# Data Science Assignment 3
## Group 6: Gabriea Groenewegen Van Der Weijden & Arnaud Haaster

### Week 10 preperation:
1. Download the data from https://www.kaggle.com/c/home-depot-product-searchrelevance/data and unzip all files. You now have a directory with four csv files and one docx
file.
2. Import the csv files in Python as separate Pandas dataframes.
3. Read the information on the task and the data. Provide a task definition in your report for
assignment 3

In [85]:
# Importing all necessary packages.

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor, BaggingRegressor
from nltk.stem.snowball import SnowballStemmer
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import matplotlib as plt
import seaborn as sns

In [2]:
# Reading the csv files into dataframes.
df_test = pd.read_csv("test.csv", encoding="ISO-8859-1")
df_train = pd.read_csv("train.csv", encoding="ISO-8859-1")
df_attr = pd.read_csv("attributes.csv")
df_sample_sub = pd.read_csv("sample_submission.csv")
df_pro_desc = pd.read_csv("product_descriptions.csv")


#### Data exploration

In [84]:
# Creating a dataframe with all the product-query combinations.
df_proq = df_train[["product_title", "search_term"]]

# Creating a dataframe with all the unique product-query combinations.
un_comb = df_proq.groupby(["product_title", "search_term"]).size().reset_index()

# Calculating the unique number of products in the training data.
uni_prod = df_train['product_title'].nunique()

# Calculating the two most occuring products in the training data.
two_most = df_train.groupby("product_title").size().reset_index().rename(columns={0: "Occurences"}).sort_values(by=["Occurences"], ascending=False).head(2)

# Calculating the descriptive statistics for the relevance values (mean, median,standard deviation) in the training data.
mean_relevance = round(df_train['relevance'].mean(), 2)
median_relevance = round(df_train['relevance'].median(), 2)
std_relevance = round(df_train['relevance'].std(), 2)

#Preparing the data in ordr to show a histogram or boxplot of the distribution of relevance values in the training data.
df_rel_val = df_train.groupby("relevance").size().reset_index().rename(columns={0: "Occurences"})
df_rel_val


Unnamed: 0,relevance,Occurences
0,1.0,2105
1,1.25,4
2,1.33,3006
3,1.5,5
4,1.67,6780
5,1.75,9
6,2.0,11730
7,2.25,11
8,2.33,16060
9,2.5,19


In [3]:
# The copy of the model.

stemmer = SnowballStemmer('english')
num_train = df_train.shape[0]

def str_stemmer(s):
	return " ".join([stemmer.stem(word) for word in s.lower().split()])

def str_common_word(str1, str2):
	return sum(int(str2.find(word)>=0) for word in str1.split())


df_all = pd.concat((df_train, df_test), axis=0, ignore_index=True)

df_all = pd.merge(df_all, df_pro_desc, how='left', on='product_uid')

df_all['search_term'] = df_all['search_term'].map(lambda x:str_stemmer(x))
df_all['product_title'] = df_all['product_title'].map(lambda x:str_stemmer(x))
df_all['product_description'] = df_all['product_description'].map(lambda x:str_stemmer(x))

df_all['len_of_query'] = df_all['search_term'].map(lambda x:len(x.split())).astype(np.int64)

df_all['product_info'] = df_all['search_term']+"\t"+df_all['product_title']+"\t"+df_all['product_description']

df_all['word_in_title'] = df_all['product_info'].map(lambda x:str_common_word(x.split('\t')[0],x.split('\t')[1]))
df_all['word_in_description'] = df_all['product_info'].map(lambda x:str_common_word(x.split('\t')[0],x.split('\t')[2]))

df_all = df_all.drop(['search_term','product_title','product_description','product_info'],axis=1)

df_train = df_all.iloc[:num_train]
df_test = df_all.iloc[num_train:]
id_test = df_test['id']

y_train = df_train['relevance'].values
X_train = df_train.drop(['id','relevance'],axis=1).values
X_test = df_test.drop(['id','relevance'],axis=1).values

rf = RandomForestRegressor(n_estimators=15, max_depth=6, random_state=0)
clf = BaggingRegressor(rf, n_estimators=45, max_samples=0.1, random_state=25)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

pd.DataFrame({"id": id_test, "relevance": y_pred}).to_csv('submission.csv',index=False)


### Week 11
1. Make a 80-20 split of the training set, using 80% for training and 20% for testing using the
train_test_split function in sklearn.
2. Evaluate the predictions on the test set in terms of Root Mean Squared Error (RMSE). Verify
that your result is close to 0.48. 

The obtained result is your baseline result. Make sure that you use the same train-test split in every
run. Be aware that lower RMSE scores are better.

3. Evaluate the matching without stemming for search terms, product titles, and product
descriptions.

#### Evaluation

In [4]:
# Splitting the train data into 80% of training and 20% testing.

# We are trying to predict the relevance of the tools.
X = df_train.drop(columns=['relevance', 'id', 'product_uid'])
y = df_train['relevance']

# Split 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


# print("X_train shape:", X_train.shape)
# print("X_test shape:", X_test.shape)


""" Evaluating the predictions on the test set in terms of Root mean squared error. """

# Fitting the model.
clf.fit(X_train, y_train)

# Predictions on the test set.
y_pred_clf = clf.predict(X_test)
rmse_clf = np.sqrt(mean_squared_error(y_test, y_pred_clf))

print("Root Mean Squared Error (RMSE) for clf:", round(rmse_clf, 4))

Root Mean Squared Error (RMSE) for clf: 0.4849


In [5]:
# Re-reading the csv files into dataframes because the previous ones have been changed.
df_test = pd.read_csv("test.csv", encoding="ISO-8859-1")
df_train = pd.read_csv("train.csv", encoding="ISO-8859-1")
df_attr = pd.read_csv("attributes.csv")
df_sample_sub = pd.read_csv("sample_submission.csv")
df_pro_desc = pd.read_csv("product_descriptions.csv")

In [6]:
# Evaluating the matching without stemming for search terms, product titles, and product descriptions.

# Rebuild from original inputs
df_all_nostem = pd.concat((df_train, df_test), axis=0, ignore_index=True)
df_all_nostem = pd.merge(df_all_nostem, df_pro_desc, how='left', on='product_uid')

# Lowercasing only (no stemming)
df_all_nostem['search_term'] = df_all_nostem['search_term'].map(lambda x: x.lower())
df_all_nostem['product_title'] = df_all_nostem['product_title'].map(lambda x: x.lower())
df_all_nostem['product_description'] = df_all_nostem['product_description'].map(lambda x: x.lower())

# Feature engineering
df_all_nostem['len_of_query'] = df_all_nostem['search_term'].map(lambda x: len(x.split())).astype(np.int64)
df_all_nostem['product_info'] = df_all_nostem['search_term'] + "\t" + df_all_nostem['product_title'] + "\t" + df_all_nostem['product_description']

def str_common_word(str1, str2):
    return sum(int(str2.find(word) >= 0) for word in str1.split())

df_all_nostem['word_in_title'] = df_all_nostem['product_info'].map(lambda x: str_common_word(x.split('\t')[0], x.split('\t')[1]))
df_all_nostem['word_in_description'] = df_all_nostem['product_info'].map(lambda x: str_common_word(x.split('\t')[0], x.split('\t')[2]))

# Drop intermediate text columns
df_all_nostem = df_all_nostem.drop(['search_term', 'product_title', 'product_description', 'product_info'], axis=1)

# Split
df_train_nostem = df_all_nostem.iloc[:num_train]
df_test_nostem = df_all_nostem.iloc[num_train:]
id_test = df_test_nostem['id']

y_train = df_train_nostem['relevance'].values
X_train = df_train_nostem.drop(['id', 'relevance'], axis=1).values
X_test = df_test_nostem.drop(['id', 'relevance'], axis=1).values

# Train and predict
rf = RandomForestRegressor(n_estimators=15, max_depth=6, random_state=0)
clf = BaggingRegressor(rf, n_estimators=45, max_samples=0.1, random_state=25)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

# Save result
pd.DataFrame({"id": id_test, "relevance": y_pred}).to_csv('submission_no_stemming.csv', index=False)


In [7]:
# We are trying to predict the relevance of the tools.
X = df_train_nostem.drop(columns=['relevance', 'id', 'product_uid'])
y = df_train_nostem['relevance']

# Split 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# print("X_train shape:", X_train.shape)
# print("X_test shape:", X_test.shape)

""" Evaluating the predictions on the test set in terms of Root mean squared error. """

# Fitting the models
clf.fit(X_train, y_train)

# 2. Predict on the test set
y_pred_clf = clf.predict(X_test)

# 3. Compute RMSE
rmse_clf = np.sqrt(mean_squared_error(y_test, y_pred_clf))
print("Root Mean Squared Error (RMSE) for clf without stemming:", round(rmse_clf, 4))

Root Mean Squared Error (RMSE) for clf without stemming: 0.4949


#### Improving the matching


Add features for the query-product matching and evaluate the efficacy of each feature. A few
suggestions are:
• Add features for matching query terms to the information in attributes.csv
• Use the structure of the attribute-value pairs to make better informed features
• Replace the simple term count matching functions with other overlap weights. You might
consider using the function TfidfVectorizer in sklearn or the text similarity function in the
spacy package.
Be creative: use any information from the queries and products that might improve the matching

### Week 12
1. Find three other regression models in the sklearn documentation and compare these for the task,
both in quality (RMSE) and processing time. A comparison of the results for four regression models are part of the report for assignment 3.
2. Select the model that works the best. You will now optimize the model’s hyperparameters. In Sklearn
there are some very simple hyper parameter tuning methods: https://scikitlearn.org/stable/modules/grid_search.html . It is also possible to use more advanced methods such
as Bayesian Optimization (https://github.com/wangronin/Bayesian-Optimization/ ).
Make sure you are not optimizing on your test set; you will need to use cross validation on the train
set (e.g using the function RandomizedSearchCV)
3. 