# Data Science Assignment 3
## Group 6: Gabriea Groenewegen Van Der Weijden & Arnaud Haaster

### Week 10 preperation:
1. Download the data from https://www.kaggle.com/c/home-depot-product-searchrelevance/data and unzip all files. You now have a directory with four csv files and one docx
file.
2. Import the csv files in Python as separate Pandas dataframes.
3. Read the information on the task and the data. Provide a task definition in your report for
assignment 3

In [30]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor, BaggingRegressor
from nltk.stem.snowball import SnowballStemmer
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [14]:
df_test = pd.read_csv("test.csv", encoding="ISO-8859-1")
df_train = pd.read_csv("train.csv", encoding="ISO-8859-1")
df_attr = pd.read_csv("attributes.csv")
df_sample_sub = pd.read_csv("sample_submission.csv")
df_pro_desc = pd.read_csv("product_descriptions.csv")


#### Task

In [35]:
stemmer = SnowballStemmer('english')
num_train = df_train.shape[0]

def str_stemmer(s):
	return " ".join([stemmer.stem(word) for word in s.lower().split()])

def str_common_word(str1, str2):
	return sum(int(str2.find(word)>=0) for word in str1.split())


df_all = pd.concat((df_train, df_test), axis=0, ignore_index=True)

df_all = pd.merge(df_all, df_pro_desc, how='left', on='product_uid')

df_all['search_term'] = df_all['search_term'].map(lambda x:str_stemmer(x))
df_all['product_title'] = df_all['product_title'].map(lambda x:str_stemmer(x))
df_all['product_description'] = df_all['product_description'].map(lambda x:str_stemmer(x))

df_all['len_of_query'] = df_all['search_term'].map(lambda x:len(x.split())).astype(np.int64)

df_all['product_info'] = df_all['search_term']+"\t"+df_all['product_title']+"\t"+df_all['product_description']

df_all['word_in_title'] = df_all['product_info'].map(lambda x:str_common_word(x.split('\t')[0],x.split('\t')[1]))
df_all['word_in_description'] = df_all['product_info'].map(lambda x:str_common_word(x.split('\t')[0],x.split('\t')[2]))

df_all = df_all.drop(['search_term','product_title','product_description','product_info'],axis=1)

df_train = df_all.iloc[:num_train]
df_test = df_all.iloc[num_train:]
id_test = df_test['id']

y_train = df_train['relevance'].values
X_train = df_train.drop(['id','relevance'],axis=1).values
X_test = df_test.drop(['id','relevance'],axis=1).values

rf = RandomForestRegressor(n_estimators=15, max_depth=6, random_state=0)
clf = BaggingRegressor(rf, n_estimators=45, max_samples=0.1, random_state=25)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

pd.DataFrame({"id": id_test, "relevance": y_pred}).to_csv('submission.csv',index=False)


ValueError: could not convert string to float: 'Not only do angles make joints stronger, they also provide more consistent, straight corners. Simpson Strong-Tie offers a wide variety of angles in various sizes and thicknesses to handle light-duty jobs or projects where a structural connection is needed. Some can be bent (skewed) to match the project. For outdoor projects or those where moisture is present, use our ZMAX zinc-coated connectors, which provide extra resistance against corrosion (look for a "Z" at the end of the model number).Versatile connector for various 90 connections and home repair projectsStronger than angled nailing or screw fastening aloneHelp ensure joints are consistently straight and strongDimensions: 3 in. x 3 in. x 1-1/2 in.Made from 12-Gauge steelGalvanized for extra corrosion resistanceInstall with 10d common nails or #9 x 1-1/2 in. Strong-Drive SD screws'

### Week 11
1. Make a 80-20 split of the training set, using 80% for training and 20% for testing using the
train_test_split function in sklearn.
2. Evaluate the predictions on the test set in terms of Root Mean Squared Error (RMSE). Verify
that your result is close to 0.48. 

The obtained result is your baseline result. Make sure that you use the same train-test split in every
run. Be aware that lower RMSE scores are better.

3. Evaluate the matching without stemming for search terms, product titles, and product
descriptions.

In [32]:
# We are trying to predict the relevance of the tools.
X = df_train.drop(columns=['relevance', 'id', 'product_uid'])
y = df_train['relevance']

# Split 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)

X_train shape: (59253, 3)
X_test shape: (14814, 3)


In [33]:
# Fitting the models
rf.fit(X_train, y_train)
clf.fit(X_train, y_train)

# 2. Predict on the test set
y_pred_rf = rf.predict(X_test)
y_pred_clf = clf.predict(X_test)

# 3. Compute RMSE
rmse_rf = np.sqrt(mean_squared_error(y_test, y_pred_rf))
rmse_clf = np.sqrt(mean_squared_error(y_test, y_pred_clf))
print("Root Mean Squared Error (RMSE) for rf:", round(rmse_rf, 4))
print("Root Mean Squared Error (RMSE) for cl:", round(rmse_clf, 4))

Root Mean Squared Error (RMSE) for rf: 0.4855
Root Mean Squared Error (RMSE) for cl: 0.4849


In [39]:
df_test = pd.read_csv("test.csv", encoding="ISO-8859-1")
df_train = pd.read_csv("train.csv", encoding="ISO-8859-1")
df_attr = pd.read_csv("attributes.csv")
df_sample_sub = pd.read_csv("sample_submission.csv")
df_pro_desc = pd.read_csv("product_descriptions.csv")

In [40]:
# Evaluating the matching without stemming for search terms, product titles, and product descriptions.

# Rebuild from original inputs
df_all_nostem = pd.concat((df_train, df_test), axis=0, ignore_index=True)
df_all_nostem = pd.merge(df_all_nostem, df_pro_desc, how='left', on='product_uid')

# Lowercasing only (no stemming)
df_all_nostem['search_term'] = df_all_nostem['search_term'].map(lambda x: x.lower())
df_all_nostem['product_title'] = df_all_nostem['product_title'].map(lambda x: x.lower())
df_all_nostem['product_description'] = df_all_nostem['product_description'].map(lambda x: x.lower())

# Feature engineering
df_all_nostem['len_of_query'] = df_all_nostem['search_term'].map(lambda x: len(x.split())).astype(np.int64)
df_all_nostem['product_info'] = df_all_nostem['search_term'] + "\t" + df_all_nostem['product_title'] + "\t" + df_all_nostem['product_description']

def str_common_word(str1, str2):
    return sum(int(str2.find(word) >= 0) for word in str1.split())

df_all_nostem['word_in_title'] = df_all_nostem['product_info'].map(lambda x: str_common_word(x.split('\t')[0], x.split('\t')[1]))
df_all_nostem['word_in_description'] = df_all_nostem['product_info'].map(lambda x: str_common_word(x.split('\t')[0], x.split('\t')[2]))

# Drop intermediate text columns
df_all_nostem = df_all_nostem.drop(['search_term', 'product_title', 'product_description', 'product_info'], axis=1)

# Split
df_train_nostem = df_all_nostem.iloc[:num_train]
df_test_nostem = df_all_nostem.iloc[num_train:]
id_test = df_test_nostem['id']

y_train = df_train_nostem['relevance'].values
X_train = df_train_nostem.drop(['id', 'relevance'], axis=1).values
X_test = df_test_nostem.drop(['id', 'relevance'], axis=1).values

# Train and predict
rf = RandomForestRegressor(n_estimators=15, max_depth=6, random_state=0)
clf = BaggingRegressor(rf, n_estimators=45, max_samples=0.1, random_state=25)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

# Save result
pd.DataFrame({"id": id_test, "relevance": y_pred}).to_csv('submission_no_stemming.csv', index=False)


In [41]:
# We are trying to predict the relevance of the tools.
X = df_train.drop(columns=['relevance', 'id', 'product_uid'])
y = df_train['relevance']

# Split 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)

X_train shape: (59253, 2)
X_test shape: (14814, 2)


In [42]:
# Fitting the models
rf.fit(X_train, y_train)
clf.fit(X_train, y_train)

# 2. Predict on the test set
y_pred_rf = rf.predict(X_test)
y_pred_clf = clf.predict(X_test)

# 3. Compute RMSE
rmse_rf = np.sqrt(mean_squared_error(y_test, y_pred_rf))
rmse_clf = np.sqrt(mean_squared_error(y_test, y_pred_clf))
print("Root Mean Squared Error (RMSE) for rf:", round(rmse_rf, 4))
print("Root Mean Squared Error (RMSE) for cl:", round(rmse_clf, 4))

ValueError: could not convert string to float: 'Leviton Decora 2-Gang Midway Nylon Wall Plate - White'

### Function of model with or without stemming