<a id="top-of-page"></a>
# Table of Contents #
Click on a chapter:<br>
**[1. Training and Testing the Models](#subtask3)<br>**
[a. K-Nearest Neighbour](#1a-KNN)<br>
[b. KNN First Model Findings](#1b-findings)<br>
[c. KNN Conclusion](#1c-conclusion)<br>
[d. Random Forest Classifier](#1d-RFC)<br>
[e. Conclusion](#1e-conclusion)<br>
**[2. Collaborative Filtering](#2-collaborative-filtering)<br>**
[a. Collaborative Filtering Conclusion](#2b-conclusion)<br>
**[3. Conclusions](#3_conclusion)<br>**
**[4. References](#4_references)<br>**

Subtasks 3 and 4 are closely intertwined, with the former involving building the recommendation system and the latter involving evaluating its effectiveness using appropriate metrics. Performing these tasks together in one notebook allows for a more seamless and integrated approach to building and evaluating the recommendation system. This also makes it easier to keep track of the steps taken and results obtained and make any necessary changes or improvements. Additionally, since these tasks are shorter compared to Subtasks 1, 2, and 5, it makes sense to keep them together in one notebook.

<a id="subtask3"></a>
# 1. Subtask 3 & 4: Training and Testing the models

To train the model, we will select features such as product price and review score and use the target variable of the product ID as the prediction output.

There are many different machine learning models available to use. From our research, we have discovered the K-Neighbours (K-means) model, random forest classifier model, and collaborative filtering are the most suitable for our dataset based off of research into other projects using a customer e-commerce dataset.

To begin with, we will create models using KNeighborsClassifier and RandomForestClassifier. 


It is crucial to evaluate the effectiveness of the recommender system to ensure that it is performing well and providing accurate recommendations to users. The metrics such as precision, recall, f1, etc., are commonly used to evaluate the performance of recommendation systems.<br>

However, the choice of evaluation metric should depend on the specific goals of the recommendation system and the type of data being used. 

In [1]:
# Import statements
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV

<a id="1a-KNN"></a>
## a) KNN (K-Nearest Neighbours)

K-Nearest Neighbour is a supervised clustering algorithm which groups similar data points together (LEDU, 2018). KNN is one of the more simpler models to implement, but can still produce meaningful results. Clusters of similar customers can be created, based on their data parameters we can use to train the model.

In [2]:
# read in the data from the CSV files
orders = pd.read_csv('data/olist_orders_dataset.csv')
order_items = pd.read_csv('data/olist_order_items_dataset.csv')
products = pd.read_csv('data/olist_products_dataset.csv')
reviews = pd.read_csv('data/olist_order_reviews_dataset.csv', encoding='ISO-8859-1')

# merge the datasets to create a single DataFrame
data = pd.merge(orders, order_items, on='order_id')
data = pd.merge(data, products, on='product_id')
data = pd.merge(data, reviews, on='order_id')

# select the features and target variable
X = data[['price', 'review_score']]
y = data['product_id']

# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Now that we have prepared and split the data into train and test data, we can train the two models.

In [3]:
# create and train the model
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

# make predictions on the testing data using the KNN model
y_pred = knn.predict(X_test)

# calculate precision, recall, f1-score, and AUC
precision = precision_score(y_test, y_pred, average="weighted")
recall = recall_score(y_test, y_pred, average="weighted")
f1 = f1_score(y_test, y_pred, average="weighted")
# auc = roc_auc_score(y_test, y_pred)

print("Scores for the KNN model")
print("Precision:", precision)
print("Recall:", recall)
print("f1:", f1)

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Scores for the KNN model
Precision: 0.10278467373385747
Recall: 0.13081844253253916
f1: 0.10201569046036013


<a id="1b-findings"></a>
## b) KNN First Model Findings
Our results for the first run of KNN are unusual. A low score of 10% precision and F1 score are not great. 

To try and improve the accuracy and precision of our model, we will implement one-hot encoding for categorical data, scale the variable to normalise their values and remove bias, and use grid search to determine the best K value for KNN. 

In [4]:
# read in the data from the CSV files
orders = pd.read_csv('data/olist_orders_dataset.csv')
order_items = pd.read_csv('data/olist_order_items_dataset.csv')
products = pd.read_csv('data/olist_products_dataset.csv')
reviews = pd.read_csv('data/olist_order_reviews_dataset.csv', encoding='ISO-8859-1')

# merge the datasets to create a single DataFrame
data = pd.merge(orders, order_items, on='order_id')
data = pd.merge(data, products, on='product_id')
data = pd.merge(data, reviews, on='order_id')

data = data.dropna()

# select the features and target variable
X = data[['price', 'review_score', 'product_category_name', 'product_weight_g', 'product_length_cm', 'product_height_cm', 'product_width_cm']]
y = data['product_id']

# encode the categorical feature as numeric
X.loc[:, 'product_category_name'] = X['product_category_name'].astype('category').cat.codes

# scale the features using standardization
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2)

# create a parameter grid for k
param_grid = {'n_neighbors': range(1, 21)}

# create and train the model using grid search and cross-validation
knn = KNeighborsClassifier(n_neighbors=4)

# grid_search = GridSearchCV(knn, param_grid, cv=5)
# grid_search.fit(X_train, y_train)

# # print the best value of k
# print("Best value of k:", grid_search.best_params_)


knn.fit(X_train, y_train)


# make predictions on the testing data using the best KNN model
# y_pred = grid_search.predict(X_test)
y_pred = knn.predict(X_test)

# calculate precision, recall, f1-score
precision = precision_score(y_test, y_pred, average="weighted")
recall = recall_score(y_test, y_pred, average="weighted")
f1 = f1_score(y_test, y_pred, average="weighted")

print("Scores for the best KNN model")
print("Precision:", precision)
print("Recall:", recall)
print("f1:", f1)

Scores for the best KNN model
Precision: 0.2600255992782448
Recall: 0.3117283950617284
f1: 0.27288607658894076


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X.loc[:, 'product_category_name'] = X['product_category_name'].astype('category').cat.codes
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


<a id="1c-conclusion"></a>
## c) KNN Conclusion

We have concluded that due to the low scores of the KNN model, our data is not suitable for the KNN model. After standardising the data, removing N/A values, testing different K values, nothing we did produced a high enough score (>60%).

<a id="1d-RFC"></a>
## d) RFC

Random Forest Classifier is a supervised clustering algorithm for classification and regression problems (Sruthi, 2021). RFC can deal with categorical data and outliers easily which makes it a good choice for our dataset.

In [5]:
# read in the data from the CSV files
orders = pd.read_csv('data/olist_orders_dataset.csv')
order_items = pd.read_csv('data/olist_order_items_dataset.csv')
products = pd.read_csv('data/olist_products_dataset.csv')
reviews = pd.read_csv('data/olist_order_reviews_dataset.csv', encoding='ISO-8859-1')

# merge the datasets to create a single DataFrame
data = pd.merge(orders, order_items, on='order_id')
data = pd.merge(data, products, on='product_id')
data = pd.merge(data, reviews, on='order_id')

# select the features and target variable
X = data[['price', 'review_score']]
y = data['product_id']

# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

rfc = RandomForestClassifier(n_estimators=10, n_jobs=-1, max_depth=4, verbose=0)
rfc.fit(X_train, y_train)

RandomForestClassifier(max_depth=4, n_estimators=10, n_jobs=-1)

Note: Due to the system we are running the notebook on, we had to reduce the size of the dataset in order to process the data.

In [6]:
# NOTE: JupyterHub kernel crashes due to the size of the data, therefore we have grabbed a small sample of test data.
X_test_rfc = X_test.sample(frac=0.3, random_state=200)
y_test_rfc = y_test.sample(frac=0.3, random_state=200)

# make predictions on the testing data using the RFC model
y_pred_rfc = rfc.predict(X_test_rfc)

In [7]:
# calculate precision, recall, f1-score, and AUC
precision = precision_score(y_test_rfc, y_pred_rfc, average="weighted")
# recall = recall_score(y_test, y_pred, average="weighted")
# f1 = f1_score(y_test, y_pred, average="weighted")
# auc = roc_auc_score(y_test, y_pred)

print("Scores for the RFC model")
print("Precision:", precision)
print("Recall:", recall)
print("f1:", f1)

Scores for the RFC model
Precision: 0.002319145428238781
Recall: 0.3117283950617284
f1: 0.27288607658894076


  _warn_prf(average, modifier, msg_start, len(result))


<a id="1e-conclusion"></a>
## e) Conclusion

The results for the RFC model is poor. Again, this could be due to similar reasons as the KNN model. The dataset can be quite sparse, and not have much distance between the 'clusters'. 

<br>_[Go to top](#top-of-page)_

<a id="2-collaborative-filtering"></a>
# 2. Collaborative Filtering


Collaborative Filtering is a model creating for recommending an item to a user, based off of their previous purchased items and other users purchases (Google Machine Learning, 2023). 

If user A buys an item, and user B has a similar purchase history, user B will be recommended the new item too.

Collaborative filtering takes customers' previous orders and identifies patterns, and recommends products to customers based on previous customers' orders. If customer A orders products X and Y, then customer B who has ordered product X, will be recommended product Y.

In [8]:
# read in the data from the CSV files
orders = pd.read_csv('data/olist_orders_dataset.csv')
order_items = pd.read_csv('data/olist_order_items_dataset.csv')
products = pd.read_csv('data/olist_products_dataset.csv')
reviews = pd.read_csv('data/olist_order_reviews_dataset.csv', encoding='ISO-8859-1')

# merge the datasets to create a single DataFrame
data = pd.merge(orders, order_items, on='order_id')
data = pd.merge(data, products, on='product_id')
data = pd.merge(data, reviews, on='order_id')


# filter the data to include only the most active customers
# NOTE: this is mainly done due to performance issues.
customer_counts = data['customer_id'].value_counts()
active_customers = customer_counts[customer_counts > 3].index
data = data[data['customer_id'].isin(active_customers)]


# split the data into training and testing sets
train_data, test_data = train_test_split(data, test_size=0.2)

# create a pivot table with customers as rows and products as columns using the training data
pivot = train_data.pivot_table(index='customer_id', columns='product_id', values='review_score')

# fill missing values with 0
pivot = pivot.fillna(0)

# convert the pivot table to a sparse matrix
matrix = csr_matrix(pivot.values)

# create and fit a NearestNeighbors model using the training data
model = NearestNeighbors(metric='cosine', algorithm='brute')
model.fit(matrix)

NearestNeighbors(algorithm='brute', metric='cosine')

In [9]:
# function to recommend products for a given customer
def recommend_products(customer_id):
    # find the index of the customer in the pivot table
    customer_index = pivot.index.get_loc(customer_id)
    
    # find the k nearest neighbors of the customer
    distances, indices = model.kneighbors(pivot.iloc[customer_index, :].values.reshape(1, -1), n_neighbors=6)
    
    # get the product ids of the products purchased by the nearest neighbors
    product_ids = []
    for i in range(0, len(distances.flatten())):
        if i == 0:
            continue
        else:
#             product_ids.extend(pivot.index[indices.flatten()[i]])
            # append the recommended product ID to the array.
            product_ids.append(pivot.columns[indices.flatten()[i]])
    
    # return the most common product ids
    return pd.Series(product_ids).value_counts().head().index.tolist()

In [10]:
# test the recommend_products function on a customer from the testing data.
test_customer = test_data.iloc[0]['customer_id']
recommended_products = recommend_products(test_customer)
print(recommended_products)

['86f2416d4670e4ea3ca5494d043d9f24', '86b22a03cb72239dd53996a67df35c63', '8509049c56caff468e3f35c4eefb6035', '86f024d3bdcdb9b54c9fffd92be39f54', '870bcc6c58e03ca658cfdd13db4bbe28']


<a id="2b-conclusion"></a>
## b) Collaborative Filtering Conclusion
The collaborative filtering model has successfully generated product ID's to recommend to customers (based on similar customers purchasing history).

<a id="3_conclusion"></a>
# 3. Subtask 3 & 4: Conclusion
The KNN model and RFC model produced poor results. Ideally we would have liked 70%+ in accuracy and precision, however, the maximum score we managed to achieve using these models was 40%.

Collaborative filtering makes more sense abstractly (recommend products based on other users purchasing trends), and in practice this produced promising results. The Product ID's which were recommended for a given user related to the purchases by other uses.

<br>_[Go to top](#top-of-page)_

# 4. Running the Recommender

By passing a customer ID into the recommender model, the output is a list of recommended products for the user.

In [11]:

test_customer = "6c8a03b35eb1de3c0012232b0ff0522d"
recommended = recommend_products(test_customer)
print(recommended)

# remove any duplicates from the recommended list
recommended = list(set(recommended))

filtered_df = data[data['product_id'].isin(recommended)]

# select the product category name column
product_category_names = filtered_df['product_category_name']

# print the product category names
print(product_category_names)

['86f024d3bdcdb9b54c9fffd92be39f54', '86b22a03cb72239dd53996a67df35c63', '8509049c56caff468e3f35c4eefb6035', '86f2416d4670e4ea3ca5494d043d9f24', '84f47b7ffbd21c845d197cc0a7bc479a']
19511                     moveis_decoracao
19512                     moveis_decoracao
80814                          moveis_sala
80815                          moveis_sala
89934    construcao_ferramentas_construcao
89935    construcao_ferramentas_construcao
89936    construcao_ferramentas_construcao
89937    construcao_ferramentas_construcao
91478                    moveis_escritorio
91479                    moveis_escritorio
99232                utilidades_domesticas
99233                utilidades_domesticas
99234                utilidades_domesticas
99235                utilidades_domesticas
99236                utilidades_domesticas
99237                utilidades_domesticas
Name: product_category_name, dtype: object


<a id="4_references"></a>
## References

Chandana, D. (2021) Exploring Customers Segmentation With RFM Analysis and K-Means Clustering. Available from: https://medium.com/web-mining-is688-spring-2021/exploring-customers-segmentation-with-rfm-analysis-and-k-means-clustering-118f9ffcd9f0 [Accessed 3 March 2023].

Google Machine Learning (2023) Collaborative Filtering. Available from: https://developers.google.com/machine-learning/recommendation/collaborative/basics [Accessed 5 March 2023].

LEDU (2018) Understanding K-means Clustering in Machine Learning. Available at: https://towardsdatascience.com/understanding-k-means-clustering-in-machine-learning-6a6e67336aa1 [Accessed 5 March 2023].

Sruthi, E. R. (2021) Understand Random Forest Algorithms With Examples. Available from: https://www.analyticsvidhya.com/blog/2021/06/understanding-random-forest/ [Accessed 5 March 2023].


*End of subtask 3 and 4*