<a id="top-of-page"></a>
# Table of Contents #
Click on a chapter:<br>
**[1. Training and Testing the Models](#subtask3)<br>**
[a. K-Nearest Neighbour](#1a-KNN)<br>
[b. KNN First Model Findings](#1b-findings)<br>
[c. KNN Conclusion](#1c-conclusion)<br>
[d. Random Forest Classifier](#1d-RFC)<br>
[e. Conclusion](#1e-conclusion)<br>
**[2. Collaborative Filtering](#2-collaborative-filtering)<br>**
[a. Collaborative Filtering Conclusion](#2b-conclusion)<br>
**[3. Conclusions](#3_conclusion)<br>**
**[4. References](#4_references)<br>**

Subtasks 3 and 4 are closely intertwined, with the former involving building the recommendation system and the latter involving evaluating its effectiveness using appropriate metrics. Performing these tasks together in one notebook allows for a more seamless and integrated approach to building and evaluating the recommendation system. This also makes it easier to keep track of the steps taken and results obtained and make any necessary changes or improvements. Additionally, since these tasks are shorter compared to Subtasks 1, 2, and 5, it makes sense to keep them together in one notebook.

<a id="subtask3"></a>
# 1. Subtask 3 & 4: Training and Testing the models

To train the model, we will select features such as product price and review score and use the target variable of the product ID as the prediction output.

There are many different machine learning models available to use. From our research, we have discovered the K-Neighbours (K-means) model, random forest classifier model, and collaborative filtering are the most suitable for our dataset based off of research into other projects using a customer e-commerce dataset.

To begin with, we will create models using KNeighborsClassifier and RandomForestClassifier. 


It is crucial to evaluate the effectiveness of the recommender system to ensure that it is performing well and providing accurate recommendations to users. The metrics such as precision, recall, f1, etc., are commonly used to evaluate the performance of recommendation systems.<br>

However, the choice of evaluation metric should depend on the specific goals of the recommendation system and the type of data being used. 

In [1]:
# Import statements
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV

<a id="1a-KNN"></a>
## a) KNN (K-Nearest Neighbours)

K-Nearest Neighbour is a supervised clustering algorithm which groups similar data points together (LEDU, 2018). KNN is one of the more simpler models to implement, but can still produce meaningful results. Clusters of similar customers can be created, based on their data parameters we can use to train the model.

In [2]:
# read in the data from the CSV files
orders = pd.read_csv('data/olist_orders_dataset.csv')
order_items = pd.read_csv('data/olist_order_items_dataset.csv')
products = pd.read_csv('data/olist_products_dataset.csv')
reviews = pd.read_csv('data/olist_order_reviews_dataset.csv', encoding='ISO-8859-1')

# merge the datasets to create a single DataFrame
data = pd.merge(orders, order_items, on='order_id')
data = pd.merge(data, products, on='product_id')
data = pd.merge(data, reviews, on='order_id')

# select the features and target variable
X = data[['price', 'review_score']]
y = data['product_id']

# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Now that we have prepared and split the data into train and test data, we can train the two models.

In [3]:
# create and train the model
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

# make predictions on the testing data using the KNN model
y_pred = knn.predict(X_test)

# calculate precision, recall, f1-score, and AUC
precision = precision_score(y_test, y_pred, average="weighted")
recall = recall_score(y_test, y_pred, average="weighted")
f1 = f1_score(y_test, y_pred, average="weighted")
# auc = roc_auc_score(y_test, y_pred)

print("Scores for the KNN model")
print("Precision:", precision)
print("Recall:", recall)
print("f1:", f1)

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Scores for the KNN model
Precision: 0.10305261884642455
Recall: 0.13183322303110523
f1: 0.10318356126768895


<a id="1b-findings"></a>
## b) KNN First Model Findings
Our results for the first run of KNN are unusual. A low score of 10% precision and an F1 score are not great.

To improve the accuracy and precision of our model, we will implement one-hot encoding for categorical data, scale the variable to normalise their values and remove bias, and use grid search to determine the best K value for KNN.

In [4]:
# read in the data from the CSV files
orders = pd.read_csv('data/olist_orders_dataset.csv')
order_items = pd.read_csv('data/olist_order_items_dataset.csv')
products = pd.read_csv('data/olist_products_dataset.csv')
reviews = pd.read_csv('data/olist_order_reviews_dataset.csv', encoding='ISO-8859-1')

# merge the datasets to create a single DataFrame
data = pd.merge(orders, order_items, on='order_id')
data = pd.merge(data, products, on='product_id')
data = pd.merge(data, reviews, on='order_id')

data = data.dropna()

# select the features and target variable
X = data[['price', 'review_score', 'product_category_name', 'product_weight_g', 'product_length_cm', 'product_height_cm', 'product_width_cm']]
y = data['product_id']

# encode the categorical feature as numeric
X.loc[:, 'product_category_name'] = X['product_category_name'].astype('category').cat.codes

# scale the features using standardization
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2)

# create a parameter grid for k
param_grid = {'n_neighbors': range(1, 21)}

# create and train the model using grid search and cross-validation
knn = KNeighborsClassifier(n_neighbors=4)

# grid_search = GridSearchCV(knn, param_grid, cv=5)
# grid_search.fit(X_train, y_train)

# # print the best value of k
# print("Best value of k:", grid_search.best_params_)


knn.fit(X_train, y_train)


# make predictions on the testing data using the best KNN model
# y_pred = grid_search.predict(X_test)
y_pred = knn.predict(X_test)

# calculate precision, recall, f1-score
precision = precision_score(y_test, y_pred, average="weighted")
recall = recall_score(y_test, y_pred, average="weighted")
f1 = f1_score(y_test, y_pred, average="weighted")

print("Scores for the best KNN model")
print("Precision:", precision)
print("Recall:", recall)
print("f1:", f1)

Scores for the best KNN model
Precision: 0.2543576006188829
Recall: 0.30599647266313934
f1: 0.26738167971926713


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X.loc[:, 'product_category_name'] = X['product_category_name'].astype('category').cat.codes
  X.loc[:, 'product_category_name'] = X['product_category_name'].astype('category').cat.codes
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


<a id="1c-conclusion"></a>
## c) KNN Conclusion
In conclusion, the K-Nearest Neighbors (KNN) model is a supervised clustering algorithm that groups similar data points. Although it is one of the simpler models to implement, it can still produce meaningful results. In the first run, our KNN model produced low scores of precision and F1, indicating the need for improvement. To increase the accuracy and precision of the model, we implemented one-hot encoding for categorical data, normalised the variables to remove bias, and used grid search to determine the best K value for KNN. Despite our efforts, the best KNN model still produced scores lower than 30%, indicating that our data may not be suitable for KNN modelling. A score of at least 60% would be considered high enough. Nonetheless, KNN is a helpful algorithm for clustering similar data points together, and its implementation is worth considering for datasets with more suitable features and target variables.<br>

As part of our investigation into improving the performance of our model, we decided to explore the Random Forest Classifier (RFC) algorithm. Our rationale behind this decision was to achieve higher scores than those previously obtained using the KNN algorithm.

<a id="1d-RFC"></a>
## d) RFC

Random Forest Classifier is a supervised clustering algorithm for classification and regression problems (Sruthi, 2021). RFC can deal with categorical data and outliers easily which makes it a good choice for our dataset.

In [5]:
# read in the data from the CSV files
orders = pd.read_csv('data/olist_orders_dataset.csv')
order_items = pd.read_csv('data/olist_order_items_dataset.csv')
products = pd.read_csv('data/olist_products_dataset.csv')
reviews = pd.read_csv('data/olist_order_reviews_dataset.csv', encoding='ISO-8859-1')

# merge the datasets to create a single DataFrame
data = pd.merge(orders, order_items, on='order_id')
data = pd.merge(data, products, on='product_id')
data = pd.merge(data, reviews, on='order_id')

# select the features and target variable
X = data[['price', 'review_score']]
y = data['product_id']

# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

rfc = RandomForestClassifier(n_estimators=10, n_jobs=-1, max_depth=4, verbose=0)
rfc.fit(X_train, y_train)

RandomForestClassifier(max_depth=4, n_estimators=10, n_jobs=-1)

Note: Due to the system we are running the notebook on, we had to reduce the size of the dataset in order to process the data.

In [6]:
# NOTE: JupyterHub kernel crashes due to the size of the data, therefore we have grabbed a small sample of test data.
X_test_rfc = X_test.sample(frac=0.3, random_state=200)
y_test_rfc = y_test.sample(frac=0.3, random_state=200)

# make predictions on the testing data using the RFC model
y_pred_rfc = rfc.predict(X_test_rfc)

In [7]:
# calculate precision, recall, f1-score, and AUC
precision = precision_score(y_test_rfc, y_pred_rfc, average="weighted")
# recall = recall_score(y_test, y_pred, average="weighted")
# f1 = f1_score(y_test, y_pred, average="weighted")
# auc = roc_auc_score(y_test, y_pred)

print("Scores for the RFC model")
print("Precision:", precision)
print("Recall:", recall)
print("f1:", f1)

Scores for the RFC model
Precision: 0.002782933393476902
Recall: 0.30599647266313934
f1: 0.26738167971926713


  _warn_prf(average, modifier, msg_start, len(result))


<a id="1e-conclusion"></a>
## e) Conclusion
In conclusion, we tried the RFC method to improve the scores of our dataset. RFC is a supervised clustering algorithm capable of handling categorical data and outliers. We merged several datasets, selected the features and target variables, and split the data into training and testing sets. We then fit the RFC model on the training data, made predictions on the testing data, and calculated the precision score. Unfortunately, the RFC model produced poor results, with a precision score of only 0.0028. This could be due to the sparse nature of the dataset or the need for more distance between clusters. <br>

Given these results, we have decided to proceed with Collaborative Filtering as our next analysis method.

<br>_[Go to top](#top-of-page)_

<a id="2-collaborative-filtering"></a>
# 2. Collaborative Filtering


Collaborative Filtering is a model creating for recommending an item to a user, based off of their previous purchased items and other users purchases (Google Machine Learning, 2023). 

If user A buys an item, and user B has a similar purchase history, user B will be recommended the new item too.

Collaborative filtering takes customers' previous orders and identifies patterns, and recommends products to customers based on previous customers' orders. If customer A orders products X and Y, then customer B who has ordered product X, will be recommended product Y.

In [8]:
# read in the data from the CSV files
orders = pd.read_csv('data/olist_orders_dataset.csv')
order_items = pd.read_csv('data/olist_order_items_dataset.csv')
products = pd.read_csv('data/olist_products_dataset.csv')
reviews = pd.read_csv('data/olist_order_reviews_dataset.csv', encoding='ISO-8859-1')

# merge the datasets to create a single DataFrame
data = pd.merge(orders, order_items, on='order_id')
data = pd.merge(data, products, on='product_id')
data = pd.merge(data, reviews, on='order_id')


# filter the data to include only the most active customers
# NOTE: this is mainly done due to performance issues.
customer_counts = data['customer_id'].value_counts()
active_customers = customer_counts[customer_counts > 3].index
data = data[data['customer_id'].isin(active_customers)]


# split the data into training and testing sets
train_data, test_data = train_test_split(data, test_size=0.2)

# create a pivot table with customers as rows and products as columns using the training data
pivot = train_data.pivot_table(index='customer_id', columns='product_id', values='review_score')

# fill missing values with 0
pivot = pivot.fillna(0)

# convert the pivot table to a sparse matrix
matrix = csr_matrix(pivot.values)

# create and fit a NearestNeighbors model using the training data
model = NearestNeighbors(metric='cosine', algorithm='brute')
model.fit(matrix)

NearestNeighbors(algorithm='brute', metric='cosine')

In [9]:
# function to recommend products for a given customer
def recommend_products(customer_id):
    # find the index of the customer in the pivot table
    customer_index = pivot.index.get_loc(customer_id)
    
    # find the k nearest neighbors of the customer
    distances, indices = model.kneighbors(pivot.iloc[customer_index, :].values.reshape(1, -1), n_neighbors=6)
    
    # get the product ids of the products purchased by the nearest neighbors
    product_ids = []
    for i in range(0, len(distances.flatten())):
        if i == 0:
            continue
        else:
#             product_ids.extend(pivot.index[indices.flatten()[i]])
            # append the recommended product ID to the array.
            product_ids.append(pivot.columns[indices.flatten()[i]])
    
    # return the most common product ids
    return pd.Series(product_ids).value_counts().head().index.tolist()

In [10]:
# test the recommend_products function on a customer from the testing data.
test_customer = test_data.iloc[0]['customer_id']
recommended_products = recommend_products(test_customer)
print(recommended_products)

['38cd38029795797c97b73421fdad08cf', '8973d773c115b9e347c34a248f17bc92', '88c20c5a22f2ca169af8cfc2df00a7a2', '892bc3e900a6ad3cba5112ccdb33466f', '87689c3ea34514e449355126a5fc299e']


<a id="2b-conclusion"></a>
## b) Collaborative Filtering Conclusion
In conclusion, in the case of the Olist datasets, Collaborative Filtering is an effective model for recommending products to customers based on their purchasing history and the purchasing history of similar customers. By analysing patterns in previous orders, Collaborative Filtering can generate recommendations for products that a customer is likely to be interested in. The model has been successfully implemented, and the function to recommend products for a given customer has generated product IDs for recommended products based on the purchasing history of similar customers.

<a id="3_conclusion"></a>
# 3. Subtask 3 & 4: Conclusion
In conclusion, we explored different machine learning models, such as KNN, RFC, and Collaborative Filtering, and evaluated them using appropriate metrics such as precision, recall, and f1 score. After standardising the data, removing N/A values, and testing different K values, the KNN and RFC models did not produce satisfactory results, with a maximum accuracy and precision score of 40%. 

However, Collaborative Filtering proved to be an effective model for recommending products to customers based on their purchasing history and the purchasing history of similar customers. Analysing patterns in previous orders generated recommendations for products that a customer will likely be interested in. The function to recommend products for a given customer generated product IDs for recommended products based on the purchasing history of similar customers. 

Ultimately, evaluating different models helped us select the most suitable recommendation system for our dataset.

<br>_[Go to top](#top-of-page)_

<a id="4_references"></a>
## References

Chandana, D. (2021) Exploring Customers Segmentation With RFM Analysis and K-Means Clustering. Available from: https://medium.com/web-mining-is688-spring-2021/exploring-customers-segmentation-with-rfm-analysis-and-k-means-clustering-118f9ffcd9f0 [Accessed 3 March 2023].

Google Machine Learning (2023) Collaborative Filtering. Available from: https://developers.google.com/machine-learning/recommendation/collaborative/basics [Accessed 5 March 2023].

LEDU (2018) Understanding K-means Clustering in Machine Learning. Available at: https://towardsdatascience.com/understanding-k-means-clustering-in-machine-learning-6a6e67336aa1 [Accessed 5 March 2023].

Sruthi, E. R. (2021) Understand Random Forest Algorithms With Examples. Available from: https://www.analyticsvidhya.com/blog/2021/06/understanding-random-forest/ [Accessed 5 March 2023].


*End of subtask 3 and 4*