<a id="top-of-page"></a>
# Table of Contents #
Click on a chapter:<br>
**[1. Training the model](#subtask3)<br>**
**[2. Geolocation Segmentation](#2-geolocation-segmentation)<br>**
[a. KNN](#2a-KNN)<br>
[b. RFC](#2b-RFC)<br>
[c. Conclusion](#2c-conclusion)<br>
**[3. Collaborative Filtering](#3-collaborative-filtering)<br>**
[a. Conclusion](#3a-conclusion)<br>
**[4. Conclusion](#4_conclusion)<br>**

Subtasks 3 and 4 are closely intertwined, with the former involving building the recommendation system and the latter involving evaluating its effectiveness using appropriate metrics. Performing these tasks together in one notebook allows for a more seamless and integrated approach to building and evaluating the recommendation system. This also makes it easier to keep track of the steps taken and results obtained and make any necessary changes or improvements. Additionally, since these tasks are shorter compared to Subtasks 1, 2, and 5, it makes sense to keep them together in one notebook.

<a id="subtask3"></a>
# 1. Subtask 3: Training the model

To train the model, we will select features such as product price and review score and use the target variable of the product id as the prediction output.

Firstly, we will create models using KNeighborsClassifer and RandomForestClassifier. 

In [1]:
# Import statements
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors

In [2]:
# read in the data from the CSV files
orders = pd.read_csv('data/olist_orders_dataset.csv')
order_items = pd.read_csv('data/olist_order_items_dataset.csv')
products = pd.read_csv('data/olist_products_dataset.csv')
reviews = pd.read_csv('data/olist_order_reviews_dataset.csv', encoding='ISO-8859-1')

# merge the datasets to create a single DataFrame
data = pd.merge(orders, order_items, on='order_id')
data = pd.merge(data, products, on='product_id')
data = pd.merge(data, reviews, on='order_id')

# select the features and target variable
X = data[['price', 'review_score']]
y = data['product_id']

# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Now that we have prepared and split the data into train and test data, we can train the two models.

In [3]:
# create and train the model
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

KNeighborsClassifier()

In [4]:
rfc = RandomForestClassifier(n_estimators=10, n_jobs=-1, max_depth=4, verbose=0)
rfc.fit(X_train, y_train)

RandomForestClassifier(max_depth=4, n_estimators=10, n_jobs=-1)

<a id="2-geolocation-segmentation"></a>
# 2. Subtask 4: Testing the model performance

It is crucial to evaluate the effectiveness of the recommender system to ensure that it is performing well and providing accurate recommendations to users. The metrics such as Area Under the Curve (AUC), precision, recall, f1, etc., are commonly used to evaluate the performance of recommendation systems.<br>

However, the choice of evaluation metric should depend on the specific goals of the recommendation system and the type of data being used. 

<a id="2a-KNN"></a>
## a) KNN

What is KNN and why are we using it?

In [5]:
# make predictions on the testing data using the KNN model
y_pred = knn.predict(X_test)

# calculate precision, recall, f1-score, and AUC
precision = precision_score(y_test, y_pred, average="weighted")
recall = recall_score(y_test, y_pred, average="weighted")
f1 = f1_score(y_test, y_pred, average="weighted")
# auc = roc_auc_score(y_test, y_pred)

print("Scores for the KNN model")
print("Precision:", precision)
print("Recall:", recall)
print("f1:", f1)

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Scores for the KNN model
Precision: 0.10767073321474538
Recall: 0.12728877123317892
f1: 0.10409291016884312


<a id="2b-RFC"></a>
## b) RFC

What is RFC and why are we using it?

In [6]:
# NOTE: JupyterHub kernel crashes due to the size of the data, therefore we have grabbed a small sample of test data.
X_test_rfc = X_test.sample(frac=0.3, random_state=200)
y_test_rfc = y_test.sample(frac=0.3, random_state=200)

# make predictions on the testing data using the RFC model
y_pred_rfc = rfc.predict(X_test_rfc)

# calculate precision, recall, f1-score, and AUC
precision = precision_score(y_test_rfc, y_pred_rfc, average="weighted")
# recall = recall_score(y_test, y_pred, average="weighted")
# f1 = f1_score(y_test, y_pred, average="weighted")
# auc = roc_auc_score(y_test, y_pred)

print("Scores for the RFC model")
print("Precision:", precision)
print("Recall:", recall)
print("f1:", f1)

Scores for the RFC model
Precision: 0.002134507856317242
Recall: 0.12728877123317892
f1: 0.10409291016884312


  _warn_prf(average, modifier, msg_start, len(result))


<a id="2c-conclusion"></a>
## c) Conclusion
<br>_[Go to top](#top-of-page)_

<a id="3-collaborative-filtering"></a>
# 3. Subtask 3: Collaborative Filtering

Collaborative filtering takes customers' previous orders and identifies patterns, and recommends products to customers based on previous customers' orders. If customer A orders products X and Y, then customer B who has ordered product X, will be recommended product Y.

In [7]:
# read in the data from the CSV files
orders = pd.read_csv('data/olist_orders_dataset.csv')
order_items = pd.read_csv('data/olist_order_items_dataset.csv')
products = pd.read_csv('data/olist_products_dataset.csv')
reviews = pd.read_csv('data/olist_order_reviews_dataset.csv', encoding='ISO-8859-1')

# merge the datasets to create a single DataFrame
data = pd.merge(orders, order_items, on='order_id')
data = pd.merge(data, products, on='product_id')
data = pd.merge(data, reviews, on='order_id')


# filter the data to include only the most active customers
# NOTE: this is mainly done due to performance issues.
customer_counts = data['customer_id'].value_counts()
active_customers = customer_counts[customer_counts > 3].index
data = data[data['customer_id'].isin(active_customers)]


# split the data into training and testing sets
train_data, test_data = train_test_split(data, test_size=0.2)

# create a pivot table with customers as rows and products as columns using the training data
pivot = train_data.pivot_table(index='customer_id', columns='product_id', values='review_score')

# fill missing values with 0
pivot = pivot.fillna(0)

# convert the pivot table to a sparse matrix
matrix = csr_matrix(pivot.values)

# create and fit a NearestNeighbors model using the training data
model = NearestNeighbors(metric='cosine', algorithm='brute')
model.fit(matrix)

NearestNeighbors(algorithm='brute', metric='cosine')

In [8]:
# function to recommend products for a given customer
def recommend_products(customer_id):
    # find the index of the customer in the pivot table
    customer_index = pivot.index.get_loc(customer_id)
    
    # find the k nearest neighbors of the customer
    distances, indices = model.kneighbors(pivot.iloc[customer_index, :].values.reshape(1, -1), n_neighbors=6)
    
    # get the product ids of the products purchased by the nearest neighbors
    product_ids = []
    for i in range(0, len(distances.flatten())):
        if i == 0:
            continue
        else:
#             product_ids.extend(pivot.index[indices.flatten()[i]])
            # append the recommended product ID to the array.
            product_ids.append(pivot.columns[indices.flatten()[i]])
    
    # return the most common product ids
    return pd.Series(product_ids).value_counts().head().index.tolist()

In [9]:
# test the recommend_products function on a customer from the testing data.
test_customer = test_data.iloc[0]['customer_id']
recommended_products = recommend_products(test_customer)
print(recommended_products)

['35afc973633aaeb6b877ff57b2793310', '98d61056e0568ba048e5d78038790e77', '8b8422bfeaebcd02e897666185ca2c2c', '8bb27b1d96be90b36b8d0c7f30931d52', '8b45810da2ef9860496d56f62435fc40']


<a id="3a-conclusion"></a>
## a) Conclusion

The KNN and RFM models are low in performance quality. They are only 10% -13 % accurate and precise.

Collaborative Filtering seems to be more appropriate for this scenario.
<br>_[Go to top](#top-of-page)_

<a id="4_conclusion"></a>
# 4. Subtask 3 & 4: Conclusion
Key takeaways and how it informs what we do next?
<br>_[Go to top](#top-of-page)_

End of subtask 3 and 4