***
# **Recommender System**
***

## Business Understanding

The aim of this project is to develop a recommendation system that can help customers find fashion products they like based on their previous shopping behaviour. The recommendation system will use implicit feedback data, such as purchase history and product ratings, to suggest products that may be of interest to customers.

The recommendation model is built using the Alternating Least Squares (ALS) algorithm by utilising implicit feedback data. Exploratoty Data Analysis process was also conducted to find out the distribution, description, and insight of the data, Laplace Smoothing approach was used to help prevent bias towards products with few high-ranking reviews in the top products ranking process by adding a constant to each category to avoid zero probability.

## Data Understanding

This project utilises the [Amazon](https://amazon-reviews-2023.github.io/) dataset for fashion product categories which includes information on purchase history and product ratings by customers, dataset contains more than 2.5M product reviews in the fashion category from 2002 to 2023 (Downloadable [here](https://datarepo.eng.ucsd.edu/mcauley_group/data/amazon_2023/raw/review_categories/Amazon_Fashion.jsonl.gz)).


### Data Description

| Field | Type | Explanation |
| --- | --- | --- |
| rating | float | Rating of the product (from 1.0 to 5.0). |
| title | str | Title of the customer review. |
| text | str | Text body of the customer review. |
| images | list | Images that customers post after they have received the product. Each image has different sizes (small, medium, large), represented by the small_image_url, medium_image_url, and large_image_url respectively. |
| asin | str | ID of the product. |
| parent_asin | str | Parent ID of the product. Note: Products with different colors, styles, sizes usually belong to the same parent ID. The “asin” in previous Amazon datasets is actually parent ID. |
| user_id | str | ID of the reviewer. |
| timestamp | int | Time of the review (unix time). |
| verified_purchase | bool | Customer purchase verification. |
| helpful_vote | int | Helpful votes of the review. |

### Import Necessary Libraries

In [None]:
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Data Wrangling

In [None]:
def load_jsonl(file_path):
    data = []
    with open(file_path, 'r') as f:
        for line in f:
            data.append(json.loads(line))
    return pd.DataFrame(data)

raw_df = load_jsonl('/content/drive/MyDrive/Data Scientist/Amazon_Fashion.jsonl')
raw_df.head()

Since the timestamp is in code, we need to convert it to make it easier to read.

In [None]:
raw_df['timestamp'] = pd.to_datetime(raw_df['timestamp'], unit='ms')
raw_df['timestamp'].sort_values()

It can be seen that the dataset contains sales from 2002 to 2023, we will only use the last 3 years data, 2020-08-01 to 2023-09-11 (the last date in the data).

In [None]:
df = raw_df[(raw_df['timestamp'] >= '2020-08-01') & (raw_df['timestamp'] <= '2023-09-11')].reset_index(drop=True)
df.head()

Next, select the data that will be used in this project.

In [None]:
df = df.drop(['title', 'text', 'images', 'parent_asin'], axis=1)
df = df.rename(columns={'asin': 'product_id'})
df.head()

We check for missing value and duplicated data.

In [None]:
df.isnull().sum()

There are no missing value in the data.

In [None]:
print('Duplicate Data:', df.duplicated().sum())

There are 6568 duplicate data, we will delete this.

In [None]:
df.drop_duplicates(inplace=True)
print('Duplicated Data:', df.duplicated().sum())

In [None]:
df.info()

Each column has an appropriate data type.

In [None]:
df.describe(include='all').T

It appears that there are no anomalies in each column.

## EDA

Let's look at the distribution of rating, verified_purchase, and helpful_vote features in the dataset.

In [None]:
fig, ax = plt.subplots(1, 3,figsize=(15,5))
for i, feature in enumerate(['rating', 'verified_purchase']):
    sns.countplot(data=df, x=feature, ax=ax[i])
    ax[i].set_title(f'Distribution of {feature}')
    ax[i].set_xlabel(None)

sns.boxplot(data=df, y='helpful_vote', ax=ax[2])
ax[2].set_title('Distribution of helpful_vote')
ax[2].set_ylabel('count')

plt.tight_layout()
plt.show()

What days and times are customers most active?

In [None]:
df['day_of_week'] = df['timestamp'].dt.day_name()
df['hour'] = df['timestamp'].dt.hour

plt.figure(figsize=(12, 8))

plt.subplot(2, 1, 1)
ax1 = sns.countplot(data=df, x='day_of_week', order=df['day_of_week'].value_counts().index)
plt.title('Customer Active Days')
plt.xlabel(None)

plt.subplot(2, 1, 2)
ax2 = sns.countplot(data=df, x='hour', order=df['hour'].value_counts().index)
plt.title('Customer Active Hours')

plt.tight_layout()
plt.show()

Now let's see how many ratings, unique product, and unique customers are in the dataset.

In [None]:
n_ratings = len(df)
n_customers = df['user_id'].nunique()
n_products = df['product_id'].nunique()

print(f'Number of ratings: {n_ratings}')
print(f'Number of customers: {n_customers}')
print(f'Number of products: {n_products}')
print(f'Average number of ratings per customer: {round(n_ratings/n_customers, 2)}')
print(f'Average number of ratings per product: {round(n_ratings/n_products, 2)}')

Let's see who gave the most ratings and distribution of user rating frequency.

In [None]:
ratings_per_customer = df[['user_id', 'product_id']].groupby('user_id').count().reset_index()
ratings_per_customer.columns = ['user_id', 'n_ratings']
ratings_per_customer.sort_values('n_ratings', ascending=False)

In [None]:
sns.set_style("whitegrid")
sns.kdeplot(ratings_per_customer['n_ratings'], fill=True, legend=False)
plt.title("Number of products rated per customer")
plt.xlabel("Ratings per customer")
plt.axvline(ratings_per_customer['n_ratings'].mean(), color="k", linestyle="--")
plt.show()

What are the highest and lowest rated product?

In [None]:
mean_rating = df.groupby('product_id')[['rating']].mean()
mean_rating.head()

In [None]:
display(df.loc[df['product_id'] == mean_rating['rating'].idxmax()],
        df.loc[df['product_id'] == mean_rating['rating'].idxmin()])

Although the product with id '0512238944' is the product with the highest average rating, it only received 2 reviews, which is not a good measure for a top product.

We apply the Laplace Smoothing method, which is a technique that adds a constant to each category to avoid zero probability. To ensure that products with a small number of reviews do not dominate.

$$ \text{Laplace Score} = \frac{\sum R_i + k \cdot R_{\text{avg}}}{n + k} $$

- ∑Ri​: Total number of rating for the product.
- n: Number of product reviews.
- k: Constant (the number of additional reviews we add).
- Ravg​: Global average rating of all product.

In [None]:
product_ratings = df.groupby('product_id').agg(
    rating_sum=('rating', 'sum'),
    rating_count=('rating', 'count'),
    rating_mean=('rating', 'mean')
).reset_index()
product_ratings.head()

In [None]:
def laplace_smoothing(row, global_avg, k):
    return (row['rating_sum'] + k * global_avg) / (row['rating_count'] + k)

global_avg_rating = df['rating'].mean()

product_ratings['laplace_score'] = product_ratings.apply(
    laplace_smoothing, global_avg=global_avg_rating, k=5, axis=1
)

top_products = product_ratings.sort_values('laplace_score', ascending=False)
top_products = top_products.set_index('product_id')
print('Highest Product:\n', top_products.head(10), '\n')
print('Lowest Product:\n', top_products.tail(10))

Based on the laplace score result, produk with id '0512238944' (top product before Laplace Smoothing is applied) is not visible in the top product list. The product that has the highest laplace score is the product with product_id 'B0B9144W3P', let's check the product details.

In [None]:
display(df[df['product_id']=='B0B9144W3P'].rating.value_counts(),
        df[df['product_id']=='B0B9144W3P'])

It appears that the product has quite good reviews.

## Data Preparation

We select the features that will be used in the model.

In [None]:
main_df = df[['user_id', 'product_id', 'rating', 'helpful_vote', 'verified_purchase']]
main_df.head()

Then we normalise the feature rating and helpful_vote to have the same scale.

In [None]:
main_df['helpful_vote_norm'] = main_df['helpful_vote'] / main_df['helpful_vote'].max()
main_df['rating_norm'] = main_df['rating'] / main_df['rating'].max()
main_df.head()

In [None]:
main_df.describe().T

We weight the three features to get the implicit score.

In [None]:
main_df['implicit_score'] = (
    0.5 * main_df['rating_norm'] +
    0.3 * main_df['helpful_vote_norm'] +
    0.2 * main_df['verified_purchase'].astype(int))
main_df.head()

And then we create a matrix that will be used to train the model with user_id as the row index, product_id as the column index, and implicit score as the value of the matrix.

In [None]:
from scipy.sparse import csr_matrix

C = main_df['user_id'].nunique()
P = main_df['product_id'].nunique()

customer_mapper = dict(zip(np.unique(main_df['user_id']), list(range(C))))
product_mapper = dict(zip(np.unique(main_df['product_id']), list(range(P))))
customer_inv_mapper = dict(zip(list(range(C)), np.unique(main_df['user_id'])))
product_inv_mapper = dict(zip(list(range(P)), np.unique(main_df['product_id'])))

row_index = [customer_mapper[i] for i in main_df['user_id']]
col_index = [product_mapper[i] for i in main_df['product_id']]

user_item_matrix = csr_matrix((main_df['implicit_score'], (row_index, col_index)), shape=(C, P))
user_item_matrix

In [None]:
print(user_item_matrix)

## Modeling

Let's train the model with matrix we have created.

In [None]:
from implicit.als import AlternatingLeastSquares

model = AlternatingLeastSquares(factors=20, regularization=0.1, iterations=50, use_gpu=False)
model.fit(user_item_matrix)

### Top N Recommender

In [None]:
def get_recommendation(user_id, N=10):
    if user_id in customer_mapper:
        customer_index = customer_mapper[user_id]
        user_items = user_item_matrix[customer_index, :].tocsr()
        recommendations = model.recommend(customer_index, user_items, N=N)
        product_indices, scores = recommendations
        filtered_recommendations = [(product_index, score) for product_index, score in zip(product_indices, scores) if product_index in product_inv_mapper]
    else:
        popular_products = top_products['laplace_score'].index[:10]
        filtered_recommendations = [(product_mapper[product], top_products.loc[f'{product}', 'laplace_score']) for product in popular_products if product in product_mapper]

    print(f"Top {N} Recommendations for UserId {user_id}:")
    for recommendation in filtered_recommendations:
        item_index, score = recommendation[0], recommendation[1]
        product_id = product_inv_mapper.get(item_index, "Unknown")
        print(f'Product ID: {product_id}, Score: {score}')

Let's try to get product recommendations for exiting customer with user_id 'AHTTU2FL6FCNBBAESCJHOHHSSW7A' from the models.

In [None]:
if 'AHTTU2FL6FCNBBAESCJHOHHSSW7A' in main_df['user_id'].values:
  get_recommendation('AHTTU2FL6FCNBBAESCJHOHHSSW7A')
else:
  print('This is a new customer')

Now for new customer with user_id 'TZIYHRCLTADI7R5STTUCVRE2CQMU'

In [None]:
if 'TZIYHRCLTADI7R5STTUCVRE2CQMU' not in main_df['user_id'].values:
  get_recommendation('TZIYHRCLTADI7R5STTUCVRE2CQMU')
else:
  print('This is not a new customer')