## E-Commerce AI Recommendation System

## 1. Data Collection and Preprocessing

In [7]:
# Importing Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

In [11]:
# Loading the Dataset
# Dataset is Amazon Beauty products ratings found on Kaggle.com
# Link to dataset: https://www.kaggle.com/datasets/skillsmuggler/amazon-ratings
data = pd.read_csv('ratings_Beauty.csv')
data = data.dropna()
data.head()

Unnamed: 0,UserId,ProductId,Rating,Timestamp
0,A39HTATAQ9V7YF,205616461,5.0,1369699000.0
1,A3JM6GV9MNOF9X,558925278,3.0,1355443000.0
2,A1Z513UWSAAO0F,558925278,5.0,1404691000.0
3,A1WMRR494NWEWV,733001998,4.0,1382573000.0
4,A3IAAVS479H7M7,737104473,1.0,1274227000.0


In [12]:
print(data.isnull().sum())

UserId       0
ProductId    0
Rating       0
Timestamp    0
dtype: int64


In [13]:
popular_products = pd.DataFrame(data.groupby('ProductId')['Rating'].count())
most_popular = popular_products.sort_values('Rating', ascending=False)
most_popular.head(10)

Unnamed: 0_level_0,Rating
ProductId,Unnamed: 1_level_1
B001MA0QY2,7533
B0009V1YR8,2869
B0000YUXI0,2143
B000ZMBSPE,2041
B00121UVU0,1838
B000FS05VG,1589
B000142FVW,1558
B001JKTTVQ,1468
B000TKH6G2,1379
B00150LT40,1349


In [14]:
# Encoding user IDs and product IDs to integer indices
user_encoder = LabelEncoder()
item_encoder = LabelEncoder()

data['user_id_encoded'] = user_encoder.fit_transform(data['UserId'])
data['product_id_encoded'] = item_encoder.fit_transform(data['ProductId'])

print(data.head())

           UserId   ProductId  Rating     Timestamp  user_id_encoded  \
0  A39HTATAQ9V7YF  0205616461     5.0  1.369699e+09           408882   
1  A3JM6GV9MNOF9X  0558925278     3.0  1.355443e+09           459591   
2  A1Z513UWSAAO0F  0558925278     5.0  1.404691e+09           176296   
3  A1WMRR494NWEWV  0733001998     4.0  1.382573e+09           163809   
4  A3IAAVS479H7M7  0737104473     1.0  1.274227e+09           452822   

   product_id_encoded  
0                   0  
1                   1  
2                   1  
3                   2  
4                   3  


## 2. Model Based Collaborative Filtering

In [16]:
# data splitting
train, test = train_test_split(data, test_size=0.2, random_state=42)

In [17]:
pip install surprise

Collecting surprise
  Downloading surprise-0.1-py2.py3-none-any.whl (1.8 kB)
Collecting scikit-surprise (from surprise)
  Downloading scikit_surprise-1.1.4.tar.gz (154 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.4/154.4 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (pyproject.toml) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.4-cp310-cp310-linux_x86_64.whl size=2357251 sha256=59c58998ff023ee65982a4c5abde1a2e7d0f8daf07b46958ba1083f1da4d6bd0
  Stored in directory: /root/.cache/pip/wheels/4b/3f/df/6acbf0a40397d9bf3ff97f582cc22fb9ce66adde75bc71fd54
Successfully built scikit-surprise
Installing collected packages: scikit-surprise, surprise
Successfully inst

In [18]:
# Collaborative filtering model

from surprise import SVD, Reader, Dataset
from surprise.model_selection import cross_validate

reader = Reader(rating_scale=(1, 5))

data_surprise = Dataset.load_from_df(train[['user_id_encoded', 'product_id_encoded', 'Rating']], reader)

model = SVD()

cross_validate(model, data_surprise, measures=['RMSE', 'MAE'], cv=3, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    1.2644  1.2654  1.2601  1.2633  0.0023  
MAE (testset)     0.9893  0.9892  0.9870  0.9885  0.0011  
Fit time          16.98   18.22   17.40   17.54   0.52    
Test time         3.74    2.35    2.37    2.82    0.65    


{'test_rmse': array([1.26439403, 1.26538335, 1.26011842]),
 'test_mae': array([0.9893194 , 0.98917948, 0.98698272]),
 'fit_time': (16.98042941093445, 18.22217631340027, 17.404427766799927),
 'test_time': (3.7444281578063965, 2.3456671237945557, 2.3722212314605713)}

## 3. Scalability and Performance:

To ensure that our recommendation engine can handle high traffic and provide low-latency responses, these straightforward strategies can be used:

- Efficient Data Structures: data structures that support quick data retrieval, such as hash tables or binary trees, to minimize response times.

- Caching: caching mechanisms to store frequently accessed data, reducing the need to fetch data from the database repeatedly. Tools like Redis or Memcached can be very effective.

- Optimize Algorithms: optimize algorithms for lower complexity to ensure they run efficiently, even under high load.

- Distributed Systems: deploying the application on a distributed computing platform that can scale horizontally. Adding more servers to handle increased load without a drop in performance.

- Load Balancers: load balancers to distribute user requests evenly across multiple servers, preventing any single server from becoming a bottleneck.

- Asynchronous Processing: asynchronous programming techniques to handle intensive computation tasks in the background, ensuring that user interactions are smooth and responsive.

## 4. Model Evaluation and optimization

In [19]:
# Model Evaluation

from surprise.accuracy import rmse
from surprise import accuracy

testset = list(zip(test['user_id_encoded'].values, test['product_id_encoded'].values, test['Rating'].values))

predictions = model.test(testset)

accuracy.rmse(predictions)

RMSE: 1.2623


1.2623211834837964

In [20]:
# Optimization and Tuning

from surprise.model_selection import GridSearchCV

param_grid = {'n_epochs': [5, 10], 'lr_all': [0.002, 0.005],
              'reg_all': [0.4, 0.6]}
gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3)

gs.fit(data_surprise)

print(gs.best_score['rmse'])

print(gs.best_params['rmse'])

1.2660408976144435
{'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.4}


In [23]:
# Testing the model on chosen user_id and product_id to visualize values
example_user_id = test.iloc[3]['user_id_encoded']
example_product_id = test.iloc[3]['product_id_encoded']
actual_rating = test.iloc[3]['Rating']

example_prediction = model.predict(example_user_id, example_product_id)
predicted_rating = example_prediction.est

print("Predicted Rating:", predicted_rating)
print("Actual Rating:", actual_rating)

Predicted Rating: 4.591306072919141
Actual Rating: 5.0
