In [None]:
# !pip install pandas numpy

# Step 1: Generating the Synthetic Dataset

We'll generate synthetic data in three parts:

- User Data: Simulate user demographic details.
- Product Data: Simulate product details.
- Interaction Data: Simulate how users interact with products (e.g., purchase or rate).

In [1]:
# Import necessary libraries
import pandas as pd  # To handle data in tabular form
import numpy as np   # To generate random data

# Step 1: Define the number of users and products
# Let's assume we have 1000 users and 500 products in our ecommerce platform.
num_users = 1000
num_products = 500

# Step 2: Generating the Users Data
# Each user has an ID, age, gender, and location.
user_data = {
    'user_id': np.arange(1, num_users + 1),  # Generate user IDs from 1 to 1000
    'age': np.random.randint(18, 70, size=num_users),  # Random ages between 18 and 70
    'gender': np.random.choice(['M', 'F'], size=num_users),  # Randomly assign gender as Male (M) or Female (F)
    'location': np.random.choice(['Urban', 'Suburban', 'Rural'], size=num_users)  # Randomly assign location type
}

# Convert the user data dictionary into a pandas DataFrame
users_df = pd.DataFrame(user_data)

# Step 3: Generating the Products Data
# Each product has an ID, category, price, and rating.
product_data = {
    'product_id': np.arange(1, num_products + 1),  # Generate product IDs from 1 to 500
    'category': np.random.choice(['Electronics', 'Clothing', 'Home', 'Books'], size=num_products),  # Randomly assign product category
    'price': np.round(np.random.uniform(5, 500, size=num_products), 2),  # Random prices between $5 and $500, rounded to 2 decimal places
    'rating': np.round(np.random.uniform(1, 5, size=num_products), 1)  # Random ratings between 1 and 5, rounded to 1 decimal place
}

# Convert the product data dictionary into a pandas DataFrame
products_df = pd.DataFrame(product_data)

# Step 4: Generating the User-Product Interaction Data (Purchase History or Ratings)
# We simulate how users interact with products. For example, users can rate or buy products.

interaction_data = {
    'user_id': np.random.choice(users_df['user_id'], size=5000),  # Randomly select users who interacted with products
    'product_id': np.random.choice(products_df['product_id'], size=5000),  # Randomly select products that were interacted with
    'rating': np.random.randint(1, 6, size=5000),  # Assign random ratings (1 to 5 stars) for these interactions
    'timestamp': pd.date_range(start='2023-01-01', periods=5000, freq='T')  # Generate random timestamps for interactions, 1 minute apart
}

# Convert the interaction data dictionary into a pandas DataFrame
interactions_df = pd.DataFrame(interaction_data)

# Let's check the first few rows of each dataset
users_df.head(), products_df.head(), interactions_df.head()


(   user_id  age gender  location
 0        1   30      M     Rural
 1        2   34      M     Urban
 2        3   30      M  Suburban
 3        4   45      M  Suburban
 4        5   44      M  Suburban,
    product_id     category   price  rating
 0           1        Books  329.69     4.7
 1           2  Electronics   57.92     3.2
 2           3     Clothing  201.51     2.2
 3           4     Clothing   46.14     2.7
 4           5        Books  396.72     2.4,
    user_id  product_id  rating           timestamp
 0      166         148       3 2023-01-01 00:00:00
 1      866         211       4 2023-01-01 00:01:00
 2        7         193       3 2023-01-01 00:02:00
 3      423          75       3 2023-01-01 00:03:00
 4      478         419       2 2023-01-01 00:04:00)

# Explanation:
1. Libraries:

- pandas: Used for handling tabular data (like spreadsheets).
numpy: Helps generate random values for simulating user, product, and interaction data.

2. User Data:

- We create 1000 users with random ages, genders (male/female), and locations (urban/suburban/rural).
3. Product Data:

- We create 500 products, each belonging to a random category (Electronics, Clothing, Home, Books), with random prices and ratings.

4. Interaction Data:

- We simulate 5000 interactions between users and products, each with a rating (1 to 5 stars) and a timestamp to capture when the interaction happened.

# Step 2: Data Pre-processing.

Data pre-processing involves preparing the data for training by handling missing values, encoding categorical data, and normalizing or scaling numerical data, if needed. In the context of a recommendation system, we mainly focus on:

1. Handling missing values: Ensure no missing data in users, products, or interactions.
2. Encoding categorical variables: Convert categories like gender or category into numerical format.
3. Creating a user-product matrix: This will be the input for our recommendation model, where each row represents a user, each column represents a product, and the cells represent ratings or interactions.

In [2]:
# Import necessary libraries for pre-processing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Step 1: Handle missing values
# Checking for missing values in all datasets
print("Missing values in users data:\n", users_df.isnull().sum())
print("Missing values in products data:\n", products_df.isnull().sum())
print("Missing values in interactions data:\n", interactions_df.isnull().sum())

# If there were any missing values, we would handle them here. Since this is synthetic data, it’s unlikely.

# Step 2: Encoding categorical variables
# We need to convert categorical variables like 'gender' and 'category' into numerical format.
# Using LabelEncoder to encode these categorical features

label_encoder = LabelEncoder()

# Encode the gender column in users data (M -> 0, F -> 1)
users_df['gender_encoded'] = label_encoder.fit_transform(users_df['gender'])

# Encode the location column in users data
users_df['location_encoded'] = label_encoder.fit_transform(users_df['location'])

# Encode the category column in products data
products_df['category_encoded'] = label_encoder.fit_transform(products_df['category'])

# Step 3: Create a User-Product Rating Matrix
# We pivot the interactions data to create a matrix with users as rows, products as columns, and ratings as values.
user_product_matrix = interactions_df.pivot_table(index='user_id', columns='product_id', values='rating').fillna(0)

# Step 4: Train-test split
# We will split the user-product interaction data into training and test sets.
train_data, test_data = train_test_split(interactions_df, test_size=0.2, random_state=42)

# Let's display the first few rows of the pre-processed data to verify
print("User-Product Matrix:\n", user_product_matrix.head())
print("Train Data Sample:\n", train_data.head())
print("Test Data Sample:\n", test_data.head())


Missing values in users data:
 user_id     0
age         0
gender      0
location    0
dtype: int64
Missing values in products data:
 product_id    0
category      0
price         0
rating        0
dtype: int64
Missing values in interactions data:
 user_id       0
product_id    0
rating        0
timestamp     0
dtype: int64
User-Product Matrix:
 product_id  1    2    3    4    5    6    7    8    9    10   ...  491  492  \
user_id                                                       ...             
1           0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0   
2           0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0   
3           0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0   
4           0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0   
5           0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  1.0  ...  0.0  0.0   

product_id  493  494  495  496  497  498  499  500  
user_id                                      

# Explanation:

1. Handling Missing Values:

- We check for missing values using isnull().sum(). Since our data is synthetic, there should be no missing values, but in real-world data, you'd handle this by filling missing values or dropping them.

2. Encoding Categorical Variables:

- We convert categorical variables such as gender, location, and category into numerical format using LabelEncoder, which is important for machine learning models to understand non-numerical data.

3. Creating User-Product Matrix:

- We use a pivot table to transform the interactions_df into a matrix where rows represent users, columns represent products, and values represent the rating. This matrix is what we’ll use to train our recommendation model.

4. Train-test Split:

- We split the interaction data into training (80%) and testing (20%) sets to evaluate our model later on.


# Model Training and Testing.

We’ll implement a Collaborative Filtering recommendation system using the Surprise library, which is popular for building recommendation systems. The algorithm we’ll use is Singular Value Decomposition (SVD), a matrix factorization technique commonly used for collaborative filtering.

Here’s what we’ll do in this step:

- Install and Import the Surprise library.
- Prepare the data for the Surprise library.
- Train the SVD model using the training dataset.
- Evaluate the model on the test dataset using metrics such as RMSE (Root Mean Squared Error).

In [3]:
# Install the Surprise library
!pip install scikit-surprise

# Import necessary libraries
from surprise import Dataset, Reader, SVD
from surprise.model_selection import train_test_split as surprise_train_test_split
from surprise.model_selection import cross_validate
from surprise import accuracy

# Step 1: Prepare the data for Surprise
# Surprise expects data to have 3 columns: user_id, product_id, and rating. We'll use the interaction data for this.

reader = Reader(rating_scale=(1, 5))  # The rating scale in our dataset is from 1 to 5
data = Dataset.load_from_df(interactions_df[['user_id', 'product_id', 'rating']], reader)

# Step 2: Train-test split
# We perform a 80/20 train-test split on the Surprise dataset
trainset, testset = surprise_train_test_split(data, test_size=0.2)

# Step 3: Train the SVD model
# SVD is a matrix factorization technique used for collaborative filtering
model = SVD()  # Initialize the SVD model

# Train the model on the training set
model.fit(trainset)

# Step 4: Test the model on the test set
# We predict the ratings for the test set and evaluate performance
predictions = model.test(testset)

# Step 5: Evaluate the performance using RMSE
rmse = accuracy.rmse(predictions)


Collecting scikit-surprise
  Downloading scikit_surprise-1.1.4.tar.gz (154 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/154.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.4/154.4 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (pyproject.toml) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.4-cp310-cp310-linux_x86_64.whl size=2357281 sha256=1b9d346a0779f18099d11039841820b0c8708c9f7628d6f890349a07d7993e4e
  Stored in directory: /root/.cache/pip/wheels/4b/3f/df/6acbf0a40397d9bf3ff97f582cc22fb9ce66adde75bc71fd54
Successfully built scikit-surprise
Installing collected packages: scikit-surprise
Succe

# Explanation:

1. Installing the Surprise Library:

- We use the Surprise library, which is specifically designed for building recommendation systems. It supports various algorithms, including SVD, KNN, and more.

2. Preparing Data:

- We format the interaction data (user_id, product_id, rating) in a way that the Surprise library understands using the Reader class.

3. Train-test Split:

- We split the data into training and test sets (80/20 split) using surprise_train_test_split().

4. Training the SVD Model:

- We initialize the SVD model and train it on the training set. The SVD algorithm performs matrix factorization, which is suitable for collaborative filtering tasks.

5. Evaluating the Model:

- After training the model, we test it on the test set. We calculate the RMSE (Root Mean Squared Error), which tells us how well the model is predicting user ratings.

**The RMSE (Root Mean Squared Error) of your model is 1.4730, which indicates the average error between the predicted and actual ratings. This score is a good starting point, but there's room for improvement by tuning the model or using other techniques like incorporating more data or trying different algorithms.**

# Step 4: Saving the Model
- We will use Python's pickle library to serialize (save) the model to a file. This way, you can load the model in your Flask app and use it to make predictions.


In [4]:
import pickle

# Step 1: Save the trained SVD model to a file
model_filename = 'svd_model.pkl'
with open(model_filename, 'wb') as model_file:
    pickle.dump(model, model_file)

print(f"Model saved to {model_filename}")


Model saved to svd_model.pkl


# Explanation:
1. Pickle Library: This is used to serialize the Python objects (in this case, the trained model) into a file.
2. Saving the Model: We open a file in write-binary mode (wb) and save the model to svd_model.pkl.

# Optional Step: Download the Model File
- After saving the model, we can use google.colab.files to download the file directly to your computer.

In [5]:
from google.colab import files

# Download the saved model file
files.download(model_filename)


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Explanation:
- files.download(): This method from the google.colab package allows you to download any file from the Colab environment to your local machine.

# Step 5: Monitoring the Model Performance
- Let’s extend the RMSE evaluation by calculating MAE (Mean Absolute Error) and generating a performance report.

In [6]:
# Step 1: Calculate MAE (Mean Absolute Error)
mae = accuracy.mae(predictions)

# Step 2: Generate a basic performance report
performance_report = {
    'RMSE': rmse,
    'MAE': mae
}

# Display the performance report
print("Model Performance Report:")
for metric, score in performance_report.items():
    print(f"{metric}: {score:.4f}")


MAE:  1.2755
Model Performance Report:
RMSE: 1.4730
MAE: 1.2755


# Explanation:

1. MAE (Mean Absolute Error):

- While RMSE penalizes larger errors more, MAE gives an average of absolute errors. Both metrics are useful for understanding the model's performance.

2. Performance Report:

- We store and print both RMSE and MAE, which gives you an overview of how the model is performing.

# Additional Step: Advanced Monitoring (Optional)
If you want to go beyond basic monitoring, you can:

1. Track performance over time: You can log the performance after every training cycle to see if the model improves or worsens.
2. Visualization: You can create plots showing error distribution, or monitor the recommendation accuracy using recall/precision if you have ground truth data.