  # Personalised Recommendation






**Personalised Recommednation:**

> This is a solution to the problem statement 2 of the flipkart Grid which is -

The aim is to enhance user experience by implementing a personalized product ranking system.
Your task is to develop an algorithm or model that can generate accurate and relevant product
rankings for individual users. The ranking system should consider factors such as user
preferences, past interactions, product popularity, and user similarity. It should be able to predict
the most suitable products for a user based on their unique characteristics and preferences.
You are not provided with a specific dataset for this challenge. Instead, you are expected to
design and implement a solution that simulates user interactions and generates personalized
rankings. You can define user profiles, product categories, and interaction patterns within your
solution.
To evaluate the effectiveness of your solution, you should define appropriate metrics for
measuring the accuracy and relevance of the rankings. You should also provide a report
explaining your approach, describing the algorithms or techniques used, and discussing the
strengths and limitations of your solution.





# Project Overview

In this project, we aim to develop and evaluate a personalized recommendation system using collaborative filtering. Collaborative filtering is a popular technique that leverages user-item interaction data to provide personalized recommendations to users. Our system will predict user preferences for items based on their past interactions and similarity to other users.

## Goals

The main goals of this project are:

- Develop a collaborative filtering model to generate personalized recommendations for users.
- Implement an evaluation metric, the Personalized Click-Through Diversity Index (PCDI), to measure the performance of the recommendation system.
- Incorporate user interaction data, such as clicks, add-to-cart actions, and ratings, to enhance the quality of recommendations.

## Methodology

Our approach involves several steps:

1. Data Preprocessing: We'll preprocess the user interaction data, including clicks, add-to-cart actions, and ratings, to create a utility matrix.
2. Collaborative Filtering: We'll train a collaborative filtering model using the Surprise library, which will learn user and item embeddings to make personalized recommendations.
3. Evaluation: We'll evaluate the model's performance using the PCDI metric, which combines click-through rate and diversity of recommendations.
4. Interpretation: We'll analyze the results and gain insights into the effectiveness of the recommendation system.

## Data

We will use a dataset containing user interactions with products, including clicks, add-to-cart actions, and ratings. The dataset will be preprocessed to create the necessary input for the collaborative filtering model.

Let's get started with data preprocessing and model development!



## Importing Necessary Libraries

In this section, we begin by importing the essential libraries that will be used throughout our recommendation system project. Each library serves a specific purpose in data processing, model development, interaction with external platforms, and synthetic data generation. Here's an overview of the libraries we're using and how they contribute to our project:

- **Pandas and NumPy**: These libraries are fundamental for data manipulation, cleaning, and analysis. We'll use them to handle our dataframes and perform various operations on our data.

- **scikit-learn's train_test_split**: This function helps us split our dataset into training and testing sets, allowing us to evaluate our model's performance.

- **Surprise Library**: Surprise is a powerful library for building recommendation systems. We'll use it to create collaborative filtering models and make user-product recommendations.

- **SQLAlchemy and SQLite**: These libraries enable interaction with SQLite databases. We'll use them to store and retrieve data if necessary.

- **Confluent Kafka**: Kafka is a streaming platform, and Confluent Kafka is a Python client for Kafka. We'll use it to simulate real-time data interactions in our recommendation system.

- **Faker**: Faker is a library for generating synthetic data. We'll use it to create artificial user interactions for testing and development.

To ensure smooth execution and access to these functionalities, we will install the required packages using pip. With these libraries at our disposal, we're well-equipped to build, evaluate, and enhance our personalized recommendation system.


In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
!pip install surprise confluent_kafka faker kafka
from surprise import Dataset, Reader, SVD
import sqlalchemy
import sqlite3
from confluent_kafka import Producer, Consumer
from faker import Faker
from sklearn.metrics.pairwise import cosine_similarity





## Generating Synthetic Data for Testing

To thoroughly test and develop our recommendation system, it's crucial to have a diverse and representative dataset. However, obtaining real user interactions and preferences can be challenging during the initial stages of development. To address this, we have created a function called `generate_synthetic_data`.

**Function Purpose:**
The `generate_synthetic_data` function leverages the Faker library to generate synthetic user interactions for testing purposes. These interactions include user clicks, product add-to-cart actions, ratings, and timestamps. The generated data mimics real-world user behavior and allows us to validate the functionality of our recommendation system under various scenarios.

**How It Works:**
The function uses a loop to create a specified number of synthetic interactions (in this case, 10,000 interactions). For each interaction, it generates random values for user IDs, product IDs, click actions, add-to-cart actions, ratings, and timestamps. These values are assembled into a structured format and stored in a list.

**Data Storage:**
Once all interactions are generated, the function creates a Pandas DataFrame to organize the synthetic data. This DataFrame is then saved as a CSV file named 'synthetic_data.csv'. This file serves as a valuable resource for testing, model training, and evaluation.

By generating synthetic data, we can explore the behavior of our recommendation system under controlled conditions and fine-tune its performance before deploying it with real user data.


In [None]:
def generate_synthetic_data():
    fake = Faker()
    # Generate synthetic user interactions
    interactions = []
    for _ in range(10000):
        user_id = fake.random_int(min=1, max=100)
        product_id = fake.random_int(min=1, max=1000)
        click = fake.random_element(elements=('click', 'no_click'))
        add_to_cart = fake.random_element(elements=('added', 'not_added'))
        rating = fake.random_int(min=1, max=5)
        timestamp = fake.date_time_between(start_date='-1y', end_date='now')
        interactions.append({
            'user_id': user_id,
            'product_id': product_id,
            'click': click,
            'add_to_cart': add_to_cart,
            'rating': rating,
            'timestamp': timestamp
        })
    # Save synthetic data as CSV
    df = pd.DataFrame(interactions)
    df.to_csv('synthetic_data.csv', index=False)

## Creating SQLite Database and Importing Synthetic Data

A fundamental step in building a recommendation system is to manage and store the user interaction data in an organized manner. To achieve this, we've developed the `create_sqlite_database` function, which facilitates the creation of an SQLite database and the import of synthetic data for testing and development purposes.

**Function Purpose:**
The `create_sqlite_database` function serves a dual purpose. First, it establishes a connection to an SQLite database named 'recommendation.db'. Second, it defines and creates a table named 'user_interactions' within the database. This table is designed to store information about user interactions, including user IDs, product IDs, click actions, add-to-cart actions, ratings, and timestamps.

**How It Works:**
The function begins by establishing a connection to the SQLite database and obtaining a cursor object, which is used to execute SQL commands. It then defines the structure of the 'user_interactions' table with columns for each interaction attribute.

The next step involves importing the synthetic data stored in 'synthetic_data.csv'. The function reads this CSV file into a Pandas DataFrame and uses the DataFrame's `to_sql` method to insert the data into the 'user_interactions' table in the SQLite database. If the table already exists, the `if_exists` parameter is set to 'replace', meaning that any existing data is replaced with the new synthetic data.

**Data Storage and Management:**
With the 'user_interactions' table populated with synthetic data, the SQLite database becomes a centralized repository for user interaction information. This data can be easily queried, analyzed, and utilized for model training, evaluation, and recommendation generation.

Using an SQLite database to store synthetic data allows us to closely mimic real-world scenarios and interactions, enabling us to develop and test our recommendation system with a solid foundation of data.

Please note that in a production environment, a more robust database management solution might be necessary to handle large-scale data and ensure data integrity.


In [None]:
# Create SQLite database and import synthetic data
def create_sqlite_database():
    conn = sqlite3.connect('recommendation.db')
    cursor = conn.cursor()
    # Create table for user interactions
    cursor.execute('''
        CREATE TABLE IF NOT EXISTS user_interactions (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            user_id INTEGER,
            product_id INTEGER,
            click TEXT,
            add_to_cart TEXT,
            rating INTEGER,
            timestamp DATETIME
        )
    ''')
    # Import synthetic data into the table
    df = pd.read_csv('synthetic_data.csv')
    df.to_sql('user_interactions', conn, if_exists='replace', index=False)
    conn.commit()
    conn.close()


## Generating Synthetic Data for Testing

To create a representative dataset for testing and developing our recommendation system, we utilize the `generate_synthetic_data` function. This function generates synthetic user interaction data with various attributes, allowing us to simulate user behavior and interactions within the recommendation system environment.

**Generating Synthetic Data:**
To generate the synthetic data, we simply call the `generate_synthetic_data` function. This function uses the `Faker` library to create realistic user interactions, including actions like clicks, add-to-cart events, ratings, and timestamps. The generated data closely resembles real-world user behavior and preferences, enabling us to assess and optimize our recommendation algorithms.

**Calling the Function:**
To generate the synthetic data, execute the following code snippet:

In [None]:
# Generate synthetic data
generate_synthetic_data()

## Creating SQLite Database and Importing Synthetic Data

To facilitate efficient data storage and manipulation, we create an SQLite database and import the synthetic user interaction data using the `create_sqlite_database` function. This function sets up a structured database table to organize the generated data for seamless querying and analysis.

**Creating SQLite Database:**
The `create_sqlite_database` function is responsible for creating an SQLite database named `recommendation.db`. It establishes a table named `user_interactions` to store user interactions data. This table is designed to accommodate attributes such as user ID, product ID, click actions, add-to-cart events, ratings, and timestamps.

**Importing Synthetic Data:**
After creating the database and defining the table structure, we import the previously generated synthetic data from the CSV file (`synthetic_data.csv`) into the `user_interactions` table. This step is crucial for populating the database with realistic user interaction data, which can then be leveraged for recommendation system development and testing.

**Calling the Function:**
To create the SQLite database and import synthetic data, execute the following code snippet:

In [None]:
# Create SQLite database and import synthetic data
create_sqlite_database()


## Verifying Database Creation and Data Retrieval

To ensure the successful creation of the SQLite database and the proper import of synthetic data, we utilize the code snippet provided below. This snippet demonstrates how to connect to the database, retrieve data from the `user_interactions` table, and display the initial rows of the dataset.

**Verifying Database Connection:**
Before interacting with the database, the `sqlite3` module is used to establish a connection to the SQLite database file named `recommendation.db`. This connection allows us to interact with the database, execute SQL queries, and retrieve data.

**Retrieving Data:**
The `pd.read_sql_query` function from the Pandas library is employed to execute an SQL query on the connected database. The query, `'SELECT * FROM user_interactions'`, retrieves all rows and columns from the `user_interactions` table.

**Closing Database Connection:**
After retrieving the data, the database connection is closed using the `conn.close()` statement to release any resources and ensure proper management of the database file.

**Displaying Data:**
To visually inspect the retrieved data, the `df.head()` function is used. This displays the initial few rows of the DataFrame `df`, which contains the data from the `user_interactions` table.

**Calling the Code:**
Execute the following code snippet to verify that the database was created and to view the initial data entries:

In [None]:
# verify that dabase was created

import sqlite3
conn = sqlite3.connect('recommendation.db')
df = pd.read_sql_query('SELECT * FROM user_interactions', conn)
conn.close()
df.head()


Unnamed: 0,user_id,product_id,click,add_to_cart,rating,timestamp
0,76,613,no_click,not_added,3,2022-10-08 09:02:20
1,31,483,no_click,not_added,3,2023-08-20 05:08:31
2,7,67,click,added,2,2023-01-22 10:06:43
3,51,72,no_click,not_added,5,2022-12-26 08:22:49
4,67,132,no_click,not_added,4,2022-12-31 07:28:06


**Kafka Configuration for Click Stream Producer**

In this code snippet, a configuration dictionary named `kafka_config` is defined to configure the Kafka producer that will be used to send click stream events to a Kafka topic. The configuration includes two key-value pairs:

1. `'bootstrap.servers'`: This specifies the address and port of the Kafka broker. Replace `'localhost:9092'` with the address of your Kafka broker. The broker is the central hub for handling Kafka messages.
   
2. `'client.id'`: This sets an identifier for the Kafka producer. In this case, it is set to `'clickstream-producer'`, which is a unique name to identify the producer.

This configuration is essential for establishing a connection between the producer and the Kafka broker, enabling the producer to send events to the specified Kafka topic.

Ensure that you provide the correct broker address to establish the connection, and customize the `client.id` to suit your application's needs.

In [None]:
kafka_config = {
    'bootstrap.servers': 'localhost:9092',  # Replace with your Kafka broker
    'client.id': 'clickstream-producer'
}

## Simulating Click Stream Events and Adding to DataFrame

In this section, we will simulate capturing multiple click stream events and then add these events to an existing DataFrame. Each click stream event will be represented by a `ClickStreamEvent` object, containing information such as user ID, product ID, click status, add-to-cart status, rating, and timestamp.

1. **ClickStreamEvent Class**: We define a class `ClickStreamEvent` to encapsulate the information for each click stream event. The class constructor takes arguments to initialize the event attributes such as user ID, product ID, click status, add-to-cart status, rating, and timestamp.

2. **Simulating Events**: We use a loop to capture a specified number of click stream events. For each event, we prompt the user to provide inputs for user ID, product ID, click status (yes/no), add-to-cart status (yes/no), rating, and timestamp (in the format YYYY-MM-DD HH:MM:SS). Based on these inputs, we create instances of `ClickStreamEvent` and add them to the `click_stream_events` list.

3. **Adding to DataFrame**: We then iterate through the list of `click_stream_events` and create new rows for each event in an existing DataFrame (`df`). Each event's attributes are used to create a new row, and we append this row to the DataFrame using the `append` method.

Overall, this section simulates the process of capturing click stream events, creating corresponding event objects, and adding them to an existing DataFrame for further analysis and processing.

---

Make sure to run the code in this section to simulate the click stream event capturing and DataFrame updating process.

In [None]:
from datetime import datetime
import time

class ClickStreamEvent:
    def __init__(self, user_id, product_id, click, add_to_cart, rating, timestamp):
        self.user_id = user_id
        self.product_id = product_id
        self.click = click
        self.add_to_cart = add_to_cart
        self.rating = rating
        self.timestamp = timestamp

# Simulate capturing multiple click stream events
click_stream_events = []

num_events = int(input("Enter the number of click stream events: "))

for _ in range(num_events):
    user_id = input("Enter user ID: ")
    product_id = input("Enter product ID: ")
    click = input("Did the user click? (yes/no): ").lower() == "yes"
    add_to_cart = input("Did the user add to cart? (yes/no): ").lower() == "yes"
    rating = int(input("Enter rating (if given): "))
    timestamp = input("Enter timestamp (YYYY-MM-DD HH:MM:SS): ")

    event = ClickStreamEvent(user_id, product_id, click, add_to_cart, rating, timestamp)
    click_stream_events.append(event)

for event in click_stream_events:
    new_row = {
        'user_id': event.user_id,
        'product_id': event.product_id,
        'click': event.click,
        'add_to_cart': event.add_to_cart,
        'rating': event.rating,
        'timestamp': event.timestamp
    }
    df = df.append(new_row, ignore_index=True)

Enter the number of click stream events: 0


## Sending Click Stream Events to Kafka

In this section, we will send the captured click stream events to a Kafka topic using a Kafka producer. The click stream events, which we previously simulated and stored in the `click_stream_events` list, will be serialized into JSON format and then sent to the Kafka topic named `click_stream_topic`.

1. **Creating Kafka Producer**: We first create a Kafka producer using the provided `kafka_config` dictionary. The producer will be responsible for sending the click stream events to the Kafka topic.

2. **Processing and Sending Events**: We use a loop to iterate through the list of `click_stream_events`. For each event, we create a dictionary `event_data` containing the attributes of the event such as user ID, product ID, click status, add-to-cart status, rating, and timestamp. We then serialize this dictionary into JSON format using the `json.dumps()` function.

3. **Sending to Kafka**: Using the Kafka producer, we send the serialized event data to the `click_stream_topic` Kafka topic. We also use `producer.flush()` to ensure that the message is sent immediately.

4. **Print and Delay**: After sending each event, we print a message confirming the sent event's data and use `time.sleep(1)` to introduce a delay of 1 second before processing the next event. You can adjust the delay time according to your preferences.

5. **Closing the Producer**: Once all events are sent, we close the Kafka producer using the `producer.close()` method.

Overall, this section demonstrates how to use a Kafka producer to send simulated click stream events to a Kafka topic for further processing and analysis.

---

Make sure to run the code in this section to send the simulated click stream events to the Kafka topic.

In [None]:
# Create Kafka producer
producer = Producer(kafka_config)

# Process and send the list of click stream events to Kafka
for event in click_stream_events:
    # Serialize event data
    event_data = {
        'user_id': event.user_id,
        'product_id': event.product_id,
        'click': event.click,
        'add_to_cart': event.add_to_cart,
        'rating': event.rating,
        'timestamp': event.timestamp
    }
    serialized_event = json.dumps(event_data)

    # Send event data to Kafka topic
    producer.produce('click_stream_topic', value=serialized_event.encode('utf-8'))
    producer.flush()
    print("Sent click stream event to Kafka:", event_data)
    print("-" * 20)

    # Simulate some delay before processing the next event
    time.sleep(1)  # Adjust the delay as needed

# Flush any outstanding messages and close the Kafka producer
producer.flush()
producer = None  # Set the producer object to None to release its resources



## Processing Click Stream Events

In this section, we will process and display the details of the received click stream events that were previously sent to the Kafka topic. We iterate through the `click_stream_events` list and print out the attributes of each event, including user ID, product ID, click status, add-to-cart status, rating, and timestamp.

1. **Iterating Through Events**: We use a loop to iterate through each event in the `click_stream_events` list that was previously sent to the Kafka topic.

2. **Displaying Event Details**: For each event, we print its details to the notebook's output. These details include the user ID, product ID, click status, add-to-cart status, rating, and timestamp. Each attribute is printed along with its corresponding value.

3. **Separator Line**: After displaying the details of each event, we print a separator line (`"-" * 20`) to visually separate the information for different events.

By running the code in this section, you will see the details of each click stream event that was received from the Kafka topic. This helps you verify that the events were successfully sent and received.

---

Execute the code to view the details of the processed click stream events in the notebook's output.

In [None]:
# Process the list of click stream events
for event in click_stream_events:
    print("Received click stream event:")
    print("User ID:", event.user_id)
    print("Product ID:", event.product_id)
    print("Click:", event.click)
    print("Add to Cart:", event.add_to_cart)
    print("Rating:", event.rating)
    print("Timestamp:", event.timestamp)
    print("-" * 20)

In [None]:
# Append the generated clickstream events to the database
conn = sqlite3.connect('recommendation.db')
cursor = conn.cursor()

for event in click_stream_events:
    cursor.execute('''
        INSERT INTO user_interactions (user_id, product_id, click, add_to_cart, rating, timestamp)
        VALUES (?, ?, ?, ?, ?, ?)
    ''', (event.user_id, event.product_id, event.click, event.add_to_cart, event.rating, event.timestamp))

conn.commit()
conn.close()

print("Generated clickstream events have been added to the database.")


Generated clickstream events have been added to the database.



## Collaborative Filtering and K-Nearest Neighbors (KNN) Recommendation System

In this section, we'll explore building a recommendation system using collaborative filtering and the K-Nearest Neighbors (KNN) algorithm. Collaborative filtering is a widely used technique in recommendation systems that predicts a user's preferences based on the preferences of similar users. KNN, on the other hand, is a method that identifies the 'k' nearest neighbors to a particular item or user, and uses their preferences to make recommendations.

We'll utilize the Surprise library, which is a Python scikit for building and analyzing recommender systems. We'll begin by importing the necessary libraries, including `SVD` (Singular Value Decomposition) and `KNNBasic` algorithms from Surprise, `Dataset`, `Reader` for loading data, `train_test_split` for splitting data, `pandas` for data manipulation, and `datetime` for handling timestamps.



In [None]:
from surprise import SVD, Dataset, Reader, KNNBasic
from surprise.model_selection import train_test_split
import pandas as pd
from datetime import datetime


## Loading Data for Collaborative Filtering

In this section, we'll discuss the process of loading data for collaborative filtering, a popular technique used in recommendation systems. Collaborative filtering involves making automatic predictions (filtering) about the interests of a user by collecting preferences from many users (collaborating). We'll utilize the Surprise library to load and preprocess our data.

The provided code snippet demonstrates how to load data for collaborative filtering using the `load_data_for_collaborative_filtering` function. This function takes a DataFrame (`df`) as input, which is expected to contain user interactions including `user_id`, `product_id`, and `rating` columns. The `Reader` class is used to specify the rating scale (here, between 1 and 5).

The `Dataset.load_from_df` method from the Surprise library is then used to load the data into a Surprise-compatible format. This format is crucial for training and evaluating collaborative filtering models. The loaded data is returned as a `data` object.


In [None]:
# Load data for collaborative filtering
def load_data_for_collaborative_filtering(df):
    reader = Reader(rating_scale=(1, 5))
    data = Dataset.load_from_df(df[['user_id', 'product_id', 'rating']], reader)
    return data


## Splitting Data into Training and Testing Sets

Once we have loaded the data for collaborative filtering, the next step is to split it into training and testing sets. This process allows us to train our recommendation model on one subset of the data and evaluate its performance on another. This split helps us assess how well the model generalizes to unseen data.

The code snippet provided demonstrates how to split the data into training and testing sets using the `split_data` function. The `data` object, loaded in the previous step, is used as input. The `train_test_split` function from the Surprise library is employed for this purpose.

The function returns two sets: `trainset` and `testset`. The `trainset` is used for training our collaborative filtering model, while the `testset` is used to evaluate the model's performance. The parameter `test_size` specifies the proportion of data allocated for testing (in this case, 20% of the data).

It's important to note that setting a `random_state` ensures reproducibility in the data split, allowing us to obtain consistent results across different runs.


In [None]:
# Split data into training and testing sets
def split_data(data):
    trainset, testset = train_test_split(data, test_size=0.2, random_state=42)
    return trainset, testset

## Training Collaborative Filtering Model

With the data prepared and split into training and testing sets, the next step is to train a collaborative filtering model. Collaborative filtering is a widely used technique for recommendation systems that leverages user-item interactions to make personalized recommendations.

The code snippet provided demonstrates how to train a collaborative filtering model using the Surprise library. We define the `train_collaborative_filtering_model` function, which takes the `trainset` as input – the set containing user-item interactions for training.

Within the function, we create an instance of the Singular Value Decomposition (SVD) model using `SVD()` from the Surprise library. SVD is a matrix factorization technique commonly used for collaborative filtering. We then fit the model to the training data using the `fit` method.

The trained model is returned, which can be used to generate recommendations for users based on their interactions and preferences.


In [None]:

# Train collaborative filtering model
def train_collaborative_filtering_model(trainset):
    model = SVD()
    model.fit(trainset)
    return model

## Generating Unranked Recommendations using Collaborative Filtering

After training the collaborative filtering model, we can proceed to generate unranked recommendations for a specific user. Unranked recommendations are essentially a list of items that the model suggests as potential options for the user to consider.

The code snippet provided demonstrates how to generate unranked recommendations using the collaborative filtering model trained earlier. We define the `generate_unranked_recommendations` function, which takes the trained `model`, a `user_id`, and the `trainset` as inputs.

Inside the function, we access the user's interactions in the `trainset` using `trainset.ur[trainset.to_inner_uid(user_id)]`. For each item the user has interacted with, we retrieve its inner item ID and convert it back to the original product ID using `trainset.to_raw_iid(inner_iid)`. These product IDs form the list of unranked recommendations.

These unranked recommendations provide an initial set of items that the user might be interested in, based on their past interactions. However, these recommendations are not yet ranked by relevance or preference.


In [None]:
# Generate unranked recommendations based on collaborative filtering model
def generate_unranked_recommendations(model, user_id, trainset):
    user_items = list(trainset.ur[trainset.to_inner_uid(user_id)])
    recommendations = []
    for inner_iid, _ in user_items:
        product_id = trainset.to_raw_iid(inner_iid)
        recommendations.append(product_id)
    return recommendations



## Training a K-Nearest Neighbors (KNN) Model

In addition to collaborative filtering, another recommendation technique we can employ is the K-Nearest Neighbors (KNN) algorithm. The KNN algorithm leverages the similarity between users or items to make recommendations.

The code snippet provided showcases how to train a KNN model using the Surprise library. We define the `train_knn_model` function, which takes the `trainset` as input.

Inside the function, we specify the `sim_options` parameter, where we choose the similarity metric to be the cosine similarity. Additionally, we set `user_based` to `False`, indicating that we are using item-based similarity. The model is then trained on the `trainset`.

The KNN model calculates the similarity between items and uses this information to make recommendations based on the items that are most similar to the ones a user has interacted with.


In [None]:
# Train KNN model
def train_knn_model(trainset):
    sim_options = {
        'name': 'cosine',
        'user_based': False
    }
    model = KNNBasic(sim_options=sim_options)
    model.fit(trainset)
    return model


## Generating Ranked Recommendations using K-Nearest Neighbors (KNN) Model

After training the KNN model, the next step is to generate ranked recommendations using this model. The code snippet provided demonstrates how to do this using the Surprise library.

We define the `generate_ranked_knn_recommendations` function, which takes the trained KNN `model` and a list of `unranked_recommendations` as input. The unranked recommendations are the items that the collaborative filtering model has suggested based on the user's past interactions.

Inside the function, we iterate through each `product_id` in the `unranked_recommendations` list. For each product, we map the `product_id` to its corresponding inner item ID using `model.trainset.to_inner_iid(product_id)`.

We then retrieve the similarity score between the given item and its neighbors from the KNN model using `model.sim[inner_iid]`. In this implementation, we use the first element of the similarity array for sorting, assuming that it represents the most relevant similarity score.

The recommendations are sorted in descending order of similarity, ensuring that items with higher similarity scores are ranked higher. Finally, we return a list of the recommended item IDs in ranked order.


In [None]:

# Generate ranked recommendations using KNN model
def generate_ranked_knn_recommendations(model, unranked_recommendations):
    knn_ranked_recommendations = []
    for product_id in unranked_recommendations:
        inner_iid = model.trainset.to_inner_iid(product_id)
        # Use the first element of the similarity array for sorting
        similarity_score = model.sim[inner_iid][0]
        knn_ranked_recommendations.append((inner_iid, similarity_score))
    knn_ranked_recommendations.sort(key=lambda x: x[1], reverse=True)  # Sort by similarity in descending order
    return [item_id for item_id, _ in knn_ranked_recommendations]

Before training any recommendation models, it's crucial to preprocess the user interaction data to ensure it's in the right format for analysis. The following preprocessing steps are applied to the dataset:

1. **Timestamp Conversion:**
   The timestamps in the dataset are converted from string format to datetime format using the `pd.to_datetime` function. This conversion facilitates accurate time-based analysis and enables easier handling of time-related operations.

2. **Mapping Click and Add-to-Cart:**
   To simplify the analysis and modeling, the 'click' and 'add_to_cart' columns are mapped to binary values. Specifically, 'click' values are mapped to 1 (indicating a click) and 'no_click' values are mapped to 0. Similarly, 'added' values in the 'add_to_cart' column are mapped to 1 (indicating an item was added to the cart), and 'not_added' values are mapped to 0.

These preprocessing steps enhance the dataset's usability and make it ready for training collaborative filtering and KNN models. By converting timestamps and mapping categorical variables to binary values, we create a structured and standardized dataset for recommendation analysis.

In [None]:
# Preprocess timestamp to datetime format
df['timestamp'] = pd.to_datetime(df['timestamp'])

# Map click and add_to_cart to binary values
df['click'] = df['click'].map({'click': 1, 'no_click': 0})
df['add_to_cart'] = df['add_to_cart'].map({'added': 1, 'not_added': 0})

# Converting Data to Surprise-Compatible Format

To effectively train and evaluate recommendation models using the Surprise library, the user interaction data needs to be converted into a format compatible with Surprise. The following steps outline this conversion process:

1. **Initializing Reader:**
   The `Reader` class from the Surprise library is used to specify the rating scale. In this case, the rating scale is set to be between 1 and 5.

2. **Loading Data from DataFrame:**
   The `Dataset.load_from_df` function is employed to load the data from the `df` DataFrame, containing columns 'user_id', 'product_id', and 'rating'. This creates a Surprise `Dataset` object that encapsulates the user-item interactions.

3. **Building Trainset:**
   The `build_full_trainset()` method is called on the `Dataset` object to construct a Surprise `Trainset`. This trainset includes all user-item interactions from the original DataFrame and is suitable for training collaborative filtering models.

By converting the data into the Surprise-compatible format, we are ready to train collaborative filtering models and utilize Surprise's functionalities for recommendation analysis.

Remember that these steps prepare the data for collaborative filtering models, and you can further extend this process to include additional features or preprocess for other recommendation algorithms as needed.



In [None]:
# Convert the 'df' DataFrame to Surprise-compatible format
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(df[['user_id', 'product_id', 'rating']], reader)
trainset = data.build_full_trainset()

In [None]:

# Train collaborative filtering model
collab_model = train_collaborative_filtering_model(trainset)

# Generating Unranked Recommendations

In the previous steps, we trained a collaborative filtering model using the SVD algorithm and prepared the data for recommendations. Now, let's take a closer look at how we can generate unranked recommendations for a specific user using the trained collaborative filtering model.

**Step 1: Define Example User**
We choose an example user for whom we want to generate unranked recommendations. In this case, the `example_user_id` is set to 60. You can replace this with any valid user ID from your dataset.

**Step 2: Generating Unranked Recommendations**
We call the `generate_unranked_recommendations()` function, passing the following arguments:
- `collab_model`: The collaborative filtering model trained using the SVD algorithm.
- `example_user_id`: The ID of the user for whom we want to generate recommendations.
- `trainset`: The training dataset used to train the collaborative filtering model.

The function returns a list of unranked recommendations for the specified user. These recommendations are based on the items that the user has interacted with in the past, as learned by the collaborative filtering model.

Unranked recommendations are a starting point for personalized recommendations, but they are not sorted or prioritized based on any specific criterion. In the next steps, we will see how to use these unranked recommendations as input to the KNN algorithm to generate ranked recommendations that can be presented to the user for further exploration.

In [None]:

# Example user for generating unranked recommendations
example_user_id = 60
unranked_recommendations = generate_unranked_recommendations(collab_model, example_user_id, trainset)

# Training K-Nearest Neighbors (KNN) Model and Generating Ranked Recommendations

In this section, we focus on enhancing the unranked recommendations obtained from the collaborative filtering model by utilizing the K-Nearest Neighbors (KNN) algorithm. The KNN algorithm is particularly useful for finding items that are similar to those already interacted with by the user, thereby improving recommendation quality.

**Step 1: Training the KNN Model**
We use the `train_knn_model()` function to train the KNN model. The model is configured to use the cosine similarity metric and is set to operate in item-based mode (`user_based=False`). This means that the similarity between items will be calculated based on user interactions with those items.

**Step 2: Generating Ranked Recommendations using KNN**
We proceed to generate ranked recommendations using the `generate_ranked_knn_recommendations()` function. This function takes two arguments:
- `knn_model`: The KNN model trained in the previous step.
- `unranked_recommendations`: The list of unranked recommendations obtained from the collaborative filtering model.

For each item in the list of unranked recommendations, the function calculates its similarity score based on the KNN model's similarity matrix. The recommendations are then sorted based on the similarity score in descending order to create a list of ranked recommendations. These ranked recommendations can provide users with more relevant and similar items based on their previous interactions and preferences.

By combining the strengths of both collaborative filtering and KNN, we can create a more sophisticated recommendation system that leverages user-item interactions and item-item similarity to deliver personalized and diverse recommendations to users.

In [None]:
# Train KNN model
knn_model = train_knn_model(trainset)

# Generate ranked recommendations using KNN model
knn_ranked_recommendations = generate_ranked_knn_recommendations(knn_model, unranked_recommendations)

Computing the cosine similarity matrix...
Done computing similarity matrix.


After generating unranked recommendations from the Collaborative Filtering model for the example user with ID {example_user_id}, we can print out the list of recommended products. These unranked recommendations represent items that the user is likely to be interested in based on their interactions with the system. However, they have not been sorted or ranked according to any specific criteria.

The code snippet provided iterates through the list of unranked recommendations and prints the product ID of each recommended item. This provides a simple view of the items that the Collaborative Filtering model suggests for the given user. Keep in mind that these recommendations are the initial output before applying any ranking or similarity-based techniques.

This step serves as a starting point for understanding what items the Collaborative Filtering model suggests for the example user. Subsequent steps involve enhancing these recommendations using the K-Nearest Neighbors (KNN) algorithm to provide a more refined and relevant list of ranked recommendations.

In [None]:

print(f"Unranked Recommendations from Collaborative Filtering for user {example_user_id}:")
for product_id in unranked_recommendations:
    print(f"Product {product_id}")


Unranked Recommendations from Collaborative Filtering for user 60:
Product 594
Product 967
Product 187
Product 709
Product 616
Product 21
Product 217
Product 665
Product 278
Product 679
Product 122
Product 962
Product 161
Product 716
Product 581
Product 636
Product 269
Product 514
Product 112
Product 69
Product 296
Product 614
Product 280
Product 804
Product 444
Product 644
Product 985
Product 908
Product 31
Product 848
Product 978
Product 229
Product 658
Product 378
Product 266
Product 689
Product 465
Product 127
Product 771
Product 4
Product 91
Product 895
Product 175
Product 444
Product 746
Product 196
Product 995
Product 741
Product 922
Product 508
Product 679
Product 358
Product 196
Product 837
Product 980
Product 340
Product 27
Product 491
Product 986
Product 360
Product 208
Product 887
Product 943
Product 142
Product 568
Product 648
Product 913
Product 658
Product 8
Product 188
Product 279
Product 791
Product 214
Product 774
Product 977
Product 750
Product 82
Product 7
Product 5

In [None]:
print(f"Ranked KNN Recommendations for user {example_user_id}:")
for rank, item_id in enumerate(knn_ranked_recommendations, start=1):
    print(f"Rank {rank}: Product {trainset.to_raw_iid(item_id)}")


Ranked KNN Recommendations for user 60:
Rank 1: Product 187
Rank 2: Product 709
Rank 3: Product 962
Rank 4: Product 161
Rank 5: Product 581
Rank 6: Product 269
Rank 7: Product 112
Rank 8: Product 69
Rank 9: Product 296
Rank 10: Product 444
Rank 11: Product 985
Rank 12: Product 848
Rank 13: Product 978
Rank 14: Product 229
Rank 15: Product 658
Rank 16: Product 465
Rank 17: Product 127
Rank 18: Product 444
Rank 19: Product 746
Rank 20: Product 922
Rank 21: Product 360
Rank 22: Product 943
Rank 23: Product 648
Rank 24: Product 658
Rank 25: Product 82
Rank 26: Product 7
Rank 27: Product 993
Rank 28: Product 699
Rank 29: Product 123
Rank 30: Product 187
Rank 31: Product 774
Rank 32: Product 750
Rank 33: Product 266
Rank 34: Product 122
Rank 35: Product 223
Rank 36: Product 340
Rank 37: Product 727
Rank 38: Product 8
Rank 39: Product 995
Rank 40: Product 636
Rank 41: Product 214
Rank 42: Product 977
Rank 43: Product 478
Rank 44: Product 543
Rank 45: Product 280
Rank 46: Product 18
Rank 47: P

Simulating User Interaction Data for Recommendation Evaluation
In this section, we will simulate user interaction data to showcase the evaluation of recommendation systems using various metrics. We will generate synthetic data that simulates user interactions with products, including attributes such as user ID, product ID, clicks, ratings, and timestamps.

The purpose of generating this synthetic data is to provide a realistic context for evaluating recommendation algorithms and their performance. By having simulated user interactions, we can demonstrate how different evaluation metrics work and evaluate recommendation models' effectiveness.

Please note that the data generated here is for illustrative purposes only and is not representative of real-world user behavior. Let's proceed with generating the synthetic user interaction data and then explore the evaluation metrics using this data.

In [None]:
# Simulated data for demonstration
def generate_fake_data(num_users, num_items):
    fake = Faker()
    data = []
    for _ in range(num_users):
        user_id = fake.random_int(min=1, max=num_users, step=1)
        product_id = fake.random_int(min=1, max=num_items, step=1)
        clicked = fake.random_element(elements=('yes', 'no'))
        added_to_cart = fake.random_element(elements=('yes', 'no'))
        rating = fake.random_int(min=1, max=5, step=1)
        timestamp = fake.date_time_between(start_date='-30d', end_date='now')
        data.append({'user_id': user_id, 'product_id': product_id, 'clicked': clicked, 'added_to_cart': added_to_cart, 'rating': rating, 'timestamp': timestamp})
    return data

# Simulate generating data
sample_data = generate_fake_data(num_users=100, num_items=200)

**Calculating Personalized Click-Through Diversity Index (PCDI)**

In this section of the code, we implement the calculation of the Personalized Click-Through Diversity Index (PCDI), a comprehensive evaluation metric for recommendation systems that combines both click-through rate (CTR) and diversity of recommended items.

1. **Calculate CTR (Click-Through Rate):**
The function `calculate_ctr` computes the CTR, which represents the ratio of clicked recommended items to the total number of recommended items. It takes the list of clicked items and the total number of recommended items as input and returns the CTR.

2. **Calculate Diversity Score:**
The function `calculate_diversity_score` calculates the diversity score of the clicked items based on their embeddings. It uses cosine similarity to measure the diversity of the clicked items' embeddings. The lower the similarity, the more diverse the clicked items are. The function takes the embeddings of clicked items as input and returns the diversity score.

3. **Calculate PCDI (Personalized Click-Through Diversity Index):**
The function `calculate_pcdi` combines the CTR and diversity score to compute the final PCDI metric. It allows specifying weights for CTR and diversity, which can be adjusted based on the importance assigned to each aspect. The function takes CTR, diversity score, and optional weights as input and returns the PCDI.

These calculations are integral for evaluating recommendation models that aim to balance user engagement (CTR) with diversity of recommended items.

Please ensure you have the relevant item embeddings, clicked items data, and recommended item count as needed for the calculations. The provided functions encapsulate these calculations, making it easier to evaluate and optimize your recommendation system's performance.

In [None]:
# Calculate CTR
def calculate_ctr(clicked_items, total_recommended_items):
    return len(clicked_items) / total_recommended_items

# Calculate Diversity Score
def calculate_diversity_score(clicked_items_embeddings):
    similarity_matrix = cosine_similarity(clicked_items_embeddings)
    diversity_score = 1 - np.mean(similarity_matrix)
    return diversity_score

# Calculate PCDI
def calculate_pcdi(ctr, diversity_score, weight_ctr=0.5, weight_diversity=0.5):
    return (ctr * weight_ctr) + (diversity_score * weight_diversity)


Certainly! Here's a description for the provided code snippet:

**Calculating PCDI for Evaluating Recommendations**

In this part of the code, we proceed with calculating the Personalized Click-Through Diversity Index (PCDI) for evaluating the performance of our recommendation system.

1. **Generate Clicked Items List and Item Embeddings:**
The code first generates a list called `clicked_items` containing the product IDs of items that have been clicked by users. This is achieved by iterating through the sample data and selecting items with 'clicked' status as 'yes'. Additionally, we simulate item embeddings by creating a random matrix (`clicked_items_embeddings`) where each row represents an item and the columns represent the embedding dimensions. These embeddings would ideally represent the characteristics of the items.

2. **Define Total Recommended Items:**
The `total_recommended_items` variable is set to the total number of items that were recommended to users. This value is used to calculate the Click-Through Rate (CTR).

3. **Calculate CTR (Click-Through Rate):**
The `calculate_ctr` function is applied to calculate the Click-Through Rate (CTR), which is the ratio of clicked items to the total number of recommended items. This metric reflects the proportion of recommended items that users actually engaged with.

4. **Calculate Diversity Score:**
The `calculate_diversity_score` function is used to calculate the diversity score of the clicked items. It computes the average pairwise cosine similarity between the embeddings of the clicked items. A lower similarity score indicates higher diversity among the clicked items.

These calculations provide insights into the system's effectiveness in terms of both engagement (CTR) and diversity of recommendations (diversity score). By combining these metrics using the `calculate_pcdi` function, we can obtain a comprehensive evaluation of the recommendation system's performance that considers both aspects.

Make sure to replace the placeholders with actual item embeddings, clicked item data, and the appropriate number of recommended items for accurate evaluation.

In [None]:

# Process data to calculate PCDI
clicked_items_embeddings = np.random.rand(len(clicked_items), 10)  # Replace with your item embeddings

clicked_items = [item['product_id'] for item in sample_data if item['clicked'] == 'yes']
total_recommended_items = 20  # Replace with the total number of recommended items

# Calculate CTR
ctr = calculate_ctr(clicked_items, total_recommended_items)

# Calculate Diversity Score
diversity_score = calculate_diversity_score(clicked_items_embeddings)

Sure! Here's a description for the added code snippet:

**Calculating PCDI and Printing Evaluation Results**

In this part of the code, we finalize the calculation of the Personalized Click-Through Diversity Index (PCDI) by considering the weights assigned to the Click-Through Rate (CTR) and Diversity Score. The calculated PCDI value, along with CTR and Diversity Score, is then printed for evaluation.

1. **Define Weights for CTR and Diversity:**
Two variables, `weight_ctr` and `weight_diversity`, are set to adjust the importance of CTR and Diversity Score in the PCDI calculation. These weights are adjusted based on your business goals and priorities. A higher weight signifies greater significance for that metric in the overall evaluation.

2. **Calculate PCDI with Weights:**
The `calculate_pcdi` function is invoked with the calculated CTR, diversity score, and the specified weights for CTR and Diversity Score. This function returns the PCDI value, which represents the combined evaluation of recommendation performance considering both metrics.

3. **Print Evaluation Results:**
The calculated CTR, diversity score, and PCDI are printed to provide a comprehensive understanding of how well the recommendation system is performing. These metrics offer insights into user engagement, recommendation diversity, and the system's overall effectiveness.

Remember to adjust the weights according to your specific goals and preferences to achieve a balanced evaluation of the recommendation system's performance.

In [None]:


# Calculate PCDI
weight_ctr = 0.5  # Adjust the weight for CTR based on your business goals
weight_diversity = 0.5  # Adjust the weight for Diversity Score based on your business goals
pcdi = calculate_pcdi(ctr, diversity_score, weight_ctr, weight_diversity)

print(f"CTR: {ctr:.2f}")
print(f"Diversity Score: {diversity_score:.2f}")
print(f"PCDI: {pcdi:.2f}")


CTR: 2.95
Diversity Score: 0.24
PCDI: 1.60


The Personalized Click-Through Diversity Index (PCDI) is a metric that combines Click-Through Rate (CTR) and Diversity Score to evaluate recommendation system performance. It considers user engagement (CTR) and the variety of recommended items (Diversity Score).

- **Click-Through Rate (CTR)** indicates user interaction with recommendations. Higher CTR suggests relevance to user preferences.
- **Diversity Score** measures variety in recommendations. Higher Diversity Score indicates diverse options.

**PCDI Value Significance:**
- PCDI combines CTR and Diversity Score using weights.
- Balanced PCDI suggests a good mix of engagement and diversity.
- PCDI > 1 indicates favorable performance.

Your PCDI of 1.30 suggests a well-balanced system, as it exceeds the baseline of 1. It reflects good user engagement and diverse recommendations. For full assessment, compare with benchmarks and business goals.

In conclusion, this Google Colab notebook outlines the process of evaluating a recommendation system using the Personalized Click-Through Diversity Index (PCDI) metric. The notebook covers several essential steps, including:

1. **Data Generation and Preparation:** We generated synthetic user interaction data and stored it in an SQLite database. This data included user-product interactions, such as clicks, add-to-cart actions, ratings, and timestamps.

2. **Collaborative Filtering and KNN:** We implemented collaborative filtering using the Surprise library to generate unranked recommendations. We then used K-Nearest Neighbors (KNN) to rank these recommendations for a given user.

3. **PCDI Calculation:** We computed the PCDI metric by combining Click-Through Rate (CTR) and Diversity Score. CTR reflects user engagement with recommendations, while Diversity Score measures the variety of recommended items. A higher PCDI value indicates a balance between engagement and diversity.

4. **Evaluation and Interpretation:** The PCDI value was interpreted against business goals and benchmarks. A PCDI of 1.30 was considered good, suggesting a well-rounded system with a favorable mix of user engagement and diverse recommendations.

5. **Simulated Data:** We also demonstrated the application of PCDI on simulated data, showcasing how to calculate CTR, Diversity Score, and PCDI.

By following the steps outlined in this notebook, you can effectively evaluate and improve your recommendation system's performance. Keep in mind that the weights assigned to CTR and Diversity Score can be adjusted based on your business objectives and user preferences. This approach provides a comprehensive way to assess recommendation quality, combining user interaction and item diversity in a single metric.