# **ASSIGNMENT: BUILDING RECOMMENDATION SYSTEM**

## Building a predictive model to forecast which pratilipis (stories) a user is likely to read in the future, based on historical reading behavior data.



### Objective

 1. Predict Future Reading Behavior: Build a model to predict at least 5 pratilipis that each user will read in the future.

 2. TrainTest Split: Use the first 75% of the dataset for training and evaluate the model on the remaining 25% of the data.

In [51]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime
from scipy.sparse import coo_matrix, csr_matrix
from sklearn.neighbors import NearestNeighbors

In [52]:
user_interactions = pd.read_csv("/content/user_interaction.csv")
meta_data = pd.read_csv("/content/metadata.csv")

## **Previewing Data**

In [53]:
user_interactions.head(10)

Unnamed: 0,user_id,pratilipi_id,read_percent,updated_at
0,5506791961876448,1377786228262109,100.0,2022-03-22 10:29:57.291
1,5506791971543560,1377786223038206,40.0,2022-03-19 13:49:25.660
2,5506791996468218,1377786227025240,100.0,2022-03-21 17:28:47.288
3,5506791978752866,1377786222398208,65.0,2022-03-21 07:39:25.183
4,5506791978962946,1377786228157051,100.0,2022-03-22 17:32:44.777
5,5506791950813636,1377786228123227,100.0,2022-03-21 01:57:23.967
6,5506791963323596,1377786228041122,7.0,2022-03-18 16:48:11.675
7,5506791954583270,1377786219622753,100.0,2022-03-18 18:05:29.744
8,5506791970811653,1377786219946385,100.0,2022-03-22 14:56:10.889
9,5506791996662298,286572936861384,100.0,2022-03-22 16:49:12.824


In [54]:
meta_data.head(10)

Unnamed: 0,author_id,pratilipi_id,category_name,reading_time,updated_at,published_at
0,-3418949279741297,1025741862639304,translation,0,2020-08-19 15:26:13,2016-09-30 10:37:04
1,-2270332351871840,1377786215601277,translation,171,2021-01-21 16:27:07,2018-06-11 13:17:48
2,-2270332352037261,1377786215601962,translation,92,2020-09-29 12:33:57,2018-06-12 04:19:12
3,-2270332352521845,1377786215640994,translation,0,2019-10-17 09:03:37,2019-09-26 14:58:53
4,-2270332349665658,1377786215931338,translation,47,2020-05-05 11:33:41,2018-11-25 12:28:23
5,-2270332348759753,1377786216399294,translation,157,2019-12-01 16:17:43,2019-03-18 07:54:59
6,-3729070011118961,1377786216409045,translation,139,2020-03-05 09:56:35,2019-03-18 11:42:27
7,-2270332347597550,1377786216454709,translation,130,2020-08-19 16:19:40,2019-04-11 14:03:19
8,-2270332348469806,1377786216456585,translation,5,2019-10-16 08:55:39,2019-08-27 12:50:20
9,-2270332347658712,1377786216463086,translation,136,2019-10-23 11:29:27,2019-10-22 12:44:05


In [55]:
print("User Interaction Columns:", user_interactions.columns.tolist())
print("Meta Data Columns:", meta_data.columns.tolist())

User Interaction Columns: ['user_id', 'pratilipi_id', 'read_percent', 'updated_at']
Meta Data Columns: ['author_id', 'pratilipi_id', 'category_name', 'reading_time', 'updated_at', 'published_at']


## **Converting Timestamps to Datetime Format**

In [56]:
# Convert timestamps to datetime, handling errors
user_interactions['updated_at'] = pd.to_datetime(user_interactions['updated_at'], errors='coerce')
meta_data['updated_at'] = pd.to_datetime(meta_data['updated_at'], errors='coerce')
meta_data['published_at'] = pd.to_datetime(meta_data['published_at'], errors='coerce')

## **Sorting Data by Time**

In [57]:
# Sort interactions by time
user_interactions.sort_values('updated_at', inplace=True)

# Display a summary of user interactions
print("User Interactions Summary:")
display(user_interactions.describe())


User Interactions Summary:


Unnamed: 0,user_id,pratilipi_id,read_percent,updated_at
count,2500000.0,2500000.0,2500000.0,2500000
mean,5489174000000000.0,1369444000000000.0,93.24295,2022-03-20 22:13:28.009031168
min,3257553000000000.0,-5375940000000000.0,0.0,2022-03-18 15:14:41.827000
25%,5506792000000000.0,1377786000000000.0,100.0,2022-03-19 18:09:25.668249856
50%,5506792000000000.0,1377786000000000.0,100.0,2022-03-20 23:18:17.970999808
75%,5506792000000000.0,1377786000000000.0,100.0,2022-03-22 02:29:16.531249920
max,5506792000000000.0,1377786000000000.0,2400.0,2022-03-23 00:08:25.306000
std,160670500000000.0,122175600000000.0,21.70149,


## **Splitting Data into Training and Testing Sets**

In [58]:
# Determine split time based on the 75th percentile
split_time = user_interactions['updated_at'].quantile(0.75)
train_data = user_interactions[user_interactions['updated_at'] <= split_time].copy()
test_data = user_interactions[user_interactions['updated_at'] > split_time].copy()


In [59]:
print("Training Data Shape:", train_data.shape)
print("Testing Data Shape:", test_data.shape)

Training Data Shape: (1875000, 4)
Testing Data Shape: (625000, 4)


## **Creating Mappings for Users and Pratilipis**

In [60]:
# Create mappings for user and pratilipi ids
user_ids = train_data['user_id'].unique()
pratilipi_ids = train_data['pratilipi_id'].unique()

user_id_to_idx = {user_id: idx for idx, user_id in enumerate(user_ids)}
pratilipi_id_to_idx = {pid: idx for idx, pid in enumerate(pratilipi_ids)}

In [61]:
# Map ids in training data to indices
train_data['user_idx'] = train_data['user_id'].map(user_id_to_idx)
train_data['pratilipi_idx'] = train_data['pratilipi_id'].map(pratilipi_id_to_idx)

## **Building a Sparse User-Item Matrix**

In [62]:
# Build a sparse user-item matrix (users x pratilipis)
train_matrix = coo_matrix(
    (train_data['read_percent'], (train_data['user_idx'], train_data['pratilipi_idx'])),
    shape=(len(user_ids), len(pratilipi_ids))
).tocsr()

print("User-Item Matrix Shape:", train_matrix.shape)

User-Item Matrix Shape: (213331, 219088)


## **Creating the Item-Item Similarity Matrix**

In [63]:
#create the item matrix (each row corresponds to a pratilipi, each column to a user)
item_matrix = train_matrix.T.tocsr()

In [64]:
# Initializing and fit the NearestNeighbors model using cosine distance
nn_model = NearestNeighbors(n_neighbors=6, metric='cosine', algorithm='brute')
nn_model.fit(item_matrix)

print("NearestNeighbors model trained on item matrix.")

NearestNeighbors model trained on item matrix.


## **Generating Item-Based Recommendations**

In [65]:
# Create inverse mapping for pratilipi ids
idx_to_pratilipi = {idx: pid for pid, idx in pratilipi_id_to_idx.items()}

def get_item_based_recommendations(user_id, user_id_to_idx, train_data, train_matrix, nn_model, idx_to_pratilipi, N=5):
    """
    Generate top N item-based recommendations for a given user.
    """
    # Check if user is in our training mapping
    if user_id not in user_id_to_idx:
        return []

    user_idx = user_id_to_idx[user_id]
    # Get items the user has interacted with in training
    user_items = train_data[train_data['user_id'] == user_id]['pratilipi_idx'].unique()

    candidate_scores = {}

    # For each item the user has read, find similar items
    for item in user_items:
        # Get nearest neighbors for this item
        distances, indices = nn_model.kneighbors(item_matrix[item], return_distance=True)
        # distances and indices are 2D arrays with shape (1, n_neighbors)
        for dist, neighbor in zip(distances[0], indices[0]):
            # Skip if neighbor is the item itself
            if neighbor == item:
                continue
            # Convert cosine distance to similarity
            similarity = 1 - dist
            candidate_scores[neighbor] = candidate_scores.get(neighbor, 0) + similarity

    # Exclude items already read by the user
    candidate_scores = {k: v for k, v in candidate_scores.items() if k not in user_items}
    # Sort candidate items by aggregated similarity score (highest first)
    recommended_items = sorted(candidate_scores, key=candidate_scores.get, reverse=True)[:N]
    # Map indices back to pratilipi_ids
    recommended_pratilipi_ids = [idx_to_pratilipi.get(idx) for idx in recommended_items]

    return recommended_pratilipi_ids


## **Testing the Recommendation System**

In [66]:
# Generate recommendations for a sample user
sample_user = user_ids[0]
recommendations = get_item_based_recommendations(
    sample_user, user_id_to_idx, train_data, train_matrix, nn_model, idx_to_pratilipi, N=5
)

# Convert recommendations to a DataFrame
recommendations_df = pd.DataFrame(recommendations, columns=["Pratilipi ID"])

# Display in tabular format
print(f"Item-based Recommendations for User {sample_user}:")
display(recommendations_df)

Item-based Recommendations for User 5506791954036110:


Unnamed: 0,Pratilipi ID
0,1377786225929943
1,1377786226154996
2,1377786226272582
3,1377786226019035
4,1377786225631330


## **Taking User Input**

In [67]:
# Take user input
user_input = input("Enter User ID: ")

# Convert input to integer
try:
    user_input = int(user_input)
except ValueError:
    print("Invalid input! Please enter a numeric User ID.")
    user_input = None

# Generate recommendations only if the input is valid and exists in our data
if user_input in user_id_to_idx:
    recommendations = get_item_based_recommendations(
        user_input, user_id_to_idx, train_data, train_matrix, nn_model, idx_to_pratilipi, N=5
    )

    # Convert recommendations to a DataFrame
    recommendations_df = pd.DataFrame(recommendations, columns=["Pratilipi ID"])

    # Display in tabular format
    print(f"Item-based Recommendations for User {user_input}:")
    display(recommendations_df)
else:
    print("User ID not found in the dataset. Please enter a valid User ID.")


Enter User ID: 5506791961876448
Item-based Recommendations for User 5506791961876448:


Unnamed: 0,Pratilipi ID
0,1377786223568546
1,1377786226522036
2,1377786223648009
3,1377786225568948
4,1377786226213569
