<a href="https://colab.research.google.com/github/Aayush050502/Pratilipi-Recommendation-System/blob/main/pratilipi.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Data Analysis**

**Importing the libraries**

In [67]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np


**Load Dataset**

In [None]:
user_interaction_path = '/content/user_interaction.csv'
metadata_path = '/content/metadata.csv'

user_interaction_df = pd.read_csv(user_interaction_path)
metadata_df = pd.read_csv(metadata_path)

**Basic Information**

In [None]:
print("User Interaction Dataset:")
print(user_interaction_df.head())
print("\nMetadata Dataset:")
print(metadata_df.head())

User Interaction Dataset:
            user_id      pratilipi_id  read_percent               updated_at
0  5506791961876448  1377786228262109         100.0  2022-03-22 10:29:57.291
1  5506791971543560  1377786223038206          40.0  2022-03-19 13:49:25.660
2  5506791996468218  1377786227025240         100.0  2022-03-21 17:28:47.288
3  5506791978752866  1377786222398208          65.0  2022-03-21 07:39:25.183
4  5506791978962946  1377786228157051         100.0  2022-03-22 17:32:44.777

Metadata Dataset:
          author_id      pratilipi_id category_name  reading_time  \
0 -3418949279741297  1025741862639304   translation             0   
1 -2270332351871840  1377786215601277   translation           171   
2 -2270332352037261  1377786215601962   translation            92   
3 -2270332352521845  1377786215640994   translation             0   
4 -2270332349665658  1377786215931338   translation            47   

            updated_at         published_at  
0  2020-08-19 15:26:13  2016-09-

**Mains Columns of Datasets**


In [None]:
required_columns_ui = ['user_id', 'pratilipi_id', 'read_percent', 'updated_at']
required_columns_meta = ['author_id', 'pratilipi_id', 'category_name', 'reading_time', 'updated_at', 'published_at']

if not set(required_columns_ui).issubset(user_interaction_df.columns):
    raise ValueError("user_interaction.csv is missing required columns.")
if not set(required_columns_meta).issubset(metadata_df.columns):
    raise ValueError("metadata.csv is missing required columns.")


**Merging Common Column**

In [None]:
merged_data = user_interaction_df.merge(metadata_df, on='pratilipi_id', how='left')

**Encoding user_id and pratilipi_id as integers for further modeling**

In [None]:
merged_data['user_id'] = merged_data['user_id'].astype('category').cat.codes
merged_data['pratilipi_id'] = merged_data['pratilipi_id'].astype('category').cat.codes

**Making read_percent as the target variable**

In [None]:
X = merged_data[['user_id', 'pratilipi_id']]
y = merged_data['read_percent']

# **Train-test Split And Model Selection**

**Train-test split**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

**Training the model by choosing Linear Regression**

In [37]:
model = LinearRegression()
model.fit(X_train, y_train)

**Evaluating the Model**

In [38]:
predictions = model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, predictions))
print("\nRMSE of the model:", rmse)


RMSE of the model: 21.202155756403105


**Designing the function to get top recommendations with category names**

In [43]:
def get_top_n_with_categories(model, X_test, user_ids, metadata, n=5):
    user_recommendations = {}
    for user_id in user_ids:
        user_data = X_test[X_test['user_id'] == user_id]
        story_scores = model.predict(user_data)
        top_stories = user_data.iloc[np.argsort(-story_scores)]['pratilipi_id'][:n]
        top_categories = metadata.set_index('pratilipi_id').loc[top_stories]['category_name'].tolist()
        user_recommendations[user_id] = list(zip(top_stories.tolist(), top_categories))
    return user_recommendations


**Catching unique users from test set**

In [44]:
unique_users = X_test['user_id'].unique()

**Filtering X_test to get data for the current user_id**

In [52]:
def get_top_n_with_categories(model, X_test, user_ids, metadata, n=5):
    user_recommendations = {}
    for user_id in user_ids:
        user_data = X_test[X_test['user_id'] == user_id]
        story_scores = model.predict(user_data)
        top_stories = user_data.iloc[np.argsort(-story_scores)]['pratilipi_id'][:n]
        top_categories = metadata.set_index('pratilipi_id').loc[top_stories]['category_name'].tolist()
        user_recommendations[user_id] = list(zip(top_stories.tolist(), top_categories))
    return user_recommendations

# **Building Recommendations**

**Displaying Top 5 recommendations**





In [62]:
print("\nTop 5 Recommendations for Users with Categories:")
for uid, stories in list(top_n.items())[:5]:
    print(f"User {uid}: {stories}")


Top 5 Recommendations for Users with Categories:
User 148640: [235324, 234110, 229172, 229172, 38226]
User 20029: [12107, 9556, 9556]
User 200450: [230529, 213709, 213709, 213709, 203277]
User 195863: [232713, 231487, 231456, 231360, 230110]
User 76348: [235590, 229526, 223353, 207799, 205636]


**Recommendation for the first time user**

In [65]:
def recommend_for_new_user(metadata, n=5):
    print("\nFirst-time User Recommendation")
    preferred_category = input("Enter your preferred category (e.g., Romance, Comedy, etc.): ")
    filtered_metadata = metadata[metadata['category_name'].str.lower() == preferred_category.lower()]
    if filtered_metadata.empty:
        print("No stories available for the chosen category.")
        return
    top_stories = filtered_metadata.head(n)
    print("\nRecommended Stories:")
    for _, row in top_stories.iterrows():
        print(f"Story ID: {row['pratilipi_id']}, Category: {row['category_name']}")


**Recommendations for the user according to their preferences**

In [68]:
recommend_for_new_user(metadata_df)


First-time User Recommendation
Enter your preferred category (e.g., Romance, Comedy, etc.): Horror

Recommended Stories:
Story ID: -869580012874040, Category: horror
Story ID: -852906293946680, Category: horror
Story ID: -844691134572856, Category: horror
Story ID: -835729899219256, Category: horror
Story ID: -833681799604536, Category: horror
