<h1 align='center'>Collaborative Filtering from Scratch</h1>
<h6 align='center'>User Based | Memory Based</h6>

#### DataSet: [The Movies Dataset](https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset)

In [1]:
import os
import pandas as pd
import numpy as np

### Data Load and Pre-processing 

*Reading Data File*

In [2]:
ratings_df = pd.read_csv("./TheMoviesDataset/ratings_small.csv")
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


*Drop non-required columns and duplicates*

In [3]:
ratings_df.drop(columns=['timestamp'], inplace=True)
ratings_df.drop_duplicates(inplace=True)

In [4]:
ratings_df.head()

Unnamed: 0,userId,movieId,rating
0,1,31,2.5
1,1,1029,3.0
2,1,1061,3.0
3,1,1129,2.0
4,1,1172,4.0


*Brief data description*

In [5]:
ratings_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100004 entries, 0 to 100003
Data columns (total 3 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   userId   100004 non-null  int64  
 1   movieId  100004 non-null  int64  
 2   rating   100004 non-null  float64
dtypes: float64(1), int64(2)
memory usage: 3.1 MB


### Colloborating filtering

#### User Based | Memory Based

*Creating Pivot Table*

In [6]:
data_df = ratings_df.pivot(columns='movieId', index='userId', values='rating')

*Fill the null value with 0*

In [7]:
pivot_df = data_df.fillna(0)
pivot_df

movieId,1,2,3,4,5,6,7,8,9,10,...,161084,161155,161594,161830,161918,161944,162376,162542,162672,163949
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
667,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
668,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
669,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
670,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


*Convert DataFrame to numpy array*

In [8]:
data_mat = pivot_df.values
print("Shape of data matrix: ", data_mat.shape)

Shape of data matrix:  (671, 9066)


*All User IDs*

In [10]:
users_ids = np.unique(pivot_df.index)

#### Calculate similarity between users

In [11]:
from sklearn.metrics.pairwise import cosine_similarity

In [12]:
cos_sim = cosine_similarity(data_mat, data_mat)
print("Shape of similarity matrix: ", cos_sim.shape)

Shape of similarity matrix:  (671, 671)


*Transform similarity matrix to DataFrame*

In [13]:
cos_sim_df = pd.DataFrame(cos_sim)
cos_sim_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,661,662,663,664,665,666,667,668,669,670
0,1.000000,0.000000,0.000000,0.074482,0.016818,0.000000,0.083884,0.000000,0.012843,0.000000,...,0.000000,0.000000,0.014474,0.043719,0.000000,0.000000,0.000000,0.062917,0.000000,0.017466
1,0.000000,1.000000,0.124295,0.118821,0.103646,0.000000,0.212985,0.113190,0.113333,0.043213,...,0.477306,0.063202,0.077745,0.164162,0.466281,0.425462,0.084646,0.024140,0.170595,0.113175
2,0.000000,0.124295,1.000000,0.081640,0.151531,0.060691,0.154714,0.249781,0.134475,0.114672,...,0.161205,0.064198,0.176134,0.158357,0.177098,0.124562,0.124911,0.080984,0.136606,0.170193
3,0.074482,0.118821,0.081640,1.000000,0.130649,0.079648,0.319745,0.191013,0.030417,0.137186,...,0.114319,0.047228,0.136579,0.254030,0.121905,0.088735,0.068483,0.104309,0.054512,0.211609
4,0.016818,0.103646,0.151531,0.130649,1.000000,0.063796,0.095888,0.165712,0.086616,0.032370,...,0.191029,0.021142,0.146173,0.224245,0.139721,0.058252,0.042926,0.038358,0.062642,0.225086
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
666,0.000000,0.425462,0.124562,0.088735,0.058252,0.000000,0.232051,0.069005,0.066412,0.032653,...,0.342283,0.050754,0.074080,0.124214,0.351207,1.000000,0.091597,0.018416,0.153111,0.127995
667,0.000000,0.084646,0.124911,0.068483,0.042926,0.019563,0.058773,0.112366,0.194493,0.098561,...,0.074089,0.059010,0.093021,0.082525,0.114487,0.091597,1.000000,0.000000,0.178017,0.135387
668,0.062917,0.024140,0.080984,0.104309,0.038358,0.024583,0.073151,0.055143,0.029291,0.060549,...,0.015960,0.025953,0.077927,0.101707,0.028773,0.018416,0.000000,1.000000,0.042609,0.085202
669,0.000000,0.170595,0.136606,0.054512,0.062642,0.019465,0.096240,0.247687,0.384429,0.158650,...,0.183662,0.122126,0.123407,0.143380,0.159479,0.153111,0.178017,0.042609,1.000000,0.228677


#### How to predict rating for a item (I) of user (U)?

For user U and item I,
    find n number of users most similar to user U and average their rating for item I.
    
*Note*: Consider only those user among the n user who rated item I.

Extract similar user to user (user_id=1) most to less

In [15]:
# user_1 = pd.DataFrame()
top30_sim_user = cos_sim_df[0].sort_values(ascending=False)[:30].index
top30_sim_user

  top30_sim_user = cos_sim_df[0].sort_values( ascending=False)[:30].index


Int64Index([  0, 324, 633, 340, 309, 206,  34, 194, 484, 129, 228, 101, 402,
            118, 386, 538, 574, 390, 467, 496, 509, 231, 213, 275,  72,  18,
             33, 584,   6, 517],
           dtype='int64')

#### Extract users who rated the Item(item=31)

In [17]:
rating_df_item31 = pd.DataFrame(data_df[31])
rating_df_item31.head()

Unnamed: 0_level_0,31
userId,Unnamed: 1_level_1
1,2.5
2,
3,
4,
5,


extract the item rating which is given by top 30 similiar user

In [19]:
rating_given_by_sim_user = rating_df_item31.iloc[top30_sim_user[1:]]
rating_given_by_sim_user.head()

Unnamed: 0_level_0,31
userId,Unnamed: 1_level_1
325,4.5
634,
341,4.5
310,1.5
207,


Consider only those user among the similar user who rated the item

In [20]:
fltered_user_rating = rating_given_by_sim_user[rating_given_by_sim_user[31]!=np.nan]

Prediction: average of the given rating

In [22]:
predicted_rating_item31 = fltered_user_rating.mean()

In [30]:
print("Original rating of item 31: ", data_df.iloc[0, 30])
print("Predicted rating of item 31: ", predicted_rating_item31)

Original rating of item 31:  2.5
Predicted rating of item 31:  31    3.214286
dtype: float64
