<a href="https://colab.research.google.com/github/RiniPaul86/Assignment/blob/main/Assignment_Recommendation_systems.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Recommendation_Engine_IMDB_top250_Movies
#Objective:
To build a recommendation system that suggests movies to users based on their ratings and the ratings of similar users. The recommendation system uses cosine similarity to find similar users and their preferences.

Data Preparation:

Import and Clean Data: Read the movie ratings dataset, clean and preprocess it to ensure it is ready for analysis.
Drop Irrelevant Columns: Keep only the necessary columns (title, imbd_rating, and user_id).
Expand User IDs: Split the user_id column, which contains a list of user IDs for each movie, into separate rows. This results in a long-format DataFrame where each row corresponds to a single user-movie-rating triplet.
Create User-Item Matrix:

Pivot Table:

Transform the expanded DataFrame into a user-item matrix where rows represent users, columns represent movies, and values are the IMDb ratings.

Cosine Similarity Calculation:

Standardize Data: Use StandardScaler to standardize the user-item matrix.
Compute Cosine Similarity: Calculate the cosine similarity between users using the standardized matrix. The result is a similarity matrix indicating how similar each user is to every other user.

Recommendation Function:

Define Function: Create a function to generate movie recommendations for a given user. The function uses the similarity matrix to identify the most similar users and then recommends movies based on their ratings.
Filter Rated Movies: Ensure that the recommended movies are not already rated by the user to provide new suggestions.
Sort Recommendations: Return the top recommended movies sorted by predicted rating.

In [40]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [41]:
df = pd.read_csv("/content/movies.csv")
df.head()

Unnamed: 0,rank,movie_id,title,year,link,imbd_votes,imbd_rating,certificate,duration,genre,...,director_id,director_name,writer_id,writer_name,storyline,user_id,user_name,review_id,review_title,review_content
0,1,tt0111161,The Shawshank Redemption,1994,https://www.imdb.com/title/tt0111161,2711075,9.3,R,2h 22m,Drama,...,nm0001104,Frank Darabont,"nm0000175,nm0001104","Stephen King,Frank Darabont","Over the course of several years, two convicts...","ur16161013,ur15311310,ur0265899,ur16117882,ur1...","hitchcockthelegend,Sleepin_Dragon,EyeDunno,ale...","rw2284594,rw6606154,rw1221355,rw1822343,rw1288...","Some birds aren't meant to be caged.,An incred...",The Shawshank Redemption is written and direct...
1,2,tt0068646,The Godfather,1972,https://www.imdb.com/title/tt0068646,1882829,9.2,R,2h 55m,"Crime,Drama",...,nm0000338,Francis Ford Coppola,"nm0701374,nm0000338","Mario Puzo,Francis Ford Coppola",The aging patriarch of an organized crime dyna...,"ur24740649,ur86182727,ur15794099,ur15311310,ur...","CalRhys,andrewburgereviews,gogoschka-1,Sleepin...","rw3038370,rw4756923,rw4059579,rw6568526,rw1897...","The Pinnacle Of Flawless Films!,An offer so go...",'The Godfather' is the pinnacle of flawless fi...
2,3,tt0468569,The Dark Knight,2008,https://www.imdb.com/title/tt0468569,2684051,9.0,PG-13,2h 32m,"Action,Crime,Drama",...,nm0634240,Christopher Nolan,"tt0468569,nm0634300,nm0634240,nm0275286,tt0468569","Writers,Jonathan Nolan,Christopher Nolan,David...",When the menace known as the Joker wreaks havo...,"ur87850731,ur1293485,ur129557514,ur12449122,ur...","MrHeraclius,Smells_Like_Cheese,dseferaj,little...","rw5478826,rw1914442,rw6606026,rw1917099,rw5170...","The Dark Knight,The Batman of our dreams! So m...","Confidently directed, dark, brooding, and pack..."
3,4,tt0071562,The Godfather Part II,1974,https://www.imdb.com/title/tt0071562,1285350,9.0,R,3h 22m,"Crime,Drama",...,nm0000338,Francis Ford Coppola,"nm0000338,nm0701374","Francis Ford Coppola,Mario Puzo",The early life and career of Vito Corleone in ...,"ur0176092,ur0688559,ur92260614,ur0200644,ur117...","Nazi_Fighter_David,tfrizzell,umunir-36959,DanB...","rw0135607,rw0135487,rw5049900,rw0135526,rw0135...",Breathtaking in its scope and tragic grandeur....,"Coppola's masterpiece is rivaled only by ""The ..."
4,5,tt0050083,12 Angry Men,1957,https://www.imdb.com/title/tt0050083,800954,9.0,Approved,1h 36m,"Crime,Drama",...,nm0001486,Sidney Lumet,nm0741627,Reginald Rose,The jury in a New York City murder trial is fr...,"ur1318549,ur0643062,ur0688559,ur20552756,ur945...","uds3,tedg,tfrizzell,TheLittleSongbird,henrique...","rw0060044,rw0060025,rw0060034,rw2262425,rw5448...","The over-used term ""classic movie"" really come...",This once-in-a-generation masterpiece simply h...


In [42]:
df.ndim

2

In [43]:
df.size

5500

In [44]:
df.shape

(250, 22)

In [45]:
df.dtypes

rank                int64
movie_id           object
title              object
year                int64
link               object
imbd_votes         object
imbd_rating       float64
certificate        object
duration           object
genre              object
cast_id            object
cast_name          object
director_id        object
director_name      object
writer_id          object
writer_name        object
storyline          object
user_id            object
user_name          object
review_id          object
review_title       object
review_content     object
dtype: object

In [46]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250 entries, 0 to 249
Data columns (total 22 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   rank            250 non-null    int64  
 1   movie_id        250 non-null    object 
 2   title           250 non-null    object 
 3   year            250 non-null    int64  
 4   link            250 non-null    object 
 5   imbd_votes      250 non-null    object 
 6   imbd_rating     250 non-null    float64
 7   certificate     249 non-null    object 
 8   duration        250 non-null    object 
 9   genre           250 non-null    object 
 10  cast_id         250 non-null    object 
 11  cast_name       250 non-null    object 
 12  director_id     250 non-null    object 
 13  director_name   250 non-null    object 
 14  writer_id       250 non-null    object 
 15  writer_name     250 non-null    object 
 16  storyline       250 non-null    object 
 17  user_id         250 non-null    obj

In [47]:
df.duplicated()

0      False
1      False
2      False
3      False
4      False
       ...  
245    False
246    False
247    False
248    False
249    False
Length: 250, dtype: bool

In [48]:
df.isnull().sum()

rank              0
movie_id          0
title             0
year              0
link              0
imbd_votes        0
imbd_rating       0
certificate       1
duration          0
genre             0
cast_id           0
cast_name         0
director_id       0
director_name     0
writer_id         0
writer_name       0
storyline         0
user_id           0
user_name         0
review_id         0
review_title      0
review_content    0
dtype: int64

In [49]:
# Drop irrelevant columns
columns_to_keep = ['title', 'imbd_rating', 'user_id']
df_new = df[columns_to_keep]

df_new.head()

Unnamed: 0,title,imbd_rating,user_id
0,The Shawshank Redemption,9.3,"ur16161013,ur15311310,ur0265899,ur16117882,ur1..."
1,The Godfather,9.2,"ur24740649,ur86182727,ur15794099,ur15311310,ur..."
2,The Dark Knight,9.0,"ur87850731,ur1293485,ur129557514,ur12449122,ur..."
3,The Godfather Part II,9.0,"ur0176092,ur0688559,ur92260614,ur0200644,ur117..."
4,12 Angry Men,9.0,"ur1318549,ur0643062,ur0688559,ur20552756,ur945..."


In [50]:
# Expand user_id lists into separate rows
expanded_df = df_new.assign(user_id=df_new['user_id'].str.split(',')).explode('user_id')

# Reset index for clean DataFrame
expanded_df = expanded_df.reset_index(drop=True)

expanded_df.head()

Unnamed: 0,title,imbd_rating,user_id
0,The Shawshank Redemption,9.3,ur16161013
1,The Shawshank Redemption,9.3,ur15311310
2,The Shawshank Redemption,9.3,ur0265899
3,The Shawshank Redemption,9.3,ur16117882
4,The Shawshank Redemption,9.3,ur1898687


In [51]:
expanded_df.shape

(6235, 3)

In [52]:
# Create the user-item matrix
user_item_matrix = expanded_df.pivot_table(index='user_id', columns='title', values='imbd_rating')

# Fill NaN values with 0 (assuming unrated movies have a rating of 0)
user_item_matrix = user_item_matrix.fillna(0)

user_item_matrix.head()

title,12 Angry Men,12 Years a Slave,1917,2001: A Space Odyssey,3 Idiots,A Beautiful Mind,A Clockwork Orange,A Separation,Aladdin,Alien,...,V for Vendetta,Vertigo,WALL·E,Warrior,Whiplash,Wild Strawberries,Wild Tales,Witness for the Prosecution,Yojimbo,Your Name.
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ur0003136,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ur0005435,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,8.1,0.0,0.0,0.0,0.0
ur0011596,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ur0011667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ur0011762,0.0,0.0,0.0,0.0,8.4,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [53]:
#Calculating Cosine Similarity between Users
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import cosine, correlation


In [54]:
user_sim = 1 - pairwise_distances(user_item_matrix.values, metric='cosine')

In [55]:
user_sim

array([[1., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 1.]])

In [56]:
#Store the results in a dataframe
user_sim_df = pd.DataFrame(user_sim)

In [57]:
#Set the index and column names to user ids
user_sim_df.index = expanded_df.user_id.unique()
user_sim_df.columns = expanded_df.user_id.unique()

In [58]:
user_sim_df.iloc[0:15, 0:15]

Unnamed: 0,ur16161013,ur15311310,ur0265899,ur16117882,ur1898687,ur0257957,ur0355122,ur16749093,ur1234929,ur0482513,ur118977607,ur87850731,ur0174908,ur0842118,ur1005460
ur16161013,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ur15311310,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ur0265899,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ur16117882,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ur1898687,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ur0257957,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ur0355122,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ur16749093,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ur1234929,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
ur0482513,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


In [59]:
np.fill_diagonal(user_sim, 0)
user_sim_df.iloc[0:15, 0:15]

Unnamed: 0,ur16161013,ur15311310,ur0265899,ur16117882,ur1898687,ur0257957,ur0355122,ur16749093,ur1234929,ur0482513,ur118977607,ur87850731,ur0174908,ur0842118,ur1005460
ur16161013,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ur15311310,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ur0265899,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ur16117882,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ur1898687,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ur0257957,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ur0355122,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ur16749093,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ur1234929,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ur0482513,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [60]:
#Most Similar Users
user_sim_df.idxmax(axis=1)[0:5]

ur16161013    ur128634296
ur15311310     ur18622219
ur0265899     ur115595273
ur16117882     ur14069613
ur1898687      ur32227901
dtype: object

In [61]:
expanded_df[(expanded_df['user_id']== 'ur16161013') | (expanded_df['user_id']== 'ur128634296')]

Unnamed: 0,title,imbd_rating,user_id
0,The Shawshank Redemption,9.3,ur16161013
59,The Dark Knight,9.0,ur128634296
192,Pulp Fiction,8.9,ur16161013
339,Inception,8.8,ur16161013
453,Se7en,8.6,ur16161013
696,Star Wars,8.6,ur16161013
898,Gladiator,8.5,ur16161013
994,The Usual Suspects,8.5,ur16161013
1586,Witness for the Prosecution,8.4,ur16161013
1884,Braveheart,8.4,ur16161013


In [62]:
user_1=expanded_df[expanded_df['user_id']== 'ur16161013']

In [63]:
user_2=expanded_df[expanded_df['user_id']== 'ur128634296']

In [64]:
user_1.title

0                 The Shawshank Redemption
192                           Pulp Fiction
339                              Inception
453                                  Se7en
696                              Star Wars
898                              Gladiator
994                     The Usual Suspects
1586           Witness for the Prosecution
1884                            Braveheart
2053                   Singin' in the Rain
2269                        Reservoir Dogs
2470                               Vertigo
2608                     Full Metal Jacket
2757                             The Sting
2955                                Snatch
2972    Indiana Jones and the Last Crusade
3467                            Unforgiven
3547                      A Beautiful Mind
3668                      The Great Escape
3794                             The Thing
4041                     Dial M for Murder
4686             In the Name of the Father
5067                                  Jaws
5372       

In [65]:
user_2.title

59    The Dark Knight
Name: title, dtype: object

In [66]:
pd.merge(user_1,user_2,on='title',how='outer')

Unnamed: 0,title,imbd_rating_x,user_id_x,imbd_rating_y,user_id_y
0,The Shawshank Redemption,9.3,ur16161013,,
1,Pulp Fiction,8.9,ur16161013,,
2,Inception,8.8,ur16161013,,
3,Se7en,8.6,ur16161013,,
4,Star Wars,8.6,ur16161013,,
5,Gladiator,8.5,ur16161013,,
6,The Usual Suspects,8.5,ur16161013,,
7,Witness for the Prosecution,8.4,ur16161013,,
8,Braveheart,8.4,ur16161013,,
9,Singin' in the Rain,8.3,ur16161013,,


In [67]:
#This merged DataFrame can help identify movies that one user has rated and the other has not, which is valuable for a recommendation system.
#For example, movies rated highly by similar users but not yet rated by the target user can be good candidates for recommendations.