## Movies_Recommendation_System
   - By Debjeet Das
 


In [1]:
# Loading the necessary libraries to be used in the following steps

from surprise import KNNWithMeans,Dataset,accuracy,Reader
from surprise.model_selection import train_test_split

In [2]:
# Uploading the ratings dataset.
import pandas as pd 
ratings = pd.read_csv("ratings_assignment.csv")

# The range of ratings in this dataset is between 0 - 5
ratings.rating.value_counts()

4.0    26818
3.0    20047
5.0    13211
3.5    13136
4.5     8551
2.0     7551
2.5     5550
1.0     2811
1.5     1791
0.5     1370
Name: rating, dtype: int64

## Recommendation System Predicting the User Rating :
  - Here we are building a model for movie ratings recommendation system.
  - We are using the cosine distance as a parameter to see the similarities between two points.
  - So,least cosine distance will say that the two points are closely related i.e they are similar.
  - The range of ratings given to the movies in this dataset is between 0.5 - 5
  - We are splitting the dataset into trainset and testset.
  - The output of the split will give us two sets with datatypes as trainset & testset respectively.
  - While Training the model it will consider the cosine distances on the basis of which it will select k neighbors.
  - The it will compute a mean of those selected k neighbors and will predict a rating which is equal to the mean of the k neighbours.
  
## Predicting the Movie Rating for a UserID:
  - Taking the userID and MovieID as user inputs from the keyboard.
  - Showing the predicted rating that he/she would give if he/she would have watched the movie.
  - This prediction of ratings is done based on the previous ratings he/she must have given to the movie of same genres.

In [32]:
# predicting the user rating for a movie


# Setting the range scale from 0.5 - 5 for this dataset
reader = Reader(rating_scale=(0.5,5))

#  creating a dataframe which is to be given to the train_test_split for spliting the data
data= Dataset.load_from_df(ratings[["userId","movieId", "rating"]],reader)

# spliting the data into trainset and testset
[trainset, testset] = train_test_split(data,test_size=0.3,shuffle=True)


# here we are using the KNNWithMeans algorithm to train our recommender system which will use cosine distance for similarity
recom=KNNWithMeans(k=9,sim_options={"name":"cosine","user_based":True})
recom.fit(trainset)
test_pred = recom.test(testset)
rmse = accuracy.rmse(test_pred)

# Taking the UserID as a user input from the keyboard and displaying the predicted rating of the user
user_id = int(input("Enter UserID "))
movie_id = int(input("Enter MovieID "))
r=recom.predict(user_id,movie_id)
print("Predicted Movie Rating for the UserID {} is %.2f " .format(user_id) % r.est)

Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 0.9163
Enter UserID 6
Enter MovieID 7
Predicted Movie Rating for the UserID 6 is 3.46 


##### Loading Movies dataset:


In [5]:
import numpy as np
movies = pd.read_csv("movies_assignment.csv")

In [6]:
# Merging ratings & movies Datframes
movie_rating = pd.merge(ratings,movies,left_on="movieId",right_on="movieId")

In [7]:
# Dropping the unwanted columns from the merged Dataframe
movie_rating = movie_rating.drop(["timestamp","genres"],axis= 1)

In [8]:
# merged Dataframe with movie titles and movieID
movie_rating

Unnamed: 0,userId,movieId,rating,title
0,1,1,4.0,Toy Story (1995)
1,5,1,4.0,Toy Story (1995)
2,7,1,4.5,Toy Story (1995)
3,15,1,2.5,Toy Story (1995)
4,17,1,4.5,Toy Story (1995)
...,...,...,...,...
100831,610,160341,2.5,Bloodmoon (1997)
100832,610,160527,4.5,Sympathy for the Underdog (1971)
100833,610,160836,3.0,Hazard (2005)
100834,610,163937,3.5,Blair Witch (2016)


In [9]:
# Finding the top 10 movies on the basis of sum of ratings 
movie_rating.pivot_table(index="movieId",values="rating",aggfunc=sum).sort_values(by="rating",ascending=False).head(10).reset_index()

Unnamed: 0,movieId,rating
0,318,1404.0
1,356,1370.0
2,296,1288.5
3,2571,1165.5
4,593,1161.0
5,260,1062.0
6,110,955.5
7,2959,931.5
8,527,929.5
9,480,892.5


# Creating a recommendation system :
   - This is a Demo recommendation system where I have used a sparse matrix with userID and Movie_title as columns and index respectively.
   - Ratings are used as values in the matrix.
   - I have calculated the top ten movies on the basis of the ratings and number of ratings per movie.
   - Now we consider the threshold for rating to be more than or equal to 2.5 so as to recommend the other movies to the user.
   - The recommender system will take the userID as the user input and the system will show the movies to be recommended.

In [10]:
# here we are importing the csr_matrix which we used to make the compressed matrix

from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors

ls=[318,356,296,2571,593,260,110,2959,527,480]
top_ten_rate=movie_rating[movie_rating["movieId"].isin(ls)]


In [11]:
# this is a spars matrix with many unused 0 values which will use alot of memory
rating_pivot = top_ten_rate.pivot(index= "title" , columns = "userId",values = "rating").fillna(0)
rating_pivot

userId,1,2,3,4,5,6,7,8,10,11,...,601,602,603,604,605,606,607,608,609,610
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Braveheart (1995),4.0,0.0,0.0,0.0,4.0,5.0,0.0,3.0,0.0,5.0,...,0.0,5.0,1.0,3.0,3.0,3.5,5.0,4.0,3.0,4.5
Fight Club (1999),5.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.5,0.0,...,5.0,0.0,4.0,0.0,0.0,5.0,0.0,5.0,0.0,5.0
Forrest Gump (1994),4.0,0.0,0.0,0.0,0.0,5.0,5.0,3.0,3.5,5.0,...,0.0,3.0,3.0,0.0,3.0,4.0,0.0,3.0,4.0,3.0
Jurassic Park (1993),4.0,0.0,0.0,0.0,0.0,5.0,5.0,4.0,0.0,4.0,...,0.0,4.0,0.0,0.0,3.0,2.5,4.0,3.0,3.0,5.0
"Matrix, The (1999)",5.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.5,0.0,...,5.0,0.0,5.0,0.0,0.0,5.0,5.0,5.0,0.0,5.0
Pulp Fiction (1994),3.0,0.0,0.0,1.0,5.0,2.0,0.0,4.0,1.0,0.0,...,0.0,5.0,5.0,5.0,2.0,5.0,3.0,5.0,4.0,5.0
Schindler's List (1993),5.0,0.0,0.5,0.0,5.0,3.0,0.0,5.0,0.0,0.0,...,5.0,5.0,3.0,0.0,0.0,5.0,5.0,4.0,0.0,3.5
"Shawshank Redemption, The (1994)",0.0,3.0,0.0,0.0,3.0,5.0,0.0,5.0,0.0,4.0,...,5.0,5.0,0.0,0.0,0.0,3.5,5.0,4.5,4.0,3.0
"Silence of the Lambs, The (1991)",4.0,0.0,0.0,5.0,0.0,4.0,5.0,4.0,0.0,5.0,...,0.0,5.0,5.0,5.0,0.0,4.5,5.0,4.0,0.0,4.5
Star Wars: Episode IV - A New Hope (1977),5.0,0.0,0.0,5.0,0.0,0.0,5.0,0.0,0.0,0.0,...,0.0,5.0,4.0,0.0,5.0,4.5,3.0,3.5,0.0,5.0


### Recommendation system  for recommending the movies to a user :

In [28]:
dict_recom={}      # Creating an empty dictionary
for j in range(0,12):
    ls=[]
    for i in rating_pivot.iloc[:,j]:  # checking the condition,where user has not watched the movie and appending it to a list.
        if(i!=0):
            ls.append(i)
    avg=np.mean(ls)         # calculating the average rating of the movies they have watched already
     
    if(avg>2.5):            # checking the condition of threshold to greater than 2.5 in the average rating.
        
        ls_recom=rating_pivot[rating_pivot.iloc[:,j]==0.0].index.tolist()
        dict_recom[j]=ls_recom    # then appending the userID and movie_title as K,V pair resp.
    
    else:
        dict_recom[j]="None"


# This is the Final dictionary which will have the UserID and Title in it.
final_dict = dict(zip(rating_pivot.columns,list(dict_recom.values()))) 


# taking the USERID as input and displaying the recommended movies
u_id=int(input("enter the user id: "))
if u_id not in rating_pivot.columns:
    print("User Id not Found")
else :
    print("Recommended Movies out of the Top 10 Movies are : ",sep="\n")
    for i in final_dict[u_id]:
        print(i)

enter the user id: 5
Recommended Movies out of the Top 10 Movies are : 
Fight Club (1999)
Forrest Gump (1994)
Jurassic Park (1993)
Matrix, The (1999)
Silence of the Lambs, The (1991)
Star Wars: Episode IV - A New Hope (1977)


# Main Recommendation system :
   - Here we are first Taking the UserID as and input from the keyboard.
   - Then we are first segregating the unique movieID's from the dataframe and storing it in unique_iid.
   - After this we will check with movies that has been already watched by the User having the input UserID and storing it in u_iid.
   - Now we will create series which will contain only the movies which the user have not watched and store it in not_watched_iid.
   - We are creating a testset which we will use for the prediction of ratings for all the unwatched movies of the User.
   - Then we will create a Dataframe with movieID and Predicted Ratings as the columns stored in it named rating DF.
   - Now we will convert the movieId as index of the dataframe which will make it easier to get the top10 movies on the basis of ratings.
   - Sorting the values by ratings and collecting the top 10 movies for the same user.
   - Displaying the titles of the movies from the top10 movies on the basis of predicted ratings of the same UserID .
   - Also recommending the movies which he or she has not watched yet.

In [14]:
# Recommending top ten movies on the basis of the predicted ratings by the user

UserID = int(input("Enter the userID : "))

# unique MovieID throughout the dataset
unique_iid = movie_rating["movieId"].unique()

# movies which the user has watched
u_iid = movie_rating[movie_rating["userId"]==UserID]["movieId"]

# movies which the user has not watched
not_watched_iid = np.setdiff1d(unique_iid,u_iid)
not_watched_iid

# We are creating a testset which we will use for the prediction of ratings for all the unwatched movies
testset = [[UserID,i,5] for i in not_watched_iid]
predicted_movies = recom.test(testset)
predicted_movies


# Creating a dataframe with MOvieID and Predicted ratings as columns in it.
mov_id=[]
rate = []

for i in predicted_movies:
    mov_id.append(i.iid)
    rate.append(i.est)

ratingDF = pd.DataFrame({"movieID":mov_id,"ratings":rate})
ratingDF = ratingDF.set_index("movieID")

# sorting the values by ratings and collecting the top 10 movies
recom_movies=ratingDF.sort_values(by="ratings",ascending = False).head(10)
top_10_movie = pd.Series(recom_movies.index) 


# Displaying the titles of the movies from the top10 movies on the basis of predicted ratings
# Also recommending the movies whuch he or she has not watched yet
name = movies[movies["movieId"].isin(top_10_movie)]["title"]
print("Movies that are recommendent are : ",sep="\n")
print(name)

Enter the userID : 6
Movies that are recommendent are : 
2880           I'm the One That I Want (2000)
4372                       Siam Sunset (1999)
7059              Garfield's Pet Force (2009)
7199    Mickey's Once Upon a Christmas (1999)
7239                          Meantime (1984)
8898                  Ghost Graduation (2012)
9289                    World of Glory (1991)
9323              Central Intelligence (2016)
9337                       Indignation (2016)
9365        Tom Segura: Mostly Stories (2016)
Name: title, dtype: object


# Merged DF:
  - Collecting the topten movies and creating a dataframe.
  - Creating another dataframe with average rating  per movie by the users.
  - Merging the top 10 movies with the avg_movie_rating dataframe.
  - Uploading the two datasets with different movie links,tags,genres etc. present in it.
  - Then merging all the Dataframes to get the final dataframe with MovieID, Title, Ratings, Tags, imdbID, Genres in it.

In [33]:
# Collecting the topten movies and creating a dataframe
top_ten_mov = pd.DataFrame(name).reset_index().drop("index",axis=1)

# Creating a dataframe with average rating  per movie 
avg_movie_rating = movie_rating.pivot_table(index=["title","movieId"],values="rating").reset_index()

# merging the top 10 movies with the avg_movie_rating dataframe
top_avg_merged = pd.merge(top_ten_mov,avg_movie_rating,how ="left",on = "title")

# Uploading the two datasets with different movie links and tags present in it
links = pd.read_csv("links_assignment.csv")
tags = pd.read_csv("tags_assignment.csv")

# Then merging the all the Dataframes to get the final dataframe with MovieID, Title, Ratings, Tags, imdbID, Genres in it.
top_avg_tag = pd.merge(top_avg_merged,tags,how ="left",on = "movieId")
top_avg_tag =top_avg_tag.drop(["userId","timestamp"],axis=1)

top_avg_tag_imdb = pd.merge(top_avg_tag,links,on="movieId",how = "left")
top_avg_tag_imdb =top_avg_tag_imdb.drop("tmdbId",axis=1)

top_avg_tag_imdb_genre = pd.merge(top_avg_tag_imdb,movies,how="left",on=["movieId","title"])
top_avg_tag_imdb_genre

Unnamed: 0,title,movieId,rating,tag,imdbId,genres
0,I'm the One That I Want (2000),3851,5.0,,251739,Comedy
1,Siam Sunset (1999),6402,5.0,,178022,Comedy
2,Garfield's Pet Force (2009),69469,5.0,,1389762,Animation
3,Mickey's Once Upon a Christmas (1999),72692,5.0,,238414,Animation|Comedy|Fantasy
4,Meantime (1984),73822,5.0,,82727,Comedy|Drama
5,Ghost Graduation (2012),134847,5.0,,1924273,Comedy
6,World of Glory (1991),158398,5.0,,102083,Comedy
7,Central Intelligence (2016),160271,3.625,,1489889,Action|Comedy
8,Indignation (2016),160644,5.0,,4193394,Drama
9,Tom Segura: Mostly Stories (2016),162344,5.0,,4970632,Comedy
