<a href="https://colab.research.google.com/github/PrajwalRedee/Moive-Recommendation-System/blob/main/Movie_Reccommendation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Movie Recommendation System


##### https://files.grouplens.org/datasets/movielens/ml-25m.zip
Download the dataset from this file

In [1]:
import pandas as pd

In [2]:
movies = pd.read_csv("movies.csv")


In [3]:
movies

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
62418,209157,We (2018),Drama
62419,209159,Window of the Soul (2001),Documentary
62420,209163,Bad Poems (2018),Comedy|Drama
62421,209169,A Girl Thing (2001),(no genres listed)


Lets clean the title column using regex 

In [4]:
import re
def title_cleaning(title):
  title = re.sub("[^a-zA-Z0-9 ]", "", title)
  return title

In [5]:
movies["clean_title"] = movies["title"].apply(title_cleaning)


In [6]:
movies.head()

Unnamed: 0,movieId,title,genres,clean_title
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,Toy Story 1995
1,2,Jumanji (1995),Adventure|Children|Fantasy,Jumanji 1995
2,3,Grumpier Old Men (1995),Comedy|Romance,Grumpier Old Men 1995
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,Waiting to Exhale 1995
4,5,Father of the Bride Part II (1995),Comedy,Father of the Bride Part II 1995


Lets compute a word occurance frequency matrix using TfidfVectorizer

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(ngram_range=(1,2))

tfidf = vectorizer.fit_transform(movies["clean_title"])

Create a search function which uses cosine_similarity to search similar titles from the dataset


In [8]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def search(title):
    title = title_cleaning(title)
    query_vec = vectorizer.transform([title])
    similarity = cosine_similarity(query_vec, tfidf).flatten()
    indices = np.argpartition(similarity, -5)[-5:]
    results = movies.iloc[indices].iloc[::-1]
    return results

Lets load the ratings file

In [9]:
ratings = pd.read_csv("ratings.csv")


In [10]:
ratings

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
1,1,306,3.5,1147868817
2,1,307,5.0,1147868828
3,1,665,5.0,1147878820
4,1,899,3.5,1147868510
...,...,...,...,...
13277171,85926,3347,3.0,1281233110
13277172,85926,3578,4.5,1281233781
13277173,85926,3703,3.5,1281232077
13277174,85926,3996,4.5,1281234172


We need to find the users who likes the same movie as the input

In [11]:
movie_id = 89745
similar_users = ratings[(ratings["movieId"] == movie_id) & (ratings["rating"] > 4)]["userId"].unique()
similar_users


array([   21,   187,   208, ..., 85869, 85914, 85921])

Then lets get the other movies that the users liked

In [12]:
similar_user_recs = ratings[(ratings["userId"].isin(similar_users)) & (ratings["rating"] > 4)]["movieId"]
similar_user_recs

3741           318
3742           527
3743           541
3744           589
3745           741
             ...  
13276785     97752
13276789    106072
13276790    110553
13276791    111364
13276792    112623
Name: movieId, Length: 309113, dtype: int64

Lets filter this down by selecting those movies where more than 10% of the user likes the movie

In [13]:
similar_user_recs = similar_user_recs.value_counts() / len(similar_users)

similar_user_recs = similar_user_recs[similar_user_recs > .10]

In [14]:
similar_user_recs

89745    1.000000
58559    0.570088
59315    0.531289
79132    0.518461
2571     0.503755
           ...   
1193     0.103254
780      0.102628
2542     0.102315
31658    0.101690
150      0.101690
Name: movieId, Length: 195, dtype: float64

Lets create a column to check all the users who watched movies recommended to us

In [15]:
all_users = ratings[(ratings["movieId"].isin(similar_user_recs.index)) & (ratings["rating"] > 4)]


In [16]:
all_user_recs = all_users["movieId"].value_counts() / len(all_users["userId"].unique())


Lets create Top 10 Recommendation

In [17]:
rec_percentages = pd.concat([similar_user_recs, all_user_recs], axis=1)
rec_percentages.columns = ["similar", "all"]

In [18]:
rec_percentages

Unnamed: 0,similar,all
1,0.243429,0.126634
32,0.103880,0.099763
47,0.204318,0.145610
50,0.220588,0.202348
110,0.187735,0.162574
...,...,...
134853,0.202128,0.036700
152081,0.129224,0.020178
164179,0.130163,0.029388
166528,0.126408,0.014131


In [19]:
rec_percentages["score"] = rec_percentages["similar"] / rec_percentages["all"]
rec_percentages = rec_percentages.sort_values("score", ascending=False)
rec_percentages.head(10).merge(movies, left_index=True, right_on="movieId")


Unnamed: 0,similar,all,score,movieId,title,genres,clean_title
17067,1.0,0.040432,24.733104,89745,"Avengers, The (2012)",Action|Adventure|Sci-Fi|IMAX,Avengers The 2012
25058,0.238736,0.01217,19.616797,122892,Avengers: Age of Ultron (2015),Action|Adventure|Sci-Fi,Avengers Age of Ultron 2015
20513,0.104193,0.005377,19.379114,106072,Thor: The Dark World (2013),Action|Adventure|Fantasy|IMAX,Thor The Dark World 2013
19678,0.206821,0.011436,18.084714,102125,Iron Man 3 (2013),Action|Sci-Fi|Thriller|IMAX,Iron Man 3 2013
16725,0.211827,0.011828,17.908354,88140,Captain America: The First Avenger (2011),Action|Adventure|Sci-Fi|Thriller|War,Captain America The First Avenger 2011
16312,0.170526,0.009931,17.171391,86332,Thor (2011),Action|Adventure|Drama|Fantasy|IMAX,Thor 2011
21348,0.28567,0.016914,16.889547,110102,Captain America: The Winter Soldier (2014),Action|Adventure|Sci-Fi|IMAX,Captain America The Winter Soldier 2014
25071,0.213079,0.012992,16.400432,122920,Captain America: Civil War (2016),Action|Sci-Fi|Thriller,Captain America Civil War 2016
25061,0.135169,0.008552,15.805771,122900,Ant-Man (2015),Action|Adventure|Sci-Fi,AntMan 2015
14628,0.236233,0.015054,15.692011,77561,Iron Man 2 (2010),Action|Adventure|Sci-Fi|Thriller|IMAX,Iron Man 2 2010


Building a recommendation function

In [20]:
def find_similar_movies(movie_id):
    # Finding users similar to us
    similar_users = ratings[(ratings["movieId"] == movie_id) & (ratings["rating"] > 4)]["userId"].unique()
    similar_user_recs = ratings[(ratings["userId"].isin(similar_users)) & (ratings["rating"] > 4)]["movieId"]

    similar_user_recs = similar_user_recs.value_counts() / len(similar_users)
    similar_user_recs = similar_user_recs[similar_user_recs > .10]

    # Finding all the users and their reccomendations
    all_users = ratings[(ratings["movieId"].isin(similar_user_recs.index)) & (ratings["rating"] > 4)]
    all_user_recs = all_users["movieId"].value_counts() / len(all_users["userId"].unique())

    # Creating Recommendation score
    rec_percentages = pd.concat([similar_user_recs, all_user_recs], axis=1)
    rec_percentages.columns = ["similar", "all"]
    
    rec_percentages["score"] = rec_percentages["similar"] / rec_percentages["all"]
    rec_percentages = rec_percentages.sort_values("score", ascending=False)
    return rec_percentages.head(10).merge(movies, left_index=True, right_on="movieId")[["score", "title", "genres"]]

Create a widget to display the recommended movies


In [21]:
import ipywidgets as widgets
from IPython.display import display

In [22]:
movie_name_input = widgets.Text(
    value='Toy Story',
    description='Movie Title:',
    disabled=False
)
recommendation_list = widgets.Output()

def on_type(data):
    with recommendation_list:
        recommendation_list.clear_output()
        title = data["new"]
        if len(title) > 5:
            results = search(title)
            movie_id = results.iloc[0]["movieId"]
            display(find_similar_movies(movie_id))

movie_name_input.observe(on_type, names='value')

display(movie_name_input, recommendation_list)

Text(value='Toy Story', description='Movie Title:')

Output()

## Conclusion

This is the final output where you give the movie name as the input and the output shows the top 10 recommended movies.