# Movie recommendation system based on collaborative filtering using Pandas

## Import necessary Dataset and Python libraries

Datasets are collected from official movielens (http://movielens.org) website.  
Downoad link:  
https://files.grouplens.org/datasets/movielens/ml-25m.zip

In [None]:
import numpy as np
import pandas as pd

movies_df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Machine Learning/Projects/Datasets/Movie recommendation system/StatQuest Dataset/movies.csv")
ratings_df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Machine Learning/Projects/Datasets/Movie recommendation system/StatQuest Dataset/ratings.csv")


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Designing a Search Engine to find Move titles and its IDs

In [None]:
movies_df.head(1)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy


### Cleaning the titles with REGEX

In [None]:
import re

def clean_title(title):
    return re.sub("[^a-zA-Z0-9 ]","",title)

movies_df["title_clean"] = movies_df.title.apply(clean_title)

In [None]:
movies_df.title_clean.head(4)

0            Toy Story 1995
1              Jumanji 1995
2     Grumpier Old Men 1995
3    Waiting to Exhale 1995
Name: title_clean, dtype: object

### Constructing a TF IDF matrix from cleaned titles

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, 2))

# Fit-transform the vectorizer to the title_clean column
titles_vector = tfidf_vectorizer.fit_transform(movies_df['title_clean'])

### Create a Search function using Cosine Similarity

In [None]:
# design a search function that fetches out best matched movie title to our search query title.
from sklearn.metrics.pairwise import cosine_similarity
def search_movie(title_query):
    title_query = clean_title(title_query)
    title_query_vector = tfidf_vectorizer.transform([title_query])
    similarity = cosine_similarity(titles_vector,title_query_vector).flatten()
    indices = np.argpartition(similarity,-1)[-1:] #fetches indices of coresponding top 5 highest valued elements
    results = movies_df.iloc[indices][["movieId","title","genres"]]
    return results

In [None]:
search_movie("doom")

Unnamed: 0,movieId,title,genres
10234,37380,Doom (2005),Action|Horror|Sci-Fi


## Designing Recomendation System based on similar users ratings

In [None]:
ratings_df.head(2)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
1,1,306,3.5,1147868817


In [None]:
ratings_df.rating.unique()

array([5. , 3.5, 4. , 2.5, 4.5, 3. , 0.5, 2. , 1. , 1.5])

In [None]:

movieId = 1 #just hard code for trial. asuming you liked this particular movie(M1) with this movieID.

# identify similar users who liked your movie (M1)
similar_users = ratings_df[(ratings_df.movieId == movieId) & (ratings_df.rating >= 4)]["userId"].unique()

# Identifying list of movies liked by similar people.(recommended movies list)
similar_user_recs = ratings_df[(ratings_df.rating >=4)&(ratings_df.userId.isin(similar_users))]["movieId"]

# calculating the Confidence(percentage of like minded people also liked other recommended movies(M2)) i.e. confidence(m1->m2)
confidence_movie_recs = similar_user_recs.value_counts()/len(similar_users)

# tune the confidence(m1 ->m2) = 10% ~ 30% and filter out best relevant movie m2 results.
confidence = 0.1
confidence_movie_recs = confidence_movie_recs[(confidence_movie_recs > confidence)]

confidence_movie_recs.head(5)

1      1.000000
318    0.549604
260    0.531518
356    0.517224
296    0.495744
Name: movieId, dtype: float64

In [None]:
# identifying support of each movie recs()
all_users = ratings_df[(ratings_df.movieId.isin(confidence_movie_recs.index)&(ratings_df.rating >= 4))]["movieId"]
support_movie_recs = all_users.value_counts()/len(ratings_df.userId.unique())
support_movie_recs.head(5)

318     0.433823
296     0.384002
356     0.362216
593     0.356642
2571    0.342941
Name: movieId, dtype: float64

In [None]:
# calculating lift for each movie recommendation
lift_movie_recs =confidence_movie_recs/support_movie_recs

# filtering out best recommendation by tuning lift threshold value greater than 1
lift = 2
best_movie_recs = lift_movie_recs[lift_movie_recs >= lift]

# sort the list by descinding order lift value and gather top n movies
top_n = 10
best_movie_recs = best_movie_recs.sort_values(ascending=False)[:top_n]

best_movie_recs.head(5)

1        4.310403
3114     3.264452
78499    2.847179
2355     2.811184
2081     2.599146
Name: movieId, dtype: float64

In [None]:
# Show the recommended movies details
movies_df[movies_df.movieId.isin(best_movie_recs.index)]

Unnamed: 0,movieId,title,genres,title_clean
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,Toy Story 1995
1005,1028,Mary Poppins (1964),Children|Comedy|Fantasy|Musical,Mary Poppins 1964
1047,1073,Willy Wonka & the Chocolate Factory (1971),Children|Comedy|Fantasy|Musical,Willy Wonka the Chocolate Factory 1971
1249,1282,Fantasia (1940),Animation|Children|Fantasy|Musical,Fantasia 1940
1818,1907,Mulan (1998),Adventure|Animation|Children|Comedy|Drama|Musi...,Mulan 1998
1992,2081,"Little Mermaid, The (1989)",Animation|Children|Comedy|Musical|Romance,Little Mermaid The 1989
2264,2355,"Bug's Life, A (1998)",Adventure|Animation|Children|Comedy,Bugs Life A 1998
2669,2761,"Iron Giant, The (1999)",Adventure|Animation|Children|Drama|Sci-Fi,Iron Giant The 1999
3021,3114,Toy Story 2 (1999),Adventure|Animation|Children|Comedy|Fantasy,Toy Story 2 1999
14813,78499,Toy Story 3 (2010),Adventure|Animation|Children|Comedy|Fantasy|IMAX,Toy Story 3 2010


In [None]:
# Functionising the whole process from entering the movie ID  to gettig the recommended movies df

def get_movie_recs(movieId,confidence=0.1,lift=2,top_n = 10):

    # identify similar users liked your movie (M1)
    similar_users = ratings_df[(ratings_df.movieId == movieId) & (ratings_df.rating >= 4)]["userId"].unique()

    # Identifying movies liked by similar people(recommended movies list)
    similar_user_recs = ratings_df[(ratings_df.rating >=4)&(ratings_df.userId.isin(similar_users))]["movieId"]

    # calculating the Confidence(percentage of like minded people also liked other recommended movies(M2)) i.e. confidence(m1->m2)
    confidence_movie_recs = similar_user_recs.value_counts()/len(similar_users)

    # tune the confidence(m1 ->m2) = 10% ~ 30% and filter out best relevant movie m2 results.
    confidence_movie_recs = confidence_movie_recs[(confidence_movie_recs > confidence)]

    # identifying support of each movie recs()
    all_users = ratings_df[(ratings_df.movieId.isin(confidence_movie_recs.index)&(ratings_df.rating >= 4))]["movieId"]
    support_movie_recs = all_users.value_counts()/len(ratings_df.userId.unique())

    # calculating lift for each movie recommendation
    lift_movie_recs =confidence_movie_recs/support_movie_recs

    # filtering out best recommendation by tuning lift threshold value greater than 1
    best_movie_recs = lift_movie_recs[lift_movie_recs >= lift]

    # sort the list by descinding order lift value
    best_movie_recs = best_movie_recs.sort_values(ascending=False)[:top_n]

    # return the recommended movies details
    return movies_df[movies_df.movieId.isin(best_movie_recs.index)][["title","genres"]]


# test the function with random movieId
movieId = 5
get_movie_recs(movieId)

Unnamed: 0,title,genres
2,Grumpier Old Men (1995),Comedy|Romance
4,Father of the Bride Part II (1995),Comedy
73,Bed of Roses (1996),Drama|Romance
78,"Juror, The (1996)",Drama|Thriller
133,Down Periscope (1996),Comedy
138,Up Close and Personal (1996),Drama|Romance
184,Nine Months (1995),Comedy|Romance
627,Sgt. Bilko (1996),Comedy
704,Multiplicity (1996),Comedy
812,"First Wives Club, The (1996)",Comedy


## Creating a Ipython Widgets to take input and get output of movie search and its recommendations

In [None]:
from ipywidgets import widgets
from IPython.display import display

movie_input = widgets.Text(
    value = "Toy Story",
    description = "Movie Title: ",
    disabled = False
)
movie_list = widgets.Output()

def on_type(data):
    with movie_list:
        movie_list.clear_output()
        title = data["new"]
        if len(title)>3:
            movie_df = search_movie(title)
            movie_name = movie_df["title"].values[0]
            movieId = movie_df["movieId"].values[0]

            display(print(f"the recommended movies for {movie_name} is shown below: "),get_movie_recs(movieId))


movie_input.observe(on_type,names = "value")
print("please enter the movie title you like below :)")
display(movie_input,movie_list)




please enter the movie title you like below :)


Text(value='Toy Story', description='Movie Title: ')

Output()

Observation:  
As you can see, the search string identifies the best possible match from our movie database and records the corresponding movie ID. Based on this movie ID, the system generates recommendations and displays them in the widget below.

By default, I have set the recommendation sensitivity parameters as follows:  
Confidence: 0.1  
Lift: 2  

We can try increasing the lift and confidence if we want our recommended movies to have a very high chance of being liked by the user, although this may result in fewer recommendations.

Based on my testing, this configuration is successful in identifying at least one or more movie recommendations for almost 41,000 movies out of a total of 62,000 movies in the database.
