## `Content Based Movie Recommender System:`

- `INPUT` : Sequence of words describing user's prefered movie themes
- `OUTPUT` : Returns 5 closest movie matches to user's description
- `APROACH` : Utilize TF-IDF with cosine similarity to recommend movies based on the user query

### Dataset: 
A small publicly available movie dataset "IMDB Movie Dataset" on kaggle. Here, we keep a toy dataset (first 5 rows or so) available on our github repo for reference. Our whole dataset for performing the task is loaded from a publicly available link via google drive for best practices. 

In [43]:
#Import necessary libraries & modules
import pandas as pd
import numpy as np
import gdown
import string
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [45]:
#If you want to download the whole dataset, please run the commented lines below

#url = "https://drive.google.com/uc?id=1iTK_Z80gzNagzo9e-NxT8IIZWQ-2q3FI"
#output = "data.csv"
#gdown.download(url, output, quiet=False)

In [47]:
#URL link to load the csv directly to our pandas dataframe
url = "https://drive.google.com/uc?id=1iTK_Z80gzNagzo9e-NxT8IIZWQ-2q3FI"
df = pd.read_csv(url)
df.head()

Unnamed: 0,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469
1,https://m.media-amazon.com/images/M/MV5BM2MyNj...,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411
2,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444
3,https://m.media-amazon.com/images/M/MV5BMWMwMG...,The Godfather: Part II,1974,A,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000
4,https://m.media-amazon.com/images/M/MV5BMWU4N2...,12 Angry Men,1957,U,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000


In [49]:
df.info() #Displays type and count info on each columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 16 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Poster_Link    1000 non-null   object 
 1   Series_Title   1000 non-null   object 
 2   Released_Year  1000 non-null   object 
 3   Certificate    899 non-null    object 
 4   Runtime        1000 non-null   object 
 5   Genre          1000 non-null   object 
 6   IMDB_Rating    1000 non-null   float64
 7   Overview       1000 non-null   object 
 8   Meta_score     843 non-null    float64
 9   Director       1000 non-null   object 
 10  Star1          1000 non-null   object 
 11  Star2          1000 non-null   object 
 12  Star3          1000 non-null   object 
 13  Star4          1000 non-null   object 
 14  No_of_Votes    1000 non-null   int64  
 15  Gross          831 non-null    object 
dtypes: float64(2), int64(1), object(13)
memory usage: 125.1+ KB


### Data pre-processing pipeline:

In [26]:
#Helper Functions in the data pipeline
def preprocess_text(text):
    
    text = text.lower() #Converts to lowercase
    text = text.translate(str.maketrans('', '', string.punctuation)) #Removes punctuation 
    text = text.strip() #Removes extra whitespace
    return text

def preprocess_data(df, text_column="Overview", sample_size=500, random_state=42):

    #Selects only the desired columns needed 
    df = df[['Series_Title', 'Overview']].copy()
    df.rename(columns={'Series_Title': 'Movie_Title'}, inplace=True)
    
    #Randomly samples 500 rows from the dataset
    if len(df) > sample_size:
        df_sample = df.sample(n=sample_size, random_state=random_state).reset_index(drop=True)
    else:
        df_sample = df.copy().reset_index(drop=True)
    
    df_sample = df_sample.dropna(subset=[text_column])
    
    #Applies text cleaning to the text column
    df_sample[text_column] = df_sample[text_column].apply(preprocess_text)
    df_sample.reset_index(drop=True, inplace=True)
    
    return df_sample

In [51]:
df2 = preprocess_data(df) #Data after pre-processing
df2.info()
df2

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Movie_Title  500 non-null    object
 1   Overview     500 non-null    object
dtypes: object(2)
memory usage: 7.9+ KB


Unnamed: 0,Movie_Title,Overview
0,Trois couleurs: Bleu,a woman struggles to find a way to live her li...
1,Captain America: The Winter Soldier,as steve rogers struggles to embrace his role ...
2,Wreck-It Ralph,a video game villain wants to be a hero and se...
3,The Sandlot,in the summer of 1962 a new kid in town is tak...
4,Gandhi,the life of the lawyer who became the famed le...
...,...,...
495,Monty Python and the Holy Grail,king arthur and his knights of the round table...
496,Les diaboliques,the wife and mistress of a loathed school prin...
497,Dog Day Afternoon,three amateur bank robbers plan to hold up a b...
498,Sabrina,a playboy becomes interested in the daughter o...


### Processing data using TF-IDF and computing their cosine similarities:

In [57]:
#Processing the movie overview data using an inbuilt function and in parallel removing the stop_words in english provided with the same module
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_vec = tfidf_vectorizer.fit_transform(df2['Overview']) #Gives out a sparse matrix 
#print(tfidf_vec)

In [61]:
def recommend_movies(query, df, top_n=5):
    
    query = preprocess_text(query)
    query_vec = tfidf_vectorizer.transform([query]) #Transforms the given user query into the TF-IDF vector space
    
    #Computes cosine similarity between the query and all movie descriptions
    cosine_sim = cosine_similarity(query_vec, tfidf_vec).flatten()
    
    #Gets indices of the top N most similar movies i.e movies with highest cosine similarity scores
    top_indices = cosine_sim.argsort()[::-1][:top_n]
    
    recommended = df.iloc[top_indices].copy()
    recommended["similarity"] = cosine_sim[top_indices]
    
    return recommended[["Movie_Title", "similarity"]].reset_index(drop=True)

### Let's test our movie recommender!!

In [101]:
#User query to the recommender system. Please feel free to change the query below!
query = "I love comedy movies."

recommendations = recommend_movies(query, df2, top_n=2)
recommendations

Unnamed: 0,Movie_Title,similarity
0,50/50,0.199758
1,Barton Fink,0.19143


### `Salary expectation per month :`
### 1600-1800$ per month. However, I’m flexible and open to further negotiation.