# <b>Machine Learning Mini Project.</b>
## <b>Title : Movie Recommender System.</b>
### Name : Sarang Manoj Pekhale.
### Roll No. : MT2212

### <b>Introduction : </b>
<ol>
<li><b>Movie Recommender System :</b></li>
A movie recommender system is a specialized software application or algorithm that suggests movies to users based on their individual preferences and historical interactions with movie-related data. These systems leverage various techniques, such as collaborative filtering, content-based filtering, and hybrid methods, to provide personalized movie recommendations. They enhance user experiences by helping individuals discover and enjoy films that align with their tastes, ultimately increasing user engagement and satisfaction in online movie platforms and streaming services.

For this project, i will be using Content-Based Movie Recommendation Systems.
The input for building a content-based recommender system is :

<ul>
<li> Movie attributes.</li> 
<li> Convert text to vectors (Text Vectorization) using bag of words technique.</li> 
<li> Cosine similarity to find the most 5 similar vectors to the inputs given by a user.</li> 
</ul>

### <b>Libraries required :</b>
<ul>
<li> <b>numpy</b> : Working with arrays, matrices, and mathematical functions.</li> 
<li> <b>pandas</b> : Data structures DataFrames and Series, for handling structured data efficiently.</li>
<li> <b>ast</b> : Parses and evaluates literal expressions.</li>
</ul>

In [151]:
import numpy as np
import pandas as pd
import ast

### <b>Data Import and Preprocessing : For EDA.</b>
In this section, I describe the process of importing, cleaning, and preprocessing the data required for our analysis. We loaded the data from external sources, performed initial data checks, and prepared it for subsequent tasks.

<ul>
<li> <b>Data Import</b> : Importing two CSV files into your Python environment using pandas, and you have created two DataFrames, df_credits and df_movies, to store and work with the data from these files.</li> 
</ul>

In [152]:
df_credits = pd.read_csv("tmdb_5000_credits.csv")
df_movies = pd.read_csv("tmdb_5000_movies.csv")

<ul>
    <li> Name of Data - TMDB 5000 Movie Dataset.</li>
    <li> Data source â€“ Kaggle.</li> 
    <li> Data size or dimensions : Number of rows and columns in the dataset.</li>
    </ul>
</ul>

In [153]:
print('shape df_credits ',df_credits.shape)
print('shape df_movies ',df_movies.shape)

shape df_credits  (4803, 4)
shape df_movies  (4803, 20)


<ul>
<li> Preview of the CREDITS data.</li>
</ul>


In [154]:
df_credits.head(1)

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


<ul>
<li> Preview of the MOVIES data.</li>
</ul>

In [155]:
df_movies.head(1)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800


<ul>
    <li> Rename the 'movie_id' column to 'id' in the first dataframe.</li>
    <li> Remove the title column from CREDITS dataset to avoid duplicate columns.</li> 
    <li> Merge the two dataframes CREDITS and MOVIES based on the common 'id' column.</li> 
    </ul>
</ul>

In [156]:
df_credits = df_credits.rename(columns={'movie_id': 'id'})
df_credits=df_credits.drop('title',axis=1)

df = df_credits.merge(df_movies, on='id')
print('shape df ',df.shape)

shape df  (4803, 22)


<ul>
    <li> Descriptive statistics of a DataFrame.</li>
</ul>

In [157]:
df.describe()

Unnamed: 0,id,budget,popularity,revenue,runtime,vote_average,vote_count
count,4803.0,4803.0,4803.0,4803.0,4801.0,4803.0,4803.0
mean,57165.484281,29045040.0,21.492301,82260640.0,106.875859,6.092172,690.217989
std,88694.614033,40722390.0,31.81665,162857100.0,22.611935,1.194612,1234.585891
min,5.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,9014.5,790000.0,4.66807,0.0,94.0,5.6,54.0
50%,14629.0,15000000.0,12.921594,19170000.0,103.0,6.2,235.0
75%,58610.5,40000000.0,28.313505,92917190.0,118.0,6.8,737.0
max,459488.0,380000000.0,875.581305,2787965000.0,338.0,10.0,13752.0


<ul>
    <li> Summary of the DataFrame's basic information.</li>
</ul>

In [158]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 22 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    4803 non-null   int64  
 1   cast                  4803 non-null   object 
 2   crew                  4803 non-null   object 
 3   budget                4803 non-null   int64  
 4   genres                4803 non-null   object 
 5   homepage              1712 non-null   object 
 6   keywords              4803 non-null   object 
 7   original_language     4803 non-null   object 
 8   original_title        4803 non-null   object 
 9   overview              4800 non-null   object 
 10  popularity            4803 non-null   float64
 11  production_companies  4803 non-null   object 
 12  production_countries  4803 non-null   object 
 13  release_date          4802 non-null   object 
 14  revenue               4803 non-null   int64  
 15  runtime              

<ul>
    <li> Identify the columns in the DataFrame df that have numeric datatype.</li>
</ul>

In [159]:
num_col = df.dtypes[df.dtypes != 'object'].index
num_col

Index(['id', 'budget', 'popularity', 'revenue', 'runtime', 'vote_average',
       'vote_count'],
      dtype='object')

<ul>
    <li> Count the number of missing (null) values in each column of a DataFrame df. </li>
</ul>

In [160]:
df.isnull().sum()

id                         0
cast                       0
crew                       0
budget                     0
genres                     0
homepage                3091
keywords                   0
original_language          0
original_title             0
overview                   3
popularity                 0
production_companies       0
production_countries       0
release_date               1
revenue                    0
runtime                    2
spoken_languages           0
status                     0
tagline                  844
title                      0
vote_average               0
vote_count                 0
dtype: int64

<ul>
    <li> The 'homepage,' 'tagline,' and 'spoken_languages' columns are removed, because first two contain lots of null values and since i am working on Hollywood movies dataset, therefore language will be english only. </li>
</ul>

In [161]:
df=df.drop('homepage',axis=1)
df=df.drop('tagline',axis=1)
df=df.drop('spoken_languages',axis=1)

<ul>
    <li> Remove rows with missing values from the DataFrame df. </li>
</ul>

In [162]:
df.dropna(inplace=True)

In [163]:
df.isnull().sum()

id                      0
cast                    0
crew                    0
budget                  0
genres                  0
keywords                0
original_language       0
original_title          0
overview                0
popularity              0
production_companies    0
production_countries    0
release_date            0
revenue                 0
runtime                 0
status                  0
title                   0
vote_average            0
vote_count              0
dtype: int64

<ul>
    <li> Check for duplicate columns and duplicate rows in the DataFrame df. </li>
</ul>


In [164]:
print(df.columns.duplicated().sum())
print(df.duplicated().sum())

0
0


<ul>
    <li> 'original_title' and 'title' are approximately same that's why i will remove 'original_title' column. </li>
</ul>

In [165]:
df=df.drop('original_title',axis=1)
df.columns

Index(['id', 'cast', 'crew', 'budget', 'genres', 'keywords',
       'original_language', 'overview', 'popularity', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime', 'status',
       'title', 'vote_average', 'vote_count'],
      dtype='object')

<ul>
    <li> I want names from dictionaries of the observations of particular columns which are given in the below format.</li>
</ul>

In [166]:
df.iloc[0].genres

'[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]'

<ul>
    <li>  The function is designed to convert a string containing a list of dictionaries into a list of values extracted from a specific key within those dictionaries. </li>
    <li> The ast.literal_eval(obj) function is used to safely evaluate (parse and execute) a single expression in a string format containing a literal Python data structure.</li>
</ul>

In [167]:
def convert(obj):
    # import ast
    L = []
    for i in ast.literal_eval(obj):
        L.append(i["name"])
    return L

In [168]:
df["keywords"] = df["keywords"].apply(convert)
df["genres"] = df["genres"].apply(convert)
df["production_companies"] = df["production_companies"].apply(convert)
df["production_countries"] = df["production_countries"].apply(convert)

<ul>
    <li> I will consider only first three names of the cast.</li>
</ul>

In [None]:
df.iloc[0].cast

<ul>
    <li> Same function as convert, but considering only first three entries.</li>
</ul>

In [170]:
def convert3(obj):
    # import ast
    L = []
    counter = 0
    for i in ast.literal_eval(obj):
        if counter != 3:
            L.append(i["name"])
            counter += 1
        else:
            break
    return L

In [171]:
df["cast"] = df["cast"].apply(convert3)

<ul>
    <li> From crew i will bw considering only director's name.</li>
</ul>

In [None]:
df.iloc[0].crew

<ul>
    <li> Same function as convert, but this time finding values from director key.</li>
</ul>

In [173]:
def fetch_director(obj):
    # import ast
    L = []
    for i in ast.literal_eval(obj):
        if i["job"] == "Director":
            L.append(i["name"])
            break
    return L

In [174]:
df["crew"] = df["crew"].apply(fetch_director)

<ul>
    <li> Same function as convert, but this time finding values from director key.</li>
</ul>

### <b>Data Import and Preprocessing : For Model building.</b>
In this section, I will perform same processes as performed earlier but with respect to what does my model needs.

In [175]:
movies = pd.read_csv("tmdb_5000_movies.csv")
credits = pd.read_csv("tmdb_5000_credits.csv")

In [176]:
movies = movies.merge(credits,on="title")

<ul>
    <li> Based on model's requirement i will need following features only 'genres', 'id', 'keywords', 'title', 'overview', 'cast', 'crew'.</li>
    <li> Merging all the above features into one.</li>
</ul>

In [177]:
movies = movies[["movie_id","title","overview","genres","keywords","cast","crew"]]

<ul>
    <li> Summary of the Dataframe movies.</li>
</ul>

In [178]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4809 entries, 0 to 4808
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   movie_id  4809 non-null   int64 
 1   title     4809 non-null   object
 2   overview  4806 non-null   object
 3   genres    4809 non-null   object
 4   keywords  4809 non-null   object
 5   cast      4809 non-null   object
 6   crew      4809 non-null   object
dtypes: int64(1), object(6)
memory usage: 263.1+ KB


<ul>
    <li> Preview of the Dataframe movies.</li>
</ul>

In [179]:
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


<ul>
    <li> Checking for null values in columns.</li>
</ul>

In [180]:
movies.isnull().sum()

movie_id    0
title       0
overview    3
genres      0
keywords    0
cast        0
crew        0
dtype: int64

<ul>
    <li> Removing null values from rows.</li>
</ul>

In [181]:
movies.dropna(inplace=True)

In [182]:
movies.isnull().sum()

movie_id    0
title       0
overview    0
genres      0
keywords    0
cast        0
crew        0
dtype: int64

<ul>
    <li> Checking for duplicate columns.</li>
</ul>

In [183]:
print(movies.columns.duplicated().sum())
print(movies.duplicated().sum())

0
0


<ul>
<li>  The function is designed to convert a string containing a list of dictionaries into a list of values extracted from a specific key within those dictionaries. </li>
</ul>

In [184]:
movies["genres"] = movies["genres"].apply(convert)
movies["keywords"] = movies["keywords"].apply(convert)
movies["cast"] = movies["cast"].apply(convert3)
movies["crew"] = movies["crew"].apply(fetch_director)

<ul>
<li>  Converting and splitting the observations in overview to list. </li>
</ul>

In [185]:
movies["overview"][0]
movies["overview"] = movies["overview"].apply(lambda x:x.split())

<ul>
<li>  Removing spaces from the texts and merging all the texual features into one column with respect to each observation. </li>
</ul>

In [186]:
movies["genres"] = movies["genres"].apply(lambda x:[i.replace(" ","") for i in x])
movies["keywords"] = movies["keywords"].apply(lambda x:[i.replace(" ","") for i in x])
movies["cast"] = movies["cast"].apply(lambda x:[i.replace(" ","") for i in x])
movies["crew"] = movies["crew"].apply(lambda x:[i.replace(" ","") for i in x])

In [187]:
movies["tags"] = movies["overview"]+movies["genres"]+movies["keywords"]+movies["cast"]+movies["crew"]

<ul>
<li> Preview of Dataframe movies.</li>
</ul>

In [188]:
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew,tags
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron],"[In, the, 22nd, century,, a, paraplegic, Marin..."


<ul>
<li> Creating new Dataframe new_df including only three features "movie_id", "title", "tags". </li>
<li> Joining all the str elements in list format of 'tags'. </li>
</ul>

In [189]:
new_df = movies[["movie_id","title","tags"]]
new_df["tags"] = new_df["tags"].apply(lambda x:" ".join(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df["tags"] = new_df["tags"].apply(lambda x:" ".join(x))


<ul>
<li> Preview of Dataframe new_df.</li>
</ul>

In [190]:
new_df.head(1)

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di..."


<ul>
<li> Preview of 'tags'.</li>
</ul>

In [191]:
new_df["tags"][0]

'In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. Action Adventure Fantasy ScienceFiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d SamWorthington ZoeSaldana SigourneyWeaver JamesCameron'

<ul>
<li> To have a uniform texts without any case sensitive issues, I will covert every letter into a lower case.</li>
</ul>

In [192]:
new_df["tags"] = new_df["tags"].apply(lambda x:x.lower())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df["tags"] = new_df["tags"].apply(lambda x:x.lower())


In [193]:
new_df.head(1)

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"in the 22nd century, a paraplegic marine is di..."


<ul>
<li> Creating a function for stemming the text using the Porter Stemmer from the NLTK (Natural Language Toolkit) library in Python. Stemming is a text normalization technique that reduces words to their base or root form. </li>
</ul>

In [194]:
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()

def stem(text):
    y = []
    for i in text.split():
        y.append(ps.stem(i))
    return " ".join(y)

<ul>
<li> Converting all observations of tags to base or roots words. </li>
</ul>

In [195]:
new_df["tags"] = new_df["tags"].apply(stem)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df["tags"] = new_df["tags"].apply(stem)


In [196]:
new_df["tags"][0]

'in the 22nd century, a parapleg marin is dispatch to the moon pandora on a uniqu mission, but becom torn between follow order and protect an alien civilization. action adventur fantasi sciencefict cultureclash futur spacewar spacecoloni societi spacetravel futurist romanc space alien tribe alienplanet cgi marin soldier battl loveaffair antiwar powerrel mindandsoul 3d samworthington zoesaldana sigourneyweav jamescameron'