## Content Based Movie Recommendation System: 
It uses attributes such as genre, director, description, actors, etc. for movies, to make suggestions for the users. The intuition behind this sort of recommendation system is that if a user liked a particular movie or show, he/she might like a movie or a show similar to it.

Following are the libraries which I will be using:
<ol>
<li><b>Numpy: </b>Used for working with arrays.</li>
<li><b>Pandas: </b>Used for data analysis.</li>
    <li><b>Matplotlib.pyplot: </b>Used for visual representation like plotting graphs.</li>
    <li><b>Sklearn:  </b> Used for making use of Machine learning tools.</li>
    <li><b>AST: </b>This module helps python application to process trees of the python abstract syntax grammar.</li>

</ol>

 


In [74]:
import numpy as np
import pandas as pd
import ast

Loading the datasets i.e. Movies and Credits.

In [75]:
movies = pd.read_csv("tmdb_5000_movies.csv")
credits = pd.read_csv("tmdb_5000_credits.csv")

<ul>
<li> Preview of the MOVIES data.</li>
</ul>

In [76]:
movies.head(1)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800


<ul>
<li> Preview of the CREDITS data.</li>
</ul>

In [77]:
credits.head(1)

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


<ul>
    
<li> Merge the two dataframes CREDITS and MOVIES based on the common 'title' column.</li> 
    </ul>
</ul>

In [78]:
movies = movies.merge(credits,on="title")

In [79]:
movies.head(1)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,movie_id,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,19995,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


<ul>
    <li> Based on model's requirement i will need following features only 'genres', 'id', 'keywords', 'title', 'overview', 'cast', 'crew'.</li>
    <li> Merging all the above features into one.</li>
</ul>

In [80]:
movies = movies[["movie_id","title","overview","genres","keywords","cast","crew"]]

<ul>
    <li> Summary of the Dataframe movies.</li>
</ul>

In [81]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4809 entries, 0 to 4808
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   movie_id  4809 non-null   int64 
 1   title     4809 non-null   object
 2   overview  4806 non-null   object
 3   genres    4809 non-null   object
 4   keywords  4809 non-null   object
 5   cast      4809 non-null   object
 6   crew      4809 non-null   object
dtypes: int64(1), object(6)
memory usage: 263.1+ KB


<ul>
    <li> Preview of the Dataframe movies.</li>
</ul>

In [82]:
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


<ul>
    <li> Checking for null values in columns.</li>
</ul>

In [83]:
movies.isnull().sum()

movie_id    0
title       0
overview    3
genres      0
keywords    0
cast        0
crew        0
dtype: int64

<ul>
    <li> Removing null values from rows.</li>
</ul>

In [84]:
movies.dropna(inplace=True)

In [85]:
movies.isnull().sum()

movie_id    0
title       0
overview    0
genres      0
keywords    0
cast        0
crew        0
dtype: int64

<ul>
    <li> Checking for duplicate columns.</li>
</ul>

In [86]:
movies.duplicated().sum()

0

In [87]:
movies.iloc[0].genres

'[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]'

<ul>
<li>  The function is designed to convert a string containing a list of dictionaries into a list of values extracted from a specific key within those dictionaries. </li>
</ul>

In [88]:
def convert(obj):
    # import ast
    L = []
    for i in ast.literal_eval(obj):
        L.append(i["name"])
    return L

In [89]:
movies["genres"] = movies["genres"].apply(convert)

In [90]:
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


In [91]:
movies["keywords"] = movies["keywords"].apply(convert)

In [92]:
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


<ul>
    <li> Same function as convert, but considering only first three entries.</li>
</ul>

In [93]:
def convert3(obj):
    # import ast
    L = []
    counter = 0
    for i in ast.literal_eval(obj):
        if counter != 3:
            L.append(i["name"])
            counter += 1
        else:
            break
    return L

In [94]:
movies["cast"] = movies["cast"].apply(convert3)

In [95]:
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


<ul>
    <li> Same function as convert, but this time finding values from director key.</li>
</ul>

In [96]:
def fetch_director(obj):
    # import ast
    L = []
    for i in ast.literal_eval(obj):
        if i["job"] == "Director":
            L.append(i["name"])
            break
    return L

In [97]:
movies["crew"] = movies["crew"].apply(fetch_director)

In [98]:
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]",[James Cameron]


In [99]:
movies["overview"][0]

'In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization.'

<ul>
<li>  Converting and splitting the observations in overview to list. </li>
</ul>

In [100]:
movies["overview"] = movies["overview"].apply(lambda x:x.split())

In [101]:
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]",[James Cameron]


<ul>
<li>  Removing spaces from the texts and merging all the texual features into one column with respect to each observation. </li>
</ul>

In [102]:
movies["genres"] = movies["genres"].apply(lambda x:[i.replace(" ","") for i in x])

In [103]:
movies["keywords"] = movies["keywords"].apply(lambda x:[i.replace(" ","") for i in x])
movies["cast"] = movies["cast"].apply(lambda x:[i.replace(" ","") for i in x])
movies["crew"] = movies["crew"].apply(lambda x:[i.replace(" ","") for i in x])

In [104]:
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron]


In [105]:
movies["tags"] = movies["overview"]+movies["genres"]+movies["keywords"]+movies["cast"]+movies["crew"]

In [106]:
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew,tags
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron],"[In, the, 22nd, century,, a, paraplegic, Marin..."


<ul>
<li> Creating new Dataframe new_df including only three features "movie_id", "title", "tags". </li>

</ul>

In [107]:
new_df = movies[["movie_id","title","tags"]]

In [108]:
new_df

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin..."
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d..."
2,206647,Spectre,"[A, cryptic, message, from, Bond’s, past, send..."
3,49026,The Dark Knight Rises,"[Following, the, death, of, District, Attorney..."
4,49529,John Carter,"[John, Carter, is, a, war-weary,, former, mili..."
...,...,...,...
4804,9367,El Mariachi,"[El, Mariachi, just, wants, to, play, his, gui..."
4805,72766,Newlyweds,"[A, newlywed, couple's, honeymoon, is, upended..."
4806,231617,"Signed, Sealed, Delivered","[""Signed,, Sealed,, Delivered"", introduces, a,..."
4807,126186,Shanghai Calling,"[When, ambitious, New, York, attorney, Sam, is..."


<ul>
<li> Joining all the str elements in list format of 'tags'. </li>
</ul>

In [109]:
new_df["tags"] = new_df["tags"].apply(lambda x:" ".join(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df["tags"] = new_df["tags"].apply(lambda x:" ".join(x))


In [110]:
new_df.head(1)

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di..."


In [111]:
new_df["tags"][0]

'In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. Action Adventure Fantasy ScienceFiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d SamWorthington ZoeSaldana SigourneyWeaver JamesCameron'

<ul>
<li> To have a uniform texts without any case sensitive issues, I will covert every letter into a lower case.</li>
</ul>

In [112]:
new_df["tags"] = new_df["tags"].apply(lambda x:x.lower())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df["tags"] = new_df["tags"].apply(lambda x:x.lower())


In [113]:
new_df.head(1)

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"in the 22nd century, a paraplegic marine is di..."


<ul>
<li> Creating a function for stemming the text using the Porter Stemmer from the NLTK (Natural Language Toolkit) library in Python. Stemming is a text normalization technique that reduces words to their base or root form. </li>
</ul>

In [114]:
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()

In [115]:
def stem(text):
    y = []
    for i in text.split():
        y.append(ps.stem(i))
    return " ".join(y)

In [116]:
new_df["tags"] = new_df["tags"].apply(stem)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df["tags"] = new_df["tags"].apply(stem)


In [117]:
new_df["tags"][0]

'in the 22nd century, a parapleg marin is dispatch to the moon pandora on a uniqu mission, but becom torn between follow order and protect an alien civilization. action adventur fantasi sciencefict cultureclash futur spacewar spacecoloni societi spacetravel futurist romanc space alien tribe alienplanet cgi marin soldier battl loveaffair antiwar powerrel mindandsoul 3d samworthington zoesaldana sigourneyweav jamescameron'

<ul>
<li> I now use count_vectorizer instead of TF-IDF, so for that I use “from sklearn.feature_extraction.text import CountVectorizer”. A count variable is created which holds the countvectorizer object. </li>
</ul>

In [118]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 5000, stop_words = "english")

<ul>
<li>Transforming the tags features from the dataset into defined vectorized format. </li>
</ul>

In [119]:
vectors = cv.fit_transform(new_df["tags"]).toarray()

In [120]:
vectors

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

<ul>
<li>Names of columns or features of the above matrix. </li>
</ul>

In [121]:
cv.get_feature_names_out()

array(['000', '007', '10', ..., 'zone', 'zoo', 'zooeydeschanel'],
      dtype=object)

<ul>
<li>Now I calculate the cosine_similarity matrix using count_matrix. </li>
</ul>

In [122]:
from sklearn.metrics.pairwise import cosine_similarity

In [123]:
similarity = cosine_similarity(vectors)

In [124]:
similarity[1]

array([0.08346223, 1.        , 0.06063391, ..., 0.02378257, 0.        ,
       0.02615329])

<ul>
<li>Creating a model to recommend the top 5 movies to the user by using user input and above cosine similarity matrix. </li>
</ul>

In [125]:
def recommend(movie):
    movie_index = new_df[new_df["title"]==movie].index[0]
    distances = similarity[movie_index]
    movies_list = sorted(list(enumerate(distances)),reverse = True,key=lambda x:x[1])[1:6]
    for i in movies_list:
        print(new_df.iloc[i[0]].title)

<ul>
<li>Preview of how the internal work of the model is carrying out. </li>
</ul>

In [126]:
new_df[new_df["title"]=="Batman Begins"].index[0]

119

In [127]:
movies_list = sorted(list(enumerate(similarity[119])),reverse = True,key=lambda x:x[1])[1:6]

In [128]:
for i in movies_list:
    print(i)
    print(new_df.iloc[i[0]].title)

(65, 0.40218090755486674)
The Dark Knight
(1363, 0.35434169344615046)
Batman
(1362, 0.3340765523905305)
Batman
(3, 0.3177444546511212)
The Dark Knight Rises
(3297, 0.3120099844792576)
10th & Wolf


<ul>
<li>Testing my model for correct recommendations. </li>
</ul>

In [129]:
recommend("Batman Begins")

The Dark Knight
Batman
Batman
The Dark Knight Rises
10th & Wolf
