# Movie Recommendation System Using Cosine-Similarity

## 1)Problem Statement
Title: Movie Recommendation System Using Cosine Similarity

## 2) Data Collection
-  The TMDB 5000 movie dataset, which contains movie titles and additional information, will be used for this project.
-  Souce of Dataset: https://www.kaggle.com/tmdb/tmdb-movie-metadata?select=tmdb_5000_movies.csv

### 2.1 Importing Required Libraries

In [1]:
from sklearn.feature_extraction.text import CountVectorizer 
import pandas as pd
import numpy as np
import ast
import nltk

### 2.2 Loadind Datasets

In [2]:
df_1 = pd.read_csv(r"tmdb_5000_movies.csv")
df_2 = pd.read_csv(r"tmdb_5000_credits.csv")

### 2.2 Showing the Dataset

In [3]:
df_1.head(2)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500


In [4]:
df_2.head(2)


Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


### 2.3 Evaluating the size of Sample data

In [5]:
df_1.shape , df_2.shape

((4803, 20), (4803, 4))

### 2.4 Merging Both the Datasets For more accurate result

In [6]:
movies = pd.merge(df_1, df_2, on="title", how="inner")
movies.head(2)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,movie_id,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,19995,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",...,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,285,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


## 3) Data Preprocessing

In [7]:
movies.columns

Index(['budget', 'genres', 'homepage', 'id', 'keywords', 'original_language',
       'original_title', 'overview', 'popularity', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'vote_average',
       'vote_count', 'movie_id', 'cast', 'crew'],
      dtype='object')

### 3.1 Features Selection

In [8]:
dataset = movies[["movie_id","title", "genres","keywords","overview", "cast", "crew" ]]
dataset.head(2)

Unnamed: 0,movie_id,title,genres,keywords,overview,cast,crew
0,19995,Avatar,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","In the 22nd century, a paraplegic Marine is di...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...","[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","Captain Barbossa, long believed to be dead, ha...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


In [9]:
dataset.shape

(4809, 7)

### 3.2 Checking Null Values

In [10]:
dataset.isnull().sum()

movie_id    0
title       0
genres      0
keywords    0
overview    3
cast        0
crew        0
dtype: int64

**As sample of our entire dataset is too large as compared to the number of Null values,We better choose to eliminate the null values**

### 3.3 Eliminating Null values

In [11]:
dataset = dataset.dropna()

### 3.4 Checking Duplicate Values

In [12]:
dataset.duplicated().sum()

0

### 3.5 Exploring the Dataset

In [13]:
dataset['genres'][0]

'[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]'

In [14]:
# Fuction to help in EDA
def help(obj):
    l = [] 
    for i in ast.literal_eval(obj):
        l.append(i['name'])
    return l

**3.5.1 As we want to create a content-based recommendation system, so we are trying to create some tags for each movie which will 
help us to find similarities among the all movies to generate more accurate recommendation. For that reason we are jus taking 
'name' as a keyword**

In [15]:
dataset['genres'] = dataset['genres'].apply(help)
dataset['keywords'] = dataset['keywords'].apply(help)

In [16]:
dataset.head(2)

Unnamed: 0,movie_id,title,genres,keywords,overview,cast,crew
0,19995,Avatar,"[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","In the 22nd century, a paraplegic Marine is di...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","Captain Barbossa, long believed to be dead, ha...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


In [17]:
def help2(obj):
    l = [] 
    k = 0
    for i in ast.literal_eval(obj):
       if k != 3: 
        l.append(i['name'])
        k+=1
       else:
         break
    return l

In [18]:
dataset['cast'] = dataset['cast'].apply(help2)

In [19]:
dataset.head(2)

Unnamed: 0,movie_id,title,genres,keywords,overview,cast,crew
0,19995,Avatar,"[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","In the 22nd century, a paraplegic Marine is di...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","Captain Barbossa, long believed to be dead, ha...","[Johnny Depp, Orlando Bloom, Keira Knightley]","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


In [20]:
def convert(data):
    L=[]
    for i in ast.literal_eval(data):
      if i["job"]== "Director" :
        L.append(i["name"])
        break
    return L

In [21]:
dataset['crew'] = dataset['crew'].apply(convert)

In [22]:
dataset.head(2)

Unnamed: 0,movie_id,title,genres,keywords,overview,cast,crew
0,19995,Avatar,"[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","In the 22nd century, a paraplegic Marine is di...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]",[James Cameron]
1,285,Pirates of the Caribbean: At World's End,"[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","Captain Barbossa, long believed to be dead, ha...","[Johnny Depp, Orlando Bloom, Keira Knightley]",[Gore Verbinski]


**3.5.2 Having the purpose of creating some tags for each movies, converting the 'overview' column into a list as we can only 
concatinate list(not 'str')to list**

In [23]:
dataset['overview'] = dataset['overview'].apply(lambda x : x.split(","))

**3.5.3 Removing spaces among the strings present in every list of each columns for more accuracy**

In [24]:
dataset['cast'] = dataset['cast'].apply(lambda x: [ i.replace(" ", "") for i in x])
dataset['crew'] = dataset['crew'].apply(lambda x: [ i.replace(" ", "") for i in x])
dataset['genres'] = dataset['genres'].apply(lambda x: [ i.replace(" ", "") for i in x])
dataset['keywords'] = dataset['keywords'].apply(lambda x: [ i.replace(" ", "") for i in x])

In [25]:
dataset.head(2)

Unnamed: 0,movie_id,title,genres,keywords,overview,cast,crew
0,19995,Avatar,"[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[In the 22nd century, a paraplegic Marine is ...","[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron]
1,285,Pirates of the Caribbean: At World's End,"[Adventure, Fantasy, Action]","[ocean, drugabuse, exoticisland, eastindiatrad...","[Captain Barbossa, long believed to be dead, ...","[JohnnyDepp, OrlandoBloom, KeiraKnightley]",[GoreVerbinski]


**3.5.4 Merging all the required columns**

In [26]:
dataset['Tags'] =dataset['genres'] + dataset['keywords'] + dataset['overview'] + dataset['cast'] + dataset["crew"]

In [27]:
dataset.head(2)

Unnamed: 0,movie_id,title,genres,keywords,overview,cast,crew,Tags
0,19995,Avatar,"[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[In the 22nd century, a paraplegic Marine is ...","[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron],"[Action, Adventure, Fantasy, ScienceFiction, c..."
1,285,Pirates of the Caribbean: At World's End,"[Adventure, Fantasy, Action]","[ocean, drugabuse, exoticisland, eastindiatrad...","[Captain Barbossa, long believed to be dead, ...","[JohnnyDepp, OrlandoBloom, KeiraKnightley]",[GoreVerbinski],"[Adventure, Fantasy, Action, ocean, drugabuse,..."


**3.5.5 Droping all the unnecessary columns**

In [28]:
new_data = dataset[["movie_id", "title", "Tags"]]
new_data['Tags'] = new_data['Tags'].apply(lambda x : " ".join(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_data['Tags'] = new_data['Tags'].apply(lambda x : " ".join(x))


In [29]:
new_data["Tags"] = new_data["Tags"].apply(lambda x: x.lower())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_data["Tags"] = new_data["Tags"].apply(lambda x: x.lower())


**3.5.6 Stemming the 'Tags' column having almost close meaning( eg. 'actors' & 'actor') to avoid repetition of such words during
vectorization**

In [30]:
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()

In [31]:
# Function for stemming
def convert(data):
    txt = []
    for i in data.split():
        txt.append(ps.stem(i))
    return " ".join(txt)

In [32]:
new_data["Tags"] = new_data["Tags"].apply(convert)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_data["Tags"] = new_data["Tags"].apply(convert)


***Required preprocessed Dataset***

In [33]:
new_data.head(2)

Unnamed: 0,movie_id,title,Tags
0,19995,Avatar,action adventur fantasi sciencefict culturecla...
1,285,Pirates of the Caribbean: At World's End,adventur fantasi action ocean drugabus exotici...


**3.5.7 Implementing scikit-learn Countvectorizer for vectorization**

In [34]:
cv = CountVectorizer(max_features=5000, stop_words="english")

In [35]:
encoded_data = cv.fit_transform(new_data['Tags']).toarray()

In [36]:
encoded_data

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

**3.5.8 Saving our preprocessed data and also encoded data to avoid redundancy**

In [37]:
np.save("encoded_data.npy", encoded_data)

In [38]:
new_data.to_csv('preprocessed_data.csv',index= False, columns=["movie_id","title"])