# Movie Recommender System

Importing necessary Libraries

In [1]:
import pandas as pd
import numpy as np

## Data Loading and Exploration

In [35]:
movies=pd.read_csv('movies.csv')
credits=pd.read_csv('credits.csv')

In [16]:
print(movies.info())
movies.head(1)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4803 non-null   object 
 2   homepage              1712 non-null   object 
 3   id                    4803 non-null   int64  
 4   keywords              4803 non-null   object 
 5   original_language     4803 non-null   object 
 6   original_title        4803 non-null   object 
 7   overview              4800 non-null   object 
 8   popularity            4803 non-null   float64
 9   production_companies  4803 non-null   object 
 10  production_countries  4803 non-null   object 
 11  release_date          4802 non-null   object 
 12  revenue               4803 non-null   int64  
 13  runtime               4801 non-null   float64
 14  spoken_languages      4803 non-null   object 
 15  status               

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800


In [15]:
print(credits.info())
credits.head(1)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   movie_id  4803 non-null   int64 
 1   title     4803 non-null   object
 2   cast      4803 non-null   object
 3   crew      4803 non-null   object
dtypes: int64(1), object(3)
memory usage: 150.2+ KB
None


Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


## Preprocessing

Merging the two dataframes on the basis of movie title

In [36]:
movies=movies.merge(credits,on='title')
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4809 entries, 0 to 4808
Data columns (total 23 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4809 non-null   int64  
 1   genres                4809 non-null   object 
 2   homepage              1713 non-null   object 
 3   id                    4809 non-null   int64  
 4   keywords              4809 non-null   object 
 5   original_language     4809 non-null   object 
 6   original_title        4809 non-null   object 
 7   overview              4806 non-null   object 
 8   popularity            4809 non-null   float64
 9   production_companies  4809 non-null   object 
 10  production_countries  4809 non-null   object 
 11  release_date          4808 non-null   object 
 12  revenue               4809 non-null   int64  
 13  runtime               4807 non-null   float64
 14  spoken_languages      4809 non-null   object 
 15  status               

Selecting the features we want to use for our recommender system

In [79]:
df=movies[['title','genres','popularity','overview','keywords','cast','crew']]

In [27]:
df.head(1)

Unnamed: 0,title,genres,popularity,overview,keywords,cast,crew
0,Avatar,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",150.437577,"In the 22nd century, a paraplegic Marine is di...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


In [28]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4809 entries, 0 to 4808
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   title       4809 non-null   object 
 1   genres      4809 non-null   object 
 2   popularity  4809 non-null   float64
 3   overview    4806 non-null   object 
 4   keywords    4809 non-null   object 
 5   cast        4809 non-null   object 
 6   crew        4809 non-null   object 
dtypes: float64(1), object(6)
memory usage: 263.1+ KB


#### Data Cleaning

Handling missing values and duplicates values


In [38]:
df.isnull().sum()

title         0
genres        0
popularity    0
overview      3
keywords      0
cast          0
crew          0
dtype: int64

Overvie of the data has 3 missing values which is comparetively less so we will drop the records with missing values

In [80]:
df.dropna(inplace=True)
df.isnull().sum()


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.dropna(inplace=True)


title         0
genres        0
popularity    0
overview      0
keywords      0
cast          0
crew          0
dtype: int64

Checking for duplicates

In [40]:
df.duplicated().sum()

0

#### Feature Engineering

In [43]:
df.iloc[0].genres

'[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]'

Function to extract features from the data

In [45]:
import ast
def extract_features(x):
    l=[]
    for i in ast.literal_eval(x):
        if i['name'] in l:
            pass
        else:
            l.append(i['name'])
    return l

In [None]:
df['genres']=df['genres'].apply(extract_features)

In [None]:
df['keywords']=df['keywords'].apply(extract_features)

In [54]:
df.head(1)

Unnamed: 0,title,genres,popularity,overview,keywords,cast,crew
0,Avatar,"[Action, Adventure, Fantasy, Science Fiction]",150.437577,"In the 22nd century, a paraplegic Marine is di...","[culture clash, future, space war, space colon...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


In [55]:
def extract_top_3_cast(x):
    l=[]
    count=0
    for i in ast.literal_eval(x): 
        if count!=3:
            l.append(i['name'])
            count+=1
        else:
            break
    return l

In [None]:
df['cast']=df['cast'].apply(extract_top_3_cast)

In [57]:
df.head(1)

Unnamed: 0,title,genres,popularity,overview,keywords,cast,crew
0,Avatar,"[Action, Adventure, Fantasy, Science Fiction]",150.437577,"In the 22nd century, a paraplegic Marine is di...","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


In [58]:
def extract_director(x):
    l=[]
    
    for i in ast.literal_eval(x): 
        if i['job']=='Director':
            l.append(i['name'])
            break
    return l

In [None]:
df['crew']=df['crew'].apply(extract_director)

In [60]:
df.head(1)

Unnamed: 0,title,genres,popularity,overview,keywords,cast,crew
0,Avatar,"[Action, Adventure, Fantasy, Science Fiction]",150.437577,"In the 22nd century, a paraplegic Marine is di...","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]",[James Cameron]


In [None]:
df['overview']=df['overview'].apply(lambda x:x.split())

In [None]:
df['popularity']=df['popularity'].apply(lambda x: [int(x)])

In [88]:
df.head(1)

Unnamed: 0,title,genres,popularity,overview,keywords,cast,crew
0,Avatar,"[Action, Adventure, Fantasy, Science Fiction]",[150],"[In, the, 22nd, century,, a, paraplegic, Marin...","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]",[James Cameron]


In [None]:
df['tags']=df['overview']+df['genres']+df['keywords']+df['cast']+df['crew']+df['popularity']

In [94]:
df = df.drop('tags', axis=1)

In [95]:
df.head(1)

Unnamed: 0,title,genres,popularity,overview,keywords,cast,crew
0,Avatar,"[Action, Adventure, Fantasy, Science Fiction]",[150],"[In, the, 22nd, century,, a, paraplegic, Marin...","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]",[James Cameron]


In [98]:
df['genres']=df['genres'].apply(lambda x: [i.replace(" ","") for i in x])
df['keywords']=df['keywords'].apply(lambda x: [i.replace(" ","") for i in x])
df['cast']=df['cast'].apply(lambda x: [i.replace(" ","") for i in x])
df['crew']=df['crew'].apply(lambda x: [i.replace(" ","") for i in x])

In [99]:
df.head(1)

Unnamed: 0,title,genres,popularity,overview,keywords,cast,crew
0,Avatar,"[Action, Adventure, Fantasy, ScienceFiction]",[150],"[In, the, 22nd, century,, a, paraplegic, Marin...","[cultureclash, future, spacewar, spacecolony, ...","[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron]


In [100]:

df['tags']=df['overview']+df['genres']+df['keywords']+df['cast']+df['crew']+df['popularity']

In [101]:
df.head(1)

Unnamed: 0,title,genres,popularity,overview,keywords,cast,crew,tags
0,Avatar,"[Action, Adventure, Fantasy, ScienceFiction]",[150],"[In, the, 22nd, century,, a, paraplegic, Marin...","[cultureclash, future, spacewar, spacecolony, ...","[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron],"[In, the, 22nd, century,, a, paraplegic, Marin..."


#### Refined Dataframe for recommender system model

In [102]:
data=df[['title','tags']]

In [104]:
data.head(4)

Unnamed: 0,title,tags
0,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin..."
1,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d..."
2,Spectre,"[A, cryptic, message, from, Bond’s, past, send..."
3,The Dark Knight Rises,"[Following, the, death, of, District, Attorney..."
