<center><h1 style="color:purple;">Movie Recommender System</h1></center>

<h2 style="color:purple">Context</h2>
These files contain metadata for all 45,000 movies listed in the Full MovieLens Dataset. The dataset consists of movies released on or before July 2017. Data points include cast, crew, plot keywords, budget, revenue, posters, release dates, languages, production companies, countries, TMDB vote counts and vote averages.

This dataset also has files containing 26 million ratings from 270,000 users for all 45,000 movies. Ratings are on a scale of 1-5 and have been obtained from the official GroupLens website.

<h2 style="color:purple">Content</h2>
This dataset consists of the following files:

movies_metadata.csv: The main Movies Metadata file. Contains information on 45,000 movies featured in the Full MovieLens dataset. Features include posters, backdrops, budget, revenue, release dates, languages, production countries and companies.

keywords.csv: Contains the movie plot keywords for our MovieLens movies. Available in the form of a stringified JSON Object.

credits.csv: Consists of Cast and Crew Information for all our movies. Available in the form of a stringified JSON Object.

links.csv: The file that contains the TMDB and IMDB IDs of all the movies featured in the Full MovieLens dataset.

links_small.csv: Contains the TMDB and IMDB IDs of a small subset of 9,000 movies of the Full Dataset.

ratings_small.csv: The subset of 100,000 ratings from 700 users on 9,000 movies.


<h1 style="color:red">Load Data</h1>

In [1]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns 
from tqdm import tqdm
import warnings 
from ast import literal_eval
import missingno as msno
warnings.filterwarnings('ignore')


from scipy import stats
from ast import literal_eval
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from nltk.stem.snowball import SnowballStemmer


In [2]:
metadata=pd.read_csv("../input/the-movies-dataset/movies_metadata.csv")
metadata.head(3)

In [3]:
metadata.info()

<h3 style="color:red">Visualize Missing values</h3>

In [4]:
msno.bar(metadata,sort="ascending",color='#7209b7',figsize=(20,10),fontsize=15)

In [5]:
pd.DataFrame(metadata.isnull().sum()/(metadata.shape[0])*100)


In [6]:
credit=pd.read_csv("../input/the-movies-dataset/credits.csv")
keyword=pd.read_csv("../input/the-movies-dataset/keywords.csv")
links_small = pd.read_csv('../input/the-movies-dataset/links_small.csv')
links_small = links_small[links_small['tmdbId'].notnull()]['tmdbId'].astype('int')

In [7]:
credit.head()

In [8]:
keyword.head()

In [9]:
links_small.head()

In [10]:
msno.bar(credit,sort="ascending",color='#7209b7',figsize=(20,10),fontsize=15)

In [11]:
msno.bar(keyword,sort="ascending",color='#7209b7',figsize=(20,10),fontsize=15)

<h1 style="color:purple;">Metadata based recommendation system</h1>

In [12]:
metadata.shape,credit.shape,keyword.shape,links_small.shape

The dataset is quite big we don't have that much computational power so we will create recommendation on appro. 9000 observations.

In [13]:
metadata = metadata.drop([19730, 29503, 35587])

In [14]:
keyword['id'] = keyword['id'].astype('int')
credit['id'] = credit['id'].astype('int')
metadata['id'] = metadata['id'].astype('int')

In [15]:
metadata=metadata.merge(credit,on='id')
metadata=metadata.merge(keyword,on='id')


In [16]:
metadata=metadata[metadata['id'].isin(links_small)]

In [17]:
metadata['cast']=metadata['cast'].apply(literal_eval)
metadata['crew'] = metadata['crew'].apply(literal_eval)
metadata['keywords'] = metadata['keywords'].apply(literal_eval)

In [18]:
metadata['cast_size']=metadata['cast'].apply(lambda x:len(x))
metadata['crew_size']=metadata['crew'].apply(lambda x:len(x))

In [19]:
def get_director(x):
    for i in x:
        if i['job']=='Director':
            return i['name']
    return np.nan

In [20]:
metadata['director']=metadata['crew'].apply(get_director)

In [21]:
plt.figure(figsize=(12,8))
sns.countplot(y='director',data=metadata,order=metadata['director'].value_counts().index[:20])
plt.yticks(size=14)
plt.show()

In [22]:
metadata['cast'] = metadata['cast'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
metadata['cast'] = metadata['cast'].apply(lambda x: x[:3] if len(x) >=3 else x)

In [23]:
metadata['keywords'] = metadata['keywords'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

In [24]:
metadata.cast_size.unique()

In [25]:
metadata.cast_size.value_counts().sort_values(ascending=False)

In [26]:
plt.figure(figsize=(12,8))
sns.jointplot(x='vote_average',y="cast_size",data=metadata,kind="reg")
plt.yticks(size=14)
plt.show()

In [27]:
plt.figure(figsize=(12,8))
sns.jointplot(x='vote_average',y="crew_size",data=metadata,kind="reg")
plt.yticks(size=14)
plt.show()

In [28]:
metadata['cast']=metadata['cast'].apply(lambda x:[str.lower(i.replace(" ","")) for i in x])
metadata['director']=metadata['director'].astype('str').apply(lambda x : x.replace(" ",""))
metadata['director']=metadata['director'].apply(lambda x:[x,x,x])

In [29]:
s=metadata.apply(lambda x:pd.Series(x['keywords']),axis=1).stack().reset_index(level=1,drop=True)
s.name='keyword'

In [30]:
s=s.value_counts()
s

In [31]:
s=s[s>1]

In [32]:
!pip install surprise

In [33]:
def filter_keywords(x):
    words = []
    for i in x:
        if i in s:
            words.append(i)
    return words

In [34]:
stemmer = SnowballStemmer('english')
stemmer.stem('dogs')

In [35]:
metadata['keywords'] = metadata['keywords'].apply(filter_keywords)
metadata['keywords'] = metadata['keywords'].apply(lambda x: [stemmer.stem(i) for i in x])
metadata['keywords'] = metadata['keywords'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])

In [36]:
metadata.head()

In [37]:
metadata['genres'] = metadata['genres'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

In [38]:
metadata.head()

In [39]:
list_=[]
def genre_vis(x):
    for i in x:
        for j in i:
            list_.append(j)
genre_vis(metadata['genres'])
        

In [40]:
ge=pd.Series(list_)

In [41]:
plt.figure(figsize=(12,8))
ge.value_counts().plot.bar()
plt.xticks(size=14)
plt.yticks(size=14)
plt.show()

In [42]:
metadata['soup'] = metadata['keywords'] + metadata['cast'] + metadata['director'] + metadata['genres']
metadata['soup'] = metadata['soup'].apply(lambda x: ' '.join(x))

In [43]:
count = CountVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
count_matrix = count.fit_transform(metadata['soup'])

In [44]:
cosine_sim = cosine_similarity(count_matrix, count_matrix)

In [45]:
smd = metadata.reset_index()
titles = metadata['title']
indices = pd.Series(metadata.index, index=metadata['title'])

In [46]:
def get_recommendations(title):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:31]
    movie_indices = [i[0] for i in sim_scores]
    return titles.iloc[movie_indices]

In [47]:
metadata.head()

In [48]:
get_recommendations('Jumanji').head(10)