#### 1: Mounting your Google Drive

In [1]:
from google.colab import drive, files
drive.mount('/content/drive')

Mounted at /content/drive


#### 2: Uploading your Kaggle API Token File

In [2]:
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"mazedulhaquemithun","key":"ceb8019e36f9ba2894f86f3f1874057c"}'}

## Section 1: Kaggle Setup

> 	Information & details on downloading dataset via Kaggle API

#### 1: Enable Kaggle API for User Mode Aceess

In [3]:
!pip install -q kaggle
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

#### 2: Downloading the Dataset from Kaggle to Google Colab

In [4]:
!kaggle datasets download -d rounakbanik/the-movies-dataset

Dataset URL: https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset
License(s): CC0-1.0
Downloading the-movies-dataset.zip to /content
 97% 220M/228M [00:02<00:00, 84.4MB/s]
100% 228M/228M [00:02<00:00, 84.4MB/s]


#### 3: Extracting Dataset

Whenever we download dataset from Kaggle, it is usually in zip or tar compression technique. To be able to access dataset, we need to extract it using suitable decompression technique.

In [5]:
## Unzip the dataset into /content/data directory
!unzip the-movies-dataset.zip -d /content/data

## Once extracted, we then remove it to save disk space.
!rm the-movies-dataset.zip

Archive:  the-movies-dataset.zip
  inflating: /content/data/credits.csv  
  inflating: /content/data/keywords.csv  
  inflating: /content/data/links.csv  
  inflating: /content/data/links_small.csv  
  inflating: /content/data/movies_metadata.csv  
  inflating: /content/data/ratings.csv  
  inflating: /content/data/ratings_small.csv  


## Section 2: Modules & Library

>     Information on setting up requirements for training and inference

#### 1: Installing required packages

In [6]:
!pip install --quiet fastparquet
!pip install --quiet pyarrow

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m34.5 MB/s[0m eta [36m0:00:00[0m
[?25h

---

For your Information: This notebook is complied on Google Colab that provides most of the modules pre-installed in working environment. If you happen to run it locally on your system, you may need to install additional dependencies.

---

#### 2: Importing Rerquired Packages

In [7]:
%matplotlib inline
import pandas as pd
import numpy as np

from ast import literal_eval
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from nltk.stem.snowball import SnowballStemmer

import pyarrow as pa
import pyarrow.parquet as pq

import warnings
warnings.simplefilter('ignore')

## Section 3: Data Cleaning & Engineering

>     Information on preparing data for trainable features

#### 1: Utility Functions For Data Cleaning & Engineering

In [8]:
def get_director(x):
    """
    Extract the Name of the Director for a movie if it is present inside the job
    """
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

#### 2: Reading dataset and merging them to form master dataset

In [9]:
movies_dataset  = pd.read_csv('/content/data/movies_metadata.csv')
credits         = pd.read_csv('/content/data/credits.csv')
keywords        = pd.read_csv('/content/data/keywords.csv')
links           = pd.read_csv('/content/data/links.csv')

In [10]:
## Dropping these 3 rows because Date Column value for them is string date instead of Int with ID.
movies_dataset = movies_dataset.drop([19730, 29503, 35587])

In [11]:
## Extracting Genres of movies from the genres dictionary. If not present, append empty list
movies_dataset['genres'] = movies_dataset['genres'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

In [12]:
## Convert to common data type for primary key in our dataset
keywords['id'] = keywords['id'].astype('int')
credits['id'] = credits['id'].astype('int')
movies_dataset['id'] = movies_dataset['id'].astype('int')

In [13]:
## Merging movies dataset with credits & keywords to form master dataset
movies_dataset = movies_dataset.merge(credits, on='id')
master_dataset = movies_dataset.merge(keywords, on='id')

In [14]:
master_dataset.head(2)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,spoken_languages,status,tagline,title,video,vote_average,vote_count,cast,crew,keywords
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[Animation, Comedy, Family]",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,False,,65000000,"[Adventure, Fantasy, Family]",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[{'id': 10090, 'name': 'board game'}, {'id': 1..."


In [16]:
print(master_dataset.columns)

Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count', 'cast', 'crew', 'keywords'],
      dtype='object')


In [15]:
links = links[links['tmdbId'].notnull()]['tmdbId'].astype('int')
master_dataset = master_dataset[master_dataset['id'].isin(links)]
print(master_dataset.shape)

(46628, 27)


#### 3: Data cleaning and Engineering

In [17]:
## Updating cast, crew and keyword columns by parsing them as their loaded data type is string but need to be converted to list
master_dataset['cast']      = master_dataset['cast'].apply(literal_eval)
master_dataset['crew']      = master_dataset['crew'].apply(literal_eval)
master_dataset['keywords']  = master_dataset['keywords'].apply(literal_eval)

In [18]:
## Updating cast to maintain proportion between different lengths (keeping top 3 cast members)
master_dataset['cast']      = master_dataset['cast'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
master_dataset['cast']      = master_dataset['cast'].apply(lambda x: x[:3] if len(x) >=3 else x)

## Setting keywords to empty list if does not exists, otherwise taking into account for each word as keyword
master_dataset['keywords']  = master_dataset['keywords'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

## Extracting directory names from the crew
master_dataset['director']  = master_dataset['crew'].apply(get_director)

In [19]:
## for uniqueness, removing all the spaces in between the names
master_dataset['cast']          = master_dataset['cast'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])

## Maintaining the original director name as main director
master_dataset['main_director'] = master_dataset['director']

## Maintaining the number of director to maintain proportion (similar to cast column above)
master_dataset['director']      = master_dataset['director'].astype('str').apply(lambda x: str.lower(x.replace(" ", "")))
master_dataset['director']      = master_dataset['director'].apply(lambda x: [x,x,x])

In [20]:
## Stacking the keywords and keeping the movies which containers X number of keywords as minimum
s = master_dataset.apply(lambda x: pd.Series(x['keywords']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'keyword'
s = s.value_counts()
print(s[:5])

keyword
woman director      3128
independent film    1942
murder              1314
based on novel       841
musical              734
Name: count, dtype: int64


In [21]:
## Will try to map where more than 1 keyword is present for the movie
s = s[s > 1]

In [22]:
## creating an object for ENGLISH Stemmer - Snowball to trim down keywords to their stem words
stemmer                     = SnowballStemmer('english')

## Trim down keywords to their stem words and then remove the space between keywords which are having more than 1 length for uniqueness
master_dataset['keywords']  = master_dataset['keywords'].apply(lambda x: [stemmer.stem(i) for i in x])
master_dataset['keywords']  = master_dataset['keywords'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])

In [23]:
master_dataset['keywords'].head(3)

0    [jealousi, toy, boy, friendship, friend, rival...
1    [boardgam, disappear, basedonchildren'sbook, n...
2       [fish, bestfriend, duringcreditssting, oldmen]
Name: keywords, dtype: object

In [24]:
## Creating a soup feature - combination of (keywords, cast, director, genres)
master_dataset['soup'] = master_dataset['keywords'] + master_dataset['cast'] + master_dataset['director'] + master_dataset['genres']

## Modifying by placing single space between all the soup words
master_dataset['soup'] = master_dataset['soup'].apply(lambda x: ' '.join(x))

In [25]:
master_dataset['soup'].head(3)

0    jealousi toy boy friendship friend rivalri boy...
1    boardgam disappear basedonchildren'sbook newho...
2    fish bestfriend duringcreditssting oldmen walt...
Name: soup, dtype: object

In [26]:
print(master_dataset.columns)

Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count', 'cast', 'crew', 'keywords', 'director',
       'main_director', 'soup'],
      dtype='object')


In [27]:
## Removing unwanted columns from the dataset - these features can be used if you wish to add more features to your recommender system.
## We are not going to use them, so we are removing them.
master_dataset.drop(['adult', 'belongs_to_collection', 'budget','homepage','original_language', 'production_companies','production_countries', 'revenue', 'runtime','spoken_languages','status','video'],axis=1,inplace=True)
master_dataset.drop(['overview', 'tagline','vote_average', 'vote_count', 'cast', 'crew','keywords', 'director'],axis=1,inplace=True)
master_dataset.drop(['id','imdb_id','original_title','poster_path','genres'],axis=1,inplace=True)

In [28]:
## Checking popularity column for being non-float data type and removing them
master_dataset['popularity']    = master_dataset.apply(lambda r: r['popularity'] if type(r['popularity'])==float else np.nan, axis=1)
master_dataset.dropna(inplace=True)

## Checking director column for being non-string data type and removing them
master_dataset['main_director'] = master_dataset.apply(lambda r: r['main_director'] if len(r['main_director'])>1 else np.nan, axis=1)
master_dataset.dropna(inplace=True)

In [29]:
## Sorting the whole dataset based on popularity. This will help us to take top X number of movies based on popularity.
master_dataset.sort_values(by=['popularity'],ascending=False,inplace=True)

## Dropping popularity column after sorting based on popularity
master_dataset.drop(['popularity'],axis=1,inplace=True)
master_dataset.dropna(inplace=True)

In [30]:
## Reset index because after sorting, the index values have changed.
master_dataset.reset_index(inplace=True,drop=True)

In [31]:
## Checking release date column for being non-string data type and removing them
master_dataset['release_date'] = master_dataset.apply(lambda r: r['release_date'] if len(r['release_date'])>1 else np.nan, axis=1)
master_dataset.dropna(inplace=True)

---

IMPORTANT NOTE:

The following cell contains comments based on different sizes of models which can be created. If you happen to have Google Colab Pro account or your local system has atleast 32Gb RAM, you may run the full dataset. Otherwise, it is adviced to run smaller dataset which fits your memory.

---

In [33]:
## For Demo, we will take top 5000 movies, which is hosted online already.
master_dataset = master_dataset[:5000]

## For Tiny-Model, we will take top 1000 movies
# master_dataset = master_dataset[:1000]

## For Extra-Small-Model, we will take top 5000 movies
# master_dataset = master_dataset[:5000]

## For Small-Model, we will take top 10000 movies
# master_dataset = master_dataset[:10000]

## For Medium-Model, we will take top 20000 movies
# master_dataset = master_dataset[:20000]

## For Large-Model, we will take top 30000 movies
# master_dataset = master_dataset[:30000]

## LEAVE ALL THE LINES COMMENTED IF YOU WISH TO TRAIN FULL MOVIES DATASET.

---

In [34]:
## This is our final dataset which we will be using for training our word and cosine similarity matrix
master_dataset.head()

Unnamed: 0,release_date,title,main_director,soup
0,2015-06-17,Minions,Kyle Balda,assist aftercreditssting duringcreditssting ev...
1,2014-10-24,Big Hero 6,Chris Williams,brotherbrotherrelationship hero talent reveng ...
2,2016-02-09,Deadpool,Tim Miller,antihero mercenari marvelcom superhero basedon...
3,2017-04-19,Guardians of the Galaxy Vol. 2,James Gunn,sequel superhero basedoncom misfit space outer...
4,2009-12-10,Avatar,James Cameron,cultureclash futur spacewar spacecoloni societ...


In [35]:
print(master_dataset.shape)

(5000, 4)


## Section 4: Recommendation Matrix

>     Building the matrix which contains similarity scores between movies based on the features

#### 1: Training Word based count vectorizer model

In [36]:
## Creating a Count Vectorizer object which will be based on word analyzer, with ngram 1-2 and minimum number of occurances of words as 2
count = CountVectorizer(analyzer='word',ngram_range=(1, 2),min_df=2, stop_words='english')

## Adjusting the count vectorizer object with respect to our dataset
count_matrix = count.fit_transform(master_dataset['soup'])

In [37]:
print(count_matrix.shape)

(5000, 12778)


#### 2: Building Cosine Similarity Matrix

**NOTE: THE FOLLOWING CODE CELL CAN CONSUME LARGE MEMORY**

In [38]:
## We build it as an pyarrow dataframe because it is the most efficient
table = pa.Table.from_pandas(pd.DataFrame(cosine_similarity(count_matrix, count_matrix)))

## Model & Data Export

>     Exporting the trained model & dataset efficiently

We export the model into parquet format. We have 3 awesome reasons (even recommend for you in your new project)

1. Uses Less Storage
2. Best Compression Ratio
3. Fast & Optimized for efficient Read/Write

In [39]:
## save the Master Dataset
master_dataset.to_parquet('/content/movie_database.parquet',engine='fastparquet',index=False)

In [40]:
## Writing the Matrix table
pq.write_table(table, '/content/model.parquet')

In [50]:
get_ipython().system('mv model.parquet ./drive/MyDrive/')

## Inference

>     Loading the trained model to execute Inference

In [51]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [52]:
import pandas as pd
import pyarrow as pa

In [53]:
master_dataset = pd.read_parquet('/content/movie_database.parquet')

In [55]:
master_dataset.head(3)

Unnamed: 0,release_date,title,main_director,soup
0,2015-06-17,Minions,Kyle Balda,assist aftercreditssting duringcreditssting ev...
1,2014-10-24,Big Hero 6,Chris Williams,brotherbrotherrelationship hero talent reveng ...
2,2016-02-09,Deadpool,Tim Miller,antihero mercenari marvelcom superhero basedon...


In [59]:
table = pa.parquet.read_table('/content/drive/MyDrive/model.parquet').to_pandas()

In [60]:
master_dataset = master_dataset.reset_index()
titles = master_dataset['title']
indices = pd.Series(master_dataset.index, index=master_dataset['title'])

In [61]:
def get_recommendations(movie_id_from_db,movie_db):
    try:
        sim_scores = list(enumerate(movie_db[movie_id_from_db]))
        sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
        sim_scores = sim_scores[1:15] ## get top 15 Recommendations

        movie_indices = [i[0] for i in sim_scores]
        output = master_dataset.iloc[movie_indices]
        output.reset_index(inplace=True, drop=True)

        response = []
        for i in range(len(output)):
            response.append({
                'movie_title':output['title'].iloc[i],
                'movie_release_date':output['release_date'].iloc[i],
                'movie_director':output['main_director'].iloc[i],
                'google_link':"https://www.google.com/search?q=" + '+'.join(output['title'].iloc[i].strip().split())
            })
        return response
    except Exception as e:
        print("error: ",e)
        return []

In [77]:
movie_name = input('Enter a movie Name: ')

Enter a movie Name: Avatar


In [78]:
movie_index = titles.to_list().index(movie_name)
recommendations = get_recommendations(movie_index,table)

In [79]:
print(f"{'Movie Title':<40} | {'Director':<20} | {'Release Date':<15}")
print(f"-"*80)
for recommendation in recommendations:
    print(f"{recommendation['movie_title']:<40} | {recommendation['movie_director']:<20} | {recommendation['movie_release_date']:<15}")

Movie Title                              | Director             | Release Date   
--------------------------------------------------------------------------------
Aliens                                   | James Cameron        | 1986-07-18     
Terminator 2: Judgment Day               | James Cameron        | 1991-07-01     
The Terminator                           | James Cameron        | 1984-10-26     
The Abyss                                | James Cameron        | 1989-08-09     
True Lies                                | James Cameron        | 1994-07-14     
Titanic                                  | James Cameron        | 1997-11-18     
2012: Ice Age                            | Trey Stokes          | 2011-06-27     
Guardians                                | Sarik Andreasyan     | 2017-02-14     
Battle For SkyArk                        | Simon Hung           | 2015-05-18     
Star Trek Into Darkness                  | J.J. Abrams          | 2013-05-05     
Star Wars: The Cl