In [1]:
#importing libraries

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
import sklearn.preprocessing as pp
import scipy.sparse
import warnings

warnings.filterwarnings("ignore")

## Data Preparation

In [2]:
allMovies = pd.read_csv("IMDb Movies.csv")
print(allMovies.shape)
allMovies.head()

(85855, 22)


Unnamed: 0,imdb_title_id,title,original_title,year,date_published,genre,duration,country,language,director,...,actors,description,avg_vote,votes,budget,usa_gross_income,worlwide_gross_income,metascore,reviews_from_users,reviews_from_critics
0,tt0000009,Miss Jerry,Miss Jerry,1894,1894-10-09,Romance,45,USA,,Alexander Black,...,"Blanche Bayliss, William Courtenay, Chauncey D...",The adventures of a female reporter in the 1890s.,5.9,154,,,,,1.0,2.0
1,tt0000574,The Story of the Kelly Gang,The Story of the Kelly Gang,1906,1906-12-26,"Biography, Crime, Drama",70,Australia,,Charles Tait,...,"Elizabeth Tait, John Tait, Norman Campbell, Be...",True story of notorious Australian outlaw Ned ...,6.1,589,$ 2250,,,,7.0,7.0
2,tt0001892,Den sorte drøm,Den sorte drøm,1911,1911-08-19,Drama,53,"Germany, Denmark",,Urban Gad,...,"Asta Nielsen, Valdemar Psilander, Gunnar Helse...",Two men of high rank are both wooing the beaut...,5.8,188,,,,,5.0,2.0
3,tt0002101,Cleopatra,Cleopatra,1912,1912-11-13,"Drama, History",100,USA,English,Charles L. Gaskill,...,"Helen Gardner, Pearl Sindelar, Miss Fielding, ...",The fabled queen of Egypt's affair with Roman ...,5.2,446,$ 45000,,,,25.0,3.0
4,tt0002130,L'Inferno,L'Inferno,1911,1911-03-06,"Adventure, Drama, Fantasy",68,Italy,Italian,"Francesco Bertolini, Adolfo Padovan",...,"Salvatore Papa, Arturo Pirovano, Giuseppe de L...",Loosely adapted from Dante's Divine Comedy and...,7.0,2237,,,,,31.0,14.0


In [3]:
allMovies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85855 entries, 0 to 85854
Data columns (total 22 columns):
imdb_title_id            85855 non-null object
title                    85855 non-null object
original_title           85855 non-null object
year                     85855 non-null object
date_published           85855 non-null object
genre                    85855 non-null object
duration                 85855 non-null int64
country                  85791 non-null object
language                 85022 non-null object
director                 85768 non-null object
writer                   84283 non-null object
production_company       81400 non-null object
actors                   85786 non-null object
description              83740 non-null object
avg_vote                 85855 non-null float64
votes                    85855 non-null int64
budget                   23710 non-null object
usa_gross_income         15326 non-null object
worlwide_gross_income    31016 non-null object

Dropping the entries with no description.

In [4]:
allMovies = allMovies.dropna(subset = ['description'])
allMovies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 83740 entries, 0 to 85854
Data columns (total 22 columns):
imdb_title_id            83740 non-null object
title                    83740 non-null object
original_title           83740 non-null object
year                     83740 non-null object
date_published           83740 non-null object
genre                    83740 non-null object
duration                 83740 non-null int64
country                  83676 non-null object
language                 82926 non-null object
director                 83657 non-null object
writer                   82253 non-null object
production_company       79524 non-null object
actors                   83676 non-null object
description              83740 non-null object
avg_vote                 83740 non-null float64
votes                    83740 non-null int64
budget                   23383 non-null object
usa_gross_income         15293 non-null object
worlwide_gross_income    30270 non-null object

Counting the number of words in each description and analyzing the distribution.

In [5]:
allMovies["description_#words"] = [len(x.description.split()) for _, x in allMovies.iterrows()]
allMovies["description_#words"].describe()

count    83740.000000
mean        27.814808
std          8.849126
min          1.000000
25%         21.000000
50%         30.000000
75%         34.000000
max         79.000000
Name: description_#words, dtype: float64

About 25% of our descriptions have less than 21 words. Obviously, they won't be proper inputs to the recommender system, as we can't depict much from such a small description of any podcast. 
<br>
<br>
*Here, I'm considering only those podcasts which have more than 15 words in their description.*

In [6]:
allMovies = allMovies[allMovies["description_#words"] > 15]
allMovies = allMovies.reset_index(drop=True)
allMovies.shape

(74889, 23)

**Segregating the dataset according to the genres.**

In [7]:
action_df = allMovies[allMovies["genre"].str.contains("Action")]
adventure_df = allMovies[allMovies["genre"].str.contains("Adventure")]
animation_df = allMovies[allMovies["genre"].str.contains("Animation")]
biography_df = allMovies[allMovies["genre"].str.contains("Biography")]
comedy_df = allMovies[allMovies["genre"].str.contains("Comedy")]
crime_df = allMovies[allMovies["genre"].str.contains("Crime")]
documentary_df = allMovies[allMovies["genre"].str.contains("Documentary")]
drama_df = allMovies[allMovies["genre"].str.contains("Drama")]
family_df = allMovies[allMovies["genre"].str.contains("Family")]
fantasy_df = allMovies[allMovies["genre"].str.contains("Fantasy")]
filmnoir_df = allMovies[allMovies["genre"].str.contains("Film-Noir")]
history_df = allMovies[allMovies["genre"].str.contains("History")]
horror_df = allMovies[allMovies["genre"].str.contains("Horror")]
music_df = allMovies[allMovies["genre"].str.contains("Music,|Music ")]
musical_df = allMovies[allMovies["genre"].str.contains("Musical")]
mystery_df = allMovies[allMovies["genre"].str.contains("Mystery")]
romance_df = allMovies[allMovies["genre"].str.contains("Romance")]
scifi_df = allMovies[allMovies["genre"].str.contains("Sci-Fi")]
sport_df = allMovies[allMovies["genre"].str.contains("Sport")]
thriller_df = allMovies[allMovies["genre"].str.contains("Thriller")]
war_df = allMovies[allMovies["genre"].str.contains("War")]
western_df = allMovies[allMovies["genre"].str.contains("Western")]

In [8]:
action_df = action_df.reset_index(drop=True)
adventure_df = adventure_df.reset_index(drop=True)
animation_df = animation_df.reset_index(drop=True)
biography_df = biography_df.reset_index(drop=True)
comedy_df = comedy_df.reset_index(drop=True)
crime_df = crime_df.reset_index(drop=True)
documentary_df = documentary_df.reset_index(drop=True)
drama_df = drama_df.reset_index(drop=True)
family_df = family_df.reset_index(drop=True)
fantasy_df = fantasy_df.reset_index(drop=True)
filmnoir_df = filmnoir_df.reset_index(drop=True)
history_df = history_df.reset_index(drop=True)
horror_df = horror_df.reset_index(drop=True)
music_df = music_df.reset_index(drop=True)
musical_df = musical_df.reset_index(drop=True)
mystery_df = mystery_df.reset_index(drop=True)
romance_df = romance_df.reset_index(drop=True)
scifi_df = scifi_df.reset_index(drop=True)
sport_df = sport_df.reset_index(drop=True)
thriller_df = thriller_df.reset_index(drop=True)
war_df = war_df.reset_index(drop=True)
western_df = western_df.reset_index(drop=True)

Storing the segregated datasets to the disk.

In [10]:
action_df.to_csv("datasets/action_df.csv", index=False)
adventure_df.to_csv("datasets/adventure_df.csv", index=False)
animation_df.to_csv("datasets/animation_df.csv", index=False)
biography_df.to_csv("datasets/biography_df.csv", index=False)
comedy_df.to_csv("datasets/comedy_df.csv", index=False)
crime_df.to_csv("datasets/crime_df.csv", index=False)
documentary_df.to_csv("datasets/documentary_df.csv", index=False)
drama_df.to_csv("datasets/drama_df.csv", index=False)
family_df.to_csv("datasets/family_df.csv", index=False)
fantasy_df.to_csv("datasets/fantasy_df.csv", index=False)
filmnoir_df.to_csv("datasets/filmnoir_df.csv", index=False)
history_df.to_csv("datasets/history_df.csv", index=False)
horror_df.to_csv("datasets/horror_df.csv", index=False)
music_df.to_csv("datasets/music_df.csv", index=False)
musical_df.to_csv("datasets/musical_df.csv", index=False)
mystery_df.to_csv("datasets/mystery_df.csv", index=False)
romance_df.to_csv("datasets/romance_df.csv", index=False)
scifi_df.to_csv("datasets/scifi_df.csv", index=False)
sport_df.to_csv("datasets/sport_df.csv", index=False)
thriller_df.to_csv("datasets/thriller_df.csv", index=False)
war_df.to_csv("datasets/war_df.csv", index=False)
western_df.to_csv("datasets/western_df.csv", index=False)

## Model Preparation and Score Generation

Considering uni, bi and trigrams for Tf-Idf scores and declaring a generic function. Also defining a different model of just unigrams for the whole dataset of 75k movies.

<br>
Defining a modified cosine_similarity function, which is much faster compared to the cosine_similarity() function in sklearn.metrics, for getting the similarity scores.

In [9]:
tfidf_model = TfidfVectorizer(analyzer = 'word', ngram_range = (1,3), stop_words = 'english')
tfidf_modelAll = TfidfVectorizer(analyzer = 'word', ngram_range = (1,1), stop_words = 'english')

def model(df):
    sparse_matrix = tfidf_model.fit_transform(df['description'])
    return sparse_matrix

def modelAll(df):
    sparse_matrix = tfidf_modelAll.fit_transform(df['description'])
    return sparse_matrix

def cosine_similarities(mat):
    col_normed_mat = pp.normalize(mat.tocsc(), axis=0)
    return col_normed_mat.T * col_normed_mat

Getting the sparse matrices for all the genres respectively.

In [11]:
action_sm = model(action_df)
adventure_sm = model(adventure_df)
animation_sm = model(animation_df)
biography_sm = model(biography_df)
comedy_sm = model(comedy_df)
crime_sm = model(crime_df)
documentary_sm = model(documentary_df)
drama_sm = model(drama_df)
family_sm = model(family_df)
fantasy_sm = model(fantasy_df)
filmnoir_sm = model(filmnoir_df)
history_sm = model(history_df)
horror_sm = model(horror_df)
music_sm = model(music_df)
musical_sm = model(musical_df)
mystery_sm = model(mystery_df)
romance_sm = model(romance_df)
scifi_sm = model(scifi_df)
sport_sm = model(sport_df)
thriller_sm = model(thriller_df)
war_sm = model(war_df)
western_sm = model(western_df)

In [12]:
all_sm = model(allMovies)

Storing the sparse matrices to the disk.

In [13]:
scipy.sparse.save_npz("sparse_matrices/action_sm.npz", action_sm)
scipy.sparse.save_npz("sparse_matrices/adventure_sm.npz", adventure_sm)
scipy.sparse.save_npz("sparse_matrices/animation_sm.npz", animation_sm)
scipy.sparse.save_npz("sparse_matrices/biography_sm.npz", biography_sm)
scipy.sparse.save_npz("sparse_matrices/comedy_sm.npz", comedy_sm)
scipy.sparse.save_npz("sparse_matrices/crime_sm.npz", crime_sm)
scipy.sparse.save_npz("sparse_matrices/documentary_sm.npz", documentary_sm)
scipy.sparse.save_npz("sparse_matrices/drama_sm.npz", drama_sm)
scipy.sparse.save_npz("sparse_matrices/family_sm.npz", family_sm)
scipy.sparse.save_npz("sparse_matrices/fantasy_sm.npz", fantasy_sm)
scipy.sparse.save_npz("sparse_matrices/filmnoir_sm.npz", filmnoir_sm)
scipy.sparse.save_npz("sparse_matrices/history_sm.npz", history_sm)
scipy.sparse.save_npz("sparse_matrices/horror_sm.npz", horror_sm)
scipy.sparse.save_npz("sparse_matrices/music_sm.npz", music_sm)
scipy.sparse.save_npz("sparse_matrices/musical_sm.npz", musical_sm)
scipy.sparse.save_npz("sparse_matrices/mystery_sm.npz", mystery_sm)
scipy.sparse.save_npz("sparse_matrices/romance_sm.npz", romance_sm)
scipy.sparse.save_npz("sparse_matrices/scifi_sm.npz", scifi_sm)
scipy.sparse.save_npz("sparse_matrices/sport_sm.npz", sport_sm)
scipy.sparse.save_npz("sparse_matrices/thriller_sm.npz", thriller_sm)
scipy.sparse.save_npz("sparse_matrices/war_sm.npz", war_sm)
scipy.sparse.save_npz("sparse_matrices/western_sm.npz", western_sm)
scipy.sparse.save_npz("sparse_matrices/all_sm.npz", all_sm)

## Testing

Choosing just 3 genres for now, and finding the cosine similarity scores in the form of sparse matrices.

In [14]:
drama_sm = scipy.sparse.load_npz("sparse_matrices/drama_sm.npz")
animation_sm = scipy.sparse.load_npz("sparse_matrices/animation_sm.npz")
scifi_sm = scipy.sparse.load_npz("sparse_matrices/scifi_sm.npz")

In [15]:
drama_sp = cosine_similarities(drama_sm.T)
animation_sp = cosine_similarities(animation_sm.T)
scifi_sp = cosine_similarities(scifi_sm.T)

A generic funtion for getting recommendations.

In [16]:
def give_recommendations(df, sim_mat, mov_name):
    movie = df[df.original_title == mov_name].index[0]
    scores = sim_mat[movie].toarray()
    index_recomm = scores.T.argsort(axis=0)[-11:-1]
    
    print("Original Description: ",df.description[movie],"\n")

    for i in np.flipud(index_recomm):
        print("Score: ",scores.T[i][0],"\t Title: ",df.title[i[0]])
        print("IMDb Title ID: ",df.imdb_title_id[i[0]])
        print(df.description[i[0]],"\n")

In [17]:
give_recommendations(drama_df, drama_sp, "Simmba")

Original Description:  Simmba, a Corrupt Officer, enjoys all the perks of being an immoral and unethical police officer until a life-changing event forces him to choose the righteous path. 

Score:  [0.11642789] 	 Title:  Lipstikka
IMDb Title ID:  tt1374988
Two women reunite in London, where they go over the details of a life-changing event which occurred when they were teenagers in Jerusalem. 

Score:  [0.09896578] 	 Title:  Ayogya
IMDb Title ID:  tt9304360
A corrupt police officer finds his life changing when he takes on a case of gang rape. 

Score:  [0.09312665] 	 Title:  Little Birds
IMDb Title ID:  tt1623745
Lily and Alison face a life-changing event after they leave their Salton Sea home and follow the boys they meet back to Los Angeles. 

Score:  [0.09086234] 	 Title:  Temper
IMDb Title ID:  tt4442758
Daya, a corrupt police officer, finds his life changing when he takes on a case of gang rape. 

Score:  [0.07739576] 	 Title:  Destined
IMDb Title ID:  tt2515456
Destined tells th

In [18]:
give_recommendations(animation_df, animation_sp, "Frozen")

Original Description:  When the newly crowned Queen Elsa accidentally uses her power to turn things into ice to curse her home in infinite winter, her sister Anna teams up with a mountain man, his playful reindeer, and a snowman to change the weather condition. 

Score:  [0.04876863] 	 Title:  Norm of the North: Keys to the Kingdom
IMDb Title ID:  tt7913394
Norm, the newly crowned polar bear king of the arctic, must save New York City and his home. But Norm goes from hero to villain when he's framed for a crime he didn't commit. He must work ... 

Score:  [0.04215783] 	 Title:  Frozen II - Il segreto di Arendelle
IMDb Title ID:  tt4520988
Anna, Elsa, Kristoff, Olaf and Sven leave Arendelle to travel to an ancient, autumn-bound forest of an enchanted land. They set out to find the origin of Elsa's powers in order to save their kingdom. 

Score:  [0.03536353] 	 Title:  Howard Lovecraft and the Frozen Kingdom
IMDb Title ID:  tt4768656
After visiting his father in Arkham Sanitarium, young 

In [19]:
give_recommendations(scifi_df, scifi_sp, "Interstellar")

Original Description:  A team of explorers travel through a wormhole in space in an attempt to ensure humanity's survival. 

Score:  [0.06204962] 	 Title:  Ritorno dal futuro
IMDb Title ID:  tt0108342
During the unstoppable alien invasion of 2022, one man flees through a wormhole to 1992 in hope of changing the future. But attempt after attempt is made by someone there to catch and kill him. 

Score:  [0.05114115] 	 Title:  La fine del mondo
IMDb Title ID:  tt1213663
Five friends who reunite in an attempt to top their epic pub crawl from twenty years earlier unwittingly become humanity's only hope for survival. 

Score:  [0.04710461] 	 Title:  The Beyond
IMDb Title ID:  tt5723416
After observing an anomaly in space, scientists transplant human brains in to synthetic bodies and send them through the wormhole. 

Score:  [0.0378474] 	 Title:  The Last Days on Mars
IMDb Title ID:  tt1709143
A group of astronaut explorers succumb one by one to a mysterious and terrifying force while collect

***The recommendations look quite accurate, as per the description of the chosen movie. So, we can proceed with the deployment.***

## Thank You