# What is Recommendation System ?
Recommender/recommendation system is a subclass of information filtering system that seeks to predict the rating/ preference a user would give to an item.

They are primarily used in applications where a person/ entity is involved with a product/ service. To further improve their experience with this product, we try to personalize it to their needs. For this we have to look up at their past interactions with this product.

*In one line* -> **Specialized content for everyone.**

*For further info, [Wiki](https://en.wikipedia.org/wiki/Recommender_system#:~:text=A%20recommender%20system%2C%20or%20a,would%20give%20to%20an%20item.)*

## Types of Recommender System
* 1). Popularity Based
* 2). Classification Based
* 3). Content Based
* 4). Collaborative Based
* 5). Hybrid Based (Content + Collaborative)
* 6). Association Based Rule Mining

# Content based recommender system
Recommends content based on product description. Here we would convert movie titles into a vector to find its cosine similarity. Similar movie would have a high cosine similarity and thus would be recommended to the user.

# Import packages and dataset

We would use Rake package, Rake stands for Rapid Automatic Keyword Extraction algorithm which is a domain independent keyword extraction algorithm which tries to determine key phrases in a body of text by analyzing the frequency of word appearance and it's co-occurrance with other words in the text.

*Credits to -> [csurfer](https://github.com/csurfer/rake-nltk)*

In [1]:
!pip install rake_nltk

Collecting rake_nltk
  Downloading rake_nltk-1.0.4.tar.gz (7.6 kB)
Building wheels for collected packages: rake-nltk
  Building wheel for rake-nltk (setup.py) ... [?25l- \ | done
[?25h  Created wheel for rake-nltk: filename=rake_nltk-1.0.4-py2.py3-none-any.whl size=7819 sha256=30eed2996a42e0b155c9607f1ffb7cf4b07cb2a72f7e600b64bd5ed94e73fe8c
  Stored in directory: /root/.cache/pip/wheels/7c/d9/8a/b8a9244fa89a07f288f9fe006aafc79d93fceb58496c29b606
Successfully built rake-nltk
Installing collected packages: rake-nltk
Successfully installed rake-nltk-1.0.4


In [2]:
import pandas as pd
import numpy as np

from rake_nltk import Rake
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer #tokenizes a collection of words extracted from a text doc
from ast import literal_eval #This evaluates whether an expresion is a Python datatype or not

In [3]:
data = pd.read_csv('../input/imdb-extensive-dataset/IMDb movies.csv')
print(data.shape)
data.head()

(81273, 22)


Unnamed: 0,imdb_title_id,title,original_title,year,date_published,genre,duration,country,language,director,...,actors,description,avg_vote,votes,budget,usa_gross_income,worlwide_gross_income,metascore,reviews_from_users,reviews_from_critics
0,tt0000574,The Story of the Kelly Gang,The Story of the Kelly Gang,1906,1906-12-26,"Biography, Crime, Drama",70,Australia,,Charles Tait,...,"Elizabeth Tait, John Tait, Norman Campbell, Be...",True story of notorious Australian outlaw Ned ...,6.1,537,$ 2250,,,,7.0,7.0
1,tt0001892,Den sorte drøm,Den sorte drøm,1911,1911-08-19,Drama,53,"Germany, Denmark",,Urban Gad,...,"Asta Nielsen, Valdemar Psilander, Gunnar Helse...",Two men of high rank are both wooing the beaut...,5.9,171,,,,,4.0,2.0
2,tt0002101,Cleopatra,Cleopatra,1912,1912-11-13,"Drama, History",100,USA,English,Charles L. Gaskill,...,"Helen Gardner, Pearl Sindelar, Miss Fielding, ...",The fabled queen of Egypt's affair with Roman ...,5.2,420,$ 45000,,,,24.0,3.0
3,tt0002130,L'Inferno,L'Inferno,1911,1911-03-06,"Adventure, Drama, Fantasy",68,Italy,Italian,"Francesco Bertolini, Adolfo Padovan",...,"Salvatore Papa, Arturo Pirovano, Giuseppe de L...",Loosely adapted from Dante's Divine Comedy and...,7.0,2019,,,,,28.0,14.0
4,tt0002199,"From the Manger to the Cross; or, Jesus of Naz...","From the Manger to the Cross; or, Jesus of Naz...",1912,1913,"Biography, Drama",60,USA,English,Sidney Olcott,...,"R. Henderson Bland, Percy Dyer, Gene Gauntier,...","An account of the life of Jesus Christ, based ...",5.7,438,,,,,12.0,5.0


In [4]:
#There are many null values
data.isnull().sum()

imdb_title_id                0
title                        0
original_title               0
year                         0
date_published               0
genre                        0
duration                     0
country                     39
language                   755
director                    73
writer                    1493
production_company        4325
actors                      66
description               2430
avg_vote                     0
votes                        0
budget                   58469
usa_gross_income         66179
worlwide_gross_income    51381
metascore                68551
reviews_from_users        7077
reviews_from_critics     10987
dtype: int64

In [5]:
#Lets convert all Null values into 'missing value'
data = data.fillna('missing value')

### Recommend movies based on a director/ writer

In [6]:
#Recommend movies based on a director (Pls give full names)
#rec_director = input('Enter director you want to be recommended movies of: ')
rec_director = 'Christopher Nolan'
data[data['director'] == rec_director]

#Recommend movies based on a writer (Pls give full names)
#rec_writer = input('Enter writer you want to be recommended movies of: ')
#data[data['writer'] == rec_writer]

Unnamed: 0,imdb_title_id,title,original_title,year,date_published,genre,duration,country,language,director,...,actors,description,avg_vote,votes,budget,usa_gross_income,worlwide_gross_income,metascore,reviews_from_users,reviews_from_critics
32413,tt0154506,Following,Following,1998,1999-11-05,"Crime, Mystery, Thriller",69,UK,English,Christopher Nolan,...,"Jeremy Theobald, Alex Haw, Lucy Russell, John ...",A young writer who follows strangers for mater...,7.5,79954,$ 6000,$ 48482,$ 48482,60,190,135
35417,tt0209144,Memento,Memento,2000,2000-10-27,"Mystery, Thriller",113,USA,English,Christopher Nolan,...,"Guy Pearce, Carrie-Anne Moss, Joe Pantoliano, ...",A man with short-term memory loss attempts to ...,8.4,1052890,$ 9000000,$ 25544867,$ 39723096,80,2166,203
38610,tt0278504,Insomnia,Insomnia,2002,2002-08-30,"Drama, Mystery, Thriller",118,"USA, UK",English,Christopher Nolan,...,"Al Pacino, Martin Donovan, Oliver 'Ole' Zemen,...",Two Los Angeles homicide detectives are dispat...,7.2,258114,$ 46000000,$ 67355513,$ 113758770,78,678,183
42664,tt0372784,Batman Begins,Batman Begins,2005,2005-06-16,"Action, Adventure",140,"USA, UK","English, Urdu, Mandarin",Christopher Nolan,...,"Christian Bale, Michael Caine, Liam Neeson, Ka...","After training with his mentor, Batman begins ...",8.2,1225020,$ 150000000,$ 206852432,$ 373413297,70,2810,310
46756,tt0468569,The Dark Knight,The Dark Knight,2008,2008-07-24,"Action, Crime, Drama",152,"USA, UK","English, Mandarin",Christopher Nolan,...,"Christian Bale, Heath Ledger, Aaron Eckhart, M...",When the menace known as the Joker wreaks havo...,9.0,2134569,$ 185000000,$ 535234033,$ 1004934033,84,6344,418
47427,tt0482571,The Prestige,The Prestige,2006,2006-11-10,"Drama, Mystery, Sci-Fi",130,"USA, UK",English,Christopher Nolan,...,"Hugh Jackman, Christian Bale, Michael Caine, P...","After a tragic accident, two stage magicians e...",8.5,1095249,$ 40000000,$ 53089891,$ 109676311,66,1285,368
48940,tt0816692,Interstellar,Interstellar,2014,2014-11-07,"Adventure, Drama, Sci-Fi",169,"USA, UK, Canada",English,Christopher Nolan,...,"Ellen Burstyn, Matthew McConaughey, Mackenzie ...",A team of explorers travel through a wormhole ...,8.6,1348155,$ 165000000,$ 188020017,$ 677471339,74,3571,612
55050,tt1345836,The Dark Knight Rises,The Dark Knight Rises,2012,2012-07-20,"Action, Thriller",164,"UK, USA","English, Arabic",Christopher Nolan,...,"Christian Bale, Gary Oldman, Tom Hardy, Joseph...",Eight years after the Joker's reign of anarchy...,8.4,1421494,$ 250000000,$ 448139099,$ 1081041287,78,2878,837
55287,tt1375666,Inception,Inception,2010,2010-07-16,"Action, Adventure, Sci-Fi",148,"USA, UK","English, Japanese, French",Christopher Nolan,...,"Leonardo DiCaprio, Joseph Gordon-Levitt, Ellen...",A thief who steals corporate secrets through t...,8.8,1892929,$ 160000000,$ 292576195,$ 829895144,74,3439,462
73689,tt5013056,Dunkirk,Dunkirk,2017,2017-07-21,"Action, Drama, History",106,"UK, Netherlands, France, USA","English, French, German",Christopher Nolan,...,"Fionn Whitehead, Damien Bonnard, Aneurin Barna...","Allied soldiers from Belgium, the British Empi...",7.9,487517,$ 100000000,$ 189740665,$ 526940665,94,2222,608


### Recommend movies based on actor

In [7]:
#rec_actor = input('Enter actor you want to be recommended movies of: ')
rec_actor = 'Ryan Gosling'
rec_actor = data[data['actors'].str.contains(rec_actor)] 

In [8]:
rec_actor

Unnamed: 0,imdb_title_id,title,original_title,year,date_published,genre,duration,country,language,director,...,actors,description,avg_vote,votes,budget,usa_gross_income,worlwide_gross_income,metascore,reviews_from_users,reviews_from_critics
28920,tt0116356,Frankenstein and Me,Frankenstein and Me,1996,1996-07-12,"Comedy, Family, Fantasy",91,Canada,English,Robert Tinnell,...,"Jamieson Boulanger, Ricky Mabe, Polly Shannon,...",Earl Williams is a dreamer teenager obsessed w...,5.4,346,missing value,missing value,missing value,missing value,5,2
35505,tt0210945,Remember the Titans,Remember the Titans,2000,2001-02-09,"Biography, Drama, Sport",113,USA,English,Boaz Yakin,...,"Denzel Washington, Will Patton, Wood Harris, R...",The true story of a newly appointed African-Am...,7.8,187326,$ 30000000,$ 115654751,$ 136706683,48,439,158
37002,tt0247199,The Believer,The Believer,2001,2001-08-23,Drama,98,USA,"English, Hebrew",Henry Bean,...,"Ryan Gosling, Peter Meadows, Garret Dillahunt,...",A young Jewish man develops a fiercely anti-Se...,7.2,35195,$ 1500000,$ 416925,$ 1309316,75,162,96
37908,tt0264935,Murder by Numbers,Murder by Numbers,2002,2002-06-28,"Crime, Mystery, Thriller",115,USA,English,Barbet Schroeder,...,"Sandra Bullock, Ben Chaplin, Ryan Gosling, Mic...","Two gifted high school students execute a ""per...",6.2,50625,$ 50000000,$ 31945749,$ 56714147,50,307,133
38028,tt0266971,The Slaughter Rule,The Slaughter Rule,2002,2002-09-14,"Drama, Sport",112,USA,English,"Alex Smith, Andrew J. Smith",...,"Ryan Gosling, David Morse, Clea DuVall, David ...","A young man finds solace with a young woman, h...",6.0,2237,$ 500000,$ 13411,$ 13411,65,30,17
39940,tt0301976,The United States of Leland,The United States of Leland,2003,2005-03-25,Drama,108,USA,English,Matthew Ryan Hoge,...,"Don Cheadle, Ryan Gosling, Chris Klein, Jena M...",A young man's experience in a juvenile detenti...,7.1,22026,missing value,$ 343847,$ 343847,37,105,65
41185,tt0332280,The Notebook,The Notebook,2004,2004-06-25,"Drama, Romance",123,USA,English,Nick Cassavetes,...,"Tim Ivey, Gena Rowlands, Starletta DuPois, Jam...",A poor yet passionate young man falls in love ...,7.8,486191,$ 29000000,$ 81001787,$ 115533700,53,1197,178
42594,tt0371257,Stay,Stay,2005,2005-10-21,"Drama, Mystery, Thriller",99,USA,English,Marc Forster,...,"Ewan McGregor, Ryan Gosling, Kate Burton, Naom...",This movie focuses on the attempts of a psychi...,6.8,71166,$ 50000000,$ 3626883,$ 8342132,41,372,121
46746,tt0468489,Half Nelson,Half Nelson,2006,2007-05-25,Drama,106,USA,English,Ryan Fleck,...,"Ryan Gosling, Jeff Lima, Shareeka Epps, Nathan...",An inner-city junior high school teacher with ...,7.2,81313,$ 700000,$ 2697938,$ 4660481,85,204,217
47636,tt0488120,Fracture,Fracture,2007,2007-04-20,"Crime, Drama, Thriller",113,"USA, Germany",English,Gregory Hoblit,...,"Anthony Hopkins, Ryan Gosling, David Strathair...",An attorney intent on climbing the career ladd...,7.2,173459,missing value,$ 39015018,$ 91354215,68,358,204


# Data Preprocessing

**Things to do:**
* Impute all missing values
* Extract only relevant columns
* Convert all columns into lower case
* Split all names into comma separated
* Combine director, writer, actor names, production company into 1 word respectively this will be used for text extraction

In [9]:
data.columns

Index(['imdb_title_id', 'title', 'original_title', 'year', 'date_published',
       'genre', 'duration', 'country', 'language', 'director', 'writer',
       'production_company', 'actors', 'description', 'avg_vote', 'votes',
       'budget', 'usa_gross_income', 'worlwide_gross_income', 'metascore',
       'reviews_from_users', 'reviews_from_critics'],
      dtype='object')

In [10]:
#Extract relevant columns that would influence a movie's rating based on the content.

#Due to memory issue using just 3k data. You can try this code on Google Colabs for better performance
data1 = data[['title','genre','director','actors','description']].head(3000)
data1.head()

Unnamed: 0,title,genre,director,actors,description
0,The Story of the Kelly Gang,"Biography, Crime, Drama",Charles Tait,"Elizabeth Tait, John Tait, Norman Campbell, Be...",True story of notorious Australian outlaw Ned ...
1,Den sorte drøm,Drama,Urban Gad,"Asta Nielsen, Valdemar Psilander, Gunnar Helse...",Two men of high rank are both wooing the beaut...
2,Cleopatra,"Drama, History",Charles L. Gaskill,"Helen Gardner, Pearl Sindelar, Miss Fielding, ...",The fabled queen of Egypt's affair with Roman ...
3,L'Inferno,"Adventure, Drama, Fantasy","Francesco Bertolini, Adolfo Padovan","Salvatore Papa, Arturo Pirovano, Giuseppe de L...",Loosely adapted from Dante's Divine Comedy and...
4,"From the Manger to the Cross; or, Jesus of Naz...","Biography, Drama",Sidney Olcott,"R. Henderson Bland, Percy Dyer, Gene Gauntier,...","An account of the life of Jesus Christ, based ..."


Remember the more columns you extract here more are the chances of overfitting as movies recommended will also take into account director, writer, production_company and et all. These features may be irrelevant to a user who wants to be recommended a movie based on his preferences.

In [11]:
data1.isnull().sum()

title          0
genre          0
director       0
actors         0
description    0
dtype: int64

In [12]:
#Impute all missing values
data1 = data1.fillna('missing value')

In [13]:
#Convert all columns into lower case
data1 = data1.applymap(lambda x: x.lower() if type(x) == str else x)
data1.head()

Unnamed: 0,title,genre,director,actors,description
0,the story of the kelly gang,"biography, crime, drama",charles tait,"elizabeth tait, john tait, norman campbell, be...",true story of notorious australian outlaw ned ...
1,den sorte drøm,drama,urban gad,"asta nielsen, valdemar psilander, gunnar helse...",two men of high rank are both wooing the beaut...
2,cleopatra,"drama, history",charles l. gaskill,"helen gardner, pearl sindelar, miss fielding, ...",the fabled queen of egypt's affair with roman ...
3,l'inferno,"adventure, drama, fantasy","francesco bertolini, adolfo padovan","salvatore papa, arturo pirovano, giuseppe de l...",loosely adapted from dante's divine comedy and...
4,"from the manger to the cross; or, jesus of naz...","biography, drama",sidney olcott,"r. henderson bland, percy dyer, gene gauntier,...","an account of the life of jesus christ, based ..."


In [14]:
#Use genre as a list of words
data1['genre'] = data1['genre'].map(lambda x: x.split(','))
data1['genre']

0         [biography,  crime,  drama]
1                             [drama]
2                   [drama,  history]
3       [adventure,  drama,  fantasy]
4                 [biography,  drama]
                    ...              
2995                          [drama]
2996        [crime,  drama,  romance]
2997                [drama,  romance]
2998                          [drama]
2999        [drama,  romance,  sport]
Name: genre, Length: 3000, dtype: object

In [15]:
#Similarily lets separate names into first and last name with commas
data1[['director','actors']] = data1[['director','actors']].applymap(lambda x: x.split(',')) #apply map used for more than 1 column, map for 1 column
data1[['director','actors']].head()

Unnamed: 0,director,actors
0,[charles tait],"[elizabeth tait, john tait, norman campbell,..."
1,[urban gad],"[asta nielsen, valdemar psilander, gunnar he..."
2,[charles l. gaskill],"[helen gardner, pearl sindelar, miss fieldin..."
3,"[francesco bertolini, adolfo padovan]","[salvatore papa, arturo pirovano, giuseppe d..."
4,[sidney olcott],"[r. henderson bland, percy dyer, gene gaunti..."


In [16]:
#Combine director, actor names into 1 word respectively this will be used for text extraction

for index,row in data1.iterrows():
    row['actors'] = [x.replace(' ','') for x in row['actors']]
    row['director'] = [x.replace(' ','') for x in row['director']]

In [17]:
data1.head()

Unnamed: 0,title,genre,director,actors,description
0,the story of the kelly gang,"[biography, crime, drama]",[charlestait],"[elizabethtait, johntait, normancampbell, bell...",true story of notorious australian outlaw ned ...
1,den sorte drøm,[drama],[urbangad],"[astanielsen, valdemarpsilander, gunnarhelseng...",two men of high rank are both wooing the beaut...
2,cleopatra,"[drama, history]",[charlesl.gaskill],"[helengardner, pearlsindelar, missfielding, mi...",the fabled queen of egypt's affair with roman ...
3,l'inferno,"[adventure, drama, fantasy]","[francescobertolini, adolfopadovan]","[salvatorepapa, arturopirovano, giuseppedeligu...",loosely adapted from dante's divine comedy and...
4,"from the manger to the cross; or, jesus of naz...","[biography, drama]",[sidneyolcott],"[r.hendersonbland, percydyer, genegauntier, al...","an account of the life of jesus christ, based ..."


## For content based movie recommendation we have to use NLP techniques like 
* Keyword extraction -> Extract keywords from description
* Bag of Words Creation -> Extracting all words from a row into a Bag
* Count Vectorizer -> Count frequency of words from this BOW
* Cosine Similarity -> Find cosine similarity between all movie titles




# Keyword Extraction
Keyword extraction is automatic detection of terms that best describe the subject of a document. We will use Rake to extract keywords from description.

*For more info -> [Wiki](https://en.wikipedia.org/wiki/Keyword_extraction)*

**Things to do:**
* Create a empty list Keywords
* Loop across all rows to extract all keywords from description
* Create a dictionary with keywords and all their scores
* Append 'keywords' column into dataframe

In [18]:
#Create a empty list Keywords
data1['keywords'] = ''

In [19]:
#Loop across all rows to extract all keywords from description
for index, row in data1.iterrows():
    description = row['description']
    
    #instantiating Rake by default it uses English stopwords from NLTK and discards all punctuation chars
    r = Rake()
    
    #extract words by passing the text
    r.extract_keywords_from_text(description)
    
    #get the dictionary with key words and their scores
    keyword_dict_scores = r.get_word_degrees()
    
    #assign keywords to new columns
    row['keywords'] = list(keyword_dict_scores.keys())
    
#drop description

In [20]:
data1.set_index('title', inplace = True)
data1.head()

Unnamed: 0_level_0,genre,director,actors,description,keywords
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
the story of the kelly gang,"[biography, crime, drama]",[charlestait],"[elizabethtait, johntait, normancampbell, bell...",true story of notorious australian outlaw ned ...,"[true, story, notorious, australian, outlaw, n..."
den sorte drøm,[drama],[urbangad],"[astanielsen, valdemarpsilander, gunnarhelseng...",two men of high rank are both wooing the beaut...,"[offer, two, men, jeweler, hirsch, famous, equ..."
cleopatra,"[drama, history]",[charlesl.gaskill],"[helengardner, pearlsindelar, missfielding, mi...",the fabled queen of egypt's affair with roman ...,"[affair, egypt, fabled, queen, ulimately, disa..."
l'inferno,"[adventure, drama, fantasy]","[francescobertolini, adolfopadovan]","[salvatorepapa, arturopirovano, giuseppedeligu...",loosely adapted from dante's divine comedy and...,"[inspired, divine, comedy, gustav, doré, dante..."
"from the manger to the cross; or, jesus of nazareth","[biography, drama]",[sidneyolcott],"[r.hendersonbland, percydyer, genegauntier, al...","an account of the life of jesus christ, based ...","[parents, jesus, based, books, life, foretold,..."


# Bag of Words Creation

This is an important technique used in NLP and other such information retrieval programs to create a bag of words concerning a text *(in our case its 'title')* Here the occurence of every word is used as a feature for training a classifier.

*For more info, -> [Wiki](https://en.wikipedia.org/wiki/Bag-of-words_model)*

**Things to do:**
* Create empty list of bow
* Iterate over all rows combining genre with director & actor names

In [21]:
data1['bow'] = ''
columns = data1.columns
for index, row in data1.iterrows():
    words = ''
    for col in columns:
        words = words + ' '.join(row[col])+ ' '
        row['bow'] = words
        

#Use below code if you do not want to include director name into bow
    #for col in columns:
        #if col != 'director':
            #words = words + ' '.join(row[col])+ ' '
        #else:
            #words = words + row[col]+ ' '
        #row['bow'] = words

    
#df1.drop(columns = [col for col in df1.columns if col!= 'bag_of_words'], inplace = True)

In [22]:
data1.head()

Unnamed: 0_level_0,genre,director,actors,description,keywords,bow
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
the story of the kelly gang,"[biography, crime, drama]",[charlestait],"[elizabethtait, johntait, normancampbell, bell...",true story of notorious australian outlaw ned ...,"[true, story, notorious, australian, outlaw, n...",biography crime drama charlestait elizabetht...
den sorte drøm,[drama],[urbangad],"[astanielsen, valdemarpsilander, gunnarhelseng...",two men of high rank are both wooing the beaut...,"[offer, two, men, jeweler, hirsch, famous, equ...",drama urbangad astanielsen valdemarpsilander g...
cleopatra,"[drama, history]",[charlesl.gaskill],"[helengardner, pearlsindelar, missfielding, mi...",the fabled queen of egypt's affair with roman ...,"[affair, egypt, fabled, queen, ulimately, disa...",drama history charlesl.gaskill helengardner p...
l'inferno,"[adventure, drama, fantasy]","[francescobertolini, adolfopadovan]","[salvatorepapa, arturopirovano, giuseppedeligu...",loosely adapted from dante's divine comedy and...,"[inspired, divine, comedy, gustav, doré, dante...",adventure drama fantasy francescobertolini a...
"from the manger to the cross; or, jesus of nazareth","[biography, drama]",[sidneyolcott],"[r.hendersonbland, percydyer, genegauntier, al...","an account of the life of jesus christ, based ...","[parents, jesus, based, books, life, foretold,...",biography drama sidneyolcott r.hendersonbland...


# Count Vectorizer

Convert a collection of text documents to a matrix of token counts. It's a data table that is obtained after normalization of next-generation sequencing data.

*For more info -> [Count Vectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)*

**Things to do:**
* Instantiate & Fit CountVectorizer into 'bow' -> to create count_matrix this is useful for cosine similarity
* 'title' column is an Index as we saw above, hence we convert 'title' column as Series -> to use 'title' as an ordered numerical list
* Understand the count_matrix -> Check its shape and type
* Convert sparse count_matrix to dense vector -> To reduce complexity, *For more info -> [Sparse Matrices](https://machinelearningmastery.com/sparse-matrices-for-machine-learning/)*
* Dense matrix for a sample row
* Check all words in the vocabulary
* Generate cosine similarity for count_matrix

In [23]:
#instantiating and generating the count matrix
count = CountVectorizer()
count_matrix = count.fit_transform(data1['bow'])

#create a Series for movie titles so they are associated to an ordered numerical list, we will use this later to match index
indices = pd.Series(data1.index)
indices[:5]

0                          the story of the kelly gang
1                                       den sorte drøm
2                                            cleopatra
3                                            l'inferno
4    from the manger to the cross; or, jesus of naz...
Name: title, dtype: object

In [24]:
#Shape count_matrix
count_matrix

<3000x21128 sparse matrix of type '<class 'numpy.int64'>'
	with 88732 stored elements in Compressed Sparse Row format>

In [25]:
type(count_matrix)

scipy.sparse.csr.csr_matrix

In [26]:
#Convert sparse count_matrix to dense vector
c = count_matrix.todense()
c

matrix([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]])

In [27]:
#Print count_matrix for 0th row
print(count_matrix[0,:]) #This shows all words and their frequency in bow of 0th row

  (0, 2146)	1
  (0, 4128)	1
  (0, 5132)	1
  (0, 3260)	1
  (0, 5688)	1
  (0, 10641)	1
  (0, 14262)	1
  (0, 1868)	1
  (0, 20600)	1
  (0, 17225)	1
  (0, 9891)	1
  (0, 10540)	1
  (0, 19945)	1
  (0, 13780)	2
  (0, 12854)	1
  (0, 13193)	1
  (0, 7060)	1
  (0, 14476)	1
  (0, 19537)	1
  (0, 18396)	1
  (0, 14305)	1
  (0, 1523)	1
  (0, 14652)	1
  (0, 13997)	1
  (0, 11068)	1
  (0, 51)	1
  (0, 119)	1


In [28]:
#Gives vocabulary of all words in 'bow' and their counts
count.vocabulary_

{'biography': 2146,
 'crime': 4128,
 'drama': 5132,
 'charlestait': 3260,
 'elizabethtait': 5688,
 'johntait': 10641,
 'normancampbell': 14262,
 'bellacola': 1868,
 'willcoyne': 20600,
 'samcrewes': 17225,
 'jackennis': 9891,
 'johnforde': 10540,
 'veralinden': 19945,
 'mr': 13780,
 'marshall': 12854,
 'mckenzie': 13193,
 'frankmills': 7060,
 'olliewilson': 14476,
 'true': 19537,
 'story': 18396,
 'notorious': 14305,
 'australian': 1523,
 'outlaw': 14652,
 'ned': 13997,
 'kelly': 11068,
 '1855': 51,
 '80': 119,
 'urbangad': 19787,
 'astanielsen': 1428,
 'valdemarpsilander': 19833,
 'gunnarhelsengreen': 8213,
 'emilalbes': 5788,
 'hugoflink': 9292,
 'maryhagen': 12929,
 'offer': 14396,
 'two': 19608,
 'men': 13258,
 'jeweler': 10359,
 'hirsch': 9074,
 'famous': 6433,
 'equestrian': 5979,
 'acrobat': 217,
 'stella': 18315,
 'ignores': 9415,
 'home': 9125,
 'follow': 6823,
 'wooing': 20859,
 'accepts': 180,
 'count': 4031,
 'von': 20213,
 'waldberg': 20267,
 'high': 9027,
 'rank': 15987,


# Calculate Cosine similarity

In [29]:
#generating the cosine similarity matrix

cosine_sim = cosine_similarity(count_matrix, count_matrix)
cosine_sim

array([[1.        , 0.03450328, 0.27586342, ..., 0.03227486, 0.03450328,
        0.03131121],
       [0.03450328, 1.        , 0.02595871, ..., 0.03340766, 0.03571429,
        0.03241019],
       [0.27586342, 0.02595871, 1.        , ..., 0.02428215, 0.02595871,
        0.02355714],
       ...,
       [0.03227486, 0.03340766, 0.02428215, ..., 1.        , 0.03340766,
        0.09095086],
       [0.03450328, 0.03571429, 0.02595871, ..., 0.03340766, 1.        ,
        0.03241019],
       [0.03131121, 0.03241019, 0.02355714, ..., 0.09095086, 0.03241019,
        1.        ]])

# Recommend top n movies given a movie name

**Things to do:**
* Create empty list
* Get index of the movie that matches this title
* Find highest cosine_sim this title shares with other titles extracted earlier and save it in a Series
* Get indexes of the 'n' most similar movies
* Populate list with titles of n matching movies

In [30]:
#Lets build a function that takes in movie and recommends top n movies

def recommendations(title,n,cosine_sim = cosine_sim):
    recommended_movies = []
    
    #get index of the movie that matches the title
    idx = indices[indices == title].index[0]
    
    #find highest cosine_sim this title shares with other titles extracted earlier and save it in a Series
    score_series = pd.Series(cosine_sim[idx]).sort_values(ascending = False)
    
    #get indexes of the 'n' most similar movies
    top_n_indexes = list(score_series.iloc[1:n+1].index)
    print(top_n_indexes)
    
    #populating the list with titles of n matching movie
    for i in top_n_indexes:
        recommended_movies.append(list(data1.index)[i])
    return recommended_movies

In [31]:
#movie = input("Enter the movie name you wished to be recommended similar movies: ").lower()
movie = 'cleopatra'
#n = int(input("How many movies do you want to be recommended: "))
n = 10

In [32]:
movie

'cleopatra'

In [33]:
recommendations(movie, n)

[0, 2710, 170, 2904, 2899, 1576, 2245, 1230, 2987, 1824]


['the story of the kelly gang',
 'think fast, mr. moto',
 'victory',
 'mysterious mr. moto',
 'mr. moto takes a chance',
 'picture snatcher',
 'the crimes of stephen hawke',
 'dr. jekyll and mr. hyde',
 'tsuruhachi tsurujirô',
 'mrs. wiggs of the cabbage patch']

**What is the index of the movie you requested ?**

In [34]:
indices[indices == movie].index[0]

2

**What is the cosine similarity this movie shares with all other movies ?**

In [35]:
pd.Series(cosine_sim[indices[indices == movie].index[0]])

0       0.275863
1       0.025959
2       1.000000
3       0.025507
4       0.025078
          ...   
2995    0.023557
2996    0.051014
2997    0.024282
2998    0.025959
2999    0.023557
Length: 3000, dtype: float64

***This is how we use NLP techniques to recommend movies based on their content.***