<a href="https://colab.research.google.com/github/torn2537/ContentBasedRecommendation/blob/master/contentBasedRecommendation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Install important libraries and import them to Google Colab.



In [57]:
!pip install pandas
!pip install rake-nltk

import pandas as pd
from rake_nltk import Rake
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer



# Load `IMDB Top 250 dataset` on a provided url and display all informations in the dataset.

* the set option is used to set a number of maximum columns to be display.

In [58]:
pd.set_option('display.max_columns', 100)
df = pd.read_csv('https://query.data.world/s/uikepcpffyo2nhig52xxeevdialfl7')
print(df.head())

   Unnamed: 0                     Title  Year     Rated     Released  Runtime  \
0           1  The Shawshank Redemption  1994         R  14 Oct 1994  142 min   
1           2             The Godfather  1972         R  24 Mar 1972  175 min   
2           3    The Godfather: Part II  1974         R  20 Dec 1974  202 min   
3           4           The Dark Knight  2008     PG-13  18 Jul 2008  152 min   
4           5              12 Angry Men  1957  APPROVED  01 Apr 1957   96 min   

                  Genre              Director  \
0          Crime, Drama        Frank Darabont   
1          Crime, Drama  Francis Ford Coppola   
2          Crime, Drama  Francis Ford Coppola   
3  Action, Crime, Drama     Christopher Nolan   
4          Crime, Drama          Sidney Lumet   

                                              Writer  \
0  Stephen King (short story "Rita Hayworth and S...   
1  Mario Puzo (screenplay), Francis Ford Coppola ...   
2  Francis Ford Coppola (screenplay), Mario Puzo .

check the shape of the dataframe and it will normally show these properties in a Tuple (numbers of row, numbers of columns). 
In this tutorial, it would be `(total numbers of movie, total numbers of movie's feature)`

In [59]:
df.shape

(250, 38)

# Select `features` to be base features of our recommendation system so I will pick following these features: 

* Title
* Genre
* Director
* Actors
* Plot






In [60]:
df = df[ ['Title','Genre', 'Director', 'Actors', 'Plot'] ]
df

Unnamed: 0,Title,Genre,Director,Actors,Plot
0,The Shawshank Redemption,"Crime, Drama",Frank Darabont,"Tim Robbins, Morgan Freeman, Bob Gunton, Willi...",Two imprisoned men bond over a number of years...
1,The Godfather,"Crime, Drama",Francis Ford Coppola,"Marlon Brando, Al Pacino, James Caan, Richard ...",The aging patriarch of an organized crime dyna...
2,The Godfather: Part II,"Crime, Drama",Francis Ford Coppola,"Al Pacino, Robert Duvall, Diane Keaton, Robert...",The early life and career of Vito Corleone in ...
3,The Dark Knight,"Action, Crime, Drama",Christopher Nolan,"Christian Bale, Heath Ledger, Aaron Eckhart, M...",When the menace known as the Joker emerges fro...
4,12 Angry Men,"Crime, Drama",Sidney Lumet,"Martin Balsam, John Fiedler, Lee J. Cobb, E.G....",A jury holdout attempts to prevent a miscarria...
...,...,...,...,...,...
245,The Lost Weekend,"Drama, Film-Noir",Billy Wilder,"Ray Milland, Jane Wyman, Phillip Terry, Howard...",The desperate life of a chronic alcoholic is f...
246,Short Term 12,Drama,Destin Daniel Cretton,"Brie Larson, John Gallagher Jr., Stephanie Bea...",A 20-something supervising staff member of a r...
247,His Girl Friday,"Comedy, Drama, Romance",Howard Hawks,"Cary Grant, Rosalind Russell, Ralph Bellamy, G...",A newspaper editor uses every trick in the boo...
248,The Straight Story,"Biography, Drama",David Lynch,"Sissy Spacek, Jane Galloway Heitz, Joseph A. C...",An old man makes a long journey by lawn-mover ...


# **Clean the data**
### Before we move to next steps we should `clean` the selected features first.
* for Genre, we split it by comma (,).
* for Director, we split their names by white space.
* for Actors, we also split it as Genre then we get three names (Firstname, Middle name, Last name)

In [0]:
df['Genre'] = df['Genre'].map(lambda x: x.split(','))
df['Director'] = df['Director'].map(lambda x: x.split(' '))
df['Actors'] = df['Actors'].map(lambda x: x.split(',')[:3])

We need to merge first name, middle name and last name together for each actor and director, so it's considered as one word.

In [62]:
nonWhiteSpace = ''
for index, row in df.iterrows():
  row['Actors'] = [x.lower().replace(' ',nonWhiteSpace) for x in row['Actors']]
  row['Director'] = ''.join(row['Director']).lower()
df

Unnamed: 0,Title,Genre,Director,Actors,Plot
0,The Shawshank Redemption,"[Crime, Drama]",frankdarabont,"[timrobbins, morganfreeman, bobgunton]",Two imprisoned men bond over a number of years...
1,The Godfather,"[Crime, Drama]",francisfordcoppola,"[marlonbrando, alpacino, jamescaan]",The aging patriarch of an organized crime dyna...
2,The Godfather: Part II,"[Crime, Drama]",francisfordcoppola,"[alpacino, robertduvall, dianekeaton]",The early life and career of Vito Corleone in ...
3,The Dark Knight,"[Action, Crime, Drama]",christophernolan,"[christianbale, heathledger, aaroneckhart]",When the menace known as the Joker emerges fro...
4,12 Angry Men,"[Crime, Drama]",sidneylumet,"[martinbalsam, johnfiedler, leej.cobb]",A jury holdout attempts to prevent a miscarria...
...,...,...,...,...,...
245,The Lost Weekend,"[Drama, Film-Noir]",billywilder,"[raymilland, janewyman, phillipterry]",The desperate life of a chronic alcoholic is f...
246,Short Term 12,[Drama],destindanielcretton,"[brielarson, johngallagherjr., stephaniebeatriz]",A 20-something supervising staff member of a r...
247,His Girl Friday,"[Comedy, Drama, Romance]",howardhawks,"[carygrant, rosalindrussell, ralphbellamy]",A newspaper editor uses every trick in the boo...
248,The Straight Story,"[Biography, Drama]",davidlynch,"[sissyspacek, janegallowayheitz, josepha.carpe...",An old man makes a long journey by lawn-mover ...


### Extract keywords from `Plot` column that is prepared for creating word vectors.
*   Initial a Rake() instance (by default is uses english stopwords from NLTK and discard all puntuation characters)
*   Use the `extract_keywords_from_text(<text>)` to extract keywords from the Plot column.
*   Create `dictOfKeywordsAndScores` (Dict) which has key is a keyword, value is the keyword's score.




In [63]:
df['Keywords'] = ""
rake = Rake()
for index, row in df.iterrows():
  rake.extract_keywords_from_text(row['Plot'])
  dictOfKeywordsAndScores = rake.get_word_degrees()
  row['Keywords'] = list(dictOfKeywordsAndScores.keys())
df.drop(columns='Plot', inplace=True)
#Display the dataframe
df.set_index('Title', inplace = True)
df

Unnamed: 0_level_0,Genre,Director,Actors,Keywords
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
The Shawshank Redemption,"[Crime, Drama]",frankdarabont,"[timrobbins, morganfreeman, bobgunton]","[acts, number, common, decency, years, eventua..."
The Godfather,"[Crime, Drama]",francisfordcoppola,"[marlonbrando, alpacino, jamescaan]","[reluctant, son, organized, crime, dynasty, tr..."
The Godfather: Part II,"[Crime, Drama]",francisfordcoppola,"[alpacino, robertduvall, dianekeaton]","[portrayed, grip, early, life, tightens, son, ..."
The Dark Knight,"[Action, Crime, Drama]",christophernolan,"[christianbale, heathledger, aaroneckhart]","[joker, emerges, mysterious, past, ability, fi..."
12 Angry Men,"[Crime, Drama]",sidneylumet,"[martinbalsam, johnfiedler, leej.cobb]","[forcing, jury, holdout, attempts, justice, pr..."
...,...,...,...,...
The Lost Weekend,"[Drama, Film-Noir]",billywilder,"[raymilland, janewyman, phillipterry]","[followed, desperate, life, four, chronic, alc..."
Short Term 12,[Drama],destindanielcretton,"[brielarson, johngallagherjr., stephaniebeatriz]","[longtime, boyfriend, residential, treatment, ..."
His Girl Friday,"[Comedy, Drama, Romance]",howardhawks,"[carygrant, rosalindrussell, ralphbellamy]","[wife, book, keep, ace, reporter, ex, newspape..."
The Straight Story,"[Biography, Drama]",davidlynch,"[sissyspacek, janegallowayheitz, josepha.carpe...","[mend, old, man, makes, lawn, mover, tractor, ..."


# Create a `bag of words` from all keywords in the dataframe then drop every column except 'bag_of_words' column which is being used for next step.

In [64]:
df['bag_of_words'] = ''
whiteSpace = ' '
columns = df.columns
for index, row in df.iterrows():
    words = ''
    for col in columns:
        if col != 'Director':
            words = words + whiteSpace.join(row[col])+ whiteSpace
        else:
            words = words + row[col]+ whiteSpace
    row['bag_of_words'] = words
df.drop(columns = [col for col in df.columns if col!= 'bag_of_words'], inplace = True)
df.head()

Unnamed: 0_level_0,bag_of_words
Title,Unnamed: 1_level_1
The Shawshank Redemption,Crime Drama frankdarabont timrobbins morganfr...
The Godfather,Crime Drama francisfordcoppola marlonbrando a...
The Godfather: Part II,Crime Drama francisfordcoppola alpacino rober...
The Dark Knight,Action Crime Drama christophernolan christia...
12 Angry Men,Crime Drama sidneylumet martinbalsam johnfied...


# **Build the similarity matrix**
Initial an instance of CountVectorizer.

In [65]:
countVectorizer = CountVectorizer()
count_matrix = countVectorizer.fit_transform(df['bag_of_words'])
count_matrix

<250x2961 sparse matrix of type '<class 'numpy.int64'>'
	with 5342 stored elements in Compressed Sparse Row format>

Create indices of `Title` column to be associated with a `cosine similarity matrix` in a recommendation function.

In [66]:
indices = pd.Series(df.index)
indices

0      The Shawshank Redemption
1                 The Godfather
2        The Godfather: Part II
3               The Dark Knight
4                  12 Angry Men
                 ...           
245            The Lost Weekend
246               Short Term 12
247             His Girl Friday
248          The Straight Story
249         Slumdog Millionaire
Name: Title, Length: 250, dtype: object

Create the `cosine similarity matrix`.

In [67]:
cosine_simi = cosine_similarity(count_matrix, count_matrix)
#Display the cosine_simi
cosine_simi

array([[1.        , 0.15789474, 0.13764944, ..., 0.05263158, 0.05263158,
        0.05564149],
       [0.15789474, 1.        , 0.36706517, ..., 0.05263158, 0.05263158,
        0.05564149],
       [0.13764944, 0.36706517, 1.        , ..., 0.04588315, 0.04588315,
        0.04850713],
       ...,
       [0.05263158, 0.05263158, 0.04588315, ..., 1.        , 0.05263158,
        0.05564149],
       [0.05263158, 0.05263158, 0.04588315, ..., 0.05263158, 1.        ,
        0.05564149],
       [0.05564149, 0.05564149, 0.04850713, ..., 0.05564149, 0.05564149,
        1.        ]])

# **Build a `recommendation function` to recommend movies**
*   The function has three parameters are: `input_movie_name`(str),  `cosine_similarity`(array) and `numberOfRecommendedItems` (int).
*   Returned results of the function are: A `List` that has `numberOfRecommendedItems` length



In [0]:
def recommendItem(input_movie_name, cosine_simi, numberOfRecommendedItems):
  recommendItems = []
  idx = indices[indices == input_movie_name].index[0]
  score_series = pd.Series(cosine_simi[idx]).sort_values(ascending = False)
  recommendedItemsIndices = list(score_series.iloc[1:numberOfRecommendedItems].index)
  for item in recommendedItemsIndices:
    recommendItems.append(list(df.index)[item])
  return recommendItems

# **Recommend movies by our `recomend function`**

In [70]:
recommendItem('Short Term 12',cosine_sim,10)

['Fight Club',
 'Room',
 'The Pianist',
 'All Quiet on the Western Front',
 'Pink Floyd: The Wall',
 'The Great Escape',
 'The Imitation Game',
 'Patton',
 'The Best Years of Our Lives']