***
#### <center> Project Report on <center>
# <center> Movie Recommendation System based on User-Input using Machine Learning <center>
#### <center> by <center>
### <center> Dhiraj Dharmadip Raut <center>
***

# Table of Contents

- [Introduction](#Introduction)

- [Import the Requied Modules and the Functions](#Import-the-Requied-Modules-and-the-Functions)

- [Data Extraction of Hindi Movies Using Webscrapping](#Data-Extraction-of-Hindi-Movies-Using-Webscrapping)

- [Data Collection of English Movies From 'English_Movies.csv' File](#Data-Collection-of-English-Movies-From-'English_Movies.csv'-File)

- [Feature Extraction of movies data](#Feature-Extraction-of-movies-data)

- [Cosine Similarity](#Cosine-Similarity)
  
- [Steps for Recommendation System on a Predefined movie](#Steps-for-Recommendation-System-on-a-Predefined-movie)

- [Live Movie Recommendation System Based on User Input](#Live-Movie-Recommendation-System-Based-on-User-Input)


***

# Introduction

The Project aims at building a **Movie Recommendation System** from the `content` of Movies-dataset that contains around **1700+ Hindi Movies** extracted from WikiPedia using `Webscrapping` and **4800+ English movies** collected from `'English_Movies.csv'` file downloaded from `Kaggle` site (https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata).

Basically, The **System will work as follows**: \
After the user has provided the name of a film he liked, the engine should be able to select in the database a list of 30 films that the user will enjoy based on Content. 

Movie Recommendation System is extensively used Now-a-days by all the **OTT Platforms** such as `NETFLIX`, `AMAZON PRIME`, etc to lure the users and make them spend more time on Platform.

There are `Three main types` of Movie Recommendation System:
- **Content Based Recommendation System** - Promotes movies based on content (Genre,Storyline,etc) of the movies watched by user 
- **Popularity Based Recommendation System** - Promotes movies based on popularity (Highest Viewed, Highest Rated, IMDb ratings,etc) of the movies according to regions, Genre, etc 
- **Collaborative Recommendation System** - Promotes movies based on Watching History Pattern of other People who follows same watching history as yours

In the current case, since the dataset only describe the content of the films, collaborative filtering and Popularity filtering is excluded and I will thus build an system that uses the content of the entries.

***

# Import the Requied Modules and the Functions

Install **Python libraries** required for this project, which are `Scikit-Learn`, `BeautifulSoup4`, `urllib`, `difflib`,and `pandas`. 

Once Installed all libraries, **import** all Modules and Functions requied from these libraries.

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from bs4 import BeautifulSoup
import urllib
import pandas as pd
import difflib
import warnings
warnings.filterwarnings('ignore')

***

# Data Extraction of Hindi Movies Using Webscrapping

First, Created a **Pandas DataFrame** named `movie_data` to Store all the movies data.

The dataframe have columns of `Title`,`Year`,`Genre`,`Director`, and `Cast`

In [2]:
movie_data = pd.DataFrame(columns=['Title','Year','Genre','Director','Cast'])
display(movie_data)

Unnamed: 0,Title,Year,Genre,Director,Cast


Used the `urllib` library and downloaded the webpage https://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2000. Did the same for all other years of Bollywood films data. Changed the year in the url.

I have **grouped** the urls according to their **content similarity**, to extract the data from tables.

Stored it as a string in the variable `url`. Saved the text of the response as a variable named `html_data`.

In [3]:
URL = {2000:'https://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2000',
       2001:'https://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2001',
       2002:'https://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2002',    
       2003:'https://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2003',
       2004:'https://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2004',
       2005:'https://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2005',
       2006:'https://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2006',
       2007:'https://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2007'
      }
for key, url in URL.items():
    print(key,":",url)

2000 : https://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2000
2001 : https://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2001
2002 : https://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2002
2003 : https://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2003
2004 : https://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2004
2005 : https://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2005
2006 : https://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2006
2007 : https://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2007


Now, Parse the html_data using `beautiful_soup`. 

**BeautifulSoup** is a Python library for pulling data out of HTML and XML files. This is accomplished by representing the HTML as a set of objects with methods used to parse the HTML. We can navigate the HTML as a tree and filter out what we are looking for.

To parse a document, pass it into the <code>BeautifulSoup</code> constructor, the BeautifulSoup object, which represents the document as a nested data structure.

First, the document is converted to Unicode, (similar to ASCII),  and HTML entities are converted to Unicode characters. Beautiful Soup transforms a complex HTML document into a complex tree of Python objects.

In [4]:
for key, url in URL.items():
    html_data = urllib.request.urlopen(url).read()
    soup = BeautifulSoup(html_data,"html.parser") # or soup = BeautifulSoup(html_data,"lxml")
    table = soup.find_all('table', class_='wikitable')[1].tbody
    rows = table.find_all('tr')
    for row in rows:
        col = row.find_all("td")
        if (col != []):
            Title = col[0].get_text(strip=True)
            Year = str(key)
            Genre = col[3].get_text(strip=True)
            Director = col[1].get_text(strip=True)
            Cast = col[2].get_text(strip=True)
            # Finally we append the data of each row to the table
            movie_data = movie_data.append({"Title":Title, "Year":Year, "Genre":Genre, "Director":Director, "Cast":Cast}, ignore_index=True)

display(movie_data)

Unnamed: 0,Title,Year,Genre,Director,Cast
0,Aaghaaz,2000,Thriller,Yogesh Ishwar,"Sunil Shetty,Sushmita Sen,Namrata Shirodkar"
1,Aaj Ka Ravan,2000,Drama,Imran Khalid,"Kasam Ali,Mithun Chakraborty"
2,Anjaane,2000,Romance,Ravi Rai,"Raveena Tandon,Vivek Mushran"
3,Anokha Moti,2000,Family,Anshul Singla,"Sanjay Sharma,Arjun Chakraborty,Nayab Aftab"
4,Apradhi Kaun,2000,Thriller,Mohan Bhakri,"Ishrat Ali,Shagufta Ali"
...,...,...,...,...,...
772,The Train,2007,Drama Thriller,"Hasnain Hyderabadwala,Raksha Mistry","Emraan Hashmi,Geeta Basra,Rajat Bedi,Sayali Bh..."
773,Victoria No. 203,2007,Comedy,Anant Mahadevan,"Om Puri,Anupam Kher,Jimmy Sheirgill,Preeti Jha..."
774,Welcome,2007,Comedy,Anees Bazmee,"Akshay Kumar,Nana Patekar,Anil Kapoor,Katrina ..."
775,Yatra,2007,Drama,Gautam Ghose,"Rekha,Nana Patekar,Deepti Naval"


Now, **new urls** are group together based on the difference in html text of these pages.

Followed same steps as above to extract the data from these webpages using **Webscrapping**. You will find changes in the code which is **necessary to locate** the required `table`, `rows`, and `columns` from html text.

In [5]:
URL1 = {2008:'https://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2008',
       2009:'https://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2009',
       2010:'https://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2010',
        2012:'https://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2012'
      }
for key, url in URL1.items():
    print(key,":",url)

2008 : https://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2008
2009 : https://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2009
2010 : https://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2010
2012 : https://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2012


In [6]:
for key, url in URL1.items():
    html_data = urllib.request.urlopen(url).read()
    soup = BeautifulSoup(html_data,"html.parser") # or soup = BeautifulSoup(html_data,"lxml")
    table = soup.find_all('table', class_='wikitable')[1:]
    #print(len(table))
    for t in table:
        rows = t.tbody.find_all('tr')[1:]
        for row in rows:
            col = row.find_all("td")
            if (col != []):
                Title = col[-4].get_text(strip=True)
                Year = str(key)
                Genre = col[-1].get_text(strip=True)
                Director = col[-3].get_text(strip=True)
                Cast = col[-2].get_text(strip=True)
                # Finally we append the data of each row to the table
                movie_data = movie_data.append({"Title":Title, "Year":Year, "Genre":Genre, "Director":Director, "Cast":Cast}, ignore_index=True)


In [7]:
URL2 = {2011:'https://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2011'
       }
for key, url in URL2.items():
    print(key,":",url)

2011 : https://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2011


In [8]:
for key, url in URL2.items():
    html_data = urllib.request.urlopen(url).read()
    soup = BeautifulSoup(html_data,"html.parser") # or soup = BeautifulSoup(html_data,"lxml")
    table = soup.find_all('table', class_='wikitable')[1:]
#     print(len(table))
    for t in table:
        rows = t.tbody.find_all('tr')[2:]
        for row in rows:
            col = row.find_all("td")
            if (col != []):
                Title = col[-5].get_text(strip=True)
                Year = str(key)
                Genre = col[-4].get_text(strip=True)
                Director = col[-3].get_text(strip=True)
                Cast = col[-2].get_text(strip=True)
                # Finally we append the data of each row to the table
                movie_data = movie_data.append({"Title":Title, "Year":Year, "Genre":Genre, "Director":Director, "Cast":Cast}, ignore_index=True)


In [9]:
URL3 = {2013:'https://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2013',
       2015:'https://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2015',
      }
for key, url in URL3.items():
    print(key,":",url)

2013 : https://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2013
2015 : https://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2015


In [10]:
for key, url in URL3.items():
    html_data = urllib.request.urlopen(url).read()
    soup = BeautifulSoup(html_data,"html.parser") # or soup = BeautifulSoup(html_data,"lxml")
    table = soup.find_all('table', class_='wikitable')[1:]
#     print(len(table))
    for t in table:
        rows = t.tbody.find_all('tr')[1:]
        for row in rows:
            col = row.find_all("td")
            if (col != []):
                Title = col[-5].get_text(strip=True)
                Year = str(key)
                Genre = col[-2].get_text(strip=True)
                Director = col[-4].get_text(strip=True)
                Cast = col[-3].get_text(strip=True)
                # Finally we append the data of each row to the table
                movie_data = movie_data.append({"Title":Title, "Year":Year, "Genre":Genre, "Director":Director, "Cast":Cast}, ignore_index=True)


In [11]:
URL4 = {2014:'https://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2014',
      }
for key, url in URL4.items():
    print(key,":",url)

2014 : https://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2014


In [12]:
for key, url in URL4.items():
    html_data = urllib.request.urlopen(url).read()
    soup = BeautifulSoup(html_data,"html.parser") # or soup = BeautifulSoup(html_data,"lxml")
    table = soup.find_all('table', class_='wikitable')[2:]
#     print(len(table))
    for t in table:
        rows = t.tbody.find_all('tr')[1:]
        for row in rows:
            col = row.find_all("td")
            if (col != []):
                Title = col[-5].get_text(strip=True)
                Year = str(key)
                Genre = col[-2].get_text(strip=True)
                Director = col[-4].get_text(strip=True)
                Cast = col[-3].get_text(strip=True)
                # Finally we append the data of each row to the table
                movie_data = movie_data.append({"Title":Title, "Year":Year, "Genre":Genre, "Director":Director, "Cast":Cast}, ignore_index=True)


In [13]:
URL5 = {2016:'https://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2016',
        2016:'https://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2016'
      }
for key, url in URL5.items():
    print(key,":",url)

2016 : https://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2016


In [14]:
for key, url in URL5.items():
    html_data = urllib.request.urlopen(url).read()
    soup = BeautifulSoup(html_data,"html.parser") # or soup = BeautifulSoup(html_data,"lxml")
    table = soup.find_all('table', class_='wikitable')[1:]
#     print(len(table))
    for t in table:
        rows = t.tbody.find_all('tr')[1:]
        for row in rows:
            col = row.find_all("td")
            if (col != []):
                if (col[-5].get_text(strip=True) == 'Force 2'):
                    continue
                Title = col[-6].get_text(strip=True)
                Year = str(key)
                Genre = col[-3].get_text(strip=True)
                Director = col[-5].get_text(strip=True)
                Cast = col[-4].get_text(strip=True)
                # Finally we append the data of each row to the table
                movie_data = movie_data.append({"Title":Title, "Year":Year, "Genre":Genre, "Director":Director, "Cast":Cast}, ignore_index=True)


Once saved all data into `movie_data` dataframe. We need to remove the **comma** and **slash** sign from the `Genre` and `Cast` columns.

In [15]:
movie_data["Genre"] = movie_data['Genre'].str.replace('/'," ")
movie_data["Cast"] = movie_data['Cast'].str.replace(','," ")
display(movie_data)

Unnamed: 0,Title,Year,Genre,Director,Cast
0,Aaghaaz,2000,Thriller,Yogesh Ishwar,Sunil Shetty Sushmita Sen Namrata Shirodkar
1,Aaj Ka Ravan,2000,Drama,Imran Khalid,Kasam Ali Mithun Chakraborty
2,Anjaane,2000,Romance,Ravi Rai,Raveena Tandon Vivek Mushran
3,Anokha Moti,2000,Family,Anshul Singla,Sanjay Sharma Arjun Chakraborty Nayab Aftab
4,Apradhi Kaun,2000,Thriller,Mohan Bhakri,Ishrat Ali Shagufta Ali
...,...,...,...,...,...
1749,Moh Maya Money,2016,Crime drama,Munish Bharadwaj,Ranvir Shorey Neha Dhupia Vidhushi Mehra Ash...
1750,Kahaani 2,2016,Thriller,Sujoy Ghosh,Vidya Balan Arjun Rampal Jugal Hansraj
1751,Befikre,2016,Romance,Aditya Chopra,Ranveer Singh Vaani Kapoor
1752,Wajah Tum Ho,2016,Romance drama,Vishal Pandya,Sana Khan Sharman Joshi Gurmeet Choudhary


***

# Data Collection of English Movies From 'English_Movies.csv' File

**Credit** for English Movies data : https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata

Load the data from the `English_Movies.csv` file to the pandas dataframe named `english_movies_data`.

In [16]:
english_movies_data = pd.read_csv('English_Movies.csv')

Now, Displayed the **First 5 rows** of the `english_movie_data` dataframe using the `head` function. For Last 5 rows, use `tail` function.

In [17]:
display(english_movies_data.head())

Unnamed: 0,index,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,cast,crew,director
0,0,237000000,Action Adventure Fantasy Science Fiction,http://www.avatarmovie.com/,19995,culture clash future space war space colony so...,en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,Sam Worthington Zoe Saldana Sigourney Weaver S...,"[{'name': 'Stephen E. Rivkin', 'gender': 0, 'd...",James Cameron
1,1,300000000,Adventure Fantasy Action,http://disney.go.com/disneypictures/pirates/,285,ocean drug abuse exotic island east india trad...,en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,...,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,Johnny Depp Orlando Bloom Keira Knightley Stel...,"[{'name': 'Dariusz Wolski', 'gender': 2, 'depa...",Gore Verbinski
2,2,245000000,Action Adventure Crime,http://www.sonypictures.com/movies/spectre/,206647,spy based on novel secret agent sequel mi6,en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,...,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466,Daniel Craig Christoph Waltz L\u00e9a Seydoux ...,"[{'name': 'Thomas Newman', 'gender': 2, 'depar...",Sam Mendes
3,3,250000000,Action Crime Drama Thriller,http://www.thedarkknightrises.com/,49026,dc comics crime fighter terrorist secret ident...,en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,...,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106,Christian Bale Michael Caine Gary Oldman Anne ...,"[{'name': 'Hans Zimmer', 'gender': 2, 'departm...",Christopher Nolan
4,4,260000000,Action Adventure Science Fiction,http://movies.disney.com/john-carter,49529,based on novel mars medallion space travel pri...,en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,...,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124,Taylor Kitsch Lynn Collins Samantha Morton Wil...,"[{'name': 'Andrew Stanton', 'gender': 2, 'depa...",Andrew Stanton


Displayed the number of rows and columns available in `english_movie_data` dataframe.

This will be used in the next step **for iterating the data** through rows and columns to collect requied data for the movies and neglect other unnecessary data from the dataframe.

In [18]:
display(english_movies_data.shape)

(4803, 24)

The `3rd column` in the dataframe provides **Genre** of all the Films. It is important to build the recommendation system since it describes the content of the film (i.e. Drama, Comedy, Action, ...).

Similarly, The `13th column` in the dataframe provides **date-month-Year** of movie released. We are extracting only year from it. \
The `19th column` in the dataframe provides **Title** of movie.\
The `22th column` in the dataframe provides **Cast** (i.e. Actors and Actresses) worked in the Film.\
The `24th column` (i.e. last column) in the dataframe provides **Director** name of the movie.

Now this all data, **iterated** starting from 1st row till the end row of `english_movies_data` dataframe using **for loop** and Saved into the `movie_data` dataframe that already includes all Bollywood movies Title, Year, Genre, Director, and Cast.

In [19]:
for i in range(0,len(english_movies_data)):
    Genre = english_movies_data.iloc[i,2]
    Year = str(english_movies_data.iloc[i,12]).split('-')[0]
    Title = english_movies_data.iloc[i,18]
    Cast = english_movies_data.iloc[i,21]
    Director = english_movies_data.iloc[i,23]
    movie_data = movie_data.append({"Title":Title, "Year":Year, "Genre":Genre, "Director":Director, "Cast":Cast}, ignore_index=True)

display(movie_data)

Unnamed: 0,Title,Year,Genre,Director,Cast
0,Aaghaaz,2000,Thriller,Yogesh Ishwar,Sunil Shetty Sushmita Sen Namrata Shirodkar
1,Aaj Ka Ravan,2000,Drama,Imran Khalid,Kasam Ali Mithun Chakraborty
2,Anjaane,2000,Romance,Ravi Rai,Raveena Tandon Vivek Mushran
3,Anokha Moti,2000,Family,Anshul Singla,Sanjay Sharma Arjun Chakraborty Nayab Aftab
4,Apradhi Kaun,2000,Thriller,Mohan Bhakri,Ishrat Ali Shagufta Ali
...,...,...,...,...,...
6552,El Mariachi,1992,Action Crime Thriller,Robert Rodriguez,Carlos Gallardo Jaime de Hoyos Peter Marquardt...
6553,Newlyweds,2011,Comedy Romance,Edward Burns,Edward Burns Kerry Bish\u00e9 Marsha Dietlein ...
6554,"Signed, Sealed, Delivered",2013,Comedy Drama Romance TV Movie,Scott Smith,Eric Mabius Kristin Booth Crystal Lowe Geoff G...
6555,Shanghai Calling,2012,,Daniel Hsia,Daniel Henney Eliza Coupe Bill Paxton Alan Ruc...


selected all relevant features for recommendation and Replaced **null values** from all columns with **null string**.

In [20]:
selected_features = ['Title','Year','Genre','Director','Cast']

for feature in selected_features:
  movie_data[feature] = movie_data[feature].fillna('')

Now, **Concatenated** all columns data of a particular movie into a single row. This will be used for **feature extraction** and them ML algorithm will be fitted for the whole data of `combined_features` variable.

In [21]:
# combining all the 5 selected features
combined_features = movie_data['Title']+' '+ \
                    movie_data['Year']+' '+ \
                    movie_data['Genre']+' '+ \
                    movie_data['Director']+' '+ \
                    movie_data['Cast']

In [22]:
print(combined_features)

0       Aaghaaz 2000 Thriller Yogesh Ishwar Sunil Shet...
1       Aaj Ka Ravan 2000 Drama Imran Khalid Kasam Ali...
2       Anjaane 2000 Romance Ravi Rai Raveena Tandon V...
3       Anokha Moti 2000 Family Anshul Singla Sanjay S...
4       Apradhi Kaun 2000 Thriller Mohan Bhakri Ishrat...
                              ...                        
6552    El Mariachi 1992 Action Crime Thriller Robert ...
6553    Newlyweds 2011 Comedy Romance Edward Burns Edw...
6554    Signed, Sealed, Delivered 2013 Comedy Drama Ro...
6555    Shanghai Calling 2012  Daniel Hsia Daniel Henn...
6556    My Date with Drew 2005 Documentary Brian Herzl...
Length: 6557, dtype: object


***

# Feature Extraction of movies data

Now, Feature extraction is done on the `combined_features` variable. That is, All the **textual data** is converted into respective **numerical values** and stored into the variable named `feature_vectors`.

This is required for the use of `Cosine_Similarity Algorithm`, as it is best to use on numerical data than textual data.

In [23]:
# converting the text data to feature vectors
vectorizer = TfidfVectorizer()

In [24]:
feature_vectors = vectorizer.fit_transform(combined_features)

In [25]:
print(feature_vectors)

  (0, 15508)	0.3212168918471364
  (0, 11946)	0.3212168918471364
  (0, 15155)	0.22984093616686824
  (0, 16471)	0.29891232380015276
  (0, 15456)	0.2277890128059397
  (0, 16410)	0.2543586126639824
  (0, 8167)	0.4086452972083719
  (0, 18772)	0.3904280286213679
  (0, 16969)	0.10976573054416208
  (0, 115)	0.19082276158496245
  (0, 246)	0.4086452972083719
  (1, 3267)	0.275549755044569
  (1, 11464)	0.2876508093071839
  (1, 690)	0.22462070380693686
  (1, 8936)	0.3585400849116652
  (1, 9136)	0.3378765021211304
  (1, 8006)	0.31288994718397617
  (1, 5036)	0.08270743498818127
  (1, 13883)	0.41581314536561087
  (1, 8720)	0.2920081425195199
  (1, 247)	0.3841242613755517
  (1, 115)	0.19416989071951987
  (2, 11826)	0.39385528146594107
  (2, 18125)	0.3187795790159615
  (2, 16682)	0.3421942072235046
  :	:
  (6555, 18927)	0.2927398263319963
  (6555, 5374)	0.25218082303255973
  (6555, 7499)	0.2927398263319963
  (6555, 12915)	0.22131315003878832
  (6555, 2217)	0.16279936091809258
  (6555, 4259)	0.3465818206

***

# Cosine Similarity

When builing the system, the first step consists in defining a criteria that would tell us how close two films are.

To do so, start from the **description** of the film that was selected by the user. From it, will get the `director` name, the names of the `actors` and a other `keywords`. Then, build a **matrix** where each row corresponds to a film of the database and where the **columns** correspond to the previous quantities (director + actors + genre)

| Title | Year  | Genre 1 | Genre 2  | .... | director  |  actor 1 | actor 2 | actor 3  | ... 
|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|
|Film 1   | $a_{11}$  |  $a_{12}$ |   |   |  ... |   |   |   |   | $a_{1q}$  |
|...   |   |   |   |   | ...  |   |   |   |   |   |
|Film i   |  $a_{i1}$ | $a_{i2}$ |   |   | $a_{ij}$  |   |   |   |   |  $a_{iq}$ |
|...   |   |   |   |   | ...  |   |   |   |   |   |
| Film p   |$a_{p1}$   | $a_{p2}$  |   |   | ...  |   |   |   |   | $a_{pq}$  |

In this matrix, the $a_{ij}$ coefficients take either the value 0 or 1 depending on the correspondance between the significance of column $j$ and the content of film $i$. Once this matrix has been defined, we **determine the distance between two films according to**:

\begin{eqnarray}
d_{m, n} = \sqrt{  \sum_{i = 1}^{N} \left( a_{m,i}  - a_{n,i} \right)^2  } 
\end{eqnarray}

At this stage, we just have to select the $N$ films which are the **closest** from the entry selected by the user. According to similarities between entries, we get a list of $N$ films.

Now, Our Primary aim is to get **Similarity Score** using `Cosine_similarity`. 

Cosine_Similarity will compare first movie with all other movies, 2nd movie with all other movies and so on.

This will find which `feature_vectors` values are similar to each other i.e. In other way, it will find which movies are related to each other and accordingly, it will score them. 

In [26]:
similarity = cosine_similarity(feature_vectors)

In [27]:
print(similarity)

[[1.         0.03705203 0.04319075 ... 0.         0.         0.        ]
 [0.03705203 1.         0.04394834 ... 0.00479505 0.         0.        ]
 [0.04319075 0.04394834 1.         ... 0.01171512 0.         0.        ]
 ...
 [0.         0.00479505 0.01171512 ... 1.         0.         0.02714674]
 [0.         0.         0.         ... 0.         1.         0.        ]
 [0.         0.         0.         ... 0.02714674 0.         1.        ]]


In [28]:
print(similarity.shape)

(6557, 6557)


***

# Steps for Recommendation System on a Predefined movie

Before Getting input movie name from user, lets check our system with a movie name `kahaani`

In [29]:
movie_name = 'kahaani'

Create a `List` with all the movie names given in the dataset to **compare** it with the similarity score of movie `kahaani`

In [30]:
list_of_all_titles = movie_data['Title'].tolist()
print(list_of_all_titles)



Now, Find the **best match** for the movie name given by the user using `get_close_matches` function of the library `difflib`. 

In [31]:
find_close_match = difflib.get_close_matches(movie_name, list_of_all_titles)
print(find_close_match)

['Kahaani', 'Yahaan', 'Tahaan']


In [32]:
close_match = find_close_match[0]
print(close_match)

Kahaani


Find the `index` of of the movie based on the Genre of the movie of `close_match`.

In [33]:
index_of_Genre_of_the_movie = movie_data[movie_data.Title == close_match]['Genre'].index.values[0]
print(index_of_Genre_of_the_movie)

1108


Now, we need to get a `list` of Similar movies based on the **index found** above using the similarity variable that we have defined earlier which contains all Similarity Score. 

**First**, It will find the movies that has same similarity score with `kahaani` movie. **Then**, These movies will be labelled as Highest similarity score. The movies which are not similar will be labelled a less similarity score.

In [34]:
similarity_score = list(enumerate(similarity[index_of_Genre_of_the_movie]))
print(similarity_score)

[(0, 0.009088598777756751), (1, 0.005071083739673044), (2, 0.0), (3, 0.0), (4, 0.008519415393253864), (5, 0.0046404077400241635), (6, 0.0), (7, 0.0061556433375167955), (8, 0.004658793279869588), (9, 0.005541665172500432), (10, 0.0), (11, 0.005545805526772859), (12, 0.0), (13, 0.0), (14, 0.0), (15, 0.0), (16, 0.004728973952682347), (17, 0.0), (18, 0.005903336234962415), (19, 0.0), (20, 0.0), (21, 0.004930761212150914), (22, 0.005338187818805301), (23, 0.0), (24, 0.005026646782857961), (25, 0.0), (26, 0.0), (27, 0.0), (28, 0.014718545038052647), (29, 0.004171631949136089), (30, 0.0), (31, 0.005808238762039883), (32, 0.0), (33, 0.004463118908375763), (34, 0.0041363158546544095), (35, 0.003749412485454124), (36, 0.0), (37, 0.003332459970245538), (38, 0.0), (39, 0.006462383944548839), (40, 0.0038924513281827027), (41, 0.004931021379382534), (42, 0.0), (43, 0.005279132670169871), (44, 0.0), (45, 0.0), (46, 0.0), (47, 0.011517900343776408), (48, 0.004709834975823098), (49, 0.00494644619951275

In the **Output of above cell**, the `1st value` represents the **index of the movie** from the dataset and the `2nd value` represent **similarity score** of that movie **based** on the movie_name entered by user (here **according** to `kahaani`). The same is repeated for all the movies and that movie score is relative to `kahaani` movie.

In [35]:
display(len(similarity_score))

6557

Now, we need to `sort` the movies in decreasing order based on their `similarity score`. This will help us to recommend movies with highest similarity score first.

In [36]:
sorted_similar_movies = sorted(similarity_score, key = lambda x:x[1], reverse = True) 
print(sorted_similar_movies)

[(1108, 1.0), (1750, 0.41545939676510996), (1686, 0.2778173489407222), (1670, 0.22869957973828361), (665, 0.2230793603911144), (1522, 0.17878350048003927), (624, 0.1760972609024334), (1376, 0.16923700120445387), (1464, 0.16828920248323992), (487, 0.16247090894736005), (1555, 0.16184821988674963), (402, 0.14746611007613147), (1175, 0.14636696281094336), (1493, 0.1412256794519589), (1360, 0.14094211009968272), (275, 0.14045094173513897), (951, 0.13792944111164704), (1106, 0.13708903487382193), (961, 0.13674236019665567), (1390, 0.13559998251816155), (1689, 0.13482479808710043), (1485, 0.13274561293603007), (1575, 0.1323600060670046), (689, 0.13102753101235923), (798, 0.13038751495568118), (1139, 0.12889910608043748), (1293, 0.12828806307084342), (971, 0.12676996897598394), (778, 0.12642942536004614), (1143, 0.12634668322670273), (1140, 0.1254601517924853), (443, 0.1252727566917367), (1189, 0.12523197967088545), (388, 0.12518575138412696), (532, 0.1243427245557596), (1475, 0.1234320936179

In [37]:
display(movie_data.Genre.index)

RangeIndex(start=0, stop=6557, step=1)

Now, we can recommend the First few movies according to **similarity score** of movies relative to `kahaani` movie. 

We need to go through **every index** of the `sorted_similar_movies` in the **decreasing order** and Find the **Title** of the movies related to that index. Print the movies serial wise. These are **recommended movies for** `kahaani` movie.

In [38]:
i = 1
for movie in sorted_similar_movies:
  index = movie[0]
  title_from_index = movie_data[movie_data.Genre.index == index]['Title'].values[0]
  if (i<=5):
    print(i,title_from_index, sep=') ')
    i+=1

1) Kahaani
2) Kahaani 2
3) Te3n
4) Traffic
5) Utthaan


***

# Live Movie Recommendation System Based on User Input

The above steps are combined to one cell. Now, **User** can Input his favourite **movie** and get the **recommendation** of 30 movies which are **similar in content** to the movie_name he/she entered. `For that Run below-cell`.

In [39]:
movie_name = input('Enter your favourite movie name : ')

list_of_all_titles = movie_data['Title'].tolist()

find_close_match = difflib.get_close_matches(movie_name, list_of_all_titles)

close_match = find_close_match[0]

index_of_Genre_of_the_movie = movie_data[movie_data.Title == close_match]['Genre'].index.values[0]

similarity_score = list(enumerate(similarity[index_of_Genre_of_the_movie]))

sorted_similar_movies = sorted(similarity_score, key = lambda x:x[1], reverse = True) 

print('\nMovies suggested for you : \n')

i = 1
for movie in sorted_similar_movies:
  Genre_index = movie[0]
  title_from_index = movie_data[movie_data.Genre.index == Genre_index]['Title'].values[0]
  if (i<=30):
    print(i,title_from_index, sep=') ')
    i+=1

Enter your favourite movie name : Iron man

Movies suggested for you : 

1) Iron Man
2) Iron Man 2
3) Iron Man 3
4) Made
5) Duets
6) The Good Night
7) The Avengers
8) The Last Airbender
9) Charlie Bartlett
10) Contagion
11) Captain America: Civil War
12) Zathura: A Space Adventure
13) Avengers: Age of Ultron
14) Sky Captain and the World of Tomorrow
15) The Iron Giant
16) Mortdecai
17) The Judge
18) Red Tails
19) The Best Man
20) A Scanner Darkly
21) Tropic Thunder
22) Holy Man
23) Lucky You
24) The Nativity Story
25) Sherlock Holmes
26) The Kite Runner
27) R.I.P.D.
28) The Best Man Holiday
29) Two Lovers
30) Se7en


>NOTE: Execute the last cell to input your favourite movie name and you will get movies recommendation based on the content of movie entered (i.e. superhero movie, Space related movie, drame, thriller, comedy, etc)

***