# What is Movie Recommender System

A **movie recommender system** is a type of software application or algorithm designed to provide personalized suggestions or recommendations for movies to users based on their preferences, behavior, or historical data. Recommender systems are widely used in various online platforms to enhance user experience by offering tailored content recommendations. There are several types of movie recommender systems, but two common approaches are collaborative filtering and content-based filtering.

1. **Collaborative Filtering:**
   - **User-Based Collaborative Filtering:** Recommends movies based on the preferences of users who are similar to the target user. If User A and User B have similar tastes and both liked a movie, the system might recommend that movie to User B.
   - **Item-Based Collaborative Filtering:** Recommends movies similar to those the user has already liked. If a user enjoyed Movie X, the system suggests other movies that are often liked by users who enjoyed Movie X.

2. **Content-Based Filtering:**
   - Analyzes the content of movies and recommends items similar to those the user has shown interest in before. This can involve features such as genre, director, actors, or keywords associated with the movies.
   - The system builds a profile for each user based on their preferences and matches it with the features of available movies.

3. **Hybrid Recommender Systems:**
   - Combine collaborative filtering and content-based filtering to leverage the strengths of both approaches. Hybrid systems can provide more accurate and diverse recommendations.

4. **Matrix Factorization:**
   - Techniques like Singular Value Decomposition (SVD) or matrix factorization are used to discover latent factors that contribute to a user's preferences and make recommendations based on these factors.

5. **Deep Learning Models:**
   - Deep learning models, such as neural collaborative filtering, use neural networks to learn complex patterns and relationships in user-item interactions, improving the accuracy of recommendations.

6. **Context-Aware Recommender Systems:**
   - Take into account additional contextual information, such as the user's location, time, or device, to provide more contextually relevant recommendations.

Popular movie streaming platforms like Netflix, Hulu, and Amazon Prime Video use sophisticated recommender systems to offer users a personalized and engaging viewing experience. These systems play a crucial role in helping users discover new content they are likely to enjoy, ultimately increasing user satisfaction and platform engagement.

## Importing Libraries

In [84]:
import numpy as np
import pandas as pd
# import matplotlib.pyplot as plt
import ast
import nltk
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import pickle
import warnings
warnings.filterwarnings('ignore')

## Reading Datasets

In [85]:
movies=pd.read_csv("/home/blackheart/Documents/DATA SCIENCE/PROJECT/Movie-Recommender-System/Data/tmdb_5000_movies.csv")
credits=pd.read_csv("/home/blackheart/Documents/DATA SCIENCE/PROJECT/Movie-Recommender-System/Data/tmdb_5000_credits.csv")

## Datasets Features List:

Detailed explanation of the features of the TMDB dataset:

**Movie Features**

| Feature Name | Data Type | Description |
|---|---|---|
| budget | float | The production budget of the movie in USD. |
| genres | list of strings | A list of genres associated with the movie. |
| id | int | A unique identifier for the movie. |
| keywords | list of strings | A list of keywords associated with the movie. |
| original_language | string | The original language in which the movie was produced. |
| overview | string | A brief overview of the movie's plot. |
| popularity | float | A measure of the movie's popularity based on TMDB user ratings and reviews. |
| poster_path | string | The path to the movie's poster image. |
| production_companies | list of strings | A list of production companies involved in the making of the movie. |
| production_countries | list of strings | A list of countries where the movie was produced. |
| release_date | string | The release date of the movie. |
| revenue | float | The revenue generated by the movie in USD. |
| runtime | int | The runtime of the movie in minutes. |
| status | string | The status of the movie, such as "Released" or "Post Production". |
| title | string | The title of the movie. |
| vote_average | float | The average vote rating for the movie from TMDB users. |
| vote_count | int | The number of votes cast for the movie by TMDB users. |

**Additional Features**

In addition to the movie features listed above, the TMDB dataset also includes the following features:

* **Cast and crew information:** This includes the names of actors, directors, writers, and other crew members who worked on the movie.
* **Plot keywords:** This is a list of keywords that describe the movie's plot.
* **Release dates for different countries:** This includes the release dates for the movie in different countries around the world.
* **Taglines:** This is a list of taglines or slogans associated with the movie.
* **Translations:** This includes translations of the movie's title, overview, and keywords into different languages.

**Dataset Size**

The TMDB dataset includes information on over 200,000 movies.

**Data Format**

The TMDB dataset is available in JSON format.

**Data Source**

The TMDB dataset is available for free from the TMDB API.



In [86]:
movies.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


In [87]:
credits.head()

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [88]:
print("Movies Data Shape: ",movies.shape)
print("Credits Data Shape: ",credits.shape)

Movies Data Shape:  (4803, 20)
Credits Data Shape:  (4803, 4)


* **Let's Merge The Datasets**

In [89]:
movies=movies.merge(credits,on='title')
movies.head(3)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,movie_id,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,19995,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",...,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,285,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...",...,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466,206647,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."


In [90]:
print("Movies Data Shape: ",movies.shape)

Movies Data Shape:  (4809, 23)


## Feature Selection

Now we are going to only those feature that will be play important role in this movie `Smile wali emoji`

In [91]:
movies=movies[['movie_id','title','overview','genres','keywords','cast','crew']]

In [92]:
movies.head(3)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...","[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...","[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."


In [93]:
print("Movies Data Shape: ",movies.shape)

Movies Data Shape:  (4809, 7)


### Let's Check Some Null Value

In [94]:
movies.isnull()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...
4804,False,False,False,False,False,False,False
4805,False,False,False,False,False,False,False
4806,False,False,False,False,False,False,False
4807,False,False,False,False,False,False,False


In [95]:
movies.isnull().sum()

movie_id    0
title       0
overview    3
genres      0
keywords    0
cast        0
crew        0
dtype: int64

* Let's drop this smal fry  `Hasne wali emoji`

In [96]:
movies.dropna(inplace=True)
movies.isnull().sum()

movie_id    0
title       0
overview    0
genres      0
keywords    0
cast        0
crew        0
dtype: int64

In [97]:
movies.isnull().sum().sum()

0

In [98]:
print("Now Movies Data Shape: ",movies.shape)

Now Movies Data Shape:  (4806, 7)


* let's find out duplicate 

In [99]:
movies.duplicated().sum()

0

* no twin brother present here|

**Here in the given datasets most of the value of features is given as `list of look like JSON` so we are going to conver all above listed feature value into python list**

## Data Preprocessing

### Handling `genres`

In [100]:
movies['genres']

0       [{"id": 28, "name": "Action"}, {"id": 12, "nam...
1       [{"id": 12, "name": "Adventure"}, {"id": 14, "...
2       [{"id": 28, "name": "Action"}, {"id": 12, "nam...
3       [{"id": 28, "name": "Action"}, {"id": 80, "nam...
4       [{"id": 28, "name": "Action"}, {"id": 12, "nam...
                              ...                        
4804    [{"id": 28, "name": "Action"}, {"id": 80, "nam...
4805    [{"id": 35, "name": "Comedy"}, {"id": 10749, "...
4806    [{"id": 35, "name": "Comedy"}, {"id": 18, "nam...
4807                                                   []
4808                  [{"id": 99, "name": "Documentary"}]
Name: genres, Length: 4806, dtype: object

In [101]:
def convert(text):
    li=[]
    for i in ast.literal_eval(text): 
        # ast.literal_eval() is used to evaluate the string as a python expression
        li.append(i['name'])
        
    return li

In [102]:
movies['genres']=movies['genres'].apply(convert)

In [103]:
movies.head(3)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[Adventure, Fantasy, Action]","[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...,"[Action, Adventure, Crime]","[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...","[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."


In [104]:
movies['genres']

0       [Action, Adventure, Fantasy, Science Fiction]
1                        [Adventure, Fantasy, Action]
2                          [Action, Adventure, Crime]
3                    [Action, Crime, Drama, Thriller]
4                [Action, Adventure, Science Fiction]
                            ...                      
4804                        [Action, Crime, Thriller]
4805                                [Comedy, Romance]
4806               [Comedy, Drama, Romance, TV Movie]
4807                                               []
4808                                    [Documentary]
Name: genres, Length: 4806, dtype: object

### Handling Keywords

In [105]:
movies['keywords'][0]

'[{"id": 1463, "name": "culture clash"}, {"id": 2964, "name": "future"}, {"id": 3386, "name": "space war"}, {"id": 3388, "name": "space colony"}, {"id": 3679, "name": "society"}, {"id": 3801, "name": "space travel"}, {"id": 9685, "name": "futuristic"}, {"id": 9840, "name": "romance"}, {"id": 9882, "name": "space"}, {"id": 9951, "name": "alien"}, {"id": 10148, "name": "tribe"}, {"id": 10158, "name": "alien planet"}, {"id": 10987, "name": "cgi"}, {"id": 11399, "name": "marine"}, {"id": 13065, "name": "soldier"}, {"id": 14643, "name": "battle"}, {"id": 14720, "name": "love affair"}, {"id": 165431, "name": "anti war"}, {"id": 193554, "name": "power relations"}, {"id": 206690, "name": "mind and soul"}, {"id": 209714, "name": "3d"}]'

In [106]:
movies['keywords']=movies['keywords'].apply(convert)

In [107]:
movies.head(3)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...,"[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...","[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."


### Now Time for `Cast`

In [108]:
movies['cast'][0]

'[{"cast_id": 242, "character": "Jake Sully", "credit_id": "5602a8a7c3a3685532001c9a", "gender": 2, "id": 65731, "name": "Sam Worthington", "order": 0}, {"cast_id": 3, "character": "Neytiri", "credit_id": "52fe48009251416c750ac9cb", "gender": 1, "id": 8691, "name": "Zoe Saldana", "order": 1}, {"cast_id": 25, "character": "Dr. Grace Augustine", "credit_id": "52fe48009251416c750aca39", "gender": 1, "id": 10205, "name": "Sigourney Weaver", "order": 2}, {"cast_id": 4, "character": "Col. Quaritch", "credit_id": "52fe48009251416c750ac9cf", "gender": 2, "id": 32747, "name": "Stephen Lang", "order": 3}, {"cast_id": 5, "character": "Trudy Chacon", "credit_id": "52fe48009251416c750ac9d3", "gender": 1, "id": 17647, "name": "Michelle Rodriguez", "order": 4}, {"cast_id": 8, "character": "Selfridge", "credit_id": "52fe48009251416c750ac9e1", "gender": 2, "id": 1771, "name": "Giovanni Ribisi", "order": 5}, {"cast_id": 7, "character": "Norm Spellman", "credit_id": "52fe48009251416c750ac9dd", "gender": 

* As we already know in single movie there is to many cast so we can't take all of them so we are going taking only top 3 of cast

In [109]:
def convert_cast(text):
    li=[]
    counter=0
    for i in ast.literal_eval(text):
        if counter < 3:
            li.append(i['name'])
        counter+=1
    return li
            

In [110]:
movies['cast']=movies['cast'].apply(convert_cast)
movies.head(3)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[Johnny Depp, Orlando Bloom, Keira Knightley]","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...,"[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...","[Daniel Craig, Christoph Waltz, Léa Seydoux]","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."


### Crew

In [111]:
movies['crew'][0]

'[{"credit_id": "52fe48009251416c750aca23", "department": "Editing", "gender": 0, "id": 1721, "job": "Editor", "name": "Stephen E. Rivkin"}, {"credit_id": "539c47ecc3a36810e3001f87", "department": "Art", "gender": 2, "id": 496, "job": "Production Design", "name": "Rick Carter"}, {"credit_id": "54491c89c3a3680fb4001cf7", "department": "Sound", "gender": 0, "id": 900, "job": "Sound Designer", "name": "Christopher Boyes"}, {"credit_id": "54491cb70e0a267480001bd0", "department": "Sound", "gender": 0, "id": 900, "job": "Supervising Sound Editor", "name": "Christopher Boyes"}, {"credit_id": "539c4a4cc3a36810c9002101", "department": "Production", "gender": 1, "id": 1262, "job": "Casting", "name": "Mali Finn"}, {"credit_id": "5544ee3b925141499f0008fc", "department": "Sound", "gender": 2, "id": 1729, "job": "Original Music Composer", "name": "James Horner"}, {"credit_id": "52fe48009251416c750ac9c3", "department": "Directing", "gender": 2, "id": 2710, "job": "Director", "name": "James Cameron"},

In [112]:
def fetch_director(text):
    li=[]
    
    for i in ast.literal_eval(text):
        if i['job']=='Director':
            li.append(i['name'])
            break
    return li

In [113]:
movies['crew']=movies['crew'].apply(fetch_director)
movies.head(3)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]",[James Cameron]
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[Johnny Depp, Orlando Bloom, Keira Knightley]",[Gore Verbinski]
2,206647,Spectre,A cryptic message from Bond’s past sends him o...,"[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...","[Daniel Craig, Christoph Waltz, Léa Seydoux]",[Sam Mendes]


### Handling Overview

In [114]:
movies['overview'][0]

'In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization.'

In [115]:
movies['overview']=movies['overview'].apply(lambda x:x.split()) # split the words in overview column
movies.head(3)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]",[James Cameron]
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[Johnny Depp, Orlando Bloom, Keira Knightley]",[Gore Verbinski]
2,206647,Spectre,"[A, cryptic, message, from, Bond’s, past, send...","[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...","[Daniel Craig, Christoph Waltz, Léa Seydoux]",[Sam Mendes]


In [116]:
movies.sample(3)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
504,328111,The Secret Life of Pets,"[The, quiet, life, of, a, terrier, named, Max,...","[Animation, Family]","[pet, bunny, anthropomorphism, dog, animal, ap...","[Louis C.K., Eric Stonestreet, Kevin Hart]",[Chris Renaud]
1007,18947,The Boat That Rocked,"[The, Boat, that, Rocked, is, an, ensemble, co...","[Drama, Comedy]","[great britain, musical, rock, pirate radio, s...","[Tom Sturridge, Philip Seymour Hoffman, Rhys I...",[Richard Curtis]
3945,17995,DysFunktional Family,"[Between, sets, from, his, hilarious, live, st...","[Comedy, Documentary]",[],"[Eddie Griffin, Joe Howard, Matthew Brent]",[George Gallo]


In [117]:
movies['overview'][1]

['Captain',
 'Barbossa,',
 'long',
 'believed',
 'to',
 'be',
 'dead,',
 'has',
 'come',
 'back',
 'to',
 'life',
 'and',
 'is',
 'headed',
 'to',
 'the',
 'edge',
 'of',
 'the',
 'Earth',
 'with',
 'Will',
 'Turner',
 'and',
 'Elizabeth',
 'Swann.',
 'But',
 'nothing',
 'is',
 'quite',
 'as',
 'it',
 'seems.']

### Now removeing space between names and surnames 
for example

**"Sinchan Lohara" to "SinchanLohara"** 

In [118]:
def remove_space(text):
    ls=[]
    for i in text:
        ls.append(i.replace(" ",""))
    return ls

In [119]:
movies['keywords']=movies['keywords'].apply(remove_space)
movies['genres']=movies['genres'].apply(remove_space)
movies['cast']=movies['cast'].apply(remove_space)
movies['crew']=movies['crew'].apply(remove_space)

In [120]:
movies.sample(3)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
1263,194,Amélie,"[At, a, tiny, Parisian, café,, the, adorable, ...","[Comedy, Romance]","[paris, lovetriangle, ghosttrain, sex-shop, sh...","[AudreyTautou, MathieuKassovitz, Rufus]",[Jean-PierreJeunet]
1377,271718,Trainwreck,"[Having, thought, that, monogamy, was, never, ...",[Comedy],"[alcohol, one-nightstand]","[AmySchumer, BillHader, BrieLarson]",[JuddApatow]
1666,146238,Runner Runner,"[When, a, poor, college, student, who, cracks,...","[Crime, Thriller, Drama]","[gambling, casino, gamblingdebts, dirtycop, pu...","[BenAffleck, GemmaArterton, JustinTimberlake]",[BradFurman]


### Time to concatenate all features   for making single feature that is `tags`

In [121]:
movies['tags']=movies['overview']+movies['genres']+movies['keywords']+movies['cast']+movies['crew']

In [122]:
movies.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew,tags
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron],"[In, the, 22nd, century,, a, paraplegic, Marin..."
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d...","[Adventure, Fantasy, Action]","[ocean, drugabuse, exoticisland, eastindiatrad...","[JohnnyDepp, OrlandoBloom, KeiraKnightley]",[GoreVerbinski],"[Captain, Barbossa,, long, believed, to, be, d..."
2,206647,Spectre,"[A, cryptic, message, from, Bond’s, past, send...","[Action, Adventure, Crime]","[spy, basedonnovel, secretagent, sequel, mi6, ...","[DanielCraig, ChristophWaltz, LéaSeydoux]",[SamMendes],"[A, cryptic, message, from, Bond’s, past, send..."
3,49026,The Dark Knight Rises,"[Following, the, death, of, District, Attorney...","[Action, Crime, Drama, Thriller]","[dccomics, crimefighter, terrorist, secretiden...","[ChristianBale, MichaelCaine, GaryOldman]",[ChristopherNolan],"[Following, the, death, of, District, Attorney..."
4,49529,John Carter,"[John, Carter, is, a, war-weary,, former, mili...","[Action, Adventure, ScienceFiction]","[basedonnovel, mars, medallion, spacetravel, p...","[TaylorKitsch, LynnCollins, SamanthaMorton]",[AndrewStanton],"[John, Carter, is, a, war-weary,, former, mili..."


In [123]:
movies['tags'][1]

['Captain',
 'Barbossa,',
 'long',
 'believed',
 'to',
 'be',
 'dead,',
 'has',
 'come',
 'back',
 'to',
 'life',
 'and',
 'is',
 'headed',
 'to',
 'the',
 'edge',
 'of',
 'the',
 'Earth',
 'with',
 'Will',
 'Turner',
 'and',
 'Elizabeth',
 'Swann.',
 'But',
 'nothing',
 'is',
 'quite',
 'as',
 'it',
 'seems.',
 'Adventure',
 'Fantasy',
 'Action',
 'ocean',
 'drugabuse',
 'exoticisland',
 'eastindiatradingcompany',
 "loveofone'slife",
 'traitor',
 'shipwreck',
 'strongwoman',
 'ship',
 'alliance',
 'calypso',
 'afterlife',
 'fighter',
 'pirate',
 'swashbuckler',
 'aftercreditsstinger',
 'JohnnyDepp',
 'OrlandoBloom',
 'KeiraKnightley',
 'GoreVerbinski']

#### Making New DataFrame with important features

In [124]:
df=movies[['movie_id','title','tags']]
df.head()

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin..."
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d..."
2,206647,Spectre,"[A, cryptic, message, from, Bond’s, past, send..."
3,49026,The Dark Knight Rises,"[Following, the, death, of, District, Attorney..."
4,49529,John Carter,"[John, Carter, is, a, war-weary,, former, mili..."


* We are going to use NLP and we already know how NLP work so we are going to convert list to str

In [125]:
df['tags']=df['tags'].apply(lambda x:" ".join(x))
df.head()

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...
4,49529,John Carter,"John Carter is a war-weary, former military ca..."


In [126]:
df['tags'][1]

"Captain Barbossa, long believed to be dead, has come back to life and is headed to the edge of the Earth with Will Turner and Elizabeth Swann. But nothing is quite as it seems. Adventure Fantasy Action ocean drugabuse exoticisland eastindiatradingcompany loveofone'slife traitor shipwreck strongwoman ship alliance calypso afterlife fighter pirate swashbuckler aftercreditsstinger JohnnyDepp OrlandoBloom KeiraKnightley GoreVerbinski"

* Conver all into lower case

In [127]:
df['tags']=df['tags'].apply(lambda x:x.lower())

In [128]:
df.head()

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"in the 22nd century, a paraplegic marine is di..."
1,285,Pirates of the Caribbean: At World's End,"captain barbossa, long believed to be dead, ha..."
2,206647,Spectre,a cryptic message from bond’s past sends him o...
3,49026,The Dark Knight Rises,following the death of district attorney harve...
4,49529,John Carter,"john carter is a war-weary, former military ca..."


In [129]:
df['tags'][1]

"captain barbossa, long believed to be dead, has come back to life and is headed to the edge of the earth with will turner and elizabeth swann. but nothing is quite as it seems. adventure fantasy action ocean drugabuse exoticisland eastindiatradingcompany loveofone'slife traitor shipwreck strongwoman ship alliance calypso afterlife fighter pirate swashbuckler aftercreditsstinger johnnydepp orlandobloom keiraknightley goreverbinski"

# PorterStemmer ?

The Porter Stemmer is a widely used algorithm for stemming in natural language processing (NLP). Stemming is the process of reducing words to their base or root form, with the goal of simplifying the words while retaining their core meaning. The Porter Stemmer specifically focuses on removing common suffixes from words.

**Key Characteristics of the Porter Stemmer:**

1. **Algorithmic Approach:**
   - The Porter Stemmer is rule-based and uses a set of rules to iteratively strip common suffixes from words.

2. **Designed for the English Language:**
   - It is designed specifically for the English language and may not perform optimally for languages with different morphological structures.

3. **Simplified Word Forms:**
   - The goal of stemming is to reduce words to a common base form. For example, the word "running" would be stemmed to "run."

4. **Light Stemming:**
   - The Porter Stemmer is considered a "light" stemming algorithm, meaning it may not aggressively reduce words. It aims to produce stems that are still recognizable as words.

5. **Common Suffix Removal:**
   - The algorithm applies a series of rules to remove common suffixes, such as "-ing," "-ed," or "-ly," to get to the root form of a word.

**Example:**
```python
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

words = ["running", "flies", "happily", "better"]
stemmed_words = [stemmer.stem(word) for word in words]

print(stemmed_words)
```

Output:
```
['run', 'fli', 'happili', 'better']
```

In this example, the Porter Stemmer has transformed the words "running," "flies," "happily," and "better" to their stemmed forms.

**Use Cases:**
- Information Retrieval: Reducing words to their base forms helps improve recall in search queries.
- Text Mining and Analysis: Simplifying word representations for pattern identification.
- Search Engines: Enhancing search results by considering different forms of a word.

While the Porter Stemmer is widely used, it's essential to note that stemming may sometimes result in over-stemming or under-stemming, where words are overly simplified or not sufficiently reduced. The choice of stemming algorithm depends on the specific requirements of the NLP task.

In [130]:
ps=PorterStemmer()

In [131]:
def stem(text):
    pt=[]
    for i in text.split():
        pt.append(ps.stem(i))
    return " ".join(pt)

In [132]:
df['tags']=df['tags'].apply(stem)

In [133]:
df['tags'][1]

"captain barbossa, long believ to be dead, ha come back to life and is head to the edg of the earth with will turner and elizabeth swann. but noth is quit as it seems. adventur fantasi action ocean drugabus exoticisland eastindiatradingcompani loveofone'slif traitor shipwreck strongwoman ship allianc calypso afterlif fighter pirat swashbuckl aftercreditssting johnnydepp orlandobloom keiraknightley goreverbinski"

## CountVectorizer?

A **CountVectorizer** is a feature extraction technique used in natural language processing (NLP) to convert a collection of text documents into a matrix of token counts. It's a simple and widely used method for vectorizing text data. The process involves counting the frequency of each word (or "token") in the documents and representing the data in a structured format suitable for machine learning models.

**Key Points about CountVectorizer:**

1. **Tokenization:**
   - The text is first tokenized, breaking it down into individual words or terms. These are the "tokens" that will be counted.

2. **Vocabulary Building:**
   - CountVectorizer builds a vocabulary of all unique words in the document collection. Each unique word becomes a feature in the resulting matrix.

3. **Document-Term Matrix (DTM):**
   - The output is a Document-Term Matrix (DTM), where each row represents a document, and each column represents a unique word from the vocabulary. The matrix elements are the counts of how many times each word appears in each document.

4. **Sparse Representation:**
   - The resulting matrix is often sparse, meaning that many entries are zero because most documents do not contain all the words in the vocabulary. This sparse representation is memory-efficient.

**Example in Python using scikit-learn:**
```python
from sklearn.feature_extraction.text import CountVectorizer

# Example documents
documents = ["This is the first document.",
              "This document is the second document.",
              "And this is the third one.",
              "Is this the first document?"]

# Create the CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the documents
X = vectorizer.fit_transform(documents)

# Convert to an array for easier inspection
X.toarray()
```

**Output:**
```
array([[0, 1, 1, 1, 1, 0, 1, 0, 1],
       [0, 2, 0, 1, 1, 1, 1, 0, 1],
       [1, 0, 0, 1, 1, 0, 1, 1, 1],
       [0, 1, 1, 1, 1, 0, 1, 0, 1]])
```

In this example, each row corresponds to a document, and each column corresponds to a unique word in the vocabulary. The numbers in the matrix represent the count of each word in each document.

**Use Cases:**
- Text classification
- Document clustering
- Information retrieval
- Keyword extraction

CountVectorizer is a fundamental tool in text processing and is often used as a preprocessing step before applying machine learning models to text data.

In [135]:
cv=CountVectorizer(max_features=5000,stop_words='english')
vector=cv.fit_transform(df['tags']).toarray()
vector[0]

array([0, 0, 0, ..., 0, 0, 0])

In [136]:
print("Vector Shape: ",vector.shape)

Vector Shape:  (4806, 5000)


## Cosine Similarity?

**Cosine Similarity** is a metric used to measure the similarity between two non-zero vectors of an inner product space. In the context of text data, vectors typically represent the term frequency or TF-IDF (Term Frequency-Inverse Document Frequency) of words in documents. Cosine Similarity is commonly used to assess the similarity between documents.

In scikit-learn, you can compute the cosine similarity between two or more vectors using the `cosine_similarity` function. Here's a brief explanation:

1. **Importing the Necessary Library:**
   ```python
   from sklearn.metrics.pairwise import cosine_similarity
   ```

2. **Creating Sample Data:**
   Let's create two vectors as an example. These could represent the TF-IDF vectors of two documents.
   ```python
   import numpy as np

   vector1 = np.array([1, 2, 0, 1, 0])
   vector2 = np.array([0, 1, 1, 0, 1])
   ```

3. **Reshape the Vectors (if needed):**
   The vectors need to be reshaped if they are 1D arrays. Cosine similarity expects 2D arrays or matrices.
   ```python
   vector1 = vector1.reshape(1, -1)
   vector2 = vector2.reshape(1, -1)
   ```

4. **Compute Cosine Similarity:**
   Use the `cosine_similarity` function to compute the cosine similarity between the vectors.
   ```python
   similarity_score = cosine_similarity(vector1, vector2)
   ```

   The resulting `similarity_score` will be a 2D array with the cosine similarity value.

**Full Example:**
```python
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Sample data
vector1 = np.array([1, 2, 0, 1, 0])
vector2 = np.array([0, 1, 1, 0, 1])

# Reshape if needed
vector1 = vector1.reshape(1, -1)
vector2 = vector2.reshape(1, -1)

# Compute cosine similarity
similarity_score = cosine_similarity(vector1, vector2)

print("Cosine Similarity:", similarity_score[0, 0])
```

In this example, the cosine similarity between `vector1` and `vector2` is calculated. The resulting similarity score represents how similar the two vectors are. The cosine similarity value ranges from -1 (completely dissimilar) to 1 (completely similar), with 0 indicating orthogonality (no similarity). Positive values indicate similarity, and negative values indicate dissimilarity.

This cosine similarity measure is often used in information retrieval, document clustering, and recommendation systems to determine the similarity between documents or items.

In [140]:
similarity=cosine_similarity(vector)

In [141]:
print("Similarity Shape: ",similarity.shape)

Similarity Shape:  (4806, 4806)


In [142]:
df[df['title']=='The Dark Knight Rises'].index[0]

3

In [143]:
df[df['title']=='The Avengers'].index[0]

16

 Creating The function ........

In [144]:
def recommend(movie):
    movie_index=df[df['title']==movie].index[0]
    distances=similarity[movie_index]
    movies_list=sorted(list(enumerate(distances)),reverse=True,key=lambda x:x[1])[1:6]
    
    for i in movies_list:
        print(df.iloc[i[0]].title)

In [145]:
recommend('The Avengers')

Iron Man 3
Avengers: Age of Ultron
Captain America: Civil War
Captain America: The First Avenger
Iron Man


## Saving Model

In [148]:
pickle.dump(df,open('/home/blackheart/Documents/DATA SCIENCE/PROJECT/Movie-Recommender-System/artifacts/movies_list.pkl','wb'))
pickle.dump(similarity,open('/home/blackheart/Documents/DATA SCIENCE/PROJECT/Movie-Recommender-System/artifacts/similarity.pkl','wb'))

# **Thank You!**