<center><font color='Black' style='font-family:verdana; font-size:25px'>Movie Recommendation System</font></center>
<hr style="color: black; height: 1px;">

<b><font color='#067F7D' style='font-family:cambria; font-size:20px'>Problem Statement</font></b><br>

<font color="black" style="font-family:Cambria; font-size:16px">The goal is to predict the success of a movie before its release by leveraging various features such as plot, cast, crew, budget, and revenues. The key questions to address include:</font>
<br>

<ul>
    <li><font color="black" style="font-family:Cambria; font-size:16px">Can we identify the factors that contribute to the success or failure of a movie?</font></li>
    <li><font color="black" style="font-family:Cambria; font-size:16px">Can we predict which films will be highly rated?</font></li>
</ul>

<hr>

<b><font color='#067F7D' style='font-family:cambria; font-size:20px'>Overview</font></b><br>

<font color="black" style="font-family:Cambria; font-size:16px">This project uses a dataset containing information on several thousand films, including their plot, cast, crew, budget, and revenues. Due to a DMCA takedown request from IMDb, the original dataset was replaced with a similar set of films and data fields from The Movie Database (TMDb). The dataset includes various columns such as genres, keywords, cast, crew, production companies, and vote average.</font>
<br>

<font color="#067F7D" style="font-family:Cambria; font-size:16px"><b>Dataset Summary</b></font>
<br>

<font color="black" style="font-family:Cambria; font-size:16px">`Source` : The Movie Database (TMDb)</font>
<br>
<font color="black" style="font-family:Cambria; font-size:16px">`Key Features`: Plot, cast, crew, budget, revenues, genres, keywords, production companies, original language, and vote average.</font>

<hr>

<b><font color='#067F7D' style='font-family:cambria; font-size:20px'>Model Architecture</font></b><br>

<font color="black" style="font-family:Cambria; font-size:16px"><b>Data Preprocessing</b></font>
<br>
   - <font color="black" style="font-family:Cambria; font-size:16px">`Merging Datasets`: Merged the movies and credits datasets on the title.</font>
   - <font color="black" style="font-family:Cambria; font-size:16px">`Feature Selection`: Selected relevant columns such as movie_id, title, overview, genres, keywords, cast, crew, production companies, and vote average.</font></li>
   - <font color="black" style="font-family:Cambria; font-size:16px">`Handling Missing Values`: Dropped rows with missing values.</font>
   - <font color="black" style="font-family:Cambria; font-size:16px">`Stemming`: Applied stemming to reduce words to their root form.</font>
<

<font color="black" style="font-family:Cambria; font-size:16px"><b>Model</b></font><br>
<ul>
    <li><font color="black" style="font-family:Cambria; font-size:16px">Vectorization: TF-IDF Vectorizer with a custom stemming tokenizer.</font></li>
    <li><font color="black" style="font-family:Cambria; font-size:16px">Similarity Measurement: Cosine similarity to find the closest matches to a given movie.</font></li>
</ul>

<font color="black" style="font-family:Cambria; font-size:16px"><b>Functions</b></font><br>
<ul>
    <li><font color="black" style="font-family:Cambria; font-size:16px">Data Conversion Functions: Convert JSON fields to lists of relevant values (e.g., genres, cast).</font></li>
    <li><font color="black" style="font-family:Cambria; font-size:16px">Stemming Function: Custom tokenizer function using PorterStemmer.</font></li>
    <li><font color="black" style="font-family:Cambria; font-size:16px">Recommendation Function: Recommends movies based on cosine similarity scores.</font></li>
</ul>

<hr>

<b><font color='#067F7D' style='font-family:cambria; font-size:20px'>Expected Outcome</font></b><br>

<font color="black" style="font-family:Cambria; font-size:16px">The expected outcome is a recommendation system that suggests movies similar to a given movie based on the plot, cast, crew, and other features. Users should be able to input a movie title and receive a list of recommended movies that are similar in terms of content and characteristics.</font>

<hr>

<b><font color='#067F7D' style='font-family:cambria; font-size:20px'>Model Training and Evaluation</font></b><br>

<font color="black" style="font-family:Cambria; font-size:16px"><b>Training</b></font><br>
   - <font color="black" style="font-family:Cambria; font-size:16px">`Text Processing`: Applied stemming and TF-IDF vectorization to convert text data into vectors.</font></li>
   - <font color="black" style="font-family:Cambria; font-size:16px">`Similarity Calculation`: Computed `cosine similarity` between all movie vectors.</font>

<font color="black" style="font-family:Cambria; font-size:16px"><b>Evaluation</b></font><br>
<ul>
    <li><font color="black" style="font-family:Cambria; font-size:16px">Assess model performance using metrics like accuracy, precision, recall, and F1-score on training and test datasets.</font></li>
    <li><font color="black" style="font-family:Cambria; font-size:16px">Fine-tuning model parameters to optimize performance.</font></li>
</ul>

<hr>


In [1]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

In [2]:
movies=pd.read_csv('tmdb_5000_movies.csv')
credits=pd.read_csv('tmdb_5000_credits.csv')

In [3]:
movies.head(1)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800


In [4]:
credits.head(1)

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


In [5]:
print(f'The shape of the movies data set is {movies.shape}')
print(f'The shape of the Credits data set is {credits.shape}')

The shape of the movies data set is (4803, 20)
The shape of the Credits data set is (4803, 4)


In [6]:
# merging the data on the bases of title
movies=movies.merge(credits,on='title')

In [7]:
print(f'The shape of the movies data set is {movies.shape}')

The shape of the movies data set is (4809, 23)


In [8]:
movies.head(1)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,movie_id,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,19995,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


In [9]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4809 entries, 0 to 4808
Data columns (total 23 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4809 non-null   int64  
 1   genres                4809 non-null   object 
 2   homepage              1713 non-null   object 
 3   id                    4809 non-null   int64  
 4   keywords              4809 non-null   object 
 5   original_language     4809 non-null   object 
 6   original_title        4809 non-null   object 
 7   overview              4806 non-null   object 
 8   popularity            4809 non-null   float64
 9   production_companies  4809 non-null   object 
 10  production_countries  4809 non-null   object 
 11  release_date          4808 non-null   object 
 12  revenue               4809 non-null   int64  
 13  runtime               4807 non-null   float64
 14  spoken_languages      4809 non-null   object 
 15  status               

In [10]:
movies[movies['original_title']=='Veer-Zaara']['cast']

3379    [{"cast_id": 11, "character": "Veer Pratap Sin...
Name: cast, dtype: object

In [11]:
movies.iloc[3379]['cast']

'[{"cast_id": 11, "character": "Veer Pratap Singh", "credit_id": "52fe43b6c3a36847f8069abb", "gender": 2, "id": 35742, "name": "Shah Rukh Khan", "order": 0}, {"cast_id": 13, "character": "Zaara Hayaat Khan", "credit_id": "52fe43b6c3a36847f8069ac3", "gender": 1, "id": 35745, "name": "Preity Zinta", "order": 1}, {"cast_id": 12, "character": "Saamiya Siddiqui", "credit_id": "52fe43b6c3a36847f8069abf", "gender": 1, "id": 35776, "name": "Rani Mukerji", "order": 2}, {"cast_id": 18, "character": "Chaudhary Sumer Singh", "credit_id": "52fe43b6c3a36847f8069ad7", "gender": 2, "id": 35780, "name": "Amitabh Bachchan", "order": 3}, {"cast_id": 19, "character": "Maati", "credit_id": "52fe43b6c3a36847f8069adb", "gender": 1, "id": 35781, "name": "Hema Malini", "order": 4}, {"cast_id": 15, "character": "Shabbo", "credit_id": "52fe43b6c3a36847f8069acb", "gender": 1, "id": 35778, "name": "Divya Dutta", "order": 5}, {"cast_id": 17, "character": "Zakir Ahmed", "credit_id": "52fe43b6c3a36847f8069ad3", "gend

In [12]:
movies[movies['original_language']=='hi'].head(2)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,movie_id,cast,crew
1026,0,"[{""id"": 18, ""name"": ""Drama""}]",,7504,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",hi,1947: Earth,It's 1947 and the borderlines between India an...,1.246883,"[{""name"": ""Cracking the Earth Films"", ""id"": 22...",...,101.0,"[{""iso_639_1"": ""hi"", ""name"": ""\u0939\u093f\u09...",Released,,Earth,6.6,9,7504,"[{""cast_id"": 2, ""character"": ""Dil Navaz"", ""cre...","[{""credit_id"": ""52fe4480c3a36847f8099e81"", ""de..."
2967,7400000,"[{""id"": 18, ""name"": ""Drama""}, {""id"": 10749, ""n...",,14395,"[{""id"": 596, ""name"": ""adultery""}, {""id"": 34094...",hi,कभी अलविदा ना कहना,Dev and Maya are both married to different peo...,3.246903,"[{""name"": ""Yash Raj Films"", ""id"": 1569}, {""nam...",...,193.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,A Love.... That Broke All Relationships,Kabhi Alvida Naa Kehna,6.1,42,14395,"[{""cast_id"": 1, ""character"": ""Dev Saran"", ""cre...","[{""credit_id"": ""58a5e57892514152a70006bf"", ""de..."


In [13]:
movies[movies['original_language']=='hi'].original_title

1026                           1947: Earth
2967                    कभी अलविदा ना कहना
2976                             Housefull
3094                                   कृष
3162                      Jab Tak Hai Jaan
3233               Yeh Jawaani Hai Deewani
3332                         Ta Ra Rum Pum
3379                            Veer-Zaara
3548                       Rang De Basanti
3558                         Dum Maaro Dum
3650                     Gandhi, My Father
3729                               Airlift
4173                               एबीसीडी
4205                                 Dabba
4217                         दिल जो भी कहे
4301                       Monsoon Wedding
4337    Rocket Singh: Salesman of the Year
4377                                  Fiza
4518                     Faith Connections
Name: original_title, dtype: object

In [14]:
movies[movies['title']=='Iron Man']

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,movie_id,cast,crew
68,140000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 878, ""na...",http://www.ironmanmovie.com/,1726,"[{""id"": 539, ""name"": ""middle east""}, {""id"": 61...",en,Iron Man,"After being held captive in an Afghan cave, bi...",120.725053,"[{""name"": ""Marvel Studios"", ""id"": 420}]",...,126.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Heroes aren't born. They're built.,Iron Man,7.4,8776,1726,"[{""cast_id"": 19, ""character"": ""Tony Stark / Ir...","[{""credit_id"": ""52fe4311c3a36847f8037f21"", ""de..."


In [15]:
# genres
# id
# keywords
# title
# Overview
# cast
# creaw
# production_companies
# original_language
# vote_average



In [16]:
movies=movies[['movie_id','title','overview','genres','keywords','cast','crew','production_companies','vote_average']]

In [17]:
movies.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew,production_companies,vote_average
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de...","[{""name"": ""Ingenious Film Partners"", ""id"": 289...",7.2
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...","[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de...","[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",6.9
2,206647,Spectre,A cryptic message from Bond’s past sends him o...,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...","[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de...","[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...",6.3
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...","[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...","[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de...","[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...",7.6
4,49529,John Carter,"John Carter is a war-weary, former military ca...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 818, ""name"": ""based on novel""}, {""id"":...","[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de...","[{""name"": ""Walt Disney Pictures"", ""id"": 2}]",6.1


In [18]:
# missing data
movies.isnull().sum()

movie_id                0
title                   0
overview                3
genres                  0
keywords                0
cast                    0
crew                    0
production_companies    0
vote_average            0
dtype: int64

In [19]:
movies.dropna(inplace=True)

In [20]:
# checkin present any duplicate or not
movies.duplicated().sum()

0

In [21]:
# Define bins and labels
bins = [0, 3, 6, 8, 10]
labels = ['low', 'medium', 'good', 'excellent']

# Add rating category column based on bins
movies['vote_average'] = pd.cut(movies['vote_average'], bins=bins, labels=labels)


In [22]:
movies.iloc[1].genres
    

'[{"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 28, "name": "Action"}]'

In [23]:
# '[{"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 28, "name": "Action"}]'
# ['Action',Adventure,'Fantasy']

In [24]:
import ast

<b><font color='#067F7D' style='font-family:cambria; font-size:20px'>`ast` Module</font></b><br>

 - <font color="brown" style="font-family:Cambria; font-size:16px">In Python, the `ast module` is used for `parsing and manipulating abstract syntax trees` (ASTs). ASTs are a structured representation of code, breaking it down into expressions, statements, functions, classes, and more.</font>
<br>

  - <font color="brown" style="font-family:Cambria; font-size:16px">The `ast.literal_eval function` specifically is part of the `ast module`. It evaluates a `string containing` a Python literal (like `strings, numbers, tuples, lists, dictionaries, booleans, and None`) and returns the `corresponding Python object`. This is `done safely`, `without the security risks` associated with using the `eval() function on arbitrary` input strings.</font>
<br>

  - <font color="brown" style="font-family:Cambria; font-size:16px">`ast.literal_eval` parses the string ` '[{"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 28, "name": "Action"}]'` and evaluates it to `produce a Python list` `{'id': 12, 'name': 'Adventure'}......`. This function is particularly useful when you need to safely convert a string representation of a literal into its corresponding Python object.</font>
<br>



In [25]:
def convert(obj):
    L=[]
    for i in ast.literal_eval(obj):
        L.append(i['name'])
    return L

In [26]:
convert('[{"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 28, "name": "Action"}]')

['Adventure', 'Fantasy', 'Action']

In [27]:
movies['genres']=movies['genres'].apply(convert)

In [28]:
movies['keywords']=movies['keywords'].apply(convert)

In [29]:
movies['production_companies']=movies['production_companies'].apply(convert)

In [31]:
# from this cast column we are taking only 5 dictornaris only 5
counter=0
for i in ast.literal_eval(movies.iloc[3379]['cast']):
    if counter !=5:
        print(i['name'])
        counter=counter+1
    else:
        break
        

Chiwetel Ejiofor
Tim Allen
Alice Braga
Jose Pablo Cantillo
Randy Couture


In [32]:
def convert3(obj):
    L=[]
    counter=0
    for i in ast.literal_eval(obj):
        if counter!=5:
            L.append(i['name'])
            counter=counter+1
        else:
            break
    return L

In [33]:
movies['cast']=movies['cast'].apply(convert3)

In [34]:
movies.head(3)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew,production_companies,vote_average
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weave...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de...","[Ingenious Film Partners, Twentieth Century Fo...",good
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[Johnny Depp, Orlando Bloom, Keira Knightley, ...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de...","[Walt Disney Pictures, Jerry Bruckheimer Films...",good
2,206647,Spectre,A cryptic message from Bond’s past sends him o...,"[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...","[Daniel Craig, Christoph Waltz, Léa Seydoux, R...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de...","[Columbia Pictures, Danjaq, B24]",good


In [35]:
for i in ast.literal_eval(movies['crew'][0]):
    if i['job']=='Director':
        print(i['name'])

James Cameron


In [36]:
# we are extracting only director
# now for crew we are extracting job in which job is mentioned as director
def fetch_director(obj):
    L=[]
    for i in ast.literal_eval(obj):
        if i['job']=='Director':
            L.append(i['name'])
    return L

In [37]:
movies['crew']=movies['crew'].apply(fetch_director)

In [38]:
movies.head(3)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew,production_companies,vote_average
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weave...",[James Cameron],"[Ingenious Film Partners, Twentieth Century Fo...",good
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[Johnny Depp, Orlando Bloom, Keira Knightley, ...",[Gore Verbinski],"[Walt Disney Pictures, Jerry Bruckheimer Films...",good
2,206647,Spectre,A cryptic message from Bond’s past sends him o...,"[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...","[Daniel Craig, Christoph Waltz, Léa Seydoux, R...",[Sam Mendes],"[Columbia Pictures, Danjaq, B24]",good


In [39]:
# if we see the overview column it is in string so we are converting it in to the list
movies['overview'][0]

'In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization.'

In [40]:
movies['overview']= movies['overview'].apply(lambda x:x.split())



<font color="brown" style="font-family:Cambria; font-size:16px">we are doing some preprocessing on the certain columns to `remove the space` in words if they have like Sam Worthington because if they have space they act has a different entity like `sam` one another will be `Worthington` for example if you want to see the movies of `Sam Mendes` then if we not remove the space then it will create the `2 enites sam and another one is mendes `when you what to see `sam mendes movie` the model me confuses which sam because of white space.</font>
<br>




In [41]:
movies['genres']=movies['genres'].apply(lambda x: [i.replace(' ','')for i in x])
movies['keywords']=movies['keywords'].apply(lambda x: [i.replace(' ','')for i in x])
movies['cast']=movies['cast'].apply(lambda x: [i.replace(' ','')for i in x])
movies['crew']=movies['crew'].apply(lambda x: [i.replace(' ','')for i in x])
movies['production_companies']=movies['production_companies'].apply(lambda x: [i.replace(' ','')for i in x])

In [42]:
movies.head(3)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew,production_companies,vote_average
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[SamWorthington, ZoeSaldana, SigourneyWeaver, ...",[JamesCameron],"[IngeniousFilmPartners, TwentiethCenturyFoxFil...",good
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d...","[Adventure, Fantasy, Action]","[ocean, drugabuse, exoticisland, eastindiatrad...","[JohnnyDepp, OrlandoBloom, KeiraKnightley, Ste...",[GoreVerbinski],"[WaltDisneyPictures, JerryBruckheimerFilms, Se...",good
2,206647,Spectre,"[A, cryptic, message, from, Bond’s, past, send...","[Action, Adventure, Crime]","[spy, basedonnovel, secretagent, sequel, mi6, ...","[DanielCraig, ChristophWaltz, LéaSeydoux, Ralp...",[SamMendes],"[ColumbiaPictures, Danjaq, B24]",good


In [43]:
movies['vote_average']=movies['vote_average'].astype('str')

In [44]:
# converting it in to the list as you see the all columns are in list so we are splitting so they can easily concat 
movies['vote_average']=movies['vote_average'].apply(lambda x:x.split())

In [45]:
# creating a new column named tags by adding all the columns expect movie_id and title

movies['tags']=movies['overview']+movies['genres']+movies['keywords']+movies['cast']+movies['crew']+movies['production_companies']+movies['vote_average']

In [46]:
print(movies.iloc[0].tags)

['In', 'the', '22nd', 'century,', 'a', 'paraplegic', 'Marine', 'is', 'dispatched', 'to', 'the', 'moon', 'Pandora', 'on', 'a', 'unique', 'mission,', 'but', 'becomes', 'torn', 'between', 'following', 'orders', 'and', 'protecting', 'an', 'alien', 'civilization.', 'Action', 'Adventure', 'Fantasy', 'ScienceFiction', 'cultureclash', 'future', 'spacewar', 'spacecolony', 'society', 'spacetravel', 'futuristic', 'romance', 'space', 'alien', 'tribe', 'alienplanet', 'cgi', 'marine', 'soldier', 'battle', 'loveaffair', 'antiwar', 'powerrelations', 'mindandsoul', '3d', 'SamWorthington', 'ZoeSaldana', 'SigourneyWeaver', 'StephenLang', 'MichelleRodriguez', 'JamesCameron', 'IngeniousFilmPartners', 'TwentiethCenturyFoxFilmCorporation', 'DuneEntertainment', 'LightstormEntertainment', 'good']


In [47]:
# creating a new data fram taking only three columns named  movie_id, title and tags
new_df=movies[['movie_id','title','tags']]


In [48]:
new_df['tags']=new_df['tags'].apply(lambda x:' '.join(x))

In [49]:
new_df.head(3)

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...


In [50]:
new_df['tags'][3379]

"The story of the love between Veer Pratap Singh, an Indian, and Zaara Hayaat Khan, a Pakistani...a love so great it knows no boundaries... Drama Romance loveofone'slife pilot classsociety pakistan india kashmirconflict ShahRukhKhan PreityZinta RaniMukerji AmitabhBachchan HemaMalini YashChopra YashRajFilms good"

In [51]:
movies[new_df['title']=='Veer-Zaara']

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew,production_companies,vote_average,tags
3379,4251,Veer-Zaara,"[The, story, of, the, love, between, Veer, Pra...","[Drama, Romance]","[loveofone'slife, pilot, classsociety, pakista...","[ShahRukhKhan, PreityZinta, RaniMukerji, Amita...",[YashChopra],[YashRajFilms],[good],"[The, story, of, the, love, between, Veer, Pra..."


In [52]:
# converting tags in to the lower case
new_df['tags']=new_df['tags'].apply(lambda x : x.lower())

In [53]:
new_df

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"in the 22nd century, a paraplegic marine is di..."
1,285,Pirates of the Caribbean: At World's End,"captain barbossa, long believed to be dead, ha..."
2,206647,Spectre,a cryptic message from bond’s past sends him o...
3,49026,The Dark Knight Rises,following the death of district attorney harve...
4,49529,John Carter,"john carter is a war-weary, former military ca..."
...,...,...,...
4804,9367,El Mariachi,el mariachi just wants to play his guitar and ...
4805,72766,Newlyweds,a newlywed couple's honeymoon is upended by th...
4806,231617,"Signed, Sealed, Delivered","""signed, sealed, delivered"" introduces a dedic..."
4807,126186,Shanghai Calling,when ambitious new york attorney sam is sent t...


#### Using TF-idf Vectorization

In [54]:
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
# tf=TfidfVectorizer(max_features=4500,stop_words='english')

In [55]:
# cv=CountVectorizer(max_features=5000,stop_words='english')

In [56]:
# cv.fit_transform(new_df['tags']).toarray()

In [57]:
# print('the shape is ',cv.fit_transform(new_df['tags']).toarray().shape)
# print('the shape is ',tf.fit_transform(new_df['tags']).toarray().shape)

In [58]:
# vectors=cv.fit_transform(new_df['tags']).toarray()
# vectors=tf.fit_transform(new_df['tags']).toarray()


In [59]:
# vectors

In [60]:
# print(tf.get_feature_names_out()[170:250])

<b><font color='#067F7D' style='font-family:cambria; font-size:20px'>Stemming</font></b><br>

 - <font color="brown" style="font-family:Cambria; font-size:16px">`Stemming `is a process in natural language processing (NLP) and `text mining` where the inflected forms of a `word are reduced to their base or root form`. The goal of stemming is to group together different forms of a word so they can be analyzed as a `single item`. This is especially useful in information retrieval and text processing tasks, where variations of a word (e.g., `"running", "runner", "ran"`) should be treated as the `same word ("run").`</font><br>
 
   - <font color="brown" style="font-family:Cambria; font-size:16px">Here's an example to illustrate stemming:</font>
   - <font color="brown" style="font-family:Cambria; font-size:16px">`Words`: "running", "runner", "ran" `Stemmed form`: "run"</font>
<br>
<font color="brown" style="font-family:Cambria; font-size:16px">Common Stemming Algorithms:</font>


 - <font color="brown" style="font-family:Cambria; font-size:16px">`Porter Stemmer`: One of the oldest and most widely used stemming algorithms, developed by Martin Porter in 1980. It uses a set of rules to remove common suffixes from English words.</font>
<br>

  - <font color="brown" style="font-family:Cambria; font-size:16px">`Snowball Stemmer`: An improvement over the Porter Stemmer, also developed by Martin Porter. It offers more flexibility and better performance.</font>
<br>

  - <font color="brown" style="font-family:Cambria; font-size:16px">`Lancaster Stemme`r: A more aggressive stemming algorithm that sometimes results in stems that are not actual words. It's faster but can be less accurate due to its aggressive nature.</font>
<br>



In [55]:
from nltk.stem.porter import PorterStemmer

In [56]:
ps=PorterStemmer()

In [None]:
# writting a function and itetrating them and apply porter stemmer
# def stem(text):
#     y=[]
#     for i in text.split():
#         y.append(ps.stem(i))
#     return ' '.join(y)

In [60]:
import nltk
import re

In [61]:
def stemming_tokenizer(text):
    text = text.lower()
    
    # Remove punctuation
    text = re.sub(r'[^\w\s]', '', text)
    tokens = nltk.word_tokenize(text)
    stemmed_tokens = [ps.stem(token) for token in tokens]
    return stemmed_tokens

#### See how the stemming is working compare both text like `story,romanc`
"The story of the love between Veer Pratap Singh, an Indian, and Zaara Hayaat Khan, a Pakistani...a love so great it knows no boundaries... Drama Romance loveofone'slife pilot classsociety pakistan india kashmirconflict ShahRukhKhan PreityZinta RaniMukerji YashChopra"

In [62]:
stemming_tokenizer("The story of the love between Veer Pratap Singh, an Indian, and Zaara Hayaat Khan, a Pakistani...a love so great it knows no boundaries... Drama Romance loveofone'slife pilot classsociety pakistan india kashmirconflict ShahRukhKhan PreityZinta RaniMukerji AmitabhBachchan HemaMalini YashChopra YashRajFilms good")

['the',
 'stori',
 'of',
 'the',
 'love',
 'between',
 'veer',
 'pratap',
 'singh',
 'an',
 'indian',
 'and',
 'zaara',
 'hayaat',
 'khan',
 'a',
 'pakistania',
 'love',
 'so',
 'great',
 'it',
 'know',
 'no',
 'boundari',
 'drama',
 'romanc',
 'loveofoneslif',
 'pilot',
 'classsocieti',
 'pakistan',
 'india',
 'kashmirconflict',
 'shahrukhkhan',
 'preityzinta',
 'ranimukerji',
 'amitabhbachchan',
 'hemamalini',
 'yashchopra',
 'yashrajfilm',
 'good']

In [63]:
# now aplying on the whole column 
# new_df['tags']=new_df['tags'].apply(stem)

<font color="black" style="font-family:Cambria; font-size:20px">TF-IDF (Term Frequency-Inverse Document Frequency) Vectorization:</font>
- <font color="brown" style="font-family:Cambria; font-size:16px">`TF (Term Frequency)`: TF measures how frequently a term (word) appears in a document relative to the total number of words in that document. It reflects how important a term is to a document.</font>
  - <font color="brown" style="font-family:Cambria; font-size:16px">TF counts how often a word appears in a specific book compared to the total number of words in that book.</font>
<br>
  - <font color="brown" style="font-family:Cambria; font-size:16px">`Example`: In a book about dragons, the word "dragon" appears 10 times out of 100 words. So, TF for "dragon" in that book is 10/100 = 0.1.</font>
 <br>

- <font color="brown" style="font-family:Cambria; font-size:16px"><b>`IDF (Inverse Document Frequency)`:</b> IDF measures how important a term is across the entire corpus (collection of documents). Terms that occur frequently across many documents are less informative.</font>
  - <font color="brown" style="font-family:Cambria; font-size:16px">IDF measures how important a word is across all the books in your library.</font>
<br>
  - <font color="brown" style="font-family:Cambria; font-size:16px">`Example`: The word "dragon" appears in all five books, but it's more common in books about fantasy creatures. So, IDF for "dragon" might be lower because it’s not rare.</font>
 <br>

<font color="brown" style="font-family:Cambria; font-size:16px"><b>`TF-IDF Vectorization`:</b> TF-IDF combines TF and IDF to transform text documents into numerical vectors. Each document is represented by a vector where each dimension corresponds to a term, and the value represents the TF-IDF score of that term in the document.</font>
    <ul>
      <li><font color="black" style="font-family:Cambria; font-size:16px">Tokenization: Breaking text into individual words or terms (tokens).</font></li>
      <li><font color="black" style="font-family:Cambria; font-size:16px">Counting Term Frequencies: Counting the occurrences of each term in each document.</font></li>
      <li><font color="black" style="font-family:Cambria; font-size:16px">Calculating TF: Calculating the TF for each term in each document.</font></li>
      <li><font color="black" style="font-family:Cambria; font-size:16px">Calculating IDF: Calculating the IDF for each term across the entire corpus.</font></li>
      <li><font color="black" style="font-family:Cambria; font-size:16px">Calculating TF-IDF: Multiplying TF by IDF to get the TF-IDF score for each term in each document.</font></li>
      <li><font color="black" style="font-family:Cambria; font-size:16px">Normalization: Optionally, normalizing the TF-IDF vectors to unit length.</font></li>
    </ul>
  </li>
</ul>

<font color="black" style="font-family:Cambria; font-size:16px">Use Cases:</font>

<ul>
  <li><font color="brown" style="font-family:Cambria; font-size:16px">Information Retrieval: To find relevant documents matching a user query.</font></li>
  <li><font color="brown" style="font-family:Cambria; font-size:16px">Keyword Extraction: To identify important terms in a document.</font></li>
  <li><font color="brown" style="font-family:Cambria; font-size:16px">Text Summarization: To rank sentences or paragraphs based on importance.</font></li>
  <li><font color="brown" style="font-family:Cambria; font-size:16px">Document Clustering: To group similar documents based on their content.</font></li>
</ul>

<font color="black" style="font-family:Cambria; font-size:16px">Benefits:</font>

<ul>
  <li><font color="brown" style="font-family:Cambria; font-size:16px">Importance Weighing: Highlights important terms that are frequent in a document but rare in the entire corpus.</font></li>
  <li><font color="brown" style="font-family:Cambria; font-size:16px">Dimensionality Reduction: Reduces the dimensionality of the feature space compared to simpler techniques.</font></li>
  <li><font color="brown" style="font-family:Cambria; font-size:16px">Language Independence: Can be applied to any language with a suitable tokenizer.</font></li>
</ul>

<font color="brown" style="font-family:verdana; font-size:16px">In summary, TF-IDF vectorization is a powerful technique in natural language processing for representing text documents numerically while emphasizing the significance of terms based on their frequency and rarity across a corpus.</font>


In [64]:

tf=TfidfVectorizer(max_features=4500,stop_words='english',tokenizer=stemming_tokenizer)

In [65]:
# print('the shape is ',cv.fit_transform(new_df['tags']).toarray().shape)
print('the shape is ',tf.fit_transform(new_df['tags']).toarray().shape)

the shape is  (4806, 4500)


In [66]:
# vectors=cv.fit_transform(new_df['tags']).toarray()
vectors=tf.fit_transform(new_df['tags']).toarray()



In [67]:
# as you see no words are reppeating there root form
print(tf.get_feature_names_out()[170:250])

['amandaseyfri' 'amateur' 'amaz' 'ambassador' 'ambit' 'ambiti'
 'amblinentertain' 'ambul' 'ambush' 'america' 'american' 'americanfootbal'
 'americanzoetrop' 'ami' 'amid' 'amidst' 'amnesia' 'amp' 'amus'
 'amusementpark' 'amyadam' 'amysmart' 'analyst' 'anarchiccomedi' 'ancient'
 'ancientrom' 'ancientworld' 'anderson' 'andi' 'andiemacdowel' 'andrew'
 'android' 'andygarcía' 'andyserki' 'angel' 'angelabassett' 'angelinajoli'
 'anger' 'angri' 'ani' 'anim' 'animalattack' 'animalhorror'
 'anjelicahuston' 'ann' 'anna' 'annafari' 'annakendrick' 'annapurnapictur'
 'annasophiarobb' 'annehathaway' 'annehech' 'annetteben' 'anni'
 'anniversari' 'announc' 'annual' 'anonym' 'anonymouscont' 'anoth'
 'answer' 'antholog' 'anthoni' 'anthonyanderson' 'anthonyhopkin'
 'anthonymacki' 'anthropomorph' 'antic' 'antihero' 'antisemit'
 'antoniobandera' 'antonyelchin' 'anyon' 'anyth' 'apart' 'apartheid'
 'apatowproduct' 'ape' 'apocalyps' 'appar']


<font color="black"><h5>In some cases, cosine similarity is more appropriate than Euclidean distance for measuring the similarity between vectors, especially in high-dimensional spaces or with sparse data. Here's an explanation of why cosine similarity can be a better choice in certain scenarios:</h5></font>

<font color="black"><h4>Euclidean Distance</h4></font>
<ul>
  <li><b>Definition:</b> Euclidean distance measures the straight-line distance between two points in a multi-dimensional space.</li>
  <li><b>Use Case:</b> Euclidean distance is suitable when the magnitude of the vectors matters and when the data points are dense and have meaningful geometric interpretations.</li>
  <li><b>Limitations:</b>
    <ul>
      <li>Magnitude Sensitivity: Euclidean distance is sensitive to the magnitude of the vectors. If two vectors are pointing in the same direction but have different magnitudes, the Euclidean distance will be large.</li>
      <li>High Dimensionality: In high-dimensional spaces, Euclidean distance can become less meaningful due to the "curse of dimensionality".</li>
    </ul>
  </li>
</ul>

<font color="black"><h4>Cosine Similarity</h4></font>
<ul>
  <li><b>Definition:</b> Cosine similarity measures the cosine of the angle between two vectors. It focuses on the orientation rather than the magnitude of the vectors.</li>
  <li><b>Use Case:</b> Cosine similarity is particularly useful when the magnitude of the vectors is not important, but the direction is. It is often used in text mining, information retrieval, and document similarity.</li>
  <li><b>Advantages:</b>
    <ul>
      <li>Magnitude Invariance: Cosine similarity is invariant to the magnitude of the vectors, making it more suitable for high-dimensional and sparse data.</li>
      <li>Normalized Space: It measures the angle between vectors, providing a measure of direction rather than distance.</li>
    </ul>
  </li>
</ul>

<font color="black"><h4>When to Use Cosine Similarity</h4></font>
<ul>
  <li><b>Text Data:</b> When comparing documents represented as TF-IDF vectors or word embeddings, where the frequency or presence of terms matters more than their absolute count.</li>
  <li><b>High-Dimensional Data:</b> When dealing with data in very high-dimensional spaces where Euclidean distance can become less effective.</li>
  <li><b>Sparse Data:</b> When data contains many zeros, as is often the case in text analysis or recommendation systems.</li>
</ul>

<font color="black"><p><b>Cosine similarity can be a better choice than Euclidean distance in scenarios where the direction of vectors is more important than their magnitude, such as in text mining and high-dimensional data analysis.</b></p></font>


In [68]:
# so we are using the cosin similirty 
from sklearn.metrics.pairwise import cosine_similarity

In [69]:
cosine_similarity(vectors)

array([[1.        , 0.02353933, 0.03227597, ..., 0.04786948, 0.00617681,
        0.00192822],
       [0.02353933, 1.        , 0.01468775, ..., 0.01879901, 0.        ,
        0.0081162 ],
       [0.03227597, 0.01468775, 1.        , ..., 0.02497005, 0.        ,
        0.00208823],
       ...,
       [0.04786948, 0.01879901, 0.02497005, ..., 1.        , 0.01733707,
        0.02082489],
       [0.00617681, 0.        , 0.        , ..., 0.01733707, 1.        ,
        0.01269076],
       [0.00192822, 0.0081162 , 0.00208823, ..., 0.02082489, 0.01269076,
        1.        ]])

In [70]:
# calculate the distance of first movie from 4806 distance
# calculate the distance of second movie from 4806 distance
# calculate the distance of third movie from 4806 distance
# calculate the distance of forth movie from 4806 distance
# ....
# calculate the distance of 4806 movie from 4806 distance


print(f'THE Shape of the cosine is {cosine_similarity(vectors).shape}')

THE Shape of the cosine is (4806, 4806)


In [71]:
similarity=cosine_similarity(vectors)

In [72]:
new_df[new_df['title']=='51 Birch Street'].index[0]

4606

In [73]:
sorted(list(enumerate(similarity[0])),reverse=True,key=lambda x:x[1])[1:6]

[(2409, 0.21625919227729304),
 (778, 0.1969745425941945),
 (539, 0.1728734749761968),
 (3608, 0.17260513643870948),
 (1216, 0.16982460678384623)]

In [74]:
def recommend(movie):
    movie_index=new_df[new_df['title']==movie].index[0]
    distance=similarity[movie_index]
    movies_list=sorted(list(enumerate(distance)),reverse=True,key=lambda x:x[1])[1:6]
    
    for i in movies_list:
        print(new_df.iloc[i[0]].title)

In [75]:
recommend('Iron Man')

Iron Man 2
Iron Man 3
Avengers: Age of Ultron
Ant-Man
Captain America: Civil War


#### To want the pickle file uncomment it 

In [76]:
import pickle

In [77]:
# pickle.dump(new_df.to_dict(),open('movie_dict_copy.pk','wb'))

In [78]:
# pickle.dump(similarity,open('similarity_copy.pk','wb'))