# 📘 Business Problem Statement
## 🎬 Project Title:
Movie Recommendation System using NLP

## 🧩 Business Context:
In today’s digital streaming landscape, platforms like Netflix, Amazon Prime, and Disney+ offer vast movie libraries. However, with so many choices, users often experience decision fatigue, leading to poor content discovery and decreased engagement. Recommender systems play a crucial role in personalizing user experience, enhancing satisfaction, and increasing platform stickiness.

## 🎯 Problem Statement:
How can we help users easily discover relevant and engaging movies aligned with their interests, using available movie metadata such as genres, cast, crew, and plot keywords?

## 💡 Proposed Solution:
Build a Content-Based Recommendation System powered by Natural Language Processing (NLP) that analyzes and understands the textual metadata of movies. By computing similarities between movie descriptions, cast, crew, and keywords, the system will recommend movies that are most similar to a selected title.

## 🚀 Project Objectives:
Develop a movie recommendation engine using movie metadata.

Leverage NLP techniques such as text preprocessing and vectorization (CountVectorizer, TF-IDF).

Apply cosine similarity to suggest the top N similar movies to a given title.

Deliver fast and relevant recommendations without relying on user behavior or ratings.

## 📈 Business Impact:
Improve user engagement by simplifying content discovery.

Reduce user churn by delivering a more personalized experience.

Provide a scalable, metadata-driven solution that works even in cold-start scenarios (new users or unrated movies).

### ✅ Steps Followed in the Movie Recommendation System Project
#### 1. Loading the Data
Imported necessary libraries (pandas, numpy, ast).

Loaded two CSVs: tmdb_5000_credits.csv and tmdb_5000_movies.csv.

#### 2. Merging Datasets
Merged the datasets using the movie_id to combine relevant information from both sources.

#### 3. Data Cleaning
Removed unnecessary columns (e.g., budget, homepage, status).

Handled nested JSON fields (e.g., genres, cast, crew, keywords) using ast.literal_eval.

Extracted and flattened cast, crew (e.g., director), genres, and keywords.

#### 4. Text Preprocessing
Lowercased all text.

Removed spaces and duplicates.

Applied NLP transformations like:

Stopword removal

Lemmatization (mentioned as a goal)

#### 5. Feature Engineering
Created a unified tags column by combining genre, cast, director, and keywords for vectorization.

#### 6. Vectorization
Applied CountVectorizer to convert text into numerical features.

Also planned for TF-IDF and Word2Vec (though not fully implemented in the visible cells).

#### 7. Similarity Calculation
Used cosine similarity to compute similarity between movies based on tag vectors.

#### 8. Recommendation Function
Built or planned a function to recommend top-N similar movies based on a given movie title.

In [1]:
import pandas as pd
import numpy as np
from ast import literal_eval


* Loading the Data

In [2]:
df1 = pd.read_csv(r"C:\MACHINE LEARNING\Self Project\tmdb_5000_credits1.csv")
df2 = pd.read_csv(r"C:\MACHINE LEARNING\Self Project\tmdb_5000_movies1.csv")

In [3]:
df1

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."
...,...,...,...,...
4798,9367,El Mariachi,"[{""cast_id"": 1, ""character"": ""El Mariachi"", ""c...","[{""credit_id"": ""52fe44eec3a36847f80b280b"", ""de..."
4799,72766,Newlyweds,"[{""cast_id"": 1, ""character"": ""Buzzy"", ""credit_...","[{""credit_id"": ""52fe487dc3a368484e0fb013"", ""de..."
4800,231617,"Signed, Sealed, Delivered","[{""cast_id"": 8, ""character"": ""Oliver O\u2019To...","[{""credit_id"": ""52fe4df3c3a36847f8275ecf"", ""de..."
4801,126186,Shanghai Calling,"[{""cast_id"": 3, ""character"": ""Sam"", ""credit_id...","[{""credit_id"": ""52fe4ad9c3a368484e16a36b"", ""de..."


In [4]:
# df1.loc[0]["crew"]

In [5]:
df2.iloc[0]

budget                                                          237000000
genres                  [{"id": 28, "name": "Action"}, {"id": 12, "nam...
homepage                                      http://www.avatarmovie.com/
id                                                                  19995
keywords                [{"id": 1463, "name": "culture clash"}, {"id":...
original_language                                                      en
original_title                                                     Avatar
overview                In the 22nd century, a paraplegic Marine is di...
popularity                                                     150.437577
production_companies    [{"name": "Ingenious Film Partners", "id": 289...
production_countries    [{"iso_3166_1": "US", "name": "United States o...
release_date                                                   2009-12-10
revenue                                                        2787965087
runtime                               

In [6]:
df2.iloc[0]["overview"]

'In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization.'

* Merging the Data

## Strategy:
     - merge the two dataframes ( based on movie id)
     - remove unwanted columns
     -  extract information about cast , crew
     -  make a column with keywords and genre ---> string feild
     -  vectorization on the string ( CountVectorizer, TFIDF vectorizer, Word2Vec)
     -  find similarity between different movies
     -  for recommendation, we will find k nearest movies to the movie being watched. 
    

In [7]:
df = df2.merge(df1, left_on='id', right_on = "movie_id")
dfselected_columns = ['genres', "keywords", "original_title", "overview", "cast", "crew" ]

In [8]:
df.isna().sum()

budget                     0
genres                     0
homepage                3091
id                         0
keywords                   0
original_language          0
original_title             0
overview                   3
popularity                 0
production_companies       0
production_countries       0
release_date               1
revenue                    0
runtime                    2
spoken_languages           0
status                     0
tagline                  844
title_x                    0
vote_average               0
vote_count                 0
movie_id                   0
title_y                    0
cast                       0
crew                       0
dtype: int64

In [9]:
df.columns

Index(['budget', 'genres', 'homepage', 'id', 'keywords', 'original_language',
       'original_title', 'overview', 'popularity', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title_x', 'vote_average',
       'vote_count', 'movie_id', 'title_y', 'cast', 'crew'],
      dtype='object')

* Data Cleaning

In [10]:
selected_columns = ['genres', "keywords", "original_title", "overview", "cast", "crew" ]
df = df[selected_columns]
df

Unnamed: 0,genres,keywords,original_title,overview,cast,crew
0,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",Avatar,"In the 22nd century, a paraplegic Marine is di...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...","[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",Spectre,A cryptic message from Bond’s past sends him o...,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...","[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",The Dark Knight Rises,Following the death of District Attorney Harve...,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",John Carter,"John Carter is a war-weary, former military ca...","[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."
...,...,...,...,...,...,...
4798,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...","[{""id"": 5616, ""name"": ""united states\u2013mexi...",El Mariachi,El Mariachi just wants to play his guitar and ...,"[{""cast_id"": 1, ""character"": ""El Mariachi"", ""c...","[{""credit_id"": ""52fe44eec3a36847f80b280b"", ""de..."
4799,"[{""id"": 35, ""name"": ""Comedy""}, {""id"": 10749, ""...",[],Newlyweds,A newlywed couple's honeymoon is upended by th...,"[{""cast_id"": 1, ""character"": ""Buzzy"", ""credit_...","[{""credit_id"": ""52fe487dc3a368484e0fb013"", ""de..."
4800,"[{""id"": 35, ""name"": ""Comedy""}, {""id"": 18, ""nam...","[{""id"": 248, ""name"": ""date""}, {""id"": 699, ""nam...","Signed, Sealed, Delivered","""Signed, Sealed, Delivered"" introduces a dedic...","[{""cast_id"": 8, ""character"": ""Oliver O\u2019To...","[{""credit_id"": ""52fe4df3c3a36847f8275ecf"", ""de..."
4801,[],[],Shanghai Calling,When ambitious New York attorney Sam is sent t...,"[{""cast_id"": 3, ""character"": ""Sam"", ""credit_id...","[{""credit_id"": ""52fe4ad9c3a368484e16a36b"", ""de..."


In [11]:
[x["name"] for x in  literal_eval(df["genres"][0])]

['Action', 'Adventure', 'Fantasy', 'Science Fiction']

In [12]:
[x["name"] for x in  literal_eval(df["keywords"][0])]

['culture clash',
 'future',
 'space war',
 'space colony',
 'society',
 'space travel',
 'futuristic',
 'romance',
 'space',
 'alien',
 'tribe',
 'alien planet',
 'cgi',
 'marine',
 'soldier',
 'battle',
 'love affair',
 'anti war',
 'power relations',
 'mind and soul',
 '3d']

In [13]:
df["genres"] = df["genres"].apply(lambda x: [y["name"] for y in  literal_eval(x)])
df["keywords"] = df["keywords"].apply(lambda x: [y["name"] for y in  literal_eval(x)])

In [14]:
df["cast"] = df["cast"].apply(lambda x : [y["name"] for y in literal_eval(x)][:3])
df["crew"] = df["crew"].apply(lambda x : [y["name"] for y in  literal_eval(x) if y["job"]=="Director"])

In [15]:
df

Unnamed: 0,genres,keywords,original_title,overview,cast,crew
0,"[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...",Avatar,"In the 22nd century, a paraplegic Marine is di...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]",[James Cameron]
1,"[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...",Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[Johnny Depp, Orlando Bloom, Keira Knightley]",[Gore Verbinski]
2,"[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...",Spectre,A cryptic message from Bond’s past sends him o...,"[Daniel Craig, Christoph Waltz, Léa Seydoux]",[Sam Mendes]
3,"[Action, Crime, Drama, Thriller]","[dc comics, crime fighter, terrorist, secret i...",The Dark Knight Rises,Following the death of District Attorney Harve...,"[Christian Bale, Michael Caine, Gary Oldman]",[Christopher Nolan]
4,"[Action, Adventure, Science Fiction]","[based on novel, mars, medallion, space travel...",John Carter,"John Carter is a war-weary, former military ca...","[Taylor Kitsch, Lynn Collins, Samantha Morton]",[Andrew Stanton]
...,...,...,...,...,...,...
4798,"[Action, Crime, Thriller]","[united states–mexico barrier, legs, arms, pap...",El Mariachi,El Mariachi just wants to play his guitar and ...,"[Carlos Gallardo, Jaime de Hoyos, Peter Marqua...",[Robert Rodriguez]
4799,"[Comedy, Romance]",[],Newlyweds,A newlywed couple's honeymoon is upended by th...,"[Edward Burns, Kerry Bishé, Marsha Dietlein]",[Edward Burns]
4800,"[Comedy, Drama, Romance, TV Movie]","[date, love at first sight, narration, investi...","Signed, Sealed, Delivered","""Signed, Sealed, Delivered"" introduces a dedic...","[Eric Mabius, Kristin Booth, Crystal Lowe]",[Scott Smith]
4801,[],[],Shanghai Calling,When ambitious New York attorney Sam is sent t...,"[Daniel Henney, Eliza Coupe, Bill Paxton]",[Daniel Hsia]


In [16]:
def remove_space(entry):
    return [x.replace(" ","" ) for x in entry]

In [17]:
df["genres"] = df["genres"].apply(remove_space)
df["keywords"] = df["keywords"].apply(remove_space)
df["cast"] = df["cast"].apply(remove_space)
df["crew"] = df["crew"].apply(remove_space)

In [18]:
df

Unnamed: 0,genres,keywords,original_title,overview,cast,crew
0,"[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...",Avatar,"In the 22nd century, a paraplegic Marine is di...","[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron]
1,"[Adventure, Fantasy, Action]","[ocean, drugabuse, exoticisland, eastindiatrad...",Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[JohnnyDepp, OrlandoBloom, KeiraKnightley]",[GoreVerbinski]
2,"[Action, Adventure, Crime]","[spy, basedonnovel, secretagent, sequel, mi6, ...",Spectre,A cryptic message from Bond’s past sends him o...,"[DanielCraig, ChristophWaltz, LéaSeydoux]",[SamMendes]
3,"[Action, Crime, Drama, Thriller]","[dccomics, crimefighter, terrorist, secretiden...",The Dark Knight Rises,Following the death of District Attorney Harve...,"[ChristianBale, MichaelCaine, GaryOldman]",[ChristopherNolan]
4,"[Action, Adventure, ScienceFiction]","[basedonnovel, mars, medallion, spacetravel, p...",John Carter,"John Carter is a war-weary, former military ca...","[TaylorKitsch, LynnCollins, SamanthaMorton]",[AndrewStanton]
...,...,...,...,...,...,...
4798,"[Action, Crime, Thriller]","[unitedstates–mexicobarrier, legs, arms, paper...",El Mariachi,El Mariachi just wants to play his guitar and ...,"[CarlosGallardo, JaimedeHoyos, PeterMarquardt]",[RobertRodriguez]
4799,"[Comedy, Romance]",[],Newlyweds,A newlywed couple's honeymoon is upended by th...,"[EdwardBurns, KerryBishé, MarshaDietlein]",[EdwardBurns]
4800,"[Comedy, Drama, Romance, TVMovie]","[date, loveatfirstsight, narration, investigat...","Signed, Sealed, Delivered","""Signed, Sealed, Delivered"" introduces a dedic...","[EricMabius, KristinBooth, CrystalLowe]",[ScottSmith]
4801,[],[],Shanghai Calling,When ambitious New York attorney Sam is sent t...,"[DanielHenney, ElizaCoupe, BillPaxton]",[DanielHsia]


In [19]:
df = df.dropna()

In [20]:
df["overview"] = df["overview"].apply(lambda x: x.split())
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["overview"] = df["overview"].apply(lambda x: x.split())


Unnamed: 0,genres,keywords,original_title,overview,cast,crew
0,"[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...",Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron]
1,"[Adventure, Fantasy, Action]","[ocean, drugabuse, exoticisland, eastindiatrad...",Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d...","[JohnnyDepp, OrlandoBloom, KeiraKnightley]",[GoreVerbinski]
2,"[Action, Adventure, Crime]","[spy, basedonnovel, secretagent, sequel, mi6, ...",Spectre,"[A, cryptic, message, from, Bond’s, past, send...","[DanielCraig, ChristophWaltz, LéaSeydoux]",[SamMendes]
3,"[Action, Crime, Drama, Thriller]","[dccomics, crimefighter, terrorist, secretiden...",The Dark Knight Rises,"[Following, the, death, of, District, Attorney...","[ChristianBale, MichaelCaine, GaryOldman]",[ChristopherNolan]
4,"[Action, Adventure, ScienceFiction]","[basedonnovel, mars, medallion, spacetravel, p...",John Carter,"[John, Carter, is, a, war-weary,, former, mili...","[TaylorKitsch, LynnCollins, SamanthaMorton]",[AndrewStanton]
...,...,...,...,...,...,...
4798,"[Action, Crime, Thriller]","[unitedstates–mexicobarrier, legs, arms, paper...",El Mariachi,"[El, Mariachi, just, wants, to, play, his, gui...","[CarlosGallardo, JaimedeHoyos, PeterMarquardt]",[RobertRodriguez]
4799,"[Comedy, Romance]",[],Newlyweds,"[A, newlywed, couple's, honeymoon, is, upended...","[EdwardBurns, KerryBishé, MarshaDietlein]",[EdwardBurns]
4800,"[Comedy, Drama, Romance, TVMovie]","[date, loveatfirstsight, narration, investigat...","Signed, Sealed, Delivered","[""Signed,, Sealed,, Delivered"", introduces, a,...","[EricMabius, KristinBooth, CrystalLowe]",[ScottSmith]
4801,[],[],Shanghai Calling,"[When, ambitious, New, York, attorney, Sam, is...","[DanielHenney, ElizaCoupe, BillPaxton]",[DanielHsia]


In [21]:
df["tag"] = df["genres"]+df["keywords"] + df["overview"] +  df["cast"]+ df["crew"]
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["tag"] = df["genres"]+df["keywords"] + df["overview"] +  df["cast"]+ df["crew"]


Unnamed: 0,genres,keywords,original_title,overview,cast,crew,tag
0,"[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...",Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron],"[Action, Adventure, Fantasy, ScienceFiction, c..."
1,"[Adventure, Fantasy, Action]","[ocean, drugabuse, exoticisland, eastindiatrad...",Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d...","[JohnnyDepp, OrlandoBloom, KeiraKnightley]",[GoreVerbinski],"[Adventure, Fantasy, Action, ocean, drugabuse,..."
2,"[Action, Adventure, Crime]","[spy, basedonnovel, secretagent, sequel, mi6, ...",Spectre,"[A, cryptic, message, from, Bond’s, past, send...","[DanielCraig, ChristophWaltz, LéaSeydoux]",[SamMendes],"[Action, Adventure, Crime, spy, basedonnovel, ..."
3,"[Action, Crime, Drama, Thriller]","[dccomics, crimefighter, terrorist, secretiden...",The Dark Knight Rises,"[Following, the, death, of, District, Attorney...","[ChristianBale, MichaelCaine, GaryOldman]",[ChristopherNolan],"[Action, Crime, Drama, Thriller, dccomics, cri..."
4,"[Action, Adventure, ScienceFiction]","[basedonnovel, mars, medallion, spacetravel, p...",John Carter,"[John, Carter, is, a, war-weary,, former, mili...","[TaylorKitsch, LynnCollins, SamanthaMorton]",[AndrewStanton],"[Action, Adventure, ScienceFiction, basedonnov..."
...,...,...,...,...,...,...,...
4798,"[Action, Crime, Thriller]","[unitedstates–mexicobarrier, legs, arms, paper...",El Mariachi,"[El, Mariachi, just, wants, to, play, his, gui...","[CarlosGallardo, JaimedeHoyos, PeterMarquardt]",[RobertRodriguez],"[Action, Crime, Thriller, unitedstates–mexicob..."
4799,"[Comedy, Romance]",[],Newlyweds,"[A, newlywed, couple's, honeymoon, is, upended...","[EdwardBurns, KerryBishé, MarshaDietlein]",[EdwardBurns],"[Comedy, Romance, A, newlywed, couple's, honey..."
4800,"[Comedy, Drama, Romance, TVMovie]","[date, loveatfirstsight, narration, investigat...","Signed, Sealed, Delivered","[""Signed,, Sealed,, Delivered"", introduces, a,...","[EricMabius, KristinBooth, CrystalLowe]",[ScottSmith],"[Comedy, Drama, Romance, TVMovie, date, loveat..."
4801,[],[],Shanghai Calling,"[When, ambitious, New, York, attorney, Sam, is...","[DanielHenney, ElizaCoupe, BillPaxton]",[DanielHsia],"[When, ambitious, New, York, attorney, Sam, is..."


In [22]:
df = df[['original_title', "tag"]]


In [23]:
df["tag"] = df["tag"].apply(lambda x: " ".join(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["tag"] = df["tag"].apply(lambda x: " ".join(x))


In [24]:
df

Unnamed: 0,original_title,tag
0,Avatar,Action Adventure Fantasy ScienceFiction cultur...
1,Pirates of the Caribbean: At World's End,Adventure Fantasy Action ocean drugabuse exoti...
2,Spectre,Action Adventure Crime spy basedonnovel secret...
3,The Dark Knight Rises,Action Crime Drama Thriller dccomics crimefigh...
4,John Carter,Action Adventure ScienceFiction basedonnovel m...
...,...,...
4798,El Mariachi,Action Crime Thriller unitedstates–mexicobarri...
4799,Newlyweds,Comedy Romance A newlywed couple's honeymoon i...
4800,"Signed, Sealed, Delivered",Comedy Drama Romance TVMovie date loveatfirsts...
4801,Shanghai Calling,When ambitious New York attorney Sam is sent t...


In [25]:
df['tag'][0]

'Action Adventure Fantasy ScienceFiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. SamWorthington ZoeSaldana SigourneyWeaver JamesCameron'

    - Clean the data 
        - lower the case
        - remove stop words
        - lemmatization
        - any other step

In [26]:
from pandarallel import pandarallel
pandarallel.initialize(progress_bar=True)

INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.

https://nalepae.github.io/pandarallel/troubleshooting/


In [27]:
import spacy
nlp = spacy.load("en_core_web_lg")


In [28]:
nlp

<spacy.lang.en.English at 0x19b73e37dd0>

In [29]:
def clean_text(text):
    doc = nlp(text)
    clean_text  = []
    for token in doc:
        # Assignment : Add filters for digits, special characters, punctuations etc
        if not token.is_stop and not token.is_punct and not token.is_space:
            clean_text.append(token.lemma_.lower()) # lemmatization and converting to lower case
            
    return " ".join(clean_text)

In [30]:
df["tag"] =  df["tag"].apply(clean_text)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["tag"] =  df["tag"].apply(clean_text)


In [31]:
df

Unnamed: 0,original_title,tag
0,Avatar,action adventure fantasy sciencefiction cultur...
1,Pirates of the Caribbean: At World's End,adventure fantasy action ocean drugabuse exoti...
2,Spectre,action adventure crime spy basedonnovel secret...
3,The Dark Knight Rises,action crime drama thriller dccomic crimefight...
4,John Carter,action adventure sciencefiction basedonnovel m...
...,...,...
4798,El Mariachi,action crime thriller unitedstate mexicobarrie...
4799,Newlyweds,comedy romance newlywed couple honeymoon upend...
4800,"Signed, Sealed, Delivered",comedy drama romance tvmovie date loveatfirsts...
4801,Shanghai Calling,ambitious new york attorney sam send shanghai ...


In [32]:
df["tag"][0]

'action adventure fantasy sciencefiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelation mindandsoul 3d 22nd century paraplegic marine dispatch moon pandora unique mission tear follow order protect alien civilization samworthington zoesaldana sigourneyweaver jamescameron'

## Vectorization
    - Count Vectorizer
    - Tf-IDF Vectorizer
    - Word2Vec

In [33]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [34]:
vector_array = CountVectorizer(max_features=2000).fit_transform(df["tag"]).toarray()

In [35]:
df.shape

(4800, 2)

In [36]:
vector_array.shape

(4800, 2000)

In [37]:
vector_array

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

## Build a similarity Matrix 
    - 4800*4800

In [38]:
from sklearn.metrics.pairwise import cosine_similarity

In [39]:
m_m_similarity = cosine_similarity(vector_array)
m_m_similarity

array([[1.        , 0.11420805, 0.10540926, ..., 0.05326236, 0.03131121,
        0.        ],
       [0.11420805, 1.        , 0.08025724, ..., 0.06082991, 0.        ,
        0.        ],
       [0.10540926, 0.08025724, 1.        , ..., 0.05614346, 0.03300492,
        0.        ],
       ...,
       [0.05326236, 0.06082991, 0.05614346, ..., 1.        , 0.07504692,
        0.05239625],
       [0.03131121, 0.        , 0.03300492, ..., 0.07504692, 1.        ,
        0.06160411],
       [0.        , 0.        , 0.        , ..., 0.05239625, 0.06160411,
        1.        ]])

In [40]:
m_m_similarity.shape

(4800, 4800)

In [41]:
df["original_title"]

0                                         Avatar
1       Pirates of the Caribbean: At World's End
2                                        Spectre
3                          The Dark Knight Rises
4                                    John Carter
                          ...                   
4798                                 El Mariachi
4799                                   Newlyweds
4800                   Signed, Sealed, Delivered
4801                            Shanghai Calling
4802                           My Date with Drew
Name: original_title, Length: 4800, dtype: object

In [42]:
m_m_similarity_df = pd.DataFrame(m_m_similarity, index=df["original_title"], columns=df["original_title"])
m_m_similarity_df

original_title,Avatar,Pirates of the Caribbean: At World's End,Spectre,The Dark Knight Rises,John Carter,Spider-Man 3,Tangled,Avengers: Age of Ultron,Harry Potter and the Half-Blood Prince,Batman v Superman: Dawn of Justice,...,On The Downlow,Sanctuary: Quite a Conundrum,Bang,Primer,Cavite,El Mariachi,Newlyweds,"Signed, Sealed, Delivered",Shanghai Calling,My Date with Drew
original_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avatar,1.000000,0.114208,0.105409,0.094281,0.213003,0.139212,0.026082,0.202073,0.076139,0.114053,...,0.000000,0.000000,0.024845,0.077850,0.000000,0.030429,0.081650,0.053262,0.031311,0.000000
Pirates of the Caribbean: At World's End,0.114208,1.000000,0.080257,0.026919,0.104257,0.158991,0.029788,0.164845,0.086957,0.130258,...,0.000000,0.000000,0.028375,0.000000,0.000000,0.034752,0.000000,0.060830,0.000000,0.000000
Spectre,0.105409,0.080257,1.000000,0.074536,0.096225,0.088045,0.027493,0.182574,0.080257,0.090167,...,0.102869,0.000000,0.000000,0.000000,0.022680,0.064150,0.000000,0.056143,0.033005,0.000000
The Dark Knight Rises,0.094281,0.026919,0.074536,1.000000,0.043033,0.059062,0.055328,0.061237,0.053838,0.241943,...,0.034503,0.037268,0.035136,0.055048,0.030429,0.086066,0.000000,0.075324,0.066421,0.092748
John Carter,0.213003,0.104257,0.096225,0.043033,1.000000,0.076249,0.071429,0.210819,0.034752,0.130145,...,0.044544,0.000000,0.000000,0.035533,0.000000,0.027778,0.000000,0.024311,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
El Mariachi,0.030429,0.034752,0.064150,0.086066,0.027778,0.025416,0.071429,0.105409,0.069505,0.052058,...,0.000000,0.048113,0.022680,0.071067,0.058926,1.000000,0.000000,0.024311,0.028583,0.059868
Newlyweds,0.081650,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.129099,0.000000,0.000000,0.052705,0.000000,1.000000,0.130466,0.076696,0.000000
"Signed, Sealed, Delivered",0.053262,0.060830,0.056143,0.075324,0.024311,0.044488,0.020838,0.000000,0.060830,0.045561,...,0.077968,0.042108,0.079399,0.093296,0.017190,0.024311,0.130466,1.000000,0.075047,0.052396
Shanghai Calling,0.031311,0.000000,0.033005,0.066421,0.000000,0.026153,0.024500,0.000000,0.071520,0.026784,...,0.000000,0.000000,0.023338,0.000000,0.040423,0.028583,0.076696,0.075047,1.000000,0.061604


In [43]:
m_m_similarity_df["Spectre"].sort_values(ascending=False)[1:5]

original_title
Quantum of Solace        0.435516
Never Say Never Again    0.400320
From Russia with Love    0.357407
Skyfall                  0.315576
Name: Spectre, dtype: float64

In [44]:
def recommend_movies(movie_name, count):
    
    return list(m_m_similarity_df[movie_name].sort_values(ascending=False)[1:count+1].index.values)
    

In [45]:
recommend_movies("The Dark Knight Rises", 10)

['The Dark Knight',
 'Batman Begins',
 'Batman',
 'Batman Forever',
 'Batman & Robin',
 'Nighthawks',
 "Amidst the Devil's Wings",
 'Slow Burn',
 'Batman Returns',
 'Dead Man Down']

In [46]:
m_m_similarity_df.to_pickle("m_m_similarity_df.pkl")

In [47]:
import sys
print(sys.executable)

C:\Users\Tarun\anaconda3\python.exe
