#### Problem statement:
Our goal is to develop a machine learning-based movie recommendation system that can predict and recommend movies to users based on their unique preferences. This system would use Term Frequency-Inverse Document Frequency (TF-IDF) vectorization technique to transform text data into meaningful numerical vectors, and cosine similarity to compute the similarity between these vectors.

#### Importing the necessary libraries:

In [1]:
import pandas as pd
import numpy as np
import ast
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import pickle

#### Loading the dataset:

In [2]:
movies = pd.read_csv("tmdb_5000_movies.csv")
credits = pd.read_csv("tmdb_5000_credits.csv")
df = movies.merge(credits, on="title")

In [3]:
df.sample(5)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,movie_id,cast,crew
4275,0,"[{""id"": 37, ""name"": ""Western""}]",,218500,[],en,The Ballad of Gregorio Cortez,The entire cause of the problem evolves from t...,0.592821,"[{""name"": ""Moctesuma Esparza Productions"", ""id...",...,104.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,,The Ballad of Gregorio Cortez,0.0,0,218500,"[{""cast_id"": 3, ""character"": ""Boone Chate"", ""c...","[{""credit_id"": ""52fe4e4ec3a368484e219717"", ""de..."
7,280000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://marvel.com/movies/movie/193/avengers_ag...,99861,"[{""id"": 8828, ""name"": ""marvel comic""}, {""id"": ...",en,Avengers: Age of Ultron,When Tony Stark tries to jumpstart a dormant p...,134.279229,"[{""name"": ""Marvel Studios"", ""id"": 420}, {""name...",...,141.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,A New Age Has Come.,Avengers: Age of Ultron,7.3,6767,99861,"[{""cast_id"": 76, ""character"": ""Tony Stark / Ir...","[{""credit_id"": ""55d5f7d4c3a3683e7e0016eb"", ""de..."
3903,3000000,"[{""id"": 18, ""name"": ""Drama""}, {""id"": 10749, ""n...",,261,"[{""id"": 30, ""name"": ""individual""}, {""id"": 236,...",en,Cat on a Hot Tin Roof,"Brick, an alcoholic ex-football player, drinks...",16.553594,"[{""name"": ""Avon Production"", ""id"": 100}, {""nam...",...,108.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,Just one pillow on her bed ... and just one de...,Cat on a Hot Tin Roof,7.6,165,261,"[{""cast_id"": 4, ""character"": ""Maggie"", ""credit...","[{""credit_id"": ""52fe422fc3a36847f800a6f7"", ""de..."
3776,0,"[{""id"": 35, ""name"": ""Comedy""}, {""id"": 10749, ""...",,15122,"[{""id"": 9817, ""name"": ""behind the scenes""}, {""...",en,Love Stinks,A movie about a relationship...that's worse th...,3.347885,[],...,94.0,"[{""iso_639_1"": ""pt"", ""name"": ""Portugu\u00eas""}...",Released,,Love Stinks,5.4,11,15122,"[{""cast_id"": 1, ""character"": ""Seth Winnick"", ""...","[{""credit_id"": ""52fe463c9251416c75071af9"", ""de..."
4686,450000,"[{""id"": 80, ""name"": ""Crime""}, {""id"": 27, ""name...",http://www.malevolencemovie.com/,54702,"[{""id"": 1562, ""name"": ""hostage""}, {""id"": 6259,...",en,Malevolence,It's ten years after the kidnapping of Martin ...,1.077321,[],...,90.0,"[{""iso_639_1"": ""pl"", ""name"": ""Polski""}, {""iso_...",Released,,Malevolence,4.9,14,54702,"[{""cast_id"": 1002, ""character"": ""Samantha Harr...","[{""credit_id"": ""52fe48abc3a36847f817354b"", ""de..."


#### Data cleaning and preprocessing:

In [4]:
#Information about the dataset:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4809 entries, 0 to 4808
Data columns (total 23 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4809 non-null   int64  
 1   genres                4809 non-null   object 
 2   homepage              1713 non-null   object 
 3   id                    4809 non-null   int64  
 4   keywords              4809 non-null   object 
 5   original_language     4809 non-null   object 
 6   original_title        4809 non-null   object 
 7   overview              4806 non-null   object 
 8   popularity            4809 non-null   float64
 9   production_companies  4809 non-null   object 
 10  production_countries  4809 non-null   object 
 11  release_date          4808 non-null   object 
 12  revenue               4809 non-null   int64  
 13  runtime               4807 non-null   float64
 14  spoken_languages      4809 non-null   object 
 15  status               

In [5]:
#Removing the unnecessary columns:
df = df[["genres", "id", "keywords", "title", "overview", "cast", "crew"]]

In [6]:
#Checking for and handling missing entries:
df.isnull().sum()
df.dropna(inplace=True)

In [7]:
#Checking for and handling duplicates:
df.duplicated().sum()

0

In [8]:
#Function to extract the list of genres and list of keywords:
def genre_keyword_converter(obj):
    L = []
    for d in ast.literal_eval(obj):
        L.append(d["name"])
    return L

df["genres"] = df["genres"].apply(genre_keyword_converter)
df["keywords"] = df["keywords"].apply(genre_keyword_converter)

In [9]:
#Function to extract the list of cast:
def cast_converter(obj):
    L = []
    c=0
    for d in ast.literal_eval(obj):
        if c==3:
            break
        L.append(d["name"])
        c += 1
    return L

df["cast"] = df["cast"].apply(cast_converter)

In [10]:
#Function to extract the name of the director:
def crew_converter(obj):
    for d in ast.literal_eval(obj):
        if d["job"] == "Director":
            return [d["name"]]
        
df["director"] = df["crew"].apply(crew_converter)
df = df.drop(columns="crew")

In [11]:
#Converting the "overview" column to a list:
df["overview"] = df["overview"].apply(lambda x: x.split(" "))

In [12]:
#Eliminating spaces between keywords/tags:
def replace(L):
    if not L:
        return []
    else:
        return [i.replace(" ", "") for i in L]
df["genres"] = df["genres"].apply(replace)
df["keywords"] = df["keywords"].apply(replace)
df["director"] = df["director"].apply(replace)
df["cast"] = df["cast"].apply(replace)

In [13]:
df.sample(10)

Unnamed: 0,genres,id,keywords,title,overview,cast,director
1568,"[Drama, Comedy]",2755,"[wifehusbandrelationship, channelsurfing, mull...",About Schmidt,"[66-year-old, Warren, Schmidt, is, a, retired,...","[JackNicholson, KathyBates, HopeDavis]",[AlexanderPayne]
928,[Drama],60308,"[underdog, basedonnovel, baseball, teamwork, s...",Moneyball,"[The, story, of, Oakland, Athletics, general, ...","[BradPitt, JonahHill, PhilipSeymourHoffman]",[BennettMiller]
4482,"[Horror, Drama, Thriller, Crime]",13429,[],Donkey Punch,"[Three, hot, girls,, four, guys,, and, one, me...","[NicholaBurley, SianBreckin, JaimeWinstone]",[OllyBlackburn]
3716,"[Drama, Thriller]",4550,[independentfilm],Sympathy for Lady Vengeance,"[After, a, 13-year, imprisonment, for, the, ki...","[ChoiMin-sik, LeeYoung-ae, KwonYea-young]",[ParkChan-wook]
694,"[Action, Drama, Mystery, Thriller]",2501,"[paris, barcelonaspain, assassin, basedonnovel...",The Bourne Identity,"[Wounded, to, the, brink, of, death, and, suff...","[MattDamon, FrankaPotente, ChrisCooper]",[DougLiman]
798,[Comedy],116741,"[jobinterview, lossofjob, intern, referencetog...",The Internship,"[Two, recently, laid-off, men, in, their, 40s,...","[OwenWilson, VinceVaughn, RoseByrne]",[ShawnLevy]
2074,"[Crime, Drama]",1646,"[blackpeople, basedonnovel, holocaust, ghetto,...",Freedom Writers,"[A, young, teacher, inspires, her, class, of, ...","[HilarySwank, ScottGlenn, ImeldaStaunton]",[RichardLaGravenese]
596,"[Action, Adventure, Comedy, Thriller]",8427,"[budapest, kidnapping, boxer, secretagent, lib...",I Spy,"[When, the, Switchblade,, the, most, sophistic...","[EddieMurphy, OwenWilson, FamkeJanssen]",[BettyThomas]
2730,"[Action, Comedy, Thriller]",70829,[],The Last Godfather,"[Young-goo, the, son, of, mafia, boss, Don, Ca...","[HarveyKeitel, JasonMewes, BlakeClark]",[ShimHyung-Rae]
4555,"[Action, Drama, Thriller]",253626,"[pilot, suspicion, drone, u.s.military, airfor...",Good Kill,"[In, the, shadowy, world, of, drone, warfare,,...","[EthanHawke, JanuaryJones, ZoëKravitz]",[AndrewNiccol]


In [14]:
#Concatenating keywords into the "tags" column and dropping unnecessary columns:
df["tags"] = df["keywords"] + df["overview"] + df["cast"] + df["director"] + df["genres"]
df.drop(columns=["keywords", "genres", "overview", "director", "cast"], inplace=True)

In [15]:
df["tags"] = df["tags"].apply(lambda x: " ".join(x))
df ["tags"] = df["tags"].apply(lambda x: x.lower())

In [16]:
#Function to perform stemming and remove stopwords from a piece of text:
stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))
def preprocess_text(text):
    words = word_tokenize(text)
    processed_words = [stemmer.stem(word) for word in words if word not in stop_words]
    return ' '.join(processed_words)

df["tags"] = df["tags"].apply(preprocess_text)

In [17]:
df.sample(10)

Unnamed: 0,id,title,tags
377,4477,The Devil's Own,newyork terrorist anonym northernireland frank...
3839,573,Frenzy,londonengland rape polic suspens serialkil ser...
4413,242083,Hits,talent delus fame teenag viralvideo talentless...
1936,20309,Say It Isn't So,"boy meet girl ; boy fall love ( wild , non-sto..."
3925,15158,Phantasm II,"portal undertak evil tallman sentin mike , rel..."
4515,75986,Love Letters,obsess radio nuditi letter love romanc lingeri...
781,2309,Inkheart,book fairytal eavesdrop adventur writer'sblock...
3241,773,Little Miss Sunshine,california brothersisterrelationship wifehusba...
1731,10383,Lost Souls,pedophilia daughter anti-christ incest small g...
2906,345003,10 Days in a Madhouse,"undercov insaneasylum report nelli bli , 23 ye..."


In [22]:
tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(df["tags"])
cs = cosine_similarity(tfidf_matrix, tfidf_matrix)

#### Exporting the cosine similarity table and the processed dataframe:

In [23]:
pickle.dump(df.to_dict(), open("movies.pkl", "wb"))
pickle.dump(cs, open("cs.pkl", "wb"))