# Movie Recommendation System

## Data EDA & Preprocessing

### 1. Install Required Packages

In [2]:
%pip install -r requirements.txt

Note: you may need to restart the kernel to use updated packages.


### 2. Import libraries

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import groq
from dotenv import load_dotenv
from enum import Enum
from pydantic import BaseModel
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from typing import List
import asyncio
import nest_asyncio
nest_asyncio.apply()

load_dotenv()

True

### 3. EDA

In [3]:
df = pd.read_csv('data/raw/Movies_Dataset.csv')
df.head()

Unnamed: 0,Release Year,Title,Origin/Ethnicity,Director,Cast,Genre,Wiki Page,Plot
0,2000,102 Dalmatians,American,Kevin Lima,"Glenn Close, Gérard Depardieu, Alice Evans","comedy, family",https://en.wikipedia.org/wiki/102_Dalmatians,"After three years in prison, Cruella de Vil ha..."
1,2000,28 Days,American,Betty Thomas,"Sandra Bullock, Viggo Mortensen",drama,https://en.wikipedia.org/wiki/28_Days_(film),Gwen Cummings (Sandra Bullock) spends her nigh...
2,2000,3 Strikes,American,DJ Pooh,"Brian Hooks, N'Bushe Wright",comedy,https://en.wikipedia.org/wiki/3_Strikes_(film),Robert Douglas (Brian Hooks) is in prison for ...
3,2000,The 6th Day,American,Roger Spottiswoode,"Arnold Schwarzenegger, Robert Duvall",science fiction,https://en.wikipedia.org/wiki/The_6th_Day,At some point in the indeterminate near future...
4,2000,The Adventures of Rocky and Bullwinkle,American,Des McAnuff,"Rene Russo, Jason Alexander, Robert De Niro, P...",comedy,https://en.wikipedia.org/wiki/The_Adventures_o...,35 years since their show's cancellation in 19...


In [4]:
print("Shape of dataset:", df.shape)
print("\nColumn names:\n", df.columns.tolist())
df.info()

Shape of dataset: (12560, 8)

Column names:
 ['Release Year', 'Title', 'Origin/Ethnicity', 'Director', 'Cast', 'Genre', 'Wiki Page', 'Plot']
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12560 entries, 0 to 12559
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Release Year      12560 non-null  int64 
 1   Title             12560 non-null  object
 2   Origin/Ethnicity  12560 non-null  object
 3   Director          12560 non-null  object
 4   Cast              12042 non-null  object
 5   Genre             12560 non-null  object
 6   Wiki Page         12560 non-null  object
 7   Plot              12560 non-null  object
dtypes: int64(1), object(7)
memory usage: 785.1+ KB


In [5]:
print("\nMissing values per column:\n", df.isnull().sum())
print("\nNumber of duplicated rows:", df.duplicated().sum())


Missing values per column:
 Release Year          0
Title                 0
Origin/Ethnicity      0
Director              0
Cast                518
Genre                 0
Wiki Page             0
Plot                  0
dtype: int64

Number of duplicated rows: 0


### 4. Data Preprocessing

In [6]:
df['Director'] = df['Director'].replace("Unknown", pd.NA)
df['Director'] = df['Director'].fillna("unknown_director")
df['Cast'] = df['Cast'].fillna("unknown_cast")

In [7]:
class Genre(str, Enum):
    ACTION = "Action"
    COMEDY = "Comedy"
    HORROR = "Horror"
    THRILLER = "Thriller"
    ANIMATION = "Animation"
    ADVENTURE = "Adventure"
    BIOGRAPHY = "Biography"
    ROMANCE = "Romance"
    DRAMA = "Drama"

class GenrePrediction(BaseModel):
    plot: str
    pred_genre: Genre

def predict_genre(
        plot: str
        ):

    prompt = f"""
                 What is the most likely genre (Action, Comedy, Drama, Horror, Thriller, Animation, Adventure, Biography, Romance, etc.) for the following plot: '{plot}' ?

                 Your response only consist of one word: Action, Comedy, Drama, Horror, Thriller, Animation, Adventure, Biography, Romance, etc.
                 """
    
    response = groq.Groq().chat.completions.create(
                                                   model="llama3-70b-8192",
                                                    messages=[{
                                                        "role": "user", 
                                                        "content": prompt
                                                        }]
                                                    )

    predict_genre = response.choices[0].message.content.strip()
    return predict_genre

In [8]:
df['Genre'] = df['Genre'].replace("unknown", pd.NA)
missing_genre_index = df['Genre'].isnull()
for idx in df[missing_genre_index].index:
    plot = df.loc[idx, 'Plot']
    genre = predict_genre(plot)
    if genre:
        df.loc[idx, 'Genre'] = genre
        print(f"{plot} : {genre}")
    else:
        print(f"{plot} : No Genre Detected")

The movie opens showing super star Eddie Tudor (Cole Sprouse) on the red carpet at the premiere of one of his films. Tom Canty (Dylan Sprouse) watches it on TV and imitates Eddie (he mimics him by saying his catchphrase "Oh yeah!"), when his grandpa, 'Pop' (Ed Lauter), calls him to get ready for school. At school, the principal asks what his ambition is in life, and, Tom replies he wants to be Eddie Tudor. The principal then advises him to join the acting classes (which Pop doesn't allow him to do). He has lived with Pop since his parents died, a couple of years earlier. From a young age, his ambition has been to become an actor, partly inspired by the stories of fame and appeal of acting conveyed to him by his neighbour and best friend, Miles (Vincent Spano) who is a former actor and who also happens to be the father of Eddie Tudor (revealed at the end of the film). Miles has known Tom since he was a little kid, when Miles moved to the area, after quitting showbiz and he often tells T

APIStatusError: Error code: 413 - {'error': {'message': 'Request too large for model `llama3-70b-8192` in organization `org_01k0z41d3wecabet7hv7p5wdxp` service tier `on_demand` on tokens per minute (TPM): Limit 6000, Requested 9281, please reduce your message size and try again. Need more tokens? Upgrade to Dev Tier today at https://console.groq.com/settings/billing', 'type': 'tokens', 'code': 'rate_limit_exceeded'}}

In [None]:
class Keywords(BaseModel):
    keywords: List[str]

class KeywordGen(BaseModel):
    plot: str
    keywords: Keywords

def KeywordIdentify(
        plot: str
        ):

    prompt = f"""
                Identify the most suitable 10-20 keywords for the following plot: '{plot}' ?

                Your response should be only a comma-separated list of keywords.
                """
    
    response = groq.Groq().chat.completions.create(
                                                   model="llama-3.1-8b-instant",
                                                    messages=[{
                                                        "role": "user", 
                                                        "content": prompt
                                                        }]
                                                    )

    content = response.choices[0].message.content.strip()
    keywords = [k.strip() for k in content.split(",") if k.strip()]
    return keywords

In [None]:
df['Keywords'] = pd.NA

missing_keywords_index = df['Keywords'].isnull() | (df['Keywords'] == '')

for idx in df[missing_keywords_index].index:
    plot = df.loc[idx, 'Plot']
    keywords = KeywordIdentify(plot)
    
    if keywords:
        df.loc[idx, 'Keywords'] = ', '.join(keywords)
        print(f"Plot: {plot[:100]}... | Keywords: {df.loc[idx, 'Keywords']}")
    else:
        print(f"Plot: {plot[:100]}... | No Keywords Detected")

In [None]:
bins = list(range(2000, 2021, 5)) 
labels = [f"{b}s" for b in bins[:-1]]  

df['Year Binned'] = pd.cut(df['Release Year'], bins=bins, labels=labels, right=False)

print("\nMovies per binned year group:\n")
print(df['Year Binned'].value_counts().sort_index())

plt.figure(figsize=(12, 6))
sns.countplot(data=df, x='Year Binned', order=labels)
plt.title("Number of Movies by Year Bin")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
df = df.drop(columns=[
                        'Wiki Page', 'Release Year', 'Plot'
                        ])
df.head()

Unnamed: 0,Title,Origin/Ethnicity,Director,Cast,Genre,Year Binned
0,Kansas Saloon Smashers,American,Unknown,,Comedy,1900s
1,Love by the Light of the Moon,American,Unknown,,Romance,1900s
2,The Martyred Presidents,American,Unknown,,Thriller,1900s
3,"Terrible Teddy, the Grizzly King",American,Unknown,,Comedy.,1900s
4,Jack and the Beanstalk,American,"George S. Fleming, Edwin S. Porter",,Fantasy/Adventure,1900s


In [27]:
df.to_csv('data/processed/Movies_Preprocessed.csv', index=False)

### 5. Feature Engineering

In [29]:
df = pd.read_csv("data/processed/Movies_Preprocessed.csv")

df.head(5)

Unnamed: 0,Title,Origin/Ethnicity,Director,Cast,Genre,Year Binned
0,Kansas Saloon Smashers,American,Unknown,,Comedy,1900s
1,Love by the Light of the Moon,American,Unknown,,Romance,1900s
2,The Martyred Presidents,American,Unknown,,Thriller,1900s
3,"Terrible Teddy, the Grizzly King",American,Unknown,,Comedy.,1900s
4,Jack and the Beanstalk,American,"George S. Fleming, Edwin S. Porter",,Fantasy/Adventure,1900s


In [5]:
def clean_text(x):
    return str(x).replace(" ", "").lower()

df["combined_features"] = (
    df["Origin/Ethnicity"].apply(clean_text) + " " +
    df["Director"].apply(clean_text) + " " +
    df["Cast"].apply(clean_text) + " " +
    df["Year Binned"].apply(clean_text) + " " +
    df["Genre"].apply(clean_text)
)

### 6. Vectorization

In [6]:
tfidf = TfidfVectorizer(stop_words="english")

tfidf_matrix = tfidf.fit_transform(df["combined_features"])

print("TF-IDF matrix shape:", tfidf_matrix.shape)

TF-IDF matrix shape: (34886, 41076)


In [7]:
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

print("Cosine similarity matrix shape:", cosine_sim.shape)

Cosine similarity matrix shape: (34886, 34886)


### 7. Recommendation Function

In [36]:
df = df.reset_index()
df["Title_Clean"] = df["Title"].str.lower().str.strip()

indices = pd.Series(df.index, index=df["Title_Clean"])

def recommend_movies(title, n=10):
    #title = title.lower()
    title = title.lower().strip()
    if title not in indices:
        return ["Movie not found in dataset."]
    
    idx = indices[title]

    if isinstance(idx, pd.Series):
        idx = idx.iloc[0]
        
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:n+1]   # exclude first (itself)
    movie_indices = [i[0] for i in sim_scores]
    return df["Title"].iloc[movie_indices].tolist()


In [52]:
print(recommend_movies("thor", 5))

['Thor: The Dark World', 'Thor: Ragnarok', 'Avengers, TheThe Avengers', "Goya's Ghosts", 'Jack Ryan: Shadow Recruit']
