Needed for data manipulation (pandas) and numerical operations (numpy). Example: pandas helps read CSV files

In [None]:
import numpy as np
import pandas as pd

Loading Data

In [None]:
# Loading Data
movies = pd.read_csv('Data/movies.csv')
credits = pd.read_csv('Data/credits.csv')
movies.head(1)

In [None]:
credits.head(1)

Merging Datasets
 - Merges both datasets on the common column 'title' to combine information like cast and crew.

In [None]:
movies = movies.merge(credits, on='title')

Checking Dataset Information
- Helps understand how big the dataset is and what type of data it contains

In [None]:
movies.head(1)

In [None]:
credits.shape

In [None]:
movies.info()

#### Selecting features for further analysis .. 

In [None]:
# geners
# id
# keywords
# title
# overview
# release date
# cast
# crew

Keeps only useful columns for recommendations. Example: We need 'genres' and 'overview' for content-based filtering

In [None]:
movies['original_language'].value_counts()

In [None]:
movies = movies[['movie_id','title','overview','genres','keywords','cast','crew']]

In [None]:
movies.head(2)

## Preporcessing

Handling Missing Data

In [None]:
# missing values
movies.isnull().sum()

In [None]:
# overview oclumn having 3 missing value which is very less, so dropping them ..
movies.dropna(inplace=True)

Checking Unique & Duplicate Values

In [None]:
# Finds unique values in each column  
movies.nunique()

In [None]:
movies['genres'].value_counts()

In [None]:
 # Counts duplicate entries  
movies.duplicated().sum()

In [None]:
movies.iloc[0].genres

Ensures we don’t have repeated movie entries that could skew recommendations.

Extracting Important Details

In [None]:
import ast # Helps convert string data to Python lists  

In [None]:
# Extracts genre and keyword names from structured data.
#  Example: Converts "[{'id': 18, 'name': 'Drama'}]" to ['Drama'].

def convert(text):
    L = []
    for i in ast.literal_eval(text):
        L.append(i['name']) 
    return L 

In [None]:
movies['genres'] = movies['genres'].apply(convert)

In [None]:
movies.head(1)

In [None]:
movies['keywords'] = movies['keywords'].apply(convert)

In [None]:
movies.head(1)

Processing Cast & Crew

In [None]:
# Keeps only the top 3 actors for each movie.
def convert3(text):
    L = []
    counter = 0
    for i in ast.literal_eval(text):
        if counter < 3:
            L.append(i['name'])
        counter+=1
    return L 

In [None]:
movies['cast'] = movies['cast'].apply(convert3)

In [None]:
movies.head(1)

In [None]:
movies['crew'][0]

In [None]:
# Extracts director names from the crew list. 
# Example: Turns "[{'job': 'Director', 'name': 'Christopher Nolan'}]" into ['Christopher Nolan'].

def fetch_director(text):
    L = []
    for i in ast.literal_eval(text):
        if i['job'] == 'Director':
            L.append(i['name'])
    return L 

In [None]:
movies['crew'] = movies['crew'].apply(fetch_director)

In [None]:
movies.head(1)

In [None]:
movies['overview'][0]

Tokenizing Overview
- Splits movie descriptions into words. Example: "A thrilling adventure" → ['A', 'thrilling', 'adventure'].


In [None]:
movies['overview'] = movies['overview'].apply(lambda x:x.split())

In [None]:
movies.head(1)

Removing Spaces in Text
- Makes words easier to process. Example: 'Science Fiction' → 'ScienceFiction'.

In [None]:
movies['genres'] = movies['genres'].apply(lambda x:[i.replace(" ", "") for i in x])

In [None]:
movies['keywords'] = movies['keywords'].apply(lambda x:[i.replace(" ", "") for i in x])

In [None]:
movies['cast'] = movies['cast'].apply(lambda x:[i.replace(" ", "") for i in x])

In [None]:
movies['crew'] = movies['crew'].apply(lambda x:[i.replace(" ", "") for i in x])

In [None]:
movies.head()

Creating "Tags" Column for Recommendations

In [None]:
# Combines all relevant information into a single column for similarity matching.
movies['tags'] = movies['overview'] + movies['genres'] + movies['keywords'] + movies['cast'] + movies['crew']

In [None]:
movies.head()

In [None]:
movies['tags'][0]

Final Dataframe for Model
- Keeps only necessary columns for recommendation modeling.

In [None]:
new_df = movies[['movie_id', 'title', 'tags']]

In [None]:
new_df.head()

Text 
- Joins words into a single sentence & converts to lowercase for uniformity. Example: ['Epic', 'Drama'] → 'epic drama'.


In [None]:
new_df['tags'] = new_df['tags'].apply(lambda x:" ".join(x))

In [None]:
new_df.head(2)

In [None]:
new_df['tags'] = new_df['tags'].apply(lambda x:x.lower())

In [None]:
new_df['tags'][0]

In [None]:
new_df['tags'][1]

Imports CountVectorizer from scikit-learn for text feature extraction.

In [None]:
# Creates a CountVectorizer instance:
# - max_features=5000 → Limits vocabulary to the 5000 most frequent words.
# - stop_words='english' → Removes common English words like "the," "is," "and" to focus on important words.

from sklearn.feature_extraction.text import CountVectorizer
Count_Vectorizer = CountVectorizer(max_features=5000, stop_words='english')

In [None]:
Count_Vectorizer

Transforms the text data (tags) into a matrix of token counts and converts it into a NumPy array.

In [None]:
vectors = Count_Vectorizer.fit_transform(new_df['tags']).toarray()

In [None]:
vectors

In [None]:
vectors.shape

In [None]:
#Gets the total number of unique words (features) used in the model.
len(Count_Vectorizer.get_feature_names_out())

In [None]:
#Lists all unique words in the vocabulary after vectorization.
# Example: Words like 'action', 'drama', 'thriller' might appear in the list, which is crucial for understanding what terms the model considers.

Count_Vectorizer.get_feature_names_out()

Introducing NLP Stemming

- nltk is a Natural Language Processing library for handling text.
- PorterStemmer helps in stemming—reducing words to their root form (e.g., running → run, actors → actor).

In [None]:
import nltk
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()

In [None]:
# helper function
def stem(text):
    y = []

    for i in text.split(): # Split text into words  
        y.append(ps.stem(i)) # Apply stemming to each word 

    return" ".join(y)  # Join stemmed words back into a sentence

In [None]:
ps.stem("loving")

In [None]:
ps.stem("loved")

In [None]:
ps.stem("Swimming")

In [None]:
ps.stem("Swimmed")

In [None]:
new_df['tags'] = new_df['tags'].apply(stem)

In [None]:
new_df['tags'][0]

Applying cosine similarity helps find movies with descriptions most similar to a given movie—essential for recommendations.
 - Example: If Movie A has similar keywords to Movie B, their cosine similarity score will be high, making Movie B a good recommendation for someone who liked Movie A.


In [None]:
from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity(vectors)

In [None]:
similarity.shape

In [None]:
sorted(list(enumerate(similarity[0])), reverse=True, key=lambda x:x[1])[1:6]

In [None]:
def recommend(movie):
    index = new_df[new_df['title'] == movie].index[0]
    distances = sorted(list(enumerate(similarity[index])),reverse=True,key = lambda x: x[1])
    for i in distances[1:6]:
        print(new_df.iloc[i[0]].title)

In [None]:
recommend('Avatar')

In [None]:
recommend('Batman')

Loading New Model

In [None]:
import pickle

In [None]:
pickle.dump(new_df, open('Artifacts/movies.pkl', 'wb'))

In [None]:
pickle.dump(similarity, open('Artifacts/similarity.pkl', 'wb'))