# Movie Recommendation System using NLP & Cosine Similarity

This project builds a content-based movie recommendation system using NLP techniques and cosine similarity.

The system recommends top 5 similar movies based on:
- Genres
- Keywords
- Cast
- Crew
- Overview

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import ast
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import nltk
from nltk.stem.porter import PorterStemmer
import pickle

## Data Loading

Dataset used: TMDB 5000 Movies Dataset

In [None]:
credits=pd.read_csv('tmdb_5000_credits.csv')
movies=pd.read_csv('tmdb_5000_movies.csv')

In [None]:
credits.head()

In [None]:
movies.head()

In [None]:
mov=movies.merge(credits, on='title')

In [None]:
mov.head()

## Data Preprocessing

Will select relevant features and handle missing values.

In [None]:
movies.shape

In [None]:
credits.shape

In [None]:
mov.shape

In [None]:
mov.info()

In [None]:
mov=mov[['movie_id','title','overview','genres','keywords','cast','crew']]

In [None]:
mov.head()

In [None]:
mov.info()

In [None]:
mov.duplicated().sum()

## Feature Engineering

Extracting names from JSON columns and combining into a single 'tags' column.

In [None]:
mov.iloc[0].genres

In [None]:
import ast
ast.literal_eval

In [None]:
def convert(obj):
    L=[]
    for i in ast.literal_eval(obj):
        L.append(i['name'])
    return L

In [None]:
mov['genres']=mov['genres'].apply(convert)

In [None]:
mov['genres']

In [None]:
mov.iloc[0].keywords

In [None]:
mov['keywords']=mov['keywords'].apply(convert)

In [None]:
mov.head()

In [None]:
mov.iloc[0].cast

In [None]:
# Considering only top 3 names of crew & Director from cast

In [None]:
def convert3(obj):
    L=[]
    c=0
    for i in ast.literal_eval(obj):
        if c!=3:
            L.append(i['name'])
            c=c+1
    return L

In [None]:
mov['cast']=mov['cast'].apply(convert3)

In [None]:
mov['cast']

In [None]:
mov.iloc[0].crew

In [None]:
def fetch_dirc(obj):
    L=[]
    for i in ast.literal_eval(obj):
        if i['job']=='Director':
            L.append(i['name'])
            break
    return L

In [None]:
mov['crew']=mov['crew'].apply(fetch_dirc)

In [None]:
mov.head()

In [None]:
mov['overview'][0]

In [None]:
mov.head()

In [None]:
mov['overview'] = mov['overview'].fillna('')
mov['overview'] = mov['overview'].apply(lambda x: x.split())

In [None]:
mov.head()

In [None]:
# replacing gap between words for ex- Science fiction --- ScienceFiction so that model do not consider tehm as different

In [None]:
mov['genres'].apply(lambda x: [i.replace(" ","") for i in x])

In [None]:
mov['genres']=mov['genres'].apply(lambda x: [i.replace(" ","") for i in x])
mov['keywords']=mov['keywords'].apply(lambda x: [i.replace(" ","") for i in x])
mov['cast']=mov['cast'].apply(lambda x: [i.replace(" ","") for i in x])
mov['crew']=mov['crew'].apply(lambda x: [i.replace(" ","") for i in x])

In [None]:
mov.head()

In [None]:
# Adding all the contents into 1 text column

In [None]:
mov['tags1']=mov['overview']+mov['genres']+ mov['keywords']+mov['cast']+mov['crew']

In [None]:
mov_rev=mov[['movie_id','title','tags1']]

In [None]:
mov_rev

In [None]:
mov_rev['tags1']=mov_rev['tags1'].apply(lambda x:" ".join(x))

In [None]:
mov_rev

In [None]:
mov_rev['tags1'][0]

In [None]:
mov_rev['tags1'][1]

## Text Processing (Stemming)

Converting words into their base form

In [None]:
import sys
!{sys.executable} -m pip install nltk

In [None]:
from nltk.stem.porter import PorterStemmer
ps=PorterStemmer()

In [None]:
ps.stem('dancing')

In [None]:
def stemmm(text):
    y=[]
    for i in text.split():
        y.append(ps.stem(i))
    return " ".join(y)
        

In [None]:
stemmm('In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. Action Adventure Fantasy ScienceFiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d JamesCameron')

In [None]:
mov_rev['tags1']=mov_rev['tags1'].apply(stemmm)

In [None]:
mov_rev

## Vectorization using Bag of Words

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(max_features=5000,stop_words='english')

In [None]:
vectors=cv.fit_transform(mov_rev['tags1']).toarray()

In [None]:
cv.get_feature_names_out()

In [None]:
vectors.shape

## Cosine Similarity 

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
similarity=cosine_similarity(vectors)

In [None]:
similarity[1]  # Shows similarity of movie 1 with other 4808 movies

In [None]:
similarity.shape

In [None]:
sorted(list(enumerate(similarity[0])),reverse=True,key=lambda x:x[1])[1:6]

## Recommendation Function to list similar 5 movies

In [None]:
def recommend(movie):
    mov_index=mov_rev[mov_rev['title']==movie].index[0]
    distances= similarity[mov_index]
    movie_list=sorted(list(enumerate(distances)),reverse=True,key=lambda x:x[1])[1:6]

    for i in movie_list:
        print(mov_rev.iloc[i[0]].title) 

In [None]:
recommend('Avatar')

## Save Model Files

In [None]:
pickle.dump(mov_rev.to_dict(),open('movies_dict.pkl','wb'))

In [None]:
top_5_similarities = {}

for i in range(len(similarity)):

    sim_scores = list(enumerate(similarity[i]))

    # remove self similarity
    sim_scores = [x for x in sim_scores if x[0] != i]

    sim_scores = sorted(sim_scores,
                        key=lambda x: x[1],
                        reverse=True)[:5]

    top_5_similarities[i] = [j[0] for j in sim_scores]

import pickle
pickle.dump(top_5_similarities, open('similarity.pkl','wb'))