### Data Science Project: 
# "Movie Recommendation System"
#### - CodXo

##### Project: Movie Recommendation System
Description: Build a recommendation system for movies based on user
preferences. Implement collaborative filtering or content-based filtering and
display personalized recommendations on a web interface.

Tech Stack: Python (Pandas, Scikit-learn), Flask, HTML/CSS for the
frontend

Outline:
o Week 1: Data preparation and understanding algorithms
o Week 2: Implement recommendation logic
o Week 3: Build frontend with Flask and display results
o Week 4: Testing and optimization

In [54]:
# Importing all necessary libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore", category=UserWarning, module="pandas")



#### Load the Data: Load your dataset using Pandas. 


In [5]:
# load dataset

data=pd.read_csv("C:\\Users\\dell\\Desktop\\CodXo Internship\\VS CodXo MovieRecommendationSystem\\movies.csv")
data

Unnamed: 0,index,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,cast,crew,director
0,0,237000000,Action Adventure Fantasy Science Fiction,http://www.avatarmovie.com/,19995,culture clash future space war space colony so...,en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,Sam Worthington Zoe Saldana Sigourney Weaver S...,"[{'name': 'Stephen E. Rivkin', 'gender': 0, 'd...",James Cameron
1,1,300000000,Adventure Fantasy Action,http://disney.go.com/disneypictures/pirates/,285,ocean drug abuse exotic island east india trad...,en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,...,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,Johnny Depp Orlando Bloom Keira Knightley Stel...,"[{'name': 'Dariusz Wolski', 'gender': 2, 'depa...",Gore Verbinski
2,2,245000000,Action Adventure Crime,http://www.sonypictures.com/movies/spectre/,206647,spy based on novel secret agent sequel mi6,en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,...,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466,Daniel Craig Christoph Waltz L\u00e9a Seydoux ...,"[{'name': 'Thomas Newman', 'gender': 2, 'depar...",Sam Mendes
3,3,250000000,Action Crime Drama Thriller,http://www.thedarkknightrises.com/,49026,dc comics crime fighter terrorist secret ident...,en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.312950,...,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106,Christian Bale Michael Caine Gary Oldman Anne ...,"[{'name': 'Hans Zimmer', 'gender': 2, 'departm...",Christopher Nolan
4,4,260000000,Action Adventure Science Fiction,http://movies.disney.com/john-carter,49529,based on novel mars medallion space travel pri...,en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,...,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124,Taylor Kitsch Lynn Collins Samantha Morton Wil...,"[{'name': 'Andrew Stanton', 'gender': 2, 'depa...",Andrew Stanton
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4798,4798,220000,Action Crime Thriller,,9367,united states\u2013mexico barrier legs arms pa...,es,El Mariachi,El Mariachi just wants to play his guitar and ...,14.269792,...,81.0,"[{""iso_639_1"": ""es"", ""name"": ""Espa\u00f1ol""}]",Released,"He didn't come looking for trouble, but troubl...",El Mariachi,6.6,238,Carlos Gallardo Jaime de Hoyos Peter Marquardt...,"[{'name': 'Robert Rodriguez', 'gender': 0, 'de...",Robert Rodriguez
4799,4799,9000,Comedy Romance,,72766,,en,Newlyweds,A newlywed couple's honeymoon is upended by th...,0.642552,...,85.0,[],Released,A newlywed couple's honeymoon is upended by th...,Newlyweds,5.9,5,Edward Burns Kerry Bish\u00e9 Marsha Dietlein ...,"[{'name': 'Edward Burns', 'gender': 2, 'depart...",Edward Burns
4800,4800,0,Comedy Drama Romance TV Movie,http://www.hallmarkchannel.com/signedsealeddel...,231617,date love at first sight narration investigati...,en,"Signed, Sealed, Delivered","""Signed, Sealed, Delivered"" introduces a dedic...",1.444476,...,120.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,,"Signed, Sealed, Delivered",7.0,6,Eric Mabius Kristin Booth Crystal Lowe Geoff G...,"[{'name': 'Carla Hetland', 'gender': 0, 'depar...",Scott Smith
4801,4801,0,,http://shanghaicalling.com/,126186,,en,Shanghai Calling,When ambitious New York attorney Sam is sent t...,0.857008,...,98.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,A New Yorker in Shanghai,Shanghai Calling,5.7,7,Daniel Henney Eliza Coupe Bill Paxton Alan Ruc...,"[{'name': 'Daniel Hsia', 'gender': 2, 'departm...",Daniel Hsia


### Step1: Data Preparation and Understanding Algorithms.


#### Data Cleaning: Check for missing values and data types. Handle any missing values, especially in columns like genres, budget, and title.

In [8]:
# display the shape

display(data.shape)
print()

(4803, 24)




In [9]:
# Check for missing values

print(data.isnull().sum())

index                      0
budget                     0
genres                    28
homepage                3091
id                         0
keywords                 412
original_language          0
original_title             0
overview                   3
popularity                 0
production_companies       0
production_countries       0
release_date               1
revenue                    0
runtime                    2
spoken_languages           0
status                     0
tagline                  844
title                      0
vote_average               0
vote_count                 0
cast                      43
crew                       0
director                  30
dtype: int64


In [10]:
# Drop rows where 'genres' is missing

data=data.dropna(subset=['genres'])


In [11]:
# Check the updated missing values

print(data.isnull().sum())


index                      0
budget                     0
genres                     0
homepage                3068
id                         0
keywords                 386
original_language          0
original_title             0
overview                   3
popularity                 0
production_companies       0
production_countries       0
release_date               0
revenue                    0
runtime                    2
spoken_languages           0
status                     0
tagline                  819
title                      0
vote_average               0
vote_count                 0
cast                      27
crew                       0
director                  16
dtype: int64


In [12]:
print(data['genres'])

0       Action Adventure Fantasy Science Fiction
1                       Adventure Fantasy Action
2                         Action Adventure Crime
3                    Action Crime Drama Thriller
4               Action Adventure Science Fiction
                          ...                   
4797                            Foreign Thriller
4798                       Action Crime Thriller
4799                              Comedy Romance
4800               Comedy Drama Romance TV Movie
4802                                 Documentary
Name: genres, Length: 4775, dtype: object


In [56]:
# Handling missing categorical columns by replacing the value with 'unknown' string (e.g. homepage, keywords)

data['homepage'].fillna('No Homepage', inplace=True)
data['keywords'].fillna('No Keywords', inplace=True)


In [14]:
# Check the updated missing values

print(data.isnull().sum())


index                     0
budget                    0
genres                    0
homepage                  0
id                        0
keywords                  0
original_language         0
original_title            0
overview                  3
popularity                0
production_companies      0
production_countries      0
release_date              0
revenue                   0
runtime                   2
spoken_languages          0
status                    0
tagline                 819
title                     0
vote_average              0
vote_count                0
cast                     27
crew                      0
director                 16
dtype: int64


In [58]:
# Fill missing values for textual columns

data['overview'].fillna('No Overview', inplace=True)
data['tagline'].fillna('No Tagline', inplace=True)
data['cast'].fillna('Unknown Cast', inplace=True)
data['director'].fillna('Unknown Director', inplace=True)


In [16]:
# Check the updated missing values
print(data.isnull().sum())


index                   0
budget                  0
genres                  0
homepage                0
id                      0
keywords                0
original_language       0
original_title          0
overview                0
popularity              0
production_companies    0
production_countries    0
release_date            0
revenue                 0
runtime                 2
spoken_languages        0
status                  0
tagline                 0
title                   0
vote_average            0
vote_count              0
cast                    0
crew                    0
director                0
dtype: int64


In [17]:
# lets drop the missing values remaining in 'runtime' column

data=data.dropna(subset=['runtime'])


In [18]:
# Check the updated missing values after filling 'keywords'

print(data.isnull().sum())

index                   0
budget                  0
genres                  0
homepage                0
id                      0
keywords                0
original_language       0
original_title          0
overview                0
popularity              0
production_companies    0
production_countries    0
release_date            0
revenue                 0
runtime                 0
spoken_languages        0
status                  0
tagline                 0
title                   0
vote_average            0
vote_count              0
cast                    0
crew                    0
director                0
dtype: int64


### Step2: Movie Recommendation System, both content-based filtering and collaborative filtering are valid approaches. However, based on the tech stack you're using and the goal to personalize recommendations based on user preferences, I recommend starting with content-based filtering.

##### Why Content-Based Filtering?
User-specific recommendations: It works well when you want to recommend movies similar to ones the user already likes.
Feature-rich dataset: You have features like genres, keywords, cast, and director, which are ideal for a content-based approach.
Easier to implement: Compared to collaborative filtering, it's simpler to set up with the dataset you have.
##### Steps for Content-Based Filtering:
Data preparation: Convert textual features like genres, keywords, and cast into numerical representations using techniques like TF-IDF or Count Vectorizer.
Cosine Similarity: Use this to calculate the similarity between movies.
Recommendation logic: For a given movie, recommend similar movies based on the similarity score.
##### Next Step: Data Preparation and Feature Extraction
We'll start by vectorizing the textual columns (genres, keywords, cast, director) using TF-IDF. Here’s the code:

In [20]:
# Check the original DataFrame to see if the columns exist and contain data
print(data.head(10))  # Print first 10 rows of the original DataFrame

# Check if the specified columns exist
print(data.columns)  # List all columns in the DataFrame

   index     budget                                    genres  \
0      0  237000000  Action Adventure Fantasy Science Fiction   
1      1  300000000                  Adventure Fantasy Action   
2      2  245000000                    Action Adventure Crime   
3      3  250000000               Action Crime Drama Thriller   
4      4  260000000          Action Adventure Science Fiction   
5      5  258000000                  Fantasy Action Adventure   
6      6  260000000                          Animation Family   
7      7  280000000          Action Adventure Science Fiction   
8      8  250000000                  Adventure Fantasy Family   
9      9  250000000                  Action Adventure Fantasy   

                                            homepage      id  \
0                        http://www.avatarmovie.com/   19995   
1       http://disney.go.com/disneypictures/pirates/     285   
2        http://www.sonypictures.com/movies/spectre/  206647   
3                 http://www

In [21]:

# Try creating the 'soup' column again
data['soup'] = data['genres'].fillna('') + ' ' + data['keywords'].fillna('') + ' ' + data['cast'].fillna('') + ' ' + data['director'].fillna('')

# Print a few rows of 'soup' to check its content
print(data['soup'].head(10))

0    Action Adventure Fantasy Science Fiction cultu...
1    Adventure Fantasy Action ocean drug abuse exot...
2    Action Adventure Crime spy based on novel secr...
3    Action Crime Drama Thriller dc comics crime fi...
4    Action Adventure Science Fiction based on nove...
5    Fantasy Action Adventure dual identity amnesia...
6    Animation Family hostage magic horse fairy tal...
7    Action Adventure Science Fiction marvel comic ...
8    Adventure Fantasy Family witch magic broom sch...
9    Action Adventure Fantasy dc comics vigilante s...
Name: soup, dtype: object


#### Next Steps
Now that you have the 'soup' column ready, you can proceed with the following steps in your movie recommendation system project:

#### 1. Vectorization: Convert the text in the 'soup' column into a numerical format that can be used by machine learning algorithms. You can use the TF-IDF vectorizer as planned.

In [23]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pickle

# Use TF-IDF to convert text data to numerical form
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(data['soup'])

# Save the matrix to a file
with open('tfidf_matrix.pkl', 'wb') as file:
    pickle.dump(tfidf_matrix, file)

# You might also want to save the cosine similarity matrix
from sklearn.metrics.pairwise import cosine_similarity
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

with open('cosine_sim.pkl', 'wb') as file:
    pickle.dump(cosine_sim, file)


# Output the shape of the matrix
print("TF-IDF Matrix Shape:", tfidf_matrix.shape)


TF-IDF Matrix Shape: (4773, 14734)


#### 2. Calculate Similarities: Use a similarity metric (like cosine similarity) on the TF-IDF matrix to find similar movies based on the combined text features.

In [25]:
from sklearn.metrics.pairwise import cosine_similarity

# Calculate cosine similarity
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

# Output shape of cosine similarity matrix
print("Cosine Similarity Matrix Shape:", cosine_sim.shape)


Cosine Similarity Matrix Shape: (4773, 4773)


#### Next Steps for Your Movie Recommendation System
Now that you have the similarity scores calculated, you can proceed with the following steps:

#### 1. Create a Function to Get Movie Recommendations: You can create a function that takes a movie title as input and returns a list of recommended movies based on their similarity.

In [27]:
def get_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title
    idx = data[data['original_title'] == title].index[0]

    # Get the pairwise similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]  # Exclude the first movie (itself)

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return data['original_title'].iloc[movie_indices]


#### 2. Test the Recommendation Function: Call the function with a movie title to see the recommendations.

In [29]:
print(data['original_title'].head(10))  # Display the first 10 movie titles


0                                      Avatar
1    Pirates of the Caribbean: At World's End
2                                     Spectre
3                       The Dark Knight Rises
4                                 John Carter
5                                Spider-Man 3
6                                     Tangled
7                     Avengers: Age of Ultron
8      Harry Potter and the Half-Blood Prince
9          Batman v Superman: Dawn of Justice
Name: original_title, dtype: object


In [30]:
recommendations = get_recommendations('The American')
print(recommendations)


805     Ghost Rider: Spirit of Vengeance
2616                           In Bruges
2588                   A Most Wanted Man
3438                             Control
795                         Leatherheads
1994                   The Ides of March
586                    The Monuments Men
2726                      Under the Skin
314                 The Spanish Prisoner
1688     Confessions of a Dangerous Mind
Name: original_title, dtype: object
