**TASKS**

1. **Load and Preprocess the Data**:

a. Import necessary Libraries and load the dataset

b. Convert the 'genres' column from a JSON format to a string of genre names.

c. Handle missing values (if any), and combine these text fields to create a comprehensive description of each movie.

2. **Feature Engineering:**

Use techniques like TF-IDF to convert the text data into a numerical format. TF-IDF will help in reflecting the importance of words in the movie descriptions and taglines in relation to the dataset.

3. **Calculate Similarity**:

Use cosine similarity to find the similarity between movies based on their TF-IDF features. Cosine similarity is a common approach to measure the angle between two non-zero vectors in a multi-dimensional space, which in this case, are our TF-IDF vectors.

4. **Recommendation Function**:

Create a function that takes a movie title as input and outputs a list of most similar movies based on their similarity scores.

In [62]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

In [48]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [49]:
df = pd.read_csv('/content/drive/MyDrive/movies_metadata.csv', low_memory=False)
df.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0




In [50]:
df['release_date'].isnull().sum()

87

In [51]:
df.dropna(subset=['release_date'], inplace=True)

In [52]:
df['release_date'].replace('1', pd.NaT, inplace=True)
df['release_date'] = pd.to_datetime(df['release_date'], errors='coerce')

In [53]:
import datetime
from datetime import datetime,date
df['release_date'] = pd.to_datetime(df['release_date'])

In [54]:
df['release_date']

0       1995-10-30
1       1995-12-15
2       1995-12-22
3       1995-12-22
4       1995-02-10
           ...    
45460   1991-05-13
45462   2011-11-17
45463   2003-08-01
45464   1917-10-21
45465   2017-06-09
Name: release_date, Length: 45379, dtype: datetime64[ns]

In [36]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 45379 entries, 0 to 45465
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   adult                  45379 non-null  object        
 1   belongs_to_collection  4491 non-null   object        
 2   budget                 45379 non-null  object        
 3   genres                 45379 non-null  object        
 4   homepage               7769 non-null   object        
 5   id                     45379 non-null  object        
 6   imdb_id                45365 non-null  object        
 7   original_language      45368 non-null  object        
 8   original_title         45379 non-null  object        
 9   overview               44438 non-null  object        
 10  popularity             45377 non-null  object        
 11  poster_path            45040 non-null  object        
 12  production_companies   45379 non-null  object        
 13  p

Since the dataset is large, we'll start by loading a portion of it to understand its structure better and to decide on the preprocessing steps. We'll focus on the 'genres', 'overview', and 'tagline' columns for our content-based system.

In [55]:
large_sample_size = 1000
df = pd.read_csv('/content/drive/MyDrive/movies_metadata.csv', usecols=['title', 'genres', 'overview', 'tagline'], nrows=large_sample_size)
df.head()

Unnamed: 0,genres,overview,tagline,title
0,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...","Led by Woody, Andy's toys live happily in his ...",,Toy Story
1,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",When siblings Judy and Peter discover an encha...,Roll the dice and unleash the excitement!,Jumanji
2,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",A family wedding reignites the ancient feud be...,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men
3,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...","Cheated on, mistreated and stepped on, the wom...",Friends are the people who let you be yourself...,Waiting to Exhale
4,"[{'id': 35, 'name': 'Comedy'}]",Just when George Banks has recovered from his ...,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II


Let's start by preprocessing the data. We will transform the 'genres' column, handle missing values, and combine 'overview' and 'tagline'.

In [56]:
from ast import literal_eval

# Preprocessing function for the 'genres' column
def preprocess_genres(genres_str):
    genres_list = literal_eval(genres_str) if pd.notnull(genres_str) else []
    return ' '.join([genre['name'] for genre in genres_list])

# Preprocessing the 'genres', 'overview', and 'tagline' columns
df['genres'] = df['genres'].apply(preprocess_genres)
df['overview'] = df['overview'].fillna('')
df['tagline'] = df['tagline'].fillna('')

# Combining 'overview' and 'tagline' into a single column
df['description'] = df['overview'] + ' ' + df['tagline']

# Displaying the processed data
df.head()

Unnamed: 0,genres,overview,tagline,title,description
0,Animation Comedy Family,"Led by Woody, Andy's toys live happily in his ...",,Toy Story,"Led by Woody, Andy's toys live happily in his ..."
1,Adventure Fantasy Family,When siblings Judy and Peter discover an encha...,Roll the dice and unleash the excitement!,Jumanji,When siblings Judy and Peter discover an encha...
2,Romance Comedy,A family wedding reignites the ancient feud be...,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,A family wedding reignites the ancient feud be...
3,Comedy Drama Romance,"Cheated on, mistreated and stepped on, the wom...",Friends are the people who let you be yourself...,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom..."
4,Comedy,Just when George Banks has recovered from his ...,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,Just when George Banks has recovered from his ...


The next step is feature engineering, where we'll use the TF-IDF (Term Frequency-Inverse Document Frequency) technique to convert the text data in the 'description' column into a numerical format. TF-IDF is effective in this context as it reflects the importance of words in the descriptions relative to the dataset, allowing us to measure textual similarity between movies.

In [42]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Creating a TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english')

# Fitting and transforming the 'description' column to a TF-IDF matrix
tfidf_matrix = tfidf_vectorizer.fit_transform(df['description'])

# Displaying the shape of the TF-IDF matrix
tfidf_matrix.shape


(1000, 9779)

The TF-IDF transformation has been applied to the 'description' column, resulting in a matrix with 1000 rows (one for each movie in our sample) and 9779 features (each representing a unique word in the dataset, excluding common English stop words).

Now, we'll calculate the cosine similarity between each pair of movies based on their TF-IDF vectors. Cosine similarity is a metric used to determine how similar two documents are irrespective of their size, making it suitable for this application.

After computing the similarity scores, we'll create a function that takes a movie title as input and returns a list of most similar movies based on their similarity scores.

In [59]:
from sklearn.metrics.pairwise import linear_kernel

# Computing the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

# Creating a mapping from movie title to index to use in the recommendation function
indices = pd.Series(df.index, index=df['title']).drop_duplicates()

In [60]:
# Recommendation function
def recommend_movies(title, cosine_sim=cosine_sim, indices=indices, df=df):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwise similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return df['title'].iloc[movie_indices]

In [61]:
# Testing the recommendation function with an example
recommend_movies('Jumanji')

8              Sudden Death
363                Maverick
954                  Picnic
976    D3: The Mighty Ducks
541       Super Mario Bros.
881      North by Northwest
894        My Favorite Year
839             Small Faces
591         Window to Paris
96                 Shopping
Name: title, dtype: object

In [44]:
recommend_movies('Waiting to Exhale')

411                        Bad Girls
615                    Condition Red
179          Moonlight and Valentino
338            The Baby-Sitters Club
810                       Phat Beach
780                  Harriet the Spy
724                          Thinner
386                    Jason's Lyric
998    Robin Hood: Prince of Thieves
215                 Boys on the Side
Name: title, dtype: object

These recommendations are based on the textual content similarities between "Jumanji" and other movies in the dataset. The system can be further refined by adjusting the TF-IDF parameters or by incorporating additional features into the model. You can test the recommendation function with other movie titles from your dataset to see how it performs with different inputs.