# Section 1: Importing Libraries
--- 
**Description**:

This section imports all necessary libraries required for data manipulation, natural language processing, and machine learning tasks. We import modules such as NumPy and pandas for data handling, nltk for text processing (including a stemmer), scikit-learn for feature extraction and similarity computation, and pickle for saving our processed data. This setup ensures that all dependencies are ready for the subsequent steps.

In [1]:
import numpy as np          # For numerical computations
import pandas as pd         # For data manipulation and analysis
import ast                  # For safely evaluating string representations of Python objects
import nltk                 # Natural Language Toolkit for text processing
from nltk.stem.porter import PorterStemmer                  # For stemming words to their root form
from sklearn.feature_extraction.text import CountVectorizer # To convert text data into numerical vectors
from sklearn.metrics.pairwise import cosine_similarity      # To compute cosine similarity between vectors
import pickle                                               # For saving and loading Python objects (e.g., models, data)

# Section 2: Loading the Datasets
---
**Description:**

In this section, we load two CSV datasets: one containing movie details and the other with movie credit information. These datasets are read using pandas' read_csv function from specified local file paths. Having the datasets loaded is the first step before any preprocessing or merging.

In [2]:
# Load the movies and credits datasets from specified file paths
movies = pd.read_csv(r"D:\Adinath's Coding Work\ML Projects\Movie-Recommendation-System\Data Set Files\tmdb_5000_movies.csv")
credits = pd.read_csv(r"D:\Adinath's Coding Work\ML Projects\Movie-Recommendation-System\Data Set Files\tmdb_5000_credits.csv")

# Section 3: Initial Data Exploration
---
**Description:**

This section provides a quick overview of both datasets by displaying the first row of each. The head() function is used to examine the structure and content of the movies and credits DataFrames. This helps in understanding the data before performing further operations.

In [3]:
# Display the first row of the movies dataset to inspect its structure
movies.head(1)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800


In [4]:
# Display the first row of the credits dataset for an initial look at its structure
credits.head(1)

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


# Section 4: Merging and Selecting Relevant Columns
---
**Description:**

Here, we merge the movies and credits DataFrames based on the common "title" column. After merging, we select a subset of columns that are relevant for building the recommendation system. This selection helps in focusing on key features such as movie_id, title, overview, genres, keywords, cast, and crew.

In [5]:
# Merge the movies and credits DataFrames using the 'title' column as the key
movies = movies.merge(credits, on= "title")

# Preview the merged DataFrame to confirm the merge was successful
movies.head(1)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,movie_id,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,19995,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


In [6]:
# Select a subset of columns essential for the recommendation system
movies = movies[["movie_id", "title", "overview", "genres", "keywords", "cast", "crew"]]
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


# Section 5: Data Cleaning: Handling Missing Values and Duplicates
---
**Description:**

In this section, we conduct data cleaning by checking for and handling missing values and duplicate records. We use isnull().sum() to count missing values, then drop rows with missing data using dropna(). Additionally, we check for duplicates to ensure the dataset’s integrity before further processing.

In [7]:
# Check for missing values in each column
movies.isnull().sum()

movie_id    0
title       0
overview    3
genres      0
keywords    0
cast        0
crew        0
dtype: int64

In [8]:
# Drop rows that contain any missing values to ensure clean data for processing
movies.dropna(inplace=True)

In [9]:
# Verify that there are no duplicate rows in the DataFrame
int(movies.duplicated().sum())

0

# Section 6: Converting String Representations to Python Objects
---
**Description:**

The dataset columns like genres, keywords, cast, and crew contain string representations of lists/dictionaries. This section defines helper functions to convert these strings into Python objects. The convert function extracts names from genres and keywords, and the convert3 function limits extraction to the first three items for the cast column. This conversion facilitates easier text processing later on.

In [10]:
# Helper function to convert a stringified list of dictionaries into a list of names
def convert(obj):
    L = []
    # Convert the string representation to a list and extract the 'name' field from each dictionary
    for i in ast.literal_eval(obj):       #ast.literal_eval is used to onvert '[]' to []
        L.append(i['name'])
    return L

In [11]:
# Apply the convert function to the 'genres' and 'keywords' columns
movies['genres'] = movies['genres'].apply(convert)
movies['keywords'] = movies['keywords'].apply(convert)


In [12]:
# Helper function to extract only the first three names from the stringified list
def convert3(obj):
    counter = 0
    L = []
    for i in ast.literal_eval(obj):
        if counter != 3:
            L.append(i['name'])
            counter += 1
        else:
            break
    return L

In [13]:
# Apply the convert3 function to the 'cast' column to limit extraction to three names
movies['cast'] = movies['cast'].apply(convert3)

In [14]:
# Check the structure of the 'crew' column by converting the first element from a string to a Python object
ast.literal_eval(movies['crew'][0])

[{'credit_id': '52fe48009251416c750aca23',
  'department': 'Editing',
  'gender': 0,
  'id': 1721,
  'job': 'Editor',
  'name': 'Stephen E. Rivkin'},
 {'credit_id': '539c47ecc3a36810e3001f87',
  'department': 'Art',
  'gender': 2,
  'id': 496,
  'job': 'Production Design',
  'name': 'Rick Carter'},
 {'credit_id': '54491c89c3a3680fb4001cf7',
  'department': 'Sound',
  'gender': 0,
  'id': 900,
  'job': 'Sound Designer',
  'name': 'Christopher Boyes'},
 {'credit_id': '54491cb70e0a267480001bd0',
  'department': 'Sound',
  'gender': 0,
  'id': 900,
  'job': 'Supervising Sound Editor',
  'name': 'Christopher Boyes'},
 {'credit_id': '539c4a4cc3a36810c9002101',
  'department': 'Production',
  'gender': 1,
  'id': 1262,
  'job': 'Casting',
  'name': 'Mali Finn'},
 {'credit_id': '5544ee3b925141499f0008fc',
  'department': 'Sound',
  'gender': 2,
  'id': 1729,
  'job': 'Original Music Composer',
  'name': 'James Horner'},
 {'credit_id': '52fe48009251416c750ac9c3',
  'department': 'Directing',
  

# Section 7: Extracting Director Information from the Crew Data
---
**Description:**

This section focuses on extracting the director's name from the crew data. A helper function ```fetch_director``` is defined to iterate through the crew list and return the name where the job is "Director". This ensures that only the director's information is retained, simplifying the feature set for the recommendation engine.

In [15]:
# Define a function to fetch the director's name from the crew data
def fetch_director(obj):
    L= []
    # Iterate over the list to find the dictionary with job 'Director'
    for i in ast.literal_eval(obj):
        if i.get('job') == 'Director':
            L.append(i.get('name'))
            break
    return L   

In [16]:
# Apply the fetch_director function to extract director's name in the 'crew' column
movies['crew']=movies['crew'].apply(fetch_director)

# Section 8: Text Preprocessing and Tag Creation
---
**Description:**
In this step, we prepare the text data for feature extraction. The movie overview is split into individual words, and spaces are removed from tokens in genres, keywords, cast, and crew. These processed lists are then concatenated into a single "tags" column that aggregates all textual information for each movie, setting the stage for vectorization.

In [17]:
# Convert the overview into a list of words
movies['overview']=movies['overview'].apply(lambda x: x.split())

In [18]:
# Remove spaces from elements in genres, keywords, cast, and crew to create uniform tokens
movies['genres']=movies['genres'].apply(lambda x:[i.replace(" ","") for i in x])
movies['keywords']=movies['keywords'].apply(lambda x:[i.replace(" ","") for i in x])
movies['cast']=movies['cast'].apply(lambda x:[i.replace(" ","") for i in x])
movies['crew']=movies['crew'].apply(lambda x:[i.replace(" ","") for i in x])

In [19]:
# Combine all processed text fields into a single 'tags' column
movies['tags'] = movies['overview'] + movies['genres'] + movies['keywords'] + movies['cast'] + movies['crew']

In [20]:
# Preview the DataFrame to verify the new 'tags' column
movies.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew,tags
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron],"[In, the, 22nd, century,, a, paraplegic, Marin..."
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d...","[Adventure, Fantasy, Action]","[ocean, drugabuse, exoticisland, eastindiatrad...","[JohnnyDepp, OrlandoBloom, KeiraKnightley]",[GoreVerbinski],"[Captain, Barbossa,, long, believed, to, be, d..."
2,206647,Spectre,"[A, cryptic, message, from, Bond’s, past, send...","[Action, Adventure, Crime]","[spy, basedonnovel, secretagent, sequel, mi6, ...","[DanielCraig, ChristophWaltz, LéaSeydoux]",[SamMendes],"[A, cryptic, message, from, Bond’s, past, send..."
3,49026,The Dark Knight Rises,"[Following, the, death, of, District, Attorney...","[Action, Crime, Drama, Thriller]","[dccomics, crimefighter, terrorist, secretiden...","[ChristianBale, MichaelCaine, GaryOldman]",[ChristopherNolan],"[Following, the, death, of, District, Attorney..."
4,49529,John Carter,"[John, Carter, is, a, war-weary,, former, mili...","[Action, Adventure, ScienceFiction]","[basedonnovel, mars, medallion, spacetravel, p...","[TaylorKitsch, LynnCollins, SamanthaMorton]",[AndrewStanton],"[John, Carter, is, a, war-weary,, former, mili..."


# Section 9: Creating a New DataFrame and Further Text Processing
---
**Description:**

We now create a new DataFrame containing only the necessary columns: movie_id, title, and tags. The list of tags is converted into a single space-separated string and then transformed to lowercase. This standardization is crucial for accurate feature extraction in the next steps.

In [21]:
# Create a new DataFrame with selected columns for the recommendation system
new_df = movies[['movie_id', 'title','tags']]
new_df.head()

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin..."
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d..."
2,206647,Spectre,"[A, cryptic, message, from, Bond’s, past, send..."
3,49026,The Dark Knight Rises,"[Following, the, death, of, District, Attorney..."
4,49529,John Carter,"[John, Carter, is, a, war-weary,, former, mili..."


In [22]:
# Convert the list of tags into a single string for each movie
new_df['tags']= new_df['tags'].apply(lambda x : " ".join(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['tags']= new_df['tags'].apply(lambda x : " ".join(x))


In [23]:
# Display the tags for the first movie to check the format
new_df['tags'][0]

'In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. Action Adventure Fantasy ScienceFiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d SamWorthington ZoeSaldana SigourneyWeaver JamesCameron'

In [25]:
# Convert all tags to lowercase for consistency in text processing
new_df['tags'] = new_df['tags'].apply(lambda x: x.lower())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['tags'] = new_df['tags'].apply(lambda x: x.lower())


In [26]:
new_df.head()

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"in the 22nd century, a paraplegic marine is di..."
1,285,Pirates of the Caribbean: At World's End,"captain barbossa, long believed to be dead, ha..."
2,206647,Spectre,a cryptic message from bond’s past sends him o...
3,49026,The Dark Knight Rises,following the death of district attorney harve...
4,49529,John Carter,"john carter is a war-weary, former military ca..."


# Section 10: Text Stemming
---
**Description:**

Stemming reduces words to their base form, which helps in grouping similar words together and reducing the overall vocabulary size. In this section, we define a function to stem each word in the tags column using the Porter Stemmer, and then apply this function to transform the tags.

In [27]:
# Initialize the Porter Stemmer
ps = PorterStemmer()

In [28]:
# Define a function to stem each word in the provided text
def stem(text):
    y = []
    # Split text into words and stem each word
    for i in text.split():
        y.append(ps.stem(i))

    return " ".join(y)

In [29]:
# Apply the stemming function to the 'tags' column in the new DataFrame
new_df['tags'] = new_df['tags'].apply(stem)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['tags'] = new_df['tags'].apply(stem)


# Section 11: Feature Extraction Using CountVectorizer
---
**Description:**

In this section, we transform the processed text into numerical feature vectors using CountVectorizer. The vectorization limits the number of features to 5000 and removes common English stop words. This vectorized representation will serve as the input for computing cosine similarity between movies.

In [30]:
# Initialize CountVectorizer with a maximum of 5000 features and removal of English stop words
cv = CountVectorizer(max_features=5000, stop_words='english')

In [31]:
# Convert the tags column into a feature vector representation (numerical matrix)
vectors = cv.fit_transform(new_df['tags']).toarray()

In [32]:
# Display the numerical vector for inspection
vectors

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], shape=(4806, 5000))

In [33]:
# Retrieve and display the feature names corresponding to the vectorized tokens
cv.get_feature_names_out()

array(['000', '007', '10', ..., 'zone', 'zoo', 'zooeydeschanel'],
      shape=(5000,), dtype=object)

# Section 12: Calculating Cosine Similarity
---
**Description:**

Cosine similarity measures the cosine of the angle between two vectors, effectively quantifying how similar the movies are based on their feature vectors. This section computes the cosine similarity matrix for all movies. Later, this similarity matrix is used to recommend movies that are contextually similar.

In [34]:
# Calculate cosine similarity between all movie vectors
similalarity = cosine_similarity(vectors)

In [35]:
# For the first movie, sort and display the top similar movies (excluding itself)
sorted(list(enumerate(similalarity[0])),reverse=True, key=lambda x: x[1])[1:6]

[(1214, np.float64(0.28676966733820225)),
 (2405, np.float64(0.26901379342448517)),
 (3728, np.float64(0.2605130246476754)),
 (507, np.float64(0.255608593705383)),
 (539, np.float64(0.25038669783359574))]

# Section 13: Building the Recommendation Function
---
**Description:**

This section defines a function named ```recommend``` that, given a movie title, retrieves similar movies based on the cosine similarity matrix. The function locates the index of the input movie, sorts other movies by similarity, and prints out the top 5 recommendations. This forms the core of the recommendation system.

In [36]:
def recommend(movie):
    """
    Recommends the top 5 movies similar to the given movie based on text similarity.
    
    Parameters:
    movie (str): The title of the movie for which to find similar movies.
    """
    # Find the index of the movie in the new DataFrame
    movies_index = new_df[new_df['title']==movie].index[0]
    # Retrieve similarity scores for the specified movie
    distances = similalarity[movies_index]

    # Sort movies by similarity score and exclude the first item (itself)
    movies_list = sorted(list(enumerate(distances)),reverse=True, key=lambda x: x[1])[1:6]

    # Print the titles of the recommended movies
    for i in movies_list:
        print(new_df.iloc[i[0]].title)

In [37]:
# Test the recommendation function with a sample movie
recommend('Batman Begins')

The Dark Knight
Batman
Batman
The Dark Knight Rises
10th & Wolf


# Section 14: Saving the Model Artifacts
---
**Description:**

In the final section, we persist the processed data and the computed similarity matrix using Python's pickle module. These artifacts (movies dictionary and similarity matrix) can be saved and later loaded to deploy the recommendation system without the need to reprocess the data from scratch.

In [38]:
# Save the new DataFrame (converted to a dictionary) as a pickle file for future use
pickle.dump(new_df.to_dict(), open('movies_dict.pkl', 'wb'))

In [39]:
# Save the similarity matrix as a pickle file for later retrieval in the recommendation system
pickle.dump(similalarity, open('similarity.pkl','wb'))