In [2]:
import numpy as np
import pandas as pd

### Importing CSV files using pandas.

In [3]:
credits_df = pd.read_csv("credits.csv")
movies_df = pd.read_csv("movies.csv")

**By this below commands you can fell full files at once**

In [4]:
pd.set_option('display.max_columns',None)
pd.set_option('display.max_rows',None)


### Merge Both the CSV file so that you can easily access data from one file. 

In [5]:
movies_df = movies_df.merge(credits_df,on = 'title')

In [6]:
movies_df.shape

(4808, 23)

### Heading: Selecting Relevant Columns in the Movie DataFrame**

**Explanation:
In this code snippet, we are selecting specific columns from the movie DataFrame named "movies_df". The selected columns are 'id', 'title', 'overview', 'genres', 'keywords', 'cast', and 'crew'.** 

**This step is crucial because it helps us focus on the essential information needed for our movie recommendation system. By narrowing down the columns, we reduce the complexity of the data and extract only the relevant features required for generating recommendations.**

In [7]:
movies_df = movies_df[['id','title','overview','genres','keywords','cast','crew']]

In [8]:
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4808 entries, 0 to 4807
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        4808 non-null   int64 
 1   title     4808 non-null   object
 2   overview  4805 non-null   object
 3   genres    4808 non-null   object
 4   keywords  4808 non-null   object
 5   cast      4808 non-null   object
 6   crew      4808 non-null   object
dtypes: int64(1), object(6)
memory usage: 300.5+ KB


### Heading: Handling Missing Values in the Movie DataFrame

Explanation:
In this code snippet, we are performing two operations on the movie DataFrame, "movies_df".

1. Checking for Null Values:
The expression "movies_df.isnull()" returns a DataFrame with the same shape as "movies_df", where each cell contains a Boolean value indicating whether the corresponding value in "movies_df" is null or not. By calling ".sum()" on this DataFrame, we obtain the sum of null values in each column. This helps us identify columns with missing data.

2. Dropping Rows with Missing Values:
After identifying the null values, we use the "dropna()" method to remove any rows containing null values from the DataFrame. The parameter "inplace=True" ensures that the changes are applied directly to the "movies_df" DataFrame, modifying it in place.

By performing these operations, we handle missing values in the movie DataFrame, ensuring that our recommendation system works with complete and valid data.

In [9]:
movies_df.isnull().sum()

id          0
title       0
overview    3
genres      0
keywords    0
cast        0
crew        0
dtype: int64

In [10]:
movies_df.dropna(inplace=True)

In [11]:
movies_df.duplicated().sum()

0

### Heading: Accessing Genres of the First Movie

Explanation:
In this code snippet, we retrieve the genres of the first movie in the DataFrame. By using the index-based selection function "iloc[0]", we access the first row of the movie DataFrame, and ".genres" allows us to specifically retrieve the genres associated with that movie.

This code is essential for extracting genre information, enabling us to understand and analyze the genres of individual movies in our recommendation system.

In [12]:
movies_df.iloc[0].genres


'[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]'

### Heading: Converting Genre and Keyword Columns to Lists

Explanation:
In this code snippet, we convert the 'genres' and 'keywords' columns in the movie DataFrame from strings to lists.

To accomplish this, we define a function called 'convert' that uses the 'ast.literal_eval' function to evaluate the string as a literal expression. The function then extracts the 'name' attribute from each item in the evaluated list and appends it to a new list.

We apply the 'convert' function to the 'genres' and 'keywords' columns using the 'apply' method. This transforms the string-based columns into lists.

Finally, we display the updated movie DataFrame using the 'head' method, showcasing the converted 'genres' and 'keywords' columns.

This code snippet plays a vital role in preparing the data for analysis and processing in the movie recommendation system by converting the columns into a more suitable list format.# 

In [13]:
import ast

In [14]:
def convert(obj):
    L=[]
    for i in ast.literal_eval(obj):
        L.append(i['name'])
    return L
    

In [15]:
movies_df['genres'] = movies_df['genres'].apply(convert)
movies_df['keywords'] = movies_df['keywords'].apply(convert)


### Heading: Extracting Top 3 Names from Cast Column

Explanation:
The 'convert3' function extracts the top 3 names from the 'cast' column in the movie DataFrame. It uses 'ast.literal_eval' to evaluate the string as a literal expression, appends the names to a list until 3 names have been added, and then stops.

Applying the 'convert3' function to the 'cast' column using the 'apply' method updates the column with the extracted top 3 names.

This code snippet is essential for narrowing down the cast to the top 3 names in the movie recommendation system, focusing on the main actors/actresses.

In [72]:
def convert3(obj):
    L=[]
    count = 0
    for i in ast.literal_eval(obj):
        if count != 3:
            L.append(i['name'])
            count += 1
        else:
            break
    return L

In [73]:
movies_df['cast'] = movies_df['cast'].apply(convert3)

###   Heading: Extracting Director Names from Crew Column

Explanation:
The 'fetch_director' function extracts the names of directors from the 'crew' column in the movie DataFrame. It iterates over the evaluated list obtained from 'ast.literal_eval(obj)', checks if the 'job' attribute is 'Director', and appends the director's name to a list.

By applying the 'fetch_director' function to the 'crew' column using the 'apply' method, we update the column with the extracted director names.

This code snippet is essential for capturing and storing the names of directors in the movie recommendation system, enabling us to consider the director's impact when generating recommendations.

In [75]:
def fetch_director(obj):
    L=[]
    for i in ast.literal_eval(obj):
        if i['job'] == 'Director':
            L.append(i['name'])
    return L
    

In [76]:
movies_df['crew'] = movies_df['crew'].apply(fetch_director)

In [78]:
movies_df.overview[0]

'In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization.'

### Heading: Splitting Overview Text into Tokens

Explanation:
In this code snippet, we use the 'apply' method to apply a lambda function to the 'overview' column in the movie DataFrame, 'movies_df'.

The lambda function takes each string in the 'overview' column, 'x', and applies the 'split()' method to it. By using 'split()' without any arguments, it splits the string into a list of tokens based on whitespace.

By assigning the result back to the 'overview' column, we update it with the tokenized representation of the original text.

This code snippet is crucial for transforming the overview text into a list of tokens, which can be used for various text-based operations, such as text analysis or building a recommendation system that considers the content of movie overviews.

In [79]:
movies_df['overview'] = movies_df['overview'].apply(lambda x:x.split())

### Heading: Removing Spaces from Genre, Keyword, Cast, and Crew Columns

Explanation:
In this code snippet, we use the 'apply' method along with lambda functions to modify multiple columns in the movie DataFrame, 'movies_df'.

For each column ('genres', 'keywords', 'cast', 'crew'), the lambda function iterates over each element, 'x', in the column and applies a list comprehension. Within the list comprehension, we use the 'replace()' method to remove spaces (' ') from each element in the list.

By assigning the modified lists back to their respective columns, we update the DataFrame with the modified versions that have spaces removed from the individual elements.

This code snippet is crucial for cleaning the genre, keyword, cast, and crew columns by removing spaces, ensuring consistency and accuracy in the data.

In [50]:
movies_df['genres'] = movies_df['genres'].apply(lambda x:[i.replace (" ","") for i in x])
movies_df['keywords'] = movies_df['keywords'].apply(lambda x:[i.replace (" ","") for i in x])
movies_df['cast'] = movies_df['cast'].apply(lambda x:[i.replace (" ","") for i in x])
movies_df['crew'] = movies_df['crew'].apply(lambda x:[i.replace (" ","") for i in x])



### Heading: Creating a 'tags' Column by Combining Multiple Columns

Explanation:
In this code snippet, we create a new column called 'tags' in the movie DataFrame, 'movies_df', by combining multiple existing columns.

By using the '+' operator, we concatenate the values from the 'overview', 'genres', 'keywords', 'cast', and 'crew' columns. This creates a new column, 'tags', that contains a combination of information from these columns.

The purpose of creating the 'tags' column is to consolidate relevant information from different aspects of the movie, such as the overview, genres, keywords, cast, and crew. This consolidated information can be used for various tasks, such as content-based filtering or generating movie recommendations based on shared characteristics.

This code snippet plays a vital role in aggregating and organizing key information in a single column, facilitating further analysis and recommendation generation in the movie recommendation system.

In [82]:
movies_df['tags'] = movies_df['overview']+movies_df['genres']+movies_df['keywords']+movies_df['cast']+movies_df['crew']

In [84]:
new_df = movies_df[['id','title','tags']]

### Heading: Joining Tags into a Single String

Explanation:
In this code snippet, we use the 'apply' method along with a lambda function to modify the 'tags' column in the DataFrame, 'new_df'.

The lambda function takes each element, 'x', in the 'tags' column and applies the 'join()' method to it. By using 'join()' with a space (' '), it concatenates the individual elements within 'x' into a single string, separating them with spaces.

By assigning the modified 'tags' column back to itself, we update the DataFrame with the new version where the tags are represented as a single string.

This code snippet is crucial for converting the 'tags' column from a list of individual elements into a cohesive and space-separated string representation. This format is often more convenient for further text analysis or recommendation generation in the movie recommendation system.

In [86]:
new_df['tags'] = new_df['tags'].apply(lambda x:' '.join(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['tags'] = new_df['tags'].apply(lambda x:' '.join(x))


In [88]:
new_df['tags'][0]

'In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. Action Adventure Fantasy Science Fiction culture clash future space war space colony society space travel futuristic romance space alien tribe alien planet cgi marine soldier battle love affair anti war power relations mind and soul 3d Sam Worthington Zoe Saldana Sigourney Weaver James Cameron'

### All Data is Converted in LowerCase 

In [89]:
new_df['tags'] = new_df['tags'].apply(lambda X:X.lower())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['tags'] = new_df['tags'].apply(lambda X:X.lower())


In [90]:
new_df['tags'][0]

'in the 22nd century, a paraplegic marine is dispatched to the moon pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. action adventure fantasy science fiction culture clash future space war space colony society space travel futuristic romance space alien tribe alien planet cgi marine soldier battle love affair anti war power relations mind and soul 3d sam worthington zoe saldana sigourney weaver james cameron'

### Heading: Creating Count Vectors for Tags Column

Explanation:
In this code snippet, we import the 'CountVectorizer' class from the 'sklearn.feature_extraction.text' module. The 'CountVectorizer' is a text feature extraction technique that converts text into a matrix of token counts.

We initialize an instance of 'CountVectorizer' called 'cv' with two parameters: 'max_features' and 'stop_words'. 'max_features' specifies the maximum number of features (words) to include in the vocabulary, and 'stop_words' specifies that common English words should be excluded from the vocabulary.

Next, we use the 'fit_transform' method of 'cv' to transform the 'tags' column of the DataFrame, 'new_df', into a matrix representation. This matrix is converted to a NumPy array using 'toarray()' method.

By calling '.shape' on the resulting array, we obtain the shape of the matrix, which represents the number of documents (rows) and the number of features (columns).

We then assign the transformed matrix to a variable called 'vectors' by calling 'fit_transform' method again on 'cv' with the 'tags' column.

This code snippet is essential for converting the text-based 'tags' column into a matrix of count vectors, which represents the frequency of each word (feature) in each document (tag). These count vectors can be used as input for various machine learning algorithms or for further analysis in the movie recommendation system.

In [111]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(max_features=5000, stop_words='english')
 

In [112]:
cv.fit_transform(new_df['tags']).toarray().shape


(4805, 5000)

In [113]:
vectors = cv.fit_transform(new_df['tags']).toarray()


In [114]:
vectors[0]

array([0, 0, 0, ..., 0, 0, 0])

### Heading: Stemming Words in Tags Column

Explanation:
In this code snippet, we import the 'PorterStemmer' class from the 'nltk.stem.porter' module. The 'PorterStemmer' is a popular stemming algorithm that reduces words to their base or root form.

We define a function called 'stem' that takes a string of text as input. Within the function, we initialize an empty list called 'y'. We then iterate over each word in the text, obtained by splitting the string. For each word, we apply stemming using the 'PorterStemmer' by calling the 'stem' method on the 'ps' object. The stemmed word is then appended to the 'y' list.

After iterating through all the words, we use the 'join' method to combine the stemmed words in the 'y' list into a single string, separated by spaces.

Finally, we apply the 'stem' function to the 'tags' column of the 'new_df' DataFrame using the 'apply' method. This updates the 'tags' column with the stemmed version of the text.

This code snippet is important for applying stemming to the words in the 'tags' column. Stemming reduces words to their base or root form, which can help in reducing the vocabulary size and capturing the essential meaning of words in the movie recommendation system.

In [116]:
import nltk

In [120]:
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()

In [121]:
def stem(text):
    y=[]
    for i in text.split():
        y.append(ps.stem(i))
    return " ".join(y)

In [122]:
new_df['tags'] = new_df['tags'].apply(stem)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['tags'] = new_df['tags'].apply(stem)


### Heading: Computing Cosine Similarity between Vectors

Explanation:
In this code snippet, we import the 'cosine_similarity' function from the 'sklearn.metrics.pairwise' module. The 'cosine_similarity' function calculates the cosine similarity between vectors, which is a measure of similarity between two vectors based on the cosine of the angle between them.

We apply the 'cosine_similarity' function to the 'vectors' matrix, which represents the count vectors of the 'tags' column in the 'new_df' DataFrame. By passing 'vectors' as the input to the function, it computes the pairwise cosine similarity between all the vectors.

The output is a square matrix where each element represents the cosine similarity between the corresponding pair of vectors. The diagonal elements represent the cosine similarity of each vector with itself, which is always 1.

This code snippet is crucial for computing the cosine similarity between vectors, which can be used in various recommendation systems to measure the similarity between movies based on their tags. It helps in identifying movies that are closely related or have similar characteristics, aiding in generating accurate and relevant recommendations.

In [123]:
from sklearn.metrics.pairwise import cosine_similarity


In [124]:
cosine_similarity(vectors)

array([[1.        , 0.06885304, 0.04948717, ..., 0.03142697, 0.05410018,
        0.        ],
       [0.06885304, 1.        , 0.04259177, ..., 0.04057204, 0.        ,
        0.        ],
       [0.04948717, 0.04259177, 1.        , ..., 0.01944039, 0.08924215,
        0.        ],
       ...,
       [0.03142697, 0.04057204, 0.01944039, ..., 1.        , 0.06375767,
        0.03207501],
       [0.05410018, 0.        , 0.08924215, ..., 0.06375767, 1.        ,
        0.03681051],
       [0.        , 0.        , 0.        , ..., 0.03207501, 0.03681051,
        1.        ]])

In [125]:
cosine_similarity(vectors).shape

(4805, 4805)

In [126]:
similarity = cosine_sim ilarity(vectors)

In [128]:
similarity[0].shape

(4805,)

### Heading: Retrieving Top Similar Movies

Explanation:
In this code snippet, we retrieve the top similar movies based on the cosine similarity scores.

The 'enumerate(similarity[0])' function creates an iterator that combines the indices and similarity scores of the first row of the 'similarity' matrix. The 'list' function is then used to convert the iterator into a list.

Next, we use the 'sorted' function to sort the list in descending order based on the similarity scores. The 'reverse=True' parameter ensures that the list is sorted in descending order. The 'key=lambda x: x[1]' parameter specifies that the sorting should be based on the second element (similarity score) of each tuple in the list.

Finally, we use list slicing '[1:6]' to retrieve the top 5 similar movies. This returns a sublist containing tuples representing the index and similarity score of each similar movie.

The purpose of this code snippet is to identify the most similar movies to a given movie. By sorting the similarity scores and retrieving the top similar movies, we can generate recommendations based on the similarity of movie tags.

Note: The code assumes that 'similarity' is a variable containing the cosine similarity matrix, and it retrieves the top 5 similar movies based on the first row of the matrix.

In [130]:
sorted(list(enumerate(similarity[0])), reverse= True, key = lambda x:x[1])[1:6]

[(2409, 0.4839354795704658),
 (1537, 0.43622020338771716),
 (3162, 0.4278406467922595),
 (838, 0.398216086792002),
 (4335, 0.39207842352784267)]

### Heading: Movie Recommendation Function

Explanation:
The code snippet presents a function called 'recommend' that generates movie recommendations based on a given movie.

The function takes a movie name as input and performs the following steps:

1. It first checks if the given movie exists in the 'title' column of the 'new_df' DataFrame. If the movie is not found, it prints a message stating that the movie was not found and returns.

2. If the movie is found in the dataset, it retrieves the index of the movie in the 'new_df' DataFrame using the 'index' method.

3. It then accesses the corresponding row in the 'similarity' matrix, which represents the cosine similarity scores for the given movie.

4. Using the 'sorted' function, it sorts the similarity scores in descending order and retrieves the top 5 similar movies. The movies are represented as tuples containing the index and similarity score.

5. Finally, it iterates over the retrieved movie list and prints the titles of the recommended movies.

The purpose of this code snippet is to provide a function that recommends similar movies based on the input movie. It utilizes the cosine similarity scores and the 'new_df' DataFrame to identify and display the top 5 movies that are most similar to the input movie.

In [145]:
def recommend(movie):
    if movie not in new_df['title'].values:
        print(f"Movie '{movie}' not found in the dataset.")
        return
    
    movie_index = new_df[new_df['title'] == movie].index[0]
    distances = similarity[movie_index]
    movie_list = sorted(list(enumerate(distances)), reverse=True, key=lambda x: x[1])[1:6]

    for i in movie_list:
        print(new_df.iloc[i[0]].title)


In [146]:
recommend('Avatar')

Aliens
Moonraker
Alien
Alien³
Silent Running


In [147]:
recommend('Iron Man')

Iron Man 2
Iron Man 3
Avengers: Age of Ultron
Captain America: Civil War
The Avengers


In [149]:
recommend('Liar Lia')

Movie 'Liar Lia' not found in the dataset.
