# Movies 

Movies are a popular form of entertainment and have been a part of our culture for over a century. With the advent of streaming platforms and the availability of a vast number of movies, it has become increasingly difficult for movie lovers to discover new movies that align with their preferences. This is where movie recommender systems come in.

![Movies](https://images.thedirect.com/media/article_full/marvel-posters-ranked.jpg)


## Movie recommender systems

are computer programs that analyze a user's movie preferences and suggest similar movies that the user may enjoy. These systems use a variety of techniques, such as collaborative filtering, content-based filtering, and deep learning-based models, to make recommendations. Collaborative filtering relies on the preferences of other users who have similar tastes to the user, while content-based filtering analyzes the attributes of the movies themselves, such as genre, cast, and plot summary. Movie recommender systems can help users discover new movies that align with their preferences and make the movie-watching experience more enjoyable.

## Project Outliens

Here is an outline of the steps we would need to take to build a movie recommender system:

- Collect a dataset of movies and their attributes (e.g. genre, director, cast, plot summary, etc.). The MovieLens dataset (https://grouplens.org/datasets/movielens/) is a popular dataset for building movie recommenders.
- Clean and preprocess the data. This might include removing missing values, converting categorical variables to numerical ones, and creating new features.
- Feature engineering: Extracting relevant information from the dataset, such as genre, cast, and plot summary and use that information to create new features.
- Select a similarity metric: There are several ways to measure the similarity between movies, such as cosine similarity or Jaccard similarity.
- Create a model: You can use a variety of models such as a memory-based collaborative filtering, model-based collaborative filtering, and deep learning-based models.
- Train and evaluate the model
- Create an API or a web-based interface for the movie recommender system that takes in a movie name and returns recommendations based on the similarity.

In [1]:
# Importing required packages 
import numpy as np
import pandas as pd

### Reading Dataset

In [2]:
movies = pd.read_csv('Datasets/movie_overviews.csv')

In [3]:
movies.shape

(9099, 4)

In [4]:
movies.head()

Unnamed: 0,id,title,overview,tagline
0,862,Toy Story,"Led by Woody, Andy's toys live happily in his ...",
1,8844,Jumanji,When siblings Judy and Peter discover an encha...,Roll the dice and unleash the excitement!
2,15602,Grumpier Old Men,A family wedding reignites the ancient feud be...,Still Yelling. Still Fighting. Still Ready for...
3,31357,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",Friends are the people who let you be yourself...
4,11862,Father of the Bride Part II,Just when George Banks has recovered from his ...,Just When His World Is Back To Normal... He's ...


In [5]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9099 entries, 0 to 9098
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        9099 non-null   int64 
 1   title     9099 non-null   object
 2   overview  9087 non-null   object
 3   tagline   7033 non-null   object
dtypes: int64(1), object(3)
memory usage: 284.5+ KB


### Data Preprocessing

Clean and preprocess the data. This might include removing missing values, converting categorical variables to numerical ones, and creating new features.

Text preprocessing for an NLP project typically includes the following steps:

- Tokenization: breaking the text into individual words or phrases (tokens)
- Lowercasing: converting all text to lowercase to standardize the data
- Removing stop words: removing common words such as "the," "and," and "is" that do not provide useful information for the analysis
- Lemmatization or stemming: reducing words to their base form to standardize the data
- Removing special characters and numbers: removing any non-letter characters and numbers to simplify the data
- Removing punctuation
- Removing white spaces
- Removing HTML or XML tags if the data is scraped from website
- Removing any duplicate data
- Removing any irrelevant data

In [6]:
movies.drop('tagline', axis=1, inplace=True)

In [7]:
import re
import nltk
import spacy
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from spacy.lang.en import English

# Load spaCy's language model
nlp = spacy.load("en_core_web_sm")

# Load NLTK's stopwords and WordNetLemmatizer
stop_words = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    # Remove special characters and digits
    text = re.sub(r"[^a-zA-Z]", " ", str(text))
    
    # Lowercase the text
    text = text.lower()
    
    # Tokenize the text
    words = word_tokenize(text)
    
    # Remove stopwords
    words = [word for word in words if word not in stop_words]
    
    # Lemmatize the words
    words = [lemmatizer.lemmatize(word) for word in words]
    
    # Join the words back into a single string
    text = " ".join(words)
    
    # Use spaCy's built-in preprocessing
    doc = nlp(text)
    words = [token.lemma_ for token in doc if not token.is_stop]
    text = " ".join(words)
    
    return text


In [8]:
movies['overview'] = movies['overview'].apply(preprocess_text)

In [9]:
movies.head()

Unnamed: 0,id,title,overview
0,862,Toy Story,lead woody andy toy live happily room andy bir...
1,8844,Jumanji,sible judy peter discover enchanted board game...
2,15602,Grumpier Old Men,family wedding reignite ancient feud door neig...
3,31357,Waiting to Exhale,cheat mistreat step woman hold breath wait elu...
4,11862,Father of the Bride Part II,george bank recover daughter wedding receive n...


In [11]:
feautres = movies['overview']
feautres

0       lead woody andy toy live happily room andy bir...
1       sible judy peter discover enchanted board game...
2       family wedding reignite ancient feud door neig...
3       cheat mistreat step woman hold breath wait elu...
4       george bank recover daughter wedding receive n...
                              ...                        
9094    man cope loss wife obsolescence job find redem...
9095    rustom pavri honourable officer indian navy sh...
9096    village lad sarman draw big bad mohenjo daro m...
9097    mind evangelion come hit large life massive gi...
9098    band storm europe conquer america groundbreaki...
Name: overview, Length: 9099, dtype: object

### Feature Engineering

Extracting relevant information from the dataset, such as genre, cast, and plot summary and use that information to create new features.

- tf-idf: This technique represents text as a weighted vector of its words, where the weight of each word is proportional to its frequency in the text and inversely proportional to its frequency in the entire corpus.

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(stop_words='english')


In [14]:
tfidf_matrix = tfidf.fit_transform(feautres)

### Select a similarity metric

There are several ways to measure the similarity between movies, such as cosine similarity or Jaccard similarity.

Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. In other words, it is a measure of the similarity between two vectors based on the cosine of the angle between them. Cosine similarity is often used in information retrieval and text mining as a measure of similarity between documents or terms.

The formula for cosine similarity between two vectors A and B is given by:

similarity = cos(theta) = (A * B) / (||A|| * ||B||)

where A and B are the vectors, (A * B) is the dot product of the vectors, and ||A|| and ||B|| are the magnitudes of the vectors.

The value of the cosine similarity ranges from -1 to 1. A value of 1 means that the vectors are identical and a value of -1 means that the vectors are completely dissimilar. A value of 0 means that the vectors are orthogonal (perpendicular) to each other.

Cosine similarity is widely used in information retrieval, text mining and natural language processing to compute the similarity between two documents, two sentences or two words. In a movie recommender system, it can be used to compute the similarity between the plot summaries of the movies.

In [15]:
from sklearn.metrics.pairwise import linear_kernel
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

### Modeling

In [18]:
# Generate mapping between titles and index
indices = pd.Series(movies.index, index=movies['title']).drop_duplicates()
 
def get_recommendations(title, cosine_sim = cosine_sim , indices = indices):
    # Get index of movie that matches title
    idx = indices[title]
    # Sort the movies based on the similarity scores
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    # Get the scores for 10 most similar movies
    sim_scores = sim_scores[1:11]
    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]
    # Return the top 10 most similar movies
    return movies['title'].iloc[movie_indices]

In [29]:
get_recommendations('Iron Man')

7514                    Iron Man 2
8290                    Iron Man 3
5668                      Scarface
6206                      The Cave
8766       Avengers: Age of Ultron
4272          Saturday Night Fever
2047                 Baby Geniuses
2322                 The Dark Half
1650    Return from Witch Mountain
1001                 Touch of Evil
Name: title, dtype: object