<h2 style = "color:darkblue"> Movie Recommendation System </h2>

There are several machine learning models that you can consider for building a movie recommendation system. The choice of model depends on the type of recommendation system you're aiming to create. Here are a few popular options:

1. **Collaborative Filtering Models:**
   - User-Based Collaborative Filtering: This model recommends movies to a user based on the preferences of users with similar tastes. It calculates similarity scores between users.
   - Item-Based Collaborative Filtering: This model recommends movies by finding similar movies to those the user has already watched or rated. It calculates similarity scores between movies.


2. **Content-Based Models:**
   - Content-Based Filtering: This model recommends movies based on the features of movies that the user has shown interest in. It focuses on movie attributes like genre, cast, and plot.


3. **Matrix Factorization Models:**
   - SVD (Singular Value Decomposition): This model decomposes the user-movie interaction matrix to identify latent features. It's used to fill in missing values (ratings) and make recommendations.


4. **Hybrid Models:**
   - Hybrid Recommender: This combines multiple recommendation techniques (e.g., collaborative filtering and content-based) to provide more accurate and diverse recommendations.


5. **Deep Learning Models:**
   - Neural Collaborative Filtering: This deep learning model combines collaborative and content-based approaches using neural networks.


6. **Ensemble Models:**
   - Ensemble Recommenders: Ensemble techniques combine predictions from multiple models to improve recommendation quality.

Remember that the choice of model depends on factors such as the available data, your specific goals, and the trade-offs between accuracy, interpretability, and scalability. It's often beneficial to experiment with different models and evaluate their performance using metrics like precision, recall, or Mean Average Precision (MAP) to determine the best fit for your recommendation system.

- Adding Required Libraries

In [None]:
import pandas as pd
import numpy as np
import ast

import warnings
warnings.filterwarnings("ignore")
# ast.literal_eval()

- Reading Files : 'Movies' & 'Credits'

In [None]:
movies = pd.read_csv("tmdb_5000_movies.csv")
credits = pd.read_csv("tmdb_5000_credits.csv")

In [None]:
movies.head(2)

In [None]:
# movies.columns
# credits.columns

<h3 style="color:darkblue"> Data preprocessing </h3>

- Merging the two dataframe 'Movies' and 'Credits'

In [None]:
df_movie = movies.merge(credits,on='title')

In [None]:
df_movie.head(2)

Building a movie recommendation system using machine learning involves selecting appropriate features to capture user preferences and movie characteristics. Here are some feature variables you might consider for your recommendation model:

1. **User Preferences:** User ID,Ratings,Watch History,Genre Preferences,Release Year Preferences


2. **Movie Characteristics:** Movie ID,Genre, Release Year, Director, Cast, Keywords,Average Rating,Popularity
   
   
3. **Textual Features:** Movie Plot Summary, Movie Tags


4. **Collaborative Filtering Features:** Similarity Metrics, User-based or Item-based Collaborative Filtering


5. **Contextual Features:** Time of Day, Location


6. **External Data:** IMDb or TMDb Data: External data from movie databases can provide additional information.



- Dropping Columns based on the requirements

In [None]:
df_movie.columns

In [None]:
df_movie = df_movie[['movie_id','title','overview','genres', 'keywords','cast','crew']]

In [None]:
df_movie.head()

In [None]:
df_movie.isnull().sum()

In [None]:
#dropping duplicate records
df_movie.dropna(inplace=True)

In [None]:
#checking for duplicates records
df_movie.duplicated().sum()

In [None]:
#function to get values from key:'names' 
def convert(obj):
    l_list = []
    for i in ast.literal_eval(obj):
        l_list.append(i["name"])
    return l_list

In [None]:
df_movie['genres'] = df_movie['genres'].apply(convert)

In [None]:
df_movie['keywords'] = df_movie['keywords'].apply(convert)

In [None]:
#fun to extract first three names from the key values
def convert3(obj):
    l_list = []
    counter = 0
    for i in ast.literal_eval(obj):
        if counter != 3:
            l_list.append(i["name"])
            counter+=1
        else:
            break
    return l_list

In [None]:
df_movie['cast'] = df_movie['cast'].apply(convert3)

In [None]:
df_movie['cast'].head()

In [None]:
var = ast.literal_eval(df_movie['crew'][1])
# var

In [None]:
#writing a function to fetch the director 
def director(obj):
    l = []
    for i in ast.literal_eval(obj):
        if i['job'] == 'Director':
            l.append(i['name'])
    return l

In [None]:
df_movie['Director'] = df_movie['crew'].apply(director)

In [None]:
#dropping the 'crew' column
df_movie.drop(columns='crew',inplace = True)

In [None]:
df_movie.head()

In [None]:
#converting overview column into list
df_movie['overview'] = df_movie['overview'].apply(lambda x : x.split())

- Removing Spaces in order to create precise Tags for Recommendation

In [None]:
df_movie['overview'].apply(lambda x :[i.replace(' ','') for i in x])

In [None]:
df_movie['genres'].apply(lambda x :[i.replace(' ','') for i in x])

In [None]:
columns_movie = ['overview', 'genres', 'keywords', 'cast','Director']

In [None]:
for i in columns_movie:
    df_movie[i] = df_movie[i].apply(lambda x :[val.replace(' ','') for val in x])

In [None]:
df_movie.head()

In [None]:
#created a tag
df_movie['tags'] = df_movie['overview'] + df_movie['genres'] + df_movie['keywords'] + df_movie['cast'] + df_movie['Director']

In [None]:
data_movie = df_movie[['movie_id','title','tags']]

In [None]:
#converting tags into string
data_movie['tags'] = data_movie['tags'].apply(lambda x : " ".join(x))

In [None]:
data_movie.head()

The "Bag of Words" (BoW) concept is a fundamental technique in Natural Language Processing (NLP) and machine learning that represents text data as a collection of individual words or tokens, disregarding grammar and word order. It's a simplified representation that transforms text into a numerical format suitable for machine learning algorithms. The name "Bag of Words" implies that we're treating text as an unordered collection, similar to a bag where words are tossed in without considering their sequence.

*Explanation of the Bag of Words concept along with examples:*

**Process of Creating a Bag of Words:**

1. **Tokenization:** Break the text into individual words or tokens. Punctuation and capitalization are usually removed, and the text is split into words.

2. **Vocabulary Creation:** Create a unique vocabulary of all the distinct words in the entire corpus (collection of documents). Each word is assigned a unique index.

3. **Document Representation:** For each document in the corpus, create a vector where each dimension corresponds to a word in the vocabulary. The value in each dimension represents the frequency of that word in the document.

**Example:**

Consider the following two sentences:
1. "The cat in the hat."
2. "The dog in the yard."

**Step 1: Tokenization:**
- Sentence 1 tokens: ["the", "cat", "in", "the", "hat"]
- Sentence 2 tokens: ["the", "dog", "in", "the", "yard"]

**Step 2: Vocabulary Creation:**
The unique words in the corpus are: ["the", "cat", "in", "hat", "dog", "yard"]

**Step 3: Document Representation:**
Create vectors for each sentence based on the vocabulary. The vectors indicate the frequency of each word in the sentence.

- Sentence 1 vector: [2, 1, 1, 1, 0, 0]
- Sentence 2 vector: [2, 0, 1, 0, 1, 1]

In this representation, the order of words is disregarded, and only the frequency of words matters. The "bag" nature of this representation means that it's as if we have a bag containing the words, and we're looking at the counts of each word.

**Usage:**

Bag of Words is used for various NLP tasks, including:
- Text classification: Determining the category of a document (spam/ham, sentiment analysis, topic classification).
- Document clustering: Grouping similar documents together.
- Information retrieval: Ranking documents based on their relevance to a query.
- Keyword extraction: Identifying the most important words in a document.

While Bag of Words is a simple representation, it has limitations, such as not considering word order or capturing the semantics of language. Advanced techniques like TF-IDF (Term Frequency-Inverse Document Frequency) and word embeddings address some of these limitations by incorporating more context and meaning into the representation.

In [None]:
#converting tags into lowercase
data_movie['tags'] = data_movie['tags'].apply(lambda x: x.lower())

In [None]:
#importing libraries
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=5000,stop_words='english')

In [None]:
vectors = cv.fit_transform(data_movie['tags']).toarray()

In [None]:
[i for i in cv.get_feature_names_out()]

**Stemming** is a text normalization technique in natural language processing (NLP) that involves reducing words to their base or root form, called a "stem." The purpose of stemming is to simplify words to their core form so that different variations of the same word are treated as the same, making it easier to perform analysis, comparisons, and information retrieval.

- Stemming involves removing prefixes, suffixes, and other affixes from words to produce the root form. The resulting stem may not always be a valid word, but it represents the core meaning of the word.

- For example, consider the words "running," "runner," and "runs." Applying stemming, we would convert all of these to the stem "run."

- Stemming algorithms vary in complexity and rules, but they generally work by applying linguistic rules to the words. Some common stemming algorithms include the Porter stemming algorithm and the Snowball stemming algorithm.

**Example:**
Let's take a few words and apply stemming using the Porter stemming algorithm:

- Original Word: "running"
  - Stemmed Word: "run"

- Original Word: "jumps"
  - Stemmed Word: "jump"

- Original Word: "happily"
  - Stemmed Word: "happili" (Note: The stem might not be a valid word, but it represents the root meaning.)

In situations where maintaining semantic accuracy is critical, more advanced techniques like *lemmatization* are preferred. Lemmatization considers the context and meaning of words to produce a valid base form, known as the "lemma."

In [None]:
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()

In [None]:
def stem(text):
    list_val = []
    for i in text.split(' '):
        list_val.append(ps.stem(i))
        
    return " ".join(list_val)

In [None]:
data_movie['tags'] = data_movie['tags'].apply(stem)

In [None]:
# Let's re-create the vector

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=5000,stop_words='english')

vectors = cv.fit_transform(data_movie['tags']).toarray()

In [None]:
[i for i in cv.get_feature_names_out()]

- Euclidean distance tends to be ineffective in text analysis due to text data's high-dimensional nature, variation in text length, irrelevant dimensions, and sparsity. In contrast, cosine similarity is well-suited for text analysis because it normalizes for vector magnitudes, considers vector orientation, handles sparsity, and is resilient to high-dimensional spaces. Cosine similarity's characteristics make it a preferred choice for measuring similarity between text documents.

- **Calculating Cosine Similarity**

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
similarity = cosine_similarity(vectors)

In [None]:
sorted(enumerate(similarity[0]),reverse=True,key=lambda x:x[1])[1:6]

In [None]:
def recommend(movie):
    movie_index = data_movie[data_movie['title']==movie].index[0]
    distances = similarity[movie_index]
    movies_list = sorted(list(enumerate(distances)),reverse=True,key=lambda x:x[1])[1:6]
    
    for i in movies_list:
        print(data_movie.iloc[i[0]].title)
#         print(i[0])

In [None]:
recommend('Avatar')

In [None]:
import pickle

In [None]:
pickle.dump(data_movie,open('movies.pkl','wb'))

In [None]:
pickle.dump(data_movie.to_dict(),open('movies_dict.pkl','wb'))

In [None]:
pickle.dump(similarity,open('similarity.pkl','wb'))