### **The Journey Begins: Adil & Harry’s Recommender System Adventure**  

Adil and Harry sat in their workspace, staring at a whiteboard filled with complex equations and diagrams. Their mission was clear—to build a **Movie Recommender System**.  

"Harry, today, we're focusing on **content-based filtering**," Adil said, adjusting his laptop screen. "Before we dive into coding, let's break down the different types of recommender systems."  

Harry nodded eagerly. "Alright, let's start with the basics!"  

---

### **Types of Recommender Systems**  

Adil explained the three major types:  

#### **1. Content-Based Filtering (Today's Focus)**  
🔹 **How it Works**: Analyzes movie features (genres, overview, keywords, cast, crew) to recommend similar items.  
🔹 **Techniques**:  
   - **TF-IDF (Term Frequency-Inverse Document Frequency)** for text processing  
   - **Cosine Similarity** for measuring similarity between movies  
   - **Machine Learning Models** (optional)  

💡 **Example**: If a user likes *Interstellar* and *Inception*, the system recommends movies like *The Martian* and *Blade Runner 2049*.  

---

#### **2. Collaborative Filtering**  
🔹 **How it Works**: Uses user interaction data to recommend items. It assumes that users with similar preferences will like similar items.  
🔹 **Techniques**:  
   - **User-Based Filtering**: Recommends movies based on similar users  
   - **Item-Based Filtering**: Finds movies rated similarly by different users  
   - **Matrix Factorization (SVD, ALS)** for uncovering hidden patterns  

💡 **Example**: If User A and User B both rated *The Dark Knight* highly, and User A also liked *Dunkirk*, the system may recommend *Dunkirk* to User B.  

---

#### **3. Hybrid Recommender Systems**  
🔹 **How it Works**: Combines content-based and collaborative filtering to improve recommendations.  
🔹 **Techniques**:  
   - **Weighted Hybrid**: Merges multiple recommendation scores  
   - **Switching Hybrid**: Switches between methods dynamically  
   - **Deep Learning Models** for learning complex relationships  

💡 **Example**: Netflix uses a **hybrid approach**, considering both a user's watch history (content-based) and similar users' preferences (collaborative filtering).  

---

In [110]:
import numpy as np
import pandas as pd

### **Data Preprocessing for a Content-Based Movie Recommender System**  

With a clear understanding of recommender systems, Adil and Harry moved on to **data preprocessing**—a crucial step in ensuring accurate and efficient recommendations.  

*"A good model starts with clean data,"* Adil reminded Harry as they opened their dataset.  

---

### **Loading the Dataset**

*"Now, it's time to load the raw dataset and take a quick look at it,"* Adil said, as he typed the code. They loaded their movie dataset into a pandas dataframe for inspection.

In [111]:
movies = pd.read_csv('tmdb_5000_movies.csv')
credits = pd.read_csv('tmdb_5000_credits.csv')

In [112]:
credits.sample(2)

Unnamed: 0,movie_id,title,cast,crew
314,29193,The Spanish Prisoner,"[{""cast_id"": 1, ""character"": ""Jimmy Dell"", ""cr...","[{""credit_id"": ""561f77189251417f47001122"", ""de..."
1067,9981,Kicking & Screaming,"[{""cast_id"": 19, ""character"": ""Phil Weston"", ""...","[{""credit_id"": ""52fe4556c3a36847f80c8957"", ""de..."


In [113]:
movies.sample(2)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
3386,0,"[{""id"": 14, ""name"": ""Fantasy""}, {""id"": 35, ""na...",,11687,"[{""id"": 964, ""name"": ""servant""}, {""id"": 4379, ...",fr,Les visiteurs,This outrageous time-travel comedy follows the...,8.893676,"[{""name"": ""Canal Plus"", ""id"": 104}, {""name"": ""...","[{""iso_3166_1"": ""FR"", ""name"": ""France""}]",1993-01-27,0,107.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""}]",Released,They Weren't Born Yesterday!,The Visitors,7.1,420
2011,23000000,"[{""id"": 18, ""name"": ""Drama""}, {""id"": 35, ""name...",,22215,[],en,Chéri,A sumptuous dramatic comedy set in late 19th C...,5.494047,"[{""name"": ""Path\u00e9 Renn Productions"", ""id"":...","[{""iso_3166_1"": ""DE"", ""name"": ""Germany""}, {""is...",2009-02-10,9366227,86.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,,Cheri,5.9,56


### **Merging `movies` and `credits` Datasets**

In [114]:
movies = movies.merge(credits,on='title')

In [115]:
movies.sample(1)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,movie_id,cast,crew
845,80000000,"[{""id"": 18, ""name"": ""Drama""}, {""id"": 9648, ""na...",,12117,"[{""id"": 378, ""name"": ""prison""}, {""id"": 690, ""n...",en,Instinct,"Dr. Ethan Powell, an anthropologist, is in Afr...",7.386394,"[{""name"": ""Spyglass Entertainment"", ""id"": 158}...",...,126.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,,Instinct,6.2,146,12117,"[{""cast_id"": 10, ""character"": ""Dr. Ethan Powel...","[{""credit_id"": ""52fe44b89251416c7503eb3d"", ""de..."


In [116]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4809 entries, 0 to 4808
Data columns (total 23 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4809 non-null   int64  
 1   genres                4809 non-null   object 
 2   homepage              1713 non-null   object 
 3   id                    4809 non-null   int64  
 4   keywords              4809 non-null   object 
 5   original_language     4809 non-null   object 
 6   original_title        4809 non-null   object 
 7   overview              4806 non-null   object 
 8   popularity            4809 non-null   float64
 9   production_companies  4809 non-null   object 
 10  production_countries  4809 non-null   object 
 11  release_date          4808 non-null   object 
 12  revenue               4809 non-null   int64  
 13  runtime               4807 non-null   float64
 14  spoken_languages      4809 non-null   object 
 15  status               

### **Adil and Harry's Feature Selection: Building the Foundation**

Adil and Harry knew that selecting the right features was crucial for building an effective content-based movie recommender system. They examined both the **movies** and **credits** datasets carefully, picking only the relevant columns for their model.

---

### **Selected Features for the Content-Based Recommender System**

To ensure their recommender system was powerful, they decided to use the following features:

---

#### **1. Movie Metadata Features (from the `movies` dataset)**

These features would help the system understand the genre, plot, and keywords associated with each movie.

- **`genres`**: The genre(s) of the movie (e.g., Action, Comedy, Drama, Sci-Fi). This would be used to group similar movies.
  
- **`movie_id`**: Unique identifier for each movie, necessary for tracking and recommendation.

- **`title`**: The name of the movie, which will be part of the recommendation output.

- **`overview`**: A brief description of the movie’s plot, which will help identify the content of the movie for similarity comparison.

- **`keywords`**: Important tags related to the movie that describe its themes, actors, or key elements, which could help suggest similar movies.

---

#### **2. Cast and Crew Features (from the `credits` dataset)**

These features would help the system understand the people behind the movie, such as actors and directors, which are important factors in content-based recommendations.

- **`cast`**: The list of actors in the movie (e.g., Leonardo DiCaprio, Robert Downey Jr.). The system could recommend movies featuring similar actors.

- **`crew`**: The list of crew members, including the **director**, **producer**, and **screenwriter**. This information is helpful to recommend movies by the same director or producers.

---

#### **3. Additional Feature (Optional)**

- **`release_date`**: The official release date of the movie. Although not critical for content-based filtering, the release date could be used to make recommendations based on recent trends or the time of year.

In [117]:
movies = movies[['movie_id','title','overview','genres','keywords','cast','crew']]

In [118]:
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


In [119]:
movies.isnull().sum()

movie_id    0
title       0
overview    3
genres      0
keywords    0
cast        0
crew        0
dtype: int64

In [120]:
movies.dropna(inplace=True)

In [121]:
movies.duplicated().sum()

np.int64(0)

In [122]:
movies.iloc[0].genres

'[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]'

In [123]:
import ast 
ast.literal_eval('[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]')

[{'id': 28, 'name': 'Action'},
 {'id': 12, 'name': 'Adventure'},
 {'id': 14, 'name': 'Fantasy'},
 {'id': 878, 'name': 'Science Fiction'}]

In [124]:
def convert(text):
    L = []
    for i in ast.literal_eval(text):
        L.append(i['name']) 
    return L

In [125]:
movies.dropna(inplace=True)

In [126]:
movies['genres'] = movies['genres'].apply(convert)

In [127]:
movies['keywords'] = movies['keywords'].apply(convert)
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


In [128]:
def convert3(text):
    L = []
    counter = 0
    for i in ast.literal_eval(text):
        if counter < 3:
            L.append(i['name'])
        counter+=1
    return L

In [129]:
movies['cast'] = movies['cast'].apply(convert)

In [130]:
def fetch_director(text):
    L = []
    for i in ast.literal_eval(text):
        if i['job'] == 'Director':
            L.append(i['name'])
    return L

In [131]:
movies['crew'] = movies['crew'].apply(fetch_director)

In [132]:
movies['overview'][0]

'In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization.'

In [133]:
movies['overview'] = movies['overview'].apply(lambda x:x.split())

In [134]:
movies.sample(2)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
2628,6552,Idle Hands,"[Anton, is, a, cheerful, but, exceedingly, non...","[Thriller, Comedy, Horror]","[teenager, attic, knitting needle, noise compl...","[Devon Sawa, Seth Green, Jessica Alba, Vivica ...",[Rodman Flender]
1715,699,For Your Eyes Only,"[A, British, spy, ship, has, sunk, and, on, bo...","[Adventure, Action, Thriller]","[london england, submarine, england, sea, assa...","[Roger Moore, Carole Bouquet, Chaim Topol, Lyn...",[John Glen]


In [135]:
def collapse(L):
    L1 = []
    for i in L:
        L1.append(i.replace(" ",""))
    return L1

In [136]:
movies['cast'] = movies['cast'].apply(collapse)
movies['crew'] = movies['crew'].apply(collapse)
movies['genres'] = movies['genres'].apply(collapse)
movies['keywords'] = movies['keywords'].apply(collapse)

In [137]:
movies.head(2)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[SamWorthington, ZoeSaldana, SigourneyWeaver, ...",[JamesCameron]
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d...","[Adventure, Fantasy, Action]","[ocean, drugabuse, exoticisland, eastindiatrad...","[JohnnyDepp, OrlandoBloom, KeiraKnightley, Ste...",[GoreVerbinski]


In [138]:
movies['tags'] = movies['overview'] + movies['genres'] + movies['keywords'] + movies['cast'] + movies['crew']

In [139]:
new_df = movies.drop(columns=['overview','genres','keywords','cast','crew'])

In [140]:
new_df['tags'] = new_df['tags'].apply(lambda x: " ".join(x))                         

In [141]:
new_df['tags'] = new_df['tags'].apply(lambda x: x.lower())  

In [142]:
new_df.head(2)

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"in the 22nd century, a paraplegic marine is di..."
1,285,Pirates of the Caribbean: At World's End,"captain barbossa, long believed to be dead, ha..."


In [143]:
from nltk.stem.porter import PorterStemmer  
ps = PorterStemmer()  

In [144]:
def stem(text):  
    y = []  
    for i in text.split():  
        y.append(ps.stem(i))  
    return " ".join(y)  

In [145]:
new_df['tags'] = new_df['tags'].apply(stem)

### Vectorization for Content-Based Movie Recommender System

In [146]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=5000,stop_words='english')

In [147]:
vector = cv.fit_transform(new['tags']).toarray()

In [148]:
vector.shape

(4806, 5000)

In [149]:
from sklearn.metrics.pairwise import cosine_similarity

In [150]:
similarity = cosine_similarity(vector)

In [151]:
new[new['title'] == 'The Lego Movie'].index[0]

np.int64(744)

In [152]:
def recommend(movie):
    index = new[new['title'] == movie].index[0]
    distances = sorted(list(enumerate(similarity[index])),reverse=True,key = lambda x: x[1])
    for i in distances[1:6]:
        print(new.iloc[i[0]].title)

In [153]:
recommend('Gandhi')

Ramanujan
Guiana 1838
The Wind That Shakes the Barley
The Bounty
A Passage to India


In [154]:
import pickle

In [155]:
pickle.dump(new,open('movie_list.pkl','wb'))
pickle.dump(similarity,open('similarity.pkl','wb'))