---
**Dataset Name:** Top-Rated TMDB Movies Dataset (as of 26-July-2022)

**Dataset Description:**
This dataset consists of information about 10,000 top-rated movies listed on the TMDB (The Movie Database) platform. The data includes details such as movie IDs, titles, genres, original language, overviews, popularity, release dates, vote averages, and vote counts for each movie. It is a valuable resource for anyone interested in analyzing and exploring top-rated movies on TMDB.

**Data Fields:**
The dataset includes the following fields:

1. **ID:** Movie ID number on the TMDB website. (Integer)
2. **Title:** Movie name. (String)
3. **Genre:** Movie genre, which can include categories such as crime, adventure, and more. (String)
4. **Original Language:** The original language in which the movie was released. (String)
5. **Overview:** A summary or description of the movie. (Text)
6. **Popularity:** A measure of the movie's popularity. (Float)
7. **Release Date:** The date on which the movie was released. (Date)
8. **Vote Average:** The average vote or rating given to the movie. (Float)
9. **Vote Count:** The number of votes or ratings the movie has received. (Integer)

**Data Sample:**

Here is a sample of the dataset to illustrate its structure:

| ID   | Title                 | Genre          | Original Language | Overview                                       | Popularity | Release Date | Vote Average | Vote Count |
|------|-----------------------|----------------|-------------------|-------------------------------------------------|------------|--------------|--------------|------------|
| 1    | The Shawshank Redemption | Crime, Drama | English           | Two imprisoned men bond over several years, finding solace and eventual redemption through acts of common decency. | 30.546    | 1994-09-23   | 8.7        | 24112      |
| 2    | The Godfather           | Crime, Drama | English           | The aging patriarch of an organized crime dynasty transfers control of his clandestine empire to his reluctant son.   | 27.137    | 1972-03-15   | 8.6        | 17471      |
| 3    | The Dark Knight         | Action, Crime, Drama, Thriller | English | When the menace known as The Joker emerges from his mysterious past, he wreaks havoc and chaos on the people of Gotham. | 31.835 | 2008-07-16 | 8.6        | 19533      |
| ...  | ...                   | ...            | ...               | ...                                            | ...        | ...          | ...          | ...        |



# Importing Libraries:

In [45]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

In [46]:
df = pd.read_csv("top10K-TMDB-movies.csv")

In [47]:
df.head(10)

Unnamed: 0,id,title,genre,original_language,overview,popularity,release_date,vote_average,vote_count
0,278,The Shawshank Redemption,"Drama,Crime",en,Framed in the 1940s for the double murder of h...,94.075,1994-09-23,8.7,21862
1,19404,Dilwale Dulhania Le Jayenge,"Comedy,Drama,Romance",hi,"Raj is a rich, carefree, happy-go-lucky second...",25.408,1995-10-19,8.7,3731
2,238,The Godfather,"Drama,Crime",en,"Spanning the years 1945 to 1955, a chronicle o...",90.585,1972-03-14,8.7,16280
3,424,Schindler's List,"Drama,History,War",en,The true story of how businessman Oskar Schind...,44.761,1993-12-15,8.6,12959
4,240,The Godfather: Part II,"Drama,Crime",en,In the continuing saga of the Corleone crime f...,57.749,1974-12-20,8.6,9811
5,667257,Impossible Things,"Family,Drama",es,"Matilde is a woman who, after the death of her...",14.358,2021-06-17,8.6,255
6,129,Spirited Away,"Animation,Family,Fantasy",ja,"A young girl, Chihiro, becomes trapped in a st...",92.056,2001-07-20,8.5,13093
7,730154,Your Eyes Tell,"Romance,Drama",ja,"A tragic accident lead to Kaori's blindness, b...",51.345,2020-10-23,8.5,339
8,372754,Dou kyu sei – Classmates,"Romance,Animation",ja,"Rihito Sajo, an honor student with a perfect s...",14.285,2016-02-20,8.5,239
9,372058,Your Name.,"Romance,Animation,Drama",ja,High schoolers Mitsuha and Taki are complete s...,158.27,2016-08-26,8.5,8895


In [48]:
df.describe()

Unnamed: 0,id,popularity,vote_average,vote_count
count,10000.0,10000.0,10000.0,10000.0
mean,161243.505,34.697267,6.62115,1547.3094
std,211422.046043,211.684175,0.766231,2648.295789
min,5.0,0.6,4.6,200.0
25%,10127.75,9.15475,6.1,315.0
50%,30002.5,13.6375,6.6,583.5
75%,310133.5,25.65125,7.2,1460.0
max,934761.0,10436.917,8.7,31917.0


In [49]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 10000 non-null  int64  
 1   title              10000 non-null  object 
 2   genre              9997 non-null   object 
 3   original_language  10000 non-null  object 
 4   overview           9987 non-null   object 
 5   popularity         10000 non-null  float64
 6   release_date       10000 non-null  object 
 7   vote_average       10000 non-null  float64
 8   vote_count         10000 non-null  int64  
dtypes: float64(2), int64(2), object(5)
memory usage: 703.2+ KB


In [50]:
df.isnull().sum()

id                    0
title                 0
genre                 3
original_language     0
overview             13
popularity            0
release_date          0
vote_average          0
vote_count            0
dtype: int64

# Feature Selections: 
### This phrasing emphasizes the goal of choosing the most suitable features for your project based on its specific requirements.

In [51]:
df.columns

Index(['id', 'title', 'genre', 'original_language', 'overview', 'popularity',
       'release_date', 'vote_average', 'vote_count'],
      dtype='object')

**We've chosen 'id,' 'title,' 'genre,' and 'overview' from our dataset to meet the project's core requirements efficiently. 'id' is the data anchor, 'title' provides movie names, 'genre' categorizes them, and 'overview' gives a snapshot of each film. These features will be the building blocks for our analysis and recommendation system.***
    

In [52]:
df = df[['id','title','genre','overview']]

In [53]:
df.head(3)

Unnamed: 0,id,title,genre,overview
0,278,The Shawshank Redemption,"Drama,Crime",Framed in the 1940s for the double murder of h...
1,19404,Dilwale Dulhania Le Jayenge,"Comedy,Drama,Romance","Raj is a rich, carefree, happy-go-lucky second..."
2,238,The Godfather,"Drama,Crime","Spanning the years 1945 to 1955, a chronicle o..."


<!-- df["tags"] = df["genre"] + df["overview"] -->

In [54]:
df["tags"] =  df["overview"] + df["genre"] 

In [55]:
cleaned_df = df.drop(columns=["overview",'genre'])

In [56]:
cleaned_df.head(5)

Unnamed: 0,id,title,tags
0,278,The Shawshank Redemption,Framed in the 1940s for the double murder of h...
1,19404,Dilwale Dulhania Le Jayenge,"Raj is a rich, carefree, happy-go-lucky second..."
2,238,The Godfather,"Spanning the years 1945 to 1955, a chronicle o..."
3,424,Schindler's List,The true story of how businessman Oskar Schind...
4,240,The Godfather: Part II,In the continuing saga of the Corleone crime f...


---
### Description:

- In our content-based recommendation system, we rely on the concept of "tags" to make personalized movie recommendations. - These tags are like descriptive labels that capture essential details about each movie. 
- To facilitate this process, we've introduced a new column, "tag," which encapsulates these descriptive keywords and characteristics. 
- By leveraging these tags, our model can predict and suggest movies that align closely with your preferences, providing tailored recommendations for your viewing pleasure. This tag-based approach ensures that you receive movie suggestions that match your specific tastes and interests.

# Model Development :

- Here, we are dealing with a variety of textual data that needs to be converted into numerical vectors, as machines inherently operate on numerical data. 
- To accomplish this, we harness the capabilities of ***Natural Language Processing (NLP)**. Here , I employ CountVectorizer from scikit-learn to transform the text into numerical representations. 

- Subsequently, we utilize the sklearn.metrics.pairwise.cosine_similarity function to compute the similarity between these numerical vectors, with lower cosine values indicating greater similarity between the items."


In [57]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(max_features=10000,stop_words="english")


In [58]:
cv

In [59]:
my_vector = cv.fit_transform(cleaned_df['tags'].values.astype('U')).toarray()

In [60]:
my_vector 

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [61]:
my_vector.shape

(10000, 10000)

In [62]:
from sklearn.metrics.pairwise import cosine_similarity


cosine_similarity(my_vector)

array([[1.        , 0.05634362, 0.12888482, ..., 0.07559289, 0.11065667,
        0.06388766],
       [0.05634362, 1.        , 0.07624929, ..., 0.        , 0.03636965,
        0.        ],
       [0.12888482, 0.07624929, 1.        , ..., 0.02273314, 0.06655583,
        0.08645856],
       ...,
       [0.07559289, 0.        , 0.02273314, ..., 1.        , 0.03253   ,
        0.02817181],
       [0.11065667, 0.03636965, 0.06655583, ..., 0.03253   , 1.        ,
        0.0412393 ],
       [0.06388766, 0.        , 0.08645856, ..., 0.02817181, 0.0412393 ,
        1.        ]])

In [63]:
similarity_values = cosine_similarity(my_vector)

In [64]:
cleaned_df[cleaned_df['title'] == "Dilwale Dulhania Le Jayenge"]  

Unnamed: 0,id,title,tags
1,19404,Dilwale Dulhania Le Jayenge,"Raj is a rich, carefree, happy-go-lucky second..."


In [65]:
cleaned_df[cleaned_df['title'] == "Dilwale Dulhania Le Jayenge"].index[0]  # Code to extract Index Number based on titles

1

In [66]:
distance = sorted(list(enumerate(similarity_values[1])), reverse=True, key=lambda vector:vector[1])
for i in distance[0:5]:
    print(i)

(1, 1.0000000000000002)
(3716, 0.24333213169614382)
(8686, 0.2357022603955158)
(370, 0.2222222222222222)
(886, 0.21516574145596756)


In [67]:
distance = sorted(list(enumerate(similarity_values[1])), reverse=True, key=lambda vector:vector[1])
for i in distance[0:5]:
    print(cleaned_df.iloc[i[0]].title)

Dilwale Dulhania Le Jayenge
A Passage to India
The Manual of Love
The Cameraman
The Graduate


In [68]:
def suggestion(movies):
    index = cleaned_df[cleaned_df['title'] == movies].index[0]
   
    distance = sorted(list(enumerate(similarity_values[index])), reverse=True, key=lambda vector:vector[1])
    for i in distance[0:11]:
        print(cleaned_df.iloc[i[0]].title)
        
    
    
    

In [69]:
suggestion("Iron Man")

Iron Man
Iron Man 3
Guardians of the Galaxy Vol. 2
Avengers: Age of Ultron
Star Wars: Episode III - Revenge of the Sith
G.O.R.A.
Iron Man 2
Charlie's Angels
Everything Everywhere All at Once
Star Wars: Episode I - The Phantom Menace
The Rocketeer


In [70]:
suggestion("Dilwale Dulhania Le Jayenge")

Dilwale Dulhania Le Jayenge
A Passage to India
The Manual of Love
The Cameraman
The Graduate
A California Christmas
Smart People
The Broken Circle Breakdown
The Nanny Diaries
The Awful Truth
Just Married


---
- "The code above represents a function that extracts movie titles based on the results obtained from our cosine_similarity function applied to the vectorized data."

In [71]:
import pickle

In [72]:
# pickle.dump(cleaned_df, open('movies_list.pkl', 'wb'))

In [73]:
# pickle.dump(similarity_values, open('similarity.pkl', 'wb'))

In [74]:
# pickle.load(open('movies_list.pkl', 'rb'))