In [1]:
# import the required libriaries
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import pickle

`CountVectorizer` from `sklearn.feature_extraction.text` is a tool used to convert a collection of text documents into a matrix of token counts. It is commonly employed in natural language processing (NLP) tasks to extract features from text data.

### Key Features:
1. **Tokenization**: Splits the text into individual words or tokens.
2. **Lowercasing**: Converts all tokens to lowercase by default.
3. **Stop Words Removal**: Can exclude common stop words like "and," "the," etc.
4. **N-grams**: Supports generation of n-grams (e.g., unigrams, bigrams, trigrams).
5. **Vocabulary Control**: Allows specification of a custom vocabulary or adjusts the range of token frequencies to include.

The `cosine_similarity` function from `sklearn.metrics.pairwise` computes the cosine similarity between vectors. Cosine similarity is a measure of the cosine of the angle between two non-zero vectors in an inner product space, which effectively calculates how similar two vectors are, irrespective of their magnitude.

---

### **Formula for Cosine Similarity**:
For two vectors \( A \) and \( B \):

\[
\text{Cosine Similarity} = \frac{A \cdot B}{\|A\| \|B\|}
\]

Where:
- \( A \cdot B \) is the dot product of the vectors.
- \( \|A\| \) and \( \|B\| \) are the magnitudes (norms) of the vectors.

The result ranges from:
- **1**: Completely similar.
- **0**: Completely dissimilar.
- **-1**: Opposite directions (only if vectors have negative values).


In [2]:
# import the data form csv and check the successibility of import
data = pd.read_csv('data.csv')
data

Unnamed: 0,id,title,genre,original_language,overview,popularity,release_date,vote_average,vote_count
0,278,The Shawshank Redemption,"Drama,Crime",en,Framed in the 1940s for the double murder of h...,94.075,1994-09-23,8.7,21862
1,19404,Dilwale Dulhania Le Jayenge,"Comedy,Drama,Romance",hi,"Raj is a rich, carefree, happy-go-lucky second...",25.408,1995-10-19,8.7,3731
2,238,The Godfather,"Drama,Crime",en,"Spanning the years 1945 to 1955, a chronicle o...",90.585,1972-03-14,8.7,16280
3,424,Schindler's List,"Drama,History,War",en,The true story of how businessman Oskar Schind...,44.761,1993-12-15,8.6,12959
4,240,The Godfather: Part II,"Drama,Crime",en,In the continuing saga of the Corleone crime f...,57.749,1974-12-20,8.6,9811
...,...,...,...,...,...,...,...,...,...
9995,10196,The Last Airbender,"Action,Adventure,Fantasy",en,"The story follows the adventures of Aang, a yo...",98.322,2010-06-30,4.7,3347
9996,331446,Sharknado 3: Oh Hell No!,"Action,TV Movie,Science Fiction,Comedy,Adventure",en,The sharks take bite out of the East Coast whe...,12.490,2015-07-22,4.7,417
9997,13995,Captain America,"Action,Science Fiction,War",en,"During World War II, a brave, patriotic Americ...",18.333,1990-12-14,4.6,332
9998,2312,In the Name of the King: A Dungeon Siege Tale,"Adventure,Fantasy,Action,Drama",en,A man named Farmer sets out to rescue his kidn...,15.159,2007-11-29,4.7,668


In [3]:
# display the first 5 rows of data
data.head()

Unnamed: 0,id,title,genre,original_language,overview,popularity,release_date,vote_average,vote_count
0,278,The Shawshank Redemption,"Drama,Crime",en,Framed in the 1940s for the double murder of h...,94.075,1994-09-23,8.7,21862
1,19404,Dilwale Dulhania Le Jayenge,"Comedy,Drama,Romance",hi,"Raj is a rich, carefree, happy-go-lucky second...",25.408,1995-10-19,8.7,3731
2,238,The Godfather,"Drama,Crime",en,"Spanning the years 1945 to 1955, a chronicle o...",90.585,1972-03-14,8.7,16280
3,424,Schindler's List,"Drama,History,War",en,The true story of how businessman Oskar Schind...,44.761,1993-12-15,8.6,12959
4,240,The Godfather: Part II,"Drama,Crime",en,In the continuing saga of the Corleone crime f...,57.749,1974-12-20,8.6,9811


In [4]:
data.describe()

Unnamed: 0,id,popularity,vote_average,vote_count
count,10000.0,10000.0,10000.0,10000.0
mean,161243.505,34.697267,6.62115,1547.3094
std,211422.046043,211.684175,0.766231,2648.295789
min,5.0,0.6,4.6,200.0
25%,10127.75,9.15475,6.1,315.0
50%,30002.5,13.6375,6.6,583.5
75%,310133.5,25.65125,7.2,1460.0
max,934761.0,10436.917,8.7,31917.0


In [5]:
data.isnull().sum()

id                    0
title                 0
genre                 3
original_language     0
overview             13
popularity            0
release_date          0
vote_average          0
vote_count            0
dtype: int64

In [6]:
data.columns

Index(['id', 'title', 'genre', 'original_language', 'overview', 'popularity',
       'release_date', 'vote_average', 'vote_count'],
      dtype='object')

The Dataset contains the following things:

1. ID: Movie ID number on the website.
2. title: Movie name
3. genre: Movie genre (crime, adventure, etc.)
4. original_language: Original language in which the movie is released
5. overview: Summary of the movie
6. popularity: Movie Popularity
7. release_date: Movie release date
8. vote_average: Movie vote average
9. vote_count: Movie vote count

Important fields for the models:
1. title
2. overview
3. genre

In [7]:
movies = data[['id', 'title', 'overview', 'genre']]
movies.head()

Unnamed: 0,id,title,overview,genre
0,278,The Shawshank Redemption,Framed in the 1940s for the double murder of h...,"Drama,Crime"
1,19404,Dilwale Dulhania Le Jayenge,"Raj is a rich, carefree, happy-go-lucky second...","Comedy,Drama,Romance"
2,238,The Godfather,"Spanning the years 1945 to 1955, a chronicle o...","Drama,Crime"
3,424,Schindler's List,The true story of how businessman Oskar Schind...,"Drama,History,War"
4,240,The Godfather: Part II,In the continuing saga of the Corleone crime f...,"Drama,Crime"


## Feature Selection

In [13]:
# combine the overview and genre becuse the tags of the words used in the fields are moreover important for the project
movies['tags']=movies['overview']+movies['genre']
movies.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies['tags']=movies['overview']+movies['genre']


Unnamed: 0,id,title,overview,genre,tags
0,278,The Shawshank Redemption,Framed in the 1940s for the double murder of h...,"Drama,Crime",Framed in the 1940s for the double murder of h...
1,19404,Dilwale Dulhania Le Jayenge,"Raj is a rich, carefree, happy-go-lucky second...","Comedy,Drama,Romance","Raj is a rich, carefree, happy-go-lucky second..."
2,238,The Godfather,"Spanning the years 1945 to 1955, a chronicle o...","Drama,Crime","Spanning the years 1945 to 1955, a chronicle o..."
3,424,Schindler's List,The true story of how businessman Oskar Schind...,"Drama,History,War",The true story of how businessman Oskar Schind...
4,240,The Godfather: Part II,In the continuing saga of the Corleone crime f...,"Drama,Crime",In the continuing saga of the Corleone crime f...


In [8]:
movies['genre'].unique()

array(['Drama,Crime', 'Comedy,Drama,Romance', 'Drama,History,War', ...,
       'Action,TV Movie,Science Fiction,Comedy,Adventure',
       'Action,Science Fiction,War', 'Adventure,Fantasy,Action,Drama'],
      dtype=object)

In [14]:
# remove the seperable column because there are not usable for the model further
new_data = movies.drop(columns=['overview', 'genre'])
new_data.head()

Unnamed: 0,id,title,tags
0,278,The Shawshank Redemption,Framed in the 1940s for the double murder of h...
1,19404,Dilwale Dulhania Le Jayenge,"Raj is a rich, carefree, happy-go-lucky second..."
2,238,The Godfather,"Spanning the years 1945 to 1955, a chronicle o..."
3,424,Schindler's List,The true story of how businessman Oskar Schind...
4,240,The Godfather: Part II,In the continuing saga of the Corleone crime f...


1. Bag of word
2. TFIDF

In [15]:
# define count vector
cv = CountVectorizer(max_features=10000, stop_words='english')
cv

### **Parameters Explained:**

1. **`max_features=10000`**:
   - Limits the number of features (unique tokens) in the vocabulary to the **10,000 most frequent tokens** in the dataset.
   - This is useful for large datasets to reduce the dimensionality and focus on the most informative words.

2. **`stop_words='english'`**:
   - Automatically removes common English stop words (e.g., "the," "is," "and," etc.).
   - This reduces noise in the data by ignoring words that are unlikely to be meaningful in the analysis.

---

### **How It Works**:
- When you fit the `CountVectorizer` to a collection of text data, it tokenizes the text, removes stop words, and selects the top 10,000 tokens by frequency.
- The resulting feature matrix will have a maximum of 10,000 columns (one for each token).

---

### **Why Use These Parameters?**
- **`max_features`**: Prevents the vectorizer from creating a sparse matrix with too many dimensions, which can slow down computations or lead to overfitting.
- **`stop_words`**: Reduces the inclus

In [16]:
# obtain counter vector
vector = cv.fit_transform(new_data['tags'].values.astype('U')).toarray()

### **Code**:

1. **`new_data['tags']`**:
   - Accesses the `tags` column of the DataFrame `new_data`.

2. **`.values.astype('U')`**:
   - Converts the column's values to a NumPy array with a Unicode string type (`'U'`), ensuring compatibility with `CountVectorizer` if the data is not already in string format.

3. **`cv.fit_transform(...)`**:
   - Fits the `CountVectorizer` to the text data and transforms it into a sparse matrix of token counts.
   - Each row corresponds to a document (here, an entry in the `tags` column).
   - Each column corresponds to a word from the vocabulary (up to the limit of `max_features`).

4. **`.toarray()`**:
   - Converts the sparse matrix into a dense NumPy array for easier inspection or further processing.

---

### **Result**:
- **`vector`**: A 2D array (matrix) where:
  - Rows represent the individual documents (each entry in the `tags` column).
  - Columns represent words from the vocabulary.
  - Values are the counts of each word in the respective document.


In [17]:
# counter vector shape
vector.shape

(10000, 10000)

In [18]:
# obtain the similarity matrix
similarity =  cosine_similarity(vector)

### **What Happens?**

1. **`cosine_similarity(vector)`**:
   - Computes the cosine similarity between all rows (documents) of the `vector` matrix.
   - `vector` is a feature matrix where each row represents a document, and each column represents a feature (word/token).

2. **Result**:
   - Returns a square similarity matrix, where:
     - Rows and columns correspond to the documents.
     - Each element \((i, j)\) in the matrix represents the cosine similarity between document \(i\) and document \(j\).


In [19]:
# disply similarity matrix
similarity

array([[1.        , 0.05634362, 0.12888482, ..., 0.07559289, 0.11065667,
        0.06388766],
       [0.05634362, 1.        , 0.07624929, ..., 0.        , 0.03636965,
        0.        ],
       [0.12888482, 0.07624929, 1.        , ..., 0.02273314, 0.06655583,
        0.08645856],
       ...,
       [0.07559289, 0.        , 0.02273314, ..., 1.        , 0.03253   ,
        0.02817181],
       [0.11065667, 0.03636965, 0.06655583, ..., 0.03253   , 1.        ,
        0.0412393 ],
       [0.06388766, 0.        , 0.08645856, ..., 0.02817181, 0.0412393 ,
        1.        ]])

In [20]:
# information about the data frame
new_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      10000 non-null  int64 
 1   title   10000 non-null  object
 2   tags    9985 non-null   object
dtypes: int64(1), object(2)
memory usage: 234.5+ KB


In [None]:
# retrieve one of movie
new_data[new_data['title']=="Schindler's List"].index[0]

In [21]:

distance = sorted(list(enumerate(similarity[3])), reverse=True, key=lambda vector:vector[1])
for i in distance[0:5]:
    print(new_data.iloc[i[0]].title)

Schindler's List
Resistance
The Counterfeiters
The Railway Man
Midway


### Code:

1. **`enumerate(similarity[3])`**:
   - `similarity[3]` is a list of cosine similarity values of 3rd columns.
   - `enumerate` pairs each similarity value with its corresponding document index, e.g., `[(0, 0.5), (1, 0.8), ...]`.

2. **`list(enumerate(similarity[3]))`**:
   - Converts the enumerated pairs into a list.

3. **`sorted(..., reverse=True, key=lambda vector:vector[1])`**:
   - Sorts the list of enumerated pairs in descending order based on the similarity value (`vector[1]`).

4. **`distance[0:5]`**:
   - Selects the top 5 most similar documents (highest similarity scores).

5. **`new_data.iloc[i[0]].title`**:
   - Retrieves the title of the document corresponding to the index `i[0]` from the `new_data` DataFrame.

6. **`print`**:
   - Outputs the titles of the top 5 most similar documents.

In [None]:
# develop a function to get most suitable 5 movies based on the tags
def recommend(movies):
    index = new_data[new_data['title']==movies].index[0]
    distance = sorted(list(enumerate(similarity[index])), reverse=True, key=lambda vector:vector[1])
    for i in distance[0:5]:
        print(new_data.iloc[i[0]].title)

In [None]:
# check the function
recommend("Iron Man")

Iron Man
Iron Man 3
Guardians of the Galaxy Vol. 2
Avengers: Age of Ultron
Star Wars: Episode III - Revenge of the Sith


In [None]:
# save the data model and similarity model for future use
pickle.dump(new_data, open('movie_list.pkl', 'wb'))
pickle.dump(similarity, open('similarity.pkl', 'wb'))