<a name="0"></a>
# 🎯 Netflix Recommendation System – Hands-on AI Workshop

💡 Have you ever wondered how Netflix *knows* exactly what to recommend you after a binge night? Or how it's smart enough to suggest “light comedy” after a breakup movie?

Today, you'll get to build a mini-version of such a system — all using **Traditional AI**!

---------

In this project, we're using a dataset from kaggle for Netflix Data and then using various machine learning methods (which will be explained below) to make a recommendation system/function based on a movie or TV show.
Let's begin!

### In this notebook, we will cover:
* [Overview](#0)
* [Environment Setup & Data Set Loading](#1)
* [Data Exploration and Adjustment](#2)
* [Text Preprocessing](#3)
* [Text Vectorization and Similarity Calculation](#4)
* [Recommendation Function](#5)
* [Final Recommendation Examples & Output](#6)

# 🔧 Pre-requisites & Setup


Before we begin, make sure you have the following:

1. ✅ A Google Account (for opening Google Colab)
2. ✅ Basic Python knowledge (loops, functions, pandas)
3. ✅ This workshop notebook:
   📎 [Click here to open the Colab](https://colab.research.google.com/github/NamVr/Netflix-Recommendation-System/blob/main/notebook.ipynb)

💻 GitHub Repo: [github.com/NamVr/Netflix-Recommendation-System](https://github.com/NamVr/Netflix-Recommendation-System)

📦 No local setup needed — all code runs in the browser.

> 🚀 ***DIY - Do It Yourself Challenge:*** Take up the courage and try to do it yourself instead of copying all code from colab. Only take help when needed.

<h2>📸 Workshop Resources</h2>

<table>
  <tr>
    <th><a href="https://colab.research.google.com/github/NamVr/Netflix-Recommendation-System/blob/main/notebook.ipynb">🔗 Google Colab</a></th>
    <th>&nbsp;&nbsp;&nbsp;</th>
    <th><a href="https://github.com/NamVr/Netflix-Recommendation-System/">🐙 GitHub Repo</a></th>
    <th>&nbsp;&nbsp;&nbsp;</th>
    <th><a href="https://linkedin.com/in/namanvrati">💼 LinkedIn</a></th>
    <th>&nbsp;&nbsp;&nbsp;</th>
    <th><a href="https://forms.gle/1dMRPY8hisXzBWNRA">📝 Feedback Form</a></th>
  </tr>
  <tr>
    <td><img src="https://api.qrserver.com/v1/create-qr-code/?size=120x120&data=https://colab.research.google.com/github/NamVr/Netflix-Recommendation-System/blob/main/notebook.ipynb"></td>
    <td></td>
    <td><img src="https://api.qrserver.com/v1/create-qr-code/?size=120x120&data=https://github.com/NamVr/Netflix-Recommendation-System/"></td>
    <td></td>
    <td><img src="https://api.qrserver.com/v1/create-qr-code/?size=120x120&data=https://linkedin.com/in/namanvrati"></td>
    <td></td>
    <td><img src="https://api.qrserver.com/v1/create-qr-code/?size=120x120&data=https://forms.gle/1dMRPY8hisXzBWNRA"></td>
  </tr>
</table>


<a name="1"></a>
# 1.1 Section 1
## Environment Setup & Data Set Loading


We're now setting up our coding environment! If you're running this in Google Colab, you already have most libraries installed. But let's ensure they're ready.


In [None]:
# Install necessary libraries
!pip install numpy pandas scikit-learn

# Import libraries
import numpy as np
import pandas as pd
import string
import sklearn

In [None]:
# Load the dataset as (df) as original and (data) will be processed.
df = pd.read_csv("/content/netflixData.csv")
data = df

<a name="2"></a>
# 1.2 Section 2
## Data Exploration and Adjustment

Let's explore the dataset to understand what kind of content Netflix provides and how we can use that for recommendations.

In [None]:
# Display the first few rows of the dataset
data.head()

In [None]:
# Describe the DataSet (DataFrame.describe())
data.describe(include='all')

In [None]:
# Get Data Set Info (DataFrame.info())
data.info()

In [None]:
# Check for missing values
data.isnull().sum()

In [None]:
# Select relevant columns only, drop other columns!
data = data[["Title", "Description", "Director", "Cast", "Genres", "Rating", "Content Type"]]
data.set_index("Title", inplace=True)

# Practically we should drop NA or NaN values BUT we see that almost 30% of the
# dataset will be lost. Directors have 2064 NaN values, which is almost 30%
# of the entire dataset. Instead, we are going to replace NaN values with an
# empty string and add checks.
data.fillna("", inplace=True)

# Display adjusted dataset.
data.head()

<a name="3"></a>
# 1.3 Section 3
## Text Preprocessing

🧹 We now clean the data to make it machine-friendly:
- Remove punctuation
- Convert to lowercase
- Combine relevant features

In [None]:
# Creating text preprocessing functions such as:

# seperate(texts: str) -> str:
# Creates a list, then splits given string by ",", trims additional spaces.
# Returns the list joint together with ' ' (spaces)
def separate(texts):
    t = []
    for text in texts.split(','):
        t.append(text.replace(' ', '').lower())
    return ' '.join(t)

# remove_space(texts: str) -> str:
# Removes unnecessary spaces from the string and returns it.
def remove_space(texts):
    return texts.replace(' ', '').lower()

# remove_punc(texts: str) -> str:
# Removes punctuations from the texts, converts text to lower case using
# lower() and then returns it.
def remove_punc(texts):
    return texts.translate(str.maketrans('','',string.punctuation)).lower()

In [None]:
# df.apply(def identifier)
# Uses the selected data and applies the given function on it.
# We are using apply to format text to suit our algorithm.

data['Content Type'] = data['Content Type'].apply(remove_space)
data['Director'] = data['Director'].apply(separate)
data['Cast'] = data['Cast'].apply(separate)
data['Rating'] = data['Rating'].apply(remove_space)
data['Genres'] = data['Genres'].apply(separate)
data['Description'] = data['Description'].apply(remove_punc)

data.head()

In [None]:
# Here we're creating a new column 'bag_of_words' in the Data Set
# Each row contains a concatenated string of non-empty values from all other
# columns in that row. It's a common preprocessing step when working with text
# data to create a bag-of-words representation.
data['bag_of_words'] = ''

# Combine all the words into 1 column
for i, row in enumerate(data.iterrows()):
    string = ''
    for col in data.columns:
        if row[1][col] == '':
            continue
        else:
            string += row[1][col] + ' '
            data['bag_of_words'][i] = string.strip()

# enumerate is a built-in Python function that adds a counter to an iterable
# (e.g., a list, tuple, or string) and returns it as an enumerate object.
# The enumerate object contains pairs of index and corresponding value from
# the original iterable. It is commonly used in loops to iterate over both the
# elements and their indices simultaneously.

data.drop(data.columns[:-1], axis=1, inplace=True)

<a name="4"></a>
# 1.4 Section 4
## Text Vectorization and Similarity Calculation


**TF-IDF** stands for Term Frequency — Inverse Document Frequency. It tells the importance of a word. In a nutshell, The word that appear more frequently in the corpus, it will be considered less importance, hence the tfidf score will be lower. It goes the opposite way with less frequent word.

----

We now convert text into numerical form using **TF-IDF** — a method that understands importance of words.
Then, we calculate **Cosine Similarity**, which finds items that are “close” to each other in terms of meaning.

In [None]:
# TfidfVectorizer is a class provided by scikit-learn for converting a
# collection of raw documents (text) into a matrix of TF-IDF features.
# TF-IDF stands for Term Frequency-Inverse Document Frequency and is a
# numerical statistic used to reflect the importance of a word in a document
# relative to a collection of documents (corpus).
from sklearn.feature_extraction.text import TfidfVectorizer

# cosine_similarity is a function in scikit-learn that computes the cosine
# similarity between vectors. In the context of NLP, these vectors are often
# the rows of a matrix representing documents in a high-dimensional space.
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
# TfidVectorizer is initialized in tfid variable.
tfid = TfidfVectorizer()

# Fit and Transform: The fit_transform method of the TfidfVectorizer is then
# used to convert the 'bag_of_words' column of your DataFrame (data) into a
# TF-IDF matrix. The result, stored in tfid_matrix, is a sparse matrix where
# each row corresponds to a document (in this case, each row of the
# 'bag_of_words' column), and each column corresponds to a unique word in the
# entire dataset. The values in the matrix represent the TF-IDF scores for each
# word in each document.
tfid_matrix = tfid.fit_transform(data['bag_of_words'])

In [None]:
tfid_matrix

In [None]:
# The cosine_similarity function takes two matrices as input and computes the
# cosine similarity between corresponding rows. In this case, both matrices
# provided are the same (tfid_matrix), so the resulting cosine_sim matrix will
# be a square matrix where each entry [i, j] represents the cosine similarity
# between the i-th and j-th rows (documents) in the original dataset.
cosine_sim = cosine_similarity(tfid_matrix, tfid_matrix)
cosine_sim

<a name="5"></a>
# 1.5 Section 5
## Recommendation Function

Let's now build our recommender engine! 🎯

We'll use the similarity scores to find movies most similar to the one the user searches.


In [None]:
# Creating recommendation based on Title/Content Type.
# Many other recommendation models can be made after this step :)
final_df = df[['Title', 'Content Type']]

def recommendation(title, total_result=5):
    # Get the index
    idx = final_df[final_df['Title'] == title].index[0]

    # Create a new column for similarity, the value is different for each title you input
    final_df['Similarity'] = cosine_sim[idx]
    sort_final_df = final_df.sort_values(by='Similarity', ascending=False)[1:total_result+1]

    # Is the title a movie or tv show?
    movies = sort_final_df['Title'][sort_final_df['Content Type'] == 'Movie']
    tv_shows = sort_final_df['Title'][sort_final_df['Content Type'] == 'TV Show']

    if len(movies) != 0:
        print('Similar Movie(s) list:')
        for i, movie in enumerate(movies):
            print('{}. {}'.format(i+1, movie))
        print()
    else:
        print('Similar Movie(s) list:')
        print('-\n')

    if len(tv_shows) != 0:
        print('Similar TV_show(s) list:')
        for i, tv_show in enumerate(tv_shows):
            print('{}. {}'.format(i+1, tv_show))
    else:
        print('Similar TV_show(s) list:')
        print('-')

<a name="6"></a>
# 1.6 Section 6
## Final Recommendation Examples & Output

📦 Here's a sample of recommendations generated by our system.
Try changing the movie title and see how results differ!


In [None]:
# Get recommendations for the movie "Stranger Things"
recommendation("Stranger Things")

In [None]:
# Final Result
# Get recommendation by inputting movie name.

n = input("Enter Movie/TV Show Name: ")
recommendation(n)

<hr>

# 📎 **Conclusion:**

I hope you enjoyed this session. If you enjoyed this, **please share on LinkedIn and tag me + GDG** — let's inspire more folks to explore AI/ML!

Connect with me on LinkedIn: https://linkedin.com/in/namanvrati <br>
Email me for doubts: info@namanvrati.me <br> <br>

---

### Wohoo! Thanks :)
