##Introduction to the IMDB Top 250 Shows Dataset
###The IMDB Top 250 Shows dataset is a comprehensive collection of information about the top-rated television shows as ranked by the Internet Movie Database (IMDB). This dataset is widely used for analysis and development of recommendation systems, providing insights into the attributes that make these shows popular among viewers.

###This dataset is having the data of the top 250 Shows as per their IMDB rating listed on the official website of IMDB

###Features

* rank - Show Rank as per IMDB rating
* show_id - Show ID
* title - Name of the Show
* year - Year of Show release
* link - URL for the Show
* imdb_votes - Number of people who voted for the IMDB rating
* imdb_rating - Rating of the Show
* certificate - Show Certification
* duration - Duration of the Show
* genre - Genre of the Show
* cast_id - ID of the cast member who have worked on the Show
* cast_name - Name of the cast member who have worked on the Show
* director_id - ID of the director who have directed the Show
* director_name - Name of the director who have directed the Show
* writer_id - ID of the writer who have wrote script for the Show
* writer_name - Name of the writer who have wrote script for the Show
* storyline - Storyline of the Show
* user_id - ID of the user who wrote review for the Show
* user_name - Name of the user who wrote review for the Show
* review_id - ID of the user review
* review_title - Short review
* review_content - Long review


##Source

https://www.kaggle.com/datasets/karkavelrajaj/imdb-top-250-shows/data

##Importing Necessary Libraries

In [1]:
import pandas as pd #Pandas is a powerful library for data manipulation and analysis.

In [2]:
df = pd.read_csv('shows.csv') #Loading the dataset.

###We will now read the data from a CSV file into a Pandas DataFrame Let us have a look at how our dataset looks like using df.head()

In [3]:
df.head() #Displays the first 5 rows of the dataset.

Unnamed: 0,rank,show_id,title,year,link,imbd_votes,imbd_rating,certificate,duration,genre,...,director_id,director_name,writer_id,writer_name,storyline,user_id,user_name,review_id,review_title,review_content
0,1,tt5491994,Planet Earth II,2016,https://www.imdb.com/title/tt5491994,145597,9.5,TV-G,4h 58m,Documentary,...,"nm1565498,nm3482115,nm4830788,nm1769336,nm2139...","Justin Anderson,Ed Charles,Fredi Devas,Chadden...",nm2357454,Elizabeth White,David Attenborough returns with a new wildlife...,"ur0362356,ur33816519,ur64238818,ur69264448,ur2...","Wentloog,john-m-madsen,thespookybuz,pjdickinso...","rw3575992,rw3576144,rw3578121,rw3576211,rw3577...","At once awe-inspiring and terrifying!,Yet anot...",I have just finished watching the first episod...
1,2,tt0903747,Breaking Bad,2008,https://www.imdb.com/title/tt0903747,1881190,9.5,TV-MA,49m,"Crime,Drama,Thriller",...,"nm0533713,nm0002835,nm0319213,nm0118778,nm0806...","Michelle MacLaren,Adam Bernstein,Vince Gilliga...","nm0319213,nm0332467,nm2297407,nm1028558,nm0909...","Vince Gilligan,Peter Gould,George Mastras,Sam ...",A chemistry teacher diagnosed with inoperable ...,"ur128165243,ur6387867,ur158768880,ur20552756,u...","FiRE010,Supermanfan-13,Lukasmj,TheLittleSongbi...","rw7088846,rw7530139,rw8672131,rw3856786,rw8725...","Really Great,Damn near perfect!,A show you nee...",I have never watched a show that is as consist...
2,3,tt0795176,Planet Earth,2006,https://www.imdb.com/title/tt0795176,210164,9.4,TV-PG,8h 58m,Documentary,...,"nm0288144,nm1768412","Alastair Fothergill,Mark Linfield","nm0041003,nm1761192,nm0288144,nm0662263","David Attenborough,Vanessa Berlowitz,Alastair ...",Each 50 minute episode features a global overv...,"ur4445210,ur1002035,ur4344459,ur14156906,ur141...","ccthemovieman-1,bob the moo,bs3dc,robert-kamer...","rw2002220,rw1356723,rw1574512,rw1594404,rw1723...","In A Word: Amazing,A visually impressive and m...","Thankfully, I caught a couple of these episode..."
3,4,tt0185906,Band of Brothers,2001,https://www.imdb.com/title/tt0185906,469081,9.4,TV-MA,9h 54m,"Drama,History,War",...,"nm0291205,nm0004121,nm0000158,nm0500896,nm0518...","David Frankel,Mikael Salomon,Tom Hanks,David L...","nm0024421,nm0096897,nm0296861,nm0000158,nm0420...","Stephen Ambrose,Erik Bork,E. Max Frye,Tom Hank...",The story of Easy Company of the U.S. Army 101...,"ur0312444,ur3922673,ur1019294,ur6387867,ur2467...","rbverhoef,philip_vanderveken,bsmith5552,Superm...","rw0626026,rw0626132,rw0625888,rw8123519,rw3248...","Excellent,This series is so unbelievably reali...",This week I saw three things based on WW-II no...
4,5,tt7366338,Chernobyl,2019,https://www.imdb.com/title/tt7366338,751884,9.4,TV-MA,5h 30m,"Drama,History,Thriller",...,nm0719307,Johan Renck,nm0563301,Craig Mazin,"In April 1986, an explosion at the Chernobyl n...","ur0482513,ur71468234,ur6387867,ur115536310,ur1...","Leofwine_draca,jfirebug,Supermanfan-13,DiCapri...","rw5285929,rw4875873,rw8325723,rw8574390,rw8521...","Exemplary,Incredible,Brilliant!,Must Watch!,Pa...",CHERNOBYL is an excellent depiction of the inf...


##Exploring the Data:
###Understanding the dataset by exploring its structure and contents.

In [4]:
df.columns # Displays the names of the columns

Index(['rank', 'show_id', 'title', 'year', 'link', 'imbd_votes', 'imbd_rating',
       'certificate', 'duration', 'genre', 'cast_id', 'cast_name',
       'director_id', 'director_name', 'writer_id', 'writer_name', 'storyline',
       'user_id', 'user_name', 'review_id', 'review_title', 'review_content'],
      dtype='object')

In [5]:
df.shape # Displays the total count of the Rows and Columns respectively.

(250, 22)

In [6]:
df.info() #Displays the total count of values present in the particular column along with the null count and data type.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250 entries, 0 to 249
Data columns (total 22 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   rank            250 non-null    int64  
 1   show_id         250 non-null    object 
 2   title           250 non-null    object 
 3   year            250 non-null    int64  
 4   link            250 non-null    object 
 5   imbd_votes      250 non-null    object 
 6   imbd_rating     250 non-null    float64
 7   certificate     246 non-null    object 
 8   duration        249 non-null    object 
 9   genre           250 non-null    object 
 10  cast_id         250 non-null    object 
 11  cast_name       250 non-null    object 
 12  director_id     250 non-null    object 
 13  director_name   250 non-null    object 
 14  writer_id       250 non-null    object 
 15  writer_name     250 non-null    object 
 16  storyline       250 non-null    object 
 17  user_id         250 non-null    obj

##Data Cleaning:
###Checking for missing values, duplicates, or any inconsistencies and clean the data accordingly.

In [7]:
df.isnull().sum()

rank              0
show_id           0
title             0
year              0
link              0
imbd_votes        0
imbd_rating       0
certificate       4
duration          1
genre             0
cast_id           0
cast_name         0
director_id       0
director_name     0
writer_id         0
writer_name       0
storyline         0
user_id           0
user_name         0
review_id         0
review_title      0
review_content    0
dtype: int64

As we can check there is only 4 null value in the certificate column and duration has 1 null value. As the count of the null value is much less, we can drop the null value as it will not affect the the out come as what we want to predict.

In [8]:
df = df.dropna() #Dropping the null values in the dataset.

In [9]:
df.isnull().sum() #Displays the total count of the null values in the particular columns.

rank              0
show_id           0
title             0
year              0
link              0
imbd_votes        0
imbd_rating       0
certificate       0
duration          0
genre             0
cast_id           0
cast_name         0
director_id       0
director_name     0
writer_id         0
writer_name       0
storyline         0
user_id           0
user_name         0
review_id         0
review_title      0
review_content    0
dtype: int64

Now there is no null value in the dataset.

In [10]:
df.drop_duplicates(inplace=True) #Dropping the duplicate values in the dataset.

In [11]:
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder


#Content-Based Filtering:



Normalizing the ratings using StandardScaler ensures that the ratings are on a comparable scale, leading to improved performance and stability of the recommendation system

In [12]:
# Normalize ratings
scaler = StandardScaler()
df['normalized_rating'] = scaler.fit_transform(df[['imbd_rating']])

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

###The code snippet tfidf = TfidfVectorizer(stop_words='english'); tfidf_matrix = tfidf.fit_transform(df['genre']) is part of the process to convert text data into numerical features that can be used in machine learning models.
###TF-IDF Vectorization
###TF-IDF stands for Term Frequency-Inverse Document Frequency. It is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents (or corpus). The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

In [14]:
# Feature extraction
tfidf = TfidfVectorizer(stop_words='english') #This creates an instance of TfidfVectorizer that will ignore common English stop words.
tfidf_matrix = tfidf.fit_transform(df['genre'])

###fit_transform(df['genre']) does two things:
###Fit: It learns the vocabulary from the genre column, determining the term frequency and document frequency for each term in the genres.
###Transform: It then converts each genre string into a TF-IDF vector. Each row in the resulting tfidf_matrix corresponds to a show, and each column corresponds to a term from the genre data, with the cell values representing the TF-IDF scores.


In [15]:
# Calculate similarity
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

###The cosine_similarity function takes two matrices as input and computes the cosine similarity between the rows of these matrices.
###By passing tfidf_matrix twice, you compute the similarity between every pair of shows in the dataset.

In [16]:
# Function to get recommendations
def get_recommendations(title, cosine_sim=cosine_sim): #title: The title of the show for which we want to get recommendations. cosine_sim: The precomputed cosine similarity matrix (default value is cosine_sim).
    idx = df[df['title'] == title].index[0] #This line finds the index of the show with the given title in the DataFrame df.
    sim_scores = list(enumerate(cosine_sim[idx])) #This line retrieves the cosine similarity scores for the show at index idx from the cosine_sim matrix.
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True) #This line sorts the list of similarity scores in descending order based on the similarity score.
    sim_scores = sim_scores[1:11] #This line selects the top 10 most similar shows, excluding the first one
    show_indices = [i[0] for i in sim_scores] #This line extracts the indices of the top 10 similar shows from the sim_scores list.
    return df['title'].iloc[show_indices] #This line returns the titles of the shows corresponding to the extracted indices.

In [17]:
recommendations = get_recommendations('Planet Earth') #As we input the name of the show, we get the reccomendations.
recommendations


2                    Planet Earth
6                  Blue Planet II
8     Cosmos: A Spacetime Odyssey
10                         Cosmos
11                     Our Planet
17                           Life
25                The Blue Planet
29                   Human Planet
30                  Frozen Planet
50                         Africa
Name: title, dtype: object

In [18]:
recommendations = get_recommendations('Chernobyl') #As we input the name of the movie, we get the reccomendations.
recommendations


242                     The Knick
185                    The Bureau
133                    It's a Sin
217    From the Earth to the Moon
114                    Mahabharat
119                      Deadwood
1                    Breaking Bad
5                        The Wire
41                          Fargo
80                             Oz
Name: title, dtype: object