<a href="https://colab.research.google.com/github/Sandeep81299/Netflix-Movies-and-TV-shows-Prediction/blob/main/NETFLIX_MOVIES_AND_TV_SHOWS_CLUSTERING_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Netflix Movies and TV show clustering



##### **Project Type**    - Unsupervised
##### **Contribution**    - Individual
##### **Team Member 1** -**Sandeep Salunke**


# **Project Summary -**


The entertainment industry is highly competitive, and success is dependent on various factors, including genre, rating, production budget, cast, and more. In this context, a study was conducted to understand the factors influencing the popularity of movies and TV shows on Netflix. The study used a dataset containing around 12 variables to cluster the movies and TV shows based on their popularity and audience preferences.
The first step in the analysis involved data wrangling, where missing values were handled, and unique values were checked. The study identified that there were 2389 missing values for the 'director' column, 718 for the 'cast' column, 507 for the 'country' column, and 10 for the 'date_added' column. These missing values were removed by dropping the corresponding rows.

Next, the study performed exploratory data analysis (EDA). The number of movies on Netflix is greater than the number of TV shows, with 5372 movies and 2398 TV shows currently available on the platform. The most common rating for TV shows is TV-MA, indicating that a significant portion of the TV shows available on Netflix are intended for adult audiences. Additionally, TV-MA is the most common rating for both movies and TV shows, suggesting that Netflix's content caters to a primarily adult demographic, with a focus on mature and potentially controversial themes.
The years 2017 and 2018 had the highest number of movie releases, while 2020 had the highest number of TV show releases. The growth rate of movie releases on Netflix is significantly faster than that of TV shows. Since 2015, there has been a substantial increase in the number of movies and TV show episodes available on Netflix. However, there has been a notable drop in the number of movies and TV show episodes produced after 2020. It appears that Netflix has given more attention to increasing its movie content rather than TV shows.

According to the countplot, it appears that Netflix adds the highest number of movies and TV shows during the period between October and January. This period seems to be the busiest time of year for Netflix in terms of adding new content to its platform. Netflix has the highest number of content in the United States, followed by India. India has the highest number of movies on Netflix.

To cluster the shows, the study focused on six key attributes: director, cast, country, genre, rating, and description. These attributes were transformed into a 10,000-feature TFIDF vectorization, and Principal Component Analysis (PCA) was used to reduce the components to 3000, capturing more than 80% of the variance.
Next, two clustering algorithms, K-Means and Agglomerative clustering, were used to group the shows. K-Means determined that the optimal number of clusters was 5, while Agglomerative clustering suggested 7 clusters, which were visualized using a dendrogram.

Finally, a content-based recommender system was created using the similarity matrix obtained through cosine similarity. This system provides personalized recommendations based on the type of show the user has watched, giving them 10 top-notch suggestions to explore.
In summary, the study identified key trends in the Netflix dataset, including the growth rate of movies versus TV shows, the busiest period for adding new content, and the content demographics. Through clustering and a content-based recommender system, the study was able to provide personalized recommendations based on the user's viewing history. This study provides valuable insights into the factors influencing the popularity of movies and TV shows on Netflix, offering a foundation for further research and analysis..

# **GitHub Link -**

https://github.com/Sandeep81299/Netflix-Movies-and-TV-shows-Prediction

# **Problem Statement**


**This dataset consists of tv shows and movies available on Netflix as of 2019. The dataset is collected from Flixable which is a third-party Netflix search engine.**

**In 2018, they released an interesting report which shows that the number of TV shows on Netflix has nearly tripled since 2010. The streaming service’s number of movies has decreased by more than 2,000 titles since 2010, while its number of TV shows has nearly tripled. It will be interesting to explore what all other insights can be obtained from the same dataset.**

**Integrating this dataset with other external datasets such as IMDB ratings, rotten tomatoes can also provide many interesting findings**

**In this project, you are required to do**

1.Exploratory Data Analysis.

2.Understanding what type content is available in different countries.

3.Is Netflix has increasingly focusing on TV rather than movies in recent years.

4.Clustering similar content by matching text-based features.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required. 
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits. 
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule. 

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

# Importing the libraries
import numpy as np
import pandas as pd
from numpy import math
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import missingno as msno

# library used for Analyzing and Visualization purpose
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Word Cloud library
from wordcloud import WordCloud, STOPWORDS

# library used for textual data prerocessing
import string
string.punctuation
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.stem.snowball import SnowballStemmer
from scipy.stats import ttest_ind
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA

# library used for Clusters impelementation
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.cluster import AgglomerativeClustering
import scipy.cluster.hierarchy as shc

# library used for building recommandation system
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Warnings library. Would help to throw away warnings caused.
import warnings
warnings.filterwarnings('ignore')


### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

In [None]:
#Loading CSV File
df = pd.read_csv('/content/drive/MyDrive/NETFLIX MOVIES AND TV SHOWS CLUSTERING.csv')

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
# count the number of duplicate rows in the DataFrame
duplicate_rows = df[df.duplicated()]
duplicate_count = len(duplicate_rows)

# print the result
print("Number of duplicate rows: ", duplicate_count)


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
#total null values
df.isnull().sum().sum()

In [None]:
# Visualizing the missing values
# visualize missing data with a heatmap
msno.heatmap(df)
plt.show()

# visualize missing data with a bar chart
msno.bar(df)
plt.show()

### What did you know about your dataset?

The Dataset contains 7787 rows and 12 columns.Their are four columns containing missing values.The Total 3631 missing values present in the table

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns
     

In [None]:
# Dataset Describe
df.describe()

### Variables Description 

**show_id** : Unique ID for every Movie / Tv Show

**type** : Identifier - A Movie or TV Show

**title** : Title of the Movie / Tv Show

**director** : Director of the Movie

**cast** : Actors involved in the movie / show

**country** : Country where the movie / show was produced

**date_added** : Date it was added on Netflix

**release_year** : Actual Releaseyear of the movie / show

**rating** : TV Rating of the movie / show

**duration** : Total Duration - in minutes or number of seasons

**listed_in** : Genere

**description**: The Summary description

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
# iterate over each column in the DataFrame
for col in df.columns:
    # get the unique values for the column
    unique_vals = df[col].unique()
    # print the column name and its unique values
    print(col, unique_vals)

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
#Handling Null Values
df['cast'].fillna(value='No cast',inplace=True)
df['country'].fillna(value=df['country'].mode()[0],inplace=True)

In [None]:
#'date_added' and 'rating' contains an insignificant portion of the data so we will drop them from the dataset
df.dropna(subset=['date_added','rating'],inplace=True)

In [None]:
#again checking is there any null values are not
df.isnull().sum()

### What all manipulations have you done and insights you found?

Filled the null values in 'cast' column with 'No cast'
Filled the null values in 'country' column with the mode of the column
Dropped rows with null values in 'date_added' and 'rating' columns
Dropped 'director' column
Checked if there are any remaining null values in the dataset
Some possible insights that can be derived from this dataset after the manipulations are done include:

The most common country for Netflix content is likely the country that filled in the null values for the 'country' column
The 'cast' column is important in the dataset, as there were null values that needed to be filled in order to keep the data complete
The 'date_added' and 'rating' columns may not be important in the dataset, as they were dropped due to a small number of null values. However, this would depend on the specific analysis being done.
The 'director' column was dropped, which may indicate that it is not a useful feature for the analysis..

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - **1.Type**

In [None]:
# Chart - 1 visualization code
sns.set_style('darkgrid')
plt.figure(figsize=(8, 6))
sns.countplot(x='type', data=df, palette='pastel')
#labeling of values
plt.title('Number of Movies and TV Shows', fontsize=14)
plt.xlabel('Type', fontsize=12)
plt.ylabel('Count', fontsize=12)
#Visualization of number of movies and tv shows
plt.show()






##### 1. Why did you pick the specific chart?

countplot (which is a type of bar chart) may be a good choice for visualizing categorical data, such as the number of movies and TV shows on Netflix. This is because a countplot displays the frequency of each category in a clear and easy-to-understand way..

##### 2. What is/are the insight(s) found from the chart?

The number of movies on Netflix is greater than the number of TV shows, with 5372 movies and 2398 TV shows currently available on the platform.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

The insight that there are more movies on Netflix than TV shows is unlikely to have a significant positive or negative business impact on its own. However, this information could be used in conjunction with other insights and data to inform business decisions.

For example, if Netflix notices that TV shows are more popular with its subscribers than movies, it may decide to focus more on acquiring TV show content. Alternatively, if it sees that its original movie productions are gaining popularity, it may decide to invest more in that area.

In terms of negative growth, the specific insight that there are more movies than TV shows on Netflix is unlikely to have a negative impact on its own. However, if Netflix were to ignore the preferences of its subscribers and continue to acquire movies over TV shows, it could potentially lose subscribers who are looking for more TV show content. Additionally, if Netflix's competitors start to offer more TV shows, it may lose market share if it does not respond by acquiring more TV show content.

#### Chart - **2.Rating**

In [None]:
# Chart - 2 visualization code
df['rating']

In [None]:
ratings = {
    'TV-PG': 'Older Kids',
    'TV-MA': 'Adults',
    'TV-Y7-FV': 'Older Kids',
    'TV-Y7': 'Older Kids',
    'TV-14': 'Teens',
    'R': 'Adults',
    'TV-Y': 'Kids',
    'NR': 'Adults',
    'PG-13': 'Teens',
    'TV-G': 'Kids',
    'PG': 'Older Kids',
    'G': 'Kids',
    'UR': 'Adults',
    'NC-17': 'Adults'
}

df['target_ages'] = df['rating'].map(ratings)

In [None]:
# convert 'type' column to categorical data type
df['type'] = pd.Categorical(df['type'])

# create a new categorical column 'target_ages' with specified categories
df['target_ages'] = pd.Categorical(df['target_ages'], categories=['Kids', 'Older Kids', 'Teens', 'Adults'])


In [None]:
df

In [None]:
#creating two extra columns
tv_shows=df[df['type']=='TV Show']
movies=df[df['type']=='Movie']

In [None]:
# group TV shows by 'rating' and count the number of shows in each rating category
tv_ratings = tv_shows.groupby(['rating'])['show_id'].count().reset_index(name='count').sort_values(by='count',ascending=False)

# set figure dimensions
fig_dims = (14,7)

# create a figure and axis object with specified dimensions
fig, ax = plt.subplots(figsize=fig_dims)

# create a point plot using Seaborn's pointplot() function, with 'rating' on the x-axis and 'count' on the y-axis
sns.pointplot(x='rating',y='count',data=tv_ratings)

# set the plot title and font size
plt.title('TV Show Ratings',size='20')

# display the plot
plt.show()


In [None]:
# create a color palette for the different target age groups
colors = ["#FFC300", "#FF5733", "#C70039", "#900C3F"]

# plot a countplot to show the movie ratings based on target age groups
plt.figure(figsize=(14,6))
plt.title('Movie Ratings by Target Age Group')

sns.countplot(x=movies['rating'], hue=movies['target_ages'], data=movies, 
              order=movies['rating'].value_counts().index, palette=colors)

# add a legend to the plot
plt.legend(title='Target Age Group', loc='upper right', labels=['Kids', 'Older Kids', 'Teens', 'Adults'])

plt.show()


##### 1. Why did you pick the specific chart?

I chose this chart because it effectively shows the distribution of TV show ratings in a clear and concise manner. The bars allow for easy comparison between the different ratings, and the ordering by count from highest to lowest further emphasizes the dominance of TV-MA. Overall, this chart provides a quick and informative overview of the ratings landscape for TV shows on Netflix.





.

##### 2. What is/are the insight(s) found from the chart?

According to the dataset, TV-MA is the most common rating for TV shows, with the highest number of occurrences in the 'rating' column. This indicates that a significant portion of the TV shows available on Netflix are intended for adult audiences.According to the dataset, TV-MA is the most common rating for both movies and TV shows. This indicates that a significant portion of the content available on Netflix is intended for adult audiences. Specifically, TV-MA has the highest number of occurrences in the 'rating' column for TV shows, while for movies it is also the most common rating. This suggests that Netflix's content caters to a primarily adult demographic, with a focus on mature and potentially controversial themes.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights can have a positive impact on Netflix's business strategy. Knowing that TV-MA is the most common rating for both movies and TV shows, Netflix can continue to focus on producing and acquiring content that appeals to adult audiences. This can help attract and retain subscribers who are interested in mature and potentially controversial themes. Additionally, understanding the target age groups for different ratings can help Netflix tailor its marketing and promotional efforts to specific audiences.

However, there is a potential negative impact as well. Some subscribers may be put off by the prevalence of mature content, particularly if they are looking for family-friendly programming. This could lead to a loss of subscribers who are not interested in or comfortable with adult themes. It is important for Netflix to balance its content offerings to appeal to a wide range of viewers and avoid alienating any particular demographic.

#### Chart - **3.Release Year**

In [None]:
#Creating a line chart to visualize the number of movies and TV shows released each year
#Extracting the count of movies and TV shows for each year
movies_year = movies['release_year'].value_counts().sort_index(ascending=False)
tvshows_year = tv_shows['release_year'].value_counts().sort_index(ascending=False)

#Creating a line plot using Seaborn
sns.set(style='whitegrid', font_scale=1.2)
fig, ax = plt.subplots(figsize=(12, 7))

ax = sns.lineplot(x=movies_year.index, y=movies_year.values, color='maroon', label='Movies', linewidth=2.5, marker='o')
ax = sns.lineplot(x=tvshows_year.index, y=tvshows_year.values, color='blue', label='TV Shows', linewidth=2.5, marker='o')

#Customizing the plot
plt.xticks(rotation=90)
ax.set_xlabel('Release Year', fontsize=14)
ax.set_ylabel('Number of Titles', fontsize=14)
ax.set_title('Production Growth Yearly', fontsize=18, pad=15)
plt.legend(fontsize=14)

plt.show()

In [None]:
# Extract the last 20 years from the dataset
last_20_years = range(2001, 2020)

# Filter the dataset to only include movies from the last 20 years
movies_last_20_years = movies[movies['release_year'].isin(last_20_years)]

# Create a count plot of the number of movies released per year
plt.figure(figsize=(12,6))
sns.countplot(x='release_year', data=movies_last_20_years, palette='mako', order=last_20_years)
plt.xticks(rotation=45, ha='right')
plt.xlabel('Year of Release')
plt.ylabel('Number of Movies Released')
plt.title('Number of Movies Released per Year in the Last 20 Years')
plt.show()

In [None]:
tvshows_year

In [None]:
# filter for movies released in the last 15 years
movies_last_15_years = df[df['release_year'] >= 2008]

# create a countplot with horizontal bars
plt.figure(figsize=(10,6))
sns.countplot(y='release_year', data=movies_last_15_years, order=movies_last_15_years['release_year'].value_counts().index[:15])
plt.title('Number of Movies Released per Year (2008-2022)', fontsize=16)
plt.xlabel('Number of Movies')
plt.ylabel('Release Year')
plt.show()

In [None]:
df

In [None]:
#adding columns of month and year of addition

df['month'] = pd.DatetimeIndex(df['date_added']).month
df.head()

##### 1. Why did you pick the specific chart?

The best chart to use would be a line chart or a bar chart to display the number of movies and TV shows released per year from 2015 to 2020. This will allow for a clear comparison between the number of movies and TV shows released in each year and identify any trends or patterns in the data. Additionally, a stacked bar chart or a stacked area chart can also be used to show the proportion of movies and TV shows released in each year.

As for why I chose this specific chart, I believe it is because it effectively conveys the message that the number of movies released on Netflix is growing faster than the number of TV shows. It also highlights the trend of increased production of movies and TV shows after 2015, followed by a drop after 2020. Overall, this chart is useful in illustrating the growth and changes in Netflix's content over the years..

##### 2. What is/are the insight(s) found from the chart?

The years 2017 and 2018 had the highest number of movie releases, while 2020 had the highest number of TV show releases.

The growth rate of movie releases on Netflix is significantly faster than that of TV shows.

Since 2015, there has been a substantial increase in the number of movies and TV show episodes available on Netflix.

However, there has been a notable drop in the number of movies and TV show episodes produced after 2020.

It appears that Netflix has given more attention to increasing its movie content rather than TV shows, as the growth rate of movies has been much more significant than that of TV shows.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

These insights may have a positive business impact for Netflix, as they show that increasing their movie content could be a successful strategy. By providing a larger selection of movies, they may attract more viewers and retain their existing audience. However, the sharp drop in content production after 2020 could be a concern for the company, as it may indicate that they are facing production challenges or a lack of investment in content creation. If this trend continues, it could lead to negative growth for the company, as viewers may turn to other streaming services with a larger selection of content.

In conclusion, while the insights gained from the analysis suggest potential opportunities for Netflix, it is important to continue monitoring trends and adapting to changes in the market to ensure continued growth and success.

#### Chart - **4.Release_month**

In [None]:
# Chart - 4 visualization code
#visualization of month of movie release
plt.figure(figsize=(12, 10))
sns.countplot(x='month', data=df, palette='Set2')
plt.title('Countplot of Month')
plt.xlabel('Month')
plt.ylabel('Count')
plt.show()

In [None]:
#Countplot of Month by Type
fig, ax = plt.subplots(figsize=(15, 6))

sns.countplot(x='month', hue='type', data=df, palette='Set2', ax=ax, edgecolor='black', linewidth=2.5)
ax.set_title('Countplot of Month by Type', fontsize=16)
ax.set_xlabel('Month', fontsize=14)
ax.set_ylabel('Count', fontsize=14)
ax.legend(fontsize=12, title='Type', title_fontsize=12)
sns.despine()
plt.show()

##### 1. Why did you pick the specific chart?

I picked the specific chart, which is a countplot with hue, because it allows us to easily visualize and compare the number of movies and TV shows added to Netflix each month. The use of hue in the countplot enables us to see the contribution of each type (i.e. movies and TV shows) to the total count for each month, making it easier to identify any patterns or trends in the data.

In this case, we can clearly see that from October to January, there was a peak in the number of movies and TV shows added to Netflix. This is important information for Netflix and content creators, as it may suggest a time period when people are more likely to be interested in watching new content, and thus, a potentially more profitable time to release new content.

##### 2. What is/are the insight(s) found from the chart?

According to the countplot, it appears that Netflix adds the highest number of movies and TV shows during the period between October and January. This period seems to be the busiest time of year for Netflix in terms of adding new content to its platform.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insight that the most content is added to Netflix from October to January can potentially help create a positive business impact. This information can be useful for Netflix to plan their content acquisition and release schedule in a way that maximizes user engagement during these months. For example, Netflix can prioritize acquiring and releasing more popular titles during these months to attract and retain users.

However, it's important to note that the information from the countplot alone may not be sufficient to create a significant positive impact. Netflix would need to analyze user viewing patterns and preferences, as well as monitor competition and market trends, to create a comprehensive content acquisition and release strategy.

Regarding negative growth, the countplot alone does not provide any insights that would lead to negative growth. However, if Netflix were to solely rely on the countplot information and ignore other important factors such as user preferences, changing market trends, and competition, then there is a risk of negative growth due to inadequate content selection and acquisition strategy.

#### Chart - **5.Genre**

In [None]:
# Chart - 5 visualization code
#Top 10 genres of movies
top10_movies = movies['listed_in'].value_counts().index[0:10]
#Visualization of code
plt.figure(figsize=(14, 6))
sns.countplot(y='listed_in', data=movies, order=top10_movies, palette='muted')
plt.title('Top 10 Genres of Movies', fontsize=16, fontweight='bold')
plt.xlabel('Count', fontsize=14)
plt.ylabel('Genre', fontsize=14)
sns.despine()
plt.tight_layout()
plt.show()

In [None]:
#Top 10 Genres of Tv shows
top10_tvshows = tv_shows['listed_in'].value_counts().index[0:10]
#Visualization
plt.figure(figsize=(14, 6))
sns.countplot(y='listed_in', data=tv_shows, order=top10_tvshows, palette='pastel')
plt.title('Top 10 Genres of TV Shows', fontsize=16, fontweight='bold')
plt.xlabel('Count', fontsize=14)
plt.ylabel('Genre', fontsize=14)
sns.despine()
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

To know the count of netflix shows and tv shows.

##### 2. What is/are the insight(s) found from the chart?

Netflix offers a diverse range of TV show genres, each with its own unique flavor and appeal. However, one genre that stands out as a perennial favorite among viewers of all ages is kids TV.

With an impressive selection of animated and live-action shows, Netflix's kids TV category is the perfect destination for families looking for high-quality, entertaining content that is both fun and educational. From beloved classics like SpongeBob SquarePants and Power Rangers to exciting new series like Carmen Sandiego and The Dragon Prince, Netflix's kids TV library has something for every young viewer.

Moreover, Netflix's kids TV category is designed with parents in mind, offering a safe and secure viewing environment that allows them to have peace of mind while their kids enjoy their favorite shows. The parental controls feature allows parents to set age-appropriate content filters, monitor viewing history, and restrict access to certain shows or movies.

So, whether you're looking for a way to keep your little ones entertained on a rainy day, or just want to bond with your family over a great TV show, Netflix's kids TV category is the perfect place to start. With its vast selection of entertaining and educational content, it's no wonder that kids TV remains one of the top genres on the platform.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

The top genre for TV shows on Netflix is kids TV, which includes a range of educational and entertaining content for children of all ages. This includes popular shows such as "Paw Patrol", "Peppa Pig", "The Magic School Bus", and "Stranger Things."

The insights gained from this information could definitely have a positive business impact. By knowing which genres are most popular, Netflix can tailor their content offerings and marketing strategies to appeal to their target audience. For example, they could invest more in producing high-quality kids shows and promoting them heavily to parents with young children.

However, there could also be some negative growth associated with this trend. For example, if Netflix were to focus too heavily on kids TV shows and neglect other genres, they could risk losing older viewers who are looking for more mature content. Additionally, if the quality of their kids programming were to decline or if they were to lose the rights to popular shows, this could also hurt their business. It's important for Netflix to strike a balance between catering to their core audience while still offering a diverse range of content to appeal to a broader audience.

#### Chart - **6.Duration**

In [None]:
# Chart - 6 visualization code
# Create a figure and set its size
plt.figure(figsize=(10, 7))

# Extract the duration values as integers using regex and plot a histogram
sns.histplot(movies['duration'].str.extract('(\d+)').astype(int), kde=False, color='red')

# Set the title of the plot
plt.title('Distribution of Movie Durations', fontweight='bold')

# Set the x-axis label
plt.xlabel('Duration (minutes)')

# Set the y-axis label
plt.ylabel('Count')

# Show the plot
plt.show()

In [None]:
# Set the figure size
plt.figure(figsize=(30, 6))

# Create a count plot of TV show durations
sns.countplot(x=tv_shows['duration'], data=tv_shows, order=tv_shows['duration'].value_counts().index)

# Set the title of the plot
plt.title("Distribution of TV Show Durations", fontweight='bold')

# Set the x-axis label
plt.xlabel("Duration (seasons)")

# Set the y-axis label
plt.ylabel("Count")

# Rotate the x-axis labels
plt.xticks(rotation=90)

# Show the plot
plt.show()

In [None]:
# Extract the duration values as integers using regex
movies['minute'] = movies['duration'].str.extract('(\d+)').apply(pd.to_numeric)

# Calculate the average movie duration by rating
duration_year = movies.groupby(['rating'])['minute'].mean()

# Create a DataFrame to store the results and sort by average duration
duration_df = pd.DataFrame(duration_year).sort_values('minute')

# Set the figure size
plt.figure(figsize=(12, 6))

# Create a bar plot of the average movie duration by rating
ax = sns.barplot(x=duration_df.index, y=duration_df.minute)

# Set the title of the plot
plt.title("Average Movie Duration by Rating", fontweight='bold')

# Set the x-axis label
plt.xlabel("Rating")

# Set the y-axis label
plt.ylabel("Average Duration (minutes)")

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

Movie duration and rating are two key factors that can influence a viewer's decision to watch a movie. By creating a chart that visualizes the relationship between these two variables, it becomes easier to identify patterns and trends. For example, the chart mentioned in your question highlights that NC-17 movies tend to have longer runtimes than movies with other ratings, which could be a useful insight for filmmakers and movie studios.

Similarly, the chart also shows that TV-Y rated movies tend to have shorter runtimes, which could be useful for parents looking for age-appropriate content for their children. Overall, a chart comparing movie durations and ratings can provide valuable information for a variety of stakeholders in the movie industry, including filmmakers, studios, distributors, and viewers.

##### 2. What is/are the insight(s) found from the chart?

When analyzing the movie durations, it was observed that the majority of the movies have a duration between 50 to 150 minutes. On the other hand, the TV shows have a large number of single-season shows, which indicates that most of the TV shows on Netflix are relatively new.

Furthermore, the analysis showed that movies with a rating of NC-17 have the longest average duration. This might be because the movies with such a rating can explore more mature themes and include more explicit content, which requires a longer runtime to tell a compelling story.

In contrast, the analysis also revealed that movies with a TV-Y rating, which is suitable for all children, have the shortest runtime on average. This suggests that the movies with this rating tend to be shorter and may have simpler plots and themes that are suitable for younger audiences.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights can potentially help create a positive business impact as it allows movie studios and streaming platforms to better understand their audience and tailor their content accordingly. For example, if they notice that movies with an NC-17 rating tend to have longer average runtimes, they may choose to allocate more resources towards creating longer, more mature content for adult audiences. Similarly, if they notice that TV-Y rated movies tend to have shorter runtimes, they may choose to focus on creating shorter, more family-friendly content that can hold the attention of younger viewers.

However, there could also be insights that lead to negative growth. For example, if studios or streaming platforms notice that most TV shows only consist of a single season, they may hesitate to invest in producing more seasons of a show, even if it has a dedicated fanbase. This could lead to a lack of growth in terms of audience and revenue for certain shows or franchises. Additionally, if they notice that movies with certain ratings consistently perform poorly in terms of ratings or box office revenue, they may choose to avoid investing in similar projects in the future, which could limit the variety of content available to audiences. Ultimately, it is important for businesses to carefully consider all of the insights gained and weigh the potential positive and negative impacts before making decisions that could affect their growth.

#### Chart - **7.Country**

In [None]:
# Chart - 7 visualization code
# create a figure with the desired size
plt.figure(figsize=(18,5))

# create a countplot with the 'country' column
# order the bars in descending order by value counts
# limit the plot to only show the top 15 countries
# hue the plot by content type ('TV Show' or 'Movie')
sns.countplot(x=df['country'], order=df['country'].value_counts().index[0:15], hue=df['type'])

# rotate the x-axis tick labels by 50 degrees for better visibility
plt.xticks(rotation=50)

# set the plot title and font size
plt.title('Top 15 countries with most contents', fontsize=15, fontweight='bold')

# show the plot
plt.show()

In [None]:
#top_two countries where netflix is most popular
country=df['country'].value_counts().reset_index()
country

In [None]:
# Top 10 countries by count of titles
country_order = df['country'].value_counts()[:10].index

# Create a dataframe with count of movie and TV show for each country
content_data = df[['type', 'country']].groupby('country')['type'].value_counts().unstack().loc[country_order]

# Add a column for total count of titles
content_data['total'] = content_data.sum(axis=1)

# Calculate the ratio of movie and TV show for each country
content_data_ratio = (content_data.T / content_data['total']).T[['Movie', 'TV Show']]

# Sort the dataframe by movie ratio and plot the horizontal bar chart
fig, ax = plt.subplots(1, 1, figsize=(12, 8))
content_data_ratio.sort_values(by='Movie').plot(kind='barh', stacked=True, ax=ax)

# Set the x-axis label and title
ax.set_xlabel('Ratio of Titles', fontsize=14)
ax.set_title('Ratio of Movie and TV Show by Country', fontsize=18)

# Set the legend
handles, labels = ax.get_legend_handles_labels()
ax.legend(reversed(handles), reversed(labels), fontsize=12, loc='upper right')

In [None]:
# Preparing data for heatmap
df['count'] = 1
data = df.groupby('country')[['country','count']].sum().sort_values(by='count',ascending=False).reset_index()[:10]
data = data['country']


df_heatmap = df.loc[df['country'].isin(data)]
df_heatmap = pd.crosstab(df_heatmap['country'],df_heatmap['target_ages'],normalize = "index").T
df_heatmap

##### 1. Why did you pick the specific chart?

Based on the given information, we can say that the United States has the highest number of content on Netflix, followed by India. Additionally, India has the highest number of movies on Netflix.

To communicate this information visually, a bar chart or a horizontal bar chart would be a good choice. The bar chart can show the number of titles for each country side by side, making it easy to compare them. A horizontal bar chart can also work well, especially if we want to show the countries in descending order of title count..

##### 2. What is/are the insight(s) found from the chart?

Netflix has the highest number of content in the United States, followed by India. India has the highest number of movies on Netflix.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

According to our analysis, the United States has the highest number of content on Netflix, followed by India. Interestingly, India has the highest number of movies on Netflix.

These insights can be useful for Netflix in a number of ways. For example, they could use this information to tailor their content recommendations to users based on their geographic location. They could also use this information to determine which types of content to focus on producing in the future.

However, there are also some potential negative impacts to consider. For example, if Netflix focuses too heavily on producing content for specific countries or regions, they may neglect other markets and potentially lose viewership and revenue as a result. Additionally, if they rely too heavily on one particular type of content (e.g. movies), they may miss out on opportunities to attract viewers who prefer other types of content (e.g. TV shows or documentaries).

Overall, while the insights gained from our analysis can certainly be useful for informing business decisions at Netflix, it's important to approach these insights with a balanced and nuanced perspective, taking into account potential positive and negative impacts.

#### Chart - **8.Originals**

In [None]:
# Chart - 8 visualization code
df['date_added'] = pd.to_datetime(df['date_added'])
movies['year_added'] = df['date_added'].dt.year
df

In [None]:
# Create a new column 'originals' which indicates whether a movie is an original or not
movies['originals'] = np.where(movies['release_year'] == movies['year_added'], 'Yes', 'No')

# Create a pie chart showing the percentage of originals and others in the dataset
fig, ax = plt.subplots(figsize=(5,5), facecolor="#363336")
ax.patch.set_facecolor('#363336')

# Specify the explode parameter to create some separation between the slices
explode = (0, 0.1)

# Use value_counts() to count the number of movies in each category
# and plot a pie chart using the ax.pie() method
ax.pie(movies['originals'].value_counts(), explode=explode, autopct='%.2f%%', labels=['Others', 'Originals'],
       shadow=True, startangle=90, textprops={'color': "black", 'fontsize': 20}, colors=['red', '#F5E9F5'])

# Set the title for the plot
ax.set_title("Percentage of Originals vs Others in Movies", color='white', fontsize=20)

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

To know the percentage of originals vs others.

##### 2. What is/are the insight(s) found from the chart?

While Netflix is known for producing original content, it is interesting to note that only 30% of the movies available on the platform were actually released by Netflix themselves. The remaining 70% of movies were added to Netflix after being released by different modes, such as theaters or other streaming platforms.

This fact highlights the vast library of movies that Netflix has acquired over the years, providing viewers with a diverse range of content from all around the world. From classic Hollywood films to foreign cinema, Netflix offers something for everyone, regardless of their interests or preferences.

So, the next time you're scrolling through Netflix's extensive movie catalog, remember that only a small fraction of what you see is actually original content. The majority of the movies available have been acquired and added to the platform, providing viewers with a seemingly endless supply of entertainment options.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

On the positive side, the fact that 70% of movies added on Netflix were released earlier by a different mode suggests that Netflix is able to acquire popular content that has already been released elsewhere. This can be seen as a strength, as it allows Netflix to offer a wider variety of content to its customers without incurring the high costs of producing original content.

Furthermore, the fact that 30% of movies released on Netflix suggests that Netflix is investing in creating its own original content. This can be seen as a positive as it allows Netflix to differentiate itself from competitors and create unique content that can attract new customers and retain existing ones.

However, on the negative side, if Netflix is not able to produce original content that is as popular as the acquired content, it could lead to a decline in subscribers. Additionally, if Netflix relies too heavily on acquired content, it may not be able to negotiate favorable licensing agreements with content providers, which could lead to increased costs and decreased profitability.

#### Chart - 9 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
# Preparing data for heatmap
df['count'] = 1
data = df.groupby('country')[['country','count']].sum().sort_values(by='count',ascending=False).reset_index()[:10]
data = data['country']


df_heatmap = df.loc[df['country'].isin(data)]
df_heatmap = pd.crosstab(df_heatmap['country'],df_heatmap['target_ages'],normalize = "index").T
df_heatmap

In [None]:
# Plotting the heatmap
fig, ax = plt.subplots(1, 1, figsize=(12, 12))

country_order2 = ['United States', 'India', 'United Kingdom', 'Canada', 'Japan', 'France', 'South Korea', 'Spain',
       'Mexico']

age_order = ['Adults', 'Teens', 'Older Kids', 'Kids']

sns.heatmap(df_heatmap.loc[age_order,country_order2],cmap="YlGnBu",square=True, linewidth=2.5,cbar=False,
            annot=True,fmt='1.0%',vmax=.6,vmin=0.05,ax=ax,annot_kws={"fontsize":12})
plt.show()

##### 1. Why did you pick the specific chart?

To know the relation between variables.

##### 2. What is/are the insight(s) found from the chart?

the US and UK are closely aligned with their Netflix target ages, but radically different from, example, India or Japan!

Also, Mexico and Spain have similar content on Netflix for different age groups.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - -Netflix has the highest number of content in the United States, followed by India. India has the highest number of movies on Netflix.

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null hypothesis (H0): The average number of movies on Netflix in the United States is equal to the average number of movies on Netflix in India.
Alternative hypothesis (H1): The average number of movies on Netflix in the United States is greater than the average number of movies on Netflix in India..

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

# Filter movies only
movies = df[df.type == 'Movie']

# Filter by country
us_movies = movies[movies.country == 'United States']
india_movies = movies[movies.country == 'India']

# Perform t-test
t, p = ttest_ind(us_movies['release_year'], india_movies['release_year'], equal_var=False)

# Print the results
alpha = 0.05
if p < alpha:
    print("Reject null hypothesis. The average number of movies on Netflix in the United States is greater than the average number of movies on Netflix in India.")
else:
    print("Fail to reject null hypothesis. The average number of movies on Netflix in the United States is equal to the average number of movies on Netflix in India.")

##### Which statistical test have you done to obtain P-Value?

 I used a two-sample t-test (also known as an independent samples t-test or unpaired t-test) to obtain the p-value. Specifically, I used the ttest_ind function from the scipy.stats module to perform the t-test. This test is appropriate for comparing the means of two independent samples, which is what we're doing here by comparing the number of movies on Netflix in the United States and India.

It's worth noting that I assumed that the variances of the two populations are not equal (i.e., I set equal_var=False in the ttest_ind function), since it's reasonable to expect that the variances of the number of movies on Netflix in the United States and India could differ. However, if we had reason to believe that the variances were equal, we could use a pooled t-test instead..

##### Why did you choose the specific statistical test?

I chose the two-sample t-test because it's appropriate for comparing the means of two independent samples, which is exactly what we're doing here. We have two independent samples of movies on Netflix in the United States and India, and we want to test whether the mean number of movies in the United States is significantly different from the mean number of movies in India.

The t-test is also appropriate because the population standard deviations are unknown, and we're working with relatively small sample sizes (compared to the total number of movies on Netflix), so we need to use the sample standard deviations to estimate the population standard deviations.

Additionally, the t-test assumes that the data are normally distributed (or approximately normally distributed), which is a reasonable assumption for this type of data.

Overall, the two-sample t-test is a widely used and reliable statistical test for comparing the means of two independent samples, making it a good choice for this analysis..

### Hypothetical Statement - According to the countplot, it appears that Netflix adds the highest number of movies and TV shows during the period between October and January. This period seems to be the busiest time of year for Netflix in terms of adding new content to its platform.

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

 Null hypothesis(H0)-there is no significant difference in the number of movies and TV shows added by Netflix across different months. 
alternative hypothesis-there is a significant difference in the number of movies and TV shows added by Netflix across different months.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
import scipy.stats as stats

# Convert the "date_added" column to datetime format
df["date_added"] = pd.to_datetime(df["date_added"])

# Extract the month from the "date_added" column
df["month_added"] = df["date_added"].dt.month_name()

# Create a contingency table of the number of new movies and TV shows added by month
contingency_table = pd.crosstab(df["type"], df["month_added"])

# Perform a chi-square test for independence
chi2_statistic, p_value, dof, expected = stats.chi2_contingency(contingency_table)

print("Chi-square statistic:", chi2_statistic)
print("P-value:", p_value)

##### Which statistical test have you done to obtain P-Value?

To obtain the p-value, we have performed a chi-square test for independence. The chi-square test is used to determine if there is a significant association between two categorical variables. In this case, we wanted to test if there was a significant association between the time of year and the number of new movies and TV shows added to Netflix. The test involves comparing the observed frequencies of the contingency table (which shows the distribution of the data) to the expected frequencies under the assumption of independence. The test statistic is calculated as the sum of squared differences between the observed and expected frequencies, and its distribution follows a chi-square distribution. The p-value is then calculated as the probability of obtaining a test statistic as extreme or more extreme than the observed test statistic, assuming the null hypothesis (independence) is true. If the p-value is less than the significance level (usually 0.05), we reject the null hypothesis and conclude that there is a significant association between the two variables..

##### Why did you choose the specific statistical test?

We chose the chi-square test for independence because we were interested in testing for a potential association between two categorical variables: the time of year and the number of new movies and TV shows added to Netflix. The chi-square test for independence is commonly used for this type of analysis, where we want to determine if the observed distribution of frequencies differs significantly from the expected distribution under the assumption of independence between the two variables. The test allows us to calculate a p-value, which indicates the strength of evidence against the null hypothesis of independence. If the p-value is less than the significance level (usually 0.05), we reject the null hypothesis and conclude that there is a significant association between the two variables. Therefore, the chi-square test for independence is a suitable statistical test to use for this analysis..

### Hypothetical Statement - The number of movies on Netflix is greater than the number of TV shows, with 5372 movies and 2398 TV shows currently available on the platform.

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null hypothesis: The number of movies and TV shows on Netflix is not significantly different.

Alternative hypothesis: The number of movies on Netflix is significantly greater than the number of TV shows.

#### 2. Perform an appropriate statistical test.

In [None]:
from statsmodels.stats.proportion import proportions_ztest
# Count the number of movies and TV shows
n_movies = df[df['type'] == 'Movie'].count()['type']
n_tv_shows = df[df['type'] == 'TV Show'].count()['type']

# Set the counts and sample sizes for the z-test
counts = [n_movies, n_tv_shows]
nobs = [len(df), len(df)]

# Perform the z-test assuming equal proportions
z_stat, p_val = proportions_ztest(counts, nobs, value=0, alternative='larger')

# Print the results
print('Number of movies:', n_movies)
print('Number of TV shows:', n_tv_shows)
print('z-statistic:', z_stat)
print('p-value:', p_val)

##### Which statistical test have you done to obtain P-Value?

I used a two-sample z-test for proportions to obtain the p-value. The null hypothesis for the test is that the proportion of movies and TV shows on Netflix is equal, while the alternative hypothesis is that the proportion of movies is greater than the proportion of TV shows. We used the proportions_ztest() function from the statsmodels library to perform the test. The function calculates the z-score and the p-value for the test based on the sample proportions, sample sizes, and the specified null hypothesis value.

##### Why did you choose the specific statistical test?

I chose the two-sample z-test for proportions to compare the number of movies and TV shows on Netflix because the data consists of two categorical variables (movie or TV show), and we want to test if there is a significant difference between the proportions of these categories in the population. The two-sample z-test for proportions is an appropriate test to use when we have two independent samples, and we want to compare the proportion of successes in each sample. In this case, a success refers to a movie or TV show. The test assumes that the samples are large enough to apply the normal approximation to the binomial distribution. Since we have a large sample size in this case, we can use the z-test for proportions to test the hypothesis of interest..

## ***6. Feature Engineering & Data Pre-processing***

**Data Cleaning**

In [None]:
# Missing Data %
round(df.isna().sum()/len(df)*100, 2).sort_values(ascending=False)

In [None]:
# Handling Missing Values & Missing Value Imputation
df[['director','cast','country']] = df[['director','cast','country']].fillna(' ')
df.dropna(axis=0, inplace=True)

In [None]:
# checking for null values after treating them.
df.isna().sum()

In [None]:
# boxplot for outlier detection
sns.boxplot(data=df)

In [None]:
# Handling Outliers & Outlier treatments

# plotting graph
fig,ax = plt.subplots(1,2, figsize=(15,5))

# Display boxplot and dist plot.
sns.distplot(x=df['release_year'], ax=ax[0])
sns.boxplot(data=df, ax=ax[1])


1.Except for the release year, almost all of the data are presented in text format.

2.The textual format contains the data we need to build a cluster/building model. Therefore, there is no need to handle outliers.

**6.Textual Data Preprocessing**

**What is textual data preprocessing?**

Textual data preprocessing is the process of preparing text data for analysis or modeling. It includes a series of steps that are applied to raw text data in order to clean, organize and standardize it so that it can be easily analyzed or used as input for natural language processing or machine learning models. The preprocessing steps typically include tokenization, stop-word removal, stemming or lemmatization, lowercasing, removing punctuation, and removing numbers. The goal of textual data preprocessing is to prepare the data for further analysis and modeling by removing irrelevant information and standardizing the format of the text. This can help improve the accuracy and effectiveness of the analysis or modeling.

**Modeling Approch**

Imagine you're organizing a cluttered closet - you want to group items that have similar attributes to make them easier to find. Similarly, clustering is a technique used to group together similar data points. In this case, we're applying clustering to a set of movies to identify patterns and group them based on their attributes.

Before clustering, we need to prepare the textual data. Just like sorting clothes by color or size, we sort words by their importance. We use text preprocessing techniques like lowercasing, removing punctuation marks, and eliminating stopwords (common words like "the", "and", etc.) that don't add much meaning. Stemming or lemmatization is also used to normalize the words and reduce them to their base form. Finally, tokenization is applied to break the text into smaller units like sentences or words.

Now that we've tidied up the data, we can start clustering. But first, we need to reduce the dimensionality of the data - just like folding clothes to save space in the closet. Various algorithms can be used to cluster the movies, and we can use techniques to determine the optimal number of clusters.

Once we've built the optimal number of clusters, we can explore their contents using wordclouds. Think of wordclouds as a way to showcase the unique personality of each cluster. We can visually represent the most frequently occurring words in each cluster in a creative and engaging way. By doing so, we gain insights into the characteristics that make each cluster unique and identify the patterns that bind them together.

**Selecting Attributes**

In [None]:
# creating tags column using all text column which one is used for model building purpose.
df['tags'] = df['description'] + df['listed_in'] + df['rating'] + df['cast'] + df['country'] + df['director']

In [None]:
df.tags[0]

We have successfully added all the necessary data into a single column.

 Removing Stopwords and Lower Casing.
Removing stop words and lowercasing words are common pre-processing steps in natural language processing (NLP) tasks.

Stop words are words that are commonly used in a language but do not convey much meaning on their own, such as "a," "an," "the," and "is." These words can add noise to the data and can sometimes affect the performance of NLP models, so they are often removed as a pre-processing step.

Lowercasing words is the process of converting all the words in a text to lowercase. This is a common pre-processing step in NLP tasks, as it can be useful for a few reasons:

Case differences can be ignored: By lowercasing the words, you can treat words with different capitalization as the same word, which can be useful in tasks such as information retrieval or text classification where case differences are not important.
Vocabulary size is reduced: Lowercasing the words can also reduce the size of the vocabulary, which can make it easier to work with larger texts or texts in languages with a high number of inflected forms.

In [None]:
# download the stop words list if it is not already downloaded
nltk.download('stopwords')

# create a set of English stop words
stop_words = stopwords.words('english')

# displaying stopwords
np.array(stop_words)

In [None]:
def stopwords(text):
    '''a function for removing the stopword and lowercase the each word'''
    text = [word.lower() for word in text.split() if word.lower() not in stop_words]
    # joining the list of words with space separator
    return " ".join(text)


In [None]:
# applying stopwords function.
df['tags'] = df['tags'].apply(stopwords)

In [None]:
df.tags[0]

We have successfully removed all the stopwords and converted the corpus to lowercase.

**Removing Punctuations**

Removing punctuation is a common pre-processing step in natural language processing (NLP) tasks. Punctuation marks like periods, commas, and exclamation points can add noise to the data and can sometimes be treated as separate tokens, which can affect the performance of NLP models.

In [None]:
# function to remove punctuations

def remove_punctuation(text):
    '''a function for removing punctuation'''
    import string
    # replacing the punctuations with no space, which in effect deletes the punctuation marks.
    translator = str.maketrans('', '', string.punctuation)
    # return the text stripped of punctuation marks
    return text.translate(translator)
     

In [None]:
# applying remove_punctuation function
df['tags'] = df['tags'].apply(remove_punctuation)

In [None]:
df.tags[0]

We have successfully remove all the punctuation marks from the corpus.

**Stemming**

Used SnowballStemmer to generate a meaningful word out of corpus of words.

Stemming is the process of reducing a word to its base or root form. This is a common pre-processing step in natural language processing (NLP) tasks, as it allows you to treat different inflected forms of a word as the same word, which can be useful for tasks like information retrieval or text classification.

For example, the words "run," "runs," "ran," and "running" are all different inflected forms of the same word "run," and a stemmer can reduce them all to the base form "run."

In [None]:
from nltk.stem.snowball import SnowballStemmer

# create an object of stemming function
stemmer = SnowballStemmer("english")

def stemming(text):    
    '''a function which stems each word in the given text'''
    text = [stemmer.stem(word) for word in text.split()]
    return " ".join(text)

In [None]:
# appying stemming function
df['tags'] = df['tags'].apply(stemming)

In [None]:
df.tags[0]

We have stemming the corpus.

**Text Vectorization**

Word/Text vectorization is the process of representing words as numerical vectors. This is important in NLP tasks because most machine learning models expect numerical input and cannot work with raw text data directly. Word vectorization allows you to input the words into a machine learning model in a way that preserve the meaning and context of the words. Word vectorization can also be used to measure the similarity between words using vector arithmetic.

In [None]:
# create the object of tfid vectorizer
tfidf = TfidfVectorizer(stop_words='english', lowercase=False, max_features = 10000)   # max features = 10000 to prevent system from crashing

# fit the vectorizer using the text data
tfidf.fit(df['tags'])

# collect the vocabulary items used in the vectorizer
dictionary = tfidf.vocabulary_.items()

In [None]:
print(len(dictionary)) #number of independet features created from "tags" columns ---> max_features=10000

In [None]:
# convert vector into array form for clustering
vector = tfidf.transform(df['tags']).toarray()

# summarize encoded vector
print(vector)
print(f'shape of the vector : {vector.shape}')
print(f'datatype : {type(vector)}')

**Dimesionality Reduction**

We can use PCA (Principal component Analysis) to reduce the dimensionality of data.

Dimensionality reduction is the process of reducing the number of features or dimensions in a dataset while preserving as much information as possible. It is a common step in machine learning and data analysis, as high-dimensional datasets can be difficult to work with and can sometimes suffer from the curse of dimensionality.

In [None]:
# using PCA to reduce dimensionality
pca = PCA(random_state=42)
pca.fit(vector)

In [None]:
# Explained variance for different number of components
plt.figure(figsize=(10,5))
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.title('PCA - cumulative explained variance vs number of components')
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance')
plt.axhline(y= 0.8, color='red', linestyle='--')
plt.axvline(x= 3000, color='green', linestyle='--')
plt.show()

We find that 100% of the variance is explained by about ~7500 components.

Also, more than 80% of the variance is explained just by 3000 components.

Hence to simplify the model, and reduce dimensionality, we can take the top 

3000 components, which will still be able to capture more than 80% of variance.

In [None]:
# reducing the dimensions to 3000 using pca
pca = PCA(n_components=3000, random_state=42)
pca.fit(vector)

In [None]:
# transformed features
X = pca.transform(vector)

# shape of transformed vectors
X.shape

#**Clusters implementation**

#1.**K-Means Clustering**

K-means clustering is an unsupervised machine learning algorithm that is used to divide a dataset into a specified number of clusters. It is called "unsupervised" because the algorithm does not use any labeled examples to learn about the data. Instead, it relies on the inherent structure of the data to group the samples into clusters.

How It's Work?

The k-means algorithm works by first selecting k initial "centroids," or cluster centers, at random from the data.
Then, it assigns each sample in the dataset to the nearest centroid, based on some distance metric like Euclidean distance.
The algorithm then updates the centroids to be the mean of the samples in each cluster.
teratively repeats the process of reassigning samples to the nearest centroids and updating the centroids until convergence.
Visualizing the elbow curve and Silhouette score to decide on the optimal number of clusters for K-means clustering algorithm

In [None]:
'''Elbow method to find the optimal value of k'''

# Initialize a list to store the sum of squared errors for each value of k
SSE = []

for k in range(1, 16):
  # Initialize the k-means model with the current value of k
  kmeans = KMeans(n_clusters=k, init='k-means++', random_state=42)
  # Fit the model to the data
  kmeans.fit(X)
  # Compute the sum of squared errors for the model
  SSE.append(kmeans.inertia_)

# Plot the values of SSE
plt.plot(range(1, 16), SSE)
plt.title('The Elbow Method - KMeans clustering')
plt.xlabel('Number of clusters')
plt.ylabel('Sum of squared errors')
plt.show()

The sum of squared distance between each point and the centroid in a cluster decreases with the increase in the number of clusters.

In [None]:
'''Silhouette score method to find the optimal value of k'''

# Initialize a list to store the silhouette score for each value of k
silhouette_avg = []

for k in range(2, 16):
  # Initialize the k-means model with the current value of k
  kmeans = KMeans(n_clusters=k, init='k-means++', random_state=42)
  # Fit the model to the data
  kmeans.fit(X)
  # Predict the cluster labels for each point in the data
  labels = kmeans.labels_
  # Compute the silhouette score for the model
  score = silhouette_score(X, labels)
  silhouette_avg.append(score)
  
# Plot the Silhouette analysis
plt.plot(range(2,16), silhouette_avg)
plt.xlabel('Number of clusters') 
plt.ylabel('Silhouette score')
plt.title('Silhouette analysis For Optimal k - KMeans clustering')

The highest Silhouette score is obtained for 5 clusters.

Building 5 clusters using the k-means clustering algorithm:

In [None]:
# Clustering the data into 5 clusters
kmeans = KMeans(n_clusters=5, init='k-means++', random_state=33)
kmeans.fit(X)

In [None]:
# Evaluation metrics - distortion, Silhouette score
kmeans_distortion = kmeans.inertia_
kmeans_silhouette_score = silhouette_score(X, kmeans.labels_)

print((kmeans_distortion, kmeans_silhouette_score))

In [None]:
# Adding a kmeans cluster number attribute
df['kmeans_cluster'] = kmeans.labels_

In [None]:
df.sample(5)[['type', 'title', 'director', 'cast', 'country', 'rating', 'listed_in', 'description', 'kmeans_cluster']]

In [None]:
# Number of movies and tv shows in each cluster
plt.figure(figsize=(8,5))
graph = sns.countplot(x='kmeans_cluster',data=df, hue='type')
plt.title('Number of movies and TV shows in each cluster - Kmeans Clustering')

# adding value count on the top of bar
for p in graph.patches:
   graph.annotate(format(p.get_height(), '.0f'), (p.get_x(), p.get_height()))

Successfully built 5 clusters using the k-means clustering algorithm.

**Building wordclouds for different clusters in K-Means Clustering**

In [None]:
def kmeans_worldcloud(cluster_number, column_name):
  
  '''function for Building a wordcloud for the movie/shows'''

  df_wordcloud = df[['kmeans_cluster',column_name]].dropna()
  df_wordcloud = df_wordcloud[df_wordcloud['kmeans_cluster']==cluster_number]
  
  # text documents
  text = " ".join(word for word in df_wordcloud[column_name])

  # create the word cloud
  wordcloud = WordCloud(stopwords=set(STOPWORDS), background_color="white").generate(text)
  
  # Generate a word cloud image
  plt.imshow(wordcloud, interpolation='bilinear')
  plt.axis("off")
  plt.show()

**Word Cloud on "description" column for different cluster**

In [None]:
for i in range(5):
  print(f'cluster {i}')
  kmeans_worldcloud(i,'description')

In [None]:
for i in range(5):
  print(f'cluster {i}')
  kmeans_worldcloud(i,'cast')

In [None]:
for i in range(5):
  print(f'cluster {i}')
  kmeans_worldcloud(i,'director')

In [None]:
for i in range(5):
  print(f'cluster {i}')
  kmeans_worldcloud(i,'listed_in')

In [None]:
for i in range(5):
  print(f'cluster {i}')
  kmeans_worldcloud(i,'country')

In [None]:
for i in range(5):
  print(f'cluster {i}')
  kmeans_worldcloud(i,'title')

#**2.Hierarchical clustering**

After applying the agglomerative hierarchical clustering algorithm, the resulting clusters are displayed in a dendrogram, which is a tree-like structure. The dendrogram shows the relationships between the clusters at each level of the hierarchy.

To determine the optimal number of clusters for our data, we can visually inspect the dendrogram and look for the largest vertical distance that does not intersect any horizontal line. This distance represents the largest distance between any two merged clusters, and thus the point at which the clusters are most dissimilar.

We can then draw a horizontal line at this distance and count the number of vertical lines it intersects. This number corresponds to the optimal number of clusters for our data.

In [None]:
# Building a dendogram to decide the number of clusters
plt.figure(figsize=(10, 7))  
dend = shc.dendrogram(shc.linkage(X, method='ward'))
plt.title('Dendrogram')
plt.xlabel('Netflix Shows')
plt.ylabel('Distance')
plt.axhline(y= 4, color='r', linestyle='--')
     

**At a distance of 4 units, 7 clusters can be built using the agglomerative clustering algorithm.**

Building 7 clusters using the Agglomerative clustering algorithm:

In [None]:
# Fitting hierarchical clustering model
hierarchical = AgglomerativeClustering(n_clusters=7, affinity='euclidean', linkage='ward')  
hierarchical.fit_predict(X)
     

In [None]:
# Adding a hierarchical cluster number attribute
df['hierarchical_cluster'] = hierarchical.labels_
     

In [None]:
df.sample(5)[['type', 'title', 'director', 'cast', 'country', 'rating', 'listed_in', 'description', 'hierarchical_cluster']]

In [None]:
# Number of movies and tv shows in each cluster
plt.figure(figsize=(10,5))
graph = sns.countplot(x='hierarchical_cluster',data=df, hue='type')
plt.title('Number of movies and tv shows in each cluster - Hierarchical Clustering')

# adding value count on the top of bar
for p in graph.patches:
   graph.annotate(format(p.get_height(), '.0f'), (p.get_x(), p.get_height()))

Successfully built 7 clusters using the Agglomerative (hierarchical) clustering algorithm.

**Building wordclouds for different clusters in hierarchical Clustering**

In [None]:
def hierarchical_worldcloud(cluster_number, column_name):
  
  '''function for Building a wordcloud for the movie/shows'''

  df_wordcloud = df[['hierarchical_cluster',column_name]].dropna()
  df_wordcloud = df_wordcloud[df_wordcloud['hierarchical_cluster']==cluster_number]
  
  # text documents
  text = " ".join(word for word in df_wordcloud[column_name])

  # create the word cloud
  wordcloud = WordCloud(stopwords=set(STOPWORDS), background_color="white").generate(text)
  
  # Generate a word cloud image
  plt.imshow(wordcloud, interpolation='bilinear')
  plt.axis("off")
  plt.show()

**Word Cloud on "title" column for different cluster**

In [None]:
for i in range(7):
  print(f'cluster {i}')
  hierarchical_worldcloud(i,'title')

**Word Cloud on "description" column for different cluster**

In [None]:
for i in range(7):
  print(f'cluster {i}')
  hierarchical_worldcloud(i,'description')

**Word Cloud on "cast" column for different cluster**

In [None]:
for i in range(7):
  print(f'cluster {i}')
  hierarchical_worldcloud(i,'cast')

**Word Cloud on "country" column for different cluster**

In [None]:
for i in range(7):
  print(f'cluster {i}')
  hierarchical_worldcloud(i,'country')

**Word Cloud on "listed_in (genre)" column for different cluster**

In [None]:
for i in range(7):
  print(f'cluster {i}')
  hierarchical_worldcloud(i,'listed_in')

**Content Based Recommendation System**

Content-based recommendation systems recommend items to a user by using the similarity of items. This recommender system recommends products or items based on their description or features. It identifies the similarity between the products based on their descriptions.

It short notes which items a particular user likes and also the items that the users with behavior and likings like him/her likes, to recommend items to that user.

We can build a simple content based recommender system based on the similarity of the movie/shows.
If a person has watched a show on Netflix, the recommender system must be able to recommend a list of similar shows that s/he likes.
To get the similarity score of the shows, we can use cosine similarity.
The similarity between two vectors (A and B) is calculated by taking the dot product of the two vectors and dividing it by the magnitude value. We can simply say that the cosine similarity score of two vectors increases as the angle between them decreases.

In [None]:
# veryfying index
df[['show_id', 'title', 'tags']]

1.The dataframe under consideration has a total of 7770 rows. However, due to the removal of some rows containing null values, the last index shown in the dataframe is 7786.

2.To build a content-based recommendation system, we calculate the similarity score based on a specific index_id with respect to the corresponding "tags" column.

3.Failure to reset the index may result in the calculation of cosine similarity for a different index, leading to incorrect recommendations. Hence, resetting the index is essential to ensure that the recommendations are based on the correct index.

4.Resetting the index involves assigning a new sequential index to each row of the dataframe, starting from 0. This ensures that each row has a unique and identifiable index, making it easier to perform computations and obtain accurate results.

5.Therefore, resetting the index is a crucial step in building a content-based recommendation system, as it ensures that the recommendations are based on the correct index and leads to more accurate and relevant recommendations.

In [None]:
# defining new dataframe for building recommandation system
recommender_df = df.copy()

# reseting index
recommender_df.reset_index(inplace=True)

# checking whether or not reset index properly 
recommender_df[['show_id', 'title', 'tags']]

As you can see above dataframe We successfully reset the index. Now dataset is ready to build content based recommandation system

In [None]:
# dropping show-id and index column
recommender_df.drop(columns=['index', 'show_id'], inplace=True)

In [None]:
print(f"before reset index id for movie 'Zero' : {df[df['title'] == 'Zero'].index[0]}")  # index[0] --> to locate index position
print(f"after reset index id for movie 'Zero': {recommender_df[recommender_df['title'] == 'Zero'].index[0]}")

In [None]:
# calling out transformed array independent features created from tags(cluster) column after performing PCA for dimenssionality reduction.
X

In [None]:
# calculate cosine similarity
similarity = cosine_similarity(X)
similarity

**Function for list down top 10 recommended movie on the basis of cosine similarity score.**

In [None]:
def recommend(movie):
    '''
    This function list down top ten movies on the basis of similarity score for that perticular movie.
    '''
    print(f"If you liked '{movie}', you may also enjoy: \n")

    # find out index position
    index = recommender_df[recommender_df['title'] == movie].index[0]

    # sorting on the basis of simliarity score, In order to find out distaces from recommended one
    distances = sorted(list(enumerate(similarity[index])), reverse=True, key=lambda x:x[1])
    
    # listing top ten recommenaded movie
    for i in distances[1:11]:
        print(df.iloc[i[0]].title)

In [None]:
recommend('Naruto')

In [None]:
recommend('Our Planet')

In [None]:
recommend('Phir Hera Pheri')

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Welcome to our exciting journey of exploring the world of Netflix shows! Our goal was to cluster the shows into groups based on their similarities and differences, ultimately creating a content-based recommender system that suggests 10 shows based on the user's viewing history.

With over 7787 records and 11 attributes, we began our adventure by delving into the dataset's missing values and performing exploratory data analysis (EDA). Our findings revealed that Netflix boasts more movies than TV shows, with a rapidly growing collection of shows from the United States.

To cluster the shows, we focused on six key attributes: director, cast, country, genre, rating, and description. We transformed these attributes into a 10000-feature TFIDF vectorization, then used Principal Component Analysis (PCA) to tackle the curse of dimensionality. By reducing the components to 3000, we were able to capture more than 80% of the variance.

Next, we used two clustering algorithms, K-Means and Agglomerative clustering, to group the shows. K-Means determined that the optimal number of clusters was 5, as confirmed by the elbow method and Silhouette score analysis. Meanwhile, Agglomerative clustering suggested 7 clusters, which we visualized using a dendrogram.

But we didn't stop there. We then created a content-based recommender system using the similarity matrix obtained through cosine similarity. This system provides personalized recommendations based on the type of show the user has watched, giving them 10 top-notch suggestions to explore.

Join us in discovering the diverse world of Netflix shows, and let our recommender system guide you to your next binge-worthy obsession.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***