# **Project Name**    -**NETFLIX MOVIES AND TV SHOWS CLUSTERING**





##### **Project Type**    - EDA/Unsupervised
##### **Contribution**    - Individual


# **Project Summary -**


The objective of this project is to analyze and cluster a dataset related to Netflix. **The dataset consists of various attributes associated with Netflix shows and movies, such as title, genre, release year, duration, rating, and others. The aim is to explore patterns and similarities among the content available on the platform and group them into meaningful clusters**.

To begin with, the dataset will be preprocessed by handling missing values, removing irrelevant columns, and transforming categorical variables into numerical representations. Feature engineering techniques may also be applied to extract useful information from the existing attributes.

Next, exploratory data analysis (EDA) techniques will be utilized to gain insights into the dataset. **Visualizations and statistical summaries will be used to understand the distribution of variables, identify any trends, and explore relationships between different features**.

Once the dataset has been thoroughly analyzed, clustering algorithms such as k-means, hierarchical clustering, or density-based spatial clustering will be employed. These algorithms will group similar Netflix shows and movies together based on their attributes. **The optimal number of clusters will be determined using techniques like the elbow method or silhouette analysis.**

After the clustering process, the results will be evaluated and interpreted. **The clusters will be analyzed to understand the common characteristics and patterns within each group. This analysis will provide valuable information for Netflix in terms of content categorization, recommendation systems, and content acquisition strategies**.

Finally, the findings and insights from the clustering analysis will be summarized and presented in a clear and concise manner. Visualizations, charts, and graphs will be used to effectively communicate the outcomes of the project. **Recommendations may also be provided based on the identified clusters, suggesting potential improvements or strategies for Netflix** to enhance user experience and content offerings.

**In conclusion, this project aims to analyze a Netflix dataset, perform clustering techniques to group similar shows and movies together, and provide insights and recommendations based on the clustering results. The project will contribute to a better understanding of Netflix's content landscape and aid in decision-making processes for the company**.




---



# **GitHub Link -** https://github.com/Akash19111997



Provide your GitHub Link here.

# **Problem Statement**


This dataset consists of tv shows and movies available on Netflix as of 2018. The dataset is collected from Flixable which is a third-party Netflix search engine.

In 2018, they released an interesting report which shows that the number of TV shows on Netflix has nearly tripled since 2010. The streaming service’s number of movies has decreased by more than 2,000 titles since 2010, while its number of TV shows has nearly tripled. It will be interesting to explore what all other insights can be obtained from the same dataset.




# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
#Import Libraries
#importing the libaries
!pip install -U kaleido

from plotly.subplots import make_subplots
import plotly.graph_objects as go

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
import plotly.express as px
import matplotlib.ticker  as mtick
import plotly.offline as py
import plotly.express as px
import plotly.figure_factory as ff
from matplotlib.pyplot import figure
import plotly.io as  pio
from datetime import datetime
import os


### Dataset Loading

In [None]:
from google.colab import files
uploaded = files.upload()


In [None]:
netflix_data = pd.read_csv('/content/NETFLIX MOVIES AND TV SHOWS CLUSTERING.csv')

### Dataset First View

In [None]:
# Dataset First Look
netflix_data.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
netflix_data.shape

### Dataset Information

In [None]:
# Dataset Info
netflix_data.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
netflix_data.duplicated()

#### Missing Values/Null Values

In [None]:
netflix_data.isnull().sum().sum()

In [None]:
# Missing Values/Null Values Count
netflix_data.isnull().sum()

In [None]:
# Visualizing the missing values
plt.figure(figsize=(10,6))
sns.heatmap(netflix_data.isnull(), cbar = False,cmap = 'viridis',yticklabels = False )
plt.title('missing value')
plt.xlabel('column value')
plt.ylabel('row value')
plt.show()

### What did you know about your dataset?

*Answer*

The Netflix Movies and TV Shows dataset is an unsupervised dataset consisting of 7,787 rows and 12 columns. Several columns contain missing values: the director column has 3671 missing entries, cast has 718, country has 507, date_added has 10, and rating has 7 missing values.


## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
netflix_data.columns

In [None]:
# Dataset Describe
netflix_data.describe(include='all')

### Variables Description

Answer Here

**Attribute Information**

**show_id :** Unique ID for every Movie / Tv Show

**type :** Identifier - A Movie or TV Show

**title :** Title of the Movie / Tv Show

**director :** Director of the Movie

**cast :** Actors involved in the movie / show

**country :** Country where the movie / show was produced

**date_added :** Date it was added on Netflix

**release_year :** Actual Releaseyear of the movie / show

**rating :** TV Rating of the movie / show

**duration :** Total Duration - in minutes or number of seasons

**listed_in :** Genere

**description:** The Summary description




### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
netflix_data.nunique()

In [None]:
# Check Unique Values for each variable.
netflix_data.apply(lambda col: col.unique())

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
#first remove the white space in dateadded columns
netflix_data['date_added'] = netflix_data['date_added'].astype(str).str.strip()

In [None]:
# Write your code to make your dataset analysis ready.
# create the new features to store  date_added columns to year, day, month in sprettly wise
netflix_data['date_added'] = pd.to_datetime(netflix_data['date_added'],errors= 'coerce') # first convert to dateadd to datetime
netflix_data['date_added'] = pd.to_datetime(netflix_data['date_added'])  # ensure datetime
most_common_date = netflix_data['date_added'].mode()[0]
netflix_data['date_added'] = netflix_data['date_added'].fillna(most_common_date)# handling the missing values useing fillna

netflix_data['year_added'] = netflix_data['date_added'].dt.year # comput the year
netflix_data['month_added'] = netflix_data['date_added'].dt.month #comput the month
netflix_data['day_added'] = netflix_data['date_added'].dt.day # comput the day in

In [None]:
#removing the duplicates
netflix_data.drop_duplicates(inplace=True)

In [None]:
#fill director and for not avilabale
netflix_data.fillna({'director' :'not avilable'}, inplace = True)

In [None]:
#fiil tha na data in cast column in place of nan values
netflix_data.fillna({'cast': 'no_cast'}, inplace= True)

In [None]:
#fiil the date and time in place of missing values
netflix_data['date_added'].fillna(netflix_data['date_added'],inplace= True)

In [None]:
# Fill the country's missing values
# Fill missing values in 'country' with the most frequent value (mode)
netflix_data['country'] = netflix_data['country'].fillna(netflix_data['country'].mode()[0])


In [None]:
#filling the missing rating value
netflix_data.fillna({'rating':'not_avilaable'},inplace=True)


In [None]:
netflix_data['type'].value_counts()

In [None]:
# find the country type in data set
netflix_data['country'].value_counts()

In [None]:
#first find the missing values
netflix_data.isnull().sum()

### What all manipulations have you done and insights you found?

Answer Here.

In this dataset, a total of 3837 null values were found. We will fill these null values in the director, cast, rating, date_added, and date_add columns. Additionally, we will convert the date_added column to datetime and split the date_added columns by year, day, and month.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
import plotly.graph_objects as go

labels = ['TV Show', 'Movie']
values = [netflix_data.type.value_counts()[1], netflix_data.type.value_counts()[0]]

# Colors
colors = ['#FF0000', '#000000']

# Create pie chart
fig = go.Figure(data=[go.Pie(labels=labels, values=values, hole=.6)])

# Customize layout
fig.update_layout(
    title_text='Type of Content Watched on Netflix',
    title_x=0.5,
    height=500,
    width=500,
    legend=dict(x=0.9),
    annotations=[dict(text='Type of Content on netflix', font_size=12, showarrow=False)]
)

# Set colors
fig.update_traces(marker=dict(colors=colors))

##### 1. Why did you pick the specific chart?

Answer Here.

The specific chart used in the code is a pie chart. I picked this chart because it is effective in visualizing the distribution of categorical data. In this case, the chart is used to represent the types of content watched on Netflix, which are categorized as "TV Show" and "Movie."


##### 2. What is/are the insight(s) found from the chart?

Answer Here

TV shows constitute the majority, accounting for 69.1% of the content watched on Netflix, while movies make up a smaller percentage of 30.9%.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

The data indicates a clear preference for TV shows over movies, with a significantly higher percentage of 69.1% compared to the lower percentage of 30.9% for movies. This suggests that people tend to enjoy shorter formats like TV shows rather than investing their time in longer movies that may be less engaging.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
tv_show = netflix_data[netflix_data["type"] == "TV Show"]
movie = netflix_data[netflix_data["type"] == "Movie"]

col = "year_added"

content_1 = tv_show["year_added"].value_counts().sort_index()
content_2 = movie["year_added"].value_counts().sort_index()

trace1 = go.Scatter(x=content_1.index, y=content_1.values, name="TV Shows", marker=dict(color='#008000', line=dict(width=4)))
trace2 = go.Scatter(x=content_2.index, y=content_2.values, name="Movies", marker=dict(color='#ffd700', line=dict(width=4)))

fig = go.Figure(data=[trace1, trace2], layout=go.Layout(title="Content added over the years",title_x=0.5, legend=dict(x=0.8, y=1.1, orientation="h")))
# Display chart
fig.show()

##### 1. Why did you pick the specific chart?

Answer Here.

The line chart is suitable for showing the trend and distribution of data over a continuous axis (in this case, the years). It allows for easy comparison between the two categories (TV shows and movies) and how their counts vary over time.



##### 2. What is/are the insight(s) found from the chart?

Answer Here

The trend in the visualization indicates that between 2008 and 2022, there were relatively fewer TV shows and movies added to Netflix. However, starting from 2016, there was a slight increase in content additions. In 2019, there was a significant peak in the number of movies added, while TV shows experienced a similar trend but with a lesser increase compared to movies.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

The gained insights indicate a positive impact for Netflix as the demand for both TV shows and movies on the platform has been increasing rapidly over the years. This growth presents an opportunity for Netflix to provide more high-quality content to its users, thereby enhancing user satisfaction and engagement.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
# Create a DataFrame to store month values and counts
months_df = pd.DataFrame(netflix_data['month_added'].value_counts())

# Reset the index to create a "month" column
months_df.reset_index(inplace=True)

# Rename the columns to "month" and "count"
months_df.rename(columns={'index': 'month', 'month_added': 'count'}, inplace=True)
months_df.columns = ['month','count']

In [None]:
months_df.columns

In [None]:
fig = px.bar(months_df, x="month", y="count", text_auto=True, color='count', color_continuous_scale=['#0000FF', '#FFFF00'])
fig.update_layout(
    title={
        'text': 'Month wise Addition of Movies and TV Shows on Netflix',
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
        autosize=False,
        width=1000,
        height=500,
        showlegend=True)
# fig.show()
fig.show()

##### 1. Why did you pick the specific chart?

Answer Here.

The bar chart is suitable for comparing and displaying categorical data (months) and their corresponding counts.
The chart helps in understanding the distribution of content additions across different months and identifying any patterns or trends.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

During the months of October to December, there is a noticeable surge in the number of TV shows and movies being released on the Netflix platform.The months of October to December are known for having various holidays and celebrations, such as Halloween,
 Diwali, Thanksgiving, and Christmas, which often result in people spending more time at home and seeking entertainment options

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

The gained insights regarding the increase in TV shows and movies on the Netflix platform during the months of October to December can potentially create a positive business impact. Here are a few reasons:-

**1-Meeting Seasonal Demand**

**2-Retaining Existing Subscribers**

**3-Attracting New Subscribers**

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# Chart - 4 visualization code
import matplotlib.pyplot as plt
import seaborn as sns
fig, ax = plt.subplots(figsize=(15,6))
sns.countplot(x='month_added', hue='type',lw=5, data=netflix_data, ax=ax,palette=['#000000','#FF0000'])

##### 1. Why did you pick the specific chart?

Answer Here.

By using a countplot, we can easily see and compare the frequencies of TV show and movie additions for each month.

##### 2. What is/are the insight(s) found from the chart?

Answer Here


**Movies:**

January, October, and December appear to be the trending months for movie additions on Netflix compared to other months.

**Tv Shows:**

October, November, and December emerge as the trending months for TV show additions on Netflix compared to other months.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here.

The gained insights regarding the trending months for movies and TV shows on Netflix can potentially create a positive business impact. Here's why:

**1-Meeting Viewer Demand:**

**2-Capitalizing on Seasonal Trends:**

**3-Improved Competitiveness:**

#### Chart - 5

In [None]:
# Chart - 5 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10, 7))

# Extract numeric part from 'duration' and convert to integers
durations = movie['duration'].str.extract(r'(\d+)')[0].astype(float)

# Plot the histogram
sns.histplot(durations, kde=False, color='red')

plt.title('Distribution of Movie Durations', fontweight="bold")
plt.xlabel('Duration (minutes)')
plt.ylabel('Count')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.
The Distplot is a suitable choice for this analysis because it allows us to observe the frequency or count of movies falling into different duration ranges

##### 2. What is/are the insight(s) found from the chart?

Answer Here
The average length of movies and TV shows falling within the range of 50 to 150 minutes can vary depending on the specific content available on Netflix.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

**Positive Business Impact:**

1-**Audience Flexibility :** By offering movies and TV shows with a variety of lengths, ranging from shorter films to longer epic productions, Netflix can cater to the diverse preferences and schedules of its audience

2-**Increased Engagement :** Movies and TV shows with varying lengths provide options for viewers to choose content that fits their available time. This can lead to increased engagement and longer viewing sessions

3-**Content Diversity :** By including movies and TV shows of different lengths, Netflix can expand its content library and cater to various genres and storytelling formats.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
#Checking the distribution of TV SHOWS
plt.figure(figsize=(20,8))
plt.title("Distribution of TV Shows duration",fontweight='bold')
sns.countplot(x=tv_show['duration'],color = 'red' ,data=tv_show,order = tv_show['duration'].value_counts().index)

##### 1. Why did you pick the specific chart?

Answer Here.

The chart in question is a countplot, which is a type of bar chart that shows the frequency or count of each category in a categorical variable. It seems to be used to display the distribution of TV show seasons

##### 2. What is/are the insight(s) found from the chart?

Answer Here

From the chart, we observed that the majority of TV shows or web series in the dataset have only one season, while the remaining shows have a maximum of two, three, four, or five seasons.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Yes ,**Positive Impact**  because by recognizing that the majority of TV shows have a limited number of seasons, content producers and streaming platforms can optimize their production planning. They can allocate resources more efficiently, reduce production costs, and potentially increase the output of content.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
movie['originals'] = np.where(movie['release_year'] == movie['year_added'], 'Yes', 'No')
# pie plot showing percentage of originals and others in movies
fig, ax = plt.subplots(figsize=(5,5),facecolor="#660066")
ax.patch.set_facecolor("#660066")
explode = (0, 0.1)
ax.pie(movie['originals'].value_counts(), explode=explode, autopct='%.2f%%', labels= ['Others', 'Originals'],
       shadow=True, startangle=90,textprops={'color':"black", 'fontsize': 25}, colors =['red','#F5E9F5'])

##### 1. Why did you pick the specific chart?

*Answer* Here.

The pie plot is a suitable choice for visualizing the distribution of categorical data, such as the proportion of "originals" and "others" in this case. It allows you to see the relative sizes of each category as a portion of the whole.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

Out of the movies available on Netflix, 30% are Netflix originals, while the remaining 70% are movies that were released earlier through different distribution channels and subsequently added to the Netflix .

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Yes, gaining insights can indeed help create a positive business impact. By understanding the distribution of movies on Netflix, such as the proportion of Netflix originals versus non-originals, the streaming service can make informed decisions about content acquisition and production.

#### Chart - 8

In [None]:
netflix_data['cast']

In [None]:
# seperating actors from cast column
cast = netflix_data['cast'].str.split(', ', expand=True).stack()

# top actors name who play highest role in movie/show.
cast.value_counts()

In [None]:
cast =cast[cast != 'No cast']

In [None]:
cast.value_counts()

In [None]:
top_10_Genre = netflix_data['listed_in'].value_counts().head(10)

fig2 = px.pie(top_10_Genre, values=top_10_Genre.values, names=top_10_Genre.index)

custom_colors = ['#4c78a8', '#72b7b2', '#ff7f0e', '#2ca02c', '#d62728']
fig2.update_traces(hovertemplate=None, textposition='outside', textinfo='percent+label', rotation=0,
                   marker=dict(colors=custom_colors))

fig2.update_layout(height=600, width=900, title='Top 10 genres on Netflix',
                   margin=dict(t=100, b=30, l=0, r=0),
                   showlegend=False,
                   plot_bgcolor='#fafafa',
                   paper_bgcolor='#fafafa',
                   title_font=dict(size=20, color='#555', family="Lato, sans-serif"),
                   font=dict(size=12, color='#FF0000'),
                   hoverlabel=dict(bgcolor="#444", font_size=13, font_family="Lato, sans-serif"))

fig2.show()


##### 1. Why did you pick the specific chart?

Answer Here.

The pie chart's circular shape allows viewers to quickly compare the sizes of different genres by observing the relative areas of the slices. The accompanying labels and percentage values outside the slices provide additional information and enhance the readability of the chart.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

In this chart, the top three genres on Netflix based on their distribution are:

1-**Documentaries:** 14.4%

2-**Stand-up Comedy:** 13.9%

3-**Drama, International Movies:**1 3.8%

These genres have the highest percentages compared to the other genres included in the top 10 list.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Over-focusing on top genres may reduce content diversity and alienate niche audiences, risking user churn and competitive disadvantage.

Balanced Strategy: Invest in top genres while supporting niche and emerging genres to ensure sustainable growth and audience satisfaction.



#### Chart - 9

In [None]:
# Chart - 9 visualization code
# Create subset of dataset with required data.
conuntryVSgenre = netflix_data[['country', 'listed_in']]

# Create a function to seperate all genres and store counts for each.
def country_wise_genre(country):
  country_genre = conuntryVSgenre[conuntryVSgenre['country'] == country]
  #Next, the function joins all the genre strings together into a single long string using the ", ".join()
  # method and then splits the long string into a list of individual genre strings using the split() method with ", " as the separator.
  country_genre = ", ".join(country_genre['listed_in'].dropna()).split(", ")
  country_genre_dict = dict(Counter (country_genre))
  return country_genre_dict

In [None]:
conuntryVSgenre

In [None]:
# Define list of top ten countries.
from collections import Counter
country_list = ['United States', 'India', 'United Kingdom', 'Canada', 'Japan', 'France', 'South Korea', 'Spain', 'Mexico', 'Australia']
# Create an empty dict to store values of each genre for each country.
country_wise_genre_dict = {}
# Iterate through all values in country_list.
for i in country_list:
  genre_data = country_wise_genre(i)
  country_wise_genre_dict[i] = genre_data
  country_genre_count_df = pd.DataFrame(country_wise_genre_dict).reset_index()
  country_genre_count_df.rename({'index':'Genre'}, inplace=True, axis=1)

In [None]:
country_genre_count_df

In [None]:
# Plot the above data.
df = country_genre_count_df

# Define colours to be used.
colors = ['aliceblue', 'brown', 'crimson', 'cyan', 'darkblue', 'darkmagenta', 'darkolivegreen', 'darkorange', 'darkturquoise', 'darkviolet', 'deeppink', 'forestgreen',
          'fuchsia', 'gainsboro', 'goldenrod', 'gray','maroon', 'mediumaquamarine', 'mediumvioletred', 'midnightblue', 'orchid', 'palegoldenrod', 'palegreen', 'paleturquoise',
          'plum', 'powderblue', 'purple', 'red', 'rosybrown', 'royalblue', 'saddlebrown', 'salmon', 'sandybrown','seagreen', 'seashell', 'sienna', 'silver', 'slategray', 'snow',
          'springgreen', 'tomato','yellow', 'yellowgreen', 'darkred', 'lavender', 'lightcoral', 'navy', 'olive', 'teal', 'turquoise']


# Create subplots, using 'domain' type for pie charts
specs = [[{'type':'domain'}, {'type':'domain'}, {'type':'domain'}, {'type':'domain'}, {'type':'domain'}], [{'type':'domain'}, {'type':'domain'}, {'type':'domain'}, {'type':'domain'}, {'type':'domain'}]]
fig = make_subplots(rows=2, cols=5, specs=specs, subplot_titles=['United States', 'India', 'United Kingdom', 'Canada', 'Japan', 'France', 'South Korea', 'Spain', 'Mexico', 'Australia'])

# Define traces.
fig.add_trace(go.Pie(labels=df['Genre'], values=df['United States'], name='United States'),1,1)
fig.add_trace(go.Pie(labels=df['Genre'], values=df['India'],  name='India'),1,2)
fig.add_trace(go.Pie(labels=df['Genre'], values=df['United Kingdom'],  name='United Kingdom'),1,3)
fig.add_trace(go.Pie(labels=df['Genre'], values=df['Canada'],  name='Canada'),1,4)
fig.add_trace(go.Pie(labels=df['Genre'], values=df['Japan'],  name='Japan'),1,5)
fig.add_trace(go.Pie(labels=df['Genre'], values=df['France'],  name='France'),2,1)
fig.add_trace(go.Pie(labels=df['Genre'], values=df['South Korea'],  name='South Korea'),2,2)
fig.add_trace(go.Pie(labels=df['Genre'], values=df['Spain'],  name='Spain'),2,3)
fig.add_trace(go.Pie(labels=df['Genre'], values=df['Mexico'],  name='Mexico'),2,4)
fig.add_trace(go.Pie(labels=df['Genre'], values=df['Australia'],  name='Australia'),2,5)

# Tune layout and hover info
fig.update_traces(hoverinfo='label+percent+name', textinfo='none', marker=dict(colors=colors))
fig.update_layout(title={'text': 'Top ten countries and the content they provide.',
                          'y':0.97,
                          'x':0.5,
                          'font_size':25,
                          'xanchor': 'center',
                          'yanchor': 'top'},height=650, width=1550,paper_bgcolor='white',
                  legend=dict(x=0.099,orientation="h")
                          )
fig = go.Figure(fig)
# fig.show()
fig.show()

##### 1. Why did you pick the specific chart?

Answer Here.
 It is suitable for showing the distribution of different genres across multiple countries. Each pie chart represents a country, and the slices of the pie represent different genres. The size of each slice indicates the proportion of content in that genre for a particular country. This allows for easy comparison of genre distribution across countries in a visually appealing manner.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

**Action & Adventure and Dramas are the most prevalent genres across all countries. They have the highest values in most countries, indicating their popularity.The United States has a diverse content offering across multiple genres, with a strong presence in Action & Adventure, Dramas, Comedies, and Documentaries**.


**India has a significant focus on Independent Movies and Dramas, with relatively fewer offerings in other genres.**

**The United Kingdom has a good balance between Drama, International TV Shows, and Documentaries.**

**Australia's content offering is diverse, with a relatively balanced distribution across various genres such as Dramas, Comedies, International TV Shows, and Documentaries.**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

The gained insights can potentially help create a positive business impact in the following ways:

1-**Targeted Content Strategy:** By understanding the genre preferences in different countries, businesses can develop a targeted content strategy that aligns with the interests of their target audience.

2-**Market Expansion:**The insights can help businesses identify countries where their content genres are highly popular. This knowledge can guide expansion plans and investment in those markets, increasing the chances of success and profitability.

3-**Content Localization:** Understanding the genre preferences in different countries can aid in content localization efforts. Adapting content to suit the local preferences can increase its appeal and viewership, potentially leading to business growth.

#### Chart - 10

In [None]:
# number of unique values
netflix_data['release_year'].nunique()

In [None]:
# Chart - 10 visualization code
print(f'Oldest release year : {netflix_data.release_year.min()}')
print(f'Latest release year : {netflix_data.release_year.max()}')

In [None]:
# Chart - 10 visualization code
fig,ax = plt.subplots(1,2, figsize=(15,6))

# Univariate analysis
hist = sns.distplot(netflix_data['release_year'], ax=ax[0], kde=False,color='green')
hist.set_title('Distribution by released year', size=20)

# Bivariate analysis
count = sns.countplot(x="release_year", hue='type', data=netflix_data, order=netflix_data['release_year'].value_counts().index[0:15], ax=ax[1])
count.set_title('Movie/TV shows released in top 15 year', size=15)
plt.xticks(rotation=90)

plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

The chosen chart combination of a histogram and a grouped bar plot allows for both univariate and bivariate analysis. The histogram provides an overview of the distribution of movie release years, while the bar plot allows for a comparison of the number of movies and TV shows released in the top 15 years.

##### 2. What is/are the insight(s) found from the chart?

Answer Here


The distribution of release years in the histogram shows a general trend of movies being released on Netflix starting from around 1980. The number of releases gradually increases, with significant growth observed from the year 2000 onwards. The highest peak in the distribution is observed between 2010 and 2020, indicating a high number of Movie/Tv shows releases during that period.

In terms of content type (Movies, TV shows), the bar graph highlights that 2017 and 2020 demonstrate the highest trends. These years exhibit a significant number of movie releases, TV show releases, and a combination of both on Netflix.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Yes, the gained insights can help create a positive business impact. By understanding the distribution of release years and identifying trends, businesses can make informed decisions regarding content acquisition, production, and marketing strategies.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
#Ratings
# number of unique values
netflix_data.rating.nunique()

In [None]:
fig,ax = plt.subplots(1,2, figsize=(15,6))
plt.suptitle('Top 10 rating for different age groups and audiences & Rating based on Movie and Tv_Shows',
             weight='bold', y=1.02, size=18)

# univariate analysis
sns.countplot(x="rating", data=netflix_data, order=netflix_data['rating'].value_counts().index[0:10], ax=ax[0])



# bivariate analysis
graph = sns.countplot(x="rating", data=netflix_data, hue='type', order=netflix_data['rating'].value_counts().index[0:10], ax=ax[1])
plt.xticks(rotation=90)

plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

 The chosen chart combination of two count plots allows for both univariate and bivariate analysis. The first plot provides insights into the top 10 ratings across all content, while the second plot offers a comparison of ratings specifically for movies and TV shows.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

**TV-MA:** This rating means that the content is intended for mature audiences only. It may include graphic violence, explicit sexual content, or strong language

In terms of ratings, the most common rating is **TV-MA**, which applies to both movies and TV shows.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

*Answer* Here

The insight that **TV-MA** is the most common rating for both movies and TV shows can inform content strategies, audience targeting, programming decisions, and content diversity to drive positive business impact in terms of increased viewership and customer satisfaction

#### Chart - 12

In [None]:
# Chart - 12 visualization code
import folium

# Use default OpenStreetMap tiles
fig = folium.Map(location=[20, 0], zoom_start=2)
fig


# Define a dictionary of country names, coordinates, and colors
countries = {'United States': {'coords': [37.0902, -95.7129], 'color': 'red'},
             'India': {'coords': [20.5937, 78.9629], 'color': 'green'},
             'United Kingdom': {'coords': [55.3781, -3.4360], 'color': 'blue'},
             'Canada': {'coords': [56.1304, -106.3468], 'color': 'orange'},
             'Japan': {'coords': [36.2048, 138.2529], 'color': 'purple'},
             'France': {'coords': [46.2276, 2.2137], 'color': 'pink'},
             'South Korea': {'coords': [35.9078, 127.7669], 'color': 'gray'},
             'Spain': {'coords': [40.4637, -3.7492], 'color': 'black'},
             'Mexico': {'coords': [23.6345, -102.5528], 'color': 'brown'}}

# Loop over the dictionary and add markers for each country
for country, info in countries.items():
    folium.Marker(location=info['coords'], tooltip=country,
                   popup=f"Color: {info['color']}",
                   icon=folium.Icon(color=info['color'])).add_to(fig)

# Display the map
fig


##### 1. Why did you pick the specific chart?

answer here.


 The chosen chart effectively presents the data in an intuitive and visually appealing manner, allowing viewers to easily identify the directors with the most contributions on Netflix

##### 2. What is/are the insight(s) found from the chart?

Answer Here

The directors Raúl Campos and Jan Suter have the highest count in terms of overall Movies and TV shows on Netflix.



### chart 13


In [None]:
netflix_data['listed_in'].value_counts().head(25)

In [None]:
import plotly.express as px
import pandas as pd

counts = netflix_data['listed_in'].value_counts().head(10)
average = counts.mean()

df = pd.DataFrame({'Category': counts.index, 'Count': counts.values})
colors = px.colors.qualitative.Dark24[:10]
fig = px.bar(df, x='Category', y='Count', color='Category', color_discrete_sequence=colors)
fig.add_hline(y=average, line_color='red')
fig.update_layout(title='Top 10 Average Genere with Count',title_x=0.3)

fig.show()

##### **1. Why did you pick the specific chart?**

**Answer Here.**

The chosen chart effectively presents the data, allowing viewers to easily compare the Average counts of different genres.

#####**2. What is/are the insight(s) found from the chart?**

**Answer Here**

The average count of genres in the top 10 categories lies between 200-250. The genre with the highest count among all the genres is Documentaries, with a count of 334.

##### **3. Will the gained insights help creating a positive business impact?**
**Are there any insights that lead to negative growth? Justify with specific reason.**


**Answer Here**

Yes, the gained insights can help create a positive business impact for a streaming platform like Netflix or any other company in the entertainment industry.These insights, companies can refine their content strategies, enhance viewer satisfaction, attract a larger audience, and ultimately drive positive business impact in terms of increased viewership, customer retention, and revenue growth

#### Chart - 14 Correlation Heatmap



In [None]:
# Correlation Heatmap visualization code
#Assigning the Ratings into grouped categories
ratings = {
    'TV-PG': 'Older Kids',
    'TV-MA': 'Adults',
    'TV-Y7-FV': 'Older Kids',
    'TV-Y7': 'Older Kids',
    'TV-14': 'Teens',
    'R': 'Adults',
    'TV-Y': 'Kids',
    'NR': 'Adults',
    'PG-13': 'Teens',
    'TV-G': 'Kids',
    'PG': 'Older Kids',
    'G': 'Kids',
    'UR': 'Adults',
    'NC-17': 'Adults'
}
netflix_data['target_ages'] = netflix_data['rating'].replace(ratings)

In [None]:
# Add a count column
netflix_data['count'] = 1

# Get top 10 countries by number of titles
data = netflix_data.groupby('country')[['count']].sum().sort_values(by='count', ascending=False).reset_index().head(10)

# Store only the country names
top_countries = data['country']

# Filter the dataset for top countries
df_heatmap = netflix_data[netflix_data['country'].isin(top_countries)]

# Create a crosstab of country vs target_ages (normalize by row for %)
df_heatmap = pd.crosstab(df_heatmap['country'], df_heatmap['target_ages'], normalize='index').T


In [None]:
fig, ax = plt.subplots(1, 1, figsize=(12, 12))

# Define target orders
country_order2 = ['United States', 'India', 'United Kingdom', 'Canada', 'Japan',
                  'France', 'South Korea', 'Spain', 'Mexico']
age_order = ['Adults', 'Teens', 'Older Kids', 'Kids']

# Filter only available rows and columns to prevent KeyError
valid_ages = [age for age in age_order if age in df_heatmap.index]
valid_countries = [country for country in country_order2 if country in df_heatmap.columns]

# Draw the heatmap
sns.heatmap(data=df_heatmap.loc[valid_ages, valid_countries],
            cmap='YlGnBu',
            square=True,
            linewidth=2.5,
            cbar=False,
            annot=True,
            fmt='1.0%',
            vmax=0.6,
            vmin=0.05,
            ax=ax,
            annot_kws={"fontsize": 12})

##### 1. Why did you pick the specific chart?

Answer Here.

A heatmap is a suitable choice when visualizing the relationships between two categorical variables, in this case, countries and age groups. It allows for a clear representation of patterns, trends, and comparisons across different categories.

##### 2. What is/are the insight(s) found from the chart?

Answer Here


In summary, the data provided suggests that the level of interest in the subject varies across different countries and target age groups. Here are the overall conclusions:

Among the countries listed, Spain stands out with the highest percentage of adults showing interest at 84%. This indicates a strong interest in the subject among adults in Spain.

1.**France**- Follows closely with 68% of adults expressing interest, demonstrating a significant level of engagement in the subject.

2.**India**- It has the highest percentage of interest among teenagers, with 57% showing interest. This suggests a notable interest among the younger population in India.

3.**United Kingdom** -It has a relatively high level of interest among adults, with 51% expressing interest.

4.**Mexico**-Here ,also demonstrates a substantial level of interest, with 77% of adults showing interest in the subject.

5.**South Korea** , **United States**- Both have 47% of adults showing interest, indicating a moderate level of engagement in these countries.

6.**Japan**- It shows a moderate level of interest among both adults and teens, with 36% of each group expressing interest.

7.**Canada**- It has the lowest percentage of interest among the listed countries, with 45% of adults showing interest.

**Overall, these conclusions highlight the varying levels of interest in the subject among different countries and target age groups. The data indicates that Spain, France, India, and Mexico have higher levels of interest in the adults, while Canada has relatively lower interest compared to the other countries**.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
pair_df = netflix_data[['year_added', 'month_added', 'day_added']]

# Plot
sns.pairplot(pair_df)
plt.suptitle("Pair Plot of Date-Based Features", fontsize=16)
plt.show()


##### 1. Why did you pick the specific chart?



Answer Here.

The Pair Plot is ideal for exploring relationships between multiple numeric variables at once. It shows:

Distributions of each variable on the diagonal (histograms)

Scatter plots of each variable pair to observe trends or correlations

##### 2. What is/are the insight(s) found from the chart?


Answer Here.

From the pair plot, we observe that the majority of Netflix's content was added after 2015, with a strong upward trend until 2020. Content additions also seem to spike in certain months, suggesting seasonal strategies. Additionally, there are distinct patterns in how Movies and TV Shows are added across the calendar year.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

In [None]:
#making copy of df_clean_frame
netflix_hypothesis = netflix_data.copy()
#head of df_hypothesis
netflix_hypothesis.head()

In [None]:
#filtering movie from Type_of_show column
netflix_hypothesis = netflix_hypothesis[netflix_hypothesis["type"] == "Movie"]

In [None]:
#with respect to each ratings assigning it into group of categories
ratings_ages = {
    'TV-PG': 'Older Kids',
    'TV-MA': 'Adults',
    'TV-Y7-FV': 'Older Kids',
    'TV-Y7': 'Older Kids',
    'TV-14': 'Teens',
    'R': 'Adults',
    'TV-Y': 'Kids',
    'NR': 'Adults',
    'PG-13': 'Teens',
    'TV-G': 'Kids',
    'PG': 'Older Kids',
    'G': 'Kids',
    'UR': 'Adults',
    'NC-17': 'Adults'
}

netflix_hypothesis['target_ages'] = netflix_hypothesis['rating'].replace(ratings_ages)
#let's see unique target ages
netflix_hypothesis['target_ages'].unique()

In [None]:
netflix_hypothesis['target_ages'] = pd.Categorical(netflix_hypothesis['target_ages'], categories=['Kids', 'Older Kids', 'Teens', 'Adults'])

netflix_hypothesis['duration'] = netflix_hypothesis['duration'].astype(str)  # Convert to string type
netflix_hypothesis['duration'] = netflix_hypothesis['duration'].str.extract('(\d+)')
netflix_hypothesis['duration'] = pd.to_numeric(netflix_hypothesis['duration'])

netflix_hypothesis.head(3)

In [None]:
#group_by duration and target_ages
group_by_= netflix_hypothesis[['duration','target_ages']].groupby(by='target_ages')
#mean of group_by variable
group=group_by_.mean().reset_index()
group

In [None]:
# Only apply to rows where 'duration' is not null
netflix_data['duration_cleaned'] = netflix_data['duration'].str.extract(r'(\d+)').astype(float)


In [None]:
#confimed it work or not
print(netflix_data[['duration', 'duration_cleaned']].head())


In [None]:
# Grouping by target age
group_by_ = netflix_data.groupby('target_ages')

# Get groups
A = group_by_.get_group('Kids')
B = group_by_.get_group('Older Kids')

# Use the cleaned numeric duration column only
M1 = A['duration_cleaned'].mean()
S1 = A['duration_cleaned'].std()

M2 = B['duration_cleaned'].mean()
S2 = B['duration_cleaned'].std()
print(netflix_data[['target_ages', 'duration_cleaned']].head())
netflix_data['duration_cleaned'] = netflix_data['duration'].str.extract(r'(\d+)').astype(float)


In [None]:
#import stats
from scipy import stats
#length of groups and DOF
n1 = len(A)
n2= len(B)
print(n1,n2)

dof = n1+n2-2
print('dof',dof)

sp_2 = ((n2-1)*S1**2  + (n1-1)*S2**2) / dof
print('SP_2 =',sp_2)

sp = np.sqrt(sp_2)
print('SP',sp)

#tvalue
t_val = (M1-M2)/(sp * np.sqrt(1/n1 + 1/n2))
print('tvalue',t_val)

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

HO:Movies rated for kids and older kids are at least two hours long.(Null Hypothesis)

H1:Movies rated for kids and older kids are not at least two hours long.(Alternate Hypothesis)

#### 2. Perform an appropriate statistical test.

In [None]:
#t-distribution
stats.t.ppf(0.025,dof)

In [None]:
#t-distribution
stats.t.ppf(0.975,dof)

##### Which statistical test have you done to obtain P-Value?

**Answer Here.**

**t-value** is not in the range, the **null hypothesis is rejecte**d.

**As a result, movies rated for kids and older kids are not at least two hours long.**

##### Why did you choose the specific statistical test?

Answer Here.

Because the t-value is not in the range, the null hypothesis is rejected.

**As a result, The duration which is more than 90 mins are movies**


### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Answer Here.**

H1:The duration which is more than 90 mins are movies

HO:The duration which is more than 90 mins are NOT movies

In [None]:
#making copy of df_clean_frame
netflix_hypothesis=netflix_data.copy()
#head of df_hypothesis
netflix_hypothesis.head()

In [None]:
netflix_hypothesis['duration']= netflix_hypothesis['duration'].str.extract('(\d+)')
netflix_hypothesis['duration'] = pd.to_numeric(netflix_hypothesis['duration'])

In [None]:
netflix_hypothesis['type'] = pd.Categorical(netflix_hypothesis['type'], categories=['Movie','TV Show'])
#from duration feature extractin string part and after extracting Changing the object type to numeric
#df_hypothesis['duration']= df_hypothesis['duration'].str.extract('(\d+)')
#df_hypothesis['duration'] = pd.to_numeric(df_hypothesis['duration'])
#head of df_
netflix_hypothesis.head(3)

In [None]:
netflix_hypothesis['type'] = pd.Categorical(netflix_hypothesis['type'], categories=['Movie','TV Show'])

In [None]:
# Perform Statistical Test to obtain P-Value
#group_by duration and TYPE
group_by_= netflix_hypothesis[['duration','type']].groupby(by='type')
#mean of group_by variable
group1=group_by_.mean().reset_index()
group1

In [None]:
group_by_ = netflix_data.groupby('type')

A = group_by_.get_group('Movie')
B = group_by_.get_group('TV Show')

# Make sure 'duration_cleaned' is a numeric column (created from 'duration' using str.extract)
M1 = A['duration_cleaned'].mean()
S1 = A['duration_cleaned'].std()

M2 = B['duration_cleaned'].mean()
S2 = B['duration_cleaned'].std()


In [None]:
netflix_data['duration_cleaned'] = netflix_data['duration'].str.extract(r'(\d+)').astype(float)

In [None]:
print(netflix_data.select_dtypes(include='number').columns)

In [None]:
#import stats
from scipy import stats
#length of groups and DOF
n1 = len(A)
n2= len(B)
print(n1,n2)

dof = n1+n2-2
print('dof',dof)

sp_2 = ((n2-1)*S1**2  + (n1-1)*S2**2) / dof
print('SP_2 =',sp_2)

sp = np.sqrt(sp_2)
print('SP',sp)

#tvalue
t_val = (M1-M2)/(sp * np.sqrt(1/n1 + 1/n2))
print('tvalue',t_val)

Which statistical test have you done to obtain P-Value?

Answer Here.

t-distribution

In [None]:
# Perform Statistical Test to obtain P-Value
#t-distribution
stats.t.ppf(0.025,dof)

In [None]:
#t-distribution
stats.t.ppf(0.975,dof)

##### Why did you choose the specific statistical test?

Answer Here.Because the t-value is not in the range, the null hypothesis is rejected.

**As a result, The duration which is more than 90 mins are movies**

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
# Combining all the clustering attributes into a single column
netflix_data['clustering'] = (netflix_data['director'] + ' ' + netflix_data['cast'] +' ' +
                                 netflix_data['country'] +' ' + netflix_data['listed_in'] +
                                 ' ' + netflix_data['description'])

In [None]:
netflix_data['clustering'][25]

# **Textual Data Preprocessing**

In [None]:
# Expand Contraction
# Lower Casing
# Remove Punctuations
# Remove URLs & Remove words and digits contain digits
# Remove Stopwords
# Remove White spaces
# Rephrase Text
# Tokenization
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
import string
import nltk
nltk.download('all',quiet=True)
from PIL import Image

def transform_text(text):
    # Convert text to lowercase
    text = text.lower()

    # Remove URLs
    text = re.sub(r'http\S+', '', text)

    # Tokenize text into words
    words = nltk.word_tokenize(text)

    # Remove non-alphanumeric characters
    words = [word for word in words if word.isalnum()]

    # Remove stopwords and punctuation
    stopwords_set = set(stopwords.words('english'))
    punctuation_set = set(string.punctuation)
    words = [word for word in words if word not in stopwords_set and word not in punctuation_set]

    # Lemmatize words
    lemmatizer = WordNetLemmatizer()
    lemmatized_words = [lemmatizer.lemmatize(word) for word in words]

    # Join words into a string and return
    return ' '.join(lemmatized_words)

In [None]:
netflix_data['Clean_Text'] = netflix_data['clustering'].apply(transform_text)

In [None]:
netflix_data["Clean_Text"]

# **Text Vectorization**

**TF-IDF combines two metrics: Term frequency (TF) and inverse document frequency (IDF).**

Term Frequency (TF): This metric measures the frequency of a term in a document. It assumes that the more often a term appears in a document, the more relevant it is to that document. It is calculated using the formula:

**TF(t, d) = (Number of times term t appears in document d) / (Total number of terms in document d)**

**Inverse Document Frequency (IDF): This metric measures the importance of a term across a collection of documents. It gives higher weight to terms that appear less frequently in the entire collection. It is calculated using the formula:**


**IDF(t) = log_e(Total number of documents / Number of documents containing term t)**

In [None]:
bag_of_words = netflix_data.Clean_Text

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample text data
bag_of_words = netflix_data.Clean_Text

# Initialize TF-IDF vectorizer
t_vectorizer = TfidfVectorizer()

# Fit and transform the data
X = t_vectorizer.fit_transform(bag_of_words)

# To see the feature names:
print(t_vectorizer.get_feature_names_out())

# To see the TF-IDF matrix:
print(X.toarray())


In [None]:
X= t_vectorizer.fit_transform(bag_of_words)

In [None]:
print(X.shape)

In [None]:
t_vectorizer.get_feature_names_out()

**Do you think that dimensionality reduction is needed? Explain Why**

**Answer Here.**

PCA to reduce the dimensionality of the dataset. PCA identifies the directions (principal components) along which the data varies the most. These components are ordered by the amount of variance they explain in the data.

**Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)**


PCA can extract the most relevant features from a dataset. It transforms the original features into a new set of uncorrelated variables called principal components. These components are linear combinations of the original features and capture the maximum amount of variation present in the data.

In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import numpy as np

# Example: using only numeric features
numeric_data = netflix_data.select_dtypes(include='number').dropna()

# Step 1: Scale the data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(numeric_data)

# Step 2: Apply PCA
transformer = PCA(n_components=len(numeric_data.columns))
transformer.fit(scaled_data)

# Step 3: Plot cumulative explained variance
plt.figure(figsize=(15, 5), dpi=120)
plt.plot(np.cumsum(transformer.explained_variance_ratio_), marker='o', color='purple')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('PCA: Cumulative Explained Variance by Components')
plt.grid(True)
plt.show()


The plot helps in determining the number of components to consider for dimensionality reduction. You can select the number of components where the cumulative explained variance reaches a satisfactory threshold, such as 95%. The point where the curve intersects or is closest to the threshold line can guide you in choosing the appropriate number of components for your analysis.

In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import numpy as np

# Example: using only numeric features
numeric_data = netflix_data.select_dtypes(include='number').dropna()

# Step 1: Scale the data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(numeric_data)

# Step 2: Apply PCA
transformer = PCA(n_components=len(numeric_data.columns))
transformer.fit(scaled_data)

# Step 3: Plot cumulative explained variance
plt.figure(figsize=(15, 5), dpi=120)
plt.plot(np.cumsum(transformer.explained_variance_ratio_), marker='o', color='purple')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('PCA: Cumulative Explained Variance by Components')
plt.grid(True)
plt.show()


# **6. ML Model Implementation**

In [None]:
from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer
from sklearn.decomposition import PCA
# Initialize the KMeans model with a random_state of 5
model = KMeans(random_state=5)

# Initialize the KElbowVisualizer with the KMeans model and desired parameters
visualizer = KElbowVisualizer(model, k=(4, 22), metric='silhouette', timings=False, locate_elbow=True)
# After scaling
pca = PCA(n_components=10)
X_transformed = pca.fit_transform(X_scaled)

# Fit the visualizer on the transformed data
visualizer.fit(X_transformed)

# Display the elbow plot
visualizer.show()

**The plot will also indicate the "elbow" point, which represents the recommended number of clusters based on the selected metric.Using elbow plot with the optimal number of 5 clusters** .

In [None]:
from yellowbrick.cluster import SilhouetteVisualizer
from sklearn.metrics import silhouette_score, silhouette_samples

def silhouette_score_analysis(n):

  for n_clusters in range(2,n):
      km = KMeans (n_clusters=n_clusters, random_state=5)
      preds = km.fit_predict(X_transformed)
      centers = km.cluster_centers_

      score = silhouette_score(X_transformed, preds, metric='euclidean')
      print ("For n_clusters = {}, silhouette score is {}".format(n_clusters, score))

      visualizer = SilhouetteVisualizer(km)

      visualizer.fit(X_transformed) # Fit the training data to the visualizer
      visualizer.poof() # Draw/show/poof the data

In [None]:
silhouette_score_analysis(15)

In [None]:
from yellowbrick.cluster import SilhouetteVisualizer
from sklearn.metrics import silhouette_score, silhouette_samples

def silhouette_score_analysis(n):

  for n_clusters in range(2,n):
      km = KMeans (n_clusters=n_clusters, random_state=5)
      preds = km.fit_predict(X_transformed)
      centers = km.cluster_centers_

      score = silhouette_score(X_transformed, preds, metric='euclidean')
      print ("For n_clusters = {}, silhouette score is {}".format(n_clusters, score))

      visualizer = SilhouetteVisualizer(km)

      visualizer.fit(X_transformed) # Fit the training data to the visualizer
      visualizer.poof() # Draw/show/poof the data

In [None]:
silhouette_score_analysis(15)

In [None]:
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

# Create a figure with a specific size and resolution
plt.figure(figsize=(10, 6), dpi=120)

# Initialize an empty list to store the within-cluster sum of squares (WCSS)
wcss = []

# Iterate over different numbers of clusters
for i in range(1, 22):
    # Create a KMeans model with default parameters
    model = KMeans(random_state=0)

    # Initialize the KMeans algorithm with specific parameters
    kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0)

    # Fit the KMeans algorithm to the transformed data
    kmeans.fit(X_transformed)

    # Append the WCSS to the list
    wcss.append(kmeans.inertia_)

# Plot the number of clusters against the WCSS
plt.plot(range(1, 22), wcss)

In [None]:
# Add cluster values to the dateframe.
netflix_data['cluster_number'] = kmeans.labels_

In [None]:
netflix_data.head(1)

In [None]:
# Count the number of movies or TV shows in each cluster
cluster_content_count = netflix_data['cluster_number'].value_counts().reset_index().rename(columns={'index': 'clusters', 'clusters': 'Movies/TV_Shows'})

# Print the cluster content count
print(cluster_content_count)

In [None]:
#word cloud
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
def word_count(category):
  print("Exploring Cluster", category)
  col_names = ['type','title','country','rating','listed_in','description']
  for i in col_names:
    df_word_cloud = netflix_data[['cluster_number',i]].dropna()
    df_word_cloud = df_word_cloud[df_word_cloud['cluster_number']==category]
    text = " ".join(word for word in df_word_cloud[i])
    # Create stopword list:
    stopwords = set(STOPWORDS)
  # Generate a word cloud image
    wordcloud = WordCloud(stopwords=stopwords, background_color="#FFC0CB",width=500,height=500).generate(text)
  # Display the generated image:
  # the matplotlib way:
    plt.rcParams["figure.figsize"] = (10,10)
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis("off")

    print("Looking for insights from", i ,"Movies/TV Shows")

    plt.show()

In [None]:
word_count(9)

Looking for insights from country Movies/TV Shows


Cluster 9 in a dataset contains a total of 232 words. The most frequently occurring words in this cluster are as follows:

**Type** -  Movie & Tv shows

**Title** - Broadway,Remastered,Christmas ,Friends Orchestra

**Country**- United Kingdom,Argentina,United States,India

**Rating** -TV-MA,PG-TV

**Listed_in** -  Dramas International,Musical Dramas,Musicial
            Documentaries,Comedies International

**Description**- Documentary ,Music,One,Bad,Tour ,Love.

In [None]:
word_count(11)

Cluster 11 in a dataset contains a total of 410 words. The most frequently occurring words in this cluster are as follows:

**Type** -  Movie & Tv shows

**Title** - Special, America,Time,Live,Comedy, Netflix Alive,
  Martin

**Country** - United States,Brazil,Mexico,Italy

**Rating** -TV-MA,TV-PG

**Listed_in** - Tv-Comedies, Comedy Stand, Talk shows
            

**Description**- Stand Comedy, Comic, Take, Life, Live, Share,Stories.

# 7- **Recommender system**

A **recommender system is a type of information filtering system that suggests items to users based on their preferences**, interests, or past behavior. **It is commonly used in various applications such as e-commerce websites, streaming platforms, social media, and more.** The goal of a recommender system is to provide personalized recommendations that are relevant and helpful to the individual user.
**Content-based filtering:** This approach recommends items similar to the ones a user has liked or interacted with in the past. It analyzes the content or attributes of items and finds similar items to recommend. For example, if a user enjoys watching action movies, the system may recommend other action movies based on genre, actors, or plot.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
#removing stopwords
tfidf = TfidfVectorizer(stop_words='english')

#Replace NaN with an empty string
netflix_data['description'] = netflix_data['description'].fillna('')

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(netflix_data['description'])

#Output the shape of tfidf_matrix
tfidf_matrix.shape

In [None]:
# Import linear_kernel
from sklearn.metrics.pairwise import linear_kernel
# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [None]:
cosine_sim

In [None]:
indices = pd.Series(netflix_data.index, index=netflix_data['title']).drop_duplicates()

In [None]:
def get_recommendations(title, cosine_sim=cosine_sim):
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]
    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return netflix_data['title'].iloc[movie_indices]


In [None]:
netflix_data['title'][1:70]

In [None]:
get_recommendations( '14 Cameras',cosine_sim)

# **Conclusion**

**1-** It is interesting to note that the majority of the content available on Netflix consists of movies. However, in recent years, the platform has been focusing more on TV shows.

**2-** Most of these shows are released either at the end or the beginning of the year.

**3-** The United States and India are among the top five countries that produce all of the available content on the platform. Additionally, out of the top ten actors with the maximum content, six of them are from India.

**4-** When it comes to content ratings, TV-MA tops the charts,
 indicating that mature content is more popular on Netflix.

**5-** The value of k=15 was found to be optimal for clustering the data, and it was used to group the content into ten distinct clusters.

**6-** Using this data, a Content based recommender system was created using cosine similarity, which provided recommendations for Movies and TV shows.

# **9. Future Work**

- Integrating this dataset with external sources such as IMDB ratings,books clsutering ,Plant based Type clustering  can lead to numerous intriguing discoveries.

- By incorporating additional data, a more comprehensive recommender system could be developed, offering enhanced recommendations to users. This system could then be deployed on the web for widespread usage.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***