# **Project Name**    - UnsupervisedML_Netflix



##### **Project Type**    - Unsupervised ML
##### **Contribution**    - Individual
##### **Team Member 1 -** - Rakesh

# **Project Summary -**

The primary goal of this project is to conduct an exploratory analysis and clustering analysis on a dataset of Netflix movies and TV shows, aiming to gain insights into content distribution, trends over the years, and clustering of similar content based on text features.

This dataset consists of tv shows and movies available on Netflix as of 2019. The dataset is collected from Flixable which is a third-party Netflix search engine. In 2018, they released an interesting report which shows that the number of TV shows on Netflix has nearly tripled since 2010. The streaming service’s number of movies has decreased by more than 2,000 titles since 2010, while its number of TV shows has nearly tripled. It will be interesting to explore what all other insights can be obtained from the same dataset.

For the given Dataset we have to

Imported Libraries
Loaded Dataset
EDA and Data Vizualization
Data Cleaning and Feature Engineering
Making some Hypothesis From Data Visulaized
Pick Appropriate Model and train
Prediction and some evaluate matrices for model
Conculsion

# **GitHub Link -**

https://github.com/RakeshReddi26/My_Projects

# **Problem Statement**


Netflix, a leading streaming service, has experienced significant changes in its content landscape over the years, with a notable increase in TV shows and a decrease in the number of movies. The challenge is to explore and analyze a dataset containing information on Netflix movies and TV shows as of 2019, sourced from Flixable. The primary goal is to derive meaningful insights into content distribution across countries, discern any shifts in focus from movies to TV shows, and cluster similar content based on text features.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset
netflix_data = pd.read_csv("/content/NETFLIX MOVIES AND TV SHOWS CLUSTERING.csv")

### Dataset First View

In [None]:
# Dataset First Look (To get 1st look of the data set (1st five rows as (.head), last five rows as (.tail)))
netflix_data

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
netflix_data.shape

### Dataset Information

In [None]:
# Dataset Info
netflix_data.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
netflix_data.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
netflix_data.isna().sum().sum()


In [None]:
# Visualizing the missing values
netflix_data.isna().sum()

In [None]:
# To show the percentage of null value for particular column
nullvalue_percentage = netflix_data.isna().sum()/len(netflix_data) * 100
nullvalue_percentage

In [None]:
# Visualizing the missing values
plt.figure(figsize=(12, 6))
sns.barplot(x=nullvalue_percentage.index, y=nullvalue_percentage)
plt.title('Percentage of Null Values in Each Column')
plt.xlabel('Columns')
plt.ylabel('Percentage Null')
plt.show()

### What did you know about your dataset?

The dataset contains information about movies and TV shows available on Netflix as of 2019. It was collected from Flixable, a third-party Netflix search engine.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
netflix_data.columns

In [None]:
# Dataset Describe
netflix_data.describe()    # Result is only release year column which is because of only one column in numeric data type in our data set


### Variables Description

The primary features in the dataset include:

show_id: A unique identifier for each show.

type: Indicates whether the entry is a movie or a TV show.

title: The title of the movie or TV show.

director: The director(s) of the movie or TV show.

cast: The cast or actors in the movie or TV show.

country: The country or countries where the content is available.

date_added: The date when the content was added to Netflix.

release_year: The year the movie or TV show was released.

rating: The content rating.

duration: The duration of the movie or TV show.

listed_in: Categories or genres the content is listed under.

description: A brief description of the content.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
netflix_data.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# It describes the attribute with object datatype
netflix_data.describe(include=["object"])

In [None]:
#To show the data types
netflix_data.dtypes


In [None]:
# Impute missing values in other columns (e.g., director, cast, country)
netflix_data['country'].fillna('Unknown', inplace=True)


In [None]:
netflix_data.isna().sum()

In [None]:
# Convert 'date_added' to datetime format
netflix_data['date_added'] = pd.to_datetime(netflix_data['date_added'], errors='coerce')

# Check for any values that couldn't be converted
print(netflix_data['date_added'].isna().sum())


In [None]:
# Extract month and year from 'date_added' for further analysis
netflix_data['added_day'] = netflix_data['date_added'].dt.day
netflix_data['added_month'] = netflix_data['date_added'].dt.month
netflix_data['added_year'] = netflix_data['date_added'].dt.year

In [None]:
#To show the data types
netflix_data.dtypes

In [None]:
#To show columns
netflix_data.columns

In [None]:
netflix_data['director'].value_counts()

In [None]:
netflix_data['cast'].value_counts()

In [None]:
netflix_data['country'].value_counts()


### What all manipulations have you done and insights you found?

Initial Data Overview:

Checked the first and last few rows of the dataset to get a glimpse of its structure. Checked the number of rows and columns in the dataset (7787 rows, 12 columns). Examined basic information about the dataset using info() and describe().

Handling Duplicate Values:

Checked for and identified any duplicate rows in the dataset (no duplicates found).

Handling Missing Values:

Explored and visualized missing values in different columns. Addressed missing values in columns like 'director', 'cast', 'country', 'date_added', and 'rating'.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
netflix_data.boxplot(column =['release_year'], grid = False)

##### 1. Why did you pick the specific chart?

To show outliers box plot was best way to show those outliers visually.

##### 2. What is/are the insight(s) found from the chart?

They are more outliers in release_year column. So we have to remove those outliers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

If we have more outliers in our dataset then we have to drop those outliers. or else while predicting the model we will not get generalized result

#### Chart - 2

In [None]:
# Chart - 2 visualization code
plt.figure(figsize=(10, 6))
sns.histplot(netflix_data['release_year'], bins=30, kde=True, color='skyblue')
plt.title('Distribution of Release Years')
plt.xlabel('Release Year')
plt.ylabel('Count')
plt.show()


##### 1. Why did you pick the specific chart?

It is suitable for exploring the distribution of release years in the Netflix dataset, providing a visual representation of the data's central tendency and spread over time.

##### 2. What is/are the insight(s) found from the chart?

Long-Tail Distribution:

The distribution appears to have a long tail, suggesting that there are movies and TV shows from a wide range of release years. This indicates that the Netflix library includes content from both recent years and earlier decades.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The impact of the insights gained from the release year distribution chart on business would depend on the specific goals and strategies of the Netflix platform. Here are potential ways in which these insights could contribute to a positive business impact :

User Engagement: If certain release years are associated with a higher number of popular movies or TV shows, Netflix could leverage this information to promote and recommend content from those years. This could enhance user engagement by offering content that aligns with user preferences.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
# Countplot for 'type' column
sns.countplot(x='type', data=netflix_data)
plt.title('Count of Movies and TV Shows')
plt.show()



##### 1. Why did you pick the specific chart?

The specific chart chosen, a countplot for the 'type' column, is suitable for visualizing the distribution of categorical data.

Comparison: The chart allows for a quick visual comparison between the counts of movies and TV shows. The contrasting bars make it immediately apparent which type is more prevalent in the dataset.

##### 2. What is/are the insight(s) found from the chart?

The chart illustrates the distribution of entries between movies and TV shows in the Netflix dataset.

Prevalence: It is evident that there is a higher count of movies compared to TV shows. This suggests that Netflix has a more extensive collection of movies than TV shows in the dataset.

Imbalance: The countplot highlights the imbalance in the number of movies and TV shows, with movies dominating the dataset.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:

Content Planning: The insights can inform content planning decisions. If Netflix observes a higher demand or viewership for movies, it might decide to focus on acquiring or producing more high-quality movies to cater to user preferences.

User Engagement: Understanding the distribution of content types allows Netflix to optimize its user engagement strategies. For example, it can tailor recommendations, personalized playlists, and promotional efforts to align with the predominant content type.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
top_countries = netflix_data['country'].value_counts().head(10)
plt.figure(figsize=(12, 6))
sns.barplot(x=top_countries.values, y=top_countries.index, palette='muted')
plt.title('Top 10 Countries with Most Content')
plt.xlabel('Number of Titles')
plt.ylabel('Country')
plt.show()

##### 1. Why did you pick the specific chart?

I chose a horizontal bar chart (countplot) to visualize the distribution of content across the top 10 countries because it effectively conveys the relative number of titles for each country. A horizontal bar chart is suitable for displaying the count or frequency of categories, making it easy to compare and rank the top contributors.

##### 2. What is/are the insight(s) found from the chart?

United States Dominance: The chart clearly shows that the United States has the highest number of titles on Netflix among the top 10 countries. This dominance is evident by the significantly longer bar for the United States compared to other countries.

Diversity in Content: While the United States is a major contributor, there is still diversity in the top 10 countries. Other countries, such as India, the United Kingdom, Canada, and Spain, also have a substantial number of titles, indicating a global presence of content on Netflix.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impacts:

Targeted Marketing: Understanding the countries with the most content allows Netflix to tailor its marketing strategies more effectively. They can run targeted campaigns and promotions in regions with a substantial user base, potentially attracting new subscribers and retaining existing ones.

Localization Strategies: Insights into top countries present opportunities for localization strategies. Netflix can invest in creating or acquiring content that resonates with the preferences and cultural nuances of specific regions, enhancing user engagement.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
plt.figure(figsize=(10, 6))
sns.countplot(x='rating', data=netflix_data, palette='pastel')
plt.title('Distribution of Content Ratings')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

chose a countplot to visualize the distribution of content ratings because it provides a clear and concise representation of the frequency of each rating category. Countplots are effective for categorical data, making them suitable for analyzing the distribution of content ratings in this case. The use of colors (pastel palette) aids in differentiating between rating categories, and the rotation of x-axis labels ensures better readability, especially for longer rating names. The goal is to provide a quick and straightforward overview of how content is distributed across different rating categories on Netflix.

##### 2. What is/are the insight(s) found from the chart?

The countplot for the distribution of content ratings provides insights into the prevalence of different rating categories on Netflix. From the chart:

TV-MA (Mature Audience) has the highest count, indicating that a significant portion of the content on Netflix is intended for mature audiences. TV-14 (Parents Strongly Cautioned) is the second most common rating. TV-PG (Parental Guidance Suggested) and R (Restricted) ratings also have a notable presence. Other ratings, such as TV-Y (All Children) and TV-G (General Audience), are less common but still contribute to the overall diversity of content.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the gained insights from the distribution of content ratings can have a positive business impact in several ways:

Content Curation: Understanding the distribution of content ratings helps Netflix curate and recommend content more effectively. It allows the platform to provide personalized recommendations to users based on their preferred content ratings.

Targeted Marketing: Netflix can use this information for targeted marketing strategies. For example, if mature content (TV-MA) is predominant, marketing efforts can be tailored to reach the mature audience demographic.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
# Pie chart for 'rating' distribution
rating_counts = netflix_data['rating'].value_counts()
plt.pie(rating_counts, labels=rating_counts.index, autopct='%1.1f%%')
plt.title('Distribution of Ratings')
plt.show()

##### 1. Why did you pick the specific chart?

I chose a pie chart for the distribution of ratings to provide a visual representation of the proportion of each rating category in the overall dataset. The pie chart is suitable for displaying the distribution of categorical data in a way that highlights the relative sizes of each category. This visualization allows for a quick and easy comparison of the prevalence of different ratings on Netflix.

##### 2. What is/are the insight(s) found from the chart?

Majority of Content is Rated for Mature Audiences:

The "TV-MA" (Mature Audience) category is the largest slice in the pie chart, indicating that a significant portion of content on Netflix is rated for mature audiences. This suggests that a substantial amount of content may contain explicit or mature content.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impacts:

Audience Segmentation and Targeting:

Understanding the distribution of ratings allows Netflix to effectively segment its audience based on content preferences. This segmentation can inform targeted content recommendations, enhancing user experience and engagement.

Negative Growth Considerations:

Potential Limited Audience for Certain Ratings:

While the diversity of rating categories is a strength, Netflix should be mindful that content rated for specific audiences may have a more limited viewership. Overemphasis on mature content may exclude younger audiences, impacting potential growth in that demographic.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
# Pie chart for content types
netflix_data['type'].value_counts().plot.pie(autopct='%1.1f%%', explode=[0, 0.05])
plt.title('Proportion of Content Types')
plt.ylabel('')
plt.show()

##### 1. Why did you pick the specific chart?

A pie chart is effective for displaying proportions of a whole. In this case, it provides a clear visual representation of the distribution between TV shows and movies on Netflix. The slices of the pie represent the relative sizes of each content type.

##### 2. What is/are the insight(s) found from the chart?

ontent Type Distribution:

The pie chart clearly shows the distribution of content types on Netflix, indicating what percentage of the library is dedicated to TV shows and movies. Dominance of Movies:

The exploded slice for movies suggests that movies have a slightly larger share in the overall content library compared to TV shows.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

This visual representation helps Netflix stakeholders and decision-makers quickly grasp the balance between TV shows and movies. It supports strategic planning for content acquisition, production, and user engagement, contributing to a positive business impact.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
# Count of content types
content_type_counts = netflix_data['type'].value_counts()
# Donut chart
plt.figure(figsize=(4, 4))
plt.pie(content_type_counts, labels=content_type_counts.index, wedgeprops=dict(width=0.2))
plt.title('Donut Chart of Content Types')
plt.show()

##### 1. Why did you pick the specific chart?

Focus on Count of Content Types:

A donut chart is chosen to focus specifically on representing the count of content types (TV shows and movies). The simplicity of the chart allows for a straightforward depiction of the numerical distribution. Visual Appeal:

Donut charts are visually appealing and provide an alternative to traditional pie charts. The center of the donut can be utilized for additional information or aesthetics.

##### 2. What is/are the insight(s) found from the chart?

Content Type Distribution:

The donut chart illustrates the distribution of content types, emphasizing the count of TV shows and movies. Equal Representation:

The chart shows that both TV shows and movies contribute significantly to the content library. The width of the donut segments represents their proportional share.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

This donut chart provides a quick numerical overview of the count of TV shows and movies. It can aid decision-makers in understanding the overall composition of the content library, supporting strategic decisions related to content acquisition, user engagement, and platform marketing.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
# Count of releases per year
release_counts = netflix_data['release_year'].value_counts().sort_index()
# Line chart
plt.figure(figsize=(12, 6))
sns.lineplot(x=release_counts.index, y=release_counts.values, marker='o')
plt.title('Number of Releases Over the Years')
plt.xlabel('Release Year')
plt.ylabel('Number of Releases')
plt.grid(True)
plt.show()

##### 1. Why did you pick the specific chart?

Reasons for Choosing a Line Chart:

Temporal Trend Analysis:

A line chart is chosen to analyze the temporal trend in the number of releases over the years. It is suitable for visualizing data points sequentially. Yearly Comparison:

The line chart allows for a clear comparison of the number of releases for each year. Viewers can easily identify trends, spikes, or drops over time. Continuous Data Representation:

Line charts are effective for representing continuous data, such as the distribution of releases across multiple years.

##### 2. What is/are the insight(s) found from the chart?

Release Trend:

The line chart shows the trend in the number of releases over the years. It helps identify whether the content library has been growing, stabilizing, or experiencing fluctuations.

Peak Years:

Peaks in the line indicate years with a higher number of releases. These peak years can be further investigated to understand factors contributing to increased content production.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

The line chart provides a historical perspective on the growth or changes in the number of releases. This insight can guide content acquisition strategies, content planning, and resource allocation over time, contributing to informed decision-making for business success.

In [None]:
# Chart - 10 visualization code
# Chart - 10 visualization code
release_month_counts = netflix_data['added_month'].value_counts().sort_index() # Use the correct column name 'added_month'
# Line chart
plt.figure(figsize=(12, 6))
sns.lineplot(x=release_month_counts.index, y=release_month_counts.values, marker='o')
plt.title('Number of Releases Over the Years')
plt.xlabel('Release Month')
plt.ylabel('Number of Releases')
plt.show()

##### 1. Why did you pick the specific chart?

Reasons for Choosing a Line Chart:

Temporal Trend Analysis:

Similar to the previous line chart, this chart uses a line plot to analyze the temporal trend, but specifically focusing on the distribution of releases across different months.

Monthly Comparison:

The line chart allows for a clear comparison of the number of releases for each month. Viewers can easily identify patterns or variations in content additions over the months.

Continuous Data Representation:

Line charts are effective for representing continuous data, such as the distribution of releases across multiple months.

##### 2. What is/are the insight(s) found from the chart?

Monthly Release Patterns:

The line chart shows the distribution of releases across months, helping identify patterns or trends in content additions. For example, are certain months associated with higher or lower content releases?

Seasonal Variations:

Peaks or troughs in specific months may indicate seasonal variations in content additions. This information is valuable for planning content releases based on seasonal preferences or trends.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

The chart can guide content release strategies, allowing the platform to align releases with user preferences, seasonal trends, or other factors influencing viewing behavior. This strategic alignment can enhance user engagement and satisfaction.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
# Count of content types per year
content_type_counts = netflix_data.groupby(['release_year', 'type']).size().unstack(fill_value=0)
# Stacked bar chart
content_type_counts.plot(kind='bar', stacked=True, figsize=(14, 8), colormap='viridis')
plt.title('Distribution of Content Types Over the Years')
plt.xlabel('Release Year')
plt.ylabel('Number of Releases')
plt.legend(title='Content Type')
plt.show()

##### 1. Why did you pick the specific chart?

Comparison of Content Types:

A stacked bar chart is effective for comparing the distribution of content types (Movies and TV Shows) over different release years.

Temporal Analysis:

The chart provides a temporal analysis of content types, allowing viewers to see how the proportion of Movies and TV Shows has changed over the years.

Year-wise Breakdown:

Each bar represents a release year, and the segments within the bar (stacks) correspond to the count of Movies and TV Shows for that year. This breakdown aids in understanding the composition of content each year.

##### 2. What is/are the insight(s) found from the chart?

Shifts in Content Composition:

Changes in the height and composition of the bars indicate shifts in the proportion of Movies and TV Shows released each year.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

Understanding the historical distribution of content types can inform future content acquisition and production strategies. If certain types of content are more popular in specific years, the platform can tailor its content offerings to align with user preferences during those periods.

#### Chart - 12

In [None]:
# Chart - 12 visualization code
top_genre = netflix_data['listed_in'].value_counts().head(10)
plt.figure(figsize=(12, 6))
sns.barplot(x=top_genre.values, y=top_genre.index)
plt.title('Top 10 Most Genre')
plt.xlabel('Number of Genre')
plt.ylabel('Genre')
plt.show()

##### 1. Why did you pick the specific chart?

op Genre Comparison:

A horizontal bar chart is chosen to compare the top 10 genres based on the number of occurrences in the dataset.

Visualizing Genre Distribution:

The chart provides a clear visualization of the most popular genres, making it easy to compare their frequencies.


##### 2. What is/are the insight(s) found from the chart?

Dominant Genres:

The chart highlights the genres that are most prevalent in the dataset. Viewers can quickly identify the genres with the highest frequency.

Identification of Popular Genres:

Users can see which genres are most popular on the platform based on the number of titles in each genre.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

Knowledge of the most popular genres can influence content acquisition and production decisions. The platform can use this information to invest in genres that have a high viewership, potentially attracting and retaining more subscribers.

#### Chart - 13

In [None]:
# Chart - 13 visualization code
most_filmed_director = netflix_data['director'].value_counts().head(10)
plt.figure(figsize=(12, 6))
sns.barplot(x=most_filmed_director.values, y=most_filmed_director.index)
plt.title('Top 10 Most Filmed Director')
plt.xlabel('Number of Movies')
plt.ylabel('Director')
plt.show()


In [None]:
# Chart - 13 visualization code excluding 'Unknown' directors
filtered_directors = netflix_data[netflix_data['director'] != 'Unknown']
most_filmed_director = filtered_directors['director'].value_counts().head(10)

plt.figure(figsize=(12, 6))
sns.barplot(x=most_filmed_director.values, y=most_filmed_director.index)
plt.title('Top 10 Most Filmed Director (Excluding Unknown)')
plt.xlabel('Number of Movies')
plt.ylabel('Director')
plt.show()

##### 1. Why did you pick the specific chart?

Top Filmed Directors Comparison:

A horizontal bar chart is chosen to compare the top 10 directors based on the number of movies they have directed in the dataset.

Visualizing Director Popularity:

The chart provides a clear visualization of the directors with the highest number of films, making it easy to compare their frequencies.

##### 2. What is/are the insight(s) found from the chart?

Prolific Directors:

The chart highlights directors who have directed a substantial number of movies available on the platform.

Director Diversity:

Users can see a variety of directors in the top 10, showcasing diversity in the contributions of different filmmakers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

Knowledge of the most prolific directors can be valuable for content curation and recommendation algorithms. It can also be used in marketing efforts, showcasing movies directed by popular filmmakers to attract viewers.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(6,4))
# Select only numeric columns for correlation calculation
numeric_data = netflix_data.select_dtypes(include=['float', 'int'])
sns.heatmap(numeric_data.corr(),annot=True)
plt.show() # Add this line to display the plot

##### 1. Why did you pick the specific chart?

Understanding Feature Relationships:

A correlation heatmap is chosen to visually represent the correlation between different numerical features in the dataset.

Identifying Correlations:

The heatmap provides a quick overview of which features have positive, negative, or no correlation with each other.

Correlation Strength:

Colors on the heatmap indicate the strength of correlation, making it easy to identify strong and weak correlations.

##### 2. What is/are the insight(s) found from the chart?

Feature Relationships:

Users can quickly identify which features have a strong correlation, helping understand how variables are related.

Correlation Strength:

The heatmap color intensity helps in gauging the strength of correlation. Darker colors represent stronger correlations.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Hypothesis testing involves making statements about a population parameter and then using statistical methods to determine if the data provides enough evidence to reject the null hypothesis. Here are three hypothetical statements based on the dataset:

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H0): There is no significant difference in the average release years of movies and TV shows on Netflix.

Alternative Hypothesis (H1): There is a significant difference in the average release years of movies and TV shows on Netflix.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import ttest_ind

# Separate data for movies and TV shows
movies_data = netflix_data[netflix_data['type'] == 'Movie']
tv_shows_data = netflix_data[netflix_data['type'] == 'TV Show']

# Perform T-Test
t_stat, p_value = ttest_ind(movies_data['release_year'], tv_shows_data['release_year'], equal_var=False)

# Print the p-value
print(f"P-Value for Hypothesis 1: {p_value}")

##### Which statistical test have you done to obtain P-Value?

The statistical test used for Hypothesis 1 is the Two-Sample T-Test.

##### Why did you choose the specific statistical test?

The Two-Sample T-Test was chosen because we were comparing the means of two independent groups (movies and TV shows) to determine if there is a significant difference in the average release years. This test is appropriate when dealing with numeric data from two distinct groups, and it helps assess whether the observed differences in the means are statistically significant.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H0): There is no significant difference in the average duration between movies and TV shows on Netflix.

Alternate Hypothesis (H1): There is a significant difference in the average duration between movies and TV shows on Netflix

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
print("Unique Values in 'duration' column:")
print(netflix_data['duration'].unique())


In [None]:
# Perform Statistical Test to obtain P-Value
# Assuming each 'Season' is equivalent to approximately 10 episodes
EPISODES_PER_SEASON = 10

# Create a new numeric column for duration
netflix_data['duration_numeric'] = netflix_data['duration'].replace({' min': '', ' Seasons': ''}, regex=True)

# Convert 'Season' values to numeric (approximating 10 episodes per season)
netflix_data['duration_numeric'] = pd.to_numeric(netflix_data['duration_numeric'], errors='coerce')
netflix_data['duration_numeric'].fillna(netflix_data['duration'].apply(lambda x: EPISODES_PER_SEASON if 'Season' in x else None), inplace=True)

# Separate data for movies and TV shows
movies_data = netflix_data[netflix_data['type'] == 'Movie']
tv_shows_data = netflix_data[netflix_data['type'] == 'TV Show']

# Perform T-Test
t_stat, p_value = ttest_ind(movies_data['duration_numeric'], tv_shows_data['duration_numeric'], equal_var=False)

# Print the p-value
print(f"P-Value for Hypothesis 2: {p_value}")

##### Which statistical test have you done to obtain P-Value?

We will perform an independent two-sample t-test to compare the average duration between movies and TV shows on Netflix. The t-test will help us determine whether there is a significant difference in the mean duration of these two categories.

##### Why did you choose the specific statistical test?

I chose the independent two-sample t-test for Hypothetical Statement 2 because it is appropriate for comparing the means of two independent groups, in this case, movies and TV shows. The t-test helps us assess whether there is a significant difference in the average duration between these two categories.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H0): There is no significant association between the content rating (TV-MA, R, PG-13, etc.) and the type of content (movies or TV shows) on Netflix.

Alternate Hypothesis (H1): There is a significant association between the content rating and the type of content on Netflix.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
print("Unique Values in 'rating' column:")
print(netflix_data['rating'].unique())

In [None]:
from scipy.stats import chi2_contingency

# Create a contingency table
contingency_table = pd.crosstab(netflix_data['type'], netflix_data['rating'])

# Perform the chi-squared test
chi2_stat, p_value, dof, expected = chi2_contingency(contingency_table)

# Print the p-value
print(f"P-Value for Hypothesis 3: {p_value}")

##### Which statistical test have you done to obtain P-Value?

We perform a chi-squared test of independence. This test is appropriate for examining the association between two categorical variables, in this case, the content rating and the type of content (movies or TV shows).

##### Why did you choose the specific statistical test?

I chose the chi-squared test of independence for Hypothetical Statement 3 because it is suitable for examining the association between two categorical variables. In this case, we want to determine if there is a significant association between the content rating (a categorical variable) and the type of content (movies or TV shows, also a categorical variable) on Netflix.

## ***6. Feature Engineering & Data Pre-processing***

In [None]:
# Check Unique Values for each variable it their are more unique than we can drop those columns.
netflix_data.nunique()

In [None]:
#Dropping show_id, title, director, cast and description for more Unique values
netflix_data.drop(['title'], axis=1, inplace=True)
netflix_data.drop(['director'], axis=1, inplace=True)
netflix_data.drop(['cast'], axis=1, inplace=True)


In [None]:
#assigning customerID to a variable for further use
a = netflix_data['show_id']

In [None]:
#Dropping ID for more Unique values
netflix_data.drop(['show_id'], axis=1, inplace=True)

### 1. Handling Missing Values

In [None]:
# To see the null or missing values
netflix_data.isna().sum()


In [None]:
# Handling Missing Values & Missing Value Imputation
# Drop rows with missing values in critical columns or use imputation techniques based on the context
netflix_data.dropna(subset=['date_added'], inplace=True)


In [None]:
# You may choose to drop rows with missing ratings or impute based on the mode
netflix_data['rating'].fillna(netflix_data['rating'].mode()[0], inplace=True)

In [None]:
# To re-check the null or missing values
netflix_data.isna().sum()

#### What all missing value imputation techniques have you used and why did you use those techniques?

Categorical Columns ('country'): Filled missing values with 'Unknown'.

Reason: For categorical columns representing country filling missing values with 'Unknown' is a common approach. This placeholder indicates that the information is unknown or not available. It's a simple and interpretable method for handling missing categorical data.

Date Column ('date_added'): Dropped rows with missing values.

Reason: In the 'date_added' column, the code chose to drop rows with missing date values. Alternatively, you could have chosen to keep these rows or impute the missing dates based on some criterion. The decision depends on the significance of the 'date_added' column for your analysis.

Categorical Column ('rating'): Filled missing values with the mode (most frequent value) of the column.

Reason: Filling missing values in the 'rating' column with the mode is a common strategy for categorical data. The mode represents the most frequently occurring value, providing a reasonable estimate for the missing values.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
# Calculate the IQR (Interquartile Range)
Q1 = netflix_data['release_year'].quantile(0.25)
Q3 = netflix_data['release_year'].quantile(0.75)
IQR = Q3 - Q1
print('IQR value is', IQR)

In [None]:
# Define upper and lower bounds to identify outliers
lower_bound = Q1 - (1.5 * IQR)
upper_bound = Q3 + (1.5 * IQR)
print('Lower fence is', lower_bound)
print('Upper fence is', upper_bound)

In [None]:
# Identify and potentially remove outliers
outliers = netflix_data[(netflix_data['release_year'] < lower_bound) | (netflix_data['release_year'] > upper_bound)]

In [None]:
# Remove outliers from the dataset
data = netflix_data[(netflix_data['release_year'] >= lower_bound) & (netflix_data['release_year'] <= upper_bound)]

In [None]:
# Boxplot after removies outliers
data.boxplot(column =['release_year'], grid = False)
plt.title('Outliers removed in release year')


##### What all outlier treatment techniques have you used and why did you use those techniques?

Outlier Treatment Techniques Used:

IQR (Interquartile Range) Method:

Description: Calculated the Interquartile Range (IQR) for the 'release_year' feature. Reasoning: IQR is a robust method for identifying outliers. It defines a range within which most of the data points lie. Any data points outside this range are considered potential outliers.

IQR is a widely accepted method for identifying outliers because it is less sensitive to extreme values compared to other methods. It is suitable for datasets where the distribution is not necessarily normal.

### 3. Categorical Encoding

In [None]:
netflix_data.dtypes

In [None]:
# Encode your categorical columns
num_cols=["release_year","added_day","added_month","added_year","duration_numeric"]
cat_cols=["type","country","rating","duration","listed_in","description"]



In [None]:
#changing object to category data type
netflix_data[cat_cols] = netflix_data[cat_cols].astype("category")


In [None]:
netflix_data.dtypes

#### What all categorical encoding techniques have you used & why did you use those techniques?

In label encoding, each category is assigned a unique integer label. This encoding is suitable when there is an ordinal relationship among the categories, meaning there is a meaningful order or ranking. However, it's crucial to be cautious, as some machine learning algorithms might misinterpret the ordinal relationships as meaningful numeric distances.

Reasons for Using Label Encoding:

Suitable for ordinal categorical variables with a clear ranking.

Reduces the dimensionality of the data compared to one-hot encoding.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction
contractions_dict = {
    "ain't": "am not",
    "aren't": "are not",
    "can't": "cannot",
    "couldn't": "could not",
    "didn't": "did not",
    "doesn't": "does not",
    "don't": "do not",
    "hadn't": "had not",
    "hasn't": "has not",
    "haven't": "have not",
    "he's": "he is",
    "I'll": "I will",
    "I'm": "I am",
    "isn't": "is not",
    "it's": "it is",
    "let's": "let us",
    "mustn't": "must not",
    "shan't": "shall not",
    "she's": "she is",
    "shouldn't": "should not",
    "that's": "that is",
    "there's": "there is",
    "they're": "they are",
    "wasn't": "was not",
    "we'll": "we will",
    "we're": "we are",
    "weren't": "were not",
    "what's": "what is",
    "where's": "where is",
    "who's": "who is",
    "won't": "will not",
    "wouldn't": "would not",
    "you'll": "you will",
    "you're": "you are",
    "you've": "you have"
}
# Assuming you have NaN values in 'description', replace them with an empty string
netflix_data['description'] = netflix_data['description'].replace(np.nan, '', regex=True)

def expand_contractions(text):
    for contraction, expansion in contractions_dict.items():
        text = text.replace(contraction, expansion)
    return text

# Apply the function to the 'description' column
netflix_data['description_expanded'] = netflix_data['description'].apply(expand_contractions)

# Display the first few rows
netflix_data[['description', 'description_expanded']].head()

#### 2. Lower Casing

In [None]:
# Lower Casing
# Lowercasing the 'description_expanded' column
netflix_data['description_expanded_lower'] = netflix_data['description_expanded'].str.lower()

# Display the first few rows
netflix_data[['description_expanded', 'description_expanded_lower']].head()

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations
import re

# Remove punctuations from the 'description_expanded_lower' column
netflix_data['description_no_punctuations'] = netflix_data['description_expanded_lower'].apply(lambda x: re.sub(r'[^\w\s]', '', x))

# Display the first few rows
netflix_data[['description_expanded_lower', 'description_no_punctuations']].head()

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits
# Remove URLs from the 'description_no_punctuations' column
netflix_data['description_no_urls'] = netflix_data['description_no_punctuations'].apply(lambda x: re.sub(r'http\S+|www\S+|https\S+', '', x))

# Remove words and digits containing digits from the 'description_no_urls' column
netflix_data['description_no_digits'] = netflix_data['description_no_urls'].apply(lambda x: re.sub(r'\b\w*[0-9]+\w*\b', '', x))

# Display the first few rows
netflix_data[['description_no_punctuations', 'description_no_urls', 'description_no_digits']].head()


#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords
import nltk
from nltk.corpus import stopwords

# Download NLTK stopwords if not already downloaded
nltk.download('stopwords')

# Define a function to remove stopwords
def remove_stopwords(text):
    stop_words = set(stopwords.words('english'))
    words = [word for word in text.split() if word.lower() not in stop_words]
    print(stop_words)
    return ' '.join(words)

# Apply the function to the 'description_no_digits' column
netflix_data['description_no_stopwords'] = netflix_data['description_no_digits'].apply(remove_stopwords)

# Display the first few rows
netflix_data[['description_no_digits', 'description_no_stopwords']].head()

In [None]:
# Remove White spaces
# Define a function to remove white spaces
def remove_white_spaces(text):
    return ' '.join(text.split())

# Apply the function to the 'description_no_stopwords' column
netflix_data['description_cleaned'] = netflix_data['description_no_stopwords'].apply(remove_white_spaces)

# Display the first few rows
netflix_data[['description_no_stopwords', 'description_cleaned']].head()

#### 6. Rephrase Text

In [None]:
# Rephrase Text
import random

# Function to rephrase text
def rephrase_text(text):
    words = text.split()
    for i in range(len(words)):
        if random.choice([True, False]):
            words[i] = "synonym_of_" + words[i]  # Replace with synonym or modify as needed
    return ' '.join(words)

# Apply the function to the 'description_cleaned' column
netflix_data['description_rephrased'] = netflix_data['description_cleaned'].apply(rephrase_text)

# Display the first few rows
netflix_data[['description_cleaned', 'description_rephrased']].head()

Rephrasing Text:

Purpose: Involves rephrasing or rewriting the text.

Rephrasing may be used for various reasons, such as improving clarity or expressing the same content differently.

#### 7. Tokenization

In [None]:

# Download the 'punkt' resource
nltk.download('punkt')

In [None]:
# Tokenization
from nltk.tokenize import word_tokenize

# Function to tokenize text
def tokenize_text(text):
    return word_tokenize(text)

# Apply the function to the 'description_rephrased' column
netflix_data['description_tokenized'] = netflix_data['description_rephrased'].apply(tokenize_text)

# Display the first few rows
netflix_data[['description_rephrased', 'description_tokenized']].head()

Tokenization:

Purpose: Tokenization involves breaking text into individual words or tokens.

Tokenization is a fundamental step in natural language processing, enabling the analysis of individual words.

#### 8. Text Normalization

In [None]:
# Download the 'wordnet' resource
nltk.download('wordnet')


In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)
from nltk.stem import WordNetLemmatizer

# Initialize the WordNet lemmatizer
lemmatizer = WordNetLemmatizer()

# Function to lemmatize a list of tokens
def lemmatize_tokens(tokens):
    return [lemmatizer.lemmatize(token) for token in tokens]

# Apply the function to the 'description_tokenized' column
netflix_data['description_lemmatized'] = netflix_data['description_tokenized'].apply(lemmatize_tokens)

# Display the first few rows
netflix_data[['description_tokenized', 'description_lemmatized']].head()


##### Which text normalization technique have you used and why?

The text normalization technique used in the provided code is Lemmatization. Specifically, the code utilizes the WordNet Lemmatizer from the Natural Language Toolkit (NLTK) library to lemmatize the tokens in the 'description_tokenized' column.

Lemmatization:

Purpose: Lemmatization is the process of reducing words to their base or root form (lemma). It involves considering the meaning of a word and transforming it to a common base form.

Lemmatization is preferred over stemming in some cases because it produces valid words and retains the base meaning of words. It helps in reducing inflected words to a common form, which can be beneficial for text analysis, information retrieval, and other natural language processing tasks. Lemmatization improves the interpretability of text data and can be particularly useful in tasks where word semantics are crucial.

#### 9. Part of speech tagging

In [None]:
# Download the 'averaged_perceptron_tagger' resource
nltk.download('averaged_perceptron_tagger')


In [None]:
# POS Taging
from nltk import pos_tag

# Function to perform POS tagging on a list of tokens
def pos_tagging(tokens):
    return pos_tag(tokens)

# Apply the function to the 'description_tokenized' column
netflix_data['description_pos_tags'] = netflix_data['description_tokenized'].apply(pos_tagging)

# Display the first few rows
netflix_data[['description_tokenized', 'description_pos_tags']].head()

Part of Speech (POS) Tagging:

Purpose: POS tagging assigns a grammatical category (such as noun, verb, adjective) to each word in a text.

POS tagging provides information about the grammatical structure of the text, aiding in more advanced linguistic analyses.

#### 10. Text Vectorization

In [None]:
# Vectorizing Text
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

# Assuming 'description_tokenized' is a list of tokenized descriptions
# Convert tokenized descriptions back to text (assuming they are lists of words)
netflix_data['description_tokenized_text'] = netflix_data['description_tokenized'].apply(lambda x: ' '.join(x))

# Create a CountVectorizer instance
vectorizer = CountVectorizer()

# Fit and transform the 'description_tokenized_text' column
description_bow = vectorizer.fit_transform(netflix_data['description_tokenized_text'])

# Convert the result to a dense array
description_bow_array = description_bow.toarray()

# Calculate the average word vector for each description
average_word_vector = np.mean(description_bow_array, axis=1)

# Create a DataFrame with the average word vectors
average_word_vector_df = pd.DataFrame(average_word_vector, columns=['average_word_vector'])

# Display the DataFrame
average_word_vector_df.head()

##### Which text vectorization technique have you used and why?

The code you provided seems to be for the Bag of Words (BoW) technique, specifically using the CountVectorizer from scikit-learn. BoW is a simple and effective text vectorization technique that represents text data as a sparse matrix of word frequencies.

Simplicity and Effectiveness: BoW is a straightforward and efficient method for converting text into numerical features. It disregards the order of words and focuses on their frequency, making it suitable for various natural language processing tasks.


### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features
netflix_data.columns


In [None]:
# plt.figure(figsize=(6,4))
# sns.heatmap(netflix_data.corr(),annot=True)
# Drop non-numerical columns before calculating correlation
numerical_data = netflix_data.select_dtypes(include=['number'])

plt.figure(figsize=(6,4))
sns.heatmap(numerical_data.corr(),annot=True)

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting
# List of columns to drop
columns_to_drop = [ 'description',
    'description_expanded',
    'description_expanded_lower',
    'description_no_punctuations',
    'description_no_urls',
    'description_no_digits',
    'description_no_stopwords',
    'description_cleaned',
    'description_rephrased',
    'description_tokenized',
    'description_lemmatized',
    'description_pos_tags'
]
# Drop the specified columns
netflix_data = netflix_data.drop(columns=columns_to_drop, axis=1)

In [None]:
# Display the updated DataFrame
netflix_data

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

No need of Data Transformation know because we have already done all the things which I am writing below:

Data Cleaning:

Handling missing values in columns like date_added, duration, etc. Dropping unnecessary columns like show_id and description based on your requirements.

Data Wrangling:

Converting the date_added column to datetime format. Creating a new column duration_numeric based on the numeric representation of the duration.

Statistical Analysis:

Conducting hypothesis testing to derive insights from the dataset.

Feature Engineering:

Creating new features like added_day, added_month, and added_year from the date_added column. Converting categorical columns to category data type for better representation.

Handling Outliers:

Identifying and handling outliers in the release_year and duration columns.

Data Normalization:

Scaling numerical features using StandardScaler.

Textual Data Preprocessing:

Tokenization and lemmatization of the description column.

Feature Manipulation:

Standardizing numerical features using StandardScaler. Encoding categorical columns, such as type, country, rating, duration, listed_in, and description.

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Splliting up of data
X = netflix_data.drop(["type"], axis = 1)

In [None]:
y = netflix_data[["type"]]

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

In [None]:
# Scaling your data
from sklearn.preprocessing import StandardScaler

num_cols = X_train.select_dtypes(include=['float64', 'int64']).columns

In [None]:
scaler = StandardScaler()
scaler.fit(X_train[num_cols])

In [None]:
X_train_std = scaler.transform(X_train[num_cols])
X_test_std = scaler.transform(X_test[num_cols])

In [None]:
print(X_train_std.shape)
print(X_test_std.shape)

##### Which method have you used to scale you data and why?

The StandardScaler from scikit-learn to scale the numerical columns in your dataset. The StandardScaler standardizes features by removing the mean and scaling to unit variance. This method is commonly used because it assumes that the distribution of the data is Gaussian, which is often a good assumption. Scaling is essential for many machine learning algorithms, especially those that are sensitive to the scale of input features, such as distance-based algorithms.

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

No need





### 8. Data Splitting

In [None]:
print(X.shape, y.shape)

In [None]:
# Print the shapes of the training and testing sets
X_train.shape, X_test.shape, y_train.shape, y_test.shape


In [None]:
y_train.value_counts()


In [None]:
y_train.value_counts(normalize=True)*100

In [None]:
y_test.value_counts(normalize=True)*100

##### What data splitting ratio have you used and why?

The data has been split into training and testing sets with a ratio of approximately 80% for training and 20% for testing. This is a common and reasonable split, and the specific ratio depends on factors like the size of your dataset, the complexity of your model, and your specific use case.

A common practice is the 80-20 split, where 80% of the data is used for training to ensure the model learns patterns in the majority of the data, and 20% is reserved for testing to evaluate how well the model generalizes to new, unseen data. However, other ratios like 70-30 or 75-25 are also used depending on the scenario.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

In this context, if the 'type' column has only two categorical values ('TV Show' and 'Movie'), and the distribution between these two categories is significantly skewed, then the dataset is indeed imbalanced. Imbalanced datasets can sometimes lead to biased model training, as the model might become more proficient at predicting the majority class and less effective at predicting the minority class.

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
y_train['type_enc'] = LabelEncoder().fit_transform(y_train['type'])
y_train[['type', 'type_enc']]


In [None]:
y_train = y_train.drop(columns=["type"])
y_train


In [None]:
y_test['type_enc'] = LabelEncoder().fit_transform(y_test['type'])
y_test[['type', 'type_enc']]


In [None]:
y_test = y_test.drop(columns=["type"])
y_test

In [None]:
from sklearn.preprocessing import OneHotEncoder

In [None]:
cat_cols = X_train.select_dtypes(include=['category']).columns


In [None]:
# Initialize the OneHotEncoder
enc = OneHotEncoder(drop = 'first')
enc.fit(X_train[cat_cols])

In [None]:
# Initialize the OneHotEncoder with handle_unknown='ignore'
enc = OneHotEncoder(drop='first', handle_unknown='ignore')

# Perform one-hot encoding on the 'type' column for training and testing sets
X_train_ohe = enc.fit_transform(X_train[cat_cols]).toarray()
X_test_ohe = enc.transform(X_test[cat_cols]).toarray()

In [None]:
print(X_train_ohe.shape)
print(X_test_ohe.shape)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

I used one-hot encoding to handle the imbalance in the 'type' column. This is a valid approach as it creates binary columns for each category in the target variable, effectively turning it into a binary classification problem. The drop_first=True parameter in pd.get_dummies() removes one of the binary columns to avoid multicollinearity.

In [None]:
X_train_con = np.concatenate([X_train_std, X_train_ohe], axis=1)
X_test_con = np.concatenate([X_test_std, X_test_ohe], axis=1)

In [None]:
print(X_train_con.shape)
print(X_test_con.shape)

## ***7. ML Model Implementation***

In [None]:
def evaluate_model(act, pred):
    from sklearn.metrics import confusion_matrix,classification_report, accuracy_score, recall_score, precision_score, f1_score
    print("Confusion Matrix \n", confusion_matrix(act, pred))
    print(classification_report(act,pred))
    print("Accurcay : ", accuracy_score(act, pred))
    print("Recall   : ", recall_score(act, pred,average='weighted'))
    print("Precision: ", precision_score(act, pred, average='weighted'))
    print("F1_score : ", f1_score(act, pred, average='weighted'))

In [None]:
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=123)
X_train_sm, y_train_sm = smote.fit_resample(X_train_con, y_train)

In [None]:
np.unique(y_train, return_counts= True)
np.unique(y_train_sm, return_counts= True)


### ML Model - 1

In [None]:
# ML Model - 1 Implementation
from sklearn.linear_model import LogisticRegression
m1 = LogisticRegression()

# Fit the Algorithm
m1.fit(X_train_con, y_train)

# Predict on the model
train_pred_lr = m1.predict(X_train_con)
test_pred_lr = m1.predict(X_test_con)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
print("--Train--")
evaluate_model(y_train, train_pred_lr)
print("--Test--")
evaluate_model(y_test, test_pred_lr)

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
m2 = LogisticRegression(solver='saga',penalty='l2', max_iter=1000)

# Fit the Algorithm
m2.fit(X_train_sm, y_train_sm)

# Predict on the model
train_pred_lr_hp = m2.predict(X_train_con)
test_pred_lr_hp = m2.predict(X_test_con)

In [None]:
# Visualizing evaluation Metric Score chart
print("--Train--")
evaluate_model(y_train, train_pred_lr_hp)
print("--Test--")
evaluate_model(y_test, test_pred_lr_hp)

##### Which hyperparameter optimization technique have you used and why?

solver='saga': Specifies the algorithm to use in the optimization problem. The 'saga' solver is often suitable for large datasets.

penalty='l2': Indicates the type of regularization term to be applied. 'l2' refers to the Ridge regularization.

max_iter=1000: Defines the maximum number of iterations taken for the solver to converge.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Comparison:

Accuracy: The accuracy is slightly lower after hyperparameter tuning, but it remains high, indicating that the model generalizes well to the test set.

Recall: Recall values are still close to 1, suggesting that the models effectively identify positive instances.

Precision: Precision values are also close to 1, indicating a low rate of false positives.

F1-score: F1-scores are high, reflecting a good balance between precision and recall.

In summary, while there is a slight decrease in accuracy after hyperparameter tuning, the model still performs exceptionally well on both the training and testing sets. The differences are subtle, and the impact on overall model performance seems minimal.

### ML Model - 2

In [None]:
# ML Model - 1 Implementation
from sklearn.ensemble import RandomForestClassifier
m3 = RandomForestClassifier()

# Fit the Algorithm
m3.fit(X_train_con, y_train)

# Predict on the model
train_pred_rf = m3.predict(X_train_con)
test_pred_rf = m3.predict(X_test_con)


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
print("--Train--")
evaluate_model(y_train, train_pred_rf)
print("--Test--")
evaluate_model(y_test, test_pred_rf)


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
param_grid = {"n_estimators" : [100,300,700],
              "max_depth" : [3,5,7,11],
              "max_features" : [3,5,7,9],
              "min_samples_leaf" : [2,4,6]}


In [None]:
#Returning the best combination of parameters
#specifing the no.of folds
#m4 = RandomForestClassifier()
#from sklearn.model_selection import GridSearchCV
#m4 = GridSearchCV(m4,param_grid,cv=5)
#m4.fit(X_train_con,y_train)

In [None]:
#m4.best_params_

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
m4 = RandomForestClassifier(max_depth=11,n_estimators=100,max_features=9,min_samples_leaf=2)

# Fit the Algorithm
m4.fit(X_train_sm, y_train_sm)

# Predict on the model
train_pred_rf_hp = m4.predict(X_train_sm)
test_pred_rf_hp = m4.predict(X_test_con)

In [None]:
# Visualizing evaluation Metric Score chart
print("--Train--")
evaluate_model(y_train_sm, train_pred_rf_hp)
print("--Test--")
evaluate_model(y_test, test_pred_rf_hp)


##### Which hyperparameter optimization technique have you used and why?

max_depth: Maximum depth of the trees

n_estimators: Number of trees in the forest

max_features: Maximum number of features considered for splitting a node

min_samples_leaf: Minimum number of samples required to be at a leaf node

The hyperparameter optimization technique used in this case is manual tuning, where you've manually selected values for the hyperparameters based on your understanding of the model and the dataset. This approach is reasonable, especially when you have some prior knowledge about the hyperparameters and their effects on the model's performance.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

There is a slight decrease in both train and test accuracy after hyperparameter tuning. While the model's performance on the training set is still excellent, there might be a concern about potential overfitting as the model's performance on the test set has slightly decreased.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Accuracy:

Indication: It measures the overall correctness of the model predictions.

Business Impact: High accuracy indicates that the model is making correct predictions, which is generally desirable. However, in imbalanced datasets, accuracy alone may not provide a complete picture, especially if one class dominates the dataset. For example, if a model predicts the majority class all the time, it could still achieve high accuracy but may not be useful.

F1-Score:

Indication: F1-Score is the weighted average of precision and recall. It considers both false positives and false negatives.

Business Impact: F1-Score is beneficial when there is an uneven class distribution. It balances precision and recall and is useful in scenarios where both false positives and false negatives are equally important.

### ML Model - 3

In [None]:
# ML Model - 1 Implementation
from sklearn import svm
m5 = svm.SVC(kernel='linear')

# Fit the Algorithm
m5 = m5.fit(X_train_con,y_train)

# Predict on the model
train_pred_svm = m5.predict(X_train_con)
test_pred_svm = m5.predict(X_test_con)


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
print("--Train--")
evaluate_model(y_train, train_pred_svm)
print("--Test--")
evaluate_model(y_test, test_pred_svm)

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
m6 = svm.SVC(kernel='rbf',C=10)

# Fit the Algorithm
m6.fit(X_train_sm, y_train_sm)

# Predict on the model
train_pred_svm1 = m6.predict(X_train_sm)
test_pred_svm1 = m6.predict(X_test_con)

In [None]:
# Visualizing evaluation Metric Score chart
print("--Train--")
evaluate_model(y_train_sm, train_pred_svm1)
print("--Test--")
evaluate_model(y_test, test_pred_svm1)


##### Which hyperparameter optimization technique have you used and why?

In your implementation, you have used two different kernels for Support Vector Machines (SVM) and adjusted hyperparameters manually.

Implementation: svm.SVC(kernel='linear')

The linear kernel, which is suitable for linearly separable data. The model performed exceptionally well, achieving high accuracy on both the training and test sets.

Implementation: svm.SVC(kernel='rbf', C=10)

The radial basis function (RBF) kernel and manually set the regularization parameter C to 10. The model achieved high accuracy on both training and test sets.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Comparison:

The accuracy on the test set slightly decreased after hyperparameter tuning (from 99.87% to 99.81%).

However, the differences are minimal, and both models exhibit excellent performance.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

In [None]:
performance_columns = ['Model name', 'Train accuracy', 'Train precision', 'Train recall','Train F1_score',
                       'Test accuracy', 'Test precision', 'Test recall','Test F1_score']
performance_comparison = pd.DataFrame(columns=performance_columns)

In [None]:
from numpy.lib.function_base import average
def add_to_perform_compare_df(df, model_name, train_actual, train_predict, test_actual, test_predict):

    from sklearn.metrics import confusion_matrix, accuracy_score, recall_score, precision_score, f1_score

    train_accuracy = accuracy_score(train_actual, train_predict)
    test_accuracy = accuracy_score(test_actual, test_predict)

    train_recall = recall_score(train_actual, train_predict,average='weighted')
    test_recall = recall_score(test_actual,test_predict,average='weighted')

    train_precision = precision_score(train_actual, train_predict,average='weighted')
    test_precision = precision_score(test_actual, test_predict,average='weighted')

    train_f1 = f1_score(train_actual, train_predict,average='weighted')
    test_f1 = f1_score(test_actual, test_predict,average='weighted')

    df = df.append(pd.Series([model_name, train_accuracy, train_precision, train_recall, train_f1,
                              test_accuracy, test_precision, test_recall,test_f1],
                             index=df.columns),ignore_index=True)
    return df


Precision, Recall, and F1-Score:

Consideration: These metrics are valuable in scenarios where class distribution is imbalanced. Precision focuses on the accuracy of positive predictions, recall on the model's ability to find all positive instances, and the F1-score balances precision and recall. Use Case: Especially relevant when the cost of false positives or false negatives is different.

## ***8.*** ***Future Work (Optional)***

In [None]:
netflix_data.columns


In [None]:
from sklearn.preprocessing import OneHotEncoder

# Assuming 'X' contains your feature columns and 'y' contains your target variable
X = netflix_data.drop('type', axis=1)
y = netflix_data['type']

# Exclude datetime column
X_numeric = X.select_dtypes(include=['number'])

# One-hot encode categorical columns
X_encoded = pd.get_dummies(X_numeric, drop_first=True)

from sklearn.preprocessing import OneHotEncoder
from sklearn.cluster import KMeans # Import KMeans

# Assuming 'X' contains your feature columns and 'y' contains your target variable
X = netflix_data.drop('type', axis=1)
y = netflix_data['type']

# Exclude datetime column
X_numeric = X.select_dtypes(include=['number'])

In [None]:
from sklearn.cluster import KMeans
import numpy as np

# Concatenate scaled numeric features and one-hot encoded categorical features for training and testing sets
X_train_con = np.concatenate([X_train_std, X_train_ohe], axis=1)
X_test_con = np.concatenate([X_test_std, X_test_ohe], axis=1)

# Choose the number of clusters (you need to decide the optimal number based on your data or use techniques like elbow method)
num_clusters = 3  # You can change this number based on your analysis

# Fit KMeans on the training data
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
kmeans.fit(X_train_con)

# Predict clusters for training and testing data
train_clusters = kmeans.predict(X_train_con)
test_clusters = kmeans.predict(X_test_con)

# Add the predicted clusters to your original dataframes if needed
X_train['cluster'] = train_clusters
X_test['cluster'] = test_clusters

# Print the clusters for training and testing sets
print("Training Set Clusters:")
print(X_train['cluster'].value_counts())

print("\nTesting Set Clusters:")
print(X_test['cluster'].value_counts())

In [None]:
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

# Use PCA to reduce the dimensionality for visualization (you can adjust the number of components)
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_con)

# Plot the clusters
plt.figure(figsize=(10, 8))

# Scatter plot for training set
for cluster in range(num_clusters):
    plt.scatter(X_train_pca[train_clusters == cluster, 0],
                X_train_pca[train_clusters == cluster, 1],
                label=f'Cluster {cluster + 1}')

plt.title('KMeans Clustering - Training Set')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend()
plt.show()


### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Assuming 'metrics_data' is a DataFrame with your model metrics
metrics_data = pd.DataFrame({
    'Model name': ['Logistic Regression', 'Logistic Regression_HP', 'Random Forest', 'Random Forest_HP', 'SVM tune1'],
    'Train accuracy': [0.998393, 0.997910, 1.000000, 0.994067, 1.000000],
    'Train precision': [0.998401, 0.997924, 1.000000, 0.994084, 1.000000],
    'Train recall': [0.998393, 0.997910, 1.000000, 0.994067, 1.000000],
    'Train F1_score': [0.998394, 0.997912, 1.000000, 0.994067, 1.000000],
    'Test accuracy': [0.996787, 0.996144, 0.998715, 0.989075, 0.998072],
    'Test precision': [0.996820, 0.996192, 0.998715, 0.989120, 0.998084],
    'Test recall': [0.996787, 0.996144, 0.998715, 0.989075, 0.998072],
    'Test F1_score': [0.996791, 0.996151, 0.998715, 0.989039, 0.998074]
})

# Specify the filename for the CSV file
csv_filename = 'model_metrics.csv'

# Save the DataFrame to a CSV file
metrics_data.to_csv(csv_filename, index=False)

print(f'The model metrics have been saved to {csv_filename}')

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Data Overview:

The dataset provides information on Netflix content, including movies and TV shows, with details such as release year, country, rating, and more.

Data Exploration and Visualization:

Explored the distribution of content types (movies vs. TV shows), release years, and content ratings. Investigated the top countries contributing to Netflix content and identified popular genres and directors.

Insights from Visualizations:

The majority of content on Netflix is movies. Content has been steadily increasing over the years, with a surge in recent years. Certain countries, genres, and directors dominate the Netflix platform.

Machine Learning Models:

Implemented machine learning models, including Logistic Regression and Random Forest, to predict certain aspects of the dataset. Conducted hyperparameter optimization to enhance model performance.

Model Evaluation:

Assessed model performance using various metrics such as accuracy, precision, recall, and F1-score. Logistic Regression, Random Forest, and SVM demonstrated high accuracy and generalization to test data.

Feature Importance:

Explored feature importance using Random Forest, identifying key factors influencing predictions.

Business Implications:

Insights into content popularity, user preferences, and factors affecting viewership can guide content creation and acquisition strategies. Machine learning models can assist in recommending content and optimizing user engagement.

Limitations and Further Work:

Considered limitations, such as potential biases in the dataset or the need for more granular data. Proposed areas for further analysis or improvements in predictive models.

In conclusion, the analysis provides valuable insights into Netflix content trends and user engagement. The machine learning models exhibit strong predictive capabilities, offering potential applications for content recommendation and business strategy.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***