<a href="https://colab.research.google.com/github/MonaRansing/Netflix-Movies-and-TV-Shows-Clustering-Unsupervised-Machine-Learning/blob/main/Netflix_Movies_and_TV_Shows_Clustering_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Netflix Movies and TV Shows Clustering



##### **Project Type**    - Unsupervised
##### **Contribution**    - Individual


# **Project Summary -**

This project aims to analyze the evolution of Netflix's content library, using a dataset of TV shows and movies available on Netflix as of 2019, collected from Flixable. Since 2010, the number of TV shows on Netflix has nearly tripled, while the number of movies has decreased by over 2,000 titles. Through Exploratory Data Analysis (EDA), visualization, data cleaning, and unsupervised machine learning algorith, the project will uncover trends in content availability, genre distribution, and other key attributes. Integrating this dataset with external sources such as IMDb and Rotten Tomatoes will enrich the analysis, providing insights into content popularity and quality. The project will also employ clustering algorithms to identify content similarities and use dimensionality reduction techniques to reveal hidden patterns. The outcome will be detailed insights into Netflix's content strategy, interactive dashboards for user exploration, and a comprehensive view of how Netflix content is perceived in the broader entertainment ecosystem.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


This dataset consists of tv shows and movies available on Netflix as of 2019. The dataset is collected from Flixable which is a third-party Netflix search engine. In 2018, they released an interesting report which shows that the number of TV shows on Netflix has nearly tripled since 2010. The streaming service’s number of movies has decreased by more than 2,000 titles since 2010, while its number of TV shows has nearly tripled. It will be interesting to explore what all other insights can be obtained from the same dataset.

Integrating this dataset with other external datasets such as IMDB ratings, rotten tomatoes can also provide many interesting findings.

In this project, you are required to do

* Exploratory Data Analysis
* Understanding what type content is available in different countries
* If Netflix has been increasingly focusing on TV rather than movies in recent years.
* Clustering similar content by matching text-based features

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.ticker as mtick
from matplotlib.pyplot import figure
import plotly.graph_objects as go
import plotly.offline as py
import plotly.express as px
from plotly.subplots import make_subplots
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Mounting google drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load the dataset
df = pd.read_csv('/content/drive/MyDrive/Almabetter/Data Science/dataset/NETFLIX MOVIES AND TV SHOWS CLUSTERING.csv')

### Dataset First View

In [None]:
# Dataset First Look
df.head()

In [None]:
df.tail()

### Dataset Rows & Columns count

In [None]:
# copy main dataset
df1 = df.copy()

In [None]:
# Dataset Rows & Columns count
df1.shape

In [None]:
df1.columns

### Dataset Information

In [None]:
# Dataset Info
df1.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_values = df1.duplicated().sum()
duplicate_values

#### Missing Values/Null Values

In [None]:
#null values
df1.isnull().sum().sum()

In [None]:
# Missing Values/Null Values Count
missing_value = df1.isnull().sum().sort_values(ascending=False).reset_index().rename(columns={'index':'Columns',0:'Missing Values'})
missing_value.head(5)

In [None]:
# Visualizing the missing values

# Define a color palette
palette = sns.color_palette("colorblind", len(missing_value))

# Create a bar plot with missing values
plt.figure(figsize=(8,6))
# Assuming 'missing_value' is a DataFrame with 'Columns' and 'Missing Values' columns
ax = sns.barplot(x='Columns', y='Missing Values', data=missing_value.head(5), palette=palette)
plt.xticks(rotation=90)
plt.xlabel('Columns')
plt.ylabel('Missing Values')
plt.title('Missing Values')

# Adding the exact values on top of the bars
for p in ax.patches:
    ax.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()),
                ha='center', va='baseline', fontsize=11, color='black', xytext=(0, 5),
                textcoords='offset points')
plt.show()



### What did you know about your dataset?

In the given dataset there are 7787 rows and 12 columns. There is duplicate values in the dataset.

There are total 3631 missing values and 2389 missing values in director column, 718 missing vlaues in cast column, 507 missing values in country column, 10 missing values in data_added column, and 7 missing value in rating column.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df1.columns

In [None]:
# Dataset Describe
df1.describe()

### Variables Description

* show_id : Unique ID for every Movie / Tv Show

* type : Identifier - A Movie or TV Show

* title : Title of the Movie / Tv Show

* director : Director of the Movie

* cast : Actors involved in the movie / show

* country : Country where the movie / show was produced

* date_added : Date it was added on Netflix

* release_year : Actual Releaseyear of the movie / show

* rating : TV Rating of the movie / show

* duration : Total Duration - in minutes or number of seasons

* listed_in : Genere

* description: The Summary description

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df1.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
duplicate_values = df1.duplicated().sum()
duplicate_values

In [None]:
# find out missing values
missing_values = df1.isnull().sum().sort_values(ascending=False).reset_index().rename(columns={'index':'Columns',0:'Missing Values'})
missing_values.head(5)

In [None]:
# replace null values
df1['cast'].fillna(value = "No Cast", inplace=True)
df1['country'].fillna(value = df['country'].mode()[0], inplace=True)


In [None]:
# date_added and ratings columns have some rows which have null values. so we drom them using dropna.
df1.dropna(subset=['date_added','rating'],inplace=True)

In [None]:
# director column is not needed so we drop that columns from dataset
df1.drop(['director'],axis=1,inplace=True)

In [None]:
# checking null values
df1.isnull().sum()

### What all manipulations have you done and insights you found?

* In the given dataset there is no duplicate values therefore no need to do any changes.

* In the given dataset there are total 3613 missing values.
* There are 5 columns which have missing values as follows:
  * director - 2389
  * cast - 718
 * country - 507
  * date_added - 10
  * rating - 7
* From the above 5 columns I deropped 1 column which is director column because I do not neede for analysis and date_added and ratings columns have null values so I dropped those null values using dropna fumction.

* Missing values from cast column is replace by "No Cast" and missing value from country column is replaced by name of countries from dataset.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### **How many TV shows and movies are there in the dataset?**

In [None]:
df1.columns

In [None]:
# calculate tv shows and movies
tv_movie_shows = df1['type'].value_counts()
tv_movie_shows

In [None]:
# Chart - 1 visualization tv shows and movies
plt.figure(figsize=(10, 6))
plt.pie(tv_movie_shows, labels=tv_movie_shows.index, autopct='%1.1f%%', startangle=140)
plt.title('TV Shows and Movies')
plt.axis('equal')
plt.show()



##### 1. Why did you pick the specific chart?

I picked pie chart becuase it gives clear and simple visualization.

##### 2. What is/are the insight(s) found from the chart?

From the above pie chart we can see that there are more number of movies than TV shows.

#### **What is the distribution of release years for the content?**

In [None]:
df1.columns

In [None]:
# calculate distribution of release years for the content
release_year_dist = df1['release_year'].value_counts().sort_index(ascending=False).head(10)
release_year_dist

In [None]:
# visualize distribution of release years for the content
# Define a color palette
palette = sns.color_palette("colorblind", len(release_year_dist))

# Create a bar plot with missing values
plt.figure(figsize=(10, 8))
ax = sns.barplot(x=release_year_dist.index, y=release_year_dist.values, palette=palette)
plt.title('Distribution of Release Years')
plt.xlabel('Release Year')
plt.ylabel('Number of shows/movies')

# Adding the exact values on top of the bars
for p in ax.patches:
    ax.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()),
                ha='center', va='baseline', fontsize=11, color='black', xytext=(0, 5),
                textcoords='offset points')
plt.show()

In [None]:
df1

In [None]:
# create two columns
tv_shows = df1[df1['type'] == 'TV Show']
movies = df1[df1['type'] == 'Movie']

In [None]:
# calculate distribution of movies release years
movies_release_years = movies['release_year'].value_counts().sort_index(ascending=False).head(10)
movies_release_years

In [None]:
# visualize numner of movies release per year
# Define colour palette
palette = sns.color_palette("colorblind", len(movies_release_years))

# Create a bar plot
plt.figure(figsize=(10,8))
ax=sns.barplot(x=movies_release_years.index, y=movies_release_years.values, palette=palette)
plt.title('Number of movies release per year')
plt.xlabel('Release Year')
plt.ylabel('Number of movies')

# adding the exact value on the top of the bar
for p in ax.patches:
    ax.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()),
                ha='center', va='baseline', fontsize=11, color='black', xytext=(0, 5),
                textcoords='offset points')
plt.show()

In [None]:
# calculate distribution of tv shows release years
tv_shows_release_years = tv_shows['release_year'].value_counts().sort_index(ascending=False).head(10)
tv_shows_release_years

In [None]:
# calculate distribution of tv shows release years
# Define colour palette
palette = sns.color_palette("colorblind", len(tv_shows_release_years))

# Create a bar plot
plt.figure(figsize=(10,8))
ax=sns.barplot(x=tv_shows_release_years.index, y=tv_shows_release_years.values, palette=palette)
plt.title('Number of TV shows release per year')
plt.xlabel('Release Year')
plt.ylabel('Number of TV shows')

# adding the exact value on the top of the bar
for p in ax.patches:
    ax.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()),
                ha='center', va='baseline', fontsize=11, color='black', xytext=(0, 5),
                textcoords='offset points')
plt.show()

In 2018, the number of both tv shows and movies are highest. In 2017 highest number of movies released and In 2020 highest number of tv shows released.

#### **Release by month**

In [None]:
# adding columns of month and year of addition
df1['month'] = pd.DatetimeIndex(df1['date_added']).month
df1.head()

In [None]:
# visualize release by month
plt.figure(figsize=(10, 6))
ax=sns.countplot(x='month', data=df1, color='orange')
plt.title('Release by Month')
plt.xlabel('Month')
plt.ylabel('Number of shows/movies')

# show exact value on bar
for p in ax.patches:
    ax.annotate(f'{p.get_height()}', (p.get_x()+p.get_width()/2., p.get_height()),
              ha='center', va='baseline', fontsize=11, color='black', xytext=(0, 5),
              textcoords='offset points')
plt.show()

In 12th month means in December maximum number of content was added. From october to january highest number of content was added.

In [None]:
plt.figure.figsize=(15, 6)
sns.countplot(df1,  x="month", hue="type")
plt.title('Release by Month')
plt.xlabel('Month')
plt.ylabel('Number of shows/movies')
plt.show()

#### **Top 10 countries who produce the most content?**

In [None]:
# find out content by countries
content_by_countries = df1['country'].value_counts().head(10)
content_by_countries

In [None]:
# Chart - 3 visualization of content by contries

# Define a color palette
palette = sns.color_palette("colorblind", len(content_by_countries))

# Create a bar plot with missing values
plt.figure(figsize=(10, 6))
ax = sns.barplot(x=content_by_countries.index, y=content_by_countries.values, palette=palette)
plt.title('Content by Countries')
plt.xlabel('Country')
plt.ylabel('Number of shows/movies')
plt.xticks(rotation=90)

# Adding the exact values on top of the bars
for p in ax.patches:
    ax.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()),
                ha='center', va='baseline', fontsize=11, color='black', xytext=(0, 5),
                textcoords='offset points')
plt.show()



In [None]:
# find out tpo 10 contries in which maximum number of movies released
top_10_countries_movies = movies['country'].value_counts().head(10)
top_10_countries_movies

In [None]:
# ploting top 10 contries in which most number of movies released
# colour palette
color_palette = sns.color_palette("husl", len(top_10_countries_movies))

# plot bar plot
plt.figure(figsize=(10,6))
sns.barplot(x=top_10_countries_movies.index, y=top_10_countries_movies.values, palette=color_palette)
plt.title('Top 10 Countries in which most number of movies released')
plt.xlabel('Country')
plt.ylabel('Number of movies')
plt.xticks(rotation=45)
plt.show()

In [None]:
top_10_countries_tv_show = tv_shows['country'].value_counts().head(10)
top_10_countries_tv_show

# plot top 10 contries in which maximum number of tv shows released
# colour palette
color_palette = sns.color_palette("husl", len(top_10_countries_movies))

# plot bar plot
plt.figure(figsize=(10,6))
sns.barplot(x=top_10_countries_tv_show.index, y=top_10_countries_tv_show.values, palette=color_palette)
plt.title('Top 10 Countries in which most number of TV shows released')
plt.xlabel('Country')
plt.ylabel('Number of TV shows')
plt.xticks(rotation=45)
plt.show()

From the above chart, United states released more content that other contries.

In [None]:
# horizontal bar plot of top 10 contries content both tv shows and movies split
country_order = df1['country'].value_counts()[:11].index
content_data = df1.groupby('country')['type'].value_counts().unstack().loc[country_order]
content_data['sum'] = content_data.sum(axis=1)
content_data_ratio = (content_data.T/content_data['sum']).T[['Movie', 'TV Show']].sort_values(by='Movie', ascending=False)[::-1]

#plot horizontal barplot
plt.figure(figsize=(20, 10))
content_data_ratio.plot(kind='barh',stacked=True, color=['red', 'black'])
plt.title("tv shows and movies split")
plt.xlabel("Ratio")
plt.ylabel("Country")
plt.show()

#### **Rating distribution of TV shows and Movies**

In [None]:
df1['rating']

In [None]:
# assign rating into gropued categories
ratings = {
    'TV-MA': 'Adults',
    'R': 'Adults',
    'PG-13': 'Teens',
    'TV-14': 'Young Adults',
    'TV-PG': 'Older Kids',
    'NR': 'Adults',
    'TV-G': 'Kids',
    'TV-Y': 'Kids',
    'TV-Y7': 'Older Kids',
    'PG': 'Older Kids',
    'G': 'Kids',
    'NC-17': 'Adults',
    'TV-Y7-FV': 'Older Kids',
    'UR': 'Adults'
}

df1['target_ages'] = df1['rating'].replace(ratings)

In [None]:
# type should be categorical
df1['type'] = df1['type'].astype('category')
df1['target_ages'] = pd.Categorical(df1['target_ages'], categories = ['Kids', 'Teens', 'Young Adults', 'Adults', 'Older Kids'])

In [None]:
df1.head()

In [None]:
# create two columns
tv_shows = df1[df1['type'] == 'TV Show']
movies = df1[df1['type'] == 'Movie']

In [None]:
df1

In [None]:
# rating based on rating system of all tv shows
tv_shows['rating'].value_counts()

In [None]:
# Visualize rating distribution of all tv shows
plt.figure(figsize=(10, 6))
sns.pointplot(x=tv_shows['rating'].value_counts().index, y=tv_shows['rating'].value_counts().values, color='red')
plt.title('Rating Distribution of TV Shows')
plt.xlabel('Rating')
plt.ylabel('Number of TV Shows')
plt.show()


In [None]:
# Visualize rating distribution of all movies
plt.figure(figsize=(10, 6))
sns.pointplot(x=movies['rating'].value_counts().index, y=movies['rating'].value_counts().values, color='red')
plt.title('Rating Distribution of Movies')
plt.xlabel('Rating')
plt.ylabel('Number of Movies')
plt.show()

##### 1. Why did you pick the specific chart?

I chose a point plot to visualize the rating distribution of TV showa and movies because it effectively heighlights variablility within diffrent categories.

##### 2. What is/are the insight(s) found from the chart?

TV-MA has the highest number of ratings in both the cases i. e. tv shows as well as movies category.

In [None]:
df1.columns

In [None]:
df1

#### **What are the most common genres on Netflix?**

In [None]:
# find out top 10 genre of the movies
top_10_genres = df1['listed_in'].value_counts().head(10)
top_10_genres

In [None]:
# Chart - 5 visualization top 10 genre
palette = sns.color_palette("colorblind", len(top_10_genres))

plt.figure(figsize=(10, 6))
sns.countplot(y='listed_in', data=df1, order=df1['listed_in'].value_counts().index[:10], palette=palette)
plt.title('Top 10 Genres of shows/movies')
plt.xlabel('Number of Shows/Movies')
plt.ylabel('Genre')
plt.show()

In [None]:
# visualizing top 10 genres of tv_shows
palette = sns.color_palette("colorblind", len(top_10_genres))

plt.figure(figsize=(10, 6))
sns.countplot(y='listed_in', data=tv_shows, order=tv_shows['listed_in'].value_counts().index[:10], palette=palette)
plt.title('Top 10 Genres of shows/movies')
plt.xlabel('Number of Shows/Movies')
plt.ylabel('Genre')
plt.show()

In [None]:
# visualizing top 10 genres of movies
palette = sns.color_palette("colorblind", len(top_10_genres))

plt.figure(figsize=(10, 6))
sns.countplot(y='listed_in', data=movies, order=movies['listed_in'].value_counts().index[:10], palette=palette)
plt.title('Top 10 Genres of shows/movies')
plt.xlabel('Number of Shows/Movies')
plt.ylabel('Genre')
plt.show()

#### **How does the number of TV shows and movies vary by release year?**

In [None]:
# find out tv shows and movies vary by release year
tv_shows_by_year = df1[df1['type'] == 'TV Show'].groupby('release_year').size()
movies_by_year = df1[df1['type'] == 'Movie'].groupby('release_year').size()

In [None]:
# visualize tv shows and movies vary by release year
plt.figure(figsize=(12,6))
plt.plot(tv_shows_by_year.index, tv_shows_by_year.values, label='TV Shows')
plt.plot(movies_by_year.index, movies_by_year.values, label='Movies')
plt.title('Number of TV Shows and Movies by Release Year')
plt.xlabel('Release Year')
plt.ylabel('Number of Shows/Movies')
plt.legend()
plt.show()

##### 1. What is/are the insight(s) found from the chart?

The graph shows a significant increase in both TV shows and movies on Netflix starting from the 2000s, with a sharp spike in content around 2015-2019. TV shows have seen rapid growth, especially in the last decade, while the number of movies, after peaking, appears to have slightly declined recently.

##### 2. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Netflix should continue investing in TV shows due to their rapid growth and strong engagement. Additionally, diversifying with more classic content could attract a broader audience. Monitoring the recent decline in movie additions can help maintain a balanced content library.

#### **Duration**

In [None]:
# Extract numeric durations and convert to numeric type
duration_numeric = movies['duration'].str.extract('(\d+)').astype(float)
duration_numeric

In [None]:
# plot the histogram
plt.figure(figsize=(10, 6))
sns.distplot(duration_numeric, bins=20, kde=True, color='red', kde_kws={'color': 'black'})
plt.title('Distribution of Movie Durations')
plt.xlabel('Duration (minutes)')
plt.ylabel('Frequency')
plt.show()

In [None]:
#Checking the distribution of TV SHOWS
plt.figure(figsize=(12,6))
plt.title("Distribution of TV Shows duration",fontweight='bold')
plt.xticks(rotation=90)
plt.xlabel("Duration")
plt.ylabel("Count")
sns.countplot(x=tv_shows['duration'],data=tv_shows,order = tv_shows['duration'].value_counts().index)

From above plot we can see that tv shows who have one season are more in number.

### **Heatmap**

In [None]:
# Ensure 'count' column exists
df1['count'] = 1

# Group by 'country' and sum the 'count' column, then sort
data = df1.groupby('country')['count'].sum().sort_values(ascending=False).reset_index()[:10]

# Extract top 10 countries
top_countries = data['country']

# Filter the dataframe for the top 10 countries
df_heatmap = df1.loc[df1['country'].isin(top_countries)]

# Create a crosstab of 'country' and 'target_ages' normalized by index
df_heatmap = pd.crosstab(df_heatmap['country'], df_heatmap['target_ages'], normalize="index").T

df_heatmap


In [None]:
# Plot the heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(df_heatmap * 100, annot=True, cmap="plasma", fmt=".2f", linewidths=.5, linecolor='gray',
                     cbar_kws={'label': 'Percentage (%)'}, annot_kws={"size": 12})
plt.title('Distribution of Target Ages by Country', fontweight=15)
plt.xlabel('Country')
plt.ylabel('Target Ages')
plt.show()

Here are five insights from the heatmap:

1. **Young Adults in Spain and Mexico**:
   - The highest percentage of content in Spain (83.58%) and Mexico (77.00%) is targeted at young adults.

2. **Adults in France and Egypt**:
   - France has a significant portion (67.83%) of content aimed at adults, followed by Egypt (27.72%).

3. **Kids in Canada**:
   - Canada has a notable percentage (18.08%) of content targeted at kids, which is higher compared to other countries.

4. **Varied Target in India**:
   - India has a diverse distribution of content across different age groups, with young adults (56.34%) and adults (25.57%) having significant shares.

5. **Low Teen Content Across Countries**:
   - The percentage of content targeted at teens is generally low across all countries, with the highest being in the United States (7.54%).

##**Hypothesis Testing**

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

1.   **Null Hypothesis** : There is no diffrence between number of movies and tv shows. The praportion of movies are equal to or less than the praportion of tv shows.
2.   **Alternate Hypothesis** : There is diffrence between number of movies and tv shows. The praportion of movies are greater than the praportion of tv shows.



#### 2. Perform an appropriate statistical test.

In [None]:
from statsmodels.stats.proportion import proportions_ztest

# Counts from the pie chart data
n_movies = 5372
n_tvshows = 2398
n_total = n_movies + n_tvshows

# Perform the z-test for a single proportion
stat, p_value = proportions_ztest(count=n_movies, nobs=n_total, value=0.5, alternative='two-sided')

# Display the results
print(f"Z-statistic: {stat}")
print(f"P-value: {p_value}")

# Interpretation
if p_value < 0.05:
    print("Reject the null hypothesis: There is a significant difference between the number of movies and TV shows.")
else:
    print("Fail to reject the null hypothesis: There is no significant evidence of a difference between the number of movies and TV shows.")


##### Which statistical test have you done to obtain P-Value?

I choose z statistical test

##### Why did you choose the specific statistical test?

I choose Z test because I have large dataset.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

1.   **Null Hypothesis** : Movies rated for kids and older kids are at least two hours long.

2.   **Alternate Hypothesis** : Movies rated for kids and older kids are not at least two hours long.



#### 2. Perform an appropriate statistical test.

In [None]:
movies


In [None]:
# make copy of df1 dataset
df1_hypo = df1.copy()
df1_hypo.head()

In [None]:
# filter movies from "type" column
df1_hypo = df1_hypo[df1_hypo['type'] == 'Movie']

In [None]:
# assign rating into gropued categories
ratings_by_age = {
    'TV-MA': 'Adults',
    'R': 'Adults',
    'PG-13': 'Teens',
    'TV-14': 'Young Adults',
    'TV-PG': 'Older Kids',
    'NR': 'Adults',
    'TV-G': 'Kids',
    'TV-Y': 'Kids',
    'TV-Y7': 'Older Kids',
    'PG': 'Older Kids',
    'G': 'Kids',
    'NC-17': 'Adults',
    'TV-Y7-FV': 'Older Kids',
    'UR': 'Adults'
}

df1_hypo['target_ages'] = df1_hypo['rating'].replace(ratings_by_age)

# unique target ages
df1_hypo['target_ages'].unique()

In [None]:
# convert target ages to categorical type with specified order
df1_hypo['target_ages'] = pd.Categorical(df1_hypo['target_ages'], categories = ['Kids', 'Teens', 'Young Adults', 'Adults', 'Older Kids'])

# extract numeric part from duration and converting to numeric type
df1_hypo['duration_numeric'] = df1_hypo['duration'].str.extract('(\d+)')
df1_hypo['duration_numeric'] = pd.to_numeric(df1_hypo['duration_numeric'], errors='coerce')

# display the first view
df1_hypo.head()

In [None]:
# group by target ages and duration numeric and also find out mean
group_by_ = df1_hypo[['target_ages','duration_numeric']].groupby(by='target_ages')
group_by_

#take mean
group = group_by_.mean().reset_index()
group

In [None]:
# gruoping values in variables
A = group_by_.get_group("Kids")
B = group_by_.get_group("Older Kids")

In [None]:
# calculate mean and standard deviation
mean_A = A['duration_numeric'].mean()
std_A = A['duration_numeric'].std()
mean_B = B['duration_numeric'].mean()
std_B = B['duration_numeric'].std()

In [None]:
# print the result
print("Mean of group A:", mean_A)
print("Standard Deviation of group A:", std_A)
print("Mean of group B:", mean_B)
print("Standard Deviation of group B:", std_B)

In [None]:
# length of A and B
len_A = len(A)
len_B = len(B)

#print
print(f'len_A = {len_A}, len_B = {len_B}')

In [None]:
# degree of freedom
DOF = len_A + len_B - 2

#print
print(f'DOF = {DOF}')

In [None]:
# pooled std
pooled_std = ((len_B)*(std_B)**2 + (len_A)*(std_A)**2)/DOF
sp = np.sqrt(pooled_std)

#print
print(f'sp = {sp}')

In [None]:
# t_value
t_value = (mean_A - mean_B)/(sp*np.sqrt(1/len_A + 1/len_B))

#print
print(f't_value = {t_value}')

In [None]:
# perform t test
from scipy.stats import ttest_ind

t_statistic, p_value = ttest_ind(A['duration_numeric'], B['duration_numeric'])

#print
print(f't_statistic = {t_statistic}')
print(f'p_value = {p_value}')

1.   The t_statistic value indcates that the mean is less than 120 min and negative sign suggest that the mean duration of movies for kids and older kids is less than 120 min.
2.  P_value is also very small.
3. From above 2 values we can reject null hypothesis.



### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null Hypothesis** : The duration which is more than 90 mins are movies.

**Alternate Hypothesis** : The duration which is more than 90 mins are NOT movies.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# creating binary vatiabales
df1_hypo['duration_binary'] = np.where(df1_hypo['duration_numeric'] > 90, 1, 0)

# observed proportion of duration > 90 min
observed_prop = df1_hypo['duration_binary'].mean()
print(f'observed_prop = {observed_prop}')

In [None]:
# perform praportion test
from statsmodels.stats.proportion import proportions_ztest
n = len(df1_hypo)
baseline_prop = 0.5
stat, p_value = proportions_ztest(count=n*observed_prop, nobs=n, value=baseline_prop, alternative='larger')

#print result
print(f"Z-statistic: {stat}")
print(f"P-value: {p_value}")

## ***6. Feature Engineering & Data Pre-processing***

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

In [None]:
df1.dtypes

In [None]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
nltk.download('punkt')
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

In [None]:
df1['description'].astype(str)

In [None]:
# making list of description feature
df1['description'] = df1['description'].apply(lambda x: x.split(' '))

In [None]:
# convert text feature to string from list
df1['description'] = df1['description'].apply(lambda x: ' '.join(x))

In [None]:
# Lower Casing all the words in the text features
df1['description'] = df1['description'].apply(lambda x: x.lower())

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations
def remove_punctuations(text):
  import string
  for punctuation in string.punctuation:
    text = text.replace(punctuation, '')
  return text

In [None]:
# apply above function
df1['description'] = df1['description'].apply(remove_punctuations)

In [None]:
df1['description'][0:10]

In [None]:
# using nltk library download stopwords
sw = stopwords.words('english')

# define stopwords
def stopwords(text):
  text = [word for word in text.split() if word not in sw]
  return " ".join(text)

In [None]:
# apply above function
df1['description'] = df1['description'].apply(stopwords)

In [None]:
df1['description'][0:10]

In [None]:
# import TfidVectorizer from sklearn
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
from tkinter.constants import X
# apply
tf = TfidfVectorizer(max_features = 5000)
X = tf.fit_transform(df1['description'])

In [None]:
X.shape

In [None]:
# convert X to array
X = X.toarray()

## ***7. ML Model Implementation***

### ML Model - 1 : KMeans Clustering

In [None]:
# import library
from sklearn.cluster import KMeans

#initializing the list for the values of wcss
wcss = []

# using for loop for intrations from 1 to 30
for i in range(1,30):
  kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
  kmeans.fit(X)
  wcss.append(kmeans.inertia_)

# plot results
plt.plot(range(1,30), wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

In [None]:
from sklearn.metrics import silhouette_score

sill = []

for i in range(2, 30):
  model = KMeans(n_clusters=i, init='k-means++', random_state=42)
  model.fit(X)
  labels = model.labels_
  sill.append(silhouette_score(X, labels))
  print(f'Silhouette score for {i} clusters: {sill[-1]}')

In [None]:
# Plotting the silhouette scores
plt.plot(range(2, 30), sill, 'bs--')
plt.xticks(range(2, 30))
plt.grid()
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette score')
plt.title('Silhouette Score for Different Numbers of Clusters')
plt.show()

In [None]:
# training the k means model on a dataset
kmeans = KMeans(n_clusters=26, init='k-means++', random_state=42)
y_kmeans = kmeans.fit_predict(X)

In [None]:
# predict the clusters and evalute the silhouette score
score = silhouette_score(X, y_kmeans)
print(f'Silhouette score: {score}')

In [None]:
# davies bouldin score of our clusters
from sklearn.metrics import davies_bouldin_score
davies_bouldin_score(X, y_kmeans)

In [None]:
# adding a seperate column for the cluster
df1['cluster'] = y_kmeans

In [None]:
df1['cluster'].value_counts()

In [None]:
# plot the graph
fig, ax = plt.subplots(figsize=(10, 6))
sns.countplot(x='cluster', data=df1, ax=ax, hue='type')
ax.set_title('Cluster Distribution')
ax.set_xlabel('Cluster')
ax.set_ylabel('Count')
plt.show()

Cluster 7 has the heighest number of datapoints

In [None]:
# scatter plot for clusters
fig = px.scatter(df1, y="description", x="cluster", color="cluster")
fig.update_traces(marker_size = 100)
fig.show()

### ML Model - 2 : Hierarchy Cluster

In [None]:
# import library
import scipy.cluster.hierarchy as shc

# plot dendogram
plt.figure(figsize=(10, 7))
plt.title("Customer Dendograms")
dend = shc.dendrogram(shc.linkage(X, method='ward'))

### **Agglomerative Clustering**

In [None]:
# import library
from sklearn.cluster import AgglomerativeClustering

# training the model
model = AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='ward')
y_pred = model.fit_predict(X)

In [None]:
df_hc = df1.copy()

# create a seperate column where each row is assigned to their separate cluster
df_hc['cluster'] = y_pred
df_hc.head()

### **Evaluation**

In [None]:
# sillhoute coefficient
score = silhouette_score(X, y_pred)
print(f'Silhouette score: {score}')

In [None]:
# davies bouldin score
davies_bouldin_score(X, y_pred)

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***