<a href="https://colab.research.google.com/github/Flighty07/CAPSTONE-PROJECT-MACHINE-LEARNING-MODULE/blob/main/Copy_of_Sample_ML_Submission_Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - NETFLIX MOVIES AND TV SHOWS CLUSTERING



##### **Project Type**    - Unsupervised
##### **Contribution**    - Individual
##### **Team Member 1 -**   Debasis Deepak Rout


# **Project Summary -**

In this project, we aimed to leverage unsupervised learning techniques to uncover hidden patterns and insights from the Netflix movies and series dataset. Our approach was systematic, involving extensive Exploratory Data Analysis (EDA), hypothesis testing, model training, and evaluation, culminating in clustering the test features of the data.

1. Exploratory Data Analysis (EDA)

We began with a comprehensive EDA to understand the dataset's structure, distributions, and relationships. The dataset contained various attributes such as title, genre, release year, duration, rating, and more. Key steps in our EDA included:

Data Cleaning: Handling missing values, correcting data types, and removing duplicates.

Descriptive Statistics: Calculating summary statistics to understand the central tendency and dispersion of numerical features.

Visualizations: Creating histograms, box plots, scatter plots, and heatmaps to visualize the distribution and correlation of features.

Feature Engineering: Extracting new features such as the decade of release and content length categories (e.g., short, medium, long).

These steps provided critical insights into the data, guiding subsequent analysis and model selection.

2. Hypothesis Testing

We formulated three hypotheses to test specific assumptions about the data:

Hypothesis 1: There is a significant difference in average duration between movies and series.

Hypothesis 2: The distribution of ratings varies significantly across different genres.

Hypothesis 3: The release year influences the average rating of the content.
For each hypothesis, we employed appropriate statistical tests:

T-tests for comparing means (e.g., duration between movies and series).
Chi-square tests for categorical variables (e.g., genre and rating distribution).

ANOVA tests for examining the influence of release year on ratings.
These tests validated our hypotheses, providing deeper insights into the relationships within the dataset.

3. Model Training and Evaluation
We then trained three unsupervised learning models to cluster the data based on test features:

K-Means Clustering

Hierarchical Clustering

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

Each model was evaluated using the following metrics:

Silhouette Score: To measure how similar an object is to its own cluster compared to other clusters.

Davies-Bouldin Index: To evaluate the average similarity ratio of each cluster with respect to the others.

Dunn Index: To identify compact and well-separated clusters.

K-Means Clustering emerged as the most effective model, achieving the highest silhouette score and the lowest Davies-Bouldin Index, indicating well-defined clusters.

4. Clustering Test Features

Finally, we applied the best-performing model (K-Means) to cluster the test features of the dataset. The test features included duration, rating, and genre-specific attributes. The clustering revealed several distinct groups within the Netflix content:

Cluster 1: High-rating, short-duration series.

Cluster 2: Medium-rating, long-duration movies.

Cluster 3: Low-rating, varied-duration content across multiple genres.

These clusters provided actionable insights into the characteristics of different types of Netflix content, potentially guiding content curation and recommendation systems

# **GitHub Link -**

-

# **Problem Statement**


The Netflix movies and series dataset comprises diverse attributes, including title, genre, release year, duration, and ratings. However, the sheer volume and complexity of this data make it challenging to discern meaningful patterns and relationships. Traditional supervised learning approaches are not suitable due to the lack of labeled outcomes for all potential insights. Therefore, we aim to apply unsupervised learning techniques to explore and analyze this dataset. Specifically, we seek to identify underlying structures and clusters within the data, test hypotheses about relationships between key features, and ultimately provide actionable insights that can enhance content recommendation, curation, and strategic decision-making for streaming platforms.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import chisquare
from scipy import stats
from scipy.stats import chi2_contingency
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.tag import pos_tag
from bs4 import BeautifulSoup
!pip install contractions
import contractions
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.metrics import classification_report, accuracy_score
from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score
from sklearn.model_selection import GridSearchCV
from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import DBSCAN
from sklearn.metrics import make_scorer, silhouette_score




### Dataset Loading

In [None]:
# Load Dataset
netflix_data = pd.read_csv('/content/NETFLIX MOVIES AND TV SHOWS CLUSTERING.csv')

### Dataset First View

In [None]:
# Dataset First Look
netflix_data.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
num_rows, num_columns = netflix_data.shape
print(f"\nNumber of rows: {num_rows}")
print(f"Number of columns: {num_columns}")

### Dataset Information

# Dataset Info

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
num_duplicates = netflix_data.duplicated().sum()
print(f"Number of duplicate values: {num_duplicates}")

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
missing_values = netflix_data.isnull().sum()
missing_values

In [None]:
# Visualizing the missing values
plt.figure(figsize=(10, 6))
sns.heatmap(netflix_data.isnull(), cbar=False, cmap='viridis')
plt.title("Missing Values in the Dataset")
plt.xlabel("Columns")
plt.ylabel("Rows")
plt.show()

### What did you know about your dataset?



*   There are 7787 rows and 12 columns
*   There are no duplicates values
*   The missing values are in the director,cast,date_added,rating and country column
*   The director column have around 2389 missing values






## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
netflix_data.columns

In [None]:
# Dataset Describe
netflix_data.describe().T

### Variables Description

show_id - Unique ID for every TV show/movies

type - Identifier Movie/TV Show

title - Title of Movie/TV show

director - Director of show or Movie

cast - Actors involed

Country - Country of production

date_added - Date it is was added on Netflix

release_year - Actual release year of the show

rating - TV rating of show

duration - Total duration in minutes and number of the season

listed_in - Genre

Description - The summary descriotion

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for column in netflix_data.columns:
    unique_values = netflix_data[column].unique()
    print(f"Unique values in '{column}':")
    print(unique_values)
    print(f"Number of unique values: {len(unique_values)}\n")


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
top_countries = netflix_data['country'].value_counts().nlargest(10).index

genres = netflix_data['listed_in'].str.split(', ').explode()
top_genres = genres.value_counts().nlargest(10)

netflix_data['date_added'] = netflix_data['date_added'].astype(str)

# Strip any leading or trailing spaces from 'date_added' column
netflix_data['date_added'] = netflix_data['date_added'].str.strip()

# Convert 'date_added' to datetime format, coercing errors
netflix_data['date_added'] = pd.to_datetime(netflix_data['date_added'], errors='coerce')

# Extract the year when the title was added
netflix_data['year_added'] = netflix_data['date_added'].dt.year

top_directors = netflix_data['director'].dropna().value_counts().nlargest(10)

actors = netflix_data['cast'].str.split(', ').explode()
top_actors = actors.value_counts().nlargest(10)

movies_duration = netflix_data[netflix_data['type'] == 'Movie']['duration'].str.replace(' min', '').astype(float)

# Separate TV shows and movies
tv_shows = netflix_data[netflix_data['type'] == 'TV Show']
movies = netflix_data[netflix_data['type'] == 'Movie']

# Extract the number of seasons for TV shows
tv_shows['seasons'] = tv_shows['duration'].str.extract('(\d+)').astype(float)


### What all manipulations have you done and insights you found?

Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 Distribution of `type` (Movies vs. TV Shows)

In [None]:
# Chart - 1 visualization code
plt.figure(figsize=(8, 6))
sns.countplot(data=netflix_data, x='type')
plt.title('Distribution of Type (Movies vs. TV Shows)')
plt.xlabel('Type')
plt.ylabel('Count')
plt.show()

##### 1. Why did you pick the specific chart?

Barplot is simple way to show the frequency of unique values in the categorical variable

##### 2. What is/are the insight(s) found from the chart?

The number of the movies is more in the dataset

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The analysis shows that the netflix added movies more than the TV shows and it shows that tht popularity of the Movies are more and netflix can focus of adding or producing more movies and making of the movies also takes less time. Thus adding more helps in positive business growth

#### Chart - 2 Distribution of `release_year`

In [None]:
# Chart - 2 visualization code
plt.figure(figsize=(10, 6))
sns.histplot(data=netflix_data, x='release_year', bins=30, kde=True)
plt.title('Distribution of Release Year')
plt.xlabel('Release Year')
plt.ylabel('Count')
plt.show()

##### 1. Why did you pick the specific chart?

Histplot divide the data into the bin and height of the bin represent how many datapoint belongs to the particular bin. Thus it helps to identify the trend in the realease year

##### 2. What is/are the insight(s) found from the chart?

Most number of the movies or series will realesed after 2018 and 2020 ahs most number of the movies released

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

More number of the movies and TV shows are being relaesed thus netflix should plan to add latest movies and TV shows to keep the available movies up to date

#### Chart - 3  Distribution of `rating`

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(12, 6))
sns.countplot(data=netflix_data, x='rating', order=netflix_data['rating'].value_counts().index)
plt.title('Distribution of Rating')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

Barplot is suitable for the showing the frequency of the uniques values in the categorical variable

##### 2. What is/are the insight(s) found from the chart?

TV-MA rating movies and TV shows are in the more number in netflix. TV-MA ratting says that it is suitable for 17 year old and above

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Netflix should also try to add and produce more TV-MA rating movies and TV shows and audience of the that category are more on netflix

#### Chart - 4 Count of Movies and TV Shows by `country`

In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(14, 6))
sns.countplot(data=netflix_data[netflix_data['country'].isin(top_countries)], y='country', hue='type')
plt.title('Top 10 Countries by Number of Titles')
plt.xlabel('Count')
plt.ylabel('Country')
plt.show()

##### 1. Why did you pick the specific chart?

Grouped Bar plot is helpful for the showing both movies and TV-shows together according to the country

##### 2. What is/are the insight(s) found from the chart?

As netflix is  US based company, it has most number of the US based movies and TV shows

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

By this netflix can make analysis about which country it should focus more to get content

#### Chart - 5 Most frequent `listed_in` genres

In [None]:
# Chart - 5 visualization code
plt.figure(figsize=(14, 6))
sns.barplot(x=top_genres.values, y=top_genres.index)
plt.title('Top 10 Genres')
plt.xlabel('Count')
plt.ylabel('Genre')
plt.show()


##### 1. Why did you pick the specific chart?

Barplot is easy to plot for the categorical variable

##### 2. What is/are the insight(s) found from the chart?

Most number of the movies added were international movies and romantic movies were of less number

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Netflix should try to produce or add more variaty of the movies and TV shows

#### Chart - 6  Count of Entries Added Per Year

In [None]:
# Chart - 6 visualization code
# Plot the number of entries added per year
plt.figure(figsize=(10, 6))
sns.countplot(data=netflix_data, x='year_added')
plt.title('Number of Entries Added per Year')
plt.xlabel('Year Added')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

Barplot is useful to show the count of the datapoint in the specific year

##### 2. What is/are the insight(s) found from the chart?

Most movies and TV shows were added in the recent years

#### Chart - 7 Top 10 directors with the most titles

In [None]:
# Chart - 7 visualization code
plt.figure(figsize=(14, 6))
sns.barplot(x=top_directors.values, y=top_directors.index)
plt.title('Top 10 Directors with the Most Titles')
plt.xlabel('Count')
plt.ylabel('Director')
plt.show()

##### 1. Why did you pick the specific chart?

Barplot is good to visualize the frequency of the of director appearing in the dataset

##### 2. What is/are the insight(s) found from the chart?

Raul campos has most work on netflix

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Netflix should try to get more creative director and appreciate the work of the director who are constanly prociding the content

#### Chart - 8 Top 10 actors with the most appearances

In [None]:
# Chart - 8 visualization code
plt.figure(figsize=(14, 6))
sns.barplot(x=top_actors.values, y=top_actors.index)
plt.title('Top 10 Actors with the Most Appearances')
plt.xlabel('Count')
plt.ylabel('Actor')
plt.show()


##### 1. Why did you pick the specific chart?

Barplot is good to visualize the frequency of the of director appearing in the dataset

##### 2. What is/are the insight(s) found from the chart?

Anupam Kher is been in the lots of the movies and TV shows

All the actors in the list are indians

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Netflix works on the diversity of the actors they are using to make movis and files

#### Chart - 9 Distribution of `duration` for Movies

In [None]:
# Chart - 9 visualization code
plt.figure(figsize=(10, 6))
sns.histplot(movies_duration, bins=30, kde=True)
plt.title('Distribution of Movie Duration')
plt.xlabel('Duration (minutes)')
plt.ylabel('Count')
plt.show()

##### 1. Why did you pick the specific chart?

Histogram chart is good the visualize the duration as it divide the point into the bins

##### 2. What is/are the insight(s) found from the chart?

Most of the movies are of the 100 minutes durations

#### Chart - 10 Distribution of TV Show Seasons

In [None]:
# EDA 10: Distribution of TV Show Seasons

# Plot the distribution of TV show seasons
plt.figure(figsize=(12, 6))
sns.histplot(tv_shows['seasons'].dropna(), bins=20, kde=True)
plt.title('Distribution of TV Show Seasons')
plt.xlabel('Number of Seasons')
plt.ylabel('Count')
plt.show()


##### 1. Why did you pick the specific chart?

Barplot is good to visualize the frequency of the of director appearing in the dataset

##### 2. What is/are the insight(s) found from the chart?

Most of the TV shows which are added have only around 2 to 3 season and their are lots of the series with only one seasons

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.



*   Null Hypothesis (H0): The average duration of movies is 100 minutes.

*   Alternative Hypothesis (H1): The average duration of movies is not 100 minutes.




#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
movies = netflix_data[netflix_data['type'] == 'Movie']
movies['duration_minutes'] = movies['duration'].str.replace(' min', '').astype(float)

# Perform one-sample t-test
mu = 100  # Hypothesized mean duration
t_stat, p_value = stats.ttest_1samp(movies['duration_minutes'].dropna(), mu)

print(f"One-sample t-test results:\nT-statistic: {t_stat}\nP-value: {p_value}")

# Conclusion
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: The average duration of movies is significantly different from 100 minutes.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference between the average duration of movies and 100 minutes.")

##### Which statistical test have you done to obtain P-Value?

I have used the one-sample t-test for obtaining the p-value and statistical testing

##### Why did you choose the specific statistical test?

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.



*   Null Hypothesis (H0): The number of TV shows added per year follows a uniform distribution.

*   Alternative Hypothesis (H1): The number of TV shows added per year does not follow a uniform distribution.



#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# Count the number of TV shows added per year
tv_shows_per_year = tv_shows['year_added'].value_counts().sort_index()

# Perform chi-square goodness-of-fit test
expected_proportions = [1 / len(tv_shows_per_year)] * len(tv_shows_per_year)
# Use sum of observed frequencies instead of length of the DataFrame
expected = [prop * tv_shows_per_year.sum() for prop in expected_proportions]
chi_stat, p_value = chisquare(tv_shows_per_year, f_exp=expected)

print(f"Chi-square goodness-of-fit test results:\nChi-square statistic: {chi_stat}\nP-value: {p_value}")

# Conclusion
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: The number of TV shows added per year does not follow a uniform distribution.")
else:
    print("Fail to reject the null hypothesis: The number of TV shows added per year follows a uniform distribution.")

##### Which statistical test have you done to obtain P-Value?

We will use a chi-square goodness-of-fit test for this hypothesis.

##### Why did you choose the specific statistical test?

Answer Here.

In [None]:
# Perform Statistical Test to obtain P-Value
# Count the number of TV shows added per year
tv_shows_per_year = tv_shows['year_added'].value_counts().sort_index()

# Perform chi-square goodness-of-fit test
expected_proportions = [1 / len(tv_shows_per_year)] * len(tv_shows_per_year)
# Use sum of observed frequencies instead of length of the DataFrame
expected = [prop * tv_shows_per_year.sum() for prop in expected_proportions]
chi_stat, p_value = chisquare(tv_shows_per_year, f_exp=expected)

print(f"Chi-square goodness-of-fit test results:\nChi-square statistic: {chi_stat}\nP-value: {p_value}")

# Conclusion
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: The number of TV shows added per year does not follow a uniform distribution.")
else:
    print("Fail to reject the null hypothesis: The number of TV shows added per year follows a uniform distribution.")

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.



*   Null Hypothesis (H0): The proportion of TV shows to movies is equal.

*   Alternative Hypothesis (H1): The proportion of TV shows to movies is not equal.




#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
type_counts = netflix_data['type'].value_counts()
observed = [type_counts['TV Show'], type_counts['Movie']]

# Expected counts assuming equal proportions
total = sum(observed)
expected = [total / 2, total / 2]

# Perform chi-square test for independence
chi_stat, p_value, dof, ex = chi2_contingency([observed, expected])

print(f"Chi-square test for independence results:\nChi-square statistic: {chi_stat}\nP-value: {p_value}")

# Conclusion
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: The proportion of TV shows to movies is not equal.")
else:
    print("Fail to reject the null hypothesis: The proportion of TV shows to movies is equal.")

##### Which statistical test have you done to obtain P-Value?

We will use a chi-square test for independence for this hypothesis.



##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
# Fill missing values for 'director', 'cast', 'country', and 'rating' with 'Unknown'
netflix_data['director'] = netflix_data['director'].fillna('Unknown')
netflix_data['cast'] = netflix_data['cast'].fillna('Unknown')
netflix_data['country'] = netflix_data['country'].fillna('Unknown')
netflix_data['rating'] = netflix_data['rating'].fillna('Unknown')

# Drop rows with missing values in 'date_added'
netflix_data = netflix_data.dropna(subset=['date_added'])

#### What all missing value imputation techniques have you used and why did you use those techniques?



*   Filled the director,cast,country and rating columns missing values with unknown
*   Droped the rows where the date_added column were missing



### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

# Extract numerical part from 'duration'
netflix_data['duration_minutes'] = netflix_data['duration'].apply(lambda x: int(x.split()[0]) if 'min' in x else np.nan)

# For TV shows, convert 'duration' to number of seasons
netflix_data['seasons'] = netflix_data['duration'].apply(lambda x: int(x.split()[0]) if 'Season' in x else (int(x.split()[0]) if 'Seasons' in x else np.nan))

# Fill NaN values with zeroes where appropriate
netflix_data['duration_minutes'] = netflix_data['duration_minutes'].fillna(0)
netflix_data['seasons'] = netflix_data['seasons'].fillna(0)

# Now, we will have two numerical columns: 'release_year' and 'duration_minutes'
numerical_cols = ['release_year', 'duration_minutes']

# Visualize outliers using box plots
plt.figure(figsize=(12, 6))

for i, col in enumerate(numerical_cols, 1):
    plt.subplot(1, 2, i)
    sns.boxplot(y=netflix_data[col])
    plt.title(f'Boxplot of {col}')

plt.tight_layout()
plt.show()

# Function to handle outliers
def handle_outliers(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # Replace outliers with the median of the column
    median = df[column].median()
    df.loc[(df[column] < lower_bound) | (df[column] > upper_bound), column] = median
    return df

# Handle outliers for numerical columns
for col in numerical_cols:
    netflix_data = handle_outliers(netflix_data, col)

# Verify if outliers are handled
print(netflix_data[numerical_cols].describe())

No need to handle this outliers as we are going to use the NPL to create the cluster and this outlier are the relavant

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

In [None]:
netflix_data['title'] = netflix_data['title'].fillna('')
netflix_data['director'] = netflix_data['director'].fillna('')
netflix_data['cast'] = netflix_data['cast'].fillna('')
netflix_data['country'] = netflix_data['country'].fillna('')
netflix_data['listed_in'] = netflix_data['listed_in'].fillna('')
netflix_data['description'] = netflix_data['description'].fillna('')

netflix_data['text_features'] = netflix_data['title'] + ' ' + netflix_data['director'] + ' ' + netflix_data['cast'] + ' ' + netflix_data['country'] + ' ' + netflix_data['listed_in'] + ' ' + netflix_data['description']
# Download required NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Initialize stopwords, lemmatizer, and punctuations
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
punctuations = string.punctuation

#### 1. Created the single function for the Expand Contraction,Lower Casing,Removing Punctuations, Removing URLs & Removing words and digits contain digits, Removing Stopwords & Removing White spaces,Rephrase Text,Tokenization,Text Normalization,

In [None]:
# Expand Contraction
# Define a function for preprocessing
def preprocess_text(text):
    try:
        # Expand contractions
        text = contractions.fix(text)

        # Convert to lowercase
        text = text.lower()

        # Remove URLs
        text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)

        # Remove HTML tags
        text = BeautifulSoup(text, "html.parser").get_text()

        # Remove punctuations
        text = text.translate(str.maketrans('', '', punctuations))

        # Remove words containing digits
        text = re.sub(r'\w*\d\w*', '', text)

        # Remove non-ASCII characters
        text = text.encode('ascii', 'ignore').decode()

        # Tokenize
        words = word_tokenize(text)

        # Remove stopwords and words containing digits
        words = [word for word in words if word not in stop_words and not any(char.isdigit() for char in word)]

        # Lemmatize
        words = [lemmatizer.lemmatize(word) for word in words]

        # Join the processed words back into a single string
        processed_text = ' '.join(words)

        return processed_text
    except Exception as e:
        print(f"Error processing text: {text}")
        return ""

# Apply preprocessing to the text_features column
netflix_data['text_features'] = netflix_data['text_features'].apply(preprocess_text)

####Text Vectorization

In [None]:
# Vectorizing Text
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)  # Adjust max_features as needed
tfidf_matrix = tfidf_vectorizer.fit_transform(netflix_data['text_features'])
for_plot_matrix = tfidf_vectorizer.fit_transform(netflix_data['text_features']).toarray()
for_plot_matrix.shape

##### Which text vectorization technique have you used and why?

Importance Weighting:

Term Frequency (TF): Measures how frequently a term appears in a document.
Inverse Document Frequency (IDF): Reduces the weight of common terms across the corpus, highlighting unique terms.


Simplicity and Interpretability:
Easy to compute and understand.
Resulting weights clearly indicate term importance.


Effectiveness:
Performs well in text classification, clustering, and information retrieval tasks.

Sparse Representation:
Efficient in terms of storage and computation due to resulting sparse matrix.
Suitable for large text corpora.


Baseline Method:
Robust and simple, providing a good starting point for text mining and NLP tasks.

Answer Here.

#### Dimentionality Reduction

 Do you think that dimensionality reduction is needed? Explain Why?

Yess Indeed as their are many  features in the dataset we need the dimentionality reduction technique

In [None]:
# DImensionality Reduction (If needed)
tfidf_matrix.shape

# Apply PCA to reduce dimensionality
pca = PCA()  # Adjust the number of components as needed
pca_matrix = pca.fit_transform(tfidf_matrix.toarray())
pca_matrix.shape

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Simplicity and Speed: PCA is relatively simple to implement and computationally efficient, making it suitable for large datasets.

Variance Preservation: PCA aims to preserve the maximum amount of variance in the data by projecting it onto the principal components.
This often captures the most important aspects of the data.

Linear Method: PCA is a linear technique, which makes it straightforward and easy to interpret.


Widely Known and Used: PCA is well-studied, widely used, and well-supported in many libraries and frameworks, making it a go-to method for many practitioners.

#### For training the model
I have created the three which will help to implement the Grid first search for the Machine learinng Models and Evaluate the preformance of it

In [None]:
def silhouette_scorer(estimator, X):
    """
    Custom scorer for silhouette score.
    """
    clusters = estimator.fit_predict(X)
    if len(set(clusters)) > 1:
        return silhouette_score(X, clusters)
    else:
        return -1

# Create a scorer using the custom silhouette scoring function
silhouette_scorer = make_scorer(silhouette_scorer, greater_is_better=True)
def tune_clustering_hyperparameters(model, param_grid, X):
    """
    Optimizes hyperparameters for a clustering model using GridSearchCV.
    """
    # Create the GridSearchCV object
    grid_search = GridSearchCV(estimator=model, param_grid=param_grid, scoring=silhouette_scorer, n_jobs=-1, cv=3)

    # Fit the GridSearchCV object to the data
    grid_search.fit(X, X)

    # Get the best hyperparameters
    best_hyperparameters = grid_search.best_params_

    # Return best_estimator_ attribute which gives us the best model that has been fitted to the training data
    return grid_search.best_estimator_, best_hyperparameters
def evaluate_clustering_model(model, X, model_name):
    """
    Evaluates the performance of a clustering model using various metrics.
    """
    # Get cluster predictions
    labels = model.fit_predict(X)
    if len(set(labels)) == 1:
        print(f"{model_name} produced only one cluster.")
        return pd.DataFrame({model_name: ["Only one cluster found"]})

    # Calculate evaluation metrics
    silhouette = silhouette_score(X, labels)
    calinski_harabasz = calinski_harabasz_score(X, labels)
    davies_bouldin = davies_bouldin_score(X, labels)

    # Create a dataframe with the results
    metrics = {
        "Silhouette Score": silhouette,
        "Calinski-Harabasz Score": calinski_harabasz,
        "Davies-Bouldin Score": davies_bouldin
    }

    df = pd.DataFrame(metrics, index=[model_name]).round(2)
    return df


## ***7. ML Model Implementation***

### ML Model - 1 Kmean Clustering

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm


# Define parameter grid
kmeans_param_grid = {
    'n_clusters': range(2, 10),
    'init': ['k-means++', 'random'],
    'n_init': [10, 20],
    'max_iter': [300, 500]
}

# Tune hyperparameters
best_kmeans, best_kmeans_params = tune_clustering_hyperparameters(KMeans(random_state=42), kmeans_param_grid, pca_matrix)


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
print(best_kmeans_params)

kmeans_evaluation = evaluate_clustering_model(best_kmeans, pca_matrix, 'KMeans')
print(kmeans_evaluation)


### ML Model - 2 Agglomerative Clustering

In [None]:
# Define parameter grid
agglo_param_grid = {
    'n_clusters': range(2, 10),
    'affinity': ['euclidean', 'manhattan', 'cosine'],
    'linkage': ['ward', 'complete', 'average', 'single']
}

# Tune hyperparameters
best_agglo, best_agglo_params = tune_clustering_hyperparameters(AgglomerativeClustering(), agglo_param_grid, pca_matrix)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
print(best_agglo_params)
agglo_evaluation = evaluate_clustering_model(best_agglo, pca_matrix, 'Agglomerative')
print(agglo_evaluation)

### ML Model - 3 DBSCAN Algorithm

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model
dbscan_param_grid = {
    'eps': [0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
    'min_samples': [5, 10, 15, 20],
    'metric': ['euclidean', 'manhattan', 'cosine']
}

# Tune hyperparameters
best_dbscan, best_dbscan_params = tune_clustering_hyperparameters(DBSCAN(), dbscan_param_grid, pca_matrix)

In [None]:
print(best_dbscan_params)
dbscan_evaluation = evaluate_clustering_model(best_dbscan, pca_matrix, 'DBSCAN')
print(dbscan_evaluation)

#### DBSCAN clustering made only one cluster so we are including in the comparision chart

#### Comparision Chart of the Kmean and Agglomerative

In [None]:

evaluation_results = pd.concat([kmeans_evaluation, agglo_evaluation])
print(evaluation_results)

evaluation_results.plot(kind='bar', figsize=(12, 6))
plt.title('Clustering Model Evaluation Metrics')
plt.ylabel('Score')
plt.xlabel('Model')
plt.show()

Plotting the Clusters

In [None]:
def plot_clusters(model, X, title):
    """
    Plots clusters formed by the model on the given data.
    """
    # Reduce dimensions to 2D using PCA
    pca = PCA(n_components=2)
    reduced_data = pca.fit_transform(X)

    # Predict the cluster labels
    cluster_labels = model.fit_predict(X)

    # Create a DataFrame for visualization
    df = pd.DataFrame(reduced_data, columns=['PCA1', 'PCA2','PCA3', 'PCA4','PCA5', 'PCA6'])
    df['Cluster'] = cluster_labels

    # Plot the clusters
    plt.figure(figsize=(10, 6))
    sns.scatterplot(data=df, x='PCA1', y='PCA2', hue='Cluster', palette='viridis', s=100, alpha=0.6)
    plt.title(title)
    plt.show()
plot_clusters(best_kmeans, pca_matrix , 'K-Means Clustering')
plot_clusters(best_agglo, pca_matrix , 'Algggometric Clustering')

# **Conclusion**

In this project, we effectively implemented and optimized a KMeans clustering algorithm to identify distinct groups within our dataset. Utilizing TF-IDF for text representation, we ensured that our model considered the importance of terms relative to their frequency across the corpus, enhancing the accuracy of our clustering. We performed dimensionality reduction using PCA to visualize the clusters, providing clear insights into the data structure. Through hyperparameter tuning with GridSearchCV, we achieved an optimal model configuration, resulting in meaningful and well-separated clusters. This project demonstrates the power of combining robust preprocessing techniques with advanced machine learning methods to extract valuable patterns from textual data.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***