<a href="https://colab.research.google.com/github/AdityaSingh1907/Netflix-Movies-and-TV-Shows-Clustering./blob/main/Netflix_Movies_and_TV_Shows_Clustering_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - **Netflix Movies and TV Shows Clustering**



##### **Project Type**    - Unsupervised
##### **Contribution**    - Individual
##### **Name** - Aditya Singh


# **Project Summary -**

**Summary:**
This project involves a comprehensive analysis of Netflix content, including movies and TV shows, followed by the development of a movie recommendation system based on content similarity. The project utilizes data cleaning, visualization, clustering, and natural language processing techniques to extract insights from the dataset and provide personalized content recommendations to users.



**Technical Documentation:**

**1. Introduction:**

The project aims to analyze Netflix's extensive collection of movies and TV shows to uncover patterns, trends, and preferences among viewers. Additionally, it implements a recommendation system that suggests content based on the descriptions of previously watched shows or movies.

**2. Data Collection and Overview:**

The dataset consists of 7,787 rows and 12 columns, with information about each title, including title, director, cast, country, date added, release year, rating, duration, genre, and description.

**3. Data Preprocessing:**

Missing data: Null values in columns like 'cast' and 'country' were handled by filling them appropriately. Rows with null values in 'date_added' and 'rating' were dropped.
Feature engineering: Ratings were grouped into categories.
Unnecessary columns: The 'director' column was removed as it was not relevant for the analysis.

**4. Data Exploration:**

Visualizations were created to understand the distribution and trends in Netflix content.
Key insights include a higher number of movies compared to TV shows, the growth of content over the years, and the distribution of content by ratings and genres.
**5. Clustering Analysis:**

K-means clustering was applied to group content based on the descriptions.
Evaluation metrics like Silhouette Score and Davies-Bouldin Score were used to assess the quality of clusters.

**6. Movie Recommendation System:**

The recommendation system was built using spaCy and cosine similarity based on word vectors.
Word vectors were created for all descriptions in the dataset.
Recommendations are provided by finding the most similar content descriptions based on cosine similarity.
Users can input a movie or TV show title, and the system suggests top recommendations.

**7. Future Enhancements:**

The recommendation system can be improved by incorporating user feedback and collaborative filtering.
External data sources or additional features, such as user preferences, could enhance recommendation accuracy.

**8. Implementation Details:**

The project was implemented using Python and various libraries, including pandas, matplotlib, scikit-learn, and spaCy.
The code provides step-by-step explanations and is well-documented for easy understanding and replication.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**This dataset consists of tv shows and movies available on Netflix as of 2019. The dataset is collected from Flixable which is a third-party Netflix search engine.**

**In 2018, they released an interesting report which shows that the number of TV shows on Netflix has nearly tripled since 2010. The streaming service’s number of movies has decreased by more than 2,000 titles since 2010, while its number of TV shows has nearly tripled. It will be interesting to explore what all other insights can be obtained from the same dataset.**

**Integrating this dataset with other external datasets such as IMDB ratings, rotten tomatoes can also provide many interesting findings.**


**In this project, you are required to do**

Exploratory Data Analysis

Understanding what type content is available in different countries

Is Netflix has increasingly focusing on TV rather than movies in recent years.

Clustering similar content by matching text-based features

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
from numpy import math
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
from matplotlib.pyplot import figure
import plotly.graph_objects as go
import plotly.offline as py
import plotly.express as px
from datetime import datetime


import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
#mounting the google drive to access the files
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
file_path='/content/drive/MyDrive/NETFLIX MOVIES AND TV SHOWS CLUSTERING.csv'
df = pd.read_csv(file_path)

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
len(df[df.duplicated()])

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print(df.isnull().sum())

In [None]:
# Visualizing the missing values
sns.heatmap(df.isnull(),cbar=True)

In [None]:
#total null values
df.isnull().sum().sum()

### What did you know about your dataset?

The above dataset has 7787 rows and 12 columns. There are total 3631 Missing Values/Null Values in director column,cast column,country column and date_added column and no duplicate values in the dataset.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe(include='all')

### Variables Description

*   show_id : Unique ID for every Movie / Tv Show

*   type : Identifier - A Movie or TV Show

*   title : Title of the Movie / Tv Show

*   director : Director of the Movie

*   cast : Actors involved in the movie / show

*   country : Country where the movie / show was produced

*   date_added : Date it was added on Netflix

*   release_year : Actual Releaseyear of the movie / show

*   rating : TV Rating of the movie / show

*   duration : Total Duration - in minutes or number of seasons

*   listed_in : Genere

*   description: The Summary description





### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in df.columns.tolist():
  print("No. of unique values in ",i,"is",df[i].nunique(),".")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# Create a copy of the current dataset
df.copy()
#check for the Null Values
df.isnull().sum()

In [None]:
#Handling Null Values
df['cast'].fillna(value='No cast',inplace=True)
df['country'].fillna(value=df['country'].mode()[0],inplace=True)


In [None]:
#'date_added' and 'rating' contains an insignificant portion of the data so we will drop them from the dataset
df.dropna(subset=['date_added','rating'],inplace=True)

In [None]:
#since there are many crows with nan director, we have filled it using empty string
df['director']=df['director'].fillna('')

In [None]:
#again checking is there any null values are not
df.isnull().sum()

In [None]:
df['rating']

In [None]:
#Assigning the Ratings into grouped categories
ratings = {
    'TV-PG': 'Older Kids',
    'TV-MA': 'Adults',
    'TV-Y7-FV': 'Older Kids',
    'TV-Y7': 'Older Kids',
    'TV-14': 'Teens',
    'R': 'Adults',
    'TV-Y': 'Kids',
    'NR': 'Adults',
    'PG-13': 'Teens',
    'TV-G': 'Kids',
    'PG': 'Older Kids',
    'G': 'Kids',
    'UR': 'Adults',
    'NC-17': 'Adults'
}
df['target_ages'] = df['rating'].replace(ratings)

In [None]:
# type should be a catego
df['type'] = pd.Categorical(df['type'])
df['target_ages'] = pd.Categorical(df['target_ages'], categories=['Kids', 'Older Kids', 'Teens', 'Adults'])

In [None]:
df

### What all manipulations have you done and insights you found?

1.  Checked for null values and identified columns with missing data.
2.  Filled missing values in the 'cast' column with "No cast".
1.  Filled missing values in the 'country' column with the mode.
2.  Dropped rows with null values in 'date_added' and 'rating' columns.
1.  Dropped the 'director' column entirely.
1.  Assigned the Ratings into grouped categories







## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 - Count of Movies and TV Shows on Netflix

In [None]:
df['type'].value_counts()

In [None]:
# Chart - 1 visualization code
#countplot to visualize the number of movies and tv_shows in type column
# Data
content_types = ['Movies', 'TV Shows']
counts = [5372, 2398]

# Plotting
plt.figure(figsize=(8, 6))
plt.bar(content_types, counts, color=['blue', 'green'])
plt.title('Count of Movies and TV Shows on Netflix')
plt.xlabel('Content Type')
plt.ylabel('Count')
plt.show()

##### 1. Why did you pick the specific chart?

For comparing the counts of movies and TV shows, I used bar chart . Bar charts are useful for comparing discrete categories, in this case, the two content types. The distinct bars for movies and TV shows make it easy to compare their counts visually.Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Netflix has 5372 movies and 2398 TV shows, there are more number movies on Netflix than TV shows.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insight about the higher number of movies than TV shows can help Netflix refine their content strategy, enhance user experiences, attract new subscribers, and make more informed decisions that positively impact their business.

They can adjust their content acquisition and production strategy to match viewer preferences and maintain a balanced library.

Highlighting popular TV shows in marketing can attract new subscribers looking for that content.

Understanding their unique content mix helps Netflix position themselves effectively against competitors.

#### Chart - 2 - Content Type Trends Over Years

In [None]:
# Chart - 2 visualization code
#creating two extra columns
tv_shows=df[df['type']=='TV Show']
movies=df[df['type']=='Movie']

# Group the data by 'release_year' and 'type' columns and count occurrences
type_count_by_year = df.groupby(['release_year', 'type']).size().unstack().fillna(0)

# Plotting
plt.figure(figsize=(10, 6))

plt.plot(type_count_by_year.index, type_count_by_year['Movie'], marker='o', label='Movies')
plt.plot(type_count_by_year.index, type_count_by_year['TV Show'], marker='o', label='TV Shows')

plt.title('Content Type Trends Over Years')
plt.xlabel('Release Year')
plt.ylabel('Count')
plt.legend()
plt.grid()

plt.show()

#Analysing how many movies released per year in last 20 years by using countplot

plt.figure(figsize=(10,6))
sns.countplot(y=movies['release_year'],data=df,order=movies['release_year'].value_counts().index[0:20])
plt.title('Movies released per year in last 20 year')

#Analysing how many movies released per year in last 15 years
plt.figure(figsize=(10,6))
sns.countplot(y=tv_shows['release_year'],data=df,order=tv_shows['release_year'].value_counts().index[0:20])
plt.title('TV Shows released per year in last 15 years')

##### 1. Why did you pick the specific chart?

I have used line plot and Countplot to visualize the count of TV shows and movies over the years. A line plot is suitable for showing trends and changes over a continuous variable (years in this case). The connection between data points on the line helps us understand how the count of each content type evolves over time.

A countplot displays the distribution of movies released each year in the last 20 years, providing insight into yearly production patterns.

Both charts were chosen for their ability to convey trends and patterns in a visually clear manner.

##### 2. What is/are the insight(s) found from the chart?

*  2017 and 2018 had the highest number of movies released on Netflix.
*  2020 also saw a significant number of movie releases.
*  The number of movies on Netflix has been growing faster than the number of TV shows.
*  There was a notable increase in both movies and TV show releases after 2015.
*   There's a substantial drop in the number of movies and TV show releases after 2020.
* It's evident that Netflix has focused more on increasing movie content compared to TV shows.  
*  Movies have experienced a much more dramatic increase than TV shows.









##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

These insights provide valuable information for content planning, user engagement strategies, and business decisions. The observations about content growth, shifts in focus, and production trends can guide Netflix's content acquisition and production strategies in the future.

#### Chart - 3 - Visualization of Content Ratings for TV Shows and Movies

In [None]:
# Chart - 3 visualization code
#Rating based on rating system of all TV Shows
tv_ratings = tv_shows.groupby(['rating'])['show_id'].count().reset_index(name='count').sort_values(by='count',ascending=False)
fig_dims = (10,7)
fig, ax = plt.subplots(figsize=fig_dims)
sns.pointplot(x='rating',y='count',data=tv_ratings)
plt.title('TV Show Ratings',size='20')
plt.show()

#Movie Ratings based on Target Age Groups
plt.figure(figsize=(10,6))
plt.title('movie ratings')
sns.countplot(x=movies['rating'],hue=movies['target_ages'],data=movies,order=movies['rating'].value_counts().index)

##### 1. Why did you pick the specific chart?

TV Show Ratings Visualization, I've used Point plot Chart.

Reason: A point plot is suitable for showing the distribution of TV show ratings, where each point represents the count of TV shows with a specific rating. This allows for a clear comparison of ratings and their frequency.

And for Movie Ratings by Target Age Group Visualization, I've used Grouped count plot.

Reason: A grouped count plot is effective for comparing movie ratings based on different target age groups. It allows you to visualize the count of movies in each rating category, grouped by target age, making it easy to identify patterns and preferences.

Both chart types were chosen to best represent the data and highlight the relationships between variables while keeping the visualizations clear and easy to interpret.

##### 2. What is/are the insight(s) found from the chart?

 The insight i've gained from visualizations is that , TV-MA has the highest number of ratings for tv shows and Movies i,e adult ratings in both the cases TV-MA has the highest number of ratings.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This insight can be valuable for Netflix's content strategy, as it suggests that adult-oriented content, such as mature themes and content suitable for a mature audience, has been well-received by viewers. It could influence decisions related to the acquisition and production of content that aligns with this rating category.

#### Chart - 4 - Visualization of Top Countries with Most Content on Netflix

In [None]:
# Chart - 4 visualization code
# Group the data by 'country' and count occurrences
country_content_count = df['country'].value_counts()

# Select top N countries for visualization
top_countries = country_content_count.head(10)

# Plotting
plt.figure(figsize=(10, 6))
top_countries.plot(kind='bar', color='skyblue')
plt.title('Top Countries with Most Content on Netflix')
plt.xlabel('Country')
plt.ylabel('Number of Titles')
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

I pick this specific chart because it's a simple and effective way to compare the number of titles (movies and TV shows) among the top countries on Netflix. The height of the bars directly shows the content count for each country, making it easy to see which countries have the most content.

##### 2. What is/are the insight(s) found from the chart?

United states has the highest number of content on the netflix ,followed by india.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes,the insights gained about content distribution on Netflix, particularly the high content counts in the United States and India, can positively impact the business. It helps in tailoring content strategies, improving user engagement, guiding localization efforts, and potentially attracting more subscribers, leading to increased revenue and a competitive edge.

#### Chart - 5 - Geographical Content Distribution And Type Of Content

In [None]:
import re
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objs as go
import plotly.figure_factory as ff
from plotly import tools
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from plotly.subplots import make_subplots

In [None]:
# all country df
all_countries = df.groupby(['country','type']).count()['show_id'].reset_index()
all_countries.head()

In [None]:
#country wise content for top countries
country_count = {}
for i in range(len(all_countries)):
    l = all_countries['country'][i].split(', ')
    for x in l:
        x = re.sub('[^A-Za-z0-9 ]+', '', x)
        if x not in country_count.keys():
            country_count[x] = all_countries['show_id'][i]
        else:
            country_count[x] += all_countries['show_id'][i]
country_df = pd.DataFrame(list(zip(country_count.keys(), country_count.values())), columns =['country', 'count'])

d = country_df.sort_values(by=['count'], ascending=False).head(10)
# .plot.bar(x='country',y='count',edgecolor='black')
fig = px.bar(d, x='country',y='count')
fig.update_traces(marker_color='#221F1F', marker_line_color='#E50914',
                  marker_line_width=2, opacity=1)
fig.update_layout(title='Content produced country wise')
fig.show()
top_30 = country_df.sort_values(by=['count'], ascending=False)['country'].head(30)

In [None]:
# visualization code for Count of Content per country and Content type using world map
# Group the data by 'country' and count occurrences
country_count = df['country'].value_counts().to_dict()

# Create a DataFrame for country and count
country_df = pd.DataFrame(list(country_count.items()), columns=['country', 'count'])

# Create a dictionary to map country to content type
country_type_mapping = df.groupby('country')['type'].unique().apply(lambda x: ', '.join(x)).to_dict()

# Modify the code to include type of content in hover text
trace = go.Choropleth(
    locations=list(country_count.keys()),
    locationmode='country names',
    z=list(country_count.values()),
    text=[f'{country_df["country"][i]}<br>{country_type_mapping[country_df["country"][i]]}' for i in range(len(country_df))],
    reversescale=False,
    zauto=True,
    colorscale='RdBu',
    marker=dict(
        line=dict(
            color='rgb(0,0,0)',
            width=0.5)
    ),
    colorbar=dict(
        title='Total Content',
        tickprefix='')
)

data = [trace]
layout = go.Layout(
    title='Total content per country',
    geo=dict(
        showframe=True,
        showlakes=False,
        showcoastlines=True,
    )
)

fig = go.Figure(data=data, layout=layout)
fig.show()


#### Chart - 6 - Distribution of movies and Tv-Shows genres

In [None]:
# Chart - 6 visualization code
#country wise genre
all_countries = df[['country','listed_in']]
all_countries.head()

In [None]:
#Analysing top10 genre of the movies
plt.figure(figsize=(14,6))
plt.title('Top10 Genre of Movies',fontweight="bold")
sns.countplot(y=movies['listed_in'],data=movies,order=movies['listed_in'].value_counts().index[0:10])


**Documentaries are the top most genre in netflix which is fllowed by standup comedy and Drams and international movies**


In [None]:
#Analysing top10 genres of TVSHOWS
plt.figure(figsize=(14,6))
plt.title('Top10 Genre of TV Shows',fontweight="bold")
sns.countplot(y=tv_shows['listed_in'],data=tv_shows,order=tv_shows['listed_in'].value_counts().index[0:10])



**Analysing top10 genres of TVSHOWS, we can say that the kids tv is the top most TV show genre in netflix**


#### Chart - 7 - Word Clouds

In [None]:
# Chart - 7 visualization code
#word cloud imports
from os import path
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

#word cloud for user rating review
def func_select_Category(category_name,category_column,column_of_choice):
  df_word_cloud = df[[category_column,column_of_choice]].dropna()
  df_word_cloud = df_word_cloud[df_word_cloud[category_column]==category_name]
  text = " ".join(word for word in df_word_cloud[column_of_choice])
  # Create stopword list:
  stopwords = set(STOPWORDS)
  # Generate a word cloud image
  wordcloud = WordCloud(stopwords=stopwords, background_color="white").generate(text)
  # Display the generated image:
  # the matplotlib way:
  plt.imshow(wordcloud, interpolation='bilinear')
  plt.axis("off")
  plt.show()


In [None]:
# Word Cloud for Movie on Description Column
func_select_Category('Movie','type','description')

**Inference:Most words like Life, family popping up**

In [None]:
#Word Cloud for TV Shows on Description Colum
func_select_Category('TV Show','type','description')

**Inference:Most words like Life, family popping up like movies before!**

#### Chart - 8 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
# Preparing data for heatmap
df['count'] = 1
data = df.groupby('country')[['country','count']].sum().sort_values(by='count',ascending=False).reset_index()[:10]
data = data['country']


df_heatmap = df.loc[df['country'].isin(data)]
df_heatmap = pd.crosstab(df_heatmap['country'],df_heatmap['target_ages'],normalize = "index").T
df_heatmap



In [None]:
# Plotting the heatmap
fig, ax = plt.subplots(1, 1, figsize=(12, 12))

country_order2 = ['United States', 'India', 'United Kingdom', 'Canada', 'Japan', 'France', 'South Korea', 'Spain',
       'Mexico']

age_order = ['Adults', 'Teens', 'Older Kids', 'Kids']

sns.heatmap(df_heatmap.loc[age_order,country_order2],cmap="YlGnBu",square=True, linewidth=2.5,cbar=False,
            annot=True,fmt='1.0%',vmax=.6,vmin=0.05,ax=ax,annot_kws={"fontsize":12})
plt.show()

##### 1. Why did you pick the specific chart?

The correlation heatmap is a suitable choice when we want to understand the relationships between different numeric variables in a dataset. By visualizing the correlations using a heatmap, we can quickly identify patterns of positive or negative relationships between variables. This helps in revealing potential multicollinearity or dependencies among features, which is valuable for tasks such as feature selection, identifying redundant variables, or understanding potential influences on target variables.

##### 2. What is/are the insight(s) found from the chart?

the US and UK are closely aligned with their Netflix target ages, but radically different from, example, India or Japan!

Also, Mexico and Spain have similar content on Netflix for different age groups.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Statement 1: The average duration of TV shows on Netflix is significantly different from the average duration of movies.

Statement 2: The distribution of content ratings on Netflix is independent of the content type (movies or TV shows).

Statement 3: There is a significant difference in the distribution of content ratings among the top three countries with the most content on Netflix.

In [None]:
#importing scipy.stats module for statistical hypothesis testing
from scipy import stats
#making copy of df_clean_frame
df_hypothesis=df.copy()
#head of df_hypothesis
df_hypothesis.head()

### Hypothetical Statement - 1 -The average duration of TV shows on Netflix is significantly different from the average duration of movies.

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H0): The average duration of TV shows on Netflix is equal to the average duration of movies.

Alternative Hypothesis (H1): The average duration of TV shows on Netflix is not equal to the average duration of movies.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# Convert the 'duration' column to numeric
tv_shows['duration'] = pd.to_numeric(tv_shows['duration'], errors='coerce')
movies['duration'] = pd.to_numeric(movies['duration'], errors='coerce')


# Perform t-test for equality of means
t_stat, p_value = stats.ttest_ind(tv_shows['duration'], movies['duration'], equal_var=False)

# Set significance level
alpha = 0.05

# Check if p-value is less than alpha
if p_value < alpha:
    print("Reject the null hypothesis: The average duration of TV shows is significantly different from movies.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference in duration between TV shows and movies.")

##### Which statistical test have you done to obtain P-Value?

The statistical test performed to obtain the p-value, the Independent Samples T-Test, It's a parametric test used to determine if there is a significant difference between the means of two independent groups (in this case, the average duration of TV shows and movies on Netflix).

##### Why did you choose the specific statistical test?

The specific statistical test, Welch's t-test, was chosen because it's appropriate for comparing the means of two independent groups (TV shows and movies) when there might be unequal variances or sample sizes.

### Hypothetical Statement - 2 -The distribution of content ratings on Netflix is independent of the content type (movies or TV shows).

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H0): The distribution of content ratings on Netflix is independent of the content type (movies or TV shows).

Alternative Hypothesis (H1): The distribution of content ratings on Netflix is dependent on the content type.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import chi2_contingency

# Create a contingency table of content ratings and content type
contingency_table = pd.crosstab(df['rating'], df['type'])

# Perform chi-square test
chi2_stat, p_value, _, _ = chi2_contingency(contingency_table)

# Set significance level
alpha = 0.05

# Check if p-value is less than alpha
if p_value < alpha:
    print("Reject the null hypothesis: The distribution of content ratings is dependent on content type.")
else:
    print("Fail to reject the null hypothesis: The distribution of content ratings is independent of content type.")

##### Which statistical test have you done to obtain P-Value?

The statistical test used to obtain the p-value in this case is the chi-squared test. This test is used to determine if there is an association between two categorical variables, in this case, content ratings and content types (movies or TV shows).

##### Why did you choose the specific statistical test?


The chi-square test chosen because it's appropriate for analyzing the independence of two categorical variables, which suits the hypothesis being tested here - the relationship between content ratings and content types on Netflix.

### Hypothetical Statement - 3 -There is a significant difference in the distribution of content ratings among the top three countries with the most content on Netflix.

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H0): The distribution of content ratings among the top three countries with the most content on Netflix is the same.

Alternative Hypothesis (H1): There is a significant difference in the distribution of content ratings among these countries.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import kruskal

# Select the top three countries with the most content
top_countries = df['country'].value_counts().head(3).index.tolist()

# Create subsets for each country
country_data = [df[df['country'] == country]['rating'] for country in top_countries]

# Perform Kruskal-Wallis test
H, p_value = kruskal(*country_data)

# Set significance level (e.g., 0.05)
alpha = 0.05

# Check if p-value is less than alpha
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant difference in content ratings among the top three countries.")
else:
    print("Fail to reject the null hypothesis: The distribution of content ratings is the same among the top three countries.")

##### Which statistical test have you done to obtain P-Value?

I used the Kruskal-Wallis test to obtain the p-value for the hypothesis test. This test is a non-parametric method used to determine whether there are statistically significant differences between the distributions of three or more groups. In this case, it was used to determine if there is a significant difference in content ratings among the top three countries with the most content on Netflix.

##### Why did you choose the specific statistical test?

the Kruskal-Wallis test is robust and does not assume equal variances or that the data follows a specific distribution, making it a suitable choice for non-parametric data like content ratings.


## ***6. Feature Engineering & Data Pre-processing***

###  Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
#Checking Missing Values
df.isnull().sum()

#### What all missing value imputation techniques have you used and why did you use those techniques?

There are no missing values to handle in the given dataset.

###  Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction
import re
# Define a dictionary of common English contractions and their expanded forms
contractions_dict = {
    "ain't": "is not",
    "aren't": "are not",
    "can't": "can not",
    "could've": "could have",
    "couldn't": "could not",
    "didn't": "did not",
    "doesn't": "does not",
    "don't": "do not",
    "hadn't": "had not",
    "hasn't": "has not",
    "haven't": "have not",
    "he'd": "he would",
    "he'll": "he will",
    "he's": "he is",
    "how'd": "how did",
    "how'll": "how will",
    "how's": "how is",
    "I'd": "I would",
    "I'll": "I will",
    "I'm": "I am",
    "I've": "I have",
    "isn't": "is not",
    "it'd": "it would",
    "it'll": "it will",
    "it's": "it is",
    "let's": "let us",
    "ma'am": "madam",
    "mightn't": "might not",
    "mustn't": "must not",
    "needn't": "need not",
    "o'clock": "of the clock",
    "shan't": "shall not",
    "she'd": "she would",
    "she'll": "she will",
    "she's": "she is",
    "should've": "should have",
    "shouldn't": "should not",
    "that'll": "that will",
    "that's": "that is",
    "there's": "there is",
    "they'd": "they would",
    "they'll": "they will",
    "they're": "they are",
    "they've": "they have",
    "wasn't": "was not",
    "we'd": "we would",
    "we'll": "we will",
    "we're": "we are",
    "we've": "we have",
    "weren't": "were not",
    "what'll": "what will",
    "what're": "what are",
    "what's": "what is",
    "what've": "what have",
    "when's": "when is",
    "where'd": "where did",
    "where's": "where is",
    "who'd": "who would",
    "who'll": "who will",
    "who're": "who are",
    "who's": "who is",
    "who've": "who have",
    "why's": "why is",
    "won't": "will not",
    "would've": "would have",
    "wouldn't": "would not",
    "you'd": "you would",
    "you'll": "you will",
    "you're": "you are",
    "you've": "you have"
}

# Function to expand contractions in a text
def expand_contractions(text, contractions_dict):
    words = text.split()
    expanded_words = [contractions_dict.get(word.lower(), word) for word in words]
    return " ".join(expanded_words)


# Example usage:
text_with_contractions = "I can't believe it's raining. You're going to the party, aren't you?"
expanded_text = expand_contractions(text_with_contractions, contractions_dict)
print(expanded_text)


#### 2. Lower Casing

In [None]:
#combining all text column to single text column to work with
df['filtered'] = df['description'] + ' '+ df['listed_in'] + ' ' + df['rating'] + ' '+ df['country']+ ' ' + df['cast'] + ' '+ df['director']

In [None]:
# Convert all text to lowercase
df['filtered'] = df['filtered'].str.lower()

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations
import string

# Define a function to remove punctuation
def remove_punctuation(text):
    return text.translate(str.maketrans('', '', string.punctuation))

# Apply the function to the 'description' column
df['filtered'] = df['filtered'].apply(remove_punctuation)

In [None]:
df['filtered'][0]

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits
# Define a function to remove URLs and words containing digits
def remove_urls_and_digits(text):
    text = re.sub(r'http\S+', '', text)  # Remove URLs
    text = re.sub(r'\w*\d\w*', '', text)  # Remove words containing digits
    return text

# Apply the function to the 'description' column
df['filtered'] = df['filtered'].apply(remove_urls_and_digits)

In [None]:
df['filtered']

#### 5. Removing Stopwords & Removing White spaces

In [None]:
#necessary import for nlp
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.snowball import SnowballStemmer
nltk.download('punkt')


In [None]:
#stemming
stemmer = SnowballStemmer('english')
stop_words = set(stopwords.words('english'))

In [None]:
stop_words

In [None]:
# function to Remove Stopwords
def wordfilter(string, filtwords):
    filtered = []
    tokens = word_tokenize(string)
    for word in tokens:
        if word not in filtwords:
            filtered.append(stemmer.stem(word))
    return filtered

df['filtered_new'] = ''
for item, row in df.iterrows():
    df.at[item, 'filtered_new'] = wordfilter(row['filtered'], stop_words)

df['filtered_new']

#### 7. Tokenization

In [None]:
# Tokenization
# Define a function to tokenize text
def tokenize_text(text):
    return word_tokenize(text)

# Apply the function to the 'description' column
df['filtered_new'] = df['filtered'].apply(tokenize_text)

In [None]:
df['filtered_new']

In [None]:
#join words fun
def join_words(x):
  return " ".join(x)

In [None]:
#final column
df['filtered_new'] = df['filtered_new'].apply(join_words)

In [None]:
df

In [None]:
words = df.filtered_new

### Text Vectorization

In [None]:
# Using tfidf for Vectorizing Text

from sklearn.feature_extraction.text import TfidfVectorizer
t_vectorizer = TfidfVectorizer(max_df = 0.9,min_df = 1, max_features=15000)
X= t_vectorizer.fit_transform(words)


In [None]:
X

In [None]:
X.shape

##### Which text vectorization technique have you used and why?

I used the TF-IDF (Term Frequency-Inverse Document Frequency) text vectorization technique because it's a widely used method that helps represent the importance of words in a document relative to a collection of documents. It's particularly useful for tasks like information retrieval, text classification, and clustering, making it suitable for various text analysis purposes.

### PCA for Dimensionality Reduction

In [None]:
#importing PCA for Dimensionality Reduction
from sklearn.decomposition import PCA

#PCA code
transformer = PCA()
transformer.fit(X.toarray())

In [None]:
#explained var v/s comp
plt.plot(np.cumsum(transformer.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance')

As we can see that clearly in the above graph the data with 3000 components cover 80% variance.

In [None]:
#choosing right dim from plot, this might take a while, for ready ans use n_components = 3000
from sklearn.decomposition import PCA
transformer = PCA(n_components=3000)
transformer.fit(X.toarray())
X_transformed = transformer.transform(X.toarray())
X_transformed.shape

In [None]:
# vectorizing the test and train
X_vectorized = t_vectorizer.transform(words)

In [None]:
#applying pca
X= transformer.transform(X_vectorized.toarray())

In [None]:
X

## ***7. ML Model Implementation***

### **K Means Clustering**

In [None]:
#finding optimal number of clusters using the elbow method
from sklearn.cluster import KMeans
wcss_list= []  #Initializing the list for the values of WCSS

#Using for loop for iterations from 1 to 30.
for i in range(1, 30):
    kmeans = KMeans(n_clusters=i, init='k-means++', random_state= 42)
    kmeans.fit(X)
    wcss_list.append(kmeans.inertia_)
plt.plot(range(1, 30), wcss_list)
plt.title('The Elobw Method Graph')
plt.xlabel('Number of clusters(k)')
plt.ylabel('wcss_list')
plt.show()


In [None]:
from sklearn.metrics import silhouette_score
#sillhoute score of clusters
sill = []
for i in range(2,30):
    model = KMeans(n_clusters=i,init ='k-means++',random_state=51)
    model.fit(X)
    y1 = model.predict(X)
    score = silhouette_score(X,y1)
    sill.append(score)
    print('cluster: %d \t Sillhoute: %0.4f'%(i,score))

In [None]:
# Plotting Silhouette's score
plt.plot(sill, 'bs--')
plt.xticks(list(range(2, 30)))  # Set tick locations to match the labels
plt.grid()
plt.xlabel('Number of clusters')
plt.show()

In [None]:
#training the K-means model on a dataset
kmeans = KMeans(n_clusters= 26, init='k-means++', random_state= 42)
y_predict= kmeans.fit_predict(X)

**Evaluation**


In [None]:
#Predict the clusters and evaluate the silhouette score

score = silhouette_score(X, y_predict)
print("Silhouette score is {}".format(score))

In [None]:
#davies_bouldin_score of our clusters
from sklearn.metrics import davies_bouldin_score
davies_bouldin_score(X, y_predict)

In [None]:
#Adding a seperate column for the cluster
df["cluster"] = y_predict

In [None]:
df['cluster'].value_counts()

In [None]:
#predict the labels of clusters.
label = kmeans.fit_predict(X)

In [None]:

# Getting unique labels
u_labels = np.unique(label)

# Create a colormap with enough distinct colors
colors = plt.cm.jet(np.linspace(0, 1, len(u_labels)))

# Increase the figure size
plt.figure(figsize=(12, 8))

# Plotting the results with different colors for each cluster:
for i, color in zip(u_labels, colors):
    plt.scatter(X[label == i, 0], X[label == i, 1], label=i, c=[color])

plt.legend()
plt.title("Clustering Results")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()

In [None]:
# kmeans label to cluster column
df['cluster'] = kmeans.labels_

In [None]:
df

#### Word Cloud & Clusters

In [None]:
#word cloud for user rating review
def func_select_Category(category_name,column_of_choice):
  df_word_cloud = df[['cluster',column_of_choice]].dropna()
  df_word_cloud = df_word_cloud[df_word_cloud['cluster']==category_name]
  text = " ".join(word for word in df_word_cloud[column_of_choice])
  # Create stopword list:
  stopwords = set(STOPWORDS)
  # Generate a word cloud image
  wordcloud = WordCloud(stopwords=stopwords, background_color="white").generate(text)
  # Display the generated image:
  # the matplotlib way:
  plt.imshow(wordcloud, interpolation='bilinear')
  plt.axis("off")
  plt.show()

### Word Cloud on Description col for different cluster

In [None]:
for i in range(26):
  func_select_Category(i,'description')

### Word Cloud on Cast col for different cluster

In [None]:
for i in range(26):
  func_select_Category(i,'cast')

### Word Cloud on director col for different cluster

In [None]:
for i in range(9):
  func_select_Category(i,'director')

### Word Cloud on listed in col for different cluster

In [None]:
for i in range(12):
  func_select_Category(i,'listed_in')

### Word Cloud on Country col for different cluster

In [None]:
for i in range(12):
  func_select_Category(i,'country')

### Word Cloud on Title col for different cluster

In [None]:
for i in range(12):
  func_select_Category(i,'title')

### Cluster 0 : Drama Enthusiasts

In [None]:
df[df['cluster'] == 0][['type','title','director','cast','country','rating','listed_in','description']]

### Cluster 1 : Sci-Fi Lovers

In [None]:
df[df['cluster'] == 1][['type','title','director','cast','country','rating','listed_in','description']]

### Cluster 2 : Comedy Central

In [None]:
df[df['cluster'] == 2][['type','title','director','cast','country','rating','listed_in','description']]

### Cluster 3 : Documentary Buffs

In [None]:
df[df['cluster'] == 3][['type','title','director','cast','country','rating','listed_in','description']]

### Cluster 4 : International Treasures

In [None]:
df[df['cluster'] == 4][['type','title','director','cast','country','rating','listed_in','description']]

### Cluster 5 : Action Mania

In [None]:
df[df['cluster'] == 5][['type','title','director','cast','country','rating','listed_in','description']]

### Cluster 6 : Kids' Corners

In [None]:
df[df['cluster'] == 6][['type','title','director','cast','country','rating','listed_in','description']]

### Cluster 7 : Mystery & Thrill

In [None]:
df[df['cluster'] == 7][['type','title','director','cast','country','rating','listed_in','description']]

### Cluster 8 : Classic Cinema

In [None]:
df[df['cluster'] == 8][['type','title','director','cast','country','rating','listed_in','description']]

### Cluster 9 : Reality TV

In [None]:
df[df['cluster'] == 9][['type','title','director','cast','country','rating','listed_in','description']]

### **Visualization for Clusters using Bokeh!!!**

In [None]:
from sklearn.manifold import TSNE

x_embedded = TSNE(n_components=2).fit_transform(X)

x_embedded.shape

In [None]:
#import for bokeh
from bokeh.models import ColumnDataSource, HoverTool, LinearColorMapper, CustomJS
from bokeh.palettes import Category20
from bokeh.transform import linear_cmap
from bokeh.io import output_file, show
from bokeh.transform import transform
from bokeh.io import output_notebook
from bokeh.plotting import figure
from bokeh.layouts import column
from bokeh.models import RadioButtonGroup
from bokeh.models import TextInput
from bokeh.layouts import gridplot
from bokeh.models import Div
from bokeh.models import Paragraph
from bokeh.layouts import column


In [None]:
output_notebook()
y_labels = label

# data sources
source = ColumnDataSource(data=dict(
    x= x_embedded[:,0],
    y= x_embedded[:,1],
    x_backup = x_embedded[:,0],
    y_backup = x_embedded[:,1],
    desc= y_labels,
    titles= df['title'],
    directors = df['director'],
    cast = df['cast'],
    description = df['description'],
    listed_in = df['listed_in'],
    rating = df['rating'],
    country = df['country'],
    labels = ["C-" + str(x) for x in y_labels]
    ))

In [None]:
# hover over information
hover = HoverTool(tooltips=[
    ("Title", "@titles"),
    ("Director(s)", "@directors"),
    ("Cast", "@cast"),
    ("Description", "@description"),
    ("listed_in","@listed_in"),
    ("rating","@rating"),
    ("country","@country")
],
                 point_policy="follow_mouse")

# map colors
mapper = linear_cmap(field_name='desc',
                     palette=Category20[20],
                     low=min(y_labels) ,high=max(y_labels))


# prepare the figure
p = figure(width=800, height=800,
           tools=[hover, 'pan', 'wheel_zoom', 'box_zoom', 'reset'],
           title="Netflix Movies and Tv Shows",
           toolbar_location="right")

# plot
p.scatter('x', 'y', size=5,
          source=source,
          fill_color=mapper,
          line_alpha=0.3,
          line_color="black",
          legend_field='labels')

# option
option = RadioButtonGroup(labels=["C-0", "C-1", "C-2",
                                  "C-3", "C-4", "C-5",
                                  "C-6", "C-7", "C-8",
                                  ],
                          active=9)

# search box
#keyword = TextInput(title="Search:", callback=keyword_callback)
#header
header = Div(text="""<h1>Find similar movies / tv shows in corresponding Cluster</h1>""")

# show
show(column(header,p))

In [None]:
df

Dendogram

In [None]:
import scipy.cluster.hierarchy as shc
plt.figure(figsize =(8, 8))
plt.title('Visualising the data')
Dendrogram = shc.dendrogram((shc.linkage(X, method ='ward')))

### Hierarchical Clustering

#### Agglomerative Clustering

In [None]:
#Fitting our variable in Agglomerative Clusters
from sklearn.cluster import AgglomerativeClustering
aggh = AgglomerativeClustering(n_clusters=6, affinity='euclidean', linkage='ward')
aggh.fit(X)
#Predicting using our model
y_hc=aggh.fit_predict(X)

In [None]:
df_hierarchical =df.copy()
#creating a column where each row is assigned to their separate cluster
df_hierarchical['cluster'] = aggh.labels_
df_hierarchical.head()


**Evaluation**

In [None]:
#Silhouette Coefficient
print("Silhouette Coefficient: %0.3f"%silhouette_score(X,y_hc, metric='euclidean'))


In [None]:
#davies_bouldin_score of our clusters
from sklearn.metrics import davies_bouldin_score
davies_bouldin_score(X, y_hc)

## Movie Recommendation System

In [None]:
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0-py3-none-any.whl

In [None]:
!python -m spacy download en_core_web_lg

In [None]:
import spacy
from spacy.lang.en.examples import sentences
#!pip install en_core_web_lg
nlp = spacy.load('en_core_web_lg')

In [None]:
# Create word vectors for all movies and TV show descriptions
with nlp.disable_pipes():
    vectors = np.array([nlp(film.description).vector for idx, film in df.iterrows()])

In [None]:
# Function to analyze how similar two word vectors are
def cosine_similarity(a, b):
    return np.dot(a, b)/np.sqrt(a.dot(a)*b.dot(b))

In [None]:
# Calculate the mean for all word vectors
vec_mean = vectors.mean(axis=0)

# Subtract the mean from the vectors
centered = vectors - vec_mean

In [None]:
# Function to get the indices of the five most similar descriptions
def get_similar_description_indices(description_vec):

    # Calculate similarities between given description and other descriptions in the dataset
    sims = np.array([cosine_similarity(description_vec - vec_mean, vec) for vec in centered])

    # Get the indices of the five most similar descriptions
    most_similar_index = np.argsort(sims)[-6:-1]

    return most_similar_index

In [None]:
# Create array of lists containing indices of five most similar descriptions
similar_indices = np.array([get_similar_description_indices(vec) for vec in vectors])


In [None]:
similar_indices

In [None]:
test_index = df.index[df.title == "3%"][0]

print("Chosen Movie/TV Show")
print(df.title[test_index] + ': ' + df.description[test_index] + '\n')
print("Top Recommendations")
print(df.title[similar_indices[test_index][4]] + ': ' + df.description[similar_indices[test_index][4]] + '\n')
print(df.title[similar_indices[test_index][3]] + ': ' + df.description[similar_indices[test_index][3]] + '\n')
print(df.title[similar_indices[test_index][2]] + ': ' + df.description[similar_indices[test_index][2]] + '\n')
print(df.title[similar_indices[test_index][1]] + ': ' + df.description[similar_indices[test_index][1]] + '\n')
print(df.title[similar_indices[test_index][0]] + ': ' + df.description[similar_indices[test_index][0]] + '\n')

# **Conclusion**

the project's conclusions:

**Data Overview:**The dataset contains information on 7787 movies and TV shows available on Netflix, with 12 columns providing details such as title, director, cast, country, date added, release year, rating, duration, genre, and description.

**Data Wrangling:**
Missing values were identified and handled in columns like 'cast' and 'country'.
Rows with null values in 'date_added' and 'rating' columns were dropped.
The 'director' column was dropped entirely.
Ratings were grouped into categories.

**Data Exploration:**Various visualizations were created to understand trends and relationships in the data.
Notable insights included the prevalence of movies over TV shows, trends in content type over the years, and distribution of content by ratings and genres.

**Cluster Analysis:**
K-means clustering was performed to group movies and TV shows based on their descriptions.
The Silhouette Score and Davies-Bouldin Score were used to evaluate the quality of the clusters.

**Movie Recommendation System:**
A content-based recommendation system was implemented using spaCy and cosine similarity based on word vectors.
The system provides top recommendations for a chosen movie or TV show based on similarity of their descriptions.

**Conclusion:**
The project successfully explored, cleaned, and analyzed the Netflix dataset.
Clustering revealed distinct groups of content based on their descriptions.
The recommendation system provides personalized suggestions based on similar content descriptions.

**Next Steps:**
Further refinement of the recommendation system could involve user feedback and collaborative filtering.
Additional features or external data sources could be incorporated for more accurate clustering and recommendations.


### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***