<a href="https://colab.research.google.com/github/Kamini262/Netflix-movies-and-tv-shows/blob/main/Netflix_Movies_%26_TV_shows.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **"Hello, I am Kamini Singh. I've meticulously crafted this insightful project, independently exploring Netflix content clustering. Enjoy the journey!"**


# **Netflix Movies and Tv shows clustering Analysis**


# **Project Title: Exploring Netflix Content Trends and Insights**

The world of streaming entertainment has witnessed remarkable growth and transformation in recent years. With the advent of platforms like Netflix, the landscape of content consumption has evolved significantly. This project aims to delve into the intriguing world of Netflix content by analyzing a dataset containing TV shows and movies available on the platform as of 2019. This dataset, sourced from a third-party Netflix search engine called "Fliable," provides a wealth of information about the content available on Netflix and its evolution over the past decade.


# Project Objectives

# **Exploratory Data Analysis**


Exploratory Data Analysis (EDA): The project begins with a comprehensive exploration of the dataset. Through data manipulation, aggregation, and visualization using libraries such as Pandas, Matplotlib, and Seaborn, we will gain insights into the dataset's structure and characteristics

# **Content Trends Across Countries: **

 The project seeks to understand the types of content available in different countries. By grouping and analyzing content based on production country, we aim to uncover geographical preferences and trends.

# ***Focus on TV vs Movies:***

An essential aspect of this project is to investigate whether Netflix has shifted its focus from movies to TV shows in recent years. By analyzing the growth of TV shows and movies over the past decade, we aim to discern any underlying patterns.

# **Clustering Similar Content**

: Utilizing text-based features like titles, directors, and genres, we will employ clustering techniques to group similar content together. This will offer a unique perspective on content categorization and may reveal hidden patterns

# **Integration with External Ratings**

 To enrich our analysis, we plan to integrate external datasets such as IMDb ratings and Rotten Tomatoes scores. By merging these datasets with our Netflix content data, we can uncover interesting correlations and insights

# **Expected Outcomes:**

Throughout this project, we aim to not only gain insights into Netflix's content strategy and user preferences but also to demonstrate the power of data analysis in deciphering complex trends. The project's deliverables will include a comprehensive Jupyter Notebook showcasing our code, visualizations, and interpretations. Additionally, a video presentation will be created to succinctly present the key findings and insights to stakeholders.


Significance to Stakeholders:

Stakeholders in the entertainment industry, content creators, and streaming platforms can greatly benefit from the insights provided by this project. By understanding content trends, user preferences, and the impact of external ratings, stakeholders can make informed decisions about content production, licensing, and audience targeting.

**Skills and Tools:**

The project will be implemented using Python programming and popular libraries such as Pandas, Matplotlib, Seaborn, and Scikit-learn. We will leverage data manipulation, visualization, and clustering techniques to achieve our objectives.

# **Exploratory Data Analysis (EDA) steps using Python code:**


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
from google.colab import drive
import pandas as pd

# Mount Google Drive
drive.mount('/content/drive')

# Path to the CSV file
csv_file_path = '/content/drive/MyDrive/NETFLIX MOVIES AND TV SHOWS CLUSTERING.csv'

# Read the CSV file using pandas
data = pd.read_csv(csv_file_path)



In [None]:

# Display the first few rows of the DataFrame
print(data.head())

In [None]:
# Geting all the values of the  data
data.describe()

In [None]:
from google.colab import drive
import pandas as pd


# Path to the CSV file
csv_file_path = '/content/drive/MyDrive/NETFLIX MOVIES AND TV SHOWS CLUSTERING.csv'

# Read the CSV file into a DataFrame
data = pd.read_csv(csv_file_path)

# Get the index of the DataFrame
index_values = data.index

# Display the index values
print(index_values)

In [None]:
# Understand the basic structure of the dataset
print(data.info())
print(data.head())

In [None]:
#Check for missing values
print(data.isnull().sum())

In [None]:
#Check for missing values
print(data.isnull().sum())

In [None]:
 #Release year distribution
plt.subplot(2, 2, 1)
sns.histplot(data=data, x="release_year", bins=20)
plt.xlabel("Release Year")
plt.title("Distribution of Release Years")

In [None]:
#Duration distribution
plt.subplot(2, 2, 2)
sns.histplot(data=data, x="duration", bins=20)
plt.xlabel("Duration (minutes)")
plt.title("Distribution of Durations")

In [None]:
# Release year distribution
plt.subplot(2, 2, 1)
sns.histplot(data=data, x="release_year", bins=20)
plt.xlabel("Release Year")
plt.title("Distribution of Release Years")

In [None]:
# Content type distribution (TV shows vs. Movies)
plt.subplot(2, 2, 4)
sns.countplot(data=data, x="type")
plt.xlabel("Content Type")
plt.title("Distribution of Content Types")

plt.tight_layout()
plt.show()

In [None]:
# Analyze the distribution of TV shows and movies over the years
content_by_year = data.groupby("release_year")["type"].value_counts().unstack()
content_by_year.plot(kind="bar", stacked=True)
plt.xlabel("Release Year")
plt.ylabel("Count")
plt.title("Content Distribution: TV Shows vs. Movies")
plt.legend(title="Content Type")
plt.show()

In [None]:
# Explore the most common genres, directors, actors, and countries
top_genres = data["listed_in"].str.split(", ").explode().value_counts().head(10)
top_directors = data["director"].value_counts().head(10)
top_actors = data["cast"].str.split(", ").explode().value_counts().head(10)
top_countries = data["country"].value_counts().head(10)

In [None]:
 #Visualize top genres of the movies
plt.figure(figsize=(12, 6))

plt.subplot(2, 2, 1)
sns.barplot(x=top_genres.values, y=top_genres.index)
plt.xlabel("Count")
plt.ylabel("Genre")
plt.title("Top 10 Genres")

In [None]:
# Visualize the top Director of the Movies
plt.subplot(2, 2, 2)
sns.barplot(x=top_directors.values, y=top_directors.index)
plt.xlabel("Count")
plt.ylabel("Director")
plt.title("Top 10 Directors")

In [None]:
# Visualize the Top 10 Actors of the Movies sets
plt.subplot(2, 2, 3)
sns.barplot(x=top_actors.values, y=top_actors.index)
plt.xlabel("Count")
plt.ylabel("Actor")
plt.title("Top 10 Actors")

In [None]:
# visualize the  top cuntries

plt.subplot(2, 2, 3)
sns.barplot(x=top_actors.values, y=top_actors.index)
plt.xlabel("Count")
plt.ylabel("countries")
plt.title("Top 10 countries")

In [None]:
# Examine correlations between variables
correlation_matrix = data.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm")
plt.title("Correlation Heatmap")
plt.show()

In [None]:
# Distribution of content types
content_type_counts = data['type'].value_counts()
plt.bar(content_type_counts.index, content_type_counts.values)
plt.xlabel('Content Type')
plt.ylabel('Count')
plt.title('Distribution of Content Types')
plt.show()

# **Text data processing**

In [None]:
# Data load and processing
# Load the dataset
csv_file_path = '/content/drive/MyDrive/NETFLIX MOVIES AND TV SHOWS CLUSTERING.csv'
data = pd.read_csv(csv_file_path)

# Display the first few rows of the dataset
print(data.head())

# Display the first few rows after preprocessing
print(data.head())

In [None]:

# Preprocess text features
text_features = ['title', 'director', 'cast', 'listed_in', 'description']
for feature in text_features:
    data[feature].fillna('', inplace=True)

In [None]:
# Load the dataset
csv_file_path = '/content/drive/MyDrive/NETFLIX MOVIES AND TV SHOWS CLUSTERING.csv'
data = pd.read_csv(csv_file_path)

# Display the first few rows of the dataset before preprocessing
print("Before Preprocessing:")
print(data.head())

# Preprocess text features
text_features = ['title', 'director', 'cast', 'listed_in', 'description']
for feature in text_features:
    data[feature].fillna('', inplace=True)

# Display the first few rows of the dataset after preprocessing
print("After Preprocessing:")
print(data.head())

In [None]:
from sklearn.cluster import AgglomerativeClustering
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

# Load the dataset
csv_file_path = '/content/drive/MyDrive/NETFLIX MOVIES AND TV SHOWS CLUSTERING.csv'
data = pd.read_csv(csv_file_path)

# Preprocess text features
text_features = ['title', 'director', 'cast', 'listed_in', 'description']
for feature in text_features:
    data[feature].fillna('', inplace=True)

# Combine text features into a single column
data['combined_text'] = data[text_features].apply(lambda row: ' '.join(row.values.astype(str)), axis=1)

# TF-IDF vectorization for text features
tfidf_vectorizer = TfidfVectorizer(max_features=1000)
tfidf_matrix = tfidf_vectorizer.fit_transform(data['combined_text'])

# Convert the sparse matrix to dense array
tfidf_array = tfidf_matrix.toarray()

# Agglomerative Hierarchical Clustering
agg_clustering = AgglomerativeClustering(n_clusters=5)
data['agg_cluster'] = agg_clustering.fit_predict(tfidf_array)

# Print unique cluster labels and their distribution
print(data['agg_cluster'].value_counts())

**Modeling and clustering though Machine learing algorithm**

In [None]:
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

# Load the dataset
csv_file_path = '/content/drive/MyDrive/NETFLIX MOVIES AND TV SHOWS CLUSTERING.csv'
data = pd.read_csv(csv_file_path)

# Preprocess text features
text_features = ['title', 'director', 'cast', 'listed_in', 'description']
for feature in text_features:
    data[feature].fillna('', inplace=True)

# Combine text features into a single column
data['combined_text'] = data[text_features].apply(lambda row: ' '.join(row.values.astype(str)), axis=1)

# TF-IDF vectorization for text features
tfidf_vectorizer = TfidfVectorizer(max_features=1000)
tfidf_matrix = tfidf_vectorizer.fit_transform(data['combined_text'])

# Convert the sparse TF-IDF matrix to a dense numpy array
tfidf_array = tfidf_matrix.toarray()

# K-Means Clustering
kmeans = KMeans(n_clusters=5, random_state=42)
data['kmeans_cluster'] = kmeans.fit_predict(tfidf_array)

# Agglomerative Hierarchical Clustering
agg_clustering = AgglomerativeClustering(n_clusters=5)
data['agg_cluster'] = agg_clustering.fit_predict(tfidf_array)

# Print unique cluster labels and their distribution
print("K-Means Cluster Distribution:")
print(data['kmeans_cluster'].value_counts())
print("\nAgglomerative Cluster Distribution:")
print(data['agg_cluster'].value_counts())

# Histogram for Numerical Features:
Plot histograms to visualize the distribution of numerical features like released year, duration, etc.**

In [None]:
# Histogram of release year
plt.figure(figsize=(10, 6))
data['release_year'].hist(bins=30)  # Replace 'released_year' with 'release_year'
plt.title('Release Year Distribution')
plt.xlabel('Release Year')
plt.ylabel('Count')
plt.show()

## **Word Clouds for Textual Features:**
## Creating word clouds to visualize the most frequent words in descriptions, titles, or other text-based features.

In [None]:
from wordcloud import WordCloud

# Word cloud for descriptions
plt.figure(figsize=(10, 6))
wordcloud = WordCloud(max_words=100, background_color='white').generate(' '.join(data['description']))
plt.imshow(wordcloud, interpolation='bilinear')
plt.title('Word Cloud of Descriptions')
plt.axis('off')
plt.show()

## **Geographical Maps**

In the dataset contains country information, I can create a geographical map to show the distribution of content in different countries

In [None]:
import geopandas as gpd

# Load a world map shapefile
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))

# Count content by country
country_counts = data['country'].value_counts().reset_index()
country_counts.columns = ['country', 'count']

# Merge data with world map
merged = world.merge(country_counts, left_on='name', right_on='country')

# Plot map
fig, ax = plt.subplots(1, 1, figsize=(15, 10))
world.boundary.plot(ax=ax)
merged.plot(column='count', ax=ax, legend=True)
plt.title('Content Distribution by Country')
plt.show()

## **Cluster Distribution**

Displaying the distribution of items across clusters can illustrate how the content has been grouped together based on textual features.**

In [None]:
plt.figure(figsize=(8, 6))
data['kmeans_cluster'].value_counts().sort_index().plot(kind='bar')
plt.title('K-Means Cluster Distribution')
plt.xlabel('Cluster')
plt.ylabel('Count')
plt.show()

In [None]:
plt.figure(figsize=(8, 6))
data['agg_cluster'].value_counts().sort_index().plot(kind='bar')
plt.title('Agglomerative Cluster Distribution')
plt.xlabel('Cluster')
plt.ylabel('Count')
plt.show()

# **Word Cloud of Clustered Descriptions**


For each cluster, you could create a word cloud of the most frequent words in the descriptions to provide a visual representation of the content within each clustet**

In [None]:
from wordcloud import WordCloud

for cluster_num in range(5):
    cluster_description = ' '.join(data[data['kmeans_cluster'] == cluster_num]['description'])
    wordcloud = WordCloud(width=800, height=400, background_color='white').generate(cluster_description)
    plt.figure(figsize=(10, 6))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.title(f'Cluster {cluster_num} - Word Cloud of Descriptions')
    plt.axis('off')
    plt.show()

# Content Type Distribution:
A bar plot showing the distribution of content types (movies vs. TV shows) can provide an overview of the dataset's composition.**

# **Conclusion**

In [None]:
# Print a summary of the project and its findings
print("Project Conclusion:")
print("------------------------------")

print("This project aimed to analyze and cluster Netflix movies and TV shows based on textual features.")
print("Key Steps and Findings:")
print("- Explored the dataset of Netflix content, understanding its structure and contents.")
print("- Conducted exploratory data analysis (EDA) to visualize the distribution of content types, countries, etc.")
print("- Preprocessed text-based features, filling missing values and preparing text for analysis.")
print("- Performed TF-IDF vectorization to transform text features into a numerical format.")
print("- Applied K-Means and Agglomerative Clustering algorithms to cluster similar content.")
print("- Visualized the clusters and analyzed their distribution.")

In [None]:

# Provide insights from clustering results
print("\nInsights from Clustering:")
print("- K-Means and Agglomerative Clustering revealed different clusters of content based on textual features.")
print("- Analyzed the distribution of content across clusters to identify patterns and similarities.")

In [None]:
# Discuss potential next steps or improvements
print("\nNext Steps and Future Improvements:")
print("- Incorporate external datasets (IMDb ratings, Rotten Tomatoes) for more insights.")
print("- Experiment with different clustering algorithms and hyperparameters for better results.")
print("- Perform deeper analysis on specific clusters to understand their characteristics.")

In [None]:
# Thank the readers and conclude
print("Thank you for following along with this analysis!")
print("We appreciate your interest and hope you found the insights and findings valuable.")
print("If you have any questions or feedback, please feel free to reach out.")
print("------------------------------")