##  About Netflix

Netflix is a leading global streaming service with over **222 million subscribers** as of mid-2021.  
It offers a huge library of TV shows, movies, and documentaries across multiple genres and languages.

This case study explores Netflixs dataset to better understand how its content performs across regions and what patterns can be uncovered from user preferences.

##  Problem Statement

Using the available data, our goal is to extract meaningful insights that can help Netflix:

1. Identify which types of content (movies/shows) perform best
2. Understand regional content preferences to support expansion
3. Predict which genres or formats are more likely to succeed in specific countries

In [None]:
# Downloading the Netflix dataset from Google Drive
# Only run this in Google Colab (not needed locally)
!wget "https://drive.google.com/uc?export=download&id=1-qDO7oNwzQn0RV44YtpqWdYS4SO3GkQg" -O netflix_title.csv

In [None]:
# Importing necessary libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

#  Load the dataset
df = pd.read_csv('netflix_title.csv')

# Quick confirmation
print(" Data loaded successfully! Total rows:", df.shape[0])

In [None]:
#  Let's take a quick peek at the dataset
df.head()

In [None]:
#  Checking the size of the dataset
rows, cols = df.shape
print(f" The dataset contains {rows} rows and {cols} columns.")

In [None]:
#  Overview of columns and data types
print(" Dataset Structure:\n")
df.info()

In [None]:
#  Summary statistics for text-based columns
df.describe(include='object')

In [None]:
#  Let's check for missing values in the dataset
print(" Null values per column:")
print(df.isnull().sum().sort_values(ascending=False), "\n")

print(" Null value percentage:")
null_percent = (df.isnull().sum() / len(df)) * 100
print(null_percent.sort_values(ascending=False))

In [None]:
#  Handling missing data
df = df[df['rating'].notna()]
df = df[df['duration'].notna()]
df = df[df['date_added'].notna()]
df['country'].fillna('No Data', inplace=True)
df['cast'].fillna('No Data', inplace=True)
df['director'].fillna('No Data', inplace=True)
df.isnull().sum()

In [None]:
#  Convert 'date_added' to datetime and extract year, month, day
df['date_added'] = pd.to_datetime(df['date_added'])
df['added_year'] = df['date_added'].dt.year
df['added_month'] = df['date_added'].dt.month
df['added_day'] = df['date_added'].dt.day
df.head(3)

In [None]:
#  Count of content types (Movie vs TV Show)
content_counts = df['type'].value_counts()
plt.figure(figsize=(6, 4))
sns.barplot(x=content_counts.index, y=content_counts.values, palette="Set2")
plt.title("Content Type Distribution on Netflix")
plt.xlabel("Type of Content")
plt.ylabel("Count")
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()

In [None]:
#  How many unique countries are in the dataset?
unique_countries = df['country'].nunique()
print(f" Netflix content spans across {unique_countries} unique countries.")

In [None]:
#  Top 10 countries producing the most Netflix content
top_countries = df['country'].value_counts().head(10)
plt.figure(figsize=(8, 5))
sns.barplot(x=top_countries.values, y=top_countries.index, palette="coolwarm")
plt.title("Top 10 Countries with Most Content on Netflix")
plt.xlabel("Number of Titles")
plt.ylabel("Country")
plt.grid(axis='x', linestyle='--', alpha=0.4)
plt.tight_layout()
plt.show()

In [None]:
#  Most common genres in Netflix content
genre_counts = df['listed_in'].str.split(', ', expand=True).stack().value_counts().head(10)
plt.figure(figsize=(9, 5))
sns.barplot(x=genre_counts.values, y=genre_counts.index, palette="mako")
plt.title("Top 10 Genres on Netflix")
plt.xlabel("Number of Titles")
plt.ylabel("Genre")
plt.grid(axis='x', linestyle='--', alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
#  Checking unique duration values
print(" Sample duration values:")
print(df['duration'].value_counts().head(10))

In [None]:
#  Let's analyze duration for movies
df_movies = df[df['type'] == 'Movie'].copy()
df_movies['duration_int'] = df_movies['duration'].str.extract('(\d+)').astype(float)
plt.figure(figsize=(8, 5))
sns.histplot(df_movies['duration_int'], bins=30, kde=True, color='tomato')
plt.title("Distribution of Movie Durations (in minutes)")
plt.xlabel("Duration (minutes)")
plt.ylabel("Frequency")
plt.grid(axis='y', linestyle='--', alpha=0.4)
plt.tight_layout()
plt.show()

In [None]:
#  Analyzing number of seasons for TV Shows
df_shows = df[df['type'] == 'TV Show'].copy()
df_shows['seasons'] = df_shows['duration'].str.extract('(\d+)').astype(int)
plt.figure(figsize=(8, 5))
sns.countplot(x='seasons', data=df_shows, palette='crest')
plt.title("Number of Seasons in Netflix TV Shows")
plt.xlabel("Number of Seasons")
plt.ylabel("Count of Shows")
plt.grid(axis='y', linestyle='--', alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
#  How many titles were added each year?
plt.figure(figsize=(10, 5))
sns.countplot(x='added_year', data=df, palette='rocket', order=sorted(df['added_year'].dropna().unique()))
plt.title("Netflix Titles Added by Year")
plt.xlabel("Year Added")
plt.ylabel("Number of Titles")
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', alpha=0.4)
plt.tight_layout()
plt.show()

In [None]:
#  Do some months have more content uploads than others?
plt.figure(figsize=(9, 5))
sns.countplot(x='added_month', data=df, palette='flare')
plt.title("Netflix Content Added by Month")
plt.xlabel("Month")
plt.ylabel("Number of Titles")
plt.grid(axis='y', linestyle='--', alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
#  Most frequently appearing actors/actresses
from collections import Counter
cast_series = df['cast'].dropna().str.split(', ')
flat_cast_list = [actor.strip() for sublist in cast_series for actor in sublist]
top_actors = Counter(flat_cast_list).most_common(10)
top_actors_df = pd.DataFrame(top_actors, columns=['Actor', 'Count'])
plt.figure(figsize=(9, 5))
sns.barplot(y='Actor', x='Count', data=top_actors_df, palette="viridis")
plt.title("Top 10 Most Featured Actors on Netflix")
plt.xlabel("Number of Appearances")
plt.ylabel("Actor Name")
plt.grid(axis='x', linestyle='--', alpha=0.4)
plt.tight_layout()
plt.show()

---

##  Summary

In this case study, we explored:

- Netflix's movie vs TV show distribution
- Genre popularity and country-wise content spread
- Year/month-based trends of added content
- Actor/actress frequency
- Duration and season breakdowns

---

##  Key Takeaways

- Netflix pushes short-duration content, especially 1-season TV shows
- Dramas and international content dominate the platform
- The 20182020 phase was peak for content expansion

---
