# Netflix Case Study: Viewer Behavior & Content Strategy

A comprehensive analysis of Netflix’s content catalog using Python, exploring content types, genres, countries, and trends over time.


## Project Objective
To uncover insights from Netflix’s dataset that inform content strategy and viewer engagement.

Specifically, we will:
1. Load and clean the dataset robustly.
2. Perform exploratory data analysis (EDA) on content types, genres, and countries.
3. Visualize trends in content addition over time.
4. Identify top actors and understand duration/season distributions.
5. Provide business recommendations based on findings.


## Dataset & Environment Setup
Ensure you have the `netflix_title.csv` file in the same directory as this notebook. If not, download it from the provided Google Drive link or run the cell below in Colab.


In [None]:
import os
import pandas as pd
import numpy as np

DATA_FILENAME = 'netflix_title.csv'
if not os.path.exists(DATA_FILENAME):
    try:
        # Attempt to download if using Colab environment
        get_ipython().system('wget "https://drive.google.com/uc?export=download&id=1-qDO7oNwzQn0RV44YtpqWdYS4SO3GkQg" -O netflix_title.csv')
        print(f"Downloaded {DATA_FILENAME} via wget.")
    except Exception:
        raise FileNotFoundError(
            f"{DATA_FILENAME} not found locally. Please download it from the Google Drive link and place it here.")
else:
    print(f"✅ Found {DATA_FILENAME} locally.")

# Load dataset
df = pd.read_csv(DATA_FILENAME)
print(f"Dataset loaded with {len(df)} rows and {len(df.columns)} columns.")

## Initial Data Inspection
Check the first few rows, shape, and column info to understand dataset structure.


In [None]:
# Preview head
df.head()


In [None]:
# Shape of the dataset
rows, cols = df.shape
print(f"Dataset contains {rows} rows and {cols} columns.")

In [None]:
# Column information and data types
df.info()


In [None]:
# Descriptive statistics for object (text) columns
df.describe(include='object')


## Handling Missing Values
Identify missing values in key columns (`rating`, `duration`, `date_added`) and decide on dropping or filling.


In [None]:
# Count missing values in key columns
initial_count = len(df)
missing_rating = df['rating'].isna().sum()
missing_duration = df['duration'].isna().sum()
missing_date = df['date_added'].isna().sum()
print(f"Missing 'rating': {missing_rating} rows")
print(f"Missing 'duration': {missing_duration} rows")
print(f"Missing 'date_added': {missing_date} rows")

# Drop rows where any of these key fields is missing
df = df.dropna(subset=['rating', 'duration', 'date_added'])
dropped = initial_count - len(df)
print(f"Dropped {dropped} rows ({dropped/initial_count:.1%} of the dataset)")

# Fill less critical nulls
df['country'] = df['country'].fillna('No Data')
df['cast'] = df['cast'].fillna('No Data')
df['director'] = df['director'].fillna('No Data')

# Final null check
print("Remaining nulls per column:")
print(df.isnull().sum())

## Parsing Dates and Extracting Time Features
Convert `date_added` to datetime with error handling and extract year/month/day.


In [None]:
import warnings
warnings.filterwarnings('ignore')

# Attempt to parse date_added with known format; coerce errors
df['date_added'] = pd.to_datetime(df['date_added'], format='%B %d, %Y', errors='coerce')
failed_dates = df['date_added'].isna().sum()
print(f"Failed to parse {failed_dates} dates.")

if failed_dates > 0:
    display(df.loc[df['date_added'].isna(), ['title', 'date_added']].head())

# Drop rows where date parsing failed (if any)
df = df.dropna(subset=['date_added'])

# Extract year, month, day
df['added_year'] = df['date_added'].dt.year
df['added_month'] = df['date_added'].dt.month
df['added_day'] = df['date_added'].dt.day

df.head(3)

## Content Type Distribution (Movies vs TV Shows)
Visualize the percentage and count of Movies vs TV Shows.


In [None]:
# Prepare DataFrame for plotting proportions
type_df = df['type'].value_counts(normalize=True).reset_index()
type_df.columns = ['type', 'percent']
type_df['count'] = df['type'].value_counts().values

import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(6, 4))
ax = sns.barplot(data=type_df, x='type', y='percent', palette='Set2')

# Annotate bars
for idx, row in type_df.iterrows():
    ax.text(idx, row['percent'] + 0.01, f"{int(row['count'])} ({row['percent']:.1%})", ha='center', va='bottom', fontsize=10)

plt.title("Content Type Distribution on Netflix")
plt.ylabel("Percentage of Total Titles")
plt.ylim(0, type_df['percent'].max() + 0.1)
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()

## Country and Genre Analysis
1) Count unique countries; 2) Top 10 countries by number of titles; 3) Top 10 genres.


In [None]:
# Unique countries count
unique_countries = df['country'].nunique()
print(f"Netflix content spans across {unique_countries} unique countries.")

# Top 10 countries by title count
top_countries = df['country'].value_counts().head(10)
plt.figure(figsize=(8, 5))
sns.barplot(x=top_countries.values, y=top_countries.index, palette='coolwarm')
plt.title("Top 10 Countries with Most Content on Netflix")
plt.xlabel("Number of Titles")
plt.ylabel("Country")
plt.grid(axis='x', linestyle='--', alpha=0.4)
plt.tight_layout()
plt.show()

# Genre breakdown using robust splitting
from collections import Counter
all_genres = []
for cell in df['listed_in'].dropna():
    for genre in cell.split(','):
        all_genres.append(genre.strip())
top10_genres = Counter(all_genres).most_common(10)
genre_df = pd.DataFrame(top10_genres, columns=['genre', 'count'])

plt.figure(figsize=(9, 5))
ax = sns.barplot(x='count', y='genre', data=genre_df, palette='mako')
plt.title("Top 10 Genres on Netflix")
plt.xlabel("Number of Titles")
plt.ylabel("Genre")
for index, row in genre_df.iterrows():
    ax.text(row['count'] + 3, index, str(row['count']), va='center')
plt.tight_layout()
plt.show()

## Duration Analysis for Movies
Parse movie durations (only those ending with 'min') and plot distribution.


In [None]:
# Extract durations (in minutes) for Movies only
import numpy as np
df_movies = df[df['type'] == 'Movie'].copy()
def parse_movie_duration(x):
    if isinstance(x, str) and x.strip().endswith('min'):
        return float(x.strip().split()[0])
    else:
        return np.nan
df_movies['duration_int'] = df_movies['duration'].apply(parse_movie_duration)
bad_durations = df_movies['duration_int'].isna().sum()
print(f"⚠️ {bad_durations} Movie rows have unexpected duration format.")
df_movies = df_movies.dropna(subset=['duration_int'])

plt.figure(figsize=(8, 5))
sns.histplot(df_movies['duration_int'], bins=30, kde=True, color='tomato')
plt.title("Distribution of Movie Durations (minutes)")
plt.xlabel("Duration (minutes)")
plt.ylabel("Frequency")
plt.grid(axis='y', linestyle='--', alpha=0.4)
plt.tight_layout()
plt.show()

## Season Analysis for TV Shows
Parse TV show durations to extract number of seasons, handling unexpected formats.


In [None]:
df_shows = df[df['type'] == 'TV Show'].copy()
def parse_seasons(x):
    import numpy as np
    if isinstance(x, str) and 'Season' in x:
        num = x.split()[0]
        try:
            return int(num)
        except ValueError:
            return np.nan
    return np.nan
df_shows['seasons'] = df_shows['duration'].apply(parse_seasons)
fail_count = df_shows['seasons'].isna().sum()
print(f"⚠️ Unable to parse seasons for {fail_count} TV Show rows.")
df_shows = df_shows.dropna(subset=['seasons'])

plt.figure(figsize=(8, 5))
sns.countplot(x='seasons', data=df_shows, palette='crest')
plt.title("Number of Seasons in Netflix TV Shows")
plt.xlabel("Number of Seasons")
plt.ylabel("Count of Shows")
plt.grid(axis='y', linestyle='--', alpha=0.3)
plt.tight_layout()
plt.show()

## Trends Over Time
Visualize how many titles were added each year and each month.


In [None]:
plt.figure(figsize=(10, 5))
sns.countplot(x='added_year', data=df, palette='rocket', order=sorted(df['added_year'].dropna().unique()))
plt.title("Netflix Titles Added by Year")
plt.xlabel("Year Added")
plt.ylabel("Number of Titles")
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', alpha=0.4)
plt.tight_layout()
plt.show()

plt.figure(figsize=(9, 5))
sns.countplot(x='added_month', data=df, palette='flare')
plt.title("Netflix Content Added by Month")
plt.xlabel("Month")
plt.ylabel("Number of Titles")
plt.grid(axis='y', linestyle='--', alpha=0.3)
plt.tight_layout()
plt.show()

## Top Actors/Actresses
Count most frequently appearing cast members in the dataset.


In [None]:
from collections import Counter
cast_series = df['cast'].dropna().str.split(', ')
flat_cast_list = [actor.strip() for sublist in cast_series for actor in sublist]
top_actors = Counter(flat_cast_list).most_common(10)
top_actors_df = pd.DataFrame(top_actors, columns=['Actor', 'Count'])

plt.figure(figsize=(9, 5))
ax = sns.barplot(y='Actor', x='Count', data=top_actors_df, palette='viridis')
plt.title("Top 10 Most Featured Actors on Netflix")
plt.xlabel("Number of Appearances")
plt.ylabel("Actor Name")
plt.grid(axis='x', linestyle='--', alpha=0.4)
plt.tight_layout()
plt.show()

## Word Cloud of Titles
Create a word cloud of the most common words in Netflix titles. Make sure `wordcloud` is installed (`pip install wordcloud`).


In [None]:
try:
    from wordcloud import WordCloud
except ImportError:
    raise ImportError("Install the 'wordcloud' package: pip install wordcloud")

text = ' '.join(df['title'].dropna().tolist())
wordcloud = WordCloud(width=1000, height=500, background_color='black', colormap='spring').generate(text)

plt.figure(figsize=(15, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Most Common Words in Netflix Titles')
plt.tight_layout(pad=0)
plt.show()

## Summary & Business Recommendations
In this analysis, we set out to:
1. Understand Movie vs TV Show proportions.
2. Identify top genres and key producing countries.
3. Observe historical trends in content addition.
4. Discover frequently featured actors.

**Key Findings:**
- Netflix’s library skews toward Movies, but single‐season TV shows grew significantly in recent years.
- Drama and Comedy are the top genres; the U.S. produces most content, followed by India, Japan, and the UK.
- A major spike in content occurred between 2018–2020 as Netflix expanded globally.
- A handful of actors appear repeatedly, suggesting possible exclusive partnerships.

**Recommendations:**
1. Maintain a balanced mix of Movies and multi‐season series to improve viewer retention.
2. Invest in local content production in high‐growth markets (e.g., India, South Korea).
3. Negotiate exclusive deals with top local and global actors to differentiate content.

Prepared by Sarath for portfolio. All data from a public Netflix dataset.