# Netflix Content Analysis - Exploratory Data Analysis
**Author:** Meghana Reddy Guntupalli  
**Date:** January 2026  
**Dataset:** Netflix Movies and TV Shows (8,807 titles)

## Project Overview
This project analyzes Netflix's content catalog to uncover insights about content trends, genre preferences, country-wise production, and viewing patterns. The analysis explores:
- Content type distribution (Movies vs TV Shows)
- Trends in content additions over time
- Top contributing countries and directors
- Genre analysis and popularity
- Content ratings and target audiences
- Seasonal release patterns

## 1. Import Libraries and Load Data

In [None]:
# Data manipulation and analysis
import pandas as pd
import numpy as np

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Additional utilities
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 10

In [None]:
# Load the dataset
df = pd.read_csv('../data/netflix_titles.csv')

print(f"Dataset loaded successfully!")
print(f"Total records: {df.shape[0]:,}")
print(f"Total features: {df.shape[1]}")

## 2. Data Understanding and Initial Exploration

In [None]:
# Display first few rows
df.head()

In [None]:
# Dataset information
print("\n=== Dataset Info ===")
df.info()

In [None]:
# Statistical summary
print("\n=== Statistical Summary ===")
df.describe(include='all')

In [None]:
# Check for missing values
print("\n=== Missing Values Analysis ===")
missing_data = pd.DataFrame({
    'Column': df.columns,
    'Missing_Count': df.isnull().sum().values,
    'Missing_Percentage': (df.isnull().sum().values / len(df) * 100).round(2)
})
missing_data = missing_data[missing_data['Missing_Count'] > 0].sort_values('Missing_Count', ascending=False)
print(missing_data)

In [None]:
# Visualize missing data
plt.figure(figsize=(10, 6))
missing_cols = missing_data['Column'].tolist()
missing_pct = missing_data['Missing_Percentage'].tolist()

plt.barh(missing_cols, missing_pct, color='coral')
plt.xlabel('Missing Percentage (%)')
plt.title('Missing Data Analysis', fontsize=14, fontweight='bold')
plt.gca().invert_yaxis()
for i, v in enumerate(missing_pct):
    plt.text(v + 0.5, i, f'{v}%', va='center')
plt.tight_layout()
plt.savefig('../images/missing_data.png', dpi=300, bbox_inches='tight')
plt.show()

## 3. Data Cleaning and Preprocessing

In [None]:
# Create a copy for cleaning
df_clean = df.copy()

# Convert date_added to datetime
df_clean['date_added'] = pd.to_datetime(df_clean['date_added'], errors='coerce')

# Extract year and month added
df_clean['year_added'] = df_clean['date_added'].dt.year
df_clean['month_added'] = df_clean['date_added'].dt.month
df_clean['month_name'] = df_clean['date_added'].dt.month_name()

# Fill missing values for categorical columns
df_clean['director'] = df_clean['director'].fillna('Unknown')
df_clean['cast'] = df_clean['cast'].fillna('Unknown')
df_clean['country'] = df_clean['country'].fillna('Unknown')
df_clean['rating'] = df_clean['rating'].fillna('Not Rated')

# Extract duration in minutes for movies and seasons for TV shows
df_clean['duration_value'] = df_clean['duration'].str.extract('(\d+)').astype(float)
df_clean['duration_unit'] = df_clean['duration'].str.extract('([a-zA-Z]+)')

print("Data cleaning completed!")
print(f"Rows after cleaning: {df_clean.shape[0]}")

## 4. Content Type Analysis

In [None]:
# Content type distribution
content_type = df_clean['type'].value_counts()
print("\n=== Content Type Distribution ===")
print(content_type)
print(f"\nMovies: {content_type['Movie'] / len(df_clean) * 100:.1f}%")
print(f"TV Shows: {content_type['TV Show'] / len(df_clean) * 100:.1f}%")

In [None]:
# Visualize content type distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Pie chart
colors = ['#E50914', '#221f1f']
axes[0].pie(content_type.values, labels=content_type.index, autopct='%1.1f%%', 
            startangle=90, colors=colors, textprops={'fontsize': 12})
axes[0].set_title('Content Type Distribution', fontsize=14, fontweight='bold')

# Bar chart
axes[1].bar(content_type.index, content_type.values, color=colors, width=0.5)
axes[1].set_ylabel('Count', fontsize=11)
axes[1].set_title('Movies vs TV Shows Count', fontsize=14, fontweight='bold')
for i, v in enumerate(content_type.values):
    axes[1].text(i, v + 100, str(v), ha='center', fontsize=11, fontweight='bold')

plt.tight_layout()
plt.savefig('../images/content_type_distribution.png', dpi=300, bbox_inches='tight')
plt.show()

## 5. Temporal Analysis - Content Addition Trends

In [None]:
# Yearly trend of content additions
yearly_content = df_clean.dropna(subset=['year_added']).groupby(['year_added', 'type']).size().unstack(fill_value=0)

plt.figure(figsize=(14, 6))
yearly_content.plot(kind='line', marker='o', linewidth=2, markersize=6)
plt.title('Netflix Content Additions Over Time', fontsize=14, fontweight='bold')
plt.xlabel('Year', fontsize=11)
plt.ylabel('Number of Titles Added', fontsize=11)
plt.legend(title='Content Type', fontsize=10)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('../images/yearly_content_trend.png', dpi=300, bbox_inches='tight')
plt.show()

# Find peak year
peak_year = yearly_content.sum(axis=1).idxmax()
peak_count = yearly_content.sum(axis=1).max()
print(f"\nPeak content addition year: {int(peak_year)} with {int(peak_count)} titles")

In [None]:
# Monthly addition pattern
monthly_content = df_clean.dropna(subset=['month_added']).groupby('month_name')['type'].value_counts().unstack(fill_value=0)

# Reorder months correctly
month_order = ['January', 'February', 'March', 'April', 'May', 'June', 
               'July', 'August', 'September', 'October', 'November', 'December']
monthly_content = monthly_content.reindex([m for m in month_order if m in monthly_content.index])

plt.figure(figsize=(14, 6))
monthly_content.plot(kind='bar', stacked=False, color=['#E50914', '#221f1f'])
plt.title('Monthly Content Addition Pattern', fontsize=14, fontweight='bold')
plt.xlabel('Month', fontsize=11)
plt.ylabel('Number of Titles', fontsize=11)
plt.xticks(rotation=45)
plt.legend(title='Content Type')
plt.tight_layout()
plt.savefig('../images/monthly_pattern.png', dpi=300, bbox_inches='tight')
plt.show()

## 6. Geographic Analysis - Content by Country

In [None]:
# Extract first country (primary production country)
df_clean['primary_country'] = df_clean['country'].str.split(',').str[0].str.strip()

# Top 15 countries producing content
top_countries = df_clean[df_clean['primary_country'] != 'Unknown']['primary_country'].value_counts().head(15)

plt.figure(figsize=(12, 8))
plt.barh(top_countries.index, top_countries.values, color='#E50914')
plt.xlabel('Number of Titles', fontsize=11)
plt.title('Top 15 Countries Producing Netflix Content', fontsize=14, fontweight='bold')
plt.gca().invert_yaxis()
for i, v in enumerate(top_countries.values):
    plt.text(v + 20, i, str(v), va='center', fontsize=10)
plt.tight_layout()
plt.savefig('../images/top_countries.png', dpi=300, bbox_inches='tight')
plt.show()

print(f"\nTop 5 Content Producing Countries:")
print(top_countries.head())

In [None]:
# Content type by top countries
top_5_countries = top_countries.head(5).index.tolist()
country_type = df_clean[df_clean['primary_country'].isin(top_5_countries)].groupby(['primary_country', 'type']).size().unstack()

country_type.plot(kind='bar', figsize=(12, 6), color=['#E50914', '#221f1f'])
plt.title('Movies vs TV Shows by Top 5 Countries', fontsize=14, fontweight='bold')
plt.xlabel('Country', fontsize=11)
plt.ylabel('Number of Titles', fontsize=11)
plt.xticks(rotation=45)
plt.legend(title='Content Type')
plt.tight_layout()
plt.savefig('../images/country_content_type.png', dpi=300, bbox_inches='tight')
plt.show()

## 7. Content Rating Analysis

In [None]:
# Rating distribution
rating_dist = df_clean['rating'].value_counts().head(10)

plt.figure(figsize=(12, 6))
plt.bar(rating_dist.index, rating_dist.values, color='#E50914')
plt.title('Top 10 Content Ratings on Netflix', fontsize=14, fontweight='bold')
plt.xlabel('Rating', fontsize=11)
plt.ylabel('Number of Titles', fontsize=11)
plt.xticks(rotation=45)
for i, v in enumerate(rating_dist.values):
    plt.text(i, v + 30, str(v), ha='center', fontsize=9)
plt.tight_layout()
plt.savefig('../images/content_ratings.png', dpi=300, bbox_inches='tight')
plt.show()

print("\n=== Content Rating Distribution ===")
print(rating_dist)

In [None]:
# Rating by content type
rating_by_type = df_clean.groupby(['rating', 'type']).size().unstack(fill_value=0)
top_ratings = rating_dist.head(8).index.tolist()
rating_by_type_top = rating_by_type.loc[top_ratings]

rating_by_type_top.plot(kind='barh', stacked=True, figsize=(12, 6), color=['#E50914', '#221f1f'])
plt.title('Content Type Distribution by Rating', fontsize=14, fontweight='bold')
plt.xlabel('Number of Titles', fontsize=11)
plt.ylabel('Rating', fontsize=11)
plt.legend(title='Content Type')
plt.tight_layout()
plt.savefig('../images/rating_by_type.png', dpi=300, bbox_inches='tight')
plt.show()

## 8. Genre Analysis

In [None]:
# Extract and analyze genres
from collections import Counter

# Split genres and count
all_genres = df_clean['listed_in'].str.split(', ').explode()
genre_counts = all_genres.value_counts().head(15)

plt.figure(figsize=(12, 8))
plt.barh(genre_counts.index, genre_counts.values, color='#E50914')
plt.xlabel('Number of Titles', fontsize=11)
plt.title('Top 15 Genres on Netflix', fontsize=14, fontweight='bold')
plt.gca().invert_yaxis()
for i, v in enumerate(genre_counts.values):
    plt.text(v + 20, i, str(v), va='center', fontsize=9)
plt.tight_layout()
plt.savefig('../images/top_genres.png', dpi=300, bbox_inches='tight')
plt.show()

print("\n=== Top 10 Genres ===")
print(genre_counts.head(10))

## 9. Duration Analysis

In [None]:
# Movie duration analysis
movies_df = df_clean[df_clean['type'] == 'Movie'].copy()

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Histogram of movie durations
axes[0].hist(movies_df['duration_value'].dropna(), bins=30, color='#E50914', edgecolor='black')
axes[0].set_xlabel('Duration (minutes)', fontsize=11)
axes[0].set_ylabel('Frequency', fontsize=11)
axes[0].set_title('Distribution of Movie Durations', fontsize=12, fontweight='bold')
axes[0].axvline(movies_df['duration_value'].median(), color='yellow', linestyle='--', 
                linewidth=2, label=f"Median: {movies_df['duration_value'].median():.0f} min")
axes[0].legend()

# Box plot
axes[1].boxplot(movies_df['duration_value'].dropna(), vert=True, patch_artist=True,
                boxprops=dict(facecolor='#E50914'))
axes[1].set_ylabel('Duration (minutes)', fontsize=11)
axes[1].set_title('Movie Duration Box Plot', fontsize=12, fontweight='bold')

plt.tight_layout()
plt.savefig('../images/movie_duration.png', dpi=300, bbox_inches='tight')
plt.show()

print(f"\n=== Movie Duration Statistics ===")
print(f"Average: {movies_df['duration_value'].mean():.1f} minutes")
print(f"Median: {movies_df['duration_value'].median():.0f} minutes")
print(f"Shortest: {movies_df['duration_value'].min():.0f} minutes")
print(f"Longest: {movies_df['duration_value'].max():.0f} minutes")

In [None]:
# TV Show seasons analysis
tv_shows_df = df_clean[df_clean['type'] == 'TV Show'].copy()
seasons_count = tv_shows_df['duration_value'].value_counts().sort_index()

plt.figure(figsize=(12, 6))
plt.bar(seasons_count.index, seasons_count.values, color='#221f1f', width=0.6)
plt.xlabel('Number of Seasons', fontsize=11)
plt.ylabel('Number of TV Shows', fontsize=11)
plt.title('Distribution of TV Show Seasons', fontsize=14, fontweight='bold')
plt.xticks(range(1, int(seasons_count.index.max()) + 1))
plt.tight_layout()
plt.savefig('../images/tv_show_seasons.png', dpi=300, bbox_inches='tight')
plt.show()

print(f"\n=== TV Show Statistics ===")
print(f"Average seasons: {tv_shows_df['duration_value'].mean():.1f}")
print(f"Most common: {tv_shows_df['duration_value'].mode()[0]:.0f} season(s)")
print(f"Maximum seasons: {tv_shows_df['duration_value'].max():.0f}")

## 10. Release Year Analysis

In [None]:
# Content by release year (decade analysis)
df_clean['decade'] = (df_clean['release_year'] // 10) * 10
decade_content = df_clean.groupby(['decade', 'type']).size().unstack(fill_value=0)

decade_content.plot(kind='bar', figsize=(14, 6), color=['#E50914', '#221f1f'])
plt.title('Netflix Content by Release Decade', fontsize=14, fontweight='bold')
plt.xlabel('Decade', fontsize=11)
plt.ylabel('Number of Titles', fontsize=11)
plt.xticks(rotation=45)
plt.legend(title='Content Type')
plt.tight_layout()
plt.savefig('../images/decade_analysis.png', dpi=300, bbox_inches='tight')
plt.show()

## 11. Top Directors and Cast

In [None]:
# Top directors
directors_list = df_clean[df_clean['director'] != 'Unknown']['director'].str.split(', ').explode()
top_directors = directors_list.value_counts().head(10)

plt.figure(figsize=(12, 6))
plt.barh(top_directors.index, top_directors.values, color='#E50914')
plt.xlabel('Number of Titles', fontsize=11)
plt.title('Top 10 Directors on Netflix', fontsize=14, fontweight='bold')
plt.gca().invert_yaxis()
for i, v in enumerate(top_directors.values):
    plt.text(v + 0.2, i, str(v), va='center', fontsize=9)
plt.tight_layout()
plt.savefig('../images/top_directors.png', dpi=300, bbox_inches='tight')
plt.show()

print("\n=== Top 10 Directors ===")
print(top_directors)

In [None]:
# Top cast members
cast_list = df_clean[df_clean['cast'] != 'Unknown']['cast'].str.split(', ').explode()
top_cast = cast_list.value_counts().head(10)

plt.figure(figsize=(12, 6))
plt.barh(top_cast.index, top_cast.values, color='#221f1f')
plt.xlabel('Number of Appearances', fontsize=11)
plt.title('Top 10 Cast Members on Netflix', fontsize=14, fontweight='bold')
plt.gca().invert_yaxis()
for i, v in enumerate(top_cast.values):
    plt.text(v + 0.5, i, str(v), va='center', fontsize=9)
plt.tight_layout()
plt.savefig('../images/top_cast.png', dpi=300, bbox_inches='tight')
plt.show()

print("\n=== Top 10 Cast Members ===")
print(top_cast)

## 12. Key Insights Summary

In [None]:
print("="*80)
print(" "*20 + "NETFLIX CONTENT ANALYSIS - KEY INSIGHTS")
print("="*80)
print("\n1. CONTENT COMPOSITION:")
print(f"   - Total Titles: {len(df_clean):,}")
print(f"   - Movies: {len(df_clean[df_clean['type']=='Movie']):,} ({len(df_clean[df_clean['type']=='Movie'])/len(df_clean)*100:.1f}%)")
print(f"   - TV Shows: {len(df_clean[df_clean['type']=='TV Show']):,} ({len(df_clean[df_clean['type']=='TV Show'])/len(df_clean)*100:.1f}%)")

print("\n2. TEMPORAL TRENDS:")
print(f"   - Peak Addition Year: {int(peak_year)} ({int(peak_count)} titles)")
print(f"   - Most Active Month: {monthly_content.sum(axis=1).idxmax()}")

print("\n3. GEOGRAPHIC INSIGHTS:")
print(f"   - Top Content Producer: {top_countries.index[0]} ({top_countries.values[0]} titles)")
print(f"   - Top 3 Countries: {', '.join(top_countries.head(3).index.tolist())}")

print("\n4. CONTENT CHARACTERISTICS:")
print(f"   - Most Common Rating: {rating_dist.index[0]} ({rating_dist.values[0]} titles)")
print(f"   - Top Genre: {genre_counts.index[0]} ({genre_counts.values[0]} titles)")
print(f"   - Average Movie Duration: {movies_df['duration_value'].mean():.0f} minutes")
print(f"   - Most TV Shows Have: {tv_shows_df['duration_value'].mode()[0]:.0f} season(s)")

print("\n5. TOP CONTRIBUTORS:")
print(f"   - Most Prolific Director: {top_directors.index[0]} ({top_directors.values[0]} titles)")
print(f"   - Most Featured Actor: {top_cast.index[0]} ({top_cast.values[0]} appearances)")

print("\n" + "="*80)
print(" "*25 + "Analysis Complete!")
print("="*80)

## 13. Recommendations and Next Steps

Based on this analysis, here are potential areas for deeper investigation:

1. **Sentiment Analysis**: Analyze description text to understand content themes and emotional tones
2. **Predictive Modeling**: Build models to predict content success based on features like genre, cast, and duration
3. **Network Analysis**: Explore connections between directors, actors, and genres
4. **Time Series Forecasting**: Predict future content addition trends
5. **Comparative Analysis**: Compare Netflix's strategy across different countries and regions

---

**Contact**: Meghana Reddy Guntupalli  
**LinkedIn**: [Your LinkedIn Profile]  
**GitHub**: [Your GitHub Profile]  
**Date**: January 2026