# Netflix Content Analysis - Assignment Solutions ðŸŽ¬

## Comprehensive Analysis with Visualizations

This notebook provides detailed solutions to all submission questions with interactive visualizations and human-readable insights.

**Dataset**: Netflix Titles (7,770 records after cleaning)  
**Analysis Period**: 2008-2021  
**Content Types**: Movies (69.1%) and TV Shows (30.9%)

## Setup and Data Loading

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
from collections import Counter
import re
import warnings
warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 12

In [None]:
# Clone dataset repository
!git clone "https://github.com/GeeksforgeeksDS/21-Days-21-Projects-Dataset"

In [None]:
# Load and clean the Netflix dataset
print("ðŸ“Š Loading Netflix Dataset...")
netflix_df = pd.read_csv('21-Days-21-Projects-Dataset/Datasets/netflix_titles.csv')

print(f"Original dataset shape: {netflix_df.shape}")
print("\nðŸ§¹ Cleaning data...")

# Data cleaning steps
netflix_df['director'] = netflix_df['director'].fillna('Unknown')
netflix_df['cast'] = netflix_df['cast'].fillna('Unknown')
mode_country = netflix_df['country'].mode()[0]
netflix_df['country'] = netflix_df['country'].fillna(mode_country)
netflix_df.dropna(subset=['date_added', 'rating'], inplace=True)

# Convert date and create time features
netflix_df['date_added'] = pd.to_datetime(netflix_df['date_added'], format='mixed', dayfirst=False)
netflix_df['year_added'] = netflix_df['date_added'].dt.year
netflix_df['month_added'] = netflix_df['date_added'].dt.month

# Create content age feature
netflix_df['content_age'] = netflix_df['year_added'] - netflix_df['release_year']

print(f"âœ… Cleaned dataset shape: {netflix_df.shape}")
print(f"ðŸ“ˆ Analysis ready with {len(netflix_df)} titles!")

---
# ðŸ“‹ Assignment Questions & Solutions

Let's dive into each question with detailed analysis and beautiful visualizations!

## Question 1: How has the distribution of content ratings changed over time? ðŸ“Š

**What we're exploring**: Netflix's content strategy regarding audience maturity levels and how it has evolved from 2017 to 2021.

In [None]:
# Question 1 Analysis: Content Ratings Over Time
print("ðŸŽ¯ QUESTION 1: Content Ratings Distribution Over Time")
print("=" * 60)

# Create ratings by year analysis
ratings_by_year = netflix_df.groupby(['year_added', 'rating']).size().unstack(fill_value=0)
recent_years = sorted(netflix_df['year_added'].unique())[-5:]  # Last 5 years

# Create a comprehensive visualization
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(20, 16))
fig.suptitle('Netflix Content Ratings: Evolution Over Time (2017-2021)', fontsize=20, fontweight='bold')

# 1. Stacked bar chart of ratings by year
ratings_subset = ratings_by_year.loc[recent_years]
top_ratings = ['TV-MA', 'TV-14', 'TV-PG', 'R', 'PG-13']
ratings_subset[top_ratings].plot(kind='bar', stacked=True, ax=ax1, colormap='viridis')
ax1.set_title('Content Ratings Distribution by Year (Stacked)', fontsize=14, fontweight='bold')
ax1.set_xlabel('Year Added to Netflix')
ax1.set_ylabel('Number of Titles')
ax1.legend(title='Rating', bbox_to_anchor=(1.05, 1), loc='upper left')
ax1.tick_params(axis='x', rotation=45)

# 2. Line plot showing TV-MA dominance over time
tv_ma_trend = ratings_by_year['TV-MA'].loc[recent_years]
total_by_year = ratings_by_year.loc[recent_years].sum(axis=1)
tv_ma_percentage = (tv_ma_trend / total_by_year * 100)

ax2.plot(recent_years, tv_ma_percentage, marker='o', linewidth=3, markersize=8, color='#e74c3c')
ax2.set_title('TV-MA Content Percentage Over Time', fontsize=14, fontweight='bold')
ax2.set_xlabel('Year')
ax2.set_ylabel('Percentage of TV-MA Content')
ax2.grid(True, alpha=0.3)
ax2.set_ylim(30, 45)

# Add percentage labels
for year, pct in zip(recent_years, tv_ma_percentage):
    ax2.annotate(f'{pct:.1f}%', (year, pct), textcoords="offset points", xytext=(0,10), ha='center')

# 3. Heatmap of ratings distribution
ratings_pct = ratings_subset[top_ratings].div(ratings_subset[top_ratings].sum(axis=1), axis=0) * 100
sns.heatmap(ratings_pct.T, annot=True, fmt='.1f', cmap='YlOrRd', ax=ax3, cbar_kws={'label': 'Percentage'})
ax3.set_title('Rating Distribution Heatmap (% by Year)', fontsize=14, fontweight='bold')
ax3.set_xlabel('Year Added')
ax3.set_ylabel('Content Rating')

# 4. Pie chart for overall rating distribution
overall_ratings = netflix_df['rating'].value_counts().head(6)
colors = plt.cm.Set3(np.linspace(0, 1, len(overall_ratings)))
wedges, texts, autotexts = ax4.pie(overall_ratings.values, labels=overall_ratings.index, 
                                   autopct='%1.1f%%', colors=colors, startangle=90)
ax4.set_title('Overall Content Rating Distribution\n(All Years)', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

# Print detailed insights
print("\nðŸ“ˆ KEY INSIGHTS:")
print(f"â€¢ TV-MA consistently dominates: {tv_ma_percentage.mean():.1f}% average across 2017-2021")
print(f"â€¢ Highest TV-MA year: {tv_ma_percentage.idxmax()} ({tv_ma_percentage.max():.1f}%)")
print(f"â€¢ Netflix clearly targets mature audiences with {overall_ratings['TV-MA']} TV-MA titles")
print(f"â€¢ TV-14 is second most common with {overall_ratings['TV-14']} titles")

# Show year-by-year breakdown
print("\nðŸ“Š YEAR-BY-YEAR BREAKDOWN:")
for year in recent_years:
    year_data = netflix_df[netflix_df['year_added'] == year]['rating'].value_counts().head(3)
    top_3 = ', '.join([f'{rating} ({count})' for rating, count in year_data.items()])
    print(f"  {year}: {top_3}")