# 📺 Netflix Content Analysis

This notebook provides a complete exploratory data analysis (EDA) of Netflix titles dataset using Python (Pandas, Seaborn, Matplotlib). It covers data cleaning, outlier detection, and visual analysis of movies and shows.

## 📁 Load Dataset

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load CSV file (replace path if needed)
df = pd.read_csv('netflix_titles.csv')
df.head()

## 🧾 Dataset Overview
Quick look at column names, data types, and missing values.

In [None]:
df.info()
df.isnull().sum()

## 🧹 Data Cleaning
Handle missing values and fix inconsistent formatting.

In [None]:
# Fill nulls in specific columns
df['description'] = df['description'].fillna('No Description')
df['age_certification'] = df['age_certification'].fillna('Unrated')
df['imdb_id'] = df['imdb_id'].fillna('N/A')
df['production_countries'] = df['production_countries'].replace('[]', 'Unknown')

# Drop rows with missing title
df = df.dropna(subset=['title'])

In [None]:
# Fill missing values with averages for numeric columns
for col in ['imdb_score', 'imdb_votes', 'tmdb_score', 'tmdb_popularity']:
    if df[col].isnull().sum() > 0:
        df[col] = df[col].fillna(df[col].mean())

## 🔍 Duplicate Check

In [None]:
df[df.duplicated(subset=['title', 'release_year', 'type'])]

## ✅ Movie vs Show Distribution

In [None]:
type_counts = df['type'].value_counts()
plt.figure(figsize=(6, 6))
type_counts.plot.pie(autopct='%1.1f%%', startangle=90, colors=['#66c2a5', '#fc8d62'])
plt.title('Content Type Distribution')
plt.ylabel('')
plt.show()

## 📅 Content by Release Year

In [None]:
year_counts = df['release_year'].value_counts().sort_index()
plt.figure(figsize=(12, 5))
sns.lineplot(x=year_counts.index, y=year_counts.values)
plt.title('Titles Released per Year')
plt.xlabel('Year')
plt.ylabel('Count')
plt.grid(True)
plt.show()

## 🌍 Top Production Countries

In [None]:
top_countries = df['production_countries'].value_counts().head(10)
plt.figure(figsize=(10, 5))
sns.barplot(x=top_countries.values, y=top_countries.index, palette='viridis')
plt.title('Top 10 Producing Countries')
plt.xlabel('Number of Titles')
plt.show()

## 🎭 Genre Distribution

In [None]:
from collections import Counter

genre_series = df['genres'].dropna().apply(lambda x: eval(x) if isinstance(x, str) and x.startswith('[') else [x])
genres_flat = [genre.strip() for sublist in genre_series for genre in sublist if genre]
genre_counts = pd.Series(Counter(genres_flat)).sort_values(ascending=False).head(10)

plt.figure(figsize=(10, 5))
sns.barplot(x=genre_counts.values, y=genre_counts.index, palette='rocket')
plt.title('Top 10 Genres')
plt.xlabel('Number of Titles')
plt.show()

## ⚠️ Runtime Outlier Detection

In [None]:
plt.figure(figsize=(8, 4))
sns.boxplot(data=df, x='type', y='runtime')
plt.title('Runtime Outliers by Type')
plt.show()

## ⏱️ Longest Titles by Runtime

In [None]:
df['hours'] = df['runtime'] // 60
df['minutes'] = df['runtime'] % 60
df['formatted_duration'] = df['hours'].astype(str) + ':' + df['minutes'].astype(str).str.zfill(2)

# Top longest titles
longest = df[['title', 'type', 'runtime', 'formatted_duration']].sort_values(by='runtime', ascending=False).head(10)
longest

## ✅ Conclusion
- Majority of content is movies.
- The most popular genres include documentaries and dramas.
- Runtime outliers were identified especially in some shows.
- The dataset required cleaning in multiple fields.

📌 You can expand this analysis with:
- Sentiment analysis of descriptions
- Rating comparison by genre
- Adding a Tableau dashboard.