# Netflix Dataset Analysis

**Ready-to-run notebook** — follows the provided problem statement. Place `netflix_titles.csv` in a `dataset/` folder next to this notebook before running.

### What this notebook contains
- Load & quick checks
- Cleaning & preprocessing (dates, duration, multi-value fields)
- Exploratory Data Analysis (plots with matplotlib & Plotly)
- Save cleaned data
- Short insights & recommendations

---


In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import os
from IPython.display import display
%matplotlib inline
print('Libraries loaded')

: 

## 1) Load dataset
Ensure your CSV is at `dataset/netflix_titles.csv`. If your filename differs, update the path below.

In [None]:
DATA_PATH = 'dataset/netflix_titles.csv'
if not os.path.exists(DATA_PATH):
    raise FileNotFoundError(f"Dataset not found at {DATA_PATH}. Please place 'netflix_titles.csv' in the 'dataset/' folder.")
df = pd.read_csv(DATA_PATH)
print('Dataset loaded — shape:', df.shape)
df.head()

### Quick checks

In [None]:
# Basic info and missing values
display(df.info())
display(df.isnull().sum())
display(df.describe(include='all').T)

## 2) Cleaning & Preprocessing
Steps:
- Parse dates
- Extract year_added
- Parse duration (minutes / seasons)
- Explode multi-valued columns: listed_in (genres), country, cast
- Deduplicate


In [None]:
# Make a copy for cleaning
dfc = df.copy()

# 1) date parsing
dfc['date_added'] = pd.to_datetime(dfc['date_added'], errors='coerce')
dfc['year_added'] = dfc['date_added'].dt.year

# 2) normalize text fields (trim)
for col in ['title','director','cast','country','listed_in','rating']:
    if col in dfc.columns:
        dfc[col] = dfc[col].astype(str).str.strip().replace({'nan': pd.NA})

# 3) duration parsing
def parse_duration(s):
    if pd.isna(s):
        return pd.NA, pd.NA
    s = str(s).strip()
    # minutes like '90 min'; seasons like '1 Season' or '3 Seasons'
    num = ''.join([c for c in s if c.isdigit() or c==' ']).strip()
    try:
        n = int(num.split()[0])
    except:
        n = pd.NA
    unit = 'min' if 'min' in s else ('season' if 'Season' in s or 'season' in s else pd.NA)
    return n, unit

dfc[['duration_num','duration_unit']] = pd.DataFrame(dfc['duration'].apply(lambda x: parse_duration(x)).tolist(), index=dfc.index)

# 4) explode genres (listed_in), country, and cast for counting analyses
for col in ['listed_in','country','cast']:
    if col in dfc.columns:
        dfc[col] = dfc[col].replace({'nan': ''}).fillna('')

# Create exploded versions for analyses
df_genre = dfc.assign(genre = dfc['listed_in'].str.split(', ')).explode('genre')
df_country = dfc.assign(country_split = dfc['country'].str.split(', ')).explode('country_split')
df_cast = dfc.assign(cast_member = dfc['cast'].str.split(', ')).explode('cast_member')

# 5) deduplicate by title, type, release_year (conservative)
if set(['title','type','release_year']).issubset(dfc.columns):
    before = dfc.shape[0]
    dfc.drop_duplicates(subset=['title','type','release_year'], inplace=True)
    after = dfc.shape[0]
    print(f'Deduplicated: {before-after} rows removed')
else:
    print('Dedup skip — required columns missing')

# Summary after cleaning
print('Cleaned dataframe shape:', dfc.shape)
dfc.head()

## 3) Exploratory Data Analysis (EDA)
We'll run focused EDA sections aligned with the problem statement.

### 3.1 Movie vs TV Show distribution

In [None]:
# Movie vs TV Show counts (matplotlib)
plt.figure(figsize=(6,4))
counts = dfc['type'].value_counts()
counts.plot(kind='bar')
plt.xlabel('Type')
plt.ylabel('Count')
plt.title('Distribution: Movies vs TV Shows')
plt.tight_layout()
plt.show()

# Also show raw counts
display(counts)

### 3.2 Content added per year (by year_added)

In [None]:
# Content added per year (interactive plotly)
year_counts = dfc.groupby(['year_added','type']).size().reset_index(name='count').dropna(subset=['year_added'])
fig = px.area(year_counts, x='year_added', y='count', color='type', title='Content added per year by type', labels={'year_added':'Year','count':'Number of titles'})
fig.show()

### 3.3 Top Genres

In [None]:
# Top genres (exploded)
top_genres = df_genre['genre'].value_counts().head(15)
top_genres_df = top_genres.reset_index()
top_genres_df.columns = ['genre','count']
# Matplotlib bar
plt.figure(figsize=(10,5))
plt.bar(top_genres_df['genre'], top_genres_df['count'])
plt.xticks(rotation=45, ha='right')
plt.ylabel('Count')
plt.title('Top 15 Genres')
plt.tight_layout()
plt.show()

display(top_genres_df)

### 3.4 Top Countries (content-producing)

In [None]:
# Top countries (exploded)
top_countries = df_country['country_split'].value_counts().head(15)
top_countries_df = top_countries.reset_index()
top_countries_df.columns = ['country','count']

plt.figure(figsize=(10,5))
plt.bar(top_countries_df['country'], top_countries_df['count'])
plt.xticks(rotation=45, ha='right')
plt.ylabel('Count')
plt.title('Top 15 Countries by number of titles')
plt.tight_layout()
plt.show()

display(top_countries_df)

### 3.5 Duration analysis (Movies)

In [None]:
# Duration distribution (movies) - numeric
movies = dfc[dfc['type']=='Movie'].copy()
movies['duration_num'] = pd.to_numeric(movies['duration_num'], errors='coerce')
plt.figure(figsize=(8,4))
plt.hist(movies['duration_num'].dropna(), bins=30)
plt.xlabel('Duration (minutes)')
plt.ylabel('Count')
plt.title('Distribution of Movie Durations')
plt.tight_layout()
plt.show()

# Basic stats
display(movies['duration_num'].describe())

### 3.6 Ratings distribution

In [None]:
# Ratings distribution (top categories)
rating_counts = dfc['rating'].value_counts().head(20)
plt.figure(figsize=(10,4))
plt.bar(rating_counts.index, rating_counts.values)
plt.xticks(rotation=45, ha='right')
plt.ylabel('Count')
plt.title('Top Ratings by count')
plt.tight_layout()
plt.show()

display(rating_counts)

### 3.7 Top Directors & Cast members

In [None]:
# Top directors (simple count)
if 'director' in dfc.columns:
    top_directors = dfc['director'].replace({'': pd.NA}).dropna().str.split(', ').explode().value_counts().head(15)
    display(top_directors)
else:
    print('No director column')

# Top cast members
top_cast = df_cast['cast_member'].replace({'': pd.NA}).dropna().str.strip().value_counts().head(15)
display(top_cast)

## 4) Save cleaned data and figures

In [None]:
# Save cleaned data to outputs/
os.makedirs('outputs', exist_ok=True)
cleaned_path = 'outputs/cleaned_netflix.csv'
dfc.to_csv(cleaned_path, index=False)
print('Cleaned data saved to', cleaned_path)

## 5) Insights & Recommendations

Based on the analysis above, add concise insights here. Below are actionable example insights you should adapt after running the notebook and inspecting counts:

- **Type trend**: Inspect the area chart — if TV Shows grew faster than Movies, note years and possible reasons.
- **Genre focus**: If Drama, Comedy, Documentary dominate — recommend content investment or localization strategies.
- **Geographic distribution**: If US, India, UK are top producers, recommend regional content expansion where counts are low.
- **Duration**: Typical movie runtime cluster (e.g., around 90–100 minutes) suggests viewer expectations.

> After running the notebook, replace these example bullets with concrete numbers & charts from your outputs.

---

### Next steps (optional)
- Build an interactive Streamlit dashboard using `outputs/cleaned_netflix.csv`.
- Run clustering on top genres + numeric features to discover content segments.
