# Pandas Student Notebook — Foundations Practice (6)  
## Dataset: Kaggle “Netflix Movies and TV Shows”

### Goal of this notebook
This notebook focuses on categorical data, time-derived features, and rate-based reasoning.
Students will practice cleaning messy text fields, deriving time features, and avoiding misleading counts.

Key analytical habits:
- Clean before grouping
- Prefer rates and proportions over raw counts
- Be explicit about time grain
- Distinguish catalog size from activity or popularity

File used:
- `netflix_titles.csv`


## 0. Setup + first inspection

Load `netflix_titles.csv` into a DataFrame called `df`.

Write as a comment:
- What does one row represent in this dataset? (grain)


In [2]:
import pandas as pd
import numpy as np
import kagglehub
import os

# Download latest version
path = kagglehub.dataset_download("shivamb/netflix-shows")

df = pd.read_csv(os.path.join(path, 'netflix_titles.csv'))

# Grain:
# one row per Netflix title (movie or TV show)


## 1. Missing values and empty strings

1) Show missing values per column


In [3]:
df.isna().sum().sort_values(ascending=False)

director        2634
country          831
cast             825
date_added        10
rating             4
duration           3
show_id            0
type               0
title              0
release_year       0
listed_in          0
description        0
dtype: int64

## 2. Date parsing and derived features

1) Convert `date_added` to datetime.  
2) Create:
- `year_added`
- `month_added`
- `year_month_added`

3) Count rows where `date_added` is missing.

Write as a comment:
- Why is `date_added` not the same as release year?


In [None]:
df["date_added"] = pd.to_datetime(df["date_added"], errors="coerce")

df["year_added"] = df["date_added"].dt.year
df["month_added"] = df["date_added"].dt.month
df["year_month_added"] = df["date_added"].dt.to_period("M").astype(str)

missing_date_added = int(df["date_added"].isna().sum())
missing_date_added

# Comment:
# date_added is when Netflix added the title to its catalog; release_year is when it was produced/released.

## 3. Cleaning categorical text columns

Choose two columns among:
- `type`
- `rating`
- `country`
- `listed_in`

Tasks:
1) Standardize casing and strip whitespace.  
2) Show `value_counts()` before and after cleaning.

Write as a comment:
- How can unclean categories silently distort groupby results?


In [None]:
# Quick sanity checks
df["type"].value_counts(dropna=False)
df["rating"].value_counts(dropna=False).head(15)


In [None]:
# Clean 'type'
df["type_clean"] = df["type"].str.strip()

# Clean 'rating' (normalize casing + remove surrounding whitespace)
df["rating_clean"] = df["rating"].str.strip().str.upper()

# Clean 'country' (keep original, but make a cleaned string version for splitting later)
df["country_clean_str"] = df["country"].str.strip()
df[['type_clean','rating_clean', 'country_clean_str']].head()

In [None]:
# Quick sanity checks
df["type_clean"].value_counts(dropna=False)
df["rating_clean"].value_counts(dropna=False).head(15)


# Unclean categories (extra spaces, inconsistent casing) create “fake” groups in groupby results,
# silently splitting what should be the same category into multiple buckets.


## 4. Movies vs TV Shows: proportions, not counts

1) Compute raw counts of Movies vs TV Shows.  
2) Compute proportions (percentages).  
3) Show both in a small table.

Write as a comment:
- Why proportions are more informative than counts here?


In [None]:
counts = df["type_clean"].value_counts(dropna=False)
props = df["type_clean"].value_counts(normalize=True, dropna=False)

summary = pd.DataFrame({
    "count": counts,
    "share": props
})

summary

# Proportions are more informative because counts depend on total catalog size;
# proportions let you compare composition even if catalog size changes.


## 5. Time trends: catalog growth

Compute number of titles added per year.

Then compute:
- cumulative number of titles over time

Write as a comment:
- Can this messure tell us something about viewing behavior?

In [None]:
titles_added_per_year = (
    df.groupby("year_added", dropna=False)["show_id"]
    .nunique()
    .reset_index(name="n_titles_added")
    .sort_values("year_added")
)

titles_added_per_year


In [None]:
# Cumulative catalog additions over time (based on date_added)
titles_added_per_year["cumulative_titles_added"] = titles_added_per_year["n_titles_added"].cumsum()
titles_added_per_year

# No, This measures catalog growth (titles added to Netflix), not viewing behavior/popularity,
# because it uses date_added and not any engagement/watch metrics.


## 6. Ratings normalization

For each rating:
- compute number of titles
- compute share of total titles

Filter out ratings with fewer than 20 titles.

Write as a comment:
- Why filtering on minimum volume matters?


In [None]:
rating_counts = (
    df.groupby("rating_clean")["show_id"]
    .nunique()
    .rename("n_titles")
    .reset_index()
)

total_titles = df["show_id"].nunique()
rating_counts["share_of_titles"] = rating_counts["n_titles"] / total_titles

rating_counts = rating_counts.sort_values("n_titles", ascending=False)
rating_counts.head(15)


In [None]:
# Filter out low-volume ratings (< 20)
rating_counts_20 = rating_counts[rating_counts["n_titles"] >= 20].copy()
rating_counts_20


# Minimum-volume filtering matters because small-n categories produce unstable shares and noisy comparisons.


## 7. Data quality checks

Create boolean `suspicious_title` if:
- duration is missing
- OR rating is missing
- OR country is missing

Show:
- number of suspicious rows
- sample rows


In [None]:
df["suspicious_title"] = (
    df["duration"].isna()
    | df["rating_clean"].isna()
    | df["country_clean_str"].isna()
)

print("suspicious rows:", int(df["suspicious_title"].sum()))

df.loc[df["suspicious_title"], ["show_id", "title", "duration", "rating", "country"]].head(10)


## 8. Capstone: analysis-ready table

Create `analysis_df` with:
- `type`
- `year_added`
- `rating`
- primary country (first listed country)
- number of genres
- release year
- suspicious_title (as int)

Requirements:
- No missing values in engineered columns
- Show `analysis_df.head()` and `analysis_df.isna().sum()`

Write as a comment:
- Which column required the most care to construct, and why?


In [None]:
# Primary country = first listed country
primary_country = (
    df["country_clean_str"]
    .fillna("Unknown")
    .str.split(",")
    .str[0]
    .str.strip()
)

# Number of genres
n_genres = (
    df["listed_in"]
    .fillna("Unknown")
    .str.split(",")
    .apply(len)
)

#Using vectorisation

# n_genres = (
#     df["listed_in"]
#     .fillna("Unknown")
#     .str.count(",")
#     .add(1)
# )


analysis_df = pd.DataFrame({
    "type": df["type_clean"].fillna("Unknown"),
    "year_added": df["year_added"].fillna(0).astype(int),   # no missing engineered cols
    "rating": df["rating_clean"].fillna("UNKNOWN"),
    "primary_country": primary_country.fillna("Unknown"),
    "n_genres": n_genres.fillna(0).astype(int),
    "release_year": df["release_year"].fillna(0).astype(int),
    "suspicious_title": df["suspicious_title"].fillna(False).astype(int),
})

analysis_df.head()
