# Pandas Student Notebook — Foundations Practice (6)  
## Dataset: Kaggle “Netflix Movies and TV Shows”

### Goal of this notebook
This notebook focuses on categorical data, time-derived features, and rate-based reasoning.
Students will practice cleaning messy text fields, deriving time features, and avoiding misleading counts.

Key analytical habits:
- Clean before grouping
- Prefer rates and proportions over raw counts
- Be explicit about time grain
- Distinguish catalog size from activity or popularity

File used:
- `netflix_titles.csv`


## 0. Setup + first inspection

Load `netflix_titles.csv` into a DataFrame called `df`.

Write as a comment:
- What does one row represent in this dataset? (grain)


## 1. Missing values and empty strings

1) Show missing values per column


director        2634
country          831
cast             825
date_added        10
rating             4
duration           3
show_id            0
type               0
title              0
release_year       0
listed_in          0
description        0
dtype: int64

## 2. Date parsing and derived features

1) Convert `date_added` to datetime.  
2) Create:
- `year_added`
- `month_added`
- `year_month_added`

3) Count rows where `date_added` is missing.

Write as a comment:
- Why is `date_added` not the same as release year?


98

## 3. Cleaning categorical text columns

Choose two columns among:
- `type`
- `rating`
- `country`
- `listed_in`

Tasks:
1) Standardize casing and strip whitespace.  
2) Show `value_counts()` before and after cleaning.

Write as a comment:
- How can unclean categories silently distort groupby results?


rating
TV-MA       3207
TV-14       2160
TV-PG        863
R            799
PG-13        490
TV-Y7        334
TV-Y         307
PG           287
TV-G         220
NR            80
G             41
TV-Y7-FV       6
NaN            4
NC-17          3
UR             3
Name: count, dtype: int64

Unnamed: 0,type_clean,rating_clean,country_clean_str
0,Movie,PG-13,United States
1,TV Show,TV-MA,South Africa
2,TV Show,TV-MA,
3,TV Show,TV-MA,
4,TV Show,TV-MA,India


rating_clean
TV-MA       3207
TV-14       2160
TV-PG        863
R            799
PG-13        490
TV-Y7        334
TV-Y         307
PG           287
TV-G         220
NR            80
G             41
TV-Y7-FV       6
NaN            4
NC-17          3
UR             3
Name: count, dtype: int64

## 4. Movies vs TV Shows: proportions, not counts

1) Compute raw counts of Movies vs TV Shows.  
2) Compute proportions (percentages).  
3) Show both in a small table.

Write as a comment:
- Why proportions are more informative than counts here?


Unnamed: 0_level_0,count,share
type_clean,Unnamed: 1_level_1,Unnamed: 2_level_1
Movie,6131,0.696151
TV Show,2676,0.303849


## 5. Time trends: catalog growth

Compute number of titles added per year.

Then compute:
- cumulative number of titles over time

Write as a comment:
- Can this messure tell us something about viewing behavior?

Unnamed: 0,year_added,n_titles_added
0,2008.0,2
1,2009.0,2
2,2010.0,1
3,2011.0,13
4,2012.0,3
5,2013.0,10
6,2014.0,23
7,2015.0,73
8,2016.0,418
9,2017.0,1164


Unnamed: 0,year_added,n_titles_added,cumulative_titles_added
0,2008.0,2,2
1,2009.0,2,4
2,2010.0,1,5
3,2011.0,13,18
4,2012.0,3,21
5,2013.0,10,31
6,2014.0,23,54
7,2015.0,73,127
8,2016.0,418,545
9,2017.0,1164,1709


## 6. Ratings normalization

For each rating:
- compute number of titles
- compute share of total titles

Filter out ratings with fewer than 20 titles.

Write as a comment:
- Why filtering on minimum volume matters?


Unnamed: 0,rating_clean,n_titles,share_of_titles
11,TV-MA,3207,0.364142
9,TV-14,2160,0.245259
12,TV-PG,863,0.09799
8,R,799,0.090723
7,PG-13,490,0.055638
14,TV-Y7,334,0.037924
13,TV-Y,307,0.034859
6,PG,287,0.032588
10,TV-G,220,0.02498
5,NR,80,0.009084


Unnamed: 0,rating_clean,n_titles,share_of_titles
11,TV-MA,3207,0.364142
9,TV-14,2160,0.245259
12,TV-PG,863,0.09799
8,R,799,0.090723
7,PG-13,490,0.055638
14,TV-Y7,334,0.037924
13,TV-Y,307,0.034859
6,PG,287,0.032588
10,TV-G,220,0.02498
5,NR,80,0.009084


## 7. Data quality checks

Create boolean `suspicious_title` if:
- duration is missing
- OR rating is missing
- OR country is missing

Show:
- number of suspicious rows
- sample rows


suspicious rows: 837


Unnamed: 0,show_id,title,duration,rating,country
2,s3,Ganglands,1 Season,TV-MA,
3,s4,Jailbirds New Orleans,1 Season,TV-MA,
5,s6,Midnight Mass,1 Season,TV-MA,
6,s7,My Little Pony: A New Generation,91 min,PG,
10,s11,"Vendetta: Truth, Lies and The Mafia",1 Season,TV-MA,
11,s12,Bangkok Breaking,1 Season,TV-MA,
13,s14,Confessions of an Invisible Girl,91 min,TV-PG,
14,s15,Crime Stories: India Detectives,1 Season,TV-MA,
16,s17,Europe's Most Dangerous Man: Otto Skorzeny in ...,67 min,TV-MA,
18,s19,Intrusion,94 min,TV-14,


## 8. Capstone: analysis-ready table

Create `analysis_df` with:
- `type`
- `year_added`
- `rating`
- primary country (first listed country)
- number of genres
- release year
- suspicious_title (as int)

Requirements:
- No missing values in engineered columns
- Show `analysis_df.head()` and `analysis_df.isna().sum()`



Unnamed: 0,type,year_added,rating,primary_country,n_genres,release_year,suspicious_title
0,Movie,2021,PG-13,United States,1,2020,0
1,TV Show,2021,TV-MA,South Africa,3,2021,0
2,TV Show,2021,TV-MA,Unknown,3,2021,1
3,TV Show,2021,TV-MA,Unknown,2,2021,1
4,TV Show,2021,TV-MA,India,3,2021,0
