# Assignment 1: Analysis of the Netflix Dataset

***Dataset:*** "Netflix Movies and TV Shows" on Kaggle (`netflix_titles.csv`)  
https://www.kaggle.com/datasets/shivamb/netflix-shows is the dataset page.

This notebook analyzes the Netflix Movies and TV Shows dataset to look into trends in the streaming service's content library. Specifically, it examines how the number of titles added to Netflix has evolved over time and compares moviesÂ and TV series based on their length and content ratings.
## Data Dictionary
The Netflix titles dataset contains the following columns:
- `show_id`: Unique identifier for each title
- `type`: Indicates whether the title is a Movie or TV Show
- `title`: Name of the movie or TV show
- `director`: Director(s) of the title
- `cast`: Main cast members
- `country`: Country or countries of production
- `date_added`: Date the title was added to Netflix
- `release_year`: Year the title was originally released
- `rating`: Content rating (e.g., PG, PG-13, TV-MA)
- `duration`: Runtime in minutes for Movies or number of seasons for TV Shows
- `listed_in`: Genre(s) or category labels
- `description`: Brief description of the title
### Columns Used in This Analysis
This analysis focuses on the following columns:
- `type`
- `date_added`
- `release_year`
- `rating`
- `duration`
## Queries: 
1. How many new titles have been added to Netflix over time, broken down by year?
2. Is the average runtime or distribution of ratings different for TV shows and movies?

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Load the dataset
df = pd.read_csv("../data/netflix/netflix_titles.csv")
df.head()

----
## 2. Data Cleaning 

In [None]:
def clean_duration(row):
    """
    Extracts the numeric value from the duration column.
    For Movies, returns the number of minutes (e.g., '90 min' -> 90).
    For TV Shows, returns the number of seasons (e.g., '2 Seasons' -> 2).
    Returns NaN if the duration is missing or cannot be parsed.
    """
    duration = row["duration"]
    if pd.isna(duration):
        return np.nan
    # Split the string and take the first part (the number)
    parts = str(duration).split(" ")
    try:
        return int(parts[0])
    except ValueError:
        return np.nan

In [None]:
# Drop duplicate rows if any
print("Duplicates found:", df.duplicated().sum())
df = df.drop_duplicates()

# Drop rows missing critical columns for our analysis
df = df.dropna(subset=["title", "type", "duration"])

# Convert date_added to datetime and extract year
df["date_added"] = pd.to_datetime(df["date_added"], errors="coerce")
df["year_added"] = df["date_added"].dt.year

# Apply the clean_duration function to extract numeric duration
df["duration_num"] = df.apply(clean_duration, axis=1)

# Fill missing ratings with "Unknown" so they are not lost
df["rating"] = df["rating"].fillna("Unknown")

# Display cleaned dataset info
print("\nCleaned dataset shape:", df.shape)
print("\nMissing values after cleaning:")
print(df[["type", "date_added", "year_added", "duration", "duration_num", "rating"]].isna().sum())

**Note:** Missing `rating` values were filled with "Unknown" as placeholders. These are treated as a separate category and may be excluded from specific analyses when appropriate (Query 2). Rows missing `duration` were dropped since duration is essential for our second research question.

---