# Netflix Titles Data Cleaning Project

## Overview
This project demonstrates data cleaning techniques using a dataset of Netflix titles (`netflix_titles.csv`). The goal is to preprocess the data by handling missing values, splitting columns, and extracting meaningful features for further analysis. This portfolio project showcases proficiency in Python, Pandas, and data wrangling.

## Objectives
- Identify and quantify missing data.
- Handle missing values appropriately.
- Transform and extract features from text-based columns.
- Prepare the dataset for downstream analysis or visualization.

## Tools Used
- **Python Libraries**: Pandas, NumPy, SciPy, Seaborn, Regular Expressions (re)
- **Dataset**: `netflix_titles.csv` (sourced locally)

In [None]:
import pandas as pd
import numpy as np
import babel as bl
import scipy as sp
import seaborn as sns
import re
from scipy import stats
from babel import numbers

## Step 1: Load the Dataset
We begin by loading the Netflix titles dataset from a CSV file and inspecting the first row to understand its structure.

In [None]:
df = pd.read_csv('/Users/brtelfer/Documents/Python_Data_Projects/*17_Data_Cleaning/netflix_titles.csv')
df.head(1)

## Step 2: Assess Missing Data
To understand the quality of the dataset, we calculate the number of missing values in each column and sort them in descending order.

In [None]:
df.isnull().sum().sort_values(ascending=False)

### Calculate Missing Data as Percentages
Next, we compute the percentage of missing values for each column to better assess their impact.

In [None]:
col_lst = [df.columns.to_list()]
results = []
for column in col_lst:
    results.append(df[column].isnull().mean())
print(results[0])

In [None]:
results = [round(i * 100, 2) for i in results]
results

## Step 3: Handle Missing Directors
The `director` column has a significant amount of missing data (approximately 31.58%). We explore options to handle this, such as dropping rows with missing directors.

In [None]:
df['director'].isnull()

In [None]:
# Drop rows where 'director' is missing
no_director = df[df['director'].isnull()].index
df.drop(no_director, axis=0).isnull().sum()
df[~(df['director'].isnull())]  # Alternative method
df.dropna(subset=['director'])  # Preferred method

## Step 4: Impute Missing Ratings
For columns with fewer missing values, like `rating`, we impute missing entries with the most frequent (mode) value.

In [None]:
df.fillna({'rating': ''.join(df['rating'].mode())}).head(3)

## Step 5: Split Duration Column
The `duration` column contains both numeric values and units (e.g., '90 min' or '1 Season'). We split it into two columns: `duration num` (numeric) and `duration type` (unit).

In [None]:
df['duration num'] = df['duration'].str.split(' ').str[0]
df['duration type'] = df['duration'].str.split(' ').str[1]

### Extract Numeric Duration for Movies
For movies specifically, we extract the numeric portion of `duration` and convert it to an integer.

In [None]:
df[df['type'] == 'Movie']['duration'].str.split().str[0].astype(int)

## Step 6: Parse Date Added
The `date_added` column (e.g., 'September 9, 2019') is split into separate `year`, `month`, and `day` columns using regular expressions.

In [None]:
df['year'] = df['date_added'].str.extract('(\d{4})')
df['month'] = df['date_added'].str.extract('(\w+) ')
df['day'] = df['date_added'].str.extract(' (\d{2}|\d),')

In [None]:
df.date_added

### Alternative Date Parsing
As an alternative approach, we test parsing dates into months using `pd.to_datetime` on a smaller sample dataset.

In [None]:
data = {'date_added': ["September 9, 2018", "October 10, 2019", "November 11, 2020"]}
df = pd.DataFrame(data)
df['date_added'].str.extract('\s(\d{2}|\d),')  # Extract day
pd.to_datetime(df['date_added'].str.strip()).dt.month  # Extract month

## Conclusion
This project successfully cleaned the Netflix titles dataset by:
- Quantifying and addressing missing data.
- Imputing missing ratings with the mode.
- Splitting the `duration` column into numeric and type components.
- Parsing the `date_added` column into year, month, and day.

The cleaned dataset is now ready for exploratory data analysis, visualization, or modeling. Future steps could include analyzing trends in movie durations or release patterns over time.