# Netflix Titles Data Cleaning Project

## Overview
This project demonstrates data cleaning techniques using a dataset of Netflix titles (`netflix_titles.csv`). The goal is to preprocess the data by handling missing values, splitting columns, and extracting meaningful features for further analysis. This portfolio project showcases proficiency in Python, Pandas, and data wrangling.

## Objectives
- Identify and quantify missing data.
- Handle missing values appropriately.
- Transform and extract features from text-based columns.
- Prepare the dataset for downstream analysis or visualization.

## Tools Used
- **Python Libraries**: Pandas, NumPy, SciPy, Seaborn, Regular Expressions (re)
- **Dataset**: `netflix_titles.csv` (sourced locally)

In [2]:
import pandas as pd
import numpy as np
import babel as bl
import scipy as sp
import seaborn as sns
import re
from scipy import stats
from babel import numbers

## Step 1: Load the Dataset
We begin by loading the Netflix titles dataset from a CSV file and inspecting the first row to understand its structure.

In [3]:
df = pd.read_csv('/Users/brtelfer/Documents/Python_Data_Projects/*17_Data_Cleaning/netflix_titles.csv')
df.head(1)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,81145628,Movie,Norm of the North: King Sized Adventure,"Richard Finn, Tim Maltby","Alan Marriott, Andrew Toth, Brian Dobson, Cole...","United States, India, South Korea, China","September 9, 2019",2019,TV-PG,90 min,"Children & Family Movies, Comedies",Before planning an awesome wedding for his gra...


## Step 2: Assess Missing Data
To understand the quality of the dataset, we calculate the number of missing values in each column and sort them in descending order.

In [4]:
df.isnull().sum().sort_values(ascending=False)

director        1969
cast             570
country          476
date_added        11
rating            10
show_id            0
type               0
title              0
release_year       0
duration           0
listed_in          0
description        0
dtype: int64

### Calculate Missing Data as Percentages
Next, we compute the percentage of missing values for each column to better assess their impact.

In [5]:
col_lst = [df.columns.to_list()]
results = []
for column in col_lst:
    results.append(df[column].isnull().mean())
print(results[0])

show_id         0.000000
type            0.000000
title           0.000000
director        0.315849
cast            0.091434
country         0.076355
date_added      0.001765
release_year    0.000000
rating          0.001604
duration        0.000000
listed_in       0.000000
description     0.000000
dtype: float64


In [6]:
results = [round(i * 100, 2) for i in results]
results

[show_id          0.00
 type             0.00
 title            0.00
 director        31.58
 cast             9.14
 country          7.64
 date_added       0.18
 release_year     0.00
 rating           0.16
 duration         0.00
 listed_in        0.00
 description      0.00
 dtype: float64]

## Step 3: Handle Missing Directors
The `director` column has a significant amount of missing data (approximately 31.58%). We explore options to handle this, such as dropping rows with missing directors.

In [7]:
df['director'].isnull()

0       False
1        True
2        True
3        True
4       False
        ...  
6229     True
6230     True
6231     True
6232     True
6233     True
Name: director, Length: 6234, dtype: bool

In [8]:
# Drop rows where 'director' is missing
no_director = df[df['director'].isnull()].index
df.drop(no_director, axis=0).isnull().sum()
df[~(df['director'].isnull())]  # Alternative method
df.dropna(subset=['director'])  # Preferred method

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,81145628,Movie,Norm of the North: King Sized Adventure,"Richard Finn, Tim Maltby","Alan Marriott, Andrew Toth, Brian Dobson, Cole...","United States, India, South Korea, China","September 9, 2019",2019,TV-PG,90 min,"Children & Family Movies, Comedies",Before planning an awesome wedding for his gra...
4,80125979,Movie,#realityhigh,Fernando Lebrija,"Nesta Cooper, Kate Walsh, John Michael Higgins...",United States,"September 8, 2017",2017,TV-14,99 min,Comedies,When nerdy high schooler Dani finally attracts...
6,70304989,Movie,Automata,Gabe Ibáñez,"Antonio Banderas, Dylan McDermott, Melanie Gri...","Bulgaria, United States, Spain, Canada","September 8, 2017",2014,R,110 min,"International Movies, Sci-Fi & Fantasy, Thrillers","In a dystopian future, an insurance adjuster f..."
7,80164077,Movie,Fabrizio Copano: Solo pienso en mi,"Rodrigo Toro, Francisco Schultz",Fabrizio Copano,Chile,"September 8, 2017",2017,TV-MA,60 min,Stand-Up Comedy,Fabrizio Copano takes audience participation t...
9,70304990,Movie,Good People,Henrik Ruben Genz,"James Franco, Kate Hudson, Tom Wilkinson, Omar...","United States, United Kingdom, Denmark, Sweden","September 8, 2017",2014,R,90 min,"Action & Adventure, Thrillers",A struggling couple can't believe their luck w...
...,...,...,...,...,...,...,...,...,...,...,...,...
6142,80063224,TV Show,The Great British Baking Show,Andy Devonshire,"Mel Giedroyc, Sue Perkins, Mary Berry, Paul Ho...",United Kingdom,"August 30, 2019",2019,TV-PG,7 Seasons,"British TV Shows, Reality TV",A talented batch of amateur bakers face off in...
6158,80164216,TV Show,Miraculous: Tales of Ladybug & Cat Noir,Thomas Astruc,"Cristina Vee, Bryce Papenbrook, Keith Silverst...","France, South Korea, Japan","August 2, 2019",2018,TV-Y7,4 Seasons,"Kids' TV, TV Action & Adventure","When Paris is in peril, Marinette becomes Lady..."
6167,80115328,TV Show,Sacred Games,"Vikramaditya Motwane, Anurag Kashyap","Saif Ali Khan, Nawazuddin Siddiqui, Radhika Ap...","India, United States","August 15, 2019",2019,TV-MA,2 Seasons,"Crime TV Shows, International TV Shows, TV Dramas",A link in their pasts leads an honest cop to a...
6182,80176842,TV Show,Men on a Mission,Jung-ah Im,"Ho-dong Kang, Soo-geun Lee, Sang-min Lee, Youn...",South Korea,"April 9, 2019",2019,TV-14,4 Seasons,"International TV Shows, Korean TV Shows, Stand...",Male celebs play make-believe as high schooler...


## Step 4: Impute Missing Ratings
For columns with fewer missing values, like `rating`, we impute missing entries with the most frequent (mode) value.

In [9]:
df.fillna({'rating': ''.join(df['rating'].mode())}).head(3)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,81145628,Movie,Norm of the North: King Sized Adventure,"Richard Finn, Tim Maltby","Alan Marriott, Andrew Toth, Brian Dobson, Cole...","United States, India, South Korea, China","September 9, 2019",2019,TV-PG,90 min,"Children & Family Movies, Comedies",Before planning an awesome wedding for his gra...
1,80117401,Movie,Jandino: Whatever it Takes,,Jandino Asporaat,United Kingdom,"September 9, 2016",2016,TV-MA,94 min,Stand-Up Comedy,Jandino Asporaat riffs on the challenges of ra...
2,70234439,TV Show,Transformers Prime,,"Peter Cullen, Sumalee Montano, Frank Welker, J...",United States,"September 8, 2018",2013,TV-Y7-FV,1 Season,Kids' TV,"With the help of three human allies, the Autob..."


## Step 5: Split Duration Column
The `duration` column contains both numeric values and units (e.g., '90 min' or '1 Season'). We split it into two columns: `duration num` (numeric) and `duration type` (unit).

In [10]:
df['duration num'] = df['duration'].str.split(' ').str[0]
df['duration type'] = df['duration'].str.split(' ').str[1]

### Extract Numeric Duration for Movies
For movies specifically, we extract the numeric portion of `duration` and convert it to an integer.

In [11]:
df[df['type'] == 'Movie']['duration'].str.split().str[0].astype(int)

0        90
1        94
4        99
6       110
7        60
       ... 
5577     70
5578    102
5579     88
5580    109
6231     60
Name: duration, Length: 4265, dtype: int64

## Step 6: Parse Date Added
The `date_added` column (e.g., 'September 9, 2019') is split into separate `year`, `month`, and `day` columns using regular expressions.

In [12]:
df['year'] = df['date_added'].str.extract('(\d{4})')
df['month'] = df['date_added'].str.extract('(\w+) ')
df['day'] = df['date_added'].str.extract(' (\d{2}|\d),')

In [13]:
df.date_added

0       September 9, 2019
1       September 9, 2016
2       September 8, 2018
3       September 8, 2018
4       September 8, 2017
              ...        
6229                  NaN
6230                  NaN
6231                  NaN
6232                  NaN
6233                  NaN
Name: date_added, Length: 6234, dtype: object

### Alternative Date Parsing
As an alternative approach, we test parsing dates into months using `pd.to_datetime` on a smaller sample dataset.

In [14]:
data = {'date_added': ["September 9, 2018", "October 10, 2019", "November 11, 2020"]}
df = pd.DataFrame(data)
df['date_added'].str.extract('\s(\d{2}|\d),')  # Extract day
pd.to_datetime(df['date_added'].str.strip()).dt.month  # Extract month

0     9
1    10
2    11
Name: date_added, dtype: int32

## Conclusion
This project successfully cleaned the Netflix titles dataset by:
- Quantifying and addressing missing data.
- Imputing missing ratings with the mode.
- Splitting the `duration` column into numeric and type components.
- Parsing the `date_added` column into year, month, and day.

The cleaned dataset is now ready for exploratory data analysis, visualization, or modeling. Future steps could include analyzing trends in movie durations or release patterns over time.