# Data Cleaning for Netflix Dataset

This notebook focuses on cleaning and preprocessing the Netflix titles dataset. The aim is to handle missing values, correct any inconsistencies, and prepare the data for further analysis, ensuring that the dataset is reliable for any subsequent analytical or machine learning work.

In [1]:
import pandas as pd

In [2]:
def load_data(file_path):
    """Load the dataset from a specified file path."""
    return pd.read_csv(file_path, encoding='latin-1')

file_path = '../data/netflix_titles.csv'
netflix_data = load_data(file_path)

## Initial Data Exploration

Begin by exploring the dataset to understand its structure and identify any immediate inconsistencies or cleaning needs. This step is crucial to determine how much cleaning and preprocessing is needed and to ensure the data's integrity.

In [3]:
netflix_data.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,...,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,...,,,,,,,,,,
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,...,,,,,,,,,,
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,...,,,,,,,,,,
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,...,,,,,,,,,,
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,...,,,,,,,,,,


In [4]:
print(netflix_data.describe())
print("----------------------------------------------")
print(netflix_data.info())

       release_year  Unnamed: 12  Unnamed: 13  Unnamed: 14  Unnamed: 15  \
count   8809.000000          0.0          0.0          0.0          0.0   
mean    2014.181292          NaN          NaN          NaN          NaN   
std        8.818932          NaN          NaN          NaN          NaN   
min     1925.000000          NaN          NaN          NaN          NaN   
25%     2013.000000          NaN          NaN          NaN          NaN   
50%     2017.000000          NaN          NaN          NaN          NaN   
75%     2019.000000          NaN          NaN          NaN          NaN   
max     2024.000000          NaN          NaN          NaN          NaN   

       Unnamed: 16  Unnamed: 17  Unnamed: 18  Unnamed: 19  Unnamed: 20  \
count          0.0          0.0          0.0          0.0          0.0   
mean           NaN          NaN          NaN          NaN          NaN   
std            NaN          NaN          NaN          NaN          NaN   
min            NaN          

## Remove Unnecessary Columns

Some columns may not contribute valuable information for our analysis and can be removed. This step simplifies the dataset by eliminating irrelevant data that could complicate or skew our analysis.

In [5]:
columns_to_drop = [col for col in netflix_data.columns if 'Unnamed' in col]
netflix_data.drop(columns=columns_to_drop, inplace=True)

netflix_data.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


## Handle Missing Values

To maintain the integrity of our dataset, we need to address missing values appropriately. Different strategies are used based on the significance of the columns and the nature of their data.


### Fill Missing Values with 'Unknown'

For the columns 'director', 'cast', and 'country', a significant portion of data is missing. We fill these with 'Unknown' since they are categorical and crucial for maintaining records for analysis without introducing bias.

In [6]:
for column in ['director', 'cast', 'country']:
    netflix_data[column].fillna('Unknown', inplace=True)

### Drop Rows with Critical Missing Information

The columns 'date_added', 'rating', and 'duration' have minimal missing values. These columns are crucial for temporal and content analysis, hence rows with missing values in these fields are dropped to avoid inaccuracies.

In [7]:
netflix_data.dropna(subset=['date_added', 'rating', 'duration'], inplace=True)

print(netflix_data.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8792 entries, 0 to 8808
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8792 non-null   object
 1   type          8792 non-null   object
 2   title         8792 non-null   object
 3   director      8792 non-null   object
 4   cast          8792 non-null   object
 5   country       8792 non-null   object
 6   date_added    8792 non-null   object
 7   release_year  8792 non-null   int64 
 8   rating        8792 non-null   object
 9   duration      8792 non-null   object
 10  listed_in     8792 non-null   object
 11  description   8792 non-null   object
dtypes: int64(1), object(11)
memory usage: 892.9+ KB
None


## Data Type Conversions

Ensuring that all data types are appropriate for their respective columns is crucial for accurate analysis. This involves converting date fields to datetime objects for easier manipulation and ensuring numerical fields are appropriately formatted.

In [8]:
netflix_data['date_added'] = pd.to_datetime(netflix_data['date_added'])

netflix_data['release_year'] = netflix_data['release_year'].astype(int)

netflix_data.dtypes

show_id                 object
type                    object
title                   object
director                object
cast                    object
country                 object
date_added      datetime64[ns]
release_year             int64
rating                  object
duration                object
listed_in               object
description             object
dtype: object

## Remove Outliers

In [9]:
netflix_data = netflix_data[netflix_data['date_added'].dt.year != 2024]

print("Unique years in dataset:", netflix_data['date_added'].dt.year.unique())
print("Updated dataset size:", netflix_data.shape)

Unique years in dataset: [2021 2020 2019 2018 2017 2016 2015 2014 2013 2012 2011 2009 2008 2010]
Updated dataset size: (8790, 12)


## Check Unique Counts

Understanding the variety of data in each column helps determine if there are any potential wrong values in the data. 

In [10]:
unique_counts = netflix_data.nunique()

print(unique_counts)

show_id         8790
type               2
title           8787
director        4527
cast            7679
country          749
date_added      1713
release_year      74
rating            14
duration         220
listed_in        513
description     8758
dtype: int64


## Save the Cleaned Dataset

After cleaning, we save the dataset for further analysis to ensure that we do not have to repeat these preprocessing steps.

In [11]:
cleaned_file_path = '../data/cleaned_netflix_titles.csv'
netflix_data.to_csv(cleaned_file_path, index=False)

print("Cleaned dataset saved successfully.")

Cleaned dataset saved successfully.
