In [44]:
import pandas as pd
import numpy as np

# Load raw data
df = pd.read_csv('netflix_titles.csv')


In [45]:
print("Initial shape:", df.shape)
print(df.info())

Initial shape: (8807, 12)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB
None


**The dataset has 8,807 Netflix titles with 12 columns.**

This helps us to know which are the columns present along with there datatypes also.

**Handle missing values**

In [46]:
# 1. Handle missing values
print("Missing values by column:\n", df.isnull().sum())


Missing values by column:
 show_id            0
type               0
title              0
director        2634
cast             825
country          831
date_added        10
release_year       0
rating             4
duration           3
listed_in          0
description        0
dtype: int64


**The dataset has major missing values in director (2,634), and moderate gaps in cast (825) and country (831). Minor missing values exist in date_added (10), rating (4), and duration (3). All other columns are complete**

In [47]:
# fill missing with meaningful placeholders
df['director'] = df['director'].fillna('Unknown Director')
df['cast'] = df['cast'].fillna('Unknown Cast')
df['country'] = df['country'].fillna('Not Specified')
df['rating'] = df['rating'].fillna('Not Rated')
df['duration'] = df['duration'].fillna('Unknown Duration')
df['date_added'] = df['date_added'].fillna('Unknown Date')


print("Missing values by column:\n", df.isnull().sum())



Missing values by column:
 show_id         0
type            0
title           0
director        0
cast            0
country         0
date_added      0
release_year    0
rating          0
duration        0
listed_in       0
description     0
dtype: int64


**All missing values were replaced with meaningful placeholders (Unknown Director, Unknown Cast, Not Specified, Not Rated, Unknown Duration, Unknown Date).**

The reason i used Unknown as a placeholder is because they still contain valuable information: title, type, release year, country, genre, etc. Dropping them shrinks your dataset unnecessarily — will lose 2,634 rows for directors and 825 rows for cast (that’s ~30% of your data!).

In [48]:
# 2. Remove duplicate rows
df.drop_duplicates(inplace=True)
print("After dropping duplicates:", df.shape)

After dropping duplicates: (8807, 12)


**There were no duplicates found as the shape of the dataset is remains same**

In [49]:
# 3. Standardize text values (e.g. type, rating)
df['type'] = df['type'].str.title()            # "movie" -> "Movie"
df['rating'] = df['rating'].str.upper().str.strip()
df['country'] = df['country'].str.title()

**Standardized text columns:**

type → Title case (e.g., "movie" → "Movie")

rating → Uppercase, removed dashes & extra spaces

country → Title case

To ensure consistent formatting across categorical fields, making grouping, filtering, and analysis easier.

In [50]:
# 4. Fix date column: convert date_added to datetime with uniform format # then format as dd‑mm‑yyyy
df['date_added'] = pd.to_datetime(df['date_added'], errors='coerce')
df['date_added'] = df['date_added'].dt.strftime('%d-%m-%Y')

**Converted date_added to datetime and reformatted it to dd‑mm‑yyyy.**

To ensure uniform date formatting for easier sorting, filtering, and time-based analysis.

In [51]:
# 5. Clean duration: split numeric value & type
df[['duration_value', 'duration_unit']] = df['duration'].str.extract(r'(\d+)\s*(\w+)')
df['duration_value'] = pd.to_numeric(df['duration_value'], errors='coerce')
df['duration_unit'] = df['duration_unit'].str.lower().replace({
    'seasons': 'season',
    'season': 'season',
    'min': 'minutes'
}).fillna('unknown')

**Split duration into numeric value (duration_value) and unit (duration_unit), converting the value to numeric and filling missing units with "Unknown".**

To make data structured and analyzable, enabling numeric comparisons and filtering by unit (minutes vs seasons).

In [52]:
# 6. Rename columns: clean headers
df.columns = (df.columns.str.strip()
                           .str.lower()
                           .str.replace(' ', '_')
                           .str.replace('-', '_'))

**Renamed all column headers to lowercase with underscores (e.g., Date Added → date_added).**

Though in the dataset all the columns were in lowercase and proper format but still performed for better results

In [53]:
# 7. Strip whitespace in text columns
for col in ['title', 'director', 'cast', 'listed_in', 'description']:
    df[col] = df[col].astype(str).str.strip()

**Removed leading and trailing spaces from text columns (title, director, cast, listed_in, description).**



In [54]:
# Save the final cleaned dataset
df.to_csv("netflix_titles_final_cleaned.csv", index=False)

print("Data cleaning complete! Final shape:", df.shape)
print(df.head())

Data cleaning complete! Final shape: (8807, 14)
  show_id     type                  title          director  \
0      s1    Movie   Dick Johnson Is Dead   Kirsten Johnson   
1      s2  Tv Show          Blood & Water  Unknown Director   
2      s3  Tv Show              Ganglands   Julien Leclercq   
3      s4  Tv Show  Jailbirds New Orleans  Unknown Director   
4      s5  Tv Show           Kota Factory  Unknown Director   

                                                cast        country  \
0                                       Unknown Cast  United States   
1  Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...   South Africa   
2  Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...  Not Specified   
3                                       Unknown Cast  Not Specified   
4  Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...          India   

   date_added  release_year rating   duration  \
0  25-09-2021          2020  PG-13     90 min   
1  24-09-2021          2021  TV-MA  2 Seasons   

**Saved the fully cleaned dataset as netflix_titles_final_cleaned.csv and displayed its final shape and sample rows.**

