## Dataset Exploration and Initial Analysis

In this section, we explore the dataset to understand its structure, check for missing or duplicate values, and get a quick overview of its contents before preprocessing.



In [1]:
# Step 1: import pandas library
import pandas as pd

In [2]:
# Step 2: Upload dataset directly in Colab
from google.colab import files
uploaded = files.upload()

Saving netflix_titles.csv to netflix_titles.csv


In [3]:
# Step 3: Read dataset
df = pd.read_csv("netflix_titles.csv")

In [4]:
# Step 4: Quick overview
print("Shape:", df.shape)
df.head()

Shape: (8807, 12)


Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


In [14]:
# Step 5: check for duplicates

df.duplicated().sum()

np.int64(0)

In [13]:
# Step 6: Check data types and missing values per column
print(df.dtypes)
print("----------")
print(df.isnull().sum())

show_id         object
type            object
title           object
director        object
cast            object
country         object
date_added      object
release_year     int64
rating          object
duration        object
listed_in       object
description     object
dtype: object
----------
show_id            0
type               0
title              0
director        2634
cast             825
country          831
date_added        10
release_year       0
rating             4
duration           3
listed_in          0
description        0
dtype: int64


## EDA Notes

From the initial exploration of the dataset, we observed the following:

- No duplicate rows were found.  
- All columns are of type `object`, except `release_year`.  
- `date_added` column contains mixed date formats and is currently in object type.  
  - Needs conversion to a proper datetime format for consistency and analysis.  
- `duration` column contains mixed units: `'min'`, `'Seasons'`, `'Season'`.  
  - Needs to be split into numeric and type parts and standardized.  
- `type` column contains inconsistent text casing.  
  - Should be normalized for consistency.


These steps provide a clear understanding of the dataset's current state and help guide the subsequent preprocessing steps.

## Data Preprocessing:

In [29]:
# Step 7: Normalize text case in 'type' column by converting values to Title Case (e.g., "TV Show" → "Tv Show")

df["type"] = df["type"].str.title()
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,date_clean
8802,s8803,Movie,Zodiac,David Fincher,"Mark Ruffalo, Jake Gyllenhaal, Robert Downey J...",United States,"November 20, 2019",2007,R,158 min,"Cult Movies, Dramas, Thrillers","A political cartoonist, a crime reporter and a...",2019-11-20
8803,s8804,Tv Show,Zombie Dumb,,,,"July 1, 2019",2018,TV-Y7,2 Seasons,"Kids' TV, Korean TV Shows, TV Comedies","While living alone in a spooky town, a young g...",2019-07-01
8804,s8805,Movie,Zombieland,Ruben Fleischer,"Jesse Eisenberg, Woody Harrelson, Emma Stone, ...",United States,"November 1, 2019",2009,R,88 min,"Comedies, Horror Movies",Looking to survive in a world taken over by zo...,2019-11-01
8805,s8806,Movie,Zoom,Peter Hewitt,"Tim Allen, Courteney Cox, Chevy Chase, Kate Ma...",United States,"January 11, 2020",2006,PG,88 min,"Children & Family Movies, Comedies","Dragged from civilian life, a former superhero...",2020-01-11
8806,s8807,Movie,Zubaan,Mozez Singh,"Vicky Kaushal, Sarah-Jane Dias, Raaghav Chanan...",India,"March 2, 2019",2015,TV-14,111 min,"Dramas, International Movies, Music & Musicals",A scrappy but poor boy worms his way into a ty...,2019-03-02


In [33]:
# Step 8.1: Remove any leading or trailing spaces from 'date_added' to avoid parsing errors

df["date_added"] = df["date_added"].str.strip()

In [34]:
# Step 8.2: Convert 'date_added' to proper datetime format ("%B %d, %Y"); invalid or null values become NaT

df["date_added"] = pd.to_datetime(df["date_added"], format="%B %d, %Y", errors="coerce")

In [43]:
# Display first few rows to verify the changes

df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,duration_num,duration_type,year_added
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,2021-09-25,2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm...",90.0,min,2021.0
1,s2,Tv Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,2021-09-24,2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t...",2.0,seasons,2021.0
2,s3,Tv Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,2021-09-24,2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...,1.0,seasons,2021.0
3,s4,Tv Show,Jailbirds New Orleans,,,,2021-09-24,2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo...",1.0,seasons,2021.0
4,s5,Tv Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,2021-09-24,2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...,2.0,seasons,2021.0


In [37]:
# Count of null (NaT) values
df["date_added"].isna().sum()


np.int64(10)

In [38]:
# Step 9.1: Split 'duration' into numeric and type parts for analysis

df["duration_num"] = df["duration"].str.extract(r'(\d+)').astype(float)  # extract number
df["duration_type"] = df["duration"].str.extract(r'([A-Za-z]+)')          # extract text

In [39]:
# Step 9.2: Standardize type: convert 'Season' and 'Seasons' → 'seasons', 'min' stays as 'min'

df["duration_type"] = df["duration_type"].str.lower().replace({"season": "seasons"})

In [40]:
# verify change
df[["duration", "duration_num", "duration_type"]].head()

Unnamed: 0,duration,duration_num,duration_type
0,90 min,90.0,min
1,2 Seasons,2.0,seasons
2,1 Season,1.0,seasons
3,1 Season,1.0,seasons
4,2 Seasons,2.0,seasons


In [42]:
# Step 10: Create a new column 'year_added' by extracting the year from 'date_added'
df["year_added"] = df["date_added"].dt.year

# verify change
df[["date_added", "year_added"]].head()

Unnamed: 0,date_added,year_added
0,2021-09-25,2021.0
1,2021-09-24,2021.0
2,2021-09-24,2021.0
3,2021-09-24,2021.0
4,2021-09-24,2021.0


## Data Preprocessing Notes

In this section, we perform essential preprocessing steps to clean and standardize the dataset for analysis:

1. **Normalize Text Case**  
   - Standardized the `show_type` column by converting all values to **Title Case** (e.g., `"TV Show"` → `"Tv Show"`).

2. **Clean and Convert Dates**  
   - Removed leading/trailing spaces from the `date_added` column.  
   - Converted `date_added` to a proper **datetime** format (`YYYY-MM-DD`) while keeping any invalid or missing values as `NaT`.  

3. **Split Duration Column**  
   - Separated the `duration` column into two new columns:  
     - `duration_num`: numeric part representing minutes or number of seasons.  
     - `duration_type`: type of duration, standardized to `"min"` or `"seasons"` (e.g., `"Season"` → `"seasons"`).  

4. **Create Year Column**  
   - Extracted the year from `date_added` into a new column `year_added` for temporal analysis.

These preprocessing steps ensure that the dataset is clean, consistent, and ready for further processing.


In [44]:
# Save the cleaned DataFrame to a CSV file
output_file = "Netflix_Cleaned.csv"
df.to_csv(output_file, index=False)  # index=False to avoid adding row numbers as a separate column

# Download the saved CSV file to your local machine
files.download(output_file)


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>