# Netflix Data Cleaning and Standardization for Movies & TV Shows

## Importing Necessary Library and Dataset

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv(r'C:\Users\Aanchal Daryani\Downloads\netflix_titles.csv')

In [3]:
df

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...
...,...,...,...,...,...,...,...,...,...,...,...,...
8802,s8803,Movie,Zodiac,David Fincher,"Mark Ruffalo, Jake Gyllenhaal, Robert Downey J...",United States,"November 20, 2019",2007,R,158 min,"Cult Movies, Dramas, Thrillers","A political cartoonist, a crime reporter and a..."
8803,s8804,TV Show,Zombie Dumb,,,,"July 1, 2019",2018,TV-Y7,2 Seasons,"Kids' TV, Korean TV Shows, TV Comedies","While living alone in a spooky town, a young g..."
8804,s8805,Movie,Zombieland,Ruben Fleischer,"Jesse Eisenberg, Woody Harrelson, Emma Stone, ...",United States,"November 1, 2019",2009,R,88 min,"Comedies, Horror Movies",Looking to survive in a world taken over by zo...
8805,s8806,Movie,Zoom,Peter Hewitt,"Tim Allen, Courteney Cox, Chevy Chase, Kate Ma...",United States,"January 11, 2020",2006,PG,88 min,"Children & Family Movies, Comedies","Dragged from civilian life, a former superhero..."


In [4]:
print("Initial dataset shape:", df.shape)

Initial dataset shape: (8807, 12)


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB


In [6]:
df.columns

Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description'],
      dtype='object')

## Handling Null Values

In [7]:
print("\nMissing values before cleaning:\n", df.isnull().sum())


Missing values before cleaning:
 show_id            0
type               0
title              0
director        2634
cast             825
country          831
date_added        10
release_year       0
rating             4
duration           3
listed_in          0
description        0
dtype: int64


In [8]:
df['director'] = df['director'].fillna("Not Available")
df['cast'] = df['cast'].fillna("Not Available")
df['country'] = df['country'].fillna("Unknown")


df = df.dropna(subset=['rating', 'duration', 'date_added'])
# There is very less amount of data missing from 'rating', 'duration', 'date_added' so used dropna

In [9]:
df.isnull().sum()

show_id         0
type            0
title           0
director        0
cast            0
country         0
date_added      0
release_year    0
rating          0
duration        0
listed_in       0
description     0
dtype: int64

## Checking for Duplicates

In [10]:
print("Total Duplicates:", df.duplicated().sum())
# there are no duplicates in dataset.

Total Duplicates: 0


## Changing Datatypes

In [11]:
df.dtypes

show_id         object
type            object
title           object
director        object
cast            object
country         object
date_added      object
release_year     int64
rating          object
duration        object
listed_in       object
description     object
dtype: object

In [12]:
df['date_added'] = df['date_added'].str.strip()


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['date_added'] = df['date_added'].str.strip()


In [13]:
## Changing date_added datatype to datetime
df['date_added'] = pd.to_datetime(df['date_added'], errors='coerce', dayfirst=True)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['date_added'] = pd.to_datetime(df['date_added'], errors='coerce', dayfirst=True)


In [14]:
df.dtypes

show_id                 object
type                    object
title                   object
director                object
cast                    object
country                 object
date_added      datetime64[ns]
release_year             int64
rating                  object
duration                object
listed_in               object
description             object
dtype: object

In [15]:
df['date_added'] = df['date_added'].dt.strftime('%d-%m-%Y')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['date_added'] = df['date_added'].dt.strftime('%d-%m-%Y')


In [17]:
df['date_added']

0       25-09-2021
1       24-09-2021
2       24-09-2021
3       24-09-2021
4       24-09-2021
           ...    
8802    20-11-2019
8803    01-07-2019
8804    01-11-2019
8805    11-01-2020
8806    02-03-2019
Name: date_added, Length: 8790, dtype: object

In [20]:
## Changing Show_id to int
df['show_id'] = df['show_id'].str.replace('s', '', regex=False).str.strip().astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['show_id'] = df['show_id'].str.replace('s', '', regex=False).str.strip().astype(int)


## Renaming Columns

In [21]:
# Clean column names: Title Case with underscores
df.columns = (
    df.columns
    .str.strip()                           # remove leading/trailing spaces
    .str.replace(' ', '_', regex=False)    # replace spaces with underscores
    .str.title()                           # capitalize first letter of each word
)

# Check new column names
print(df.columns)


Index(['Show_Id', 'Type', 'Title', 'Director', 'Cast', 'Country', 'Date_Added',
       'Release_Year', 'Rating', 'Duration', 'Listed_In', 'Description'],
      dtype='object')


## Standardizing Text Data

In [22]:
# Show rows where 'Country' starts with a comma
rows_with_comma = df[df['Country'].str.startswith(',')]
print(rows_with_comma)


     Show_Id     Type            Title       Director  \
193      194  TV Show             D.P.  Not Available   
365      366    Movie  Eyes of a Thief   Najwa Najjar   

                                                  Cast            Country  \
193  Jung Hae-in, Koo Kyo-hwan, Kim Sung-kyun, Son ...      , South Korea   
365  Khaled Abol El Naga, Souad Massi, Suhail Hadda...  , France, Algeria   

     Date_Added  Release_Year Rating  Duration  \
193  27-08-2021          2021  TV-MA  1 Season   
365  30-07-2021          2014  TV-14   103 min   

                                            Listed_In  \
193                 International TV Shows, TV Dramas   
365  Dramas, Independent Movies, International Movies   

                                           Description  
193  A young private’s assignment to capture army d...  
365  After a decade in prison, a Palestinian man wi...  


In [23]:
# Remove only leading commas and spaces
df['Country'] = df['Country'].str.lstrip(', ').str.strip()


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Country'] = df['Country'].str.lstrip(', ').str.strip()


In [24]:
 df[df['Country'].str.startswith(',')]

Unnamed: 0,Show_Id,Type,Title,Director,Cast,Country,Date_Added,Release_Year,Rating,Duration,Listed_In,Description


In [25]:
df.columns

Index(['Show_Id', 'Type', 'Title', 'Director', 'Cast', 'Country', 'Date_Added',
       'Release_Year', 'Rating', 'Duration', 'Listed_In', 'Description'],
      dtype='object')

## Setting show_id as index

In [26]:
# Set 'show_id' as the index
df.set_index('Show_Id', inplace=True)

# Check the DataFrame
print(df.head())


            Type                  Title         Director  \
Show_Id                                                    
1          Movie   Dick Johnson Is Dead  Kirsten Johnson   
2        TV Show          Blood & Water    Not Available   
3        TV Show              Ganglands  Julien Leclercq   
4        TV Show  Jailbirds New Orleans    Not Available   
5        TV Show           Kota Factory    Not Available   

                                                      Cast        Country  \
Show_Id                                                                     
1                                            Not Available  United States   
2        Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...   South Africa   
3        Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...        Unknown   
4                                            Not Available        Unknown   
5        Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...          India   

         Date_Added  Release_Year Ratin

In [27]:
pd.set_option("display.max_columns",None)

In [28]:
df

Unnamed: 0_level_0,Type,Title,Director,Cast,Country,Date_Added,Release_Year,Rating,Duration,Listed_In,Description
Show_Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,Movie,Dick Johnson Is Dead,Kirsten Johnson,Not Available,United States,25-09-2021,2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
2,TV Show,Blood & Water,Not Available,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,24-09-2021,2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",Unknown,24-09-2021,2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
4,TV Show,Jailbirds New Orleans,Not Available,Not Available,Unknown,24-09-2021,2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
5,TV Show,Kota Factory,Not Available,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,24-09-2021,2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...
...,...,...,...,...,...,...,...,...,...,...,...
8803,Movie,Zodiac,David Fincher,"Mark Ruffalo, Jake Gyllenhaal, Robert Downey J...",United States,20-11-2019,2007,R,158 min,"Cult Movies, Dramas, Thrillers","A political cartoonist, a crime reporter and a..."
8804,TV Show,Zombie Dumb,Not Available,Not Available,Unknown,01-07-2019,2018,TV-Y7,2 Seasons,"Kids' TV, Korean TV Shows, TV Comedies","While living alone in a spooky town, a young g..."
8805,Movie,Zombieland,Ruben Fleischer,"Jesse Eisenberg, Woody Harrelson, Emma Stone, ...",United States,01-11-2019,2009,R,88 min,"Comedies, Horror Movies",Looking to survive in a world taken over by zo...
8806,Movie,Zoom,Peter Hewitt,"Tim Allen, Courteney Cox, Chevy Chase, Kate Ma...",United States,11-01-2020,2006,PG,88 min,"Children & Family Movies, Comedies","Dragged from civilian life, a former superhero..."


In [29]:
df.to_csv("Cleaned_Netflix Data.csv",header=True)

__This project focuses on cleaning and preprocessing a Netflix movies and TV shows dataset.Key steps include handling missing values (filled or removed), standardizing text fields, renaming columns to a clean format, converting data types (integers and datetime), and setting Show_Id as the index. The final dataset is consistent, analysis-ready, and suitable for visualization or further data analysis tasks.__
