In [1]:
import pandas as pd

In [2]:
df= pd.read_csv("/content/netflix_titles.csv")

In [3]:
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


**Checking for Missing Values**

In [36]:
print("Missing values before cleaning: \n")
print(df.isnull().sum())

Missing values before cleaning: 

show_id            0
type               0
title              0
director        2634
cast             825
country          831
date_added        98
release_year       0
rating             4
duration           3
listed_in          0
description        0
dtype: int64


In [37]:
# Shape before
print("Original shape:", df.shape)

Original shape: (8807, 12)


In [38]:
print("Number of duplicate rows:", df.duplicated().sum())

Number of duplicate rows: 0


In [40]:
# Shape after
print("Shape after removing duplicates:", df.shape)

Shape after removing duplicates: (8807, 12)


In [41]:
# Convert 'type' to title case
df['type'] = df['type'].str.title()

# Clean 'country' values: remove leading/trailing spaces, title case
df['country'] = df['country'].str.strip().str.title()


In [42]:
print(df['type'].unique())

['Movie' 'Tv Show']


In [43]:
print(df['country'].dropna().unique()[:10])

['United States' 'South Africa' 'India'
 'United States, Ghana, Burkina Faso, United Kingdom, Germany, Ethiopia'
 'United Kingdom' 'Germany, Czech Republic' 'Mexico' 'Turkey' 'Australia'
 'United States, India, France']


In [44]:
df['date_added'] = pd.to_datetime(df['date_added'], errors='coerce')

In [45]:
print(df['date_added'].head(10))

0   2021-09-25
1   2021-09-24
2   2021-09-24
3   2021-09-24
4   2021-09-24
5   2021-09-24
6   2021-09-24
7   2021-09-24
8   2021-09-24
9   2021-09-24
Name: date_added, dtype: datetime64[ns]


In [15]:
# Making all column headers lowercase and replace spaces with underscores
df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")

In [46]:
print(df.columns.tolist())

['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added', 'release_year', 'rating', 'duration', 'listed_in', 'description']


In [47]:
# Checking if any column still has spaces
spaces = [col for col in df.columns if " " in col]
uppercase = [col for col in df.columns if not col.islower()]

print("Columns with spaces:", spaces)
print("Columns with uppercase letters:", uppercase)

Columns with spaces: []
Columns with uppercase letters: []


In [48]:
df['release_year'] = pd.to_numeric(df['release_year'], errors='coerce')

In [49]:
print("release_year dtype:", df['release_year'].dtype)
print("date_added dtype:", df['date_added'].dtype)

release_year dtype: int64
date_added dtype: datetime64[ns]


In [50]:
print("Missing values in release_year:", df['release_year'].isnull().sum())
print("Missing values in date_added:", df['date_added'].isnull().sum())

Missing values in release_year: 0
Missing values in date_added: 98


In [52]:
# Filling missing values
df['director'].fillna("Unknown", inplace=True)
df['cast'].fillna("Unknown", inplace=True)
df['country'].fillna("Unknown", inplace=True)
df['date_added'].fillna(method='ffill', inplace=True)  # Or use 'bfill' or dropna()


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['director'].fillna("Unknown", inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['cast'].fillna("Unknown", inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values alw

In [53]:
print("Missing values after cleaning:\n")
print(df.isnull().sum())

Missing values after cleaning:

show_id         0
type            0
title           0
director        0
cast            0
country         0
date_added      0
release_year    0
rating          4
duration        3
listed_in       0
description     0
dtype: int64


In [60]:
df['rating'] = df['rating'].fillna('NR')
most_common_duration = df['duration'].mode()[0]
df['duration'] = df['duration'].fillna(most_common_duration)

In [61]:
print(df.isnull().sum())

show_id         0
type            0
title           0
director        0
cast            0
country         0
date_added      0
release_year    0
rating          0
duration        0
listed_in       0
description     0
dtype: int64


**In conclusion, the Netflix dataset was successfully cleaned and preprocessed using Python (Pandas). Missing values in important fields were identified and handled appropriately, and duplicate entries were removed to ensure data consistency. Text data such as country names and content types were standardized for uniformity. Dates were converted to a consistent datetime format, and column names were renamed to be lowercase with underscores for clarity. Data types were also verified and corrected where necessary. These preprocessing steps have resulted in a structured and reliable dataset, ready for meaningful analysis and insights.**