### Select a real-world, messy dataset (e.g., a large  CSV or JSON file) that has at least five features and over 1,000 records. Write a Python script to import the data, perform data cleaning (handle missing values using imputation or removal , remove duplicates ), and perform data transformation (e.g., creating a new calculated feature). Finally, export the cleaned data to a new CSV file.

In [1]:
# Import required libraries
import pandas as pd
import numpy as np


In [2]:
# Load dataset (use the path to your downloaded file)
df = pd.read_csv('netflix_titles.csv')

# Show first few records
df.head()


Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


In [3]:
# Check basic info to identify missing values and data types
df.info()

# Check number of missing values in each column
df.isnull().sum()

# Check for duplicates
print(f"Duplicate rows: {df.duplicated().sum()}")


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB
Duplicate rows: 0


In [4]:
# Option 1: Drop columns or rows if too many missing values
df.dropna(subset=['director', 'cast'], inplace=True)

# Option 2: Fill missing values (imputation)
# df['director'].fillna('Unknown', inplace=True)
df['country'].fillna('Unknown', inplace=True)
df['rating'].fillna(df['rating'].mode()[0], inplace=True)

# For numeric columns, you could use mean/median if applicable:
# df['duration'] = df['duration'].fillna(df['duration'].median())

print("✅ Missing values handled.")


✅ Missing values handled.


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['country'].fillna('Unknown', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['rating'].fillna(df['rating'].mode()[0], inplace=True)


In [5]:
# Remove duplicates
df.drop_duplicates(inplace=True)

print(f"✅ Dataset after duplicate removal: {df.shape[0]} rows")


✅ Dataset after duplicate removal: 5700 rows


In [6]:
# Create new feature: 'release_decade'
df['release_decade'] = (df['release_year'] // 10) * 10

# Example of another feature — title length
df['title_length'] = df['title'].apply(len)

df[['title', 'release_year', 'release_decade', 'title_length']].head()


Unnamed: 0,title,release_year,release_decade,title_length
2,Ganglands,2021,2020,9
5,Midnight Mass,2021,2020,13
6,My Little Pony: A New Generation,2021,2020,32
7,Sankofa,1993,1990,7
8,The Great British Baking Show,2021,2020,29


In [7]:
# Verify the changes
df.info()
df.describe(include='all').T.head()


<class 'pandas.core.frame.DataFrame'>
Index: 5700 entries, 2 to 8806
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   show_id         5700 non-null   object
 1   type            5700 non-null   object
 2   title           5700 non-null   object
 3   director        5700 non-null   object
 4   cast            5700 non-null   object
 5   country         5700 non-null   object
 6   date_added      5700 non-null   object
 7   release_year    5700 non-null   int64 
 8   rating          5700 non-null   object
 9   duration        5697 non-null   object
 10  listed_in       5700 non-null   object
 11  description     5700 non-null   object
 12  release_decade  5700 non-null   int64 
 13  title_length    5700 non-null   int64 
dtypes: int64(3), object(11)
memory usage: 668.0+ KB


Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
show_id,5700,5700,s3,1,,,,,,,
type,5700,2,Movie,5522,,,,,,,
title,5700,5700,Ganglands,1,,,,,,,
director,5700,4152,"Raúl Campos, Jan Suter",18,,,,,,,
cast,5700,5512,"Vatsal Dubey, Julie Tejwani, Rupa Bhimani, Jig...",13,,,,,,,


In [8]:
# Save cleaned data to a new CSV
df.to_csv('netflix_titles_cleaned.csv', index=False)
print("✅ Cleaned dataset exported as 'netflix_titles_cleaned.csv'")


✅ Cleaned dataset exported as 'netflix_titles_cleaned.csv'
