# 🎬 Netflix Dataset – Data Cleaning (Internship Task 1)

**Task:**Clean and preprocess the Netflix Movies and TV Shows dataset using Pandas(Python).

📌 This task involves:
- Handling missing values.
- Removing duplicates.
- Standardizing formats.
- Converting dates.
- Renaming columns.
- Fixing data types.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt


## 📂 Step 1: Load the Dataset
We'll load the dataset and check its basic structure.

In [2]:
# Load dataset
df = pd.read_csv("netflix_movies_processed_for_tableau.csv")

# Preview the dataset
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,year_added,duration_minutes,seasons,genres
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,2021-09-25,2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm...",2021.0,90.0,,Documentaries
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,2021-09-24,2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t...",2021.0,,2.0,International TV Shows
2,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,2021-09-24,2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t...",2021.0,,2.0,TV Dramas
3,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,2021-09-24,2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t...",2021.0,,2.0,TV Mysteries
4,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,2021-09-24,2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...,2021.0,,1.0,Crime TV Shows


## 🧩 Step 2: Identify Missing Values
We’ll inspect and handle missing values in the dataset.

In [3]:
# Count missing values
df.isnull().sum()

show_id                 0
type                    0
title                   0
director             5884
cast                 1504
country              1722
date_added             20
release_year            0
rating                  6
duration                3
listed_in               0
description             0
year_added             20
duration_minutes     6136
seasons             13190
genres                  0
dtype: int64

In [13]:

df['country'].fillna('Unknown', inplace=True)

df.dropna(subset=['title', 'type'], inplace=True)

# Recheck missing values
df.isnull().sum()

show_id                 0
type                    0
title                   0
director             5884
cast                 1504
country                 0
date_added             20
release_year            0
rating                  6
duration                3
listed_in               0
description             0
year_added             20
duration_minutes     6136
seasons             13190
genres                  0
dtype: int64

## 🔁 Step 3: Remove Duplicate Rows
We'll remove any duplicate entries in the dataset.

In [5]:

print("Original rows:", df.shape[0])

# Drop duplicates
df.drop_duplicates(inplace=True)

# New shape
print("After removing duplicates:", df.shape[0])

Original rows: 19323
After removing duplicates: 19323


## ✏️ Step 4: Standardize Text Data
We'll clean up inconsistent text formatting (like case and spaces).

In [8]:

df['country'] = df['country'].str.title()

df = df.apply(lambda x: x.str.strip() if x.dtype == "object" else x)

## 📅 Step 5: Convert Date Formats
Let's convert `date_added` to datetime format.

In [9]:

df['date_added'] = pd.to_datetime(df['date_added'], errors='coerce')

# Preview
df[['title', 'date_added']].head()

Unnamed: 0,title,date_added
0,Dick Johnson Is Dead,2021-09-25
1,Blood & Water,2021-09-24
2,Blood & Water,2021-09-24
3,Blood & Water,2021-09-24
4,Ganglands,2021-09-24


## 🏷️ Step 6: Rename Columns
We'll rename columns for consistency (lowercase, no spaces).

In [10]:
df.columns = df.columns.str.lower().str.replace(' ', '_')
df.columns

Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description',
       'year_added', 'duration_minutes', 'seasons', 'genres'],
      dtype='object')

## 💾 Step 7: Export Cleaned Dataset
We'll save the cleaned version to a new CSV file.

In [12]:

df.to_csv("netflix_cleaned dataset.csv", index=False)

print("✅ Cleaned dataset saved as 'netflix_cleaned dataset.csv'")

✅ Cleaned dataset saved as 'netflix_cleaned.csv'


## ✅ Summary of Cleaning Steps or changes


- Filled missing `country` values with "Unknown"
- Dropped rows missing `title` or `type`
- Removed duplicate records
- Standardized text (title case, trimmed spaces)
- Converted `date_added` to datetime format
- Renamed columns to lowercase with underscores
- Saved cleaned dataset as `netflix_cleaned dataset.csv`