# Task 1: Data Cleaning and Preprocessing

## Dataset Used:
[Netflix Movies and TV Shows](https://www.kaggle.com/datasets/shivamb/netflix-shows)

## Tools:
- Python
- Pandas
- Google Colab

## Steps Performed:
- Handled missing values using `fillna()`
- Removed duplicate records with `drop_duplicates()`
- Standardized text fields (e.g., 'type', 'country')
- Converted 'date_added' to datetime format
- Renamed columns to lowercase with underscores
- Ensured correct data types (e.g., 'release_year' as integer)

## Output:
- `cleaned_netflix_dataset.csv`

## Learnings:
- Practical experience with data preprocessing using Pandas
- Improved understanding of handling missing values and data formatting


In [6]:
# Import libraries
import pandas as pd

# Install necessary packages
!pip install pandas



In [3]:
# Upload the dataset
from google.colab import files
uploaded = files.upload()

Saving netflix_titles.csv to netflix_titles.csv


In [7]:
# Load the dataset
df = pd.read_csv('netflix_titles.csv')

In [8]:
# Display basic information
print("Original Dataset Info:")
df.info()
print("\nFirst 5 rows:")
print(df.head())


Original Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB

First 5 rows:
  show_id     type                  title         director  \
0      s1    Movie   Dick Johnson Is Dead  Kirsten Johnson   
1      s2  TV Show          Blood & Water              NaN   
2      s3  TV Show          

In [9]:
# Check for missing values
print("\nMissing values before cleaning:")
print(df.isnull().sum())

# Fill missing values
df['director'].fillna('No Director', inplace=True)
df['cast'].fillna('No Cast', inplace=True)
df['country'].fillna('No Country', inplace=True)
df['date_added'].fillna(df['date_added'].mode()[0], inplace=True)
df['rating'].fillna(df['rating'].mode()[0], inplace=True)
df['duration'].fillna(df['duration'].mode()[0], inplace=True)



Missing values before cleaning:
show_id            0
type               0
title              0
director        2634
cast             825
country          831
date_added        10
release_year       0
rating             4
duration           3
listed_in          0
description        0
dtype: int64


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['director'].fillna('No Director', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['cast'].fillna('No Cast', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values

In [10]:
# Remove duplicate records
df.drop_duplicates(inplace=True)


In [11]:
# Standardize 'type' and 'country' fields
df['type'] = df['type'].str.strip().str.title()
df['country'] = df['country'].str.strip().str.title()


In [12]:
# Convert 'date_added' to datetime format
df['date_added'] = pd.to_datetime(df['date_added'], errors='coerce')


In [13]:
# Rename columns to lowercase with underscores
df.columns = [col.lower().replace(' ', '_') for col in df.columns]


In [14]:
# Ensure 'release_year' is integer
df['release_year'] = pd.to_numeric(df['release_year'], errors='coerce')


In [15]:
# Save the cleaned dataset
df.to_csv("cleaned_netflix_dataset.csv", index=False)
print("\n✅ Cleaned dataset saved as 'cleaned_netflix_dataset.csv'")



✅ Cleaned dataset saved as 'cleaned_netflix_dataset.csv'
