# Task 1: Data Cleaning and Preprocessing

## Dataset Used
[Netflix Movies and TV Shows](https://www.kaggle.com/datasets/shivamb/netflix-shows)

## Tools Used
- Python (Pandas)

## Cleaning Performed
- Removed 11 duplicate entries
- Handled missing values in `director`, `cast`, `country`, `date_added`, and `rating`
- Standardized column names (lowercase, underscores)
- Converted `date_added` to datetime format
- Cleaned text formats in columns like `type` and `country`
- Ensured data types are correct for all columns

✅ Final cleaned dataset saved as `cleaned_netflix_titles.csv`.


In [62]:
import pandas as pd

In [63]:
df = pd.read_csv('netflix_titles.csv')

In [64]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB
None


In [65]:
print(df.describe())

       release_year
count   8807.000000
mean    2014.180198
std        8.819312
min     1925.000000
25%     2013.000000
50%     2017.000000
75%     2019.000000
max     2021.000000


In [66]:
print(df.head())

  show_id     type                  title         director  \
0      s1    Movie   Dick Johnson Is Dead  Kirsten Johnson   
1      s2  TV Show          Blood & Water              NaN   
2      s3  TV Show              Ganglands  Julien Leclercq   
3      s4  TV Show  Jailbirds New Orleans              NaN   
4      s5  TV Show           Kota Factory              NaN   

                                                cast        country  \
0                                                NaN  United States   
1  Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...   South Africa   
2  Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...            NaN   
3                                                NaN            NaN   
4  Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...          India   

           date_added  release_year rating   duration  \
0  September 25, 2021          2020  PG-13     90 min   
1  September 24, 2021          2021  TV-MA  2 Seasons   
2  September 24, 2021        

In [67]:
print(df.isnull().sum())
df.fillna(method='ffill', inplace=True)

show_id            0
type               0
title              0
director        2634
cast             825
country          831
date_added        10
release_year       0
rating             4
duration           3
listed_in          0
description        0
dtype: int64


  df.fillna(method='ffill', inplace=True)


In [68]:
df['director'] = df['director'].fillna('Unknown')
df['cast'] = df['cast'].fillna('Unknown')
df['country'] = df['country'].fillna('Unknown')
df['rating'] = df['rating'].fillna(df['rating'].mode()[0])

In [69]:
df.drop_duplicates(inplace=True)

In [70]:
df.columns = [col.lower().replace(' ', '_') for col in df.columns]

In [71]:
df['release_year'] = df['release_year'].astype(int)

In [72]:
df['type'] = df['type'].str.strip().str.title()
df['country'] = df['country'].str.strip()

In [73]:
df.to_csv('cleaned_dataset.csv', index=False)