# Task 1: Data Cleaning and Preprocessing
# Objective: Clean and prepare a Netflix Movies and TV Shows dataset (with nulls, duplicates, inconsistent formats).

In [57]:
# Import library
import pandas as pd

In [58]:
# Load the Dataset
df = pd.read_csv('/content/netflix_titles.csv')

In [59]:
# Get column headings
print(df.columns)

Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description'],
      dtype='object')


In [60]:
# View info
print("Initial Shape:", df.shape)
print(df.info())

Initial Shape: (8807, 12)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB
None


In [61]:
# Display the first few rows of the Dataset
print("First few rows of the dataset:")
print(df.head())

First few rows of the dataset:
  show_id     type                  title         director  \
0      s1    Movie   Dick Johnson Is Dead  Kirsten Johnson   
1      s2  TV Show          Blood & Water              NaN   
2      s3  TV Show              Ganglands  Julien Leclercq   
3      s4  TV Show  Jailbirds New Orleans              NaN   
4      s5  TV Show           Kota Factory              NaN   

                                                cast        country  \
0                                                NaN  United States   
1  Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...   South Africa   
2  Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...            NaN   
3                                                NaN            NaN   
4  Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...          India   

           date_added  release_year rating   duration  \
0  September 25, 2021          2020  PG-13     90 min   
1  September 24, 2021          2021  TV-MA  2 Seasons  

In [62]:
# Remove duplicates
df = df.drop_duplicates()
print("After Removing Duplicates:", df.shape)

After Removing Duplicates: (8807, 12)


In [63]:
# Handle missing values
missing_values = df.isnull().sum()
print("Missing Values:\n", missing_values)

Missing Values:
 show_id            0
type               0
title              0
director        2634
cast             825
country          831
date_added        10
release_year       0
rating             4
duration           3
listed_in          0
description        0
dtype: int64


In [64]:
# Fill missing 'director' and 'cast' with 'Not Available'
df['director'] = df['director'].fillna('Not Available')
df['cast'] = df['cast'].fillna('Not Available')

In [65]:
# Fill missing 'country' with 'Unknown'
df['country'] = df['country'].fillna('Unknown')

In [66]:
# Fill missing 'date_added' with mode or drop if needed
# Convert to datetime first (important!)
df['date_added'] = pd.to_datetime(df['date_added'], errors='coerce')

# Fill missing with the most common date
most_common_date = df['date_added'].mode()[0]
df['date_added'] = df['date_added'].fillna(most_common_date)

# Convert to string with dd-mm-yyyy format
df['date_added'] = df['date_added'].dt.strftime('%d-%m-%Y')


In [67]:
print(df['date_added'].head())
print(df['date_added'].dtype)  # Should be object (string) after formatting


0    25-09-2021
1    24-09-2021
2    24-09-2021
3    24-09-2021
4    24-09-2021
Name: date_added, dtype: object
object


In [68]:
# Fill missing 'rating' with 'Not Rated'
df['rating'] = df['rating'].fillna('Not Rated')

In [69]:
# Fill missing 'duration' with 'unknown'
df['duration'] = df['duration'].fillna('Unknown')

In [70]:
# Standardize column names
df.columns = df.columns.str.lower().str.strip().str.replace(" ", "_")

In [71]:
# Data type checks
print(df.dtypes)

show_id         object
type            object
title           object
director        object
cast            object
country         object
date_added      object
release_year     int64
rating          object
duration        object
listed_in       object
description     object
dtype: object


In [72]:
# Final check for nulls
print("Nulls after cleaning:\n", df.isnull().sum())

Nulls after cleaning:
 show_id         0
type            0
title           0
director        0
cast            0
country         0
date_added      0
release_year    0
rating          0
duration        0
listed_in       0
description     0
dtype: int64


# Netflix Data Cleaning and Preprocessing

This notebook performs initial data cleaning and preprocessing on the Netflix Movies and TV Shows dataset.

**Steps Performed:**

1.  **Loading Data**: The dataset was loaded into a pandas DataFrame.
2.  **Initial Inspection**: The shape, column names, data types, and the first few rows of the dataset were inspected.
3.  **Handling Duplicates**: Duplicate rows were removed from the dataset.
4.  **Handling Missing Values**: Missing values in the 'director', 'cast', 'country', 'date_added', 'rating', and 'duration' columns were handled by filling them with appropriate values ('Not Available', 'Unknown', mode, 'Not Rated', 'Unknown').
5. **Standardizing Column Names**: Column names were converted to lowercase, stripped of leading/trailing whitespace, and spaces were replaced with underscores.
6. **Final Check**: Nulls and data types were checked after cleaning.