# Netflix Dataset Cleaning – Task 1 (Data Analyst Internship)

**Objective:** Clean and preprocess the Netflix Movies and TV Shows dataset by handling missing values, duplicates, inconsistent formatting, and column renaming.

**Tools Used:** Python, Pandas


In [2]:
import pandas as pd

In [8]:
df = pd.read_csv('netflix_titles.csv')

In [9]:
df.head

<bound method NDFrame.head of      show_id     type                  title         director  \
0         s1    Movie   Dick Johnson Is Dead  Kirsten Johnson   
1         s2  TV Show          Blood & Water              NaN   
2         s3  TV Show              Ganglands  Julien Leclercq   
3         s4  TV Show  Jailbirds New Orleans              NaN   
4         s5  TV Show           Kota Factory              NaN   
...      ...      ...                    ...              ...   
8802   s8803    Movie                 Zodiac    David Fincher   
8803   s8804  TV Show            Zombie Dumb              NaN   
8804   s8805    Movie             Zombieland  Ruben Fleischer   
8805   s8806    Movie                   Zoom     Peter Hewitt   
8806   s8807    Movie                 Zubaan      Mozez Singh   

                                                   cast        country  \
0                                                   NaN  United States   
1     Ama Qamata, Khosi Ngema, Gail Mabal

In [10]:
# Basic shape and structure
print("Dataset shape:", df.shape)
print("\nColumn names:\n", df.columns.tolist())

# Detailed column info
df.info()

Dataset shape: (8807, 12)

Column names:
 ['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added', 'release_year', 'rating', 'duration', 'listed_in', 'description']
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB


In [11]:
df.isnull().sum().sort_values(ascending=False)

director        2634
country          831
cast             825
date_added        10
rating             4
duration           3
show_id            0
type               0
title              0
release_year       0
listed_in          0
description        0
dtype: int64

In [12]:
df['director'] = df['director'].fillna('Not Specified')

In [13]:
df.isnull().sum().sort_values(ascending=False)

country         831
cast            825
date_added       10
rating            4
duration          3
title             0
type              0
show_id           0
director          0
release_year      0
listed_in         0
description       0
dtype: int64

In [14]:
df['cast'] = df['cast'].fillna('Unknown')


In [15]:
df['country'] = df['country'].fillna('Unknown')


In [16]:
df['date_added'] = df['date_added'].fillna('Not Added')


In [17]:
df['rating'] = df['rating'].fillna('Not Rated')


In [18]:
df['duration'] = df['duration'].fillna('Unknown')


In [19]:
df.isnull().sum().sort_values(ascending=False)

show_id         0
type            0
title           0
director        0
cast            0
country         0
date_added      0
release_year    0
rating          0
duration        0
listed_in       0
description     0
dtype: int64

In [20]:
# Check number of duplicate rows
duplicates = df.duplicated().sum()
print(f"Duplicate rows: {duplicates}")

Duplicate rows: 0


In [21]:
df['country'] = df['country'].str.strip().str.title()


In [22]:
df['rating'] = df['rating'].str.strip().str.upper()


In [23]:
df['duration'] = df['duration'].str.strip()


In [24]:
df['country'].unique()[:10] 

array(['United States', 'South Africa', 'Unknown', 'India',
       'United States, Ghana, Burkina Faso, United Kingdom, Germany, Ethiopia',
       'United Kingdom', 'Germany, Czech Republic', 'Mexico', 'Turkey',
       'Australia'], dtype=object)

In [25]:
df['date_added'] = df['date_added'].replace('Not Added', pd.NaT)

In [26]:
df['date_added'] = pd.to_datetime(df['date_added'], errors='coerce')

In [27]:
print(df['date_added'].dtype)


datetime64[ns]


In [28]:
df['date_added'].head()


0   2021-09-25
1   2021-09-24
2   2021-09-24
3   2021-09-24
4   2021-09-24
Name: date_added, dtype: datetime64[ns]

In [29]:
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')
print(df.columns.tolist())

['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added', 'release_year', 'rating', 'duration', 'listed_in', 'description']


In [30]:
df.dtypes

show_id                 object
type                    object
title                   object
director                object
cast                    object
country                 object
date_added      datetime64[ns]
release_year             int64
rating                  object
duration                object
listed_in               object
description             object
dtype: object

In [31]:
df.to_csv("netflix_cleaned.csv", index=False)
print("✅ Cleaned dataset saved as 'netflix_cleaned.csv'")

✅ Cleaned dataset saved as 'netflix_cleaned.csv'
