<a href="https://colab.research.google.com/github/Deewakar2412/Netflix-Data-Cleaning-Task-1/blob/main/Task_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Task 1 - Data Cleaning & Preprocessing**

---


#Dataset: Netflix Movies and TV Shows
# Platform: Google Colab

In [41]:
#Import required libraries
import pandas as pd
import numpy as np

In [42]:
# Load the dataset
df = pd.read_csv('/content/netflix_titles.csv')

**Checking the basic information**



In [43]:
df.head() # the first 5 rows

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


In [44]:
df.tail() # the last 5 rows

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
8802,s8803,Movie,Zodiac,David Fincher,"Mark Ruffalo, Jake Gyllenhaal, Robert Downey J...",United States,"November 20, 2019",2007,R,158 min,"Cult Movies, Dramas, Thrillers","A political cartoonist, a crime reporter and a..."
8803,s8804,TV Show,Zombie Dumb,,,,"July 1, 2019",2018,TV-Y7,2 Seasons,"Kids' TV, Korean TV Shows, TV Comedies","While living alone in a spooky town, a young g..."
8804,s8805,Movie,Zombieland,Ruben Fleischer,"Jesse Eisenberg, Woody Harrelson, Emma Stone, ...",United States,"November 1, 2019",2009,R,88 min,"Comedies, Horror Movies",Looking to survive in a world taken over by zo...
8805,s8806,Movie,Zoom,Peter Hewitt,"Tim Allen, Courteney Cox, Chevy Chase, Kate Ma...",United States,"January 11, 2020",2006,PG,88 min,"Children & Family Movies, Comedies","Dragged from civilian life, a former superhero..."
8806,s8807,Movie,Zubaan,Mozez Singh,"Vicky Kaushal, Sarah-Jane Dias, Raaghav Chanan...",India,"March 2, 2019",2015,TV-14,111 min,"Dramas, International Movies, Music & Musicals",A scrappy but poor boy worms his way into a ty...


In [45]:
 df.shape      # Number of rows & columns

(8807, 12)

In [46]:
 df.columns   # List of column names

Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description'],
      dtype='object')

In [47]:
df.info()    # Data types & null values

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB


**Handleing Missing Values**

In [48]:
# Filling 'director', 'cast', 'country' with 'Unknown' where data is missing
df['director'] = df['director'].fillna('Unknown')
df['cast'] = df['cast'].fillna('Unknown')
df['country'] = df['country'].fillna('Unknown')

**Clean and converting 'date_added' column**

In [49]:
# Converting to datetime format and handle errors safely
df['date_added'] = pd.to_datetime(df['date_added'].astype(str).str.strip(), errors='coerce')  # Convert to datetime

In [50]:
# Fill missing values using forward fill first, then backward fill
df['date_added'] = df['date_added'].ffill()
df['date_added'] = df['date_added'].bfill()

**Handle Missing Ratings**

In [51]:
# Finding the most common rating and use it to fill missing ratings
most_common_rating = df['rating'].mode()[0]
print(f"\nMost common rating: {most_common_rating}")
df['rating'] = df['rating'].fillna(most_common_rating)


Most common rating: TV-MA


**Handle Missing Durations**

In [52]:
# 6. Handleing missing 'duration' values separately for Movies and TV Shows
# Movies → filling missing with most common movie duration
# TV Shows → filling missing with most common TV show duration
movie_duration_mode = df[df['type'] == 'Movie']['duration'].mode()[0]
tv_show_duration_mode = df[df['type'] == 'TV Show']['duration'].mode()[0]

In [53]:
print(f"Most common Movie duration: {movie_duration_mode}")
print(f"Most common TV Show duration: {tv_show_duration_mode}")

Most common Movie duration: 90 min
Most common TV Show duration: 1 Season


In [54]:
# Filling missing durations based on type
df.loc[(df['type'] == 'Movie') & (df['duration'].isnull()), 'duration'] = movie_duration_mode
df.loc[(df['type'] == 'TV Show') & (df['duration'].isnull()), 'duration'] = tv_show_duration_mode

**Removeing Duplicate Rows**

In [55]:
# Step 8: Remove Duplicate Rows
print("\nDuplicate rows before removing:", df.duplicated().sum())
df.drop_duplicates(inplace=True)
print("Duplicate rows after removing:", df.duplicated().sum())


Duplicate rows before removing: 0
Duplicate rows after removing: 0


**Standardize Text Columns**

In [56]:
# Converting text to lowercase and strip spaces for consistency
df['country'] = df['country'].str.strip().str.lower()
df['rating'] = df['rating'].str.strip().str.upper()
df['category'] = df['listed_in'].str.strip().str.lower()


In [57]:
df.columns

Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description',
       'category'],
      dtype='object')

In [58]:
# Rename columns for better readability
df.rename(columns={
    'listed_in': 'category',
    'release_year': 'release_yr'
}, inplace=True)

**Final check**

In [59]:
# Final check of cleaned dataset
print("\n--- Cleaned Data Info ---")
df.info()
print("\nMissing values after cleaning:\n", df.isnull().sum())


--- Cleaned Data Info ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   show_id      8807 non-null   object        
 1   type         8807 non-null   object        
 2   title        8807 non-null   object        
 3   director     8807 non-null   object        
 4   cast         8807 non-null   object        
 5   country      8807 non-null   object        
 6   date_added   8807 non-null   datetime64[ns]
 7   release_yr   8807 non-null   int64         
 8   rating       8807 non-null   object        
 9   duration     8807 non-null   object        
 10  category     8807 non-null   object        
 11  description  8807 non-null   object        
 12  category     8807 non-null   object        
dtypes: datetime64[ns](1), int64(1), object(11)
memory usage: 894.6+ KB

Missing values after cleaning:
 show_id        0
type     

**saveing the data set**

In [60]:
# Saveing the cleaned dataset into a new CSV file
df.to_csv('netflix_cleaned.csv', index=False)
print("\n✅ Cleaned dataset has been saved as 'netflix_cleaned.csv'.")


✅ Cleaned dataset has been saved as 'netflix_cleaned.csv'.
