# Netflix Movies and TV Shows Analysis 📺🍿  
This notebook explores the Netflix dataset from Kaggle. We load, clean, and prepare the data for future analysis.


In [3]:
import pandas as pd

# Load dataset
df = pd.read_csv("netflix_titles.csv")

# Show first 5 rows
df.head()


Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


In [4]:
# Shape of dataset (rows, columns)
df.shape

# Overview of data
df.info()

# Summary statistics (for numeric fields)
df.describe(include='all')


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB


Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
count,8807,8807,8807,6173,7982,7976,8797,8807.0,8803,8804,8807,8807
unique,8807,2,8807,4528,7692,748,1767,,17,220,514,8775
top,s1,Movie,Dick Johnson Is Dead,Rajiv Chilaka,David Attenborough,United States,"January 1, 2020",,TV-MA,1 Season,"Dramas, International Movies","Paranormal activity at a lush, abandoned prope..."
freq,1,6131,1,19,19,2818,109,,3207,1793,362,4
mean,,,,,,,,2014.180198,,,,
std,,,,,,,,8.819312,,,,
min,,,,,,,,1925.0,,,,
25%,,,,,,,,2013.0,,,,
50%,,,,,,,,2017.0,,,,
75%,,,,,,,,2019.0,,,,


In [5]:
# Step 1: Rename column headers
df.columns = [col.lower().strip().replace(' ', '_') for col in df.columns]

# Step 2: Check for duplicate rows
print("Duplicate rows:", df.duplicated().sum())
df.drop_duplicates(inplace=True)

# Step 3: Fill missing values
df['director'].fillna('Not Available', inplace=True)
df['cast'].fillna('Not Available', inplace=True)
df['country'].fillna('Unknown', inplace=True)
df['date_added'].fillna('Not Given', inplace=True)
df['rating'].fillna('Not Rated', inplace=True)
df['duration'].fillna('Not Available', inplace=True)

# Step 4: Convert 'date_added' to datetime (errors='coerce' will convert invalid to NaT)
df['date_added'] = pd.to_datetime(df['date_added'], errors='coerce')

# Step 5: Strip whitespace and standardize case in text columns
df['type'] = df['type'].str.lower().str.strip()
df['country'] = df['country'].str.strip()

# Step 6: Save cleaned dataset
df.to_csv("netflix_cleaned.csv", index=False)

print("✅ Data cleaned and saved successfully.")


Duplicate rows: 0


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['director'].fillna('Not Available', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['cast'].fillna('Not Available', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are settin

✅ Data cleaned and saved successfully.


## ✅ Conclusion
The Netflix dataset has been successfully cleaned.  
- Handled missing values
- Checked and removed duplicates  
- Ready for further analysis or visualization.


In [6]:
df['date_added'] = pd.to_datetime(df['date_added'], errors='coerce')


In [7]:
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')


In [8]:
df.dtypes  # Confirm date_added is datetime64


show_id                 object
type                    object
title                   object
director                object
cast                    object
country                 object
date_added      datetime64[ns]
release_year             int64
rating                  object
duration                object
listed_in               object
description             object
dtype: object