In [32]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [33]:
df = pd.read_csv('netflix_titles.csv')

In [34]:
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


In [35]:
df.tail()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
8802,s8803,Movie,Zodiac,David Fincher,"Mark Ruffalo, Jake Gyllenhaal, Robert Downey J...",United States,"November 20, 2019",2007,R,158 min,"Cult Movies, Dramas, Thrillers","A political cartoonist, a crime reporter and a..."
8803,s8804,TV Show,Zombie Dumb,,,,"July 1, 2019",2018,TV-Y7,2 Seasons,"Kids' TV, Korean TV Shows, TV Comedies","While living alone in a spooky town, a young g..."
8804,s8805,Movie,Zombieland,Ruben Fleischer,"Jesse Eisenberg, Woody Harrelson, Emma Stone, ...",United States,"November 1, 2019",2009,R,88 min,"Comedies, Horror Movies",Looking to survive in a world taken over by zo...
8805,s8806,Movie,Zoom,Peter Hewitt,"Tim Allen, Courteney Cox, Chevy Chase, Kate Ma...",United States,"January 11, 2020",2006,PG,88 min,"Children & Family Movies, Comedies","Dragged from civilian life, a former superhero..."
8806,s8807,Movie,Zubaan,Mozez Singh,"Vicky Kaushal, Sarah-Jane Dias, Raaghav Chanan...",India,"March 2, 2019",2015,TV-14,111 min,"Dramas, International Movies, Music & Musicals",A scrappy but poor boy worms his way into a ty...


In [36]:
df.describe()

Unnamed: 0,release_year
count,8807.0
mean,2014.180198
std,8.819312
min,1925.0
25%,2013.0
50%,2017.0
75%,2019.0
max,2021.0


In [37]:
df.shape

(8807, 12)

In [38]:
df.columns

Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description'],
      dtype='object')

In [39]:
df.nunique()

show_id         8807
type               2
title           8807
director        4528
cast            7692
country          748
date_added      1767
release_year      74
rating            17
duration         220
listed_in        514
description     8775
dtype: int64

### 1. Identify and Handle Missing Values

Missing values are common in real-world datasets and can arise from various reasons such as data entry errors, incomplete information, or merging datasets from different sources. It is important to identify and handle these missing values to ensure the quality and reliability of data analysis.

**Steps:**
- **Identify Missing Values:** Use functions like `.isnull()` or `.isna()` to detect missing values in the dataset.
- **Handle Missing Values:** Depending on the context and importance of the data, missing values can be handled by:
  - Removing rows or columns with missing values using `.dropna()`
  - Filling missing values with appropriate substitutes, such as the mean, median, mode, or a specific value using `.fillna()`
  - Using advanced techniques like interpolation or predictive modeling for imputation

Proper handling of missing values helps prevent errors in analysis and ensures that the results are accurate and meaningful.

In [40]:
# Check for missing values
print(df.isnull().sum())

show_id            0
type               0
title              0
director        2634
cast             825
country          831
date_added        10
release_year       0
rating             4
duration           3
listed_in          0
description        0
dtype: int64


In [41]:
# Drop rows with missing values
df = df.dropna()

### 2. Remove Duplicate Rows

Duplicate rows can occur in a dataset due to repeated data entry, merging datasets, or data collection errors. Removing duplicates is important to ensure that each record in the dataset is unique, which helps maintain the integrity and accuracy of data analysis.

**Steps:**
- **Identify Duplicates:** Use the `.duplicated()` function to find duplicate rows in the dataset.
- **Remove Duplicates:** Use the `.drop_duplicates()` function to remove duplicate rows, keeping only the first occurrence by default.

By removing duplicate rows, we prevent skewed analysis and ensure that statistical calculations and insights are based on unique data points.

In [42]:
df = df.drop_duplicates()

In [43]:
df.shape

(5332, 12)

### 3. Standardize Text Values

Text data can often be inconsistent due to differences in capitalization, extra spaces, spelling variations, or the use of synonyms. Standardizing text values ensures uniformity and improves the quality of data analysis, especially when grouping or filtering data.

**Steps:**
- **Convert to Lowercase:** Use string methods to convert all text to lowercase for consistency.
- **Remove Extra Spaces:** Strip leading and trailing spaces to avoid mismatches.
- **Unify Categories:** Replace variations or synonyms with a single standardized value (e.g., "United States" and "USA" both become "USA").
- **Consistent Formatting:** Apply consistent formatting to columns such as names, countries, and categories.

Standardizing text values helps prevent errors during analysis and ensures that similar values are treated as the same category.

In [44]:
u= df['country'].unique()
print(u)

['United States, Ghana, Burkina Faso, United Kingdom, Germany, Ethiopia'
 'United Kingdom' 'United States' 'Germany, Czech Republic' 'India'
 'United States, India, France' 'China, Canada, United States'
 'South Africa, United States, Japan' 'Japan' 'Nigeria'
 'Spain, United States' 'United Kingdom, United States'
 'United Kingdom, Australia, France'
 'United Kingdom, Australia, France, United States'
 'United States, Canada' 'Germany, United States'
 'South Africa, United States' 'United States, Mexico'
 'United States, Italy, France, Japan'
 'United States, Italy, Romania, United Kingdom'
 'Australia, United States' 'Argentina, Venezuela'
 'United States, United Kingdom, Canada' 'China, Hong Kong' 'Canada'
 'Hong Kong' 'United States, China, Hong Kong' 'Italy, United States'
 'United States, Germany' 'France' 'United Kingdom, Canada, United States'
 'United States, United Kingdom' 'India, Nepal'
 'New Zealand, Australia, France, United States' 'Italy, Brazil, Greece'
 'Spain' 'Colomb

In [45]:
# Standardize 'country' column

def clean_country(country_str):
    if pd.isnull(country_str):
        return country_str
    # Split by comma, take first country, strip spaces
    first_country = country_str.split(',')[0].strip().lower()
    # Map to standard abbreviations
    country_map = {
        'united states': 'usa',
        'united kingdom': 'uk',
        'india': 'ind',
        # Add more as needed
    }
    return country_map.get(first_country, first_country)

df['country'] = df['country'].apply(clean_country)

In [46]:
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
7,s8,Movie,Sankofa,Haile Gerima,"Kofi Ghanaba, Oyafunmike Ogunlano, Alexandra D...",usa,"September 24, 2021",1993,TV-MA,125 min,"Dramas, Independent Movies, International Movies","On a photo shoot in Ghana, an American model s..."
8,s9,TV Show,The Great British Baking Show,Andy Devonshire,"Mel Giedroyc, Sue Perkins, Mary Berry, Paul Ho...",uk,"September 24, 2021",2021,TV-14,9 Seasons,"British TV Shows, Reality TV",A talented batch of amateur bakers face off in...
9,s10,Movie,The Starling,Theodore Melfi,"Melissa McCarthy, Chris O'Dowd, Kevin Kline, T...",usa,"September 24, 2021",2021,PG-13,104 min,"Comedies, Dramas",A woman adjusting to life after a loss contend...
12,s13,Movie,Je Suis Karl,Christian Schwochow,"Luna Wedler, Jannis Niewöhner, Milan Peschel, ...",germany,"September 23, 2021",2021,TV-MA,127 min,"Dramas, International Movies",After most of her family is murdered in a terr...
24,s25,Movie,Jeans,S. Shankar,"Prashanth, Aishwarya Rai Bachchan, Sri Lakshmi...",ind,"September 21, 2021",1998,TV-14,166 min,"Comedies, International Movies, Romantic Movies",When the father of the man she loves insists t...


### 4. Convert Date Formats

Dates in datasets can appear in various formats, which may cause issues during analysis or when merging data from different sources. Converting all date columns to a consistent format ensures accurate sorting, filtering, and time-based calculations.

**Steps:**
- **Identify Date Columns:** Find columns that represent dates, such as `date_added`.
- **Convert to Datetime:** Use functions like `pd.to_datetime()` to convert date strings to datetime objects.
- **Standardize Format:** Format the dates consistently (e.g., `dd-mm-yyyy`) for uniformity in presentation and analysis.

Consistent date formats help avoid confusion, make time-based analysis easier, and ensure compatibility with other systems and tools.

In [47]:
df['country'].unique()

array(['usa', 'uk', 'germany', 'ind', 'china', 'south africa', 'japan',
       'nigeria', 'spain', 'australia', 'argentina', 'canada',
       'hong kong', 'italy', 'france', 'new zealand', 'colombia',
       'mexico', 'switzerland', 'taiwan', 'bulgaria', '', 'poland',
       'saudi arabia', 'thailand', 'indonesia', 'kuwait', 'egypt',
       'malaysia', 'south korea', 'vietnam', 'lebanon', 'brazil',
       'romania', 'philippines', 'united arab emirates', 'sweden',
       'syria', 'belgium', 'mauritius', 'austria', 'turkey',
       'czech republic', 'cameroon', 'netherlands', 'ireland', 'russia',
       'kenya', 'chile', 'bangladesh', 'portugal', 'hungary', 'norway',
       'singapore', 'iceland', 'serbia', 'namibia', 'uruguay', 'peru',
       'mozambique', 'ghana', 'zimbabwe', 'israel', 'pakistan', 'denmark',
       'paraguay', 'cambodia', 'soviet union', 'georgia', 'iran',
       'finland', 'venezuela', 'slovenia', 'guatemala', 'jamaica',
       'somalia', 'croatia'], dtype=object)

In [48]:
# Convert 'date_added' to datetime and format as dd-mm-yyyy
df['date_added'] = pd.to_datetime(df['date_added'], errors='coerce')
df['date_added'] = df['date_added'].dt.strftime('%d-%m-%Y')

In [49]:
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
7,s8,Movie,Sankofa,Haile Gerima,"Kofi Ghanaba, Oyafunmike Ogunlano, Alexandra D...",usa,24-09-2021,1993,TV-MA,125 min,"Dramas, Independent Movies, International Movies","On a photo shoot in Ghana, an American model s..."
8,s9,TV Show,The Great British Baking Show,Andy Devonshire,"Mel Giedroyc, Sue Perkins, Mary Berry, Paul Ho...",uk,24-09-2021,2021,TV-14,9 Seasons,"British TV Shows, Reality TV",A talented batch of amateur bakers face off in...
9,s10,Movie,The Starling,Theodore Melfi,"Melissa McCarthy, Chris O'Dowd, Kevin Kline, T...",usa,24-09-2021,2021,PG-13,104 min,"Comedies, Dramas",A woman adjusting to life after a loss contend...
12,s13,Movie,Je Suis Karl,Christian Schwochow,"Luna Wedler, Jannis Niewöhner, Milan Peschel, ...",germany,23-09-2021,2021,TV-MA,127 min,"Dramas, International Movies",After most of her family is murdered in a terr...
24,s25,Movie,Jeans,S. Shankar,"Prashanth, Aishwarya Rai Bachchan, Sri Lakshmi...",ind,21-09-2021,1998,TV-14,166 min,"Comedies, International Movies, Romantic Movies",When the father of the man she loves insists t...


### 5. Rename Column Headers

Column headers in a dataset should be clean, uniform, and easy to reference in code. Inconsistent or messy column names can lead to confusion and errors during analysis. Renaming columns to follow a consistent style improves readability and usability.

**Steps:**
- **Remove Spaces:** Replace spaces with underscores or remove them entirely.
- **Consistent Capitalization:** Use a consistent capitalization style, such as capitalizing the first letter of each word.
- **Strip Extra Characters:** Remove leading/trailing spaces and unnecessary special characters.

Clean and uniform column headers make it easier to access columns programmatically and help maintain a professional and organized dataset.

In [50]:
# Rename columns: First letter uppercase, no spaces (use underscores)
df.columns = [col.strip().replace(' ', '_').title() for col in df.columns]

In [51]:
df.head()

Unnamed: 0,Show_Id,Type,Title,Director,Cast,Country,Date_Added,Release_Year,Rating,Duration,Listed_In,Description
7,s8,Movie,Sankofa,Haile Gerima,"Kofi Ghanaba, Oyafunmike Ogunlano, Alexandra D...",usa,24-09-2021,1993,TV-MA,125 min,"Dramas, Independent Movies, International Movies","On a photo shoot in Ghana, an American model s..."
8,s9,TV Show,The Great British Baking Show,Andy Devonshire,"Mel Giedroyc, Sue Perkins, Mary Berry, Paul Ho...",uk,24-09-2021,2021,TV-14,9 Seasons,"British TV Shows, Reality TV",A talented batch of amateur bakers face off in...
9,s10,Movie,The Starling,Theodore Melfi,"Melissa McCarthy, Chris O'Dowd, Kevin Kline, T...",usa,24-09-2021,2021,PG-13,104 min,"Comedies, Dramas",A woman adjusting to life after a loss contend...
12,s13,Movie,Je Suis Karl,Christian Schwochow,"Luna Wedler, Jannis Niewöhner, Milan Peschel, ...",germany,23-09-2021,2021,TV-MA,127 min,"Dramas, International Movies",After most of her family is murdered in a terr...
24,s25,Movie,Jeans,S. Shankar,"Prashanth, Aishwarya Rai Bachchan, Sri Lakshmi...",ind,21-09-2021,1998,TV-14,166 min,"Comedies, International Movies, Romantic Movies",When the father of the man she loves insists t...


### 6. Check and Fix Data Types

Ensuring that each column in the dataset has the correct data type is crucial for accurate analysis and efficient processing. Incorrect data types can lead to errors, unexpected results, or inefficient operations.

**Steps:**
- **Check Data Types:** Use functions like `.dtypes` to inspect the current data types of each column.
- **Convert Data Types:** Change columns to their appropriate types (e.g., convert date columns to `datetime`, numeric columns to `int` or `float`, and categorical columns to `category`).
- **Handle Conversion Errors:** Use parameters like `errors='coerce'` to handle invalid parsing gracefully.

Fixing data types ensures that mathematical operations, date manipulations, and analyses are performed correctly and efficiently.

In [52]:
df.dtypes

Show_Id         object
Type            object
Title           object
Director        object
Cast            object
Country         object
Date_Added      object
Release_Year     int64
Rating          object
Duration        object
Listed_In       object
Description     object
dtype: object

In [53]:
# Convert 'date_added' to datetime (if not already)
df['Date_Added'] = pd.to_datetime(df['Date_Added'], errors='coerce')

# Convert 'release_year' to integer (if not already)
df['Release_Year'] = pd.to_numeric(df['Release_Year'], errors='coerce').astype('Int64')


  df['Date_Added'] = pd.to_datetime(df['Date_Added'], errors='coerce')


In [54]:
df.dtypes

Show_Id                 object
Type                    object
Title                   object
Director                object
Cast                    object
Country                 object
Date_Added      datetime64[ns]
Release_Year             Int64
Rating                  object
Duration                object
Listed_In               object
Description             object
dtype: object