# Netflix Exploratory Data Analysis
---

## 1. Introduction

This project was developed as a **personal learning exercise** to strengthen my skills in **data analysis**, with a particular focus on **Exploratory Data Analysis (EDA)**. The goal is twofold:  
1. To practice the process of **cleaning, exploring, and visualizing real-world data**.  
2. To extract **meaningful insights** about the structure and trends of Netflix content.  

The dataset used in this analysis comes from [Kaggle’s *Netflix Movies and TV Shows* dataset](https://www.kaggle.com/datasets/shivamb/netflix-shows), which contains information such as titles, release years, countries, genres, ratings, and durations. This rich dataset provides an excellent opportunity to explore patterns in Netflix’s catalog and to demonstrate the application of various Python tools for data analysis.


### 1.1. Before starting

Before diving into the dataset, it is essential to import the core Python libraries that will support the analysis. These libraries provide the necessary tools for data manipulation, numerical computations, and visualization, forming the foundation of most modern data analysis workflows.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

After setting up the environment, the next step is to **load the dataset** into a Pandas DataFrame. This will allow us to efficiently manipulate, clean, and explore the data throughout the analysis.

The dataset is stored in the local `./dataset` folder as a CSV file. We begin by reading it into a DataFrame and inspecting its basic structure to better understand its contents.

In [2]:
df = pd.read_csv('./dataset/netflix_titles.csv')
print(df.shape)

(8807, 12)


We can confirm that the data is correctly imported. Now it is time to start doing the data Analysis

---
## 2. Data Analysis

### 2.1. Data Overview

Before performing any cleaning or detailed exploration, it is important to gain a **general understanding** of the dataset. In this step, we will:  

- Inspect the first few rows of the data.  
- Check the dataset’s dimensions (number of rows and columns).  
- Review each feature, including its data type and meaning.  
- Identify missing values and duplicates.

This initial inspection helps to **frame the scope of the analysis** and informs the decisions we will make in the data cleaning phase.

In [3]:
display(df.head())

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


In [4]:
print(f"Shape: {df.shape}")

Shape: (8807, 12)


In [5]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB
None


| Feature | Description | Category |
|-----|-----|-----|
| show_id | Unique identifier for each show of the dataset. | Identifier |
| type | Type of content - either *Movie* or *TV show*. | Categorical |
| title | Title of the movie or TV show. | Text |
| director | Director of the content. | Categorical |
| cast | Main actors/actresses featured in the content. | Text/Categorical |
| country | Country (or countries) where the content was produced. | Categorical |
| date_added | Date when the content was added to Netflix. | Datetime |
| release_year | Year the content was originally released. | Numerical (Integer) |
| rating | Netflix maturity rating. | Categorical |
| duration | Duration of the content - in minutes for Movies or in number of seasons for TV Shows. | Mixed (Numeric/String) |
| listed_in | Genres/Categories of the content (comma-separated list). | Categorical (Multi-label) |
| description | Short summary of the content. | Text |

I will separate the missing values and duplicate values identification in two ways: Quantitative identification and percentage

In [6]:
missing_count = df.isna().sum()
missing_count = missing_count[missing_count!=0]
print("Missing Values:")
display(missing_count)

Missing Values:


director      2634
cast           825
country        831
date_added      10
rating           4
duration         3
dtype: int64

In [7]:
missing_percent=round((df.isna().sum()/len(df))*100, 2)
missing_percent = missing_percent[missing_percent!=0]
print("Missing Values Percentage")
display(missing_percent)

Missing Values Percentage


director      29.91
cast           9.37
country        9.44
date_added     0.11
rating         0.05
duration       0.03
dtype: float64

The table above shows the **percentage of missing values** for each feature in the dataset (features not in this table have no missing values)

As we can see, the `director` feature has a high percentage of missing values (~30%). Handling these missing entries by filling or imputing could significantly distort the dataset, introducing bias and making it no longer reflect reality. Therefore, for the purposes of this analysis, we will **leave the missing values in `director` as they are**. Same for `cast`

The other columns have a relatively low percentage of missing data, and we will handle them as needed during the data cleaning phase.

In [8]:
duplicate_count = df.duplicated().sum()
print(f"Total duplicate rows: {duplicate_count}")

Total duplicate rows: 0


As there are no duplicate rows, **no further action is required** for duplicates.

---

### 2.2. Data Cleaning & Preprocessing

#### 2.2.1. Handling Missing Values

We are going to start by handling missing values. As we said before, `director`, `cast` and `country` features missing values are not going to be handled as it would deteriorate data. We are just going to fill the null values with Unknown (which might also be a valuable insight in the future).

In [9]:
df['director'] = df['director'].fillna('Unknown')
df['cast'] = df['cast'].fillna('Unknown')
df['country'] = df['country'].fillna('Unknown')
display(df.head())

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,Unknown,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,Unknown,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",Unknown,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,Unknown,Unknown,Unknown,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,Unknown,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


The next features to handle are `date_added`, `rating` and `duration`. First of all, it is important to transform the features to the correct variable type make analysis and values handling easier. 

As this functions are very different from the TV Series and the Movies we can split the dataframe into two new dataframes: **tv_series_df** (TV series only dataframe) and **movie_df** (movies only dataframe).

Having this two datasets splitted is also good to get some insights later.

In [10]:
df['date_added'] = df['date_added'].astype(str).str.strip()
df['date_added'] = pd.to_datetime(df['date_added'], errors='coerce')
df['release_year'] = df['release_year'].astype(int)

tv_series_df = df[df['type']=="TV Show"].copy()
movie_df = df[df['type']=="Movie"].copy()

In [11]:
print(tv_series_df.isna().sum()[tv_series_df.isna().sum()!=0])

date_added    10
rating         2
dtype: int64


To handle all the errors in the date_added column we sort the table by release_year and then interpolate. This can create some errors, especifically having rows where the date_added is before the release_year, which is impossible. This situation must also be handled and ensured.

In [12]:
tv_series_df=tv_series_df.sort_values(by='release_year')

tv_series_df['date_added_temp']=tv_series_df['date_added'].map(lambda x: x.timestamp() if pd.notna(x) else np.nan)
tv_series_df['date_added_temp']=tv_series_df['date_added_temp'].interpolate(method='linear')

tv_series_df['date_added'] = pd.to_datetime(tv_series_df['date_added_temp'], unit='s')
tv_series_df['date_added'] = tv_series_df['date_added'].dt.strftime('%Y-%m-%d')
tv_series_df = tv_series_df.drop(columns='date_added_temp')

In [13]:
print(tv_series_df.isna().sum()[tv_series_df.isna().sum()!=0])

rating    2
dtype: int64


Having the null values handled we shall now guarantee that there are no "impossible" rows in the dataframe.

In [17]:
tv_series_df['date_added'] = pd.to_datetime(tv_series_df['date_added'], errors='coerce')
print(f"Number of rows with date_added before release_year: {len(tv_series_df[tv_series_df['date_added'].dt.year<tv_series_df['release_year']])}")

def adjust_date(row):
    if pd.notna(row['date_added']) and row['date_added'].year < row['release_year']:
        try:
            return row['date_added'].replace(year=row['release_year'])
        except ValueError:
            return datetime(row['release_year'], row['date_added'].month, min(row['date_added'].day, 28))
    return row['date_added']

#temp['date_added'] = pd.to_datetime(temp['date_added'], errors='coerce')
tv_series_df['date_added'] = tv_series_df.apply(adjust_date, axis=1)

Number of rows with date_added before release_year: 0


As we see we have 12 rows with impossible data. Let's fix it. For that I will keep the date as it is and only change the year. 

For example, a date_added 2012-03-01 with release_date 2015 will be changed to 2015-03-01.

In [None]:
def adjust_date(row):
    if pd.notna(row['date_added']) and row['date_added'].year < row['release_year']:
        try:
            return row['date_added'].replace(year=row['release_year'])
        except ValueError:
            return datetime(row['release_year'], row['date_added'].month, min(row['date_added'].day, 28))
    return row['date_added']

tv_series_df['date_added'] = tv_series_df.apply(adjust_date, axis=1)

In [18]:
print(f"Number of rows with date_added before release_year: {len(tv_series_df[tv_series_df['date_added'].dt.year<tv_series_df['release_year']])}")

Number of rows with date_added before release_year: 0


As we can see, we don't have any more rows with impossible data. After this we will try to check values where the date_added is before the Netflix foundation. Netflix's streaming service was only created on 2007 but Netflix already existed before, operating as a mail-order DVD rental service since August 29, 1997. So no movie nor tv shows should be added before.

In [27]:
print(tv_series_df.sort_values(by="date_added")['date_added'])

6611   2008-02-04
5940   2013-08-02
5939   2013-09-01
6885   2013-10-08
7908   2013-10-14
          ...    
3      2021-09-24
4      2021-09-24
1      2021-09-24
1696   2021-11-15
1551   2021-12-14
Name: date_added, Length: 2676, dtype: datetime64[ns]


As we can see the first TV show was added in 2008 which means that all the data is correct. This means that the date_added column is already handled. 