# Exploratory Data Analysis with Python: Netflix Titles from Kaggle

### Overview

This project involves performing an **Exploratory Data Analysis (EDA)** on the **Netflix Titles dataset** obtained from Kaggle. The dataset includes detailed information on various TV shows and movies available on Netflix, such as their genres, release year, ratings, cast, country of production, and more. By analzing this data, it serves as a good base for my first experience with Exploratory Data Analysis using Kaggle datasets.

### Goals of the Project

The primary goals for this analysis include:

- **Identifying Trends and Correlations**: Explore relationships between variables like genres, ratings, and release year.
- **Detecting Missing or Inconsistent Data**: Identify any gaps, missing values, or inconsistencies in the dataset.
- **Understanding Data Distributions**: Analyze the distributions of key variables such as ratings, release years, and content type (TV show vs. movie).

### Dataset Source:
You can access the dataset on Kaggle: [Netflix Titles Dataset on Kaggle](https://www.kaggle.com/datasets/victorsoeiro/netflix-tv-shows-and-movies)

### Dataset Description

The dataset contains the following columns:
- `show_id`: Unique identifier for each show.
- `type`: Whether the content is a TV show or a movie.
- `title`: The title of the show/movie.
- `director`: Director of the show/movie.
- `cast`: Main cast members.
- `country`: Country of origin.
- `release_year`: Year of release.
- `rating`: IMDb or Netflix rating.
- `duration`: Duration of the movie (in minutes) or number of seasons for TV shows.
- `listed_in`: Genre or categories the title belongs to.
- `description`: A brief description of the title.

### Approach

- **Data Cleaning**: Handle missing values and duplicates, ensure correct data types.
- **Data Visualization**: Use Python libraries such as `Matplotlib` to create meaningful visualizations, like bar plots, histograms, and scatter plots.
- **Statistical Analysis**: Investigate the distribution of ratings, release years, and content types.
- **Exploratory Insights**: Explore trends such as the most common genres, top-rated titles, or content availability across different countries.

---

### Creating the imports required and allocating the data to a variable:

In [9]:
import pandas as pd
import numpy as np
import matplotlib
df = pd.read_csv("titles.csv")

### Displaying the information of the data frame:

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5850 entries, 0 to 5849
Data columns (total 15 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    5850 non-null   object 
 1   title                 5849 non-null   object 
 2   type                  5850 non-null   object 
 3   description           5832 non-null   object 
 4   release_year          5850 non-null   int64  
 5   age_certification     3231 non-null   object 
 6   runtime               5850 non-null   int64  
 7   genres                5850 non-null   object 
 8   production_countries  5850 non-null   object 
 9   seasons               2106 non-null   float64
 10  imdb_id               5447 non-null   object 
 11  imdb_score            5368 non-null   float64
 12  imdb_votes            5352 non-null   float64
 13  tmdb_popularity       5759 non-null   float64
 14  tmdb_score            5539 non-null   float64
dtypes: float64(5), int64(

### Check for any duplicates in the data:

In [7]:
print(df.duplicated().sum())

0


### Show all unique release years:

In [18]:
df["release_year"].unique()

array([1945, 1976, 1972, 1975, 1967, 1969, 1979, 1971, 1980, 1961, 1966,
       1954, 1958, 1977, 1963, 1956, 1960, 1973, 1974, 1959, 1978, 1989,
       1990, 1987, 1984, 1983, 1982, 1986, 1981, 1988, 1985, 2000, 1996,
       1997, 1995, 1994, 1999, 1998, 1993, 1992, 1991, 2008, 2002, 2010,
       2005, 2007, 2004, 2006, 2009, 2003, 2001, 2011, 2012, 2013, 2014,
       2015, 2016, 2018, 2017, 2019, 2020, 2022, 2021], dtype=int64)

### Count all entries for each year:

In [22]:
df["release_year"].value_counts()

release_year
2019    836
2020    814
2021    787
2018    773
2017    563
       ... 
1960      1
1974      1
1959      1
1978      1
1945      1
Name: count, Length: 63, dtype: int64

### Fetch the sum of all null values in the columns of data:

In [25]:
df.isnull().sum()

id                         0
title                      1
type                       0
description               18
release_year               0
age_certification       2619
runtime                    0
genres                     0
production_countries       0
seasons                 3744
imdb_id                  403
imdb_score               482
imdb_votes               498
tmdb_popularity           91
tmdb_score               311
dtype: int64

### Replace all null values with '0', first time using numpy:

In [30]:
df.replace(np.nan, "0", inplace = True)

### Confirm the change has been made -> converting all null values to '0':

In [32]:
df.isnull().sum()

id                      0
title                   0
type                    0
description             0
release_year            0
age_certification       0
runtime                 0
genres                  0
production_countries    0
seasons                 0
imdb_id                 0
imdb_score              0
imdb_votes              0
tmdb_popularity         0
tmdb_score              0
dtype: int64

### Creating a second data frame that tracks only release yeares greater than 2018:

In [56]:
df2 = df[df["release_year"] > 2018]

In [60]:
df2

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
3042,ts76645,The Umbrella Academy,SHOW,"A family of former child heroes, now grown apa...",2019,TV-14,50,"['comedy', 'drama', 'fantasy', 'scifi', 'action']",['US'],3.0,tt1312171,8.0,230211.0,388.952,8.7
3043,ts216746,The Secret of Skinwalker Ranch,SHOW,A team of experts and scientists undertakes ex...,2020,TV-PG,42,"['horror', 'reality', 'thriller', 'documentati...",['US'],3.0,tt10589968,6.0,2625.0,30.911,8.5
3044,ts89840,Virgin River,SHOW,"After seeing an ad for a midwife, a recently d...",2019,TV-14,44,"['drama', 'romance']",['US'],4.0,tt9077530,7.4,36459.0,169.37,8.034
3045,tm441050,The Gentlemen,MOVIE,American expat Mickey Pearson has built a high...,2019,R,113,"['crime', 'action', 'comedy']","['US', 'GB']",0,tt8367814,7.8,325385.0,67.919,7.687
3046,ts87466,Demon Slayer: Kimetsu no Yaiba,SHOW,"It is the Taisho Period in Japan. Tanjiro, a k...",2019,TV-MA,25,"['scifi', 'action', 'animation', 'horror', 'fa...",['JP'],3.0,tt9335498,8.7,93604.0,94.263,8.809
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5845,tm1014599,Fine Wine,MOVIE,A beautiful love story that can happen between...,2021,0,100,"['romance', 'drama']",['NG'],0,tt13857480,6.8,45.0,1.466,0
5846,tm898842,C/O Kaadhal,MOVIE,A heart warming film that explores the concept...,2021,0,134,['drama'],[],0,tt11803618,7.7,348.0,0,0
5847,tm1059008,Lokillo,MOVIE,A controversial TV host and comedian who has b...,2021,0,90,['comedy'],['CO'],0,tt14585902,3.8,68.0,26.005,6.3
5848,tm1035612,Dad Stop Embarrassing Me - The Afterparty,MOVIE,"Jamie Foxx, David Alan Grier and more from the...",2021,PG-13,37,[],['US'],0,0,0,0,1.296,10.0


### Creating a new data frame to fetch titles rated PG-13:

In [66]:
df3 = df2[df2["age_certification"] == "PG-13"]

In [68]:
df3

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
3057,tm510203,A Call to Spy,MOVIE,"In the beginning of WWII, with Britain becomin...",2019,PG-13,123,"['crime', 'drama', 'thriller', 'war', 'history']",['US'],0,tt7698468,6.6,7091.0,11.78,6.8
3090,tm911973,Enola Holmes,MOVIE,"While searching for her missing mother, intrep...",2020,PG-13,123,"['crime', 'drama', 'action']",['US'],0,tt7846844,6.6,162511.0,22.461,7.3
3097,tm362198,Scary Stories to Tell in the Dark,MOVIE,"Mill Valley, Pennsylvania, Halloween night, 19...",2019,PG-13,108,"['horror', 'thriller']","['US', 'CA', 'CN']",0,tt3387520,6.2,74754.0,36.326,6.5
3110,tm456156,Sweetheart,MOVIE,Jenn has washed ashore a small tropical island...,2019,PG-13,82,"['horror', 'thriller', 'drama', 'fantasy', 'sc...",['US'],0,tt6560164,5.8,8191.0,12.174,6.5
3123,tm460948,Always Be My Maybe,MOVIE,"Reunited after 15 years, famous chef Sasha and...",2019,PG-13,102,"['romance', 'comedy']",['US'],0,tt7374948,6.8,55686.0,11.434,6.6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5643,tm1159301,Forgive Us Our Trespasses,MOVIE,"In 1939 Germany, a disabled farm boy is pursue...",2022,PG-13,13,"['war', 'drama']",['US'],0,tt17162524,6.5,0,37.229,6.3
5646,tm1049083,Trese After Dark,MOVIE,"Stars and creators gather to discuss ""Trese,"" ...",2021,PG-13,36,['documentation'],['PH'],0,0,0,0,2.792,7.5
5706,tm1161223,Cat Burglar,MOVIE,"In this edgy, over-the-top, interactive trivia...",2022,PG-13,12,"['animation', 'comedy']","['IE', 'GB']",0,tt17321170,6.9,0,2.472,5.9
5763,tm1160938,Adam by Eve: A Live in Animation,MOVIE,"Anime, live action and music by cutting-edge a...",2022,PG-13,58,"['drama', 'animation', 'music']",['JP'],0,tt18274178,6.1,378.0,3.828,6.3


### Running information on data frame 3 to see total number of entries in that context of PG-13:

In [85]:
df3.info()

<class 'pandas.core.frame.DataFrame'>
Index: 148 entries, 3057 to 5848
Data columns (total 15 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   id                    148 non-null    object
 1   title                 148 non-null    object
 2   type                  148 non-null    object
 3   description           148 non-null    object
 4   release_year          148 non-null    int64 
 5   age_certification     148 non-null    object
 6   runtime               148 non-null    int64 
 7   genres                148 non-null    object
 8   production_countries  148 non-null    object
 9   seasons               148 non-null    object
 10  imdb_id               148 non-null    object
 11  imdb_score            148 non-null    object
 12  imdb_votes            148 non-null    object
 13  tmdb_popularity       148 non-null    object
 14  tmdb_score            148 non-null    object
dtypes: int64(2), object(13)
memory usage: 18.

### Return a box plot on the release year for the original data set:

In [233]:
# Create boxplot for the 'release_year' column
df['release_year'].dropna().plot(kind='box', figsize=(8, 6))

<Axes: >

### Return data from column 'Type' where SHOWs and MOVIEs are the different types:

In [216]:
df["type"].unique()

array(['SHOW', 'MOVIE'], dtype=object)

### Organzing the type data into a histogram:

In [195]:
df["type"].hist()

<Axes: >

### Describing the country column, where 'US' is the most popular in the original data set ('df'):

In [198]:
df["production_countries"].describe()

count       5850
unique       452
top       ['US']
freq        1959
Name: production_countries, dtype: object

### Determine a correlation between IMDB_SCORE, IMDB_VOTES, and IMDB_POPULARITY
This will require the converson of this data using Pandas to acquire new, numeric values:

In [201]:
df["imdb_score"] = pd.to_numeric(df["imdb_score"], errors="coerce") # Convert imdb_score to numeric, invalid parsing will become NaN

In [238]:
df['imdb_votes'] = pd.to_numeric(df['imdb_votes'], errors='coerce')  # Convert imdb_votes to numeric

In [162]:
df["tmdb_popularity"] = pd.to_numeric(df["tmdb_popularity"], errors="coerce") # Converts tmdb_popularity to numeric

### Verify the conversion of data where int64 (desired) is displayed versus object:

In [166]:
print(df.dtypes)

id                       object
title                    object
type                     object
description              object
release_year              int64
age_certification        object
runtime                   int64
genres                   object
production_countries     object
seasons                  object
imdb_id                  object
imdb_score              float64
imdb_votes              float64
tmdb_popularity         float64
tmdb_score               object
dtype: object


### Drop rows with missing values in specific columns
An alternative method may be filling missing values with a placeholder (e.g. mean or median):

In [180]:
df = df.dropna(subset=["imdb_score", "imdb_votes", "tmdb_popularity"])

### Correlation Analysis (1)
Now that the columns are correctly converted to numeric types you can see the correlation analysis:

In [176]:
print(df[["release_year", "imdb_score", "imdb_votes", "tmdb_popularity"]].corr())

                 release_year  imdb_score  imdb_votes  tmdb_popularity
release_year         1.000000   -0.114379   -0.205648         0.042928
imdb_score          -0.114379    1.000000    0.159753         0.039150
imdb_votes          -0.205648    0.159753    1.000000         0.209695
tmdb_popularity      0.042928    0.039150    0.209695         1.000000


### Correlation Analysis (2)

The correlation matrix between key variables in the dataset reveals the following insights:

---

- **Release Year vs IMDb Score**: There is a slight negative correlation of **-0.11**. This suggests that, over time, older movies tend to have slightly lower IMDb scores, although the relationship is weak.

---
  
- **Release Year vs IMDb Votes**: A weak negative correlation of **-0.21** indicates that older titles may receive slightly fewer votes on IMDb, though the relationship remains relatively weak.

---
  
- **IMDb Score vs IMDb Votes**: A positive correlation of **0.16** implies a weak positive relationship. As the IMDb score increases, titles tend to accumulate more votes, but the effect is not significant.

---

- **IMDb Score vs TMDB Popularity**: There is a very weak positive correlation of **0.04**, indicating that IMDb scores have almost no impact on a movie’s popularity on TMDB.

---

- **IMDb Votes vs TMDB Popularity**: A weak positive correlation of **0.21** shows that movies with more votes on IMDb tend to be slightly more popular on TMDB, although the correlation is still weak.

---

In summary, the dataset shows mostly weak relationships between the variables, with a few trends that may indicate subtle patterns in how movies are rated and voted on different platforms.
