# NETFLIX Data Visualization and Analysis

![First We Load the Data and get some info ](https://upload.wikimedia.org/wikipedia/commons/6/69/Netflix_logo.svg)

Netflix is a popular streaming service that offers a vast catalog of movies, TV shows, and original contents. This [dataset](https://www.kaggle.com/datasets/ariyoomotade/netflix-data-cleaning-analysis-and-visualization/data) is a cleaned version of the original version which can be found [here](https://www.kaggle.com/datasets/shivamb/netflix-shows). The data consist of contents added to Netflix from 2008 to 2021. This dataset will be used to test my data cleaning and visualization skills.

In [80]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import plotly.express as px
from plotly.offline import iplot, plot
import plotly.graph_objects as go
from plotly.subplots import make_subplots

from collections import Counter

<p style="background-color:rgb(245,51,9);color:rgb(255,255,255);text-align:left;font-size:30px;padding:10px 10px;font-weight:bold;font-family:'Bebas Neue', sans-serif;">Loading the data </p>

In [81]:
data = pd.read_csv('netflix1.csv')
data.head()

Unnamed: 0,show_id,type,title,director,country,date_added,release_year,rating,duration,listed_in
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,United States,9/25/2021,2020,PG-13,90 min,Documentaries
1,s3,TV Show,Ganglands,Julien Leclercq,France,9/24/2021,2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act..."
2,s6,TV Show,Midnight Mass,Mike Flanagan,United States,9/24/2021,2021,TV-MA,1 Season,"TV Dramas, TV Horror, TV Mysteries"
3,s14,Movie,Confessions of an Invisible Girl,Bruno Garotti,Brazil,9/22/2021,2021,TV-PG,91 min,"Children & Family Movies, Comedies"
4,s8,Movie,Sankofa,Haile Gerima,United States,9/24/2021,1993,TV-MA,125 min,"Dramas, Independent Movies, International Movies"


In [82]:
data.describe()

Unnamed: 0,release_year
count,8790.0
mean,2014.183163
std,8.825466
min,1925.0
25%,2013.0
50%,2017.0
75%,2019.0
max,2021.0


It seems that there is only one column with numeric values, but let's check it.

In [83]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8790 entries, 0 to 8789
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8790 non-null   object
 1   type          8790 non-null   object
 2   title         8790 non-null   object
 3   director      8790 non-null   object
 4   country       8790 non-null   object
 5   date_added    8790 non-null   object
 6   release_year  8790 non-null   int64 
 7   rating        8790 non-null   object
 8   duration      8790 non-null   object
 9   listed_in     8790 non-null   object
dtypes: int64(1), object(9)
memory usage: 686.8+ KB


Yes, all the columns have the object type except one. Now, let's see how much rows and columns have the data!

In [84]:
print(f'Nº of rows: {data.shape[0]}')
print(f'Nº of columns: {data.shape[1]}')

Nº of rows: 8790
Nº of columns: 10


Also, we'll see if there's any null value or any duplicate in the dataset.

In [85]:
data.isnull().sum()

show_id         0
type            0
title           0
director        0
country         0
date_added      0
release_year    0
rating          0
duration        0
listed_in       0
dtype: int64

In [86]:
data.duplicated().sum()

0

Looks like we have an excellent dataset with no null values or duplicates. Time to prepare to it for the analysis.

<p style="background-color:rgb(245,51,9);color:rgb(255,255,255);text-align:left;font-size:30px;padding:10px 10px;font-weight:bold;font-family:'Bebas Neue', sans-serif;">Processing the data</p>

First of all, we will drop the 'show_id' column, wich seems to be quite useless in this moment.

In [87]:
data.drop(columns='show_id', inplace=True)

We change the datatype of the 'date_added' column to datetime and add some new columns to the dataset: 'year_added', 'month_added' and 'day_added'. As the name suggests, these columns will show the year, month and day the movie or TV show was added to the Netflix catalogue.

In [88]:
data['date_added'] = pd.to_datetime(data['date_added'])

data['year_added'] = data['date_added'].dt.year
data['month_added'] = data['date_added'].dt.month_name()
data['day_added'] = data['date_added'].dt.day_name()

In [89]:
data

Unnamed: 0,type,title,director,country,date_added,release_year,rating,duration,listed_in,year_added,month_added,day_added
0,Movie,Dick Johnson Is Dead,Kirsten Johnson,United States,2021-09-25,2020,PG-13,90 min,Documentaries,2021,September,Saturday
1,TV Show,Ganglands,Julien Leclercq,France,2021-09-24,2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",2021,September,Friday
2,TV Show,Midnight Mass,Mike Flanagan,United States,2021-09-24,2021,TV-MA,1 Season,"TV Dramas, TV Horror, TV Mysteries",2021,September,Friday
3,Movie,Confessions of an Invisible Girl,Bruno Garotti,Brazil,2021-09-22,2021,TV-PG,91 min,"Children & Family Movies, Comedies",2021,September,Wednesday
4,Movie,Sankofa,Haile Gerima,United States,2021-09-24,1993,TV-MA,125 min,"Dramas, Independent Movies, International Movies",2021,September,Friday
...,...,...,...,...,...,...,...,...,...,...,...,...
8785,TV Show,Yunus Emre,Not Given,Turkey,2017-01-17,2016,TV-PG,2 Seasons,"International TV Shows, TV Dramas",2017,January,Tuesday
8786,TV Show,Zak Storm,Not Given,United States,2018-09-13,2016,TV-Y7,3 Seasons,Kids' TV,2018,September,Thursday
8787,TV Show,Zindagi Gulzar Hai,Not Given,Pakistan,2016-12-15,2012,TV-PG,1 Season,"International TV Shows, Romantic TV Shows, TV ...",2016,December,Thursday
8788,TV Show,Yoko,Not Given,Pakistan,2018-06-23,2016,TV-Y,1 Season,Kids' TV,2018,June,Saturday


<p style="background-color:rgb(245,51,9);color:rgb(255,255,255);text-align:left;font-size:30px;padding:10px 10px;font-weight:bold;font-family:'Bebas Neue', sans-serif;"> Data Analysis</p>

## Type of show

In [90]:
types = data["type"].value_counts()
types

type
Movie      6126
TV Show    2664
Name: count, dtype: int64

In [91]:
fig = px.bar(data_frame = types,
      x = types.index,
      y = types,
      title='Movies vs Tv Shows',
      text_auto=True,
      labels= {'y' : 'Sum',
              'type' : 'Show Type',
              'color' : 'Show Type'},
      color = types.index,
      color_discrete_sequence=["darkred", "royalblue"])

fig.update_traces(marker_line_color='black',
                  marker_line_width=1.5, opacity=0.8)
fig.show()

In [92]:
fig = px.pie(values = [types.iloc[0], types.iloc[1]],
            title = 'Movies vs Tv Shows',
            names = ['Movies', 'Tv Shows'],
            color_discrete_sequence=["darkred", "royalblue"])
fig.update_traces(textposition='inside', textinfo='percent+value')
fig.show()

The pie chart comparing Netflix movie uploads to TV show uploads from 2008 to 2021 reveals a 39.4% increase in movie uploads.

## Directors

In [93]:
data.head()

Unnamed: 0,type,title,director,country,date_added,release_year,rating,duration,listed_in,year_added,month_added,day_added
0,Movie,Dick Johnson Is Dead,Kirsten Johnson,United States,2021-09-25,2020,PG-13,90 min,Documentaries,2021,September,Saturday
1,TV Show,Ganglands,Julien Leclercq,France,2021-09-24,2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",2021,September,Friday
2,TV Show,Midnight Mass,Mike Flanagan,United States,2021-09-24,2021,TV-MA,1 Season,"TV Dramas, TV Horror, TV Mysteries",2021,September,Friday
3,Movie,Confessions of an Invisible Girl,Bruno Garotti,Brazil,2021-09-22,2021,TV-PG,91 min,"Children & Family Movies, Comedies",2021,September,Wednesday
4,Movie,Sankofa,Haile Gerima,United States,2021-09-24,1993,TV-MA,125 min,"Dramas, Independent Movies, International Movies",2021,September,Friday


In [94]:
data['director'].nunique()

4528

In [95]:
directors = data['director'].value_counts()
directors

director
Not Given                         2588
Rajiv Chilaka                       20
Alastair Fothergill                 18
Raúl Campos, Jan Suter              18
Suhas Kadav                         16
                                  ... 
Matt D'Avella                        1
Parthiban                            1
Scott McAboy                         1
Raymie Muzquiz, Stu Livingston       1
Mozez Singh                          1
Name: count, Length: 4528, dtype: int64

In [96]:
given_directors = directors.sum() - directors.iloc[0]
print(f'There are {given_directors} movies/tv shows whith given director')

There are 6202 movies/tv shows whith given director


In [97]:
fig = px.pie(values = [given_directors, directors.iloc[0]],
            title = 'Given Directors vs Not Given',
            names = ['Given Directors', 'Not Given'],
            color_discrete_sequence=["darkred", "royalblue"])
fig.update_traces(textposition='inside', textinfo='percent+value')
fig.show()

Looks like the 30% of the movies or tv shows dont have a given director.

## 5 Countries with most shows added

In [98]:
countries = data["country"].value_counts()
countries = countries[countries.index != 'Not Given']
countries.head(5)

country
United States     3240
India             1057
United Kingdom     638
Pakistan           421
Canada             271
Name: count, dtype: int64

In [99]:
data['country'].nunique()

86

In [100]:
fig = px.bar(x = countries.head(5).index, y = countries.head(5),
             color = countries.head(5).index,
            color_discrete_sequence=px.colors.qualitative.G10,
            labels = {'x':'Countries', 'y':'Nº of shows added', 'color':'Countries'})


fig.update_traces(marker_line_color='black',
                  marker_line_width=1.5, opacity=0.8)
fig.show()

## Duration

In [101]:
data['duration'].value_counts().head(5)

duration
1 Season     1791
2 Seasons     421
3 Seasons     198
90 min        152
97 min        146
Name: count, dtype: int64

## Directors with most films & tv shows

In [102]:
fig = px.bar(directors.iloc[1:11], x = directors.iloc[1:11].index,
            y = directors.iloc[1:11],
            color = directors.iloc[1:11].index,
            labels = { 'y' : 'Nº of movies & tv shows', 'x' : 'Director', 'color' : 'Director'},
            color_discrete_sequence=px.colors.qualitative.Plotly)
fig.update_traces(marker_line_color='black',
                  marker_line_width=1.5, opacity=0.8)
fig.show()

In [103]:
directors.loc['Martin Scorsese']

12

The director with the most movies and TV shows in Netflix's catalogue is Rajiv Chilaka, with 20 productions. However, the most famous is Scorsese, with 12 films, 40% less than Rajiv!

## How many movies are added per year?

In [104]:
mask_movie = data['type'] == 'Movie'
data_movies = data.copy()
data_movies = data_movies.loc[mask_movie]

In [105]:
movies_added_year = data_movies.groupby(data_movies["year_added"])["type"].count()
movies_added_year

year_added
2008       1
2009       2
2010       1
2011      13
2012       3
2013       6
2014      19
2015      56
2016     251
2017     836
2018    1237
2019    1424
2020    1284
2021     993
Name: type, dtype: int64

In [106]:
fig = px.line(x=movies_added_year.index,
      y = movies_added_year,
      markers = True, line_shape='spline',
      width=900, height=500,
      labels={ 'x': 'Year', 'y' : 'Nº of additions'},
      title='Movies added per year')

fig.update_layout(
    margin=dict(l=40, r=10, t=80, b=20),
)
fig.update_traces(line=dict(color='darkred'))
fig.update_yaxes(automargin=True)
fig.show()

As we move forward in time, the frequency increases until it peaks in 2019, with 1424 additions. From this year onwards, it starts to decrease.
Notable peaks are also observed in the years 2017, 2018, and 2019, indicating a significant increase compared to previous years, this may be due to the rise in popularity of the Netflix platform.

## How many TV shows are added per year?

In [107]:
mask_tv = data['type'] == 'TV Show'
data_tv = data.copy()
data_tv= data_tv.loc[mask_tv]

In [108]:
tv_added_year = data_tv.groupby(data_tv["year_added"])["type"].count()
tv_added_year

year_added
2008      1
2013      5
2014      5
2015     26
2016    175
2017    349
2018    411
2019    592
2020    595
2021    505
Name: type, dtype: int64

In [109]:
fig = px.line(x=tv_added_year.index,
      y = tv_added_year,
      markers = True, line_shape='spline',
      width=900, height=500, 
      labels={ 'x': 'Year', 'y' : 'Nº of additions'},
      title='TV Shows added per year')

fig.update_layout(
    margin=dict(l=40, r=10, t=80, b=20),
)
fig.update_traces(line=dict(color='royalblue'))
fig.update_yaxes(automargin=True)
fig.show()

As in the films, we may see an increase in frequency until 2019. However, in this case there is not such a sharp decline in additions to the catalogue, remaining more stable.

## How Many Movies and TV Shows Added By Each Country ??

In [110]:
show_origin = data.pivot_table(index="country", columns = data["type"], values = "type", 
                                   aggfunc="count")
show_origin = show_origin.fillna(0).sort_values("Movie", ascending = False)
show_origin

type,Movie,TV Show
country,Unnamed: 1_level_1,Unnamed: 2_level_1
United States,2395.0,845.0
India,976.0,81.0
United Kingdom,387.0,251.0
Not Given,257.0,30.0
Canada,187.0,84.0
...,...,...
Ukraine,0.0,2.0
Cyprus,0.0,1.0
Puerto Rico,0.0,1.0
Senegal,0.0,1.0


We have a not given index, lets remove it

In [111]:
show_origin = show_origin[show_origin.index != 'Not Given']
show_origin

type,Movie,TV Show
country,Unnamed: 1_level_1,Unnamed: 2_level_1
United States,2395.0,845.0
India,976.0,81.0
United Kingdom,387.0,251.0
Canada,187.0,84.0
France,148.0,65.0
...,...,...
Ukraine,0.0,2.0
Cyprus,0.0,1.0
Puerto Rico,0.0,1.0
Senegal,0.0,1.0


In [112]:
fig = px.bar(show_origin.head(), barmode="group", color_discrete_sequence=["darkred", "royalblue"],
            labels={'value': 'Value', 'country': 'Countries','type': 'Type'})


fig.show()

Netflix has the most movies from the United States, followed by India in second place and the United Kingdom in third place.

## Most popular categories

In [113]:
categories = data["listed_in"].str.split(", ")
categories.head()

0                                      [Documentaries]
1    [Crime TV Shows, International TV Shows, TV Ac...
2                 [TV Dramas, TV Horror, TV Mysteries]
3                 [Children & Family Movies, Comedies]
4    [Dramas, Independent Movies, International Mov...
Name: listed_in, dtype: object

In [114]:
counter = Counter()
for category in categories:
    counter.update(category)

In [115]:
most_popular = counter.most_common(10)

In [116]:
most_popular[0:10]

[('International Movies', 2752),
 ('Dramas', 2426),
 ('Comedies', 1674),
 ('International TV Shows', 1349),
 ('Documentaries', 869),
 ('Action & Adventure', 859),
 ('TV Dramas', 762),
 ('Independent Movies', 756),
 ('Children & Family Movies', 641),
 ('Romantic Movies', 616)]

In [117]:
category = list()
position = list()
for i in most_popular:
    category.append(i[0])
    position.append(i[1])

In [118]:
fig = go.Figure(data=[go.Pie(labels=category, values=position, pull=[0.1, 0, 0, 0])])
fig.update_traces(textposition='inside', textinfo='percent+value')
fig.show()

Netflix has the most titles in the "International Movies" genre, 21.7% od the shows are from this genre! It's followed by 'Dramas' and 'Comedies'. 

The least popular genres are 'Children & Family Movies' and 'Romantic', with only 5.05% and 4.85%, respectively.

## Oldest movies in Netflix

In [119]:
five_lowest_years = data['release_year'].value_counts().nsmallest(5).index
oldest_movies = data[data['release_year'].isin(five_lowest_years)]
oldest_movies = oldest_movies.sort_values(by='release_year')

In [120]:
fig = px.bar(oldest_movies, x='release_year', y = 'title', text_auto=True, color='title',
            labels= {'X': 'Relesase year', 'y' : 'Film Title', 'title' : 'Film Title'},
            title='5 Oldest Movies In Netflix')
fig.update_xaxes(range=[1800, max(data['release_year']) + 1])
fig.show()

It seems that the oldest film in the Netflix catalogue is Pioneers: First Women Filmmakers*, from 1925. Almost a century old!