# Package installation and imports
**Note:** This installation may take a few minutes, depending on your hardware. When complete, rerun to verify that the install was successful.

In [1]:
conda install pandas numpy matplotlib

Retrieving notices: ...working... done
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.


Note: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

pd.options.mode.chained_assignment = None

path = "data.csv"
dataset = pd.read_csv(path)
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB


## Data overview

Looking at this data, there seems to be many possibilities here to draw meaningful insights.

Some data I would like to investigate further and that I think could aid in producing meaningful insight are:

- Additions over time
- Looking deeper into the distribution of production countries for existing content
- Content type
- Content rating
- Content genres
- Duration
- Changes in all the above over time.

Knowing there is a fair bit of data missing, we'll have to clean our dataset before proceeding to our analysis. We also have to ensure that our data is correctly entered so that can be sure we don't have data fields that will skew any projection we'll later be making.

# Data processing
We begin by filtering for and filling out missing values wherever it seems reasonable to do so. 

In [3]:
 dataset.isna().sum()

show_id            0
type               0
title              0
director        2634
cast             825
country          831
date_added        10
release_year       0
rating             4
duration           3
listed_in          0
description        0
dtype: int64

We beghin with filling in the empty rows for the `director`, `cast` and `country` columns, before investigating the `date_added`, `rating` and `duration` columns further. 

There being so few missing fields of data for these rows might indicate that something else is wrong, so let's investigate.

In [4]:
dataset.director = dataset.director.fillna("missing data")
dataset.cast = dataset.cast.fillna("missing data")
dataset.country = dataset.country.fillna("missing data")
dataset.isna().sum()

show_id          0
type             0
title            0
director         0
cast             0
country          0
date_added      10
release_year     0
rating           4
duration         3
listed_in        0
description      0
dtype: int64

In [5]:
dataset.iloc[np.where(dataset.date_added.isna())]

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
6066,s6067,TV Show,A Young Doctor's Notebook and Other Stories,missing data,"Daniel Radcliffe, Jon Hamm, Adam Godley, Chris...",United Kingdom,,2013,TV-MA,2 Seasons,"British TV Shows, TV Comedies, TV Dramas","Set during the Russian Revolution, this comic ..."
6174,s6175,TV Show,Anthony Bourdain: Parts Unknown,missing data,Anthony Bourdain,United States,,2018,TV-PG,5 Seasons,Docuseries,This CNN original series has chef Anthony Bour...
6795,s6796,TV Show,Frasier,missing data,"Kelsey Grammer, Jane Leeves, David Hyde Pierce...",United States,,2003,TV-PG,11 Seasons,"Classic & Cult TV, TV Comedies",Frasier Crane is a snooty but lovable Seattle ...
6806,s6807,TV Show,Friends,missing data,"Jennifer Aniston, Courteney Cox, Lisa Kudrow, ...",United States,,2003,TV-14,10 Seasons,"Classic & Cult TV, TV Comedies",This hit sitcom follows the merry misadventure...
6901,s6902,TV Show,Gunslinger Girl,missing data,"Yuuka Nanri, Kanako Mitsuhashi, Eri Sendai, Am...",Japan,,2008,TV-14,2 Seasons,"Anime Series, Crime TV Shows","On the surface, the Social Welfare Agency appe..."
7196,s7197,TV Show,Kikoriki,missing data,Igor Dmitriev,missing data,,2010,TV-Y,2 Seasons,Kids' TV,A wacky rabbit and his gang of animal pals hav...
7254,s7255,TV Show,La Familia P. Luche,missing data,"Eugenio Derbez, Consuelo Duval, Luis Manuel Áv...",United States,,2012,TV-14,3 Seasons,"International TV Shows, Spanish-Language TV Sh...","This irreverent sitcom featues Ludovico, Feder..."
7406,s7407,TV Show,Maron,missing data,"Marc Maron, Judd Hirsch, Josh Brener, Nora Zeh...",United States,,2016,TV-MA,4 Seasons,TV Comedies,"Marc Maron stars as Marc Maron, who interviews..."
7847,s7848,TV Show,Red vs. Blue,missing data,"Burnie Burns, Jason Saldaña, Gustavo Sorola, G...",United States,,2015,NR,13 Seasons,"TV Action & Adventure, TV Comedies, TV Sci-Fi ...","This parody of first-person shooter games, mil..."
8182,s8183,TV Show,The Adventures of Figaro Pho,missing data,"Luke Jurevicius, Craig Behenna, Charlotte Haml...",Australia,,2015,TV-Y7,2 Seasons,"Kids' TV, TV Comedies","Imagine your worst fears, then multiply them: ..."


Looking over `date_added` we can tell that its values are simply missing. In addition, that the values to the right and left for it does not appear to have accidentally received its value.

We'll therefore assign these elements with the `missing value` string. before doing the same check on the final two columns.

In [6]:
dataset.date_added.iloc[np.where(dataset.date_added.isna())] = "missing data"

In [7]:
dataset.iloc[np.where(dataset.rating.isna())]

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
5989,s5990,Movie,13TH: A Conversation with Oprah Winfrey & Ava ...,missing data,"Oprah Winfrey, Ava DuVernay",missing data,"January 26, 2017",2017,,37 min,Movies,Oprah Winfrey sits down with director Ava DuVe...
6827,s6828,TV Show,Gargantia on the Verdurous Planet,missing data,"Kaito Ishikawa, Hisako Kanemoto, Ai Kayano, Ka...",Japan,"December 1, 2016",2013,,1 Season,"Anime Series, International TV Shows","After falling through a wormhole, a space-dwel..."
7312,s7313,TV Show,Little Lunch,missing data,"Flynn Curry, Olivia Deeble, Madison Lu, Oisín ...",Australia,"February 1, 2018",2015,,1 Season,"Kids' TV, TV Comedies","Adopting a child's perspective, this show take..."
7537,s7538,Movie,My Honor Was Loyalty,Alessandro Pepe,"Leone Frisa, Paolo Vaccarino, Francesco Miglio...",Italy,"March 1, 2017",2015,,115 min,Dramas,"Amid the chaos and horror of World War II, a c..."


In [8]:
dataset.rating.iloc[np.where(dataset.rating.isna())] = "missing data"

In [9]:
dataset.iloc[np.where(dataset.duration.isna())]

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
5541,s5542,Movie,Louis C.K. 2017,Louis C.K.,Louis C.K.,United States,"April 4, 2017",2017,74 min,,Movies,"Louis C.K. muses on religion, eternal love, gi..."
5794,s5795,Movie,Louis C.K.: Hilarious,Louis C.K.,Louis C.K.,United States,"September 16, 2016",2010,84 min,,Movies,Emmy-winning comedy writer Louis C.K. brings h...
5813,s5814,Movie,Louis C.K.: Live at the Comedy Store,Louis C.K.,Louis C.K.,United States,"August 15, 2016",2015,66 min,,Movies,The comic puts his trademark hilarious/thought...


Looking at the missing duration data, we can quickly tell that the missing `duration` data actually exists in the `rating` column. 

To fix this, replace the missing data in these three `duration` cells with the correct data, then set the wrongful `rating` data as missing.

In [10]:
dataset.duration.iloc[np.where(dataset.duration.isna())] = \
dataset.rating.iloc[np.where(dataset.duration.isna())]

In [11]:
dataset.rating.iloc[[5541, 5794, 5813]] = "missing value"

In [12]:
dataset.isnull().sum()

show_id         0
type            0
title           0
director        0
cast            0
country         0
date_added      0
release_year    0
rating          0
duration        0
listed_in       0
description     0
dtype: int64

Now that we've cleared up our data for missing values, we're ready to start analyzing our dataset. 

## Productions by country
Beginning our analysis, we look into the `country` column to find at the frequency of productions by country.

In [20]:
dataset.country.value_counts()

United States                             2818
India                                      972
missing data                               831
United Kingdom                             419
Japan                                      245
                                          ... 
Romania, Bulgaria, Hungary                   1
Uruguay, Guatemala                           1
France, Senegal, Belgium                     1
Mexico, United States, Spain, Colombia       1
United Arab Emirates, Jordan                 1
Name: country, Length: 749, dtype: int64

Checking for unique values, we see that there is a very large quantity of unique values, one much greater than the total number of countries. This makes it tricky to immediately assess productions by country. 

Seeing as the different countries of production are separated by commas, we'll have to split by the comma symbol, then summarize the result by the number of occurances.

In [26]:
countries_sorted = dataset.country.apply(lambda x: pd.value_counts(x.replace(", ", ",").split(", "))).sum()

countries_sorted.sort_values(ascending=False)

United States                          2818.0
India                                   972.0
missing data                            831.0
United Kingdom                          419.0
Japan                                   245.0
                                        ...  
Romania,Bulgaria,Hungary                  1.0
Uruguay,Guatemala                         1.0
France,Senegal,Belgium                    1.0
Mexico,United States,Spain,Colombia       1.0
United Arab Emirates,Jordan               1.0
Length: 749, dtype: float64

In [25]:
countries_sorted = dataset.country.apply(lambda x: pd.value_counts(x.replace(", ", ",").split(", "))).sum()

countries_sorted.sort_values(ascending=False)

United States                          2818.0
India                                   972.0
missing data                            831.0
United Kingdom                          419.0
Japan                                   245.0
                                        ...  
Romania,Bulgaria,Hungary                  1.0
Uruguay,Guatemala                         1.0
France,Senegal,Belgium                    1.0
Mexico,United States,Spain,Colombia       1.0
United Arab Emirates,Jordan               1.0
Length: 749, dtype: float64

Here I apply a lambda function that splits each element that contains a comma and a space with just a comma, before splitting every element at the comma position, leaving us with only the country values, and their number of occurances.