#### GOAL

Generate insights by analysing the data to help our client decide what type of shows/movies to produce to develop the business in different countries.

#### PROBLEM STATEMENT 

Analyze the data and generate insights that could help Netflix in deciding which type of shows/movies to produce and how they can grow the business in different countries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as mlt
import seaborn as sns
%matplotlib inline

In [15]:
data = pd.read_csv('dataset/netflix.csv')

In [31]:
unprocessed_data = data.copy()

#### BASIC METRICS

In [32]:
unprocessed_data.ndim # The dataset is 2 dimensional

2

In [33]:
# Basic Metrics

In [34]:
unprocessed_data.head(2)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."


In [35]:
unprocessed_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB


#### OBSERVATIONS

* The data has 8807 rows with 12 columns.
* There are null values in some columns (director, cast, country, date_added, rating, duration)
* We could see the `type` and `rating` to be given as object which needs to be converted to categories
* The only numerical field in the dataset is the `release_year`
* The `cast` and genre `listed_in` has multiple values which needs to unnested.
* The duration has the number of minutes the movie will be running and the other type for how many seasons were released.

In [21]:
unprocessed_data.shape

(8807, 12)

In [22]:
unprocessed_data.dtypes

show_id         object
type            object
title           object
director        object
cast            object
country         object
date_added      object
release_year     int64
rating          object
duration        object
listed_in       object
description     object
dtype: object

In [23]:
unprocessed_data['type'] = unprocessed_data['type'].astype('category')
unprocessed_data['rating'] = unprocessed_data['rating'].astype('category')

In [24]:
unprocessed_data.dtypes

show_id           object
type            category
title             object
director          object
cast              object
country           object
date_added        object
release_year       int64
rating          category
duration          object
listed_in         object
description       object
dtype: object

In [25]:
unprocessed_data.isna().sum()

show_id            0
type               0
title              0
director        2634
cast             825
country          831
date_added        10
release_year       0
rating             4
duration           3
listed_in          0
description        0
dtype: int64

In [29]:
unprocessed_data.describe()

Unnamed: 0,release_year
count,8807.0
mean,2014.180198
std,8.819312
min,1925.0
25%,2013.0
50%,2017.0
75%,2019.0
max,2021.0


In [41]:
unprocessed_data.groupby("type").mean(numeric_only=True)

Unnamed: 0_level_0,release_year
type,Unnamed: 1_level_1
Movie,2013.121514
TV Show,2016.605755


In [39]:
unprocessed_data.groupby("rating").mean(numeric_only=True)

Unnamed: 0_level_0,release_year
rating,Unnamed: 1_level_1
66 min,2015.0
74 min,2017.0
84 min,2010.0
G,1997.804878
NC-17,2015.0
NR,2010.9125
PG,2008.428571
PG-13,2009.314286
R,2010.47184
TV-14,2013.655556


In [73]:
# print(np.where(unprocessed_data['rating'] == '66 min' )[0][0])
# print(np.where(unprocessed_data['rating'] == '74 min' )[0][0])
# print(np.where(unprocessed_data['rating'] == '84 min' )[0][0])

#for i in unprocessed_data['rating']:
    #if i == '66 min' or  i == '74 min' or i == '84 min':

invalid_rating_list = []

for invalid_rating in ['66 min' ,'74 min','84 min']:
    invalid_rating_list.append(np.where(unprocessed_data['rating'] == invalid_rating)[0][0])
print(invalid_rating_list)

[5813, 5541, 5794]


In [82]:
unprocessed_data.iloc[invalid_rating_list]

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
5813,s5814,Movie,Louis C.K.: Live at the Comedy Store,Louis C.K.,Louis C.K.,United States,"August 15, 2016",2015,66 min,,Movies,The comic puts his trademark hilarious/thought...
5541,s5542,Movie,Louis C.K. 2017,Louis C.K.,Louis C.K.,United States,"April 4, 2017",2017,74 min,,Movies,"Louis C.K. muses on religion, eternal love, gi..."
5794,s5795,Movie,Louis C.K.: Hilarious,Louis C.K.,Louis C.K.,United States,"September 16, 2016",2010,84 min,,Movies,Emmy-winning comedy writer Louis C.K. brings h...


#### OBSERVATIONS

* We could see the shape of data of 8807 rows and 12 columns as mentioned in the above observation.
* We have converted the `type` and `rating` columns to categories.
* We have movies/shows that were released between 1925 to 2021.
* On average, we could see the value to be 2014 meaning that 2014 - 2021 (`7 years`) has the same number of movies/shows as 1925 - 2014 (`89 years`)
* Based on `type` of movies/shows, on average it as 2013 and 2016 respectively.
* we could see the `duration` to be missing but present in `rating` column.