# 0.0 Imports

In [1]:
import warnings

import pandas as pd
import seaborn as sns

from matplotlib import pyplot as plt

## 0.1 Load data

### Feature Description based on the Project Plan

Referring to our [project plan](../docs/planning.md), it is crucial to understand the definitions of the dataset features. As a comprehensive description document is absent, we'll provide inferred definitions, keeping in mind that our planning takes into account movies in both script-drafting and filming stages.

**Dataset Features**:
- `show_id`: Unique identifier for the media.
- `type`: Media type.
- `title`: Media title or name.
- `director`: The directing team responsible for the media.
- `cast`: Cast members involved.
- `country`: Planned country for filming or production.
- `date_added`: Date when the media was added to the database.
- `release_year`: Year the media was released.
- `rating`: Evaluation or rating score.
- `duration`: Media runtime or duration.
- `listed_in`: Categories or genres the media falls under.
- `description`: Brief synopsis of the media.

In [2]:
netflix_data = catalog.load("netflix_data")

- The data separation process often involves partitioning the data initially to simulate a production environment. While this is a recommended practice, these datasets would eventually need to pass through the entire cleaning and processing pipeline. Due to time constraints in this project, constructing such a pipeline outside of the notebook won't be feasible. Therefore, we will proceed with data separation after the exploratory data analysis and will later utilize cross-validation techniques.

## 0.2 Helper Functions

In [3]:
def notebook_settings():
    pd.set_option('display.max_columns', None)
    pd.set_option('display.max_rows', 60)
    pd.set_option('display.float_format', lambda x: '%.3f' % x)

    plt.style.use('bmh')
    plt.rcParams['figure.figsize'] = [28, 12]
    plt.rcParams['font.size'] = 24
    sns.set()

    warnings.filterwarnings('ignore')
    return None

notebook_settings()

# 1.0 Data Description

In [4]:
data_description = netflix_data.copy()
data_description.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,81145628,Movie,Norm of the North: King Sized Adventure,"Richard Finn, Tim Maltby","Alan Marriott, Andrew Toth, Brian Dobson, Cole...","United States, India, South Korea, China",2019-09-09,2019.0,41.0,90 min,"Children & Family Movies, Comedies",Before planning an awesome wedding for his gra...
1,80117401,Movie,Jandino: Whatever it Takes,,Jandino Asporaat,United Kingdom,2016-09-09,2016.0,52.0,94 min,Stand-Up Comedy,Jandino Asporaat riffs on the challenges of ra...
2,70234439,TV Show,Transformers Prime,,"Peter Cullen, Sumalee Montano, Frank Welker, J...",United States,2018-09-08,2013.0,82.0,1 Season,Kids' TV,"With the help of three human allies, the Autob..."
3,80058654,TV Show,Transformers: Robots in Disguise,,"Will Friedle, Darren Criss, Constance Zimmer, ...",United States,2018-09-08,2016.0,64.0,1 Season,Kids' TV,When a prison ship crash unleashes hundreds of...
4,80125979,Movie,#realityhigh,Fernando Lebrija,"Nesta Cooper, Kate Walsh, John Michael Higgins...",United States,2017-09-08,2017.0,57.0,99 min,Comedies,When nerdy high schooler Dani finally attracts...


## 1.1 Data Dimension

In [6]:
print(f'Number os rows: {data_description.shape[0]}')
print(f'Number os columns: {data_description.shape[1]}')

Number os rows: 6234
Number os columns: 12


## 1.2 Check NA

In [7]:
data_description.isna().sum() / data_description.shape[0] * 100


show_id         [1;36m0.000[0m
type            [1;36m0.000[0m
title           [1;36m0.000[0m
director       [1;36m31.601[0m
cast            [1;36m0.016[0m
country         [1;36m0.016[0m
date_added      [1;36m0.192[0m
release_year    [1;36m0.016[0m
rating          [1;36m0.016[0m
duration        [1;36m0.016[0m
listed_in       [1;36m0.016[0m
description     [1;36m0.016[0m
dtype: float64

## 1.3 Fillout NA

In [8]:
data_description = data_description.dropna()

data_description.isna().sum() / data_description.shape[0] * 100


show_id        [1;36m0.000[0m
type           [1;36m0.000[0m
title          [1;36m0.000[0m
director       [1;36m0.000[0m
cast           [1;36m0.000[0m
country        [1;36m0.000[0m
date_added     [1;36m0.000[0m
release_year   [1;36m0.000[0m
rating         [1;36m0.000[0m
duration       [1;36m0.000[0m
listed_in      [1;36m0.000[0m
description    [1;36m0.000[0m
dtype: float64

## 1.4 Data Types

In [10]:
data_description.dtypes


show_id                  int64
type                    object
title                   object
director                object
cast                    object
country                 object
date_added      datetime64[1m[[0mns[1m][0m
release_year           float64
rating                 float64
duration                object
listed_in               object
description             object
dtype: object

In [14]:
data_description['release_year'] = data_description['release_year'].astype('int64')
data_description['rating'] = data_description['rating'].astype('int64')

data_description.dtypes


show_id                  int64
type                    object
title                   object
director                object
cast                    object
country                 object
date_added      datetime64[1m[[0mns[1m][0m
release_year             int64
rating                   int64
duration                object
listed_in               object
description             object
dtype: object

## 1.6 Descriptive Statistical

In [15]:
df_aux = data_description.copy()

num_attributes = df_aux.select_dtypes(include=['int64', 'float64'])
cat_attributes = df_aux.select_dtypes(exclude=['int64', 'float64'])

### 1.6.1 Numerical Attributes

In [16]:
# Central tendency and Dispersion
range_values = pd.DataFrame(num_attributes.apply( lambda x: x.max() - x.min() )).T
statistic_metric = num_attributes.agg(['min', 'max', 'mean', 'median', 'std', 'skew', 'kurtosis'])

# Concatenate
metrics = pd.concat([range_values, statistic_metric]).T.reset_index()
metrics.columns = ['attibutes', 'range', 'min', 'max', 'mean', 'median', 'std', 'skew', 'kurtosis']
metrics = metrics[['attibutes', 'min', 'max', 'range', 'mean', 'median', 'std', 'skew', 'kurtosis']]

metrics

Unnamed: 0,attibutes,min,max,range,mean,median,std,skew,kurtosis
0,show_id,247747.0,81235729.0,80987982.0,75592271.88,80157831.5,12875723.525,-4.366,20.879
1,release_year,1942.0,2020.0,78.0,2012.384,2016.0,9.705,-3.27,13.654
2,rating,0.0,97.0,97.0,63.018,66.0,16.647,-1.975,5.246


### 1.6.2 Categorical Attributes

In [17]:
# checking data variation
cat_attributes.apply(lambda x: x.unique().shape[0])


type              [1;36m2[0m
title          [1;36m4238[0m
director       [1;36m3300[0m
cast           [1;36m3785[0m
country         [1;36m483[0m
date_added     [1;36m1050[0m
duration        [1;36m191[0m
listed_in       [1;36m307[0m
description    [1;36m4256[0m
dtype: int64

- The cardinality is considerable and can have an impact on analysis and visualization, so let's look at the platform and genre that contain the lowest cardinalities.

In [20]:
catalog.save("data_description", data_description)