# Exploratory Data Analysis

The main objectives for this notebook are:
* Explore the clean dataset by performing univariate analysis
* Investiage the relationships between the features and the target by perofrming bivariate and multivariate analyses
* Extract relevant insights to share with business stakeholders
* Understand steps that will be required for ML pre-processing


## Notes
1. Using Polars framwork instead of pandas
2. Using interactive plots (e.g. Plotly) for visualisations
3. Write clear insights after every section of the analysis
4. Using well-written and documented utility functions

# Imports

In [1]:
%load_ext autoreload
%autoreload 2

In [3]:
import sys, os
import plotly.io as pio

# Path needs to be added manually to read from another folder
path2add = os.path.normpath(os.path.abspath(os.path.join(os.path.dirname('__file__'), os.path.pardir, 'utils')))
if (not (path2add in sys.path)) :
    sys.path.append(path2add)
    
import polars as pl
import plotly.express as px
from visualisations import bar_plot, proportion_plot, boxplot_by_bin_with_target
# etc

pio.renderers.default='notebook'

In [4]:
import pandas as pd
data = pd.read_parquet("../data/supervised_clean_data.parquet")

In [5]:
data.head()

Unnamed: 0,Unnamed: 1,_id,inter_api_access_duration(sec),api_access_uniqueness,sequence_length(count),vsession_duration(min),ip_type,num_sessions,num_users,num_unique_apis,source,classification,is_anomaly
0,0,1f2c32d8-2d6e-3b68-bc46-789469f2b71e,0.000812,0.004066,85.643243,5405,default,1460.0,1295.0,451.0,E,normal,False
1,1,4c486414-d4f5-33f6-b485-24a8ed2925e8,6.3e-05,0.002211,16.166805,519,default,9299.0,8447.0,302.0,E,normal,False
2,2,7e5838fc-bce1-371f-a3ac-d8a0b2a05d9a,0.004481,0.015324,99.573276,6211,default,255.0,232.0,354.0,E,normal,False
3,3,82661ecd-d87f-3dff-855e-378f7cb6d912,0.017837,0.014974,69.792793,8292,default,195.0,111.0,116.0,E,normal,False
4,4,d62d56ea-775e-328c-8b08-db7ad7f834e5,0.000797,0.006056,14.952756,182,default,272.0,254.0,23.0,E,normal,False


In [6]:
data.shape

(1695, 13)

In [7]:
data.info

<bound method DataFrame.info of                                              _id  \
0        0  1f2c32d8-2d6e-3b68-bc46-789469f2b71e   
1        1  4c486414-d4f5-33f6-b485-24a8ed2925e8   
2        2  7e5838fc-bce1-371f-a3ac-d8a0b2a05d9a   
3        3  82661ecd-d87f-3dff-855e-378f7cb6d912   
4        4  d62d56ea-775e-328c-8b08-db7ad7f834e5   
...    ...                                   ...   
1690  1694  3653d165-4b93-346b-9543-f1d4f5bf4831   
1691  1695  44356d09-52e9-321e-9ec1-630e582bfe53   
1692  1696  0ecdc692-df55-3990-815e-a30f1ee63f5f   
1693  1697  468a84b3-2885-30d6-b1a8-6cf2e44577cd   
1694  1698  2854b436-7d8b-3f2c-8139-3340ad2cd45a   

      inter_api_access_duration(sec)  api_access_uniqueness  \
0                           0.000812               0.004066   
1                           0.000063               0.002211   
2                           0.004481               0.015324   
3                           0.017837               0.014974   
4                           

In [8]:
data.isna().sum()

                                  0
_id                               0
inter_api_access_duration(sec)    0
api_access_uniqueness             0
sequence_length(count)            0
vsession_duration(min)            0
ip_type                           0
num_sessions                      0
num_users                         0
num_unique_apis                   0
source                            0
classification                    0
is_anomaly                        0
dtype: int64

In [9]:
data = pl.read_parquet("../data/supervised_clean_data.parquet")
print(data.shape)
data.head()

(1695, 13)


Unnamed: 0_level_0,_id,inter_api_access_duration(sec),api_access_uniqueness,sequence_length(count),vsession_duration(min),ip_type,num_sessions,num_users,num_unique_apis,source,classification,is_anomaly
i64,str,f64,f64,f64,i64,str,f64,f64,f64,str,str,bool
0,"""1f2c32d8-2d6e-3b68-bc46-789469…",0.000812,0.004066,85.643243,5405,"""default""",1460.0,1295.0,451.0,"""E""","""normal""",False
1,"""4c486414-d4f5-33f6-b485-24a8ed…",6.3e-05,0.002211,16.166805,519,"""default""",9299.0,8447.0,302.0,"""E""","""normal""",False
2,"""7e5838fc-bce1-371f-a3ac-d8a0b2…",0.004481,0.015324,99.573276,6211,"""default""",255.0,232.0,354.0,"""E""","""normal""",False
3,"""82661ecd-d87f-3dff-855e-378f7c…",0.017837,0.014974,69.792793,8292,"""default""",195.0,111.0,116.0,"""E""","""normal""",False
4,"""d62d56ea-775e-328c-8b08-db7ad7…",0.000797,0.006056,14.952756,182,"""default""",272.0,254.0,23.0,"""E""","""normal""",False


In [10]:
data['ip_type'].unique()

ip_type
str
"""default"""
"""datacenter"""
