### Data Exploration and Analysis

In [1]:
import sys
sys.path.insert(0,'../scripts/')

In [2]:
# Setting Notebook preference options
import pandas as pd
pd.set_option('max_column', None)
pd.set_option('display.float_format', '{:.2f}'.format)

In [3]:
from data_loader import load_df_from_csv
from data_information import DataInfo
from data_cleaner import DataCleaner

#### Type Modification before Analysis

In [4]:
clean_data = load_df_from_csv("../data/teleCo_clean_data.csv")
cleaner = DataCleaner(clean_data)
cleaner.remove_unwanted_columns(cleaner.df.columns[0])
df = cleaner.change_columns_type_to(['IMSI','Handset Manufacturer','Handset Type','IMEI','MSISDN/Number','Bearer Id'],'category')

explorer = DataInfo(df)

### Variable Identification

In [5]:
explorer.get_information()

DataFrame Information: 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 60 columns):
 #   Column                               Non-Null Count   Dtype   
---  ------                               --------------   -----   
 0   Bearer Id                            150000 non-null  category
 1   Start                                150000 non-null  object  
 2   Start sec                            150000 non-null  float32 
 3   End                                  150000 non-null  object  
 4   End sec                              150000 non-null  float32 
 5   Dur. (hr)                            150000 non-null  float32 
 6   IMSI                                 150000 non-null  category
 7   MSISDN/Number                        150000 non-null  category
 8   IMEI                                 150000 non-null  category
 9   Last Location Name                   148848 non-null  object  
 10  Avg RTT DL (ms)                      150000 

Type of Variables
We can have a lot of predicator-target correlation for different purposes, but for the time-being lets assume we want our target to be the total duration time of the use
- Predicator Variables
    - start_date
    - start_time
    - MSISDN/Number
    - Total Data (MegaBytes)
    - Total Social Media Data (MegaBytes)
    - Total Google Data (MegaBytes)
    - Total Email Data (MegaBytes)
    - Total Youtube Data (MegaBytes)
    - Total Netflix Data (MegaBytes)
    - Total Gaming Data (MegaBytes)
    - Total Other Data (MegaBytes)
    - Jitter (ms)
    - Avg Delay (ms)
    - Avg Throughput (kbps)
    - Last Location Name
- Target Variables
    - end_date
    - end_time
    - Total Duration (hr)

Data Types
- Character

In [6]:
explorer.df.describe(include='object').columns.to_list()

['Start', 'End', 'Last Location Name', 'start_time', 'end_time']

- Numeric

In [7]:
explorer.df.describe(include=['float32','float64']).columns.to_list()

['Start sec',
 'End sec',
 'Dur. (hr)',
 'Avg RTT DL (ms)',
 'Avg RTT UL (ms)',
 'Avg Bearer TP DL (kbps)',
 'Avg Bearer TP UL (kbps)',
 'DL TP < 50 Kbps (%)',
 '50 Kbps < DL TP < 250 Kbps (%)',
 '250 Kbps < DL TP < 1 Mbps (%)',
 'DL TP > 1 Mbps (%)',
 'UL TP < 10 Kbps (%)',
 '10 Kbps < UL TP < 50 Kbps (%)',
 '50 Kbps < UL TP < 300 Kbps (%)',
 'UL TP > 300 Kbps (%)',
 'Activity Duration DL (sec)',
 'Activity Duration UL (sec)',
 'Total Duration (hr)',
 'Nb of sec with Vol DL < 6250B',
 'Nb of sec with Vol UL < 1250B',
 'Social Media DL (MegaBytes)',
 'Social Media UL (MegaBytes)',
 'Google DL (MegaBytes)',
 'Google UL (MegaBytes)',
 'Email DL (MegaBytes)',
 'Email UL (MegaBytes)',
 'Youtube DL (MegaBytes)',
 'Youtube UL (MegaBytes)',
 'Netflix DL (MegaBytes)',
 'Netflix UL (MegaBytes)',
 'Gaming DL (MegaBytes)',
 'Gaming UL (MegaBytes)',
 'Other DL (MegaBytes)',
 'Other UL (MegaBytes)',
 'Total UL (MegaBytes)',
 'Total DL (MegaBytes)',
 'Total Data (MegaBytes)',
 'Total Social Media Da

Variable Category
- Categorical

In [8]:
explorer.df.describe(include='category').columns.to_list()

['Bearer Id',
 'IMSI',
 'MSISDN/Number',
 'IMEI',
 'Handset Manufacturer',
 'Handset Type',
 'start_date',
 'end_date']

- Continous
    The remaining columns apart from the categorical variables are all continous

### Data Understanding (Basic Matrics Analysis)

In [9]:
explorer.get_description()

DataFrame Description: 


Unnamed: 0,Start sec,End sec,Dur. (hr),Avg RTT DL (ms),Avg RTT UL (ms),Avg Bearer TP DL (kbps),Avg Bearer TP UL (kbps),DL TP < 50 Kbps (%),50 Kbps < DL TP < 250 Kbps (%),250 Kbps < DL TP < 1 Mbps (%),DL TP > 1 Mbps (%),UL TP < 10 Kbps (%),10 Kbps < UL TP < 50 Kbps (%),50 Kbps < UL TP < 300 Kbps (%),UL TP > 300 Kbps (%),Activity Duration DL (sec),Activity Duration UL (sec),Total Duration (hr),Nb of sec with Vol DL < 6250B,Nb of sec with Vol UL < 1250B,Social Media DL (MegaBytes),Social Media UL (MegaBytes),Google DL (MegaBytes),Google UL (MegaBytes),Email DL (MegaBytes),Email UL (MegaBytes),Youtube DL (MegaBytes),Youtube UL (MegaBytes),Netflix DL (MegaBytes),Netflix UL (MegaBytes),Gaming DL (MegaBytes),Gaming UL (MegaBytes),Other DL (MegaBytes),Other UL (MegaBytes),Total UL (MegaBytes),Total DL (MegaBytes),Total Data (MegaBytes),Total Social Media Data (MegaBytes),Total Google Data (MegaBytes),Total Email Data (MegaBytes),Total Youtube Data (MegaBytes),Total Netflix Data (MegaBytes),Total Gaming Data (MegaBytes),Total Other Data (MegaBytes),Jitter (ms),Avg Delay (ms),Avg Throughput (kbps)
count,150000.0,150000.0,150000.0,150000.0,150000.0,150000.0,150000.0,150000.0,150000.0,150000.0,150000.0,150000.0,150000.0,150000.0,150000.0,150000.0,150000.0,150000.0,150000.0,150000.0,150000.0,150000.0,150000.0,150000.0,150000.0,150000.0,150000.0,150000.0,150000.0,150000.0,150000.0,150000.0,150000.0,150000.0,150000.0,150000.0,150000.0,150000.0,150000.0,150000.0,150000.0,150000.0,150000.0,150000.0,150000.0,150000.0,150000.0
mean,0.5,0.5,0.03,97.77,15.32,13300.05,1770.43,92.84,3.07,1.72,1.6,98.54,0.77,0.15,0.08,1829.18,1408.88,29.06,3702.11,4001.99,1.8,0.03,5.75,2.06,1.79,0.47,11.63,11.01,11.63,11.0,422.04,8.29,421.1,8.26,41.12,454.64,495.76,1.83,7.81,2.26,22.64,22.63,430.33,429.37,-420296.91,-82.46,-11529.62
std,0.29,0.29,0.02,559.91,76.69,23971.88,4625.36,13.01,6.2,4.15,4.82,4.62,3.22,1.62,1.29,5696.4,4643.23,22.51,9151.91,10137.22,1.04,0.02,3.31,1.19,1.04,0.27,6.71,6.35,6.73,6.36,243.97,4.78,243.21,4.77,11.28,244.14,244.38,1.04,3.52,1.07,9.25,9.26,244.02,243.27,2482590.25,564.34,21736.23
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.98,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.87,7.11,28.96,0.0,0.04,0.01,0.08,0.1,0.31,0.15,-84124624.0,-96922.0,-374058.0
25%,0.25,0.25,0.02,35.0,3.0,43.0,47.0,91.0,0.0,0.0,0.0,99.0,0.0,0.0,0.0,14.88,21.54,15.96,88.0,107.0,0.9,0.02,2.88,1.02,0.89,0.23,5.83,5.52,5.78,5.48,210.47,4.13,210.18,4.15,33.22,243.11,284.48,0.93,4.94,1.36,16.0,15.98,218.73,218.55,-24888.25,-49.0,-16193.25
50%,0.5,0.5,0.02,45.0,5.0,63.0,63.0,100.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,39.3,46.79,24.0,203.0,217.0,1.79,0.03,5.77,2.05,1.79,0.47,11.62,11.01,11.64,11.0,423.41,8.29,421.81,8.27,41.14,455.84,496.86,1.83,7.81,2.26,22.66,22.64,431.62,429.99,486.0,-40.0,-9.0
75%,0.75,0.75,0.04,62.0,11.0,19710.75,1120.0,100.0,4.0,1.0,0.0,100.0,0.0,0.0,0.0,679.61,599.1,36.79,2608.0,2417.0,2.69,0.05,8.62,3.09,2.69,0.7,17.45,16.52,17.47,16.51,633.17,12.43,631.7,12.38,49.03,665.71,706.51,2.73,10.68,3.16,29.29,29.29,641.42,639.93,5931.0,-27.0,7.0
max,1.0,1.0,0.52,96923.0,7120.0,378160.0,58613.0,100.0,93.0,100.0,94.0,100.0,98.0,100.0,96.0,136536.47,144911.3,516.48,604061.0,604122.0,3.59,0.07,11.46,4.12,3.59,0.94,23.26,22.01,23.26,22.01,843.44,16.56,843.44,16.56,78.33,902.97,955.98,3.65,15.53,4.52,45.19,45.2,859.2,859.52,144712880.0,7082.0,39950.0


In [10]:
explorer.get_mode()

Unnamed: 0,Bearer Id,Start,Start sec,End,End sec,Dur. (hr),IMSI,MSISDN/Number,IMEI,Last Location Name,Avg RTT DL (ms),Avg RTT UL (ms),Avg Bearer TP DL (kbps),Avg Bearer TP UL (kbps),DL TP < 50 Kbps (%),50 Kbps < DL TP < 250 Kbps (%),250 Kbps < DL TP < 1 Mbps (%),DL TP > 1 Mbps (%),UL TP < 10 Kbps (%),10 Kbps < UL TP < 50 Kbps (%),50 Kbps < UL TP < 300 Kbps (%),UL TP > 300 Kbps (%),Activity Duration DL (sec),Activity Duration UL (sec),Total Duration (hr),Handset Manufacturer,Handset Type,Nb of sec with Vol DL < 6250B,Nb of sec with Vol UL < 1250B,Social Media DL (MegaBytes),Social Media UL (MegaBytes),Google DL (MegaBytes),Google UL (MegaBytes),Email DL (MegaBytes),Email UL (MegaBytes),Youtube DL (MegaBytes),Youtube UL (MegaBytes),Netflix DL (MegaBytes),Netflix UL (MegaBytes),Gaming DL (MegaBytes),Gaming UL (MegaBytes),Other DL (MegaBytes),Other UL (MegaBytes),Total UL (MegaBytes),Total DL (MegaBytes),start_date,start_time,end_date,end_time,Total Data (MegaBytes),Total Social Media Data (MegaBytes),Total Google Data (MegaBytes),Total Email Data (MegaBytes),Total Youtube Data (MegaBytes),Total Netflix Data (MegaBytes),Total Gaming Data (MegaBytes),Total Other Data (MegaBytes),Jitter (ms),Avg Delay (ms),Avg Throughput (kbps)
0,,2019-04-29 07:08:38,0.34,2019-04-25 00:01:32,0.87,0.02,208201710567424.00,33663707136.00,86376900984832.00,D41377B,45.00,5.00,23.00,40.00,100.00,0.00,0.00,0.00,100.00,0.00,0.00,0.00,0.00,0.00,24.00,Apple,Huawei B528S-23A,3.00,217.00,0.15,0.03,3.99,3.43,1.83,0.05,18.87,3.40,2.10,1.75,453.16,15.30,357.81,4.51,35.58,74.13,1556323200000000000,07:08:38,1556323200000000000,00:01:32,833.15,0.15,5.24,0.65,22.67,3.19,15.15,50.57,0.00,-40.00,3.00
1,,,,2019-04-25 00:01:33,,,,,,,,,,,,,,,,,,,,,,,,,,0.29,0.04,6.30,,,0.06,19.13,19.09,,20.56,,16.27,,8.13,38.18,75.11,,,,,,0.48,8.79,0.90,26.45,3.28,66.09,54.04,,,
2,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.30,0.06,6.84,,,0.12,20.00,19.24,,21.95,,,,8.52,40.40,90.29,,,,,,0.58,10.21,0.97,28.59,3.59,69.20,67.83,,,
3,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.34,0.06,7.70,,,0.16,20.50,,,,,,,9.70,41.71,91.94,,,,,,0.59,,1.15,32.62,4.87,83.01,69.45,,,
4,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.34,,9.00,,,0.20,,,,,,,,16.07,43.13,94.98,,,,,,0.67,,1.25,34.50,5.92,84.63,105.26,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
672,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,40.87,,,,,
673,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,41.03,,,,,
674,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,41.18,,,,,
675,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,41.91,,,,,


### Data Exploration

#### Non-Graphical Uni-variate Analysis

#### Graphical Uni-variate Analysis

#### Bivariate Analysis

#### Variable transformations

#### Correlation Analysis

#### Dimensionality Reduction