# Exploratory Data Analysis (EDA)

This notebook will help us understand The Telecom Data. Some of the Tasks we will cover in here are.
- View and count Null Values in the dataset
- Get General Information About the dataset
- 

In [1]:
# System Modules
import os
import sys
sys.path.append(os.path.abspath(os.path.join('..')))

import warnings
warnings.filterwarnings('ignore')

In [2]:
# Module imports
import pandas as pd
pd.set_option('max_column', None)
pd.options.display.float_format = "{:.2f}".format



In [3]:
# Custom Modules
from myscripts import file
from myscripts.df_info import DataFrameInfo
from myscripts.df_cleaning import DataFrameCleaning


In [4]:
file_name = 'Week1_challenge_data_source.csv'
data = file.read_csv(file_name)

## Information About The Data

In [5]:
df_info = DataFrameInfo(data)
df_info.info()

Data Frame contain 150001 rows and 55 columns


In [6]:
## Null Percentage 
df_info.null_percentage()

Data Frame contain null values of 12.72%


In [7]:
df_info.get_null_counts()

Bearer Id                                      991
Start                                            1
Start ms                                         1
End                                              1
End ms                                           1
Dur. (ms)                                        1
IMSI                                           570
MSISDN/Number                                 1066
IMEI                                           572
Last Location Name                            1153
Avg RTT DL (ms)                              27829
Avg RTT UL (ms)                              27812
Avg Bearer TP DL (kbps)                          1
Avg Bearer TP UL (kbps)                          1
TCP DL Retrans. Vol (Bytes)                  88146
TCP UL Retrans. Vol (Bytes)                  96649
DL TP < 50 Kbps (%)                            754
50 Kbps < DL TP < 250 Kbps (%)                 754
250 Kbps < DL TP < 1 Mbps (%)                  754
DL TP > 1 Mbps (%)             

In [8]:
df_info.skewness()

Bearer Id                                    0.03
Start ms                                     0.00
End ms                                      -0.00
Dur. (ms)                                    3.95
IMSI                                        41.05
MSISDN/Number                              332.16
IMEI                                         1.07
Avg RTT DL (ms)                             62.91
Avg RTT UL (ms)                             28.46
Avg Bearer TP DL (kbps)                      2.59
Avg Bearer TP UL (kbps)                      4.50
TCP DL Retrans. Vol (Bytes)                 15.95
TCP UL Retrans. Vol (Bytes)                 84.11
DL TP < 50 Kbps (%)                         -2.30
50 Kbps < DL TP < 250 Kbps (%)               3.27
250 Kbps < DL TP < 1 Mbps (%)                4.57
DL TP > 1 Mbps (%)                           5.37
UL TP < 10 Kbps (%)                         -8.99
10 Kbps < UL TP < 50 Kbps (%)               10.94
50 Kbps < UL TP < 300 Kbps (%)              21.88


## Data Cleaning

In [14]:
df_clean = DataFrameCleaning(data)

Columns with More than 30% null, After Identifying columns
remove them from our dataset

In [15]:
bad_columns = df_clean.get_column_with_many_null()
print("List Of Columns with More than 30% Null Values")
print(bad_columns)

List Of Columns with More than 30% Null Values
['TCP DL Retrans. Vol (Bytes)', 'TCP UL Retrans. Vol (Bytes)', 'HTTP DL (Bytes)', 'HTTP UL (Bytes)', 'Nb of sec with 125000B < Vol DL', 'Nb of sec with 1250B < Vol UL < 6250B', 'Nb of sec with 31250B < Vol DL < 125000B', 'Nb of sec with 37500B < Vol UL', 'Nb of sec with 6250B < Vol DL < 31250B', 'Nb of sec with 6250B < Vol UL < 37500B']


In [17]:
## Remove Bad columns
df_clean.drop_columns(bad_columns)
df_clean.drop_column('Dur. (ms).1')

In [18]:
bad_columns = df_clean.get_column_with_many_null()
print("Number Of Columns with More than 30% Null Values After Clean Up")
print(bad_columns)

Number Of Columns with More than 30% Null Values After Clean Up
[]
