# **<font color='white gray'>panData</font>**
# **<font color='white gray'>Data Analysis Project with Python Language</font>**
### <font color='white gray'>Data Cleaning and Missing Value Treatment Techniques for Data Analysis</font>
### **<font color='blue'>Part 1</font>**

## **Python Packages Used in the Project**

In [40]:
!pip install -q -U watermark

In [41]:
# Imports
import math
import sys, os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import warnings
warnings.filterwarnings('ignore')

In [42]:
%reload_ext watermark
%watermark -a "panData"

Author: panData



## **Loading the Data**

https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

In [43]:
# 1. We create a list to identify missing values
missing_values_labels_list = ["n/a", "na", "undefined"]

In [44]:
# 2. Load the dataset
dataset = pd.read_csv("dataset.csv", na_values=missing_values_labels_list)

In [45]:
# 3. Shape
dataset.shape

(150001, 55)

In [46]:
# 4. Data Sample
dataset.head()

Unnamed: 0,Bearer Id,Start,Start ms,End,End ms,Dur. (ms),IMSI,MSISDN/Number,IMEI,Last Location Name,Avg RTT DL (ms),Avg RTT UL (ms),Avg Bearer TP DL (kbps),Avg Bearer TP UL (kbps),TCP DL Retrans. Vol (Bytes),TCP UL Retrans. Vol (Bytes),DL TP < 50 Kbps (%),50 Kbps < DL TP < 250 Kbps (%),250 Kbps < DL TP < 1 Mbps (%),DL TP > 1 Mbps (%),UL TP < 10 Kbps (%),10 Kbps < UL TP < 50 Kbps (%),50 Kbps < UL TP < 300 Kbps (%),UL TP > 300 Kbps (%),HTTP DL (Bytes),HTTP UL (Bytes),Activity Duration DL (ms),Activity Duration UL (ms),Dur. (ms).1,Handset Manufacturer,Handset Type,Nb of sec with 125000B < Vol DL,Nb of sec with 1250B < Vol UL < 6250B,Nb of sec with 31250B < Vol DL < 125000B,Nb of sec with 37500B < Vol UL,Nb of sec with 6250B < Vol DL < 31250B,Nb of sec with 6250B < Vol UL < 37500B,Nb of sec with Vol DL < 6250B,Nb of sec with Vol UL < 1250B,Social Media DL (Bytes),Social Media UL (Bytes),Google DL (Bytes),Google UL (Bytes),Email DL (Bytes),Email UL (Bytes),Youtube DL (Bytes),Youtube UL (Bytes),Netflix DL (Bytes),Netflix UL (Bytes),Gaming DL (Bytes),Gaming UL (Bytes),Other DL (Bytes),Other UL (Bytes),Total UL (Bytes),Total DL (Bytes)
0,1.311448e+19,4/4/2019 12:01,770.0,4/25/2019 14:35,662.0,1823652.0,208201400000000.0,33664960000.0,35521210000000.0,9.16456699548519E+015,42.0,5.0,23.0,44.0,,,100.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,,,37624.0,38787.0,1823653000.0,Samsung,Samsung Galaxy A5 Sm-A520F,,,,,,,213.0,214.0,1545765.0,24420.0,1634479.0,1271433.0,3563542.0,137762.0,15854611.0,2501332.0,8198936.0,9656251.0,278082303.0,14344150.0,171744450.0,8814393.0,36749741.0,308879636.0
1,1.311448e+19,4/9/2019 13:04,235.0,4/25/2019 8:15,606.0,1365104.0,208201900000000.0,33681850000.0,35794010000000.0,L77566A,65.0,5.0,16.0,26.0,,,100.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,,,168.0,3560.0,1365104000.0,Samsung,Samsung Galaxy J5 (Sm-J530),,,,,,,971.0,1022.0,1926113.0,7165.0,3493924.0,920172.0,629046.0,308339.0,20247395.0,19111729.0,18338413.0,17227132.0,608750074.0,1170709.0,526904238.0,15055145.0,53800391.0,653384965.0
2,1.311448e+19,4/9/2019 17:42,1.0,4/25/2019 11:58,652.0,1361762.0,208200300000000.0,33760630000.0,35281510000000.0,D42335A,,,6.0,9.0,,,100.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,,,0.0,0.0,1361763000.0,Samsung,Samsung Galaxy A8 (2018),,,,,,,751.0,695.0,1684053.0,42224.0,8535055.0,1694064.0,2690151.0,672973.0,19725661.0,14699576.0,17587794.0,6163408.0,229584621.0,395630.0,410692588.0,4215763.0,27883638.0,279807335.0
3,1.311448e+19,4/10/2019 0:31,486.0,4/25/2019 7:36,171.0,1321509.0,208201400000000.0,33750340000.0,35356610000000.0,T21824A,,,44.0,44.0,,,100.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,,,3330.0,37882.0,1321510000.0,,,,,,,,,17.0,207.0,644121.0,13372.0,9023734.0,2788027.0,1439754.0,631229.0,21388122.0,15146643.0,13994646.0,1097942.0,799538153.0,10849722.0,749039933.0,12797283.0,43324218.0,846028530.0
4,1.311448e+19,4/12/2019 20:10,565.0,4/25/2019 10:40,954.0,1089009.0,208201400000000.0,33699800000.0,35407010000000.0,D88865A,,,6.0,9.0,,,100.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,,,0.0,0.0,1089009000.0,Samsung,Samsung Sm-G390F,,,,,,,607.0,604.0,862600.0,50188.0,6248284.0,1500559.0,1936496.0,173853.0,15259380.0,18962873.0,17124581.0,415218.0,527707248.0,3529801.0,550709500.0,13910322.0,38542814.0,569138589.0


In [47]:
# 5. Set the total number of columns to display when printing the dataframe
pd.set_option('display.max_columns', 100)

In [48]:
# 6. Data Sample
dataset.head()

Unnamed: 0,Bearer Id,Start,Start ms,End,End ms,Dur. (ms),IMSI,MSISDN/Number,IMEI,Last Location Name,Avg RTT DL (ms),Avg RTT UL (ms),Avg Bearer TP DL (kbps),Avg Bearer TP UL (kbps),TCP DL Retrans. Vol (Bytes),TCP UL Retrans. Vol (Bytes),DL TP < 50 Kbps (%),50 Kbps < DL TP < 250 Kbps (%),250 Kbps < DL TP < 1 Mbps (%),DL TP > 1 Mbps (%),UL TP < 10 Kbps (%),10 Kbps < UL TP < 50 Kbps (%),50 Kbps < UL TP < 300 Kbps (%),UL TP > 300 Kbps (%),HTTP DL (Bytes),HTTP UL (Bytes),Activity Duration DL (ms),Activity Duration UL (ms),Dur. (ms).1,Handset Manufacturer,Handset Type,Nb of sec with 125000B < Vol DL,Nb of sec with 1250B < Vol UL < 6250B,Nb of sec with 31250B < Vol DL < 125000B,Nb of sec with 37500B < Vol UL,Nb of sec with 6250B < Vol DL < 31250B,Nb of sec with 6250B < Vol UL < 37500B,Nb of sec with Vol DL < 6250B,Nb of sec with Vol UL < 1250B,Social Media DL (Bytes),Social Media UL (Bytes),Google DL (Bytes),Google UL (Bytes),Email DL (Bytes),Email UL (Bytes),Youtube DL (Bytes),Youtube UL (Bytes),Netflix DL (Bytes),Netflix UL (Bytes),Gaming DL (Bytes),Gaming UL (Bytes),Other DL (Bytes),Other UL (Bytes),Total UL (Bytes),Total DL (Bytes)
0,1.311448e+19,4/4/2019 12:01,770.0,4/25/2019 14:35,662.0,1823652.0,208201400000000.0,33664960000.0,35521210000000.0,9.16456699548519E+015,42.0,5.0,23.0,44.0,,,100.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,,,37624.0,38787.0,1823653000.0,Samsung,Samsung Galaxy A5 Sm-A520F,,,,,,,213.0,214.0,1545765.0,24420.0,1634479.0,1271433.0,3563542.0,137762.0,15854611.0,2501332.0,8198936.0,9656251.0,278082303.0,14344150.0,171744450.0,8814393.0,36749741.0,308879636.0
1,1.311448e+19,4/9/2019 13:04,235.0,4/25/2019 8:15,606.0,1365104.0,208201900000000.0,33681850000.0,35794010000000.0,L77566A,65.0,5.0,16.0,26.0,,,100.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,,,168.0,3560.0,1365104000.0,Samsung,Samsung Galaxy J5 (Sm-J530),,,,,,,971.0,1022.0,1926113.0,7165.0,3493924.0,920172.0,629046.0,308339.0,20247395.0,19111729.0,18338413.0,17227132.0,608750074.0,1170709.0,526904238.0,15055145.0,53800391.0,653384965.0
2,1.311448e+19,4/9/2019 17:42,1.0,4/25/2019 11:58,652.0,1361762.0,208200300000000.0,33760630000.0,35281510000000.0,D42335A,,,6.0,9.0,,,100.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,,,0.0,0.0,1361763000.0,Samsung,Samsung Galaxy A8 (2018),,,,,,,751.0,695.0,1684053.0,42224.0,8535055.0,1694064.0,2690151.0,672973.0,19725661.0,14699576.0,17587794.0,6163408.0,229584621.0,395630.0,410692588.0,4215763.0,27883638.0,279807335.0
3,1.311448e+19,4/10/2019 0:31,486.0,4/25/2019 7:36,171.0,1321509.0,208201400000000.0,33750340000.0,35356610000000.0,T21824A,,,44.0,44.0,,,100.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,,,3330.0,37882.0,1321510000.0,,,,,,,,,17.0,207.0,644121.0,13372.0,9023734.0,2788027.0,1439754.0,631229.0,21388122.0,15146643.0,13994646.0,1097942.0,799538153.0,10849722.0,749039933.0,12797283.0,43324218.0,846028530.0
4,1.311448e+19,4/12/2019 20:10,565.0,4/25/2019 10:40,954.0,1089009.0,208201400000000.0,33699800000.0,35407010000000.0,D88865A,,,6.0,9.0,,,100.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,,,0.0,0.0,1089009000.0,Samsung,Samsung Sm-G390F,,,,,,,607.0,604.0,862600.0,50188.0,6248284.0,1500559.0,1936496.0,173853.0,15259380.0,18962873.0,17124581.0,415218.0,527707248.0,3529801.0,550709500.0,13910322.0,38542814.0,569138589.0


In [49]:
# 7. Loading the data dictionary
dictionary = pd.read_excel("dictionary.xlsx")

In [50]:
# 8. Shape
dictionary.shape

(56, 2)

In [51]:
# 9. Data Sample
dictionary.head(10)

Unnamed: 0,Fields,Description
0,bearer id,xDr session identifier
1,Dur. (ms),Total Duration of the xDR (in ms)
2,Start,Start time of the xDR (first frame timestamp)
3,Start ms,Milliseconds offset of start time for the xDR (first frame timestamp)
4,End,End time of the xDR (last frame timestamp)
5,End ms,Milliseconds offset of end time of the xDR (last frame timestamp)
6,Dur. (s),Total Duration of the xDR (in s)
7,IMSI,International Mobile Subscriber Identity
8,MSISDN/Number,MS International PSTN/ISDN Number of mobile - customer number
9,IMEI,International Mobile Equipment Identity


In [52]:
# 10. Set a large value for the maximum column width
pd.set_option('display.max_colwidth', 100)

In [53]:
# 11. Data Sample
dictionary.head(60)

Unnamed: 0,Fields,Description
0,bearer id,xDr session identifier
1,Dur. (ms),Total Duration of the xDR (in ms)
2,Start,Start time of the xDR (first frame timestamp)
3,Start ms,Milliseconds offset of start time for the xDR (first frame timestamp)
4,End,End time of the xDR (last frame timestamp)
5,End ms,Milliseconds offset of end time of the xDR (last frame timestamp)
6,Dur. (s),Total Duration of the xDR (in s)
7,IMSI,International Mobile Subscriber Identity
8,MSISDN/Number,MS International PSTN/ISDN Number of mobile - customer number
9,IMEI,International Mobile Equipment Identity


## **Exploratory Analysis**

In [54]:
# 12. Info
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150001 entries, 0 to 150000
Data columns (total 55 columns):
 #   Column                                    Non-Null Count   Dtype  
---  ------                                    --------------   -----  
 0   Bearer Id                                 149010 non-null  float64
 1   Start                                     150000 non-null  object 
 2   Start ms                                  150000 non-null  float64
 3   End                                       150000 non-null  object 
 4   End ms                                    150000 non-null  float64
 5   Dur. (ms)                                 150000 non-null  float64
 6   IMSI                                      149431 non-null  float64
 7   MSISDN/Number                             148935 non-null  float64
 8   IMEI                                      149429 non-null  float64
 9   Last Location Name                        148848 non-null  object 
 10  Avg RTT DL (ms)     

In [55]:
# 13. Descriptive Statistics
dataset.describe()

Unnamed: 0,Bearer Id,Start ms,End ms,Dur. (ms),IMSI,MSISDN/Number,IMEI,Avg RTT DL (ms),Avg RTT UL (ms),Avg Bearer TP DL (kbps),Avg Bearer TP UL (kbps),TCP DL Retrans. Vol (Bytes),TCP UL Retrans. Vol (Bytes),DL TP < 50 Kbps (%),50 Kbps < DL TP < 250 Kbps (%),250 Kbps < DL TP < 1 Mbps (%),DL TP > 1 Mbps (%),UL TP < 10 Kbps (%),10 Kbps < UL TP < 50 Kbps (%),50 Kbps < UL TP < 300 Kbps (%),UL TP > 300 Kbps (%),HTTP DL (Bytes),HTTP UL (Bytes),Activity Duration DL (ms),Activity Duration UL (ms),Dur. (ms).1,Nb of sec with 125000B < Vol DL,Nb of sec with 1250B < Vol UL < 6250B,Nb of sec with 31250B < Vol DL < 125000B,Nb of sec with 37500B < Vol UL,Nb of sec with 6250B < Vol DL < 31250B,Nb of sec with 6250B < Vol UL < 37500B,Nb of sec with Vol DL < 6250B,Nb of sec with Vol UL < 1250B,Social Media DL (Bytes),Social Media UL (Bytes),Google DL (Bytes),Google UL (Bytes),Email DL (Bytes),Email UL (Bytes),Youtube DL (Bytes),Youtube UL (Bytes),Netflix DL (Bytes),Netflix UL (Bytes),Gaming DL (Bytes),Gaming UL (Bytes),Other DL (Bytes),Other UL (Bytes),Total UL (Bytes),Total DL (Bytes)
count,149010.0,150000.0,150000.0,150000.0,149431.0,148935.0,149429.0,122172.0,122189.0,150000.0,150000.0,61855.0,53352.0,149247.0,149247.0,149247.0,149247.0,149209.0,149209.0,149209.0,149209.0,68527.0,68191.0,150000.0,150000.0,150000.0,52463.0,57107.0,56415.0,19747.0,61684.0,38158.0,149246.0,149208.0,150001.0,150001.0,150001.0,150001.0,150001.0,150001.0,150001.0,150001.0,150001.0,150001.0,150001.0,150001.0,150001.0,150001.0,150000.0,150000.0
mean,1.013887e+19,499.1882,498.80088,104608.6,208201600000000.0,41882820000.0,48474550000000.0,109.795706,17.662883,13300.045927,1770.428647,20809910.0,759658.7,92.844754,3.069355,1.717341,1.609654,98.530142,0.776749,0.147987,0.078923,114471000.0,3242301.0,1829177.0,1408880.0,104609100.0,989.699998,340.434395,810.837401,149.257052,965.464756,141.304812,3719.787552,4022.083454,1795322.0,32928.43438,5750753.0,2056542.0,1791729.0,467373.44194,11634070.0,11009410.0,11626850.0,11001750.0,422044700.0,8288398.0,421100500.0,8264799.0,41121210.0,454643400.0
std,2.893173e+18,288.611834,288.097653,81037.62,21488090000.0,2447443000000.0,22416370000000.0,619.782739,84.793524,23971.878541,4625.3555,182566500.0,26453050.0,13.038031,6.215233,4.159538,4.82889,4.634285,3.225176,1.624523,1.295396,963194600.0,19570640.0,5696395.0,4643231.0,81037610.0,2546.52444,1445.365032,1842.162008,1219.112287,1946.387608,993.349688,9171.60901,10160.324314,1035482.0,19006.178256,3309097.0,1189917.0,1035840.0,269969.307031,6710569.0,6345423.0,6725218.0,6359490.0,243967500.0,4782700.0,243205000.0,4769004.0,11276390.0,244142900.0
min,6.917538e+18,0.0,0.0,7142.0,204047100000000.0,33601000000.0,440015200000.0,0.0,0.0,0.0,0.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,40.0,40.0,0.0,0.0,7142988.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,12.0,0.0,207.0,3.0,14.0,2.0,53.0,105.0,42.0,35.0,2516.0,59.0,3290.0,148.0,2866892.0,7114041.0
25%,7.349883e+18,250.0,251.0,57440.5,208201400000000.0,33651300000.0,35460710000000.0,32.0,2.0,43.0,47.0,35651.5,4694.75,91.0,0.0,0.0,0.0,99.0,0.0,0.0,0.0,112403.5,24322.0,14877.75,21539.75,57440790.0,20.0,10.0,26.0,2.0,39.0,3.0,87.0,106.0,899148.0,16448.0,2882393.0,1024279.0,892793.0,233383.0,5833501.0,5517965.0,5777156.0,5475981.0,210473300.0,4128476.0,210186900.0,4145943.0,33222010.0,243106800.0
50%,7.349883e+18,499.0,500.0,86399.0,208201500000000.0,33663710000.0,35722010000000.0,45.0,5.0,63.0,63.0,568730.0,20949.5,100.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,1941949.0,229733.0,39304.5,46793.5,86399980.0,128.0,52.0,164.0,8.0,288.0,8.0,203.0,217.0,1794369.0,32920.0,5765829.0,2054573.0,1793505.0,466250.0,11616020.0,11013450.0,11642220.0,10996380.0,423408100.0,8291208.0,421803000.0,8267071.0,41143310.0,455841100.0
75%,1.304243e+19,749.0,750.0,132430.2,208201800000000.0,33683490000.0,86119700000000.0,70.0,15.0,19710.75,1120.0,3768308.0,84020.25,100.0,4.0,1.0,0.0,100.0,0.0,0.0,0.0,25042900.0,1542827.0,679609.5,599095.2,132430800.0,693.5,203.0,757.0,35.0,1092.0,31.0,2650.0,2451.0,2694938.0,49334.0,8623552.0,3088454.0,2689327.0,700440.0,17448520.0,16515560.0,17470480.0,16507270.0,633174200.0,12431620.0,631691800.0,12384150.0,49034240.0,665705500.0
max,1.318654e+19,999.0,999.0,1859336.0,214074300000000.0,882397100000000.0,99001200000000.0,96923.0,7120.0,378160.0,58613.0,4294426000.0,2908226000.0,100.0,93.0,100.0,94.0,100.0,98.0,100.0,96.0,72530640000.0,1491890000.0,136536500.0,144911300.0,1859336000.0,81476.0,85412.0,58525.0,50553.0,66913.0,49565.0,604061.0,604122.0,3586064.0,65870.0,11462830.0,4121357.0,3586146.0,936418.0,23259100.0,22011960.0,23259190.0,22011960.0,843441900.0,16558790.0,843442500.0,16558820.0,78331310.0,902969600.0


It doesn’t make sense to calculate descriptive statistics for Beared Id, IMSI, MSISDN / Number, and IMEI. However, the describe() method calculates statistics for all numeric columns. These statistics are being computed before the data is cleaned, so there may be changes after missing values and outliers are addressed.

In [56]:
# 14. Shape
dataset.shape

(150001, 55)

In [57]:
# 15. Shape
dictionary.shape

(56, 2)

There are 150,001 rows and 55 columns in the dataframe. However, we have 56 columns with their names and descriptions in the dictionary. This means there is a described column that is not included in the dataframe. Let’s identify the missing column.

In [58]:
# 16. Concatenate the dataframes
df_column_comparison = pd.concat([pd.Series(dataset.columns.tolist()), dictionary['Fields']],
                                 axis=1)

In [59]:
# 17. Column names
df_column_comparison.columns

Index([0, 'Fields'], dtype='object')

In [60]:
# 18. Rename columns
df_column_comparison.rename(columns={0: 'Column in Dataset', 'Fields': 'Column in Dictionary'},
                            inplace=True)

In [61]:
# 19. View
df_column_comparison

Unnamed: 0,Column in Dataset,Column in Dictionary
0,Bearer Id,bearer id
1,Start,Dur. (ms)
2,Start ms,Start
3,End,Start ms
4,End ms,End
5,Dur. (ms),End ms
6,IMSI,Dur. (s)
7,MSISDN/Number,IMSI
8,IMEI,MSISDN/Number
9,Last Location Name,IMEI


“Dur. (Ms)” is missing in the dataset, as seen at index 1 in df_column_comparison. This is where the order of columns started to shift.

However, the same column name, “Dur. (Ms),” appears in the dataset at index 5, while the dictionary file labels it as “Dur. (S)” at index 6. Since the units differ between these columns, as indicated in their names, we need to verify which one is correct. To investigate further, we will use the “Dur. (Ms) .1” column, located at indices 28 and 29 in the dataset and dictionary file, respectively.

In [62]:
# 20. Select columns for investigation
dataset[['Dur. (ms)', 'Dur. (ms).1']]

Unnamed: 0,Dur. (ms),Dur. (ms).1
0,1823652.0,1.823653e+09
1,1365104.0,1.365104e+09
2,1361762.0,1.361763e+09
3,1321509.0,1.321510e+09
4,1089009.0,1.089009e+09
...,...,...
149996,81230.0,8.123076e+07
149997,97970.0,9.797070e+07
149998,98249.0,9.824953e+07
149999,97910.0,9.791063e+07


It appears that the “Dur. (Ms)” column is measured in seconds. Therefore, let’s rename it accordingly. We’ll also rename some other columns to make them clearer according to their descriptions and to align with the naming style of other columns.

In [63]:
# 21. Rename columns
dataset.rename(columns={'Dur. (ms)': 'Dur (s)',
                        'Dur. (ms).1': 'Dur (ms)',
                        'Start ms': 'Start Offset (ms)',
                        'End ms': 'End Offset (ms)'},
               inplace=True)

In [64]:
# 22. List of dataset columns
dataset.columns.tolist()

['Bearer Id',
 'Start',
 'Start Offset (ms)',
 'End',
 'End Offset (ms)',
 'Dur (s)',
 'IMSI',
 'MSISDN/Number',
 'IMEI',
 'Last Location Name',
 'Avg RTT DL (ms)',
 'Avg RTT UL (ms)',
 'Avg Bearer TP DL (kbps)',
 'Avg Bearer TP UL (kbps)',
 'TCP DL Retrans. Vol (Bytes)',
 'TCP UL Retrans. Vol (Bytes)',
 'DL TP < 50 Kbps (%)',
 '50 Kbps < DL TP < 250 Kbps (%)',
 '250 Kbps < DL TP < 1 Mbps (%)',
 'DL TP > 1 Mbps (%)',
 'UL TP < 10 Kbps (%)',
 '10 Kbps < UL TP < 50 Kbps (%)',
 '50 Kbps < UL TP < 300 Kbps (%)',
 'UL TP > 300 Kbps (%)',
 'HTTP DL (Bytes)',
 'HTTP UL (Bytes)',
 'Activity Duration DL (ms)',
 'Activity Duration UL (ms)',
 'Dur (ms)',
 'Handset Manufacturer',
 'Handset Type',
 'Nb of sec with 125000B < Vol DL',
 'Nb of sec with 1250B < Vol UL < 6250B',
 'Nb of sec with 31250B < Vol DL < 125000B',
 'Nb of sec with 37500B < Vol UL',
 'Nb of sec with 6250B < Vol DL < 31250B',
 'Nb of sec with 6250B < Vol UL < 37500B',
 'Nb of sec with Vol DL < 6250B',
 'Nb of sec with Vol UL < 12

In [65]:
# 23. Shape
dataset.shape

(150001, 55)

## **Step 1 - Handling Missing Values**
1. Identifying Missing Values
2. Dropping Columns
3. Imputation with Backward Fill
4. Imputation with Forward Fill
5. Imputation of Categorical Variables
6. Dropping Rows

### **1.1. Identifying Missing Values**

In [66]:
# 24. Function to calculate the percentage of missing values
def calculate_missing_values_percentage(df):

    # Calculate the total number of cells in the dataset
    totalCells = np.product(df.shape)

    # Count the number of missing values per column
    missingCount = df.isnull().sum()

    # Calculate the total number of missing values
    totalMissing = missingCount.sum()

    # Calculate the percentage of missing values
    print("The dataset has", round(((totalMissing / totalCells) * 100), 2), "%", "missing values.")

In [67]:
# 25. Check the percentage of missing values
calculate_missing_values_percentage(dataset)

The dataset has 12.72 % missing values.


In [68]:
# 26. Function to calculate missing values by column
def calculate_missing_values_by_column(df):

    # Total missing values
    mis_val = df.isnull().sum()

    # Percentage of missing values
    mis_val_percent = 100 * mis_val / len(df)

    # Data type of columns with missing values
    mis_val_dtype = df.dtypes

    # Create a table with the results
    mis_val_table = pd.concat([mis_val, mis_val_percent, mis_val_dtype], axis=1)

    # Rename columns
    mis_val_table_ren_columns = mis_val_table.rename(
        columns={0: 'Missing Values', 1: '% of Missing Values', 2: 'Dtype'})

    # Sort the table by percentage of missing values in descending order and remove columns without missing values
    mis_val_table_ren_columns = mis_val_table_ren_columns[mis_val_table_ren_columns.iloc[:, 0] != 0].sort_values(
        '% of Missing Values', ascending=False).round(2)

    # Print
    print("The dataset has " + str(df.shape[1]) + " columns.\n"
          "Found: " + str(mis_val_table_ren_columns.shape[0]) + " columns with missing values.")

    if mis_val_table_ren_columns.shape[0] == 0:
        return

    # Return the dataframe with missing information
    return mis_val_table_ren_columns

In [69]:
# 27. Create table with missing values
df_missing = calculate_missing_values_by_column(dataset)

The dataset has 55 columns.
Found: 41 columns with missing values.


In [70]:
# 28. View
df_missing

Unnamed: 0,Missing Values,% of Missing Values,Dtype
Nb of sec with 37500B < Vol UL,130254,86.84,float64
Nb of sec with 6250B < Vol UL < 37500B,111843,74.56,float64
Nb of sec with 125000B < Vol DL,97538,65.02,float64
TCP UL Retrans. Vol (Bytes),96649,64.43,float64
Nb of sec with 31250B < Vol DL < 125000B,93586,62.39,float64
Nb of sec with 1250B < Vol UL < 6250B,92894,61.93,float64
Nb of sec with 6250B < Vol DL < 31250B,88317,58.88,float64
TCP DL Retrans. Vol (Bytes),88146,58.76,float64
HTTP UL (Bytes),81810,54.54,float64
HTTP DL (Bytes),81474,54.32,float64


Typically, columns with more than 50% missing values should be removed. For those with 30% to 50% missing values, the decision is optional.

However, the final decision is always yours! Yes, you, Data Analyst. Just remember to always justify your choices.

In this project, we will remove columns with more than 30% missing values, as we have a large number of columns with missing values, which means a significant amount of work. We will handle variables with a low percentage of missing values and delete those with a high percentage of missing values.

### **1.2. Dropping Columns**

In [71]:
# 29. Columns to be removed
columns_to_remove = df_missing[df_missing['% of Missing Values'] >= 30.00].index.tolist()

In [72]:
# 30. Columns to be removed
columns_to_remove

['Nb of sec with 37500B < Vol UL',
 'Nb of sec with 6250B < Vol UL < 37500B',
 'Nb of sec with 125000B < Vol DL',
 'TCP UL Retrans. Vol (Bytes)',
 'Nb of sec with 31250B < Vol DL < 125000B',
 'Nb of sec with 1250B < Vol UL < 6250B',
 'Nb of sec with 6250B < Vol DL < 31250B',
 'TCP DL Retrans. Vol (Bytes)',
 'HTTP UL (Bytes)',
 'HTTP DL (Bytes)']

Even though the “TCP” variables have many missing values, instead of removing them, we will apply imputation to these variables, as they may be necessary for our subsequent analysis.

In [73]:
# 31. Columns to be removed (excluding certain TCP variables)
columns_to_remove = [col for col in columns_to_remove if col not in ['TCP UL Retrans. Vol (Bytes)',
                                                                     'TCP DL Retrans. Vol (Bytes)']]

In [74]:
# 32. Columns to be removed
columns_to_remove

['Nb of sec with 37500B < Vol UL',
 'Nb of sec with 6250B < Vol UL < 37500B',
 'Nb of sec with 125000B < Vol DL',
 'Nb of sec with 31250B < Vol DL < 125000B',
 'Nb of sec with 1250B < Vol UL < 6250B',
 'Nb of sec with 6250B < Vol DL < 31250B',
 'HTTP UL (Bytes)',
 'HTTP DL (Bytes)']

In [75]:
# 33. Drop columns and create a new dataframe
cleaned_dataset = dataset.drop(columns_to_remove, axis=1)

In [76]:
# 34. Shape
cleaned_dataset.shape

(150001, 47)

Now let’s check the status of the missing values in the modified dataframe.

In [77]:
# 35. Check the percentage of missing values in the modified dataframe
calculate_missing_values_percentage(cleaned_dataset)


The dataset has 3.85 % missing values.


In [78]:
# 36. Check missing values by column in the modified dataframe
calculate_missing_values_by_column(cleaned_dataset)


The dataset has 47 columns.
Found: 33 columns with missing values.


Unnamed: 0,Missing Values,% of Missing Values,Dtype
TCP UL Retrans. Vol (Bytes),96649,64.43,float64
TCP DL Retrans. Vol (Bytes),88146,58.76,float64
Avg RTT DL (ms),27829,18.55,float64
Avg RTT UL (ms),27812,18.54,float64
Handset Type,9559,6.37,object
Handset Manufacturer,9559,6.37,object
Last Location Name,1153,0.77,object
MSISDN/Number,1066,0.71,float64
Bearer Id,991,0.66,float64
Nb of sec with Vol UL < 1250B,793,0.53,float64


### **1.3. Imputation with Backward Fill**

Since the percentages of missing values for 'TCP UL Retrans. Vol (Bytes)' and 'TCP DL Retrans. Vol (Bytes)' are very high, we will apply imputation to the missing values using the backward fill method.

In this case, using a single value like the mean or median is not advisable, as it may alter our data undesirably by making most values equal to a single value.

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html

In [79]:
# 37. Imputation of missing values using backward fill
# method = 'bfill': Bfill or backward-fill propagates the first observed non-null value backward until
# another non-null value is encountered.
def fix_missing_bfill(df, col):

    count = df[col].isna().sum()

    df[col] = df[col].fillna(method='bfill')

    print(f"{count} missing values in column {col} were replaced using the backward fill method.")


In [80]:
# 38. Backward Fill Imputation for the variable 'TCP UL Retrans. Vol (Bytes)'
fix_missing_bfill(cleaned_dataset, 'TCP UL Retrans. Vol (Bytes)')

96649 missing values in column TCP UL Retrans. Vol (Bytes) were replaced using the backward fill method.


In [81]:
# 39. Backward Fill Imputation for the variable 'TCP DL Retrans. Vol (Bytes)'
fix_missing_bfill(cleaned_dataset, 'TCP DL Retrans. Vol (Bytes)')

88146 missing values in column TCP DL Retrans. Vol (Bytes) were replaced using the backward fill method.


### **1.4. Imputation with Forward Fill**

In [82]:
# 40. Check missing values by column in the cleaned dataset
calculate_missing_values_by_column(cleaned_dataset)


The dataset has 47 columns.
Found: 33 columns with missing values.


Unnamed: 0,Missing Values,% of Missing Values,Dtype
Avg RTT DL (ms),27829,18.55,float64
Avg RTT UL (ms),27812,18.54,float64
Handset Type,9559,6.37,object
Handset Manufacturer,9559,6.37,object
Last Location Name,1153,0.77,object
MSISDN/Number,1066,0.71,float64
Bearer Id,991,0.66,float64
Nb of sec with Vol UL < 1250B,793,0.53,float64
UL TP > 300 Kbps (%),792,0.53,float64
50 Kbps < UL TP < 300 Kbps (%),792,0.53,float64


"Avg RTT DL (ms)" and "Avg RTT UL (ms)" have the next highest percentages of missing values, with around 18.5% each. Let's check if these variables are skewed (do not follow a normal distribution) using the `skew()` method, which returns the skewness coefficient.

In [83]:
# 41. Check skewness for 'Avg RTT DL (ms)'
cleaned_dataset['Avg RTT DL (ms)'].skew(skipna=True)


62.90782807995961

In [84]:
# 42. Check skewness for 'Avg RTT UL (ms)'
cleaned_dataset['Avg RTT UL (ms)'].skew(skipna=True)

28.45741458546382

- Se a assimetria estiver entre -0,5 e 0,5, os dados são bastante simétricos
- Se a assimetria estiver entre -1 e - 0,5 ou entre 0,5 e 1, os dados estão moderadamente inclinados
- Se a assimetria for menor que -1 ou maior que 1, os dados estão altamente enviesados

Visto que ambas as colunas Avg RTT DL (ms) e Avg RTT UL (ms) são fortemente enviesadas positivamente é aconselhável não imputá-las com sua média. Portanto, usaremos o preenchimento progressivo.

In [85]:
# 43. Imputation of missing values using forward fill (progressive fill)
# method = 'ffill': Ffill or forward-fill propagates the last observed non-null value forward until
# another non-null value is encountered
def fix_missing_ffill(df, col):

    count = df[col].isna().sum()

    df[col] = df[col].fillna(method='ffill')

    print(f"{count} missing values in column {col} were replaced using the forward fill method.")


In [86]:
# 44. Forward Fill Imputation for 'Avg RTT DL (ms)'
fix_missing_ffill(cleaned_dataset, 'Avg RTT DL (ms)')


27829 missing values in column Avg RTT DL (ms) were replaced using the forward fill method.


In [87]:
# 45. Forward Fill Imputation for 'Avg RTT UL (ms)'
fix_missing_ffill(cleaned_dataset, 'Avg RTT UL (ms)')


27812 missing values in column Avg RTT UL (ms) were replaced using the forward fill method.


> We check the missing values again.

In [89]:
# 46. Check the percentage of missing values in the cleaned dataset
calculate_missing_values_percentage(cleaned_dataset)


The dataset has 0.44 % missing values.


In [90]:
# 47. Check missing values by column in the cleaned dataset
calculate_missing_values_by_column(cleaned_dataset)


The dataset has 47 columns.
Found: 31 columns with missing values.


Unnamed: 0,Missing Values,% of Missing Values,Dtype
Handset Type,9559,6.37,object
Handset Manufacturer,9559,6.37,object
Last Location Name,1153,0.77,object
MSISDN/Number,1066,0.71,float64
Bearer Id,991,0.66,float64
Nb of sec with Vol UL < 1250B,793,0.53,float64
UL TP > 300 Kbps (%),792,0.53,float64
50 Kbps < UL TP < 300 Kbps (%),792,0.53,float64
10 Kbps < UL TP < 50 Kbps (%),792,0.53,float64
UL TP < 10 Kbps (%),792,0.53,float64


### **1.5. Imputation of Categorical Variables**








In [91]:
# 48. Information about the cleaned dataset
cleaned_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150001 entries, 0 to 150000
Data columns (total 47 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   Bearer Id                       149010 non-null  float64
 1   Start                           150000 non-null  object 
 2   Start Offset (ms)               150000 non-null  float64
 3   End                             150000 non-null  object 
 4   End Offset (ms)                 150000 non-null  float64
 5   Dur (s)                         150000 non-null  float64
 6   IMSI                            149431 non-null  float64
 7   MSISDN/Number                   148935 non-null  float64
 8   IMEI                            149429 non-null  float64
 9   Last Location Name              148848 non-null  object 
 10  Avg RTT DL (ms)                 150001 non-null  float64
 11  Avg RTT UL (ms)                 150001 non-null  float64
 12  Avg Bearer TP DL

Since "Handset Type" and "Handset Manufacturer" are categorical columns, it’s better to impute them with the value "unknown" to avoid biasing the data.

In [92]:
# 49. Fill NA values
def fix_missing_value(df, col, value):

    count = df[col].isna().sum()

    df[col] = df[col].fillna(value)

    if isinstance(value, str):
        print(f"{count} missing values in column {col} were replaced with '{value}'.")
    else:
        print(f"{count} missing values in column {col} were replaced with {value}.")


In [93]:
# 50. Imputation of categorical variables
fix_missing_value(cleaned_dataset, 'Handset Type', 'unknown')
fix_missing_value(cleaned_dataset, 'Handset Manufacturer', 'unknown')


9559 missing values in column Handset Type were replaced with 'unknown'.
9559 missing values in column Handset Manufacturer were replaced with 'unknown'.


We check the missing values again.

In [94]:
# 51. Check the percentage of missing values in the cleaned dataset
calculate_missing_values_percentage(cleaned_dataset)

The dataset has 0.17 % missing values.


In [95]:
# 52. Check missing values by column in the cleaned dataset
calculate_missing_values_by_column(cleaned_dataset)


The dataset has 47 columns.
Found: 29 columns with missing values.


Unnamed: 0,Missing Values,% of Missing Values,Dtype
Last Location Name,1153,0.77,object
MSISDN/Number,1066,0.71,float64
Bearer Id,991,0.66,float64
Nb of sec with Vol UL < 1250B,793,0.53,float64
UL TP > 300 Kbps (%),792,0.53,float64
50 Kbps < UL TP < 300 Kbps (%),792,0.53,float64
10 Kbps < UL TP < 50 Kbps (%),792,0.53,float64
UL TP < 10 Kbps (%),792,0.53,float64
Nb of sec with Vol DL < 6250B,755,0.5,float64
50 Kbps < DL TP < 250 Kbps (%),754,0.5,float64


### **1.6. Dropping Rows**

Since only 0.17% of the dataset contains missing values and the total number of rows is approximately 150,000, dropping these rows will not have a noticeable negative impact.

In [96]:
# 53. Drop rows with missing values
def drop_rows_with_na(df):

    old = df.shape[0]

    df.dropna(inplace=True)

    new = df.shape[0]

    count = old - new

    print(f"{count} rows containing missing values were dropped.")


In [97]:
# 54. Drop rows with missing values
drop_rows_with_na(cleaned_dataset)


3114 rows containing missing values were dropped.


In [98]:
# 55. Check the percentage of missing values in the cleaned dataset
calculate_missing_values_percentage(cleaned_dataset)

The dataset has 0.0 % missing values.


In [99]:
# 57. Shape
cleaned_dataset.shape

(146887, 47)

In [100]:
%watermark -a "panData"

Author: panData



In [101]:
%watermark

Last updated: 2024-11-01T07:40:29.367474+00:00

Python implementation: CPython
Python version       : 3.10.12
IPython version      : 7.34.0

Compiler    : GCC 11.4.0
OS          : Linux
Release     : 6.1.85+
Machine     : x86_64
Processor   : x86_64
CPU cores   : 2
Architecture: 64bit



In [102]:
%watermark --iversions

numpy     : 1.26.4
matplotlib: 3.8.0
sys       : 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0]
pandas    : 2.2.2



# **The End**