The Telecommunication industry has experienced a great revolution since the last decade. Mobile devices have become the new fashion trend and play a vital role in everyone's life. The success of the mobile industry is largely dependent on its consumers. Therefore, it is necessary for the vendors to focus on their target audience i.e. what are the needs and requirements of their consumers and how they feel and perceive their products. Tracking & evaluating of customers’ experience can help organizations to optimize their products and services so that it meets the evolving user expectations, needs, and acceptance.

In the telecommunication industry, the user experience is related, most of the time, to network parameter performances or the customers’ device characteristics.  

In this section, you’re expected to focus on network parameters like TCP retransmission, Round Trip Time (RTT), Throughput, and the customers’ device characteristics like the handset type to conduct a deep user experience analysis. The network parameters are all columns in the dataset. The following questions are your guidance to complete the task. For this task, you need a python script that includes all solutions to tasks.

Task 3. 1 - Aggregate, per customer, the following information (treat missing & outliers by replacing by the mean or the mode of the corresponding variable):
●	Average TCP retransmission
●	Average RTT
●	Handset type
●	Average throughput
Task 3.2 - Compute & list 10 of the top, bottom, and most frequent:
a.	TCP values in the dataset. 
b.	RTT values in the dataset.
c.	Throughput values in the dataset.
Task 3.3 - Compute & report:
d.	The distribution of the average throughput per handset type and provide interpretation for your findings.
e.	The average TCP retransmission view per handset type and provide interpretation for your findings.
Task 3.4 - Using the experience metrics above, perform a k-means clustering (where k = 3) to segment users into groups of experiences and provide a brief description of each cluster. (The description must define each group based on your understanding of the data)


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

In [2]:
if "google.colab" in str(get_ipython()):
    !pip install dvc[gdrive] > /content/piplog.txt
    print(">> g-colab detected \ncloning repo from github\n")
    !git clone https://github.com/JimohAR/it_core_project1

    print("\n>> changing path to the repo\n")
    os.chdir("/content/it_core_project1")

    print("\n>> downloading the datasets\n")
    !dvc import https://github.com/JimohAR/it_core_project1 data/telco.csv \
    -o data/telco.csv

    print("\n>> set up path to the data directory")
    path = os.getcwd() + "/data/"
else:
    print("\n>> set up path to the data directory")
    path = os.path.abspath(os.getcwd() + "/../../data") + "/"


>> set up path to the data directory


In [3]:
data = pd.read_csv(path + "telco.csv").iloc[:-1]

In [4]:
desc = pd.read_excel(path + "field_descriptions.xlsx").set_index("Fields")

In [5]:
def resolve_outlier(df):
    data = df.copy()
    for i in data.select_dtypes(["int", "float"]).keys():
        Q1,Q3 = data[i].quantile([.25,.75])
        IQR = Q3 - Q1
        lower_range = Q1 - (1.5 * IQR)
        upper_range = Q3 + (1.5 * IQR)

        lr_ind = data[i][data[i] < lower_range].keys()
        ur_ind = data[i][data[i] > upper_range].keys()

        data.loc[lr_ind, i] = np.nan
        data.loc[ur_ind, i] = np.nan
        data[i].fillna(data[i].mean(), inplace= True)
    return data

In [6]:
data.columns

Index(['Bearer Id', 'Start', 'Start ms', 'End', 'End ms', 'Dur. (ms)', 'IMSI',
       'MSISDN/Number', 'IMEI', 'Last Location Name', 'Avg RTT DL (ms)',
       'Avg RTT UL (ms)', 'Avg Bearer TP DL (kbps)', 'Avg Bearer TP UL (kbps)',
       'TCP DL Retrans. Vol (Bytes)', 'TCP UL Retrans. Vol (Bytes)',
       'DL TP < 50 Kbps (%)', '50 Kbps < DL TP < 250 Kbps (%)',
       '250 Kbps < DL TP < 1 Mbps (%)', 'DL TP > 1 Mbps (%)',
       'UL TP < 10 Kbps (%)', '10 Kbps < UL TP < 50 Kbps (%)',
       '50 Kbps < UL TP < 300 Kbps (%)', 'UL TP > 300 Kbps (%)',
       'HTTP DL (Bytes)', 'HTTP UL (Bytes)', 'Activity Duration DL (ms)',
       'Activity Duration UL (ms)', 'Dur. (ms).1', 'Handset Manufacturer',
       'Handset Type', 'Nb of sec with 125000B < Vol DL',
       'Nb of sec with 1250B < Vol UL < 6250B',
       'Nb of sec with 31250B < Vol DL < 125000B',
       'Nb of sec with 37500B < Vol UL',
       'Nb of sec with 6250B < Vol DL < 31250B',
       'Nb of sec with 6250B < Vol UL < 37500B',


In [7]:
network_stats_cols = ['Avg RTT DL (ms)', 'Avg RTT UL (ms)', 'Avg Bearer TP DL (kbps)', 'Avg Bearer TP UL (kbps)', 
                      'TCP DL Retrans. Vol (Bytes)', 'TCP UL Retrans. Vol (Bytes)', 'Handset Manufacturer', 
                      'Handset Type',
                     ]

In [8]:
ux_data = resolve_outlier(data[network_stats_cols])

In [9]:
ux_data.describe()

Unnamed: 0,Avg RTT DL (ms),Avg RTT UL (ms),Avg Bearer TP DL (kbps),Avg Bearer TP UL (kbps),TCP DL Retrans. Vol (Bytes),TCP UL Retrans. Vol (Bytes)
count,150000.0,150000.0,150000.0,150000.0,150000.0,150000.0
mean,47.250673,7.887062,7425.817491,366.371568,1311051.0,33719.82391
std,19.277858,6.952875,12450.903656,552.806965,1214180.0,24363.747133
min,0.0,0.0,0.0,0.0,2.0,1.0
25%,35.0,3.0,43.0,47.0,1311051.0,33719.82391
50%,47.250673,7.0,63.0,63.0,1311051.0,33719.82391
75%,51.0,7.887062,7425.817491,366.371568,1311051.0,33719.82391
max,127.0,34.0,49211.0,2729.0,9365124.0,203003.0


### Aggregate, per customer, the following information (treat missing & outliers by replacing by the mean or the mode of the corresponding variable): ● Average TCP retransmission ● Average RTT ● Handset type ● Average throughput 

In [10]:
def sum_ul_dl(df):
    data = df.copy()
    summed_data = data.select_dtypes(exclude= ["int", "float"]).copy()
    count = 0
    for i,j in enumerate(data.select_dtypes(["int", "float"]).columns):
        if count != 1:
            summed_data[j.replace("UL", "Data").replace("DL", "Data")] = data.iloc[:, i] + data.iloc[:,i+1]
            count = (count + 1) % 2
        else: 
            count = (count + 1) % 2
            continue
    return summed_data

In [11]:
ux_data = sum_ul_dl(ux_data)

In [12]:
#convert throughput to MB
ux_data["TCP Data Retrans. Vol (Bytes)"] /= 2**20
ux_data.rename(columns= {"TCP Data Retrans. Vol (Bytes)": "TCP Data Retrans. Vol (MBs)"}, inplace= True)

In [13]:
ux_data['Bearer Id'] = data['Bearer Id']

In [14]:
ux_data_per_user = ux_data.groupby('Bearer Id').mean()

### Compute & list 10 of the top, bottom, and most frequent: a. TCP values in the dataset. b. RTT values in the dataset. c. Throughput values in the dataset. 

In [15]:
def sort_each_col(df, aggregate, size= 10):
    data = df.copy()
    sorted_data = pd.DataFrame()
    if aggregate == "min":
        for i in data.select_dtypes(["int", "float"]).columns:
            sorted_data[i] = data[i].sort_values(ascending= True).unique()[:size]
            
    elif aggregate == "max":
        for i in data.select_dtypes(["int", "float"]).columns:
            sorted_data[i] = data[i].sort_values(ascending= False).unique()[:size]
            
    elif aggregate == "freq":
        for i in data.select_dtypes(["int", "float"]).columns:
            sorted_data[i] = data[i].value_counts().keys()[:size]
    
    else:
        raise Exception(f"aggregate unknown - {aggregate}\nValid - min, max, freq")
        
    return sorted_data

In [16]:
sort_each_col(ux_data_per_user, "freq")

Unnamed: 0,Avg RTT Data (ms),Avg Bearer TP Data (kbps),TCP Data Retrans. Vol (MBs)
0,55.137735,7792.189059,1.282473
1,29.0,15.0,1.25155
2,39.0,63.0,1.251584
3,38.0,97.0,0.033426
4,30.0,90.0,0.032194
5,40.0,98.0,1.251573
6,28.0,96.0,0.032245
7,49.0,99.0,0.034695
8,31.0,89.0,1.252852
9,41.0,93.0,0.033415


In [17]:
sort_each_col(ux_data_per_user, "min")

Unnamed: 0,Avg RTT Data (ms),Avg Bearer TP Data (kbps),TCP Data Retrans. Vol (MBs)
0,0.0,0.0,9.3e-05
1,2.0,0.5,0.000103
2,4.0,1.0,0.000108
3,5.0,1.5,0.000122
4,6.0,2.0,0.000123
5,7.0,2.5,0.000128
6,7.887062,3.0,0.000136
7,8.0,3.5,0.000165
8,9.0,3.666667,0.000167
9,10.0,4.0,0.000168


In [18]:
sort_each_col(ux_data_per_user, "max")

Unnamed: 0,Avg RTT Data (ms),Avg Bearer TP Data (kbps),TCP Data Retrans. Vol (MBs)
0,160.0,51844.0,9.096872
1,159.0,51790.0,9.086612
2,158.0,51730.0,9.070628
3,157.0,51695.0,9.066752
4,156.0,51588.0,9.060015
5,155.0,51513.0,9.031706
6,154.0,51440.0,9.016943
7,153.0,51423.0,9.013672
8,152.0,51392.0,9.009158
9,151.0,51383.0,9.007325


### Task 3.3 - Compute & report: d. The distribution of the average throughput per handset type and provide interpretation for your findings. e. The average TCP retransmission view per handset type and provide interpretation for your findings.

In [19]:
ux_data.groupby(['Handset Type'])['Avg Bearer TP Data (kbps)'].mean().sort_values(ascending= False).to_frame()

Unnamed: 0_level_0,Avg Bearer TP Data (kbps)
Handset Type,Unnamed: 1_level_1
Xiaomi Communica. M1803E1A,49538.371568
Xiaomi Communica. Redmi Note 2,49381.000000
Huawei Nova 2I Huawei Mate 10 Lite,47661.000000
Htc 2Q6E100,47493.000000
Lephone U Pro,45669.000000
...,...
Test IMEI,2.000000
Concox Informati. Concox Gt06 Gt06N Tr06,2.000000
Quectel Wireless. Quectel Ec25-E,1.000000
Lg Lg-T385,0.000000


In [20]:
ux_data.groupby(['Handset Type'])['TCP Data Retrans. Vol (MBs)'].mean().sort_values(ascending= False).to_frame()

Unnamed: 0_level_0,TCP Data Retrans. Vol (MBs)
Handset Type,Unnamed: 1_level_1
Zte Blade L110 Zte Blade L110 Blade L110 Blade L110,8.723232
Lg-X210Ds,8.719436
Spa Condor Elect. Plume L1 Plus,8.541739
Lenovo Moto Z Force (2Nd Gen) Ge12072245,8.489502
Tcl Communicatio. Alcatel A5 Led Alcatel A5,8.189405
...,...
Oppo A37F,0.002267
Lenovo Moto X Play,0.001733
Gotron (Hk) Elec. Armor X,0.001431
Samsung Galaxy Note 8 (Sm-N9500),0.000399


#### interpret your findings

### Task 3.4 - Using the experience metrics above, perform a k-means clustering (where k = 3) to segment users into groups of experiences and provide a brief description of each cluster. (The description must define each group based on your understanding of the data)

In [21]:
sub_ux_data = ux_data[['Avg RTT Data (ms)', 'Avg Bearer TP Data (kbps)', 
                       'TCP Data Retrans. Vol (MBs)'
                      ]]

In [22]:
scaler = StandardScaler()
sub_ux_data_scaled = pd.DataFrame(scaler.fit_transform(sub_ux_data), columns= sub_ux_data.columns)

In [23]:
sub_ux_data_scaled.describe().style

Unnamed: 0,Avg RTT Data (ms),Avg Bearer TP Data (kbps),TCP Data Retrans. Vol (MBs)
count,150000.0,150000.0,150000.0
mean,0.0,0.0,0.0
std,1.000003,1.000003,1.000003
min,-2.4571,-0.610783,-1.102003
25%,-0.674582,-0.603728,-0.026661
50%,0.0,-0.600985,0.0
75%,0.305803,0.118644,0.0
max,4.672972,3.452956,6.715183


In [24]:
kmeans = KMeans(3, random_state=0)

In [25]:
kmeans.fit(sub_ux_data_scaled)

KMeans(n_clusters=3, random_state=0)

In [26]:
task3_cluster_centers = pd.DataFrame(scaler.inverse_transform(kmeans.cluster_centers_), columns = sub_ux_data.columns)

In [27]:
# mapper = dict(zip(task3_cluster_centers["tot data usage (MBs)"].sort_values().keys(), ["bad experience", "average experience", "good experience"]))

In [28]:
# task3_cluster_centers = task3_cluster_centers.rename(index = mapper).rename_axis("labels")
task3_cluster_centers.to_csv(path + "user_experience_cluster_centers.csv")

In [29]:
sub_ux_data["labels"] = kmeans.labels_.copy()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sub_ux_data["labels"] = kmeans.labels_.copy()


In [30]:
sub_ux_data.groupby("labels").mean()

Unnamed: 0_level_0,Avg RTT Data (ms),Avg Bearer TP Data (kbps),TCP Data Retrans. Vol (MBs)
labels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,77.034648,26915.245077,0.887939
1,47.851843,1841.133412,1.113231
2,78.34644,21642.157364,5.547624


#### provide a brief description of each cluster