# Iperf data exploration

## How data is  collected
Iperf data is collected by running iperf3 test with test server **clearskystatus.info**

**Commands**:  
Bandwidth test:
>/usr/bin/iperf3 -c clearskystatus.info

Reverse bandwidth test(server sends data to the client):
>/usr/bin/iperf3 -c clearskystatus.info -R

Data being collected:
 - **Ping latency** 
 - **Upload speed**
 - **Download speed**

## How data looks like

In [None]:
from data_exploration import *
import numpy as np

In [None]:
#Set up starting point, by default if will start from current time
starting_point=datetime.now().strftime('%Y-%m-%d %H:%M:%S')
#starting point="2019-01-10 14:00:00"  # to set upl alternative starting point
print("Starting point:",starting_point )

title_tail=" to the date "+ starting_point

In [None]:
time_interval='4w' #5d

Set up influxdb connection:

In [None]:
client, client_df = connect_to_influxdb()

In [None]:
query1 = "SELECT * FROM SPEEDTEST_IPERF_UPLOAD WHERE PROVIDER='iperf' ORDER BY time DESC LIMIT 10;"
df1 = get_dataframe_from_influxdb(client_df=client_df,query_influx=query1,table_name='SPEEDTEST_IPERF_UPLOAD')
df1

Let's take just one device, for example 4:

In [None]:
query1 = "SELECT * FROM SPEEDTEST_IPERF_UPLOAD WHERE PROVIDER='iperf' AND SK_PI='4' ORDER BY time DESC LIMIT 10;"
df1 = get_dataframe_from_influxdb(client_df=client_df,query_influx=query1,table_name='SPEEDTEST_IPERF_UPLOAD')
df1

Checking upload speed coming for the same device from speedtest:

In [None]:
query1 = "SELECT * FROM SPEEDTEST_IPERF_UPLOAD WHERE PROVIDER!='iperf' AND SK_PI='4' ORDER BY time DESC LIMIT 10;"
df1 = get_dataframe_from_influxdb(client_df=client_df,query_influx=query1,table_name='SPEEDTEST_IPERF_UPLOAD')
df1

Looks like iperf is collecting data in Kbps vs Speedtest is collecting data in Mbps.
Kbps can be converted to Mbps by multiplying by 0.001

Checking download speed and ping latency coming from the same device:

In [None]:
query2 = "SELECT * FROM SPEEDTEST_IPERF_DOWNLOAD WHERE PROVIDER='iperf' AND SK_PI='4' ORDER BY time DESC LIMIT 10;"
df2 = get_dataframe_from_influxdb(client_df=client_df,query_influx=query2,table_name='SPEEDTEST_IPERF_DOWNLOAD')
df2

In [None]:
query3 = "SELECT * FROM SPEEDTEST_IPERF_PING WHERE PROVIDER='iperf' AND SK_PI='4' ORDER BY time DESC LIMIT 10;"
df3 = get_dataframe_from_influxdb(client_df=client_df,query_influx=query3,table_name='SPEEDTEST_IPERF_PING')
df3

Comparing with ping latency coming from speedtest:

In [None]:
query3 = "SELECT * FROM SPEEDTEST_IPERF_PING WHERE PROVIDER!='iperf' AND SK_PI='4' ORDER BY time DESC LIMIT 10;"
df3 = get_dataframe_from_influxdb(client_df=client_df,query_influx=query3,table_name='SPEEDTEST_IPERF_PING')
df3

Latencies are slightly different but not much, looks like units are the same - Miliseconds.

Let's compare with what we have in MS SQL database:

In [None]:
cnxn = connect_to_mssql()
sql = "SELECT TOP 10 * FROM FCT_SPEEDTEST WHERE PROVIDER='iperf' AND SK_PI='4' ORDER BY DATA_DATE DESC;"
pd.read_sql(sql,cnxn)

Are there any zeros or NaNs?

In [None]:
sql = "SELECT  * FROM FCT_SPEEDTEST WHERE PROVIDER='iperf' AND (UPLOAD=0 OR DOWNLOAD=0 OR PING=0) ORDER BY DATA_DATE DESC;"
pd.read_sql(sql,cnxn)

In [None]:
sql = "SELECT  * FROM FCT_SPEEDTEST WHERE PROVIDER='iperf' AND (UPLOAD IS NULL OR DOWNLOAD IS NULL OR PING IS NULL) ORDER BY DATA_DATE DESC;"
pd.read_sql(sql,cnxn)

No zeros and no NaNs

## Number of datapoints per device

In [None]:
device_numbers=get_tag_values_influxdb(client_influx=client,table_name='SPEEDTEST_IPERF_DOWNLOAD', tag_name='SK_PI')
device_numbers=list(map(int, device_numbers))
device_numbers= sorted(device_numbers)
print(device_numbers)

Getting number of data points per device for the entire period of time.

In [None]:
query_download_counts = "SELECT COUNT(DOWNLOAD) FROM SPEEDTEST_IPERF_DOWNLOAD WHERE PROVIDER='iperf' AND time<= '"+starting_point+"' AND DOWNLOAD>0 GROUP BY SK_PI;"
download_counts=get_stats_influxdb(client_influx=client,
                               query_influx=query_download_counts,
                               stat_name='count',
                               device_numbers=device_numbers)

In [None]:
simple_bar_plot(xvalues=device_numbers,
                yvalues=download_counts,
                name="ping datapoints",
                title="Number of data points per device "+ title_tail,
                ytitle="Number of datapoints")

Some of the devices have small number of datapoints, may be they are just installed? Lets check how many dataponts came in last 4 weeks.

Getting number of datapoints per device in last 4 weeks.

In [None]:
query_download_counts_time = "SELECT COUNT(DOWNLOAD) FROM SPEEDTEST_IPERF_DOWNLOAD WHERE time >= '"+starting_point+"'-"+time_interval+" AND PROVIDER='iperf' AND DOWNLOAD>0 GROUP BY SK_PI ;"
download_counts_time = get_stats_influxdb(client_influx=client,
                                      query_influx=query_download_counts_time,
                                      stat_name='count',
                                      device_numbers=device_numbers)

Plotting combined barchart - entire number of datapoints vs number of datapoints in last 4 weeks.

In [None]:
combined_bar_plot_2traces(xvalues=device_numbers,
                          yvalues1=download_counts_time,
                          yvalues2=[a - b for a, b in zip(download_counts, download_counts_time)],
                          name1='Last '+time_interval,
                          name2='The rest of the time',
                          title="Comparing number of datapoints in last "+time_interval+" vs entire time "+ title_tail,
                          ytitle="Number of datapoints")

There are no datapoints in the last 4 weeks. Let's check last reporting time for every device.

In [None]:
query_upload_last = "SELECT LAST(UPLOAD), time FROM SPEEDTEST_IPERF_UPLOAD WHERE PROVIDER='iperf' AND time <= '"+starting_point+"' AND UPLOAD>0 GROUP BY SK_PI;"
result_upload_last=get_stats_influxdb(client_influx=client,
                               query_influx=query_upload_last,
                               stat_name='time',
                               device_numbers=device_numbers)

In [None]:
query_upload_first = "SELECT FIRST(UPLOAD), time FROM SPEEDTEST_IPERF_UPLOAD WHERE PROVIDER='iperf' AND time <= '"+starting_point+"' AND UPLOAD>0 GROUP BY SK_PI;"
result_upload_first=get_stats_influxdb(client_influx=client,
                               query_influx=query_upload_first,
                               stat_name='time',
                               device_numbers=device_numbers)

In [None]:
#print("Iperf reporting times:")
data=[]
for i in range(len(device_numbers)):
    try:
        result_upload_first[i] = dateutil.parser.parse(result_upload_first[i]).strftime('%Y-%m-%d %H:%M:%S')
    except:
        result_upload_first[i]=None
    try:    
        result_upload_last[i] = dateutil.parser.parse(result_upload_last[i]).strftime('%Y-%m-%d %H:%M:%S')
    except:
        result_upload_last[i]=None
    #print("Device: ", device_numbers[i],"  was reporting from ", result_upload_first[i], " to ",result_upload_last[i])
    trace = go.Scatter(x=[result_upload_first[i],result_upload_last[i]],y=[device_numbers[i],device_numbers[i]], 
                       name = device_numbers[i],marker=dict(color=colors[i]))
    data.append(trace)
layout = dict(title = "Device reporting times(iperf) "+ title_tail,xaxis=dict(title="Time"),
        yaxis=dict(title="Device Number"))
fig = go.Figure(data=data, layout=layout)
iplot(fig)

Looks like iperf3 stopped listening on the test server on Dec3.  
Able to ping `clearskystatus.info` but all iperf3 test failing:
   >/usr/bin/iperf3 -c clearskystatus.info  
   >iperf3: error - unable to connect to server: Operation timed out

## How often data was collected

Lets check devices 2 and 4 and see how often data was collected:

In [None]:
query2 = "SELECT * FROM SPEEDTEST_IPERF_DOWNLOAD WHERE PROVIDER='iperf' AND SK_PI='4' ORDER BY time DESC LIMIT 10;"
df2 = get_dataframe_from_influxdb(client_df=client_df,query_influx=query2,table_name='SPEEDTEST_IPERF_DOWNLOAD')
df2

In [None]:
query2 = "SELECT * FROM SPEEDTEST_IPERF_DOWNLOAD WHERE PROVIDER='iperf' AND SK_PI='4' ORDER BY time DESC LIMIT 10;"
df2 = get_dataframe_from_influxdb(client_df=client_df,query_influx=query2,table_name='SPEEDTEST_IPERF_DOWNLOAD')
df2

Just by observing the data there is no consistency. 
Let's calculate the time intervals for all available datapoints for today for devices 2 and 4.

In [None]:
query_device4 = "SELECT * FROM SPEEDTEST_IPERF_DOWNLOAD WHERE PROVIDER='iperf' AND SK_PI='4' AND time<= '"+starting_point+"';"
df_device4 = get_dataframe_from_influxdb(client_df=client_df,query_influx=query_device4,table_name='SPEEDTEST_IPERF_DOWNLOAD')
df_device4.head()

In [None]:
df_device4['interval'] = df_device4['time'] - df_device4['time'].shift(+1)
df_device4['interval']=round(df_device4['interval'].dt.total_seconds() / 60)

In [None]:
df_device4 = df_device4[np.isfinite(df_device4['interval'])]
df_device4.head()

In [None]:
time_intervals=df_device4['interval'].unique()
time_intervals= sorted(time_intervals)
print("Time intervals for device 4: ",time_intervals)

In [None]:
print("Frequencies for every time interval for device4:")
df_device4.groupby(['interval']).size()

In [None]:
query_device2 = "SELECT * FROM SPEEDTEST_IPERF_DOWNLOAD WHERE PROVIDER='iperf' AND SK_PI='2' AND time<= '"+starting_point+"';"
df_device2 = get_dataframe_from_influxdb(client_df=client_df,query_influx=query_device2,table_name='SPEEDTEST_IPERF_DOWNLOAD')
df_device2['interval'] = df_device2['time'] - df_device2['time'].shift(+1)
df_device2['interval']=round(df_device2['interval'].dt.total_seconds() / 60)
df_device2 = df_device2[np.isfinite(df_device2['interval'])]
print("Frequencies for every time interval for device2:")
df_device2.groupby(['interval']).size()

In [None]:
#trace=go.Histogram(x=df_device4['interval'],xbins=dict(size=222))
#fig = go.Figure(data=[trace])
#fig['layout'].update(title='Download speed histogram per device')
#iplot(fig)

In [None]:
#import plotly.figure_factory as ff
#hist_data = [df_device4['interval']]
#group_labels = ['device 4 time interval']
#fig = ff.create_distplot(hist_data, group_labels,bin_size=60)
#fig['layout']['xaxis'].update(title='Download speed (Mbps)')
#iplot(fig)

Most of data is collected with 222 mins intervals(or 444 or 666) but its not consistent

Comparing these intervals with speedtest we can see in grafana that they alternating(not happening at the same time)

![](images/grafana_ping2.png)

## Statistics by device

Since there is not a lot of data - we will select the entire database back from todays date.

In [None]:
query_download = "SELECT * FROM SPEEDTEST_IPERF_DOWNLOAD WHERE PROVIDER='iperf'AND DOWNLOAD>0 AND time <'"+starting_point+"';"
download_df = get_dataframe_from_influxdb(client_df=client_df,query_influx=query_download,
                                          table_name='SPEEDTEST_IPERF_DOWNLOAD')
download_df['DOWNLOAD']=download_df['DOWNLOAD']*0.001

In [None]:
query_upload = "SELECT * FROM SPEEDTEST_IPERF_UPLOAD WHERE PROVIDER='iperf'AND UPLOAD>0 AND time < '"+starting_point+"';"
upload_df = get_dataframe_from_influxdb(client_df=client_df,query_influx=query_upload,
                                          table_name='SPEEDTEST_IPERF_UPLOAD')
upload_df['UPLOAD']=upload_df['UPLOAD']*0.001

In [None]:
query_ping = "SELECT * FROM SPEEDTEST_IPERF_PING WHERE PROVIDER='iperf'AND PING>0 AND time < '"+starting_point+"';"
ping_df = get_dataframe_from_influxdb(client_df=client_df,query_influx=query_ping,
                                          table_name='SPEEDTEST_IPERF_PING')

In [None]:
download_summary=mean_max_median_min_by1(download_df,'DOWNLOAD')
device_numbers_d=download_summary["SK_PI"].unique()
download_line=go.Scatter(x=device_numbers_d,y=[50] * len(device_numbers_d), mode='markers',marker=dict(color='red'), name='50Mbps')
combined_bar_plot_4traces(xvalues=download_summary["SK_PI"],
                         yvalues1=download_summary["max"],
                         yvalues2=download_summary["mean"],
                         yvalues3=download_summary["median"],
                         yvalues4=download_summary["min"],
                         name1="Max",
                         name2="Mean",
                         name3="Median",
                         name4="Min",
                         title="Download speed by device"+ title_tail,
                         ytitle="Mbps",
                         line=download_line,
                         stack=False)

In [None]:
upload_summary=mean_max_median_min_by1(upload_df,'UPLOAD')
device_numbers_u=upload_summary["SK_PI"].unique()
upload_line=go.Scatter(x=device_numbers_u,y=[10] * len(device_numbers_u), mode='markers',marker=dict(color='red'), name='10Mbps')

combined_bar_plot_4traces(xvalues=upload_summary["SK_PI"],
                         yvalues1=upload_summary["max"],
                         yvalues2=upload_summary["mean"],
                         yvalues3=upload_summary["median"],
                         yvalues4=upload_summary["min"],
                         name1="Max",
                         name2="Mean",
                         name3="Median",
                         name4="Min",
                         title="Upload speed by device"+ title_tail,
                         ytitle="Mbps",
                         line=upload_line,
                         stack=False)

In [None]:
ping_summary=mean_max_median_min_by1(ping_df,'PING')
combined_bar_plot_4traces(xvalues=ping_summary["SK_PI"],
                         yvalues1=ping_summary["max"],
                         yvalues2=ping_summary["mean"],
                         yvalues3=ping_summary["median"],
                         yvalues4=ping_summary["min"],  
                         name1="Max",
                         name2="Mean",
                         name3="Median",
                         name4="Min",
                         title="Ping latency by device back"+ title_tail,
                         ytitle="Miliseconds",
                         stack=False)

In [None]:
simple_boxplot(dataframe=download_df,plot_value='DOWNLOAD',sort_value='SK_PI',
               title="Download speed by device "+ title_tail,
               ytitle="Mbps", downloadline=True)

In [None]:
simple_boxplot(dataframe=upload_df,plot_value='UPLOAD',sort_value='SK_PI',
               title="Upload speed by device "+ title_tail,
               ytitle="Mbps", uploadline=True)

In [None]:
simple_boxplot(dataframe=ping_df,plot_value='PING',sort_value='SK_PI',
               title="Ping latency by device "+ title_tail,
               ytitle="Miliseconds")

## Statistic by time of the day, day of the week

### Download speed

In [None]:
download_df["hour"]=pd.to_numeric(download_df["time"].dt.hour)

In [None]:
t="Normalized download speed by hour "+ title_tail
traces=[]
for device in device_numbers_d:
    subset=download_df[download_df["SK_PI"]==device]
    trace = go.Scatter(
        x = subset['hour'],
        y=(subset['DOWNLOAD']-subset['DOWNLOAD'].mean())/subset['DOWNLOAD'].std(),
        mode = 'markers',
        marker = dict(color=colors[device]),
        name = str(device)
    )
    traces.append(trace)
layout = go.Layout(
        title=t,
        xaxis=dict(title="Hour of the day"),
        yaxis=dict(title="Difference to normalized speeds (Mbps)")
        )
data = traces
fig = go.Figure(data=data, layout=layout)
iplot(fig)

In [None]:
device_number=2
subset=download_df[download_df["SK_PI"]==device_number]
t="Download speed by hour for the device " + str(device_number)+" "+ title_tail
simple_boxplot(dataframe=subset,plot_value='DOWNLOAD',sort_value='hour',
               title=t,
               xtitle="Hour of the day", downloadline=True)

In [None]:
download_df["time_group"]=""
download_df.loc[(download_df["hour"]>23)|(download_df["hour"]<=7),"time_group"]="night 23:00-07:00"
download_df.loc[(download_df["hour"]>7)&(download_df["hour"]<=17),"time_group"]="day 7:00-17:00"
download_df.loc[(download_df["hour"]>17)&(download_df["hour"]<=23),"time_group"]="evening 17:00-23:00"

In [None]:
#subset=download_df[download_df["SK_PI"]==device_number]
#t="Upload speed by timegroup for the device "+str(device_number)+title_tail
#simple_boxplot(dataframe=subset,plot_value='DOWNLOAD',sort_value='time_group',
#               title=t,
 #              ytitle="Mbps",downloadline=True, jitter=True)

In [None]:
download_df["weekday"]=download_df["time"].dt.weekday_name
download_df["weekday"] = pd.Categorical(download_df["weekday"], ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"])

In [None]:
device_number=2
subset=download_df[download_df["SK_PI"]==device_number]
t="Download speed by day of the week"+title_tail
simple_boxplot(dataframe=download_df,plot_value='DOWNLOAD',sort_value='weekday',
               title=t,
               ytitle="Mbps", weekdays=True,jitter=True)

In [None]:
download_df["day_group"]="Weekday"
download_df.loc[(download_df["weekday"]=="Sunday")|(download_df["weekday"]=="Saturday"),"day_group"]="Weekend"

In [None]:
t="Download speed by day group"+title_tail
simple_boxplot(dataframe=download_df,plot_value='DOWNLOAD',sort_value='day_group',
               title=t,
               ytitle="Mbps", jitter=True, downloadline=True)

### Upload speed

In [None]:
upload_df["hour"]=pd.to_numeric(upload_df["time"].dt.hour)

In [None]:
traces=[]
t="Normalized upload speed by hour "+title_tail
traces=[]
for device in device_numbers_u:
    subset=upload_df[upload_df["SK_PI"]==device]
    trace = go.Scatter(
        x = subset['hour'],
        y=(subset['UPLOAD']-subset['UPLOAD'].mean())/subset['UPLOAD'].std(),
        mode = 'markers',
        marker = dict(color=colors[device]),
        name = str(device)
    )
    traces.append(trace)
layout = go.Layout(
        title=t,
        xaxis=dict(title="Hour of the day"),
        yaxis=dict(title="Difference to normalized speeds (Mbps)")
        )
data = traces
fig = go.Figure(data=data, layout=layout)
iplot(fig)

In [None]:
upload_df["hour"]=pd.to_numeric(upload_df["time"].dt.hour)

device_number=3
subset=upload_df[upload_df["SK_PI"]==device_number]
t="Upload speed by hour for the device " + str(device_number)+" "+title_tail
simple_boxplot(dataframe=subset,plot_value='UPLOAD',sort_value='hour',
               title=t,
               ytitle="Upload speed (Mbps)",
               xtitle="Hour of the day", uploadline=True)

In [None]:
upload_df["time_group"]=""
upload_df.loc[(upload_df["hour"]>23)|(upload_df["hour"]<=7),"time_group"]="night 23:00-07:00"
upload_df.loc[(upload_df["hour"]>7)&(upload_df["hour"]<=17),"time_group"]="day 7:00-17:00"
upload_df.loc[(upload_df["hour"]>17)&(upload_df["hour"]<=23),"time_group"]="evening 17:00-23:00"

In [None]:
subset=upload_df[upload_df["SK_PI"]==device_number]
t="Upload speed by timegroup for the device "+str(device_number)+title_tail
simple_boxplot(dataframe=subset,plot_value='UPLOAD',sort_value='time_group',
               title=t,
               ytitle="Mbps",uploadline=True, jitter=True)

In [None]:
upload_df["weekday"]=upload_df["time"].dt.weekday_name
upload_df["weekday"] = pd.Categorical(upload_df["weekday"], ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"])

In [None]:
subset=upload_df[upload_df["SK_PI"]==device_number]
t="Upload speed by day of the week for the device "+str(device_number)+title_tail
simple_boxplot(dataframe=subset,plot_value='UPLOAD',sort_value='weekday',
               title=t,
               ytitle="Mbps",uploadline=True, weekdays=True, jitter=True)

### Ping latency

In [None]:
ping_df["hour"]=pd.to_numeric(ping_df["time"].dt.hour)
device_numbers_p=ping_df["SK_PI"].unique()

In [None]:
traces=[]
t="Normalized ping latency by hour "+title_tail
traces=[]
for device in device_numbers_p:
    subset=ping_df[ping_df["SK_PI"]==device]
    trace = go.Scatter(
        x = subset['hour'],
        y=(subset['PING']-subset['PING'].mean())/subset['PING'].std(),
        mode = 'markers',
        marker = dict(color=colors[device]),
        name = str(device)
    )
    traces.append(trace)
layout = go.Layout(
        title=t,
        xaxis=dict(title="Hour of the day"),
        yaxis=dict(title="Difference to normalized latencies (Miliseconda)")
        )
data = traces
fig = go.Figure(data=data, layout=layout)
iplot(fig)

In [None]:
t="Ping latency by hour"+title_tail
simple_boxplot(dataframe=ping_df,plot_value='PING',sort_value='hour',
               title=t,
               ytitle="Miliseconds",
               xtitle="Hour of the day")

In [None]:
device_number=7
by_hour_by_device_p2=mean_max_median_by2(input_dataframe=ping_df,value1="PING", value2="PING",
                                          value3="PING",group_by_value="hour", rename_columns=True)
subset=by_hour_by_device_p2[by_hour_by_device_p2["SK_PI"]==device_number]
t="Ping latency(speedtest) by hour for the device "+str(device_number)+title_tail
combined_bar_plot_3traces(xvalues=subset["hour"],
                         yvalues1=subset["max"],
                         yvalues2=subset["mean"],
                         yvalues3=subset["median"],
                         name1="Max",
                         name2="Mean",
                         name3="Median",
                         title=t,
                         xtitle="hour",
                         stack=False)

In [None]:
ping_df["weekday"]=ping_df["time"].dt.weekday_name
ping_df["weekday"] = pd.Categorical(ping_df["weekday"], ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"])

In [None]:
t="Ping latency(speedtest) by day of the week"+title_tail
simple_boxplot(dataframe=ping_df,plot_value='PING',sort_value='weekday',
               title=t,
               ytitle="Miliseconds", weekdays=True)

In [None]:
device_number=7
by_hour_by_device_p2=mean_max_median_by2(input_dataframe=ping_df,value1="PING", value2="PING",
                                          value3="PING",group_by_value="weekday", rename_columns=True)
subset=by_hour_by_device_p2[by_hour_by_device_p2["SK_PI"]==device_number]
t="Ping latency(speedtest) by hour for the device "+str(device_number)+title_tail
combined_bar_plot_3traces(xvalues=subset["weekday"],
                         yvalues1=subset["max"],
                         yvalues2=subset["mean"],
                         yvalues3=subset["median"],
                         name1="Max",
                         name2="Mean",
                         name3="Median",
                         title=t,
                         xtitle="weekday",
                         stack=False)