### How data is  collected

Data (upload, download speed and ping latency) is collected  in two different ways:
 - Running **speedtest** test similar to speedtest.net (but command line)
 - Runnning **iperf** test using test server `clearskystatus.info`

### How data looks like
Load libraries:

In [None]:
from data_exploration import *

Set up influxdb and ms sql connections:

In [None]:
client, client_df = connect_to_influxdb()
cnxn = connect_to_mssql()

How data looks like in  MS SQL database (last 5 records):

In [None]:
sql = "SELECT TOP 5 * FROM FCT_SPEEDTEST  ORDER BY DATA_DATE DESC;"
pd.read_sql(sql,cnxn)

How data looks in influxdb (last 10 records):

In [None]:
query1 = "SELECT * FROM SPEEDTEST_IPERF_PING ORDER BY time;"
df1 = get_dataframe_from_influxdb(client_df=client_df,query_influx=query1,table_name='SPEEDTEST_IPERF_PING')
df1.head(5)

In  MS SQL we have one record per test including ping latency, upload/download speed and additional metadata.  
In Influxdb we have separate measurements(tables)  for ping/upload and download speed.   
The number of records for these 3 measurements should be the same since they are coming from a single row in MS SQL table. 

For the different test types we  are going to separate data coming into  "iperf" and "speedtest" tests by PROVIDER (its always "iperf" for iperf tests and ISP provider name for speedtest). 

### Number  of data points per device

In [None]:
summary_stat = df1.groupby(["SK_PI", "test_type"]).size().unstack().reset_index()

combined_bar_plot_2traces(xvalues=summary_stat["SK_PI"],
                          yvalues1=summary_stat["speedtest"],
                          yvalues2=summary_stat["iperf"],
                          name1='speedtest',
                          name2='iperf',
                          title="Number of datapoints per device ",
                          ytitle="Number of datapoints")

### Raw data by device

In [None]:
df = get_all_data(client_df)

In [None]:
def show_data(ev):
    clear_output(wait=True)
    display(Box(children = [device_name1,measurement_type1,show_button1]))
                
    subset = df[(df["test_type"]=="iperf") & (df["SK_PI"] == device_name1.value)&(df["MES_TYPE"]==measurement_type1.value)]
    subset1= df[(df["test_type"]=="speedtest") & (df["SK_PI"] == device_name1.value) & (df["MES_TYPE"]==measurement_type1.value)]
  
    fig = get_fig_raw_data_by_device(subset,subset1)
    fig.update_layout(showlegend=True)
    fig.show()

In [None]:
device_name1 = widgets.Dropdown(options = df['SK_PI'].sort_values().unique(), description ='Device number: ',style = {'description_width': 'initial'}, disabled=False)
measurement_type1 = widgets.Dropdown(options = df["MES_TYPE"].sort_values().unique(), description ='Measurement_type: ',style = {'description_width': 'initial'}, disabled=False)

show_button1 = widgets.Button(button_style= 'info', description="Show Data")
show_button1.on_click(show_data)

display(Box(children = [device_name1,measurement_type1,show_button1]))

### Raw data for all devices over the last 6 months

In [None]:
def show_data_all_devices(ev):
    clear_output(wait=True)
    display(Box(children = [test_type2,measurement_type2,show_button2]))
                
    subset = df[(df["test_type"]==test_type2.value) & (df["MES_TYPE"]==measurement_type2.value)]
    subset = subset[subset["time"]>  datetime.now() - pd.DateOffset(months=6)]
    
    device_numbers=subset["SK_PI"].unique()
    
    fig = get_fig_raw_data_all(subset, device_numbers)
    fig.show()
        

In [None]:
test_type2 = widgets.Dropdown(options = df['test_type'].unique(), description ='Test type: ',style = {'description_width': 'initial'}, disabled=False)
measurement_type2 = widgets.Dropdown(options = df["MES_TYPE"].sort_values().unique(), description ='Measurement_type: ',style = {'description_width': 'initial'}, disabled=False)
show_button2 = widgets.Button(button_style= 'info', description="Show Data")
show_button2.on_click(show_data_all_devices)

display(Box(children = [test_type2,measurement_type2,show_button2]))

### How often data is collected?
Finding time difference(interval) by device and sort by the most common interval  for iperf and speedtest tests.

In [None]:
#separate data by test type
iperf_df1 = df1[df1["test_type"]=="iperf"]
speedtest_df1 = df1[df1["test_type"]=="speedtest"]

In [None]:
speedtest_df1["interval"] = round(speedtest_df1.groupby('SK_PI')['time'].diff(-1) * (-1) / np.timedelta64(1, 'm'))
speedtest_df1["interval"].value_counts().head()
#fig = go.Figure(data=[go.Histogram(x=speedtest_df1["interval"])])
#fig.show()

In [None]:
iperf_df1["interval"] = round(iperf_df1.groupby('SK_PI')['time'].diff(-1) * (-1) / np.timedelta64(1, 'm'))
iperf_df1["interval"].value_counts().head()
#fig = go.Figure(data=[go.Histogram(x=iperf_df["interval"])])
#fig.show()

 For both tests : speedtest and iperf - most often data is collected every 222 minutes : 3 hours 42 minutes.   
   Looking at the graphs  - tests are iterating.