In [2]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import scipy.stats as stats

## 1.1A
Datasets are 4 text files stored in csv format. Each record is on single line and attributes are separated with TAB character. All 4 datasets contain common attribute imei which makes relation within tables.

### Connections csv

In [None]:
connections = pd.read_csv('data/connections.csv', sep='\t')
#connections = sns.load_dataset("data/connections.csv")

connections.head()


In [None]:
connections.shape

Connections csv has 15074 records (rows) and 13 attributes (columns). 

In [None]:
connections.info()

This data has no null values (the number of Non-Null values for each column is the same as the total number of rows). First column contains values of type object (which represents timestamp) and  will be possibly converted to type timestamp during further processing. The second attribute contains values of type int64 and stores International Mobile Equipment Identity number. The third one contains values of type float64, however obtains only one of two values: 1. or 0. representing malware-related-activity, so it will be probably converted to type boolean. Other 10 columns contain float64 values, which represent input value for evaluation of mwra. First three attributes (ts, imei, mwra) are discrete (categorical) attributes and the rest are continuous (numeric) values. 

The pair of first and second column (ts, imei) are keys for snapshot of the rest of values. The third column is the result of evaluation. 

In [None]:
# oprava dat 1.2:
#connections['ts'] = pd.to_datetime(connections['ts'])
# maybe also mwra to boolean

### Devices csv

In [None]:
devices = pd.read_csv('data/devices.csv', sep='\t')
devices.head()

In [None]:
devices.shape

Devices csv has 2895 records (rows) and 6 attributes (columns). 

In [None]:
devices.info()

In [None]:
devices.shape[0] - devices.dropna().shape[0]

This data has 3 null values in attribute code. First and second attribute contains geolocation latitude and longitude position. Their values are float64. Next three attributes contain values of type object. They are strings which represent identification of store which sold the device. Attribute store_name contains name of the store, code is the code of the country, in which the store is located and the location is the name of the continent and the city in which the device was sold. In further processing, the location attribute might be split to two columns. The last attribute contains values of type int64 and stores International Mobile Equipment Identity number. All attributes are discrete (categorical). 

The attribute imei is the key for this table.

### Processes csv

In [None]:
processes = pd.read_csv('data/processes.csv', sep='\t')
processes.shape

In [None]:
processes.head()

### Profiles csv

In [None]:
profiles = pd.read_csv('data/profiles.csv', sep='\t')
profiles.shape

In [None]:
profiles.head()

## 1.1B Analysis of attributes
For analysis were chosen these significant attributes:
* Connections
   - ts
   - imei
   - mwra
   - c.android.youtube
   - c.dogalize
* Devices
   - store_name
   - imei
* Processes
   - ...
* Profiles
   - ...

### Table Connections - attribute ts

In [None]:
connections.info()

Attribute timestamps contains values of type object, therefore, descriptive statistics cannot be generated, so there is a need to convert them to type datetime.

In [None]:
connections['ts'] = pd.to_datetime(connections['ts'])
connections.describe()

The ts column has been successfully converted to a timestamp data type, allowing for date-time operations. The column contains 15,074 entries, which indicates that every record in the dataset has a corresponding timestamp in correct format. The minimum timestamp is 2018-05-05 10:00:00, indicating the earliest recorded time in the dataset. The maximum timestamp is 2018-05-15 18:14:00, which indicates the latest recorded time. The timestamps cover a span of about 10 days.

The mean timestamp is approximately 2018-05-10 14:03:19, suggesting that the average record date falls around the middle of the range. The 50th percentile (median) is 2018-05-10 14:04:30, is very close to the mean, suggesting a roughly symmetric distribution around this central point. The standard deviation value is NaN, indicating that it is not applicable to timestamps as they are not numerical values.

In [None]:
connections['ts'].unique().size

The amout of unique timestamps is 14895, which means that there are __duplicates. There is a need to check whether those duplicates are related to the device imei.

In [None]:
connections[['ts', 'imei']].drop_duplicates().shape[0]

The number of unique records for the keys ts and imei is 14895.

In [None]:
connections['date'] = connections['ts'].dt.date
date_counts = connections['date'].value_counts().sort_index()

connections_ts_graph = sns.barplot(x=date_counts.index, y=date_counts.values)

connections_ts_graph.set(xlabel='Date', ylabel='Number of Records')
connections_ts_graph.set_xticklabels(connections.date, rotation=45)

The dataset was filled with data roughly evenly in time.

### Table Connections - attribute imei

In [None]:
connections.describe()

Attribute imei contains numbers between 3.590434e+17 and 8.630331e+18. Since they are generated as unique numbers, there is no sense in calculating statistical distributions, only reasonable metric is the amount of unique numbers.

In [None]:
connections['imei'].unique().size

Table connection in attribute imei contians records for 500 unique devices.

In [None]:
connections.imei.value_counts().sort_values()

Each device in this dataset has at least 12 and maximum of 47 records. Below is visualized the representation of the number of records per device imei. 

In [None]:
imei_counts = connections['imei'].value_counts()
connections_imei_graph = sns.barplot(x=imei_counts.index, y=imei_counts.values, errorbar=None)

connections_imei_graph.set(xlabel="imei index", ylabel="number of records")

ticks = connections_imei_graph.get_xticks()
connections_imei_graph.set_xticks(ticks[::100])
connections_imei_graph.set_xticklabels(imei_counts.index[::100], rotation=45)


There is no correlation between device and the number of records in dataset.

### mwra todo

In [None]:
connections.describe()

The mean value is 0.628367, which means that more records were reported by devices with malware than those without it.  TODO count a opisat osttane

In [None]:
connections['mwra'].unique()

There are only two values present in attribute mwra: 0 and 1. These values indicate malware-related-activity at a time on a device.  

In [None]:
connections['mwra'].value_counts()

There are 9472 records with malware-related-activity and 5602 records without it.

In [None]:
connections.mwra.value_counts().plot(kind='pie')

### c.android.youtube

In [None]:
connections.describe()

The column has 15,074 entries, which indicates that every record in the dataset has a corresponding value for c.android.youtube. The mean value is approximately 10.65. This suggests that, on average, users have some level of interaction with the YouTube app.
The values range from 1.02 to 20.73. The median is approximately 10.53, which is close to the mean, suggesting a somewhat symmetric distribution. The standard deviation is approximately 2.54. 

In [None]:
stats.mode(connections['c.android.youtube'])

In [None]:
np.var(connections['c.android.youtube'])

In [None]:
iqr = np.percentile(connections['c.android.youtube'], 75) - np.percentile(connections['c.android.youtube'], 25)
iqr

TODO vybrat si dva a ktory bude mat najkrajsie rozlozenie ten pouzit
pre devices imei: jeden record per device

In [None]:
sns.histplot(connections['c.UCMobile.intl'], bins=30, kde=True)