# P5: Exploratory Data Analysis

The files "data_*.txt" contain the preprocessed packets, i.e. the features and the target variable ("size_mm"):
* start_time: date and start time of experiment
* the other features are named with <feature name>_<sensor>. G stands for geophone, M(iniplate) for accelerometer and S(ound) for microphone
	* velocity is the flow velocity of the water for that experiment (m/s)
	* centroid frequency is the weighted average of the frequencies where the weights are the power spectral density at each frequency
    https://en.wikipedia.org/wiki/Spectral_centroid
	* centroid frequenc2 is similar, but the weights are the squared power spectral density
	* median frequency is the frequency corresponding to the median of the power spectral densities
	* flash_in and cv are the "flashiness index" and the coefficient of variation of the power spectral densities
	* iqa is the sum of squared amplitudes of a packet
	* mab is the absolute maximal amplitude of a packet
	* imp is the number of impulses of a packet
	* len is the lenth of a packet in number of timesteps (same for all sensors)
* size_mm is the target variable: grain size b-axis diameter (mm). A good starting point is to take the log of this as a target variable

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
pd.set_option('display.max_columns', None)

# MPA

A dedicated measuring systems for smaller grain sizes. 

`4 Sensors`

In [2]:
mpa = pd.read_table('../data/data_mpa.txt', sep=' ')
print(mpa.shape)
mpa.head(4)

(21663, 39)


Unnamed: 0,start_time,velocity,size_mm,centroid_frequency_M01,centroid_frequency2_M01,centroid_frequency_M02,centroid_frequency2_M02,centroid_frequency_M03,centroid_frequency2_M03,centroid_frequency_M04,centroid_frequency2_M04,median_freq_M01,median_freq_M02,median_freq_M03,median_freq_M04,flash_ind_M01,flash_ind_M02,flash_ind_M03,flash_ind_M04,cv_M01,cv_M02,cv_M03,cv_M04,iqa_M01,iqa_M02,iqa_M03,iqa_M04,mab_M01,mab_M02,mab_M03,mab_M04,imp_M01,imp_M02,imp_M03,imp_M04,len_M01,len_M02,len_M03,len_M04
0,2021-06-22 15:34:38,2.05,12.3,2407.17752,2383.462777,3594.34465,3818.274968,3043.452352,3549.220734,2673.814437,2824.432598,2588.199096,3707.581105,3380.291834,2659.706639,0.00341,0.000856,0.003861,0.004662,0.531196,0.834109,0.647348,0.536881,3.136071e-08,0.000264837,4.509251e-08,2.927102e-08,0.006431,1.131952,0.00766,0.006011,0,2,0,0,42,42,42,42
1,2021-06-22 15:34:38,2.05,12.3,3245.903945,3375.076129,2921.431687,3283.745824,2461.395358,2502.497764,2411.778622,2264.244802,3191.512728,2953.919364,2564.269676,2578.243694,0.001206,0.003759,0.003158,0.003963,0.730771,0.652439,0.413809,0.581647,0.0004363673,3.413999e-08,3.490291e-08,6.079496e-08,1.497038,0.007796,0.006003,0.007696,3,0,0,0,47,47,47,47
2,2021-06-22 15:34:38,2.05,12.3,2354.3781,2214.842318,3336.998058,3318.154655,2888.766886,3205.724097,2583.538995,2783.182747,1977.127033,3232.205322,3219.740721,2931.929627,0.004347,0.001226,0.003707,0.003401,0.70575,0.891155,0.504303,0.679552,9.560187e-08,0.0001952121,5.947095e-08,6.004481e-08,0.011887,0.946315,0.008638,0.008229,0,3,0,0,46,46,46,46
3,2021-06-22 15:34:38,2.05,12.3,2646.745202,2752.672212,3231.972948,3232.356657,2743.411191,2882.039891,2993.83149,3337.390227,2770.004722,3168.124297,2911.109525,3246.180027,0.002389,0.001345,0.002188,0.002083,0.384234,0.906965,0.505675,0.576147,4.415891e-08,1.541509e-05,2.444728e-08,9.764644e-08,0.011723,0.212369,0.007838,0.018704,0,0,0,0,37,37,37,37


In [3]:
mpa.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21663 entries, 0 to 21662
Data columns (total 39 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   start_time               21663 non-null  object 
 1   velocity                 21663 non-null  float64
 2   size_mm                  21663 non-null  float64
 3   centroid_frequency_M01   21663 non-null  float64
 4   centroid_frequency2_M01  21663 non-null  float64
 5   centroid_frequency_M02   21663 non-null  float64
 6   centroid_frequency2_M02  21663 non-null  float64
 7   centroid_frequency_M03   21663 non-null  float64
 8   centroid_frequency2_M03  21663 non-null  float64
 9   centroid_frequency_M04   21663 non-null  float64
 10  centroid_frequency2_M04  21663 non-null  float64
 11  median_freq_M01          21663 non-null  float64
 12  median_freq_M02          21663 non-null  float64
 13  median_freq_M03          21663 non-null  float64
 14  median_freq_M04       

In [4]:
# Set start time to datetime
mpa['start_time'] = pd.to_datetime(mpa['start_time'])

In [None]:
tmp = mpa['grain_size'].astype('category')
p = sns.scatterplot(x=mpa['centroid_frequency_M01'], y=, hue=tmp['size_mm'])

In [None]:
fig = plt.subplots(figsize=(10, 7))
tmp = mpa.corr(method='pearson')

p = sns.heatmap(tmp)
plt.show()

In [None]:
fig = plt.subplots(1, 2, figsize=(20, 5))

plt.subplot(1,2,1)
p = sns.histplot(mpa['size_mm'])
p.set_title('Distributino of Target Grain Size', loc='left')

plt.subplot(1,2,2)
p = sns.histplot(np.log(mpa['size_mm']))
p.set_title('Distribution of Target with Log-Transformation', loc='left')

sns.despine()
plt.show()

**Interpretation:**

Der Plot zeigt, dass durch die empfohlene Transformation die Korngrösse normalverteilter gemacht wird.

--------------

# SPG

Oldest measuring system of all three. Largest amount of data. Longer measuring period available.

`2 Sensors`

In [None]:
spg = pd.read_table('../data/data_spg.txt', sep=' ')
print(spg.shape)
spg.head(4)

In [None]:
spg.info()

In [None]:
spg['start_time'] = pd.to_datetime(spg['start_time'])

# SPS

Newest measuring System of all three.


In [None]:
sps = pd.read_table('../data/data_sps.txt', sep=' ')
print(spg.shape)
sps.head(4)

In [None]:
sps.info()

In [None]:
sps['start_time'] = pd.to_datetime(sps['start_time'])