# Machine Learning Applications - Data Understanding

<p> Practical examples for machine learning applications lecture No. 7 - Data Understanding </p>

<h4> Contents </h4>

 1. Data Understanding
 2. Data Preprocessing <font color="red">- see Jupyter notebook "Data Preprocessing"</font>
 3. Feature Engineering <font color="red">- see Jupyter notebook "Data Preprocessing"</font>

First, we need to import the necessary python libraries:

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

from sklearn.decomposition import PCA

from tqdm import tqdm_notebook as tqdm

## 1. Data Unterstanding

<p> In a first step we will load the data and get an overview of the recorded parameters. We call this process Data Understanding. The idea is to analyse and describe the dataset with classical statistical methods in order to get an broad overview and idea of the parameters and their relevance. 

<p> We now start with importing the two dataset. The first on will contain samples of the light weight container, the second one will contain samples of the heavy weight container. </p>

In [2]:
"""PFAD ANPASSEN"""
# Import files
df_light_full = pd.read_csv("Matlab_Data_Training_output_leicht.csv", header=None)
df_heavy_full = pd.read_csv("Matlab_Data_Training_output_schwer.csv", header=None)

# Rename columns names
df_light_full.columns = ['DEV1', 't', 'var_1', 'var_2', 'var_3']
df_heavy_full.columns = ['DEV1', 't', 'var_1', 'var_2', 'var_3']

# Drop NaNs
df_heavy_full = df_heavy_full.dropna()
df_light_full = df_light_full.dropna()

# Show first values of dataset (good for checking correct formatting)
df_heavy_full.head(3) 
df_light_full.head(3)

Unnamed: 0,DEV1,t,var_1,var_2,var_3
0,DEV_1,331765,0.056,0.018,0.926
1,DEV_1,331775,-0.004,-0.006,0.997
2,DEV_1,331785,0.017,0.046,0.976


<h4>Basic descriptive statistics </h4>

We will start the analysis by a simple statisitical descpription of the data. Pandas includes functions to help us with this.

In [None]:
df_heavy_full.describe()

In [None]:
df_light_full.describe()

<h4> Visual inspection </h4>

Plotting the three variables over time.

In [None]:
sns.set(style="darkgrid")
plt.figure(figsize=(12, 6))

df_heavy = df_heavy_full[["var_1", "var_2", "var_3"]]

plt.title("Heavy weight")
# plt.ylim(-1, 1.5)
plt.xlabel('Time')
plt.ylabel('Accelerations in m/s**2')
ax1 = sns.lineplot(data=df_heavy)
ax1


In [None]:
sns.set(style="darkgrid")
plt.figure(figsize=(12, 6))

df_light = df_light_full[["var_1", "var_2", "var_3"]]

plt.title("Light weight")
# plt.ylim(-1, 1.5)
plt.xlabel('Time')
plt.ylabel('Accelerations in m/s**2')
ax2 = sns.lineplot(data=df_light)
ax2

In [None]:
""" Delete all Zero Rows"""

df_light = df_light.loc[(df_light!=0).any(axis=1)]
df_heavy = df_heavy.loc[(df_heavy!=0).any(axis=1)]

sns.set(style="darkgrid")
plt.figure(figsize=(12, 6))
plt.title("Heavy weight")
plt.ylim(-1, 1.5)
ax1 = sns.lineplot(data=df_heavy)
ax1

After having a first impression of the data, we want to continue with statistical visualisation of the data. Very helpful is the visualization of the data distributions, for example in a histogram.

In [None]:
# Plot a historgram and kernel density estimate

plt.figure(figsize=(12, 6))

# Distplot works fine with pandas slices
sns.distplot(df_light[["var_1"]], norm_hist=True, color="b", label="Var 1")
sns.distplot(df_light[["var_2"]], norm_hist=True, color="r", label="Var 2")
sns.distplot(df_light[["var_3"]], norm_hist=True, color="g", label="Var 3")

plt.legend()
plt.title("Light weight")
plt.xlabel("Acceleration")
plt.ylabel("PDF")

In [None]:
# Plot a historgram and kernel density estimate

plt.figure(figsize=(12,6))

# Distplot works fine with pandas slices
sns.distplot(df_heavy[["var_1"]], norm_hist=True, color="b", label="Var 1")
sns.distplot(df_heavy[["var_2"]], norm_hist=True, color="r", label="Var 2")
sns.distplot(df_heavy[["var_3"]], norm_hist=True, color="g", label="Var 3")

plt.legend()
plt.title("Heavy weight")
# plt.ylim(0, 24)
# plt.xlim(-0.5, 2)
plt.xlabel("Acceleration")
plt.ylabel("PDF")

In [None]:
plt.figure()
df_light.boxplot()

<h3>Analysis</h3>

Now we will compare the two classes against each other rather than comparing the three dimensions.

In [None]:
plt.figure(figsize=(16, 12))

var_names = ['var_1', 'var_2', 'var_3']

for i, var in enumerate(var_names):
    plt.subplot(3,1,i+1)
    sns.distplot(df_light[var], norm_hist=True, label=['light'])
    sns.distplot(df_heavy[var], norm_hist=True, label=['heavy'])
    plt.xlabel('')
    plt.title(var)
    plt.legend()



<h4>Basic explorative statistics </h4>

Also very basic statistics show correlations between variables. Pandas again includes a function which computes by default the Pearson correlation coefficients. Spearman or Kendall are available as well

In [None]:
df_heavy.corr()

In [None]:
df_light.corr()

Correlations can be plotted in form of 2D heatmaps. Of course, this gets more interesting if we have higher dimensions...

In [None]:
plt.figure()
plt.matshow(df_light.corr(), cmap='viridis')
plt.colorbar(cmap='viridis')
plt.yticks(range(len(var_names)), var_names)
plt.xticks(range(len(var_names)), var_names)
# plt.title('Correlation matrix')
plt.tight_layout()

In [None]:
plt.figure(figsize=(10,6))
plt.subplot(121)
plt.scatter(df_light.var_1, df_light.var_2, alpha=0.6)
plt.xlabel('Var1 = x')
plt.ylabel('Var2 = y')
plt.subplot(122)
plt.scatter(df_light.var_1, df_light.var_3, alpha=0.6, c='C1')
plt.xlabel('Var1 = x')
plt.ylabel('Var3 = z')