# Ausrüster AG Use Case

## Python Notebook: FOM - Area of Application - Business Analytics

Author: Dr. Stephan Hausberg, Winter semester 2024

Learning objectives:

- Data quality impact
- Imputation and its impact

1. Read-in data and descriptive analytics

We find a dataset containing a sample of 1.000 machines, each with 2 different sensors, the age of the machine and an indicator if it failed or not. Let's take a look at the data.

In [None]:
import pandas as pd
import numpy as np

df_in = pd.read_excel("data_pred_main.xlsx")[['Nummer', 'Sensor 1', 'Sensor 2',
                                                           'Alter des Bauteils in Tagen',
                                                           'Ausfall']]

In [None]:
df_in.head()

In [None]:
df_in.describe()

In [None]:
from summarytools import dfSummary
dfSummary(df_in)

Seaborn is a library settled upon matplotlib. We take it to create a correlation plot between these features.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

f, ax = plt.subplots(figsize=(5, 5))
corr = df_in.corr()
sns.heatmap(corr,
    cmap=sns.diverging_palette(200, 10, as_cmap=True),
    vmin=-1.0, vmax=1.0,
    square=True, ax=ax)

2. Modification of data and imputation effects 

Creating a new dataframe and set all values that create a signal in "Sensor 2" to missing values (frac = 1)

In [None]:
# Creating a second dataframe named df_2 as a copy
df_2 = df_in.copy()

condition = df_2['Ausfall'] == 1

filtered_df = df_2[condition]

random_sample = filtered_df.sample(frac=1, random_state=1)  # Set random_state for reproducibility

df_2.loc[random_sample.index, 'Sensor 2'] = np.nan

How does this effect the summary statistics?!

In [None]:
dfSummary(df_2)

Take a quick look at how this effects the correlation plot.

In [None]:
f, ax = plt.subplots(figsize=(5, 5))
corr = df_2.corr()
sns.heatmap(corr,
    cmap=sns.diverging_palette(200, 10, as_cmap=True),
    vmin=-1.0, vmax=1.0,
    square=True, ax=ax)

We realize that the correlation between Ausfall and Sensor 2 has completely vanished. So signal went to zero and there are is only the feature "Alter" left to explain "Ausfall" in a possible model. Let's go on and impute the missing data with the given mean values of that variable. And save this in another dataframe.

In [None]:
df_3 = df_2.copy().fillna(df_2.mean())

f, ax = plt.subplots(figsize=(5, 5))
corr = df_3.corr()
sns.heatmap(corr,
    cmap=sns.diverging_palette(200, 10, as_cmap=True),
    vmin=-1.0, vmax=1.0,
    square=True, ax=ax)

The summary statistics also show that the distribution has changed significantly after imputing values with the mean value.

In [None]:
dfSummary(df_3)

Let's visualize this in a more compound graph.

In [None]:
# Create the plot
plt.figure(figsize=(10, 6))

# Plot histograms for each distribution with transparency (alpha) and different colors
plt.hist(df_in['Sensor 2'], bins=20, color='blue', alpha=0.1, label='Original Data')
plt.hist(df_2['Sensor 2'], bins=20, color='red', alpha=0.2, label='Data with missing values')
plt.hist(df_3['Sensor 2'], bins=20, color='green', alpha=0.3, label='Imputed data')

# Add title and labels
plt.title("Histograms of original, missing and imputed data", fontsize=16)
plt.xlabel("Value", fontsize=12)
plt.ylabel("Amount of realizations", fontsize=12)

# Show legend
plt.legend()

# Display the plot
plt.show()
