# 2. Data analysis & Preprocessing

In this exercise you will get to know the basics from the lectures "3. Getting to Know Your Data" and "4. Preprocessing" in their practical use and apply them yourself.

Since this practice sheet is designed to be used in three sessions, it is roughly divided into three parts:

- Part One: Getting to Know Your Data
- Part Two: Preprocessing - Data cleaning & Data integration
- Part Three: Preprocessing - Data reduction, data transformation & data discretization

Of course, depending on how quickly an exercise group progresses in the actual exercise, one of these parts may not be discussed entirely in the affected exercise, or parts of the subsequent part may already be addressed.

## Part Three: Preprocessing - Data reduction, data transformation & data discretization

In this part you will apply the theoretical knowledge gained in the second part of the lecture "Preprocessing".

In [None]:
# Import the required libraries
import tempfile
import sqlite3
import urllib.request
import sklearn.decomposition
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [None]:
# Create a temporary directory
dataset_folder = tempfile.mkdtemp()

# Get the database
urllib.request.urlretrieve(
    "https://github.com/FAU-CS6/KDD-Databases/raw/main/AdventureWorks/adventure-works.db",
    dataset_folder + "/adventure-works.db",
)

# Open connection to the adventure-works.db
connection = sqlite3.connect(dataset_folder + "/adventure-works.db")

In [None]:
# Create the clean dataframe(s)
# Order dataframe
order_dataframe = pd.read_sql_query(
    "SELECT p.ProductID,p.Name,p.ProductNumber,p.MakeFlag,p.FinishedGoodsFlag,p.Color,p.SafetyStockLevel,"
    "p.ReorderPoint,p.StandardCost,p.ListPrice,p.Size,p.SizeUnitMeasureCode,p.WeightUnitMeasureCode,p.Weight,"
    "p.DaysToManufacture,p.ProductLine,p.Class,p.Style,p.ProductSubcategoryID,p.ProductModelID,p.SellStartDate,"
    "p.SellEndDate,p.DiscontinuedDate,d.PurchaseOrderID,d.PurchaseOrderDetailID,d.DueDate,d.OrderQty,d.ProductID,"
    "d.UnitPrice,d.ReceivedQty,d.RejectedQty,h.RevisionNumber,h.Status,h.EmployeeID,h.VendorID,h.ShipMethodID,"
    "h.OrderDate,h.ShipDate,h.SubTotal,h.TaxAmt,h.Freight,h.TotalDue,e.NationalIDNumber,e.LoginID,e.OrganizationNode,"
    "e.JobTitle,e.BirthDate,e.MaritalStatus,e.Gender,e.HireDate,e.SalariedFlag,e.VacationHours,e.SickLeaveHours,"
    "e.CurrentFlag,r.PersonType,r.NameStyle,r.Title,r.FirstName,r.MiddleName,r.LastName,r.Suffix,r.EmailPromotion,"
    "r.AdditionalContactInfo,r.Demographics "
    "FROM Product p "
    "JOIN PurchaseOrderDetail d ON p.ProductID = d.ProductID "
    "JOIN PurchaseOrderHeader h ON d.PurchaseOrderID = h.PurchaseOrderID "
    "JOIN Employee e ON h.EmployeeID = e.BusinessEntityID "
    "JOIN Person r ON e.BusinessEntityID = r.BusinessEntityID",
    connection,
    index_col="PurchaseOrderDetailID",
)

# CurrencyRate dataframe
currency_rate_dataframe = pd.read_sql_query(
    "SELECT STRFTIME('%Y-%m-%d', CurrencyRateDate) AS CurrencyRateDate,AverageRate,EndOfDayRate "
    "FROM CurrencyRate "
    "WHERE FromCurrencyCode='USD' AND ToCurrencyCode='EUR'",
    connection,
    index_col="CurrencyRateDate",
)

### Normalization

One method introduced in the lecture and frequently used in Data Science is normalization. In order to apply this practically, we will first take a look at a part of the order_dataframe already known from Part One. More precisely, we are looking at some numeric attributes from order_dataframe.

<div class="alert alert-block alert-info">
<b>Task:</b> Display the head of the SubTotal, Freight and OrderQty attributes from order_dataframe.</div>

In [None]:
# Display the head of SubTotal, Freight and OrderQty

In [None]:
# Display the head of SubTotal, Freight and OrderQty
order_dataframe[["SubTotal", "Freight", "OrderQty"]].head(20)

<div class="alert alert-block alert-info">
<b>Task:</b> Display the minimum, maximum, mean and standard deviation of SubTotal, Freight and OrderQty.</div>

In [None]:
# Display the minimum of SubTotal, Freight and OrderQty

In [None]:
# Display the minimum of SubTotal, Freight and OrderQty
order_dataframe[["SubTotal", "Freight", "OrderQty"]].min()

In [None]:
# Display the maximum of SubTotal, Freight and OrderQty

In [None]:
# Display the maximum of SubTotal, Freight and OrderQty
order_dataframe[["SubTotal", "Freight", "OrderQty"]].max()

In [None]:
# Display the mean of SubTotal, Freight and OrderQty

In [None]:
# Display the mean of SubTotal, Freight and OrderQty
order_dataframe[["SubTotal", "Freight", "OrderQty"]].mean()

In [None]:
# Display the standard deviation of SubTotal, Freight and OrderQty

In [None]:
# Display the standard deviation of SubTotal, Freight and OrderQty
order_dataframe[["SubTotal", "Freight", "OrderQty"]].std()

As can be clearly seen, the three attributes differ significantly. This can be a hindrance for some knowledge discovery tasks. For this reason, normalization is often performed, scaling the attribute values to a much smaller specified range of values.

In the lecture you were introduced to three different variants of normalization: The min-max normalization, the z-score normalization and the normalization by decimal scaling.

Below you can see the implementation of one of the normalization methods for the attributes SubTotal, Freight and OrderQty:

In [None]:
# It is always good to define methods that you may want to use more often as a function.
def mystery_normalization(dataframe_to_normalize):
    # We need to copy the dataframe_to_normalize to avoid overriding content in the exisiting dataframe
    dataframe = dataframe_to_normalize.copy()

    # We need to normalize each column individually
    for column in dataframe.columns:
        dataframe[column] = (dataframe[column] - dataframe[column].mean()) / dataframe[
            column
        ].std()

    return dataframe


# Now we can apply the function we just defined to our dataframe
mystery_normalization_dataframe = mystery_normalization(
    order_dataframe[["SubTotal", "Freight", "OrderQty"]]
)
mystery_normalization_dataframe.head(20)

<div class="alert alert-block alert-info">
<b>Task:</b> Determine whether the above function mystery_normalization is an implementation of min-max normalization, the z-score normalization, or normalization by decimal scaling.</div>

<b>The function is an implementation of:</b>
1. [ ] Min-max normalization (for the interval [0, 1])
2. [ ] Z-score normalization
3. [ ] Normalization by decimal scaling

<b>The function is an implementation of:</b>
1. [ ] Min-max normalization (for the interval [0, 1])
2. [X] Z-score normalization
3. [ ] Normalization by decimal scaling

<div class="alert alert-block alert-info">
<b>Task:</b> Implement a function for each of the three normalization methods you got to know. (You may, of course, reuse the above code when you work on the corresponding function).</div>

In [None]:
# Implement a min-max normalization for the interval [0, 1]
def min_max_normalization(dataframe_to_normalize):
    # We need to copy the dataframe_to_normalize to avoid overriding content in the exisiting dataframe
    dataframe = dataframe_to_normalize.copy()

    # ...

    return dataframe


# Apply the min-max normalization
min_max_dataframe = min_max_normalization(
    order_dataframe[["SubTotal", "Freight", "OrderQty"]]
)
min_max_dataframe.head(20)

In [None]:
# Implement a min-max normalization for the interval [0, 1]
def min_max_normalization(dataframe_to_normalize):
    # We need to copy the dataframe_to_normalize to avoid overriding content in the exisiting dataframe
    dataframe = dataframe_to_normalize.copy()

    # We need to normalize each column individually
    for column in dataframe.columns:
        dataframe[column] = (dataframe[column] - dataframe[column].min()) / (
            dataframe[column].max() - dataframe[column].min()
        )

    return dataframe


# Apply the min-max normalization
min_max_dataframe = min_max_normalization(
    order_dataframe[["SubTotal", "Freight", "OrderQty"]]
)
min_max_dataframe.head(20)

In [None]:
# Display the minimum, maximum, mean and standard deviation of the min_max_dataframe
print("Minimum:")
print(min_max_dataframe.min())

print("\nMaximum:")
print(min_max_dataframe.max())

print("\nMean:")
print(min_max_dataframe.mean())

print("\nStandard deviation:")
print(min_max_dataframe.std())

In [None]:
# Implement a z-score normalization
def z_score_normalization(dataframe_to_normalize):
    # We need to copy the dataframe_to_normalize to avoid overriding content in the exisiting dataframe
    dataframe = dataframe_to_normalize.copy()

    # ...

    return dataframe


# Apply the z_score normalization
z_score_dataframe = z_score_normalization(
    order_dataframe[["SubTotal", "Freight", "OrderQty"]]
)
z_score_dataframe.head(20)

In [None]:
# Implement a z-score normalization
def z_score_normalization(dataframe_to_normalize):
    # We need to copy the dataframe_to_normalize to avoid overriding content in the exisiting dataframe
    dataframe = dataframe_to_normalize.copy()

    # We need to normalize each column individually
    for column in dataframe.columns:
        dataframe[column] = (dataframe[column] - dataframe[column].mean()) / dataframe[
            column
        ].std()

    return dataframe


# Apply the z_score normalization
z_score_dataframe = z_score_normalization(
    order_dataframe[["SubTotal", "Freight", "OrderQty"]]
)
z_score_dataframe.head(20)

In [None]:
# Display the minimum, maximum, mean and standard deviation of the z_score_dataframe
print("Minimum:")
print(z_score_dataframe.min())

print("\nMaximum:")
print(z_score_dataframe.max())

print("\nMean:")
print(z_score_dataframe.mean())

print("\nStandard deviation:")
print(z_score_dataframe.std())

In [None]:
# Implement a normalization by decimal scaling
def normalization_by_decimal_scaling(dataframe_to_normalize):
    # We need to copy the dataframe_to_normalize to avoid overriding content in the exisiting dataframe
    dataframe = dataframe_to_normalize.copy()

    # ...

    return dataframe


# Apply the normalization_by_decimal_scaling
decimal_scaling_dataframe = normalization_by_decimal_scaling(
    order_dataframe[["SubTotal", "Freight", "OrderQty"]]
)
decimal_scaling_dataframe.head(20)

In [None]:
# Implement a normalization by decimal scaling
def normalization_by_decimal_scaling(dataframe_to_normalize):
    # We need to copy the dataframe_to_normalize to avoid overriding content in the exisiting dataframe
    dataframe = dataframe_to_normalize.copy()

    # We need to normalize each column individually
    for column in dataframe.columns:
        # Find k
        k = 0
        while dataframe[column].abs().max() / (10 ** k) >= 1:
            k += 1

        # Compute the normalization of the column
        dataframe[column] = dataframe[column] / (10 ** k)

    return dataframe


# Apply the normalization_by_decimal_scaling
decimal_scaling_dataframe = normalization_by_decimal_scaling(
    order_dataframe[["SubTotal", "Freight", "OrderQty"]]
)
decimal_scaling_dataframe.head(20)

In [None]:
# Display the minimum, maximum, mean and standard deviation of the decimal_scaling_dataframe
print("Minimum:")
print(decimal_scaling_dataframe.min())

print("\nMaximum:")
print(decimal_scaling_dataframe.max())

print("\nMean:")
print(decimal_scaling_dataframe.mean())

print("\nStandard deviation:")
print(decimal_scaling_dataframe.std())

It can be clearly seen that not all normalization methods lead to the same result. It is therefore always important to consider which normalization method best serves your purpose. 

<div class="alert alert-block alert-info">
<b>Task:</b> Consider when the various normalization methods presented might be beneficial.</div>

Write down your solution here:

- <b>Min-max normalization:</b><br />
Min-max normalization is advantageous when values must be secured in a fixed interval and this interval shall be used as good as possible. For example, in certain deep learning methods it is essential that values lie in the value range [0, 1] in order to avoid incorrect results.

- <b>Z-score normalization:</b><br />
The goal of Z-score normalization (also called standardization) is not to bring all values into a fixed range of values. In this type of normalization, the attributes are aligned in a different way. Thus, the Z-Score normalization achieves that the mean of all attributes is as close as possible to 0 and the attribute values have a standard deviation of 1 to each other. 

- <b>Normalization by decimal scaling:</b><br />
Although normalization by decimal scaling assures the user that all output values are in the value range [-1, -1], it rarely actually uses this range (see example). The advantage compared to min-max normalization is that normalization is not done with arbitrary divisors, but with a power of ten. Since we humans are used to the decimal system, the connection between value and normalized value is easier to recognize. (Min-Max normalization: 19953.6 becomes 1.0 - Normalization by decimal scaling: 19953.6 becomes 0.199536)

### Discretization

Another commonly used method is discretization. This is used to convert a continuous attribute into an attribute with discrete values. Of course, this method is best demonstrated on an attribute that is as continuous as possible, which is why we take a look at the new currency_rate_dataframe, which represents the USD to EUR exchange rates.

In [None]:
# Print the head of currency_rate_dataframe
currency_rate_dataframe.head(20)

In [None]:
# Draw the progression of the two rates over time
currency_rate_dataframe.plot(subplots=True)
plt.xticks(rotation=30)

In Knowledge Discovery, with such continuous attributes, it often makes more sense to divide the attribute values into "groups", since you can then work with fewer different values. Graphically, we have also already applied a method in Part One that does just that.

<div class="alert alert-block alert-info">
<b>Task:</b> Draw a histogram with five bins for the AverageRate and EndOfDayRate attributes.</div>

In [None]:
# Draw a histogram for AverageRate and EndOfDayRate

In [None]:
# Draw a histogram for AverageRate and EndOfDayRate
currency_rate_dataframe.hist(bins=5, rwidth=0.8)

Although the histogram analysis divides the individual values into different groups and thus basically leads to a discretization of the data, it is not possible to map the temporal relationship of our data set with it. 

An alternative, with whose assistance the temporal can be maintained, is the Binning method. The simplest variant is to divide the range of values into several bins of the same interval size, similar to the histogram analysis (equal-width partitioning). This is possible in Pandas via the cut function.

<div class="alert alert-block alert-info">
<b>Task:</b> Use cut() to distribute the attribute values of AverageRate into five bins with equal width. (Help: <a href="https://pandas.pydata.org/docs/reference/api/pandas.cut.html">Pandas documentation</a>)</div>

In [None]:
# Distribute the attribute values of AverageRate into bins with equal width

In [None]:
# Distribute the attribute values of AverageRate into bins with equal width
pd.cut(currency_rate_dataframe["AverageRate"], bins=5)

<div class="alert alert-block alert-info">
<b>Task:</b> Find out how to interpret the interval notation used in the new attribute values and what boundaries each of the five bins has. (Help: <a href="https://pandas.pydata.org/docs/reference/api/pandas.cut.html">Pandas documentation</a>)</div>

Write down your solution here:

<u>Interval notation:</u>

The interval notation used in the new attribute values can be read as "lower limit, upper limit", where the brackets indicate whether the limit is to be understood as inclusive ("[" or "]") or exclusive ("(" or )"). I.e. whether the limit is part of the interval (inclusive), or whether the interval begins shortly after/before the limit (exclusive).

<u>Boundaries:</u>

- <b>Bin 1:</b> From 0.961 (excl.) to 1.01 (inkl.)
- <b>Bin 2:</b> From 1.01 (excl.) to 1.06 (inkl.)
- <b>Bin 3:</b> From 1.06 (excl.) to 1.109 (inkl.)
- <b>Bin 4:</b> From 1.109 (excl.) to 1.158 (inkl.)
- <b>Bin 5:</b> From 1.158 (excl.) to 1.208 (inkl.)

Besides equal-width partioning, Pandas also supports equal-depth partioning. I.e. not to set the bins so that each interval is the same size, but that each bin contains approximately the same number of values. This is done with the qcut function.

<div class="alert alert-block alert-info">
<b>Task:</b> Use qcut() to distribute the attribute values of AverageRate into five bins with equal depth. (Help: <a href="https://pandas.pydata.org/docs/reference/api/pandas.qcut.html">Pandas documentation</a>)</div>

In [None]:
# Distribute the attribute values of AverageRate into bins with equal depth

In [None]:
# Distribute the attribute values of AverageRate into bins with equal depth
pd.qcut(currency_rate_dataframe["AverageRate"], 5)

# Note: In this case your have to set the number of quantiles not bins

The disadvantage of this pure partitioning is that the attribute values are no longer purely numerical attributes and thus further processing is only possible to a limited extent. Therefore, one or two representative values are often selected for each bin, to which all values within the bin are smoothed.

In the lecture the variant smoothing by bin means was presented, which we will now have a look at here. 

<div class="alert alert-block alert-info">
<b>Task:</b> Implement a method for smoothing by bin means with equal width partitioning by completing the following program skeleton. You may of course use cut(), but you don't have to. </div>

In [None]:
# Implement a method for smoothing by bin means with equal width partitioning
def smoothing_by_bin_means_with_equal_width_part(dataframe_to_smooth, bins=10):
    # We need to copy the dataframe_to_smooth to avoid overriding content in the exisiting dataframe
    dataframe = dataframe_to_smooth.copy()

    # Smooth every column
    for column in dataframe.columns:

        # ...
        continue

    return dataframe


# Apply the smoothing_by_bin_means
currency_rate_smoothed_dataframe = smoothing_by_bin_means_with_equal_width_part(
    currency_rate_dataframe, 5
)
currency_rate_smoothed_dataframe.head(20)

In [None]:
# Implement a method for smoothing by bin means with equal width partitioning
def smoothing_by_bin_means_with_equal_width_part(dataframe_to_smooth, bins=10):
    # We need to copy the dataframe_to_smooth to avoid overriding content in the exisiting dataframe
    dataframe = dataframe_to_smooth.copy()

    # Smooth every column
    for column in dataframe.columns:
        # Calculate the bin affiliations with cut()
        bin_affiliations = pd.cut(currency_rate_dataframe[column], bins, labels=False)

        # For each bin
        for bin in range(bins):
            # Set every bin to the match the mean of this bin
            dataframe[bin_affiliations == bin] = dataframe[
                bin_affiliations == bin
            ].mean()

    return dataframe


# Apply the smoothing_by_bin_means
currency_rate_smoothed_dataframe = smoothing_by_bin_means_with_equal_width_part(
    currency_rate_dataframe, 5
)
currency_rate_smoothed_dataframe.head(20)

In [None]:
# Draw the progression of the two rates over time
currency_rate_smoothed_dataframe.plot(subplots=True)
plt.xticks(rotation=30)

In addition to histogram analysis and binning, there are other methods of discretization. One of them - clustering - will be covered later in the semester.

### Data reduction

Another important part of preprocessing can be to reduce the amount of data to be analyzed. One focus in the lecture was Principal Component Analysis, which we now want to look at in practice.

Subject of analysis is still the currency_rate_dataframe, where the AverageRate and the EndOfDayRate are visibly the same. It is therefore to be expected that a large part of the redundancy can be eliminated by the PCA.

In order to generate a deep understanding on PCA, you will now first apply the steps presented in the lecture one by one yourself before resorting to an encapsulating function.

<div class="alert alert-block alert-info">
<b>Task:</b> Standardize the currency_rate_dataframe to ensure that all attributes are included in the analysis to the same extent. (Hint: You might want to use one of your previous defined functions) </div>

In [None]:
# Standardize the currency_rate_dataframe

In [None]:
# Standardize the currency_rate_dataframe
standardized_currency_rate_dataframe = z_score_normalization(currency_rate_dataframe)

# Although not part of the question, it is usually helpful to display the result of the calculation
standardized_currency_rate_dataframe.head(20)

<div class="alert alert-block alert-info">
<b>Task:</b> Calculate the covariance matrix for the standardized dataframe. (Help: <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.cov.html">Pandas documentation</a>) </div>

In [None]:
# Calculate the covariance matrix

In [None]:
# Calculate the covariance matrix
covariance_matrix = standardized_currency_rate_dataframe.cov()

# Display the covariance matrix
covariance_matrix

<div class="alert alert-block alert-info">
<b>Task:</b> Calculate the associated eigenvalues and eigenvectors. (Help: <a href="https://numpy.org/doc/stable/reference/generated/numpy.linalg.eig.html">Numpy documentation</a>) </div>

In [None]:
# Calculate the associated eigenvalues and eigenvectors

In [None]:
# Calculate the associated eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(covariance_matrix)

# Display the eigenvalues and eigenvectors
print("Eigenvalues:")
print(eigenvalues)

print("\nEigenvectors:")
print(eigenvectors)

<div class="alert alert-block alert-info">
<b>Task:</b> Calculate the percentage of information per eigenvector. </div>

In [None]:
# Calculate the percentage of information per eigenvector

In [None]:
# Calculate the percentage of information per eigenvector
relative_information_share = eigenvalues / np.sum(eigenvalues)

# Print the information
print(
    "First eigenvektor: approx. {0:.0f}% (Exact: ".format(
        relative_information_share[0] * 100
    )
    + str(relative_information_share[0])
    + ")"
)
print(
    "Second eigenvektor: approx. {0:.0f}% (Exact: ".format(
        relative_information_share[1] * 100
    )
    + str(relative_information_share[1])
    + ")"
)

<div class="alert alert-block alert-info">
<b>Task:</b> Select the feature matrix so that the transformation preserves at least 80% of the information contained in the standardized currency_rate_dataframe. </div>

In [None]:
# Select the feature matrix

In [None]:
# Select the feature matrix
# The first eigenvector contains nearly all information => select only that one
feature_matrix = eigenvectors[1]

# Print the feature_matrix
print(feature_matrix)

<div class="alert alert-block alert-info">
<b>Task:</b> Perform the transformation of the standardized data frame using the feature matrix and display the result. (Help: <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dot.html">Pandas documentation</a>) </div>

In [None]:
# Perform the transformation and display the result

In [None]:
# Perform the transformation
transformated_currency_rate_dataframe = pd.DataFrame(
    data=standardized_currency_rate_dataframe.dot(feature_matrix)
)

# Display the transformated dataframe
transformated_currency_rate_dataframe.head(20)

PCA in this case allowed the two original attributes to be merged into one without any significant loss of information. Of course, it is very cumbersome to execute PCA step by step manually each time, which is why PCA is also included in some ML frameworks, such as scikit learn. 

It is very important to know exactly what the framework does for you and what it does not. For example, the PCA within scikit learn does not include standardization, although feature scaling <a href="https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html">is strongly recommended</a> by the framework itself as a preceding step.

<div class="alert alert-block alert-info">
<b>Task:</b> Use the PCA from scikit learn to transform the standardized currency_rate_dataframe a second time. This time you may assume that only one component is expected to be used as a result. Display the result. (Help: <a href="https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html">Scikit learn documentation</a>) </div>

In [None]:
# Use the PCA from scikit learn

In [None]:
# Create a PCA object with sklearn
pca = sklearn.decomposition.PCA(n_components=1)

# Compute the principal components for the standardized_currency_rate_dataframe
transformated_currency_rate_dataframe = pd.DataFrame(
    data=pca.fit_transform(standardized_currency_rate_dataframe),
    index=standardized_currency_rate_dataframe.index,
)

# Display the transformated dataframe
transformated_currency_rate_dataframe.head(20)

<div class="alert alert-block alert-info">
<b>Task:</b> It may be that the results of the manual PCA and the PCA per scikit learn differ. Consider why both results can be correct, even if they differ. </div>

Write down your solution here:

In PCA, an important step is the determination of the eigenvalues and eigenvectors. This step is done by solving a system of equations for which there may be several solutions. In particular with previous standardization it can occur that two solutions are simply mirrored, whereby a likewise mirrored result in the transformed data set can be explained