# 2. Data Analysis & Preprocessing

This JupyterNotebook is part of an exercise series titled *Data Analysis & Preprocessing*.\
The series itself includes practical exercises for lectures *3. Getting to Know Your Data* and *4. Data Preprocessing*. 

This exercise series is divided into three parts. There will be one exercise session per part (= one part per week):

- **2.1.** [Getting to Know Your Data](./2.1-Getting-to-Know-Your-Data.ipynb) (*notebook of the week before last*)
- **2.2.** [Preprocessing - Data Cleaning & Data Integration](./2.2-Preprocessing-Data-Cleaning-and-Integration.ipynb) (*next weeks notebook*) (*last weeks notebook*)
- **2.3.** Preprocessing - Data Reduction, Data Transformation & Data Discretization (*this notebook*)
    - **2.3.1.** [Normalization](#2.3.1.-Normalization)
    - **2.3.2.** [Discretization](#2.3.2.-Discretization)
    - **2.3.3.** [Data Reduction](#2.3.3.-Data-Reduction)

<div class="alert alert-block alert-warning">

**Important:**
    
Work on the respective part yourself **BEFORE** each exercise session. The exercise session is **NOT** intended to take a first look at the exercise sheet, but to solve problems students had while preparing the exercise sheet beforehand.
    
</div>

## 2.3. Preprocessing - Data Reduction, Data Transformation & Data Discretization

In this part you will apply the theoretical knowledge gained in the second part of the lecture *4. Data Preprocessing*.

In [None]:
# Import the required libraries
import tempfile
import sqlite3
import os
import urllib.request
import sklearn.decomposition
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [None]:
# Create a temporary directory
dataset_folder = tempfile.mkdtemp()

# Build path to database
database_path = os.path.join(dataset_folder, "adventure-works.db")

# Get the database
urllib.request.urlretrieve(
    "https://github.com/FAU-CS6/KDD-Databases/raw/main/AdventureWorks/adventure-works.db",
    database_path,
)

# Open connection to the adventure-works.db
connection = sqlite3.connect(database_path)

In [None]:
# Create the clean DataFrame(s)
# Order DataFrame
order_df = pd.read_sql_query(
    "SELECT p.ProductID,p.Name,p.ProductNumber,p.MakeFlag,p.FinishedGoodsFlag,p.Color,p.SafetyStockLevel,"
    "p.ReorderPoint,p.StandardCost,p.ListPrice,p.Size,p.SizeUnitMeasureCode,p.WeightUnitMeasureCode,p.Weight,"
    "p.DaysToManufacture,p.ProductLine,p.Class,p.Style,p.ProductSubcategoryID,p.ProductModelID,p.SellStartDate,"
    "p.SellEndDate,p.DiscontinuedDate,d.PurchaseOrderID,d.PurchaseOrderDetailID,d.DueDate,d.OrderQty,d.ProductID,"
    "d.UnitPrice,d.ReceivedQty,d.RejectedQty,h.RevisionNumber,h.Status,h.EmployeeID,h.VendorID,h.ShipMethodID,"
    "h.OrderDate,h.ShipDate,h.SubTotal,h.TaxAmt,h.Freight,h.TotalDue,e.NationalIDNumber,e.LoginID,e.OrganizationNode,"
    "e.JobTitle,e.BirthDate,e.MaritalStatus,e.Gender,e.HireDate,e.SalariedFlag,e.VacationHours,e.SickLeaveHours,"
    "e.CurrentFlag,r.PersonType,r.NameStyle,r.Title,r.FirstName,r.MiddleName,r.LastName,r.Suffix,r.EmailPromotion,"
    "r.AdditionalContactInfo,r.Demographics "
    "FROM Product p "
    "JOIN PurchaseOrderDetail d ON p.ProductID = d.ProductID "
    "JOIN PurchaseOrderHeader h ON d.PurchaseOrderID = h.PurchaseOrderID "
    "JOIN Employee e ON h.EmployeeID = e.BusinessEntityID "
    "JOIN Person r ON e.BusinessEntityID = r.BusinessEntityID",
    connection,
    index_col="PurchaseOrderDetailID",
)

# CurrencyRate DataFrame
currency_rate_df = pd.read_sql_query(
    "SELECT STRFTIME('%Y-%m-%d', CurrencyRateDate) AS CurrencyRateDate,AverageRate,EndOfDayRate "
    "FROM CurrencyRate "
    "WHERE FromCurrencyCode='USD' AND ToCurrencyCode='EUR'",
    connection,
    index_col="CurrencyRateDate",
)

### 2.3.1. Normalization

One method introduced in the lecture and frequently used in Data Science is normalization. In order to apply this practically, we will first take a look at a part of the `order_df` already known from Part One. More precisely, we are looking at some numeric attributes from `order_df`.

<div class="alert alert-block alert-info">

**Task 1:**
    
Display the head of the attributes `SubTotal`, `Freight` and `OrderQty` from `order_df`.
</div>

In [None]:
# Display the head of SubTotal, Freight and OrderQty

In [None]:
# Display the head of SubTotal, Freight and OrderQty
order_df[["SubTotal", "Freight", "OrderQty"]].head(20)

<div class="alert alert-block alert-info">

**Task 2:**
    
Display the minimum, maximum, mean, and standard deviation of `SubTotal`, `Freight` and `OrderQty`.
</div>

In [None]:
# Display the minimum, maximum, mean, and standard deviation of SubTotal, Freight and OrderQty

In [None]:
# Display the minimum, maximum, mean, and standard deviation of SubTotal, Freight and OrderQty
order_df[["SubTotal", "Freight", "OrderQty"]].agg(["min", "max", "mean", "std"])

As can be seen, the three attributes differ significantly with respect to their distribution of values posing a hindrance for some knowledge discovery tasks. To alleviate this problem, normalization is employed to scale attribute values to a smaller range.

In the lecture you were introduced to three different variants of normalization: min-max normalization, z-score normalization, and normalization by decimal scaling.

Below you can see the implementation of one of the normalization methods for the attributes `SubTotal`, `Freight` and `OrderQty`:

In [None]:
# Compute a "mystery" normalization in a "Pythonic" way
def mystery_normalization(df):
    # Compute the normalization with a sinple formula
    return (df - df.mean()) / df.std()


# Apply previously defined function
mystery_normalization_df = mystery_normalization(
    order_df[["SubTotal", "Freight", "OrderQty"]]
)
mystery_normalization_df.head(20)

<div class="alert alert-block alert-info">

**Task 3:** 
    
Determine whether the above function `mystery_normalization` is an implementation of min-max normalization, z-score normalization, or normalization by decimal scaling.
</div>

**The function is an implementation of:**
1. [ ] Min-max normalization (for the interval [0, 1])
2. [ ] Z-score normalization
3. [ ] Normalization by decimal scaling

**The function is an implementation of:**
1. [ ] Min-max normalization (for the interval [0, 1])
2. [X] Z-score normalization
3. [ ] Normalization by decimal scaling

<div class="alert alert-block alert-info">

**Task 4:**
    
Implement a function for each of the three normalization methods you got to know. (You may, of course, reuse the above code when you work on the corresponding function).
</div>

In [None]:
# Implement min-max normalization for the interval [0, 1]
def min_max_normalization(df):
    # ...
    return df


# Apply min-max normalization
min_max_df = min_max_normalization(order_df[["SubTotal", "Freight", "OrderQty"]])
min_max_df.head(20)

In [None]:
# Sample solution 1: Min-max normalization for the interval [0, 1] with for-loop
def min_max_normalization(df):
    # Create a copy to avoid overriding original content
    normalized = df.copy()

    # Normalize each column individually
    for column in normalized.columns:
        normalized[column] = (df[column] - df[column].min()) / (
            df[column].max() - df[column].min()
        )

    return normalized


# Apply min-max normalization
min_max_df = min_max_normalization(order_df[["SubTotal", "Freight", "OrderQty"]])
min_max_df.head(20)

In [None]:
# Sample solution 2: "Pythonic" min-max normalization for the interval [0, 1]
def min_max_normalization(df):
    # Compute the min-max normalization with a simple formula
    return (df - df.min()) / (df.max() - df.min())


# Apply min-max normalization
min_max_df = min_max_normalization(order_df[["SubTotal", "Freight", "OrderQty"]])
min_max_df.head(20)

In [None]:
# Display the minimum, maximum, mean, and standard deviation values of min_max_df
min_max_df.agg([min, max, "mean", "std"])

In [None]:
# Implement z-score normalization
def z_score_normalization(df):
    # ...
    return df


# Apply z-score normalization
z_score_df = z_score_normalization(order_df[["SubTotal", "Freight", "OrderQty"]])
z_score_df.head(20)

In [None]:
# Sample solution 1: Z-score normalization with for-loop
def z_score_normalization(df):
    # Create a copy to avoid overriding original content
    normalized = df.copy()

    # Normalize each column individually
    for column in normalized.columns:
        normalized[column] = (
            normalized[column] - normalized[column].mean()
        ) / normalized[column].std()

    return normalized


# Apply z-score normalization
z_score_df = z_score_normalization(order_df[["SubTotal", "Freight", "OrderQty"]])
z_score_df.head(20)

In [None]:
# Sample solution 2: "Pythonic" z-score normalization
def z_score_normalization(df):
    # Compute the z-score normalization with a simple formula
    return (df - df.mean()) / df.std()


# Apply z-score normalization
z_score_df = z_score_normalization(order_df[["SubTotal", "Freight", "OrderQty"]])
z_score_df.head(20)

In [None]:
# Display the minimum, maximum, mean and standard deviation values of z_score_df
z_score_df.agg([min, max, "mean", "std"])

In [None]:
# Implement normalization by decimal scaling
def normalization_by_decimal_scaling(df):
    # ...
    return df


# Apply normalization_by_decimal_scaling
decimal_scaling_df = normalization_by_decimal_scaling(
    order_df[["SubTotal", "Freight", "OrderQty"]]
)
decimal_scaling_df.head(20)

In [None]:
# Sample solution 1: Normalization by decimal scaling with for-loop
def normalization_by_decimal_scaling(df):
    # Create a copy to avoid overriding original content
    normalized = df.copy()

    # Normalize each column individually
    for column in normalized.columns:
        # Find k
        k = 0
        while normalized[column].abs().max() / (10**k) >= 1:
            k += 1

        # Compute normalization of the column
        normalized[column] = normalized[column] / (10**k)

    return normalized


# Apply normalization_by_decimal_scaling
decimal_scaling_df = normalization_by_decimal_scaling(
    order_df[["SubTotal", "Freight", "OrderQty"]]
)
decimal_scaling_df.head(20)

In [None]:
# Sample solution 2: "Pythonic" normalization by decimal scaling method
def normalization_by_decimal_scaling(df):
    # Compute the decimal scaling normalization with a simple formula
    return df / 10 ** (np.ceil(np.log10(df.abs().max())))


# Apply normalization_by_decimal_scaling
decimal_scaling_df = normalization_by_decimal_scaling(
    order_df[["SubTotal", "Freight", "OrderQty"]]
)
decimal_scaling_df.head(20)

In [None]:
# Display the minimum, maximum, mean and standard deviation values of decimal_scaling_df
decimal_scaling_df.agg([min, max, "mean", "std"])

Note that each normalization results in different scaled values. It is therefore important to consider which normalization method best serves your purpose. 

<div class="alert alert-block alert-info">

**Task 5:**
    
Consider when the various normalization methods presented might be beneficial.
</div>

Write down your solution here:

- **Min-max normalization:**

Min-max normalization is advantageous when values must be secured in a fixed interval and this interval shall be used as good as possible. For example, in certain deep learning methods it is essential that values lie in the value range [0, 1] in order to avoid incorrect results.

- **Z-score normalization:**

The goal of z-score normalization (also called standardization) is not to bring all values into a fixed range of values. In this type of normalization, the attributes are aligned in a different way. Thus, the Z-Score normalization achieves that the mean of all attributes is as close as possible to 0 and the attribute values have a standard deviation of 1 to each other. 

- **Normalization by decimal scaling:**

Although normalization by decimal scaling assures the user that all output values are in the value range [-1, -1], it rarely actually uses this range (see example). The advantage compared to min-max normalization is that normalization is not done with arbitrary divisors, but with a power of ten. Since we humans are used to the decimal system, the connection between value and normalized value is easier to recognize. (Min-Max normalization: 19953.6 becomes 1.0 - Normalization by decimal scaling: 19953.6 becomes 0.199536)

### 2.3.2. Discretization

Another commonly used method is discretization. This is used to convert a continuous values to discrete values. This method is best demonstrated on a continuous attribute which is why we take a look at the `currency_rate_df` which represents exchange rates from USD to EUR.

In [None]:
# Print the head of currency_rate_df
currency_rate_df.head(20)

In [None]:
# Draw the progression of the two rates over time
currency_rate_df.plot(subplots=True)
plt.xticks(rotation=30)

Sometimes it is desirable to divide attribute values into *groups* resulting in fewer values that are easier to handle. Graphically, we have already applied a method in Part One that does just that: Histogram plot.

<div class="alert alert-block alert-info">

**Task 6:**
    
Draw a histogram with five bins for the attributes `AverageRate` and `EndOfDayRate`.
</div>

In [None]:
# Draw a histogram for AverageRate and EndOfDayRate

In [None]:
# Draw a histogram for AverageRate and EndOfDayRate
currency_rate_df.hist(bins=5, rwidth=0.8)

Although histogram analysis divides individual values into different groups and thus discretizes data, it completely ignores the temporal aspect. More specifically, both attributes `AverageRate` and `EndOfDayRate` constitute time series, i. e. data is indexed by time, and binning data values ignores this time dependency. 

Alternatively, data can be binned and thus discretize by maintaining the temporal nature by partitioning the data values to equal-width bins. This is possible in pandas via the fucntion `cut`.

<div class="alert alert-block alert-info">

**Task 7:**
    
Use `cut()` to distribute the attribute values of `AverageRate` into five bins with equal width. (Help: [pandas documentation](https://pandas.pydata.org/docs/reference/api/pandas.cut.html))
</div>

In [None]:
# Distribute the attribute values of AverageRate into bins with equal width

In [None]:
# Distribute the attribute values of AverageRate into bins with equal width
pd.cut(currency_rate_df["AverageRate"], bins=5)

<div class="alert alert-block alert-info">

**Task 8:**
    
Find out how to interpret the interval notation used in the new attribute values and what boundaries each of the five bins has. 

</div>

Write down your solution here:

**Interval notation:** Interval notation can be read as "lower limit, upper limit", where the brackets indicate whether a value is included ("[" or "]") or excluded ("(" or )") in the limit.

**Boundaries:**

- **Bin 1:** From 0.961 (excl.) to 1.01 (inkl.)
- **Bin 2:** From 1.01 (excl.) to 1.06 (inkl.)
- **Bin 3:** From 1.06 (excl.) to 1.109 (inkl.)
- **Bin 4:** From 1.109 (excl.) to 1.158 (inkl.)
- **Bin 5:** From 1.158 (excl.) to 1.208 (inkl.)

Besides equal-width partioning, pandas also supports equal-depth partioning where each bin contains approximately the same *number of values*. This is done with the function `qcut`.

<div class="alert alert-block alert-info">

**Task 9:**
    
Use `qcut()` to arrange attribute values of AverageRate into five bins with equal depth. (Help: <a href="https://pandas.pydata.org/docs/reference/api/pandas.qcut.html">pandas documentation</a>)
</div>

In [None]:
# Distribute attribute values of AverageRate into bins with equal depth

In [None]:
# Distribute attribute values of AverageRate into bins with equal depth
pd.qcut(currency_rate_df["AverageRate"], q=5)

# Note: Parameter q specifies the number of quantiles, not the number of bins.

The disadvantage of this "pure" partitioning is that the attribute values are no longer purely numerical attributes and thus further processing is only possible to a limited extent. Therefore, one or two representative values are often selected for each bin, to which all values within the bin are smoothed.

In the lecture the variant smoothing by bin means was presented, which we will now have a look at here. 

<div class="alert alert-block alert-info">

**Task 10:**
    
Implement a method for smoothing by bin means with equal width partitioning by completing the following program skeleton. You may use `cut()`, but you don't have to. 
</div>

In [None]:
# Implement a method for smoothing by bin means with equal width partitioning
def smoothing_by_bin_means_with_equal_width_part(df, bins=10):
    # Create a copy to avoid overriding original content
    smoothed = df.copy()

    # Smooth every column
    for column in smoothed.columns:
        # ...
        continue

    return smoothed


# Apply smoothing_by_bin_means
currency_rate_smoothed_df = smoothing_by_bin_means_with_equal_width_part(
    currency_rate_df, 5
)
currency_rate_smoothed_df.head(20)

In [None]:
# Implement a method for smoothing by bin means with equal width partitioning
def smoothing_by_bin_means_with_equal_width_part(dataframe_to_smooth, bins=10):
    # Create a copy to avoid overriding original content
    smoothed = dataframe_to_smooth.copy()

    # Smooth every column
    for column in smoothed.columns:
        # Calculate the bin affiliations with cut()
        bin_affiliations = pd.cut(dataframe_to_smooth[column], bins, labels=False)

        # For each bin
        for bin in range(bins):
            # Set every bin to the mean of the matched bin
            smoothed[bin_affiliations == bin] = smoothed[bin_affiliations == bin].mean()

    return smoothed


# Apply smoothing_by_bin_means
currency_rate_smoothed_df = smoothing_by_bin_means_with_equal_width_part(
    currency_rate_df, 5
)
currency_rate_smoothed_df.head(20)

In [None]:
# Draw the progression of the two rates over time
currency_rate_smoothed_df.plot(subplots=True)
plt.xticks(rotation=30)

In addition to histogram analysis and binning, there are other methods of discretization. One of them - clustering - will be covered later in the semester.

### 2.3.3. Data Reduction

Another important part of data preprocessing is to reduce the dimentionality of data. One focus in the lecture was Principal Component Analysis (PCA), which we now utilize.

Subject of analysis is still the `currency_rate_df` where `AverageRate` and `EndOfDayRate` are visibly the same. It is, therefore, expected that a large part of the redundancy can be eliminated by PCA.

In order to generate a deep understanding on PCA, you will first apply the steps presented in the lecture before resorting to a library function.

<div class="alert alert-block alert-info">

**Task 11:**
    
Standardize the `currency_rate_df` to ensure that all attributes are included in the analysis to the same extent. (Hint: You may use one of your previous defined functions) 
</div>

In [None]:
# Standardize currency_rate_df

In [None]:
# Standardize currency_rate_df
standardized_currency_rate_df = z_score_normalization(currency_rate_df)

# Although not part of the question, it is usually helpful to display the result of the calculation
standardized_currency_rate_df.head(20)

<div class="alert alert-block alert-info">

**Task 12:**
    
Calculate the covariance matrix for the standardized `DataFrame`. (Help: [pandas documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.cov.html))
</div>

In [None]:
# Calculate covariance matrix

In [None]:
# Calculate covariance matrix
covariance_matrix = standardized_currency_rate_df.cov()

# Display the covariance matrix
covariance_matrix

<div class="alert alert-block alert-info">

**Task 13:**
    
Calculate the associated eigenvalues and eigenvectors. (Help: [NumPy documentation](https://numpy.org/doc/stable/reference/generated/numpy.linalg.eig.html)) 
</div>

In [None]:
# Calculate the associated eigenvalues and eigenvectors

In [None]:
# Calculate the associated eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(covariance_matrix)

# Display the eigenvalues and eigenvectors
print("Eigenvalues:")
print(eigenvalues)

print("\nEigenvectors:")
print(eigenvectors)

<div class="alert alert-block alert-info">

**Task 14:**
    
Calculate the percentage of information per eigenvector.
</div>

In [None]:
# Calculate the percentage of information per eigenvector

In [None]:
# Calculate the percentage of information per eigenvector
relative_information_share = eigenvalues / np.sum(eigenvalues)

# Print the information
print(
    "First eigenvektor: approx. {0:.0f}% (Exact: ".format(
        relative_information_share[0] * 100
    )
    + str(relative_information_share[0])
    + ")"
)
print(
    "Second eigenvektor: approx. {0:.0f}% (Exact: ".format(
        relative_information_share[1] * 100
    )
    + str(relative_information_share[1])
    + ")"
)

<div class="alert alert-block alert-info">

**Task 15:**
    
Select the feature matrix so that the transformation preserves at least 80% of the information contained in the standardized `currency_rate_df`. 
</div>

In [None]:
# Select the feature matrix

In [None]:
# Select the feature matrix
# The first eigenvector contains nearly all information => select only that one
feature_matrix = eigenvectors[1]

# Print the feature_matrix
feature_matrix

<div class="alert alert-block alert-info">

**Task 16:**
    
Perform the transformation of the standardized data frame using the feature matrix and display the result. (Help: [pandas documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dot.html)) 
</div>

In [None]:
# Perform the transformation and display the result

In [None]:
# Perform the transformation
transformated_currency_rate_df = pd.DataFrame(
    data=standardized_currency_rate_df.dot(feature_matrix)
)

# Display the transformated DataFrame
transformated_currency_rate_df.head(20)

In this exercise, PCA allowed the two original attributes to be merged into one without any significant loss of information. Of course, it is very cumbersome to execute PCA step by step manually each time which is why PCA is also included in some ML frameworks such as scikit-learn. 

It is very important to know exactly what the framework does for you and what it does not. For example, scikit-learn's PCA does not include standardization, although feature scaling [is strongly recommended](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html) by the framework itself as a preceding step.

<div class="alert alert-block alert-info">

**Task 17:**
    
Use scikit-learn's PCA to transform the standardized `currency_rate_df` a second time. This time you may assume that only one component is expected to be used as a result. Display the result. (Help: [scikit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)) 

</div>

In [None]:
# Use PCA from scikit-learn

In [None]:
# Instantiate a PCA object
pca = sklearn.decomposition.PCA(n_components=1)

# Compute the principal components for the standardized_currency_rate_df
transformated_currency_rate_df = pd.DataFrame(
    data=pca.fit_transform(standardized_currency_rate_df),
    index=standardized_currency_rate_df.index,
)

# Display the transformed DataFrame
transformated_currency_rate_df.head(20)

<div class="alert alert-block alert-info">

**Task 18:**
    
It may be that the results of the manual PCA and scikit-learn's PCA differ. Explain why both results may well be correct. 

</div>

Write down your solution here:

One of the steps in PCA is the determination of eigenvalues and eigenvectors. This step is done by solving a system of equations that may lead to several correct solutions. Of course, different solutions do not lead to an identical result of the PCA, but since they are all valid solutions for the corresponding system of equations, all results are also valid results of a PCA. 