# 2. Data analysis & Preprocessing

In this exercise you will get to know the basics from the lectures "3. Getting to Know Your Data" and "4. Preprocessing" in their practical use and apply them yourself.

Since this practice sheet is designed to be used in three sessions, it is roughly divided into three sections:

- [2.1. Part One: Getting to Know Your Data](#2.1. Part One: Getting to Know Your Data)
- [2.2. Part Two: Preprocessing - Data cleaning & data integration](#2.2. Part Two: Preprocessing - Data cleaning & Data integration)
- [2.3. Part Three: Preprocessing - Data reduction, data transformation & data discretization](#2.3. Part Three: Preprocessing - Data reduction, data transformation & data discretization)

Of course, depending on how quickly an exercise group progresses in the actual exercise, one of these parts may not be discussed entirely in the affected exercise, or parts of the subsequent part may already be addressed.

### Preparation: Import required libraries

In [None]:
# Import the required libraries
import tempfile
import sqlite3
import urllib.request
import pandas as pd

### Preparation: Download the datasets

In [None]:
# Create a temporary directory
dataset_folder = tempfile.mkdtemp()

# Get the database
urllib.request.urlretrieve(
    "https://github.com/FAU-CS6/KDD-Databases/raw/main/AdventureWorks/adventure-works.db",
    dataset_folder + "/adventure-works.db",
)

# Open connection to the adventure-works.db
connection = sqlite3.connect(dataset_folder + "/adventure-works.db")

# Create the dataframe(s)
order_dataframe = pd.read_sql_query(
    "SELECT * FROM Product JOIN PurchaseOrderDetail ON Product.ProductID = PurchaseOrderDetail.ProductID",
    connection,
)

## 2.1. Part One: Getting to Know Your Data

In this part you will apply the theoretical knowledge gained in the lecture "Getting to Know Your Data". In doing so, you will familiarize yourself step by step with the `order_dataframe` dataframe defined above.

#### 2.1.1. Structure of the Dataframe

Currently you don't know anything about the `order_dataframe` except for the fact that it consists of the two tables `Product` and `PurchaseOrderDetail` of a database named `AdventureWorks`. 
In order to gather an initial understanding of the structure of the dataframe, it is useful to know the dimensions of the dataframe. The corresponding information is stored in the `shape` property of a panda dataframe.

<div class="alert alert-block alert-warning">
<b>Task 1:</b> Figure out the dimensions of order_dataframe. You are allowed to have a look at <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html">the Pandas documentation</a> regarding the mentioned property.</div>


In [None]:
order_dataframe.shape

Thus, we know that the present dataframe consists of 8845 tuples and contains 34 attributes. But what we still don't know is what data is contained in the data set.

In order to get a first impression in this respect, it can be useful to look at (a sample of) the data frame. The supposedly simplest method to make this possible is the `print()` function.

In [None]:
# Print the order_dataframe
print(order_dataframe)

However, as you can see, this method outputs the entire content of the dataframe without any specific layout. This can cause problems, especially with very large dataframes, and is therefore not recommended. It is far more common to use the dataframe member function `head()`.

<div class="alert alert-block alert-warning">
<b>Task 2:</b> Use the <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html">Pandas documentation</a> to familiarize yourself with head(), then apply it to the order_dataframe so that the first 10 tuples are displayed.</div>

In [None]:
# Use the head() function on the order_dataframe while setting the number of rows displayed to 10
order_dataframe.head(10)

As you can see, the representation by head() is easier to read. However, head() also has its limitations. For example, in this case we do not get all columns displayed.

<div class="alert alert-block alert-warning">
<b>Task 3:</b> All attributes of a data frame are stored in the <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.columns.html">member variable columns</a>. Use this information to output a list of all attributes contained in order_dataframe. No special formatting is asked, but it should be made sure that this time all column identifiers are directly named in the output.
</div>

In [None]:
# There are multiple possible solutions, e.g.:
# Sample solution 1: Iterate over the columns
for column in order_dataframe.columns:
    print(column, end=",")

In [None]:
# Sample solution 2: Use list()
print(list(order_dataframe.columns))

For example, we did not see the columns "WeightUnitMeasureCode", "Weight", "DaysToManufacture", "ProductLine", "Class" and "Style" in the above execution of head().  

<div class="alert alert-block alert-warning">
<b>Task 4:</b> Show the attributes "WeightUnitMeasureCode", "Weight", "DaysToManufacture", "ProductLine", "Class", "Style" for the first 10 tuples. (Help: <a href="https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html">Pandas tutorial on selecting subsets</a>)
</div>

In [None]:
# There are multiple possible solutions, e.g.:
# Sample solution 1: Explicit naming of the identifiers
order_dataframe[
    [
        "WeightUnitMeasureCode",
        "Weight",
        "DaysToManufacture",
        "ProductLine",
        "Class",
        "Style",
    ]
].head(10)

In [None]:
# Sample solution 2: Using the columns attribute
order_dataframe[order_dataframe.columns[12:18]].head(10)

Of course, it is a pity that all the attribute values shown are `0` in the "DaysToManufacture" attribute and `None` in the "ProductLine" attribute. However, this does not mean that this is the case for the entire dataframe.

<div class="alert alert-block alert-warning">
<b>Task 5a:</b> Save all with "DaysToManufacture" higher than 0 into a new dataframe called order_dtm_not_zero_dataframe. (Help: <a href="https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html">Pandas tutorial on selecting subsets</a>)
</div>

In [None]:
# Select all tuples with "DaysToManufacture" > 0
order_dtm_not_zero_dataframe = order_dataframe[order_dataframe["DaysToManufacture"] > 0]

<div class="alert alert-block alert-warning">
<b>Task 5b:</b> Display the columns "WeightUnitMeasureCode", "Weight", "DaysToManufacture", "ProductLine", "Class" and "Style" of the new order_dtm_not_zero_dataframe. Limit the output to 10 tuples. (Help: <a href="https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html">Pandas tutorial on selecting subsets</a>)
</div>

In [None]:
# Do the same thing we have done in task 4
order_dtm_not_zero_dataframe[
    [
        "WeightUnitMeasureCode",
        "Weight",
        "DaysToManufacture",
        "ProductLine",
        "Class",
        "Style",
    ]
].head(10)

With larger data frames, it is only possible to a limited extent to obtain an overview of all attributes. In most cases, however, this is not necessary, since the problem clearly determines which attributes are more important and which are not.

#### 2.1.2. Basic Statistical Descriptors

We start this task by taking a closer look at the attributes "ReorderPoint", "DaysToManufacture", and "UnitPrice". 

In [None]:
order_dataframe[["ReorderPoint", "DaysToManufacture", "UnitPrice"]].head(10)

It makes sense to first look at the simple statistical values "Mean", "Median", "Min", "Max" and "NUnique" to get a rough estimate of these attributes. 

<div class="alert alert-block alert-warning">
<b>Task 1a:</b> Determine the mean for each of the three attributes and output it. (Help: <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mean.html">Pandas documentation</a>)
</div>

In [None]:
# Output the mean of the attributes
print("ReorderPoint: " + str(order_dataframe["ReorderPoint"].mean(axis=0)))
print("DaysToManufacture: " + str(order_dataframe["DaysToManufacture"].mean(axis=0)))
print("UnitPrice: " + str(order_dataframe["UnitPrice"].mean(axis=0)))

<div class="alert alert-block alert-warning">
<b>Task 1b:</b> Determine the median for each of the three attributes and output it. (Help: <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.median.html">Pandas documentation</a>)
</div>

In [None]:
# Output the median of the attributes
print("ReorderPoint: " + str(order_dataframe["ReorderPoint"].median(axis=0)))
print("DaysToManufacture: " + str(order_dataframe["DaysToManufacture"].median(axis=0)))
print("UnitPrice: " + str(order_dataframe["UnitPrice"].median(axis=0)))

<div class="alert alert-block alert-warning">
<b>Task 1c:</b> Determine the min for each of the three attributes and output it. (Help: <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.min.html">Pandas documentation</a>)
</div>

In [None]:
# Output the min of the attributes
print("ReorderPoint: " + str(order_dataframe["ReorderPoint"].min(axis=0)))
print("DaysToManufacture: " + str(order_dataframe["DaysToManufacture"].min(axis=0)))
print("UnitPrice: " + str(order_dataframe["UnitPrice"].min(axis=0)))

<div class="alert alert-block alert-warning">
<b>Task 1d:</b> Determine the max for each of the three attributes and output it. (Help: <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.max.html">Pandas documentation</a>)
</div>

In [None]:
# Output the max of the attributes
print("ReorderPoint: " + str(order_dataframe["ReorderPoint"].max(axis=0)))
print("DaysToManufacture: " + str(order_dataframe["DaysToManufacture"].max(axis=0)))
print("UnitPrice: " + str(order_dataframe["UnitPrice"].max(axis=0)))

<div class="alert alert-block alert-warning">
<b>Task 1e:</b> Determine the nunique for each of the three attributes and output it. (Help: <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.nunique.html">Pandas documentation</a>)
</div>

In [None]:
###### Output the nunique of the attributes
print("ReorderPoint: " + str(order_dataframe["ReorderPoint"].nunique()))
print("DaysToManufacture: " + str(order_dataframe["DaysToManufacture"].nunique()))
print("UnitPrice: " + str(order_dataframe["UnitPrice"].nunique()))

We can already tell a lot from these few values. For example, the values within "ReorderPoint" are far apart. Nevertheless, there are only 5 unique values. This speaks clearly for the fact that we have to do here with a discrete attribute.

In "DaysToManufacture", on the other hand, there are even only 2 unique values (0 and 1). Long production times do not seem to occur in our data set.

The "UnitPrice", on the other hand, is very variable with 177 different values ranging from 0.21 to 82.8345. Thus, it is the most likely of the three attributes to be a continuous attribute.

What we can't be more specific about so far, however, is the distribution of the individual values within the attributes. This becomes clearer, for example, when looking at a histogram.

<div class="alert alert-block alert-warning">
<b>Task 2a:</b> Draw a histogram for "ReorderPoint". Consider what number of bins might be appropriate. (Help: <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.hist.html">Pandas documentation</a>)
</div>

In [None]:
# Histogram for "ReorderPoint" - 20 bins as the two lower values are to close to each other to be seperated otherwise
order_dataframe["ReorderPoint"].plot.hist(bins=20)

<div class="alert alert-block alert-warning">
<b>Task 2b:</b> Draw a histogram for "DaysToManufacture". Consider what number of bins might be appropriate. (Help: <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.hist.html">Pandas documentation</a>)
</div>

In [None]:
# Histogram for "DaysToManufacture" - 2 bins as there are only 2 unique values
order_dataframe["DaysToManufacture"].plot.hist(bins=2)

<div class="alert alert-block alert-warning">
<b>Task 2c:</b> Draw a histogram for "UnitPrice". Consider what number of bins might be appropriate. (Help: <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.hist.html">Pandas documentation</a>)
</div>

In [None]:
# Histogram for "UnitPrice" - 20 bins as 177 would not be displayed very well
order_dataframe["UnitPrice"].plot.hist(bins=20)

It can be seen that it does not always make sense to choose the number of bins exactly like the number of unique values. 

Apart from that, we have to note that especially in the case of "UnitPrice" information is lost by merging multiple values. In such a case it can also be useful to look at boxplot and density curve instead of histogram.

<div class="alert alert-block alert-warning">
<b>Task 3a:</b> Draw a boxplot diagram for "UnitPrice". (Help: <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.box.html">Pandas documentation</a>)
</div>


In [None]:
# Boxplot for "UnitPrice"
order_dataframe["UnitPrice"].plot.box()

<div class="alert alert-block alert-warning">
<b>Task 3b:</b> Draw a density diagram for "UnitPrice". (Help: <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.density.html">Pandas documentation</a>)
</div>


In [None]:
# Density curve for "UnitPrice"
order_dataframe["UnitPrice"].plot.density()

#### 2.1.3. Data Visualization

<div class="alert alert-block alert-info">
TODO
</div>

#### 2.1.4. Data similarity

<div class="alert alert-block alert-info">
TODO
</div>

## 2.2. Part Two: Preprocessing - Data cleaning & Data integration

<div class="alert alert-block alert-info">
TODO
</div>

## 2.3. Part Three: Preprocessing - Data reduction, data transformation & data discretization

<div class="alert alert-block alert-info">
TODO
</div>