# 2. Data analysis & Preprocessing

In this exercise you will get to know the basics from the lectures "3. Getting to Know Your Data" and "4. Preprocessing" in their practical use and apply them yourself.

Since this practice sheet is designed to be used in three sessions, it is roughly divided into three sections:

- <a href="#2.1. Part One: Getting to Know Your Data">2.1. Part One: Getting to Know Your Data</a>
- <a href="2.2. Part Two: Preprocessing - Data cleaning & data integration">2.2. Part Two: Preprocessing - Data cleaning & Data integration</a>
- <a href="2.3. Part Three: Preprocessing - Data reduction, data transformation & data discretization">2.3. Part Three: Preprocessing - Data reduction, data transformation & data discretization</a>

Of course, depending on how quickly an exercise group progresses in the actual exercise, one of these parts may not be discussed entirely in the affected exercise, or parts of the subsequent part may already be addressed.

### Preparation: Import required libraries

In [None]:
# Import the required libraries
import tempfile
import sqlite3
import urllib.request
import squarify
import pywaffle
import scipy
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns

### Preparation: Download the database and prepare the datasets

In [None]:
# Create a temporary directory
dataset_folder = tempfile.mkdtemp()

# Get the database
urllib.request.urlretrieve(
    "https://github.com/FAU-CS6/KDD-Databases/raw/main/AdventureWorks/adventure-works.db",
    dataset_folder + "/adventure-works.db",
)

# Open connection to the adventure-works.db
connection = sqlite3.connect(dataset_folder + "/adventure-works.db")

In [None]:
# Create the clean dataframe(s)
# Order dataframe
order_dataframe = pd.read_sql_query(
    "SELECT * FROM Product "
    "JOIN PurchaseOrderDetail ON Product.ProductID = PurchaseOrderDetail.ProductID "
    "JOIN PurchaseOrderHeader ON PurchaseOrderDetail.PurchaseOrderID = PurchaseOrderHeader.PurchaseOrderID "
    "JOIN Employee ON PurchaseOrderHeader.EmployeeID = Employee.BusinessEntityID "
    "JOIN Person ON Employee.BusinessEntityID = Person.BusinessEntityID",
    connection,
)

In [None]:
# Modify the database to contain dirty data
cursor = connection.cursor()
cursor.execute("UPDATE Employee SET HireDate = STRFTIME('%Y-%m-%d %H:%M:%S', HireDate)")
cursor.execute(
    "UPDATE Employee SET BirthDate = STRFTIME('%Y-%m-%d %H:%M:%S', BirthDate)"
)
cursor.execute(
    'UPDATE Employee SET Gender = "Male" WHERE Gender = \'M\' AND NationalIDNumber LIKE "%8"'
)
cursor.execute(
    'UPDATE Employee SET Gender = "Female" WHERE Gender = \'F\' AND NationalIDNumber LIKE "%7%"'
)
cursor.execute(
    "UPDATE Employee SET HireDate = STRFTIME('%Y-%m-%d %H:%M:%S', DATE(BirthDate, '-10 year')) WHERE NationalIDNumber LIKE \"2%\""
)
cursor.execute(
    "UPDATE Employee SET BirthDate = STRFTIME('%Y-%m-%d', BirthDate) WHERE NationalIDNumber LIKE \"%2%\""
)
cursor.execute("UPDATE Employee SET SickLeaveHours = 2306 WHERE BusinessEntityID = 10")
cursor.execute("UPDATE Employee SET VacationHours = -12 WHERE BusinessEntityID = 21")
cursor.execute('UPDATE Person SET LastName = "Doe"')
cursor.execute('UPDATE Employee SET JobTitle = "None" WHERE NationalIDNumber LIKE "%1"')
cursor.execute(
    "UPDATE Employee SET CurrentFlag = 0 WHERE NationalIDNumber = 658797903 OR NationalIDNumber = 974026903"
)


# Create the dirty dataframe(s)
# Employee dataframe
dirty_employee_dataframe = pd.read_sql_query(
    "SELECT NationalIDNumber, LoginID, OrganizationNode, JobTitle, BirthDate, MaritalStatus, Gender, HireDate, SalariedFlag, VacationHours, SickLeaveHours, CurrentFlag, PersonType, NameStyle, Title, FirstName, MiddleName, LastName, Suffix, EmailPromotion, AdditionalContactInfo, Demographics FROM Employee "
    "JOIN Person ON Employee.BusinessEntityID = Person.BusinessEntityID",
    connection,
)

## 2.1. Part One: Getting to Know Your Data

In this part you will apply the theoretical knowledge gained in the lecture "Getting to Know Your Data". In doing so, you will familiarize yourself step by step with the order_dataframe defined above.

#### 2.1.1. Structure of the Dataframe

Currently you don't know anything about the order_dataframe except for the fact that it consists of the two tables Product and PurchaseOrderDetail of a database named AdventureWorks. 
In order to gather an initial understanding of the structure of the dataframe, it is useful to know the dimensions of the dataframe. The corresponding information is stored in the shape property of a panda dataframe.

<div class="alert alert-block alert-info">
<b>Task 1:</b> Figure out the dimensions of order_dataframe. You are allowed to have a look at the <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html">Pandas documentation</a> regarding the mentioned property.</div>


In [None]:
# Get the shape of the dataframe

In [None]:
# Get the shape of the dataframe
order_dataframe.shape

The output of shape contains two dimensions. It is of course important to know which number is the count of attributes and which number is the count of tuples in our dataframe.

<div class="alert alert-block alert-info">
<b>Task 2:</b> Use the <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html">Pandas documentation</a> to find out which number stands for the number of attributes and which for the number of tuples.</div>

<b>The count of attributes:</b>
1. [ ] 8845
2. [ ] 3212
3. [ ] 75
4. [ ] 34

<b>The count of tuples:</b>
1. [ ] 8845
2. [ ] 3212
3. [ ] 75
4. [ ] 34

<b>The count of attributes:</b>
1. [X] 8845
2. [ ] 3212
3. [ ] 75
4. [ ] 34

<b>The count of tuples:</b>
1. [ ] 8845
2. [ ] 3212
3. [X] 75
4. [ ] 34

Now that we now the count of tuples and attributes in our dataframe, we still do not know what data is contained in the data set.

In order to get a first impression in this respect, it can be useful to look at (a sample of) the data frame. The supposedly simplest method to make this possible is the print() function.

In [None]:
# Print the order_dataframe
print(order_dataframe)

However, as you can see, this method outputs the entire content of the dataframe without any specific layout. This can cause problems, especially with very large dataframes, and is therefore not recommended. It is far more common to use the dataframe member function head().

<div class="alert alert-block alert-info">
<b>Task 3:</b> Use the <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html">Pandas documentation</a> to familiarize yourself with head(), then apply it to the order_dataframe so that the first 10 tuples are displayed.</div>

In [None]:
# Use the head() function on the order_dataframe while setting the number of rows displayed to 10

In [None]:
# Use the head() function on the order_dataframe while setting the number of rows displayed to 10
order_dataframe.head(10)

As you can see, the representation by head() is easier to read. However, head() also has its limitations. For example, in this case we do not get all columns displayed.

<div class="alert alert-block alert-info">
<b>Task 4:</b> All attributes of a data frame are stored in the <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.columns.html">member variable columns</a>. Use this information to output a list of all attributes contained in order_dataframe. No special formatting is asked, but it should be made sure that this time all column identifiers are directly named in the output.
</div>

In [None]:
# Output a list of all columns

In [None]:
# There are multiple possible solutions, e.g.:
# Sample solution 1: Iterate over the columns
for column in order_dataframe.columns:
    print(column, end=",")

In [None]:
# Sample solution 2: Use list()
print(list(order_dataframe.columns))

For example, we did not see the columns "WeightUnitMeasureCode", "Weight", "DaysToManufacture", "ProductLine", "Class" and "Style" in the above execution of head().  

<div class="alert alert-block alert-info">
<b>Task 5:</b> Show the attributes "WeightUnitMeasureCode", "Weight", "DaysToManufacture", "ProductLine", "Class", "Style" for the first 10 tuples. (Help: <a href="https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html">Pandas tutorial on selecting subsets</a>)
</div>

In [None]:
# Print the attributes "WeightUnitMeasureCode", "Weight", "DaysToManufacture", "ProductLine", "Class"
# and "Style" of the first ten attributes

In [None]:
# There are multiple possible solutions, e.g.:
# Sample solution 1: Explicit naming of the identifiers
order_dataframe[
    [
        "WeightUnitMeasureCode",
        "Weight",
        "DaysToManufacture",
        "ProductLine",
        "Class",
        "Style",
    ]
].head(10)

In [None]:
# Sample solution 2: Using the columns attribute
order_dataframe[order_dataframe.columns[12:18]].head(10)

Of course, it is a pity that all the attribute values shown are `0` in the "DaysToManufacture" attribute and `None` in the "ProductLine" attribute. However, this does not mean that this is the case for the entire dataframe.

<div class="alert alert-block alert-info">
<b>Task 6a:</b> Save all with "DaysToManufacture" higher than 0 into a new dataframe called order_dtm_not_zero_dataframe. (Help: <a href="https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html">Pandas tutorial on selecting subsets</a>)
</div>

In [None]:
# Select all tuples with "DaysToManufacture" > 0

In [None]:
# Select all tuples with "DaysToManufacture" > 0
order_dtm_not_zero_dataframe = order_dataframe[order_dataframe["DaysToManufacture"] > 0]

<div class="alert alert-block alert-info">
<b>Task 6b:</b> Display the columns "WeightUnitMeasureCode", "Weight", "DaysToManufacture", "ProductLine", "Class" and "Style" of the new order_dtm_not_zero_dataframe. Limit the output to 10 tuples. (Help: <a href="https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html">Pandas tutorial on selecting subsets</a>)
</div>

In [None]:
# Print the attributes "WeightUnitMeasureCode", "Weight", "DaysToManufacture", "ProductLine", "Class"
# and "Style" of the first ten attributes of the new order_dtm_not_zero_dataframe

In [None]:
# Do the same thing we have done in task 4
order_dtm_not_zero_dataframe[
    [
        "WeightUnitMeasureCode",
        "Weight",
        "DaysToManufacture",
        "ProductLine",
        "Class",
        "Style",
    ]
].head(10)

You have now received a first impression of the data. The methods presented to you here can of course also be applied to any other attributes and tuples.

<div class="alert alert-block alert-info">
<b>Task 7:</b> Consider the limitations of the methods presented in this section and give two examples (related to the data set at hand) of what information would be difficult to find out using only the methodology presented here.
</div>

Write down your solution here:


The methodology presented here is only used to get a first impression of the data. It is largely raw data that a human can only ever capture in part. For example, it would be difficult to answer the following questions using only this methodology:

- What is the longest available production period in our database?
- What is the average price of our products?
- What is the most frequently ordered item?

#### 2.1.2. Basic Statistical Descriptors

In this section we take a closer look at the attributes "ReorderPoint", "DaysToManufacture", and "UnitPrice". 

In [None]:
# Print the attributes "ReorderPoint", "DaysToManufacture", and "UnitPrice" of the first ten attributes
order_dataframe[["ReorderPoint", "DaysToManufacture", "UnitPrice"]].head(10)

It makes sense to first look at the simple statistical values "Mean", "Median", "Min", "Max", "Mode" and "NUnique" to get a rough estimate of these attributes. 

<div class="alert alert-block alert-info">
<b>Task 1a:</b> Determine the mean for each of the three attributes and output it. (Help: <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mean.html">Pandas documentation</a>)
</div>

In [None]:
# Output the mean of "ReorderPoint"

In [None]:
# Output the mean of "DaysToManufacture"

In [None]:
# Output the mean of "UnitPrice"

In [None]:
# Output the mean of "ReorderPoint"
print("ReorderPoint: " + str(order_dataframe["ReorderPoint"].mean()))

In [None]:
# Output the mean of "DaysToManufacture"
print("DaysToManufacture: " + str(order_dataframe["DaysToManufacture"].mean()))

In [None]:
# Output the mean of "UnitPrice"
print("UnitPrice: " + str(order_dataframe["UnitPrice"].mean()))

<div class="alert alert-block alert-info">
<b>Task 1b:</b> Determine the median for each of the three attributes and output it. (Help: <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.median.html">Pandas documentation</a>)
</div>

In [None]:
# Output the median of "ReorderPoint"

In [None]:
# Output the median of "DaysToManufacture"

In [None]:
# Output the median of "UnitPrice"

In [None]:
# Output the median of "ReorderPoint"
print("ReorderPoint: " + str(order_dataframe["ReorderPoint"].median()))

In [None]:
# Output the median of "DaysToManufacture"
print("DaysToManufacture: " + str(order_dataframe["DaysToManufacture"].median()))

In [None]:
# Output the median of "UnitPrice"
print("UnitPrice: " + str(order_dataframe["UnitPrice"].median()))

<div class="alert alert-block alert-info">
<b>Task 1c:</b> Determine the min for each of the three attributes and output it. (Help: <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.min.html">Pandas documentation</a>)
</div>

In [None]:
# Output the min of "ReorderPoint"

In [None]:
# Output the min of "DaysToManufacture"

In [None]:
# Output the min of "UnitPrice"

In [None]:
# Output the min of "ReorderPoint"
print("ReorderPoint: " + str(order_dataframe["ReorderPoint"].min()))

In [None]:
# Output the min of "DaysToManufacture"
print("DaysToManufacture: " + str(order_dataframe["DaysToManufacture"].min()))

In [None]:
# Output the min of "UnitPrice"
print("UnitPrice: " + str(order_dataframe["UnitPrice"].min()))

<div class="alert alert-block alert-info">
<b>Task 1d:</b> Determine the max for each of the three attributes and output it. (Help: <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.max.html">Pandas documentation</a>)
</div>

In [None]:
# Output the max of "ReorderPoint"

In [None]:
# Output the max of "DaysToManufacture"

In [None]:
# Output the max of "UnitPrice"

In [None]:
# Output the max of "ReorderPoint"
print("ReorderPoint: " + str(order_dataframe["ReorderPoint"].max()))

In [None]:
# Output the max of "DaysToManufacture"
print("DaysToManufacture: " + str(order_dataframe["DaysToManufacture"].max()))

In [None]:
# Output the max of "UnitPrice"
print("UnitPrice: " + str(order_dataframe["UnitPrice"].max()))

<div class="alert alert-block alert-info">
<b>Task 1e:</b> Determine the max for each of the three attributes and output it. (Help: <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mode.html">Pandas documentation</a>)
</div>

In [None]:
# Output the mode of "ReorderPoint"

In [None]:
# Output the mode of "DaysToManufacture"

In [None]:
# Output the mode of "UnitPrice"

In [None]:
# Output the mode of "ReorderPoint"
order_dataframe["ReorderPoint"].mode()

In [None]:
# Output the mode of "DaysToManufacture"
order_dataframe["DaysToManufacture"].mode()

In [None]:
# Output the mode of "UnitPrice"
order_dataframe["UnitPrice"].mode()

<div class="alert alert-block alert-info">
<b>Task 1f:</b> Determine the nunique for each of the three attributes and output it. (Help: <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.nunique.html">Pandas documentation</a>)
</div>

In [None]:
# Output the nunique of "ReorderPoint"

In [None]:
# Output the nunique of "DaysToManufacture"

In [None]:
# Output the nunique of "UnitPrice"

In [None]:
# Output the nunique of "ReorderPoint"
print("ReorderPoint: " + str(order_dataframe["ReorderPoint"].nunique()))

In [None]:
# Output the nunique of "DaysToManufacture"
print("DaysToManufacture: " + str(order_dataframe["DaysToManufacture"].nunique()))

In [None]:
# Output the nunique of "UnitPrice"
print("UnitPrice: " + str(order_dataframe["UnitPrice"].nunique()))

Even though the methods used here are fairly simple statistical methods, they can already tell us quite a bit about our data. 

<div class="alert alert-block alert-info">
<b>Task 2a:</b> Consider and describe what can be said about the "ReorderPoint" attribute based on the statistical values obtained.
</div>

Write down your solution here:


The attribute "ReorderPoint" has only five different values, which are distributed between the lowest value 3 and the highest value 750. It can therefore be assumed that there are sometimes large ranges between the individual values and that the attribute is most likely a discrete numeric attribute.

The mean of approx. 589 and the median of 750 clearly indicate that many values are probably located in the higher part of the value range. The fact that the mode is 750 further confirms this thesis.

<div class="alert alert-block alert-info">
<b>Task 2b:</b> Consider and describe what can be said about the "DaysToManufacture" attribute based on the statistical values obtained.
</div>

Write down your solution here:


In contrast to "ReorderPoint", "DaysToManufacture" apparently has only two different values. These seem to be the values "0" and "1", whereby both mean and median, as well as mode show that the value "0" is the more frequent of the two.

<div class="alert alert-block alert-info">
<b>Task 2c:</b> Consider and describe what can be said about the "UnitPrice" attribute based on the statistical values obtained.
</div>

Write down your solution here:


The "UnitPrice" attribute, with its 177 different values ranging from 0.21 up to 82.8345, appears to be much less discrete than the other two attributes. It has two values that occur most frequently, 31.4895 and 48.2895. In general, this, along with the mean of about 34.743 and the median of 39.2805, suggests that most values are likely to be found in the middle of the range of values. However, it is possible that there are many values in the very low range as well as in the very high range, but they simply balance each other out.

The distribution of the individual values within the attributes is something we can not specific with these simple statiscal values. A posssibility to get more information about this is a histogram.

<div class="alert alert-block alert-info">
<b>Task 3a:</b> Draw a histogram for "ReorderPoint". Consider what number of bins might be appropriate. (Help: <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.hist.html">Pandas documentation</a>)
</div>

In [None]:
# Histogram for "ReorderPoint"

In [None]:
# Histogram for "ReorderPoint"
# 20 bins as the two lower values are to close to each other to be seperated otherwise
order_dataframe["ReorderPoint"].plot.hist(bins=20)

<div class="alert alert-block alert-info">
<b>Task 3b:</b> Draw a histogram for "DaysToManufacture". Consider what number of bins might be appropriate. (Help: <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.hist.html">Pandas documentation</a>)
</div>

In [None]:
# Histogram for "DaysToManufacture"

In [None]:
# Histogram for "DaysToManufacture"
# 2 bins as there are only 2 unique values
order_dataframe["DaysToManufacture"].plot.hist(bins=2)

<div class="alert alert-block alert-info">
<b>Task 3c:</b> Draw a histogram for "UnitPrice". Consider what number of bins might be appropriate. (Help: <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.hist.html">Pandas documentation</a>)
</div>

In [None]:
# Histogram for "UnitPrice"

In [None]:
# Histogram for "UnitPrice"
# 20 bins as 177 would not be displayed very well
order_dataframe["UnitPrice"].plot.hist(bins=20)

It can be seen that it does not always make sense to choose the number of bins exactly like the number of unique values. 

Apart from that, we have to note that especially in the case of "UnitPrice" insight is lost by merging multiple values. In such a case it can also be useful to look at boxplot and density curve instead of histogram.

<div class="alert alert-block alert-info">
<b>Task 4a:</b> Draw a boxplot diagram for "UnitPrice". (Help: <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.box.html">Pandas documentation</a>)
</div>


In [None]:
# Boxplot for "UnitPrice"

In [None]:
# Boxplot for "UnitPrice"
order_dataframe["UnitPrice"].plot.box()

<div class="alert alert-block alert-info">
<b>Task 4b:</b> Draw a density diagram for "UnitPrice". (Help: <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.density.html">Pandas documentation</a>)
</div>


In [None]:
# Density curve for "UnitPrice"

In [None]:
# Density curve for "UnitPrice"
order_dataframe["UnitPrice"].plot.density()

#### 2.1.3. Data Visualization

While simple chart types such as histograms, boxplots and density plots can already be implemented directly with pandas, a variety of libraries are available for more advanced visualization techniques. In this section you will use four of these libraries for different diagram types from the lecture.

##### 2.1.3.1. Pixel Oriented Visualization Techniques:  Polar Plot - Plotly

In the lecture you were introduced to Polar Plots, also called Radar Chart. In this type of chart, you look at multiple attributes of a single tuple.

<div class="alert alert-block alert-info">
<b>Task 1:</b> Use the Plotly Express to output the prepared tuple as a polar plot. (Help: <a href="https://plotly.com/python/radar-chart/">Plotly Documentation</a> and <a href="https://plotly.com/python-api-reference/generated/plotly.express.line_polar.html">Plotly API Reference</a>)
</div>

In [None]:
# Set the attributes to select
attributes = ["OrderQty", "ReceivedQty", "RejectedQty"]

# Prepare tuple_0 (Tuple with the id "0" in order_dataframe)
tuple_0 = pd.DataFrame(
    dict(r=order_dataframe[attributes].iloc[0].values, theta=attributes)
)

In [None]:
# Draw a polar diagram for tuple_0

In [None]:
# Draw a polar diagram for tuple_0
fig = px.line_polar(tuple_0, r="r", theta="theta", line_close=True)
fig.update_traces(fill="toself")
fig.show()

Even though the Polar Plot is initially only about a single tuple, it is of course also possible to display several tuples in the same diagram.

<div class="alert alert-block alert-info">
<b>Task 2:</b> Use the Plotlys graph object Scatterpolar to output the prepared tuples within one polar plot. (Help: <a href="https://plotly.com/python/radar-chart/">Plotly Documentation</a> and <a href="https://plotly.com/python-api-reference/generated/plotly.graph_objects.scatterpolar.html?highlight=scatterpolar#module-plotly.graph_objects.scatterpolar">Plotly API Reference</a>)
</div>

In [None]:
# Set the attributes to select
attributes = ["OrderQty", "ReceivedQty", "RejectedQty"]

# Prepare tuple_6, tuple_77, tuple_82
tuple_6 = pd.DataFrame(
    dict(r=order_dataframe[attributes].iloc[6].values, theta=attributes)
)
tuple_77 = pd.DataFrame(
    dict(r=order_dataframe[attributes].iloc[77].values, theta=attributes)
)
tuple_82 = pd.DataFrame(
    dict(r=order_dataframe[attributes].iloc[82].values, theta=attributes)
)

In [None]:
# Draw a single polar diagram for tuple_6, tuple_77 and tuple_82

In [None]:
# Draw a single polar diagram for tuple_6, tuple_77 and tuple_82
fig = go.Figure()

fig.add_trace(
    go.Scatterpolar(
        r=tuple_6["r"],
        theta=tuple_6["theta"],
        mode="lines",
        fill="tonext",
        name="tuple_6",
        line_color="peru",
    )
)
fig.add_trace(
    go.Scatterpolar(
        r=tuple_77["r"],
        theta=tuple_77["theta"],
        mode="lines",
        fill="tonext",
        name="tuple_77",
        line_color="deepskyblue",
    )
)
fig.add_trace(
    go.Scatterpolar(
        r=tuple_82["r"],
        theta=tuple_82["theta"],
        mode="lines",
        fill="tonext",
        name="tuple_82",
        line_color="seagreen",
    )
)


fig.update_layout(showlegend=True)

fig.show()

##### 2.1.3.2. Geometric Projection Visualization Techniques:  Scatter Plot Matrices - Seaborn

One type of graphs that we did not consider in 2.1.2 are scatter plots. This is because this chapter is perfect for directly contrasting scatter plots and scatter plot matrices.

<div class="alert alert-block alert-info">
<b>Task 1:</b> Use the internal pandas functions to create a scatter plot of the "OrderQty" and "ReceivedQty" attributes in order_dataframe. (Help: <a href="https://pandas.pydata.org/pandas-docs/dev/reference/api/pandas.DataFrame.plot.scatter.html">Pandas Documentation</a>)
</div>



In [None]:
# Draw a scatter plot regarding "OrderQty" and "ReceivedQty"

In [None]:
# Draw a scatter plot regarding "OrderQty" and "ReceivedQty"
order_dataframe.plot.scatter(x="OrderQty", y="ReceivedQty")

Within a scatter plot, exactly two attributes are always displayed on different axes. In order to compare more than two attributes with each other, a scatter plot matrix is often used. 

<div class="alert alert-block alert-info">
<b>Task 2:</b> Use the PairPlot from the Seaborn library to create a scatter plot matrix of the attributes "OrderQty", "ReceivedQty" and "RejectedQty" in order_dataframe. (Help: <a href="https://seaborn.pydata.org/generated/seaborn.pairplot.html#seaborn.pairplot">Seaborn Documentation</a>)
</div>

In [None]:
# Draw a scatter plot regarding "OrderQty", "ReceivedQty" and "RejectedQty"

In [None]:
# Draw a scatter plot regarding "OrderQty", "ReceivedQty" and "RejectedQty"
sns.pairplot(
    order_dataframe[["OrderQty", "ReceivedQty", "RejectedQty"]], diag_kind="kde"
)

##### 2.1.3.3. Icon Based Visualization:  Icon Diagram - PyWaffle

Icon Based Diagrams are often used to show percentage relationships in an understandable way. Icon Based Diagrams are often used to show percentage relationships in an understandable way.

For example, it can be used to highlight the gender distribution of employees working with our orders.

In [None]:
# Get unique employees
orders_per_employee_dataframe = (
    order_dataframe.groupby(
        ["EmployeeID", "FirstName", "MiddleName", "LastName", "Gender"]
    )
    .size()
    .reset_index(name="Orders")
)

# Sort the dataframe
orders_per_employee_dataframe.sort_values("Orders", ascending=False, inplace=True)

# Show the head of the orders_per_employee_dataframe
orders_per_employee_dataframe.head(15)

In [None]:
# Get the count of employees per gender
employees_per_gender_dataframe = (
    orders_per_employee_dataframe.groupby(["Gender"]).size().reset_index(name="Count")
)

# Print the head of this new dataframe
employees_per_gender_dataframe.head(10)

<div class="alert alert-block alert-info">
<b>Task 1:</b> Use PyWaffle to draw a icon diagram regarding the genders of the employees. (Help: <a href="https://pywaffle.readthedocs.io/en/latest/examples/plot_with_characters_or_icons.html">PyWaffle Documentation</a>)
</div>

In [None]:
# Save the counts into seperate variables
count_male_employees = employees_per_gender_dataframe["Count"].iloc[
    employees_per_gender_dataframe.index[
        employees_per_gender_dataframe["Gender"] == "M"
    ].tolist()[0]
]
count_female_employees = employees_per_gender_dataframe["Count"].iloc[
    employees_per_gender_dataframe.index[
        employees_per_gender_dataframe["Gender"] == "F"
    ].tolist()[0]
]

In [None]:
# Draw the icon diagram

In [None]:
# Draw the icon diagram
plt.figure(
    FigureClass=pywaffle.Waffle,
    rows=2,
    values=[count_male_employees, count_female_employees],
    colors=["#0000ff", "#ff0084"],
    icons=["male", "female"],
    font_size=30,
    icon_legend=True,
    legend={
        "labels": ["male", "female"],
        "loc": "upper left",
        "bbox_to_anchor": (1, 1),
    },
)
plt.tight_layout()

##### 2.1.3.4. Icon Based Visualization:  Tree Maps - Squarify

Even better than Icon Diagrams, distributions can be displayed using tree maps. 

For example, a tree map can be used to display the count of orders processed per employee.

In [None]:
# Print the head of orders_per_employee (again)
orders_per_employee_dataframe.head(15)

<div class="alert alert-block alert-info">
<b>Task 1:</b> Use Squarify to draw a tree map displaying the count of orders processed per employee. (Help: <a href="https://github.com/laserson/squarify#Usage">Squarify Documentation</a>)
</div>

In [None]:
# Draw the tree map

In [None]:
# Draw the tree map
fig, ax = plt.subplots(1, figsize=(12, 12))
squarify.plot(
    sizes=orders_per_employee_dataframe["Orders"],
    label=[
        "Employee #"
        + str(employee[1]["EmployeeID"])
        + ":\n"
        + employee[1]["FirstName"]
        + " "
        + employee[1]["MiddleName"]
        + ". "
        + employee[1]["LastName"]
        + "\n\nOrders:\n"
        + str(employee[1]["Orders"])
        for employee in orders_per_employee_dataframe.iterrows()
    ],
    pad=True,
)
plt.axis("off")
plt.show()

#### 2.1.4. Data similarity

Data similarity is often measured in data science by the distance between two data sets. There are many different methods to determine distances between different types of data. 

In this chapter, we will focus on the distance between numeric data. The two tuples we are comparing are tuple_77 and tuple_82 we introduced in chapter 2.1.3.1.

In [None]:
print(tuple_77["r"])

In [None]:
print(tuple_82["r"])

One of the most commonly used methods for calculating the distance between numerical values is the Manhattan distance, also called the L1 norm or City Block.

<div class="alert alert-block alert-info">
<b>Task 1:</b> Use the cityblock function of SciPy to compute the distance between tuple_77 and tuple_82. (Help: <a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cityblock.html#scipy.spatial.distance.cityblock">SciPy Documentation</a>)
</div>

In [None]:
# Use cityblock() to compute the distance

In [None]:
# Use cityblock() to compute the distance
scipy.spatial.distance.cityblock(tuple_77["r"], tuple_82["r"])

A second common variant is the Euclidean distance, also called L2 norm.

<div class="alert alert-block alert-info">
<b>Task 2:</b> Use the euclidean function of SciPy to compute the distance between tuple_77 and tuple_82. (Help: <a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.euclidean.html#scipy.spatial.distance.euclidean">SciPy Documentation</a>)
</div>

In [None]:
# Use euclidean() to compute the distance

In [None]:
# Use euclidean() to compute the distance
scipy.spatial.distance.euclidean(tuple_77["r"], tuple_82["r"])

Apparently, for tuple_77 and tuple_82, the manhattan distance and the Euclidean distance do not differ. But what about the following two lists?

In [None]:
# Lists to compare
list_1 = [550, 550, 82]
list_2 = [82, 550, 550]

<div class="alert alert-block alert-info">
<b>Task 3:</b> Compute the manhattan distance and the euclidean distance between list_1 and list_2. 
</div>

In [None]:
# Compute the manhattan distance

In [None]:
# Compute the euclidean distance

In [None]:
# Compute the manhattan distance
scipy.spatial.distance.cityblock(list_1, list_2)

In [None]:
# Compute the euclidean distance
scipy.spatial.distance.euclidean(list_1, list_2)

<div class="alert alert-block alert-info">
<b>Task 4:</b> Consider and explain why the distance measures did not differ for tuple_77 and tuple_82, but did for list_1 and list_2.
</div>

Write down your solution here:


For list_1 and list_2, each value appears in both lists. While the L1 norm evaluates that both the first values of both lists and the last values of both lists differ, the L2 norm "recognizes" that the values in both arrays are simply shifted and thus estimates the distance to be smaller.

Both the manhattan distance and the Euclidean distance are special forms of another distance. Do you remember which one?

<div class="alert alert-block alert-info">
<b>Task 5:</b> Compute the manhattan distance and the euclidean distance between list_1 and list_2 without using cityblock() or euclidean(). (Hint: The required distance measure is also part of <a href="https://docs.scipy.org/doc/scipy/reference/spatial.distance.html">scipy.spatial.distance</a>)
</div>

In [None]:
# Compute the manhattan distance without cityblock()

In [None]:
# Compute the euclidean distance without euclidean()

In [None]:
# Compute the manhattan distance without cityblock()
scipy.spatial.distance.minkowski(list_1, list_2, 1)

In [None]:
# Compute the euclidean distance without euclidean()
scipy.spatial.distance.minkowski(list_1, list_2, 2)

## 2.2. Part Two: Preprocessing - Data cleaning & Data integration

In this part you will apply the theoretical knowledge gained in the first part of the lecture "Preprocessing". 

#### 2.2.1. Recognizing dirty data

We did not look for dirty data in the order_dataframe. It is rather unusual in the real world that there is no dirty data. For this reason, we will now look at the dirty_employee_dataframe, into which some obvious problems have been built in.

<div class="alert alert-block alert-info">
<b>Task 1:</b> Independently use the skills you learned in Part One to familiarize yourself with the dirty_employee_dataframe. In doing so, try to identify as many problems as possible with the dataset at hand.</div>

In [None]:
# Use the methods you learned in Part One to familiarize yourself with the dirty_employee_dataframe
# (You are of course allowed to create new code cells if necessary)

In [None]:
# Since this task is very loosely defined, no 100% sample solution can be given here.

# But a minimum is to look at the shape of the dataframe
dirty_employee_dataframe.shape

In [None]:
# And then at least take a quick look at the head of all attributes. (Part 1)
dirty_employee_dataframe[dirty_employee_dataframe.columns[0:10]].head(25)

In [None]:
# And then at least take a quick look at the head of all attributes. (Part 2)
dirty_employee_dataframe[dirty_employee_dataframe.columns[10:22]].head(25)

##### 2.2.1.1. Incomplete

Incomplete data can take many different forms. If you look at the present data set, you will notice "None" values in various attributes. 

In [None]:
# Print the columns containing at least one "None"
dirty_employee_dataframe[
    [
        "OrganizationNode",
        "JobTitle",
        "Title",
        "MiddleName",
        "Suffix",
        "AdditionalContactInfo",
    ]
].head(25)

Not every "None" equates to missing data. Sometimes it is simply the correct information that the attribute value is "nothing". This can be seen in the six attributes presented. 

<div class="alert alert-block alert-info">
<b>Task 2:</b> For each of the attributes at hand, consider whether the "None" values are actually indicative of incomplete information, or whether the use of "None" is justified.</div>

Write down your solution here:

A 100% assessment of which attributes are correct and which are not usually requires a lot of expert knowledge. Without this, only an approximate estimate can be given:

<u>Most likely correct attributes:</u>

- <b>MiddleName:</b> That there are some "None" values in the MiddleName is consistent with what you would expect. In the "real" world, there are people with MiddleName as well as those without. However, it is still possible that the entry was forgotten for an employee with MiddleName.
- <b>Suffix:</b> The very fact that all of the first 25 tuples have no entry at the suffix shows that this is probably intentional. One possible reason why this could be an error is that the attribute was added to the database after the fact, which resulted in so many "None" values.
- <b>AdditionalContactInfo:</b> Again, the "None" values definitely suggest that this is at least a valid value. It also sounds reasonable that "Additional" information does not always have to be given.

<u>Most likely incomplete attributes:</u>

- <b>JobTitle:</b> It seems strange that in a dataset of employees, some people do not have a JobTitle. This is most likely incomplete information.
- <b>Title:</b> If the non-"None" values in Title were scientific titles such as "Dr." or "Prof." it would be expected that these titles do not actually exist for every employee. However, since titles such as "Mr." or "Ms." are also used, it can be assumed that information is simply missing. It should definitely be possible to specify these titles for each employee.

<u>Difficult to assess attributes:</u>

- <b>OrganizationNode:</b> There is also a "None" value in the OrganizationNode attribute. However, this is assigned to the CEO of all people (see JobTitle). It is quite possible that only employees with superiors should be assigned to an OrganizationNode. However, this could also be an error, as the CEO may have been with the company the longest and may simply have forgotten to add his OrganizationNode.

##### 2.2.1.2. Noisy

Noisy data, i.e. small measurement inaccuracies, are difficult to detect in the context of such an exercise. However, it is almost certain that the present data set does not contain any noisy data. 

<div class="alert alert-block alert-info">
<b>Task 3:</b> Consider why noisy data is unlikely to be included in dirty_employee_dataframe.</div>

Write down your solution here:

The dirty_employee_dataframe does not contain measured data. It is therefore extremely unlikely that noise is included in the data.

##### 2.2.1.3.  Inconsistencies

Examples of inconsistencies can be found in the present data set in the attributes "Gender", "BirthDate", and "HireDate". 

<div class="alert alert-block alert-info">
<b>Task 4a:</b> Print the head of the attributes "Gender", "BirthDate", and "HireDate".</div>

In [None]:
# Print the head of "Gender", "BirthDate", and "HireDate"

In [None]:
# Print the head of "Gender", "BirthDate", and "HireDate"
dirty_employee_dataframe[["Gender", "BirthDate", "HireDate"]].head(25)

<div class="alert alert-block alert-info">
<b>Task 4b:</b> Consider what inconsistencies are in the "Gender" attribute.</div>

Write down your solution here:

The Gender in the data set is partly given as "F" and "M" and partly as "Female" and "Male".

<div class="alert alert-block alert-info">
<b>Task 4c:</b> Consider what inconsistencies are in the "BirthDate" attribute.</div>

Write down your solution here:

The format in which the BirthDate has been specified differs.

<div class="alert alert-block alert-info">
<b>Task 4d:</b> Consider what inconsistencies are in the "HireDate" attribute. (Hint: Consider the attribute in conjunction with the other two attributes)</div>

Write down your solution here:

The HireDate is partly before the birthday. This is first of all an inconsistency, but the probability is high that this is actually even an error.

##### 2.2.1.4.  Errors/Outlier

Errors in numerical values and in the special outlier are sometimes not quite detectable at a glance. Fortunately, we have already learned methods in Part One that we can now use.

Let's first consider the "SickLeaveHours" attribute. First, let's look at the "SickLeaveHours" attribute. Does it contain outliers or errors?

<div class="alert alert-block alert-info">
<b>Task 5a:</b> Use a boxplot diagram to graphically analyze whether "SickLeaveHours" contains Outliers.</div>

In [None]:
# Draw a boxplot diagram for "SickLeaveHours"

In [None]:
# Draw a boxplot diagram for "SickLeaveHours"
dirty_employee_dataframe["SickLeaveHours"].plot.box()

<div class="alert alert-block alert-info">
<b>Task 5b:</b> Think about a way to find out which tuple contains the outlier in "SickLeaveHours".</div>

In [None]:
# Output the tuple containing the outlier

In [None]:
# It is clear from the boxplot diagram that the SickLeaveHours of the Outlier are above 2000. This can be used:
dirty_employee_dataframe[dirty_employee_dataframe["SickLeaveHours"] > 2000]

Secondly, let's take a look at "VacationHours".

<div class="alert alert-block alert-info">
<b>Task 6a:</b> Use a boxplot diagram to graphically analyze whether "VacationHours" contains Outliers.</div>

In [None]:
# Draw a boxplot diagram for "VacationHours"

In [None]:
# Draw a boxplot diagram for "VacationHours"
dirty_employee_dataframe["VacationHours"].plot.box()

<div class="alert alert-block alert-info">
<b>Task 6b:</b> Even though the boxplot diagram does not show any outliers, it clearly indicates a possible error in "VacationHours". Which error?</div>

Write down your solution here:

The minimum whisker extends to below 0 . It seems strange that hours can also become negative.

<div class="alert alert-block alert-info">
<b>Task 6c:</b> Output the affected tuples.</div>

In [None]:
# Output the tuple(s) containing the error

In [None]:
# Output the tuple(s) containing the error
dirty_employee_dataframe[dirty_employee_dataframe["VacationHours"] < 0]

##### 2.2.1.5.  Intentional

There is also an intentional error in the dirty_employee_dataframe. It can be found in either the "NationalIDNumber", "MaritalStatus", "SalariedFlag", "FirstName" or "LastName" attribute.

<div class="alert alert-block alert-info">
<b>Task 7a:</b> Again, independently use your learned skills to search the attributes "NationalIDNumber", "MaritalStatus", "SalariedFlag", "FirstName" and "LastName" for the intentional error.</div>

In [None]:
# Search for the intentional error

In [None]:
# Search for the intentional error
dirty_employee_dataframe[
    ["NationalIDNumber", "MaritalStatus", "SalariedFlag", "FirstName", "LastName"]
].head(25)

<div class="alert alert-block alert-info">
<b>Task 7b:</b> In which of the attributes is the intentional error to be found.</div>

<b>The attribute with the intentional error:</b>
1. [ ] NationalIDNumber
2. [ ] MaritalStatus
3. [ ] SalariedFlag
4. [ ] FirstName
5. [ ] LastName

<b>The attribute with the intentional error:</b>
1. [ ] NationalIDNumber
2. [ ] MaritalStatus
3. [ ] SalariedFlag
4. [ ] FirstName
5. [X] LastName

<div class="alert alert-block alert-info">
<b>Task 7c:</b> Why this error could have been built in on purpose?</div>

Write down your solution here:

For data protection reasons, it may sometimes be necessary to anonymize data. While it is rather atypical that this is the case with an employee database, if the data set had been issued for external analysis, for example, and it contained salary data, then one could explain such anonymization. However, it would be quite advantageous if not only this one attribute had been anonymized then.

#### 2.2.2. Data cleaning

The mere detection of dirty data is, of course, only the first step in the data cleaning process. While it is a best case scenario to correct dirty data step by step once it has been identified, this is often a lengthy and difficult process. 

In our example from 2.2.1, for example, only the inconsistencies in "Gender" and "BirthDate" can be quickly fixed.

<div class="alert alert-block alert-info">
<b>Task 1a:</b> Replace all occurrences of "Female" with "F" and all occurrences of "Male" with "M" in the "Gender" attribute of the dirty_employee_dataframe. (Help: <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html">Pandas documentation</a>)</div>

In [None]:
# Replace "Female" and "Male" values in "Gender"

In [None]:
# Replace "Female" and "Male" values in "Gender"
dirty_employee_dataframe.replace({"Gender": {"Female": "F", "Male": "M"}}, inplace=True)

<div class="alert alert-block alert-info">
    <b>Task 1b:</b> Delete the 00:00:00 suffix in the BirthDate attribute. (Help: <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html">Pandas documentation</a> - Hint: Use RegEx)</div>

In [None]:
# Delete the 00:00:00 suffix in BirthDate

In [None]:
# Delete the 00:00:00 suffix in BirthDate
dirty_employee_dataframe.replace(
    {"BirthDate": r"\ 00:00:00"}, {"BirthDate": ""}, regex=True, inplace=True
)

If only individual tuples contain an error/outlier and these cannot be manually fixed, the most efficient approach is often to simply remove these tuples from the dataset. 

For example this would apply to the tuples with NationalIDNumber 243322160 and 879342154 in the dirty_employee_dataframe.

<div class="alert alert-block alert-info">
    <b>Task 2:</b> Delete the tuples with NationalIDNumber 243322160 and 879342154. (Help: <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html">Pandas documentation</a>)</div>

In [None]:
# Delete the tuples with NationalIDNumber 243322160 and 879342154

In [None]:
# Delete the tuples with NationalIDNumber 243322160 and 879342154
dirty_employee_dataframe.drop(
    dirty_employee_dataframe[
        dirty_employee_dataframe.NationalIDNumber == "243322160"
    ].index,
    inplace=True,
)
dirty_employee_dataframe.drop(
    dirty_employee_dataframe[
        dirty_employee_dataframe.NationalIDNumber == "879342154"
    ].index,
    inplace=True,
)

Even for attributes that do not contain any information, it usually makes more sense to remove them from the data set. 

In this example this is the case with the attribute LastName.

<div class="alert alert-block alert-info">
    <b>Task 3:</b> Delete the attribute LastName. (Help: <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html">Pandas documentation</a>)</div>

In [None]:
# Delete the attribute LastName

In [None]:
# Delete the attribute LastName
dirty_employee_dataframe.drop(columns=["LastName"], inplace=True)

#### 2.2.3. Data integration

In the context of data integration, we mainly looked at correlation in the lecture. The calculation of this depends on the type of data.

##### 2.2.3.1 Nominal data

One of the two data types we looked at in more detail in the lecture is nominal data. This describes all data that that is used to label variables without providing any quantitative value.

The first combination of nominal attributes we will look at in this section are "Gender" and "CurrentFlag". We start by displaying the contigency table for these attributes.

<div class="alert alert-block alert-info">
<b>Task 1:</b> Use the pandas function crosstab() to create a contingency table for the attributes "Gender" and "CurrentFlag". Show the subtotals one time and once not. (Help: <a href="https://pandas.pydata.org/docs/reference/api/pandas.crosstab.html">Pandas documentation</a>)</div>

In [None]:
# Display a contingency table without subtotals

In [None]:
# Display a contingency table with subtotals

In [None]:
# Display a contingency table without subtotals
pd.crosstab(dirty_employee_dataframe["Gender"], dirty_employee_dataframe["CurrentFlag"])

In [None]:
# Display a contingency table with subtotals
pd.crosstab(
    dirty_employee_dataframe["Gender"],
    dirty_employee_dataframe["CurrentFlag"],
    margins=True,
)

The disadvantage of this contingency table is, of course, that only the observed quantities are displayed. For the calculation of the correlation, however, the expected quantities are also important. These can be calculated for example with the expected_freq() function from SciPy.

<div class="alert alert-block alert-info">
<b>Task 2:</b> Use expected_freq() to output the expected quantities for the attributes "Gender" and "CurrentFlag". (Help: <a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.contingency.expected_freq.html#scipy.stats.contingency.expected_freq">SciPy documentation</a>)</div>

In [None]:
# Display the expected quantities

In [None]:
# Display the expected quantities
# Sample solution 1: Use pd.crosstab
pd_crosstab = pd.crosstab(
    dirty_employee_dataframe["Gender"], dirty_employee_dataframe["CurrentFlag"]
)

# Compute the expected_frequencies
# (it is fine if you just output them - the creation of a pd.DataFrame is just a bonus)
pd_expected_frequencies = scipy.stats.contingency.expected_freq(pd_crosstab)

# Create a pd.DataFrame
pd.DataFrame(
    data=pd_expected_frequencies, index=pd_crosstab.index, columns=pd_crosstab.columns
)

In [None]:
# Sample solution 2: Use scipy.stats.contingency.crosstab
sp_crosstab_elements, sp_crosstab_count = scipy.stats.contingency.crosstab(
    dirty_employee_dataframe["Gender"], dirty_employee_dataframe["CurrentFlag"]
)

# Compute the expected_frequencies
# (it is fine if you just output them - the creation of a pd.DataFrame is just a bonus)
sp_expected_frequencies = scipy.stats.contingency.expected_freq(sp_crosstab_count)

# Create a pd.DataFrame
pd.DataFrame(
    data=sp_expected_frequencies,
    index=sp_crosstab_elements[0],
    columns=sp_crosstab_elements[1],
)

One can see that both the expected values and the observed ones are quite close. This indicates that it is probably a small correlation. 

However, in the lecture, the Chi-squared test was presented as a method to validate this more accurately.

<div class="alert alert-block alert-info">
<b>Task 3a:</b> Use the chi2_contingency function from SciPy to determine the correlation between "Gender" and "CurrentFlag". (Help: <a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html#scipy.stats.chi2_contingency">SciPy documentation</a>)</div>

In [None]:
# Compute chi-squared for "Gender" and "CurrentFlag"

In [None]:
# Compute chi-squared for "Gender" and "CurrentFlag"
scipy.stats.chi2_contingency(pd_crosstab)

<div class="alert alert-block alert-info">
<b>Task 3b:</b> Find out what the different values in the above output of chi2_contingency stand for and describe how to interpret them in this case.</div>

Write down your solution here:

- <b>First value (0.0):</b><br />
This value represents the actual Chi-squared value. The closer it is to 0, the less correlation there is between the attributes. Since this value is 0.0, we can assume there is no correlation between "Gender" and "CurrentFlag". (However: see explanation of the second value)

- <b>Second value (1.0):</b><br />
This is the so-called p-value. If this value is higher than the selected level of statistical significance (usually 0.01, 0.05 or 0.10), the chi-squared value is not fully reliable. Since in this case the value is significantly higher than any normally selected value, it can be assumed that our chi-squared should be viewed with great caution. 

- <b>Third value (1):</b><br />
The third return value describes the degrees of freedom. This value is related to the number of categories (number of categories minus one). So there are two categories in our case (of course we already knew this from the contingency table).

- <b>Last value (Array):</b><br />
This is just another version of the expected values known from Task 2.

Second, let's look at the "Gender" and "SalariedFlag" attributes.

<div class="alert alert-block alert-info">
<b>Task 4a:</b> Using the methods learned above, calculate Chi-squared for "Gender" and "SalariedFlag".</div>

In [None]:
# Compute chi-squared for "Gender" and "SalariedFlag"

In [None]:
# Compute chi-squared for "Gender" and "SalariedFlag"
pd_crosstab_salaried = pd.crosstab(
    dirty_employee_dataframe["Gender"], dirty_employee_dataframe["SalariedFlag"]
)
scipy.stats.chi2_contingency(pd_crosstab_salaried)

<div class="alert alert-block alert-info">
<b>Task 4b:</b> Interpret the chi-squared for "Gender" and "SalariedFlag".</div>

Write down your solution here:

The two most important values for interpretation are the actual chi-squared value and the p-value. 

The p-value in this case is just above 0.05, so it does not yet reach this level of statistical significance. Depending on how certain one wants to be in drawing conclusions, one can either accept or reject the result of the chi-squared test.

The chi-squared value of about 3.65 is clearly above 0, so it is much more correlation than in the combination "Gender" and "CurrentFlag".

##### 2.2.3.2 Numerical data

The other data type we have considered in the context of Correlation is the numeric data type. Here it is suitable to look at the connection between "VacationHours" and "SickLeaveHours". 

A method of graphical analysis of correlation in numerical data should already be known from Part One.

<div class="alert alert-block alert-info">
<b>Task 1:</b> Draw a scatter plot regarding "VacationHours" and "SickLeaveHours".
</div>

In [None]:
# Draw a scatter plot regarding "VacationHours" and "SickLeaveHours"

In [None]:
# Draw a scatter plot regarding "VacationHours" and "SickLeaveHours"
dirty_employee_dataframe.plot.scatter(x="VacationHours", y="SickLeaveHours")

However, what was not part of the method in Part One is the interpretation of this diagram.

<div class="alert alert-block alert-info">
<b>Task 2:</b> Interpret the scatter plot regarding "VacationHours" and "SickLeaveHours".
</div>

Write down your solution here:

The scatter plot clearly shows positive correlation between both variables. So it seems that employees who were on vacation more often were also on vacation more often. 

(Caution in interpretation: This may also be due to the fact that both VacationHours and SickLeaveHours were simply added over the contract period. We did not consider this in this analysis).

We can, of course, evaluate this graphical analysis mathematically. In the lecture we used Pearson's product-moment coefficient for this purpose.

<div class="alert alert-block alert-info">
<b>Task 3a:</b> Compute pearson's product-moment coefficient for "VacationHours" and "SickLeaveHours". Use the pearsonr() function from SciPy. (Help: <a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html">SciPy documentation</a>)
</div>

In [None]:
# Compute pearson's product-moment coefficient for "VacationHours" and "SickLeaveHours"

In [None]:
# Compute pearson's product-moment coefficient for "VacationHours" and "SickLeaveHours"
scipy.stats.pearsonr(
    dirty_employee_dataframe["VacationHours"],
    dirty_employee_dataframe["SickLeaveHours"],
)

<div class="alert alert-block alert-info">
<b>Task 3b:</b> Find out what the different values in the above output of pearsonr stand for and describe how to interpret them in this case.</div>

Write down your solution here:

- <b>First value (approx. 0.989):</b><br />
This value represents the actual Pearson’s correlation coefficient. If it is positive, then there is a positive correlation between the two attributes. Our graphical analysis is confirmed here.

- <b>Second value (approx. 0.000):</b><br />
Similar to the chi-squared test, this is the p-value. Since this is virtually zero in this case, a high statistical significance can be assumed.

## 2.3. Part Three: Preprocessing - Data reduction, data transformation & data discretization

<div class="alert alert-block alert-warning">
TODO
</div>