# 2. Data analysis & Preprocessing

In this exercise you will get to know the basics from the lectures "3. Getting to Know Your Data" and "4. Preprocessing" in their practical use and apply them yourself.

Since this practice sheet is designed to be used in three sessions, it is roughly divided into three parts:

- Part One: Getting to Know Your Data
- Part Two: Preprocessing - Data cleaning & Data integration
- Part Three: Preprocessing - Data reduction, data transformation & data discretization

Of course, depending on how quickly an exercise group progresses in the actual exercise, one of these parts may not be discussed entirely in the affected exercise, or parts of the subsequent part may already be addressed.

## Part One: Getting to Know Your Data

In this part you will apply the theoretical knowledge gained in the lecture "Getting to Know Your Data". In doing so, you will familiarize yourself step by step with the order_dataframe defined above.

### Preparations

In [None]:
# Import the required libraries
import tempfile
import sqlite3
import urllib.request
import squarify
import pywaffle
import scipy
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns

In [None]:
# Create a temporary directory
dataset_folder = tempfile.mkdtemp()

# Get the database
urllib.request.urlretrieve(
    "https://github.com/FAU-CS6/KDD-Databases/raw/main/AdventureWorks/adventure-works.db",
    dataset_folder + "/adventure-works.db",
)

# Open connection to the adventure-works.db
connection = sqlite3.connect(dataset_folder + "/adventure-works.db")

In [None]:
# Create the clean dataframe(s)
# Order dataframe
order_dataframe = pd.read_sql_query(
    "SELECT p.ProductID,p.Name,p.ProductNumber,p.MakeFlag,p.FinishedGoodsFlag,p.Color,p.SafetyStockLevel,"
    "p.ReorderPoint,p.StandardCost,p.ListPrice,p.Size,p.SizeUnitMeasureCode,p.WeightUnitMeasureCode,p.Weight,"
    "p.DaysToManufacture,p.ProductLine,p.Class,p.Style,p.ProductSubcategoryID,p.ProductModelID,p.SellStartDate,"
    "p.SellEndDate,p.DiscontinuedDate,d.PurchaseOrderID,d.PurchaseOrderDetailID,d.DueDate,d.OrderQty,d.ProductID,"
    "d.UnitPrice,d.ReceivedQty,d.RejectedQty,h.RevisionNumber,h.Status,h.EmployeeID,h.VendorID,h.ShipMethodID,"
    "h.OrderDate,h.ShipDate,h.SubTotal,h.TaxAmt,h.Freight,h.TotalDue,e.NationalIDNumber,e.LoginID,e.OrganizationNode,"
    "e.JobTitle,e.BirthDate,e.MaritalStatus,e.Gender,e.HireDate,e.SalariedFlag,e.VacationHours,e.SickLeaveHours,"
    "e.CurrentFlag,r.PersonType,r.NameStyle,r.Title,r.FirstName,r.MiddleName,r.LastName,r.Suffix,r.EmailPromotion,"
    "r.AdditionalContactInfo,r.Demographics "
    "FROM Product p "
    "JOIN PurchaseOrderDetail d ON p.ProductID = d.ProductID "
    "JOIN PurchaseOrderHeader h ON d.PurchaseOrderID = h.PurchaseOrderID "
    "JOIN Employee e ON h.EmployeeID = e.BusinessEntityID "
    "JOIN Person r ON e.BusinessEntityID = r.BusinessEntityID",
    connection,
    index_col="PurchaseOrderDetailID",
)

### Structure of the Dataframe

Currently you don't know anything about the order_dataframe except for the fact that it consists of the two tables Product and PurchaseOrderDetail of a database named AdventureWorks. 
In order to gather an initial understanding of the structure of the dataframe, it is useful to know the dimensions of the dataframe. The corresponding information is stored in the shape property of a panda dataframe.

<div class="alert alert-block alert-info">
<b>Task:</b> Figure out the dimensions of order_dataframe. You are allowed to have a look at the <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html">Pandas documentation</a> regarding the mentioned property.</div>


In [None]:
# Get the shape of the dataframe

In [None]:
# Get the shape of the dataframe
order_dataframe.shape

The output of shape contains two dimensions. It is of course important to know which number is the count of attributes and which number is the count of tuples in our dataframe.

<div class="alert alert-block alert-info">
<b>Task:</b> Use the <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html">Pandas documentation</a> to find out which number stands for the number of attributes and which for the number of tuples.</div>

<b>The count of attributes:</b>
1. [ ] 8845
2. [ ] 3212
3. [ ] 63
4. [ ] 34

<b>The count of tuples:</b>
1. [ ] 8845
2. [ ] 3212
3. [ ] 63
4. [ ] 34

<b>The count of attributes:</b>
1. [X] 8845
2. [ ] 3212
3. [ ] 63
4. [ ] 34

<b>The count of tuples:</b>
1. [ ] 8845
2. [ ] 3212
3. [X] 63
4. [ ] 34

Now that we now the count of tuples and attributes in our dataframe, we still do not know what data is contained in the data set.

In order to get a first impression in this respect, it can be useful to look at (a sample of) the data frame. The supposedly simplest method to make this possible is the print() function.

In [None]:
# Print the order_dataframe
print(order_dataframe)

However, as you can see, this method outputs the entire content of the dataframe without any specific layout. This can cause problems, especially with very large dataframes, and is therefore not recommended. It is far more common to use the dataframe member function head().

<div class="alert alert-block alert-info">
<b>Task:</b> Use the <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html">Pandas documentation</a> to familiarize yourself with head(), then apply it to the order_dataframe so that the first 10 tuples are displayed.</div>

In [None]:
# Use the head() function on the order_dataframe while setting the number of rows displayed to 10

In [None]:
# Use the head() function on the order_dataframe while setting the number of rows displayed to 10
order_dataframe.head(10)

As you can see, the representation by head() is easier to read. However, head() also has its limitations. For example, in this case we do not get all columns displayed.

<div class="alert alert-block alert-info">
<b>Task:</b> All attributes of a data frame are stored in the <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.columns.html">member variable columns</a>. Use this information to output a list of all attributes contained in order_dataframe. No special formatting is asked, but it should be made sure that this time all column identifiers are directly named in the output.
</div>

In [None]:
# Output a list of all columns

In [None]:
# There are multiple possible solutions, e.g.:
# Sample solution 1: Iterate over the columns
for column in order_dataframe.columns:
    print(column, end=",")

In [None]:
# Sample solution 2: Use list()
print(list(order_dataframe.columns))

For example, we did not see the columns "WeightUnitMeasureCode", "Weight", "DaysToManufacture", "ProductLine", "Class" and "Style" in the above execution of head().  

<div class="alert alert-block alert-info">
<b>Task:</b> Show the attributes "WeightUnitMeasureCode", "Weight", "DaysToManufacture", "ProductLine", "Class", "Style" for the first 10 tuples. (Help: <a href="https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html">Pandas tutorial on selecting subsets</a>)
</div>

In [None]:
# Print the attributes "WeightUnitMeasureCode", "Weight", "DaysToManufacture", "ProductLine", "Class"
# and "Style" of the first ten attributes

In [None]:
# There are multiple possible solutions, e.g.:
# Sample solution 1: Explicit naming of the identifiers
order_dataframe[
    [
        "WeightUnitMeasureCode",
        "Weight",
        "DaysToManufacture",
        "ProductLine",
        "Class",
        "Style",
    ]
].head(10)

In [None]:
# Sample solution 2: Using the columns attribute
order_dataframe[order_dataframe.columns[12:18]].head(10)

Of course, it is a pity that all the attribute values shown are `0` in the "DaysToManufacture" attribute and `None` in the "ProductLine" attribute. However, this does not mean that this is the case for the entire dataframe.

<div class="alert alert-block alert-info">
<b>Task:</b> Save all with "DaysToManufacture" higher than 0 into a new dataframe called order_dtm_not_zero_dataframe. (Help: <a href="https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html">Pandas tutorial on selecting subsets</a>)
</div>

In [None]:
# Select all tuples with "DaysToManufacture" > 0

In [None]:
# Select all tuples with "DaysToManufacture" > 0
order_dtm_not_zero_dataframe = order_dataframe[order_dataframe["DaysToManufacture"] > 0]

<div class="alert alert-block alert-info">
<b>Task:</b> Display the columns "WeightUnitMeasureCode", "Weight", "DaysToManufacture", "ProductLine", "Class" and "Style" of the new order_dtm_not_zero_dataframe. Limit the output to 10 tuples. (Help: <a href="https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html">Pandas tutorial on selecting subsets</a>)
</div>

In [None]:
# Print the attributes "WeightUnitMeasureCode", "Weight", "DaysToManufacture", "ProductLine", "Class"
# and "Style" of the first ten attributes of the new order_dtm_not_zero_dataframe

In [None]:
# Print the attributes "WeightUnitMeasureCode", "Weight", "DaysToManufacture", "ProductLine", "Class"
# and "Style" of the first ten attributes of the new order_dtm_not_zero_dataframe
order_dtm_not_zero_dataframe[
    [
        "WeightUnitMeasureCode",
        "Weight",
        "DaysToManufacture",
        "ProductLine",
        "Class",
        "Style",
    ]
].head(10)

You have now received a first impression of the data. The methods presented to you here can of course also be applied to any other attributes and tuples.

<div class="alert alert-block alert-info">
<b>Task:</b> Consider the limitations of the methods presented in this section and give two examples (related to the data set at hand) of what information would be difficult to find out using only the methodology presented here.
</div>

Write down your solution here:


The methodology presented here is only used to get a first impression of the data. It is largely raw data that a human can only ever capture in part. For example, it would be difficult to answer the following questions using only this methodology:

- What is the longest available production period in our database?
- What is the average price of our products?
- What is the most frequently ordered item?

### Basic Statistical Descriptors

In this section we take a closer look at the attributes "ReorderPoint", "DaysToManufacture", and "UnitPrice". 

In [None]:
# Print the attributes "ReorderPoint", "DaysToManufacture", and "UnitPrice" of the first ten attributes
order_dataframe[["ReorderPoint", "DaysToManufacture", "UnitPrice"]].head(10)

It makes sense to first look at the simple statistical values "Mean", "Median", "Min", "Max", "Mode" and "NUnique" to get a rough estimate of these attributes. 

<div class="alert alert-block alert-info">
<b>Task:</b> Determine the mean for each of the three attributes and output it. (Help: <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mean.html">Pandas documentation</a>)
</div>

In [None]:
# Output the mean of "ReorderPoint"

In [None]:
# Output the mean of "DaysToManufacture"

In [None]:
# Output the mean of "UnitPrice"

In [None]:
# Output the mean of "ReorderPoint"
print("ReorderPoint: " + str(order_dataframe["ReorderPoint"].mean()))

In [None]:
# Output the mean of "DaysToManufacture"
print("DaysToManufacture: " + str(order_dataframe["DaysToManufacture"].mean()))

In [None]:
# Output the mean of "UnitPrice"
print("UnitPrice: " + str(order_dataframe["UnitPrice"].mean()))

<div class="alert alert-block alert-info">
<b>Task:</b> Determine the median for each of the three attributes and output it. (Help: <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.median.html">Pandas documentation</a>)
</div>

In [None]:
# Output the median of "ReorderPoint"

In [None]:
# Output the median of "DaysToManufacture"

In [None]:
# Output the median of "UnitPrice"

In [None]:
# Output the median of "ReorderPoint"
print("ReorderPoint: " + str(order_dataframe["ReorderPoint"].median()))

In [None]:
# Output the median of "DaysToManufacture"
print("DaysToManufacture: " + str(order_dataframe["DaysToManufacture"].median()))

In [None]:
# Output the median of "UnitPrice"
print("UnitPrice: " + str(order_dataframe["UnitPrice"].median()))

<div class="alert alert-block alert-info">
<b>Task:</b> Determine the min for each of the three attributes and output it. (Help: <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.min.html">Pandas documentation</a>)
</div>

In [None]:
# Output the min of "ReorderPoint"

In [None]:
# Output the min of "DaysToManufacture"

In [None]:
# Output the min of "UnitPrice"

In [None]:
# Output the min of "ReorderPoint"
print("ReorderPoint: " + str(order_dataframe["ReorderPoint"].min()))

In [None]:
# Output the min of "DaysToManufacture"
print("DaysToManufacture: " + str(order_dataframe["DaysToManufacture"].min()))

In [None]:
# Output the min of "UnitPrice"
print("UnitPrice: " + str(order_dataframe["UnitPrice"].min()))

<div class="alert alert-block alert-info">
<b>Task:</b> Determine the max for each of the three attributes and output it. (Help: <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.max.html">Pandas documentation</a>)
</div>

In [None]:
# Output the max of "ReorderPoint"

In [None]:
# Output the max of "DaysToManufacture"

In [None]:
# Output the max of "UnitPrice"

In [None]:
# Output the max of "ReorderPoint"
print("ReorderPoint: " + str(order_dataframe["ReorderPoint"].max()))

In [None]:
# Output the max of "DaysToManufacture"
print("DaysToManufacture: " + str(order_dataframe["DaysToManufacture"].max()))

In [None]:
# Output the max of "UnitPrice"
print("UnitPrice: " + str(order_dataframe["UnitPrice"].max()))

<div class="alert alert-block alert-info">
<b>Task:</b> Determine the max for each of the three attributes and output it. (Help: <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mode.html">Pandas documentation</a>)
</div>

In [None]:
# Output the mode of "ReorderPoint"

In [None]:
# Output the mode of "DaysToManufacture"

In [None]:
# Output the mode of "UnitPrice"

In [None]:
# Output the mode of "ReorderPoint"
order_dataframe["ReorderPoint"].mode()

In [None]:
# Output the mode of "DaysToManufacture"
order_dataframe["DaysToManufacture"].mode()

In [None]:
# Output the mode of "UnitPrice"
order_dataframe["UnitPrice"].mode()

<div class="alert alert-block alert-info">
<b>Task:</b> Determine the nunique for each of the three attributes and output it. (Help: <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.nunique.html">Pandas documentation</a>)
</div>

In [None]:
# Output the nunique of "ReorderPoint"

In [None]:
# Output the nunique of "DaysToManufacture"

In [None]:
# Output the nunique of "UnitPrice"

In [None]:
# Output the nunique of "ReorderPoint"
print("ReorderPoint: " + str(order_dataframe["ReorderPoint"].nunique()))

In [None]:
# Output the nunique of "DaysToManufacture"
print("DaysToManufacture: " + str(order_dataframe["DaysToManufacture"].nunique()))

In [None]:
# Output the nunique of "UnitPrice"
print("UnitPrice: " + str(order_dataframe["UnitPrice"].nunique()))

Even though the methods used here are fairly simple statistical methods, they can already tell us quite a bit about our data. 

<div class="alert alert-block alert-info">
<b>Task:</b> Consider and describe what can be said about the "ReorderPoint" attribute based on the statistical values obtained.
</div>

Write down your solution here:


The attribute "ReorderPoint" has only five different values, which are distributed between the lowest value 3 and the highest value 750. It can therefore be assumed that there are sometimes large ranges between the individual values and that the attribute is most likely a discrete numeric attribute.

The mean of approx. 589 and the median of 750 clearly indicate that many values are probably located in the higher part of the value range. The fact that the mode is 750 further confirms this thesis.

<div class="alert alert-block alert-info">
<b>Task:</b> Consider and describe what can be said about the "DaysToManufacture" attribute based on the statistical values obtained.
</div>

Write down your solution here:


In contrast to "ReorderPoint", "DaysToManufacture" apparently has only two different values. These seem to be the values "0" and "1", whereby both mean and median, as well as mode show that the value "0" is the more frequent of the two.

<div class="alert alert-block alert-info">
<b>Task:</b> Consider and describe what can be said about the "UnitPrice" attribute based on the statistical values obtained.
</div>

Write down your solution here:


The "UnitPrice" attribute, with its 177 different values ranging from 0.21 up to 82.8345, appears to be much less discrete than the other two attributes. It has two values that occur most frequently, 31.4895 and 48.2895. In general, this, along with the mean of about 34.743 and the median of 39.2805, suggests that most values are likely to be found in the middle of the range of values. However, it is possible that there are many values in the very low range as well as in the very high range, but they simply balance each other out.

The distribution of the individual values within the attributes is something we can not specific with these simple statiscal values. A posssibility to get more information about this is a histogram.

<div class="alert alert-block alert-info">
<b>Task:</b> Draw a histogram for "ReorderPoint". Consider what number of bins might be appropriate. (Help: <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.hist.html">Pandas documentation</a>)
</div>

In [None]:
# Histogram for "ReorderPoint"

In [None]:
# Histogram for "ReorderPoint"
# 20 bins as the two lower values are to close to each other to be seperated otherwise
order_dataframe["ReorderPoint"].plot.hist(bins=20, rwidth=0.8)

# Note: The rwidth parameter is not required, but it helps to better separate two buckets
# that are directly next to each other. If you are very precise, then this parameter can
# lead to the histogram no longer being completely correct, since a gap possibly might occur
# exactly above an attribute value. Here it should always be weighed up whether readability
# of the diagram or strict correctness should have priority.
# Another alternative would be to use an edge color that stands out. However, this is not
# supported by Panda's internal histogram function and will therefore not be discussed here.

<div class="alert alert-block alert-info">
<b>Task:</b> Draw a histogram for "DaysToManufacture". Consider what number of bins might be appropriate. (Help: <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.hist.html">Pandas documentation</a>)
</div>

In [None]:
# Histogram for "DaysToManufacture"

In [None]:
# Histogram for "DaysToManufacture"
# 2 bins as there are only 2 unique values
order_dataframe["DaysToManufacture"].plot.hist(bins=2, rwidth=0.8)

<div class="alert alert-block alert-info">
<b>Task:</b> Draw a histogram for "UnitPrice". Consider what number of bins might be appropriate. (Help: <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.hist.html">Pandas documentation</a>)
</div>

In [None]:
# Histogram for "UnitPrice"

In [None]:
# Histogram for "UnitPrice"
# 20 bins as 177 would not be displayed very well
order_dataframe["UnitPrice"].plot.hist(bins=20, rwidth=0.8)

It can be seen that it does not always make sense to choose the number of bins exactly like the number of unique values. 

Apart from that, we have to note that especially in the case of "UnitPrice" insight is lost by merging multiple values. In such a case it can also be useful to look at boxplot and density curve instead of histogram.

<div class="alert alert-block alert-info">
<b>Task:</b> Draw a boxplot diagram for "UnitPrice". (Help: <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.box.html">Pandas documentation</a>)
</div>


In [None]:
# Boxplot for "UnitPrice"

In [None]:
# Boxplot for "UnitPrice"
order_dataframe["UnitPrice"].plot.box()

<div class="alert alert-block alert-info">
<b>Task:</b> Draw a density diagram for "UnitPrice". (Help: <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.density.html">Pandas documentation</a>)
</div>


In [None]:
# Density curve for "UnitPrice"

In [None]:
# Density curve for "UnitPrice"
order_dataframe["UnitPrice"].plot.density()

### Data Visualization

While simple chart types such as histograms, boxplots and density plots can already be implemented directly with pandas, a variety of libraries are available for more advanced visualization techniques. In this section you will use four of these libraries for different diagram types from the lecture.

#### Pixel Oriented Visualization Techniques:  Polar Plot - Plotly

In the lecture you were introduced to Polar Plots, also called Radar Chart. In this type of chart, you look at multiple attributes of a single tuple.

<div class="alert alert-block alert-info">
<b>Task:</b> Use the Plotly Express to output the prepared tuple as a polar plot. (Help: <a href="https://plotly.com/python/radar-chart/">Plotly Documentation</a> and <a href="https://plotly.com/python-api-reference/generated/plotly.express.line_polar.html">Plotly API Reference</a>)
</div>

In [None]:
# Set the attributes to select
attributes = ["OrderQty", "ReceivedQty", "RejectedQty"]

# Prepare tuple_0 (Tuple with the id "0" in order_dataframe)
tuple_0 = pd.DataFrame(
    dict(r=order_dataframe[attributes].iloc[0].values, theta=attributes)
)

In [None]:
# Draw a polar diagram for tuple_0

In [None]:
# Draw a polar diagram for tuple_0
fig = px.line_polar(tuple_0, r="r", theta="theta", line_close=True)
fig.update_traces(fill="toself")
fig.show()

Even though the Polar Plot is initially only about a single tuple, it is of course also possible to display several tuples in the same diagram.

<div class="alert alert-block alert-info">
<b>Task:</b> Use the Plotlys graph object Scatterpolar to output the prepared tuples within one polar plot. (Help: <a href="https://plotly.com/python/radar-chart/">Plotly Documentation</a> and <a href="https://plotly.com/python-api-reference/generated/plotly.graph_objects.scatterpolar.html?highlight=scatterpolar#module-plotly.graph_objects.scatterpolar">Plotly API Reference</a>)
</div>

In [None]:
# Set the attributes to select
attributes = ["OrderQty", "ReceivedQty", "RejectedQty"]

# Prepare tuple_6, tuple_77, tuple_82
tuple_6 = pd.DataFrame(
    dict(r=order_dataframe[attributes].iloc[6].values, theta=attributes)
)
tuple_77 = pd.DataFrame(
    dict(r=order_dataframe[attributes].iloc[77].values, theta=attributes)
)
tuple_82 = pd.DataFrame(
    dict(r=order_dataframe[attributes].iloc[82].values, theta=attributes)
)

In [None]:
# Draw a single polar diagram for tuple_6, tuple_77 and tuple_82

In [None]:
# Draw a single polar diagram for tuple_6, tuple_77 and tuple_82
fig = go.Figure()

fig.add_trace(
    go.Scatterpolar(
        r=tuple_6["r"],
        theta=tuple_6["theta"],
        mode="lines",
        fill="tonext",
        name="tuple_6",
        line_color="peru",
    )
)
fig.add_trace(
    go.Scatterpolar(
        r=tuple_77["r"],
        theta=tuple_77["theta"],
        mode="lines",
        fill="tonext",
        name="tuple_77",
        line_color="deepskyblue",
    )
)
fig.add_trace(
    go.Scatterpolar(
        r=tuple_82["r"],
        theta=tuple_82["theta"],
        mode="lines",
        fill="tonext",
        name="tuple_82",
        line_color="seagreen",
    )
)


fig.update_layout(showlegend=True)

fig.show()

#### Geometric Projection Visualization Techniques:  Scatter Plot Matrices - Seaborn

One type of graphs that we did not consider in 2.1.2 are scatter plots. This is because this chapter is perfect for directly contrasting scatter plots and scatter plot matrices.

<div class="alert alert-block alert-info">
<b>Task:</b> Use the internal pandas functions to create a scatter plot of the "OrderQty" and "ReceivedQty" attributes in order_dataframe. (Help: <a href="https://pandas.pydata.org/pandas-docs/dev/reference/api/pandas.DataFrame.plot.scatter.html">Pandas Documentation</a>)
</div>



In [None]:
# Draw a scatter plot regarding "OrderQty" and "ReceivedQty"

In [None]:
# Draw a scatter plot regarding "OrderQty" and "ReceivedQty"
order_dataframe.plot.scatter(x="OrderQty", y="ReceivedQty")

Within a scatter plot, exactly two attributes are always displayed on different axes. In order to compare more than two attributes with each other, a scatter plot matrix is often used. 

<div class="alert alert-block alert-info">
<b>Task:</b> Use the PairPlot from the Seaborn library to create a scatter plot matrix of the attributes "OrderQty", "ReceivedQty" and "RejectedQty" in order_dataframe. (Help: <a href="https://seaborn.pydata.org/generated/seaborn.pairplot.html#seaborn.pairplot">Seaborn Documentation</a>)
</div>

In [None]:
# Draw a scatter plot regarding "OrderQty", "ReceivedQty" and "RejectedQty"

In [None]:
# Draw a scatter plot regarding "OrderQty", "ReceivedQty" and "RejectedQty"
sns.pairplot(
    order_dataframe[["OrderQty", "ReceivedQty", "RejectedQty"]], diag_kind="kde"
)

#### Icon Based Visualization:  Icon Diagram - PyWaffle

Icon Based Diagrams are often used to show percentage relationships in an understandable way. Icon Based Diagrams are often used to show percentage relationships in an understandable way.

For example, it can be used to highlight the gender distribution of employees working with our orders.

In [None]:
# Get unique employees
orders_per_employee_dataframe = (
    order_dataframe.groupby(
        ["EmployeeID", "FirstName", "MiddleName", "LastName", "Gender"]
    )
    .size()
    .reset_index(name="Orders")
)

# Sort the dataframe
orders_per_employee_dataframe.sort_values("Orders", ascending=False, inplace=True)

# Show the head of the orders_per_employee_dataframe
orders_per_employee_dataframe.head(15)

In [None]:
# Get the count of employees per gender
employees_per_gender_dataframe = (
    orders_per_employee_dataframe.groupby(["Gender"]).size().reset_index(name="Count")
)

# Print the head of this new dataframe
employees_per_gender_dataframe.head(10)

<div class="alert alert-block alert-info">
<b>Task:</b> Use PyWaffle to draw a icon diagram regarding the genders of the employees. (Help: <a href="https://pywaffle.readthedocs.io/en/latest/examples/plot_with_characters_or_icons.html">PyWaffle Documentation</a>)
</div>

In [None]:
# Save the counts into seperate variables
count_male_employees = employees_per_gender_dataframe["Count"].iloc[
    employees_per_gender_dataframe.index[
        employees_per_gender_dataframe["Gender"] == "M"
    ].tolist()[0]
]
count_female_employees = employees_per_gender_dataframe["Count"].iloc[
    employees_per_gender_dataframe.index[
        employees_per_gender_dataframe["Gender"] == "F"
    ].tolist()[0]
]

In [None]:
# Draw the icon diagram

In [None]:
# Draw the icon diagram
plt.figure(
    FigureClass=pywaffle.Waffle,
    rows=2,
    values=[count_male_employees, count_female_employees],
    colors=["#0000ff", "#ff0084"],
    icons=["male", "female"],
    font_size=30,
    icon_legend=True,
    legend={
        "labels": ["male", "female"],
        "loc": "upper left",
        "bbox_to_anchor": (1, 1),
    },
)
plt.tight_layout()

#### Icon Based Visualization:  Tree Maps - Squarify

Even better than Icon Diagrams, distributions can be displayed using tree maps. 

For example, a tree map can be used to display the count of orders processed per employee.

In [None]:
# Print the head of orders_per_employee (again)
orders_per_employee_dataframe.head(15)

<div class="alert alert-block alert-info">
<b>Task:</b> Use Squarify to draw a tree map displaying the count of orders processed per employee. (Help: <a href="https://github.com/laserson/squarify#Usage">Squarify Documentation</a>)
</div>

In [None]:
# Draw the tree map

In [None]:
# Draw the tree map
fig, ax = plt.subplots(1, figsize=(12, 12))
squarify.plot(
    sizes=orders_per_employee_dataframe["Orders"],
    label=[
        "Employee #"
        + str(employee[1]["EmployeeID"])
        + ":\n"
        + employee[1]["FirstName"]
        + " "
        + employee[1]["MiddleName"]
        + ". "
        + employee[1]["LastName"]
        + "\n\nOrders:\n"
        + str(employee[1]["Orders"])
        for employee in orders_per_employee_dataframe.iterrows()
    ],
    pad=True,
)
plt.axis("off")
plt.show()

### Data similarity

Data similarity is often measured in data science by the distance between two data sets. There are many different methods to determine distances between different types of data. 

In this chapter, we will focus on the distance between numeric data. The two tuples we are comparing are tuple_77 and tuple_82 we introduced in a previous section.

In [None]:
print(tuple_77["r"])

In [None]:
print(tuple_82["r"])

One of the most commonly used methods for calculating the distance between numerical values is the Manhattan distance, also called the L1 norm or City Block.

<div class="alert alert-block alert-info">
<b>Task:</b> Use the cityblock function of SciPy to compute the distance between tuple_77 and tuple_82. (Help: <a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cityblock.html#scipy.spatial.distance.cityblock">SciPy Documentation</a>)
</div>

In [None]:
# Use cityblock() to compute the distance

In [None]:
# Use cityblock() to compute the distance
scipy.spatial.distance.cityblock(tuple_77["r"], tuple_82["r"])

A second common variant is the Euclidean distance, also called L2 norm.

<div class="alert alert-block alert-info">
<b>Task:</b> Use the euclidean function of SciPy to compute the distance between tuple_77 and tuple_82. (Help: <a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.euclidean.html#scipy.spatial.distance.euclidean">SciPy Documentation</a>)
</div>

In [None]:
# Use euclidean() to compute the distance

In [None]:
# Use euclidean() to compute the distance
scipy.spatial.distance.euclidean(tuple_77["r"], tuple_82["r"])

Apparently, for tuple_77 and tuple_82, the manhattan distance and the Euclidean distance do not differ. But what about the following two lists?

In [None]:
# Lists to compare
list_1 = [550, 550, 82]
list_2 = [82, 550, 550]

<div class="alert alert-block alert-info">
<b>Task:</b> Compute the manhattan distance and the euclidean distance between list_1 and list_2. 
</div>

In [None]:
# Compute the manhattan distance

In [None]:
# Compute the euclidean distance

In [None]:
# Compute the manhattan distance
scipy.spatial.distance.cityblock(list_1, list_2)

In [None]:
# Compute the euclidean distance
scipy.spatial.distance.euclidean(list_1, list_2)

<div class="alert alert-block alert-info">
<b>Task:</b> Consider and explain why the distance measures did not differ for tuple_77 and tuple_82, but did for list_1 and list_2.
</div>

Write down your solution here:


For list_1 and list_2, each value appears in both lists. While the L1 norm evaluates that both the first values of both lists and the last values of both lists differ, the L2 norm "recognizes" that the values in both arrays are simply shifted and thus estimates the distance to be smaller.

Both the manhattan distance and the Euclidean distance are special forms of another distance. Do you remember which one?

<div class="alert alert-block alert-info">
<b>Task:</b> Compute the manhattan distance and the euclidean distance between list_1 and list_2 without using cityblock() or euclidean(). (Hint: The required distance measure is also part of <a href="https://docs.scipy.org/doc/scipy/reference/spatial.distance.html">scipy.spatial.distance</a>)
</div>

In [None]:
# Compute the manhattan distance without cityblock()

In [None]:
# Compute the euclidean distance without euclidean()

In [None]:
# Compute the manhattan distance without cityblock()
scipy.spatial.distance.minkowski(list_1, list_2, 1)

In [None]:
# Compute the euclidean distance without euclidean()
scipy.spatial.distance.minkowski(list_1, list_2, 2)