# Exercise 1: Getting to Know Your Data

For this exercise, we require a series of libraries. They are imported by executing the following code cell.

In [None]:
# Import required libraries
import os
import tempfile
import sqlite3
import urllib.request
import squarify
import pywaffle
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns

In addition to the libraries, we also need data with which we can familiarize ourselves during the exercise. For this purpose, we use the so-called AdventureWorks database, which is modeled after a fictional shop.

The following code cells download the database from GitHub (first cell) and then import the data required for this exercise into a pandas `DataFrame` (second cell).

In [None]:
# Create a temporary directory
dataset_folder = tempfile.mkdtemp()

# Build path to database
database_path = os.path.join(dataset_folder, "adventure-works.db")

# Get the database
urllib.request.urlretrieve(
    "https://github.com/FAU-CS6/KDD-Databases/raw/main/AdventureWorks/adventure-works.db",
    database_path,
)

# Open connection to the adventure-works.db
connection = sqlite3.connect(database_path)

In [None]:
# Create the clean DataFrame(s)
# Order DataFrame
order_df = pd.read_sql_query(
    "SELECT p.ProductID,p.Name,p.ProductNumber,p.MakeFlag,p.FinishedGoodsFlag,p.Color,p.SafetyStockLevel,"
    "p.ReorderPoint,p.StandardCost,p.ListPrice,p.Size,p.SizeUnitMeasureCode,p.WeightUnitMeasureCode,p.Weight,"
    "p.DaysToManufacture,p.ProductLine,p.Class,p.Style,p.ProductSubcategoryID,p.ProductModelID,p.SellStartDate,"
    "p.SellEndDate,p.DiscontinuedDate,d.PurchaseOrderID,d.PurchaseOrderDetailID,d.DueDate,d.OrderQty,d.ProductID,"
    "d.UnitPrice,d.ReceivedQty,d.RejectedQty,h.RevisionNumber,h.Status,h.EmployeeID,h.VendorID,h.ShipMethodID,"
    "h.OrderDate,h.ShipDate,h.SubTotal,h.TaxAmt,h.Freight,h.TotalDue,e.NationalIDNumber,e.LoginID,e.OrganizationNode,"
    "e.JobTitle,e.BirthDate,e.MaritalStatus,e.Gender,e.HireDate,e.SalariedFlag,e.VacationHours,e.SickLeaveHours,"
    "e.CurrentFlag,r.PersonType,r.NameStyle,r.Title,r.FirstName,r.MiddleName,r.LastName,r.Suffix,r.EmailPromotion,"
    "r.AdditionalContactInfo,r.Demographics "
    "FROM Product p "
    "JOIN PurchaseOrderDetail d ON p.ProductID = d.ProductID "
    "JOIN PurchaseOrderHeader h ON d.PurchaseOrderID = h.PurchaseOrderID "
    "JOIN Employee e ON h.EmployeeID = e.BusinessEntityID "
    "JOIN Person r ON e.BusinessEntityID = r.BusinessEntityID",
    connection,
    index_col="PurchaseOrderDetailID",
)

Currently we don't know anything about the `DataFrame` `order_df`. 
In order to gather an initial understanding of the structure of the `DataFrame`, it is useful to know the dimensions of the `DataFrame`. The corresponding information is stored in the shape property of a pandas `DataFrame`.

<div class="alert alert-block alert-info">

**Task 1:** 
    
Figure out the shape of `order_df`. You may take a look at the [pandas documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html) regarding the mentioned property.
    
</div>


In [None]:
# Get the shape of the DataFrame

In [None]:
# Get the shape of the DataFrame
order_df.shape

The output of shape is a tuple containing two values. It is of course important to know which number is the number of attributes and which is the number tuples/samples in our `DataFrame`.

<div class="alert alert-block alert-info">

**Task 2:** 

Take a look at the documentation to determine which element stands for the number of attributes and which for the number of tuples. 
    
</div>

The count of attributes/features: ?

The count of tuples/samples: ?

The count of attributes/features: 63

The count of tuples/samples: 8845


Now that we know the count of tuples and attributes in our `DataFrame`, we still do not know what data is contained in the data set.

In order to get a first impression in this respect, it can be useful to look at (a sample of) the `DataFrame`. The supposedly simplest method to make this possible is the `print()` function.

In [None]:
# Print the order_df
print(order_df)

However, as you can see, this method outputs the entire content of the `DataFrame` without any specific layout. This can cause problems, especially with very large `DataFrames` and is therefore not recommended. It is far more common to use the `DataFrame` member function `head()`.

<div class="alert alert-block alert-info">

**Task 3:** 
    
Use the pandas documentation to familiarize yourself with `head()`, then apply it to the `order_df` to display the first 10 tuples.
    
</div>

In [None]:
# Use the head() function on the order_df while setting the number of rows displayed to 10

In [None]:
# Use the head() function on the order_df while setting the number of rows displayed to 10
order_df.head(10)

As you can see, the representation by `head()` is easier to read. However, `head()` also has its limitations. For example, not all attributes are displayed when the number of attribute is large.

<div class="alert alert-block alert-info">

**Task 4:** 
    
List the names of all columns.
    
</div>

In [None]:
# Output a list of all columns

In [None]:
# There are multiple possible solutions, e.g.:
# Sample solution 1: Iterate over the columns
for column in order_df.columns:
    print(column, end=",")

In [None]:
# Sample solution 2: Use list()
list(order_df.columns)

In [None]:
# Sample solution 3: simply access columns and print them
order_df.columns

For example, we did not see the columns `WeightUnitMeasureCode`, `Weight`, `DaysToManufacture`, `ProductLine`, `Class` and `Style` in the above execution of `head()`.  

<div class="alert alert-block alert-info">

**Task 5:** 

Show the first 10 tuples of the attributes `WeightUnitMeasureCode`, `Weight`, `DaysToManufacture`, `ProductLine`, `Class` and `Style`.
    
</div>

In [None]:
# Print the attributes WeightUnitMeasureCode, Weight, DaysToManufacture, ProductLine, Class
# and "Style" of the first ten attributes

In [None]:
# There are multiple possible solutions, e.g.:
# Sample solution 1: Explicit naming of the identifiers
order_df[
    [
        "WeightUnitMeasureCode",
        "Weight",
        "DaysToManufacture",
        "ProductLine",
        "Class",
        "Style",
    ]
].head(10)

In [None]:
# Sample solution 2: Using the columns attribute
order_df[order_df.columns[12:18]].head(10)

Of course, it is a pity that all the attribute values shown are `0` in the `DaysToManufacture` attribute and `None` in the `ProductLine` attribute. However, this does not mean that this is the case for the entire `DataFrame`.

<div class="alert alert-block alert-info">

**Task 6:** 
    
Save all with `DaysToManufacture` higher than 0 into a new `DataFrame` called `order_dtm_not_zero_dataframe`. 
    
</div>

In [None]:
# Select all tuples with DaysToManufacture > 0

In [None]:
# Select all tuples with DaysToManufacture > 0
order_dtm_not_zero_dataframe = order_df[order_df["DaysToManufacture"] > 0]

<div class="alert alert-block alert-info">

**Task 7:** 

Display the columns `WeightUnitMeasureCode`, `Weight`, `DaysToManufacture`, `ProductLine`, `Class`, and `Style` of `order_dtm_not_zero_dataframe`. Limit the output to 10 tuples. 
    
</div>

In [None]:
# Print the attributes "WeightUnitMeasureCode", "Weight", "DaysToManufacture", "ProductLine", "Class"
# and "Style" of the first ten attributes of the new order_dtm_not_zero_dataframe

In [None]:
# Print the attributes "WeightUnitMeasureCode", "Weight", "DaysToManufacture", "ProductLine", "Class"
# and "Style" of the first ten attributes of the new order_dtm_not_zero_dataframe
order_dtm_not_zero_dataframe[
    [
        "WeightUnitMeasureCode",
        "Weight",
        "DaysToManufacture",
        "ProductLine",
        "Class",
        "Style",
    ]
].head(10)

In addition to simply outputting data from the `DataFrame`, it can also be useful to examine some basic statistical descriptors. 

In this case we will focus here on the attributes `ReorderPoint`, `DaysToManufacture`, and `UnitPrice`. 

In [None]:
# Print the attributes "ReorderPoint", "DaysToManufacture", and "UnitPrice" of the first ten tuples
order_df[["ReorderPoint", "DaysToManufacture", "UnitPrice"]].head(10)

<div class="alert alert-block alert-info">

**Task 8:** 
    
Determine the mean for each of the three attributes and print it. (Help: [pandas documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mean.html))

</div>

In [None]:
# Output the mean of "ReorderPoint"

In [None]:
# Output the mean of "DaysToManufacture"

In [None]:
# Output the mean of "UnitPrice"

In [None]:
# Output the mean of "ReorderPoint"
order_df["ReorderPoint"].mean()

In [None]:
# Output the mean of "DaysToManufacture"
order_df["DaysToManufacture"].mean()

In [None]:
# Output the mean of "UnitPrice"
order_df["UnitPrice"].mean()

<div class="alert alert-block alert-info">

**Task 9:** 
   
Determine the median for each of the three attributes and print it. (Help: [pandas documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.median.html))
    
</div>

In [None]:
# Output the median of "ReorderPoint"

In [None]:
# Output the median of "DaysToManufacture"

In [None]:
# Output the median of "UnitPrice"

In [None]:
# Output the median of "ReorderPoint"
order_df["ReorderPoint"].median()

In [None]:
# Output the median of "DaysToManufacture"
order_df["DaysToManufacture"].median()

In [None]:
# Output the median of "UnitPrice"
order_df["UnitPrice"].median()

<div class="alert alert-block alert-info">

**Task 10:** 
    
Determine the minimum value for each of the three attributes and print it.
    
</div>

In [None]:
# Output the min of "ReorderPoint"

In [None]:
# Output the min of "DaysToManufacture"

In [None]:
# Output the min of "UnitPrice"

In [None]:
# Output the min of "ReorderPoint"
order_df["ReorderPoint"].min()

In [None]:
# Output the min of "DaysToManufacture"
order_df["DaysToManufacture"].min()

In [None]:
# Output the min of "UnitPrice"
order_df["UnitPrice"].min()

<div class="alert alert-block alert-info">

**Task 11:** 
    
Determine the maximum value for each of the three attributes and print it.

</div>

In [None]:
# Output the max of "ReorderPoint"

In [None]:
# Output the max of "DaysToManufacture"

In [None]:
# Output the max of "UnitPrice"

In [None]:
# Output the max of "ReorderPoint"
order_df["ReorderPoint"].max()

In [None]:
# Output the max of "DaysToManufacture"
order_df["DaysToManufacture"].max()

In [None]:
# Output the max of "UnitPrice"
order_df["UnitPrice"].max()

<div class="alert alert-block alert-info">

**Task 12:** 
    
Determine the mode for each of the three attributes and print it.

</div>

In [None]:
# Output the mode of "ReorderPoint"

In [None]:
# Output the mode of "DaysToManufacture"

In [None]:
# Output the mode of "UnitPrice"

In [None]:
# Output the mode of "ReorderPoint"
order_df["ReorderPoint"].mode()

In [None]:
# Output the mode of "DaysToManufacture"
order_df["DaysToManufacture"].mode()

In [None]:
# Output the mode of "UnitPrice"
order_df["UnitPrice"].mode()

<div class="alert alert-block alert-info">

**Task 13:** 
    
Determine the number of unique values for each of the three attributes and output it. (Help: Take a look at the function `nunique` in the pandas documentation.)

</div>

In [None]:
# Output the nunique of "ReorderPoint"

In [None]:
# Output the nunique of "DaysToManufacture"

In [None]:
# Output the nunique of "UnitPrice"

In [None]:
# Output the nunique of "ReorderPoint"
order_df["ReorderPoint"].nunique()

In [None]:
# Output the nunique of "DaysToManufacture"
order_df["DaysToManufacture"].nunique()

In [None]:
# Output the nunique of "UnitPrice"
order_df["UnitPrice"].nunique()

There are also methods in pandas that can be used to obtain multiple simple statistical values simultaneously for a `DataFrame`. An example is `describe`, which encapsulates the output of count, maximum, minimum, mean, standard deviation, and some quantiles in one function.

<div class="alert alert-block alert-info">
    
**Task 14:** 
    
Output the result for `describe` for each of the three attributes.

</div>

In [None]:
# Output the result for describe of "ReorderPoint"

In [None]:
# Output the result for describe of "DaysToManufacture"

In [None]:
# Output the result for describe of "UnitPrice"

In [None]:
# Output the result for describe of "ReorderPoint"
order_df["ReorderPoint"].describe()

In [None]:
# Output the result for describe of "DaysToManufacture"
order_df["DaysToManufacture"].describe()

In [None]:
# Output the result for describe of "UnitPrice"
order_df["UnitPrice"].describe()

Another method is `agg`, with the help of which several self-defined statistical values are calculated for a `DataFrame` in a single call.

<div class="alert alert-block alert-info">
    
**Task 15:** 
    
Output the mean, median, minimum, maximum, and number of unique values again for each of the three attributes. This time, use `agg`.

</div>

In [None]:
# Output the result for agg of "ReorderPoint"

In [None]:
# Output the result for agg of "DaysToManufacture"

In [None]:
# Output the result for agg of "UnitPrice"

In [None]:
# Output the result for agg of "ReorderPoint"
order_df["ReorderPoint"].agg(["mean", "median", "min", "max", "nunique"])

In [None]:
# Output the result for agg of "DaysToManufacture"
order_df["DaysToManufacture"].agg(["mean", "median", "min", "max", "nunique"])

In [None]:
# Output the result for agg of "UnitPrice"
order_df["UnitPrice"].agg(["mean", "median", "min", "max", "nunique"])

Even though the methods used here are fairly simple statistical methods, they can already tell us quite a bit about our data. 

<div class="alert alert-block alert-info">

**Task 16:** 
    
Consider and describe what can be said about the attribute `ReorderPoint` based on the statistical values obtained.

</div>

The attribute "ReorderPoint" has only five different values, which are distributed between the lowest value 3 and the highest value 750. It can therefore be assumed that there are sometimes large ranges between the individual values and that the attribute is most likely a discrete numeric attribute.

The mean of approx. 589 and the median of 750 clearly indicate that many values are probably located in the higher part of the value range. The fact that the mode is 750 further confirms this thesis.

<div class="alert alert-block alert-info">

**Task 17:** 

Consider and describe what can be said about the `DaysToManufacture` attribute based on the statistical values obtained.
</div>

In contrast to `ReorderPoint`, `DaysToManufacture` apparently has only two different values. These seem to be the values `0` and `1`. The mode of `0` shows us that the value `0` is the most frequent appearance. The mean of about `0.13` even proofs that `0` occurs much more often than `1` (due to the binary value set).

<div class="alert alert-block alert-info">

**Task 18:**
    
Consider and describe what can be said about the `UnitPrice` attribute based on the statistical values obtained.
</div>

The "UnitPrice" attribute, with its 177 different values ranging from 0.21 up to 82.8345, appears to be much less discrete than the other two attributes. It has two values that occur most frequently, 31.4895 and 48.2895. In general, this, along with the mean of about 34.743 and the median of 39.2805, suggests that most values are likely to be found in the middle of the range of values. However, it is possible that there are many values in the very low range as well as in the very high range, but they simply balance each other out.

The distribution of the individual values within the attributes is something we can not specific with these simple statistical values. A possibility to get more information about this is a histogram.

<div class="alert alert-block alert-info">

**Task 19:** 
    
Draw a histogram for `ReorderPoint`. Consider what number of bins might be appropriate. (Help: [pandas documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.hist.html))
</div>

In [None]:
# Histogram for "ReorderPoint"

In [None]:
# Histogram for "ReorderPoint"
# 20 bins as the two lower values are to close to each other to be separated otherwise
order_df["ReorderPoint"].plot.hist(bins=20, rwidth=0.9)

# Note: The rwidth parameter is not required, but it helps to better separate two buckets
# that are directly next to each other.
# An alternative would be to use an edge color that stands out. However, this is not
# supported by Panda's internal histogram function and will therefore not be discussed here.

<div class="alert alert-block alert-info">

**Task 20:**
    
Draw a histogram for `DaysToManufacture`. Consider what number of bins might be appropriate.
</div>

In [None]:
# Histogram for "DaysToManufacture"

In [None]:
# Histogram for "DaysToManufacture"
# 2 bins as there are only 2 unique values
order_df["DaysToManufacture"].plot.hist(bins=2, rwidth=0.8)

<div class="alert alert-block alert-info">

**Task 21:**
    
Draw a histogram for `UnitPrice`. Consider what number of bins might be appropriate. 
</div>

In [None]:
# Histogram for "UnitPrice"

In [None]:
# Histogram for "UnitPrice"
# 20 bins as 177 would not be displayed very well
order_df["UnitPrice"].plot.hist(bins=20, rwidth=0.8)

It can be seen that it does not always make sense to choose the number of bins exactly like the number of unique values. 

Apart from that, we have to note that especially in the case of `UnitPrice`, insight is lost by merging multiple values. In such a case it can also be useful to look at boxplot and density curve instead of histogram.

<div class="alert alert-block alert-info">

**Task 22:**
    
Draw a boxplot diagram for `UnitPrice`. (Help: [pandas documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.box.html))
</div>


In [None]:
# Boxplot for "UnitPrice"

In [None]:
# Boxplot for "UnitPrice"
order_df["UnitPrice"].plot.box()

<div class="alert alert-block alert-info">

**Task 23:**
    
Draw a density diagram for "UnitPrice". (Help: [pandas documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.density.html))
</div>


In [None]:
# Density curve for "UnitPrice"

In [None]:
# Density curve for "UnitPrice"
order_df["UnitPrice"].plot.density()

While simple chart types such as histograms, boxplots and density plots can be directly used with pandas, a variety of libraries are available for more advanced visualizations. 



In the lecture you were introduced to Polar Plots, also called Radar Chart. In this type of chart, you look at multiple attributes of a single tuple.

<div class="alert alert-block alert-info">

**Task 24:**
    
Use Plotly's graph function `Scatterpolar` to output the prepared tuples within one polar plot. (Help: [Plotly Documentation](https://plotly.com/python/radar-chart/) and [Plotly API Reference](https://plotly.com/python-api-reference/generated/plotly.graph_objects.scatterpolar.html?highlight=scatterpolar#module-plotly.graph_objects.scatterpolar))
</div>

In [None]:
# Set the attributes to select
attributes = ["OrderQty", "ReceivedQty", "RejectedQty"]

# Prepare tuple_6, tuple_77, tuple_82
tuple_6 = pd.DataFrame(dict(r=order_df[attributes].iloc[6].values, theta=attributes))
tuple_77 = pd.DataFrame(dict(r=order_df[attributes].iloc[77].values, theta=attributes))
tuple_82 = pd.DataFrame(dict(r=order_df[attributes].iloc[82].values, theta=attributes))

In [None]:
# Draw a single polar diagram for tuple_6, tuple_77 and tuple_82

In [None]:
# Draw a single polar diagram for tuple_6, tuple_77 and tuple_82
fig = go.Figure()

fig.add_trace(
    go.Scatterpolar(
        r=tuple_6["r"],
        theta=tuple_6["theta"],
        mode="lines",
        fill="tonext",
        name="tuple_6",
        line_color="peru",
    )
)
fig.add_trace(
    go.Scatterpolar(
        r=tuple_77["r"],
        theta=tuple_77["theta"],
        mode="lines",
        fill="tonext",
        name="tuple_77",
        line_color="deepskyblue",
    )
)
fig.add_trace(
    go.Scatterpolar(
        r=tuple_82["r"],
        theta=tuple_82["theta"],
        mode="lines",
        fill="tonext",
        name="tuple_82",
        line_color="seagreen",
    )
)


fig.update_layout(showlegend=True)

fig.show()

One type of graph that we did not yet consider are scatter plots.

<div class="alert alert-block alert-info">

**Task 25:**
    
Use the internal pandas functions to create a scatter plot of the `OrderQty` and `ReceivedQty` attributes in `order_df`. (Help: [pandas Documentation](https://pandas.pydata.org/pandas-docs/dev/reference/api/pandas.DataFrame.plot.scatter.html))
</div>



In [None]:
# Draw a scatter plot regarding "OrderQty" and "ReceivedQty"

In [None]:
# Draw a scatter plot regarding "OrderQty" and "ReceivedQty"
order_df.plot.scatter(x="OrderQty", y="ReceivedQty")

Within a scatter plot, exactly two attributes are always displayed on different axes. In order to compare more than two attributes with each other, a scatter plot matrix is often used. 

<div class="alert alert-block alert-info">

**Task 26:**
    
Use seaborn to plot a pair plot that creates a scatter plot matrix of the attributes `OrderQty`, `ReceivedQty` and `RejectedQty` in `order_df`. (Help: [Seaborn Documentation](https://seaborn.pydata.org/generated/seaborn.pairplot.html#seaborn.pairplot))
</div>

In [None]:
# Draw a scatter plot regarding "OrderQty", "ReceivedQty" and "RejectedQty"

In [None]:
# Draw a scatter plot regarding "OrderQty", "ReceivedQty" and "RejectedQty"
sns.pairplot(order_df[["OrderQty", "ReceivedQty", "RejectedQty"]], diag_kind="kde")

Icon Based Diagrams are often used to show percentage relationships in an understandable way. 

For example, it can be used to highlight the gender distribution of employees working with our orders.

In [None]:
# Get unique employees
orders_per_employee_df = (
    order_df.groupby(["EmployeeID", "FirstName", "MiddleName", "LastName", "Gender"])
    .size()
    .reset_index(name="Orders")
)

# Sort the DataFrame
orders_per_employee_df.sort_values("Orders", ascending=False, inplace=True)

# Show the head of the orders_per_employee_df
orders_per_employee_df.head(15)

In [None]:
# Get the count of employees per gender
employees_per_gender_df = orders_per_employee_df.groupby(["Gender"]).size()

# Print the head of this new DataFrame
employees_per_gender_df

<div class="alert alert-block alert-info">

**Task 27:**
    
Use PyWaffle to draw an icon diagram regarding the genders of the employees. (Help: [PyWaffle Documentation](https://pywaffle.readthedocs.io/en/latest/examples/plot_with_characters_or_icons.html))
</div>

In [None]:
# Draw the icon diagram

In [None]:
# Draw the icon diagram
plt.figure(
    FigureClass=pywaffle.Waffle,
    rows=2,
    values=[employees_per_gender_df.loc["M"], employees_per_gender_df.loc["F"]],
    colors=["#0000ff", "#ff0084"],
    font_size=30,
    icon_legend=True,
    legend={
        "labels": ["male", "female"],
        "loc": "upper left",
        "bbox_to_anchor": (1, 1),
    },
)
plt.tight_layout()

Even better than by Icon Diagrams, distributions can be displayed using tree maps. 

For example, a tree map can be used to display the count of orders processed per employee.

In [None]:
# Print the head of orders_per_employee (again)
orders_per_employee_df.head(15)

<div class="alert alert-block alert-info">

**Task 28:**
    
Use Squarify to draw a tree map displaying the count of orders processed per employee. (Help: [Squarify Documentation](https://github.com/laserson/squarify#Usage))
</div>

In [None]:
# Draw the tree map

In [None]:
# Draw the tree map
fig, ax = plt.subplots(1, figsize=(12, 12))
squarify.plot(
    sizes=orders_per_employee_df["Orders"],
    label=[
        f'Employee #{employee["EmployeeID"]}:\n{employee["FirstName"]} {employee["MiddleName"]}. '
        f'{employee["LastName"]}\n\nOrders:\n{employee["Orders"]}'
        for _, employee in orders_per_employee_df.iterrows()
    ],
    pad=True,
)
plt.axis("off")
plt.show()

In addition to data visualization, it can also be useful to determine the similarity of data.

Data similarity refers to the distance between two data sets. There are many different methods to determine distances between different types of data. 

In this chapter, we will focus on the distance between numeric data. More specifically, here, we compare previously selected and created tuples, namely `tuple_77` and `tuple_82`.

In [None]:
tuple_77["r"]

In [None]:
tuple_82["r"]

One of the most commonly used methods for calculating the distance between numerical values is the Manhattan distance, also called the L1 norm or City Block. A corresponding function that calculates the appropriate distance can be implemented relatively quickly using a for-loop.

In [None]:
# Method to compute the Manhattan distance using a for-loop
def manhattan_distance(tuple_a, tuple_b):
    # We do not check for correct datatypes and same length of both tuples
    # However this should be done in a productive environment

    # Set the distance to 0
    distance = 0

    # Add the distance for each tuple
    for i in range(len(tuple_a)):
        difference = tuple_a[i] - tuple_b[i]
        absolute_difference = abs(difference)
        distance += absolute_difference

    # Return the distance
    return distance

In Python, however, one would normally take a more elegant approach and avoid the for-loop.

In [None]:
# "Pythonic" method to compute the Manhattan distance
def manhattan_distance(tuple_a, tuple_b):
    # We do not check for correct datatypes and same length of both tuples
    # However this should be done in a productive environment

    # Convert lists to pandas Series
    if isinstance(tuple_a, list):
        a = pd.Series(tuple_a)
    else:
        a = tuple_a.copy()
    if isinstance(tuple_b, list):
        b = pd.Series(tuple_b)
    else:
        b = tuple_b.copy()

    # Compute the manhattan distance with a simple formula
    return (abs(a - b)).sum()

<div class="alert alert-block alert-info">

**Task 29:**
    
Use the defined function to compute the Manhattan distance between `tuple_77` and `tuple_82`.
</div>

In [None]:
# Use the defined function to compute the Manhattan distance between tuple_77 and tuple_82

In [None]:
# Use the defined function to compute the Manhattan distance between tuple_77 and tuple_82
manhattan_distance(tuple_77["r"], tuple_82["r"])

A second common variant is the Euclidean distance, also called L2 norm.

<div class="alert alert-block alert-info">

**Task 30:**
    
Complete the following source code to create a function to calculate the Euclidean distance and use it to calculate the Euclidean distance between `tuple_77` and `tuple_82`.
</div>

In [None]:
# Method to compute the euclidean distance
def euclidean_distance(tuple_a, tuple_b):
    # ...
    return -1


# Use the defined function to compute the Euclidean distance between tuple_77 and tuple_82
# ...

In [None]:
# Sample solution 1: Method to compute the euclidean distance using a for-loop
def euclidean_distance(tuple_a, tuple_b):
    # We do not check for correct datatypes and same length of both tuples
    # However this should be done in a productive environment

    # Set the distance to 0
    distance = 0

    # Add the distance for each tuple
    for i in range(len(tuple_a)):
        difference = tuple_a[i] - tuple_b[i]
        absolute_difference = abs(difference)
        exponentiated_difference = absolute_difference**2
        distance += exponentiated_difference

    # Calculate the square root of the distance
    distance = distance ** (1 / 2)

    # Return the distance
    return distance


# Use the defined function to compute the distance between tuple_77 and tuple_82
euclidean_distance(tuple_77["r"], tuple_82["r"])

In [None]:
# Sample solution 2: "Pythonic" method to compute the euclidean distance
def euclidean_distance(tuple_a, tuple_b):
    # We do not check for correct datatypes and same length of both tuples
    # However this should be done in a productive environment

    # Convert lists to pandas Series
    if isinstance(tuple_a, list):
        a = pd.Series(tuple_a)
    else:
        a = tuple_a.copy()
    if isinstance(tuple_b, list):
        b = pd.Series(tuple_b)
    else:
        b = tuple_b.copy()

    # Compute the euclidean distance with a simple formula
    return (abs(a - b) ** 2).sum() ** 0.5


# Use the defined function to compute the distance between tuple_77 and tuple_82
euclidean_distance(tuple_77["r"], tuple_82["r"])

Apparently, for `tuple_77` and `tuple_82`, the Manhattan distance and the Euclidean distance do not differ. But what about the following two lists?

In [None]:
# Lists to compare
list_1 = [550, 550, 82]
list_2 = [82, 550, 550]

<div class="alert alert-block alert-info">

**Task 31:**
    
Compute the Manhattan distance and the Euclidean distance between `list_1` and `list_2`. 
</div>

In [None]:
# Compute the manhattan distance

In [None]:
# Compute the euclidean distance

In [None]:
# Compute the manhattan distance
manhattan_distance(list_1, list_2)

In [None]:
# Compute the euclidean distance
euclidean_distance(list_1, list_2)

<div class="alert alert-block alert-info">

**Task 32:**
    
Consider and explain why the distance measures did not differ for `tuple_77` and `tuple_82`, but did for `list_1` and `list_2`.
</div>

Write down your solution here:


For `list_1` and `list_2`, each value appears in both lists. While the L1 norm evaluates that both the first values of both lists and the last values of both lists differ, the L2 norm "recognizes" that the values in both arrays are simply shifted and thus estimates the distance to be smaller.

Both the Manhattan distance and the Euclidean distance are special forms of the Minkowski distance. 

<div class="alert alert-block alert-info">

**Task 33:**    
    
Implement the Minkowski distance using the given code fragment. Use the resulting function to calculate the Manhattan distance and the Euclidean distance of list 1 and list 2 again.
</div>

In [None]:
# Method to compute the minkowski distance
def minkowski_distance(tuple_a, tuple_b, h):
    # ...
    return -1


# Use the function to compute the Manhattan and the Euclidean distance between list_1 and list_2 and print them
# ...

In [None]:
# Sample solution 1: Method to compute the minkowski distance using a for-loop
def minkowski_distance(tuple_a, tuple_b, h):
    # We do not check for correct datatypes, same length of both tuples and h > 0
    # However this should be done in a productive environment

    # Set the distance to 0
    distance = 0

    # Add the distance for each tuple
    for i in range(len(tuple_a)):
        difference = tuple_a[i] - tuple_b[i]
        absolute_difference = abs(difference)
        exponentiated_difference = absolute_difference**h
        distance += exponentiated_difference

    # Calculate the h root of the distance
    distance = distance ** (1 / h)

    # Return the distance
    return distance


# Use the function to compute the Manhattan and the Euclidean distance between list_1 and list_2 and print them
print("Manhattan distance: ", minkowski_distance(list_1, list_2, 1))
print("Euclidean distance: ", minkowski_distance(list_1, list_2, 2))

In [None]:
# Sample solution 2: "Pythonic" method to compute the minkowski distance using pandas
def minkowski_distance(tuple_a, tuple_b, h):
    # We do not check for correct datatypes, same length of both tuples and h > 0
    # However this should be done in a productive environment

    # Convert lists to pandas Series
    if isinstance(tuple_a, list):
        a = pd.Series(tuple_a)
    else:
        a = tuple_a.copy()
    if isinstance(tuple_b, list):
        b = pd.Series(tuple_b)
    else:
        b = tuple_b.copy()

    # Compute the minkowski distance with a simple formula
    return (abs(a - b) ** h).sum() ** (1 / h)


# Use the function to compute the Manhattan and the Euclidean distance between list_1 and list_2 and print them
print("Manhattan distance: ", minkowski_distance(list_1, list_2, 1))
print("Euclidean distance: ", minkowski_distance(list_1, list_2, 2))

In [None]:
# Sample solution 3: "Pythonic" method to compute the minkowski distance without pandas
def minkowski_distance(tuple_a, tuple_b, h):
    # We do not check for correct datatypes, same length of both tuples and h > 0
    # However this should be done in a productive environment

    # Compute the minkowski distance with a simple formula
    return sum([abs(a - b) ** h for a, b in zip(tuple_a, tuple_b)]) ** (1 / h)


# Use the function to compute the Manhattan and the Euclidean distance between list_1 and list_2 and print them
print("Manhattan distance: ", minkowski_distance(list_1, list_2, 1))
print("Euclidean distance: ", minkowski_distance(list_1, list_2, 2))