# 2. Data analysis & Preprocessing

In this exercise you will get to know the basics from the lectures "3. Getting to Know Your Data" and "4. Preprocessing" in their practical use and apply them yourself.

Since this practice sheet is designed to be used in three sessions, it is roughly divided into three sections:

- [2.1. Part One: Getting to Know Your Data](#2.1. Part One: Getting to Know Your Data)
- [2.2. Part Two: Preprocessing - Data cleaning & data integration](#2.2. Part Two: Preprocessing - Data cleaning & Data integration)
- [2.3. Part Three: Preprocessing - Data reduction, data transformation & data discretization](#2.3. Part Three: Preprocessing - Data reduction, data transformation & data discretization)

Of course, depending on how quickly an exercise group progresses in the actual exercise, one of these parts may not be discussed entirely in the affected exercise, or parts of the subsequent part may already be addressed.

### Preparation: Import required libraries

In [1]:
# Import the required libraries
import tempfile
import sqlite3
import urllib.request
import pandas as pd

### Preparation: Download the datasets

In [6]:
# Create a temporary directory
dataset_folder = tempfile.mkdtemp()

# Get the database
urllib.request.urlretrieve(
    "https://github.com/FAU-CS6/KDD-Databases/raw/main/AdventureWorks/adventure-works.db",
    dataset_folder + "/adventure-works.db",
)

# Open connection to the adventure-works.db
connection = sqlite3.connect(dataset_folder + "/adventure-works.db")

# Create the dataframe(s)
order_dataframe = pd.read_sql_query(
    "SELECT * FROM Product JOIN PurchaseOrderDetail ON Product.ProductID = PurchaseOrderDetail.ProductID",
    connection,
)

## 2.1. Part One: Getting to Know Your Data

In this part you will apply the theoretical knowledge gained in the lecture "Getting to Know Your Data". In doing so, you will familiarize yourself step by step with the `order_dataframe` dataframe defined above.

#### 2.1.1. Structure of the Dataframe

Currently you don't know anything about the order dataframe except for the fact that it consists of the two tables `Product` and `PurchaseOrderDetail` of a database named `AdventureWorks`. 
In order to gather an initial understanding of the structure of the dataframe, it can be quite useful to first look at an excerpt of the dataframe.

This is possible for example with the function `print()`.


In [3]:
# Print the order_dataframe
print(order_dataframe)

     ProductID                             Name ProductNumber  MakeFlag  \
0            1                  Adjustable Race       AR-5381         0   
1          359               Thin-Jam Hex Nut 9       HJ-1213         0   
2          360              Thin-Jam Hex Nut 10       HJ-1220         0   
3          530                        Seat Post       SP-2981         0   
4            4            Headset Ball Bearings       BE-2908         0   
...        ...                              ...           ...       ...   
8840       880          Hydration Pack - 70 oz.    HY-1023-70         0   
8841       881   Short-Sleeve Classic Jersey, S     SJ-0194-S         0   
8842       882   Short-Sleeve Classic Jersey, M     SJ-0194-M         0   
8843       883   Short-Sleeve Classic Jersey, L     SJ-0194-L         0   
8844       884  Short-Sleeve Classic Jersey, XL     SJ-0194-X         0   

      FinishedGoodsFlag   Color  SafetyStockLevel  ReorderPoint  StandardCost  \
0                 

However, as you can see, this method outputs the entire content of the dataframe without any specific layout. This can cause problems, especially with very large dataframes. Therefore, it is more common to use the dataframe member function `head()`.

<div class="alert alert-block alert-warning">
<b>Task 1:</b> Use the <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html">Pandas documentation about the function</a> to familiarize yourself with head(), then apply it to the order_dataframe so that the first 10 tuples are displayed.</div>

In [4]:
# Use the head() function on the order_dataframe while setting the number of rows displayed to 10
order_dataframe.head(10)

Unnamed: 0,ProductID,Name,ProductNumber,MakeFlag,FinishedGoodsFlag,Color,SafetyStockLevel,ReorderPoint,StandardCost,ListPrice,...,ModifiedDate,PurchaseOrderID,PurchaseOrderDetailID,DueDate,OrderQty,ProductID.1,UnitPrice,ReceivedQty,RejectedQty,ModifiedDate.1
0,1,Adjustable Race,AR-5381,0,0,,1000,750,0.0,0.0,...,2014-02-08 10:01:36.827,1,1,2011-04-30 00:00:00.000,4,1,50.26,3,0,2011-04-23 00:00:00.000
1,359,Thin-Jam Hex Nut 9,HJ-1213,0,0,,1000,750,0.0,0.0,...,2014-02-08 10:01:36.827,2,2,2011-04-30 00:00:00.000,3,359,45.12,3,0,2011-04-23 00:00:00.000
2,360,Thin-Jam Hex Nut 10,HJ-1220,0,0,,1000,750,0.0,0.0,...,2014-02-08 10:01:36.827,2,3,2011-04-30 00:00:00.000,3,360,45.5805,3,0,2011-04-23 00:00:00.000
3,530,Seat Post,SP-2981,0,0,,500,375,0.0,0.0,...,2014-02-08 10:01:36.827,3,4,2011-04-30 00:00:00.000,550,530,16.086,550,0,2011-04-23 00:00:00.000
4,4,Headset Ball Bearings,BE-2908,0,0,,800,600,0.0,0.0,...,2014-02-08 10:01:36.827,4,5,2011-04-30 00:00:00.000,3,4,57.0255,2,1,2011-04-23 00:00:00.000
5,512,HL Road Rim,RM-R800,0,0,,800,600,0.0,0.0,...,2014-02-08 10:01:36.827,5,6,2011-05-14 00:00:00.000,550,512,37.086,550,0,2011-05-07 00:00:00.000
6,513,Touring Rim,RM-T801,0,0,,800,600,0.0,0.0,...,2014-02-08 10:01:36.827,6,7,2011-05-14 00:00:00.000,550,513,26.5965,468,0,2011-05-07 00:00:00.000
7,317,LL Crankarm,CA-5965,0,0,Black,500,375,0.0,0.0,...,2014-02-08 10:01:36.827,7,8,2011-05-14 00:00:00.000,550,317,27.0585,550,0,2011-05-07 00:00:00.000
8,318,ML Crankarm,CA-6738,0,0,Black,500,375,0.0,0.0,...,2014-02-08 10:01:36.827,7,9,2011-05-14 00:00:00.000,550,318,33.579,550,0,2011-05-07 00:00:00.000
9,319,HL Crankarm,CA-7457,0,0,Black,500,375,0.0,0.0,...,2014-02-08 10:01:36.827,7,10,2011-05-14 00:00:00.000,550,319,46.0635,550,0,2011-05-07 00:00:00.000


As you can see, the representation by head() is easier to read. However, head() also has its limitations. For example, in this case we do not get all columns displayed.

<div class="alert alert-block alert-warning">
<b>Task 2:</b> All attributes of a data frame are stored in the <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.columns.html">member variable columns</a>. Use this information to output a list (without special formatting) of all attributes contained in order_dataframe.
</div>

In [5]:
# There are many equal solutions, e.g.:
# Solution 1: Iterate over the columns
for column in order_dataframe.columns:
    print(column, end=",")

# Solution 2: List the columns
# list(order_dataframe.columns)

# And many more ...

ProductID,Name,ProductNumber,MakeFlag,FinishedGoodsFlag,Color,SafetyStockLevel,ReorderPoint,StandardCost,ListPrice,Size,SizeUnitMeasureCode,WeightUnitMeasureCode,Weight,DaysToManufacture,ProductLine,Class,Style,ProductSubcategoryID,ProductModelID,SellStartDate,SellEndDate,DiscontinuedDate,rowguid,ModifiedDate,PurchaseOrderID,PurchaseOrderDetailID,DueDate,OrderQty,ProductID,UnitPrice,ReceivedQty,RejectedQty,ModifiedDate,

<div class="alert alert-block alert-info">
TODO
</div>

## 2.2. Part Two: Preprocessing - Data cleaning & Data integration

<div class="alert alert-block alert-info">
TODO
</div>

## 2.3. Part Three: Preprocessing - Data reduction, data transformation & data discretization

<div class="alert alert-block alert-info">
TODO
</div>