# Exercise 3: Conducting Classification


Imagine the following scenario:

*You are an employee in the fictitious company Adventure Works GmbH. Your task is to automate a critical decision regarding inventory thresholds within the company:*

*Currently, each product is manually assigned a stock level at which it should be reordered. This threshold should be automatically determined using classification techniques.*

*To perform the classification, you have access to the relevant DataFrame product_df.*

*Additionally, your supervisor provided you with a selection of essential libraries:*

In [None]:
# Import helpful libraries
import os
import tempfile
import sqlite3
import urllib.request
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report
from sklearn.preprocessing import LabelEncoder

In [None]:
# Create a temporary directory
dataset_folder = tempfile.mkdtemp()

# Build path to database
database_path = os.path.join(dataset_folder, "adventure-works.db")

# Get the database
urllib.request.urlretrieve(
    "https://github.com/FAU-CS6/KDD-Databases/raw/main/AdventureWorks/adventure-works.db",
    database_path,
)

# Open connection to the adventure-works.db
connection = sqlite3.connect(database_path)

In [None]:
# Load Product into a DataFrame
product_df = pd.read_sql_query(
    "SELECT ProductID, Name, ProductNumber, Size, SizeUnitMeasureCode, Weight, WeightUnitMeasureCode, MakeFlag, StandardCost, ListPrice,  DaysToManufacture, ReorderPoint, Color FROM Product",
    connection,
)

<div class="alert alert-block alert-info">

**Task 1:**
    
Train a classifier on `product_df` to reliably determine the reorder point for a product. 

</div>

In [None]:
# Train a good classifier to determine the reorder point (Code placeholder 01/10)

In [None]:
# Train a good classifier to determine the reorder point (Code placeholder 02/10)

In [None]:
# Train a good classifier to determine the reorder point (Code placeholder 03/10)

In [None]:
# Train a good classifier to determine the reorder point (Code placeholder 04/10)

In [None]:
# Train a good classifier to determine the reorder point (Code placeholder 05/10)

In [None]:
# Train a good classifier to determine the reorder point (Code placeholder 06/10)

In [None]:
# Train a good classifier to determine the reorder point (Code placeholder 07/10)

In [None]:
# Train a good classifier to determine the reorder point (Code placeholder 08/10)

In [None]:
# Train a good classifier to determine the reorder point (Code placeholder 09/10)

In [None]:
# Train a good classifier to determine the reorder point (Code placeholder 10/10)

As always, the first step is to get to know the data:

In [None]:
product_df.head(10)

While `ReorderPoint` is easy to recognize as the target variable of our classification, a simple `head()` points out an other important detail:

Some of the columns contain missing values (e.g. `None` or `NaN`). 

As this might lead to problems later on, we should first investigate further on this: 

In [None]:
# Get the number of missing values per column
print("Missing values per column:\n" + str(product_df.isna().sum()) + "\n")

# Get the total number of values in the DataFrame
print("Total count of tuples in the DataFrame:\n" + str(product_df.shape[0]))

It seems that the attributes `Size`, `SizeUnitMeasureCode`, `Weight`, `WeightUnitMeasureCode`, and `Color` do not contain any values in about half of all tuples.

This can mean, for example, that a product really has no relevant color, but it can also mean that the color was simply not entered. 

Depending on which is the case, this can possibly falsify the results of the classification. It would be best to ask the data producers what is the case in for each missing value.

However, we do not have this option in this exercise. We are therefore left with some less optimal solutions:

1. **Ignore the missing data:**

    Ignoring the missing data is a problem at the latest when we apply sklearn's classification methods to the DataFrame. This would lead to an error with NaN Values. Ignoring is therefore not a valid solution.

2. **Infer the missing data:**

    A frequently used variant for missing data is to simply derive the missing data from the existing data. 

    However, as long as we do not train our own classification for this, we could only fall back on very generic filling methods such as mode, which would further distort the result of our classification, as the most frequent values of each attribute would suddenly be set for even more tuples and most of them would probably not even belong to this value “in real life”.

3. **Mark the missing data:**

    Even if deriving the data is a bad idea in our case, there is a second possibility to get away from `NaN`/`None`: 

    Introduce an extra value for `Unknown`. This has a certain advantage if missing data also has a certain meaning (e.g. if the color is always omitted when it is irrelevant). 

    However, the value significantly distorts our classification if it is randomly forgotten data. 

4. **Delete the tuples with missing values:**

    Tuples with missing data are often simply deleted and therefore ignored during training. 

    However, this variant has the disadvantage that there is suddenly a significantly smaller number of tuples on which the classifier can be trained.

    In addition, the problem arises at the latest when classes are predicted for tuples with the help of the classifier, which themselves have missing values at these points (because, for example, it might be fully intentional not to have specified a color)

In [None]:
# Make a copy of the DataFrame
product_df_copy = product_df.copy()

# Drop all rows with missing values
product_df_copy = product_df_copy.dropna()

# Get the shape of the new DataFrame
print(
    "Shape of the new DataFrame after dropping missing values:\n"
    + str(product_df_copy.shape)
)

5. **Delete the attributes with missing values:**

    Deleting the attributes with missing values is probably the safest way to avoid the uncertainties of missing values without contacting the data producers. 

    The disadvantage is that fewer attributes are available for training and the classifier will therefore potentially perform less well.

    In our case, however, it is probably the best way to get around the problem.

In [None]:
# Drop all columns with missing values
product_df = product_df.dropna(axis=1)

# Print the DataFrame
product_df.head(10)

With this problem solved, we can move on to the actual classification.

It is important to note that `sklearn` cannot directly work with categorical attributes and that we must first encode them accordingly:

In [None]:
# Encode the categorical columns using LabelEncoder
label_encoders = {}
for column in product_df.columns:
    if product_df[column].dtype == type(object):  # If a column is categorical
        le = LabelEncoder()
        # Fit and transform the column
        product_df[column] = le.fit_transform(product_df[column].astype(str))
        label_encoders[column] = le

# Print the DataFrame
product_df.head(10)

The next step is to separate the target variable from the remaining attributes

In [None]:
# Separate the features and the target variable (ReorderPoint)
X = product_df.drop("ReorderPoint", axis=1)
y = product_df["ReorderPoint"]

We also have to split the data we have into training data and test data in order to be able to check the quality of our classifier later on.

In [None]:
# Split the data into training and testing sets (in this case, 70% training and 30% testing which is a common split)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

Based on this data, `sklearn`s classifiers can now be trained. Our "supervisor” (see scenario) has provided us with both a Decision Tree Classifier and a Naive Bayes Classifier, so it is probably best to try both and choose the best one.

In [None]:
# Train a Decision Tree Classifier (with entropy (= Information Gain) as the criterion)
tree_model = DecisionTreeClassifier(criterion="entropy", random_state=42)
tree_model.fit(X_train, y_train)

# Test the model
y_pred = tree_model.predict(X_test)

# Print the classification report
print(classification_report(y_test, y_pred))

In [None]:
# Train a Decision Tree Classifier (with gini as the criterion)
tree_model = DecisionTreeClassifier(criterion="gini", random_state=42)
tree_model.fit(X_train, y_train)

# Test the model
y_pred = tree_model.predict(X_test)

# Print the classification report
print(classification_report(y_test, y_pred))

In [None]:
# Train a Naive Bayes Classifier
nb_model = GaussianNB()
nb_model.fit(X_train, y_train)

# Test the model
y_pred = nb_model.predict(X_test)

# Print the classification report
print(classification_report(y_test, y_pred))

All in all, the Decision Tree Classifier seems to perform better (regardless of the criterion), as it leads to a significantly higher f1-score. 

The Naive Bayes leads to a slightly better recall, but has significantly worse precision.

It is also interesting to note that values that are supposed to contain a `600` are often not predicted correctly. However, this may well be due to the small number of values with this value in the training data set.

All in all, the classifier is very satisfactory as it is.

Next, a slightly modified DataFrame is given:

In [None]:
# Load Product into a DataFrame
new_product_df = pd.read_sql_query(
    "SELECT ProductID, Name, ProductNumber, Size, SizeUnitMeasureCode, Weight, WeightUnitMeasureCode, MakeFlag, StandardCost, ListPrice,  DaysToManufacture, SafetyStockLevel, ReorderPoint, Color FROM Product",
    connection,
)

<div class="alert alert-block alert-info">

**Task 2:**
    
Carry out a classification with regard to `ReorderPoint` on `new_product_df`.  

What do you notice about the result, why does this change occur and why should you be careful with classifiers with this result?

</div>

In [None]:
# Train a good classifier to determine the reorder point (Code placeholder 01/10)

In [None]:
# Train a good classifier to determine the reorder point (Code placeholder 02/10)

In [None]:
# Train a good classifier to determine the reorder point (Code placeholder 03/10)

In [None]:
# Train a good classifier to determine the reorder point (Code placeholder 04/10)

In [None]:
# Train a good classifier to determine the reorder point (Code placeholder 05/10)

In [None]:
# Train a good classifier to determine the reorder point (Code placeholder 06/10)

In [None]:
# Train a good classifier to determine the reorder point (Code placeholder 07/10)

In [None]:
# Train a good classifier to determine the reorder point (Code placeholder 08/10)

In [None]:
# Train a good classifier to determine the reorder point (Code placeholder 09/10)

In [None]:
# Train a good classifier to determine the reorder point (Code placeholder 10/10)

As the DataFrame contains new columns, we have to check for missing values again:

In [None]:
# Get the number of missing values per column
print("Missing values per column:\n" + str(new_product_df.isna().sum()) + "\n")

# Get the total number of values in the DataFrame
print("Total count of tuples in the DataFrame:\n" + str(new_product_df.shape[0]))

However, the new SafetyStockLevel column does not appear to contain any missing values, so we can continue with our existing preprocessing:

In [None]:
# Drop all columns with missing values
new_product_df = new_product_df.dropna(axis=1)

# Print the DataFrame
new_product_df.head(10)

We do not need to adjust the actual classification either:

In [None]:
# Encode the categorical columns using LabelEncoder
label_encoders = {}
for column in new_product_df.columns:
    if new_product_df[column].dtype == type(object):  # If a column is categorical
        le = LabelEncoder()
        # Fit and transform the column
        new_product_df[column] = le.fit_transform(new_product_df[column].astype(str))
        label_encoders[column] = le

# Print the DataFrame
product_df.head(10)

In [None]:
# Separate the features and the target variable (ReorderPoint)
X = new_product_df.drop("ReorderPoint", axis=1)
y = new_product_df["ReorderPoint"]

In [None]:
# Split the data into training and testing sets (in this case, 70% training and 30% testing which is a common split)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

In [None]:
# Train a Decision Tree Classifier (with entropy (= Information Gain) as the criterion)
tree_model = DecisionTreeClassifier(criterion="entropy", random_state=42)
tree_model.fit(X_train, y_train)

# Test the model
y_pred = tree_model.predict(X_test)

# Print the classification report
print(classification_report(y_test, y_pred))

In [None]:
# Train a Decision Tree Classifier (with gini as the criterion)
tree_model = DecisionTreeClassifier(criterion="gini", random_state=42)
tree_model.fit(X_train, y_train)

# Test the model
y_pred = tree_model.predict(X_test)

# Print the classification report
print(classification_report(y_test, y_pred))

In [None]:
# Train a Naive Bayes Classifier
nb_model = GaussianNB()
nb_model.fit(X_train, y_train)

# Test the model
y_pred = nb_model.predict(X_test)

# Print the classification report
print(classification_report(y_test, y_pred))

With the added attribute, all classifiers suddenly have an f1-score of 100%.

However, this can be explained relatively easily if you take a look at the correlation of the all attributes with `ReorderPoint`:

In [None]:
new_product_df.corr()["ReorderPoint"]

It can be clearly seen that `SafetyStockLevel` and `ReorderPoint` are fully correlated.

This of course allows the respective class of `ReorderPoint` to be easily predicted from `SafetyStockLevel`. 

It can be argued that this somewhat negates the point of a classification. 


If `ReorderPoint` is defined manually before the classification, the same might also apply to SafetyStockLevel.

If that is the case, it would probably make sense to determine both automatically via the classification.

It can also be considered whether the database should not even be normalized in the case of such a clear correlation. 

However, this goes beyond the questions on this task sheet.