# DALI 2024 Winter Application - Machine Learning Track
### John Guerrerio

We now develop models to predict if the profit of a purchase is above or below the median purchase profit (as determined by our exploratory analysis).  A superstore would likely have access to all features in the Superstore.csv dataset at order time except profit; that would likely have to be calculated later.  Therefore, it would be helpful to the superstore to have a lightweight model than can predict weather a purchase will be above or below the median profit at order time.

Please note that deep learing is a promising approach for this classification problem and would likely perform better than the models I present here.  However, I ran out of GPU compute units on colab after training my models for predictCategory.ipynb, so I cannot train deep learning models for this task.\
\
Areas for expansion:
- Try additional models (e.g. random forest)
- Try a deep learning approach
- Develop a data pipeline to fill in missing values in the dataset before we train models (for the models in this file, we had enough data to comfortably train even after removing the missing values.  However, more data couldn't hurt)

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import classification_report, confusion_matrix, f1_score

## Preapre the Dataset

We first need to prepare the dataset for machine learning.  This involves loading it, generating correct class labels (above or below the median), generating feature vectors for each entry, and splitting it into train, validation and test sets.

Splitting the dataset into train, validation and test sets is important to prevent data leakage.  The purpose of each set is as follows:

- Train: The data to train the model on, refining it over time.
- Validation: The dataset on which to optimize performance when tuning hyperparameters.
- Test: The dataset to evaluate the model on after training and hyperparameter tuning.  This gives an indication of the model's real-world performace.

If we train on data outside the train dataset, we will artificially inflate our model's performance because the model will have seen the data in that set before.  Similarly, if we optimize our test set performance when tuning hyperparameters, the test set performance will be artifially inflated and we won't have good means of measuring the model's real world performance.  For this project, we use a 70-15-15 train test validation split.

In [3]:
df = pd.read_csv('Superstore.csv')

Based on the results of correlation.ipynb, we choose 5 features that appear to be correlated most strongly with profit: region, product category, product sub-category, quantity, and discount.  We drop rows that contain null values for any of these columns.  We are assuming data entries are missing completely at random, so this should not introduce bias into the dataset.  

In [4]:
df.dropna(subset=["Region", "Category", "Sub-Category", "Quantity", "Discount"], inplace=True)
print("Number of rows: " + str(len(df.index))) # make sure we still have enough data

Number of rows: 5923


In [5]:
MEDIAN = 8.662 # from the exploratory analysis file
RANDOM_STATE = 42 # random seed to ensure results are reproducible

The below cell generates a feature vector for each row and generates the label for that row (1 for above median profit, 0 for less than or equal to the median profit)

In [6]:
# turn region, category, and sub-category columns into vectors of numbers
region=np.unique(df['Region'], return_inverse=True)[1]
category=np.unique(df['Category'], return_inverse=True)[1]
subCategory=np.unique(df['Sub-Category'], return_inverse=True)[1]

# turn quantity, discount, and profit columns into vectors of numbers
quantity = df["Quantity"].to_numpy()
discount = df["Discount"].to_numpy()
profit = df["Profit"].to_numpy()

vectorizedDataset = np.empty((len(region), 5))
labels = np.empty(len(region))

# generate feature vectors
for i in range(0, len(region)):
  data = np.zeros((1, 5))
  data[0][0] = region[i]
  data[0][1] = category[i]
  data[0][2] = subCategory[i]
  data[0][3] = quantity[i]
  data[0][4] = discount[i]

  vectorizedDataset[i] = data

  if (profit[i] > MEDIAN):
    labels[i] = 1
  else:
    labels[i] = 0

In [7]:
# shuffles the data and splits it into the train, test, and validation sets
train, validAndTest, trainLabels, validAndTestLabels = train_test_split(vectorizedDataset, labels, test_size=0.3, random_state=RANDOM_STATE)
valid, test, validLabels, testLabels = train_test_split(validAndTest, validAndTestLabels, test_size=0.5, random_state=RANDOM_STATE)

In [8]:
print(len(train))
print(len(trainLabels))
print()

print(len(valid))
print(len(validLabels))
print()

print(len(test))
print(len(testLabels))

4146
4146

888
888

889
889


In [9]:
trainLabels = trainLabels.T

## Support Vector Machine

The first model we generate is a Support Vector Machine (SVM).  SVM finds the optimal hyperplane to seperate the purchases above the median profit and those below the median profit in the feature space.  It works well when there is a clear margin of separation between classes, which I believe there might be given the "above the median profit" and "below the median profit" do not seem to overlap in terms of features.  SVM is also relatively memory efficient and quick to train.  However, if there is a lot of nose in the dataset, the algorithm will not be able to find the optimal hyperplane to separate the two classes.  If SVM performs poorly, I would assume this is the reason why.

The below cell specifies the hyperparameters for our SVM.  We perform a grid search across the penalty, alpha, and epochs hyperparameters.  This means that we try all possible combination of each hyperparameter, and record the combination that performs the best.  This approach will improve our model's performance, as the hyperparameters we find will likely be better than the default hyperparameters sklearn specifies.

In [10]:
PENALTY = ["l2", "l1", "elasticnet"] # determines the penalty term
ALPHA = [0.01, 0.001, 0.0002, 0.0001, 0.0005, 0.00001] # determines the learning rate
EPOCHS = [500, 750, 1000, 1250, 1500] # number of epochs for training

# sklearn paritions its own validation set from the train data we pass in an and stops training once performance on
# its validation set no longer improves after 5 epochs
# this early stopping is important to prevent overfitting
EARLY_STOPPING = True
VALIDATION_FRACTION = 0.15 # fraction of training data to use for validation

In [11]:
bestPenalty = "l2"
bestAplha = 0.0001
bestEpochs = 1000
bestF1 = 0

# grid search
for penalty in PENALTY:
  for alpha in ALPHA:
    for epoch in EPOCHS:
      # the default loss function for SGDClassifier generates an SVM
      SVM = SGDClassifier(penalty=penalty, alpha=alpha, max_iter=epoch, early_stopping=EARLY_STOPPING, validation_fraction=VALIDATION_FRACTION, random_state=RANDOM_STATE)
      SVM.fit(train, trainLabels)

      # evalaute on validation set
      predictions = SVM.predict(valid)
      f1 = f1_score(validLabels, predictions, average="macro")

      if f1 > bestF1:
        bestF1 = f1
        bestPenalty = penalty
        bestAplha = alpha
        bestEpochs = epoch

# best performing hyperparameters
print("Best f1: " + str(bestF1))
print("Best epochs: " + str(bestEpochs))
print("Best alpha: " + str(bestAplha))
print("Best penalty: " + str(bestPenalty))

Best f1: 0.699212165409399
Best epochs: 500
Best alpha: 0.0005
Best penalty: elasticnet


In [12]:
SVM = SGDClassifier(penalty=bestPenalty, alpha=bestAplha, max_iter=bestEpochs, early_stopping=EARLY_STOPPING, validation_fraction=VALIDATION_FRACTION, random_state=RANDOM_STATE) # this is a linear SVM
SVM.fit(train, trainLabels)
predictions = SVM.predict(test)

In [13]:
# evaluation of best-performing hyperparameters
print(classification_report(testLabels, predictions))

              precision    recall  f1-score   support

         0.0       0.69      0.83      0.75       499
         1.0       0.70      0.52      0.60       390

    accuracy                           0.69       889
   macro avg       0.69      0.67      0.68       889
weighted avg       0.69      0.69      0.68       889



This model performs reasonably well, with an f1 of 0.68 and overall accuracy of 0.69.  A deep learning approach would likely perform better; however, the advantage of an SVM is that it is far more lightweight and memory efficient.  If a superstore were implementing a model to predict at order time if that order would make above or below the median profit, an SVM would likely be a better choice for its fast performance across thousands of orders.  As this model does substantially better than a random guess, it is at least usable for our classification task.

## Logistic Regression

The second model we try is Logistic Regression, which is essentially a single layer of a neural network.  It works well for binary classification tasks (which our classification task is), and serves as a solid performance baseline for other ML approaches on this task.  One advantage of logistic regression is that it is highly interpretable.  The weights of logistic regression each correspond to an input feature, so we can use them to see which features are weighted most heavily when predicting profit.  This could also provide useful information to the superstore on which features most heavily infouence weather a purchase will make above the median profit or not.

We perform another grid search for the logistic regression hyperparameters.

In [14]:
LR_PENALTY = ["l2", "l1", "elasticnet"] # determines the penalty term
LR_ALPHA = [0.01, 0.001, 0.0002, 0.0001, 0.0005, 0.00001] # determines the learning rate
LR_EPOCHS = [500, 750, 1000, 1250, 1500] # number of epochs for training

In [15]:
lr_bestPenalty = "l2"
lr_bestAplha = 0.0001
lr_bestEpochs = 1000
lr_bestF1 = 0

 # grid search
for penalty in LR_PENALTY:
  for alpha in LR_ALPHA:
    for epoch in LR_EPOCHS:
      # we specify log_loss to train a logistic regression classifier
      LogReg = SGDClassifier(loss = 'log_loss', penalty=penalty, alpha=alpha, max_iter=epoch, early_stopping=EARLY_STOPPING, validation_fraction=VALIDATION_FRACTION, random_state=RANDOM_STATE)
      LogReg.fit(train, trainLabels)

      # evaluation on the validation set
      predictions = LogReg.predict(valid)
      f1 = f1_score(validLabels, predictions, average="macro")

      if f1 > lr_bestF1:
        lr_bestF1 = f1
        lr_bestPenalty = penalty
        lr_bestAplha = alpha
        lr_bestEpochs = epoch

# best performing hyperparameters
print("Best f1: " + str(lr_bestF1))
print("Best epochs: " + str(lr_bestEpochs))
print("Best alpha: " + str(lr_bestAplha))
print("Best penalty: " + str(lr_bestPenalty))

Best f1: 0.6982494097882874
Best epochs: 500
Best alpha: 0.0001
Best penalty: l1


In [16]:
LogReg = SGDClassifier(loss = 'log_loss', penalty=lr_bestPenalty, alpha=lr_bestAplha, max_iter=lr_bestEpochs, early_stopping=EARLY_STOPPING, validation_fraction=VALIDATION_FRACTION, random_state=RANDOM_STATE) # this is a linear SVM
LogReg.fit(train, trainLabels)
predictions = LogReg.predict(test)

In [17]:
# evaluation of best-performing hyperparameters
print(classification_report(testLabels, predictions))

              precision    recall  f1-score   support

         0.0       0.74      0.67      0.71       499
         1.0       0.63      0.71      0.66       390

    accuracy                           0.69       889
   macro avg       0.69      0.69      0.68       889
weighted avg       0.69      0.69      0.69       889



Most of what we said in the SVM results section holds true here as well.  Logistic regression performs substantially better than a random guess, and would likely be usable for our classification task.  Like SVM, logistic regression is lightweight and would almost assuredly perform inference quicker than a neural network.  However, a neural network would likely make predictions more accurately.

In [18]:
print(LogReg.coef_) # logistic regression weights

[[  -1.5645983     6.78497593    0.            4.77404507 -140.38173997]]


The above cell shows the weights of our logistic regression model.  We see that discount has a strong negative relationship with profit, which makes sense (the higher the discount, the less the superstore makes).  Quantity has a positive relationshup with profit which agin makes sense; the more a customer buys, the greater the superstore's overall profit.  Region and product category have a negative and positive relationship with profit respectively (as we have defined the numerical labels for regions and product categories).  This is very useful information for the superstore; they might want to analyse their best performing product categories and regions they sell to, and reduce their operations to only those categories/regions.  Surprisingly, product sub-category has no relationship with profit.  I assume this is because the other features are far better predictiors of profit than product sub-category is; however, given the statistical test showing profit was correlated with product sub-category, I am surprised by this result.