<a href="https://colab.research.google.com/github/MHonegger/Deep-learning-books-1/blob/master/ML_for_Business_Assessment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning for Business - Assessment Notebook

### Please enter your student number below:

*(Enter your student number here)*



---



# 1. Outline of the problem

### Predicting inventory errors

Ensuring the availability of fast-moving items is essential for grocery retailers. In many cases, however, retailers are unable to make accurate reordering decisions because their inventory records (i.e. how much is actually available on the shelf) are inaccurate. Here, machine learning is proving to be an effective way to overcome this problem, as retailers can train algorithms to predict the likelihood of records with potential discrepancies between the amount of stock registered in the inventory records, and the amount of stock actually on the shelf. Being able to predict potential discrepancies allows stores to quickly correct emerging out-of-stock scenarios and by reordering product accordingly.

In this exercise, we will train two algorihtms aimed at improving the inventory accuracy of a large grocery retailer. The training data contains information at SKU (stock keeping unit) level about actual sales, forecast sales, and product size, among others. We will use these variables to predict potential errors in the inventory stock, at SKU level.

# 2. Setup

### Import libraries

Import libraries for managing data structures and plotting figures

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline 

sns.set()

### Import the dataset

In [None]:
data = pd.read_csv("https://sbs-ml.s3.eu-west-1.amazonaws.com/StockErrors-2.csv")

In [None]:
data.head(6)

# 3. Data exploration

### Numerical exploration

Number of columns and rows in the dataset

In [None]:
print('Data size : ', data.shape)

Checking for missing values

In [None]:
print('Null values per column : \n', data.isnull().sum())

Calculating the basic statistics for each column

In [None]:
data.describe()

Inspecting the distribution of errors in the inventory records

In [None]:
data = data.dropna()

In [None]:
print('\nBalance of positive and negative error classes (%): \n', 
      data['stock_error'].value_counts(normalize=True) * 100)

### Splitting the data



In [None]:
from sklearn.model_selection import train_test_split

Splitting the independent from the target variable. 

In [None]:
X = data.drop(['stock_error'], axis = 1)
target = data['stock_error']

Splitting the data into our training and testing data sets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    target,  
                                                    test_size = 0.3, 
                                                    random_state = 44,
                                                    stratify=target)

# 4. Training of Algorithm # 1

### Pre-processing of the data

In [None]:
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

Pre-processing of both numerical and categorical variables

In [None]:
minMax = MinMaxScaler()

In [None]:
# Transformer for categorical variables
cat_transformer = Pipeline(steps=[
        ('encoder', OneHotEncoder(sparse=False, handle_unknown='ignore'))
    ])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', minMax, list(range(0, 22))),
        ('cat', cat_transformer, list(range(22, 27)))
    ], remainder='passthrough')


In [None]:
X_train_sc = preprocessor.fit_transform (X_train)
X_test_sc = preprocessor.transform(X_test)

### Loading the model

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
clf = RandomForestClassifier(random_state=44)

Training the model

In [None]:
clf.fit(X_train_sc, y_train)

## Making predictions

### Predictions on the training set


Computing ROC-AUC

In [None]:
from sklearn.metrics import roc_auc_score, confusion_matrix

In [None]:
prob_est_train = clf.predict_proba(X_train_sc)
roc_train = roc_auc_score(y_train, prob_est_train[:, 1].T)
print('The {} has an ROC-AUC on the training set of {}'.format('Random Forest', roc_train))

Plotting the Confusion Matrix

In [None]:
y_pred_train_rf = clf.predict(X_train_sc)
cm_rf_train = confusion_matrix(y_true=y_train, y_pred=y_pred_train_rf)

In [None]:
plt.figure(figsize=(5, 5))
sns.heatmap(cm_rf_train, annot=True, fmt="d")
plt.title('Confusion matrix for {}'.format('Random Forest on Training Set'))
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

### Predictions on the testing set

In [None]:
prob_est_test_rf = clf.predict_proba(X_test_sc)
roc_train_rf = roc_auc_score(y_test, prob_est_test_rf[:, 1].T)
print('The {} has an ROC-AUC on the testing set of {}'.format('Random Forest', roc_train_rf))

Plotting the Confusion Matrix

In [None]:
y_pred_test_rf = clf.predict(X_test_sc)
cm_rf_test = confusion_matrix(y_true=y_test, y_pred = y_pred_test_rf)

In [None]:
plt.figure(figsize=(5, 5))
sns.heatmap(cm_rf_test, annot=True, fmt="d")
plt.title('Confusion matrix for {}'.format('Random Forest on Testing Set'))
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

Calculating further metrics

In [None]:
from sklearn.metrics import classification_report

In [None]:
print(classification_report(y_test, y_pred_test_rf))

# 5. Training of Algorithm # 2

In [None]:
import tensorflow as tf
import keras.metrics

### Defining the metrics for the evaluation

In [None]:
METRICS = [
    keras.metrics.TruePositives(name='tp'),
    keras.metrics.FalsePositives(name='fp'),
    keras.metrics.TrueNegatives(name='tn'),
    keras.metrics.FalseNegatives(name='fn'),
    keras.metrics.BinaryAccuracy(name='accuracy'),
    keras.metrics.AUC(name='auc'),
    keras.metrics.AUC(name='prc', curve='PR'),
]

### Loading the model

In [None]:
# Setting the number of layers and neurons per layer
neurons = 70
hidden_layers = 2

In [None]:
# Calculating the initial bias 
neg, pos = np.bincount(target)
initial_bias = np.log([pos / neg])

In [None]:
# Splitting the data into training and validation sets
X_train_ann, X_val_ann, y_train_ann, y_val_ann = train_test_split(X_train_sc, y_train, test_size=0.2, stratify=y_train,
                                                          random_state=44)

In [None]:
# Initialising the model
ann = tf.keras.models.Sequential()

# Adding fully connected layers
for layers in range(hidden_layers):
    ann.add(tf.keras.layers.Dense(units=neurons, activation='relu'))

ann.add(tf.keras.layers.Dropout(0.2))                                                           # Add a dropout layer
ann.add(tf.keras.layers.Dense(units=1,activation='sigmoid', bias_initializer=tf.keras.initializers.Constant(initial_bias)))    # Add the output layer

# Compiling the model
ann.compile(optimizer= tf.optimizers.Adam(learning_rate=0.01), loss='binary_crossentropy', metrics=METRICS)

Training of the model


In [None]:
baseline_history = ann.fit(X_train_ann,
                           y_train_ann,
                           batch_size=32,
                           epochs=100,
                           validation_data=(X_val_ann, y_val_ann),
                           )


## Making predictions

### Predictions on the training set

In [None]:
plt.plot(baseline_history.epoch, baseline_history.history['auc'])
# print('The {} has an ROC-AUC on the training set of {}'.format('Neural Network', roc_train))

In [None]:
train_predictions_baseline = ann.predict(X_train_sc)

In [None]:
ann_predictions  = pd.DataFrame(train_predictions_baseline)

In [None]:
confusion = confusion_matrix(y_true= y_train, y_pred = ann_predictions.iloc[:, 0] > 0.5)

Plotting the Confusion Matrix

In [None]:
plt.figure(figsize=(5, 5))
sns.heatmap(confusion, annot=True, fmt="d")
plt.title('Confusion matrix for {}'.format('Random Forest on Training Set'))
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

### Predictions on the testing set


In [None]:
test_predictions_baseline = ann.predict(X_test_sc)

In [None]:
ann_predictions_test  = pd.DataFrame(test_predictions_baseline)

In [None]:
confusion_test = confusion_matrix(y_true= y_test, y_pred = ann_predictions_test.iloc[:, 0] > 0.5)

Plotting the Confusion Matrix

In [None]:
plt.figure(figsize=(5, 5))
sns.heatmap(confusion_test, annot=True, fmt="d")
plt.title('Confusion matrix for {}'.format('Random Forest on Training Set'))
plt.ylabel('Actual label')
plt.xlabel('Predicted label')



---



# Your task

First, do make sure to have added your **student number** at the top!

Then run the above notebook, and **answer the following two questions** (both parts are weighted equally):


1.   State what kind of machine learning algorithms have been implemented in this workbook, and briefly interpret the results obtained.

2.   Discuss the advantages and limitations of the two modelling approaches taken here, and state which approach you would choose for the task at hand. Justify your answer!



# Your answer (1,000 words max)

*(Please type here)*



---



***Now Save your notebook, print it as PDF, and submit the PDF to SAMS!***