# Assignment 5: Discriminant Analysis

## The problems in this assignment are based on the exercises of Chapter 12 in Data Mining for Business Analytics.

### Scenario: A management consultant is studying the roles played by experience and training in a system administrator’s ability to complete a set of tasks in a specified amount of time. In particular, she is interested in discriminating between administrators who are able to complete given tasks within a specified time and those who are not. Data are collected on the performance of 75 randomly selected administrators. Using these data, the consultant performs a discriminant analysis.

### Data: They are stored in the file SystemAdministrators.csv. The variable Experience measures months of full time system administrator experience, while Training measures number of relevant training credits. The dependent variable Completed is either Yes or No, according to whether or not the administrator completed the tasks.

In [1]:
%matplotlib inline

from pathlib import Path

import numpy as np
import pandas as pd
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
import matplotlib.pylab as plt
from dmba import classificationSummary
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV
from sklearn.neural_network import MLPClassifier, MLPRegressor
from dmba import classificationSummary, regressionSummary

no display found. Using non-interactive Agg backend


In [2]:
# Read in data
administrators = pd.read_csv('dmba\SystemAdministrators.csv')
administrators.tail()

Unnamed: 0,Experience,Training,Completed task
70,5.6,4,No
71,5.9,8,No
72,6.4,6,No
73,3.8,4,No
74,5.3,4,No


### Question 1 (6 points) Run a discriminant analysis with both predictors using the entire dataset as training data. Among those who completed the tasks, what is the percentage of administrators who are classified incorrectly as failing to complete the tasks?

In [3]:
administrators.columns

Index(['Experience', 'Training', 'Completed task'], dtype='object')

In [4]:
processed = pd.get_dummies(administrators, columns=['Completed task'], drop_first=True)
processed.tail()

Unnamed: 0,Experience,Training,Completed task_Yes
70,5.6,4,0
71,5.9,8,0
72,6.4,6,0
73,3.8,4,0
74,5.3,4,0


In [5]:
lda_reg = LinearDiscriminantAnalysis()
lda_reg.fit(processed.drop(columns=['Completed task_Yes']), processed['Completed task_Yes'])

classificationSummary(processed['Completed task_Yes'], 
                      lda_reg.predict(processed.drop(columns=['Completed task_Yes'])),
                      class_names=lda_reg.classes_)
processed['Completed task_Yes'].value_counts()

Confusion Matrix (Accuracy 0.9067)

       Prediction
Actual  0  1
     0 58  2
     1  5 10


0    60
1    15
Name: Completed task_Yes, dtype: int64

#### What was the percentage of administrators who are classified incorrectly as failing to complete the tasks?
#### 5 of the 75 were predicted to fail in completing the tasks but actually completed the task: 0.06667 or 6.6667%

### Question 2 (4 points) Compute the two classification scores (the "task completed" classification score and the "task not completed" classification score) for an administrator with four months of experience and six credits of training. is this administrator classified as "task not completed" or "task completed"?

In [6]:
# intialise data of lists. 
data = {'Experience':[4], 'Training':[6]} 
  
# Create DataFrame 
new_administrator = pd.DataFrame(data) 
  
# Print the output. 
new_administrator 

Unnamed: 0,Experience,Training
0,4,6


In [7]:
lda_reg.predict(new_administrator)

array([0], dtype=uint8)

#### The new administrator is classified as "task not completed".

### Question 3 (10 points) Now partition the original data into training and validation data (set the random_state=1), and run a discriminant analysis, AND a neural net. For each method, compare the training and validation results, and comment.

In [8]:
# Read in data
administrators = pd.read_csv('dmba\SystemAdministrators.csv')

processed = pd.get_dummies(administrators, columns=['Completed task'], drop_first=True)

outcome = 'Completed task_Yes'

predictors = [c for c in processed.columns if c != outcome]


X = processed[predictors]
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX = scaler.fit_transform(X)
X = rescaledX

y = processed[outcome]
minY = y.min()
rangeY = (y.max() - y.min())
# Transform the actual values to range [0, 1]
y = (y - minY)/rangeY

  return self.partial_fit(X, y)


In [9]:
# split the data into training (60%) and validation (40%) datasets (use random_state=1).
train_X, valid_X, train_y, valid_y = train_test_split(X, y, test_size=0.4, random_state=1)

### Discriminant Analysis of the original data partitioned into training and validation data (set the random_state=1)

In [10]:
lda_reg = LinearDiscriminantAnalysis()
lda_reg.fit(train_X, train_y)
lda_reg.predict(valid_X)

array([0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 1., 0.,
       0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 1., 0.])

In [11]:
print('Accuracy of the LDA predictions on the Training Set:')
classificationSummary(train_y, lda_reg.predict(train_X))
print('\nAccuracy of the LDA predictions on the Validation Set:')
classificationSummary(valid_y, lda_reg.predict(valid_X))

Accuracy of the LDA predictions on the Training Set:
Confusion Matrix (Accuracy 0.9111)

       Prediction
Actual  0  1
     0 31  1
     1  3 10

Accuracy of the LDA predictions on the Validation Set:
Confusion Matrix (Accuracy 0.9000)

       Prediction
Actual  0  1
     0 25  3
     1  0  2


#### Comments on the Linear Discriminant Analysis of the partition data:
#### The Accuracy of the LDA predictions on the Training Set gave a Confusion Matrix with an Accuracy of 0.9111 -AND- the Accuracy of the LDA predictions on the Validation Set gave a Confusion Matrix with only a slightly lower Accuracy of 0.9000.  

### Neural Network Analysis of the original data partitioned into training and validation data (set the random_state=1)

In [12]:
# use a single hidden layer with 2 nodes
# train neural network with 2 hidden nodes
clf = MLPRegressor(hidden_layer_sizes=(2, ), activation='logistic', solver='lbfgs',
                    random_state=1)
print(clf.fit(train_X, train_y))
print()
print(clf.predict(valid_X))

MLPRegressor(activation='logistic', alpha=0.0001, batch_size='auto',
       beta_1=0.9, beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(2,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
       random_state=1, shuffle=True, solver='lbfgs', tol=0.0001,
       validation_fraction=0.1, verbose=False, warm_start=False)

[-4.74464260e-03  1.39750558e-01  5.62102756e-01  8.95599359e-03
 -3.82694529e-03  4.72908299e-03  1.45555785e-02 -6.27220834e-03
 -8.05467241e-04  1.83843495e-02 -1.40956090e-02 -5.09695138e-03
  9.91807166e-01  1.45555785e-02  1.15612203e-02  1.04146311e+00
 -1.74547512e-03 -5.63203384e-03  4.72908299e-03  4.14965549e-01
 -3.24020989e-03 -6.11042713e-03  7.64735751e-01  1.55764872e-03
  2.04403022e-01  8.10452901e-02  3.74149949e-02  5.17355102e-02
  6.42500103e-01  3.74580285e-02]


In [13]:
# training
print('RMS error for the training data:')
regressionSummary(train_y, clf.predict(train_X))
# validation
print('\nRMS error for the validation data:')
regressionSummary(valid_y, clf.predict(valid_X))

RMS error for the training data:

Regression statistics

                      Mean Error (ME) : -0.0000
       Root Mean Squared Error (RMSE) : 0.2529
            Mean Absolute Error (MAE) : 0.1361
          Mean Percentage Error (MPE) : nan
Mean Absolute Percentage Error (MAPE) : inf

RMS error for the validation data:

Regression statistics

                      Mean Error (ME) : -0.0999
       Root Mean Squared Error (RMSE) : 0.2899
            Mean Absolute Error (MAE) : 0.1331
          Mean Percentage Error (MPE) : nan
Mean Absolute Percentage Error (MAPE) : inf


  ('Mean Percentage Error (MPE)', 100 * sum(y_res / y_true) / len(y_res)),
  ('Mean Percentage Error (MPE)', 100 * sum(y_res / y_true) / len(y_res)),
  ('Mean Absolute Percentage Error (MAPE)', 100 * sum(abs(y_res / y_true) / len(y_res))),
  ('Mean Percentage Error (MPE)', 100 * sum(y_res / y_true) / len(y_res)),
  ('Mean Percentage Error (MPE)', 100 * sum(y_res / y_true) / len(y_res)),
  ('Mean Absolute Percentage Error (MAPE)', 100 * sum(abs(y_res / y_true) / len(y_res))),


#### Comments on the Neural Network analysis of the partitioned data:
#### The Neural Network containing a Single Hidden Layer with 2 Nodes provided a Root Mean Squared Error (RMSE)  of 0.2529 on the Training Set -AND- a slightly higher Root Mean Squared Error (RMSE) of 0.2899 on the Validation Set.


### General Comments comparing the LDA vs. the Neural Network analysis

#### Discriminant Analysis are based on calculating the statistical distance that accounts for the centroids, spreads and correlations between the predictors.  This stat distance requires a matrix inversion that is computationally expensive because it grows with the number of predictors and may take a long time to compute on large samples.  Conversely for Neural Networks, the number of layers and nodes per layer of a Neural Networks brings added complexity to a model.  Thus LDA, being the simplest of the 2 models would be the one to go with.