<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Multi-Variable Logistic Regression and Classification Matrix

_Authors: Sam Stack(DC)_


**Exercise Objectives**
- Hand on experience using Multi-Variable Logistic Regression
- Review and Exploration of the Classification Matrix and its evaluation Metrics
- Introduction to One vs. One and One vs. Rest Classifiers.

**Lets get some data.**
One of the most popular classification datasets for Machine learning is the Iris Dataset, which can be loaded directly from `sklearn.datasets`
- Sklearn datasets are imported as dictionaries and use keys to access specific aspects.
    - `iris.data` : actual matrix of observations
    - `iris.target` : target column for classification
    - `iris.feature_names` :  column names

In [1]:
import seaborn as sns
import pandas as pd
from sklearn import datasets

In [2]:
iris = datasets.load_iris()

# .data holds arrays of values for each sample
X = pd.DataFrame(iris.data, columns=iris.feature_names)
# .target holds array of mapped binary values (0, 1, 2) that represent iris classes
y = iris.target

In [3]:
# Examine the data
X.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [4]:
y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

**Break down of classes**  
0 : Setosa  
1 : Versicolour  
2 : Virginica  

----

**Modelling**
This data is extremely neat and tidy so no cleaning is necessary and we can get right into modelling.

In [5]:
X.describe()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


In [6]:
# model the data, using train-test split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [7]:
# Warning error returned when multi_class and solver default parameters not declared?!?
logreg = LogisticRegression(solver='lbfgs', multi_class='auto')

In [8]:
x_train, x_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [9]:
logreg.fit(x_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [10]:
# model.predict
y_pred = logreg.predict(x_test)

In [11]:
# evaluate model performance with confusion matrix.
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, y_pred)

array([[15,  0,  0],
       [ 0, 11,  0],
       [ 0,  0, 12]], dtype=int64)

With a multivariable confusion matrix, some of our labellings (True Pos., True Neg., False Pos., False Neg.) get a little warped.  We are no longer predicting one class from a null class, we are classifying to 3 distinguished classes.  

The **True** diagonal stays the same as these are properly classified observations.  


|     | Class 0 | Class 1  | Class 2 |
| --- | ------- |:--------:| -------:|
| **Pred Class 0**  | 15      | 0        | 0       |
| **Pred Class 1**    | 0       | 11       |   0     |
| **Pred Class 2**    | 0       | 0        |    12   |


It is better to stick with True and False labels with multi-class to avoid [_Confusion_](https://www.youtube.com/watch?v=bcYppAs6ZdI)

If you need to refer to False Positive or True Negative, it is better to first select a specific class such as `Class 2 ` and refer to classification or misclassification relative to said chosen class, instead of the set of all classes as a whole. 

Example:
    _True Negatives relative to Class 2 are True Positives for Class 0 and Class 1._

Speaking of our Classes, how are probabilities calculated with multi-class?
- Are they Probability of `Class 0` vs. `Not Class 0`?
- Or Probability of `Class 0` vs. `Class 1` vs. `Class 2` ?

In [12]:
# predict_proba splits the Classes like with binary - 3 Classes add up to 1 per row
y_prob = logreg.predict_proba(x_test)

In [16]:
y_prob[:10, :]

array([[3.98546368e-03, 8.22130675e-01, 1.73883861e-01],
       [9.44175902e-01, 5.58237879e-02, 3.10136387e-07],
       [1.20890504e-08, 1.82799179e-03, 9.98171996e-01],
       [6.68179712e-03, 7.87139247e-01, 2.06178955e-01],
       [1.54224282e-03, 7.69786608e-01, 2.28671149e-01],
       [9.52618170e-01, 4.73815678e-02, 2.62163071e-07],
       [7.75359367e-02, 9.06996363e-01, 1.54677002e-02],
       [1.77954940e-04, 1.58493509e-01, 8.41328537e-01],
       [2.33019049e-03, 7.75672873e-01, 2.21996936e-01],
       [2.87125656e-02, 9.44160935e-01, 2.71264996e-02]])

In [17]:
# They should all add up to 1 - lets test that
for a, b, c in logreg.predict_proba(x_test):
    print(sum([a, b, c]))

1.0
1.0
1.0
1.0
1.0
1.0
0.9999999999999998
0.9999999999999999
1.0
1.0
1.0
1.0
1.0
1.0
1.0
0.9999999999999999
0.9999999999999999
1.0
0.9999999999999999
0.9999999999999999
1.0000000000000002
1.0
0.9999999999999998
0.9999999999999999
1.0
1.0
1.0
1.0
1.0
1.0
1.0000000000000002
1.0
1.0
0.9999999999999999
1.0
1.0
1.0
1.0


Looks like our probabilities of each class all add up to 1, so it is like `Class 0` vs. `Class 1` vs. `Class 2`.

What if we wanted to create a logistic regression that has `Class 0` vs. `Class 1` & `Class 2` or just `Class 0` vs. `Class 2`?  We will cover that in a bit, but first more evaluation metrics.

---

**Classification Reports/Matrix**

Classification reports are another means of evaluation for classification models, and return a few metrics that are based on True Positives, False Positives and False Negatives.  

In [18]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        15
           1       1.00      1.00      1.00        11
           2       1.00      1.00      1.00        12

    accuracy                           1.00        38
   macro avg       1.00      1.00      1.00        38
weighted avg       1.00      1.00      1.00        38



**Precision**  
- "How many of the items selected are relevant."
- Of the items placed into a class, how many of the are True Positives.


$$\frac{True Positives}{True Positives + False Positives}$$

**Recall**  
- "How many of the relevant items are selected."
- Of the items that were suppose to be placed into a class, how many did we accurately place.


$$\frac{True Positives}{True Positives + False Negatives}$$

**F1-Score**

F1 exists on a range of 0 - 1 where 0 is just awful and 1 is perfection.
F1 is considered a harmonic mean as it averages Precision and Recall.  With classification models you often times have to chooise what kind of error you are willing to increase in order to reduce the other and thus you may want to optimize Precision or Recall accordingly.  If you are uncertain which you should optimize, F1 score may be the metric of choice.

$$2*\frac{precision * recall}{precision + recall}$$

**Support**
Number of true observations in given class.  The count of possible true observations.  

---

In [28]:
# 0 : Setosa
# 1 : Versicolour
# 2 : Virginica

# Plot probability distribution (y_prob)

## Intro to Ensembling

Earlier we talked about building models relative to Class combinations.  Distinguishing one Class from all other Classes or just one specific Class from another specific Class.  These goals are possible with Logistic Regression combined with another model.

Up until this point we have used one model, but there are also Machine Learning methods that involve combining several models to arrive at a more refined conclusion, commonly referred to as Ensemble Methods.

### One Vs. Rest Classification.

One vs. Rest Classification is a method that builds an individual model for each Class, to distingush said specific Class from the rest of the Classes.  Since we are only focusing on one Class, e.g. `Class1`, these classifiers will group `Class2`, `Class3`, and `Class4` into a single class of `Not Class 1`.  The same happens all the way through for the rest of the Classes.

1 - Class1 vs. Class2, Class3, Class4  
2 - Class2 vs. Class1, Class3, Class4  
3 - Class3 vs. Class1, Class2, Class4   
4 - Class4 vs. Class1, Class2, Class3  

### One Vs. One Classification.

We train a model for every set of classes.  As more classes are added this becomes more computationally expense.  

1 - Class1 vs. Class2  
2 - Class1 vs. Class3  
3 - Class1 vs. Class4  
4 - Class2 vs. Class3  
5 - Class2 vs. Class4  
6 - Class3 vs. Class4  


#### One Vs. Rest Classifier

In [16]:
# Import ensembler method
from sklearn.multiclass import OneVsRestClassifier 

In [17]:
# Instantiate chosen binary algorithm (can also choose Perceptron)
lr = LogisticRegression(solver='liblinear', multi_class='ovr')

In [18]:
# Place the model in the ensembler method
OVC = OneVsRestClassifier(lr)

In [19]:
# Use the ensembler method like a normal model using earlier train-test split
OVC.fit(x_train, y_train)

OneVsRestClassifier(estimator=LogisticRegression(C=1.0, class_weight=None,
                                                 dual=False, fit_intercept=True,
                                                 intercept_scaling=1,
                                                 l1_ratio=None, max_iter=100,
                                                 multi_class='ovr', n_jobs=None,
                                                 penalty='l2',
                                                 random_state=None,
                                                 solver='liblinear', tol=0.0001,
                                                 verbose=0, warm_start=False),
                    n_jobs=None)

In [20]:
# Use the predict method in the same way
class_pred = OVC.predict(x_test)

In [21]:
# Use confusion matrix metric in the same way
# Exactly same results!
confusion_matrix(y_test, class_pred)

array([[15,  0,  0],
       [ 0, 11,  0],
       [ 0,  0, 12]], dtype=int64)

#### One Vs. One Classifier

In [22]:
# OvO works the same as OvR
from sklearn.multiclass import OneVsOneClassifier

In [23]:
LR = LogisticRegression(solver='liblinear', multi_class='ovr')

In [24]:
OVO = OneVsOneClassifier(LR)

In [25]:
OVO.fit(x_train, y_train)

OneVsOneClassifier(estimator=LogisticRegression(C=1.0, class_weight=None,
                                                dual=False, fit_intercept=True,
                                                intercept_scaling=1,
                                                l1_ratio=None, max_iter=100,
                                                multi_class='ovr', n_jobs=None,
                                                penalty='l2', random_state=None,
                                                solver='liblinear', tol=0.0001,
                                                verbose=0, warm_start=False),
                   n_jobs=None)

In [26]:
# Make prediction and evaluate confusion matrix.
ovo_pred = OVO.predict(x_test)

In [27]:
# Exactly same results!
confusion_matrix(y_test, ovo_pred)

array([[15,  0,  0],
       [ 0, 11,  0],
       [ 0,  0, 12]], dtype=int64)

One Vs.One / One Vs. Rest Classifiers are not restricted to fitting using Logistic Regression model.  

With SKLearn, any type of Classification model can be placed into the One Vs X classification ensemble. Classically, OvO is best with Support Vector Machines.