# Lesson 26 - Multiclass Logistic Regression

## Prepare Environment

In [0]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, expr

from pyspark.ml.feature import VectorAssembler, StringIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.classification import LogisticRegression 

spark = SparkSession.builder.getOrCreate()

## Introduction

A **multiclass classification** problem is a classification problem in which the label has more than two classes. The mathematics behind how to create a multiclass logistic regression model is similar to, but more complicated than, the binary case. Forutnately, the 
processtTo build a logistic regression model for a multi-class classification problem is similar to build the binary logistic regression model.

## Multiclass Logistic Regression

A multiclass classification problem is a classification problem in which the label has more than two classes. In the previous lesson, we discuss how to use logistic regression to perform binary classification. In this lesson, we will see how logistic regression can be adapted to a multiclass problem. 

For simplicity, assume that we are working with a problem in which there are three classes. The process detailed here will easily generalizes to a larger number of classes. As before, we will use the variable \\(y\\) to indicate the value of the label. In this case, we will allow \\(y\\) to assume three values, 0, 1, and 2, which each one indicating a different class or label value.  As before, suppose that we have \\(K\\)  features that we plan to use in our model, and that the values of these features are represented by variables \\(x_1, x_2, ..., x_K\\). 

Our multiclass logistic regression model will be represented by a collection of coefficients of the following form:

$$\hat{\beta}^0_0, ~\hat{\beta}^0_1, ~\hat{\beta}^0_2, ..., ~\hat{\beta}^0_K$$
$$\hat{\beta}^1_0, ~\hat{\beta}^1_1, ~\hat{\beta}^1_2, ..., ~\hat{\beta}^1_K$$
$$\hat{\beta}^2_0, ~\hat{\beta}^2_1, ~\hat{\beta}^2_2, ..., ~\hat{\beta}^2_K$$

 
Notice that we have one set of coefficients for each of the three possible labels, as indicated by the superscripts. As with binary classification, we will use these coefficients to form linear combinations of the feature values. 

$$z_0 = \hat{\beta}^0_0 + \hat{\beta}^0_1 x_1 + \hat{\beta}^0_2 x_2 + ... + \hat{\beta}^0_K x_K$$
$$z_1 = \hat{\beta}^1_0 + \hat{\beta}^1_1 x_1 + \hat{\beta}^1_2 x_2 + ... + \hat{\beta}^1_K x_K$$
$$z_2 = \hat{\beta}^2_0 + \hat{\beta}^2_1 x_1 + \hat{\beta}^2_2 x_2 + ... + \hat{\beta}^2_K x_K$$
 

Finally, we will generate the estimated probability of the observation belonging to each of the three classes as follows:

$$\hat{p}_0 = \frac{e^{z_0}}{e^{z_0} + e^{z_1} + e^{z_2}}$$
$$\hat{p}_1 = \frac{e^{z_1}}{e^{z_0} + e^{z_1} + e^{z_2}}$$
$$\hat{p}_2 = \frac{e^{z_2}}{e^{z_0} + e^{z_1} + e^{z_2}}$$

 

We could have, in theory, plugged each of the linear combinations \\(z_0\\), \\(z_1\\), and \\(z_2\\)  into the sigmoid function in order to generate our probability estimates. While that would ensure that each result could individually be interpreted as a probability, it would not guaranteed that the three probabilities summed to 1. The approach we have presented here ensure that each probability estimate will be between 0 and 1, and that the three estimates will sum to 1. Thus, we use the **softmax** function to generate the probabilities

## Multiclass Logistic Regression in Spark

As we can see, the calculations involved with multiclass logistic regression are a bit more complicated than those for binary logistic regression. Fortunately, there is no difference in how we use Spark to create a logistic regression model for a multiclass problem versus a binary classification problem.

## Load and Prepare Data
To illustrate this, we will consider the [Iris dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set). This dataset contains information about 150 flowers from three different iris species: setosa, versicolor, and virginica. For each flower, we are provided with the species of the flower, as well as measurements for certain leaf-like structures on that flower. Specifically, we are provided with the length and width of the sepals, and the length and width of the petals for each flower. We will build a model that uses these four measurements to predict the iris species for a flower.

In [0]:
iris_schema = 'Sepal_Length DOUBLE, Sepal_Width DOUBLE, Petal_Length DOUBLE, Petal_Width DOUBLE, Species STRING'

iris = (
    spark.read
    .option('delimiter', '\t')
    .option('header', True)
    .schema(iris_schema)
    .csv('/FileStore/tables/iris.txt')
)

iris.printSchema()

In [0]:
iris.show(10)

In [0]:
N = iris.count()
print(N)

### Select Features

We will use the `columns` attribute of our DataFrame to create a list of names of the feature columns.

In [0]:
features = iris.columns[:-1]
print(features)

### Distribution of Label Values

To serve as a baseline against which we can compare our model, we will check the distribution of the label values.

In [0]:
(
    iris
    .select('Species')
    .groupby('Species')
    .agg(
        expr('COUNT(*) as count'), 
        expr(f'ROUND(COUNT(*)/{N},4) as prop')
    )
    .show()
)

### Encode Target Variable

Since are labels are represented as strings, we need to use `StringIndexer` to perform an integer encoding of the label.

In [0]:
indexer = StringIndexer(inputCol='Species', outputCol='label').fit(iris)
iris = indexer.transform(iris)
iris.show(10)

In [0]:
print(indexer.labels)

### Assemble Features Vector

We are now ready to use `VectorAssember` to create our feature vectors.

In [0]:
assembler = VectorAssembler(inputCols=features, outputCol='features')
train = assembler.transform(iris)
train.show(5, truncate=False)

## Logistic Regression Model

Next, we will create a `LogisticRegression` object and call its `fit()` method to create a trained `LogisticRegressionModel` object.

In [0]:
logreg = LogisticRegression(featuresCol='features', labelCol='label')
logreg_model = logreg.fit(train)

# or chain those two steps together:
# logreg_model = LogisticRegression(featuresCol='features', labelCol='label').fit(train)

### Model Coefficients

In multiclass regression, there is one set of coefficients for each possible label. These coefficient values are stored in the `interceptVector` and `coefficientMatrix` objects.

In [0]:
pd.DataFrame(
    np.vstack([
        logreg_model.interceptVector.toArray(),
        logreg_model.coefficientMatrix.toArray().T
    ]),
    columns = indexer.labels,
    index = ['intercept'] + features
)

Unnamed: 0,setosa,versicolor,virginica
intercept,1.983366,20.327028,-22.310395
Sepal_Length,-9.049063,5.757136,3.291927
Sepal_Width,38.994453,-16.156789,-22.837664
Petal_Length,-11.366109,0.96838,10.397729
Petal_Width,-25.596427,3.65524,21.941186


### Generating Predictions

We will now use the `transform()` method of our model to generate predictions for the training set.

In [0]:
train_pred = logreg_model.transform(train)
train_pred.select(['probability', 'prediction', 'label']).show(10, truncate=False)

### Scoring the Model

We will now score calculate our model's accuracy on the training set.

In [0]:
accuracy_eval = MulticlassClassificationEvaluator(
    predictionCol='prediction', labelCol='label', metricName='accuracy'
)

acc = accuracy_eval.evaluate(train_pred)
print(acc)

We see that the model get a nearly perfect accuracy. In fact, as we see in the cell below, the model only predicts the wrong label for 2 of the 150 observations.

In [0]:
(
    train_pred
    .filter(expr('prediction != label'))
    .select('probability', 'prediction', 'label')
    .show(truncate=False)
)

### Generating Predictions for New Observations

We will end the lesson by generating predictions for a new set of observations consisting of two flowers.

In [0]:
new_df = spark.createDataFrame(
    data = [[6.5, 2.9, 5.1, 1.7], [5.1, 3.1, 3.4, 1.1]], 
    schema = 'Sepal_Length DOUBLE, Sepal_Width DOUBLE, Petal_Length DOUBLE, Petal_Width DOUBLE' 
)

new_df = assembler.transform(new_df)
new_df.show()

In [0]:
new_pred = logreg_model.transform(new_df)
new_pred.select('probability', 'prediction').show(truncate=False)