# Lesson 25 - Logistic Regression

In [0]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, expr

from pyspark.ml.feature import VectorAssembler, StringIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.classification import LogisticRegression 

spark = SparkSession.builder.getOrCreate()

## Sigmoid Function 

Before discussion the logistic regression classification algorithm we need to introduce the **sigmoid function**. This function is defined according to the formula \\(\sigma(z) = \frac{1}{1 + e^{-z}}\\). A plot of the function is provided below. One of the most important properties of the sigmoid function is that it can accepts any real number as an input, but its output is always within the interval (0,1). This makes the sigmoid function useful in statistics and machine learning since its output can be interpreted as a probabilities. 

![Apache Spark](https://drbeane.github.io/files/images/417/sigmoid.png)

## Logistic Regression

**Logistic regression** is classification algorithm that allows us to estimate the probability that a particular observation belongs to a particular class based on the value of some set of features for which we have measurements. A logistic regression model generates its probability estimates by first calculating a linear function of the feature values, and the then passing the result of that calculation to the sigmoid function to obtain a value between 0 and 1 that can be interpreted as a probability. 

To explain this concept in more detail, suppose that we are working on a binary classification problem in which we wish to classify observations as belonging to one of two classes. For the sake of discussion, lets name the two classes positive and negative. Let \\(y\\)  be a variable that indicates the correct class for any given observation. We will set \\(y=0\\)   for observations in the negative class and \\(y=1\\)  for observations in the negative class. 

Now suppose that we have \\(K\\)  features that we plan to use in our model. We will represent the values of these features with variables \\(x_1, x_2, ..., x_K\\) . We will denote our model's estimate of the probability that a particular observation belongs to the positive class by \\(\hat{p}_1\\).

A logistic regression model is defined by a collection of coefficients \\(\hat{\beta}_0,\hat{\beta}_1,..., \hat{\beta}_K\\). Given these coefficients, the model generates its probability estimated \\(\hat{p}_1\\) by first calculating a linear combination of the form \\( z = \hat{\beta}_0 + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 + ... + \hat{\beta}_K x_K\\). It then uses the sigmoid function to estimate the probability of the observation being in the positive class: \\(\hat{p}_1 = \sigma(z)\\). Once we have \\(\hat{p}_1\\) we can then estimate the probality of the observation being in the negative class as follows: \\(\hat{p}_0 = 1 - \hat{p}_1\\).

You might be wondering where the coefficients \\(\hat{\beta}_0,\hat{\beta}_1,..., \hat{\beta}_K\\) come from. The short answer is that these are learned from the training data. The training algorithm is provided with several observations for which the true class is known and it finds the coefficients that result in the model that generates the best predictions for the training data. There is a lot of mathematics involved in finding these optimal coefficients, but fortunately the tools we will be using take care of this process for us.

## Load and Prepare Data

In this lesson, we will demonstrate how to use Spark to create, evaluate, and apply a logistic regression model to perform binary classification. For this example, we will be working with the Pima Diabetes dataset. The goal of this problem will be to use information collected from a medical screening to determine if a patient is likely to develop diabetes in the near future. The dataset consists of 768 observations of adults aged 21 or older. For each individual, we have values for 8 features, as well as a label named `Outcome` that indicates if the individual developed diabetes within 5 years of their data being collected. 

Further information about this dataset can be found here: [Pima Diabetes Data](https://rdrr.io/cran/dprep/man/diabetes.html)

In [0]:
 pima_schema = (
    'Pregnancies INTEGER, Glucose INTEGER, BloodPressure INTEGER, SkinThickness INTEGER, Insulin INTEGER, '
    'BMI DOUBLE, DiabetesPedigreeFunction DOUBLE, Age INTEGER, Outcome STRING'
)

pima = (
    spark.read
    .option('delimiter', ',')
    .option('header', True)
    .schema(pima_schema)
    .csv('/FileStore/tables/pima_diabetes.csv')
)

pima.printSchema()

In [0]:
pima.show(10)

In [0]:
N = pima.count()
print(N)

### Select Features

We will use the `columns` attribute of our DataFrame to create a list of names of the feature columns.

In [0]:
features = pima.columns[:-1]
print(features)

### Distribution of Label Values

To serve as a baseline against which we can compare our model, we will check the distribution of the label values.

In [0]:
(
    pima
    .select('Outcome')
    .groupby('Outcome')
    .agg(
        expr('COUNT(*) as count'), 
        expr(f'ROUND(COUNT(*)/{N},4) as prop')
    )
    .show()
)

### Encode Target Variable

To train a classification model in Spark, it is required that our labels be numerical encoded. Currently, our label values are given by strings. We can use the `StringIndexer` class from `pyspark.ml.feature` to perform an integer encoding of the label. This is demonstrated in the cell below.

In [0]:
indexer = StringIndexer(inputCol='Outcome', outputCol='label').fit(pima)
pima = indexer.transform(pima)
pima.show(10)

# The 1, 0 will be assigned sorted by frequency

In [0]:
print(type(indexer)) # the intexer itself is a string indexer object 

In [0]:
print(indexer.labels)

### Assemble Feature Vector

Before we can use MLlib to create a machine learning model, we must first combine any columns representing features to be used in our model into a single column, which we will typically name `features`. Each entry in the `features` column will contain a list of feature values for that particular observation. In Spark terms, we will refer to this list as a feature vector. PySpark provides us with a `VectorAssembler` class that can be used to easily create this column.

In [0]:
assembler = VectorAssembler(inputCols=features, outputCol='features')
train = assembler.transform(pima)
train.show(5, truncate=False)

In [0]:
# The only two parts I use for the model
train.select('label', 'features').show(5, truncate=False)

## Logistic Regression Model

To create a logistic regression model in Spark, we must first create an instance of the `LogisticRegression` class (which can be imported from `pyspark.ml.classification`). When creating this instance, we must provide values for the `featuresCol` and `labelCol` parameters. These should be set to the names of the columns containing our feature vector and label, respectively. The `LogisticRegression` object represents a training algorithm. 

To create an actual model, we need to call the `fit()` method of our `LogisticRegression` object and pass it the DataFrame that contains our training data. The `fit()` method returns an object of type `LogisticRegressionModel`. This object will represent our model.

In [0]:
logreg = LogisticRegression(featuresCol='features', labelCol='label') # this line represents the algorithm
logreg_model = logreg.fit(train) # this line represents the model

### Model Coefficients

The coefficients defining our trained logistic regression model are contained in the `intercept` and `coefficients` attributes of our `LogisticRegressionModel` object. We will display these in the cell below.

In [0]:
pd.DataFrame({
    'Feature':['Intercept'] + features,
    'Coefficient': [logreg_model.intercept] + logreg_model.coefficients.tolist()
})             

Unnamed: 0,Feature,Coefficient
0,Intercept,-8.404703
1,Pregnancies,0.123183
2,Glucose,0.035164
3,BloodPressure,-0.013296
4,SkinThickness,0.000619
5,Insulin,-0.001192
6,BMI,0.089701
7,DiabetesPedigreeFunction,0.945183
8,Age,0.014869


### Generating Predictions

Every object representing a trained machine learning model will come equipped with a `transform()` method that can be used to generated predictions. This method requires that a DataFrame be passed to it as an argument. This DataFrame must contain a `features` column containing vectors that have been assembled using the same `VectorAssembler` object that was trained on the training set. The method will return a DataFrame that contains all of the columns in the argument DataFrame as well as the following three new columns:

* **`prediction`** - Each entry in this column will be a float representing a whole number value. This value indicates which of the label classes the model has predicted that the observation belongs to.
* **`probability`** - Each entry in this column will be a vector with one element for each possible class for your target variable. The values contained in the vector will be the probabilities of the observation being in each of the possible classes, as estimated by the logistic regression model. This column is useful for assessing how confident the model is in its prediction, as well as for determining what other classes the model believes might be likely.
* **`rawPrediction`** - Each entry in this column will be a vector with one element for each possible class for your target variable. The values contained in the vector will be the log odds scores that the model has assigned to each possible class. We will not make frequent use of this column.

In [0]:
train_pred = logreg_model.transform(train)
train_pred.select(['rawPrediction', 'probability', 'prediction', 'label']).show(10, truncate=False)

# rawPrediction: [negate_z, z] ---> feed into sigmoid funciton to produce probability [prob_class0, prob_class1] 

### Scoring the Model

There are several metrics that can be used to score or evaluate a classification model. These metrics each have their advantages in certain situations, or when working with datasets with certain characteristics. The simplest (but not always best) classification metric is **accuracy**. A model's accuracy score with respect to a certain dataset is the proportion of observations for which the model predicts the correct class label. We can use the PySpark class `MulticlassClassificationEvaluator` to calculate a classification model's accuracy on a dataset. To use this tool, we must have created a dataset that contains predicted classes as well as the actual classes for the observations in the dataset. Such a DataFrame is returned by the `transform()` method of any MLlib model object. The syntax for using `MulticlassClassificationEvaluator` is shown below.

In [0]:
accuracy_eval = MulticlassClassificationEvaluator(
    predictionCol='prediction', labelCol='label', metricName='accuracy')

acc = accuracy_eval.evaluate(train_pred)
print(acc)

### Generating Predictions for New Observations

We will now illustrate how to use our model to generate predictions for new observations. In the cell below, we create a DataFrame named `new_df` that is meant to contain feature values for two individuals to whom we would like to apply the model. In order to do so, we must first create features vectors for both observations. We can do this by using the `transform()` method of the `VectorAssembler` object we create earlier.

In [0]:
new_df = spark.createDataFrame(
    data = [[3, 130, 62, 33, 315, 31.5, 0.428, 37],
            [5, 152, 71, 27, 254, 27.4, 0.638, 45]], 
    schema = (
        'Pregnancies INTEGER, Glucose INTEGER, BloodPressure INTEGER, SkinThickness INTEGER,' 
        'Insulin INTEGER, BMI DOUBLE, DiabetesPedigreeFunction DOUBLE, Age INTEGER'
    )
)

new_df = assembler.transform(new_df)
new_df.show(truncate=False)

In [0]:
new_pred = logreg_model.transform(new_df)
new_pred.select('probability', 'prediction').show(truncate=False)