# Logistic Regression - Binary

Running a logistic regression with Python and Spark! 

Steps to follow: 

1. Create a Spark Session and import LogisticRegression
2. Load data and check if it's in the format - label, features
3. Split data into training and testing set (7:3)
4. Create an instance of Logistic Regression 
5. Create a model by using the instance to train/fit training data 
6. Use trained model to obtain prediction results by evaluating on testing data
7. Select label and predictions from prediction results
8. Create evaluator instance 
9. Get accuracy by evaluating predictions and label on evaluator instance

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('logregdoc').getOrCreate()

#Since Logistic Regression is a classification task it falls under ml.classification
from pyspark.ml.classification import LogisticRegression

In [2]:
# Load data

data = spark.read.format("libsvm").load("sample_libsvm_data.txt")

# Check if data is formatted as label and features

data.show()

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  0.0|(692,[127,128,129...|
|  1.0|(692,[158,159,160...|
|  1.0|(692,[124,125,126...|
|  1.0|(692,[152,153,154...|
|  1.0|(692,[151,152,153...|
|  0.0|(692,[129,130,131...|
|  1.0|(692,[158,159,160...|
|  1.0|(692,[99,100,101,...|
|  0.0|(692,[154,155,156...|
|  0.0|(692,[127,128,129...|
|  1.0|(692,[154,155,156...|
|  0.0|(692,[153,154,155...|
|  0.0|(692,[151,152,153...|
|  1.0|(692,[129,130,131...|
|  0.0|(692,[154,155,156...|
|  1.0|(692,[150,151,152...|
|  0.0|(692,[124,125,126...|
|  0.0|(692,[152,153,154...|
|  1.0|(692,[97,98,99,12...|
|  1.0|(692,[124,125,126...|
+-----+--------------------+
only showing top 20 rows



In [3]:
# Split data into training and testing set (7:3)

train, test = data.randomSplit([0.7,0.3])

# Create an instance of Logistic Regression

lr = LogisticRegression()

# Train/fit model on training data 

lrModel = lr.fit(train)

# Use trained model to obtain prediction results by evaluating on testing data

predictionResults = lrModel.evaluate(test)

# Show prediction results

predictionResults.predictions.show()

+-----+--------------------+--------------------+--------------------+----------+
|label|            features|       rawPrediction|         probability|prediction|
+-----+--------------------+--------------------+--------------------+----------+
|  0.0|(692,[122,123,148...|[22.6839973146669...|[0.99999999985924...|       0.0|
|  0.0|(692,[123,124,125...|[16.8927374259317...|[0.99999995391311...|       0.0|
|  0.0|(692,[124,125,126...|[32.4786205273090...|[0.99999999999999...|       0.0|
|  0.0|(692,[124,125,126...|[18.0200716259477...|[0.99999998507266...|       0.0|
|  0.0|(692,[126,127,128...|[28.9440252750824...|[0.99999999999973...|       0.0|
|  0.0|(692,[126,127,128...|[21.0927052690363...|[0.99999999930887...|       0.0|
|  0.0|(692,[126,127,128...|[23.3070269958379...|[0.99999999992451...|       0.0|
|  0.0|(692,[126,127,128...|[26.5544262041464...|[0.99999999999706...|       0.0|
|  0.0|(692,[126,127,128...|[23.0883641925825...|[0.99999999990606...|       0.0|
|  0.0|(692,[150

In [4]:
# Check schema of evaluated data frame
predictionResults.predictions.printSchema()

root
 |-- label: double (nullable = true)
 |-- features: vector (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = false)



We want to compare the label (true values) with the prediction (predicted values).

In [5]:
prediction_and_labels = predictionResults.predictions.select('label','prediction')

In [6]:
prediction_and_labels.show()

+-----+----------+
|label|prediction|
+-----+----------+
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  1.0|       1.0|
|  1.0|       1.0|
|  1.0|       1.0|
|  1.0|       1.0|
+-----+----------+
only showing top 20 rows



#### Types of available Evaluation Metrics- 
1. BinaryClassificationEvaluator - 
    1. areaUnderROC curve
    2. areaUnderPR (Precision-Recall) curve   
    
2. MulticlassClassificationEvaluator -
    1. f1 score
    2. weightedPrecision score
    3. weightedRecall score
    4. accuracy score

In [7]:
# Create evaluator object

from pyspark.ml.evaluation import BinaryClassificationEvaluator                                  
    
evaluator = BinaryClassificationEvaluator(rawPredictionCol='prediction', labelCol='label')

# Pass the evaluated data frame to a evaluator

evaluator.evaluate(prediction_and_labels)

1.0

This means that the area under ROC curve was 1.0. That means all observations were classified with 100% accuracy.

-------------------------------------------------------------------