### Logistic Regression

Unlike linear regression, logistic regression is used for classification problems, where the dependent variable is categorical or dichotomous (e.g., pass/fail, spam/not spam, admitted/rejected).

Logistic Regression Example

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Logistic Regression').getOrCreate()

In [2]:
from pyspark.ml.classification import LogisticRegression

In [4]:
my_data = spark.read.format('libsvm').load('Datasets/sample_libsvm_data.txt')

In [5]:
my_data.show()

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  0.0|(692,[127,128,129...|
|  1.0|(692,[158,159,160...|
|  1.0|(692,[124,125,126...|
|  1.0|(692,[152,153,154...|
|  1.0|(692,[151,152,153...|
|  0.0|(692,[129,130,131...|
|  1.0|(692,[158,159,160...|
|  1.0|(692,[99,100,101,...|
|  0.0|(692,[154,155,156...|
|  0.0|(692,[127,128,129...|
|  1.0|(692,[154,155,156...|
|  0.0|(692,[153,154,155...|
|  0.0|(692,[151,152,153...|
|  1.0|(692,[129,130,131...|
|  0.0|(692,[154,155,156...|
|  1.0|(692,[150,151,152...|
|  0.0|(692,[124,125,126...|
|  0.0|(692,[152,153,154...|
|  1.0|(692,[97,98,99,12...|
|  1.0|(692,[124,125,126...|
+-----+--------------------+
only showing top 20 rows



In [6]:
log_reg_model = LogisticRegression()

In [7]:
fitted_logreg = log_reg_model.fit(my_data)

In [8]:
fitted_logreg.coefficients # Weights for each feature

SparseVector(692, {95: 0.0011, 96: 0.0003, 97: 0.0111, 98: 0.0148, 99: 0.0108, 100: 0.0041, 101: -0.0169, 102: -0.0233, 119: 0.0125, 120: 0.0042, 121: 0.0073, 122: 0.0007, 123: -0.0003, 124: 0.0016, 125: 0.001, 126: 0.0007, 127: 0.0007, 128: 0.0004, 129: -0.0022, 130: -0.0002, 131: 0.0026, 132: 0.0043, 133: 0.005, 146: 0.0129, 147: 0.0044, 148: 0.0036, 149: 0.0019, 150: -0.0004, 151: -0.0004, 152: 0.0005, 153: -0.0007, 154: -0.0024, 155: -0.0022, 156: -0.0017, 157: -0.0011, 158: -0.0009, 159: 0.0002, 160: 0.0017, 161: 0.001, 162: 0.003, 163: 0.0032, 164: 0.0195, 174: 0.0061, 175: 0.0017, 176: -0.0016, 177: -0.0014, 178: 0.0002, 179: -0.0001, 180: -0.0009, 181: -0.0006, 182: -0.0008, 183: 0.0011, 184: -0.0, 185: -0.0004, 186: -0.0008, 187: -0.0004, 188: -0.0001, 189: 0.0004, 190: 0.0021, 191: 0.0011, 192: 0.0019, 202: -0.0113, 203: -0.002, 204: -0.0011, 205: -0.0003, 206: 0.0008, 207: -0.0016, 208: -0.0014, 209: -0.0002, 210: 0.0014, 211: 0.0027, 212: 0.0008, 213: -0.0012, 214: -0.0012,

In [9]:
fitted_logreg.intercept # Bias term

9.250831966638948

In [10]:
log_summary = fitted_logreg.summary
log_summary.predictions.printSchema()

root
 |-- label: double (nullable = true)
 |-- features: vector (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = false)



**Show the summary**

Where:
- label - The actual class label (0 or 1 in a binary classification task).
- features - The feature vector used for prediction. In this case, 692 features are available, but only non-zero indices are displayed (sparse representation).
rawPrediction - The raw output of the linear model before applying the logistic function.
- probability - The probability vector after applying the sigmoid function (logistic function).
- Prediction - The final predicted class (0 or 1), determined by checking if the probability of class 1 is ≥ 0.5.


In [11]:
log_summary.predictions.show()

+-----+--------------------+--------------------+--------------------+----------+
|label|            features|       rawPrediction|         probability|prediction|
+-----+--------------------+--------------------+--------------------+----------+
|  0.0|(692,[127,128,129...|[20.3777627514862...|[0.99999999858729...|       0.0|
|  1.0|(692,[158,159,160...|[-21.114014198867...|[6.76550380001560...|       1.0|
|  1.0|(692,[124,125,126...|[-23.743613234676...|[4.87842678715831...|       1.0|
|  1.0|(692,[152,153,154...|[-19.192574012719...|[4.62137287298722...|       1.0|
|  1.0|(692,[151,152,153...|[-20.125398874697...|[1.81823629113437...|       1.0|
|  0.0|(692,[129,130,131...|[20.4890549504187...|[0.99999999873608...|       0.0|
|  1.0|(692,[158,159,160...|[-21.082940212813...|[6.97903542824686...|       1.0|
|  1.0|(692,[99,100,101,...|[-19.622713503566...|[3.00582577441380...|       1.0|
|  0.0|(692,[154,155,156...|[21.1594863606570...|[0.99999999935352...|       0.0|
|  0.0|(692,[127

More on model summary

In [12]:
print("Accuracy:", log_summary.accuracy)
print("AUC-ROC Score:", log_summary.areaUnderROC)
print("Precision by Label:", log_summary.precisionByLabel)
print("Recall by Label:", log_summary.recallByLabel)
print("F1-Score:", log_summary.fMeasureByLabel())

Accuracy: 1.0
AUC-ROC Score: 1.0
Precision by Label: [1.0, 1.0]
Recall by Label: [1.0, 1.0]
F1-Score: [1.0, 1.0]


In [13]:
lr_train, lr_test = my_data.randomSplit([0.7,0.3])

In [14]:
final_model = LogisticRegression()

In [15]:
fit_final = final_model.fit(lr_train)

In [16]:
prediction_and_labels = fit_final.evaluate(lr_test)

Show predictions from test data

In [17]:
prediction_and_labels.predictions.show()

+-----+--------------------+--------------------+--------------------+----------+
|label|            features|       rawPrediction|         probability|prediction|
+-----+--------------------+--------------------+--------------------+----------+
|  0.0|(692,[98,99,100,1...|[26.7842465178436...|[0.99999999999766...|       0.0|
|  0.0|(692,[124,125,126...|[33.6593592225403...|[0.99999999999999...|       0.0|
|  0.0|(692,[126,127,128...|[17.6835998932546...|[0.99999997910173...|       0.0|
|  0.0|(692,[126,127,128...|[23.8602148040308...|[0.99999999995658...|       0.0|
|  0.0|(692,[127,128,129...|[22.6207461498305...|[0.99999999985005...|       0.0|
|  0.0|(692,[150,151,152...|[18.9413140537971...|[0.99999999405855...|       0.0|
|  0.0|(692,[152,153,154...|[17.2295366227941...|[0.99999996709155...|       0.0|
|  0.0|(692,[152,153,154...|[11.9235449158779...|[0.99999336765120...|       0.0|
|  1.0|(692,[100,101,102...|[46.5691686307323...|           [1.0,0.0]|       0.0|
|  1.0|(692,[123

**Using evaluator**

![eval](Img/eval.png)

In [18]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator

In [20]:
my_eval = BinaryClassificationEvaluator()
my_final_roc = my_eval.evaluate(prediction_and_labels.predictions)
my_final_roc

0.9230769230769232

In [21]:
eval_acc = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
eval_f1 = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="f1")

accuracy = eval_acc.evaluate(prediction_and_labels.predictions)
F1 = eval_f1.evaluate(prediction_and_labels.predictions)

print("Accuracy :", accuracy)
print("F1 :", F1)

Accuracy : 0.9523809523809523
F1 : 0.9528291316526611
