# Logistic Regression

https://spark.apache.org/docs/latest/ml-classification-regression.html#logistic-regression


Evaluators will be a very important part of our pipline when working with Machine Learning:

https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.evaluation.BinaryClassificationEvaluator.html#pyspark.ml.evaluation.BinaryClassificationEvaluator.metricName

https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.evaluation.MulticlassClassificationEvaluator.html


In [1]:
path = '/home/danial/Desktop/myspark/Apache-Spark/Python-and-Spark-for-Big-Data-master/Spark_for_Machine_Learning/Logistic_Regression/sample_libsvm_data.txt'

In [2]:
import findspark

In [3]:
findspark.init("/home/danial/spark-3.3.2-bin-hadoop3")

In [4]:
from pyspark.sql import SparkSession

In [7]:
spark = SparkSession.builder.appName('logis').getOrCreate()

23/04/04 11:17:00 WARN Utils: Your hostname, danial-VirtualBox resolves to a loopback address: 127.0.1.1; using 10.0.2.15 instead (on interface enp0s3)
23/04/04 11:17:00 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/04/04 11:17:00 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/04/04 11:17:01 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


In [9]:
data = spark.read.format('libsvm').load(path)

23/04/04 11:19:29 WARN LibSVMFileFormat: 'numFeatures' option not specified, determining the number of features by going though the input. If you know the number in advance, please specify it via 'numFeatures' option to avoid the extra scan.


[Stage 0:>                                                          (0 + 1) / 1]                                                                                

In [10]:
data.show()

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  0.0|(692,[127,128,129...|
|  1.0|(692,[158,159,160...|
|  1.0|(692,[124,125,126...|
|  1.0|(692,[152,153,154...|
|  1.0|(692,[151,152,153...|
|  0.0|(692,[129,130,131...|
|  1.0|(692,[158,159,160...|
|  1.0|(692,[99,100,101,...|
|  0.0|(692,[154,155,156...|
|  0.0|(692,[127,128,129...|
|  1.0|(692,[154,155,156...|
|  0.0|(692,[153,154,155...|
|  0.0|(692,[151,152,153...|
|  1.0|(692,[129,130,131...|
|  0.0|(692,[154,155,156...|
|  1.0|(692,[150,151,152...|
|  0.0|(692,[124,125,126...|
|  0.0|(692,[152,153,154...|
|  1.0|(692,[97,98,99,12...|
|  1.0|(692,[124,125,126...|
+-----+--------------------+
only showing top 20 rows



In [11]:
data.select('label').distinct().show()

+-----+
|label|
+-----+
|  0.0|
|  1.0|
+-----+



In [12]:
from pyspark.ml.classification import LogisticRegression

In [15]:
train_Set, test_set = data.randomSplit([0.7, 0.3])

In [14]:
lg = LogisticRegression()

In [16]:
lg_model = lg.fit(train_Set)

23/04/04 11:24:09 WARN InstanceBuilder$NativeBLAS: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
23/04/04 11:24:09 WARN InstanceBuilder$NativeBLAS: Failed to load implementation from:dev.ludovic.netlib.blas.ForeignLinkerBLAS
23/04/04 11:24:10 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
23/04/04 11:24:10 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS


In [28]:
log_summary = lg_model.summary

In [31]:
log_summary.predictions.printSchema()

root
 |-- label: double (nullable = true)
 |-- features: vector (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = false)



In [33]:
log_summary.predictions.show()

+-----+--------------------+--------------------+--------------------+----------+
|label|            features|       rawPrediction|         probability|prediction|
+-----+--------------------+--------------------+--------------------+----------+
|  0.0|(692,[98,99,100,1...|[25.1513018780422...|[0.99999999998806...|       0.0|
|  0.0|(692,[121,122,123...|[26.0753841057895...|[0.99999999999526...|       0.0|
|  0.0|(692,[122,123,124...|[20.2151747204161...|[0.99999999833788...|       0.0|
|  0.0|(692,[122,123,148...|[22.3056083865609...|[0.99999999979450...|       0.0|
|  0.0|(692,[123,124,125...|[24.0462456874123...|[0.99999999996395...|       0.0|
|  0.0|(692,[124,125,126...|[33.1079069148920...|[0.99999999999999...|       0.0|
|  0.0|(692,[124,125,126...|[34.1141437492309...|[0.99999999999999...|       0.0|
|  0.0|(692,[124,125,126...|[36.4386030338422...|[0.99999999999999...|       0.0|
|  0.0|(692,[124,125,126...|[22.1688960637801...|[0.99999999976440...|       0.0|
|  0.0|(692,[125

In [34]:
# Evaluating the model 

In [21]:
predictions_and_labels = lg_model.evaluate(test_set)

In [22]:
predictions_and_labels.predictions.show()

+-----+--------------------+--------------------+--------------------+----------+
|label|            features|       rawPrediction|         probability|prediction|
+-----+--------------------+--------------------+--------------------+----------+
|  0.0|(692,[95,96,97,12...|[24.4085149772204...|[0.99999999997490...|       0.0|
|  0.0|(692,[100,101,102...|[2.22487928435347...|[0.90246153751556...|       0.0|
|  0.0|(692,[123,124,125...|[31.8716286857398...|[0.99999999999998...|       0.0|
|  0.0|(692,[123,124,125...|[36.1540641835322...|[0.99999999999999...|       0.0|
|  0.0|(692,[124,125,126...|[21.0332131007928...|[0.99999999926651...|       0.0|
|  0.0|(692,[124,125,126...|[21.9067180479556...|[0.99999999969378...|       0.0|
|  0.0|(692,[126,127,128...|[16.3898344184417...|[0.99999992379467...|       0.0|
|  0.0|(692,[126,127,128...|[39.9715770068874...|           [1.0,0.0]|       0.0|
|  0.0|(692,[126,127,128...|[33.2562823069702...|[0.99999999999999...|       0.0|
|  0.0|(692,[126

In [23]:
from pyspark.ml.evaluation import (BinaryClassificationEvaluator,
                                    MulticlassClassificationEvaluator)

In [25]:
my_eval = BinaryClassificationEvaluator()

In [35]:
# areaUnderROC:

my_final_roc = my_eval.evaluate(predictions_and_labels.predictions)

In [36]:
my_final_roc

1.0