<a href="https://colab.research.google.com/github/Datangels/Machine_Learning_with_PySpark/blob/master/pyspark_logistic_regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Google Colab configuration & creation the SparkSession Object**

---



In [0]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://www-us.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
!tar xf spark-2.4.4-bin-hadoop2.7.tgz
!pip install -q findspark

In [0]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.4-bin-hadoop2.7"

In [0]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt

## **Read the Dataset**

In [0]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
dataset_not_clean = spark.read.csv('/content/drive/My Drive/pycharm_colab_training/dataset/user_countries_socialMediaUsage.csv',inferSchema=True, header=True)

## **Exploratory Data Analysis**

> Indented block




In [0]:
# dataset_not_clean.printSchema()
dataset_not_clean.describe().show()
# print((dataset_not_clean.count(), len(dataset_not_clean.columns)))

+-------+--------+-----------------+-----------------+--------+-----------------+------------------+
|summary| Country|              Age|   Repeat_Visitor|Platform| Web_pages_viewed|            Status|
+-------+--------+-----------------+-----------------+--------+-----------------+------------------+
|  count|   20000|            20000|            20000|   20000|            20000|             20000|
|   mean|    null|         28.53955|           0.5029|    null|           9.5533|               0.5|
| stddev|    null|7.888912950773227|0.500004090187782|    null|6.073903499824976|0.5000125004687693|
|    min|  Brazil|               17|                0|    Bing|                1|                 0|
|    max|Malaysia|              111|                1|   Yahoo|               29|                 1|
+-------+--------+-----------------+-----------------+--------+-----------------+------------------+



## **Feature Engineering**

In [0]:
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import OneHotEncoder

platform_indexer = StringIndexer(inputCol="Platform", outputCol="Platform_Num").fit(dataset_not_clean)
dataset_partially_clean = platform_indexer.transform(dataset_not_clean)
platform_encoder = OneHotEncoder(inputCol="Platform_Num", outputCol="Platform_Vector")
dataset_partially_clean = platform_encoder.transform(dataset_partially_clean)

country_indexer = StringIndexer(inputCol="Country", outputCol="Country_Num").fit(dataset_partially_clean)
dataset_partially_clean = country_indexer.transform(dataset_partially_clean)
country_encoder = OneHotEncoder(inputCol="Country_Num", outputCol="Country_Vector")
dataset_partially_clean = country_encoder.transform(dataset_partially_clean)

dataset_assembler = VectorAssembler(inputCols=['Platform_Vector','Country_Vector','Age', 'Repeat_Visitor','Web_pages_viewed'], outputCol="features")
dataset_clean = dataset_assembler.transform(dataset_partially_clean)

dataset_clean.show()

+---------+---+--------------+--------+----------------+------+------------+---------------+-----------+--------------+--------------------+
|  Country|Age|Repeat_Visitor|Platform|Web_pages_viewed|Status|Platform_Num|Platform_Vector|Country_Num|Country_Vector|            features|
+---------+---+--------------+--------+----------------+------+------------+---------------+-----------+--------------+--------------------+
|    India| 41|             1|   Yahoo|              21|     1|         0.0|  (2,[0],[1.0])|        1.0| (3,[1],[1.0])|[1.0,0.0,0.0,1.0,...|
|   Brazil| 28|             1|   Yahoo|               5|     0|         0.0|  (2,[0],[1.0])|        2.0| (3,[2],[1.0])|[1.0,0.0,0.0,0.0,...|
|   Brazil| 40|             0|  Google|               3|     0|         1.0|  (2,[1],[1.0])|        2.0| (3,[2],[1.0])|(8,[1,4,5,7],[1.0...|
|Indonesia| 31|             1|    Bing|              15|     1|         2.0|      (2,[],[])|        0.0| (3,[0],[1.0])|(8,[2,5,6,7],[1.0...|
| Malaysia| 3

## **Splitting the Dataset**

In [0]:
model_df = dataset_clean.select(['features','Status'])
train_df, test_df = model_df.randomSplit([0.75,0.25])
print("whole dataset: " + str(model_df.count()))
print("train_df dataset: " + str(train_df.count()))
print("test_df dataset: " + str(test_df.count()))

whole dataset: 20000
train_df dataset: 15053
test_df dataset: 4947


## **Build and Train Logistic Regression Mode**

In [0]:
from pyspark.ml.classification import LogisticRegression
log_reg = LogisticRegression(labelCol='Status').fit(train_df)

## **Training Results**

In [0]:
train_results = log_reg.evaluate(train_df).predictions
train_results.filter(train_results['Status']==1).filter(train_results['prediction']==1).select(['Status','prediction','probability']).show(10,False)

+------+----------+----------------------------------------+
|Status|prediction|probability                             |
+------+----------+----------------------------------------+
|1     |1.0       |[0.32738633322874056,0.6726136667712594]|
|1     |1.0       |[0.32738633322874056,0.6726136667712594]|
|1     |1.0       |[0.18784915435591013,0.8121508456440898]|
|1     |1.0       |[0.18784915435591013,0.8121508456440898]|
|1     |1.0       |[0.18784915435591013,0.8121508456440898]|
|1     |1.0       |[0.09902872212683424,0.9009712778731658]|
|1     |1.0       |[0.09902872212683424,0.9009712778731658]|
|1     |1.0       |[0.09902872212683424,0.9009712778731658]|
|1     |1.0       |[0.09902872212683424,0.9009712778731658]|
|1     |1.0       |[0.04963829348541264,0.9503617065145874]|
+------+----------+----------------------------------------+
only showing top 10 rows



## **Evaluate Logistic Regression Model on Test Data**

In [0]:
results = log_reg.evaluate(test_df).predictions
results.select(['Status','prediction']).show(10,False)

+------+----------+
|Status|prediction|
+------+----------+
|0     |0.0       |
|0     |0.0       |
|0     |0.0       |
|0     |0.0       |
|0     |0.0       |
|0     |0.0       |
|1     |0.0       |
|0     |0.0       |
|0     |0.0       |
|1     |1.0       |
+------+----------+
only showing top 10 rows



## **Confusion Matrix**

In [0]:
true_postives = results[(results.Status == 1) & (results.prediction == 1)].count()
true_negatives = results[(results.Status == 0) & (results.prediction == 0)].count()
false_postives = results[(results.Status == 0) & (results.prediction == 1)].count()
false_negatives = results[(results.Status == 1) & (results.prediction == 0)].count()

## **Accuracy**

In [0]:
accuracy=float((true_postives + true_negatives) / (results.count()))
print(accuracy)

0.9419850414392561


## **Recall**

In [0]:
# Recall rate shows how much of the positive class cases we are able to predict correctly out of the total positive class observations.
recall = float(true_postives) / (true_postives + false_negatives)
print(recall)

0.9403946838501812


## **Precision**

In [0]:
# Precision rate talks about the number of true positives predicted correctly out of all the predicted positives observations
precision = float(true_postives) / (true_postives + false_postives)
print(precision)

0.9438156831042845
