# Logistic Regression model training

- After creating labels and features for the data, we’re ready to build a model that can learn from it (training). But before you train the model, you'll split the combined dataset into training and testing dataset because it can assign a probability of being spam to each data point. We can then decide to classify messages as spam or not, depending on how high the probability.

- In this final part of the exercise, you'll split the data into training and test, run Logistic Regression on the training data, apply the same HashingTF() feature transformation to get vectors on a positive example (spam) and a negative one (non-spam) and finally check the accuracy of the model trained.

- Remember, you have a SparkContext sc available in your workspace, as well as the samples variable.

## Instructions

- Split the combined data into training and test sets (80/20).
- Train the Logistic Regression (LBFGS variant) model with the training dataset.
- Create a prediction label from the trained model on the test dataset.
- Combine the labels in the test dataset with the labels in the prediction dataset.
- Calculate the accuracy of the trained model using original and predicted labels on the labels_and_preds.

In [1]:
# Intialization
import os
import sys

os.environ["SPARK_HOME"] = "/home/talentum/spark"
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
# In below two lines, use /usr/bin/python2.7 if you want to use Python 2
os.environ["PYSPARK_PYTHON"] = "/usr/bin/python3.6" 
os.environ["PYSPARK_DRIVER_PYTHON"] = "/usr/bin/python3"
sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.10.7-src.zip")
sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")

# NOTE: Whichever package you want mention here.
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.11:0.6.0 pyspark-shell' 
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-avro_2.11:2.4.0 pyspark-shell'
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.11:0.6.0,org.apache.spark:spark-avro_2.11:2.4.3 pyspark-shell'
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.11:0.6.0,org.apache.spark:spark-avro_2.11:2.4.0 pyspark-shell'

In [2]:
#Entrypoint 2.x
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Spark SQL basic example").enableHiveSupport().getOrCreate()

# On yarn:
# spark = SparkSession.builder.appName("Spark SQL basic example").enableHiveSupport().master("yarn").getOrCreate()
# specify .master("yarn")

sc = spark.sparkContext

In [3]:
file_path_spam = "file:///home/talentum/test-jupyter/P2/M4/SM3/3_Classification/Dataset/spam.txt"
file_path_non_spam = "file:///home/talentum/test-jupyter/P2/M4/SM3/3_Classification/Dataset/ham.txt"


# Load the datasets into RDDs
spam_rdd = sc.textFile(file_path_spam)
non_spam_rdd = sc.textFile(file_path_non_spam)

# Split the email messages into words
spam_words = spam_rdd.flatMap(lambda email: email.split(' '))
non_spam_words = non_spam_rdd.flatMap(lambda email: email.split(' '))

# Print the first element in the split RDD
print("The first element in spam_words is", spam_words.first())
print("The first element in non_spam_words is", non_spam_words.first())

The first element in spam_words is You
The first element in non_spam_words is Rofl.


In [4]:
from pyspark.mllib.feature import HashingTF
from pyspark.mllib.classification import LabeledPoint

# Create a HashingTf instance with 200 features
tf = HashingTF(numFeatures=200)

# Map each word to one feature
spam_features = tf.transform(spam_words)
non_spam_features = tf.transform(non_spam_words)
print(spam_features.take(2))
print(non_spam_features.take(2))

# Label the features: 1 for spam, 0 for non-spam
spam_samples = spam_features.map(lambda features:LabeledPoint(1, features))
non_spam_samples = non_spam_features.map(lambda features:LabeledPoint(0, features))
print(type(spam_samples))
print(spam_samples.take(2))
print(non_spam_samples.take(2))

# Combine the two datasets
samples = spam_samples.union(non_spam_samples)

[SparseVector(200, {103: 1.0, 111: 1.0, 119: 1.0}), SparseVector(200, {14: 1.0, 89: 1.0, 193: 1.0, 199: 1.0})]
[SparseVector(200, {103: 2.0, 136: 1.0, 162: 2.0}), SparseVector(200, {64: 1.0, 163: 2.0})]
<class 'pyspark.rdd.PipelinedRDD'>
[LabeledPoint(1.0, (200,[103,111,119],[1.0,1.0,1.0])), LabeledPoint(1.0, (200,[14,89,193,199],[1.0,1.0,1.0,1.0]))]
[LabeledPoint(0.0, (200,[103,136,162],[2.0,1.0,2.0])), LabeledPoint(0.0, (200,[64,163],[1.0,2.0]))]


In [6]:
from pyspark.mllib.classification import LogisticRegressionWithLBFGS

# Split the data into training and testing
train_samples,test_samples = samples.randomSplit([0.8, 0.2])

# Train the model
model = LogisticRegressionWithLBFGS.train(train_samples)

# Create a prediction label from the test data
predictions = model.predict(test_samples.map(lambda x: x.features))

# Combine original labels with the predicted labels
labels_and_preds = test_samples.map(lambda x: x.label).zip(predictions)

# Check the accuracy of the model on the test data
accuracy = labels_and_preds.filter(lambda x: x[0] == x[1]).count() / float(test_samples.count())
print("Model accuracy : {:.2f}".format(accuracy))


Model accuracy : 0.82
