##  Machine Learning with Spark 

Classification Using Decision Tree, Random Forest, and Logistic Regression

### **Attributes**

1. age (numeric)
2. job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
3. marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
4. education (categorical: 'primary', 'secondary', 'tertiary', 'unknown')
5. default: has credit in default? (categorical: 'no','yes','unknown')
6. balance : bank balance
7. housing: has housing loan? (categorical: 'no','yes','unknown')
8. loan: has personal loan? (categorical: 'no','yes','unknown')

### Related with the last contact of the current campaign:
9. contact: contact communication type (categorical: 'cellular','telephone','unknown')
10. day: last contact day of the week (numerical: 1,2,...28,29,30)
1. month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
12. duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

### Other attributes:
12. campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13. pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14. previous: number of contacts performed before this campaign and for this client (numeric)
15. poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success','unknown')

### Output variable (desired target):
16. deposit - has the client subscribed a term deposit? (binary: 'yes','no')

### Step 1: Data Loading and Preparation <a class="anchor" name="data-preparation"></a> 

In [1]:
from pyspark import SparkConf
from pyspark.sql import SparkSession

spark_conf = SparkConf()\
            .setMaster("local[*]")\
            .setAppName("ML-Classification")

spark = SparkSession.builder.config(conf=spark_conf).getOrCreate()
spark.sparkContext.setLogLevel('ERROR')

df = spark.read.csv('bank.csv', header = True, inferSchema = True)
cols = df.columns

In [2]:
# First, save the category in the category columns list.
categoryInputCols = ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'poutcome']
numericInputCols = ['age', 'balance', 'duration', 'campaign', 'pdays', 'previous']
categoryOutputCol = 'deposit'
categoryCols = categoryInputCols+[categoryOutputCol]

### Step 2: Feature Engineering <a class="anchor" name="feature-engineering"></a> 

In [3]:
### Convert categorical columns
from pyspark.ml.feature import StringIndexer

# Define the output columns
outputCols=[f'{x}_index' for x in categoryInputCols]
outputCols.append('label')

print(categoryCols)
print(outputCols)

# Create the index values for categorical values
# Initialize StringIndexer (use inputCols and outputCols)
inputIndexer = StringIndexer(inputCols=categoryCols, outputCols=outputCols)

# Call the fit and transform() method to get the encoded results 
df_indexed = inputIndexer.fit(df).transform(df)

# # Display the output, only the output columns
# df_indexed.show(5)

['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'poutcome', 'deposit']
['job_index', 'marital_index', 'education_index', 'default_index', 'housing_index', 'loan_index', 'contact_index', 'poutcome_index', 'label']


In [4]:
from pyspark.ml.feature import OneHotEncoder

# input columns for OHE are all output columns from StringIndexer except label
inputCols_OHE = [x for x in outputCols if x!='label']
outputCols_OHE = [f'{x}_vec' for x in categoryInputCols]

#Define OneHotEncoder with the appropriate columns
encoder = OneHotEncoder(inputCols=inputCols_OHE,
                        outputCols=outputCols_OHE)

model = encoder.fit(df_indexed)
# Call fit and transform to get the encoded results
df_encoded = model.transform(df_indexed)
# # Display the output columns
# df_encoded.show(3)
# print('\n')
df_encoded.select("job", "job_index", "job_vec").show(3)

+----------+---------+--------------+
|       job|job_index|       job_vec|
+----------+---------+--------------+
|    admin.|      3.0|(11,[3],[1.0])|
|    admin.|      3.0|(11,[3],[1.0])|
|technician|      2.0|(11,[2],[1.0])|
+----------+---------+--------------+
only showing top 3 rows



In [5]:
from pyspark.ml.feature import VectorAssembler

# inputCols are all the encoded columns from OHE plus numerical columns
inputCols=outputCols_OHE
assemblerInputs = outputCols_OHE + numericInputCols
print(assemblerInputs)

# Define the assembler with appropriate input and output columns
assembler = VectorAssembler(inputCols = assemblerInputs, outputCol="features")

# use the asseembler transform() to get encoded results
df_final = assembler.transform(df_encoded)

# Display the output
df_final.select('features').show()

['job_vec', 'marital_vec', 'education_vec', 'default_vec', 'housing_vec', 'loan_vec', 'contact_vec', 'poutcome_vec', 'age', 'balance', 'duration', 'campaign', 'pdays', 'previous']
+--------------------+
|            features|
+--------------------+
|(30,[3,11,13,16,1...|
|(30,[3,11,13,16,1...|
|(30,[2,11,13,16,1...|
|(30,[4,11,13,16,1...|
|(30,[3,11,14,16,1...|
|(30,[0,12,14,16,2...|
|(30,[0,11,14,16,2...|
|(30,[5,13,16,18,2...|
|(30,[2,11,13,16,1...|
|(30,[4,12,13,16,1...|
|(30,[3,12,13,16,1...|
|(30,[1,11,13,16,1...|
|(30,[0,11,14,16,2...|
|(30,[1,12,14,16,1...|
|(30,[2,12,14,16,1...|
|(30,[0,14,16,18,2...|
|(30,[1,12,15,16,1...|
|(30,[4,11,13,16,1...|
|(30,[3,11,13,16,1...|
|(30,[3,13,16,20,2...|
+--------------------+
only showing top 20 rows



Example: (30,[3,11,13,16,17],[1.0,1.0,1.0,50.0,789.0])

### Step 3: Pipeline API <a class="anchor" name="pipeline"></a> 

In [6]:
from pyspark.ml import Pipeline

# Pipelines and PipelineModels help to ensure that training and test data go through identical feature processing steps.
stage_1 = inputIndexer
stage_2 = encoder
stage_3 = assembler

stages = [stage_1,stage_2,stage_3]

pipeline = Pipeline(stages = stages)
pipelineModel = pipeline.fit(df)
df_pipeline = pipelineModel.transform(df)

#Choose only label and features to create a dataframe
selectedCols = ['label', 'features'] + cols
df_pipeline = df_pipeline.select(selectedCols)
df_pipeline.printSchema()

root
 |-- label: double (nullable = false)
 |-- features: vector (nullable = true)
 |-- age: integer (nullable = true)
 |-- job: string (nullable = true)
 |-- marital: string (nullable = true)
 |-- education: string (nullable = true)
 |-- default: string (nullable = true)
 |-- balance: integer (nullable = true)
 |-- housing: string (nullable = true)
 |-- loan: string (nullable = true)
 |-- contact: string (nullable = true)
 |-- day: integer (nullable = true)
 |-- month: string (nullable = true)
 |-- duration: integer (nullable = true)
 |-- campaign: integer (nullable = true)
 |-- pdays: integer (nullable = true)
 |-- previous: integer (nullable = true)
 |-- poutcome: string (nullable = true)
 |-- deposit: string (nullable = true)



### Step 4: Train/Test Split <a class="anchor" name="train-test"></a> 

In [7]:
# Divide data into train sets and test sets. 
# Seed is the value used to make the same data three times later
train, test = df_pipeline.randomSplit([0.7, 0.3], seed = 2020)
print("Training Dataset Count: " + str(train.count()))
print("Test Dataset Count: " + str(test.count()))

Training Dataset Count: 7858
Test Dataset Count: 3304


## ML Classification Models <a class="anchor" name="models"></a>
<hr />

### Decision Tree <a class="anchor" name="dt"></a>


In [None]:
from pyspark.ml.classification import DecisionTreeClassifier

# Extracts the number of nodes in the decision tree and the tree depth in the model and stores it in dt.
dt = DecisionTreeClassifier(featuresCol = 'features', labelCol = 'label', maxDepth = 3)
dtModel = dt.fit(train)

In [None]:
dtPredictions = dtModel.transform(test)
dtPredictions.select('features','label','prediction','probability').show()

### Random Forest <a class="anchor" name="rf"></a>

In [None]:
from pyspark.ml.classification import RandomForestClassifier

rf = RandomForestClassifier(labelCol="label", featuresCol="features", numTrees=10)
forestModel = rf.fit(train)

In [None]:
rfPredictions = forestModel.transform(test)
rfPredictions.select('features','label','prediction','probability').show()

### Logistic Regression <a class="anchor" name="lr"></a>


In [None]:
from pyspark.ml.classification import LogisticRegression

# Create an initial model using the train set.
lr = LogisticRegression(featuresCol = 'features', labelCol = 'label', maxIter=10,regParam=0.3,elasticNetParam=0.7)
lrModel = lr.fit(train)

In [None]:
lrPredictions = lrModel.transform(test)
lrPredictions.select('features','label','prediction','probability').show()

## Model Evaluation <a class="anchor" name="model-evaluation"></a>
<hr />

In [None]:
def compute_metrics(predictions):
    # Calculate the elements of the confusion matrix
    TN = predictions.filter('prediction = 0 AND label = prediction').count()
    TP = predictions.filter('prediction = 1 AND label = prediction').count()
    FN = predictions.filter('prediction = 0 AND label <> prediction').count()
    FP = predictions.filter('prediction = 1 AND label <> prediction').count()
    
    # calculate metrics by the confusion matrix
    accuracy = (TN + TP) / (TN + TP + FN + FP)
    precision = TP / (TP + FP)
    recall = TP / (TP + FN)
    f1 = 2/((1/recall)+(1/precision))
    return accuracy,precision,recall,f1    

In [None]:
print('Logistic regression:',compute_metrics(lrPredictions))
print('Decision Trees:',compute_metrics(dtPredictions))
print('Random Forest:',compute_metrics(rfPredictions))