# Binary Tabular Data Classification with PySpark

This notebook covers a classification problem in Machine Learning and go through a comprehensive guide to succesfully develop an End-to-End ML class prediction model using PySpark.

Classification Algorithms In order to predict the class of certain samples, there are several classification algorithms that can be used. In fact, when developing our machine learning models, we will train and evaluate a certain number of them, and we will keep those with better predicting performance.

A non-exhaustive list of some of the most used algorithms are:

1) Logistic Regression
2) Decision Trees
3) Random Forests
4) Support Vector Machines
5) K-Nearest Neighbors (KNN)

<b>ROC</b> the metric that we will use in our project is the Reciever Operation Characteristic or ROC. The ROC curve tells us about how good the model can distinguish between two classes. It can get values from 0 to 1. The better the model is, the closer to 1 value it will be.

We will use a number of different supervised algorithms to precisely predict individuals’ income using data collected from the 1994 U.S. Census. 

We will then choose the best candidate algorithm from preliminary results and further optimize this algorithm to best model the data. Our goal with this implementation is to build a model that accurately predicts whether an individual makes more than $50,000.

<p>Therefore, we are facing a binary classification problem, where we want to determine wether an individual makes more than $50K a year (class 1) or do not (class 0).</p>

In [1]:
import findspark
findspark.init()

In [2]:
import pandas as pd
import numpy as np
from datetime import date, timedelta, datetime
import time

import pyspark #only run this after findspark.init()
from pyspark.sql import SparkSession, SQLContext
from pyspark.context import SparkContext
from pyspark.sql.functions import *
from pyspark.sql.types import *

## 1. Load Data


The census dataset consists of approximately 45222 data points, with each datapoint having 13 features
The dataset for this project can be found from the UCI Machine Learning Repo.
[DATASET](https://archive.ics.uci.edu/dataset/2/adult).

In [3]:
# Initiliaze Spark Session
spark = SparkSession.builder.appName('imbalanced_binary_classification').getOrCreate()

In [4]:
spark

In [5]:
#File Location and type
file_location = "./adult.csv"
file_type = "csv"

# CSV Options
infer_schema = "true"
first_row_is_header = "true"
delimiter = ","

# make sure to add column name as csv does not contain column name as default

df = spark.read.format(file_type) \
   .option("inferSchema", infer_schema)\
   .option("header", first_row_is_header)\
   .option("sep", delimiter)\
   .load(file_location) \
   .toDF("age", "workClass", "fnlwgt", "education", "education-num","marital-status", "occupation", "relationship",
        "race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country", "income")
display(df)


DataFrame[age: int, workClass: string, fnlwgt: int, education: string, education-num: int, marital-status: string, occupation: string, relationship: string, race: string, sex: string, capital-gain: int, capital-loss: int, hours-per-week: int, native-country: string, income: string]

In [6]:
df.show()
df.columns
display(df)

+---+----------------+------+------------+-------------+--------------+-----------------+--------------+-----+------+------------+------------+--------------+--------------+------+
|age|       workClass|fnlwgt|   education|education-num|marital-status|       occupation|  relationship| race|   sex|capital-gain|capital-loss|hours-per-week|native-country|income|
+---+----------------+------+------------+-------------+--------------+-----------------+--------------+-----+------+------------+------------+--------------+--------------+------+
| 90|               ?| 77053|     HS-grad|            9|       Widowed|                ?| Not-in-family|White|Female|           0|        4356|            40| United-States| <=50K|
| 82|         Private|132870|     HS-grad|            9|       Widowed|  Exec-managerial| Not-in-family|White|Female|           0|        4356|            18| United-States| <=50K|
| 66|               ?|186061|Some-college|           10|       Widowed|                ?|     U

DataFrame[age: int, workClass: string, fnlwgt: int, education: string, education-num: int, marital-status: string, occupation: string, relationship: string, race: string, sex: string, capital-gain: int, capital-loss: int, hours-per-week: int, native-country: string, income: string]

### 2. Data Preprocessing

In [7]:
# import pyspark functions
from pyspark.sql import functions as F
#create add new column to the dataset
df = df.withColumn('>50k', F.when(df.income == '<=50k', 0).otherwise(1))
# Drop the income label
df = df.drop('income')
df.columns
                                

['age',
 'workClass',
 'fnlwgt',
 'education',
 'education-num',
 'marital-status',
 'occupation',
 'relationship',
 'race',
 'sex',
 'capital-gain',
 'capital-loss',
 'hours-per-week',
 'native-country',
 '>50k']

In [8]:
display(df)

DataFrame[age: int, workClass: string, fnlwgt: int, education: string, education-num: int, marital-status: string, occupation: string, relationship: string, race: string, sex: string, capital-gain: int, capital-loss: int, hours-per-week: int, native-country: string, >50k: int]

#### Vectorizing Numerical Features and One-Hot Encoding Categorical Features

<p>1. Vectorizing Numerical Features:</p><p>
Vectorization is the process of converting numerical features into a numerical vector format. This is often done to standardize or scale numerical features, making them comparable and avoiding issues related to varying scales

</p>

<p>2. One-Hot Encoding Categorical Features:</p><p>
One-hot encoding is a technique used to represent categorical variables as binary vectors. It creates binary columns for each category and indicates the presence of the category with a 1 or 0. This is crucial for machine learning models that require numerical input. Libraries such as scikit-learn provide tools for one-hot encoding:</p>

In [9]:
# Selecting categorical features
categorical_columns = [
 'workClass',
 'education',
 'marital-status',
 'occupation',
 'relationship',
 'race',
 'sex',
 'hours-per-week',
 'native-country',
 ]

In [10]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
from pyspark.ml.classification import (DecisionTreeClassifier, GBTClassifier, RandomForestClassifier, LogisticRegression)
from pyspark.ml.evaluation import BinaryClassificationEvaluator
# The index of string values multiple column
indexers = [
    StringIndexer(inputCol = c, outputCol = "{0}_indexed".format(c))
    for c in categorical_columns]
# The encode of indexed values multiple column
encoders = [OneHotEncoder(dropLast = False, inputCol= indexer.getOutputCol(),
                          outputCol="{0}_encoded".format(indexer.getOutputCol()))
            for indexer in indexers]

<p>The above code basically indexes each categorical column using the StringIndexer, and then converts the indexed categories into one-hot encoded variables. The resulting output has the binary vectors appended to the end of each row.</p>

#### Join the categorical encoded features with the numerical ones and make a vector with both of them

In [11]:
# Vectorizing encoded values
categorical_encoded = [encoder.getOutputCol() for encoder in encoders]
numerical_columns = ['age', 'education-num', 'capital-gain', 'capital-loss']
inputcols = categorical_encoded + numerical_columns
assembler = VectorAssembler(inputCols= inputcols, outputCol = "features")

#### Set up the pipeline to automatize this stages
<p>Pipeline is used to assemble multiple stages (StringIndexers and OneHotEncoders) into a single pipeline.

fit is called on the pipeline with the original DataFrame (df), and then transform is applied to execute the transformations.</p>s

In [12]:
# Assemble the stages into a Pipeline
pipeline = Pipeline(stages=indexers + encoders+[assembler])
model = pipeline.fit(df)
#Transform data
transformed = model.transform(df)
display(transformed)

DataFrame[age: int, workClass: string, fnlwgt: int, education: string, education-num: int, marital-status: string, occupation: string, relationship: string, race: string, sex: string, capital-gain: int, capital-loss: int, hours-per-week: int, native-country: string, >50k: int, workClass_indexed: double, education_indexed: double, marital-status_indexed: double, occupation_indexed: double, relationship_indexed: double, race_indexed: double, sex_indexed: double, hours-per-week_indexed: double, native-country_indexed: double, workClass_indexed_encoded: vector, education_indexed_encoded: vector, marital-status_indexed_encoded: vector, occupation_indexed_encoded: vector, relationship_indexed_encoded: vector, race_indexed_encoded: vector, sex_indexed_encoded: vector, hours-per-week_indexed_encoded: vector, native-country_indexed_encoded: vector, features: vector]

#### Finally, we will select a dataset only with the relevant features.

In [13]:
# Transform data
final_data = transformed.select('features', '>50k')

### 3. Build a Model

In [17]:
# Initialize the classification models
# Decision Trees
# Random Forests
# Gradient Boosted Trees

dtc = DecisionTreeClassifier(labelCol = '>50k', featuresCol = 'features')
rfc = RandomForestClassifier(numTrees=150, labelCol='>50k', featuresCol = 'features')
gbt = GBTClassifier(labelCol= '>50k', featuresCol = 'features', maxIter=10)

In [18]:
# Split data
# We will perform a classic 80/20 split between training and testing data.
train_data, test_data = final_data.randomSplit([0.8,0.2], seed=623)
print(train_data.count())
print(test_data.count())

26140
6421


### 4. Start Training

In [19]:
dtc_model = dtc.fit(train_data)
rfc_model = rfc.fit(train_data)
gbt_model = gbt.fit(train_data)

## 5. Evaluate with Test-set

In [20]:
dtc_preds = dtc_model.transform(test_data)
rfc_preds = rfc_model.transform(test_data)
gbt_preds = gbt_model.transform(test_data)

## 6. Evaluating Model's Performance

In [21]:
# our evaluator will be the ROC
my_eval = BinaryClassificationEvaluator(labelCol='>50k')

In [22]:
# Display Decision Tree evaluation metric
print('DTC')
print(my_eval.evaluate(dtc_preds))

DTC
1.0


In [23]:
# Display Random Forest evaluation metric
print('RFC')
print(my_eval.evaluate(rfc_preds))

RFC
1.0


In [24]:
# Display Gradien Boosting Tree evaluation metric
print('GBT')
print(my_eval.evaluate(gbt_preds))

GBT
1.0


In [None]:
# Improving Model Performance (Model Tuning)