# Classification in Spark

The intent of this blog is to demonstrate binary classification in pySpark. The various steps involved in developing a classification model in pySpark are as follows:

1) Initialize a Spark session

2) Download and read the the dataset

3) Developing initial understanding about the data

4) Handling missing values

5) Scalerizing the features

6) Train test split

7) Imbalance handling

8) Feature selection

9) Performance evaluation

In [1]:
# Initializing a Spark session
from pyspark.sql import SparkSession
import time
spark = SparkSession.builder.master("local").appName("income").config("spark.some.config.option","some-value").getOrCreate()

# Download and read the dataset


In [3]:
start_time = time.time()
raw_data = spark.read.csv('s3://516ml/adult-training.csv',
                    header='true', inferSchema='true')
print("--- %s seconds ---" % (time.time() - start_time))
raw_data.columns

--- 11.9728660583 seconds ---


['Age',
 'Workclass',
 'fnlgwt',
 'Education',
 'Education num',
 'Marital Status',
 'Occupation',
 'Relationship',
 'Race',
 'Sex',
 'Capital Gain',
 'Capital Loss',
 'Hours/Week',
 'Native country',
 'Income']

In [4]:
raw_data.show(5)

+---+-----------------+--------+----------+-------------+-------------------+------------------+--------------+------+-------+------------+------------+----------+--------------+------+
|Age|        Workclass|  fnlgwt| Education|Education num|     Marital Status|        Occupation|  Relationship|  Race|    Sex|Capital Gain|Capital Loss|Hours/Week|Native country|Income|
+---+-----------------+--------+----------+-------------+-------------------+------------------+--------------+------+-------+------------+------------+----------+--------------+------+
| 39|        State-gov| 77516.0| Bachelors|         13.0|      Never-married|      Adm-clerical| Not-in-family| White|   Male|      2174.0|         0.0|      40.0| United-States| <=50K|
| 50| Self-emp-not-inc| 83311.0| Bachelors|         13.0| Married-civ-spouse|   Exec-managerial|       Husband| White|   Male|         0.0|         0.0|      13.0| United-States| <=50K|
| 38|          Private|215646.0|   HS-grad|          9.0|           Di

In [5]:
raw_data.describe().select("Age","fnlgwt","Education num", "Sex", "Hours/Week", "Income").show()

+------------------+------------------+-----------------+-------+------------------+------+
|               Age|            fnlgwt|    Education num|    Sex|        Hours/Week|Income|
+------------------+------------------+-----------------+-------+------------------+------+
|             32561|             32561|            32561|  32561|             32561| 32561|
| 38.58164675532078|189778.36651208502| 10.0806793403151|   null|40.437455852092995|  null|
|13.640432553581356|105549.97769702227|2.572720332067397|   null|12.347428681731838|  null|
|                17|           12285.0|              1.0| Female|               1.0| <=50K|
|                90|         1484705.0|             16.0|   Male|              99.0|  >50K|
+------------------+------------------+-----------------+-------+------------------+------+



In [6]:
import numpy as np
from pyspark.sql.functions import when
raw_data=raw_data.withColumn("Income",when(raw_data.Income=='null',np.nan).otherwise(raw_data.Income))
raw_data.select("Income").show()
raw_data=raw_data.withColumn("Income",when(raw_data.Income==' <=50K',0.0).otherwise(1.0))

raw_data=raw_data.withColumn("Sex",when(raw_data.Sex=='null',np.nan).otherwise(raw_data.Sex))
raw_data=raw_data.withColumn("Sex",when(raw_data.Sex=='Female',1.0).otherwise(0.0))
raw_data=raw_data.withColumn("Age",raw_data.Age.cast('float'))

+------+
|Income|
+------+
| <=50K|
| <=50K|
| <=50K|
| <=50K|
| <=50K|
| <=50K|
| <=50K|
|  >50K|
|  >50K|
|  >50K|
|  >50K|
|  >50K|
| <=50K|
| <=50K|
|  >50K|
| <=50K|
| <=50K|
| <=50K|
| <=50K|
|  >50K|
+------+
only showing top 20 rows



In [8]:
raw_data.select("Income").show()

+------+
|Income|
+------+
|   0.0|
|   0.0|
|   0.0|
|   0.0|
|   0.0|
|   0.0|
|   0.0|
|   1.0|
|   1.0|
|   1.0|
|   1.0|
|   1.0|
|   0.0|
|   0.0|
|   1.0|
|   0.0|
|   0.0|
|   0.0|
|   0.0|
|   1.0|
+------+
only showing top 20 rows



In [9]:
raw_data.select("Age","fnlgwt","Education num", "Sex", "Hours/Week", "Income").show(5)

+----+--------+-------------+---+----------+------+
| Age|  fnlgwt|Education num|Sex|Hours/Week|Income|
+----+--------+-------------+---+----------+------+
|39.0| 77516.0|         13.0|0.0|      40.0|   0.0|
|50.0| 83311.0|         13.0|0.0|      13.0|   0.0|
|38.0|215646.0|          9.0|0.0|      40.0|   0.0|
|53.0|234721.0|          7.0|0.0|      40.0|   0.0|
|28.0|338409.0|         13.0|0.0|      40.0|   0.0|
+----+--------+-------------+---+----------+------+
only showing top 5 rows



So we have replaced all "0" with NaN. Now, we can simply impute the NaN by calling an imputer :)

In [10]:
from pyspark.ml.feature import Imputer
imputer=Imputer(inputCols=["Age","fnlgwt","Education num", "Sex", "Hours/Week", "Income"],outputCols=["Age","fnlgwt","Education num", "Sex", "Hours/Week", "Income"])
model=imputer.fit(raw_data)
raw_data=model.transform(raw_data)
raw_data.show(5)

+----+-----------------+--------+----------+-------------+-------------------+------------------+--------------+------+---+------------+------------+----------+--------------+------+
| Age|        Workclass|  fnlgwt| Education|Education num|     Marital Status|        Occupation|  Relationship|  Race|Sex|Capital Gain|Capital Loss|Hours/Week|Native country|Income|
+----+-----------------+--------+----------+-------------+-------------------+------------------+--------------+------+---+------------+------------+----------+--------------+------+
|39.0|        State-gov| 77516.0| Bachelors|         13.0|      Never-married|      Adm-clerical| Not-in-family| White|0.0|      2174.0|         0.0|      40.0| United-States|   0.0|
|50.0| Self-emp-not-inc| 83311.0| Bachelors|         13.0| Married-civ-spouse|   Exec-managerial|       Husband| White|0.0|         0.0|         0.0|      13.0| United-States|   0.0|
|38.0|          Private|215646.0|   HS-grad|          9.0|           Divorced| Handle

In [11]:
cols = ["Age","fnlgwt","Education num", "Sex", "Hours/Week", "Income"]
cols.remove("Income")
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=cols,outputCol="features")
raw_data=assembler.transform(raw_data)
raw_data.select("features").show(truncate=False)

+-----------------------------+
|features                     |
+-----------------------------+
|[39.0,77516.0,13.0,0.0,40.0] |
|[50.0,83311.0,13.0,0.0,13.0] |
|[38.0,215646.0,9.0,0.0,40.0] |
|[53.0,234721.0,7.0,0.0,40.0] |
|[28.0,338409.0,13.0,0.0,40.0]|
|[37.0,284582.0,14.0,0.0,40.0]|
|[49.0,160187.0,5.0,0.0,16.0] |
|[52.0,209642.0,9.0,0.0,45.0] |
|[31.0,45781.0,14.0,0.0,50.0] |
|[42.0,159449.0,13.0,0.0,40.0]|
|[37.0,280464.0,10.0,0.0,80.0]|
|[30.0,141297.0,13.0,0.0,40.0]|
|[23.0,122272.0,13.0,0.0,30.0]|
|[32.0,205019.0,12.0,0.0,50.0]|
|[40.0,121772.0,11.0,0.0,40.0]|
|[34.0,245487.0,4.0,0.0,45.0] |
|[25.0,176756.0,9.0,0.0,35.0] |
|[32.0,186824.0,9.0,0.0,40.0] |
|[38.0,28887.0,7.0,0.0,50.0]  |
|[43.0,292175.0,14.0,0.0,45.0]|
+-----------------------------+
only showing top 20 rows



# Standard Sclarizer 

So we have created a feature vector. Now let us use StandardScaler to scalerize the newly created "feature" column 

In [12]:
from pyspark.ml.feature import StandardScaler
standardscaler=StandardScaler().setInputCol("features").setOutputCol("Scaled_features")
raw_data=standardscaler.fit(raw_data).transform(raw_data)
raw_data.select("features","Scaled_features").show(5)

+--------------------+--------------------+
|            features|     Scaled_features|
+--------------------+--------------------+
|[39.0,77516.0,13....|[2.85914686699289...|
|[50.0,83311.0,13....|[3.66557290640114...|
|[38.0,215646.0,9....|[2.78583540886487...|
|[53.0,234721.0,7....|[3.88550728078521...|
|[28.0,338409.0,13...|[2.05272082758464...|
+--------------------+--------------------+
only showing top 5 rows



# Train, test split

Now that the preprocessing of the data is complete. Let us split the dataset in training and testing set. 

In [13]:
train, test = raw_data.randomSplit([0.8, 0.2], seed=12345)

Let us check whether their is imbalance in the dataset

In [14]:
dataset_size=float(train.select("Income").count())
numPositives=train.select("Income").where('Income == 1').count()
per_ones=(float(numPositives)/float(dataset_size))*100
numNegatives=float(dataset_size-numPositives)
print('The number of ones are {}'.format(numPositives))
print('Percentage of ones are {}'.format(per_ones))

The number of ones are 6222
Percentage of ones are 24.039873271


# Imbalancing handling

Since the percentage of ones in the dataset is just 24.04 % surely their is imbalance in the dataset. 

Therefore,logistic loss objective function should treat the positive class (Outcome == 1) with higher weight. For this purpose we calculate the BalancingRatio as follows:

BalancingRatio= numNegatives/dataset_size

Then against every Outcome == 1, we put BalancingRatio in column "classWeights", and  against every Outcome == 0, we put 1-BalancingRatio in column  "classWeights" 

In this way, we assign higher weightage to the minority class (i.e. positive class)

In [15]:
BalancingRatio= numNegatives/dataset_size
print('BalancingRatio = {}'.format(BalancingRatio))

BalancingRatio = 0.75960126729


In [16]:
train=train.withColumn("classWeights", when(train.Income == 1,BalancingRatio).otherwise(1-BalancingRatio))
train.select("classWeights").show(5)

+-------------------+
|       classWeights|
+-------------------+
|0.24039873270999146|
|0.24039873270999146|
|0.24039873270999146|
|0.24039873270999146|
|0.24039873270999146|
+-------------------+
only showing top 5 rows



# Building a classification model using Logistic Regression (LR)

In [17]:
from pyspark.ml.classification import LogisticRegression
# lr = LogisticRegression().setWeightCol("classWeights").setLabelCol("Outcome").setFeaturesCol("Aspect")
start_time = time.time()
# lr = LogisticRegression(labelCol="Income", featuresCol="Aspect",weightCol="classWeights",maxIter=10)
lr = LogisticRegression(labelCol="Income", featuresCol="Scaled_features",weightCol="classWeights",maxIter=10)
model = lr.fit(train)    

print("--- %s seconds ---" % (time.time() - start_time))
predict_train=model.transform(train)
predict_test=model.transform(test)

predict_test.select("Income","prediction").show(10)

--- 2.93401193619 seconds ---
+------+----------+
|Income|prediction|
+------+----------+
|   0.0|       0.0|
|   0.0|       0.0|
|   0.0|       0.0|
|   0.0|       0.0|
|   0.0|       0.0|
|   0.0|       0.0|
|   0.0|       0.0|
|   0.0|       0.0|
|   0.0|       0.0|
|   0.0|       0.0|
+------+----------+
only showing top 10 rows



# Evaluating the model

Now let us evaluate the model using BinaryClassificationEvaluator class in Spark ML. BinaryClassificationEvaluator by default uses areaUnderROC as the performance metric 

In [20]:
# The BinaryClassificationEvaluator uses areaUnderROC as the default metric. As o fnow we will continue with the same
from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator=BinaryClassificationEvaluator(rawPredictionCol="rawPrediction",labelCol="Income")

In [18]:
predict_test.select("Income","rawPrediction","prediction","probability").show(5)

+------+--------------------+----------+--------------------+
|Income|       rawPrediction|prediction|         probability|
+------+--------------------+----------+--------------------+
|   0.0|[3.25412979493759...|       0.0|[0.96282122779656...|
|   0.0|[3.07498513336678...|       0.0|[0.95584903204928...|
|   0.0|[3.75138778546630...|       0.0|[0.97705376443933...|
|   0.0|[3.04977377641950...|       0.0|[0.95477275879431...|
|   0.0|[3.83494964851657...|       0.0|[0.97885437073768...|
+------+--------------------+----------+--------------------+
only showing top 5 rows



In [None]:
print("The area under ROC for train set is {}".format(evaluator.evaluate(predict_train)))
print("The area under ROC for test set is {}".format(evaluator.evaluate(predict_test)))

# Hyper parameters

To this point we have developed a classification model using logistic regression. However, the working of logistic regression depends upon the on a number of parameters. As of now we have worked with only the default parameters. Now, let s try to tune the hyperparameters and see whether it make any difference.  

In [None]:
# if you are unsure which parameters to tune pls use "print(lr.explainParams())" to get the list of parameters available for tuning  
print(lr.explainParams())

# List of tunable parameters in LR

1) aggregationDepth: suggested depth for treeAggregate (>= 2). (default: 2)

2) elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty. (default: 0.0)

3) family: The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial (default: auto)

4) featuresCol: features column name. (default: features, current: Aspect)

5) fitIntercept: whether to fit an intercept term. (default: True)

6) labelCol: label column name. (default: label, current: Outcome)

7) maxIter: max number of iterations (>= 0). (default: 100, current: 10)

8) predictionCol: prediction column name. (default: prediction)

9) probabilityCol: Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities. (default: probability)

10) rawPredictionCol: raw prediction (a.k.a. confidence) column name. (default: rawPrediction)

11) regParam: regularization parameter (>= 0). (default: 0.0)

12) standardization: whether to standardize the training features before fitting the model. (default: True)

13) threshold: Threshold in binary classification prediction, in range [0, 1].

14) If threshold and thresholds are both set, they must match.e.g. if threshold is p, then thresholds must be equal to [1-p, p]. (default: 0.5)

15) thresholds: Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0, excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold. (undefined)

16) tol: the convergence tolerance for iterative algorithms (>= 0). (default: 1e-06)

17) weightCol: weight column name. If this is not set or empty, we treat all instance weights as 1.0. (current: classWeights)


Now let us tune some of these parameters and observe their effect on the performance of the algorithm.

For the purpose of hyperparameter tuning we will consider the following parameters:

1) aggregationDepth [2, 5, 10]

2) elasticNetParam [0.0, 0.5, 1.0]

3) fitIntercept [True / False]

4) maxIter [10, 100, 1000]

5) regParam [0.01, 0.5, 2.0]

frist off all let us define a parameter grid as follows:

In [29]:
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

paramGrid = ParamGridBuilder()\
    .addGrid(lr.aggregationDepth,[2,5,10])\
    .addGrid(lr.elasticNetParam,[0.0, 0.5, 1.0])\
    .addGrid(lr.fitIntercept,[False, True])\
    .addGrid(lr.maxIter,[10, 100, 1000])\
    .addGrid(lr.regParam,[0.01, 0.5, 2.0]) \
    .build()

# https://spark.apache.org/docs/2.1.0/ml-tuning.html

# K-fold cross validation

In [30]:
# Create 5-fold CrossValidator
cv = CrossValidator(estimator=lr, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=5)

# Run cross validations
cvModel = cv.fit(train)
# this will likely take a fair amount of time because of the amount of models that we're creating and testing
predict_train=cvModel.transform(train)
predict_test=cvModel.transform(test)
print("The area under ROC for train set after CV  is {}".format(evaluator.evaluate(predict_train)))
print("The area under ROC for test set after CV  is {}".format(evaluator.evaluate(predict_test)))

KeyboardInterrupt: 

In [31]:
print((raw_data.count(), len(raw_data.columns)))

(32561, 17)
