# Classification in PySpark's MLlib

PySpark offers a good variety of algorithms that can be applied to classification machine learning problems. However, because PySpark operates on distributed dataframes, we cannot use popular Python libraries like scikit learn for our machine learning applications. Which means we need to use PySpark's MLlib packages for these tasks. Luckily, MLlib offers a pretty good variety of algorithms! In this notebook we will go over how to prep our data and train and test the classification algorithms PySpark offers. 

## Defining Classification

As we went over in the concept review lecture, classification is a supervised machine learning task where we want to automatically categorize our data into some pre-defined categorization method. Examples of classification might include sorting objects like flowers into various species or automatically labeling images into groups like cat, dog, fish, etc. To be able to do this though, we need to have training data and a pre-defined dependent variable which is the column in your dataset that defines the categories you want to predict. 

## Algorithms Available

PySpark offers the following algorithms for classification. 

1. Logistic Regression 
2. Naive Bayes
3. One Vs Rest
4. Linear Support Vector Machine (SVC)
5. Random Forest Classifier
6. GBT Classifier
7. Decision Tree Classifier
8. Multilayer Perceptron Classifier (Neural Network)

In [1]:
# First let's create our PySpark instance
# import findspark
# findspark.init()

import pyspark # only run after findspark.init()
from pyspark.sql import SparkSession
# May take awhile locally
spark = SparkSession.builder.appName("Class").getOrCreate()

cores = spark._jsc.sc().getExecutorMemoryStatus().keySet().size()
print("You are working with", cores, "core(s)")
spark
# Click the hyperlinked "Spark UI" link to view details about your Spark session

You are working with 1 core(s)


In [2]:
# Read in functions we will need
from pyspark.ml.feature import VectorAssembler
from pyspark.sql.types import * 
from pyspark.sql.functions import *
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import MinMaxScaler

## Let's read our dataset in for this notebook 

### Data Set Name: Autistic Spectrum Disorder Screening Data for Adult

Autistic Spectrum Disorder (ASD) is a neurodevelopment condition associated with significant healthcare costs, and early diagnosis can significantly reduce these. 

**The Problem**
Unfortunately, waiting times for an ASD diagnosis are lengthy and procedures are not cost effective. The economic impact of autism and the increase in the number of ASD cases across the world reveals an urgent need for the development of easily implemented and effective screening methods. Therefore, a time-efficient and accessible ASD screening is imminent to help health professionals and inform individuals whether they should pursue formal clinical diagnosis. 

**About the data**
This dataset containes 20 features related to the classification of ASD cases. In this dataset, we record ten behavioural features (AQ-10-Adult) plus ten individuals characteristics that have proved to be effective in detecting the ASD cases from controls in behaviour science.

### Source: 
https://www.kaggle.com/faizunnabi/autism-screening

In [3]:
path ="Datasets/autism-screening-for-toddlers/"
df = spark.read.csv(path+'Toddler Autism dataset July 2018.csv',inferSchema=True,header=True)

### Check out the dataset

In [4]:
df.limit(6).toPandas()

Unnamed: 0,Case_No,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,Age_Mons,Qchat-10-Score,Sex,Ethnicity,Jaundice,Family_mem_with_ASD,Who completed the test,Class/ASD Traits
0,1,0,0,0,0,0,0,1,1,0,1,28,3,f,middle eastern,yes,no,family member,No
1,2,1,1,0,0,0,1,1,0,0,0,36,4,m,White European,yes,no,family member,Yes
2,3,1,0,0,0,0,0,1,1,0,1,36,4,m,middle eastern,yes,no,family member,Yes
3,4,1,1,1,1,1,1,1,1,1,1,24,10,m,Hispanic,no,no,family member,Yes
4,5,1,1,0,1,1,1,1,1,1,1,20,9,f,White European,no,yes,family member,Yes
5,6,1,1,0,0,1,1,1,1,1,1,21,8,m,black,no,no,family member,Yes


In [5]:
df.printSchema()

root
 |-- Case_No: integer (nullable = true)
 |-- A1: integer (nullable = true)
 |-- A2: integer (nullable = true)
 |-- A3: integer (nullable = true)
 |-- A4: integer (nullable = true)
 |-- A5: integer (nullable = true)
 |-- A6: integer (nullable = true)
 |-- A7: integer (nullable = true)
 |-- A8: integer (nullable = true)
 |-- A9: integer (nullable = true)
 |-- A10: integer (nullable = true)
 |-- Age_Mons: integer (nullable = true)
 |-- Qchat-10-Score: integer (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Ethnicity: string (nullable = true)
 |-- Jaundice: string (nullable = true)
 |-- Family_mem_with_ASD: string (nullable = true)
 |-- Who completed the test: string (nullable = true)
 |-- Class/ASD Traits : string (nullable = true)



### How many classes do we have?

It's important to check for class imbalance in your dependent variable for classification tasks. If there are extremley under or over represented classes, the accuracy of your model predictions might suffer as a result of your model essentially being biased. 

If you see class imbalance, one common way to correct this would be boot strapping or resampling your dataframe. 

In [6]:
df.groupBy("Class/ASD Traits ").count().show(100)

+-----------------+-----+
|Class/ASD Traits |count|
+-----------------+-----+
|               No|  326|
|              Yes|  728|
+-----------------+-----+



## Format Data 

MLlib requires all input columns of your dataframe to be vectorized. You will see that we rename our dependent var to label as that is what is expected for all MLlib applications. If rename once here, we never have to do it again!

For more methods on transformations visit: https://spark.apache.org/docs/latest/ml-features

In [6]:
input_columns = df.columns

In [8]:
input_columns = input_columns[1:-1]
input_columns

['A1',
 'A2',
 'A3',
 'A4',
 'A5',
 'A6',
 'A7',
 'A8',
 'A9',
 'A10',
 'Age_Mons',
 'Qchat-10-Score',
 'Sex',
 'Ethnicity',
 'Jaundice',
 'Family_mem_with_ASD',
 'Who completed the test']

In [10]:
dependent_var = 'Class/ASD Traits '

In [11]:
renamed = df.withColumn('label_str',df[dependent_var].cast(StringType()))
indexer = StringIndexer(inputCol="label_str",outputCol="label")
indexed = indexer.fit(renamed).transform(renamed)

In [13]:
numeric_inputs = []
string_inputs = []
for column in input_columns:
    if str(indexed.schema[column].dataType) == 'StringType':
        indexer = StringIndexer(inputCol=column, outputCol=column+'_num')
        indexed = indexer.fit(indexed).transform(indexed)
        new_col_name = column+'_num'
        string_inputs.append(new_col_name)
    else:
        numeric_inputs.append(column)

In [16]:
indexed.printSchema()

root
 |-- Case_No: integer (nullable = true)
 |-- A1: integer (nullable = true)
 |-- A2: integer (nullable = true)
 |-- A3: integer (nullable = true)
 |-- A4: integer (nullable = true)
 |-- A5: integer (nullable = true)
 |-- A6: integer (nullable = true)
 |-- A7: integer (nullable = true)
 |-- A8: integer (nullable = true)
 |-- A9: integer (nullable = true)
 |-- A10: integer (nullable = true)
 |-- Age_Mons: integer (nullable = true)
 |-- Qchat-10-Score: integer (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Ethnicity: string (nullable = true)
 |-- Jaundice: string (nullable = true)
 |-- Family_mem_with_ASD: string (nullable = true)
 |-- Who completed the test: string (nullable = true)
 |-- Class/ASD Traits : string (nullable = true)
 |-- label_str: string (nullable = true)
 |-- label: double (nullable = false)
 |-- Sex_num: double (nullable = false)
 |-- Ethnicity_num: double (nullable = false)
 |-- Jaundice_num: double (nullable = false)
 |-- Family_mem_with_ASD_num: double

In [18]:
d = {}
for col in numeric_inputs:
    d[col] = indexed.approxQuantile(col,[0.01,0.99],0.25)
    
for col in numeric_inputs:
    skew = indexed.agg(skewness(indexed[col])).collect()
    skew = skew[0][0]
    if skew >1:
        indexed = indexed.withColumn(col, \
        log(when(df[col] < d[col][0],d[col][0])\
        .when(indexed[col] > d[col][1],d[col][1])\
        .otherwise(indexed[col])+1).alias(col))
        print(col," has been treated for pos skew ",skew)
    elif skew < -1:
        indexed = indexed.withColumn(col, \
        exp(when(df[col] < d[col][0],d[col][0])\
        .when(indexed[col] > d[col][1],d[col][1])\
        .otherwise(indexed[col])).alias(col))
        print(col," has been treated for neg skew ",skew)

In [23]:
minimums = df.select([min(c).alias(c) for c in df.columns if c in numeric_inputs])
min_array = minimums.select(array(numeric_inputs).alias("mins"))
df_minimum = min_array.select(array_min(min_array.mins)).collect()
df_minimum = df_minimum[0][0]

In [24]:
df_minimum

0

In [25]:
features_list = numeric_inputs + string_inputs
assembler = VectorAssembler(inputCols=features_list,outputCol='features')
output = assembler.transform(indexed).select('features','label')

In [26]:
output.show()

+--------------------+-----+
|            features|label|
+--------------------+-----+
|(17,[6,7,9,10,11,...|  1.0|
|(17,[0,1,5,6,10,1...|  0.0|
|(17,[0,6,7,9,10,1...|  0.0|
|[1.0,1.0,1.0,1.0,...|  0.0|
|[1.0,1.0,0.0,1.0,...|  0.0|
|[1.0,1.0,0.0,0.0,...|  0.0|
|(17,[0,3,4,5,8,10...|  0.0|
|(17,[1,4,6,7,8,9,...|  0.0|
|(17,[6,9,10,11,13...|  1.0|
|[1.0,1.0,1.0,0.0,...|  0.0|
|[1.0,0.0,0.0,1.0,...|  0.0|
|[1.0,1.0,1.0,1.0,...|  0.0|
|(17,[10,12,13,14]...|  1.0|
|[1.0,1.0,1.0,1.0,...|  0.0|
|(17,[10,13],[18.0...|  1.0|
|(17,[0,1,2,4,6,7,...|  0.0|
|(17,[10,13,15],[3...|  1.0|
|[1.0,1.0,1.0,0.0,...|  0.0|
|(17,[0,4,9,10,11,...|  1.0|
|[1.0,1.0,1.0,0.0,...|  0.0|
+--------------------+-----+
only showing top 20 rows



In [27]:
scaler = MinMaxScaler(inputCol="features",outputCol="scaledFeatures",min=0,max=1000)
scalerModel = scaler.fit(output)
scaled_data = scalerModel.transform(output)

In [28]:
scaled_data.show()

+--------------------+-----+--------------------+
|            features|label|      scaledFeatures|
+--------------------+-----+--------------------+
|(17,[6,7,9,10,11,...|  1.0|[0.0,0.0,0.0,0.0,...|
|(17,[0,1,5,6,10,1...|  0.0|[1000.0,1000.0,0....|
|(17,[0,6,7,9,10,1...|  0.0|[1000.0,0.0,0.0,0...|
|[1.0,1.0,1.0,1.0,...|  0.0|[1000.0,1000.0,10...|
|[1.0,1.0,0.0,1.0,...|  0.0|[1000.0,1000.0,0....|
|[1.0,1.0,0.0,0.0,...|  0.0|[1000.0,1000.0,0....|
|(17,[0,3,4,5,8,10...|  0.0|[1000.0,0.0,0.0,1...|
|(17,[1,4,6,7,8,9,...|  0.0|[0.0,1000.0,0.0,0...|
|(17,[6,9,10,11,13...|  1.0|[0.0,0.0,0.0,0.0,...|
|[1.0,1.0,1.0,0.0,...|  0.0|[1000.0,1000.0,10...|
|[1.0,0.0,0.0,1.0,...|  0.0|[1000.0,0.0,0.0,1...|
|[1.0,1.0,1.0,1.0,...|  0.0|[1000.0,1000.0,10...|
|(17,[10,12,13,14]...|  1.0|[0.0,0.0,0.0,0.0,...|
|[1.0,1.0,1.0,1.0,...|  0.0|[1000.0,1000.0,10...|
|(17,[10,13],[18.0...|  1.0|[0.0,0.0,0.0,0.0,...|
|(17,[0,1,2,4,6,7,...|  0.0|[1000.0,1000.0,10...|
|(17,[10,13,15],[3...|  1.0|[0.0,0.0,0.0,0.0,...|


In [29]:
final_data = scaled_data.select('label','scaledFeatures')
final_data = final_data.withColumnRenamed("scaledFeatures",'features')

In [30]:
final_data.show()

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  1.0|[0.0,0.0,0.0,0.0,...|
|  0.0|[1000.0,1000.0,0....|
|  0.0|[1000.0,0.0,0.0,0...|
|  0.0|[1000.0,1000.0,10...|
|  0.0|[1000.0,1000.0,0....|
|  0.0|[1000.0,1000.0,0....|
|  0.0|[1000.0,0.0,0.0,1...|
|  0.0|[0.0,1000.0,0.0,0...|
|  1.0|[0.0,0.0,0.0,0.0,...|
|  0.0|[1000.0,1000.0,10...|
|  0.0|[1000.0,0.0,0.0,1...|
|  0.0|[1000.0,1000.0,10...|
|  1.0|[0.0,0.0,0.0,0.0,...|
|  0.0|[1000.0,1000.0,10...|
|  1.0|[0.0,0.0,0.0,0.0,...|
|  0.0|[1000.0,1000.0,10...|
|  1.0|[0.0,0.0,0.0,0.0,...|
|  0.0|[1000.0,1000.0,10...|
|  1.0|[1000.0,0.0,0.0,0...|
|  0.0|[1000.0,1000.0,10...|
+-----+--------------------+
only showing top 20 rows



In [31]:
train,test = final_data.randomSplit([0.7,0.3])

In [32]:
train.count()

733

In [33]:
test.count()

321

In [34]:
# Read in dependencies
from pyspark.ml.classification import *
from pyspark.ml.evaluation import *
from pyspark.sql.functions import *
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder