# Classification with MLlib and PySpark

*This reference material will take a closer look to classification with MLlib. In fact the whole process of training, testing and evaluating is almost identical to Scikit Learn. Only difference is data preparation which is handled with PySpark.*

**Variable pre-processing:**

- String Indexer functin (better for ordinal)
- IndexToString -> transforms string labels back to the indexes
- OneHotEncoderEstimator better for nonimal and is also in Scikit-Learn

**Classifier in PySpark:**

- LogisticRegression
- Decision Tree
- Ensemble Methods (Random Forest - bagging, Gradient-Boosting)
- Boosting is learning from previous mistakes -> sequential
- Bagging is parallel and then chooses the best result
- Multilayer Perceptron Classifier (MLP)
- Linear SVM
- One vs Rest classifier (each class is comparing against an entire datasets)
- Naive Bayes Classifier

## Data Formatting and Transformations

In [2]:
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Class").getOrCreate()

In [3]:
spark

<div class="alert alert-block alert-warning"><b>Basic Imports and data load</b></div>

In [4]:
from pyspark.ml.feature import VectorAssembler
from pyspark.sql.types import*
from pyspark.sql. functions import*
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import MinMaxScaler

**Autistic Spectrum Disorder Screening Data for Adult**

Autistic Spectrum Disorder (ASD) is a neurodevelopment condition associated with significant healthcare costs, and early diagnosis can significantly reduce these. Unfortunately, waiting times for an ASD diagnosis are lengthy and procedures are not cost effective. The economic impact of autism and the increase in the number of ASD cases across the world reveals an urgent need for the development of easily implemented and effective screening methods. Therefore, a time-efficient and accessible ASD screening is imminent to help health professionals and inform individuals whether they should pursue formal clinical diagnosis. The rapid growth in the number of ASD cases worldwide necessitates datasets related to behaviour traits. However, such datasets are rare making it difficult to perform thorough analyses to improve the efficiency, sensitivity, specificity and predictive accuracy of the ASD screening process. Presently, very limited autism datasets associated with clinical or screening are available and most of them are genetic in nature. Hence, we propose a new dataset related to autism screening of adults that contained 20 features to be utilised for further analysis especially in determining influential autistic traits and improving the classification of ASD cases. In this dataset, we record ten behavioural features (AQ-10-Adult) plus ten individuals characteristics that have proved to be effective in detecting the ASD cases from controls in behaviour science.


**Source:** https://www.kaggle.com/faizunnabi/autism-screening

In [6]:
df = spark.read.csv("Datasets/Toddler Autism dataset July 2018.csv", inferSchema=True, header=True)

In [7]:
df.limit(5).toPandas()

Unnamed: 0,Case_No,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,Age_Mons,Qchat-10-Score,Sex,Ethnicity,Jaundice,Family_mem_with_ASD,Who completed the test,Class/ASD Traits
0,1,0,0,0,0,0,0,1,1,0,1,28,3,f,middle eastern,yes,no,family member,No
1,2,1,1,0,0,0,1,1,0,0,0,36,4,m,White European,yes,no,family member,Yes
2,3,1,0,0,0,0,0,1,1,0,1,36,4,m,middle eastern,yes,no,family member,Yes
3,4,1,1,1,1,1,1,1,1,1,1,24,10,m,Hispanic,no,no,family member,Yes
4,5,1,1,0,1,1,1,1,1,1,1,20,9,f,White European,no,yes,family member,Yes


In [8]:
df.printSchema()

root
 |-- Case_No: integer (nullable = true)
 |-- A1: integer (nullable = true)
 |-- A2: integer (nullable = true)
 |-- A3: integer (nullable = true)
 |-- A4: integer (nullable = true)
 |-- A5: integer (nullable = true)
 |-- A6: integer (nullable = true)
 |-- A7: integer (nullable = true)
 |-- A8: integer (nullable = true)
 |-- A9: integer (nullable = true)
 |-- A10: integer (nullable = true)
 |-- Age_Mons: integer (nullable = true)
 |-- Qchat-10-Score: integer (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Ethnicity: string (nullable = true)
 |-- Jaundice: string (nullable = true)
 |-- Family_mem_with_ASD: string (nullable = true)
 |-- Who completed the test: string (nullable = true)
 |-- Class/ASD Traits : string (nullable = true)



<div class="alert alert-block alert-warning"><b>Data preparation</b></div>

In [12]:
df.groupBy("Class/ASD Traits ").count().show()
#checking labels if we have imbalanced data or not

+-----------------+-----+
|Class/ASD Traits |count|
+-----------------+-----+
|               No|  326|
|              Yes|  728|
+-----------------+-----+



**Firstly we wanna drop label and id.**

In [28]:
input_columns = df.columns
input_columns = input_columns[1:-1] #first and last out

In [29]:
print(input_columns)

['A1', 'A2', 'A3', 'A4', 'A5', 'A6', 'A7', 'A8', 'A9', 'A10', 'Age_Mons', 'Qchat-10-Score', 'Sex', 'Ethnicity', 'Jaundice', 'Family_mem_with_ASD', 'Who completed the test']


In [30]:
label_var = "Class/ASD Traits "

In [17]:
renamed = df.withColumn("label_str", df[label_var].cast(StringType())) #changing data type of our data
indexer = StringIndexer(inputCol="label_str", outputCol="label") #string indexing
indexed = indexer.fit(renamed).transform(renamed) #fitting - only for our label

**Now dealing with other columns.**

In [31]:
numeric_inputs = []
string_inputs = []

for column in input_columns: #iterate through our columns
    if str(indexed.schema[column].dataType) == "StringType": #if data typ is string we wanna change that
        indexer = StringIndexer(inputCol=column, outputCol=column+"_num") #changing and adding _num to name
        indexed = indexer.fit(indexed).transform(indexed) #fitting and transforming
        new_col_name = column+"_num"
        string_inputs.append(new_col_name) #appending to our string list
    else:
        numeric_inputs.append(column) #appending to our numeric list if column is numeric already

In [35]:
indexed.limit(5).toPandas()

Unnamed: 0,Case_No,A1,A2,A3,A4,A5,A6,A7,A8,A9,...,Family_mem_with_ASD,Who completed the test,Class/ASD Traits,label_str,label,Sex_num,Ethnicity_num,Jaundice_num,Family_mem_with_ASD_num,Who completed the test_num
0,1,0,0,0,0,0,0,1,1,0,...,no,family member,No,No,1.0,1.0,2.0,1.0,0.0,0.0
1,2,1,1,0,0,0,1,1,0,0,...,no,family member,Yes,Yes,0.0,0.0,0.0,1.0,0.0,0.0
2,3,1,0,0,0,0,0,1,1,0,...,no,family member,Yes,Yes,0.0,0.0,2.0,1.0,0.0,0.0
3,4,1,1,1,1,1,1,1,1,1,...,no,family member,Yes,Yes,0.0,0.0,5.0,0.0,0.0,0.0
4,5,1,1,0,1,1,1,1,1,1,...,yes,family member,Yes,Yes,0.0,1.0,0.0,0.0,1.0,0.0


So my numeric columns are twice here, in string format and in numeric. 

<div class="alert alert-block alert-danger"><b>BeAware:</b>It's only tutorial, but I would probably care more about nominal and ordinal encoding</div>

**Outlier deletion- measuring of skewness - tough one**

In [44]:
d = {}

for col in numeric_inputs:
    d[col] = indexed.approxQuantile(col,[0.01,0.99],0.25)
    #we will go throug our columns and use approxQuantile function we wanna first and last percent of data
    
for col in numeric_inputs:
    skew = indexed.agg(skewness(indexed[col])).collect() #skeweness measure skew
    #I am aggregating skewness scores from our columns
    skew= skew[0][0] #first row and first column
    if skew > 1:
        indexed = indexed.withColumn(col, log(when(d[col]< d[col][0],d[col][0])\
                                             .when(indexed[col]> df[col][1], df[col][1])\
                                             .otherwise(indexed[col])+1).alias(col))
        print(col,"has been treader for pos skew", skew)
    elif skew <-1:
        indexed = indexed.withColumn(col, exp(when(d[col]< d[col][0],d[col][0])\
                                             .when(indexed[col]> d[col][1], d[col][1])\
                                             .otherwise(indexed[col]).alias(col)))
        print(col,"has been treader for negativ skew", skew)
#hmm pretty crazy function but very handy - measures skewness of data and repair them - it is basically deleting
#top and bottom tails

**Negativ values detector**

In [45]:
minimus = df.select([min(c).alias(c) for c in df.columns if c in numeric_inputs])
#it just collect minimum values for our numeric inputs - for each col one value
min_array = minimus.select(array(numeric_inputs).alias("mins"))
#after that I am creating array from that
df_minimum = min_array.select(array_min(min_array.mins)).collect()
df_minimum = df_minimum[0][0]
#hmm I consider that as a weird syntax and over complicated ..it's just for checking negativ values

In [50]:
min_array.show()

+--------------------+
|                mins|
+--------------------+
|[0, 0, 0, 0, 0, 0...|
+--------------------+



**Creating our features - VectorAssembler**

In [59]:
feature_list = numeric_inputs + string_inputs #just a list of my features
#string inputs are also numerical already..

In [61]:
feature_list #something like that

['A1',
 'A2',
 'A3',
 'A4',
 'A5',
 'A6',
 'A7',
 'A8',
 'A9',
 'A10',
 'Age_Mons',
 'Qchat-10-Score',
 'Sex_num',
 'Ethnicity_num',
 'Jaundice_num',
 'Family_mem_with_ASD_num',
 'Who completed the test_num']

In [63]:
assembler = VectorAssembler(inputCols=feature_list, outputCol="features")
#it is jus VectorAssembler ..joining columns together
output = assembler.transform(indexed).select("features", "label")

In [72]:
output.show() #my data are now only in two columns

+--------------------+-----+
|            features|label|
+--------------------+-----+
|(17,[6,7,9,10,11,...|  1.0|
|(17,[0,1,5,6,10,1...|  0.0|
|(17,[0,6,7,9,10,1...|  0.0|
|[1.0,1.0,1.0,1.0,...|  0.0|
|[1.0,1.0,0.0,1.0,...|  0.0|
|[1.0,1.0,0.0,0.0,...|  0.0|
|(17,[0,3,4,5,8,10...|  0.0|
|(17,[1,4,6,7,8,9,...|  0.0|
|(17,[6,9,10,11,13...|  1.0|
|[1.0,1.0,1.0,0.0,...|  0.0|
|[1.0,0.0,0.0,1.0,...|  0.0|
|[1.0,1.0,1.0,1.0,...|  0.0|
|(17,[10,12,13,14]...|  1.0|
|[1.0,1.0,1.0,1.0,...|  0.0|
|(17,[10,13],[18.0...|  1.0|
|(17,[0,1,2,4,6,7,...|  0.0|
|(17,[10,13,15],[3...|  1.0|
|[1.0,1.0,1.0,0.0,...|  0.0|
|(17,[0,4,9,10,11,...|  1.0|
|[1.0,1.0,1.0,0.0,...|  0.0|
+--------------------+-----+
only showing top 20 rows



**Scaling**

In [97]:
scaler = MinMaxScaler(inputCol="features", outputCol="scaledFeatures", min=0, max=1000)
#we are setting min to 0 because we don't wanna zero values
scalerModel = scaler.fit(output)
scaled_data = scalerModel.transform(output)

In [99]:
scaled_data.show()

+--------------------+-----+--------------------+
|            features|label|      scaledFeatures|
+--------------------+-----+--------------------+
|(17,[6,7,9,10,11,...|  1.0|(17,[6,7,9,10,11,...|
|(17,[0,1,5,6,10,1...|  0.0|(17,[0,1,5,6,10,1...|
|(17,[0,6,7,9,10,1...|  0.0|(17,[0,6,7,9,10,1...|
|[1.0,1.0,1.0,1.0,...|  0.0|[1000.0,1000.0,10...|
|[1.0,1.0,0.0,1.0,...|  0.0|[1000.0,1000.0,0....|
|[1.0,1.0,0.0,0.0,...|  0.0|[1000.0,1000.0,0....|
|(17,[0,3,4,5,8,10...|  0.0|(17,[0,3,4,5,8,10...|
|(17,[1,4,6,7,8,9,...|  0.0|(17,[1,4,6,7,8,9,...|
|(17,[6,9,10,11,13...|  1.0|(17,[6,9,10,11,13...|
|[1.0,1.0,1.0,0.0,...|  0.0|[1000.0,1000.0,10...|
|[1.0,0.0,0.0,1.0,...|  0.0|[1000.0,0.0,0.0,1...|
|[1.0,1.0,1.0,1.0,...|  0.0|[1000.0,1000.0,10...|
|(17,[10,12,13,14]...|  1.0|(17,[10,12,13,14]...|
|[1.0,1.0,1.0,1.0,...|  0.0|[1000.0,1000.0,10...|
|(17,[10,13],[18.0...|  1.0|(17,[10,13],[250....|
|(17,[0,1,2,4,6,7,...|  0.0|(17,[0,1,2,4,6,7,...|
|(17,[10,13,15],[3...|  1.0|(17,[10,13,15],[1...|


In [100]:
final_data = scaled_data.select("label","scaledFeatures")
final_data = final_data.withColumnRenamed("scaledFeatures","features")

In [101]:
final_data.show()

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  1.0|(17,[6,7,9,10,11,...|
|  0.0|(17,[0,1,5,6,10,1...|
|  0.0|(17,[0,6,7,9,10,1...|
|  0.0|[1000.0,1000.0,10...|
|  0.0|[1000.0,1000.0,0....|
|  0.0|[1000.0,1000.0,0....|
|  0.0|(17,[0,3,4,5,8,10...|
|  0.0|(17,[1,4,6,7,8,9,...|
|  1.0|(17,[6,9,10,11,13...|
|  0.0|[1000.0,1000.0,10...|
|  0.0|[1000.0,0.0,0.0,1...|
|  0.0|[1000.0,1000.0,10...|
|  1.0|(17,[10,12,13,14]...|
|  0.0|[1000.0,1000.0,10...|
|  1.0|(17,[10,13],[250....|
|  0.0|(17,[0,1,2,4,6,7,...|
|  1.0|(17,[10,13,15],[1...|
|  0.0|[1000.0,1000.0,10...|
|  1.0|(17,[0,4,9,10,11,...|
|  0.0|(17,[0,1,2,4,6,7,...|
+-----+--------------------+
only showing top 20 rows



Just deleting features and renaming scaledfeatures.

**Splitting data**

In [102]:
train, test = final_data.randomSplit([0.7,0.3]) #70% and 30% random split

In [103]:
train.count(), test.count() #weird that its not len :)

(720, 334)

## Classification

In this part I will focus on some classification algorithm.

<div class="alert alert-block alert-info"><b>Note:</b> this part contains basic info for every PySpark classifier and how to use it. Some parts are just copy and paste, it's more a quick refresher of PySpark</div>

In [104]:
from pyspark.ml.classification import *
from pyspark.ml.evaluation import *
from pyspark.sql.functions import *
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

### Logistic Regression

<div class="alert alert-block alert-warning"><b>Training, testing and predicting</b></div>

In [118]:
MC_evaluator = MulticlassClassificationEvaluator(metricName="accuracy") #for metric measurement

In [107]:
classifier = LogisticRegression() #my classifier

In [110]:
paramGrid = (ParamGridBuilder().addGrid(classifier.maxIter, [10, 15,20]).build())
#setting up my hyperparameters

In [111]:
crossval = CrossValidator(estimator=classifier,
                          estimatorParamMaps=paramGrid,
                          evaluator=MC_evaluator,
                          numFolds=3) 
#my cross validation 

In [112]:
fitModel = crossval.fit(train) #fitting my model

In [113]:
BestModel = fitModel.bestModel #looking for the best one

In [114]:
print("Intercept: " + str(BestModel.interceptVector))
print("Coefficients: \n" + str(BestModel.coefficientMatrix))
#printing results

Intercept: [53.328829302256686]
Coefficients: 
DenseMatrix([[-1.18950832e-02, -1.10041935e-02, -1.15087043e-02,
              -1.14871367e-02, -1.14903526e-02, -1.12813030e-02,
              -1.17502752e-02, -1.13838220e-02, -1.17504639e-02,
              -1.16707740e-02, -1.26690894e-03, -3.29165767e-02,
              -4.95346691e-05, -4.13318562e-04, -8.41729898e-04,
               2.81695294e-04, -8.28203670e-03]])


In [115]:
predictions = fitModel.transform(test) #now I am only using transform for my test

In [116]:
accuracy = (MC_evaluator.evaluate(predictions))*100 #and I am evaluating prediction

In [117]:
print(accuracy)

100.0


<div class="alert alert-block alert-warning"><b>Feature importance</b></div>

In [123]:
coeff_array = BestModel.coefficientMatrix.toArray() #coefficient to array from my best model

#Now i just want my array transform to list
coeff_scores = []
for x in coeff_array[0]:
    coeff_scores.append(float(x))
    
result = spark.createDataFrame(zip(input_columns,coeff_scores), schema=['feature','coeff'])
result.show()

+--------------------+--------------------+
|             feature|               coeff|
+--------------------+--------------------+
|                  A1|-0.01189508322066376|
|                  A2|-0.01100419352374...|
|                  A3|-0.01150870434067...|
|                  A4|-0.01148713670676...|
|                  A5|-0.01149035258499...|
|                  A6|-0.01128130300534...|
|                  A7|-0.01175027522635...|
|                  A8|-0.01138382202587...|
|                  A9|-0.01175046394679...|
|                 A10|-0.01167077403238...|
|            Age_Mons|-0.00126690893723...|
|      Qchat-10-Score|-0.03291657674658315|
|                 Sex|-4.95346690839824...|
|           Ethnicity|-4.13318562236252...|
|            Jaundice|-8.41729897833716...|
| Family_mem_with_ASD|2.816952939039711E-4|
|Who completed the...|-0.00828203669800...|
+--------------------+--------------------+



And that is probably all. It's just for educational purpose, we have only small part of original dataset so our accuracy is a little twisted :). For feature importance and ocefficient interpretation is a good article here: ***https://www.displayr.com/how-to-interpret-logistic-regression-coefficients/***

### One vs. Rest

<div class="alert alert-block alert-warning"><b>Training, testing and predicting</b></div>

Each class is viewed as it compares to rest of the classes as a whole, as opposed to each one individually. It's not a new estimator just another form of it.

In [126]:
lr = LogisticRegression() #creating another logistic regression classifier

In [127]:
classifier = OneVsRest(classifier=lr) #now creating classifier with one classifier within :)

In [131]:
paramGrid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01]).build() #just parameters

In [132]:
crossval = CrossValidator(estimator=classifier, #our estimator
                          estimatorParamMaps=paramGrid, #our parameters
                          evaluator=MulticlassClassificationEvaluator(), #our evaluator
                          numFolds=5)  #and number of folds

In [133]:
fitModel = crossval.fit(train) #training

In [134]:
BestModel = fitModel.bestModel #looking for best model

In [135]:
models = BestModel.models

In [142]:
for model in models:
    print(model.intercept, model.coefficients)
    
#just printing coefficients...

-7.393831841017095 [0.0014856949899890567,0.001688943775857319,0.0013360474103731505,0.001500978245841736,0.0017346188604542082,0.0015081235053492706,0.0016205494908386193,0.0017214085038386787,0.0018115385933320709,0.0014320087455013677,0.00036832670335998086,0.004529542675628303,-0.00035002916340941753,0.0006321027175696799,0.0005038737980106367,2.4629920556283505e-05,0.000713711595807801]
7.393831840565555 [-0.0014856949899168701,-0.0016889437758628824,-0.0013360474103949645,-0.0015009782458118468,-0.0017346188602188075,-0.00150812350531691,-0.0016205494911018564,-0.0017214085034702915,-0.0018115385930456763,-0.0014320087454208646,-0.00036832670332823163,-0.004529542675385013,0.000350029163575465,-0.0006321027176559,-0.0005038737980327789,-2.462992072611913e-05,-0.0007137115964014607]


In [143]:
predictions = fitModel.transform(test)#predicting

In [144]:
accuracy = (MC_evaluator.evaluate(predictions))*100

In [145]:
print(accuracy)

100.0


And again the result is not an important here.

### Multilayer Perceptro Classifier

*Aka Neural Network. Basic info is here:*

***Common Hyper Parameters:***

**MaxIter:** <br>
The maximum number of iterations to use. There is no clear formula for setting the optimum iteration number, but you can figure out this issue by an iterative process by initializing the iteration number by a small number like 100 and then increase it linearly. This process will be repeated until the MSE of the test does not decrease and even may increase. The below link describes well:
https://www.quora.com/What-will-happen-if-I-train-my-neural-networks-with-too-much-iteration

**Layers:** <br>
Spark requires that the input layer equals the number of features in the dataset, the hidden layer might be one or two more than that (flexible), and the output layer has to be equal to the number of classes. Here's a great article to learn more about how to play around with the hidden layers: https://towardsdatascience.com/beginners-ask-how-many-hidden-layers-neurons-to-use-in-artificial-neural-networks-51466afa0d3e

**Block size:** <br>
Block size for stacking input data in matrices to speed up the computation. Data is stacked within partitions. If block size is more than remaining data in a partition then it is adjusted to the size of this data. Recommended size is between 10 and 1000. Default: 128

**Seed:** <br>
A random seed. Set this value if you need your results to be reproducible across repeated calls (highly recommdended).

**Weights**: *printed for us below along with accuracy rate* <br> 
Each hidden neuron added will increase the number of weights, thus it is recommended to use the least number of hidden neurons that accomplish the task. Using more hidden neurons than required will add more complexity.

**PySpark Documentation link:** <br> 
https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.classification.MultilayerPerceptronClassifier

In [146]:
#Firstly collecting features and classes numbers
features = final_data.select(['features']).collect() #counting the features
features_count = len(features[0][0])

class_count = final_data.select(countDistinct("label")).collect() #counting the classes
classes = class_count[0][0]

layers = [features_count, features_count+1, features_count, classes]
#FEATURES - HIDDEN LAYERS - SECOND HIDDEN LAYER - CLASSES
#Note: hidden layer has to be equal to feature or +1,+2 in PySPark

classifier = MultilayerPerceptronClassifier(maxIter=100, layers=layers, blockSize=128, seed=1234)
#our classifier

fitModel = classifier.fit(train) #training our data
print(fitModel.weights.size) # Print the model Weights
predictions = fitModel.transform(test) #predicting
accuracy = (MC_evaluator.evaluate(predictions))*100 #accuracy
print("Accuracy: ",accuracy) 

683
Accuracy:  90.71856287425149


### Naive Bays

***Basic Info***

**Assumptions:**
 - Independence between every pair of features
 - Feature values are nonnegative (which is why we checked earlier)

**Hyper Parameters:**

 - **smoothing** = It is problematic when a frequency-based probability is zero, because it will wipe out all the information in the other probabilities, and we need to find a solution for this. A solution would be Laplace smoothing , which is a technique for smoothing categorical data. In PySpark, this number needs to be be >= 0, default is 1.0'. Also here is a great article that defines smoothing in more detail: https://medium.com/syncedreview/applying-multinomial-naive-bayes-to-nlp-problems-a-practical-explanation-4f5271768ebf
 - **thresholds** = Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0, excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold. The default value is none. 
 - **weightCol** = If you have a weight column you would enter the name of the column here. If this is not set or empty, we treat all instance weights as 1.0. To learn more about the theory behind this, here is a good paper: http://pami.uwaterloo.ca/~khoury/ece457f07/Zhang2004.pdf

In [147]:
#Still the same
classifier = NaiveBayes()
paramGrid = (ParamGridBuilder().addGrid(classifier.smoothing, [0.0, 0.3, 0.7, 0.8]).build())

crossval = CrossValidator(estimator=classifier,
                          estimatorParamMaps=paramGrid,
                          evaluator=MulticlassClassificationEvaluator(),
                          numFolds=5)
fitModel = crossval.fit(train)
predictions = fitModel.transform(test)
accuracy = (MC_evaluator.evaluate(predictions))*100
print("Accuracy: ",accuracy)

Accuracy:  85.02994011976048


### Linear Support Vector Machine

***Basic info***

**Interpretting the coefficients:**

Each coefficients direction gives us the predicted class, so if you take the dot product of any point with the vector, you can tell on which side it is: if the dot product is positive, it belongs to the positive class, if it is negative it belongs to the negative class.

You can even learn something about the importance of each feature. Let's say the svm would find only one feature useful for separating the data, then the hyperplane would be orthogonal to that axis. So, you could say that the absolute size of the coefficient relative to the other ones gives an indication of how important the feature was for the separation. 

**Hyper Parameters:** <br>

**MaxIter:** <br>
The maximum number of iterations to use. There is no clear formula for setting the optimum iteration number, but you can figure out this issue by an iterative process by initializing the iteration number by a small number like 100 and then increase it linearly. This process will be repeated until the MSE of the test does not decrease and even may increase. The below link describes well:
https://www.quora.com/What-will-happen-if-I-train-my-neural-networks-with-too-much-iteration

**regParam**: <br>
The purpose of the regularizer is to encourage simple models and avoid overfitting. To learn more about this concept, here is an interesting article: https://towardsdatascience.com/regularization-in-machine-learning-76441ddcf99a

**PySpark Documentation link:** <br> https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.classification.LinearSVC

In [156]:
#Just a small "checker" 
class_count = final_data.select(countDistinct("label")).collect() #I am counting unique classes values
classes = class_count[0][0] #pulling only number
if classes > 2:
    print("LinearSVC cannot be used because PySpark currently only accepts\
          binary classification data for this algorithm")

classifier = LinearSVC()
paramGrid = (ParamGridBuilder().addGrid(classifier.maxIter, [15, 50]).addGrid(classifier.regParam, [0.2, 0.01]).build())

crossval = CrossValidator(estimator=classifier,
                          estimatorParamMaps=paramGrid,
                          evaluator=MulticlassClassificationEvaluator(),
                          numFolds=5)

#Fitting and predicting
fitModel = crossval.fit(train)
BestModel = fitModel.bestModel
predictions = fitModel.transform(test)
accuracy = (MC_evaluator.evaluate(predictions))*100
print("Accuracy: ",accuracy)

Accuracy:  100.0


### Decision Tree

***Basic Info***
**Common Hyper Parameters**

 - **maxBins** = Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.
     - **Continuous features:** For small datasets in single-machine implementations, the split candidates for each continuous feature are typically the unique values for the feature. Some implementations sort the feature values and then use the ordered unique values as split candidates for faster tree calculations.
         Sorting feature values is expensive for large distributed datasets. This implementation computes an approximate set of split candidates by performing a quantile calculation over a sampled fraction of the data. The ordered splits create “bins” and the maximum number of such bins can be specified using the maxBins parameter.
         Note that the number of bins cannot be greater than the number of instances N (a rare scenario since the default maxBins value is 32). The tree algorithm automatically reduces the number of bins if the condition is not satisfied.

     - **Categorical features:** For a categorical feature with M possible values (categories), one could come up with 2 exp(M−1) −1 split candidates. For binary (0/1) classification and regression, we can reduce the number of split candidates to M−1 by ordering the categorical feature values by the average label. For example, for a binary classification problem with one categorical feature with three categories A, B and C whose corresponding proportions of label 1 are 0.2, 0.6 and 0.4, the categorical features are ordered as A, C, B. The two split candidates are A | C, B and A , C | B where | denotes the split.
         In multiclass classification, all 2 exp(M−1) −1 possible splits are used whenever possible. When 2 exp(M−1) −1 is greater than the maxBins parameter, we use a (heuristic) method similar to the method used for binary classification and regression. The M categorical feature values are ordered by impurity, and the resulting M−1 split candidates are considered.
         
 - **maxDepth** = The max_depth parameter specifies the maximum depth of each tree. The default value for max_depth is None, which means that each tree will expand until every leaf is pure. A pure leaf is one where all of the data on the leaf comes from the same class.

### Feature Importance Scores
Scores add up to 1 accross all varaibles so the lowest score is the least imporant variable. 


### Extra Reading
**How to tune a decision tree** <br>
https://towardsdatascience.com/how-to-tune-a-decision-tree-f03721801680

**PySpark Documentation link:** <br> https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.classification.DecisionTreeClassifier

In [157]:
classifier = DecisionTreeClassifier()
paramGrid = (ParamGridBuilder().addGrid(classifier.maxBins, [10, 20, 40, 80, 100]).build())
#note I wont be using more parameters - BUT I SHOULD, for this case I just don't wanna waste time computing..

crossval = CrossValidator(estimator=classifier,
                          estimatorParamMaps=paramGrid,
                          evaluator=MulticlassClassificationEvaluator(),
                          numFolds=3) 


fitModel = crossval.fit(train)

# Feature importance - important
BestModel = fitModel.bestModel
featureImportances = BestModel.featureImportances.toArray()
print("Feature Importances: ",featureImportances)

predictions = fitModel.transform(test)
accuracy = (MC_evaluator.evaluate(predictions))*100
print("Accuracy: ",accuracy)

Feature Importances:  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
Accuracy:  100.0


*Now nicer looking feature importance.*

In [158]:
imp_scores = []
for x in featureImportances:
    imp_scores.append(int(x))
    
# Then zip with input_columns list and create a df
result = spark.createDataFrame(zip(input_columns,imp_scores), schema=['feature','score'])
print(result.orderBy(result["score"].desc()).show(truncate=False))

+----------------------+-----+
|feature               |score|
+----------------------+-----+
|Qchat-10-Score        |1    |
|A9                    |0    |
|Age_Mons              |0    |
|A10                   |0    |
|Family_mem_with_ASD   |0    |
|Sex                   |0    |
|Ethnicity             |0    |
|Jaundice              |0    |
|A3                    |0    |
|Who completed the test|0    |
|A5                    |0    |
|A6                    |0    |
|A7                    |0    |
|A8                    |0    |
|A1                    |0    |
|A2                    |0    |
|A4                    |0    |
+----------------------+-----+

None


<div class="alert alert-block alert-danger"><b>BeAware:</b>it is a good habbit to run Tree at the beginning of project and to look at what features are more improtant than others, and therefore we can pay more attention to them in the analysis</div>

### Random Forest

*Basic info is same as for Decisions Tree. Just stacking the trees!*

PySpark Documentation link: https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.classification.RandomForestClassifier

In [159]:
#Still the same
classifier = RandomForestClassifier()
paramGrid = (ParamGridBuilder() \
               .addGrid(classifier.maxDepth, [2, 5, 10])
#                                .addGrid(classifier.maxBins, [5, 10, 20])
#                                .addGrid(classifier.numTrees, [5, 20, 50])
             .build())

crossval = CrossValidator(estimator=classifier,
                          estimatorParamMaps=paramGrid,
                          evaluator=MulticlassClassificationEvaluator(),
                          numFolds=2)


fitModel = crossval.fit(train)
BestModel = fitModel.bestModel
featureImportances = BestModel.featureImportances.toArray()
print("Feature Importances: ",featureImportances)
predictions = fitModel.transform(test)
accuracy = (MC_evaluator.evaluate(predictions))*100
print(" ")
print("Accuracy: ",accuracy)

Feature Importances:  [0.07954638 0.01880797 0.         0.02694263 0.18578089 0.13816786
 0.10158916 0.01076742 0.04391258 0.         0.         0.39394903
 0.         0.00053607 0.         0.         0.        ]
 
Accuracy:  100.0


### Gradient Boost Tree CLassifier

*Same info as for Tree and Forrest but now we are boosting. Learning from each other!*

PySpark Documentation link: https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.classification.GBTClassifier

In [160]:
#Again and again :)
classifier = GBTClassifier()

paramGrid = (ParamGridBuilder() \
#                              .addGrid(classifier.maxDepth, [2, 5, 10, 20, 30]) \
#                              .addGrid(classifier.maxBins, [10, 20, 40, 80, 100]) \
             .addGrid(classifier.maxIter, [10, 15,50,100])
             .build())

crossval = CrossValidator(estimator=classifier,
                          estimatorParamMaps=paramGrid,
                          evaluator=MulticlassClassificationEvaluator(),
                          numFolds=2) 


fitModel = crossval.fit(train)
BestModel = fitModel.bestModel
featureImportances = BestModel.featureImportances.toArray()
print("Feature Importances: ",featureImportances)   
predictions = fitModel.transform(test)
accuracy = (MC_evaluator.evaluate(predictions))*100
print(" ")
print("Accuracy: ",accuracy)

Feature Importances:  [0.00000000e+00 8.49404185e-17 2.16993702e-18 0.00000000e+00
 0.00000000e+00 8.47789349e-19 0.00000000e+00 0.00000000e+00
 0.00000000e+00 2.58373897e-17 0.00000000e+00 1.00000000e+00
 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
 1.78815931e-15]
 
Accuracy:  100.0


<div class="alert alert-block alert-info"><b>Note:</b> I could iterate through all classifiers but that is not a point here.</div>

## Tips & Tricks for reducing redundancy

<div class="alert alert-block alert-warning"><b>Data preparation function</b></div>

Copying and pasting, its just all data preparation rewritten in one function.

In [161]:
# Data Prep function
def MLClassifierDFPrep(df,input_columns,dependent_var,treat_outliers=True,treat_neg_values=True):
    
    # change label (class variable) to string type to prep for reindexing
    # Pyspark is expecting a zero indexed integer for the label column. 
    # Just incase our data is not in that format... we will treat it by using the StringIndexer built in method
    renamed = df.withColumn("label_str", df[dependent_var].cast(StringType())) #Rename and change to string type
    indexer = StringIndexer(inputCol="label_str", outputCol="label") #Pyspark is expecting the this naming convention 
    indexed = indexer.fit(renamed).transform(renamed)

    # Convert all string type data in the input column list to numeric
    # Otherwise the Algorithm will not be able to process it
    numeric_inputs = []
    string_inputs = []
    for column in input_columns:
        if str(indexed.schema[column].dataType) == 'StringType':
            indexer = StringIndexer(inputCol=column, outputCol=column+"_num") 
            indexed = indexer.fit(indexed).transform(indexed)
            new_col_name = column+"_num"
            string_inputs.append(new_col_name)
        else:
            numeric_inputs.append(column)
            
    if treat_outliers == True:
        print("We are correcting for non normality now!")
        # empty dictionary d
        d = {}
        # Create a dictionary of quantiles
        for col in numeric_inputs: 
            d[col] = indexed.approxQuantile(col,[0.01,0.99],0.25) #if you want to make it go faster increase the last number
        #Now fill in the values
        for col in numeric_inputs:
            skew = indexed.agg(skewness(indexed[col])).collect() #check for skewness
            skew = skew[0][0]
            # This function will floor, cap and then log+1 (just in case there are 0 values)
            if skew > 1:
                indexed = indexed.withColumn(col, \
                log(when(df[col] < d[col][0],d[col][0])\
                .when(indexed[col] > d[col][1], d[col][1])\
                .otherwise(indexed[col] ) +1).alias(col))
                print(col+" has been treated for positive (right) skewness. (skew =)",skew,")")
            elif skew < -1:
                indexed = indexed.withColumn(col, \
                exp(when(df[col] < d[col][0],d[col][0])\
                .when(indexed[col] > d[col][1], d[col][1])\
                .otherwise(indexed[col] )).alias(col))
                print(col+" has been treated for negative (left) skewness. (skew =",skew,")")

            
    # Produce a warning if there are negative values in the dataframe that Naive Bayes cannot be used. 
    # Note: we only need to check the numeric input values since anything that is indexed won't have negative values
    minimums = df.select([min(c).alias(c) for c in df.columns if c in numeric_inputs]) # Calculate the mins for all columns in the df
    min_array = minimums.select(array(numeric_inputs).alias("mins")) # Create an array for all mins and select only the input cols
    df_minimum = min_array.select(array_min(min_array.mins)).collect() # Collect golobal min as Python object
    df_minimum = df_minimum[0][0] # Slice to get the number itself

    features_list = numeric_inputs + string_inputs
    assembler = VectorAssembler(inputCols=features_list,outputCol='features')
    output = assembler.transform(indexed).select('features','label')

#     final_data = output.select('features','label') #drop everything else
    
    # Now check for negative values and ask user if they want to correct that? 
    if df_minimum < 0:
        print(" ")
        print("WARNING: The Naive Bayes Classifier will not be able to process your dataframe as it contains negative values")
        print(" ")
    
    if treat_neg_values == True:
        print("You have opted to correct that by rescaling all your features to a range of 0 to 1")
        print(" ")
        print("We are rescaling you dataframe....")
        scaler = MinMaxScaler(inputCol="features", outputCol="scaledFeatures")

        # Compute summary statistics and generate MinMaxScalerModel
        scalerModel = scaler.fit(output)

        # rescale each feature to range [min, max].
        scaled_data = scalerModel.transform(output)
        final_data = scaled_data.select('label','scaledFeatures')
        final_data = final_data.withColumnRenamed('scaledFeatures','features')
        print("Done!")

    else:
        print("You have opted not to correct that therefore you will not be able to use to Naive Bayes classifier")
        print("We will return the dataframe unscaled.")
        final_data = output
    
    return final_data

***Test run!***

In [162]:
from pyspark.ml.feature import VectorAssembler
from pyspark.sql.types import * 
from pyspark.sql.functions import *
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import MinMaxScaler

col_list = ["A1","A2","A3","A4","A5","A6","A7","A8","A9","A10","Age_Mons","Qchat-10-Score","Sex","Ethnicity","Jaundice","Family_mem_with_ASD","Who completed the test"]

input_columns = col_list
dependent_var = 'Class/ASD Traits '

final_data = MLClassifierDFPrep(df,input_columns,dependent_var)
final_data.limit(5).toPandas()

We are correcting for non normality now!
You have opted to correct that by rescaling all your features to a range of 0 to 1
 
We are rescaling you dataframe....
Done!


Unnamed: 0,label,features
0,1.0,"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, ..."
1,0.0,"(1.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, ..."
2,0.0,"(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, ..."
3,0.0,"[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ..."
4,0.0,"[1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ..."


<div class="alert alert-block alert-warning"><b>Testing and evaluating function</b></div>

In [163]:
def ClassTrainEval(classifier,features,classes,train,test):

    def FindMtype(classifier):
        # Intstantiate Model
        M = classifier
        # Learn what it is
        Mtype = type(M).__name__
        
        return Mtype
    
    Mtype = FindMtype(classifier)
    

    def IntanceFitModel(Mtype,classifier,classes,features,train):
        
        if Mtype == "OneVsRest":
            # instantiate the base classifier.
            lr = LogisticRegression()
            # instantiate the One Vs Rest Classifier.
            OVRclassifier = OneVsRest(classifier=lr)
#             fitModel = OVRclassifier.fit(train)
            # Add parameters of your choice here:
            paramGrid = ParamGridBuilder() \
                .addGrid(lr.regParam, [0.1, 0.01]) \
                .build()
            #Cross Validator requires the following parameters:
            crossval = CrossValidator(estimator=OVRclassifier,
                                      estimatorParamMaps=paramGrid,
                                      evaluator=MulticlassClassificationEvaluator(),
                                      numFolds=2) # 3 is best practice
            # Run cross-validation, and choose the best set of parameters.
            fitModel = crossval.fit(train)
            return fitModel
        if Mtype == "MultilayerPerceptronClassifier":
            # specify layers for the neural network:
            # input layer of size features, two intermediate of features+1 and same size as features
            # and output of size number of classes
            # Note: crossvalidator cannot be used here
            features_count = len(features[0][0])
            layers = [features_count, features_count+1, features_count, classes]
            MPC_classifier = MultilayerPerceptronClassifier(maxIter=100, layers=layers, blockSize=128, seed=1234)
            fitModel = MPC_classifier.fit(train)
            return fitModel
        if Mtype in("LinearSVC","GBTClassifier") and classes != 2: # These classifiers currently only accept binary classification
            print(Mtype," could not be used because PySpark currently only accepts binary classification data for this algorithm")
            return
        if Mtype in("LogisticRegression","NaiveBayes","RandomForestClassifier","GBTClassifier","LinearSVC","DecisionTreeClassifier"):
  
            # Add parameters of your choice here:
            if Mtype in("LogisticRegression"):
                paramGrid = (ParamGridBuilder() \
#                              .addGrid(classifier.regParam, [0.1, 0.01]) \
                             .addGrid(classifier.maxIter, [10, 15,20])
                             .build())
                
            # Add parameters of your choice here:
            if Mtype in("NaiveBayes"):
                paramGrid = (ParamGridBuilder() \
                             .addGrid(classifier.smoothing, [0.0, 0.2, 0.4, 0.6]) \
                             .build())
                
            # Add parameters of your choice here:
            if Mtype in("RandomForestClassifier"):
                paramGrid = (ParamGridBuilder() \
                               .addGrid(classifier.maxDepth, [2, 5, 10])
#                                .addGrid(classifier.maxBins, [5, 10, 20])
#                                .addGrid(classifier.numTrees, [5, 20, 50])
                             .build())
                
            # Add parameters of your choice here:
            if Mtype in("GBTClassifier"):
                paramGrid = (ParamGridBuilder() \
#                              .addGrid(classifier.maxDepth, [2, 5, 10, 20, 30]) \
#                              .addGrid(classifier.maxBins, [10, 20, 40, 80, 100]) \
                             .addGrid(classifier.maxIter, [10, 15,50,100])
                             .build())
                
            # Add parameters of your choice here:
            if Mtype in("LinearSVC"):
                paramGrid = (ParamGridBuilder() \
                             .addGrid(classifier.maxIter, [10, 15]) \
                             .addGrid(classifier.regParam, [0.1, 0.01]) \
                             .build())
            
            # Add parameters of your choice here:
            if Mtype in("DecisionTreeClassifier"):
                paramGrid = (ParamGridBuilder() \
#                              .addGrid(classifier.maxDepth, [2, 5, 10, 20, 30]) \
                             .addGrid(classifier.maxBins, [10, 20, 40, 80, 100]) \
                             .build())
            
            #Cross Validator requires all of the following parameters:
            crossval = CrossValidator(estimator=classifier,
                                      estimatorParamMaps=paramGrid,
                                      evaluator=MulticlassClassificationEvaluator(),
                                      numFolds=2) # 3 + is best practice
            # Fit Model: Run cross-validation, and choose the best set of parameters.
            fitModel = crossval.fit(train)
            return fitModel
    
    fitModel = IntanceFitModel(Mtype,classifier,classes,features,train)
    
    # Print feature selection metrics
    if fitModel is not None:
        
        if Mtype in("OneVsRest"):
            # Get Best Model
            BestModel = fitModel.bestModel
            print(" ")
            print('\033[1m' + Mtype + '\033[0m')
            # Extract list of binary models
            models = BestModel.models
            for model in models:
                print('\033[1m' + 'Intercept: '+ '\033[0m',model.intercept,'\033[1m' + '\nCoefficients:'+ '\033[0m',model.coefficients)

        if Mtype == "MultilayerPerceptronClassifier":
            print("")
            print('\033[1m' + Mtype," Weights"+ '\033[0m')
            print('\033[1m' + "Model Weights: "+ '\033[0m',fitModel.weights.size)
            print("")

        if Mtype in("DecisionTreeClassifier", "GBTClassifier","RandomForestClassifier"):
            # FEATURE IMPORTANCES
            # Estimate of the importance of each feature.
            # Each feature’s importance is the average of its importance across all trees 
            # in the ensemble The importance vector is normalized to sum to 1. 
            # Get Best Model
            BestModel = fitModel.bestModel
            print(" ")
            print('\033[1m' + Mtype," Feature Importances"+ '\033[0m')
            print("(Scores add up to 1)")
            print("Lowest score is the least important")
            print(" ")
            featureImportances = BestModel.featureImportances.toArray()
            print(featureImportances)
            
            if Mtype in("DecisionTreeClassifier"):
                global DT_featureImportances
                DT_featureImportances = BestModel.featureImportances.toArray()
                global DT_BestModel
                DT_BestModel = BestModel
            if Mtype in("GBTClassifier"):
                global GBT_featureImportances
                GBT_featureImportances = BestModel.featureImportances.toArray()
                global GBT_BestModel
                GBT_BestModel = BestModel
            if Mtype in("RandomForestClassifier"):
                global RF_featureImportances
                RF_featureImportances = BestModel.featureImportances.toArray()
                global RF_BestModel
                RF_BestModel = BestModel

        if Mtype in("LogisticRegression"):
            # Get Best Model
            BestModel = fitModel.bestModel
            print(" ")
            print('\033[1m' + Mtype," Coefficient Matrix"+ '\033[0m')
            print("You should compares these relative to eachother")
            print("Coefficients: \n" + str(BestModel.coefficientMatrix))
            print("Intercept: " + str(BestModel.interceptVector))
            global LR_coefficients
            LR_coefficients = BestModel.coefficientMatrix.toArray()
            global LR_BestModel
            LR_BestModel = BestModel

        if Mtype in("LinearSVC"):
            # Get Best Model
            BestModel = fitModel.bestModel
            print(" ")
            print('\033[1m' + Mtype," Coefficients"+ '\033[0m')
            print("You should compares these relative to eachother")
            print("Coefficients: \n" + str(BestModel.coefficients))
            global LSVC_coefficients
            LSVC_coefficients = BestModel.coefficients.toArray()
            global LSVC_BestModel
            LSVC_BestModel = BestModel
        
   
    # Set the column names to match the external results dataframe that we will join with later:
    columns = ['Classifier', 'Result']
    
    if Mtype in("LinearSVC","GBTClassifier") and classes != 2:
        Mtype = [Mtype] # make this a list
        score = ["N/A"]
        result = spark.createDataFrame(zip(Mtype,score), schema=columns)
    else:
        predictions = fitModel.transform(test)
        MC_evaluator = MulticlassClassificationEvaluator(metricName="accuracy") # redictionCol="prediction",
        accuracy = (MC_evaluator.evaluate(predictions))*100
        Mtype = [Mtype] # make this a string
        score = [str(accuracy)] #make this a string and convert to a list
        result = spark.createDataFrame(zip(Mtype,score), schema=columns)
        result = result.withColumn('Result',result.Result.substr(0, 5))
        
    return result
    #Also returns the fit model important scores or p values

*Too long for me I guess? Better to split it a little bit. Also probably very time consuming. But still a good source.*

***Test Run!***

In [164]:
#This part is just okay, basic testing
from pyspark.ml.classification import *
from pyspark.ml.evaluation import *
from pyspark.sql import functions
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

# Comment out Naive Bayes if your data still contains negative values
classifiers = [
                LogisticRegression()
                ,OneVsRest()
               ,LinearSVC()
               ,NaiveBayes()
               ,RandomForestClassifier()
               ,GBTClassifier()
               ,DecisionTreeClassifier()
               ,MultilayerPerceptronClassifier()
              ] 

train,test = final_data.randomSplit([0.7,0.3])
features = final_data.select(['features']).collect()
# Learn how many classes there are in order to specify evaluation type based on binary or multi and turn the df into an object
class_count = final_data.select(countDistinct("label")).collect()
classes = class_count[0][0]

#set up your results table
columns = ['Classifier', 'Result']
vals = [("Place Holder","N/A")]
results = spark.createDataFrame(vals, columns)

for classifier in classifiers:
    new_result = ClassTrainEval(classifier,features,classes,train,test)
    results = results.union(new_result)
results = results.where("Classifier!='Place Holder'")
results.show(100,False)

 
[1mLogisticRegression  Coefficient Matrix[0m
You should compares these relative to eachother
Coefficients: 
DenseMatrix([[ -5.74826963,  -5.88605442,  -5.7136875 ,  -5.64906936,
               -5.49625639,  -5.86266633,  -6.33965982,  -5.49307227,
               -6.38637691,  -5.93037897,   0.03782589, -16.69952918,
                0.18956656,  -0.75559565,  -0.51833962,  -0.19425441,
               -9.58377893]])
Intercept: [26.888162137531307]
 
[1mOneVsRest[0m
[1mIntercept: [0m -7.435940405560727 [1m
Coefficients:[0m [1.5426091771792398,1.8014770692007012,1.2906738274071288,1.5146930541860517,1.4805674045093462,1.6484001984963874,1.6509116447967473,1.5165470502758096,2.0817681499674063,1.483337728646401,0.2930327345686611,4.575638620467662,-0.17176988123414272,0.5507928698449102,0.44382006219496994,-0.15660900773580122,1.744358219859474]
[1mIntercept: [0m 7.43594040556073 [1m
Coefficients:[0m [-1.5426091771792385,-1.8014770692007032,-1.2906738274071308,-1.5146930541860

<div class="alert alert-block alert-warning"><b>Diagnostics</b></div>

*It is info about our testing but in more depth. It could be useful in future.*

In [166]:
from pyspark.ml.evaluation import *
from pyspark.ml.classification import *

def ClassDiag(classifier):
    
    # Fit our model
    C = classifier
    fitModel = C.fit(train)

    # Load the Summary
    trainingSummary = fitModel.summary

    # General Describe
    trainingSummary.predictions.describe().show()

    # View Predictions
    pred_and_labels = fitModel.evaluate(test)
    pred_and_labels.predictions.show()

    # Print the coefficients and intercept for multinomial logistic regression
    print("Coefficients: \n" + str(fitModel.coefficientMatrix))
    print(" ")
    print("Intercept: " + str(fitModel.interceptVector))
    print(" ")

    # Obtain the objective per iteration
    objectiveHistory = trainingSummary.objectiveHistory
    print(" ")
    print("objectiveHistory:")
    for objective in objectiveHistory:
        print(objective)

    # for multiclass, we can inspect metrics on a per-label basis
    print(" ")
    print("False positive rate by label:")
    for i, rate in enumerate(trainingSummary.falsePositiveRateByLabel):
        print("label %d: %s" % (i, rate))

    print(" ")
    print("True positive rate by label:")
    for i, rate in enumerate(trainingSummary.truePositiveRateByLabel):
        print("label %d: %s" % (i, rate))

    print(" ")
    print("Precision by label:")
    for i, prec in enumerate(trainingSummary.precisionByLabel):
        print("label %d: %s" % (i, prec))

    print(" ")
    print("Recall by label:")
    for i, rec in enumerate(trainingSummary.recallByLabel):
        print("label %d: %s" % (i, rec))

    print(" ")
    print("F-measure by label:")
    for i, f in enumerate(trainingSummary.fMeasureByLabel()):
        print("label %d: %s" % (i, f))

    accuracy = trainingSummary.accuracy
    falsePositiveRate = trainingSummary.weightedFalsePositiveRate
    truePositiveRate = trainingSummary.weightedTruePositiveRate
    fMeasure = trainingSummary.weightedFMeasure()
    precision = trainingSummary.weightedPrecision
    recall = trainingSummary.weightedRecall
    print(" ")
    print("Accuracy: %s\nFPR: %s\nTPR: %s\nF-measure: %s\nPrecision: %s\nRecall: %s"
          % (accuracy, falsePositiveRate, truePositiveRate, fMeasure, precision, recall))

In [167]:
ClassDiag(LogisticRegression())

+-------+-------------------+-------------------+
|summary|              label|         prediction|
+-------+-------------------+-------------------+
|  count|                736|                736|
|   mean|             0.3125|             0.3125|
| stddev|0.46382761282805135|0.46382761282805135|
|    min|                0.0|                0.0|
|    max|                1.0|                1.0|
+-------+-------------------+-------------------+

+-----+--------------------+--------------------+--------------------+----------+
|label|            features|       rawPrediction|         probability|prediction|
+-----+--------------------+--------------------+--------------------+----------+
|  0.0|(17,[0,1,2,3,4,5,...|[192.286546049392...|[1.0,3.0975199095...|       0.0|
|  0.0|(17,[0,1,2,3,4,5,...|[157.706929220363...|[1.0,3.2266427365...|       0.0|
|  0.0|(17,[0,1,2,3,4,5,...|[158.310679045804...|[1.0,1.7641912512...|       0.0|
|  0.0|(17,[0,1,2,3,4,6,...|[157.628655938967...|[1.0,3.4

***The End***