# Exercices with Spark ML

This lab is split into two part. The first part is very guided and the goal to make you write a program capable of
estimating whether a tumor is malign or benign according to a few features collected from a biopsy ! The second
part is much more exploratory with several ML tasks on several datasets. One of the goal of this lab is to make
you efficient at reading and using the documentation.

The main page to look for the documentation of spark ML is https://spark.apache.org/docs/latest/ml-pipeline.html

## The winconsin breast cancer dataset

The Winconsin breast cancer dataset contains 699 cases of breast cancers. The dataset is presented here :
https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)
and you can directly download the dataset at the following URL : https://archive.ics.uci.edu/ml/
machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data.


Our goal here is to determine whether a tumor is malign or benign using the provided features. Each line of the dataset represents a case and contains 11 numerical values separated by commas.


## Initialisation de l'environnement

In [2]:
import pyspark
import random
import os
sc = pyspark.SparkContext(appName="SparkML")

## 1. Load the data into Spark
The first step for our ML application consist in setup up Spark and reading the dataset.

In [3]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField
from pyspark.sql.types import DoubleType, IntegerType
from pyspark.ml.feature import VectorAssembler
from pyspark.sql import SQLContext
from pyspark.ml.linalg import VectorUDT,Vectors

In [4]:
def toFloat(x):
    if x == '?':
        return 5.0
    else:
        return float(x)

def doLine(l):
    item=l.split(",")
    label = 1
    if item[10]=='2':
        label=0
    return (Vectors.dense([toFloat(e) for e in item[1:10]]),label)

In [5]:
raw_data = sc.textFile("/home/p5hngk/Downloads/GitHub/SD_701---Data_Mining/breast-cancer-wisconsin.data")
schema = StructType([StructField("features", VectorUDT(), True),
                     StructField("label",IntegerType(),True)])
data = SQLContext(sc).createDataFrame(raw_data.map(doLine),schema)

In [7]:
data.show(10)

+--------------------+-----+
|            features|label|
+--------------------+-----+
|[5.0,1.0,1.0,1.0,...|    0|
|[5.0,4.0,4.0,5.0,...|    0|
|[3.0,1.0,1.0,1.0,...|    0|
|[6.0,8.0,8.0,1.0,...|    0|
|[4.0,1.0,1.0,3.0,...|    0|
|[8.0,10.0,10.0,8....|    1|
|[1.0,1.0,1.0,1.0,...|    0|
|[2.0,1.0,2.0,1.0,...|    0|
|[2.0,1.0,1.0,1.0,...|    0|
|[4.0,2.0,1.0,1.0,...|    0|
+--------------------+-----+
only showing top 10 rows



In [8]:
data.printSchema()

root
 |-- features: vector (nullable = true)
 |-- label: integer (nullable = true)



In [26]:
data.groupby(data.label).count().show()

+-----+-----+
|label|count|
+-----+-----+
|    1|  241|
|    0|  458|
+-----+-----+



**1** = malign tumors      
**0** = begign tumors

## 2. Splitting into training and testing

To build our model we first need to split our data into a training set and a testing set. Here we will split according
to the usual 1-9 rule, which means that 90% of the dataset will be used for training while 10% will be used to test
our model. For this you can use the function `randomSplit` (see documentation).

In [28]:
df_split = data.randomSplit([0.9 , 0.1])

In [29]:
data_train = df_split[0]
data_test = df_split[1]

In [32]:
data_train.count()

636

In [33]:
data_test.count()

63

## 3. Building the model

Building the model is done using the object DecisionTree. `DecisionTreeClassifier` in the `from pyspark.ml.classification` package.


https://spark.apache.org/docs/latest/ml-classification-regression.html#decision-tree-classifier


In [41]:
from pyspark.ml.classification import DecisionTreeClassifier

# Train a DecisionTree model.
bc_model = DecisionTreeClassifier(labelCol="label", featuresCol="features").fit(data_train)

## 4. Testing your model

Computing predictions for the test data can be done by applying the model on the test data :
```python
predictions = bc_model.transform(test)
evaluator = BinaryClassificationEvaluator(labelCol="label",
metricName='areaUnderROC')
result = evaluator.evaluate(predictions)
```

In [50]:
predictions = bc_model.transform(data_test)
predictions

DataFrame[features: vector, label: int, rawPrediction: vector, probability: vector, prediction: double]

In [62]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

evaluator = BinaryClassificationEvaluator(labelCol="label", metricName='areaUnderROC')
result = evaluator.evaluate(predictions)
print("The area under ROC of our classifier is : {}".format(result))

The area under ROC of our classifier is : 0.9473684210526315


In [71]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

evaluator2 = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator2.evaluate(predictions)
print("Test error = %g " % (1.0 - accuracy))
print("The accuracy of our classifier is {}%.".format(accuracy*100))

Test error = 0.031746 
The accuracy of our classifier is 96.82539682539682%.


## 5. Improving the model

The model is based on a set of parameters (number of bins, depth, etc.). Spark ML has tools to help you decide
which parameters are well suited for your application (see https://spark.apache.org/docs/2.2.0/ml-tuning.html)
