# MLlib: Basic Statistics and Exploratory Data Analysis

We will introduce Spark's machine learning library [MLlib](https://spark.apache.org/docs/latest/mllib-guide.html).

## Getting the data and creating the RDD

The main objective is to play with a small dataset from the Hackathon.

In [13]:
ls

mllib-tutorial.ipynb           Spark_correction_tutorial.ipynb
mllib-tutorial-students.ipynb  Spark_tutorial_students.ipynb


In [23]:
import numpy as np
from pyspark import Row

n_examples = 1000
n_feats=11

def gen_features(_):
    return np.random.randn(1, n_feats).flatten().tolist()

def gen_dict(x): 
    return dict([ ("feature_{}".format(i) , e) for i, e in enumerate(x) ])

featuresDF = sc.parallelize(range(n_examples)).map(gen_features).map(gen_dict).map(lambda r: Row(**r))

featuresDF=sqlContext.createDataFrame(featuresDF)

featuresDF.write.parquet("/home/matthieu_le_goff/spark-tutorial/input-dataset.parquet")

In [24]:
featuresDF = sc.parallelize(range(n_examples)).map(gen_features)
print(featuresDF.first())

[-1.2286099587270376, 0.5083651052766873, -0.5272028734422457, -0.23747419008827422, -1.6039189103925688, 0.7295034651296943, 0.24573184837021395, 1.4005658579060023, -1.0406163559216033, 0.1758052835612565, -0.3064545508872768]


In [30]:
featuresDF=None
featuresDF = sqlContext.read.parquet("/home/matthieu_le_goff/spark-tutorial/input-dataset.parquet")

First, parsing the file...

In [31]:
featuresDF.printSchema()

root
 |-- feature_0: double (nullable = true)
 |-- feature_1: double (nullable = true)
 |-- feature_10: double (nullable = true)
 |-- feature_2: double (nullable = true)
 |-- feature_3: double (nullable = true)
 |-- feature_4: double (nullable = true)
 |-- feature_5: double (nullable = true)
 |-- feature_6: double (nullable = true)
 |-- feature_7: double (nullable = true)
 |-- feature_8: double (nullable = true)
 |-- feature_9: double (nullable = true)



In [32]:
result = featuresDF.describe().collect()

In [33]:
for l in result:
    print "----------------------"
    r = l.asDict()
    print "Statistics {}".format(r["summary"])
    for key in r.keys():
        print "{0}: {1}".format(key, r[key])
    print "----------------------"

----------------------
Statistics count
feature_10: 1000
summary: count
feature_8: 1000
feature_9: 1000
feature_2: 1000
feature_3: 1000
feature_0: 1000
feature_1: 1000
feature_6: 1000
feature_7: 1000
feature_4: 1000
feature_5: 1000
----------------------
----------------------
Statistics mean
feature_10: 0.021629714868542822
summary: mean
feature_8: 0.03411087104521261
feature_9: 0.04760686252662131
feature_2: -0.01651924887134106
feature_3: 0.004302160644344671
feature_0: -0.05458669093655795
feature_1: 0.07591372627210322
feature_6: 0.016071453810657506
feature_7: -0.047463444974409456
feature_4: -0.040010827075252296
feature_5: 0.009890709057433654
----------------------
----------------------
Statistics stddev
feature_10: 0.9825420772692703
summary: stddev
feature_8: 1.0186248346945252
feature_9: 1.0119930348969068
feature_2: 0.9601107948692916
feature_3: 1.018268338009346
feature_0: 1.0719935478003724
feature_1: 0.9948617364436604
feature_6: 1.0040374446267113
feature_7: 1.0025740

In [38]:
from pyspark.sql.types import LongType
label_df = featuresDF.withColumn("label", (featuresDF.feature_1>0.1).cast("int") )

Compute statistics by labels

In [39]:
label_df.select("label").groupBy("label").count().show()

+-----+-----+
|label|count|
+-----+-----+
|    1|  502|
|    0|  498|
+-----+-----+



## Machine learning with Apache Spark
Now that the inputs are defined, we can apply some basics (or advanced) data processing functions to classify the type of interactions (i.e. "label")

In [40]:
from pyspark.ml.feature import StringIndexer
col_names=featuresDF.columns
s = StringIndexer(inputCol="label", outputCol="idx_label").fit(label_df.select(col_names + ["label"]))

In [41]:
result = s.transform(label_df.select(col_names + ["label"]))

In [42]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import VectorAssembler, PCA

assemblor = VectorAssembler(inputCols=col_names, outputCol="features")
rf = RandomForestClassifier(featuresCol="features", labelCol="idx_label", maxDepth=1, maxBins=32, numTrees=1)
pipeline = Pipeline(stages=[s, assemblor, rf])

Train and test splits

In [43]:
train, test = label_df.select(col_names + ["label"]).randomSplit([0.6,0.4])

In [44]:
model = pipeline.fit(train)

Compute accuracy on both train and test sets

In [45]:
model.transform(test).select("prediction", "idx_label").groupBy("prediction", "idx_label").count().show()

+----------+---------+-----+
|prediction|idx_label|count|
+----------+---------+-----+
|       1.0|      1.0|  198|
|       0.0|      1.0|    2|
|       0.0|      0.0|  205|
+----------+---------+-----+



In [46]:
preds = model.transform(test)
print preds.where(preds.prediction == preds.idx_label).count()

403


Try applying a PCA before learning the model

In [47]:
from pyspark.ml.feature import PCA

pca = PCA(k=2, inputCol="features", outputCol="pca_features")
assemblor = VectorAssembler(inputCols=col_names, outputCol="features")
rf = RandomForestClassifier(featuresCol="pca_features", labelCol="idx_label", maxDepth=1, maxBins=32, numTrees=1)
pipeline = Pipeline(stages=[s, assemblor, pca, rf])

In [48]:
model = pipeline.fit(train)

In [49]:
model.transform(test).select("prediction", "idx_label").groupBy("prediction", "idx_label").count().show()

+----------+---------+-----+
|prediction|idx_label|count|
+----------+---------+-----+
|       1.0|      1.0|  163|
|       0.0|      1.0|   37|
|       1.0|      0.0|  165|
|       0.0|      0.0|   40|
+----------+---------+-----+



Try applying a kmeans to the dataset

In [50]:
from pyspark.ml.clustering import KMeans
kmeans = KMeans(k=2, seed=1, featuresCol="features", predictionCol="kmeans_pred")
assemblor = VectorAssembler(inputCols=col_names, outputCol="features")
kmeans_assemblor = VectorAssembler(inputCols=col_names+["kmeans_pred"], outputCol="kmeans_features")
rf = RandomForestClassifier(featuresCol="kmeans_features", labelCol="idx_label", maxDepth=1, maxBins=32, numTrees=1)
pipeline = Pipeline(stages=[s, assemblor, kmeans, kmeans_assemblor, rf])

In [51]:
model = pipeline.fit(train)

### Cross validation

In [56]:
from pyspark.ml.feature import PCA

pca = PCA(k=2, inputCol="features", outputCol="pca_features")
assemblor = VectorAssembler(inputCols=col_names, outputCol="features")
rf = RandomForestClassifier(featuresCol="pca_features", labelCol="idx_label", maxDepth=1, maxBins=32, numTrees=1)
pipeline = Pipeline(stages=[s, assemblor, pca, rf])


In [61]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

paramGrid = ParamGridBuilder() \
    .addGrid(pca.k, [2, 5, 8]) \
    .addGrid(rf.maxDepth, [1, 10]) \
    .build()

crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=BinaryClassificationEvaluator(),
                          numFolds=3)  # use 3+ folds in practice

# Run cross-validation, and choose the best set of parameters.
cvModel = crossval.fit(train)

In [63]:
cvModel.transform(test).select("prediction", "idx_label").groupBy("prediction", "idx_label").count().show()

+----------+---------+-----+
|prediction|idx_label|count|
+----------+---------+-----+
|       1.0|      1.0|  163|
|       0.0|      1.0|   37|
|       1.0|      0.0|  165|
|       0.0|      0.0|   40|
+----------+---------+-----+



Exercices:
* Compare these models to the ones in scikit learn
* Detect classification noise using multiple models
    - If different models keep giving wrong predictions on the same sample, it may be a labeling mistake
    
Optional but good for training:
* Implement the Viola Jones strategy using Spark
    - Lear a classifier (Decision Tree)
    - Reduce to zero the FPR using a threshold on probability
    - 