# MLlib: Basic Statistics and Exploratory Data Analysis

We will introduce Spark's machine learning library [MLlib](https://spark.apache.org/docs/latest/mllib-guide.html).

## Getting the data and creating the RDD

The main objective is to play with a small dataset from the Hackathon.

# Prepare the data

In this notebook, you will use the famous [Boston dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html#sklearn.datasets.load_boston) coming from the sklearn library.

In the next cell, the dataset is loaded from scikit learn and write to a csv file that we can load from Spark.

In [None]:
from sklearn.datasets import load_boston
import pandas
import numpy as np

data = load_boston()
df = pandas.DataFrame.from_records(np.concatenate([data["data"], data["target"].reshape(-1,1)], axis=1), columns=data["feature_names"].tolist()+["Target"])
df.to_csv("boston.csv", header=True, index=False)

## Machine learning with Apache Spark
Now that the inputs are defined, we can apply some basics (or advanced) data processing functions to classify the type of interactions (i.e. "label")

In [None]:
import pyspark
sc = pyspark.SparkContext()
sqlCtx = pyspark.SQLContext(sc)

In [None]:
df = sqlCtx.read.format("csv").option("header", "true").option("inferSchema", "true").load("ici.csv")

In [47]:
from pyspark.ml import Pipeline
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.feature import VectorAssembler, PCA

assemblor = VectorAssembler(inputCols=["CRIM", "ZN"], outputCol="features")
rf = RandomForestRegressor(featuresCol="features", labelCol="Target", maxDepth=1, maxBins=32, numTrees=1)
pipeline = Pipeline(stages=[assemblor, rf])

Train and test splits

In [48]:
train, test = featuresDF.randomSplit([0.6,0.4])

In [49]:
model = pipeline.fit(train)

Compute accuracy on both train and test sets

In [None]:
model.transform(test).select("prediction", "features").show()

# Exercice: PCA Preprocessing
Add a PCA step in the pipeline

In [52]:
from pyspark.ml.feature import PCA
pca = PCA(k=2, inputCol="features", outputCol="pca_features")

# Exercice: KMeans
Add a kmeans model to your model

In [55]:
from pyspark.ml.clustering import KMeans
kmeans = KMeans(k=2, seed=1, featuresCol="features", predictionCol="kmeans_pred")