<div style="background:black">
    <center>
<img src="./images/session3/title.png" alt="Title"/>
    </center>
</div>


<div class="alert alert-block alert-success">
<center>
Today's objectives:<br/><br/>
    </center>
    &#x25a2; Get familiar with <b>supervised classification</b><br/>
    &#x25a2; Try <b>Spark's MLlib classification</b> library <br/>
    &#x25a2; Apply supervised classification to <b>network intrusion detection</b><br/>
    &#x25a2; Practice with Spark <b>DataFrames</b>
</div>


# Principles of supervised classification

See slides [here](pdf/session3/classification.pdf)

# Supervised classification in Spark

  <div class="alert alert-block alert-info">
    <center>
Spark has a rich API for Classification and Regression, described <a href="https://spark.apache.org/docs/latest/ml-classification-regression.html">here<a>
        </center>
        </div>

We will go through an example adapted from https://spark.apache.org/docs/latest/ml-classification-regression.html

In [None]:
# A little magic to adjust the config at Ericsson
import os
os.environ["IPYTHON"]="1"
os.environ["PYSPARK_PYTHON"]="/usr/bin/python3"
os.environ["PYSPARK_DRIVER_PYTHON"]="ipython3"
os.environ["PYSPARK_DRIVER_PYTHON_OPTS"]="notebook"
os.environ["JAVA_HOME"]="/usr/lib/jvm/default-java"

In [None]:
# Create Spark session
from pyspark.sql import SparkSession
spark = SparkSession \
        .builder \
        .appName("Python Spark SQL basic example") \
        .getOrCreate()

The data is already prepared for MLlib:

In [None]:
# Load and parse the data file, converting it to a DataFrame.
data = spark.read.format("libsvm").load("data/session3/sample_libsvm_data.txt")

# Convert to Pandas for better visualization in notebook
import pandas as pd
pd.set_option('display.max_columns', None)
data.toPandas()

Once data is prepared, first step is to split the training and test sets:

In [None]:
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3], seed=123)

print(f'Dataset has {data.count()} elements')
print(f'Training set has {trainingData.count()} elements')
print(f'Test set has {testData.count()} elements')

  <div class="alert alert-block alert-info">
    <center>
    The <b>Pipeline</b> is an important element of MLlib, containing the various transformations and models applied to the data.
    </center>
    </div>
    <a id="pipeline"></a>

In [None]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier

# Define a RandomForest model.
rf = RandomForestClassifier(labelCol="label", featuresCol="features", numTrees=10)

# Chain data preparation and forest in a Pipeline
pipeline = Pipeline(stages=[rf])

# Train model.  This also runs the indexers.
model = pipeline.fit(trainingData)

# Make predictions.
predictions = model.transform(testData)

# Select example rows to display.
predictions.select("prediction", "label", "features").show(5)

<ul style="list-style-image: url('images/do.png');">
<li>Using your knowledge the DataFrame API, compute the accuracy of this classifier on the test set:</li>
</ul>


# Quiz


<div class="alert alert-block alert-warning">
Deep Learning can be used for supervised classification:
    
&#x25a2; True

&#x25a2; False

</div>

<div class="alert alert-block alert-info">

Decision trees can be used for regression:
    
&#x25a2; True

&#x25a2; False

</div>

<div class="alert alert-block alert-warning">
Classifiers in Spark's MLlib are all parallelized, they could leverage a computing cluster:
    
&#x25a2; True

&#x25a2; False

</div>

# Mini-project: network intrusion detection

<center>
<div class="alert alert-block alert-info">
    <b>Goal</b>: predict if a network connection is an attack
    </div>
    </center>

## Data inspection

Data: US Air Force LAN data ([link](https://www.kaggle.com/sampadab17/network-intrusion-detection)), available in <code>data/session3/network_data.csv</code>

In [None]:
! head data/session3/network_data.csv

## Data loading

<ul style="list-style-image: url('images/do.png');">
    <li>Load the dataset as a Spark DataFrame:</li>
</ul>

<ul style="list-style-image: url('images/do.png');">
    <li>What are the categorical and numerical features?</li>
</ul>

## Data preparation (1 feature)
<a id="preparation"/>

<center>
<div class="alert alert-block alert-info">
    Required: a DataFrame with columns named <b>label</b> (numeric) and <b>features</b> (vector of numbers).
    </div>
    </center>
    
For clarity, we will first start with a single feature, <code>src_bytes</code>.

<ul style="list-style-image: url('images/do.png');">
    <li>Build a DataFrame called <code>data</code> containing only two columns: <code>src_bytes</code> and <code>class</code>. Tip: use function <code>select</code> in the DataFrame API.</li>
</ul>

Column <code>src_bytes</code> is of type <code>string</code> while a number is required:

In [None]:
data.describe()

<center>
<div class="alert alert-block alert-info">
    Module <b>pyspark.sql.functions</b> contains useful functions to manipulate DataFrames. 
</div>
    Here we use function <code>col</code> to access a column from its name:
</center>

In [None]:
# Cast numeric columns to float
from pyspark.sql.functions import col
data = data.select([col('src_bytes').cast("float") 
                    for col_name in data.columns if col_name != 'class'] + [ col('class') ])
data.toPandas()


We will now build our data transformation pipeline, that will:
<ol>
    <li>Encode column <code>class</code> as numbers</li>
    <li>Build a feature vector from column <code>src_bytes</code></li>
</ol>

Here is the definition of the pipeline stages:

In [None]:
stages = []

from pyspark.ml.feature import StringIndexer
label_stringIdx = StringIndexer(inputCol = 'class', outputCol = 'label')
stages += [label_stringIdx]

from pyspark.ml.feature import VectorAssembler
assemblerInputs = [c for c in data.columns if c != 'class' and c != 'label']
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]

stages

<ul style="list-style-image: url('images/do.png');">
    <li>Apply these pipeline stages to <code>data</code>. Tip: use the example in <a href="#pipeline">Section 2</a>.</li>
</ul>

## Supervised classification

<ul style="list-style-image: url('images/do.png');">
    <li>Split the transformed data into a training (70%) and a test (30%) set. Tip: use function <code>randomSplit</code>.</li>
</ul>

<ul style="list-style-image: url('images/do.png');">
    <li>Train a Random Forest classifier on the training set, and make predictions on the test set using the resulting model. Tip: use the example in <a href="#pipeline">Section 2</a>.</li>
</ul>

<ul style="list-style-image: url('images/do.png');">
    <li>Compute the accuracy of your predictions.</li>
</ul>

<center>
<div class="alert alert-block alert-success">
    You should obtain an accuracy <b>slightly above 0.9</b>. That's quite good for a model trained on a single feature! 
</div>

<b>Optional</b>

<ul style="list-style-image: url('images/do.png');">
    <li>Play with the analysis parameters (number of trees in the forest, train/test ratio) to see how they influence the prediction.</li>
</ul>

## Feature analysis


<div class="alert alert-block alert-warning">
<center>
    Obtaining such a high accuracy with a single feature is suspicious. 
</center>
    <ol>
    <li>Was information shared between the train and test sets?</li>
    <li>Is the feature strongly correlated with the class label?</li>
    </ol>
</div>


Let's plot the feature histogram for each class, using Matplotlib:

In [None]:
anomalies = data.where(data['class'] == 'anomaly').toPandas()
normals = data.where(data['class'] == 'normal').toPandas()

In [None]:
from matplotlib import pyplot as plt

# Parameters
thresh = 500  # we won't plot beyond src_bytes=500
bins = 40  # number of bins in the histograms

# Plot histograms
filtered_anomalies = [ x for x in anomalies.src_bytes if x < thresh ]
plt.hist(filtered_anomalies, alpha=0.5, label='anomalies', bins=bins)

filtered_normals = [ x for x in normals.src_bytes if x < thresh ]
plt.hist(filtered_normals, alpha=0.5, label='normals', bins=bins)

# Formatting
plt.xlim(0, thresh)
plt.xlabel('src_bytes')
plt.ylabel('occurrences')
plt.show()

<div class="alert alert-block alert-info">
<center>
It looks like the large majority of anomalies have <code>src_bytes</code> less than 30. A simple threshold should give good classification performance.
</center>

## A threshold-based classifier

<ul style="list-style-image: url('images/do.png');">
    <li>Starting from DataFrame <code>test</code>, create a <code>predictions</code> DataFrame where the prediction will be 1 (anomaly) if <code>src_bytes</code> &lt; 30, and 0 (normal) otherwise.</li>
</ul>

<ul style="list-style-image: url('images/do.png');">
    <li>Compute the accuracy of this simple classifier.</li>
</ul>

<div class="alert alert-block alert-success">
<center>
The accuracy is close to the one obtained from Random Forests.<br/>
This reinforces our confidence in this result that initially looked too good to be true.
</center>
    </div>

## More numerical features

<div class="alert alert-block alert-info">
<center>
Many features remain unused, which suggests that our 0.9 accuracy result might still be improved. 
    </center>
    </div>

<ul style="list-style-image: url('images/do.png');">
    <li>Starting from the data preparation example in <a href="#preparation">Section 4.3</a>, prepare the dataset to use all the <b>numerical</b> features. Don't include categorical features for now.</li>
</ul>

<ul style="list-style-image: url('images/do.png');">
    <li>Split the dataset in train (70%) and test (30%) sets</li>
    </ul>

<ul style="list-style-image: url('images/do.png');">
    <li>Fit a Random Forest classifier on the training set and use it to make predictions on the test set</li>
    </ul>

<ul style="list-style-image: url('images/do.png');">
    <li>Evaluate the accuracy of your classifier</li>
    </ul>

<div class="alert alert-block alert-success">
<center>
As expected, accuracy improved substantially by adding more features. <br/>
    This won't always be the case, finding the best features is called <b>feature engineering</b>.
</center>
    </div>

## Categorical features

<div class="alert alert-block alert-info">
<center>
Categorical features need to be encoded as numbers to be used by Spark MLlib classifiers.<br/>Here we will use a popular encoding method called <b>one-hot encoding</b>.
</center>
    </div>

First, let's re-load the data, cast the numerical features to floats, and index the class column:

In [None]:
# Read CSV data in a DataFrame
filename = 'data/session3/network_data.csv'
data = spark.read.option("header","true").csv(filename)
    
numeric_features = [ x for x in data.columns if x not in categorical_features and x != 'class' ]

# Cast numeric columns to float
from pyspark.sql.functions import col
data = data.select([col(col_name).cast("float") 
                    for col_name in data.columns if col_name != 'class' and col_name not in categorical_features]\
                   + [ col('class') ]
                   + categorical_features)

# Stages
stages = []

# Index class 
from pyspark.ml.feature import StringIndexer
label_stringIdx = StringIndexer(inputCol = 'class', outputCol = 'label')
stages += [label_stringIdx]



We will implement one-hot encoding by adding two steps to the MLlib pipeline, **for each categorical variable**:
1. A ```StringIndexer```, to convert categorical values to integers
2. A ```OneHotEncoder```, to represent these integers as non-ordinal bits

<ul style="list-style-image: url('images/do.png');">
    <li>Define the pipeline stages to transform categorical features using <code>StringIndexer</code> and <code>OneHotEncoder</code> from <code>pyspark.ml.feature</code></li>
    </ul>

<ul style="list-style-image: url('images/do.png');">
    <li>Define the pipeline stage to create column <code>features</code> using <code>VectorAssembler</code> from <code>pyspark.ml.feature</code></li>
    </ul>

<ul style="list-style-image: url('images/do.png');">
    <li>Apply the transformation pipeline to the dataset</li>
    </ul>

<ul style="list-style-image: url('images/do.png');">
    <li>Split the dataset in train (70%) and test (30%) sets</li>
    </ul>

<ul style="list-style-image: url('images/do.png');">
    <li>Fit a Random Forest classifier on the training set and use it to make predictions on the test set</li>
    </ul>

<ul style="list-style-image: url('images/do.png');">
    <li>Evaluate the accuracy of your classifier</li>
    </ul>

<div class="alert alert-block alert-success">
<center>
Not much improvement was brought by categorical variables
</center>
    </div>

# Recap


<div class="alert alert-block alert-success">
<center>
    </center>
    &#x2611; Get familiar with <b>supervised classification</b><br/>
    &#x2611; Try <b>Spark's MLlib classification</b> library <br/>
    &#x2611; Apply supervised classification to <b>network intrusion detection</b><br/>
    &#x2611; Practice with Spark <b>DataFrames</b>
</div>
