# Spark tutorial

For the linux guys, load ipython notebook with IPYTHON_OPTS="notebook --pylab inline" ./bin/pyspark

Two important variables : contexts

In [1]:
sc

<pyspark.context.SparkContext at 0x7f27a8cfcd90>

In [2]:
sqlContext

<pyspark.sql.context.SQLContext at 0x7f27916f36d0>

Ipython notebook supports some shell commands.

##Loading the data
We are using data from [KDD99](http://kdd.ics.uci.edu/databases/kddcup99). Please, check out the website for more information.

In [3]:
import urllib
f = urllib.urlretrieve("http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data_10_percent.gz", "kddcup.data_10_percent.gz")

Reads file and returns a RDD containing each line

In [5]:
featuresFile = sc.textFile("kddcup.data_10_percent.gz")
featuresFile.first()

u'0,tcp,http,SF,181,5450,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,9,9,1.00,0.00,0.11,0.00,0.00,0.00,0.00,0.00,normal.'

Count number of lines in the file

In [6]:
featuresFile.count()

494021

In [7]:
!cat kddcup.data_10_percent.gz | wc -l

7747


##Basic arithmetic
Basic operations on an RDD.

In [8]:
a = range(100)
p = sc.parallelize(a)

Sum of elements

4950

Sum of square elements

[0, 1]
[0, 1]


328350

Sum of even elements

2500


##Playing with data

Counting the number of lines containing ".normal"

In [16]:
featuresFile.take(1)

[u'0,tcp,http,SF,181,5450,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,9,9,1.00,0.00,0.11,0.00,0.00,0.00,0.00,0.00,normal.']

Lazy parsing the entire file. How to parse the dataset is described on the website. But basically, all attributes are possible features except the 41st which is the label of the interaction.

In [None]:
# types = [int,str, str, str] + [int]*20 + [float]*7 + [int, int] + [float]*8 + [str]

# for (e,t) in zip(featuresFile.take(1)[0].split(","), types):
#     print e, t

In [39]:
def parseLine(line):
    
    types = [int,str, str, str] + [int]*20 + [float]*7 + [int, int] + [float]*8 + [str]
    
    elems = [ t(e) for (e, t) in zip(line.split(","), types) ]
    
    tag = elems[41]
    
    return (tag, elems)

featuresRDD = featuresFile.map(parseLine)

Counting elements tagged as normal

Find the different values for protocol and service access

In [105]:
# indf = 1 protocol
# indf = 2 service
# Use the distinct function

Try acessing a numerical feature and see how many values are greater than 0

In [106]:
featuresRDD.filter(lambda features: features[1][3] >0).count()

494021

## Caching

Interesting experiments

* You can take advantage of common pipeline elements

In [5]:
%%time
featureToFilter = 1
featuresRDD.filter(lambda features: features[1][3] >0).count()

CPU times: user 8 ms, sys: 4 ms, total: 12 ms
Wall time: 1.25 s


In [6]:
%%time
featureToFilter = 2
featuresRDD.filter(lambda features: features[1][4] >0).count()

CPU times: user 12 ms, sys: 0 ns, total: 12 ms
Wall time: 1.28 s


In [47]:
%%time
featuresRDD.cache() # Store results in memory // Other persistance scheme could be used DISK, MEMORY_AND_DISK ...

featuresRDD.filter(lambda features: features[1][4] >0).count()

featuresRDD.filter(lambda features: features[1][3] >0).count()

CPU times: user 22.8 ms, sys: 8.45 ms, total: 31.2 ms
Wall time: 23.8 s


Compute the mean of the fourth feature

In [50]:
featuresRDD.map(lambda features: features[1][4]).mean()

3025.6102959186424

Test of classic ML operations

##KMeans

Let's get to work ... Implement a k means algorithm on the dataset

Extract some features to decrease computing time

In [101]:
inputRDD = featuresRDD.map(lambda features: features[1][4:15])

In [102]:
inputRDD = inputRDD.sample(False,0.01)

Useful function

Find the closest center in a list if centers to the datapoint: datapoint -> index of the closest centroid

In [62]:
import numpy as np

def closestTo(datapoint, centerList):
    
    datapoint = np.array(datapoint)
    
    distanceList = [ np.sum((datapoint - center)**2)  for center in centerList]
    
    return np.argmin(distanceList)

Random initialisation

In [92]:
N = 10 # Number of centroids
from numpy.random import rand
centroidsList = []
for indCentroid in range(N):
    centroidsList.append(rand(1,11))

In [93]:
closestTo(rand(1,11), centroidsList)

8

Simplest kMeans algorithm:
* Compute centroids of each cluster
* Update centroids
* Repeat

Explanations:
* First transformation: datapoint -> (closest centroid Index , ( datapoint, 1 ) )
* Reduce By Key: Aggregate for each centroid 
    - ( datapoint1, pop1 ) and ( datapoint2, pop2 ) => (datapoint1 + datapoint2 , pop1+pop2)
* So that at the end, 
    - (cluster Index, (sum of datapoint in cluster , number of datapoints in cluster) )
    - Can compute the centroid

In [103]:
for t in range(5):
    
    # Compute centroids of clusters
    
    newCentroids = inputRDD.map(lambda data:
               (closestTo(data, centroidsList) , ( np.array(data) ,1 ) )
               ).reduceByKey( lambda (datapoint1, eff1 ), (datapoint2,eff2):
                             ((datapoint1+datapoint2),(eff1+eff2))
                             ).collect()
    
    # Update centroids with new ones
    
    for stats in newCentroids:
        centroidIndex = stats[0]
        newCentroid = stats[1][0] / stats[1][1]

        centroidsList[centroidIndex] = newCentroid
        
    # Repeat ...

Final centroids

In [104]:
centroidsList

[array([3664247,       0,       0,       0,       0,       1,       0,
              1,       0,       0,       0]),
 array([468,   2,   0,   0,   0,   0,   0,   0,   0,   0,   0]),
 array([[ 0.11603557,  0.41126311,  0.51308471,  0.31779167,  0.87743239,
          0.77976412,  0.24335836,  0.83978181,  0.89069832,  0.46411378,
          0.21869815]]),
 array([26,  4,  0,  0,  0,  0,  0,  0,  0,  0,  0]),
 array([1764,  100,    0,    0,    0,    0,    0,    0,    0,    0,    0]),
 array([107, 138,   0,   0,   0,   0,   0,   0,   0,   0,   0]),
 array([  246, 16620,     0,     0,     0,     0,     0,     1,     0,
            0,     0]),
 array([8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
 array([ 339, 1569,    0,    0,    0,    0,    0,    0,    0,    0,    0]),
 array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])]