### Assignment

In case you are facing issues, please read the following two documents first:

https://github.com/IBM/skillsnetwork/wiki/Environment-Setup

https://github.com/IBM/skillsnetwork/wiki/FAQ

Then, please feel free to ask:

https://coursera.org/learn/machine-learning-big-data-apache-spark/discussions/all

Please make sure to follow the guidelines before asking a question:

https://github.com/IBM/skillsnetwork/wiki/FAQ#im-feeling-lost-and-confused-please-help-me


In [1]:
import findspark
findspark.init()
from pyspark import SparkContext, SparkConf, SQLContext
from pyspark.sql import SparkSession
from pyspark.sql import DataFrameReader, SQLContext
from sqlalchemy import create_engine
import pandas as pd
spark= SparkSession.builder.getOrCreate()
spark
sc = SparkContext.getOrCreate(SparkConf().setMaster("local[*]"))

# Exercise
## Part 1
Now let's calculate covariance and correlation by ourselves using ApacheSpark

1st we crate two random RDD’s, which shouldn't correlate at all.


In [11]:
import random
rddX = sc.parallelize(random.sample(list(range(100)),100))
rddY = sc.parallelize(random.sample(list(range(100)),100))

Now we calculate the mean, note that we explicitly cast the denominator to float in order to obtain a float instead of int

In [12]:
meanX = rddX.sum()/float(rddX.count())
meanY = rddY.sum()/float(rddY.count())
print (meanX)
print (meanY)

49.5
49.5


Now we calculate the covariance

In [13]:
rddXY = rddX.zip(rddY)
covXY = rddXY.map(lambda x_y : (x_y[0]-meanX)*(x_y[1]-meanY)).sum()/rddXY.count()
covXY

-0.05

Covariance is not a normalized measure. Therefore we use it to calculate correlation. But before that we need to calculate the indivicual standard deviations first

In [18]:
from math import sqrt
n = rddXY.count()
sdX = sqrt(rddX.map(lambda x : pow(x-meanX,2)).sum()/n)
sdY = sqrt(rddY.map(lambda x : pow(x-meanY,2)).sum()/n)
print (sdX)
print (sdY)

28.86607004772212
28.86607004772212


Now we calculate the correlation

In [20]:
corrXY = covXY / (sdX * sdY)
corrXY

-6.000600060006001e-05

## Part 2
No we want to create a correlation matrix out of the four RDDs used in the lecture

In [34]:
from pyspark.mllib.stat import Statistics
import random
column1 = sc.parallelize(list(range(100)))
column2 = sc.parallelize(list(range(100,200)))
column3 = sc.parallelize(list(reversed(range(100))))
column4 = sc.parallelize(random.sample(list(range(100)),100))
data = column1.zip(column2).zip(column3).zip(column4).map(lambda a_b_c_d : (a_b_c_d[0][0][0],a_b_c_d[0][0][1],a_b_c_d[0][1],a_b_c_d[1]) ).map(lambda a_b_c_d : [a_b_c_d[0],a_b_c_d[1],a_b_c_d[2],a_b_c_d[3]])
Statistics.corr(data)

array([[ 1.        ,  1.        , -1.        , -0.10009001],
       [ 1.        ,  1.        , -1.        , -0.10009001],
       [-1.        , -1.        ,  1.        ,  0.10009001],
       [-0.10009001, -0.10009001,  0.10009001,  1.        ]])

## Part 3
Now let's calculate Skewness and Kurtosis by ourselves using ApacheSpark

In [86]:
rdd = sc.parallelize(range(100))

In [87]:
sum = rdd.sum()
n = rdd.count()
mean = sum/n
print (mean)

49.5


In [88]:
from math import sqrt
sd = sqrt(rdd.map(lambda x : pow(x-mean,2)).sum()/n)
sd

28.86607004772212

In [89]:
n = float(n)
skewness = n/((n-1)*(n-2)) * rdd.map(lambda x : pow(x-mean,3)/pow(sd,3)).sum()
skewness

7.323672807257268e-17

Congratulations, you are done with Exercice