<div style="color:red;font-weight:bold;background:yellow;text-align:center;padding:10px;border:solid">
    <h1>RUN IN EMR CLUSTER ONLY</h1>
    If the URL of the current page does not begin with "ec2", then do **NOT** proceed!
</div>

# PySpark Practice
In this practice, you will use the tools you learned in the readings and labs to perform some computation using PySpark.

## Connecting to PySpark

In [1]:
name = !hostname
if "dsa" in name[0]:
    raise RuntimeError("Only run this notebook in the EMR Cluster!")
import findspark
findspark.init()
from pyspark import SparkContext, SparkConf

conf = SparkConf().setAppName("pyspark-lab")
spark_context = SparkContext(conf=conf)

## 1
Bring in the Dataset
For this portion of the practice, you will be using a familiar dataset: the Amazon Review Dataset. Bring in the dataset using SparkSQL in the box below

(For reference, this dataset is located in `/datasets/amazon_reviews.json`)

In [2]:
from pyspark.sql import SQLContext

# To use Spark SQL we create a SQLContext from SparkContext
sqlContext = SQLContext(spark_context)

# Location of the dataset on HDFS
DATASET = '/datasets/amazon_reviews.json'

# Load a table with a CSV format reader and get the first 1000 rows
df = sqlContext.read.json(DATASET).limit(1000)

In [3]:
df.head()

Row(helpfulness_count=7, helpfulness_score=7, price=-1.0, product_id='B000179R3I', profile_name='Jeanmarie Kabala "JP Kabala"', score=4.0, summary='Periwinkle Dartmouth Blazer', text='I own the Austin Reed dartmouth blazer in every color in which they make it-- it is a staple of my business wardrobe. Well made, quality fabric, nicely tailored, classic lines, appropriate for a professional woman. (something that can be hard to find at times) It should be noted, however, that the periwinkle and raspberry colors are lovely, but the fabric and buttons are slightly different than the "classic" colors(lighter) and the linings and interfacings are not as substantial as the brown, navy, camel, red and ivory. It\'s still a good value, particularly as these are colors appropriate to warmer seasons and climates, but I was a bit surprised.', time=1182816000, title='Amazon.com', user_id='A3Q0VJTUO4EZ56')

In [4]:
df.select('score')

DataFrame[score: double]

## 2
Extract lists of `score` and `helpfulness_score`

In [5]:

"""
Second way to do it
scores = df.select("score").rdd.flatMap(lambda x: x).collect()
helpfulness =  df.select("helpfulness_score").rdd.flatMap(lambda x: x).collect()
"""

df = df.toPandas()
scores = df["score"].values.tolist()
helpfulness =  df["helpfulness_score"].values.tolist()


## 3
Calculate the Mean

Use PySpark to parallelize finding the mean of scores

$$
\mu = \frac{\sum_{i = 1}^{n} i}{n}
$$

Hint: Use accumulators

In [8]:

# Create RDD
rdd = spark_context.parallelize(scores)
# Create accumulators()
acc = spark_context.accumulator(0)
# add them
rdd.foreach(lambda x: acc.add(x))
# get back values
total = acc.value

# calc mean
mean_score = total/len(scores)

# Print mean
print(mean_score)

4.268


In [9]:
# for testing purpose
rdd.mean()

4.267999999999997

In [10]:
# for testing purpose
rdd.stdev()

1.203401844771729

## 4
Calculate the Mean

Use PySpark to parallelize finding the mean of helpfulness score

$$
\mu = \frac{\sum_{i = 1}^{n} i}{n}
$$

Hint: Use accumulators

In [11]:
# Create RDD
rdd = spark_context.parallelize(helpfulness)
# Create accumulators()
acc = spark_context.accumulator(0)
# add them
rdd.foreach(lambda x: acc.add(x))
# get back values
total = acc.value

# calc mean
mean_helpfulness_score = total/len(helpfulness)

# Print mean
print(mean_helpfulness_score)

4.973


In [12]:
# for testing purpose
rdd.mean()

4.972999999999999

In [13]:
# for testing purpose
rdd.stdev()

12.031137560513553

## 5
Calculate the Standard Deviation

Use PySpark to parallelize finding the standard deviation of the score. You will most likely need to do 2 separate parallelization calls: one for $(x-\mu)^2$ and one for the summation

$$
\sigma = \sqrt{
    \frac{\sum_{x = 1}^{n} {(x - \mu)^2}}{n}
}
$$

Hint: Use accumulator and broadcast variables

In [14]:
from math import sqrt
# Create RDD
rdd1 = spark_context.parallelize(scores)


# create broadcast variables
broadcast = spark_context.broadcast(mean_score)

# do x - mu ^ 2
temp_residual_score = rdd1.map(lambda x: (x - broadcast.value)*(x - broadcast.value))

#get the results
residual_score = temp_residual_score.collect()




# sum the lists
rdd2 = spark_context.parallelize(residual_score)
acc2 = spark_context.accumulator(0)
# add them
rdd2.foreach(lambda x: acc2.add(x))
# get back values
total = acc2.value

# divide by n and take sqrt
std_score = sqrt(total/len(residual_score))

# Print Standard Deviation
print(std_score)


1.2034018447717287


## 6 
Now that you have worked through calculating a standard deviation:
Calculate the correlation between score and helpfulness score

Correlation is given by:
$$
r_{xy} = \frac{\sigma_{xy}}{\sigma_x\sigma_y}
$$
Where $\sigma_{xy}$ is the covariance between x and y given by:
$$
\sigma_{xy} = \frac{
    \sum_{i=1}^{n} (x_i - \mu_x)(y_i - \mu_y)
}{n-1}
$$

You should parallelize as much as you can!

In [15]:
##########
# Std Dev of Helpfulness
##########
# Create RDD
rdd1 = spark_context.parallelize(helpfulness)


# create broadcast variables
broadcast = spark_context.broadcast(mean_helpfulness_score)

# do x - mu ^ 2
temp_residual_score = rdd1.map(lambda x: (x - broadcast.value)*(x - broadcast.value))

#get the results
residual_helpfulness = temp_residual_score.collect()




# sum the lists
rdd2 = spark_context.parallelize(residual_helpfulness)
acc2 = spark_context.accumulator(0)
# add them
rdd2.foreach(lambda x: acc2.add(x))
# get back values
total = acc2.value

# divide by n and take sqrt
std_helpfulness = sqrt(total/len(residual_helpfulness))

# Print Standard Deviation
print(std_helpfulness)


12.031137560513553


In [16]:
############
# Covariance between X and Y
############
# Create RDD
rdd = spark_context.parallelize(scores)
rdd1 = spark_context.parallelize(helpfulness)


# create broadcast variables
broadcast_mean_score = spark_context.broadcast(mean_score)
broadcast_mean_helpfulness_score = spark_context.broadcast(mean_helpfulness_score)

# do x - mu
temp_residual_score = rdd.map(lambda x: (x - broadcast_mean_score.value))
temp_residual_helpfulness = rdd1.map(lambda x: (x - broadcast_mean_helpfulness_score.value))

#get the results
residual_score = temp_residual_score.collect()
residual_helpfulness = temp_residual_helpfulness.collect()
product_score_helpfulnessScore = []
for a,b in zip(residual_score,residual_helpfulness):
    product_score_helpfulnessScore.append(a*b)


# sum the lists
rdd2 = spark_context.parallelize(product_score_helpfulnessScore)
acc2 = spark_context.accumulator(0)
# add them
rdd2.foreach(lambda x: acc2.add(x))
# get back values
total = acc2.value

# divide by n -1
covariance_score_helpfulnessScore = total/(len(product_score_helpfulnessScore)-1)

# Print covariance
print(covariance_score_helpfulnessScore)


0.34758358358358327


In [20]:
##############
# Calculate correlation
##############
correlation_score_helpfulnessScore = covariance_score_helpfulnessScore/(std_score*std_helpfulness)
print(correlation_score_helpfulnessScore)


0.02400722104063107
