# Predict Wikipedia pageviews (for mobile and for desktop)

Dataset: This file contains a count of pageviews to the English-language Wikipedia over nearly 6 weeks, grouped by timestamp (down to a one-second resolution level) and site (mobile or desktop).
 * from 2015-03-16T00:00:00 Monday
 * stop 2015-04-25T15:59:59 Saturday

Step 1: Load the DataSet.
---
We will not cache this dataset.  In a later step we will cache just the aggregate data, saving memory.

In [None]:
from pyspark.sql import functions as fn
from pyspark.sql.types import *

pageViewsDF = spark.read.option("header", True).option("delimiter", "\t").csv("/data/training/pageviews-by-second-tsv.gz")
pageViewsDF = pageViewsDF\
    .withColumn("timestamp", fn.col("timestamp").cast(TimestampType()))\
    .withColumn("requests", pageViewsDF.requests.cast("int"))

pageViewsDF.printSchema()
pageViewsDF.show()

In [None]:
pageViewsDF.take(100)

###Let's graph the traffic for the desktop and mobile websites, as a function of time.

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

fig1, ax = plt.subplots()

data = pageViewsDF.filter("site == 'mobile'").select("timestamp", "requests").take(200)
time = [row[0] for row in data]
requests = [row[1] for row in data]
ax.scatter(time, requests, marker='o', linestyle='--', color='r', label='Power')

### Step 2: Reduce our data to hourly totals, and cache it.

In [None]:
hourCol = fn.hour(fn.col("timestamp")).alias("Hour")
dayCol = fn.date_format(fn.col("timestamp"), "E").alias("Day")
dateCol = fn.to_date(fn.col("timestamp")).alias("Date")
# This next line is ugly, but what does is create a timestamp rounded to the nearest hour.
# We could create a python UDF, but it's faster if we can reuse the built-in Scala UDFs.
dateTimeCol = fn.from_unixtime(dateCol.cast(TimestampType()).cast(LongType()) + hourCol * 60 * 60).alias("DateTime")

requestsPerHourDF=pageViewsDF.groupBy(dayCol, dateTimeCol).sum("requests").withColumnRenamed("sum(requests)", "TotalRequests").orderBy("DateTime")
requestsPerHourDF.show()


In [None]:
requestsPerHourDF.cache()

In [None]:
requestsPerHourDF.show()

### Step 3: Graph the requests as a function of time.
Notice several key features:
* Traffic cycles up and down by Time-of-Day
* Traffic cycles up and down by Day-of-Week
* And we might imagine traffic increasing month-to-month, as Wikipedia becomes ever more popular.

In [None]:
# anaylyse data

### Step 4: Extract numerical features we can use for machine learning.
This involves extracting the hour, day-of-week, and date-time all as Doubles.

In [None]:
hourCol = fn.hour(fn.col("DateTime")).cast(DoubleType()).alias("Hour")
dayCodeCol = (fn.date_format(fn.col("DateTime"), "u")-1).cast(DoubleType()).alias("DayCode")
unixTimeCol = fn.col("DateTime").cast(TimestampType()).cast(DoubleType()).alias("UnixTime")
totalCol = fn.col("TotalRequests").cast(DoubleType()).alias("TotalRequests")

requestsPerHourNumericalDF=requestsPerHourDF.select(hourCol, dayCodeCol, unixTimeCol, "Day", "DateTime", totalCol)


In [None]:
requestsPerHourNumericalDF.show()

### Step 5: Build a linear-regression machine learning pipeline.
We expect linear growth over time.  But a cyclical pattern based on hour-of-day and day-of-week.  
To account for the non-linear relationship for hour and day-of-week, we'll encode each hour-of-day as an independent variable in a vector.  This is called OneHotEncoding.

In [None]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import OneHotEncoder
from pyspark.ml.feature import StandardScaler
from pyspark.ml.regression import LinearRegression

hourEncoder = OneHotEncoder(inputCol="Hour", outputCol="HourVector")
dayEncoder = OneHotEncoder(inputCol="DayCode", outputCol="DayVector")

# Selects columns from the input dataframe and puts them into a Vector we can use for learning.
vectorizer = VectorAssembler(inputCols=["UnixTime", "DayVector", "HourVector"], outputCol="features")

# Scales the data to fit a standard Guassian curve.  This ensures large inputs don't dwarf small inputs.
standardizer = StandardScaler(inputCol='features', outputCol='standardizedFeatures')

# The linear regressoin 
linearReg = LinearRegression(featuresCol = 'standardizedFeatures', labelCol = 'TotalRequests')
linearReg.setPredictionCol("PredictedTotal")
linearReg.setRegParam(.5)

pipeline = Pipeline().setStages([dayEncoder, hourEncoder, vectorizer, standardizer, linearReg])


### Step 6: Build a model using our machine learning pipeline

In [None]:
model = pipeline.fit(requestsPerHourNumericalDF)

Predict traffic based on time, day-of-week, and hour-of-day, using model.

In [None]:
# Predict traffic using our model.  At first we'll predict using the entire dataset.
result=model.transform(requestsPerHourNumericalDF)

In [None]:
result.select("Day", "DateTime", "TotalRequests", "PredictedTotal").show()

In [None]:
result.select("Day", "DateTime", "TotalRequests", "PredictedTotal").show()

### Step 7: Evaluate our model using Training and Test datasets.

In [None]:
# Split up our data into Training Data and Test Data

(trainingData, testData) = requestsPerHourNumericalDF.randomSplit((0.80, 0.20), seed = 42)
(trainingData.count(), testData.count())

In [None]:
# Train a model using the Training Data
testModel = pipeline.fit(trainingData)

In [None]:
# Predict traffic using our model.  At first we'll predict using the entire dataset.
testResult=testModel.transform(testData)

In [None]:
testResult.show()

In [None]:
testResult.select("Day", "DateTime", "TotalRequests", "PredictedTotal").show()

In [None]:
testResult.select("Day", "DateTime", "TotalRequests", "PredictedTotal").show()

In [None]:
# Now let's compute some evaluation metrics against our test dataset
from pyspark.mllib.evaluation import RegressionMetrics

metrics = RegressionMetrics(testResult.rdd.map(lambda r: (r.PredictedTotal, r.TotalRequests)))

rmse = metrics.rootMeanSquaredError
explainedVariance = metrics.explainedVariance
r2 = metrics.r2

print("Root Mean Squared Error: {}".format(rmse))
print("Explained Variance: {}".format(explainedVariance))
print("R2: {}".format(r2))


And finally, let's look at the distribution of error

In [None]:
# Let's look at the distribution of error.
# First we calculate the residual error and divide it by the RMSE
testResult.selectExpr("TotalRequests", "PredictedTotal", "TotalRequests - PredictedTotal Residual_Error", "(TotalRequests - PredictedTotal) / {} -.25   Within_RSME".format(rmse)).registerTempTable("RMSE_Evaluation")

In [None]:
spark.sql("SELECT * from RMSE_Evaluation").show()

#Challenge Exercise 1:
Instead of using LinearRegression, try using [`org.apache.spark.ml.regression.DecisionTreeRegressor`]()