# Spark Machine Learning Pipeline

This coursework is about implementing and applying Spark Machine Learning Pipelines, and evaluating them with respect to preprocessing, parametrisation, and scaling.

## 1. Data set initial analysis and summary of pipeline task. (20%)

In [None]:
# import dependencies for creating a data frame
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql.types import *
import csv


# Create SparkSession 
spark = SparkSession.builder.getOrCreate() 




# create RDD from csv files
trainRDD = spark.read.csv("hdfs://saltdean/data/data/santander-products/train_ver2.csv", 
                          header=True, mode="DROPMALFORMED", schema=schema)

testRDD = spark.read.csv("hdfs://saltdean/data/data/santander-products/test_ver2.csv", 
                          header=True, mode="DROPMALFORMED", schema=schema)





# alternatively...
# create RDD from csv files
trainRDD = sc.textFile("hdfs://saltdean/data/data/santander-products/train_ver2.csv")
trainRDD = trainRDD.mapPartitions(lambda x: csv.reader(x))





# alternatively... from https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema
# create RDD from csv files
lines = sc.textFile("hdfs://saltdean/data/data/santander-products/train_ver2.csv")
elements = lines.map(lambda l: l.split(","))

# Each line is converted to a tuple.
clients = elements.map(lambda p: (p[0], p[1].strip(),p[2],...))

# The schema is encoded in a string.
schemaString = "name age ..."
fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
schema = StructType(fields)

# Apply the schema to the RDD and register the DataFrame to be used with Spark SQL.
trainRDD = spark.createDataFrame(clients, schema)
trainRDD.createOrReplaceTempView('trainingset')






# alternatively, as seen in tutorial 8:
lines = sc.textFile("hdfs://saltdean/data/data/santander-products/train_ver2.csv")
parts = lines.map(lambda l: l.split(","))
trainRDD = parts.map(lambda p: Row(userId=int(p[0]), movieId=int(p[1]),
                                     rating=float(p[2]), timestamp=int(p[3])))

# Create DataFrame and register it to be used with Spark SQL.
trainClients = spark.createDataFrame(trainRDD)
trainClients.createOrReplaceTempView('Clients')

# For testing
print(trainClients.describe()) # columns info
print(trainClients.count()) # number of instances




## 2. Implementation of machine learning pipeline. (25%)
Implement a machine learning pipeline in Spark, including feature extractors, transformers, and/or selectors. Test that your pipeline it is correctly implemented and explain your choice of processing steps, learning algorithms, and parameter settings.

In [None]:
# imports dependencies for machine learning pipeline
from pyspark.ml import Pipeline
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder


## 3. Evaluation and test of model. (20%)
Evaluate the performance of your pipeline using training and test set (don’t use CV but pyspark.ml.tuning.TrainValidationSplit).

## 4. Model fine-tuning (hyperparameters optimization). (35%) 
Implement a parameter grid (using pyspark.ml.tuning.ParamGridBuilder[source]), varying at least one feature preprocessing step, one machine learning parameter, and the training set size. Document the training and test performance and the time taken for training and testing. Comment on your findings.