<a href="https://colab.research.google.com/github/Lawrence-Krukrubo/Advanced-Data-Science/blob/master/spark_ml_pipelines.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Install Java, Spark, and Findspark
This installs Apache Spark 2.4.7, Java 8, and [Findspark](https://github.com/minrk/findspark), a library that makes it easy for Python to find Spark.

In [2]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q  http://apache.osuosl.org/spark/spark-2.4.7/spark-2.4.7-bin-hadoop2.7.tgz
!tar xf spark-2.4.7-bin-hadoop2.7.tgz
!pip install -q findspark

# Set Environment Variables
Set the locations where Spark and Java are installed.

In [3]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.7-bin-hadoop2.7"

# Start a SparkSession


First, let's initialise a spark context if none exists

In [4]:
import findspark
findspark.init()
from pyspark import SparkConf, SparkContext

try:
    conf = SparkConf().setMaster("local").setAppName("My_App")
    sc = SparkContext(conf = conf)
    print('SparkContext Initialised Successfully!')
except Exception as e:
    print(e)

#spark = SparkSession.builder.master("local[*]").getOrCreate()

SparkContext Initialised Successfully!


Then start a local Spark session.

In [5]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('MyApp').getOrCreate()
spark

A Pipeline is a very convenient process of designing our data preprocessing in Machine Learning flow.<br>There are certain steps which we must do before the actual ML begins. These steps are called data-preprocessing and/or feature engineering.<br>The cool thing about pipelines is that we get some sort of a recipe or list of predefined steps already for us.<br> These steps could include:<br>1. Assigning categorical values e.g 0 or 1<br>2. Normalising the range of values per dimension<br>3. One-hot encoding and then the final<br>4. Modeling... where we train our ML algorithm.<br>
So the idea is when using pipelines, we can maintain the same preprocessing and just switch out different modeling algorithnms or different parameter sets of the modeling algorithm without changing anything before. This is very very handy.<br>The overall idea of pipelines is that we can fuse our complete data processing flow into one single pipeline and that single pipeline we can further use downstream.<br>
So the pipeline as a Machine Learning Algorithm has functions or methods which are called fit, evaluate and score. Fit basically starts the training, and score gives you back the predicted value.<br>
One advantage is that we can cross-validate, that is you can try out many many parameters using that same very pipeline. And this really accelerates optimisation of the algorithm.<br>
So in summary, pipelines are really facilitating our day to day work in machine learning as we can draw from pre-defined data processing steps, we make sure everything is aligned and we can switch and swap our algorithms as needed. We can create a pipeline and we can use this pipeline in downstream data processing in a process called hyperparameter-tuning for example.

Finally, remember that Dataframes in Apache Spark are always lazy in the sense that if you don't read the data nothing gets executed.

<h3>1. Data Extraction</h3>

* Note that the parquet file format uses compression and column store and actually maps data layout to the Apache Spark Tungsten memory layout.


In [6]:
# This is the dataset that contains the different folders for reading the accelerometer data
# We will clone this data set
accelerometer_readings = 'https://github.com/wchill/HMP_Dataset.git'

In [7]:
!git clone https://github.com/wchill/HMP_Dataset.git

Cloning into 'HMP_Dataset'...
remote: Enumerating objects: 865, done.[K
remote: Total 865 (delta 0), reused 0 (delta 0), pack-reused 865[K
Receiving objects: 100% (865/865), 1010.96 KiB | 13.85 MiB/s, done.


In [8]:
# Let's list out the folders in the HMP_Dataset
!ls HMP_Dataset

Brush_teeth	Drink_glass  Getup_bed	  Pour_water	 Use_telephone
Climb_stairs	Eat_meat     impdata.py   README.txt	 Walk
Comb_hair	Eat_soup     Liedown_bed  Sitdown_chair
Descend_stairs	final.py     MANUAL.txt   Standup_chair


In [9]:
# Let's have a look at one of the folders
!ls HMP_Dataset/Brush_teeth

Accelerometer-2011-04-11-13-28-18-brush_teeth-f1.txt
Accelerometer-2011-04-11-13-29-54-brush_teeth-f1.txt
Accelerometer-2011-05-30-08-35-11-brush_teeth-f1.txt
Accelerometer-2011-05-30-09-36-50-brush_teeth-f1.txt
Accelerometer-2011-05-30-10-34-16-brush_teeth-m1.txt
Accelerometer-2011-05-30-21-10-57-brush_teeth-f1.txt
Accelerometer-2011-05-30-21-55-04-brush_teeth-m2.txt
Accelerometer-2011-05-31-15-16-47-brush_teeth-f1.txt
Accelerometer-2011-06-02-10-42-22-brush_teeth-f1.txt
Accelerometer-2011-06-02-10-45-50-brush_teeth-f1.txt
Accelerometer-2011-06-06-10-45-27-brush_teeth-f1.txt
Accelerometer-2011-06-06-10-48-05-brush_teeth-f1.txt


Let's see the content or structure of the data in these text files, looking at the Brush_teeth folder.

In [10]:
!head HMP_Dataset/Brush_teeth/Accelerometer-2011-04-11-13-28-18-brush_teeth-f1.txt

22 49 35
22 49 35
22 52 35
22 52 35
21 52 34
22 51 34
20 50 35
22 52 34
22 50 34
22 51 35


let's recursively traverse through these folders in HMP_Dataset and create Apache spark DataFrame from the text files and then we just union all dataframes into one overall DataFrame containing all the data.

Let's define the schema of the data frame below

In [11]:
from pyspark.sql.types import StructType, StructField, IntegerType

schema = StructType([
                     StructField('x',IntegerType(),True),
                     StructField('y',IntegerType(),True),
                     StructField('z',IntegerType(),True)
])

Now let's traverse through the data using the OS library

In [12]:
file_list = os.listdir('HMP_Dataset')
file_list

['Drink_glass',
 'final.py',
 'Getup_bed',
 'Descend_stairs',
 'Comb_hair',
 'Sitdown_chair',
 'Liedown_bed',
 '.idea',
 'README.txt',
 'MANUAL.txt',
 'Walk',
 'Eat_soup',
 'Pour_water',
 '.git',
 'Standup_chair',
 'Eat_meat',
 'Climb_stairs',
 'Brush_teeth',
 'impdata.py',
 'Use_telephone']

Now let's get rid of the folders that do not contain underscores as we don't need those

In [13]:
file_list_filtered = [i for i in file_list if '_' in i or i == 'Walk']
file_list_filtered

['Drink_glass',
 'Getup_bed',
 'Descend_stairs',
 'Comb_hair',
 'Sitdown_chair',
 'Liedown_bed',
 'Walk',
 'Eat_soup',
 'Pour_water',
 'Standup_chair',
 'Eat_meat',
 'Climb_stairs',
 'Brush_teeth',
 'Use_telephone']

Okay so we have all the folders containing data in one array. Now we can iterate over this array.

In [14]:
# First we define an empty data frame that we'd append data to
df = None
# next we import tqdm progress bars to see how our code runs 
from tqdm import tqdm

from pyspark.sql.functions import lit
# The lit library helps us write string literals column to an apache dataframe.

# Now let's iterate through the folders
for category in tqdm(file_list_filtered):
    # Now we traverse all through the files in each folder
    data_files = os.listdir('HMP_Dataset/' + category)
    for data_file in data_files:
        # Now we create a temporary dataframe
        # we use our defined schema above
        temp_df = spark.read.option('header','false').option('delimiter',' ').csv('HMP_Dataset/'+ category + '/' + data_file, schema=schema)  
        temp_df = temp_df.withColumn('class',lit(category))  # Adding a class column to the dataframe
        temp_df = temp_df.withColumn('source',lit(data_file))  # Adding a source column to the dataframe
        # now we put a condition if df is empty
        if df is None:
            df = temp_df
        else:
            df = df.union(temp_df)  # else union appends the data frames vertically

100%|██████████| 14/14 [00:42<00:00,  3.04s/it]


Let's see the schema of our DataFrame

In [15]:
df.printSchema()

root
 |-- x: integer (nullable = true)
 |-- y: integer (nullable = true)
 |-- z: integer (nullable = true)
 |-- class: string (nullable = false)
 |-- source: string (nullable = false)



Let's see the first 10 rows of the DataFrame...

In [16]:
df.show(10)

+---+---+---+-----------+--------------------+
|  x|  y|  z|      class|              source|
+---+---+---+-----------+--------------------+
| 29| 39| 51|Drink_glass|Accelerometer-201...|
| 29| 39| 51|Drink_glass|Accelerometer-201...|
| 28| 38| 52|Drink_glass|Accelerometer-201...|
| 29| 37| 51|Drink_glass|Accelerometer-201...|
| 30| 38| 52|Drink_glass|Accelerometer-201...|
| 29| 39| 52|Drink_glass|Accelerometer-201...|
| 30| 39| 51|Drink_glass|Accelerometer-201...|
| 29| 39| 52|Drink_glass|Accelerometer-201...|
| 29| 38| 51|Drink_glass|Accelerometer-201...|
| 30| 39| 52|Drink_glass|Accelerometer-201...|
+---+---+---+-----------+--------------------+
only showing top 10 rows



<h3>2. Data Transformation</h3>

Now we need to transform the data and create an integer representation of the class column as ML algorithms cannot cope with a String. So we will transform the class into a number of integers. using the StringIndexer module.

In [17]:
from pyspark.ml.feature import StringIndexer

indexer = StringIndexer(inputCol = 'class', outputCol = 'classIndex')
indexed = indexer.fit(df).transform(df)  # This is a new data frame

# Let's see it
indexed.show(10)

+---+---+---+-----------+--------------------+----------+
|  x|  y|  z|      class|              source|classIndex|
+---+---+---+-----------+--------------------+----------+
| 29| 39| 51|Drink_glass|Accelerometer-201...|       2.0|
| 29| 39| 51|Drink_glass|Accelerometer-201...|       2.0|
| 28| 38| 52|Drink_glass|Accelerometer-201...|       2.0|
| 29| 37| 51|Drink_glass|Accelerometer-201...|       2.0|
| 30| 38| 52|Drink_glass|Accelerometer-201...|       2.0|
| 29| 39| 52|Drink_glass|Accelerometer-201...|       2.0|
| 30| 39| 51|Drink_glass|Accelerometer-201...|       2.0|
| 29| 39| 52|Drink_glass|Accelerometer-201...|       2.0|
| 29| 38| 51|Drink_glass|Accelerometer-201...|       2.0|
| 30| 39| 52|Drink_glass|Accelerometer-201...|       2.0|
+---+---+---+-----------+--------------------+----------+
only showing top 10 rows



<h3>3. One-Hot Encoding:</h3>

We can see the class index for each class. Good.
So now we do one-hot-encoding

In [21]:
from pyspark.ml.feature import OneHotEncoder

# The OneHotEncoder is a pure transformer object. it does not use the fit()
encoder = OneHotEncoder(inputCol = 'classIndex', outputCol = 'categoryVec')
encoded = encoder.transform(indexed)  # This is a new data frame
encoded.show(10, False)

+---+---+---+-----------+----------------------------------------------------+----------+--------------+
|x  |y  |z  |class      |source                                              |classIndex|categoryVec   |
+---+---+---+-----------+----------------------------------------------------+----------+--------------+
|29 |39 |51 |Drink_glass|Accelerometer-2011-06-01-14-13-57-drink_glass-f1.txt|2.0       |(13,[2],[1.0])|
|29 |39 |51 |Drink_glass|Accelerometer-2011-06-01-14-13-57-drink_glass-f1.txt|2.0       |(13,[2],[1.0])|
|28 |38 |52 |Drink_glass|Accelerometer-2011-06-01-14-13-57-drink_glass-f1.txt|2.0       |(13,[2],[1.0])|
|29 |37 |51 |Drink_glass|Accelerometer-2011-06-01-14-13-57-drink_glass-f1.txt|2.0       |(13,[2],[1.0])|
|30 |38 |52 |Drink_glass|Accelerometer-2011-06-01-14-13-57-drink_glass-f1.txt|2.0       |(13,[2],[1.0])|
|29 |39 |52 |Drink_glass|Accelerometer-2011-06-01-14-13-57-drink_glass-f1.txt|2.0       |(13,[2],[1.0])|
|30 |39 |51 |Drink_glass|Accelerometer-2011-06-01-14-13

<h3>4. VectorAssembler</h3>

Next thing we need to do is to transform our values X, Y, Z into vectors because sparkML only can work on vector objects.
So let's import vectors and vectorAssembler libraries

In [22]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
# VectorAssembler creates vectors from ordinary data types for us

vectorAssembler = VectorAssembler(inputCols = ['x','y','z'], outputCol = 'features')
# Now we use the vectorAssembler object to transform our last updated dataframe

features_vectorized = vectorAssembler.transform(encoded)  # note this is a new df

# Let's see the first 10 rows
features_vectorized.show(10, False)

+---+---+---+-----------+----------------------------------------------------+----------+--------------+----------------+
|x  |y  |z  |class      |source                                              |classIndex|categoryVec   |features        |
+---+---+---+-----------+----------------------------------------------------+----------+--------------+----------------+
|29 |39 |51 |Drink_glass|Accelerometer-2011-06-01-14-13-57-drink_glass-f1.txt|2.0       |(13,[2],[1.0])|[29.0,39.0,51.0]|
|29 |39 |51 |Drink_glass|Accelerometer-2011-06-01-14-13-57-drink_glass-f1.txt|2.0       |(13,[2],[1.0])|[29.0,39.0,51.0]|
|28 |38 |52 |Drink_glass|Accelerometer-2011-06-01-14-13-57-drink_glass-f1.txt|2.0       |(13,[2],[1.0])|[28.0,38.0,52.0]|
|29 |37 |51 |Drink_glass|Accelerometer-2011-06-01-14-13-57-drink_glass-f1.txt|2.0       |(13,[2],[1.0])|[29.0,37.0,51.0]|
|30 |38 |52 |Drink_glass|Accelerometer-2011-06-01-14-13-57-drink_glass-f1.txt|2.0       |(13,[2],[1.0])|[30.0,38.0,52.0]|
|29 |39 |52 |Drink_glass

<h3>5. Normalizing The Dataset</h3>

So the next thing we do now is Normalising the data set.
This makes the range of values in the data set to be between 0 and 1 or -1 and 1 sometimes. The idea is to have all features data within the same range so no one over shadows the other.

In [23]:
from pyspark.ml.feature import Normalizer
normalizer = Normalizer(inputCol = 'features', outputCol = 'features_norm', p=1.0)  # Manhattan Distance
normalized_data = normalizer.transform(features_vectorized) # New data frame too.

In [24]:
normalized_data.show(10)

+---+---+---+-----------+--------------------+----------+--------------+----------------+--------------------+
|  x|  y|  z|      class|              source|classIndex|   categoryVec|        features|       features_norm|
+---+---+---+-----------+--------------------+----------+--------------+----------------+--------------------+
| 29| 39| 51|Drink_glass|Accelerometer-201...|       2.0|(13,[2],[1.0])|[29.0,39.0,51.0]|[0.24369747899159...|
| 29| 39| 51|Drink_glass|Accelerometer-201...|       2.0|(13,[2],[1.0])|[29.0,39.0,51.0]|[0.24369747899159...|
| 28| 38| 52|Drink_glass|Accelerometer-201...|       2.0|(13,[2],[1.0])|[28.0,38.0,52.0]|[0.23728813559322...|
| 29| 37| 51|Drink_glass|Accelerometer-201...|       2.0|(13,[2],[1.0])|[29.0,37.0,51.0]|[0.24786324786324...|
| 30| 38| 52|Drink_glass|Accelerometer-201...|       2.0|(13,[2],[1.0])|[30.0,38.0,52.0]|[0.25,0.316666666...|
| 29| 39| 52|Drink_glass|Accelerometer-201...|       2.0|(13,[2],[1.0])|[29.0,39.0,52.0]|[0.24166666666666...|
|

As seen in the features_norm column, all values have been squashed between 0 and 1.

<h3>6. Creating The Pipeline:</h3>

The Pipeline constructor below takes an array of Pipeline stages we pass to it.<br>
Here we pass the 4 stages above in the right sequence one after another.

In [25]:
from pyspark.ml import Pipeline
pipeline = Pipeline(stages = [indexer,encoder,vectorAssembler,normalizer])

Now let's fit the Pipeline object to our original data frame

In [27]:
data_model = pipeline.fit(df)

Finally let's transform our data frame using the Pipeline Object

In [28]:
pipelined_data = data_model.transform(df)

Let's see the first 10 rows

In [29]:
pipelined_data.show(10)

+---+---+---+-----------+--------------------+----------+--------------+----------------+--------------------+
|  x|  y|  z|      class|              source|classIndex|   categoryVec|        features|       features_norm|
+---+---+---+-----------+--------------------+----------+--------------+----------------+--------------------+
| 29| 39| 51|Drink_glass|Accelerometer-201...|       2.0|(13,[2],[1.0])|[29.0,39.0,51.0]|[0.24369747899159...|
| 29| 39| 51|Drink_glass|Accelerometer-201...|       2.0|(13,[2],[1.0])|[29.0,39.0,51.0]|[0.24369747899159...|
| 28| 38| 52|Drink_glass|Accelerometer-201...|       2.0|(13,[2],[1.0])|[28.0,38.0,52.0]|[0.23728813559322...|
| 29| 37| 51|Drink_glass|Accelerometer-201...|       2.0|(13,[2],[1.0])|[29.0,37.0,51.0]|[0.24786324786324...|
| 30| 38| 52|Drink_glass|Accelerometer-201...|       2.0|(13,[2],[1.0])|[30.0,38.0,52.0]|[0.25,0.316666666...|
| 29| 39| 52|Drink_glass|Accelerometer-201...|       2.0|(13,[2],[1.0])|[29.0,39.0,52.0]|[0.24166666666666...|
|

In [30]:
# first let's list out the columns we want to drop
cols_to_drop = ['x','y','z','class','source','classIndex','features']

# Next let's use a list comprehension with conditionals to select cols we need
selected_cols = [col for col in pipelined_data.columns if col not in cols_to_drop]

# Let's define a new train_df with only the categoryVec and features_norm cols
df_train = pipelined_data.select(selected_cols)

# Let's see our training dataframe.
df_train.show(10)

+--------------+--------------------+
|   categoryVec|       features_norm|
+--------------+--------------------+
|(13,[2],[1.0])|[0.24369747899159...|
|(13,[2],[1.0])|[0.24369747899159...|
|(13,[2],[1.0])|[0.23728813559322...|
|(13,[2],[1.0])|[0.24786324786324...|
|(13,[2],[1.0])|[0.25,0.316666666...|
|(13,[2],[1.0])|[0.24166666666666...|
|(13,[2],[1.0])|  [0.25,0.325,0.425]|
|(13,[2],[1.0])|[0.24166666666666...|
|(13,[2],[1.0])|[0.24576271186440...|
|(13,[2],[1.0])|[0.24793388429752...|
+--------------+--------------------+
only showing top 10 rows

