# CSIT 5800 Introduction to Big Data
## Spring 2020
### Assignment 2 - PySpark

### Description
In this assignment, you will have an opportunity to:
<ul>
<li>apply data pre-processing tecniques that you learned in the class to a problem using Spark</li>
<li>apply machine learning techniques that you learned in the class to a problem using Spark</li>
</ul>

<br/>
To get started on this assignment, you need to download the given dataset and read the description carefully written on this page. Please note that all implementation of your program should be done with Python.
<br/>

### Intended Learning Outcomes

- Upon completion of this assignment, you should be able to:
<ol>
    <li>Demonstrate your understanding on how to pre-process data using the algorithms / techniques as described in the class.</li>
    <li>Demonstrate your understanding on how to do prediction using the machine learning algorithms / techniques as described in the class.</li>
    <li>Using PySpark to construct Python program to pre-process data, performing machine learning from the training data and do data classification for the testing set.</li>
</ol>

### Dataset
The dataset contains daily weather observations from numerous Australian weather stations.
The problem is to predict whether or not it will rain tomorrow by training a binary classification model on target RainTomorrow
The target variable RainTomorrow means: Will it rain the next day? Yes or No.

Note: You should exclude the variable Risk-MM when training a binary classification model. Not excluding it will leak the answers to your model and reduce its predictability. Read more about it here.

<font color="red">Note: The suggested functions below are for reference only. You can use any functions from PySpark.</font>

## Step 0: Installing PySpark
### Step 0.1 : Install Java
#### Step 0.1.1: Check Java Version

In command prompt:
<pre>java -version</pre>

Note: PySpark requires Java version 7 or later.

#### Step 0.1.2
Install java from the official download website.
<url> https://www.oracle.com/java/technologies/javase-jdk8-downloads.html </url>


### Step 0.2: Install Apache Spark on Windows
<ol>
    <li>Go to the <a href="http://spark.apache.org/downloads.html">Spark download</a>.</li>
    <li>Select the latest stable release (2.4.5 as of May-2020) of Spark for "Choose a Spark release".</li>
    <li>Select a version that is pre-built for the latest version of Hadoop such as Pre-built for Hadoop 2.7 and later
        for "Choose a package type"</li>
    <li>Click the link next to Download Spark to download the spark-2.4.5-bin-hadoop2.7.tgz</li>
    <li>Extract the files from the downloaded zip file using winzip or equivalent (right click on the extracted file and click extract here).</li>
    <li>Make sure that the folder path and the folder name containing Spark files do not contain any spaces.</li>
    <li>Create a folder called "spark" on your desktop and unzip the file that you downloaded as a folder called spark-2.4.5-bin-hadoop2.7. So, all Spark files will be in a folder called C:\Users\[your_user_name]\Desktop\Spark\spark-2.4.0-bin-hadoop2.7. This will be referred as SPARK_HOME.</li>
    <li>To test if your installation was successful, open Anaconda Prompt, change to SPARK_HOME directory and type bin\pyspark. This should start the PySpark shell which can be used to interactively work with Spark. </li>
    <li>Create a system environment variable in Windows called SPARK_HOME that points to the SPARK_HOME folder path.</li>
</ol>

### Step 0.2: Install Apache Spark on Mac

You can use Homebrew to install Apache Spark.
<ol>
    <li>Install Homebrew using the following command in your terminal:<br/>
        <pre>/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"</pre></li>
    <li>Install Spark with Homebrew
         <ol>
         <li>In your terminal, type: <br />
         <pre>brew install apache-spark</pre></li>
         <li>You can check the version of spark: <br />
         <pre>pyspark –version</pre></li>
         </ol>
    </li>
    <li>You may need to install PySpark by:<br />
    <pre>pip install pyspark</pre>
    </li>
    <li>To know where spark is installed: <br/>
    <pre>brew info apache-spark</pre>
    </li>
    <li>Set the environment variables:<br />
    <pre>export SPARK_HOME="[your_path]/ibexec/"</pre>
    </li>
</ol>

### Step 0.3 Using Pyspark in Jupyter Notebook

findspark is a library that automatically sets up the development environment to import Apache Spark library.
To install findspark, run the following in your shell:<br />
<pre>pip install findspark</pre>

In [None]:
import findspark
findspark.init()
findspark.find()

## Step 1: Importing data and exploring the features (8 points)

In [None]:
import pyspark

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

### Step 1.1
Read the csv file 'weatherAUS.csv' using 
<a href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=read#pyspark.sql.SparkSession">SparkSession.read()</a> to create a dataframe.

In [None]:
# Put your statement here



### Step 1.2
Use show() of <a href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=read#pyspark.sql.DataFrame">pyspark.sql.DataFrame</a> to print out the schema of the dataframe created.

In [None]:
# Put your statement here



### Step 1.3
Use printSchema() of <a href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=read#pyspark.sql.DataFrame">pyspark.sql.DataFrame</a> to print out the schema of the dataframe created.

In [None]:
# Put your statement here



Which of the features above have a wrong datatype in the schema? What should the correct datatypes be?

<font color=red>(Write your answer here)</font>



## Step 2: Convert the data types of features (7 points)

### Step 2.1
Import data types from 
<a href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=read#module-pyspark.sql.types">pyspark.sql.types</a>

In [None]:
# Put your statement here



### Step 2.2
Convert the datatype of features using:
<ul>
<li>withColumn() from <a href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=read#pyspark.sql.DataFrame">pyspark.sql.DataFrame</a>, and</li>
<li>cast() from <a href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=read#pyspark.sql.Column">pyspark.sql.Column</a>
to convert the features (columns) into appropriate types 
(<a href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=read#module-pyspark.sql.types">pyspark.sql.types</a>)</li>
</ul>

In [None]:
# Put your statements here



## Step 3:Exploring missing values (10 points)

### Step 3.1 
<ul>
<li>Count the missing values in features using:
<ul>
<li>isNull() from <a href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=read#pyspark.sql.Column">pyspark.sql.Column</a>
to see whether there are missing values in a certain feature (column).</li>
<li>filter() from <a href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=read#pyspark.sql.DataFrame">pyspark.sql.DataFrame</a>
to filter rows given a condition, and</li>
<li>count() from <a href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=read#pyspark.sql.DataFrame">pyspark.sql.DataFrame</a>
to count the number of rows in a dataframe</li>
</ul>
<li>Print the features with missing values and their corresponding number of rows with missing values</li>
</ul>


In [None]:
# Put your statements here



### Step 3.2
Some features have values of "NA", which should be regarded as missing values.</br>
Print those features with values of "NA" and their corresponding number of rows with values "NA".

In [None]:
# Put your statements here



## Step 4: Drop the feature, RISK_MM (2 points)
As described in the dataset description, we will need to drop the feature, RISK_MM, using drop() from <a href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=read#pyspark.sql.DataFrame">pyspark.sql.DataFrame</a>

In [None]:
# Put your statement here



## Step 5: Processing the date feature (5 points)

### Step 5.1
Import the to_date(), year(), month(), dayofmonth()
from
<a href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions">pyspark.sql.functions</a> module.

In [None]:
# Put your statement here



### Step 5.2
Use the to_date() function to convert the <strong>Date</strong> attribute to datetype.

In [None]:
# Put your statement here



### Step 5.3
Extract the <strong>year</strong>, <strong>month</strong>, <strong>day</strong> attributes of the converted <strong>Date</strong> attribute using year(), month() & dayofmonth()
and create the corresponding new features <strong>Year</strong>, <strong>Month</strong> and <strong>Day</strong>.

In [None]:
# Put your statements here



### Step 5.4
Drop the original <strong>Date</strong> feature.

In [None]:
# Put your statement here



## Step 6: Handling missing values of the categorical features (8 marks)
### Step 6.1 Find the most frequent value for categorical features with "NA" values 
Use groupBy(), count() & show() of <a href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=read#pyspark.sql.DataFrame">pyspark.sql.DataFrame</a> to list the distinct values of a feature and count the corresponding number of rows.

In [None]:
# Put your statements here



For the categorical features with missing values ("NA"), what is the most frequent value of each of the features?

<font color="red">(Write your answer here)</font>



### Step 6.2
Import when(), lit() from
<a href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions">pyspark.sql.functions</a> module.

In [None]:
# Put your statement here



### Step 6.3
For each of the categorical features with missing values ("NA"), replace "NA" with the corresponding most frequent value of each of the features using when(), lit(), otherwise().

In [None]:
# Put your statements here



## Step 7 Handling missing values of the numerical features (10 marks)
### Step 7.1 
Print the list of numerical features.

In [None]:
# Put your statement(s) here



### Step 7.2
Import Imputer from <a href="https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#module-pyspark.ml.feature">pyspark.ml.feature</a> module

In [None]:
# Put your statement here



### Step 7.3
Using <a href="https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.Imputer">pyspark.ml.feature.imputer</a> to fill in the missing values of the numerical features with mean.

In [None]:
# Put your statements here



## Step 8: Transform the features (10 marks)
### Step 8.1
Import skewness from <a href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions">pyspark.sql.functions</a> module.

In [None]:
# Put your statement here



### Step 8.2
We can get the skewness using skewness() from <a href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions">pyspark.sql.functions</a> module.

Find the features which are with skewness values larger than 0.75, and print the features together with their skewness values

In [None]:
# Put your statements here



### Step 8.3
Import log1p from <a href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions">pyspark.sql.functions</a> module.

In [None]:
# Put your statement here



### Step 8.4

Apply log transformation on those features with skewness values larger than 0.75 using log1p() from 
<a href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions">pyspark.sql.functions</a> module.

In [None]:
# Put your statements here



## Step 9 Converting Categorial features  (10 points)

### Step 9.1 List categorical features
Get and print the list of categorial features (exclude "RainTomorrow")

In [None]:
# Put your statements here



### Step 9.2 Convert categorical features into dummy/indicator features
#### Step 9.2.1
Import <a href="https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.StringIndexer">StringIndex</a> and <a href="https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.OneHotEncoderEstimator">OneHotEncoderEstimator</a> from <a href="https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#module-pyspark.ml.feature">pyspark.ml.feature</a> module.

In [None]:
# Put your statement here



#### Step 9.2.2
Using StringIndexer to convert categorical values into category indices for each of the categorical features

In [None]:
# Put your statement here



#### Step 9.2.3
Using OneHotEncoderEstimator to map a column of category indices to a column of binary vectors for the categorical features.

In [None]:
# Put your statements here



## Step 10: Training the Regression model  (20 points)

### Step 10.1: Creating the feature vector

#### Step 10.1.1
Import VectorAssembler from <a href="https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#module-pyspark.ml.feature">pyspark.ml.feature</a> module.


In [None]:
# Put your statement here



#### Step 10.1.2
Using <a href="https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.VectorAssembler">VectorAssembler<a/>, a feature transformer, to merge the columns created in step 9.2.3 and the columns of numerical features into a vector column.

In [None]:
# Put your statements here



### Step 10.2 
Is there any other transformation on the dataframe needed?

In [None]:
# Put your statement here



### Step 10.3 Split the data into training data and testing data

Using randomSplit of <a href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame">pyspark.sql.DataFrame</a>
to randomly splits the DataFrame with the ratio of 0.7 and 0.3 into training and testing datasets.

In [None]:
# Put your statement here



### Step 10.4 Build the logistic regression model

#### Step 10.4.1
Import LogisticRegression from <a href="https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#module-pyspark.ml.classification">pyspark.ml.classification</a> module.

In [None]:
# Put your statement here



#### Step 10.4.2
<ol>
<li>Initialize a Logistic Regression model by <a href="https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression">LogisticRegression()</a> function using the feature vector generated in Step 10.1.2.</li>
<li>Use fit() function to train the logistic regression model using the training feature data.</li>
</ol>

In [None]:
# Put your statements here



#### Step 10.4.3
<ul>
    <li>Gets summary of model trained on the training set. </li>
    <li>Print the accuracy, objective history, total iterations of the trained model.</li>
</ul>

In [None]:
# Put your statements here



#### Step 10.4.4
Predict the target values for the testing feature data using the transform() function.</li>

In [None]:
# Put your statement here



#### Step 10.4.5
Use show() to list the top 5 rows of results in Step 10.4.4, show only "prediction", "RainTomorrow" and the featuresCol specified in the LogisticRegression model.

In [None]:
# Put your statement here



### Step 10.5 Evaluate the results
#### Step 10.5.1 
Import MulticlassClassificationEvaluator from <a href="https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#module-pyspark.ml.evaluation">pyspark.ml.evaluation</a> module.

In [None]:
# Put your statement here



#### Step 10.5.2
Using <a href="https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.evaluation.MulticlassClassificationEvaluator">pyspark.ml.evaluation.MulticlassClassificationEvaluator</a> to evaluate the predictions from Step 10.4.4.
Print the evaluation results.

In [None]:
# Put your statements here



## Step 11: Pipeline (10 marks)

### Step 11.1
Import <a href="https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.Pipeline">Pipeline</a> from 
<a href="https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#module-pyspark.ml">pyspark.ml</a> package.

In [None]:
# Put your statement here



### Step 11.2

Rewrite Steps 9.2 to 10.5:
<ol>
    <li>Configure an ML pipeline which consists of the StringIndexer, OneHotEncoderEstimator, VectorAssembler, LogisticRegression</li>
    <li>Fit the pipeline to the training data.</li>
    <li>Make predictions on the testing data.</li>
    <li>Evaluate the prediction results as in step 10.5.</li>
</ol>

Note: 
<ul>
    <li>It is fine to have the steps 9.2 to 10.5 slightly rearranged in this step.</li>
    <li>Hence, the evaluation results may be slightly different.</li>
</ul>

In [None]:
# Put your statements here



## Step 12 Submission
Submit your jupyter notebook (.ipynb) to Canvas.

<center>The end of HW2</center>