### AST 2: Intro to PySpark

## Learning Objectives

At the end of the experiment, you will be able to

* interact with Spark using python
* understand Spark dataframes
* implement linear regression using PySpark

### Dataset

The dataset chosen for this assignment is [Ecommerce customers](https://www.kaggle.com/srolka/ecommerce-customers). The dataset is made up of 500 records and 8 columns. It has customer information, such as e-mail, address, and their color avatar. Then it also has numerical value columns.

* Avg Session Length: Average session of in-store style advice sessions
* Time on App: Average time spent on App in minutes
* Time on Website: Average time spent on Website in minutes
* Length of Membership: How many years the customer has been a member.
* Yearly Amount Spent

Here, we will be using the first four features to perform linear regression using spark and predict Yearly Amount Spent by each customer.

### Information

**Why do we need Spark?**

Spark is one of the latest technologies being used to quickly and easily handle Big Data. Spark is an open-source distributed computing framework that promises a clean and pleasurable experience similar to that of Pandas, while scaling to large data sets via a distributed architecture under the hood.

Apache Spark is a powerful cluster computing engine, therefore it is designed for fast computation of big data. Spark runs on Memory (RAM), and that makes the processing much faster than on Disk. It includes "MLlib" library to perform Machine Learning tasks using the Spark framework.

### Introduction

Apache Spark is known as a fast, easy to use and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing. It’s well-known for its speed, ease of use, generality and the ability to run virtually everywhere. And even though Spark is one of the most asked tools for data engineers, also data scientists can benefit from Spark when doing exploratory data analysis, feature extraction, supervised learning and model evaluation.

Spark is a platform for cluster computing that lets you spread data and computations over clusters with multiple nodes (think of each node as a separate computer). Splitting up your data makes it easier to work with very large datasets because each node only works with a small amount of data.

As each node works on its own subset of the total data, it also carries out a part of the total calculations required, so that both data processing and computation are performed in parallel over the nodes in the cluster. It is a fact that parallel computation can make certain types of programming tasks much faster.

### Setup Steps:

### Importing required packages

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

### PySpark

PySpark is an interface for Apache Spark in Python. It not only allows you to write Spark applications using Python APIs, but also provides the PySpark shell for interactively analyzing your data in a distributed environment. PySpark supports most of Spark’s features such as Spark SQL, DataFrame, Streaming, MLlib (Machine Learning) and Spark Core.

<figure>
<img src='https://cdn.iisc.talentsprint.com/CDS/Images/pyspark_components.png' width = 700 px/>
</figure>

**Spark SQL and DataFrame**

Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrame and can also act as distributed SQL query engine.

**Streaming**

Running on top of Spark, the streaming feature in Apache Spark enables powerful interactive and analytical applications across both streaming and historical data, while inheriting Spark’s ease of use and fault tolerance characteristics.

**MLlib**

Built on top of Spark, MLlib is a scalable machine learning library that provides a uniform set of high-level APIs that help users create and tune practical machine learning pipelines.

**Spark Core**

Spark Core is the underlying general execution engine for the Spark platform that all other functionality is built on top of. It provides an RDD (Resilient Distributed Dataset) and in-memory computing capabilities.

#### Install PySpark

In [None]:
!pip install pyspark

#### Start a Spark Session

Spark session is a combined entry point of a Spark application, which came into implementation from Spark 2.0. It provides a way to interact with various spark’s functionality with a lesser number of constructs. Instead of having spark context, hive context, SQL context, now everything is encapsulated in a Spark session.

In [None]:
# Start spark session
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('LinearRegression').getOrCreate()
spark

### Data Processing using Pyspark

#### Loading data into PySpark

To load the dataset we will use the `read.csv` module.  The `inferSchema` parameter provided will enable Spark to automatically determine the data type for each column. Also, `header` and `sep` parameters are given as the dataset contains header, and values are separated using vertical bar.

In [None]:
df = spark.read.csv("ecommerce_customers_.csv", sep = "|", header=True, inferSchema = True)           # creating spark data frame

#### Data exploration with PySpark

* Display data types of dataframe columns

In [None]:
# Print the data types
df.dtypes

* Display column details

In [None]:
# Print the Schema of the DataFrame
df.printSchema()

* Display rows

In [None]:
df.show(5)

* Display total number of rows

In [None]:
df.count()

* Display column labels

In [None]:
df.columns

* Display specific columns

In [None]:
columns = ["Email","Time on App","Time on Website"]
df.select(columns).show(5)

* Display the statistics of dataframe

In [None]:
df.describe().show()

* Display total distinct values in *Avatar* column

In [None]:
# Distinct value count
df.select('Avatar').distinct().count()

* Display count of distinct values in *Avatar* column

In [None]:
df.groupby('Avatar').count().show(10)

* Plot the count of distinct values in *Avatar* column

In [None]:
DF = df.groupby('Avatar').count().sort("count", ascending= False)
DF.show(8)

In [None]:
plt.figure(figsize= (24,4))
x = DF.toPandas()['Avatar']
y = DF.toPandas()['count']
sns.barplot(x=x, y=y)
plt.xticks(rotation= 90)
plt.show()

* Display average time spent on app by users having different *Avatar*

In [None]:
df.groupby('Avatar').avg().select(['Avatar', 'avg(Time on App)']).show(5)

* Display the records where average time spent on website by user is greater than 37 minutes

In [None]:
df.filter(df['Time on Website'] > 37).show(5)

* Display the minimum Yearly Amount Spent where average time spent on website by user is greater than 39 minutes

In [None]:
from pyspark.sql.functions import col, min
df.filter(col('Time on Website')>39).agg(min('Yearly Amount Spent')).show()

* Display the records where average time spent on app by user is greater than 12 minutes and average time spent on website is smaller than 37 minutes

In [None]:
from pyspark.sql.functions import col
df.filter((col('Time on App')>12) &(col('Time on Website') < 37)).show(10, truncate=False)

To know more about other `pyspark.sql.functions` operation click [here](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/functions.html).

### Linear Regression Model

Linear Regression model is one of the oldest and widely used machine learning approach which assumes a relationship between dependent and independent variables. It consists of the best fitting line through the scattered points on the graph and this best fitting line is known as the regression line.

#### Setting Up DataFrame for Model

For Spark to accept the data, it needs to be in the form of two columns ("labels", "features")

* Features are data points of all the attributes to be used for prediction
* Labels are output for each data point
* We will be predicting Label from Features

For the linear regression model, we need to import two modules from Pyspark i.e. Vector Assembler and Linear Regression. Vector Assembler is a transformer that assembles all the features into one vector from multiple columns that contain type double.

To know more about vector assembler click [here](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.VectorAssembler.html).

In [None]:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression

In [None]:
assembler = VectorAssembler(
                            inputCols= ["Avg Session Length", "Time on App", "Time on Website",'Length of Membership'],
                            outputCol= "features")       # features is the name of output columns which combines all the columns

In [None]:
output = assembler.transform(df)            # A new column 'features' will be created along with the existing columns
                                            # features column will include all the values combined in one list

In [None]:
output.show(10)

In [None]:
output.select("features").show(10, truncate= False)          # displays only the features column (which includes all other column values in a list)

In [None]:
# Complete dataset is represented in 2 columns
final_data = output.select("features",'Yearly Amount Spent')

#### Splitting the data into Training and Test set

In [None]:
# Splitting the data in Train and Test set(70% training data, 30% testing data)
train_data,test_data = final_data.randomSplit([0.7,0.3])

In [None]:
train_data.describe().show()

In [None]:
test_data.describe().show()

#### Create a Linear Regression Model object and fit on train data

In [None]:
regressor = LinearRegression(featuresCol="features", labelCol="Yearly Amount Spent")

#Learn to fit the model from training set
model = regressor.fit(train_data)

#### Predicting the Test set results

In [None]:
predict = model.transform(test_data)

predict.select(predict.columns[:]).show(10)

#### Evaluating Model Performance

In [None]:
metrics = model.evaluate(test_data)                             # Using evaluate method we can verify our model's performance

print('Mean absolute error: {}'.format(metrics.meanAbsoluteError))
print('Root mean squared error: {}'.format(metrics.rootMeanSquaredError))
print('R_squared value: {}'.format(metrics.r2))

To know more about other operations in pyspark click [here](https://cdn.iisc.talentsprint.com/CDS/cheatSheet_pyspark.pdf).