# Tutorial 6

**This tutorial will cover:**

* Examples of PySpark MLlib (DataFrame-based)

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Machine Learning Practice").getOrCreate()

# Read the dataset.
dataset = spark.read.csv("test-data-6.csv", header=True, inferSchema=True)
dataset.show()

+-----+---+----------+------+
| Name|Age|Experience|Salary|
+-----+---+----------+------+
|Steve| 31|        10| 30000|
| Bill| 30|         8| 25000|
| John| 29|         4| 20000|
| Paul| 24|         3| 20000|
|Chris| 21|         1| 15000|
|  Tom| 23|         2| 18000|
+-----+---+----------+------+



In [2]:
# View the schema.
dataset.printSchema()

root
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Experience: integer (nullable = true)
 |-- Salary: integer (nullable = true)



In [3]:
# View the columns in the dataset.
dataset.columns

['Name', 'Age', 'Experience', 'Salary']

`VectorAssembler` is a transformer that combines a given list of columns into a single vector column. ([MLlib docs](https://spark.apache.org/docs/3.3.1/ml-features.html#vectorassembler)) The columns that are used as inputs and outputs are called "features". So you will hear terms like "feature column" or "input features" or "independent features". So the vector column (i.e. output column) is sometimes called an "input feature" or "independent feature".

You will specify the columns that will be used as inputs (i.e. the independent variables or x-variables). 
Then you will define the output column that will contain the linear regression output (i.e. the dependent variables or the y-variable).

The output column will be a vector and the inputs will be used to calculate that vector. 
In other words, the inputs will be used to "assemble" the output vector. Hence the module name `VectorAssembler`.

The following example will show you how to create a model that uses the "Age" and "Experience" columns to predict the salary.

In [4]:
# Create the VectorAssembler.
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=["Age","Experience"],outputCol="Input Features")

In [5]:
# Create a dataframe that includes the output column.
output = assembler.transform(dataset)

In [6]:
# View the new dataframe.
output.show()

+-----+---+----------+------+--------------+
| Name|Age|Experience|Salary|Input Features|
+-----+---+----------+------+--------------+
|Steve| 31|        10| 30000|   [31.0,10.0]|
| Bill| 30|         8| 25000|    [30.0,8.0]|
| John| 29|         4| 20000|    [29.0,4.0]|
| Paul| 24|         3| 20000|    [24.0,3.0]|
|Chris| 21|         1| 15000|    [21.0,1.0]|
|  Tom| 23|         2| 18000|    [23.0,2.0]|
+-----+---+----------+------+--------------+



In [7]:
# View the updated columns in the dataset.
output.columns

['Name', 'Age', 'Experience', 'Salary', 'Input Features']

In [8]:
# `select()` only the columns that are going to be used when training and testing the data.
train_test_df = output.select("Input Features", "Salary")

In [9]:
train_test_df.show()

+--------------+------+
|Input Features|Salary|
+--------------+------+
|   [31.0,10.0]| 30000|
|    [30.0,8.0]| 25000|
|    [29.0,4.0]| 20000|
|    [24.0,3.0]| 20000|
|    [21.0,1.0]| 15000|
|    [23.0,2.0]| 18000|
+--------------+------+



In [10]:
# Now we can train the data in order to be able to predict the salary amounts.
from pyspark.ml.regression import LinearRegression

# Create a train, test split. The train dataset will use 75% of the data and the test dataset will have 25% of the data.
train_data, test_data = train_test_df.randomSplit([0.75, 0.25])

# Create the regression model with the input and output features.
regressor = LinearRegression(featuresCol="Input Features", labelCol="Salary")

# Fit the model.
model = regressor.fit(train_data)

In [11]:
# Coefficients
model.coefficients

DenseVector([-263.7076, 1767.624])

In [12]:
model.intercept

19919.060052212404

In [13]:
# Predict salaries.
pred_results = model.evaluate(test_data)

In [14]:
# Show the predicted salaries in a dataframe where you can compare the predicted values to the actual salary values.
pred_results.predictions.show()

+--------------+------+-----------------+
|Input Features|Salary|       prediction|
+--------------+------+-----------------+
|    [29.0,4.0]| 20000|19342.03655352618|
+--------------+------+-----------------+



In [15]:
# View other important metrics of the predicted values.
pred_results.meanAbsoluteError, pred_results.meanSquaredError

(657.9634464738192, 432915.89689570636)