<h1>Spark MLlib:</h1>

Go to the Apache Spark page at https://spark.apache.org/. 
<UL>
<LI>Documentation>Latest Release>Programming Guides> MLlib provides available algorithms are listed here, e.g., classification, regression, clustering.

**NB.** Extracting features algorithms are also available, e.g., vectorIndexer, PCA.  </LI>
    
    
<LI>
Documentation>Latest Release>API Docs>Python, provides detail information about the functions available in the MLlib package.</LI>
 </UL>

<h2>Input format:</h2>

The input dataset should have only one/two columns. One for 'label' and one for all 'features'. See section 'Formatting MLlib input' for more information.

## Linear Regression Example:


In [14]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('mllib').getOrCreate()

In [15]:
from pyspark.ml.regression import LinearRegression

#### In this example, the input data is already in the correct format (i.e., has two columns of features and label)
We can call the below read function:

In [16]:
all_data = spark.read.format('libsvm').load('sample_linear_regression_data.txt')

In [17]:
all_data.show()

+-------------------+--------------------+
|              label|            features|
+-------------------+--------------------+
| -9.490009878824548|(10,[0,1,2,3,4,5,...|
| 0.2577820163584905|(10,[0,1,2,3,4,5,...|
| -4.438869807456516|(10,[0,1,2,3,4,5,...|
|-19.782762789614537|(10,[0,1,2,3,4,5,...|
| -7.966593841555266|(10,[0,1,2,3,4,5,...|
| -7.896274316726144|(10,[0,1,2,3,4,5,...|
| -8.464803554195287|(10,[0,1,2,3,4,5,...|
| 2.1214592666251364|(10,[0,1,2,3,4,5,...|
| 1.0720117616524107|(10,[0,1,2,3,4,5,...|
|-13.772441561702871|(10,[0,1,2,3,4,5,...|
| -5.082010756207233|(10,[0,1,2,3,4,5,...|
|  7.887786536531237|(10,[0,1,2,3,4,5,...|
| 14.323146365332388|(10,[0,1,2,3,4,5,...|
|-20.057482615789212|(10,[0,1,2,3,4,5,...|
|-0.8995693247765151|(10,[0,1,2,3,4,5,...|
| -19.16829262296376|(10,[0,1,2,3,4,5,...|
|  5.601801561245534|(10,[0,1,2,3,4,5,...|
|-3.2256352187273354|(10,[0,1,2,3,4,5,...|
| 1.5299675726687754|(10,[0,1,2,3,4,5,...|
| -0.250102447941961|(10,[0,1,2,3,4,5,...|
+----------

In [18]:
train_data, test_data = all_data.randomSplit([0.7,0.3])

In [21]:
train_data.describe().show()

+-------+-------------------+
|summary|              label|
+-------+-------------------+
|  count|                368|
|   mean| 0.3449742030841025|
| stddev| 10.002903086492546|
|    min|-28.571478869743427|
|    max|  27.78383192005107|
+-------+-------------------+



### Create an instance of Linear Regression model:

In [36]:
# featuresCol and labelCol are the feature and label col name, respectively.
lr = LinearRegression(featuresCol='features',labelCol='label',predictionCol='prediction',regParam=0.0, elasticNetParam=0.0)

### NB. Regularizing the Linear Regression model:

***Ordinary Least square loss:***  regParam = 0  and   elasticNetParam = 0

***Ridge (L2):***                  regParam > 0  and   elasticNetParam = 0

***LASSO (L1):***                  regParam > 0  and   elasticNetParam = 1

***Elastic net (L1+L2):***         regParam > 0  and   elasticNetParam &isin; (0,1)

### Fit the model (train the model):

In [23]:
lrModel = lr.fit(train_data)

In [24]:
lrModel.coefficients

DenseVector([0.304, 1.1027, -0.9633, 2.2453, -0.0762, 1.2084, 0.2702, -0.7544, -0.6322, 0.2282])

In [25]:
lrModel.intercept

0.27827499500963354

In [35]:
training_summary = lrModel.summary

In [28]:
training_summary.r2

0.02963794930112995

In [29]:
training_summary.rootMeanSquaredError

9.840158316048619

### Evaluate the model (test the model)

In [33]:
test_results = lrModel.evaluate(test_data)

In [34]:
test_results.rootMeanSquaredError

11.07457218916498

In [73]:
test_results.residuals.show()

+-------------------+
|          residuals|
+-------------------+
| -27.95456795978699|
|-24.036702338156424|
|-26.787671379819635|
| -20.47985032261661|
|  -19.3821758322171|
| -19.32343899120233|
| -16.61414553142395|
|-16.922863567109914|
|-15.880851617675004|
|-17.404932746758213|
|-16.240651231038118|
|-19.897900090192714|
|-14.869980008755615|
| -17.20311403285779|
|-15.339977390253512|
|-14.555161340950864|
|-14.469243135335658|
|-16.718023956084227|
| -17.51813627408178|
|-13.301707287750471|
+-------------------+
only showing top 20 rows



### Deploy the model (transform the unlabeled data)
At deployment time, data is unlabeled, so:

In [38]:
predictions = lrModel.transform(test_data.select('features'))

In [39]:
predictions.show()

+--------------------+--------------------+
|            features|          prediction|
+--------------------+--------------------+
|(10,[0,1,2,3,4,5,...|   1.149084531303916|
|(10,[0,1,2,3,4,5,...|   0.549262217219912|
|(10,[0,1,2,3,4,5,...|  3.8378454436235607|
|(10,[0,1,2,3,4,5,...|  0.5952895483431853|
|(10,[0,1,2,3,4,5,...|-0.49081520585130517|
|(10,[0,1,2,3,4,5,...|  1.0482254251977003|
|(10,[0,1,2,3,4,5,...| -1.1894806572405667|
|(10,[0,1,2,3,4,5,...| -0.4038571655660324|
|(10,[0,1,2,3,4,5,...| -1.1456406465345437|
|(10,[0,1,2,3,4,5,...|  0.7127257254471082|
|(10,[0,1,2,3,4,5,...| 0.15499219001662898|
|(10,[0,1,2,3,4,5,...|   4.117215057569414|
|(10,[0,1,2,3,4,5,...| -0.5674047846756034|
|(10,[0,1,2,3,4,5,...|  1.8683465529354497|
|(10,[0,1,2,3,4,5,...|  0.2834944157110795|
|(10,[0,1,2,3,4,5,...|-0.26699156880032465|
|(10,[0,1,2,3,4,5,...| -0.2935151175954693|
|(10,[0,1,2,3,4,5,...|   2.389045447008785|
|(10,[0,1,2,3,4,5,...|  3.7456947123789095|
|(10,[0,1,2,3,4,5,...| 0.2617792

In [47]:
#predictions.join(test_data, predictions.features == test_data.features, 'inner').head(2)

### Regression evaluation metrics:

***Mean Absolute Error (MSE)***:

<img src="https://latex.codecogs.com/gif.latex?\frac{1}{n}\sum_{i=1}^{n}y_{i}-\hat{y}_{i}" title="\frac{1}{n}\sum_{i=1}^{n}y_{i}-\hat{y}_{i}" />

***Mean Squared Error (MSE)***: Larger errors are noted more than with MAE (we are computing square of errors, so larger errors become even larger!)
Cons: MSE is not in the same unit as y any more.

<img src="https://latex.codecogs.com/gif.latex?\frac{1}{n}\sum_{i=1}^{n}(y_{i}-\hat{y}_{i})^2" title="\frac{1}{n}\sum_{i=1}^{n}(y_{i}-\hat{y}_{i})^2" />

***Root Mean Squared Error (RMSE)***: Root of MSE, so it has the same unit as y. More popular than the above error metrics.

<img src="https://latex.codecogs.com/gif.latex?\sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_{i}-\hat{y}_{i})^2&space;}" title="\sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_{i}-\hat{y}_{i})^2 }" />

***R-Squared***: It is not quite an error metric! It is more of a statistical measure of your regression model. A measure of how much variance your model accounts for. Between [0-1]


<img src="https://latex.codecogs.com/gif.latex?\large&space;R^2=1-\frac{\sum_{i=1}^{n}(y_i-\hat{y}_i)^2}{\sum_{i=1}^{n}(y_i-\bar{y})^2}" title="\large R^2=1-\frac{\sum_{i=1}^{n}(y_i-\hat{y}_i)^2}{\sum_{i=1}^{n}(y_i-\bar{y})^2}" />

***Adjusted R-Squared***: Adjusted R-Squared takes into account the number of independent variables you employ in your model and can help indicate if a variable is useless or not.

<img src="https://latex.codecogs.com/gif.latex?\large&space;Adjusted\&space;R^2=1-\frac{(n-1)}{[n-(k&plus;1)]}(1-R^2)" title="\large Adjusted\ R^2=1-\frac{(n-1)}{[n-(k+1)]}(1-R^2)" />



# Formatting MLlib input:

In [48]:
data = spark.read.csv('Ecommerce_Customers.csv', inferSchema=True, header = True)

In [49]:
data.printSchema()

root
 |-- Email: string (nullable = true)
 |-- Address: string (nullable = true)
 |-- Avatar: string (nullable = true)
 |-- Avg Session Length: double (nullable = true)
 |-- Time on App: double (nullable = true)
 |-- Time on Website: double (nullable = true)
 |-- Length of Membership: double (nullable = true)
 |-- Yearly Amount Spent: double (nullable = true)



In [52]:
for item in data.head(1)[0]:
    print (item)

mstephenson@fernandez.com
835 Frank TunnelWrightmouth, MI 82180-9605
Violet
34.49726772511229
12.65565114916675
39.57766801952616
4.0826206329529615
587.9510539684005


## Set feature column:
We want to only use 'Avg Session Length','Time on App','Time on Website', and 'Length of Membership' columns as features. 

**VectorAssembler** sets what columns are our features and assign a name to the feature column.

In [63]:
from pyspark.ml.feature import VectorAssembler

Define a VectorAssembler instance:

In [64]:
assembler = VectorAssembler(inputCols=['Avg Session Length','Time on App','Time on Website',
                                       'Length of Membership'],
                            outputCol = 'features')

Use the VectorAssembler instance to ransform the data

In [65]:
output = assembler.transform(data)

In [66]:
output.printSchema()

root
 |-- Email: string (nullable = true)
 |-- Address: string (nullable = true)
 |-- Avatar: string (nullable = true)
 |-- Avg Session Length: double (nullable = true)
 |-- Time on App: double (nullable = true)
 |-- Time on Website: double (nullable = true)
 |-- Length of Membership: double (nullable = true)
 |-- Yearly Amount Spent: double (nullable = true)
 |-- features: vector (nullable = true)



In [67]:
final_data = output.select('features','Yearly Amount Spent')
final_data.show()

+--------------------+-------------------+
|            features|Yearly Amount Spent|
+--------------------+-------------------+
|[34.4972677251122...|  587.9510539684005|
|[31.9262720263601...|  392.2049334443264|
|[33.0009147556426...| 487.54750486747207|
|[34.3055566297555...|  581.8523440352177|
|[33.3306725236463...|  599.4060920457634|
|[33.8710378793419...|   637.102447915074|
|[32.0215955013870...|  521.5721747578274|
|[32.7391429383803...|  549.9041461052942|
|[33.9877728956856...|  570.2004089636196|
|[31.9365486184489...|  427.1993848953282|
|[33.9925727749537...|  492.6060127179966|
|[33.8793608248049...|  522.3374046069357|
|[29.5324289670579...|  408.6403510726275|
|[33.1903340437226...|  573.4158673313865|
|[32.3879758531538...|  470.4527333009554|
|[30.7377203726281...|  461.7807421962299|
|[32.1253868972878...| 457.84769594494855|
|[32.3388993230671...| 407.70454754954415|
|[32.1878120459321...|  452.3156754800354|
|[32.6178560628234...|   605.061038804892|
+----------

In [68]:
#split to train-test
train_data , test_data = final_data.randomSplit([0.7,0.3])

In [69]:
train_data.describe().show()

+-------+-------------------+
|summary|Yearly Amount Spent|
+-------+-------------------+
|  count|                346|
|   mean| 499.89733812560945|
| stddev|  81.56403887266724|
|    min| 256.67058229005585|
|    max|  765.5184619388373|
+-------+-------------------+



In [70]:
test_data.describe().show()

+-------+-------------------+
|summary|Yearly Amount Spent|
+-------+-------------------+
|  count|                154|
|   mean|  498.0035073885354|
| stddev|  74.25293922502175|
|    min|  275.9184206503857|
|    max|  712.3963268096637|
+-------+-------------------+



### Example: linear regression

In [71]:
lr = LinearRegression(labelCol = 'Yearly Amount Spent')

# StringIndexer vs. OneHotEncoder:

##### StringIndexer:
StringIndexer (equivalent to LabelEncoder in sklearn) assigns index [0-num_labels] to each string.
The problem here is, since there are different numbers in the same column, the model will misunderstand the data to be in some kind of order. But this is not the case at all. To overcome this problem, we use One Hot Encoder.

##### OneHotEncoder:
**Note:** pyspark OneHotEncoder is different from scikit-learn’s OneHotEncoder.

A one-hot encoder that maps a column of **category indices** to a column of binary vectors. So, we should encode categorical features to categorical indices using **StringIndexer** first then apply pyspark OneHotEncoder.

**Note:** The last category is not included by default (configurable via dropLast), because it makes the vector entries sum up to one, and hence linearly dependent. So an input value of 4.0 maps to [0.0, 0.0, 0.0, 0.0].



Check Linear_Regression_Consulting_Project.ipynb for an example.


# GridSearch and CrossValidation

In [1]:
from pyspark.ml import Pipeline
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import RegressionEvaluator

To see an example check Linear_Regression_Consulting_Project.ipynb

# Evaluator: 
Once we applied model.transform() on test data, we can evaluate the predicted values using an **Evaluator** instance.
We can define an evaluator instance and then call the available metrics the predicted DataFrame.

#### Different evaluators exist depending on our algorithm:


***RegressionEvaluator***: rmse (default), mse, r2, mae, var(explained variance).

***BinaryClassificationEvaluator***: "areaUnderROC" (default), "areaUnderPR"

***MulticlassClassificationEvaluator***: "f1" (default), "accuracy", "weightedPrecision", "weightedRecall", "weightedTruePositiveRate", "weightedFalsePositiveRate", "weightedFMeasure", "truePositiveRateByLabel", "falsePositiveRateByLabel", "precisionByLabel", "recallByLabel", "fMeasureByLabel", "logLoss", "hammingLoss"

In [2]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

To check an example, check Linear_Regression_Consulting_Project.ipynb

# Pipelines:
Define multiple stages as a pipeline. Then, run the pipeline faterwards. 

To see an example please check Logistic_Regression_Consulting_Project.ipynb

In [1]:
from pyspark.ml import Pipeline

# Normalizing features:

Examples of normalizers:

**StandardScaler:** Normalizing each feature to have unit standard deviation and/or zero mean

**MinMaxScaler:** Rescaling each feature to a specific range (default [0, 1]). It takes min and max parameters.

Other algorithms can be found at this <a href="https://spark.apache.org/docs/latest/ml-features.html#normalizer"> link </a>.

NB. Input should be in libsvm format (two columns of 'features' and 'label')
See an example in Logistic_Regression_Consulting_Project.ipynb

