## Spark Group Assignment

Group O-2-8

Attacks fall into four main categories:

* DOS: denial-of-service, e.g. syn flood;
* R2L: unauthorized access from a remote machine, e.g. guessing password;
* U2R:  unauthorized access to local superuser (root) privileges, e.g., various buffer overflow attacks;
* probing: surveillance and other probing, e.g., port scanning.

It is important to note that the test data is not from the same probability distribution as the training data, and it includes specific attack types not in the training data.  This makes the task more realistic.  Some intrusion experts believe that most novel attacks are variants of known attacks and the "signature" of known attacks can be sufficient to catch novel variants.  The datasets contain a total of 24 training attack types, with an additional 14 types in the test data only. 

Agenda
1. Spark-Setup / Load Data
2. Inspect Data
3. Preprocess Data
4. Create A Model
5. Make Predictions
6. Evaluate Predictions

## 1. Spark Setup

In [1]:
import os
print(os.environ['SPARK_HOME'])
dataset_path="/home/ubuntu/challenge_1/"

/usr/local/software/spark


In [2]:
import pandas as pd

In [3]:
#import findspark
#findspark.init()
import pyspark

In [4]:
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .master("local") \
    .appName("Dataset") \
    .getOrCreate()

In [5]:
spark.version

'2.2.0'

### 1.1 Data Loading

Data inspection shows that the data does not have a header. Therefore we are going to use a simple for loop to assign the correct labelling to the columns. Furthermore, we are assignung the variable "connection" to the different types of network intrusion attacks. The connection types fall into the following categories:

* DOS: denial-of-service, e.g. syn flood;
* R2L: unauthorized access from a remote machine, e.g. guessing password;
* U2R:  unauthorized access to local superuser (root) privileges, e.g., various buffer overflow attacks;
* probing: surveillance and other probing, e.g., port scanning.
* normal: no attack was identified

#### 1.1.1 Train Data

In [6]:
df = spark.read \
    .option("inferSchema", "true") \
    .csv("file://"+dataset_path+"full.data")

In [7]:
df_test = spark.read \
    .option("inferSchema", "true") \
    .csv("file://"+dataset_path+"corrected")

In [8]:
features=["duration", "protocol_type", "service", "flag", "src_bytes","dst_bytes", \
          "land","wrong_fragment","urgent","hot","num_failed_logins","logged_in", \
          "num_compromised","root_shell","su_attempted","num_root","num_file_creations", \
          "num_shells","num_access_files","num_outbound_cmds","is_host_login","is_guest_login", \
          "count","srv_count","serror_rate","srv_serror_rate","rerror_rate","srv_rerror_rate",\
          "same_srv_rate","diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count", \
          "dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate", \
          "dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate","dst_host_rerror_rate",\
          "dst_host_srv_rerror_rate"]

target=["connection"]

fieldnames=features+target

rawnames=df.schema.names

# Create a small function
def updateColNames(df,oldnames,newnames):
    for i in range(len(newnames)):
        df=df.withColumnRenamed(oldnames[i], newnames[i])
    return df

df=updateColNames(df,rawnames,fieldnames)

# df.printSchema()

#### 1.1.2 Creating new attack variable 'label'

Regarding the scope of this assignment, there is no need to classify attack types into the correct group (i.e probing or DOS). We simply have to identify whether or not an attack is taking place. Thus, we are creating a new boolean column 'lable':

* Assign the value '0' for no attack (=normal)
* Assign the value '1' for attack

In [9]:
# Adding a Boolean column for attack (=1) or normal (=0)
from pyspark.sql.functions import when

df = df.withColumn('label', when(df["connection"] == 'normal.', 0).otherwise(1))

df.groupBy('label').count().show()

+-----+-------+
|label|  count|
+-----+-------+
|    1|3925650|
|    0| 972781|
+-----+-------+



#### 1.2 Loading Test Data

We have to repeat the same process for the test data:

* Assign column names
* Create new column 'label'

In [10]:
df_test = spark.read \
    .option("inferSchema", "true") \
    .csv("file://"+dataset_path+"corrected")

In [11]:
features_test=["duration", "protocol_type", "service", "flag", "src_bytes","dst_bytes", \
          "land","wrong_fragment","urgent","hot","num_failed_logins","logged_in", \
          "num_compromised","root_shell","su_attempted","num_root","num_file_creations", \
          "num_shells","num_access_files","num_outbound_cmds","is_host_login","is_guest_login", \
          "count","srv_count","serror_rate","srv_serror_rate","rerror_rate","srv_rerror_rate",\
          "same_srv_rate","diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count", \
          "dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate", \
          "dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate","dst_host_rerror_rate",\
          "dst_host_srv_rerror_rate"]

target_test=["connection"]

fieldnames_test=features_test+target_test

rawnames_test=df_test.schema.names

# Create a small function
def updateColNames_test(df_test,oldnames,newnames):
    for i in range(len(newnames)):
        df_test=df_test.withColumnRenamed(oldnames[i], newnames[i])
    return df_test

df_test=updateColNames(df_test,rawnames,fieldnames)

# df_test.printSchema()

In [12]:
# Adding a Boolean column for attack (=1) or normal (=0)
from pyspark.sql.functions import when

df_test = df_test.withColumn('label', when(df_test["connection"] == 'normal.', 0).otherwise(1))

df_test.groupBy('label').count().show()

+-----+------+
|label| count|
+-----+------+
|    1|250436|
|    0| 60593|
+-----+------+



## 2. Data Inspection


* How many records do we have?
* What is the schema of our data?
* Is it numerical , is it categorical?
* Visualize your data

In [13]:
# Print the number of records in the data frame
print('Nb. of records  : %d' % df.count())

Nb. of records  : 4898431


### Check correlation

from pyspark.ml.linalg import Vectors
from pyspark.ml.stat import Correlation

data =[
    "duration","src_bytes"]

df = spark.createDataFrame(data, ["features"])

r1 = Correlation.corr(df, "features").head()
print("Pearson correlation matrix:\n" + str(r1[0]))

r2 = Correlation.corr(df, "features", "spearman").head()
print("Spearman correlation matrix:\n" + str(r2[0]))

In [14]:
# Check the Schema
df.printSchema()

root
 |-- duration: integer (nullable = true)
 |-- protocol_type: string (nullable = true)
 |-- service: string (nullable = true)
 |-- flag: string (nullable = true)
 |-- src_bytes: integer (nullable = true)
 |-- dst_bytes: integer (nullable = true)
 |-- land: integer (nullable = true)
 |-- wrong_fragment: integer (nullable = true)
 |-- urgent: integer (nullable = true)
 |-- hot: integer (nullable = true)
 |-- num_failed_logins: integer (nullable = true)
 |-- logged_in: integer (nullable = true)
 |-- num_compromised: integer (nullable = true)
 |-- root_shell: integer (nullable = true)
 |-- su_attempted: integer (nullable = true)
 |-- num_root: integer (nullable = true)
 |-- num_file_creations: integer (nullable = true)
 |-- num_shells: integer (nullable = true)
 |-- num_access_files: integer (nullable = true)
 |-- num_outbound_cmds: integer (nullable = true)
 |-- is_host_login: integer (nullable = true)
 |-- is_guest_login: integer (nullable = true)
 |-- count: integer (nullable = true

### 2.1 Exploring numercial variables

In total, there are 28 numercial variables in our dataset:

* XX continous 22)
* XX boolean (6)

We are using agg() operations in order to compare means between attack and non-attack networks and receive a couple of insights:

* Duration: the mean duration of normal connection is longer
* Dst_bytes: the mean number of data bytes from destination to source is 6x greater
* Hot: the mean number of 'hot' indiactors is 15x smaller for attacks

In [15]:
### Compare averages of numcerical features between 

In [16]:
# Some stats on numerical features
df.groupBy('label').agg({'duration': 'mean'}).orderBy("avg(duration)", ascending = False).show(30)

+-----+------------------+
|label|     avg(duration)|
+-----+------------------+
|    0|217.82472416710442|
|    1|6.3445052411702525|
+-----+------------------+



In [17]:
# Some stats on numerical features
df.groupBy('label').agg({'src_bytes': 'mean'}).orderBy("avg(src_bytes)", ascending = False).show(30)

+-----+------------------+
|label|    avg(src_bytes)|
+-----+------------------+
|    1| 1923.030449734439|
|    0|1477.8462500809535|
+-----+------------------+



In [18]:
df.groupBy('label').agg({'dst_bytes': 'mean'}).orderBy("avg(dst_bytes)", ascending = False).show(30)

+-----+------------------+
|label|    avg(dst_bytes)|
+-----+------------------+
|    0|3234.6501113816985|
|    1| 563.0735605568505|
+-----+------------------+



In [19]:
# Some stats on numerical features
df.groupBy('label').agg({'wrong_fragment': 'mean'}).orderBy("avg(wrong_fragment)", ascending = False).show(30)

+-----+--------------------+
|label| avg(wrong_fragment)|
+-----+--------------------+
|    1|8.095474634773859E-4|
|    0|                 0.0|
+-----+--------------------+



In [20]:
# Some stats on numerical features
df.groupBy('label').agg({'hot': 'mean'}).orderBy("avg(hot)", ascending = False).show(30)

+-----+--------------------+
|label|            avg(hot)|
+-----+--------------------+
|    0| 0.04953530136793379|
|    1|0.003244812960910932|
+-----+--------------------+



In [21]:
# Some stats on numerical features
df.groupBy('label').agg({'num_failed_logins': 'mean'}).orderBy("avg(num_failed_logins)", ascending = False).show(30)

+-----+----------------------+
|label|avg(num_failed_logins)|
+-----+----------------------+
|    0|   9.86861379899484E-5|
|    1|  1.553882796479563...|
+-----+----------------------+



In [22]:
# Some stats on numerical features
df.select("duration").describe().show()

+-------+-----------------+
|summary|         duration|
+-------+-----------------+
|  count|          4898431|
|   mean|48.34243046395876|
| stddev|723.3298112546812|
|    min|                0|
|    max|            58329|
+-------+-----------------+



In [23]:
# Create a table for SQL access
# df.registerTempTable("train_data")

In [24]:
# df.describe().toPandas().to_csv("data_summary")

### 2.2. Exploring the categorical variables

Again, we are using grouby() commands to explore the categorical variables and their count().

* protocol_type (3 distinct types)
* service       (3 distinct types)
* flag          (11 distinct types)
* connection    (21 distinct types)

in term of the number of categories and count()

In [25]:
# How many distict flags we have
df.groupby('protocol_type').count().show()

+-------------+-------+
|protocol_type|  count|
+-------------+-------+
|          tcp|1870598|
|          udp| 194288|
|         icmp|2833545|
+-------------+-------+



In [26]:
# How many distict services we have
df.groupby('service').count().show()

+---------+-----+
|  service|count|
+---------+-----+
|   telnet| 4277|
|      ftp| 5214|
|     auth| 3382|
| iso_tsap| 1052|
|   systat| 1056|
|     name| 1067|
|  sql_net| 1052|
|    ntp_u| 3833|
|      X11|  135|
|    pop_3| 1981|
|     ldap| 1041|
|  discard| 1059|
|   tftp_u|    3|
|   Z39_50| 1078|
|  daytime| 1056|
| domain_u|57782|
|    login| 1045|
|     smtp|96554|
|http_2784|    1|
|      mtp| 1076|
+---------+-----+
only showing top 20 rows



In [27]:
# How many distict flags we have
df.groupby('flag').count().show()

+------+-------+
|  flag|  count|
+------+-------+
|RSTOS0|    122|
|    S3|     50|
|    SF|3744328|
|    S0| 869829|
|   OTH|     57|
|   REJ| 268874|
|  RSTO|   5344|
|  RSTR|   8094|
|    SH|   1040|
|    S2|    161|
|    S1|    532|
+------+-------+



In [28]:
df.groupby('connection').count()\
    .orderBy('count', ascending =False)\
    .show(100)

+----------------+-------+
|      connection|  count|
+----------------+-------+
|          smurf.|2807886|
|        neptune.|1072017|
|         normal.| 972781|
|          satan.|  15892|
|        ipsweep.|  12481|
|      portsweep.|  10413|
|           nmap.|   2316|
|           back.|   2203|
|    warezclient.|   1020|
|       teardrop.|    979|
|            pod.|    264|
|   guess_passwd.|     53|
|buffer_overflow.|     30|
|           land.|     21|
|    warezmaster.|     20|
|           imap.|     12|
|        rootkit.|     10|
|     loadmodule.|      9|
|      ftp_write.|      8|
|       multihop.|      7|
|            phf.|      4|
|           perl.|      3|
|            spy.|      2|
+----------------+-------+



### 2.3 Exploring data visually (@Adolfo)

tbd

In [29]:
# 3a. Create a in-memory DataFrame 
# df2.registerTempTable("network_data")

## 3. Preprocess Data

The data inspetion shows that our dataset contains three categorical variables:

* protocol_type
* service
* flag

We are going to use StringIndexer, OneHotEncoder, Vector Assembler and a Pipeline to compute feature transformation.

* **StringIndexer**: converts a single column to an index column (similar to a factor column in R)
* **OneHotEncoder**: One-hot encoding maps a column of label indices to a column of binary vectors, with at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features.
* **VectorAssembler**: A transformer that combines a given list of columns into a single vector column.
* **Pipelines**: Facilitates the creation, tuning, and inspection of practical ML workflows. A Spark Pipeline is specified as a sequence of stages, and each stage is either a Transformer or an Estimator. These stages are run in order, and the input DataFrame is transformed as it passes through each stage. 


In [31]:
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler

#### 3.1. Transformations

In [32]:
from pyspark.ml import Pipeline

categoricalColumns = [ \
           "protocol_type", "service", "flag"]

stages = [] # stages in our Pipeline
for col in categoricalColumns:
  
  # Category Indexing with StringIndexer
  indexer = StringIndexer(inputCol=col, outputCol=col+"_index")
   
  # Use OneHotEncoder to convert categorical variables into binary SparseVectors
  encoder = OneHotEncoder(inputCol=col+"_index", outputCol=col+"_vector")
  
  # Add stages.  These are not run here, but will run all at once later on.
  stages += [indexer, encoder]

#### 3.2 VectorAssembler 

This output will include both the numeric columns and the one-hot encoded binary vector columns in our dataset.

We are not going to use all of the numeric features from the dataset. The most important features have been identified while inspecting the data. 

In [33]:
# Transform all numerical features into a vector using VectorAssembler

numericCols_model = ["duration","src_bytes","dst_bytes","land","wrong_fragment","urgent"]

assemblerInputs = [ col + "_vector" for col in categoricalColumns ] + numericCols_model
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]

print(assemblerInputs)

['protocol_type_vector', 'service_vector', 'flag_vector', 'duration', 'src_bytes', 'dst_bytes', 'land', 'wrong_fragment', 'urgent']


#### 3.3 Checking the stages

In [34]:
# Check the stages of our pipeline
n=0
for s in stages:
    print('stage number %d %s' %(n,s.getOutputCol()))
    n+=1 

stage number 0 protocol_type_index
stage number 1 protocol_type_vector
stage number 2 service_index
stage number 3 service_vector
stage number 4 flag_index
stage number 5 flag_vector
stage number 6 features


## 4. Create a Model 
 * Create the model
 * Split data into train and test data
 * Train the model with train data
 * Test model predictions with test data

#### 4.1 Create Pipleline

Group together the stages we defined (feature transformations).

In [50]:
from pyspark.ml import Pipeline
# Create a Pipeline.
pipeline = Pipeline(stages=stages)

transformer = pipeline.fit(df)
transformed_df = transformer.transform(df)

# Focus on the relevant columns and define dataset
selection = ["label", "features", "duration", "src_bytes"] + assemblerInputs     # ASK TOMMY
dataset = transformed_df.select(selection)

#### 4.2 Splitting dataset into train and test

* 70% train | 30% test
* Setting a seed to esnure reproducability of the split

In [51]:
(train_data, test_data) = dataset.randomSplit([0.7, 0.3], seed = 123)
print('Training records : %d' % train_data.count())
print('Test records : %d ' % test_data.count())
train_data.cache()

Training records : 3427798
Test records : 1470633 


DataFrame[label: int, features: vector, duration: int, src_bytes: int, protocol_type_vector: vector, service_vector: vector, flag_vector: vector, duration: int, src_bytes: int, dst_bytes: int, land: int, wrong_fragment: int, urgent: int]

#### 4.3 Create a Logisitc Regression Model

In [37]:
from pyspark.ml.classification import LogisticRegression

# Create initial LogisticRegression model
lr = LogisticRegression(labelCol="label", featuresCol="features", maxIter=10)

# Train model with Training Data
model = lr.fit(train_data)

In [52]:
# Make predictions on test data using the transform() method. Feature have been specified earlier.
predictions = model.transform(test_data)

#### Evaluation Metrics:

Binary classifiers are used to separate the elements of a given dataset into one of two possible groups (e.g. attack or no attack).

In [39]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Evaluate model
evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction")
score = evaluator.evaluate(predictions)
print('Score is : %03f' % score )

Score is : 0.999387


### Model Selection

** READ THIS : ** : https://spark.apache.org/docs/2.2.1/ml-tuning.html

Model selection consists in using data to find the best model or parameters for a given task.

 * Inspect available parameters for tuning
 * Use CrossValidation or TrainValidationSplit for parameter tuning
 * Both requires the following inputs:
    *  Estimator: algorithm or Pipeline to tune
    *  Set of ParamMaps: parameters to choose from, sometimes called a “parameter grid” to search over
    *  Evaluator: metric to measure how well a fitted Model does on held-out test data

* At a high level, these model selection tools work as follows:

   *  They split the input data into separate training and test datasets.
   * For each (training, test) pair, they iterate through the set of ParamMaps:
   * For each ParamMap, they fit the Estimator using those parameters, get the fitted Model, and evaluate the Model’s performance using the Evaluator.
   
*  They finally select the Model produced by the best-performing set of parameters.

** An interesting blog on parameter tuning ** :https://www.oreilly.com/ideas/big-datas-biggest-secret-hyperparameter-tuning

In [40]:
print(lr.explainParam("regParam"))

regParam: regularization parameter (>= 0). (default: 0.0)


In [41]:
print(lr.explainParam("elasticNetParam"))

elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty. (default: 0.0)


#### Create Parameters Grid for Cross Validation
we will create a model for each combination of parameters in the grid specified and evaluate its result

We use :
 3 regularization param values (regParam)
 3 values for maximum nb of iterations
 3 values for elasticNetParam
 The grid will have 3 x 3 x 3 = 27 parameter settings to choose from. 


 Regularization Parameter: 

 (intuitively) is a penalty against complexity. 
 A bigger regParam penalizes "large" weight coefficients ,i.e, 
 tries to avoid our model model picking up "noise," or "deducting a pattern where there is none."
 tries to avoid OVERFITTING

 ElasticNetParam:
 read this : https://en.wikipedia.org/wiki/Elastic_net_regularization

In [42]:
# Create Parameters Grid for Cross Validation
# we will create a model for each combination of parameters in the grid specified and evaluate its result
#
# We use :
# 3 regularization param values (regParam)
# 3 values for maximum nb of iterations
# 3 values for elasticNetParam
# The grid will have 3 x 3 x 3 = 27 parameter settings to choose from. 


# Regularization Parameter: 

# (intuitively) is a penalty against complexity. 
# A bigger regParam penalizes "large" weight coefficients ,i.e, 
# tries to avoid our model model picking up "noise," or "deducting a pattern where there is none."
# tries to avoid OVERFITTING

# ElasticNetParam:
# read this : https://en.wikipedia.org/wiki/Elastic_net_regularization

from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
paramGrid = (ParamGridBuilder()
             .addGrid(lr.regParam, [1, 5, 10])
             .addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])
             .addGrid(lr.maxIter, [1, 5, 10])
             .build())

In [43]:
# Create 3-fold CrossValidator

# numFolds determines the number of train/test dataset pairs used in the cross-validation
# The cross validation will compute the  average of the evaluation metrics produced by the n models
# by fitting the Estimator on the 3 different (training, test) dataset pairs.

cv = CrossValidator(estimator=lr, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=3)

# Run cross validations
cvModel = cv.fit(train_data)
# this may take some of time (depends on the amount of models that we're creating and testing)

In [44]:
# Use test set here so we can measure the accuracy of our model on new data
predictions = cvModel.transform(test_data)
# cvModel uses the best model found from the Cross Validation
# Evaluate best model
best_score=evaluator.evaluate(predictions)
print('Best model score : %03f' % best_score)

Best model score : 0.997650


## 6. Predicting test data

We are using the previously created pipleline on the corrected dataset: df_test

In [45]:
df_test.groupBy('label').count().show()

+-----+------+
|label| count|
+-----+------+
|    1|250436|
|    0| 60593|
+-----+------+



In [46]:
transformer_test = pipeline.fit(df_test)
transformed_df_test = transformer_test.transform(df_test)

# Keep relevant columns
selection_test = ["label", "features", "duration", "src_bytes"] + assemblerInputs
dataset_test = transformed_df_test.select(selection_test)

In [47]:
# Use test set here so we can measure the accuracy of our model on new data
predictions_test = cvModel.transform(dataset_test)        # WE CAN EITHER USE MODEL OR CVMODEL
# cvModel uses the best model found from the Cross Validation

In [48]:
# Evaluate best model
best_score_test = evaluator.evaluate(predictions_test)
print('Best model score : %03f' % best_score_test)

Py4JJavaError: An error occurred while calling o426.evaluate.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1715.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1715.0 (TID 7997, localhost, executor driver): org.apache.spark.SparkException: Failed to execute user defined function($anonfun$1: (vector) => vector)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
	at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalArgumentException: requirement failed: BLAS.dot(x: Vector, y:Vector) was given Vectors with non-matching sizes: x.size = 82, y.size = 87
	at scala.Predef$.require(Predef.scala:224)
	at org.apache.spark.ml.linalg.BLAS$.dot(BLAS.scala:104)
	at org.apache.spark.ml.classification.LogisticRegressionModel$$anonfun$31.apply(LogisticRegression.scala:975)
	at org.apache.spark.ml.classification.LogisticRegressionModel$$anonfun$31.apply(LogisticRegression.scala:974)
	at org.apache.spark.ml.classification.LogisticRegressionModel.predictRaw(LogisticRegression.scala:1108)
	at org.apache.spark.ml.classification.LogisticRegressionModel.predictRaw(LogisticRegression.scala:904)
	at org.apache.spark.ml.classification.ProbabilisticClassificationModel$$anonfun$1.apply(ProbabilisticClassifier.scala:117)
	at org.apache.spark.ml.classification.ProbabilisticClassificationModel$$anonfun$1.apply(ProbabilisticClassifier.scala:116)
	... 15 more

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
	at org.apache.spark.RangePartitioner$.sketch(Partitioner.scala:266)
	at org.apache.spark.RangePartitioner.<init>(Partitioner.scala:128)
	at org.apache.spark.rdd.OrderedRDDFunctions$$anonfun$sortByKey$1.apply(OrderedRDDFunctions.scala:62)
	at org.apache.spark.rdd.OrderedRDDFunctions$$anonfun$sortByKey$1.apply(OrderedRDDFunctions.scala:61)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
	at org.apache.spark.rdd.OrderedRDDFunctions.sortByKey(OrderedRDDFunctions.scala:61)
	at org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.x$4$lzycompute(BinaryClassificationMetrics.scala:155)
	at org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.x$4(BinaryClassificationMetrics.scala:146)
	at org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.confusions$lzycompute(BinaryClassificationMetrics.scala:148)
	at org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.confusions(BinaryClassificationMetrics.scala:148)
	at org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.createCurve(BinaryClassificationMetrics.scala:223)
	at org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.roc(BinaryClassificationMetrics.scala:86)
	at org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.areaUnderROC(BinaryClassificationMetrics.scala:97)
	at org.apache.spark.ml.evaluation.BinaryClassificationEvaluator.evaluate(BinaryClassificationEvaluator.scala:87)
	at sun.reflect.GeneratedMethodAccessor161.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:280)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Failed to execute user defined function($anonfun$1: (vector) => vector)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
	at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	... 1 more
Caused by: java.lang.IllegalArgumentException: requirement failed: BLAS.dot(x: Vector, y:Vector) was given Vectors with non-matching sizes: x.size = 82, y.size = 87
	at scala.Predef$.require(Predef.scala:224)
	at org.apache.spark.ml.linalg.BLAS$.dot(BLAS.scala:104)
	at org.apache.spark.ml.classification.LogisticRegressionModel$$anonfun$31.apply(LogisticRegression.scala:975)
	at org.apache.spark.ml.classification.LogisticRegressionModel$$anonfun$31.apply(LogisticRegression.scala:974)
	at org.apache.spark.ml.classification.LogisticRegressionModel.predictRaw(LogisticRegression.scala:1108)
	at org.apache.spark.ml.classification.LogisticRegressionModel.predictRaw(LogisticRegression.scala:904)
	at org.apache.spark.ml.classification.ProbabilisticClassificationModel$$anonfun$1.apply(ProbabilisticClassifier.scala:117)
	at org.apache.spark.ml.classification.ProbabilisticClassificationModel$$anonfun$1.apply(ProbabilisticClassifier.scala:116)
	... 15 more


In [None]:
# spark.stop()