# Spark Pipeline on Kickstarter Pledge Dataset

## 1. Overview

### 1.1. Instructions

- **Choosing any sufficiently large open dataset** (less than 100000 lines are not allowed)


- **Choosing one variable to predict**


- **Implementing at least two supervised learning models**: classification, regression, recommender system, etc. Unsupervised tasks (e.g. clusterisation, associative rules, etc.) are not allowed


- **Mandatory use of Apache Spark** (e.g. on Google Cloud as we did during our lab sessions)


- A **full machine learning pipeline must be implemented**, which include:
    - Reading the data
    - Transforming data (extracting features, dealing with missing values if any, etc.)
    - Building models (build at least two models to compare)
    - Evaluating quality (use cross-validation or train/test split)

### 1.2. Dataset

### 1.3. Summary & Conclusion

The notebook was also ran locally using the installation steps for Spark described [here](https://sparkbyexamples.com/spark/spark-installation-on-linux-ubuntu/).

## 2. Environment Set-Up

We need the following libraries installed to set up the environment:

- kaggle (see documentation [here](https://github.com/Kaggle/kaggle-api#datasets))
- pyspark (see documentation [here](https://spark.apache.org/docs/latest/api/python/index.html))

In [1]:
!pip install kaggle
!pip install pyspark

[33mYou are using pip version 9.0.1, however version 20.3.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
[33mYou are using pip version 9.0.1, however version 20.3.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


## 3. Dataset Download

In [2]:
# Removes existing files that may have been downloaded locally
!rm -f kickstarter-projects.zip
!rm -f ks-projects-201612.csv ks-projects-201801.csv

<span style="color:red">**On Google Cloud**: To download the kaggle dataset we have to upload our kaggle.json file in /root/.kaggle. the kaggle.json file can be downloaded here:</span>

> ``https://www.kaggle.com/<username>/account``
 
<span style="color:red">**Locally**: To download the kaggle dataset we have to upload our kaggle.json file in ~/.kaggle.</span>
    
<span style="color:red">Run the cell below when using Google Cloud. **It is assumed we created a /home/user folder where we uploaded our JupyterNotebook and our kaggle.json file in Dataproc**</span>

In [3]:
# Dowloads the raw dataset from the kaggle source
!kaggle datasets download -d kemical/kickstarter-projects

Downloading kickstarter-projects.zip to /
 90%|██████████████████████████████████    | 33.0M/36.8M [00:00<00:00, 47.6MB/s]
100%|██████████████████████████████████████| 36.8M/36.8M [00:00<00:00, 50.4MB/s]


In [4]:
# Unzips the raw dataset and keeps only the most recent instance
!unzip kickstarter-projects.zip
!rm -f ks-projects-201612.csv kickstarter-projects.zip
!ls

Archive:  kickstarter-projects.zip
  inflating: ks-projects-201612.csv  
  inflating: ks-projects-201801.csv  
bin   etc     ks-projects-201801.csv  lib64	  media  proc  sbin  tmp
boot  hadoop  lib		      libx32	  mnt	 root  srv   usr
dev   home    lib32		      lost+found  opt	 run   sys   var


We only keep 'ks-projects-201801.csv', the most recent dataset available.

<span style="color:red">Run the cell below when using Google Cloud:</span>

In [5]:
# Uploads the dataset to HDFS when on Google Cloud
!hdfs dfs -mkdir /user/qlr
!hdfs dfs -rm /user/qlr/ks-projects-201801.csv
!hdfs dfs -put ks-projects-201801.csv /user/qlr
!hdfs dfs -ls /user/qlr

mkdir: `/user/qlr': File exists
Deleted /user/qlr/ks-projects-201801.csv
Found 1 items
-rw-r--r--   2 root hadoop   58030359 2021-01-02 16:51 /user/qlr/ks-projects-201801.csv


## 4. Library Imports & Spark Variables

In [6]:
from pyspark.context import SparkContext
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
from pyspark.ml.feature import Word2Vec, Tokenizer, HashingTF
from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.sql import Row
from pyspark.sql.functions import unix_timestamp, ceil, isnan, when, count, col
from pyspark.sql.session import SparkSession
from pyspark.sql.types import FloatType

<span style="color:red">Swap the dataset_path variable and comment out the spark_context variable when using Google Cloud:</span>

In [7]:
#dataset_path = "ks-projects-201801.csv" #if run on a local computer
dataset_path = "/user/qlr/ks-projects-201801.csv" # if run on Google Cloud
dataset_format = "csv"
#spark_context = "local" #if run on a local computer
spark_context = "yarn" #if run on Google Cloud

## 5. Starting a Spark Session & Loading the Dataset

<span style="color:red">Comment out the following cell when when running the notebook Google Cloud as a spark session is automatically instantiated.</span>

In [8]:
spark #Spark UI on Google Cloud should return v2.3.4 (version), yarn (Master), PySparkShell (AppName)

In [9]:
campaigns = (spark
             .read
             .format(dataset_format)
             .options(header=True)
             .load(dataset_path))

## 6. Data Preprocessing

In [10]:
raw_columns_to_keep = [
    "ID","name","category","deadline","launched","country","usd_goal_real", #features
    "state" # target
]

replace_start_end_dates_with_duration = [
    "ID","name","category","total_duration","country","usd_goal_real", #features
    "state" # target
]

kept_columns_for_1st_batch_of_models = [
    "total_duration","usd_goal_real","name","category","country", #features
    "state" # target
]

deadline_format = "yyyy-MM-dd"
launched_format = "yyyy-MM-dd HH:mm:ss"

In [11]:
# Checks the type of each columns
campaigns.dtypes

[('ID', 'string'),
 ('name', 'string'),
 ('category', 'string'),
 ('main_category', 'string'),
 ('currency', 'string'),
 ('deadline', 'string'),
 ('goal', 'string'),
 ('launched', 'string'),
 ('pledged', 'string'),
 ('state', 'string'),
 ('backers', 'string'),
 ('country', 'string'),
 ('usd pledged', 'string'),
 ('usd_pledged_real', 'string'),
 ('usd_goal_real', 'string')]

In [12]:
# Drops NAs, Nulls, and Duplicates as pySpark can raise an error during .fit() procedures
campaigns = campaigns.dropna()
campaigns = campaigns.dropDuplicates()
for column in campaigns.columns:
    campaigns = campaigns.where(col(column).isNotNull())

# Checks for N/A
campaigns.select([count(when(isnan(c), c)).alias(c) for c in campaigns.columns]).show()

+---+----+--------+-------------+--------+--------+----+--------+-------+-----+-------+-------+-----------+----------------+-------------+
| ID|name|category|main_category|currency|deadline|goal|launched|pledged|state|backers|country|usd pledged|usd_pledged_real|usd_goal_real|
+---+----+--------+-------------+--------+--------+----+--------+-------+-----+-------+-------+-----------+----------------+-------------+
|  0|   0|       0|            0|       0|       0|   0|       0|      0|    0|      0|      0|          0|               0|            0|
+---+----+--------+-------------+--------+--------+----+--------+-------+-----+-------+-------+-----------+----------------+-------------+



In [13]:
# Keeps only the relevant columns
campaigns = campaigns.select(raw_columns_to_keep)
# --
print("The dataset contains " + str(campaigns.count()) + " rows.")
campaigns.show(n=5)

The dataset contains 374855 rows.
+----------+--------------------+----------+----------+-------------------+-------+-------------+----------+
|        ID|                name|  category|  deadline|           launched|country|usd_goal_real|     state|
+----------+--------------------+----------+----------+-------------------+-------+-------------+----------+
|1343448542|RWE PRESENTS: NYE...|       Pop|2013-12-02|2013-11-22 02:33:17|     US|    100000.00|  canceled|
|1344240212|One-sided Story y...| Art Books|2013-03-18|2013-02-16 19:41:31|     US|      2000.00|successful|
|1344521181|          Flatlander|   Fiction|2015-11-02|2015-09-16 04:44:41|     US|      3000.00|successful|
|1346558846|Bring readline to...|Technology|2015-06-06|2015-05-22 22:33:48|     US|      5000.00|  canceled|
|  13466532|Printing of Betha...|      Jazz|2011-10-26|2011-10-12 16:16:05|     US|      1200.00|    failed|
+----------+--------------------+----------+----------+-------------------+-------+-----------

In [14]:
# Computes a duration time (in day) from the launch and deadline features before dropping them
launch_times = unix_timestamp('launched', format = launched_format)
deadline_times = unix_timestamp('deadline', format = deadline_format)
time_difference = deadline_times - launch_times
campaigns = campaigns.\
    withColumn("total_duration",ceil(time_difference/(3600*24))).\
    select(replace_start_end_dates_with_duration)
# --
campaigns.show(n=5)

+----------+--------------------+----------+--------------+-------+-------------+----------+
|        ID|                name|  category|total_duration|country|usd_goal_real|     state|
+----------+--------------------+----------+--------------+-------+-------------+----------+
|1343448542|RWE PRESENTS: NYE...|       Pop|            10|     US|    100000.00|  canceled|
|1344240212|One-sided Story y...| Art Books|            30|     US|      2000.00|successful|
|1344521181|          Flatlander|   Fiction|            47|     US|      3000.00|successful|
|1346558846|Bring readline to...|Technology|            15|     US|      5000.00|  canceled|
|  13466532|Printing of Betha...|      Jazz|            14|     US|      1200.00|    failed|
+----------+--------------------+----------+--------------+-------+-------------+----------+
only showing top 5 rows



In [15]:
# Cleans the target labels
# 'undefied', 'live' -> dropped
# 'suspended', 'cancelled' -> renamed to 'failed'
for condition in ['state!="undefined"', 'state!="live"']:
    campaigns = campaigns.where(condition)
campaigns = campaigns.\
    withColumn("state",when(col("state") == "canceled", "failed").\
    when(col("state") == "suspended", "failed").\
    when(col("state") == "failed", "failed").\
    otherwise("successful"))
campaigns.select("state").groupBy('state').count().orderBy(col("count").desc()).show()

+----------+------+
|     state| count|
+----------+------+
|    failed|237451|
|successful|134609|
+----------+------+



In [16]:
# Casts the relevant column(s) to their end types
for column in ["total_duration", "usd_goal_real"]:
    campaigns = campaigns.withColumn(column,col(column).cast(FloatType()))

### 6.1. Exploring the dataset

### 6.2. Creating the dataset for Logistic Regression, Decision Tree, and Random Forest


We will rely on indexing and assembling our data pipeline using the following stages:
- **StringIndexer** for all categorical columns
- **OneHotEncoder** for all categorical index columns
- **VectorAssembler** for all feature columns into one vector column

In [17]:
# Removes ID and name from the dataset
first_batch_models_dataset = campaigns.select(kept_columns_for_1st_batch_of_models)
# --
print("The dataset contains " + str(first_batch_models_dataset.count()) + " rows.")
first_batch_models_dataset.show(n=5)
first_batch_models_dataset.dtypes

The dataset contains 372060 rows.
+--------------+-------------+--------------------+--------------+-------+----------+
|total_duration|usd_goal_real|                name|      category|country|     state|
+--------------+-------------+--------------------+--------------+-------+----------+
|          43.0|      4926.39|             Borders|         Drama|     GB|    failed|
|          21.0|      2240.39|Spiele für iOS un...|  Mobile Games|     DE|    failed|
|          30.0|        700.0|Odyssey Skateboar...|Graphic Design|     US|    failed|
|          30.0|       5500.0|Debut EP Album Pr...|           R&B|     US|    failed|
|          16.0|       1200.0|GBS Detroit Prese...|    Indie Rock|     US|successful|
+--------------+-------------+--------------------+--------------+-------+----------+
only showing top 5 rows



[('total_duration', 'float'),
 ('usd_goal_real', 'float'),
 ('name', 'string'),
 ('category', 'string'),
 ('country', 'string'),
 ('state', 'string')]

In [18]:
# Creates pipeline stages to string index each categorical feature column except name, and the label column
categorical_feature_columns = first_batch_models_dataset.columns[3:]
string_indexing_feature_columns = [StringIndexer(inputCol=column, 
                                                 outputCol='strindexed_' + column,
                                                 handleInvalid="skip")
                                   for column in categorical_feature_columns]
string_indexing_label_column = [StringIndexer(inputCol='state', 
                                              outputCol='label',
                                              handleInvalid="skip")]

In [19]:
# Creates pipeline stages to one-hot encode each categorical feature column except name
if spark_context == "local":
    onehot_encoding_feature_columns = [OneHotEncoder(inputCol='strindexed_' + column, 
                                                     outputCol='onehot_' + column,
                                                  handleInvalid="keep") 
                                      for column in categorical_feature_columns]
else: #Google Cloud's version of PySpark does not support/need handleInvalid attributes
    onehot_encoding_feature_columns = [OneHotEncoder(inputCol='strindexed_' + column, 
                                                 outputCol='onehot_' + column) 
                                  for column in categorical_feature_columns]

In [20]:
# Creates a pipeline stage to vector assemble each categorical feature column except name
processed_feature_columns = list(map(lambda col_name: "onehot_" + col_name, categorical_feature_columns))
processed_feature_columns += ["total_duration", "usd_goal_real"]
processed_feature_columns.remove("onehot_state")
print(processed_feature_columns)

if spark_context == "local":
    vectorassembler_stage = VectorAssembler(inputCols=processed_feature_columns, 
                                            outputCol='features_1',
                                            handleInvalid="skip")
else: #Google Cloud's version of PySpark does not support/need handleInvalid attributes
    vectorassembler_stage = VectorAssembler(inputCols=processed_feature_columns, 
                                            outputCol='features_1')

['onehot_category', 'onehot_country', 'total_duration', 'usd_goal_real']


In [21]:
tokenizer = Tokenizer(inputCol="name", outputCol="words")
Word2Vec = Word2Vec(vectorSize=10, inputCol=tokenizer.getOutputCol(), outputCol="features_2")

In [22]:
merge_features = VectorAssembler(inputCols=["features_1", "features_2"], outputCol="features")

In [23]:
# Assembles the data processing pipeline
data_processing_pipeline = Pipeline(
    stages = string_indexing_feature_columns +
    string_indexing_label_column + 
    onehot_encoding_feature_columns + 
    [vectorassembler_stage] + 
    [tokenizer] + 
    [Word2Vec] +
    [merge_features]
)

In [24]:
# Fits the data processing pipeline
pipeline_first_batch_models = data_processing_pipeline.fit(first_batch_models_dataset.na.drop())

### 6.2. Finalizing our dataset for Naive Bayes

In [25]:
tokenizer = Tokenizer(inputCol="name", outputCol="words")
HashingTF = HashingTF(numFeatures=20, inputCol=tokenizer.getOutputCol(), outputCol="features_2")

In [26]:
# Assembles the data processing pipeline
data_processing_pipeline = Pipeline(
    stages = string_indexing_feature_columns +
    string_indexing_label_column + 
    onehot_encoding_feature_columns + 
    [vectorassembler_stage] + 
    [tokenizer] + 
    [HashingTF] +
    [merge_features]
)

In [27]:
# Fits the data processing pipeline
pipeline_naive_bayes = data_processing_pipeline.fit(first_batch_models_dataset.na.drop())

## 7. Running a Logistic Regression Pipeline

### 7.1. Finalizing the data processing pipeline

In [28]:
# Declares hyperparameters
training_size = 0.7
test_size = 0.3
reg_parameters = [0, 0.5, 1, 2]
elastic_net_parameters = [0, 0.5, 1]

def process_confusion_matrix(matrix):
    items = []
    for item in matrix: 
        items.append(item)
        print(item, matrix[item])
    if Row(label=0.0, prediction=0.0) in items: 
        true_negatives = float(matrix[Row(label=0.0, prediction=0.0)])
    else: 
        true_negatives = 0.
    if Row(label=1.0, prediction=0.0) in items: 
        false_negatives = float(matrix[Row(label=1.0, prediction=0.0)])
    else: 
        false_negatives = 0.
    if Row(label=0.0, prediction=1.0) in items: 
        false_positives = float(matrix[Row(label=0.0, prediction=1.0)])
    else: 
        false_positives = 0.
    if Row(label=1.0, prediction=1.0) in items: 
        true_positives = float(matrix[Row(label=1.0, prediction=1.0)])
    else: 
        true_positives = 0.
    precision = true_positives/(true_positives+false_positives)
    recall = true_positives/(true_positives+false_negatives)
    print("\nPrecision score:", precision)
    print("Recall score:", recall)
    if precision+recall != 0.: print("F1 score:", (precision*recall)/(precision+recall))

In [29]:
final_columns = ['features', 'label']
first_batch_dataset_prepped = pipeline_first_batch_models.\
    transform(first_batch_models_dataset)
# --
print("The dataset contains " + str(first_batch_dataset_prepped.count()) + " rows.")        
first_batch_dataset_prepped.select(["features_1", "features_2"]).show(5)

The dataset contains 370775 rows.
+--------------------+--------------------+
|          features_1|          features_2|
+--------------------+--------------------+
|(181,[26,158,179,...|[0.05186719766684...|
|(181,[5,158,179,1...|[0.35906347632408...|
|(181,[51,159,179,...|[0.03446385506540...|
|(181,[43,158,179,...|[0.03267886023968...|
|(181,[5,159,179,1...|[0.04297338426113...|
+--------------------+--------------------+
only showing top 5 rows



In [30]:
first_batch_dataset_prepped = first_batch_dataset_prepped.\
    select(final_columns).sample(0.01).cache()
# --
print("The dataset contains " + str(first_batch_dataset_prepped.count()) + " rows.")        
first_batch_dataset_prepped.show(5)

The dataset contains 3711 rows.
+--------------------+-----+
|            features|label|
+--------------------+-----+
|(191,[12,166,179,...|  0.0|
|(191,[13,158,179,...|  1.0|
|(191,[8,158,179,1...|  0.0|
|(191,[3,158,179,1...|  0.0|
|(191,[28,158,179,...|  0.0|
+--------------------+-----+
only showing top 5 rows



### 7.2. Creating a model pipeline

#### 7.2.1. Building and fitting the model

In [31]:
lr = LogisticRegression(featuresCol='features', labelCol='label')

In [32]:
# Builds a parameter grid
param_grid = ParamGridBuilder().\
    addGrid(lr.regParam, [0., 0.5, 1., 2.]).\
    addGrid(lr.elasticNetParam, [0., 0.5, 1.]).\
    build()

In [33]:
# Builds the evaluator
evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction")

In [None]:
# Builds the cross-validation model
cv = CrossValidator(estimator=lr, 
                    estimatorParamMaps=param_grid, 
                    evaluator=evaluator, 
                    numFolds=5)

In [None]:
# Fits the cross-validation model
cv_model = cv.fit(first_batch_dataset_prepped.na.drop())

In [None]:
print("The model was fit using parameters: ")
print(cv_model.extractParamMap())

The model was fit using parameters: 
{Param(parent=u'CrossValidatorModel_4696830d4887abebd9de', name='estimatorParamMaps', doc='estimator param maps'): [{Param(parent=u'LogisticRegression_4c7c81df79b94a8909b5', name='regParam', doc='regularization parameter (>= 0).'): 0.0, Param(parent=u'LogisticRegression_4c7c81df79b94a8909b5', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0}, {Param(parent=u'LogisticRegression_4c7c81df79b94a8909b5', name='regParam', doc='regularization parameter (>= 0).'): 0.0, Param(parent=u'LogisticRegression_4c7c81df79b94a8909b5', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.5}, {Param(parent=u'LogisticRegression_4c7c81df79b94a8909b5', name='regParam', doc='regularization parameter (>= 0).'): 0.0, Param(parent=u'LogisticRegressio

#### 7.2.2. Evaluating the model

In [None]:
# Provides a confusion matrix
label_and_pred = cv_model.transform(first_batch_dataset_prepped).select('label', 'prediction')
confusion_matrix = label_and_pred.rdd.zipWithIndex().countByKey()
process_confusion_matrix(confusion_matrix)

(Row(label=0.0, prediction=1.0), 349)
(Row(label=1.0, prediction=0.0), 734)
(Row(label=0.0, prediction=0.0), 2030)
(Row(label=1.0, prediction=1.0), 598)
('\nPrecision score:', 0.6314677930306231)
('Recall score:', 0.44894894894894893)
('F1 score:', 0.2623957876261518)


In [None]:
# Intercept and Coefficients of the regresison model
print('Intercept: ' + str(cv_model.bestModel.intercept) + "\n"
      'coefficients: ' + str(cv_model.bestModel.coefficients))

Intercept: -1.19133135782
coefficients: [2.883169701788507,2.8822537448249506,3.4110199938860206,3.2290771491118178,3.426121102437037,2.2029277326287002,2.44258890183341,2.580245409153952,1.7058287523020317,2.001547864443571,1.8049231264298944,2.5813494120723406,1.8365524647872513,2.8891735526482836,2.8493117609947034,2.6234018841832594,2.1246119267337518,1.0552371706685628,2.2473371836537,2.290743279839454,3.7352507907286285,2.127609521157579,2.7682828412005467,0.7000294790882924,3.199848347512026,2.080313460845413,3.9510877888628766,2.88936139237162,1.103806422039922,3.091092158160054,2.854976653257445,2.202866134868444,2.807568849907514,2.7291544731502126,1.4698833233890645,3.6248033484229194,1.0719345219076695,3.3689809815080234,1.9013346599587657,1.9520510263376063,3.3816058076716704,2.9586757098005068,3.3678674099689436,2.734566754657879,1.8644788110440902,2.544353103001597,3.5042498871059053,3.3074304189772614,2.288088161295042,2.7788548311194026,2.590052752961918,3.534033406451

In [None]:
# Parameters of the best model
print('The best RegParam is: ', cv_model.bestModel._java_obj.getRegParam(), "\n",
     'The best ElasticNetParam is: cv_model.bestModel._java_obj.getElasticNetParam()')

('The best RegParam is: ', 0.0, '\n', 'The best ElasticNetParam is: cv_model.bestModel._java_obj.getElasticNetParam()')


## 8. Running a DecisionTreeClassifier Pipeline

### 8.1. Finalizing the data processing pipeline

In [None]:
# Declares hyperparameters
max_depth_grid = list(range(2,10))

In [None]:
# Splits the dataset between training and validation sets
training, test = first_batch_dataset_prepped.randomSplit([training_size, test_size], seed=0)

### 8.2. Creating a model pipeline using cross-validation

#### 8.2.1. Building and fitting the model

In [None]:
# Builds the estimator
decision_tree_with_crossvalidation = DecisionTreeClassifier(featuresCol='features', labelCol='label')

In [None]:
# Builds a parameter grid
param_grid = ParamGridBuilder().\
    addGrid(decision_tree_with_crossvalidation.maxDepth, max_depth_grid).\
    build()

In [None]:
# Builds the evaluator
evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction", metricName="areaUnderROC")

In [None]:
# Builds the cross-validation model
cv = CrossValidator(estimator=decision_tree_with_crossvalidation, 
                    estimatorParamMaps=param_grid, 
                    evaluator=evaluator, 
                    numFolds=5)

In [None]:
# Fits the cross-validation model
cv_model = cv.fit(first_batch_dataset_prepped.na.drop())

#### 8.2.2. Evaluating the model

In [None]:
# Predicts on training data
show_columns = ['features', 'label', 'prediction', 'rawPrediction', 'probability']
pred_training_cv = cv_model.transform(first_batch_dataset_prepped)
pred_training_cv.select(show_columns).show(5)

+--------------------+-----+----------+--------------+--------------------+
|            features|label|prediction| rawPrediction|         probability|
+--------------------+-----+----------+--------------+--------------------+
|(191,[12,166,179,...|  0.0|       0.0| [840.0,324.0]|[0.72164948453608...|
|(191,[13,158,179,...|  1.0|       0.0|[1089.0,850.0]|[0.56162970603403...|
|(191,[8,158,179,1...|  0.0|       0.0|[1089.0,850.0]|[0.56162970603403...|
|(191,[3,158,179,1...|  0.0|       0.0| [840.0,324.0]|[0.72164948453608...|
|(191,[28,158,179,...|  0.0|       0.0|[1089.0,850.0]|[0.56162970603403...|
+--------------------+-----+----------+--------------+--------------------+
only showing top 5 rows



In [None]:
# Provides a confusion matrix
label_and_pred = cv_model.transform(first_batch_dataset_prepped).select('label', 'prediction')
confusion_matrix = label_and_pred.rdd.zipWithIndex().countByKey()
process_confusion_matrix(confusion_matrix)

(Row(label=0.0, prediction=1.0), 52)
(Row(label=1.0, prediction=0.0), 1229)
(Row(label=0.0, prediction=0.0), 2327)
(Row(label=1.0, prediction=1.0), 103)
('\nPrecision score:', 0.6645161290322581)
('Recall score:', 0.07732732732732733)
('F1 score:', 0.06926698049764626)


In [None]:
print('The best MaxDepth is:', cv_model.bestModel._java_obj.getMaxDepth())

('The best MaxDepth is:', 2)


In [None]:
print(cv_model.bestModel)

DecisionTreeClassificationModel (uid=DecisionTreeClassifier_489a81d9127568ddb644) of depth 2 with 7 nodes


### 8.3. Creating a model pipeline using Train-Test split

#### 8.3.1. Building and fitting the model

In [None]:
# Builds the estimator
decision_tree_with_traintestsplit = DecisionTreeClassifier(featuresCol='features', labelCol='label')

In [None]:
# Builds the evaluator
evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction", metricName="areaUnderROC")

In [None]:
# Fits the cross-validation model
traintest_model = decision_tree_with_traintestsplit.fit(training.na.drop())

#### 8.3.2. Evaluating the model

In [None]:
# Predicts on training data
show_columns = ['features', 'label', 'prediction', 'rawPrediction', 'probability']
pred_test = traintest_model.transform(test)
pred_test.select(show_columns).show(5)

+--------------------+-----+----------+-------------+--------------------+
|            features|label|prediction|rawPrediction|         probability|
+--------------------+-----+----------+-------------+--------------------+
|(191,[3,158,179,1...|  0.0|       0.0|[520.0,224.0]|[0.69892473118279...|
|(191,[8,158,179,1...|  0.0|       0.0|[233.0,122.0]|[0.65633802816901...|
|(191,[9,158,179,1...|  1.0|       1.0|[168.0,230.0]|[0.42211055276381...|
|(191,[9,158,179,1...|  0.0|       0.0|[520.0,224.0]|[0.69892473118279...|
|(191,[13,158,179,...|  1.0|       0.0|  [60.0,56.0]|[0.51724137931034...|
+--------------------+-----+----------+-------------+--------------------+
only showing top 5 rows



In [None]:
accuracy = evaluator.evaluate(pred_test)
print("The test error is", 1.0 - accuracy)

('The test error is', 0.462692312887496)


In [None]:
print(traintest_model)

DecisionTreeClassificationModel (uid=DecisionTreeClassifier_4083bc918fe7e60ac47a) of depth 5 with 57 nodes


## 9. Running a Random Forest Pipeline

### 9.1. Finalizing the data processing pipeline

In [None]:
# Declares hyperparameters
max_depth_grid = list(range(2,10))
minimum_info_grain = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5]

### 9.2. Creating a model pipeline using cross-validation

#### 9.2.1. Building and fitting the model

In [None]:
# Builds the estimator
random_forest_with_crossvalidation = RandomForestClassifier(featuresCol='features', labelCol='label')

In [None]:
# Builds a parameter grid
param_grid = ParamGridBuilder().\
    addGrid(random_forest_with_crossvalidation.maxDepth, max_depth_grid).\
    addGrid(random_forest_with_crossvalidation.minInfoGain, minimum_info_grain).\
    build()

In [None]:
# Builds the evaluator
evaluator = BinaryClassificationEvaluator()

In [None]:
# Builds the cross-validation model
cv = CrossValidator(estimator=random_forest_with_crossvalidation, 
                    estimatorParamMaps=param_grid, 
                    evaluator=evaluator)

In [None]:
# Fits the cross-validation model
cv_model = cv.fit(first_batch_dataset_prepped.na.drop())

#### 9.2.2. Evaluating the model

In [None]:
# Predicts on training data
show_columns = ['features', 'label', 'prediction', 'rawPrediction', 'probability']
pred_training_cv = cv_model.transform(first_batch_dataset_prepped)
pred_training_cv.select(show_columns).show(5)

+--------------------+-----+----------+--------------------+--------------------+
|            features|label|prediction|       rawPrediction|         probability|
+--------------------+-----+----------+--------------------+--------------------+
|(191,[12,166,179,...|  0.0|       0.0|[15.4419633230080...|[0.77209816615040...|
|(191,[13,158,179,...|  1.0|       0.0|[11.9675484158580...|[0.59837742079290...|
|(191,[8,158,179,1...|  0.0|       0.0|[14.2468879425841...|[0.71234439712920...|
|(191,[3,158,179,1...|  0.0|       0.0|[12.6286258514609...|[0.63143129257304...|
|(191,[28,158,179,...|  0.0|       0.0|[12.5537399612295...|[0.62768699806147...|
+--------------------+-----+----------+--------------------+--------------------+
only showing top 5 rows



In [None]:
# Provides a confusion matrix
label_and_pred = cv_model.transform(first_batch_dataset_prepped).select('label', 'prediction')
confusion_matrix = label_and_pred.rdd.zipWithIndex().countByKey()
process_confusion_matrix(confusion_matrix)

(Row(label=0.0, prediction=1.0), 53)
(Row(label=1.0, prediction=0.0), 1015)
(Row(label=0.0, prediction=0.0), 2326)
(Row(label=1.0, prediction=1.0), 317)
('\nPrecision score:', 0.8567567567567568)
('Recall score:', 0.23798798798798798)
('F1 score:', 0.18625146886016453)


In [66]:
print(cv_model.bestModel)

RandomForestClassificationModel (uid=RandomForestClassifier_478dbc6cf5e34379607c) with 20 trees


## 10. Running a Naive Bayes Pipeline

### 10.1. Finalizing the data processing pipeline

In [76]:
# Declares hyperparameters
smoothing = map(float, list(range(0,10)))

In [77]:
final_columns = ['features', 'label']
naive_bayes_dataset_prepped = pipeline_naive_bayes.\
    transform(first_batch_models_dataset)
# --
print("The dataset contains " + str(naive_bayes_dataset_prepped.count()) + " rows.")        
naive_bayes_dataset_prepped.select(["features_1", "features_2"]).show(5)

The dataset contains 370775 rows.
+--------------------+--------------------+
|          features_1|          features_2|
+--------------------+--------------------+
|(181,[49,159,179,...|      (20,[9],[1.0])|
|(181,[57,162,179,...|(20,[2,7,8,9,11,1...|
|(181,[52,158,179,...|(20,[1,3,18],[1.0...|
|(181,[107,158,179...|(20,[1,4,8,15],[1...|
|(181,[20,158,179,...|(20,[1,9,10,12,13...|
+--------------------+--------------------+
only showing top 5 rows



In [78]:
naive_bayes_dataset_prepped = naive_bayes_dataset_prepped.\
    select(final_columns).sample(0.01).cache()
# --
print("The dataset contains " + str(naive_bayes_dataset_prepped.count()) + " rows.")        
naive_bayes_dataset_prepped.show(5)

The dataset contains 3812 rows.
+--------------------+-----+
|            features|label|
+--------------------+-----+
|(201,[11,161,179,...|  1.0|
|(201,[15,158,179,...|  1.0|
|(201,[1,158,179,1...|  1.0|
|(201,[40,163,179,...|  0.0|
|(201,[9,158,179,1...|  0.0|
+--------------------+-----+
only showing top 5 rows



### 10.2. Creating a model pipeline using cross-validation

#### 10.2.1. Building and fitting the model

In [79]:
# Builds the estimator
naive_bayes_with_crossvalidation = NaiveBayes(featuresCol='features', labelCol='label')

In [80]:
# Builds a parameter grid
param_grid = ParamGridBuilder().\
    addGrid(naive_bayes_with_crossvalidation.smoothing, smoothing).\
    build()

In [81]:
# Builds the evaluator
evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction", metricName="areaUnderROC")

In [82]:
# Builds the cross-validation model
cv = CrossValidator(estimator=naive_bayes_with_crossvalidation, 
                    estimatorParamMaps=param_grid, 
                    evaluator=evaluator)

In [83]:
# Fits the cross-validation model
cv_model = cv.fit(naive_bayes_dataset_prepped.na.drop())

#### 10.2.2. Evaluating the model

In [84]:
# Predicts on training data
show_columns = ['features', 'label', 'prediction', 'rawPrediction', 'probability']
pred_training_cv = cv_model.transform(naive_bayes_dataset_prepped)
pred_training_cv.select(show_columns).show(5)

+--------------------+-----+----------+--------------------+--------------------+
|            features|label|prediction|       rawPrediction|         probability|
+--------------------+-----+----------+--------------------+--------------------+
|(201,[11,161,179,...|  1.0|       1.0|[-295.18347220871...|[8.58502538406355...|
|(201,[15,158,179,...|  1.0|       1.0|[-331.33323159063...|[4.46878495662525...|
|(201,[1,158,179,1...|  1.0|       1.0|[-392.11114538793...|[2.46258657583464...|
|(201,[40,163,179,...|  0.0|       1.0|[-408.81133075316...|[9.08339036052063...|
|(201,[9,158,179,1...|  0.0|       0.0|[-333.11701353272...|[0.99991585927173...|
+--------------------+-----+----------+--------------------+--------------------+
only showing top 5 rows



In [85]:
# Provides a confusion matrix
label_and_pred = cv_model.transform(naive_bayes_dataset_prepped).select('label', 'prediction')
confusion_matrix = label_and_pred.rdd.zipWithIndex().countByKey()
process_confusion_matrix(confusion_matrix)

(Row(label=0.0, prediction=1.0), 1683)
(Row(label=1.0, prediction=0.0), 172)
(Row(label=0.0, prediction=0.0), 755)
(Row(label=1.0, prediction=1.0), 1202)
('\nPrecision score:', 0.41663778162911613)
('Recall score:', 0.8748180494905385)
('F1 score:', 0.28222587461845505)


In [88]:
print(cv_model.bestModel)

NaiveBayes_474fb50cda2c3a8aff1b
