# Spark Pipeline on Kickstarter Pledge Dataset

## 1. Overview

### 1.1. Instructions

- **Choosing any sufficiently large open dataset** (less than 100000 lines are not allowed)


- **Choosing one variable to predict**


- **Implementing at least two supervised learning models**: classification, regression, recommender system, etc. Unsupervised tasks (e.g. clusterisation, associative rules, etc.) are not allowed


- **Mandatory use of Apache Spark** (e.g. on Google Cloud as we did during our lab sessions)


- A **full machine learning pipeline must be implemented**, which include:
    - Reading the data
    - Transforming data (extracting features, dealing with missing values if any, etc.)
    - Building models (build at least two models to compare)
    - Evaluating quality (use cross-validation or train/test split)

### 1.2. Dataset

We will be using the [Kickstarter Projects](https://www.kaggle.com/kemical/kickstarter-projects) Kaggle dataset. It contains two .csv files dated December 2016 and January 2018 which contains lists of kickstarter campaigns, explicited with the following data fields:

- ID
- name
- category
- main_category
- currency
- deadline
- goal
- launched
- pledged
- state
- backers
- country
- usd_pledged: conversion in US dollars of the pledged column 
- usd_pledged_real: conversion in US dollars of the pledged column
- usd_goal_real: conversion in US dollars of the goal column


### 1.3. Goal

Our goal will be to predict the **state** value of campaigns based on any number other columns (our features), excluding *usd_pledged* and *usd_pledged_real*.

The notebook was also ran locally using the installation steps for Spark described [here](https://sparkbyexamples.com/spark/spark-installation-on-linux-ubuntu/).

## 2. Environment Set-Up

We need the following libraries installed to set up the environment:

- kaggle (see documentation [here](https://github.com/Kaggle/kaggle-api#datasets))
- pyspark (see documentation [here](https://spark.apache.org/docs/latest/api/python/index.html))

In [1]:
# Installs the kaggle and pyspark module on the machine
!pip install kaggle
!pip install pyspark



## 3. Dataset Download

In [2]:
# Removes previously existing files
!rm -f kickstarter-projects.zip
!rm -f ks-projects-201612.csv ks-projects-201801.csv

### 3.1. Setting up Kaggle environment variables with the kaggle.json file

#### 3.1.2. On Google Cloud

<span style="color:red">To download the kaggle dataset, we must first upload our account's **kaggle.json file** in the **/root/.kaggle/ folder**.</span>
    
<span style="color:red">The kaggle.json file can be downloaded here:</span>

> ``https://www.kaggle.com/<username>/account``
    
<span style="color:red">It is assumed we created a **/home/\<user\>/ folder** where this Jupyter Notebook and the kaggle.json file have been uploaded</span>.

In [None]:
############## WARNING ###########
# RUN ONLY WHEN USING GOOGLE CLOUD
##################################

# Given this notebook and the kaggle.json file are set in the folder /home/<user>/
# Moves the kaggle.json file from the user folder to the root folder
!mv home/qlr/kaggle.json /root/.kaggle/kaggle.json

#### 3.1.2. On a local machine
    
<span style="color:red">Download and move the kaggle.json file to the local /root/.kaggle/ folder.</span>

### 3.2. Downloading the dataset

We only keep 'ks-projects-201801.csv', the most recent dataset available.

In [3]:
# Dowloads the raw dataset from the kaggle source
!kaggle datasets download -d kemical/kickstarter-projects

Downloading kickstarter-projects.zip to /home/qlr/Programming/kickstarter_pledge_prediction
100%|██████████████████████████████████████| 36.8M/36.8M [00:05<00:00, 5.64MB/s]
100%|██████████████████████████████████████| 36.8M/36.8M [00:05<00:00, 6.85MB/s]


In [4]:
# Unzips the raw dataset and keeps only the most recent instance
!unzip kickstarter-projects.zip
!rm -f ks-projects-201612.csv kickstarter-projects.zip

Archive:  kickstarter-projects.zip
  inflating: ks-projects-201612.csv  
  inflating: ks-projects-201801.csv  


### 3.3. Uploading to HDFS when using Google Cloud

In [None]:
############## WARNING ###########
# RUN ONLY WHEN USING GOOGLE CLOUD
##################################

# Uploads the dataset to HDFS when on Google Cloud
!hdfs dfs -mkdir /user/qlr
!hdfs dfs -rm /user/qlr/ks-projects-201801.csv
!hdfs dfs -put ks-projects-201801.csv /user/qlr
!hdfs dfs -ls /user/qlr

## 4. Library Imports & Setting Spark/Global Environment Variables

In [5]:
# Loads the needed modules
from pyspark.context import SparkContext

from pyspark.ml import Pipeline
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
from pyspark.ml.feature import Word2Vec, Tokenizer, HashingTF
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

from pyspark.sql import Row
from pyspark.sql.functions import unix_timestamp, ceil, isnan, when, count, col
from pyspark.sql.session import SparkSession
from pyspark.sql.types import FloatType

In [54]:
############## WARNING ###########
# RUN ONLY WHEN ON A LOCAL MACHINE
##################################

dataset_path = "ks-projects-201801.csv"
dataset_format = "csv"
context = "local"

# Instantiates a local Spark Session
spark = SparkSession.builder \
    .master('local[*]') \
    .config("spark.driver.memory", "12g") \
    .appName('distributed-database-assignment') \
    .getOrCreate()

In [6]:
############## WARNING ###########
# RUN ONLY WHEN USING GOOGLE CLOUD
##################################

dataset_path = "/user/qlr/ks-projects-201801.csv"
dataset_format = "csv"
context = "cloud"

## 5. Loading the Kickstarter Dataset

<span style="color:red">Comment out the following cell when when running the notebook Google Cloud as a spark session is automatically instantiated.</span>

In [7]:
#Spark UI on Google Cloud should return:
#   v2.3.4 (version)
#   yarn (Master)
#   PySparkShell (AppName)

spark

In [35]:
# Loads the dataset
campaigns = (spark
             .read
             .format(dataset_format)
             .options(header=True)
             .load(dataset_path))

## 6. Dataset Pre-Processing

In [36]:
# Declares variables for pre-processing the dataset
kept_raw_columns = [
    "ID","name","category","deadline","launched","country","usd_goal_real", #features
    "state" # target
]

remove_date_columns = [
    "ID","name","category","total_duration","country","usd_goal_real", #features
    "state" # target
]

kept_columns_for_modelization = [
    "total_duration","usd_goal_real","name","category","country", #features
    "state" # target
]

deadline_format = "yyyy-MM-dd"
launched_format = "yyyy-MM-dd HH:mm:ss"

final_columns = ["features", "label"]

In [37]:
# Declares useful functions
def dataset_check(db):
    print("The dataset contains " + str(db.count()) + " rows.")
    db.show(n=5)

In [38]:
# Checks the type of the dataset columns
campaigns.dtypes

[('ID', 'string'),
 ('name', 'string'),
 ('category', 'string'),
 ('main_category', 'string'),
 ('currency', 'string'),
 ('deadline', 'string'),
 ('goal', 'string'),
 ('launched', 'string'),
 ('pledged', 'string'),
 ('state', 'string'),
 ('backers', 'string'),
 ('country', 'string'),
 ('usd pledged', 'string'),
 ('usd_pledged_real', 'string'),
 ('usd_goal_real', 'string')]

In [39]:
# Drops NAs, Nulls, and Duplicates 
campaigns = campaigns.dropna()
campaigns = campaigns.dropDuplicates()
for column in campaigns.columns:
    campaigns = campaigns.where(col(column).isNotNull())

In [40]:
# Prunes the non-relevant columns
campaigns = campaigns.select(kept_raw_columns)

In [41]:
# Computes a duration time (in day) between the launch and deadline features
launch_times = unix_timestamp("launched", format = launched_format)
deadline_times = unix_timestamp("deadline", format = deadline_format)
time_difference = deadline_times - launch_times
campaigns = campaigns.withColumn("total_duration",ceil(time_difference/(3600*24)))

# Removes the launch and deadline feature columns
campaigns = campaigns.select(remove_date_columns)

In [42]:
# Cleans the target labels:
#   - 'undefined', 'live' -> dropped
for condition in ['state!="undefined"', 'state!="live"']:
    campaigns = campaigns.where(condition)

#   - 'suspended', 'cancelled' -> renamed to 'failed' 
campaigns = campaigns.\
    withColumn("state",when(col("state") == "canceled", "failed").\
    when(col("state") == "suspended", "failed").\
    when(col("state") == "failed", "failed").\
    otherwise("successful"))

In [43]:
# Casts the relevant column(s) to their end types
for column in ["total_duration", "usd_goal_real"]:
    campaigns = campaigns.withColumn(column,col(column).cast(FloatType()))

In [44]:
# finishes clean-up
processed_campaigns = campaigns.select(kept_columns_for_modelization)

With the preprocessed campaigns, we can both explore the data better and build our Spark pipelines.

### 6.1. Exploring the pre-processed campaigns

In [45]:
# Checks dataset structure
dataset_check(processed_campaigns)

The dataset contains 372060 rows.
+--------------+-------------+--------------------+--------------+-------+----------+
|total_duration|usd_goal_real|                name|      category|country|     state|
+--------------+-------------+--------------------+--------------+-------+----------+
|          43.0|      4926.39|             Borders|         Drama|     GB|    failed|
|          21.0|      2240.39|Spiele für iOS un...|  Mobile Games|     DE|    failed|
|          30.0|        700.0|Odyssey Skateboar...|Graphic Design|     US|    failed|
|          30.0|       5500.0|Debut EP Album Pr...|           R&B|     US|    failed|
|          16.0|       1200.0|GBS Detroit Prese...|    Indie Rock|     US|successful|
+--------------+-------------+--------------------+--------------+-------+----------+
only showing top 5 rows



In [46]:
# Checks state column's content
processed_campaigns.select("state").groupBy("state").count().orderBy(col("count").desc()).show()

+----------+------+
|     state| count|
+----------+------+
|    failed|237451|
|successful|134609|
+----------+------+



In [47]:
processed_campaigns.dtypes

[('total_duration', 'float'),
 ('usd_goal_real', 'float'),
 ('name', 'string'),
 ('category', 'string'),
 ('country', 'string'),
 ('state', 'string')]

In [48]:
# Checks for N/A
processed_campaigns.select([count(when(isnan(c), c)).alias(c) for c in processed_campaigns.columns]).show()

+--------------+-------------+----+--------+-------+-----+
|total_duration|usd_goal_real|name|category|country|state|
+--------------+-------------+----+--------+-------+-----+
|             0|            0|   0|       0|      0|    0|
+--------------+-------------+----+--------+-------+-----+



### 6.2. Creating the dataset for Logistic Regression, Decision Tree, and Random Forest

Our first three models will be:
- Logistic Regression
- Decision Tree
- Random Forest

To create our data pipeline, we will rely on indexing and assembling our data using the following stages:
- **StringIndexer** for all categorical columns
- **OneHotEncoder** for all categorical index columns
- **Tokenizer** and **Word2Vec** for the \<name\> column
- **VectorAssembler** for all feature columns to be assembled into one vector column

In [59]:
# String-indexes the categorical feature columns
categorical_feature_columns = processed_campaigns.columns[3:5]
string_indexing_feature_columns = [
    StringIndexer(inputCol=column, outputCol="strindexed_" + column, handleInvalid="skip")
    for column in categorical_feature_columns
]

# String-indexes the label column
string_indexing_label_column = [
    StringIndexer(inputCol="state", outputCol="label", handleInvalid="skip")
]

<span style="color:red">**Note on pyspark 2.3 used on Google Cloud**:  OneHotEncoder and VectorAssembler do not have the \<handleInvalid\> attribute. The resulting effect is that pyspark can raise a null error during .fit() procedures despite no na exists in the dataset (see cell in previous part). A solution will be to pass dataset column as dataset.na.drop() later on.</span>

In [60]:
# Creates pipeline stages to one-hot encode each categorical feature column
if context == "local":
    onehot_encoding_feature_columns = [
        OneHotEncoder(inputCol = "strindexed_" + column, 
                      outputCol = "onehot_" + column,
                      handleInvalid = "keep")
        for column in categorical_feature_columns
    ]
else:
    onehot_encoding_feature_columns = [
        OneHotEncoder(inputCol = "strindexed_" + column, 
                      outputCol = "onehot_" + column) 
        for column in categorical_feature_columns
    ]

In [61]:
# Creates pipeline stages to vector assemble each categorical feature column
processed_feature_columns = list(map(lambda col_name: "onehot_" + col_name, categorical_feature_columns))
processed_feature_columns += ["total_duration", "usd_goal_real"]

if context == "local":
    vectorassembler_stage = VectorAssembler(inputCols=processed_feature_columns, 
                                            outputCol="features_1",
                                            handleInvalid="skip")
else:
    vectorassembler_stage = VectorAssembler(inputCols=processed_feature_columns, 
                                            outputCol="features_1")

In [62]:
# Creates pipeline stages to vectorize the <name> column
tokenizer = Tokenizer(inputCol="name", outputCol="words")
Word2Vec = Word2Vec(vectorSize=20, inputCol=tokenizer.getOutputCol(), outputCol="features_2")

In [63]:
# Merges the vectors resulting from the categorical feature pipeline and word2vec pipeline
merge_features = VectorAssembler(inputCols=["features_1", "features_2"], outputCol="features")

In [64]:
# Assembles the data processing pipeline
data_processing_pipeline = Pipeline(
    stages = string_indexing_feature_columns +
    string_indexing_label_column + 
    onehot_encoding_feature_columns + 
    [vectorassembler_stage] + 
    [tokenizer] + 
    [Word2Vec] +
    [merge_features]
)

In [72]:
# Fits the data processing pipeline
first_pipeline = data_processing_pipeline.fit(processed_campaigns.na.drop())
first_processed_dataset = first_pipeline.transform(processed_campaigns)

In [73]:
# Caches 20% of the dataset for the session for better time performance
first_processed_dataset = first_processed_dataset.select(final_columns).sample(0.2).cache()

In [74]:
dataset_check(first_processed_dataset)

The dataset contains 73890 rows.
+--------------------+-----+
|            features|label|
+--------------------+-----+
|(193,[18,159,181,...|  0.0|
|(193,[4,162,181,1...|  1.0|
|(193,[28,159,181,...|  0.0|
|(193,[42,160,181,...|  0.0|
|(193,[28,159,181,...|  0.0|
+--------------------+-----+
only showing top 5 rows



### 6.2. Finalizing our dataset for Naive Bayes

Our second batch of models will be:
- Naive Bayes

To create our data pipeline, we will rely on indexing and assembling our data using the following stages:
- **StringIndexer** for all categorical columns
- **OneHotEncoder** for all categorical index columns
- **Tokenizer** and **HashingTF** for the \<name\> column
- **VectorAssembler** for all feature columns to be assembled into one vector column

We use HashingTF for our pipeline because Naive Bayes can only accept positive float values when Word2Vec can output vectors with negative elements.

In [75]:
# Creates pipeline stages to vectorize the <name> column
tokenizer = Tokenizer(inputCol="name", outputCol="words")
HashingTF = HashingTF(numFeatures=20, inputCol=tokenizer.getOutputCol(), outputCol="features_2")

In [76]:
# Assembles the data processing pipeline
data_processing_pipeline = Pipeline(
    stages = string_indexing_feature_columns +
    string_indexing_label_column + 
    onehot_encoding_feature_columns + 
    [vectorassembler_stage] + 
    [tokenizer] + 
    [HashingTF] +
    [merge_features]
)

In [77]:
# Fits the data processing pipeline
pipeline_naive_bayes = data_processing_pipeline.fit(processed_campaigns.na.drop())
second_processed_dataset = pipeline_naive_bayes.transform(processed_campaigns)

In [78]:
# Caches 20% of the dataset for the session for better time performance
second_processed_dataset = second_processed_dataset.select(final_columns).sample(0.2).cache()

In [79]:
dataset_check(second_processed_dataset)

The dataset contains 74522 rows.
+--------------------+-----+
|            features|label|
+--------------------+-----+
|(203,[49,160,181,...|  0.0|
|(203,[8,159,181,1...|  0.0|
|(203,[23,159,181,...|  0.0|
|(203,[42,160,181,...|  0.0|
|(203,[6,159,181,1...|  1.0|
+--------------------+-----+
only showing top 5 rows



## 7. Running a Logistic Regression Pipeline

### 7.1. Declaring model hyper-parameters

In [32]:
# Declares hyperparameters
training_size = 0.7
test_size = 0.3
reg_parameters = [0, 0.5, 1., 2.] # must be float values
elastic_net_parameters = [0., 0.5, 1.] # must be float values

In [None]:
# Declares useful functions
def process_confusion_matrix(matrix):\
    """
    Produces the confusion matrix of a model based on its
    binary classification output.
    """
    items = []
    for item in matrix: 
        items.append(item)
        print(item, matrix[item])
    if Row(label=0.0, prediction=0.0) in items: 
        true_negatives = float(matrix[Row(label=0.0, prediction=0.0)])
    else: 
        true_negatives = 0.
    if Row(label=1.0, prediction=0.0) in items: 
        false_negatives = float(matrix[Row(label=1.0, prediction=0.0)])
    else: 
        false_negatives = 0.
    if Row(label=0.0, prediction=1.0) in items: 
        false_positives = float(matrix[Row(label=0.0, prediction=1.0)])
    else: 
        false_positives = 0.
    if Row(label=1.0, prediction=1.0) in items: 
        true_positives = float(matrix[Row(label=1.0, prediction=1.0)])
    else: 
        true_positives = 0.
    precision = true_positives/(true_positives+false_positives)
    recall = true_positives/(true_positives+false_negatives)
    print("\nPrecision score:", precision)
    print("Recall score:", recall)
    if precision+recall != 0.: 
        print("F1 score:", (precision*recall)/(precision+recall))
        return precision, recall, (precision*recall)/(precision+recall))
    return precision, recall

### 7.2. Creating a model pipeline using Cross-Validation

#### 7.2.1. Building and fitting the model

In [33]:
lr = LogisticRegression(featuresCol="features", labelCol="label")

In [34]:
# Builds a parameter grid
lr_param_grid = ParamGridBuilder().\
    addGrid(lr.regParam, reg_parameters).\
    addGrid(lr.elasticNetParam, elastic_net_parameters).\
    build()

In [35]:
# Builds the evaluator
lr_evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction")

In [36]:
# Builds the cross-validation model
lr_cv = CrossValidator(estimator=lr, 
                       estimatorParamMaps=lr_param_grid, 
                       evaluator=lr_evaluator, 
                       numFolds=4)

In [37]:
# Fits the cross-validation model
lr_cv_model = lr_cv.fit(first_processed_dataset.na.drop())

In [38]:
print("The model was fit using parameters: \n")
print(lr_cv_model.extractParamMap())

The model was fit using parameters: 
{Param(parent='CrossValidatorModel_ffc5cd3d18df', name='seed', doc='random seed.'): 5370764324114565524, Param(parent='CrossValidatorModel_ffc5cd3d18df', name='numFolds', doc='number of folds for cross validation'): 5, Param(parent='CrossValidatorModel_ffc5cd3d18df', name='estimator', doc='estimator to be cross-validated'): LogisticRegression_3f2500add73e, Param(parent='CrossValidatorModel_ffc5cd3d18df', name='estimatorParamMaps', doc='estimator param maps'): [{Param(parent='LogisticRegression_3f2500add73e', name='regParam', doc='regularization parameter (>= 0).'): 0.0, Param(parent='LogisticRegression_3f2500add73e', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0}, {Param(parent='LogisticRegression_3f2500add73e', name='regParam', doc='regularization parameter (>= 0).'): 0.0, Param(parent='LogisticRegression_3f2500add73e', name='ela

#### 7.2.2. Evaluating the model

In [39]:
# Provides a confusion matrix
lr_label_and_pred = lr_cv_model.transform(first_processed_dataset).select("label", "prediction")
lr_confusion_matrix = lr_label_and_pred.rdd.zipWithIndex().countByKey()
lr_results = process_confusion_matrix(lr_confusion_matrix)

Row(label=1.0, prediction=1.0) 5327
Row(label=0.0, prediction=0.0) 20628
Row(label=1.0, prediction=0.0) 7995
Row(label=0.0, prediction=1.0) 3235

Precision score: 0.6221677178229386
Recall score: 0.3998648851523795
F1 score: 0.24341985011880826


In [40]:
# Intercept and Coefficients of the regresison model
print("Intercept: " + str(lr_cv_model.bestModel.intercept) + "\n"
      "coefficients: " + str(lr_cv_model.bestModel.coefficients))

Intercept: -0.4819529278019527
coefficients: [0.26758795267590774,0.30009144189370945,1.0121060891880869,0.4060548693466408,0.6444665740258785,-0.30477547008262085,-0.14847142463087093,0.026472049456678765,-0.5186856274399023,-0.4007342827118165,-0.1987309359978946,0.38999189488689034,-0.6246978446955926,1.143156112750401,-0.2885451749640301,0.5821233670881333,0.1644139055976517,-1.5141054469178223,-0.1294286631095084,-0.2340278517642612,1.1653789703418131,-0.0018203295877317746,0.31513882922983005,-1.3210137810236107,0.6207974062004611,-0.27529024081583603,1.0686700990323432,0.15550453494200608,-1.199044592513611,0.3993995032425031,0.3241803063560533,0.20269835383190596,-0.1529378693771232,0.8767678319505534,-0.052461231799360616,0.8303418608092669,-0.7370582536777287,0.09200919649376844,-0.3248781917889076,-0.043543017169981595,0.9789183440991038,0.7882246104833379,1.1544693325442312,-0.19631198661363683,0.21423241330250548,-0.3828016823699003,1.2189437365994513,0.11444284502203682,-

In [41]:
# Parameters of the best model
print("The best RegParam is: ", cv_model.bestModel._java_obj.getRegParam(), "\n",
     "The best ElasticNetParam is:", cv_model.bestModel._java_obj.getElasticNetParam())

The best RegParam is:  0.0 
 The best ElasticNetParam is: cv_model.bestModel._java_obj.getElasticNetParam()


## 8. Running a DecisionTreeClassifier Pipeline

### 8.1. Declaring model hyper-parameters

In [42]:
# Declares hyperparameters
max_depth_grid = list(range(2,10))

### 8.2. Creating a model pipeline using Cross-Validation

#### 8.2.1. Building and fitting the model

In [44]:
# Builds the estimator
dt = DecisionTreeClassifier(featuresCol="features", labelCol="label")

In [45]:
# Builds a parameter grid
dt_param_grid = ParamGridBuilder().\
    addGrid(dt.maxDepth, max_depth_grid).\
    build()

In [46]:
# Builds the evaluator
dt_evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction", 
                                             metricName="areaUnderROC")

In [47]:
# Builds the cross-validation model
dt_cv = CrossValidator(estimator=dt, 
                       estimatorParamMaps=dt_param_grid, 
                       evaluator=dt_evaluator, 
                       numFolds=4)

In [48]:
# Fits the cross-validation model
dt_cv_model = dt_cv.fit(first_processed_dataset.na.drop())

#### 8.2.2. Evaluating the model

In [50]:
# Provides a confusion matrix
dt_cv_label_and_pred = cv_model.transform(first_processed_dataset).select("label", "prediction")
dt_cv_confusion_matrix = dt_cv_label_and_pred.rdd.zipWithIndex().countByKey()
dt_cv_results = process_confusion_matrix(dt_cv_confusion_matrix)

Row(label=1.0, prediction=1.0) 4964
Row(label=0.0, prediction=0.0) 19102
Row(label=1.0, prediction=0.0) 8358
Row(label=0.0, prediction=1.0) 4761

Precision score: 0.5104370179948586
Recall score: 0.37261672421558323
F1 score: 0.21538595044908232


In [51]:
print('The best MaxDepth is:', cv_model.bestModel._java_obj.getMaxDepth())

The best MaxDepth is: 3


In [52]:
print(cv_model.bestModel)

DecisionTreeClassificationModel: uid=DecisionTreeClassifier_e12c75433b1a, depth=3, numNodes=7, numClasses=2, numFeatures=193


### 8.3. Creating a model pipeline using Train-Test split

#### 8.3.1. Building and fitting the model

In [43]:
# Splits the dataset between training and validation sets
training, test = first_processed_dataset.randomSplit([training_size, test_size], seed=0)

In [53]:
# Builds the estimator
dt = DecisionTreeClassifier(featuresCol='features', labelCol='label')

In [54]:
# Builds the evaluator
dt_tt_evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction", metricName="areaUnderROC")

In [55]:
# Fits the cross-validation model
dt_tt_model = dt.fit(training.na.drop())

#### 8.3.2. Evaluating the model

In [56]:
# Predicts on training data
pred_test = traintest_model.transform(test)

# Provides a confusion matrix
dt_tt_label_and_pred = pred_test.select("label", "prediction")
dt_tt_confusion_matrix = dt_tt_label_and_pred.rdd.zipWithIndex().countByKey()
dt_tt_results = process_confusion_matrix(dt_tt_confusion_matrix)

+--------------------+-----+----------+---------------+--------------------+
|            features|label|prediction|  rawPrediction|         probability|
+--------------------+-----+----------+---------------+--------------------+
|(193,[0,159,181,1...|  1.0|       0.0| [1138.0,149.0]|[0.88422688422688...|
|(193,[0,159,181,1...|  0.0|       0.0|[2358.0,1823.0]|[0.56397990911265...|
|(193,[0,159,181,1...|  0.0|       0.0|  [999.0,440.0]|[0.69423210562890...|
|(193,[1,160,181,1...|  0.0|       0.0|    [408.0,9.0]|[0.97841726618705...|
|(193,[2,159,181,1...|  1.0|       0.0|  [999.0,440.0]|[0.69423210562890...|
+--------------------+-----+----------+---------------+--------------------+
only showing top 5 rows



In [57]:
accuracy = evaluator.evaluate(pred_test)
print("The test error is", 1.0 - accuracy)

The test error is 0.4397729058969222


In [58]:
print(traintest_model)

DecisionTreeClassificationModel: uid=DecisionTreeClassifier_e4af082e9aec, depth=5, numNodes=47, numClasses=2, numFeatures=193


## 9. Running a Random Forest Pipeline

### 9.1. Declaring model hyper-parameters

In [59]:
# Declares hyperparameters
max_depth_grid = map(float,list(range(2,10))) # must be float values
minimum_info_grain = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5] # must be float values

### 9.2. Creating a model pipeline using cross-validation

#### 9.2.1. Building and fitting the model

In [60]:
# Builds the estimator
rf = RandomForestClassifier(featuresCol='features', labelCol='label')

In [61]:
# Builds a parameter grid
rf_param_grid = ParamGridBuilder().\
    addGrid(random_forest_with_crossvalidation.maxDepth, max_depth_grid).\
    addGrid(random_forest_with_crossvalidation.minInfoGain, minimum_info_grain).\
    build()

In [62]:
# Builds the evaluator
rf_evaluator = BinaryClassificationEvaluator()

In [63]:
# Builds the cross-validation model
rf_cv = CrossValidator(estimator=rf, 
                       estimatorParamMaps=rf_param_grid, 
                       evaluator=rf_evaluator)

In [64]:
# Fits the cross-validation model
rf_cv_model = rf_cv.fit(first_processed_dataset.na.drop())

#### 9.2.2. Evaluating the model

In [66]:
# Provides a confusion matrix
rf_label_and_pred = rf_cv_model.transform(first_processed_dataset).select('label', 'prediction')
rf_confusion_matrix = rf_label_and_pred.rdd.zipWithIndex().countByKey()
rf_results = process_confusion_matrix(rf_confusion_matrix)

Row(label=1.0, prediction=1.0) 1899
Row(label=0.0, prediction=0.0) 23158
Row(label=1.0, prediction=0.0) 11423
Row(label=0.0, prediction=1.0) 705

Precision score: 0.7292626728110599
Recall score: 0.14254616423960367
F1 score: 0.11923898028381265


In [67]:
print(rf_cv_model.bestModel)

RandomForestClassificationModel: uid=RandomForestClassifier_5aa6fa19503c, numTrees=20, numClasses=2, numFeatures=193


## 10. Running a Naive Bayes Pipeline

### 10.1. Finalizing the data processing pipeline

In [68]:
# Declares hyperparameters
smoothing = map(float, list(range(0,10))) # must be float values

### 10.2. Creating a model pipeline using cross-validation

#### 10.2.1. Building and fitting the model

In [69]:
# Builds the estimator
nb = NaiveBayes(featuresCol='features', labelCol='label')

In [70]:
# Builds a parameter grid
nb_param_grid = ParamGridBuilder().\
    addGrid(nb.smoothing, smoothing).\
    build()

In [71]:
# Builds the evaluator
nb_evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction", metricName="areaUnderROC")

In [72]:
# Builds the cross-validation model
nb_cv = CrossValidator(estimator = nb, 
                       estimatorParamMaps = nb_param_grid, 
                       evaluator = nb_evaluator)

In [73]:
# Fits the cross-validation model
nb_cv_model = nb_cv.fit(second_processed_dataset.na.drop())

#### 10.2.2. Evaluating the model

In [75]:
# Provides a confusion matrix
nb_label_and_pred = cv_model.transform(second_processed_dataset).select('label', 'prediction')
nb_confusion_matrix = nb_label_and_pred.rdd.zipWithIndex().countByKey()
nb_results = process_confusion_matrix(nb_confusion_matrix)

Row(label=0.0, prediction=1.0) 17457
Row(label=1.0, prediction=1.0) 11919
Row(label=0.0, prediction=0.0) 6285
Row(label=1.0, prediction=0.0) 1350

Precision score: 0.4057393790849673
Recall score: 0.8982591001582636
F1 score: 0.2794934927893071


In [76]:
print("The parameter smoothing has best value:", nb_cv_model.bestModel._java_obj.getSmoothing())

The parameter smoothing has best value: 1.0


In [77]:
print(nb_cv_model.bestModel)

NaiveBayesModel: uid=NaiveBayes_cb21321f226a, modelType=multinomial, numClasses=2, numFeatures=203
