# Spark Pipeline on Kickstarter Pledge Dataset

## 1. Overview

### 1.1. Instructions

- **Choosing any sufficiently large open dataset** (less than 100000 lines are not allowed)


- **Choosing one variable to predict**


- **Implementing at least two supervised learning models**: classification, regression, recommender system, etc. Unsupervised tasks (e.g. clusterisation, associative rules, etc.) are not allowed


- **Mandatory use of Apache Spark** (e.g. on Google Cloud as we did during our lab sessions)


- A **full machine learning pipeline must be implemented**, which include:
    - Reading the data
    - Transforming data (extracting features, dealing with missing values if any, etc.)
    - Building models (build at least two models to compare)
    - Evaluating quality (use cross-validation or train/test split)

### 1.2. Dataset

### 1.3. Summary & Conclusion

The notebook was also ran locally using the installation steps for Spark described [here](https://sparkbyexamples.com/spark/spark-installation-on-linux-ubuntu/).

## 2. Environment Set-Up

We need the following libraries installed to set up the environment:

- kaggle (see documentation [here](https://github.com/Kaggle/kaggle-api#datasets))
- pyspark (see documentation [here](https://spark.apache.org/docs/latest/api/python/index.html))

In [1]:
!pip install kaggle
!pip install pyspark



## 3. Dataset Download

In [2]:
# Removes existing files that may have been downloaded locally
!rm -f kickstarter-projects.zip
!rm -f ks-projects-201612.csv ks-projects-201801.csv

In [3]:
# Dowloads the raw dataset from the kaggle source
!kaggle datasets download -d kemical/kickstarter-projects

Downloading kickstarter-projects.zip to /home/qlr/Programming/kickstarter_pledge_prediction
 98%|█████████████████████████████████████▏| 36.0M/36.8M [00:03<00:00, 9.53MB/s]
100%|██████████████████████████████████████| 36.8M/36.8M [00:04<00:00, 9.50MB/s]


In [4]:
# Unzips the raw dataset and keeps only the most recent instance
!unzip kickstarter-projects.zip
!rm -f ks-projects-201612.csv kickstarter-projects.zip
!ls

Archive:  kickstarter-projects.zip
  inflating: ks-projects-201612.csv  
  inflating: ks-projects-201801.csv  
draft.ipynb		README.md		      spark_pipeline_vDef.ipynb
ks-projects-201801.csv	spark-ml-full-pipeline.ipynb  Untitled.ipynb


We only keep 'ks-projects-201801.csv', the most recent dataset available.

## 4. Library Imports & Spark Variables

In [28]:
from pyspark.context import SparkContext
from pyspark.sql.functions import unix_timestamp, ceil, isnan, when, count, col
from pyspark.sql.session import SparkSession
from pyspark.sql.types import FloatType

In [6]:
dataset_path = "ks-projects-201801.csv"
dataset_format = "csv"
spark_context = "local" #if run on a local computer

## 5. Starting a Spark Session & Loading the Dataset

In [7]:
sc = SparkContext(spark_context)
spark = SparkSession(sc)

In [13]:
spark

In [89]:
campaigns = (spark
             .read
             .format(dataset_format)
             .options(header=True)
             .load(dataset_path))

## 6. Data Preprocessing

In [90]:
raw_columns_to_keep = [
    "ID","name","category","deadline","launched","country","usd_goal_real", #features
    "state" # target
]

replace_start_end_dates_with_duration = [
    "ID","name","category","total_duration","country","usd_goal_real", #features
    "state" # target
]

kept_columns_for_decision_tree = [
    "total_duration","usd_goal_real","category","country", #features
    "state" # target
]

deadline_format = "yyyy-MM-dd"
launched_format = "yyyy-MM-dd HH:mm:ss"

In [91]:
# Checks the type of each columns
campaigns.dtypes

[('ID', 'string'),
 ('name', 'string'),
 ('category', 'string'),
 ('main_category', 'string'),
 ('currency', 'string'),
 ('deadline', 'string'),
 ('goal', 'string'),
 ('launched', 'string'),
 ('pledged', 'string'),
 ('state', 'string'),
 ('backers', 'string'),
 ('country', 'string'),
 ('usd pledged', 'string'),
 ('usd_pledged_real', 'string'),
 ('usd_goal_real', 'string')]

In [92]:
# Checks for N/A
campaigns.select([count(when(isnan(c), c)).alias(c) for c in campaigns.columns]).show()

+---+----+--------+-------------+--------+--------+----+--------+-------+-----+-------+-------+-----------+----------------+-------------+
| ID|name|category|main_category|currency|deadline|goal|launched|pledged|state|backers|country|usd pledged|usd_pledged_real|usd_goal_real|
+---+----+--------+-------------+--------+--------+----+--------+-------+-----+-------+-------+-----------+----------------+-------------+
|  0|   0|       0|            0|       0|       0|   0|       0|      0|    0|      0|      0|          0|               0|            0|
+---+----+--------+-------------+--------+--------+----+--------+-------+-----+-------+-------+-----------+----------------+-------------+



In [93]:
# Keeps only the relevant columns
campaigns = campaigns.select(raw_columns_to_keep)
# --
print(f"The dataset contains {campaigns.count()} rows.")
campaigns.show(n=5)

The dataset contains 378661 rows.
+----------+--------------------+--------------+----------+-------------------+-------+-------------+--------+
|        ID|                name|      category|  deadline|           launched|country|usd_goal_real|   state|
+----------+--------------------+--------------+----------+-------------------+-------+-------------+--------+
|1000002330|The Songs of Adel...|        Poetry|2015-10-09|2015-08-11 12:12:28|     GB|      1533.95|  failed|
|1000003930|Greeting From Ear...|Narrative Film|2017-11-01|2017-09-02 04:43:57|     US|     30000.00|  failed|
|1000004038|      Where is Hank?|Narrative Film|2013-02-26|2013-01-12 00:20:50|     US|     45000.00|  failed|
|1000007540|ToshiCapital Reko...|         Music|2012-04-16|2012-03-17 03:24:11|     US|      5000.00|  failed|
|1000011046|Community Film Pr...|  Film & Video|2015-08-29|2015-07-04 08:35:03|     US|     19500.00|canceled|
+----------+--------------------+--------------+----------+-------------------

In [94]:
# Computes a duration time (in day) from the launch and deadline features before dropping them
launch_times = unix_timestamp('launched', format = launched_format)
deadline_times = unix_timestamp('deadline', format = deadline_format)
time_difference = deadline_times - launch_times
campaigns = campaigns.\
    withColumn("total_duration",ceil(time_difference/(3600*24))).\
    select(replace_start_end_dates_with_duration)
# --
campaigns.show(n=5)

+----------+--------------------+--------------+--------------+-------+-------------+--------+
|        ID|                name|      category|total_duration|country|usd_goal_real|   state|
+----------+--------------------+--------------+--------------+-------+-------------+--------+
|1000002330|The Songs of Adel...|        Poetry|            59|     GB|      1533.95|  failed|
|1000003930|Greeting From Ear...|Narrative Film|            60|     US|     30000.00|  failed|
|1000004038|      Where is Hank?|Narrative Film|            45|     US|     45000.00|  failed|
|1000007540|ToshiCapital Reko...|         Music|            30|     US|      5000.00|  failed|
|1000011046|Community Film Pr...|  Film & Video|            56|     US|     19500.00|canceled|
+----------+--------------------+--------------+--------------+-------+-------------+--------+
only showing top 5 rows



In [95]:
# Casts the relevant column(s) to their end types
for column in ["total_duration", "usd_goal_real"]:
    campaigns = campaigns.withColumn(column,col(column).cast(FloatType()))

### 6.1. Building a dataset for a decision tree classifier

In [96]:
# Removes ID and name from the dataset
decision_tree_dataset = campaigns.select(kept_columns_for_decision_tree)
# --
print(f"The dataset contains {decision_tree_dataset.count()} rows.")
decision_tree_dataset.show(n=5)
campaigns.dtypes

The dataset contains 378661 rows.
+--------------+-------------+--------------+-------+--------+
|total_duration|usd_goal_real|      category|country|   state|
+--------------+-------------+--------------+-------+--------+
|          59.0|      1533.95|        Poetry|     GB|  failed|
|          60.0|      30000.0|Narrative Film|     US|  failed|
|          45.0|      45000.0|Narrative Film|     US|  failed|
|          30.0|       5000.0|         Music|     US|  failed|
|          56.0|      19500.0|  Film & Video|     US|canceled|
+--------------+-------------+--------------+-------+--------+
only showing top 5 rows



[('ID', 'string'),
 ('name', 'string'),
 ('category', 'string'),
 ('total_duration', 'float'),
 ('country', 'string'),
 ('usd_goal_real', 'float'),
 ('state', 'string')]

## 7. Running a DecisionTreeClassifier Pipeline

In [112]:
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.tuning import ParamGridBuilder
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator

### 7.1. Creating a data processing pipeline

We will rely on indexing and assembling our data pipeline using the following stages:
- **StringIndexer** for all categorical columns
- **OneHotEncoder** for all categorical index columns
- **VectorAssembler** for all feature columns into one vector column

In [149]:
# Declares hyperparameters
training_size = 0.8
validation_size = 0.2
max_depth_grid = list(range(2,10))

In [150]:
# Creates a pipeline stage to string index each categorical feature column, and the label column
categorical_feature_columns = decision_tree_dataset.columns[2:4]
string_indexing_feature_columns = [StringIndexer(inputCol=column, 
                                                 outputCol='strindexed_' + column,
                                                 handleInvalid="skip")
                                   for column in categorical_feature_columns]
string_indexing_label_column = [StringIndexer(inputCol='state', 
                                              outputCol='label',
                                              handleInvalid="skip")]

In [151]:
# Creates a pipeline stage to one-hot encode each categorical feature column
onehot_encoding_feature_columns = [OneHotEncoder(inputCol='strindexed_' + column, 
                                                 outputCol='onehot_' + column,
                                                 handleInvalid="keep") 
                                  for column in categorical_feature_columns]

In [152]:
# Creates a pipeline stage to vector assemble each categorical feature column
processed_feature_columns = list(map(lambda col_name: "onehot_" + col_name, categorical_feature_columns))
processed_feature_columns += ["total_duration", "usd_goal_real"]
vectorassembler_stage = VectorAssembler(inputCols=processed_feature_columns, 
                                        outputCol='features',
                                        handleInvalid="skip")

In [153]:
# Assembles the data processing pipeline
data_processing_pipeline = Pipeline(
    stages = string_indexing_feature_columns +
    string_indexing_label_column + 
    onehot_encoding_feature_columns + 
    [vectorassembler_stage]
)

In [154]:
# Fits the data processing pipeline
pipeline_model = data_processing_pipeline.fit(decision_tree_dataset)

In [157]:
final_columns = processed_feature_columns + ['features', 'label']
decision_tree_dataset_prepped = pipeline_model.transform(decision_tree_dataset).select(final_columns)
# --
print(f"The dataset contains {decision_tree_dataset_prepped.count()} rows.")        
decision_tree_dataset_prepped.show(5)

The dataset contains 377364 rows.
+-----------------+---------------+--------------+-------------+--------------------+-----+
|  onehot_category| onehot_country|total_duration|usd_goal_real|            features|label|
+-----------------+---------------+--------------+-------------+--------------------+-----+
|(1441,[62],[1.0])|(226,[1],[1.0])|          59.0|      1533.95|(1669,[62,1442,16...|  0.0|
|(1441,[22],[1.0])|(226,[0],[1.0])|          60.0|      30000.0|(1669,[22,1441,16...|  0.0|
|(1441,[22],[1.0])|(226,[0],[1.0])|          45.0|      45000.0|(1669,[22,1441,16...|  0.0|
| (1441,[2],[1.0])|(226,[0],[1.0])|          30.0|       5000.0|(1669,[2,1441,166...|  0.0|
| (1441,[7],[1.0])|(226,[0],[1.0])|          56.0|      19500.0|(1669,[7,1441,166...|  2.0|
+-----------------+---------------+--------------+-------------+--------------------+-----+
only showing top 5 rows



In [158]:
# Splits the dataset between training and validation sets
training, validation = decision_tree_dataset_prepped.randomSplit([training_size, validation_size], seed=0)

### 7.2. Creating a model pipeline

In [159]:
# Builds the estimator
dt = DecisionTreeClassifier(featuresCol='features', labelCol='label')

In [160]:
# Builds a parameter grid
param_grid = ParamGridBuilder().\
    addGrid(dt.maxDepth, max_depth_grid).\
    build()

In [161]:
# Builds the evaluator
evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction", metricName="areaUnderROC")

In [162]:
# Builds the cross-validation model
cv = CrossValidator(estimator=dt, estimatorParamMaps=param_grid, evaluator=evaluator, numFolds=4)

In [None]:
# Fits the cross-validation model
cv_model = cv.fit(decision_tree_dataset_prepped)