## Overview

This notebook will display simple multiclass classification task with PySpark on the estimation of obesity levels based on eating habits and physical condition dataset.

In [0]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.feature import StringIndexer, VectorIndexer, VectorAssembler

from pyspark.sql.functions import count, when, col, isnan

### Loading dataset

In [0]:
# File location and type
file_location = "/FileStore/tables/ObesityDataSet_raw_and_data_sinthetic.csv"
file_type = "csv"

# Read dataset
df = spark.read.csv(file_location, header=True, inferSchema=True)

In [0]:
# show first 5 row of data
df.show(5)

+------+----+------+------+------------------------------+----+----+---+---------+-----+----+---+---+---+----------+--------------------+-------------------+
|Gender| Age|Height|Weight|family_history_with_overweight|FAVC|FCVC|NCP|     CAEC|SMOKE|CH2O|SCC|FAF|TUE|      CALC|              MTRANS|         NObeyesdad|
+------+----+------+------+------------------------------+----+----+---+---------+-----+----+---+---+---+----------+--------------------+-------------------+
|Female|21.0|  1.62|  64.0|                           yes|  no| 2.0|3.0|Sometimes|   no| 2.0| no|0.0|1.0|        no|Public_Transporta...|      Normal_Weight|
|Female|21.0|  1.52|  56.0|                           yes|  no| 3.0|3.0|Sometimes|  yes| 3.0|yes|3.0|0.0| Sometimes|Public_Transporta...|      Normal_Weight|
|  Male|23.0|   1.8|  77.0|                           yes|  no| 2.0|3.0|Sometimes|   no| 2.0| no|2.0|1.0|Frequently|Public_Transporta...|      Normal_Weight|
|  Male|27.0|   1.8|  87.0|                         

In [0]:
# printing the dataset schema
df.printSchema()

root
 |-- Gender: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Height: double (nullable = true)
 |-- Weight: double (nullable = true)
 |-- family_history_with_overweight: string (nullable = true)
 |-- FAVC: string (nullable = true)
 |-- FCVC: double (nullable = true)
 |-- NCP: double (nullable = true)
 |-- CAEC: string (nullable = true)
 |-- SMOKE: string (nullable = true)
 |-- CH2O: double (nullable = true)
 |-- SCC: string (nullable = true)
 |-- FAF: double (nullable = true)
 |-- TUE: double (nullable = true)
 |-- CALC: string (nullable = true)
 |-- MTRANS: string (nullable = true)
 |-- NObeyesdad: string (nullable = true)



In [0]:
# the number of records in the dataset
df.count()

Out[6]: 2111

In [0]:
df.summary().show()

+-------+------+-----------------+-------------------+------------------+------------------------------+----+------------------+------------------+------+-----+------------------+----+------------------+------------------+------+----------+-------------------+
|summary|Gender|              Age|             Height|            Weight|family_history_with_overweight|FAVC|              FCVC|               NCP|  CAEC|SMOKE|              CH2O| SCC|               FAF|               TUE|  CALC|    MTRANS|         NObeyesdad|
+-------+------+-----------------+-------------------+------------------+------------------------------+----+------------------+------------------+------+-----+------------------+----+------------------+------------------+------+----------+-------------------+
|  count|  2111|             2111|               2111|              2111|                          2111|2111|              2111|              2111|  2111| 2111|              2111|2111|              2111|              

### Checking null values

here we would check whethere there are null values on the dataset or not

In [0]:
df.select([count(when(col(c).contains('None') |
                            col(c).contains('NULL') |
                            (col(c) == '' ) |
                            col(c).isNull() |
                            isnan(c), c 
                           )).alias(c)
                    for c in df.columns]).show()

+------+---+------+------+------------------------------+----+----+---+----+-----+----+---+---+---+----+------+----------+
|Gender|Age|Height|Weight|family_history_with_overweight|FAVC|FCVC|NCP|CAEC|SMOKE|CH2O|SCC|FAF|TUE|CALC|MTRANS|NObeyesdad|
+------+---+------+------+------------------------------+----+----+---+----+-----+----+---+---+---+----+------+----------+
|     0|  0|     0|     0|                             0|   0|   0|  0|   0|    0|   0|  0|  0|  0|   0|     0|         0|
+------+---+------+------+------------------------------+----+----+---+----+-----+----+---+---+---+----+------+----------+



### Labels Indexer

StringIndexer would take string data on the columns and map it into indices. In the labelIndexer, we will only map the label column, which is the NObeyesdad column.

In [0]:
labelIndexer = StringIndexer(inputCol='NObeyesdad', outputCol='idxObese').fit(df)

### Feature Indexer

Similar with the label indexer, the featureIndexer also map the string data into indices. Here we will use categorical features instead of label for the input columns.

In [0]:
categorical_col = [c for c, tp in df.dtypes if tp == 'string']

In [0]:
categorical_col.remove('NObeyesdad')

In [0]:
featureIndexer = StringIndexer(inputCols=categorical_col, 
                               outputCols=[f'{c}_idx' for c in categorical_col]).fit(df)

### Assemble the features

Here, we will assemble the numerical columns in the dataset, and combine it into a vector of features.

In [0]:
numerical_col = [c for c, tp in df.dtypes if tp in ['int', 'double']]

In [0]:
featureAssembler =  VectorAssembler(inputCols=numerical_col, outputCol='features')

### Split the data

In [0]:
# split the data into training and test dataset
training_df, test_df = df.randomSplit([0.8, 0.2], seed=42)

### Create the classifier

In [0]:
# create the DecisionTreeClassifier model, set the labelCol into the class that the data will be classified into, 
# and featuresCol is the features that already assembled
dtc = DecisionTreeClassifier(labelCol="idxObese", featuresCol="features", seed=42)

### Create process pipeline

Create a pipeline of process that the data have to undergo, and then fit the pipeline model into the training dataset.

In [0]:
# creating the pipeline for the stages of data processing
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, featureAssembler, dtc])

In [0]:
# 
model = pipeline.fit(training_df)

### Predictions and evaluations

Creating prediction dataset from transformed test dataframe.

In [0]:
pred = model.transform(test_df)

In [0]:
pred.select('idxObese', 'prediction').show(10)

+--------+----------+
|idxObese|prediction|
+--------+----------+
|     5.0|       5.0|
|     5.0|       5.0|
|     0.0|       0.0|
|     3.0|       3.0|
|     6.0|       6.0|
|     3.0|       3.0|
|     3.0|       3.0|
|     6.0|       6.0|
|     6.0|       6.0|
|     6.0|       6.0|
+--------+----------+
only showing top 10 rows



evaluates the classification result using F1 score and accuracy score.

In [0]:
metrics = ['f1', 'accuracy']
evaluators = [MulticlassClassificationEvaluator(labelCol='idxObese', 
                                                predictionCol='prediction', metricName=f'{c}') for c in metrics]
f1 = evaluators[0].evaluate(pred)
acc = evaluators[1].evaluate(pred)

print(f'f1 score: {f1}')
print(f'accuracy: {acc}')

f1 score: 0.8337950983586657
accuracy: 0.8360215053763441
