# CEBD 1261 Winter 2020
## Final Project: Mushroom classification (Poisonous (p) vs. Edible (e))
### Data source: https://www.kaggle.com/uciml/mushroom-classification 
### By: Pawel Kaluski


Searching for data to use for my project I found this one. It is a classification problem. 
The challenge with this dataset was that it only has characters and no numbers. It required a lot of encoding. The second issue was to make the model fit with the pipeline. 

I ended adapting the code posted on DataBriks site:

https://docs.databricks.com/applications/machine-learning/mllib/binary-classification-mllib-pipelines.html

Doing this project, I learned how to apply machine learning using spark.

To further improve accuracy, we could consider predicting the missing values from "Stalk-Root". 

In [1]:
from pyspark.sql import SparkSession
from pyspark import SparkContext
import pyspark.sql as sparksql
spark = SparkSession.builder.appName('mushrooms').getOrCreate()
train = spark.read.csv('mushrooms.csv', inferSchema=True,header=True)
import pandas as pd

### Used Python to make sure there were no nan in any columns

In [2]:
# testing data for nan
df = pd.read_csv('mushrooms.csv')

In [3]:
# get info of missing data for each col by creading data frame that contains col's name and its NaN value counts
nan_info = pd.DataFrame(df.isnull().sum()).reset_index()
nan_info.columns = ['col','nan_cnt']
nan_info.sort_values(by = 'nan_cnt',ascending=False,inplace=True)
nan_info

Unnamed: 0,col,nan_cnt
0,class,0
12,stalk-surface-above-ring,0
21,population,0
20,spore-print-color,0
19,ring-type,0
18,ring-number,0
17,veil-color,0
16,veil-type,0
15,stalk-color-below-ring,0
14,stalk-color-above-ring,0


### We see there are no nan values in any columns

In [4]:
train.printSchema()

root
 |-- class: string (nullable = true)
 |-- cap-shape: string (nullable = true)
 |-- cap-surface: string (nullable = true)
 |-- cap-color: string (nullable = true)
 |-- bruises: string (nullable = true)
 |-- odor: string (nullable = true)
 |-- gill-attachment: string (nullable = true)
 |-- gill-spacing: string (nullable = true)
 |-- gill-size: string (nullable = true)
 |-- gill-color: string (nullable = true)
 |-- stalk-shape: string (nullable = true)
 |-- stalk-root: string (nullable = true)
 |-- stalk-surface-above-ring: string (nullable = true)
 |-- stalk-surface-below-ring: string (nullable = true)
 |-- stalk-color-above-ring: string (nullable = true)
 |-- stalk-color-below-ring: string (nullable = true)
 |-- veil-type: string (nullable = true)
 |-- veil-color: string (nullable = true)
 |-- ring-number: string (nullable = true)
 |-- ring-type: string (nullable = true)
 |-- spore-print-color: string (nullable = true)
 |-- population: string (nullable = true)
 |-- habitat: string 

## Next we will look at te different features to determine what's in them

In [5]:
# Our Target
train.groupBy('class').count().show()

+-----+-----+
|class|count|
+-----+-----+
|    e| 4208|
|    p| 3916|
+-----+-----+



In [6]:
train.groupBy('cap-shape').count().show()

+---------+-----+
|cap-shape|count|
+---------+-----+
|        x| 3656|
|        f| 3152|
|        k|  828|
|        c|    4|
|        b|  452|
|        s|   32|
+---------+-----+



In [7]:
train.groupBy('cap-surface').count().show()

+-----------+-----+
|cap-surface|count|
+-----------+-----+
|          g|    4|
|          f| 2320|
|          y| 3244|
|          s| 2556|
+-----------+-----+



In [8]:
train.groupBy('cap-color').count().show()

+---------+-----+
|cap-color|count|
+---------+-----+
|        g| 1840|
|        n| 2284|
|        e| 1500|
|        p|  144|
|        y| 1072|
|        w| 1040|
|        c|   44|
|        u|   16|
|        b|  168|
|        r|   16|
+---------+-----+



In [9]:
train.groupBy('bruises').count().show()

+-------+-----+
|bruises|count|
+-------+-----+
|      f| 4748|
|      t| 3376|
+-------+-----+



In [10]:
train.groupBy('odor').count().show()

+----+-----+
|odor|count|
+----+-----+
|   l|  400|
|   m|   36|
|   f| 2160|
|   n| 3528|
|   p|  256|
|   y|  576|
|   c|  192|
|   a|  400|
|   s|  576|
+----+-----+



In [11]:
train.groupBy('gill-attachment').count().show()

+---------------+-----+
|gill-attachment|count|
+---------------+-----+
|              f| 7914|
|              a|  210|
+---------------+-----+



In [12]:
train.groupBy('gill-spacing').count().show()

+------------+-----+
|gill-spacing|count|
+------------+-----+
|           w| 1312|
|           c| 6812|
+------------+-----+



In [13]:
train.groupBy('gill-size').count().show()

+---------+-----+
|gill-size|count|
+---------+-----+
|        n| 2512|
|        b| 5612|
+---------+-----+



In [14]:
train.groupBy('gill-color').count().show()

+----------+-----+
|gill-color|count|
+----------+-----+
|         g|  752|
|         n| 1048|
|         k|  408|
|         e|   96|
|         o|   64|
|         h|  732|
|         p| 1492|
|         w| 1202|
|         y|   86|
|         u|  492|
|         b| 1728|
|         r|   24|
+----------+-----+



In [15]:
train.groupBy('stalk-shape').count().show()

+-----------+-----+
|stalk-shape|count|
+-----------+-----+
|          e| 3516|
|          t| 4608|
+-----------+-----+



In [16]:
train.groupBy('stalk-root').count().show()

+----------+-----+
|stalk-root|count|
+----------+-----+
|         e| 1120|
|         c|  556|
|         b| 3776|
|         r|  192|
|         ?| 2480|
+----------+-----+



#### We can see we have 2480 missing values we can exclude this column in the MVP

In [17]:
train.groupBy('stalk-surface-above-ring').count().show()

+------------------------+-----+
|stalk-surface-above-ring|count|
+------------------------+-----+
|                       f|  552|
|                       k| 2372|
|                       y|   24|
|                       s| 5176|
+------------------------+-----+



In [18]:
train.groupBy('stalk-surface-below-ring').count().show()

+------------------------+-----+
|stalk-surface-below-ring|count|
+------------------------+-----+
|                       f|  600|
|                       k| 2304|
|                       y|  284|
|                       s| 4936|
+------------------------+-----+



In [19]:
train.groupBy('stalk-color-above-ring').count().show()

+----------------------+-----+
|stalk-color-above-ring|count|
+----------------------+-----+
|                     g|  576|
|                     n|  448|
|                     e|   96|
|                     o|  192|
|                     p| 1872|
|                     w| 4464|
|                     y|    8|
|                     c|   36|
|                     b|  432|
+----------------------+-----+



In [20]:
train.groupBy('stalk-color-below-ring').count().show()

+----------------------+-----+
|stalk-color-below-ring|count|
+----------------------+-----+
|                     g|  576|
|                     n|  512|
|                     e|   96|
|                     o|  192|
|                     p| 1872|
|                     w| 4384|
|                     y|   24|
|                     c|   36|
|                     b|  432|
+----------------------+-----+



In [21]:
train.groupBy('veil-color').count().show()

+----------+-----+
|veil-color|count|
+----------+-----+
|         n|   96|
|         o|   96|
|         w| 7924|
|         y|    8|
+----------+-----+



In [22]:
train.groupBy('veil-type').count().show()

+---------+-----+
|veil-type|count|
+---------+-----+
|        p| 8124|
+---------+-----+



#### Since this feature adds no value it will not be used in our model

In [23]:
train.groupBy('ring-number').count().show()

+-----------+-----+
|ring-number|count|
+-----------+-----+
|          n|   36|
|          o| 7488|
|          t|  600|
+-----------+-----+



In [24]:
train.groupBy('ring-type').count().show()

+---------+-----+
|ring-type|count|
+---------+-----+
|        l| 1296|
|        f|   48|
|        n|   36|
|        e| 2776|
|        p| 3968|
+---------+-----+



In [25]:
train.groupBy('spore-print-color').count().show()

+-----------------+-----+
|spore-print-color|count|
+-----------------+-----+
|                n| 1968|
|                k| 1872|
|                o|   48|
|                h| 1632|
|                w| 2388|
|                y|   48|
|                u|   48|
|                b|   48|
|                r|   72|
+-----------------+-----+



In [26]:
train.groupBy('population').count().show()

+----------+-----+
|population|count|
+----------+-----+
|         n|  400|
|         v| 4040|
|         y| 1712|
|         c|  340|
|         a|  384|
|         s| 1248|
+----------+-----+



In [27]:
train.groupBy('habitat').count().show()

+-------+-----+
|habitat|count|
+-------+-----+
|      l|  832|
|      g| 2148|
|      m|  292|
|      p| 1144|
|      d| 3148|
|      w|  192|
|      u|  368|
+-------+-----+



### We will remove 'veil-type' and 'stalk-root'

In [28]:
train = train.select('class', 'cap-shape', 'cap-surface', 'cap-color', 'bruises', 'odor', 'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color', 'stalk-shape', 'stalk-surface-above-ring', 'stalk-surface-below-ring', 'stalk-color-above-ring', 'stalk-color-below-ring', 'veil-color', 'ring-number', 'ring-type', 'spore-print-color', 'population', 'habitat')
cols = train.columns
train.printSchema()

root
 |-- class: string (nullable = true)
 |-- cap-shape: string (nullable = true)
 |-- cap-surface: string (nullable = true)
 |-- cap-color: string (nullable = true)
 |-- bruises: string (nullable = true)
 |-- odor: string (nullable = true)
 |-- gill-attachment: string (nullable = true)
 |-- gill-spacing: string (nullable = true)
 |-- gill-size: string (nullable = true)
 |-- gill-color: string (nullable = true)
 |-- stalk-shape: string (nullable = true)
 |-- stalk-surface-above-ring: string (nullable = true)
 |-- stalk-surface-below-ring: string (nullable = true)
 |-- stalk-color-above-ring: string (nullable = true)
 |-- stalk-color-below-ring: string (nullable = true)
 |-- veil-color: string (nullable = true)
 |-- ring-number: string (nullable = true)
 |-- ring-type: string (nullable = true)
 |-- spore-print-color: string (nullable = true)
 |-- population: string (nullable = true)
 |-- habitat: string (nullable = true)



In [29]:
# we will look at the first 5 rows to see if data is still ok and confirm the columns were removed
import pandas as pd
pd.DataFrame(train.take(5), columns=train.columns).transpose()

Unnamed: 0,0,1,2,3,4
class,p,e,e,p,e
cap-shape,x,x,b,x,x
cap-surface,s,s,s,y,s
cap-color,n,y,w,w,g
bruises,t,t,t,t,f
odor,p,a,l,p,n
gill-attachment,f,f,f,f,f
gill-spacing,c,c,c,c,w
gill-size,n,b,b,n,b
gill-color,k,k,n,n,k


### This part is where the encoding takes place. (Converting labels to numbers) DataBriks example was used

In [30]:
from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer, VectorAssembler

In [31]:
# DataBrick example
from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer, VectorAssembler
categoricalColumns = ["cap-shape", "cap-surface", "cap-color", "bruises", "odor", "gill-attachment", "gill-spacing", 
                      "gill-size", "gill-color", "stalk-shape", "stalk-surface-above-ring", "stalk-surface-below-ring", 
                      "stalk-color-above-ring", "stalk-color-below-ring", "veil-color", "ring-number", "ring-type", 
                      "spore-print-color", "population", "habitat"]

stages = [] # stages in our Pipeline
for categoricalCol in categoricalColumns:
    # Category Indexing with StringIndexer
    stringIndexer = StringIndexer(inputCol=categoricalCol, outputCol=categoricalCol + "Index")
    # Use OneHotEncoder to convert categorical variables into binary SparseVectors
    # encoder = OneHotEncoderEstimator(inputCol=categoricalCol + "Index", outputCol=categoricalCol + "classVec")
    encoder = OneHotEncoderEstimator(inputCols=[stringIndexer.getOutputCol()], outputCols=[categoricalCol + "classVec"])
    # Add stages.  These are not run here, but will run all at once later on.
    stages += [stringIndexer, encoder]
    
    # Convert label into label indices using the StringIndexer
label_stringIdx = StringIndexer(inputCol="class", outputCol="label")
stages += [label_stringIdx]

assemblerInputs = [c + "classVec" for c in categoricalColumns]
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]

In [32]:
from pyspark.ml.classification import LogisticRegression
  
partialPipeline = Pipeline().setStages(stages)
pipelineModel = partialPipeline.fit(train)
preppedDataDF = pipelineModel.transform(train)

# Fit model to prepped data
lrModel = LogisticRegression().fit(preppedDataDF)

In [33]:
preppedDataDF.printSchema()

root
 |-- class: string (nullable = true)
 |-- cap-shape: string (nullable = true)
 |-- cap-surface: string (nullable = true)
 |-- cap-color: string (nullable = true)
 |-- bruises: string (nullable = true)
 |-- odor: string (nullable = true)
 |-- gill-attachment: string (nullable = true)
 |-- gill-spacing: string (nullable = true)
 |-- gill-size: string (nullable = true)
 |-- gill-color: string (nullable = true)
 |-- stalk-shape: string (nullable = true)
 |-- stalk-surface-above-ring: string (nullable = true)
 |-- stalk-surface-below-ring: string (nullable = true)
 |-- stalk-color-above-ring: string (nullable = true)
 |-- stalk-color-below-ring: string (nullable = true)
 |-- veil-color: string (nullable = true)
 |-- ring-number: string (nullable = true)
 |-- ring-type: string (nullable = true)
 |-- spore-print-color: string (nullable = true)
 |-- population: string (nullable = true)
 |-- habitat: string (nullable = true)
 |-- cap-shapeIndex: double (nullable = false)
 |-- cap-shapeclas

In [34]:
selectedcols = ["label", "features"]
dataset = preppedDataDF.select(selectedcols)
display(dataset)

DataFrame[label: double, features: vector]

In [35]:
dataset.printSchema()

root
 |-- label: double (nullable = false)
 |-- features: vector (nullable = true)



In [36]:
dataset.groupBy('label').count().show()

+-----+-----+
|label|count|
+-----+-----+
|  0.0| 4208|
|  1.0| 3916|
+-----+-----+



In [37]:
dataset.groupBy('features').count().show()

+--------------------+-----+
|            features|count|
+--------------------+-----+
|(91,[3,5,12,23,26...|    1|
|(91,[0,6,11,23,26...|    1|
|(91,[0,5,12,23,26...|    1|
|(91,[1,6,11,22,26...|    1|
|(91,[1,5,12,24,26...|    1|
|(91,[0,7,9,18,26,...|    1|
|(91,[1,7,9,17,18,...|    1|
|(91,[1,7,8,18,26,...|    1|
|(91,[0,7,9,17,19,...|    1|
|(91,[0,7,9,18,26,...|    1|
|(91,[1,5,9,18,26,...|    1|
|(91,[1,7,10,18,26...|    1|
|(91,[1,7,9,17,19,...|    1|
|(91,[1,7,9,17,19,...|    1|
|(91,[1,7,11,17,19...|    1|
|(91,[1,7,9,18,26,...|    1|
|(91,[1,5,11,17,19...|    1|
|(91,[0,7,11,17,19...|    1|
|(91,[0,5,12,17,18...|    1|
|(91,[12,18,26,31,...|    1|
+--------------------+-----+
only showing top 20 rows



In [38]:
### Randomly split data into training and test sets. set seed for reproducibility
(trainingData, testData) = dataset.randomSplit([0.75, 0.25], seed=100)
print(trainingData.count())
print(testData.count())

6096
2028


### Logistic Regression

In [39]:
from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression(featuresCol = 'features', labelCol = 'label', maxIter=10)
lrModel = lr.fit(trainingData)

In [40]:
# Make predictions on test data using the transform() method.
# LogisticRegression.transform() will only use the 'features' column.
predictions = lrModel.transform(testData)

In [41]:
selected = predictions.select("label", "prediction")
display(selected)

DataFrame[label: double, prediction: double]

In [42]:
selected.describe()

DataFrame[summary: string, label: string, prediction: string]

In [43]:
# Evaluating model with the BianaryClassificationEvaluator
# Default metric for the BinaryClassificationEvaluator is areaUnderROC

from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Evaluate model
evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction")
evaluator.evaluate(predictions)

lr_acc = evaluator.evaluate(predictions)
print('A Logistic Regression algorithm had an accuracy of: {0:2.2f}%'.format(lr_acc*100))

A Logistic Regression algorithm had an accuracy of: 100.00%


### Decision Tree

In [44]:
from pyspark.ml.classification import DecisionTreeClassifier

# Create initial Decision Tree Model
dt = DecisionTreeClassifier(labelCol="label", featuresCol="features", maxDepth=3)

# Train model with Training Data
dtModel = dt.fit(trainingData)

In [45]:
# to list the number of nodes and the tree depth of the model 

print("numNodes = ", dtModel.numNodes)
print("depth = ", dtModel.depth)

numNodes =  11
depth =  3


In [46]:
# This can also be done with the following
display(dtModel)

DecisionTreeClassificationModel (uid=DecisionTreeClassifier_b4e73d042775) of depth 3 with 11 nodes

In [47]:
# Make predictions on test data using the Transformer.transform() method.
dtpredictions = dtModel.transform(testData)

In [48]:
dtpredictions.printSchema()

root
 |-- label: double (nullable = false)
 |-- features: vector (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = false)



In [49]:
# Evaluating model with the BianaryClassificationEvaluator
from pyspark.ml.evaluation import BinaryClassificationEvaluator
# Evaluate model
evaluator = BinaryClassificationEvaluator()
evaluator.evaluate(dtpredictions)
dt_acc = evaluator.evaluate(dtpredictions)
print('A Decision Tree algorithm had an accuracy of: {0:2.2f}%'.format(dt_acc*100))

A Decision Tree algorithm had an accuracy of: 98.02%


### Decision Tree Classifier

In [50]:
from pyspark.ml.classification import DecisionTreeClassifier
dtc = DecisionTreeClassifier(labelCol='label',featuresCol='features')

In [51]:
# Train model with Training Data
dtcModel = dtc.fit(trainingData)

In [52]:
dtcpredictions = dtcModel.transform(testData)

In [53]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
# Select (prediction, true label) and compute test error
acc_evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")

dtc_acc = acc_evaluator.evaluate(dtcpredictions)
print('A Decision Tree algorithm had an accuracy of: {0:2.2f}%'.format(dtc_acc*100))

A Decision Tree algorithm had an accuracy of: 99.80%


### The last algorithm " Decision Tree Classifier" had the best accuracy.
#### Despite changing the train/test ratio, it maintained an accuracy of over 99.8%.
#### The leaner regression always show 100%. this is not a good sign. Could be over-fitting
#### The runner up algorithm, is the Decision Tree. It maintained it's accuracy over 98%.