# CEBD 1261 Winter 2020
## Final Project: Mushroom classification (Poisonous (p) vs. Edible (e))
### Data source: https://www.kaggle.com/uciml/mushroom-classification 
### By: Pawel Kaluski


Searching for data to use for my project I found this one. It is a classification problem. The challenge with this dataset is that it only has characters and no numbers. It requires alot of encoding.

In [55]:
from pyspark.sql import SparkSession
import pyspark.sql as sparksql
spark = SparkSession.builder.appName('stroke').getOrCreate()
train = spark.read.csv('mushrooms.csv', inferSchema=True,header=True)
import pandas as pd

In [56]:
# create DataFrame as a temporary view
train.createOrReplaceTempView('table')

In [57]:
train.printSchema()

root
 |-- class: string (nullable = true)
 |-- cap-shape: string (nullable = true)
 |-- cap-surface: string (nullable = true)
 |-- cap-color: string (nullable = true)
 |-- bruises: string (nullable = true)
 |-- odor: string (nullable = true)
 |-- gill-attachment: string (nullable = true)
 |-- gill-spacing: string (nullable = true)
 |-- gill-size: string (nullable = true)
 |-- gill-color: string (nullable = true)
 |-- stalk-shape: string (nullable = true)
 |-- stalk-root: string (nullable = true)
 |-- stalk-surface-above-ring: string (nullable = true)
 |-- stalk-surface-below-ring: string (nullable = true)
 |-- stalk-color-above-ring: string (nullable = true)
 |-- stalk-color-below-ring: string (nullable = true)
 |-- veil-type: string (nullable = true)
 |-- veil-color: string (nullable = true)
 |-- ring-number: string (nullable = true)
 |-- ring-type: string (nullable = true)
 |-- spore-print-color: string (nullable = true)
 |-- population: string (nullable = true)
 |-- habitat: string 

In [58]:
train.groupBy('class').count().show()

+-----+-----+
|class|count|
+-----+-----+
|    e| 4208|
|    p| 3916|
+-----+-----+



In [59]:
pd.DataFrame(df.take(5), columns=df.columns)

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


In [60]:
df

DataFrame[class: string, cap-shape: string, cap-surface: string, cap-color: string, bruises: string, odor: string, gill-attachment: string, gill-spacing: string, gill-size: string, gill-color: string, stalk-shape: string, stalk-root: string, stalk-surface-above-ring: string, stalk-surface-below-ring: string, stalk-color-above-ring: string, stalk-color-below-ring: string, veil-type: string, veil-color: string, ring-number: string, ring-type: string, spore-print-color: string, population: string, habitat: string]

In [61]:
numeric_features = [v[0] for v in df.dtypes if v[1] == 'int']
df.select(numeric_features).describe().toPandas().transpose()

Unnamed: 0,0,1,2,3,4
summary,count,mean,stddev,min,max


This part is where the encoding takes place. (Converting labels to numbers)

In [63]:
from pyspark.ml.feature import (VectorAssembler,OneHotEncoder,StringIndexer)

In [64]:
cap_shape_indexer = StringIndexer(inputCol='cap-shape',outputCol='cap_shapeIndex')
cap_shape_encoder = OneHotEncoder(inputCol='cap_shapeIndex',outputCol='cap_shapeVec')

In [65]:
cap_surface_indexer = StringIndexer(inputCol='cap-surface',outputCol='cap_surfaceIndex')
cap_surface_encoder = OneHotEncoder(inputCol='cap_surfaceIndex',outputCol='cap_surfaceVec')

In [66]:
cap_color_indexer = StringIndexer(inputCol='cap-color',outputCol='cap_colorIndex')
cap_color_encoder = OneHotEncoder(inputCol='cap_colorIndex',outputCol='cap_colorVec')

In [67]:
bruises_indexer = StringIndexer(inputCol='bruises',outputCol='bruisesIndex')
bruises_encoder = OneHotEncoder(inputCol='bruisesIndex',outputCol='bruisesVec')

In [68]:
odor_indexer = StringIndexer(inputCol='odor',outputCol='odorIndex')
bodor_encoder = OneHotEncoder(inputCol='odorIndex',outputCol='odorVec')

In [69]:
gill_attachment_indexer = StringIndexer(inputCol='gill-attachment',outputCol='gill_attachmentIndex')
gill_attachment_encoder = OneHotEncoder(inputCol='gill_attachmentIndex',outputCol='gill_attachmentVec')

In [70]:
gill_spacing_indexer = StringIndexer(inputCol='gill-spacing',outputCol='gill_spacingIndex')
gill_spacing_encoder = OneHotEncoder(inputCol='gill_spacingIndex',outputCol='gill_spacingVec')

In [71]:
gill_size_indexer = StringIndexer(inputCol='gill-size',outputCol='gill_sizeIndex')
gill_size_encoder = OneHotEncoder(inputCol='gill_sizeIndex',outputCol='gill_sizeVec')

In [72]:
gill_color_indexer = StringIndexer(inputCol='gill-color',outputCol='gill_colorIndex')
gill_color_encoder = OneHotEncoder(inputCol='gill_colorIndex',outputCol='gill_colorVec')

In [73]:
stalk_shape_indexer = StringIndexer(inputCol='stalk-shape',outputCol='stalk_shapeIndex')
stalk_shape_encoder = OneHotEncoder(inputCol='stalk_shapeIndex',outputCol='stalk_shapeVec')

In [74]:
stalk_root_indexer = StringIndexer(inputCol='stalk-root',outputCol='stalk_rootIndex')
stalk_root_encoder = OneHotEncoder(inputCol='stalk_rootIndex',outputCol='stalk_rootVec')

In [75]:
stalk_surface_above_ring_indexer = StringIndexer(inputCol='stalk-surface-above-ring',outputCol='stalk_surface_above_ringIndex')
stalk_surface_above_ring_encoder = OneHotEncoder(inputCol='stalk_surface_above_ringIndex',outputCol='stalk_surface_above_ringVec')

In [76]:
stalk_surface_below_ring_indexer = StringIndexer(inputCol='stalk-surface-below-ring',outputCol='stalk_surface_above_ringIndex')
stalk_surface_below_ring_encoder = OneHotEncoder(inputCol='stalk_surface_below_ringIndex',outputCol='stalk_surface_below_ringVec')

In [77]:
stalk_color_above_ring_indexer = StringIndexer(inputCol='stalk-color-above-ring',outputCol='stalk_color_above_ringIndex')
stalk_color_above_ring_encoder = OneHotEncoder(inputCol='stalk_color_above_ringIndex',outputCol='stalk_color_above_ringVec')

In [78]:
stalk_color_below_ring_indexer = StringIndexer(inputCol='stalk-color-below-ring',outputCol='stalk_color_below_ringIndex')
stalk_color_below_ring_encoder = OneHotEncoder(inputCol='stalk_color_below_ringIndex',outputCol='stalk_color_below_ringVec')

In [79]:
veil_type_indexer = StringIndexer(inputCol='veil-type',outputCol='veil_typeIndex')
veil_type_encoder = OneHotEncoder(inputCol='veil_typeIndex',outputCol='veil_typeVec')

In [80]:
veil_color_indexer = StringIndexer(inputCol='veil-color',outputCol='veil_colorIndex')
veil_color_encoder = OneHotEncoder(inputCol='veil_colorIndex',outputCol='veil_colorVec')

In [81]:
ring_number_indexer = StringIndexer(inputCol='ring-number',outputCol='ring_numberIndex')
ring_number_encoder = OneHotEncoder(inputCol='ring_numberIndex',outputCol='ring_numberVec')

In [82]:
ring_type_indexer = StringIndexer(inputCol='ring-type',outputCol='ring_typeIndex')
ring_type_encoder = OneHotEncoder(inputCol='ring_typeIndex',outputCol='ring_typeVec')

In [83]:
spore_print_color_indexer = StringIndexer(inputCol='spore-print-color',outputCol='spore_print_colorIndex')
spore_print_color_encoder = OneHotEncoder(inputCol='spore_print_colorIndex',outputCol='spore_print_colorVec')

In [86]:
population_indexer = StringIndexer(inputCol='population',outputCol='populationIndex')
population_encoder = OneHotEncoder(inputCol='populationIndex',outputCol='populationVec')

In [87]:
habitat_indexer = StringIndexer(inputCol='habitat',outputCol='habitatIndex')
habitat_encoder = OneHotEncoder(inputCol='habitatIndex',outputCol='habitatVec')

In [88]:
assembler = VectorAssembler(inputCols=['cap_shapeVec',
 'cap_surfaceVec',
 'cap_colorVec',
 'bruisesVec',
 'odorVec',
 'gill_attachmentVec',
 'gill_spacingVec',
 'gill_sizeVec',
 'gill_colorVec',
 'stalk_shapeVec',
 'stalk_rootVec',
 'stalk_surface_above_ringVec',
 'stalk_surface_below_ringVec',
 'stalk_color_above_ringVec',
 'stalk_color_below_ringVec',                                     
 'veil_typeVec',                                   
 'veil_colorVec',
 'ring_numberVec',
 'ring_typeVec',
 'spore_print_colorVec',
 'populationVec',                                      
 'habitatVec'],outputCol='features')

In [49]:
from pyspark.ml.classification import DecisionTreeClassifier

In [89]:
dtc = DecisionTreeClassifier(labelCol='class',featuresCol='features')

In [None]:
from pyspark.ml import Pipeline
pipeline = Pipeline(stages=[cap_shape_indexer, cap_surface_indexer, cap_color_indexer, bruises_indexer, odor_indexer, gill_attachment_indexer, gill_spacing_indexer, gill_size_indexer,
                           Residence_type_encoder, smoking_status_encoder, assembler, dtc])