# ACM Research Coding Challenge for Spring 2022

##### Submission by Agastya Bose.

First, let's assess what the data-set looks like.

In [1]:
import pandas as pd # For reading the csv file, data processing, etc

data = pd.read_csv("mushrooms.csv") # Can't get too far in this problem without reading the data file
data.describe()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
count,8124,8124,8124,8124,8124,8124,8124,8124,8124,8124,...,8124,8124,8124,8124,8124,8124,8124,8124,8124,8124
unique,2,6,4,10,2,9,2,2,2,12,...,4,9,9,1,4,3,5,9,6,7
top,e,x,y,n,f,n,f,c,b,b,...,s,w,w,p,w,o,p,w,v,d
freq,4208,3656,3244,2284,4748,3528,7914,6812,5612,1728,...,4936,4464,4384,8124,7924,7488,3968,2388,4040,3148


In [2]:
data.head()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


In [3]:
data['class'].value_counts()

e    4208
p    3916
Name: class, dtype: int64

We see that the distribution between edible and poisonous mushrooms is pretty much even and the dataset is quite balanced.

Since it is more convenient to work with numerical data as compared to strings/characters, it would be prudent to change all the entries in the data file to integers.

In [4]:
from sklearn.preprocessing import LabelEncoder

mappings = list()

encoder = LabelEncoder()

for column in range(len(data.columns)):
    data[data.columns[column]] = encoder.fit_transform(data[data.columns[column]])
    mappings_dict = {index: label for index, label in enumerate(encoder.classes_)}
    mappings.append(mappings_dict) # Changing the chars in each column to ints

Our data has now been cleaned up and can be further processed to develop a model. The split between the test and training sets has been (rather arbitrarily) set to be 25%:75%.

In [5]:
from sklearn.model_selection import train_test_split # Preparation to split the data into test and training sets
from sklearn.preprocessing import StandardScaler

y = data['class']
x = data.drop('class', axis = 1)
scale = StandardScaler()
x = pd.DataFrame(scale.fit_transform(x), columns = x.columns) # Standardizing the data-set

x_train, x_test, y_train, y_test = train_test_split(x, y, train_size = 0.75) # ThisIsWhereTheFunBegins.gif

Now, we are ready to finally create the model. We shall use the support-vector machine algorithm to do so from the sklearn library.

In [6]:
from sklearn.svm import SVC

model = SVC() # The bane of all wild mushrooms around the world has thus come unto existence

Now how does the model actually fare? Let's train and test it.

In [7]:
model.fit(x_train, y_train) # Passing in the training values

print(f"Accuracy: {100 * model.score(x_test, y_test)}%") # Testing the successfulness of the model

Accuracy: 100.0%


An accuracy of 100%! Not too shabby. Now the fine folks over at UTD's chapter of ACM can be rest-assured the next time they go out in the wild and have the sudden urge to consume any mushrooms found there because this model will be there to reliably advise them on what to eat.