In [1]:
using CSV
using DataFrames



# Is that Mushroom Edible????

The following information is on the mushroom dataset. Before it can be used the data must be cleaned and converted into Boolean. This code was written exclusively to perform this task. 

Attribute Information:

1. cap-shape: 
        b=bell
        c=conical
        x=convex
        f=flat
        k=knobbed
        s=knobbed
2. cap-surface: 
        f=fibrous
        g=grooves
        y=scaly
        s=smooth
3. cap-color: 
        n=brown
        b=buff
        c=cinnamon
        g=gray
        r=green
        p=pink
        u=purple
        e=red
        w=white
        y=yellow
4. bruises?: 
        t=bruises
        f=no
5. odor: 
        a=almond
        l=anise
        c=creosote
        y=fishy
        f=foul
        m=musty
        n=none
        p=pungent
        s=spicy
6. gill-attachment: 
        a=attached
        d=descending
        f=free
        n=notched
7. gill-spacing: 
        c=close
        w=crowded
        d=distant
8. gill-size: 
        b=broad
        n=narrow
9. gill-color: 
        k=black
        n=brown
        b=buff
        h=chocolate
        g=gray
        r=green
        o=orange
        p=pink
        u=purple
        e=red
        w=white
        y=yellow
10. stalk-shape: 
        e=enlarging
        t=tapering
11. stalk-root: 
        b=bulbous
        c=club
        u=cup
        e=equal
        z=rhizomorphs
        r=rooted
        ?=missing
12. stalk-surface-above-ring: 
        f=fibrous
        y=scaly
        k=silky
        s=smooth
13. stalk-surface-below-ring: 
        f=fibrous
        y=scaly
        k=silky
        s=smooth
14. stalk-color-above-ring: 
        n=brown
        b=buff
        c=cinnamon
        g=gray
        o=orange
        p=pink
        e=red
        w=white
        y=yellow
15. stalk-color-below-ring: 
        n=brown
        b=buff
        c=cinnamon
        g=gray
        o=orange
        p=pink
        e=red
        w=white
        y=yellow
16. veil-type: 
        p=partial
        u=universal
17. veil-color: 
        n=brown
        o=orange
        w=white
        y=yellow
18. ring-number: 
        n=none
        o=one
        t=two
19. ring-type: 
        c=cobwebby
        e=evanescent
        f=flaring
        l=large
        n=none
        p=pendant
        s=sheathing
        z=zone
20. spore-print-color: 
        k=black
        n=brown
        b=buff
        h=chocolate
        r=green
        o=orange
        u=purple
        w=white
        y=yellow
21. population: 
        a=abundant
        c=clustered
        n=numerous
        s=scattered
        v=several
        y=solitary
22. habitat: 
        g=grasses
        l=leaves
        m=meadows
        p=paths
        u=urban
        w=waste
        d=woods

The logical rules to determine whether a mushroom is edible or not that has proven to be the most successful are as follows. 

    P_1) odor=NOT(almond.OR.anise.OR.none)
	     120 poisonous cases missed, 98.52% accuracy

	P_2) spore-print-color=green
	     48 cases missed, 99.41% accuracy
         
	P_3) odor=none.AND.stalk-surface-below-ring=scaly.AND.
	          (stalk-color-above-ring=NOT.brown) 
	     8 cases missed, 99.90% accuracy
         
	P_4) habitat=leaves.AND.cap-color=white
	         100% accuracy     

	Rule P_4) may also be

	P_4') population=clustered.AND.cap_color=white


In [2]:
#here I have loaded in the data and specified that the dat begins on the 1st row so that I may asign headers 
df = CSV.read("agaricus-lepiota.data", datarow=1, copycols=true)

#I am assigning headers to each attribute 
df = names!(df, [:class, :cap_shape, :cap_surface, :cap_color, :bruises, :odor, :gill_attachment, :gill_spacing, :gill_size, 
        :gill_color, :stalk_shape, :stalk_root, :stalk_surface_above_ring, :stalk_surface_below_ring, :stalk_color_above_ring,
        :stalk_color_below_ring, :veil_type, :veil_color, :ring_number, :ring_type, :spore_print_color, :population, :habitat])

#just checking!
first(df, 6)



ArgumentError: ArgumentError: "agaricus-lepiota.data" is not a valid file

# Data Cleaning

Before I can work with the data I have to change the data into bianary information. 

I first itterate through each of the original attributes, determine how many unique identifiers are in each attribute, create new attributes with these unique identifiers, then itterate through each row of the dataframe and determine which of the unique attributes is true for a given species. 

In [3]:
#how many original attributes are there?
n_mushrooms = nrow(df)
size = NaN*zeros(n_mushrooms)

#what are the names of all of the columns 
attributes = names(df)

#initiate the new dataframe
df_mushrooms = DataFrame()

#iterate through each of the columns 
for (i, attribute) in enumerate(attributes)
    
    #find the unique types within each attribute
    attribute_types = unique(df[!, attribute])
    
    if lenght(attribute_types) == 2
        attribute_types = attribute_types[end-1]
    end
        
    
    #iterate through each of those unique types 
    for (j, name)  in enumerate(attribute_types)
        #create a symbol that can be used to name the column with the new name 
        new_attributes = Symbol(":$(attribute)_$name")
        
        #this array will hold all of the values for a given columm which will then be used to populate the new dataframe 
        new_values = []
            
            #this itterates through each row of the original dataframe to determine whether or not 
            #a species has the new attribute (name) that it is currently on 
            for (h, value) in enumerate(eachrow(df[:, attribute]))
                #if the name of the attribute matches the value of the cell, i.e. the species has this attribute 
                #the cell is marked with true 
                if value[1] == String(name) 
                    push!(new_values, true)
                
                #if the value does not match the name of the attribute it is marked with false 
                else 
                    push!(new_values, false)
                    
                end
            end
        
        #add the new column to the dataframe with the array of new_values which states whether or not each species 
        #has that attribute 
        df_mushrooms[!, new_attributes] = vcat(new_values)
    end
   
end


UndefVarError: UndefVarError: df not defined

# CSV 

I will now write this information to a CSV so that I do not have to run this code every time I want to use the data

In [19]:
CSV.write("df_mushrooms.csv", df_mushrooms)


"df_mushrooms.csv"