# Lab One: Exploring Table Data
___

## Business Understanding

The data we have selected to study is the analysis of 23 different gilled mushroom species of the *Agaricus* and *Lepiota* Families. Originally, this was collected in order to discover a more efficient and accurate way to tell if a mushroom is edible or poisonous as part of a field guide. This information was found on the UCI, collected The Audubon Society Field Guide to North American Mushrooms (1981), and donated by Jeff Schlimmer.
 
This kind of information is vital to many industries, ranging from tourism to healthcare. Having the knowledge of whether a mushroom is usable or not could save a starving hiker's life or speed up the process of developing medicines involving those materials. Although most poisonous mushrooms only cause minor symptoms such as vomiting and diarrhea, children or animals can develop symptoms such organ damage and, in some cases, death. The field is currently based on a lot of guesswork and tedious work, and even professional mushroomers can misidentify a specimen.

Since one of our 22 categorical data attributes is a binary type representing if a mushroom is poisonous or edible, we can use this information to check our accuracy. We plan to correlate each attribute to either type and see if there is a strong relation between any of them that could be used to accurately predict the mushroom's edibleness. 
___


## Data Understanding











[15 points] Load the dataset and appropriately define data types. What data type should be used to represent each data attribute? Discuss the attributes collected in the dataset. For datasets with a large number of attributes, only discuss a subset of relevant attributes.  

The following are our 22 categorical data attributes, split by data representation:
* One Hot Encoding
  1. Cap Shape
  2. Cap Surface
  3. Cap Color
  4. Odor
  5. Gill Attachment
  6. Gill Spacing
  7. Gill Color
  8. Stalk Root
  9. Stalk Surface Above Ring
  10. Stalk Surface Below Ring
  11. Stalk Color Above Ring
  12. Stalk Color Below Ring
  13. Veil Color
  14. Ring Number
  15. Ring Type
  16. Spore Print Color
  17. Population
  18. Habitat
* Binary
  1. Bruises
  2. Gill Size
  3. Stalk Shape
  4. Veil Type

Although some of our attributes can be represented by themselves, such as Population, Habitat, Odor, and Spore Print color, the rest of them become fairly specific, so we created a subset named Physcal Attributes to group them all under. 

[15 points] Verify data quality: Explain any missing values or duplicate data. Are those mistakes? Why do these quality issues exist in the data? How do you deal with these problems? Give justifications for your methods (elimination or imputation).  

# Data Quality

## Missing Values
The data sets has 2480 missing values, all for one attribute, the stalk root. These missing values most likely exist because the stalk root is the only attribute that could not be visible unless the mushroom had been pulled out of the ground.

To deal with these missing values we can either eliminate the instances with missing values or eliminate the column altogether since it would be difficult to impute the values with this data set. Eliminating the column can be justified because the attribute is not relevant in the case of determining the edibility of the mushroom without removing it from the ground.

## Single Value Column
We also noticed that in the veil-type column, only one value was present.  This makes the column completely irrelevant.  This likely means that of the 20 species studied none of them had a veil-type of anything other than partial.

## Repeat Data
Due to there being 8124 rows and only 23 species of mushrooms, there will inevitably be lot of identical rows.  However, we decided that depending on what the intended use of the data is it could be beneficial to leave in the duplicate rows.

When implementing our function to read the DataFrame from a file, we allowed the option to remove duplicate values in order to easily analyze the dataset with and without duplicates. 

Below is a short piece of code showing what percentage of the data is unique.

In [10]:
from preprocessing.shroom_dealer import get_data_frame

df = get_data_frame()
total_rows = len(df)

df.drop_duplicates(inplace=True)
no_dups = len(df)

8124

# Smell
We noticed that one of the most highly correlated features was odor.

In [2]:
from analysis import histogram_analysis
counts, poison_counts = histogram_analysis.get_hist_data()

tf_tpf = {}
for val in counts:
    tf_tpf[val] = dict([(x,poison_counts[val][x]/counts[val][x]) for x in counts[val] if counts[val][x] != 0])

print(tf_tpf["odor"])

{'none': 0.034013605442176874, 'anise': 0.0, 'almond': 0.0, 'fishy': 1.0, 'spicy': 1.0, 'musty': 1.0, 'pungent': 1.0, 'foul': 1.0, 'creosote': 1.0}


As we can see, the correlations between smell and whether or not a mushroom is poisonous is 100%.  The only time that there is a question of whether or not a mushroom is poisonous is when the mushroom lacks a smell.

# Visualizations
## Heatmaps
One of our first visualizations was to create a heap map that showed the correlation between all possible attributes.  
![Heatmap of all Attributes](visualizations/singleton/fullheatmap.png)
As you can see, there are way too many attributes to look at this chart alone.  To see some of the more strongly correlated attributes more closely, we made heat maps displaying the correlation between every combibnation of two columns.  Below are some of the more interesting ones.
![Heatmap of all Attributes](visualizations/heatmaps/gill-attachment_and_veil-color.png)

![Heatmap of all Attributes](visualizations/heatmaps/odor_and_poisonous.png)
We found Odor and whether or not the Mushroom was poisonous to be a highly correlated feature.
![Heatmap of all Attributes](visualizations/heatmaps/stalk-shape_and_stalk-root.png)
We ended up dropping the stalk-root attribute due to 25% of the rows missing a stalk-root value.  We thought the heat map was still interesting enough to include it anyways.
