# Assignment – Preprocessing Data for scikit-learn
 Very often, we’re tasked with taking data in one form and transforming it for easier downstream analysis. In this assignment, you’ll use what you’ve learned in the course to prepare data for predictive analysis in Project 4.
 
## Mushrooms Dataset.
A famous—if slightly moldy—dataset about mushrooms can be found in the UCI repository here: https://archive.ics.uci.edu/ml/datasets/Mushroom. The fact that this is such a well-known dataset in the data science community has made it a good dataset to use for comparative benchmarking. For example, if someone was working to build a better decision tree algorithm (or other predictive classifier) to analyze categorical data, this dataset could be useful. In Project 4, we’ll use scikit-learn to answer the question, “Which other attribute or attributes are the best predictors of whether a particular mushroom is poisonous or edible?”

### Your assignment is to:
- First study the dataset and the associated description of the data (i.e. “data dictionary”). You may need to look around a bit, but it’s there!
- Create a `pandas DataFrame` with a subset of the columns in the dataset. You should include the column that indicates edible or poisonous, the column that includes odor, and at least one other column of your choosing.
- Add meaningful names for each column.
- Replace the codes used in the data with numeric values—for example, in the first “target” column, “e” might become 0 and “p” might become 1. This is because your downstream processing in Project 4 using `scikit-learn` requires that values be stored as numerics.
- Perform exploratory data analysis: show the distribution of data for each of the columns you selected, and show scatterplots for edible/poisonous vs. odor as well as the other column that you selected.
- Include some text describing your preliminary conclusions about whether either of the other columns could be helpful in predicting if a specific mushroom is edible or poisonous.


Your deliverable is a Jupyter Notebook that performs these transformation and exploratory data analysis tasks.


*** 
### First we must import the libraries we will need

In [1]:
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
%matplotlib inline

***
## First study the dataset and the associated description of the data (i.e. “data dictionary”). You may need to look around a bit, but it’s there!

I was able to find the "data dictonary" under the `agaricus-lepiota.names` file

In [2]:
data_dictonary = open('agaricus-lepiota.names', 'r')
print (data_dictonary.read())

1. Title: Mushroom Database

2. Sources: 
    (a) Mushroom records drawn from The Audubon Society Field Guide to North
        American Mushrooms (1981). G. H. Lincoff (Pres.), New York: Alfred
        A. Knopf
    (b) Donor: Jeff Schlimmer (Jeffrey.Schlimmer@a.gp.cs.cmu.edu)
    (c) Date: 27 April 1987

3. Past Usage:
    1. Schlimmer,J.S. (1987). Concept Acquisition Through Representational
       Adjustment (Technical Report 87-19).  Doctoral disseration, Department
       of Information and Computer Science, University of California, Irvine.
       --- STAGGER: asymptoted to 95% classification accuracy after reviewing
           1000 instances.
    2. Iba,W., Wogulis,J., & Langley,P. (1988).  Trading off Simplicity
       and Coverage in Incremental Concept Learning. In Proceedings of 
       the 5th International Conference on Machine Learning, 73-79.
       Ann Arbor, Michigan: Morgan Kaufmann.  
       -- approximately the same results with their HILLARY algorithm    
    3. In 

***
## Create a `pandas DataFrame` with a subset of the columns in the dataset. You should include the column that indicates edible or poisonous, the column that includes odor, and at least one other column of your choosing.

Now we will create our dataframe  with our prefered rows.

In [3]:
import_file = 'agaricus-lepiota.data'
mushroom_dataset = pd.read_csv(import_file, header=None, usecols=[0, 3, 5, 21, 22])
mushroom_dataset.head(11)

Unnamed: 0,0,3,5,21,22
0,p,n,p,s,u
1,e,y,a,n,g
2,e,w,l,n,m
3,p,w,p,s,u
4,e,g,n,a,g
5,e,y,a,n,g
6,e,w,a,n,m
7,e,w,l,s,m
8,p,w,p,v,g
9,e,y,a,s,m


***
## Add meaningful names for each column.

In [4]:
column_names = ['Edible/Poisonous', 'Cap-Color', 'Odor', 'Population','Habitat']
mushroom_dataset.columns = column_names
mushroom_dataset.head(11)

Unnamed: 0,Edible/Poisonous,Cap-Color,Odor,Population,Habitat
0,p,n,p,s,u
1,e,y,a,n,g
2,e,w,l,n,m
3,p,w,p,s,u
4,e,g,n,a,g
5,e,y,a,n,g
6,e,w,a,n,m
7,e,w,l,s,m
8,p,w,p,v,g
9,e,y,a,s,m


***
## Replace the codes used in the data with numeric values—for example, in the first “target” column, “e” might become 0 and “p” might become 1. This is because your downstream processing in Project 4 using `scikit-learn` requires that values be stored as numerics.

In [5]:
replace_with_num_values = mushroom_dataset.replace({'Edible/Poisonous':{'e':0,
                                                                        'p':1},
                                                    'Cap-Color':{'n':0,
                                                                 'b':1,
                                                                 'c':2,
                                                                 'g':3,
                                                                 'r':4,
                                                                 'p':5,
                                                                 'u':6,
                                                                 'e':7,
                                                                 'w':8,
                                                                 'y':9},
                                                    'Odor':{'a': 0,
                                                            'l': 1,
                                                            'c': 2,
                                                            'y': 3,
                                                            'f': 4,
                                                            'm': 5,
                                                            'n': 6,
                                                            'p': 7,
                                                            's': 8},
                                                    
                                                    'Population':{'a':0,
                                                                  'c':1,
                                                                  'n':2,
                                                                  's':3,
                                                                  'v':4,
                                                                  'y':5},
                                                    'Habitat':{'g': 0,
                                                               'l': 1,
                                                               'm': 2,
                                                               'p': 3,
                                                               'u': 4,
                                                               'w': 5,
                                                               'd': 6}})
replace_with_num_values.head(11)

Unnamed: 0,Edible/Poisonous,Cap-Color,Odor,Population,Habitat
0,1,0,7,3,4
1,0,9,0,2,0
2,0,8,1,2,2
3,1,8,7,3,4
4,0,3,6,0,0
5,0,9,0,2,0
6,0,8,0,2,2
7,0,8,1,3,2
8,1,8,7,4,0
9,0,9,0,3,2


***
## Perform exploratory data analysis: show the distribution of data for each of the columns you selected, and show scatterplots for edible/poisonous vs. odor as well as the other column that you selected.

***
## Include some text describing your preliminary conclusions about whether either of the other columns could be helpful in predicting if a specific mushroom is edible or poisonous.