# LU1 (Working with data) - Example

In [None]:
# imports, feel free to ignore this for now 
import pandas as pd 
from matplotlib import pyplot as plt 
from utils import get_toy_data
% matplotlib inline 

## 1 - Working with jupyter notebooks (this will be insultingly basic) 

Welcome! You are now within a [Jupyter notebook](http://jupyter.org/). We can do a few cool things: 

In [None]:
# basic math 
10 + 15

In [None]:
# defining variables 
some_number = 5
some_animal = 'cat'

In [None]:
# variables can be used later 
print('I have %0.0f tins of %s food' % (some_number, some_animal))

In [None]:
# defining functions 
def sum_all(a, b, c): 
    return a + b + c

In [None]:
sum_all(2, 3 , 4)

Writing in markdown: 

### some odd type of subtitle 

Look mom, I can write markdown! `def Mom==impressed`

> Do you see any Teletubbies in here? 

*yay formatting!* ... **be bold**

Loading images: 

![title](https://media.giphy.com/media/ljUXHv2x2BpjG/giphy.gif)

To learn more: [Lessons by datacamp](https://www.datacamp.com/community/tutorials/tutorial-jupyter-notebook)

## 2 - Loading data 

Great, we're ready to get to work! Let's start by loading our dataset, in our case some mushrooms! 

![title](https://mojohealth.com.au/assets/upload/data/mushrooms.jpg)

You already have some mushroom datasets in your `data` folder. 

We will load them using Pandas, using [read_csv](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)

In [None]:
# load the csv that is at path data/mushrooms.csv, into a Pandas DataFrame called data
data = pd.read_csv('data/mushrooms.csv')

# print the type 
print('Our dataset is now of the following type: %s' % type(data))

This dataset will now be a [Pandas DataFrame](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html). 

It's ok (at first) to think of DataFrames as tables, a bit like spreadsheets. 

## 3 - Understanding data 

The first thing to do with any dataset is... to look at it! 

Let's use [head](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html) to look at the first 5 lines of our DataFrame

In [None]:
data.head(5)

How many rows and columns do we have? 

We can find out with the [.shape](http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.shape.html) command

In [None]:
data.shape

What are the columns?

In [None]:
data.columns

We can subset a column, by using brackets notation: 

In [None]:
# subset the column is_poisonous, and save it into a Pandas Series called poison 
poison = data['is_poisonous']


# print the type 
print('poison is a variable of the following type: %s' % type(poison))

You will notice that `poison` is now a Pandas Series (fancy name for "column"). Don't worry too much about this for now. 

We can use [value_counts](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html) to understand how many poisonous and non-poisonous mushrooms we have

In [None]:
# How many poisonous and non poisonous mushrooms do we have? 
data['is_poisonous'].value_counts()

Let's see another column: 

In [None]:
# How many of each gill color do we have? 
data['gill-color'].value_counts()

Sometimes, we have continuous variables, such as `height`.  
If we do `value_counts` on `height` is will be pretty useless, as it has too many unique vaues. 

In these cases, using [a histogram](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.hist.html) is generally better: 

In [None]:
# draw a histogram for the height column
data['height'].hist()
# the following is Matplotlib styling code
plt.xlabel('height')
plt.ylabel('count')
plt.show()

For a continuous variable, we can get a bunch of descriptive statistics: 

In [None]:
# Mean mushroom height: 
data['height'].mean()

In [None]:
# Minimum mushroom height: 
data['height'].min()

In [None]:
# max mushroom height: 
data['height'].max()

We can also look at a single mushroom, but subsetting the second line: 

In [None]:
data.iloc[2]  # get the second line 

## 5 - Groupby

Ok, now for something a bit trickier. 

We want to know what is the percentage of mushrooms that are poison, for each "color of spore-print" (whatever a spore-print is). 

So, we need to `groupby` the color of the spore, and then take the `mean` of `is_poisonous`

In [None]:
# mean poison by color = grouping by sprint color, take the mean of each group's is_poison
mean_poison_by_color = data.groupby('spore-print-color')['is_poisonous'].mean()

So what does this look like?

In [None]:
mean_poison_by_color

As often happens, it is easier to just plot this, by adding a `.plot` at the end. In this case we 

In [None]:
# plot the mean_poison_by_color, with a horizontal bar plot
mean_poison_by_color.plot(kind='barh')
# matplotlib styling (axis labels)
plt.xlabel('Percentage poisonous mushrooms')
plt.show()

Very interesting, we can tell that if the color is `r` (presumably red) the mushroom is always poisonous, but if is `y` (probably yellow) it is never poisonous. 

## 6 - Dummifying 

As you might notice, our dataset has multiple types. We features that are categorical (contain strings), such as `gill-color`, a continuous variable (`height`), and a discrete variable (`is_poisonous`).

In [None]:
# just calling head so we can see the data again 
data[['gill-color', 'spore-print-color', 'is_poisonous', 'height']].head(3)

For some applications, such as making predictions, we need all the data to be numerical. 

For that we have an excellent tool called [get_dummies](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html), that will _one hot encode_ (make 0 or 1) our categorical data. 

Easier with an example: 

In [None]:
toy_data = get_toy_data()

# let's look at it 
toy_data

Now let's get dummies on this dataset

In [None]:
pd.get_dummies(toy_data)

So, now let's do this on our mushroom dataset: 

In [None]:
data = pd.get_dummies(data)

In [None]:
data.head()

In [None]:
print('We now have %0.0f rows and %0.0f columns'  % (data.shape[0], data.shape[1]))

# 7 - Saving data

Finally, we'll save our "prepared mushrooms" to a csv, using pandas' to_csv function. 

In [None]:
data.to_csv('data/prepared_mushrooms.csv', # the path where we want it to be saved 
            index=False)  # we don't care about the [0, 1, 2, ...] index. Sometimes the index is relevant, but not here

# Now go do this yourself! 

Feel free to play around with this notebook until you feel comfortable, then head over to the exercise of Learning Unit 1, where you will apply these concepts to a different dataset. 