# Tidy Practice

### Objectives
After this lesson you should be able to...
+ Explain what tidy data is
+ Spot messy data
+ Identify the type of messy data
+ Transform any messy dataset into a tidy data set
+ Transform any messy dataset into one ready for machine learning
+ Know the difference between tidy data and data prepared for machine learning
+ Use the reshaping functions/methods: **`melt, stack, unstack, pivot, pivot_table`**
+ Know the primary purpose of **`melt, stack, unstack, pivot, pivot_table`**
+ Go back and forth between multiple levels of grouped data
+ Handle missing values in a variety of ways
+ Use the **`str`** accessor methods
+ Talk to data owners to get more information
+ Follow the guideline for creating clean datasets
+ Tidy real datasets

### Prepare for this lesson by...
+ Read Hadley Wickham's paper on [tidy data](http://vita.had.co.nz/papers/tidy-data.pdf)
+ Watch Hadley Wickham's talk on [tidy data](https://vimeo.com/33727555)
+ Watch Jeff Leek's video on [tidy data](https://www.youtube.com/watch?v=whDilsFoLVY)
+ Read the [reshaping pandas documentation page](http://pandas.pydata.org/pandas-docs/stable/reshaping.html)
+ Read the entire page [working with text data](http://pandas.pydata.org/pandas-docs/stable/text.html)

### Introduction
The previous notebook focused on one particular type of messy dataset. A dataset where the column names are actually values and not variables. This was illustrated with the simple dataset of counts of fruits. **`stack`** or **`melt`** will quickly tidy these basic datasets, but often is the case that datasets are much more manipulation to make them tidy.

In [2]:
import pandas as pd
import numpy as np

In [3]:
# looks so nice and clean!
df = pd.DataFrame(data=[[12, 10, 40], [9, 7, 12], [0, 14, 190]], 
                  columns=['Apple', 'Orange', 'Banana'],
                  index=['Ted', 'Penelope', 'Niko'])
df

Unnamed: 0,Apple,Orange,Banana
Ted,12,10,40
Penelope,9,7,12
Niko,0,14,190


### Most Common Messy Data Problems
Again, we will rely upon Hadley's paper to describe the most common problems that appear in messy datasets. We will tackle all of these and a few more.
+ Column headers are values, not variable names.
+ Multiple variables are stored in one column.
+ Variables are stored in both rows and columns.
+ Multiple types of observational units are stored in the same table.
+ A single observational unit is stored in multiple tables

# Multiple variables are stored in one column
A tidy data set needs values of a single variable stored in one column. There are a few flavors of this type.

### Column names are values in the column
Column names will appear directly as values in a single column and the value of these variables will be in another column.

Notice below how the **`Value`** column has both numeric and string data types and the **`Info`** column isn't a variable at all but column names.

In [4]:
df = pd.DataFrame(data={'Name': ['Ted', 'Penelope', 'Niko'] * 3,
                        'Info': ['Age'] * 3 + ['Salary'] * 3 + ['Hair Color'] * 3, 
                        'Value': [10, 15, 20, 3, 4, 5, 'Brown', 'Pink','Red']},
                 columns=['Name', 'Info', 'Value'])
df

Unnamed: 0,Name,Info,Value
0,Ted,Age,10
1,Penelope,Age,15
2,Niko,Age,20
3,Ted,Salary,3
4,Penelope,Salary,4
5,Niko,Salary,5
6,Ted,Hair Color,Brown
7,Penelope,Hair Color,Pink
8,Niko,Hair Color,Red


### The fix
This dataset is 'overly stacked', so pivoting it (which normally creates a messy dataset) will make it tidy. Both **`pivot`** and **`unstack`** will make this work.

In [5]:
df.pivot(index='Name', columns='Info', values='Value')

Info,Age,Hair Color,Salary
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Niko,20,Red,5
Penelope,15,Pink,4
Ted,10,Brown,3


In [10]:
# can also use unstack
df.set_index(['Name', 'Info']).unstack()

Unnamed: 0_level_0,Value,Value,Value
Info,Age,Hair Color,Salary
Name,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Niko,20,Red,5
Penelope,15,Pink,4
Ted,10,Brown,3


In [36]:
# lots of extra nusaince names: Value, Info, Name still remain
# squeeze forces the one column dataframe to be a Series so unstack doesn't create a multi-index
# rename_axis removes 'Info' and 'Name' level names
df.set_index(['Name', 'Info'])\
  .squeeze()\
  .rename_axis([None, None])\
  .unstack()

Unnamed: 0,Age,Hair Color,Salary
Niko,20,Red,5
Penelope,15,Pink,4
Ted,10,Brown,3


### Two or more values are stored in the same cell
Two or more values (of the same variable or different) variable can be stored in the same cell in a DataFrame. You will need to extract the desired quantities. These values will usually be strings. 

### Using the `str` accessor to extract values
pandas provides a `str` accessor with a couple dozen methods, each with the ability to extract different pieces of information from these strings. 

In [77]:
df = pd.DataFrame({'City':['Houston', 'Dallas', 'Austin'], 
                   'Geolocation':['(29.7604° N, 95.3698° W)', '32.7767° N, 96.7970° W', '30.2672° N, 97.7431° W']})
df

Unnamed: 0,City,Geolocation
0,Houston,"(29.7604° N, 95.3698° W)"
1,Dallas,"32.7767° N, 96.7970° W"
2,Austin,"30.2672° N, 97.7431° W"


**`str`** is available only to Series objects. Most of methods are self explanatory. Let's see several examples on the **`City`** column

In [81]:
# get the length of each value
df.City.str.len()

0    7
1    6
2    6
Name: City, dtype: int64

In [88]:
# make all uppercase
df.City.str.upper()

0    HOUSTON
1     DALLAS
2     AUSTIN
Name: City, dtype: object

In [89]:
# make title case after uppercase
df.City.str.upper().str.title()

0    Houston
1     Dallas
2     Austin
Name: City, dtype: object

In [91]:
# get the 4th character
df.City.str.get(3)

0    s
1    l
2    t
Name: City, dtype: object

In [94]:
# split strings by a letter and expand each split into its own column
df.City.str.split('s', expand=True)

Unnamed: 0,0,1
0,Hou,ton
1,Dalla,
2,Au,tin


### Extracting coordinates
The **`Geolocation`** column has quite a lot of information packed into it. We will parse it into 4 separate variables
+ latitude 
+ latitude direction
+ longitude
+ longitude direction

In [95]:
# strip off parentheses from ends
df.Geolocation.str.strip('()')

0    29.7604° N, 95.3698° W
1    32.7767° N, 96.7970° W
2    30.2672° N, 97.7431° W
Name: Geolocation, dtype: object

In [116]:
# split on a comma
geo_split = df.Geolocation.str.strip('()')\
              .str.split(',', expand=True)
geo_split

Unnamed: 0,0,1
0,29.7604° N,95.3698° W
1,32.7767° N,96.7970° W
2,30.2672° N,97.7431° W


In [141]:
lat = geo_split[0].str.split(' ', expand=True)
long = geo_split[1].str.split(' ', expand=True)

In [142]:
lat

Unnamed: 0,0,1
0,29.7604°,N
1,32.7767°,N
2,30.2672°,N


In [143]:
# give meaningful columns
lat.columns = ['latitude', 'latitude direction']
lat

Unnamed: 0,latitude,latitude direction
0,29.7604°,N
1,32.7767°,N
2,30.2672°,N


In [144]:
long

Unnamed: 0,0,1,2
0,,95.3698°,W
1,,96.7970°,W
2,,97.7431°,W


In [145]:
# an extra column. lets drop it
long = long.drop(0, axis=1)
long.columns = ['longitude', 'longitude direction']
long

Unnamed: 0,longitude,longitude direction
0,95.3698°,W
1,96.7970°,W
2,97.7431°,W


In [158]:
# use regex to replace non digit/decimals with nothing
long['longitude'] = long.longitude.str.replace('[^0-9.]+', '')
lat['latitude'] = lat.latitude.str.replace('[^0-9.]+', '')

In [159]:
long

Unnamed: 0,longitude,longitude direction
0,95.3698,W
1,96.797,W
2,97.7431,W


In [160]:
lat

Unnamed: 0,latitude,latitude direction
0,29.7604,N
1,32.7767,N
2,30.2672,N


In [162]:
# data types are not right. Lets change lat and long to numeric
long.dtypes

longitude              object
longitude direction    object
dtype: object

In [167]:
lat['latitude'] = pd.to_numeric(lat['latitude'])
long['longitude'] = pd.to_numeric(long['longitude'])
lat.dtypes

latitude              float64
latitude direction     object
dtype: object

In [172]:
# concatenate city column from original DataFrame with 
# two new transformed DataFrames
df_final = pd.concat([df['City'], lat, long], axis=1)
df_final

Unnamed: 0,City,latitude,latitude direction,longitude,longitude direction
0,Houston,29.7604,N,95.3698,W
1,Dallas,32.7767,N,96.797,W
2,Austin,30.2672,N,97.7431,W


### Mini Summary of `str`
+ `str` is very powerful and works directly with text column data
+ `str` only works with Series
+ You will have to learn regular expressions to make `str` more useful
+ Messy datasets with multiple values in a single cell of data need `str` functionality to tidy them up
+ see the **`extract`** method 

### Variables are stored in both rows and columns
A more difficult situation occurs when variables are stored down a column and across the column names. Pivoting and melting may have to be used together to make it tidy. Let's take a look at the example below. 

The **`Property`** column has names of variables. The years in the columns are all values of variables. There are a few ways to tidy this set.

In [2]:
import pandas as pd
import numpy as np
df = pd.read_csv('data/temp_flow_pressure.csv')
df

Unnamed: 0,Group,Property,2012,2013,2014,2015,2016
0,A,Pressure,928,873,814,973,870
1,A,Temperature,1026,1038,1009,1036,1042
2,A,Flow,819,806,861,882,856
3,B,Pressure,817,877,914,806,942
4,B,Temperature,1008,1041,1009,1002,1013
5,B,Flow,887,899,837,824,873


In [3]:
# melt the years and then pivot the columns
df_melt = pd.melt(df, 
                  id_vars=['Group', 'Property'], 
                  value_vars=['2012', '2013', '2014', '2015', '2016'],
                  var_name='Year')
df_melt.head()

Unnamed: 0,Group,Property,Year,value
0,A,Pressure,2012,928
1,A,Temperature,2012,1026
2,A,Flow,2012,819
3,B,Pressure,2012,817
4,B,Temperature,2012,1008


In [4]:
# you must use pivot_table instead of pivot because
# pivot does not allow multiple columns in the index
df_tidy = df_melt.pivot_table(index=['Group', 'Year'], columns='Property', values='value')
df_tidy

Unnamed: 0_level_0,Property,Flow,Pressure,Temperature
Group,Year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A,2012,819,928,1026
A,2013,806,873,1038
A,2014,861,814,1009
A,2015,882,973,1036
A,2016,856,870,1042
B,2012,887,817,1008
B,2013,899,877,1041
B,2014,837,914,1009
B,2015,824,806,1002
B,2016,873,942,1013


In [5]:
# get rid of level name and move out of the index as columns
df_tidy.rename_axis(None, axis=1).reset_index()

Unnamed: 0,Group,Year,Flow,Pressure,Temperature
0,A,2012,819,928,1026
1,A,2013,806,873,1038
2,A,2014,861,814,1009
3,A,2015,882,973,1036
4,A,2016,856,870,1042
5,B,2012,887,817,1008
6,B,2013,899,877,1041
7,B,2014,837,914,1009
8,B,2015,824,806,1002
9,B,2016,873,942,1013


In [204]:
# the transformation is also possible with stack and unstack
df.set_index(['Group', 'Property'])\
  .stack()\
  .unstack('Property')\
  .rename_axis(['Group', 'Year'])\
  .rename_axis(None, axis=1)\
  .reset_index()

Unnamed: 0,Group,Year,Flow,Pressure,Temperature
0,A,2012,819,928,1026
1,A,2013,806,873,1038
2,A,2014,861,814,1009
3,A,2015,882,973,1036
4,A,2016,856,870,1042
5,B,2012,887,817,1008
6,B,2013,899,877,1041
7,B,2014,837,914,1009
8,B,2015,824,806,1002
9,B,2016,873,942,1013


# Case Study: My Brothers Keeper Data
[data.gov](www.data.gov) is an excellent place to find interesting and messy (occasionally tidy) datasets. This case study will examine the [My Brothers Keeper](https://catalog.data.gov/dataset/my-brothers-keeper-key-statistical-indicators-on-boys-and-men-of-color) dataset.

'MBK is an interagency effort to improve measurably the expected educational and life outcomes for and address the persistent opportunity gaps faced by boys and young men of color'

In [623]:
df = pd.read_csv('data/my_brothers_keeper.csv')

df.head()

Unnamed: 0,Race/ethnicity,Year,Rate of birth to women ages 18-19,Distribution of male children born to women ages 18-19,Distribution of female children born to women ages 18-19,Rate of birth to women ages 20-24,Distribution of male children born to women ages 20-24,Distribution of female children born to women ages 20-24
0,"White, non-Hispanic",2000,57.5,51.5%,48.5%,91.2,51.2%,48.8%
1,"White, non-Hispanic",2001,54.7,51.3%,48.7%,87.0,51.2%,48.8%
2,"White, non-Hispanic",2002,52.0,51.1%,48.9%,84.7,51.4%,48.6%
3,"White, non-Hispanic",2003,50.0,51.4%,48.6%,84.1,51.3%,48.7%
4,"White, non-Hispanic",2004,48.6,51.2%,48.8%,83.0,51.2%,48.8%


In [624]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65 entries, 0 to 64
Data columns (total 8 columns):
Race/ethnicity                                              65 non-null object
Year                                                        65 non-null int64
Rate of birth to women ages 18-19                           65 non-null float64
Distribution of male children born to women ages 18-19      65 non-null object
Distribution of female children born to women ages 18-19    65 non-null object
Rate of birth to women ages 20-24                           65 non-null float64
Distribution of male children born to women ages 20-24      65 non-null object
Distribution of female children born to women ages 20-24    65 non-null object
dtypes: float64(2), int64(1), object(5)
memory usage: 4.1+ KB


In [625]:
df.shape

(65, 8)

### Extra spaces in the column names
When I was first exploring this data set, I attempted to select the first column but got hit with a **`KeyError`** informing me that the column didn't exist. This was odd as I typed the column name in verbatim. But when I examined the columns I noticed an extra space.

In [626]:
# notice extra space 'Race/ethnicity '
df.columns

Index(['Race/ethnicity ', 'Year', 'Rate of birth to women ages 18-19',
       'Distribution of male children born to women ages 18-19',
       'Distribution of female children born to women ages 18-19',
       'Rate of birth to women ages 20-24',
       'Distribution of male children born to women ages 20-24',
       'Distribution of female children born to women ages 20-24'],
      dtype='object')

### Strip characters
The **`.str`** accessor provides the **`strip`** method which removes from the beginning and end of the string the passed values. The default is to remove empty spaces.

In [627]:
# remove any spaces
df.columns = df.columns.str.strip()
df.columns

Index(['Race/ethnicity', 'Year', 'Rate of birth to women ages 18-19',
       'Distribution of male children born to women ages 18-19',
       'Distribution of female children born to women ages 18-19',
       'Rate of birth to women ages 20-24',
       'Distribution of male children born to women ages 20-24',
       'Distribution of female children born to women ages 20-24'],
      dtype='object')

### Variables as column names
It appears that there are some variables in the column names which violates one of the tidy data principles. Both age and possibly gender are stored in the column names.

There are also appears to be two other variables: **`birth rate`** and **`percentage male/female`**.

### Split data into two Data Frames
Because it appears that both the `Rate` (**`birth_rate`**) columns and the `Distribution`(**`percentage male/female`**) columns need to be melted, we will split them up into two separate Data Frames and combine the results at the end.

The Race/ethnicity and Year columns are put into the index to label the rows. This index will be used to join the DataFrames back together later.

In [628]:
# create two new data frames. 
percent = df.set_index(['Race/ethnicity', 'Year']).filter(like='Distribution')
rate = df.set_index(['Race/ethnicity', 'Year']).filter(like='Rate')

In [629]:
percent.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Distribution of male children born to women ages 18-19,Distribution of female children born to women ages 18-19,Distribution of male children born to women ages 20-24,Distribution of female children born to women ages 20-24
Race/ethnicity,Year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"White, non-Hispanic",2000,51.5%,48.5%,51.2%,48.8%
"White, non-Hispanic",2001,51.3%,48.7%,51.2%,48.8%
"White, non-Hispanic",2002,51.1%,48.9%,51.4%,48.6%
"White, non-Hispanic",2003,51.4%,48.6%,51.3%,48.7%
"White, non-Hispanic",2004,51.2%,48.8%,51.2%,48.8%


In [630]:
rate.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Rate of birth to women ages 18-19,Rate of birth to women ages 20-24
Race/ethnicity,Year,Unnamed: 2_level_1,Unnamed: 3_level_1
"White, non-Hispanic",2000,57.5,91.2
"White, non-Hispanic",2001,54.7,87.0
"White, non-Hispanic",2002,52.0,84.7
"White, non-Hispanic",2003,50.0,84.1
"White, non-Hispanic",2004,48.6,83.0


### Stack data
To get all the percentages in one column we will use the **`stack`** method.

In [631]:
percent_stacked = percent.stack()

In [632]:
percent_stacked.head(10)

Race/ethnicity       Year                                                          
White, non-Hispanic  2000  Distribution of male children born to women ages 18-19      51.5%
                           Distribution of female children born to women ages 18-19    48.5%
                           Distribution of male children born to women ages 20-24      51.2%
                           Distribution of female children born to women ages 20-24    48.8%
                     2001  Distribution of male children born to women ages 18-19      51.3%
                           Distribution of female children born to women ages 18-19    48.7%
                           Distribution of male children born to women ages 20-24      51.2%
                           Distribution of female children born to women ages 20-24    48.8%
                     2002  Distribution of male children born to women ages 18-19      51.1%
                           Distribution of female children born to women ages 1

### Extracting age
We can extract from the variables gender and age from the new innermost index level. To do this we will need to take advantage of the **`str`** accessor which is available for both the Index and the Series. Specifically, we will use the **`str.extract`** method to extract the age group.

We will use a regular expression to find two numbers followed by a dash followed by two numbers again.

In [633]:
# get the innermost level
percent_levels = percent_stacked.index.get_level_values(-1)

percent_levels[:10]

Index(['Distribution of male children born to women ages 18-19',
       'Distribution of female children born to women ages 18-19',
       'Distribution of male children born to women ages 20-24',
       'Distribution of female children born to women ages 20-24',
       'Distribution of male children born to women ages 18-19',
       'Distribution of female children born to women ages 18-19',
       'Distribution of male children born to women ages 20-24',
       'Distribution of female children born to women ages 20-24',
       'Distribution of male children born to women ages 18-19',
       'Distribution of female children born to women ages 18-19'],
      dtype='object')

In [634]:
# regex to find age grop
age_group = percent_levels.str.extract('(\d{2}-\d{2})', expand=False).rename('Age Group')

age_group[:10]

Index(['18-19', '18-19', '20-24', '20-24', '18-19', '18-19', '20-24', '20-24',
       '18-19', '18-19'],
      dtype='object', name='Age Group')

In [635]:
# another way of extracting it, using get item selection syntax
percent_stacked.index.get_level_values(-1).str[-5:]

Index(['18-19', '18-19', '20-24', '20-24', '18-19', '18-19', '20-24', '20-24',
       '18-19', '18-19',
       ...
       '20-24', '20-24', '18-19', '18-19', '20-24', '20-24', '18-19', '18-19',
       '20-24', '20-24'],
      dtype='object', length=260)

### Extracting Gender
Gender is extracted similarly using another regular expression. The basics of regular expressions are good to know but its unlikely you'll remember how they work unless you deal with text data frequently. Learn to google for what you need.

The gender is the third word so googling for `find third word regex` lead me to [this page](http://stackoverflow.com/questions/23691664/how-to-extract-nth-word-using-regular-expression) which yielded the correct answer.

In [636]:
# complex regex to return 3rd word
gender = percent_levels.str.extract('(?:\S+\s+){2}(\S+)', expand=False)

gender

Index(['male', 'female', 'male', 'female', 'male', 'female', 'male', 'female',
       'male', 'female',
       ...
       'male', 'female', 'male', 'female', 'male', 'female', 'male', 'female',
       'male', 'female'],
      dtype='object', length=260)

### Remove old index level
The innermost index level is now useless. We have extracted the relevant information - gender and age. We can drop an index level with the **`droplevel`** index method.

In [637]:
percent_stacked.index = percent_stacked.index.droplevel(-1)

### Add age group to index
The **`rate`** DataFrame also has age so we will append it to the index to make the join possible later.

In [638]:
df_percent = percent_stacked.to_frame('Gender Percentage').set_index(age_group, append=True)

In [639]:
df_percent.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Gender Percentage
Race/ethnicity,Year,Age Group,Unnamed: 3_level_1
"White, non-Hispanic",2000,18-19,51.5%
"White, non-Hispanic",2000,18-19,48.5%
"White, non-Hispanic",2000,20-24,51.2%
"White, non-Hispanic",2000,20-24,48.8%
"White, non-Hispanic",2001,18-19,51.3%


### Convert percentage to numeric
The percentage sign in the Gender Percentage column is preventing the column from becoming a numeric. Lets strip that percentage sign and then convert

In [640]:
df_percent['Gender Percentage'] = pd.to_numeric(df_percent['Gender Percentage'].str.strip('%'))

In [641]:
# confirm numeric
df_percent.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 260 entries, (White, non-Hispanic, 2000, 18-19) to (Asian and Pacific Islander, 2012, 20-24)
Data columns (total 1 columns):
Gender Percentage    260 non-null float64
dtypes: float64(1)
memory usage: 3.1+ KB


### Add gender column
Since gender is specific to this **`percent`** DataFrame we will add as a column.

In [642]:
df_percent['Gender'] = gender

In [643]:
df_percent.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Gender Percentage,Gender
Race/ethnicity,Year,Age Group,Unnamed: 3_level_1,Unnamed: 4_level_1
"White, non-Hispanic",2000,18-19,51.5,male
"White, non-Hispanic",2000,18-19,48.5,female
"White, non-Hispanic",2000,20-24,51.2,male
"White, non-Hispanic",2000,20-24,48.8,female
"White, non-Hispanic",2001,18-19,51.3,male
"White, non-Hispanic",2001,18-19,48.7,female
"White, non-Hispanic",2001,20-24,51.2,male
"White, non-Hispanic",2001,20-24,48.8,female
"White, non-Hispanic",2002,18-19,51.1,male
"White, non-Hispanic",2002,18-19,48.9,female


### Take similar approach with `rate`
We can take a similar approach with the **`rate`** DataFrame which is outputted again below. Only the age group and not gender are found in the column names.

In [644]:
rate.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Rate of birth to women ages 18-19,Rate of birth to women ages 20-24
Race/ethnicity,Year,Unnamed: 2_level_1,Unnamed: 3_level_1
"White, non-Hispanic",2000,57.5,91.2
"White, non-Hispanic",2001,54.7,87.0
"White, non-Hispanic",2002,52.0,84.7
"White, non-Hispanic",2003,50.0,84.1
"White, non-Hispanic",2004,48.6,83.0


### stack and get new index levels

In [645]:
rate_stacked = rate.stack()
rate_levels = rate_stacked.index.get_level_values(-1)

### Get age group

In [646]:
age_group = rate_levels.str.extract('(\d{2}-\d{2})', expand=False).rename('Age Group')
age_group[:5]

Index(['18-19', '20-24', '18-19', '20-24', '18-19'], dtype='object', name='Age Group')

### Remove useless index level

In [647]:
rate_stacked.reset_index(level=-1, drop=True, inplace=True)

rate_stacked.head()

Race/ethnicity       Year
White, non-Hispanic  2000    57.5
                     2000    91.2
                     2001    54.7
                     2001    87.0
                     2002    52.0
dtype: float64

### Add age group to index
The index needs to be the same as the **`percent`** DataFrame to align properly.

In [648]:
df_rate = rate_stacked.to_frame('Birth Rate').set_index([age_group], append=True)

### See `final` tables

In [649]:
df_rate.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Birth Rate
Race/ethnicity,Year,Age Group,Unnamed: 3_level_1
"White, non-Hispanic",2000,18-19,57.5
"White, non-Hispanic",2000,20-24,91.2
"White, non-Hispanic",2001,18-19,54.7
"White, non-Hispanic",2001,20-24,87.0
"White, non-Hispanic",2002,18-19,52.0
"White, non-Hispanic",2002,20-24,84.7
"White, non-Hispanic",2003,18-19,50.0
"White, non-Hispanic",2003,20-24,84.1
"White, non-Hispanic",2004,18-19,48.6
"White, non-Hispanic",2004,20-24,83.0


In [650]:
df_percent.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Gender Percentage,Gender
Race/ethnicity,Year,Age Group,Unnamed: 3_level_1,Unnamed: 4_level_1
"White, non-Hispanic",2000,18-19,51.5,male
"White, non-Hispanic",2000,18-19,48.5,female
"White, non-Hispanic",2000,20-24,51.2,male
"White, non-Hispanic",2000,20-24,48.8,female
"White, non-Hispanic",2001,18-19,51.3,male
"White, non-Hispanic",2001,18-19,48.7,female
"White, non-Hispanic",2001,20-24,51.2,male
"White, non-Hispanic",2001,20-24,48.8,female
"White, non-Hispanic",2002,18-19,51.1,male
"White, non-Hispanic",2002,18-19,48.9,female


### Add `Birth Rate` to final with Automatic index alignment
The following command looks very simple but obscures lots of detail

In [651]:
df_final = df_percent.copy()
df_final['Birth Rate'] = df_rate['Birth Rate']

In [652]:
# get final table
df_final = df_final.reset_index()

df_final.head()

Unnamed: 0,Race/ethnicity,Year,Age Group,Gender Percentage,Gender,Birth Rate
0,"White, non-Hispanic",2000,18-19,51.5,male,57.5
1,"White, non-Hispanic",2000,18-19,48.5,female,57.5
2,"White, non-Hispanic",2000,20-24,51.2,male,91.2
3,"White, non-Hispanic",2000,20-24,48.8,female,91.2
4,"White, non-Hispanic",2001,18-19,51.3,male,54.7


### Not quite tidy
This DataFrame no longer has variables in the column names but has many repeated values. Repeated values can be moved out into their own table and replaced with a **`key`** that uniquely identifies them. 

For instance, we could have kept the **`rate`** and **`percent`** DataFrames separate. Let's do this and add a key so they can easily join one another.

### Drop duplicates before getting keys

In [682]:
unique_index = df_percent.index.drop_duplicates()

In [683]:
df_key = pd.DataFrame(index = unique_index, data=list(range(len(unique_index))), columns=['key'])

In [684]:
df_key.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,key
Race/ethnicity,Year,Age Group,Unnamed: 3_level_1
"White, non-Hispanic",2000,18-19,0
"White, non-Hispanic",2000,20-24,1
"White, non-Hispanic",2001,18-19,2
"White, non-Hispanic",2001,20-24,3
"White, non-Hispanic",2002,18-19,4


### add key to percent table

In [694]:
df_percent_key = df_percent.copy()
df_percent_key['key'] = df_key

In [695]:
df_percent_key.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Gender Percentage,Gender,key
Race/ethnicity,Year,Age Group,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"White, non-Hispanic",2000,18-19,51.5,male,0
"White, non-Hispanic",2000,18-19,48.5,female,0
"White, non-Hispanic",2000,20-24,51.2,male,1
"White, non-Hispanic",2000,20-24,48.8,female,1
"White, non-Hispanic",2001,18-19,51.3,male,2
"White, non-Hispanic",2001,18-19,48.7,female,2
"White, non-Hispanic",2001,20-24,51.2,male,3
"White, non-Hispanic",2001,20-24,48.8,female,3
"White, non-Hispanic",2002,18-19,51.1,male,4
"White, non-Hispanic",2002,18-19,48.9,female,4


### Add key to rate table

In [702]:
df_rate_key = df_rate.copy()
df_rate_key['key'] = df_key

In [703]:
df_rate_key.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Birth Rate,key
Race/ethnicity,Year,Age Group,Unnamed: 3_level_1,Unnamed: 4_level_1
"White, non-Hispanic",2000,18-19,57.5,0
"White, non-Hispanic",2000,20-24,91.2,1
"White, non-Hispanic",2001,18-19,54.7,2
"White, non-Hispanic",2001,20-24,87.0,3
"White, non-Hispanic",2002,18-19,52.0,4
"White, non-Hispanic",2002,20-24,84.7,5
"White, non-Hispanic",2003,18-19,50.0,6
"White, non-Hispanic",2003,20-24,84.1,7
"White, non-Hispanic",2004,18-19,48.6,8
"White, non-Hispanic",2004,20-24,83.0,9


### Drop index from rate and percent table
The key has replaced the index and is no longer needed.

In [704]:
df_percent_key = df_percent_key.reset_index(drop=True)
df_rate_key = df_rate_key.reset_index(drop=True)

In [705]:
df_percent_key.head()

Unnamed: 0,Gender Percentage,Gender,key
0,51.5,male,0
1,48.5,female,0
2,51.2,male,1
3,48.8,female,1
4,51.3,male,2


In [707]:
df_rate_key.head()

Unnamed: 0,Birth Rate,key
0,57.5,0
1,91.2,1
2,54.7,2
3,87.0,3
4,52.0,4


In [710]:
df_key.reset_index().head()

Unnamed: 0,Race/ethnicity,Year,Age Group,key
0,"White, non-Hispanic",2000,18-19,0
1,"White, non-Hispanic",2000,20-24,1
2,"White, non-Hispanic",2001,18-19,2
3,"White, non-Hispanic",2001,20-24,3
4,"White, non-Hispanic",2002,18-19,4


### Is gender really a variable?
It looks like gender might not be its own variable after all

In [717]:
df_final.head(10)

Unnamed: 0,Race/ethnicity,Year,Age Group,Gender Percentage,Gender,Birth Rate
0,"White, non-Hispanic",2000,18-19,51.5,male,57.5
1,"White, non-Hispanic",2000,18-19,48.5,female,57.5
2,"White, non-Hispanic",2000,20-24,51.2,male,91.2
3,"White, non-Hispanic",2000,20-24,48.8,female,91.2
4,"White, non-Hispanic",2001,18-19,51.3,male,54.7
5,"White, non-Hispanic",2001,18-19,48.7,female,54.7
6,"White, non-Hispanic",2001,20-24,51.2,male,87.0
7,"White, non-Hispanic",2001,20-24,48.8,female,87.0
8,"White, non-Hispanic",2002,18-19,51.1,male,52.0
9,"White, non-Hispanic",2002,18-19,48.9,female,52.0


In [718]:
df_final.pivot_table(index=['Race/ethnicity', 'Year', 'Age Group', 'Birth Rate'], 
                     columns='Gender', 
                     values='Gender Percentage').reset_index().head(10)

Gender,Race/ethnicity,Year,Age Group,Birth Rate,female,male
0,American Indian/Alaska Native,2000,18-19,97.1,50.0,50.0
1,American Indian/Alaska Native,2000,20-24,117.2,48.6,51.4
2,American Indian/Alaska Native,2001,18-19,92.7,50.5,49.5
3,American Indian/Alaska Native,2001,20-24,113.8,49.3,50.7
4,American Indian/Alaska Native,2002,18-19,85.3,49.0,51.0
5,American Indian/Alaska Native,2002,20-24,110.7,49.7,50.3
6,American Indian/Alaska Native,2003,18-19,82.1,49.1,50.9
7,American Indian/Alaska Native,2003,20-24,107.0,49.0,51.0
8,American Indian/Alaska Native,2004,18-19,79.9,48.7,51.3
9,American Indian/Alaska Native,2004,20-24,105.4,49.5,50.5


### Problem 1
<span  style="color:green; font-size:16px">Make the following dataset tidy but putting all the `HOUR` columns into a single column</span>

In [229]:
df = pd.read_csv('data/country_hour_price.csv')
df

Unnamed: 0,ASID,BORDER,HOUR1,HOUR2
0,21,GERMANY,2,3
1,32,FRANCE,2,3
2,99,ITALY,2,3
3,77,USA,4,5
4,66,CANADA,4,5
5,55,MEXICO,4,5
6,44,INDIA,6,7
7,88,CHINA,6,7
8,111,JAPAN,6,7


### Problem 2
<span  style="color:green; font-size:16px">Tidy the dataset Impaired_Driving_Death_Rate.csv</span>

In [None]:
# your code here

### Problem 3
<span  style="color:green; font-size:16px">Choose from one of the many files in the **data** directory and make it tidy.</span>

In [719]:
# your code here