# Tidy and Reshaped Data

### Objectives
After this lesson you should be able to...
+ Explain what tidy data is
+ Spot messy data
+ Transform a simple messy dataset into a tidy data set
+ Master the reshaping functions/methods: **`melt, stack, unstack, pivot, pivot_table`**
+ Know the primary purpose of **`melt, stack, unstack, pivot, pivot_table`**
+ Go back and forth between multiple levels of grouped data

### Prepare for this lesson by...
+ Read Hadley Wickham's paper on [tidy data](http://vita.had.co.nz/papers/tidy-data.pdf)
+ Watch Hadley Wickham's talk on [tidy data](https://vimeo.com/33727555)
+ Watch Jeff Leek's video on [tidy data](https://www.youtube.com/watch?v=whDilsFoLVY)
+ Read the [reshaping pandas documentation page](http://pandas.pydata.org/pandas-docs/stable/reshaping.html)

### Datasets until now
Thus far, we have analyzed several datasets but have not done much work to change their structure or do any preprocessing before computation. We immediately began generating results and answering questions. Producing results is typically not the first step of a data analysis. The vast majority of datasets 'in the wild' will need some amount of inspection and preprocessing. And in some cases, the entire project will just be about cleaning the data so that it can be further processed by someone else. 

For all the work that goes into data preparation for machine learning, there is surprisingly sparse coverage on how to do it. This notebook will use many ideas formulated by Hadley Wickham to 'tidy' data before introducing a few more steps in order to prepare it for machine learning and visualization.

There's an infamous data science saying goes some like this: data scientists spend 80% of their time cleaning data and the other 20% complaining about cleaning the data.

### The genesis of data
Do you know where and how data is generated? Many introductory courses such as this one will use premade csv files. Loading this data into your workspace is not the genesis of this data.

The data from these sources must come from somewhere. It wasn't just magically put in a csv file or on a website or in a database used by an API. 

Some original sources of data might be:
+ Playing a mobile game your smart phone sends game data to a small sqlite instance on your local phone and to a large remote Amazon S3 server.
+ You keep track of all your golf scores on paper and copy them to an excel file after each round
+ Censors on industrial equipment continually pour data into an on-premise hadoop cluster
+ Facebook quickly writing all it's interactions to hbase
+ A City of Houston employees enter in personal information in an online web app.

Yes, non-electronic data does exist and is valuable (that was all there was before the 20th century) but for obvious reasons we will only deal with electronic data that can be read by modern computers.

### Getting to know the data management team
Business data can be very complex and multiple systems can interact

### Tidy Data
Tidy data is a term coined by Hadley Wickham the creator of many useful R packages to transform messy data into tidy data. It is highly recommended that you read [his paper](http://vita.had.co.nz/papers/tidy-data.pdf) to get a fuller understanding of tidy data. The basics will be covered below.

Tidy data is a specific structure of data that makes analysis easier. A dataset is tidy when:
1. Each variable forms a column
2. Each observation forms a row
3. Each type of observational unit forms a table

Any dataset that does not meet this definition is considered messy. This definition is simple but very useful and powerful and something that will take you a long way in your data exploration analyses. 

### First example of messy data
Messy data can appear deceptively clean and tidy, especially if you have not been exposed to it before.

In the table below we have some data about the weight of some fruit owned by some people.

In [2]:
import pandas as pd
import numpy as np

In [3]:
# looks so nice and clean!
df = pd.DataFrame(data=[[12, 10, 40], [9, 7, 12], [0, 14, 190]], 
                  columns=['Apple', 'Orange', 'Banana'],
                  index=['Ted', 'Penelope', 'Niko'])
df

Unnamed: 0,Apple,Orange,Banana
Ted,12,10,40
Penelope,9,7,12
Niko,0,14,190


### What's wrong?
Even though the dataset returns perfectly readable and acceptable information it is not technically a tidy data set and although machine learning would be uninteresting with this dataset, visualization would be made easier if the data were tidy. More on this in the plotting notebooks.

The main issue with the above dataset is that the column names are variables themselves. At this point, you might be confused as to what exactly is meant by a 'variable'. With more practice it should become clear but anything that takes on a value (numeric or string) that can be changed is a probably good enough of a definition to identify a variable. 

### What are the variable names?
None of the variable names are actually part of the DataFrame above. You must infer them from the context of the problem. The variables are:
+ Person names
+ Types of fruit 
+ Weight of fruit

### Actual Tidying
To tidy, we simply need to make sure the three tidy rules are followed. Let's start with forcing each variable into a column. The person names already appear to be in a single column, though they are actually in the pandas **`index`**. We will remove it from the index later.

The types of fruit are column names and need to be transposed to a column.

The weight of the fruit is a total mess and comprises a three by three square.

### Stacking
The pandas **`stack`** method, restructures the DataFrame by taking every data value (those not columns names or the index) and forcing them into one column of data. The result is a pandas **`Series`** that adds a label to all the values as the original column names.

In [4]:
# stacking the data into a Series
df.stack()

Ted       Apple      12
          Orange     10
          Banana     40
Penelope  Apple       9
          Orange      7
          Banana     12
Niko      Apple       0
          Orange     14
          Banana    190
dtype: int64

### Finish Tidying
With one command, the above data is much closer to being tidy but the Series index is now comprised of two levels (a MultiIndex). The **`reset_index`** will push all these values back out as normal DataFrame columns.

In [5]:
df_tidy = df.stack().reset_index()
df_tidy

Unnamed: 0,level_0,level_1,0
0,Ted,Apple,12
1,Ted,Orange,10
2,Ted,Banana,40
3,Penelope,Apple,9
4,Penelope,Orange,7
5,Penelope,Banana,12
6,Niko,Apple,0
7,Niko,Orange,14
8,Niko,Banana,190


### Column Names
The 'columns' in the **`index`** are technically called **`levels`** which can have names (more on this later) but do not here. By default they are referenced as integers beginning from 0 on the left. The index can have any number of levels.

Let's name the columns directly with a list.

In [6]:
df_tidy.columns = ['Name', 'Fruit', 'Weight']

df_tidy

Unnamed: 0,Name,Fruit,Weight
0,Ted,Apple,12
1,Ted,Orange,10
2,Ted,Banana,40
3,Penelope,Apple,9
4,Penelope,Orange,7
5,Penelope,Banana,12
6,Niko,Apple,0
7,Niko,Orange,14
8,Niko,Banana,190


In [7]:
# All steps together
df_tidy = df.stack().reset_index()
df_tidy.columns = ['Name', 'Fruit', 'Weight']
df_tidy

Unnamed: 0,Name,Fruit,Weight
0,Ted,Apple,12
1,Ted,Orange,10
2,Ted,Banana,40
3,Penelope,Apple,9
4,Penelope,Orange,7
5,Penelope,Banana,12
6,Niko,Apple,0
7,Niko,Orange,14
8,Niko,Banana,190


### Our first tidy dataset
By ensuring that each variable forms its own row we also instantly have each observation as its own row. You could argue that each row was originally a single observation.

# Focus on `melt, stack, pivot, unpivot`
We will shift focus for the moment by mastering **`melt`, `stack`, `pivot`** and **`unpivot`** on this simple dataset as these will be your primary tools from moving from messy to tidy and back to messy data again. We will return our focus to tidy data after these basic commands are covered.

### Accomplishing the same task with `melt`
Like most large Python libraries, pandas has many different ways to accomplish the same task. A large percentage of the pandas questions on stackoverflow have multiple answers that produce the same successful output with different commands. The differences usually being readability and performance.

pandas contains a *function* and not a DataFrame method named **`melt`** which works similarly to the **`stack`** method but gives a bit more flexibility. The **`melt`** function takes up to 6 parameters with two of them being more important. 
+ **`id_vars`** - a list of column names that you do NOT want to move into a single column.
+ **`value_vars`** - a list of column names that you would like to move into one column

This 'moving' into one column is usually referred to as 'melting' or 'stacking'. The **`id_vars`** will stay in the same column they are currently in but repeat to align with all the newly stacked values in the **`value_vars`** columns. 

One other important note: **`melt`** works when there are no columns in the **`index`**. To get started we first reset the index.

In [8]:
df2 = df.reset_index()
df2

Unnamed: 0,index,Apple,Orange,Banana
0,Ted,12,10,40
1,Penelope,9,7,12
2,Niko,0,14,190


In [9]:
# rename that ugly column
df2 = df2.rename(columns={'index':'Name'})
df2

Unnamed: 0,Name,Apple,Orange,Banana
0,Ted,12,10,40
1,Penelope,9,7,12
2,Niko,0,14,190


In [10]:
# melt is a FUNCTION!
# the first parameter is the dataframe
# id_vars are the columns you don't want to stack/melt. 
# value_vars are the columns you do want to stack/melt
df_melt = pd.melt(df2, 
                  id_vars='Name', 
                  value_vars=['Apple', 'Orange', 'Banana'])
df_melt

Unnamed: 0,Name,variable,value
0,Ted,Apple,12
1,Penelope,Apple,9
2,Niko,Apple,0
3,Ted,Orange,10
4,Penelope,Orange,7
5,Niko,Orange,14
6,Ted,Banana,40
7,Penelope,Banana,12
8,Niko,Banana,190


### Renaming with `melt`
The **`melt`** function contains two other handy-dandy parameters that let you name the melted and value columns.

In [11]:
df_melt = pd.melt(df2, 
                  id_vars='Name', 
                  value_vars=['Apple', 'Orange', 'Banana'],
                  var_name='Fruit', 
                  value_name='Weight')
df_melt

Unnamed: 0,Name,Fruit,Weight
0,Ted,Apple,12
1,Penelope,Apple,9
2,Niko,Apple,0
3,Ted,Orange,10
4,Penelope,Orange,7
5,Niko,Orange,14
6,Ted,Banana,40
7,Penelope,Banana,12
8,Niko,Banana,190


In [12]:
# all in one step
pd.melt(df.reset_index().rename(columns={'index':'Name'}), 
        id_vars='Name', 
        value_vars=['Apple', 'Orange', 'Banana'],
        var_name='Fruit', 
        value_name='Weight')

Unnamed: 0,Name,Fruit,Weight
0,Ted,Apple,12
1,Penelope,Apple,9
2,Niko,Apple,0
3,Ted,Orange,10
4,Penelope,Orange,7
5,Niko,Orange,14
6,Ted,Banana,40
7,Penelope,Banana,12
8,Niko,Banana,190


### `df.stack()` vs `pd.melt(df)`
The primary purpose of both the **`stack`** method and the **`melt`** function is to take multiple columns and put them in a single column. Think of columns being stacked one on top of one another or columns literally melting their data down into one common place. Each value in this long column will be labeled by it's original column name.

The **`stack`** method, takes every column of the DataFrame and stacks all the values into a single column. You do not get to choose a subset of columns. The column names also get put into the **`index`** and create a multi-index (not covered here).

The **`melt`** function gives you more control and allows you to choose which columns will be stacked and which ones will remain as labels. Any values in the index must be first reset if they are to be used in the **`melt`** function.

**Terminology**: For the sake of brevity 'stacked' and 'melted' will refer to the same exact data operation.

### Inversing Stacked Data with `unstack`
pandas has functionality to invert stacked data back to its original messy form.

### `unstack` method
The `unstack` DataFrame method inverts the operation of `stack` by moving values from **`index levels`** to column names.

In [13]:
# stored stacked df to a variable
df_stacked = df.stack()
df_stacked

Ted       Apple      12
          Orange     10
          Banana     40
Penelope  Apple       9
          Orange      7
          Banana     12
Niko      Apple       0
          Orange     14
          Banana    190
dtype: int64

In [14]:
df_stacked.unstack()

Unnamed: 0,Apple,Orange,Banana
Ted,12,10,40
Penelope,9,7,12
Niko,0,14,190


### Transposing a DataFrame with `stack` and `unstack`
A DataFrame can be easily transposed with the **`T`** attribute but can also be achieved by cleverly using **`stack`** and then **`unstack`**. 

The **`unstack`** method defaults to unstacking the inner most(right most) level of the index. Index levels are numbered beginning at 0 from left to right. The **`level`** parameter is defaulted to **`-1`** meaning the right most level. We can change this parameter to choose the exact level we want to unstack. You may use a list to unstack more than one level.

In [15]:
# View original df
df

Unnamed: 0,Apple,Orange,Banana
Ted,12,10,40
Penelope,9,7,12
Niko,0,14,190


In [16]:
# Transpose the original dataframe by unstacking
df_stacked.unstack(level=0)

Unnamed: 0,Ted,Penelope,Niko
Apple,12,9,0
Orange,10,7,14
Banana,40,12,190


In [17]:
# also done more efficiently with .T
df.T

Unnamed: 0,Ted,Penelope,Niko
Apple,12,9,0
Orange,10,7,14
Banana,40,12,190


### Inverting stacked data with `pivot`
Similarly to the **`unstack`** method is the **`pivot`** method (also a function) which will invert stacked data. It takes three parameters:
+ **`index`** - the column that will stay vertical and be made into the index. Leave this as None if you would like to use the current index.
+ **`columns`** - The column which will be transposed and whose unique values will be made into column names
+ **`values`** - The column which will be tiled as the new values of the returned DataFrame

In [18]:
## reprint data from melt
df_melt

Unnamed: 0,Name,Fruit,Weight
0,Ted,Apple,12
1,Penelope,Apple,9
2,Niko,Apple,0
3,Ted,Orange,10
4,Penelope,Orange,7
5,Niko,Orange,14
6,Ted,Banana,40
7,Penelope,Banana,12
8,Niko,Banana,190


In [19]:
# pivot
df_melt.pivot(index='Name', columns='Fruit', values='Weight')

Fruit,Apple,Banana,Orange
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Niko,0,190,14
Penelope,9,12,7
Ted,12,40,10


In [20]:
## Same result with function
## is more generic and does not work directly on your object.
pd.pivot(index=df_melt.Name, columns=df_melt.Fruit, values=df_melt.Weight)

Fruit,Apple,Banana,Orange
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Niko,0,190,14
Penelope,9,12,7
Ted,12,40,10


### What is that extra blank row?
You might be disturbed by what appears to be an extra row with the index label **`Name`** from the resulting pivoted table above. Similarly the word **`Fruit`** has also been added in the upper left cell.

This output is unfortunately very confusing. What appears to be an extra row with label **`Name`** is not a row at all. There is absolutely no data in those blank cells. **`Name`** is simply the name of the first index level (in addition to its numeric level numbering beginning at 0 from the left). By default index levels have no name and so this added 'row' is missing for most DataFrames that you will encounter. Similarly, **`Fruit`** is simply the name of the column index.

In [21]:
# store pivoted data
df_pivot = df_melt.pivot(index='Name', columns='Fruit', values='Weight')

In [22]:
# inspect the index
# note that the name of the index is 'Name'
df_pivot.index

Index(['Niko', 'Penelope', 'Ted'], dtype='object', name='Name')

### Index Names are Helpful
The resulting DataFrame from the **`pivot`** method can be extremely confusing the first time you come upon a named index level and it might seem useless. They become more useful when there are multiple levels in your index or columns (a **`MultiIndex`**).

Index names are also helpful when resetting an index. The index names become the column names so there is no need to set them manually like we had to when first using **`stack`** above.

Let's see how index names turn into column names. We will perform the same tidy operations on the **`df_pivot`** DataFrame as we did the original **`df`**. 

In [23]:
# last column will still need to be renamed
df_pivot.stack().reset_index()

Unnamed: 0,Name,Fruit,0
0,Niko,Apple,0
1,Niko,Banana,190
2,Niko,Orange,14
3,Penelope,Apple,9
4,Penelope,Banana,12
5,Penelope,Orange,7
6,Ted,Apple,12
7,Ted,Banana,40
8,Ted,Orange,10


### Naming the values in a Series
The above doesn't quite get us what we want since the variable weight is never named anywhere. The pandas **`rename`** Series method allows you to give a name to all the values. If this Series is converted to a DataFrame, as what happens when **`reset_index`** is called, then this Series name will be the new column name.

**`df.stack()`** returns a Series which can have the values named with the **`rename`** method. Take a look at the bottom of the output below. You will see the name of the Series as **Weight**.

In [24]:
# Since df_pivot.stack() returns a series you can use the rename method
# to rename the actual values of the series
# See the Name of the values of the series at the very bottom of the output
df_pivot.stack().rename('Weight')

Name      Fruit 
Niko      Apple       0
          Banana    190
          Orange     14
Penelope  Apple       9
          Banana     12
          Orange      7
Ted       Apple      12
          Banana     40
          Orange     10
Name: Weight, dtype: int64

### Putting it all together
We can chain these methods together to get the DataFrame we want.

In [25]:
# now reset the index
df_pivot.stack().rename('Weight').reset_index()

Unnamed: 0,Name,Fruit,Weight
0,Niko,Apple,0
1,Niko,Banana,190
2,Niko,Orange,14
3,Penelope,Apple,9
4,Penelope,Banana,12
5,Penelope,Orange,7
6,Ted,Apple,12
7,Ted,Banana,40
8,Ted,Orange,10


### Renaming index names
You can remove the name of the index if it is bothering you by deleting it with **`del`** or setting it to **`None`**. Your DataFrame will now look more familiar to you.

In [26]:
# DataFrame with named indexes
df_pivot

Fruit,Apple,Banana,Orange
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Niko,0,190,14
Penelope,9,12,7
Ted,12,40,10


In [27]:
# remove names
df_pivot.index.name = None
del df_pivot.columns.name

df_pivot

Unnamed: 0,Apple,Banana,Orange
Niko,0,190,14
Penelope,9,12,7
Ted,12,40,10


### Renaming Index Levels
You can rename the index level names with the **`rename_axis`** method.

In [28]:
# create DataFrame with named levels
df_pivot = df_melt.pivot(index='Name', columns='Fruit', values='Weight')
df_pivot

Fruit,Apple,Banana,Orange
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Niko,0,190,14
Penelope,9,12,7
Ted,12,40,10


In [29]:
# rename levels - pass in a list for multiple levels
df_pivot.rename_axis('PERSON', axis='index').rename_axis('FOOD', axis=1)

FOOD,Apple,Banana,Orange
PERSON,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Niko,0,190,14
Penelope,9,12,7
Ted,12,40,10


### `pivot_table` and `groupby` are very similar
Now that you have seen how **`stack`** and **`melt`** are similar and **`unstack`** and **`pivot`** similarly invert those operations, there is one other set of reshaping methods that do nearly the same thing - **`pivot_table`** and **`groupby`** (when aggregating).

Since we already covered **`groupby`** thoroughly and had an example with **`pivot_table`** we will jump right into a more complex example with the college dataset.

In [30]:
college = pd.read_csv('data/college.csv')
pd.options.display.max_columns = 40

college.head()

Unnamed: 0,INSTNM,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
0,Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
1,University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,0.5922,0.26,0.0283,0.0518,0.0022,0.0007,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
2,Amridge University,Montgomery,AL,0.0,0.0,0.0,1,,,1.0,291.0,0.299,0.4192,0.0069,0.0034,0.0,0.0,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370.0
3,University of Alabama in Huntsville,Huntsville,AL,0.0,0.0,0.0,0,595.0,590.0,0.0,5451.0,0.6988,0.1255,0.0382,0.0376,0.0143,0.0002,0.0172,0.0332,0.035,0.2146,1,0.3072,0.4596,0.264,45500,24097.0
4,Alabama State University,Montgomery,AL,1.0,0.0,0.0,0,425.0,430.0,0.0,4811.0,0.0158,0.9208,0.0121,0.0019,0.001,0.0006,0.0098,0.0243,0.0137,0.0892,1,0.7347,0.7554,0.127,26600,33118.5


### Redo a `groupby` operation with `pivot_table`
It's not apparent at first but **`groupby`** and **`pivot_table`** use almost the exact same inputs. The below **`groupby`** is passed three different lists for three different parts of the operation. 

+ **['STABBR', 'RELAFFIL']** - these are grouping columns.
+ **['UGDS', 'SATMTMID']** - these are the columns being aggregated
+ **['size', 'min', 'max']** - these are the aggregating functions applied to each column

The **`pivot_table`** method (also a function) uses the parameter **`index`** for the first list, **`values`** for the second and **`aggfunc`** for the third.

In [31]:
# use a complex groupby from a previous notebook
cg = college.groupby(['STABBR', 'RELAFFIL'])['UGDS', 'SATMTMID'].agg(['size', 'min', 'max']).head(12)
cg

Unnamed: 0_level_0,Unnamed: 1_level_0,UGDS,UGDS,UGDS,SATMTMID,SATMTMID,SATMTMID
Unnamed: 0_level_1,Unnamed: 1_level_1,size,min,max,size,min,max
STABBR,RELAFFIL,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
AK,0,7,109.0,12865.0,7,,
AK,1,3,27.0,275.0,3,503.0,503.0
AL,0,72,12.0,29851.0,72,420.0,590.0
AL,1,24,13.0,3033.0,24,400.0,560.0
AR,0,68,18.0,21405.0,68,427.0,565.0
AR,1,18,20.0,4485.0,18,495.0,600.0
AS,0,1,1276.0,1276.0,1,,
AZ,0,124,1.0,151558.0,124,503.0,580.0
AZ,1,9,25.0,4102.0,9,480.0,480.0
CA,0,609,0.0,44744.0,609,445.0,785.0


In [32]:
# replicate with pivot_table
cp = college.pivot_table(index=['STABBR', 'RELAFFIL'], 
                         values=['UGDS', 'SATMTMID'], 
                         aggfunc=[np.size, np.min, np.max]).head(12)
cp 

Unnamed: 0_level_0,Unnamed: 1_level_0,size,size,amin,amin,amax,amax
Unnamed: 0_level_1,Unnamed: 1_level_1,SATMTMID,UGDS,SATMTMID,UGDS,SATMTMID,UGDS
STABBR,RELAFFIL,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
AK,0,7.0,7.0,,109.0,,12865.0
AK,1,3.0,3.0,503.0,27.0,503.0,275.0
AL,0,72.0,72.0,420.0,12.0,590.0,29851.0
AL,1,24.0,24.0,400.0,13.0,560.0,3033.0
AR,0,68.0,68.0,427.0,18.0,565.0,21405.0
AR,1,18.0,18.0,495.0,20.0,600.0,4485.0
AS,0,1.0,1.0,,1276.0,,1276.0
AZ,0,124.0,124.0,503.0,1.0,580.0,151558.0
AZ,1,9.0,9.0,480.0,25.0,480.0,4102.0
CA,0,609.0,609.0,445.0,0.0,785.0,44744.0


### `pivot_table` needs more help for exact replication
Unfortunately the **`pivot_table`** method does not take numpy string methods as aggregate functions like **`groupby`**. The column levels are reversed and not in the right order. **`swaplevel`** and **`sort_index`** are two DataFrame methods that fix this. Column levels are numbered beginning at 0 from the top.

In [33]:
# swap columm levels and sort top column
# close enough
cp.swaplevel(0, 1, axis='columns')\
  .sortlevel(level=0, axis='columns', ascending=False) 
    # this looks to be a pandas bug. Says its lexicographic but its numeric

Unnamed: 0_level_0,Unnamed: 1_level_0,UGDS,UGDS,UGDS,SATMTMID,SATMTMID,SATMTMID
Unnamed: 0_level_1,Unnamed: 1_level_1,amax,amin,size,amax,amin,size
STABBR,RELAFFIL,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
AK,0,12865.0,109.0,7.0,,,7.0
AK,1,275.0,27.0,3.0,503.0,503.0,3.0
AL,0,29851.0,12.0,72.0,590.0,420.0,72.0
AL,1,3033.0,13.0,24.0,560.0,400.0,24.0
AR,0,21405.0,18.0,68.0,565.0,427.0,68.0
AR,1,4485.0,20.0,18.0,600.0,495.0,18.0
AS,0,1276.0,1276.0,1.0,,,1.0
AZ,0,151558.0,1.0,124.0,580.0,503.0,124.0
AZ,1,4102.0,25.0,9.0,480.0,480.0,9.0
CA,0,44744.0,0.0,609.0,785.0,445.0,609.0


### Taking Advantage of Index Level Names
The original grouped data has two index levels with the same names as the column names that they once were. These level names can be used in-place of the numeric level labeling like we have done above. See the examples below using the index level name in methods.

In [34]:
# original college grouped data
cg

Unnamed: 0_level_0,Unnamed: 1_level_0,UGDS,UGDS,UGDS,SATMTMID,SATMTMID,SATMTMID
Unnamed: 0_level_1,Unnamed: 1_level_1,size,min,max,size,min,max
STABBR,RELAFFIL,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
AK,0,7,109.0,12865.0,7,,
AK,1,3,27.0,275.0,3,503.0,503.0
AL,0,72,12.0,29851.0,72,420.0,590.0
AL,1,24,13.0,3033.0,24,400.0,560.0
AR,0,68,18.0,21405.0,68,427.0,565.0
AR,1,18,20.0,4485.0,18,495.0,600.0
AS,0,1,1276.0,1276.0,1,,
AZ,0,124,1.0,151558.0,124,503.0,580.0
AZ,1,9,25.0,4102.0,9,480.0,480.0
CA,0,609,0.0,44744.0,609,445.0,785.0


In [35]:
# sort by religious affiliation
cg.sort_index(level='RELAFFIL', sort_remaining=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,UGDS,UGDS,UGDS,SATMTMID,SATMTMID,SATMTMID
Unnamed: 0_level_1,Unnamed: 1_level_1,size,min,max,size,min,max
STABBR,RELAFFIL,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
AK,0,7,109.0,12865.0,7,,
AL,0,72,12.0,29851.0,72,420.0,590.0
AR,0,68,18.0,21405.0,68,427.0,565.0
AS,0,1,1276.0,1276.0,1,,
AZ,0,124,1.0,151558.0,124,503.0,580.0
CA,0,609,0.0,44744.0,609,445.0,785.0
CO,0,118,0.0,25873.0,118,424.0,680.0
AK,1,3,27.0,275.0,3,503.0,503.0
AL,1,24,13.0,3033.0,24,400.0,560.0
AR,1,18,20.0,4485.0,18,495.0,600.0


In [36]:
# get all the values of one level
cg.index.get_level_values('STABBR')

Index(['AK', 'AK', 'AL', 'AL', 'AR', 'AR', 'AS', 'AZ', 'AZ', 'CA', 'CA', 'CO'], dtype='object', name='STABBR')

### Crazy reshaping using `stack` and `unstack` with index and column level names
The level names really come in handy when stacking and unstacking data with indexes and columns with multiple levels. Behold the wizardry below. First we will name the column levels since they don't exist currently.

In [37]:
# now all four index and column levels have names
# looks a little odd doesn't it?
cg = cg.rename_axis(['Agg_Cols', 'Agg_Funcs'], axis='columns')
cg

Unnamed: 0_level_0,Agg_Cols,UGDS,UGDS,UGDS,SATMTMID,SATMTMID,SATMTMID
Unnamed: 0_level_1,Agg_Funcs,size,min,max,size,min,max
STABBR,RELAFFIL,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
AK,0,7,109.0,12865.0,7,,
AK,1,3,27.0,275.0,3,503.0,503.0
AL,0,72,12.0,29851.0,72,420.0,590.0
AL,1,24,13.0,3033.0,24,400.0,560.0
AR,0,68,18.0,21405.0,68,427.0,565.0
AR,1,18,20.0,4485.0,18,495.0,600.0
AS,0,1,1276.0,1276.0,1,,
AZ,0,124,1.0,151558.0,124,503.0,580.0
AZ,1,9,25.0,4102.0,9,480.0,480.0
CA,0,609,0.0,44744.0,609,445.0,785.0


In [38]:
# commense wizardy
# Stack all the values in the Agg_Cols level
cg.stack('Agg_Cols')

Unnamed: 0_level_0,Unnamed: 1_level_0,Agg_Funcs,size,min,max
STABBR,RELAFFIL,Agg_Cols,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AK,0,UGDS,7,109.0,12865.0
AK,0,SATMTMID,7,,
AK,1,UGDS,3,27.0,275.0
AK,1,SATMTMID,3,503.0,503.0
AL,0,UGDS,72,12.0,29851.0
AL,0,SATMTMID,72,420.0,590.0
AL,1,UGDS,24,13.0,3033.0
AL,1,SATMTMID,24,400.0,560.0
AR,0,UGDS,68,18.0,21405.0
AR,0,SATMTMID,68,427.0,565.0


In [39]:
# stack values in Agg_Funcs
cg.stack('Agg_Funcs')

Unnamed: 0_level_0,Unnamed: 1_level_0,Agg_Cols,UGDS,SATMTMID
STABBR,RELAFFIL,Agg_Funcs,Unnamed: 3_level_1,Unnamed: 4_level_1
AK,0,size,7.0,7.0
AK,0,min,109.0,
AK,0,max,12865.0,
AK,1,size,3.0,3.0
AK,1,min,27.0,503.0
AK,1,max,275.0,503.0
AL,0,size,72.0,72.0
AL,0,min,12.0,420.0
AL,0,max,29851.0,590.0
AL,1,size,24.0,24.0


In [40]:
# stack both into a Series
# now with 4 index levels!
s4 = cg.stack(['Agg_Funcs', 'Agg_Cols'])
s4

STABBR  RELAFFIL  Agg_Funcs  Agg_Cols
AK      0         size       UGDS             7.0
                             SATMTMID         7.0
                  min        UGDS           109.0
                  max        UGDS         12865.0
        1         size       UGDS             3.0
                             SATMTMID         3.0
                  min        UGDS            27.0
                             SATMTMID       503.0
                  max        UGDS           275.0
                             SATMTMID       503.0
AL      0         size       UGDS            72.0
                             SATMTMID        72.0
                  min        UGDS            12.0
                             SATMTMID       420.0
                  max        UGDS         29851.0
                             SATMTMID       590.0
        1         size       UGDS            24.0
                             SATMTMID        24.0
                  min        UGDS            13.0
            

In [41]:
# now unstack
s4.unstack('STABBR')

Unnamed: 0_level_0,Unnamed: 1_level_0,STABBR,AK,AL,AR,AS,AZ,CA,CO
RELAFFIL,Agg_Funcs,Agg_Cols,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,size,UGDS,7.0,72.0,68.0,1.0,124.0,609.0,118.0
0,size,SATMTMID,7.0,72.0,68.0,1.0,124.0,609.0,118.0
0,min,UGDS,109.0,12.0,18.0,1276.0,1.0,0.0,0.0
0,min,SATMTMID,,420.0,427.0,,503.0,445.0,424.0
0,max,UGDS,12865.0,29851.0,21405.0,1276.0,151558.0,44744.0,25873.0
0,max,SATMTMID,,590.0,565.0,,580.0,785.0,680.0
1,size,UGDS,3.0,24.0,18.0,,9.0,164.0,
1,size,SATMTMID,3.0,24.0,18.0,,9.0,164.0,
1,min,UGDS,27.0,13.0,20.0,,25.0,8.0,
1,min,SATMTMID,503.0,400.0,495.0,,480.0,441.0,


In [42]:
s4.unstack(['STABBR','Agg_Funcs'])

Unnamed: 0_level_0,STABBR,AK,AK,AK,AL,AL,AL,AR,AR,AR,AS,AS,AS,AZ,AZ,AZ,CA,CA,CA,CO,CO,CO
Unnamed: 0_level_1,Agg_Funcs,size,min,max,size,min,max,size,min,max,size,min,max,size,min,max,size,min,max,size,min,max
RELAFFIL,Agg_Cols,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2
0,UGDS,7.0,109.0,12865.0,72.0,12.0,29851.0,68.0,18.0,21405.0,1.0,1276.0,1276.0,124.0,1.0,151558.0,609.0,0.0,44744.0,118.0,0.0,25873.0
0,SATMTMID,7.0,,,72.0,420.0,590.0,68.0,427.0,565.0,1.0,,,124.0,503.0,580.0,609.0,445.0,785.0,118.0,424.0,680.0
1,UGDS,3.0,27.0,275.0,24.0,13.0,3033.0,18.0,20.0,4485.0,,,,9.0,25.0,4102.0,164.0,8.0,6745.0,,,
1,SATMTMID,3.0,503.0,503.0,24.0,400.0,560.0,18.0,495.0,600.0,,,,9.0,480.0,480.0,164.0,441.0,665.0,,,


### 'Wide' vs 'Long' format
'Wide' and 'Long' are common (but perhaps imprecise) idioms to identify data. Typically long format refers data that has many rows and few columns.  Stacked and tidy data would be 'long' data. 'Wide' data is the opposite and contains many columns and less rows. Pivoted and messy data is wide.

### Back to `pivot_table` `groupby` equivalence
We just saw how we can use the `pivot_table` method to emulate a **`groupby`**. Well, its also possible to do the opposite. **`pivot_table`** offers the **`columns`** argument which transposes the values of a column to column names before aggregating. **`groupby`** offers no direct ability to mimic this behavior but with the help of **`unstack`** it is possible to create the equivalence.

In [43]:
# use pivot_table to transpose the STABBR column
college.pivot_table(index='RELAFFIL', 
                    columns='STABBR', 
                    values='UGDS', 
                    aggfunc=np.mean)

STABBR,AK,AL,AR,AS,AZ,CA,CO,CT,DC,DE,FL,FM,GA,GU,HI,IA,ID,IL,IN,KS,...,NY,OH,OK,OR,PA,PR,PW,RI,SC,SD,TN,TX,UT,VA,VI,VT,WA,WI,WV,WY
RELAFFIL,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1
0,3508.857143,3248.774648,1793.691176,1276.0,4363.533898,3802.08981,2324.619469,1890.573171,2008.285714,2247.75,2676.443149,2344.0,2704.517483,2808.5,2531.05,2872.57377,1624.222222,2251.887446,2884.929293,2236.323529,...,2491.77628,1610.865517,1347.015748,2409.0,1698.535117,1599.431818,602.0,3504.894737,2197.853933,1560.913043,1594.900709,3206.222826,2862.363636,2635.720588,1971.0,1602.684211,2293.683168,2879.130952,1873.857143,2244.363636
1,123.333333,979.722222,917.785714,,692.75,1356.342105,2332.25,1674.142857,4874.75,3788.666667,993.642857,,2288.24,65.0,1509.0,1076.827586,6344.5,1851.860465,1835.5,646.761905,...,1330.269231,1382.098039,1387.5,953.857143,1703.157895,1590.8,,2043.333333,1283.0,691.857143,1450.068966,1680.758621,4938.0,3030.25,,942.0,2064.909091,1716.2,716.428571,


The trick is to groupby by all the columns in both the **`index`** and **`columns`** parameters of **`pivot_table`** and then use **`unstack`**.

In [44]:
cg2 = college.groupby(['RELAFFIL','STABBR'])['UGDS'].agg('mean')
cg2.head(10)

RELAFFIL  STABBR
0         AK        3508.857143
          AL        3248.774648
          AR        1793.691176
          AS        1276.000000
          AZ        4363.533898
          CA        3802.089810
          CO        2324.619469
          CT        1890.573171
          DC        2008.285714
          DE        2247.750000
Name: UGDS, dtype: float64

In [45]:
## unstack STABBR
cg2.unstack('STABBR')

STABBR,AK,AL,AR,AS,AZ,CA,CO,CT,DC,DE,FL,FM,GA,GU,HI,IA,ID,IL,IN,KS,...,NY,OH,OK,OR,PA,PR,PW,RI,SC,SD,TN,TX,UT,VA,VI,VT,WA,WI,WV,WY
RELAFFIL,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1
0,3508.857143,3248.774648,1793.691176,1276.0,4363.533898,3802.08981,2324.619469,1890.573171,2008.285714,2247.75,2676.443149,2344.0,2704.517483,2808.5,2531.05,2872.57377,1624.222222,2251.887446,2884.929293,2236.323529,...,2491.77628,1610.865517,1347.015748,2409.0,1698.535117,1599.431818,602.0,3504.894737,2197.853933,1560.913043,1594.900709,3206.222826,2862.363636,2635.720588,1971.0,1602.684211,2293.683168,2879.130952,1873.857143,2244.363636
1,123.333333,979.722222,917.785714,,692.75,1356.342105,2332.25,1674.142857,4874.75,3788.666667,993.642857,,2288.24,65.0,1509.0,1076.827586,6344.5,1851.860465,1835.5,646.761905,...,1330.269231,1382.098039,1387.5,953.857143,1703.157895,1590.8,,2043.333333,1283.0,691.857143,1450.068966,1680.758621,4938.0,3030.25,,942.0,2064.909091,1716.2,716.428571,


# Problems

The first set of problems will use NY state demographic data found from [data.gov](https://catalog.data.gov/dataset).

### Problem 1
<span  style="color:green; font-size:16px">Read in the `ny_demographics.csv` dataset. Is this a tidy dataset? Explain why or why not.</span>

In [46]:
import pandas as pd
import numpy as np
import os
ny_demo = pd.read_csv('data/ny_demographics.csv')
print(ny_demo.columns)
ny_demo.head(5)

Index(['JURISDICTION NAME', 'COUNT PARTICIPANTS', 'COUNT FEMALE',
       'PERCENT FEMALE', 'COUNT MALE', 'PERCENT MALE', 'COUNT GENDER UNKNOWN',
       'PERCENT GENDER UNKNOWN', 'COUNT GENDER TOTAL', 'PERCENT GENDER TOTAL',
       'COUNT PACIFIC ISLANDER', 'PERCENT PACIFIC ISLANDER',
       'COUNT HISPANIC LATINO', 'PERCENT HISPANIC LATINO',
       'COUNT AMERICAN INDIAN', 'PERCENT AMERICAN INDIAN',
       'COUNT ASIAN NON HISPANIC', 'PERCENT ASIAN NON HISPANIC',
       'COUNT WHITE NON HISPANIC', 'PERCENT WHITE NON HISPANIC',
       'COUNT BLACK NON HISPANIC', 'PERCENT BLACK NON HISPANIC',
       'COUNT OTHER ETHNICITY', 'PERCENT OTHER ETHNICITY',
       'COUNT ETHNICITY UNKNOWN', 'PERCENT ETHNICITY UNKNOWN',
       'COUNT ETHNICITY TOTAL', 'PERCENT ETHNICITY TOTAL',
       'COUNT PERMANENT RESIDENT ALIEN', 'PERCENT PERMANENT RESIDENT ALIEN',
       'COUNT US CITIZEN', 'PERCENT US CITIZEN', 'COUNT OTHER CITIZEN STATUS',
       'PERCENT OTHER CITIZEN STATUS', 'COUNT CITIZEN STATUS UNKN

Unnamed: 0,JURISDICTION NAME,COUNT PARTICIPANTS,COUNT FEMALE,PERCENT FEMALE,COUNT MALE,PERCENT MALE,COUNT GENDER UNKNOWN,PERCENT GENDER UNKNOWN,COUNT GENDER TOTAL,PERCENT GENDER TOTAL,COUNT PACIFIC ISLANDER,PERCENT PACIFIC ISLANDER,COUNT HISPANIC LATINO,PERCENT HISPANIC LATINO,COUNT AMERICAN INDIAN,PERCENT AMERICAN INDIAN,COUNT ASIAN NON HISPANIC,PERCENT ASIAN NON HISPANIC,COUNT WHITE NON HISPANIC,PERCENT WHITE NON HISPANIC,...,COUNT ETHNICITY TOTAL,PERCENT ETHNICITY TOTAL,COUNT PERMANENT RESIDENT ALIEN,PERCENT PERMANENT RESIDENT ALIEN,COUNT US CITIZEN,PERCENT US CITIZEN,COUNT OTHER CITIZEN STATUS,PERCENT OTHER CITIZEN STATUS,COUNT CITIZEN STATUS UNKNOWN,PERCENT CITIZEN STATUS UNKNOWN,COUNT CITIZEN STATUS TOTAL,PERCENT CITIZEN STATUS TOTAL,COUNT RECEIVES PUBLIC ASSISTANCE,PERCENT RECEIVES PUBLIC ASSISTANCE,COUNT NRECEIVES PUBLIC ASSISTANCE,PERCENT NRECEIVES PUBLIC ASSISTANCE,COUNT PUBLIC ASSISTANCE UNKNOWN,PERCENT PUBLIC ASSISTANCE UNKNOWN,COUNT PUBLIC ASSISTANCE TOTAL,PERCENT PUBLIC ASSISTANCE TOTAL
0,10001,44,22,0.5,22,0.5,0,0,44,100,0,0.0,16,0.36,0,0.0,3,0.07,1,0.02,...,44,100,2,0.05,42,0.95,0,0.0,0,0,44,100,20,0.45,24,0.55,0,0,44,100
1,10002,35,19,0.54,16,0.46,0,0,35,100,0,0.0,1,0.03,0,0.0,28,0.8,6,0.17,...,35,100,2,0.06,33,0.94,0,0.0,0,0,35,100,2,0.06,33,0.94,0,0,35,100
2,10003,1,1,1.0,0,0.0,0,0,1,100,0,0.0,0,0.0,0,0.0,1,1.0,0,0.0,...,1,100,0,0.0,1,1.0,0,0.0,0,0,1,100,0,0.0,1,1.0,0,0,1,100
3,10004,0,0,0.0,0,0.0,0,0,0,0,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0,...,0,0,0,0.0,0,0.0,0,0.0,0,0,0,0,0,0.0,0,0.0,0,0,0,0
4,10005,2,2,1.0,0,0.0,0,0,2,100,0,0.0,0,0.0,0,0.0,1,0.5,0,0.0,...,2,100,1,0.5,1,0.5,0,0.0,0,0,2,100,0,0.0,2,1.0,0,0,2,100


### Problem 2
<span  style="color:green; font-size:16px">Reshape the NY demographic data so that it has three variables: JURISDICTION NAME, Gender and Count</span>

In [47]:
df = pd.melt(ny_demo, 
        id_vars='JURISDICTION NAME', 
        value_vars=['COUNT FEMALE', 'COUNT MALE'], 
        var_name=['Gender'],
       value_name = 'Count')
df.head(10)

Unnamed: 0,JURISDICTION NAME,Gender,Count
0,10001,COUNT FEMALE,22
1,10002,COUNT FEMALE,19
2,10003,COUNT FEMALE,1
3,10004,COUNT FEMALE,0
4,10005,COUNT FEMALE,2
5,10006,COUNT FEMALE,2
6,10007,COUNT FEMALE,0
7,10009,COUNT FEMALE,0
8,10010,COUNT FEMALE,0
9,10011,COUNT FEMALE,2


### Problem 3
<span  style="color:green; font-size:16px">Reshape the NY demographic data in the same way you did in problem 2 except with a different command. HINT: If you use stack, put columns that you don't want stacked in the index.</span>

Bonus: If you use stack, use method chaining to rename all the columns correctly.

In [48]:
ny_demo[['JURISDICTION NAME', 'COUNT MALE', 'COUNT FEMALE']].set_index('JURISDICTION NAME')\
                                                            .stack()\
                                                            .reset_index()\
                                                            .head(10)

Unnamed: 0,JURISDICTION NAME,level_1,0
0,10001,COUNT MALE,22
1,10001,COUNT FEMALE,22
2,10002,COUNT MALE,16
3,10002,COUNT FEMALE,19
4,10003,COUNT MALE,0
5,10003,COUNT FEMALE,1
6,10004,COUNT MALE,0
7,10004,COUNT FEMALE,0
8,10005,COUNT MALE,0
9,10005,COUNT FEMALE,2


### Problem 4
<span  style="color:green; font-size:16px">Find a different variable in the columns and tidy that variable by creating another three column DataFrame. Store your resulting DataFrame in **`df_count`**.</span>

In [49]:
df_count = pd.melt(ny_demo, 
        id_vars='JURISDICTION NAME', 
        value_vars=['COUNT PERMANENT RESIDENT ALIEN', 'COUNT US CITIZEN', 'COUNT OTHER CITIZEN STATUS','COUNT CITIZEN STATUS UNKNOWN'], 
        var_name=['Status'],
       value_name = 'Count')
df_count['Status'] = df_count.Status.str.replace('COUNT ', '')
df_count.head()


Unnamed: 0,JURISDICTION NAME,Status,Count
0,10001,PERMANENT RESIDENT ALIEN,2
1,10002,PERMANENT RESIDENT ALIEN,2
2,10003,PERMANENT RESIDENT ALIEN,0
3,10004,PERMANENT RESIDENT ALIEN,0
4,10005,PERMANENT RESIDENT ALIEN,1


### Problem 5
<span  style="color:green; font-size:16px">For the same variable you used in problem 4, create another three column tidy dataset using the percentage column instead of the count. Store your resulting DataFrame in **`df_perc`**.</span>

In [50]:
melted_cols = ['PERCENT PERMANENT RESIDENT ALIEN', 
               'PERCENT US CITIZEN', 
               'PERCENT  OTHER CITIZEN STATUS', 
               'PERCENT CITIZEN STATUS UNKNOWN']

df_perc = pd.melt(ny_demo, 
        id_vars='JURISDICTION NAME', 
        value_vars=melted_cols, 
        var_name=['Status'],
       value_name = 'Percent')
df_perc['Status'] = df_count.Status.str.replace('PERCENT ', '')
df_perc.head()


Unnamed: 0,JURISDICTION NAME,Status,Percent
0,10001,PERMANENT RESIDENT ALIEN,0.05
1,10002,PERMANENT RESIDENT ALIEN,0.06
2,10003,PERMANENT RESIDENT ALIEN,0.0
3,10004,PERMANENT RESIDENT ALIEN,0.0
4,10005,PERMANENT RESIDENT ALIEN,0.5


### Problem 6: Advanced
<span  style="color:green; font-size:16px">Add a **`Percent`** column to **`df_count`** that calculates the percent found in **`df_perc`**. Create an additional column, **`Percent_orig`**, from **`df_perc`** to **`df_count`**. Check that the calculated percentage and original percentage match.</span>

In [89]:
melted_cols = ['PERCENT PERMANENT RESIDENT ALIEN', 
               'PERCENT US CITIZEN', 
               'PERCENT  OTHER CITIZEN STATUS', 
               'PERCENT CITIZEN STATUS UNKNOWN']
df_count = pd.melt(ny_demo, 
        id_vars='JURISDICTION NAME', 
        value_vars=melted_cols, 
        var_name=['Status'],
       value_name = 'Count')
df_count['Status'] = df_count.Status.str.replace('COUNT ', '')


df_perc = pd.melt(ny_demo, 
        id_vars='JURISDICTION NAME', 
        value_vars=melted_cols, 
        var_name=['Status'],
       value_name = 'Percent')
df_perc['Status'] = df_count.Status.str.replace('PERCENT ', '')

def findpercent(s):
    return s/s.sum()

df_count['Percent'] = df_count.groupby(['JURISDICTION NAME'])\
                              .transform(findpercent)
    
# add unique identifiers
df_perc['uid'] = df_perc['JURISDICTION NAME'].astype(str) + df_perc['Status']
df_count['uid'] = df_count['JURISDICTION NAME'].astype(str) + df_count['Status']

# set indexes
df_perc.set_index('uid', inplace=True)
df_count.set_index('uid', inplace=True)

print(df_perc.iloc[0], "\n\n",df_count.iloc[0])

JURISDICTION NAME                       10001
Status               PERMANENT RESIDENT ALIEN
Percent                                  0.05
Name: 10001PERMANENT RESIDENT ALIEN, dtype: object 

 JURISDICTION NAME                               10001
Status               PERCENT PERMANENT RESIDENT ALIEN
Count                                            0.05
Percent                                          0.05
Name: 10001PERCENT PERMANENT RESIDENT ALIEN, dtype: object


In [96]:
#matches based on index, hence the statement above
df_count['Percent2'] = df_perc['Percent']
df_count.describe()

Unnamed: 0,JURISDICTION NAME,Count,Percent,Percent2
count,944.0,708.0,315.0,0.0
mean,11127.173729,0.147627,0.333333,
std,1050.756386,0.339487,0.446964,
min,10001.0,0.0,0.0,
25%,10451.75,0.0,0.0,
50%,11216.5,0.0,0.0,
75%,11422.25,0.0,0.94,
max,20459.0,1.0,1.0,


In [79]:
df_perc.head()

Unnamed: 0_level_0,JURISDICTION NAME,Status,Percent
uid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
10001PERMANENT RESIDENT ALIEN,10001,PERMANENT RESIDENT ALIEN,0.05
10002PERMANENT RESIDENT ALIEN,10002,PERMANENT RESIDENT ALIEN,0.06
10003PERMANENT RESIDENT ALIEN,10003,PERMANENT RESIDENT ALIEN,0.0
10004PERMANENT RESIDENT ALIEN,10004,PERMANENT RESIDENT ALIEN,0.0
10005PERMANENT RESIDENT ALIEN,10005,PERMANENT RESIDENT ALIEN,0.5


In [None]:
df1 = df_count.drop('Percent2', axis=1).fillna(0)

In [None]:
df2 = df1.sample(frac=.5, replace=False)

In [None]:
df2.shape

In [None]:
df1.shape

In [None]:
df1['Percet2'] = df2['Percent']

In [None]:
df3 = df1.reset_index()

In [None]:
df3.head()

In [None]:
df2.head()

In [None]:
df3.join(df2, on='uid', rsuffix='_2')

In [None]:
df3

In [None]:
df4 = df2.reset_index()

In [None]:
df3.merge(df4, on='uid')

In [None]:
df1.Percet2.notnull().sum()

### Problem 7
<span  style="color:green; font-size:16px">If you use the **`stack`** method on a 10 row, 5 column DataFrame (that has single level indexes), what will be the resulting shape and data structure. Answer this problem first without writing any code. Then confirm it by testing it on a DataFrame.</span>

In [None]:
# your code here

### Problem 8
<span  style="color:green; font-size:16px">Give the column index and the index a name of the following DataFrame.</span>

In [None]:
df = pd.DataFrame(np.random.rand(2,2))
df

In [None]:
# your code here

### Problem 9
<span  style="color:green; font-size:16px">Use `groupby` method (with other reshaping methods) to recreate the same DataFrame produced by `pivot_table` below.</span>

In [None]:
coh = pd.read_csv('data/coh_employee.csv')
# recreate this table
coh_pivot = coh.pivot_table(index='RACE', columns='GENDER', values='BASE_SALARY', aggfunc=np.mean)
coh_pivot

In [None]:
# your code here

### Problem 10
<span  style="color:green; font-size:16px">Use the `melt` function to make make the DataFrame `coh_pivot` tidy.</span>

In [None]:
coh_pivot = coh.pivot_table(index=['RACE', 'DEPARTMENT'], 
                            columns='GENDER', 
                            values='BASE_SALARY', 
                            aggfunc=np.mean)\
                .reset_index()\
                .rename_axis(None, axis='columns',)
coh_pivot.head(10)

In [None]:
# your code here

### Problem 11
<span  style="color:green; font-size:16px">Use the `stack` function to make make the DataFrame `coh_pivot` from problem 10 tidy.</span>

In [None]:
# your code here

### Problem 12
<span  style="color:green; font-size:16px">Make the column levels `first` and `second` index levels. Make the index level `two` a column level.</span>

In [None]:
index = pd.MultiIndex.from_product([['a', 'b'], ['c', 'd', 'e']], names=['one', 'two'])
columns = pd.MultiIndex.from_product([['A', 'B'], ['C', 'D']], names=['first', 'second'])
df = pd.DataFrame(np.random.rand(6,4), index=index, columns=columns)
df

In [None]:
# your code here