<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

## Pandas Long Format, Wide Format, Pivot Tables, and Melting

_Authors: Kiefer Katovich (SF)_

---

This lesson is all about data transformation using pandas. Data transformation is the reorganization of the rows and columns of your dataset into a different, potentially more useful shape and format. 

The benefits of transforming your data are to have better access to relevant information and to make manipulations of that data more streamlined. As you become more familiar with datasets and operations on them you will grow an intuition and appreciation for when operating row-wise or column-wise on your data is preferable.

Different data formats are better for different tasks. This takes time and experience to learn, but serves as an introduction to common structures, transformation, and how to apply these transformations.

### Learning Objectives
- Understand the differences between long and wide format data
- Understand pivot tables
- Practice transforming data between long and wide format
- Practice creating pivot tables 
- Learn how to avoid common pitfalls and obstacles in data transformation with pandas


### Lesson Guide

- [Wide format data](#wide_format)
- [Load and examine the NPAS data](#load_nerdy)
- [Long format data](#long_format)
- [Using pandas `pivot_table()`: long to wide format](#pivot_tables)
- [Multiindex/Hierarchical indices in pandas](#multiindex)
- [Using pandas `melt()`: wide to long format](#melt)
- [Summarizing data with `pivot_table()`](#pivot_table_summarizing)
- [Inner-workings of the multiindex](#examining_multiindex)
- [Getting rid of hierarchical indices: "flattening" data](#multiindex_to_flat)
- [Merging dataframes: long vs. wide](#pandas_merge)


In [8]:
import numpy as np
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

sns.set_style('darkgrid')

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

<a id='wide_format'></a>

### "Wide" format data

---

Between "wide" and "long", wide format data is the more intuitive one. It is a common format for .csv type files. You have already viewed multiple datasets in wide format throughout this class.

Wide format data is formatted so that:

- Unique IDs, subjects, observations, etc. are represented as rows.
- Distinct information categories (variables) are represented as columns. In other words, there is a column for every "variable" with its own unique values.
- The format can often be a more compact matrix, particularly if no or little information is missing.
- Is not as useful for SQL-style operations: it can make it much harder or even impossible to join tables together on a value.
- Wide can be more useful in pandas when you need to preform operations on variables **across columns**. For example, multiplying columns together to create a new column.
- It is the data format required for statistical modeling (with few exceptions).

<a id='load_nerdy'></a>

### Load the "Nerdy Personality Attributes" dataset

---

This is a pre-cleaned and modified version of the full "Nerdy Personality Attributes" survey that asked subjects to self-rate on questions related to "nerdiness" as well as more general personality traits such as openness and extraversion. Demographic information on the subjects was also collected.

[You can find the raw data here along with many other sociological surveys.](http://personality-testing.info/_rawdata/)

In this modified version, for the sake of example, some of the subjects have data for the survey but not the demographic variables. Because there are missing values and the data is "messy", we have a data cleaning problem.

**Load the data (which is in wide format).** 

In [13]:
nerdy_wide_f = pd.DataFrame.from_csv('/Users/Mahendra/desktop/GA/hw/3.1.1_pandas-long_wide_pivot_melt-lesson/datasets/NPAS_parsed_trunc_wide_missing.csv')
nerdy_wide_f

# load data and print the dimensions

Unnamed: 0_level_0,academic_over_social,age,anxious,bookish,books_over_parties,calm,collect_books,conventional,critical,dependable,...,religion,reserved,socially_awkward,strange_person,sympathetic,urban,voted,was_odd_child,watch_science_shows,writing_novel
subject_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,5.0,,1.0,5.0,5.0,7.0,5.0,1.0,1.0,7.0,...,,7.0,5.0,5.0,7.0,,,5.0,5.0,3.0
1,2.0,50.0,4.0,4.0,4.0,6.0,5.0,1.0,3.0,5.0,...,1.0,5.0,5.0,4.0,5.0,2.0,1.0,3.0,5.0,1.0
2,5.0,22.0,7.0,5.0,5.0,2.0,5.0,1.0,6.0,3.0,...,1.0,7.0,5.0,5.0,2.0,1.0,1.0,5.0,5.0,4.0
3,5.0,,4.0,4.0,5.0,7.0,5.0,1.0,2.0,7.0,...,,2.0,5.0,5.0,6.0,,,5.0,5.0,4.0
4,4.0,,3.0,5.0,5.0,6.0,4.0,2.0,5.0,4.0,...,,6.0,0.0,5.0,5.0,,,5.0,4.0,1.0
5,4.0,18.0,5.0,3.0,4.0,4.0,4.0,4.0,3.0,5.0,...,1.0,5.0,4.0,5.0,4.0,3.0,2.0,4.0,5.0,3.0
6,4.0,18.0,1.0,4.0,5.0,6.0,5.0,1.0,1.0,2.0,...,1.0,5.0,5.0,5.0,5.0,2.0,2.0,1.0,4.0,1.0
7,3.0,21.0,7.0,3.0,5.0,1.0,5.0,4.0,6.0,5.0,...,12.0,5.0,5.0,5.0,6.0,2.0,1.0,3.0,3.0,3.0
8,3.0,25.0,5.0,5.0,4.0,3.0,2.0,1.0,1.0,5.0,...,7.0,7.0,3.0,4.0,7.0,2.0,2.0,3.0,0.0,5.0
9,3.0,17.0,6.0,4.0,5.0,5.0,4.0,2.0,6.0,3.0,...,1.0,7.0,5.0,5.0,4.0,2.0,2.0,5.0,5.0,5.0


In [15]:
nerdy_wide_f.shape

(1391, 56)

The dataset is in a familiar format where each column is a variable and each row contains the observation for that variable which corresponding to a distinct subject.

*Wide format implies that all of the information for one distinct subject will be represented in the columns corresponding to that row. A single subject should not have multiple rows of data.*

In [16]:
# print the header
nerdy_wide_f.head()

Unnamed: 0_level_0,academic_over_social,age,anxious,bookish,books_over_parties,calm,collect_books,conventional,critical,dependable,...,religion,reserved,socially_awkward,strange_person,sympathetic,urban,voted,was_odd_child,watch_science_shows,writing_novel
subject_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,5.0,,1.0,5.0,5.0,7.0,5.0,1.0,1.0,7.0,...,,7.0,5.0,5.0,7.0,,,5.0,5.0,3.0
1,2.0,50.0,4.0,4.0,4.0,6.0,5.0,1.0,3.0,5.0,...,1.0,5.0,5.0,4.0,5.0,2.0,1.0,3.0,5.0,1.0
2,5.0,22.0,7.0,5.0,5.0,2.0,5.0,1.0,6.0,3.0,...,1.0,7.0,5.0,5.0,2.0,1.0,1.0,5.0,5.0,4.0
3,5.0,,4.0,4.0,5.0,7.0,5.0,1.0,2.0,7.0,...,,2.0,5.0,5.0,6.0,,,5.0,5.0,4.0
4,4.0,,3.0,5.0,5.0,6.0,4.0,2.0,5.0,4.0,...,,6.0,0.0,5.0,5.0,,,5.0,4.0,1.0


**Check to see how many null values there are per column:**

In [18]:
# count null values by column
nerdy_wide_f.isnull().sum()

academic_over_social              0
age                             691
anxious                           0
bookish                           0
books_over_parties                0
calm                              0
collect_books                     0
conventional                      0
critical                          0
dependable                        0
diagnosed_autistic                0
disorganized                      0
education                       691
engnat                          691
enjoy_learning                    0
excited_about_research            0
extraverted                       0
familysize                      691
gender                          691
hand                            691
hobbies_over_people               0
in_advanced_classes               0
intelligence_over_appearance      0
interested_science                0
introspective                     0
libraries_over_publicspace        0
like_dry_topics                   0
like_hard_material          

The 691 missing demographic variables are intentional (I specifically enforced that only 700 of the subjects would have demographic information).

However, we can see that the `major` variable has 970 missing values. This was not an intentional change.

If we were to just drop all the rows that have any null values at this point, we would lose 970 rows due to the commonly missing variable `major`.

With a numeric column, this would be hard to avoid without "imputing" some number to fill in the values. In the simplest case imputing the mean or median for missing numeric values is commonly used (but not ideal).

With a **categorical variable**, which `major` is, we have the luxury of replacing the missing values with a new category label that stands for "missing". 

**Replace the missing major column values with "unknown":**

In [23]:
# set missing values in major to "unknown"
nerdy_wide_f.loc[nerdy_wide_f.major.isnull(),'major']='unknown'
nerdy_wide_f.major.isnull().sum()

0

<a id='long_format'></a>

### "Long" format data

---

Now we can load the same data but instead in the format commonly called "long".

Long data is formatted so that:

- There are potentially multiple "id" (identification) columns.
- There are pairs of columns such as `variable:value` that match a variable key to a value (in the simple case, there would be a single `variable` column and a single `value` column).
- The "variable" column corresponds to the multiple variable columns in a wide format dataset. Instead of a column for each variable like in wide, you have a row for each variable:value pair, *per id*. 
- This is a standard format for SQL databases because it makes joining different tables together by keys easier.

**Load the long format of the same data below:**

In [26]:
nerdy_long_f = pd.DataFrame.from_csv('/Users/Mahendra/desktop/GA/hw/3.1.1_pandas-long_wide_pivot_melt-lesson/datasets/NPAS_parsed_trunc_long_missing.csv')

# load long data and print the dimensions
nerdy_long_f 

Unnamed: 0_level_0,variable,value
subject_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,education,4.0
2,education,3.0
5,education,2.0
6,education,2.0
7,education,2.0
8,education,3.0
9,education,1.0
10,education,2.0
14,education,3.0
15,education,2.0


In [27]:
nerdy_long_f.shape

(70295, 2)

You can see that the long data has far more rows than the wide dataset, but only three columns.

Below you can view the three columns: `subject_id`, `variable`, and `value`.

**`subject_id:`**
- This is the primary "key" or "id" column. Each subject id will have corresponding entries in the variable column, one for each row.

**`variable:`**
- This column indicates which variable the item in the value column corresponds to.

**`value:`**

- This contains all the values for all of the variables for all ids. Essentially, every cell in the wide dataset except the subject_id is listed in this column.

In [28]:
# print the header
nerdy_long_f .head()

Unnamed: 0_level_0,variable,value
subject_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,education,4.0
2,education,3.0
5,education,2.0
6,education,2.0
7,education,2.0


In [44]:
#reindexing
nerdy_long=nerdy_long_f.reset_index()
nerdy_long

Unnamed: 0,subject_id,variable,value
0,1,education,4.0
1,2,education,3.0
2,5,education,2.0
3,6,education,2.0
4,7,education,2.0
5,8,education,3.0
6,9,education,1.0
7,10,education,2.0
8,14,education,3.0
9,15,education,2.0


**Print out the unique values in the `variable` column.**

You can see that the unique values in the variable column correspond to the column headers of the wide format data.

In [32]:
# print the unique values in the variable column
nerdy_long_f.variable.unique()

array(['education', 'urban', 'gender', 'engnat', 'age', 'hand', 'religion',
       'voted', 'married', 'familysize', 'major', 'race_white',
       'race_nerdy', 'race_native_american', 'writing_novel',
       'read_tech_reports', 'online_over_inperson', 'introspective',
       'hobbies_over_people', 'books_over_parties', 'bookish',
       'libraries_over_publicspace', 'race_native_austrailian',
       'like_hard_material', 'race_hispanic', 'diagnosed_autistic',
       'play_many_videogames', 'race_arab', 'race_asian',
       'interested_science', 'playes_rpgs', 'in_advanced_classes',
       'collect_books', 'intelligence_over_appearance',
       'watch_science_shows', 'academic_over_social',
       'like_science_fiction', 'like_dry_topics', 'race_black', 'calm',
       'disorganized', 'extraverted', 'dependable', 'critical',
       'opennness', 'anxious', 'sympathetic', 'reserved', 'conventional',
       'was_odd_child', 'prefer_fictional_people', 'enjoy_learning',
       'excited_abou

In [46]:
# count the unique subject ids
len(nerdy_long.subject_id.unique())

1391

**Replace the missing values in major with "unkown" in the long format dataset.**

The process for replacing the data will be different due to the format. Using logical selection masks with pandas `.loc` syntax is the preferable way to do this.

In [10]:
# replace the missing values for major in the long dataset with "unknown"

<a id='pivot_tables'></a>

### Pandas `pivot_table()`: going from long to wide format

---

The `pd.pivot_table()` function is a very powerful tool to both transform data from long to wide format as well as summarize data with user-supplied functions.

First we'll look at transforming the long format data back into the wide format using the `pivot_table` function.

**Important parameters for the `pivot_table` function:**

    nerdy_long: the pivot_table() function takes a dataframe to pivot as its first argument
    
- **`columns`**: this is the list of columns in the wide format data to transform back to columns in wide format, with each unique value in the long format column becoming a header for the wide format.
- **`values`**: a single column indicating the values to use when pivoting and filling in the new wide format columns.
- **`index`**: columns in the long format data that are index variables – this means that these will be left as single columns, not spread out across columns by unique value such as in the columns parameter .
- **`aggfunc`**: often pivot_table() is used to perform a summary of the data. aggfunc stands for "aggregation function". It is required and defaults to np.mean. You can put your own function in, which I do below.
- **`fill_value`**: if a cell is missing for the wide format data, the value to fill in.
    
Below we put in our own function `select_item_or_nan()` to the `aggfunc` keyword argument. Because my `subject_id` column has a single variable value for each id, I just want the single element in the long format value cell. My data is messy and so I have to write a function to check for places it can break. 

Note: `x` passed into my function will be a series object. I pull out the first element of that with the `.iloc` indexer.

In [57]:
def select_item_or_nan(x):
    x = x.iloc[0]
    if len(x) == 0:
        return np.nan
    else:
         return x

nerdy_wide = pd.pivot_table(nerdy_long, columns=['variable'], values='value',index=['subject_id'], aggfunc=select_item_or_nan,
                             fill_value=np.nan)
nerdy_wide

variable,academic_over_social,age,anxious,bookish,books_over_parties,calm,collect_books,conventional,critical,dependable,...,religion,reserved,socially_awkward,strange_person,sympathetic,urban,voted,was_odd_child,watch_science_shows,writing_novel
subject_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,5.0,,1.0,5.0,5.0,7.0,5.0,1.0,1.0,7.0,...,,7.0,5.0,5.0,7.0,,,5.0,5.0,3.0
1,2.0,50.0,4.0,4.0,4.0,6.0,5.0,1.0,3.0,5.0,...,1.0,5.0,5.0,4.0,5.0,2.0,1.0,3.0,5.0,1.0
2,5.0,22.0,7.0,5.0,5.0,2.0,5.0,1.0,6.0,3.0,...,1.0,7.0,5.0,5.0,2.0,1.0,1.0,5.0,5.0,4.0
3,5.0,,4.0,4.0,5.0,7.0,5.0,1.0,2.0,7.0,...,,2.0,5.0,5.0,6.0,,,5.0,5.0,4.0
4,4.0,,3.0,5.0,5.0,6.0,4.0,2.0,5.0,4.0,...,,6.0,0.0,5.0,5.0,,,5.0,4.0,1.0
5,4.0,18.0,5.0,3.0,4.0,4.0,4.0,4.0,3.0,5.0,...,1.0,5.0,4.0,5.0,4.0,3.0,2.0,4.0,5.0,3.0
6,4.0,18.0,1.0,4.0,5.0,6.0,5.0,1.0,1.0,2.0,...,1.0,5.0,5.0,5.0,5.0,2.0,2.0,1.0,4.0,1.0
7,3.0,21.0,7.0,3.0,5.0,1.0,5.0,4.0,6.0,5.0,...,12.0,5.0,5.0,5.0,6.0,2.0,1.0,3.0,3.0,3.0
8,3.0,25.0,5.0,5.0,4.0,3.0,2.0,1.0,1.0,5.0,...,7.0,7.0,3.0,4.0,7.0,2.0,2.0,3.0,0.0,5.0
9,3.0,17.0,6.0,4.0,5.0,5.0,4.0,2.0,6.0,3.0,...,1.0,7.0,5.0,5.0,4.0,2.0,2.0,5.0,5.0,5.0


<a id='multiindex'></a>

### Multiindex/Hierarchical indexing

---

In the header you can see that the format of the wide data is *not* the same as our original loaded wide format. Pandas implements something called **Multiindexing** or **Hierarchical indexing** which allows for "tiered" row and column labels.

Right now the multiindex is not terrible, but this can get very confusing and annoying, which we will see further down in the lesson.

The main difference is that we have a `variable` name in the top left corner, which is "labeling" our columns (and corresponds to the name of our original column in the long format data). The row indexer has become our single key/id variable `subject_id`. The columns are what we would expect here, each one a variable like in the original wide data.

**Drop the null values from our recreated wide data. How many unique subjects do we have?**

Remember our `subject_id` is now the **index**, and so we can access it with the `.index` attribute.

In [58]:
# print the header of the widened dataset
nerdy_wide.head()

variable,academic_over_social,age,anxious,bookish,books_over_parties,calm,collect_books,conventional,critical,dependable,...,religion,reserved,socially_awkward,strange_person,sympathetic,urban,voted,was_odd_child,watch_science_shows,writing_novel
subject_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,5.0,,1.0,5.0,5.0,7.0,5.0,1.0,1.0,7.0,...,,7.0,5.0,5.0,7.0,,,5.0,5.0,3.0
1,2.0,50.0,4.0,4.0,4.0,6.0,5.0,1.0,3.0,5.0,...,1.0,5.0,5.0,4.0,5.0,2.0,1.0,3.0,5.0,1.0
2,5.0,22.0,7.0,5.0,5.0,2.0,5.0,1.0,6.0,3.0,...,1.0,7.0,5.0,5.0,2.0,1.0,1.0,5.0,5.0,4.0
3,5.0,,4.0,4.0,5.0,7.0,5.0,1.0,2.0,7.0,...,,2.0,5.0,5.0,6.0,,,5.0,5.0,4.0
4,4.0,,3.0,5.0,5.0,6.0,4.0,2.0,5.0,4.0,...,,6.0,0.0,5.0,5.0,,,5.0,4.0,1.0


In [65]:
# drop the null values and count unique subjects
nerdy_wide.dropna(inplace=True)
nerdy_wide.shape

(421, 56)

In [67]:
len(nerdy_wide.index.unique())

421

**Convert the `subject_id` index back into a column.**

We can use the dataframe function `.reset_index()` to move `subject_id` into a column and create a new index. Now we have the dataframe in the format we got when we loaded the original wide data in before. The only exception is that we still have that "variable" column label.

In [68]:
# convert the index to a column
nerdy_wide.reset_index()

variable,subject_id,academic_over_social,age,anxious,bookish,books_over_parties,calm,collect_books,conventional,critical,...,religion,reserved,socially_awkward,strange_person,sympathetic,urban,voted,was_odd_child,watch_science_shows,writing_novel
0,1,2.0,50.0,4.0,4.0,4.0,6.0,5.0,1.0,3.0,...,1.0,5.0,5.0,4.0,5.0,2.0,1.0,3.0,5.0,1.0
1,2,5.0,22.0,7.0,5.0,5.0,2.0,5.0,1.0,6.0,...,1.0,7.0,5.0,5.0,2.0,1.0,1.0,5.0,5.0,4.0
2,5,4.0,18.0,5.0,3.0,4.0,4.0,4.0,4.0,3.0,...,1.0,5.0,4.0,5.0,4.0,3.0,2.0,4.0,5.0,3.0
3,8,3.0,25.0,5.0,5.0,4.0,3.0,2.0,1.0,1.0,...,7.0,7.0,3.0,4.0,7.0,2.0,2.0,3.0,0.0,5.0
4,14,2.0,31.0,3.0,1.0,3.0,6.0,4.0,2.0,6.0,...,2.0,4.0,4.0,4.0,4.0,1.0,2.0,3.0,4.0,2.0
5,15,4.0,39.0,2.0,2.0,5.0,5.0,5.0,3.0,1.0,...,2.0,6.0,3.0,4.0,2.0,2.0,2.0,5.0,5.0,1.0
6,22,5.0,20.0,1.0,5.0,5.0,7.0,5.0,6.0,2.0,...,2.0,7.0,4.0,2.0,5.0,3.0,2.0,4.0,5.0,2.0
7,27,4.0,50.0,3.0,5.0,5.0,6.0,4.0,3.0,5.0,...,2.0,6.0,5.0,5.0,2.0,2.0,2.0,4.0,5.0,2.0
8,38,3.0,17.0,7.0,5.0,3.0,3.0,2.0,2.0,5.0,...,4.0,5.0,2.0,5.0,7.0,3.0,2.0,5.0,3.0,1.0
9,45,3.0,15.0,7.0,5.0,3.0,2.0,1.0,6.0,6.0,...,1.0,1.0,2.0,2.0,3.0,2.0,2.0,2.0,1.0,5.0


**Remove the label of the columns.**

You can remove the column label (which can be confusing during print statements) by setting the `.columns.name` attribute to `None`.

In [69]:
# remove the columns label
nerdy_wide.columns.name

'variable'

In [74]:
nerdy_wide.columns.name=None
nerdy_wide.reset_index().head(5)


Unnamed: 0,subject_id,academic_over_social,age,anxious,bookish,books_over_parties,calm,collect_books,conventional,critical,...,religion,reserved,socially_awkward,strange_person,sympathetic,urban,voted,was_odd_child,watch_science_shows,writing_novel
0,1,2.0,50.0,4.0,4.0,4.0,6.0,5.0,1.0,3.0,...,1.0,5.0,5.0,4.0,5.0,2.0,1.0,3.0,5.0,1.0
1,2,5.0,22.0,7.0,5.0,5.0,2.0,5.0,1.0,6.0,...,1.0,7.0,5.0,5.0,2.0,1.0,1.0,5.0,5.0,4.0
2,5,4.0,18.0,5.0,3.0,4.0,4.0,4.0,4.0,3.0,...,1.0,5.0,4.0,5.0,4.0,3.0,2.0,4.0,5.0,3.0
3,8,3.0,25.0,5.0,5.0,4.0,3.0,2.0,1.0,1.0,...,7.0,7.0,3.0,4.0,7.0,2.0,2.0,3.0,0.0,5.0
4,14,2.0,31.0,3.0,1.0,3.0,6.0,4.0,2.0,6.0,...,2.0,4.0,4.0,4.0,4.0,1.0,2.0,3.0,4.0,2.0


<a id='melt'></a>

### Going from wide to long with `.melt()`

---

**`.melt()`** is a function that essentially performs the inverse operation of `pivot_table` on dataframes.

Melt takes a dataframe as its first argument. Additional arguments typically used in the melt function are:

- **`id_vars`**: the column or columns that will be id variables. id variables contain datapoints specified by the variable and value columns.
- **`value_vars`**: a list that specifies which columns should be converted into a single value column and variable column.
- **`var_name`**: the header name of the variable column (default='variable').
- **`value_name`**: the header name of the value column (default='value').

**First, subset the WIDE data to just columns: `['subject_id','anxious','booking','calm','major']`.**


In [82]:
# subset the wide data
nerdy_wide_f=nerdy_wide_f.reset_index().head(5)


In [84]:
nerdy_subset=nerdy_wide_f[['subject_id','major','anxious','bookish','calm']]
nerdy_subset.head(2)

Unnamed: 0,subject_id,major,anxious,bookish,calm
0,0,unknown,1.0,5.0,7.0
1,1,biophysics,4.0,4.0,6.0


**Use `melt` on the subset with `id_vars=['subject_id','major']`.**

Print out the shape of the data and the header. The non-id columns and their values are now represented by the variable:value column pair.

Note: when you only specify the `id_vars`, the remaining columns are inferred to become part of the variable and value columns.

In [85]:
nerdy_sub_long = pd.melt(nerdy_subset, id_vars=['subject_id','major'])
nerdy_sub_long


Unnamed: 0,subject_id,major,variable,value
0,0,unknown,anxious,1.0
1,1,biophysics,anxious,4.0
2,2,biology,anxious,7.0
3,3,unknown,anxious,4.0
4,4,unknown,anxious,3.0
5,0,unknown,bookish,5.0
6,1,biophysics,bookish,4.0
7,2,biology,bookish,5.0
8,3,unknown,bookish,4.0
9,4,unknown,bookish,5.0


In [87]:
print nerdy_sub_long.shape
print nerdy_subset.shape

(15, 4)
(5, 5)


You can do the same thing as above without having to subset the dataframe first by simply specifying the `value_vars` keyword argument. The output dataframe will then only contain the data specified in the `id_vars` and `value_vars` arguments.

**Create the same dataframe with melt on the full wide dataset, but select the columns to use with the `value_vars` argument.**

In [91]:
nerdy_sub_long = pd.melt(nerdy_wide_f, id_vars=['subject_id','major'], value_vars=['anxious','bookish','calm'])
nerdy_sub_long.head()

Unnamed: 0,subject_id,major,variable,value
0,0,unknown,anxious,1.0
1,1,biophysics,anxious,4.0
2,2,biology,anxious,7.0
3,3,unknown,anxious,4.0
4,4,unknown,anxious,3.0


In [94]:
print nerdy_sub_long.shape
print nerdy_wide_f.shape

(15, 4)
(5, 57)


In [96]:
# print the datatypes
nerdy_sub_long.dtypes

subject_id      int64
major          object
variable       object
value         float64
dtype: object

The value column is still a string, so we can convert it to float:

In [20]:
# ensure the value is a float

<a id='pivot_table_summarizing'></a>

### Summarizing your data with  `pivot_table` and aggregate functions

---

For those of you experienced with Excel, the pandas pivot table does the same thing as the pivot table in Excel. It's more powerful, but harder to use than the user-friendly spreadsheet version.

Pivot table can take in a variable, value, and an index to group by and apply aggregate functions to summarizing the data. 

Note: be careful that your index variable is not be pulling out unique rows (for example, subject_id by variable would only have one value to send into the aggregate functions).

Below I am calling the `pivot_table` function with:
- the long format data as the first argument
- "variable" specified as the columns that indicate the variable names (groups)
- "value" specified to be the column that contains the data per variable
- "major" as the index: the rows will be grouped by major
- `np.mean`, `np.median`, `np.std`, and `len` as aggregate functions. These will be calculated for each major-by-variable group
- a fill value of `np.nan`, for cells in the output table that have no data.



In [21]:
# nerdy_major_summary = pd.pivot_table(nerdy_sub_long, columns=['variable'], values='value',
#                                      index=['major'], aggfunc=[np.mean, np.median, np.std, len],
#                                      fill_value=np.nan)

The output dataframe gives you a "hierarchical" column index – the three variable for each aggregate function. The row index is the "major" groups.

If you apply more index variables to split by, the row indices will also become hierarchical! It can get bloated fast.

In [22]:
# print the header of the pivot table

<a id='examining_multiindex'></a>

### Inner-workings of the the multiindex

--- 

The `.names` attribute on the index and the columns will show you the hierarchy of labels. The row index is "major", and the two column indices are None and 'variable' (the aggregate functions get no label from pivot table in this case). 

If you print out the columns, you can see it has become a pandas `MultiIndex` object that has levels, labels, and names.

In [23]:
# print nerdy_major_summary.index.names
# print nerdy_major_summary.columns.names
# print nerdy_major_summary.columns

Indexing along the hierarchical column headers can be done with chained bracket keys, with the top level column label in the first bracket down to the bottom level.

In [24]:
# nerdy_major_summary['mean'].head(2)

In [25]:
# nerdy_major_summary['mean']['anxious'].head(2)

In [26]:
# nerdy_major_summary['mean'][['anxious','bookish']].head(2)

In some cases you can just split them up by comma within the brackets.

In [27]:
# nerdy_major_summary['mean', 'bookish'].head(2)

<a id='multiindex_to_flat'></a>

### Getting rid of the Multiindex: converting back to "flat" data

---

Multiindex dataframes have great potential use and are a cool concept. That being said, the overhead and confusion on how to subset/mask them is most often not worth it, especially when your data needs to be formatted for insertion into a model.

The most reliable way to "flatten" a multi-indexed dataframe is through the `.to_records()` function. To make this a new dataframe, it needs to be wrapped in a `pd.DataFrame()` like so:

In [28]:
# nerdy_major_flat = pd.DataFrame(nerdy_major_summary.to_records())
# nerdy_major_flat.head(2)

You can see that the new column names are tuples of the hierarchy of the multiindexed columns. You can convert these to new, more easily indexed columns with something like a list comprehension, for example.

The **`eval`** function takes a string and trys to evaluate it as if it were a python command.

**Use a list comprehension and the `eval` function to convert the flattened multiindex columns to something more readable:**

In [29]:
# replace the column names with list comprehension and eval

<a id='merging_joining_preface'></a>

### Preface to merging and joining with long and wide data

---

You will be merging and joining datasets extensively throughout this course and your future careers. It is important to note the differences between merging long and wide datasets together.

**Load in the data used above, but now split so that the demographic variables are in one dataset and the survey question answers are in another another.** 

These datasets are in wide format, and they both contain `subject_id` to identify who the questions are for. 

As you may recall, the demographic responses have fewer observations.

In [30]:
n_demos_file = './datasets/NPAS_parsed_trunc_demo_sample.csv'
n_survey_file = './datasets/NPAS_parsed_trunc_survey.csv'

# load the files

In [31]:
# print the header of the demos and survey

<a id='pandas_merge'></a>

### Pandas `.merge()` function: joining long format vs. wide format data

---

The merge function is a built-in function in a DataFrame. The first argument is another DataFrame that you want to merge it with, and the `on` keyword argument is the key or keys that you want the DataFrames to be "matched" on.

We are specifying `how='inner'` here, which means that the key must be present in both dataframes have that row present in the output. Because the demographics dataset has fewer subject_ids, it will only merge the subject_id rows from the survey dataset that are present in the demographics dataset.

**Merge the survey and demographic wide-format datasets together using `.merge()`:**

In [32]:
# demos_survey = demos_subset.merge(survey, on=['subject_id'], how='inner')

In [33]:
# print the merged data header

**Convert the demographic and survey data into long format using `melt()`.**

- For the demographic dataframe, specify two id_vars, gender and subject_id.
- For the survey dataframe, only specify subject_id for id_vars

In [34]:
# melt the demographic data

In [35]:
# melt the survey data

**Merge together the long form datasets just like we did before with the wide format data.**

Here we will still merge on 'subject_id' with 'inner' for the how variable. We have duplicate named columns in each of these dataframes ('variable' and 'value'). We can specify `suffixes=('_survey','_demo')` to give the instances of the survey and demographic dataframes appropriate column names when they are joined together.

In [36]:
# merge the survey and demo data