<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

## `pandas` Long Format, Wide Format, Pivot Tables, and Melting
_Instructor: Aymeric Flaisler_
___
<br>

This lesson is all about **transforming data** using `pandas`. Data transformation is the reorganization of your data set's rows and columns into a different, potentially **more useful shape and format**. 

The benefits of transforming your data include **better access to relevant information** and **streamlined data manipulation**. As you become more familiar with data sets and their associated operations, you will develop an intuition and appreciation for when it's better to **work row-wise or column-wise**.

Different data formats are better for different tasks. It takes time and experience to learn the distinctions. But, for now, we'll introduce the **common structures, transformations, and how to apply these transformations**.

### Learning Objectives
- Understand the differences between **long and wide format data**.
- Understand **pivot tables**.
- Practice transforming data between **long and wide** formats.
- Practice creating pivot tables.
- Learn how to avoid **common pitfalls and obstacles** in data transformation with `pandas`.


### Lesson Guide

- [Wide Format Data](#wide_format)
- [Load and Examine the NPAS Data](#load_nerdy)
- [Long Format Data](#long_format)
- [Using `pandas`' `.pivot_table()` Function: Long to Wide Format](#pivot_tables)
- [MultiIndex/Hierarchical Indices in `pandas`](#multiindex)
- [Using `pandas`' `.melt()` Function: Wide to Long Format](#melt)
- [Summarizing Data With `.pivot_table()` and Aggregate  Functions](#pivot_table_summarizing)
- [The Inner Workings of the MultiIndex](#examining_multiindex)
- [Getting Rid of the MultiIndex: "Flattening" Data](#multiindex_to_flat)
- [A Preface: Merging and Joining With Long and Wide Format Data](#merging_joining_preface)
- [`pandas`' `.merge()` function: Joining Long Format vs. Wide Format Data](#pandas_merge)


In [1]:
import numpy as np
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
pd.set_option('display.max_columns', None)

sns.set_style('darkgrid')

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

<a id='wide_format'></a>

### Wide Format Data

---

Between "wide" and "long," **wide format data is the more intuitive**. It's also a common format for `.csv` files. You've already viewed multiple data sets in wide format throughout this course.

Wide format data is structured so that:

- Unique IDs, subjects, observations, etc. are represented as **rows**.
- Distinct information categories (**variables**) are represented as columns. In other words, there is a **column for every "variable"** with its own unique values.
- This format can often be a more compact matrix, particularly if little or no information is missing.
- It is **not as useful for SQL-style operations**: It can make it much harder or even impossible to **join tables together on a value**.
- It can be useful in `pandas` when you need to perform operations on variables **across columns**; for example, multiplying columns together to create a new column.
- It is the data format required for statistical modeling (with few exceptions).

<a id='load_nerdy'></a>

### Load and Examine the "Nerdy Personality Attributes" Data Set

---

This is a pre-cleaned and modified version of the full "Nerdy Personality Attributes" survey, which asked subjects to rate themselves based on questions related to "nerdiness" as well as more general personality traits such as openness and extraversion. Researches also collected demographic information from the subjects.

You can find the raw data [here](http://personality-testing.info/_rawdata/), along with many other sociological surveys.

In this modified version, for the sake of our example, some of the subjects provided data for the survey but not the demographic variables. Because there are missing values and the data is "messy," we have a data cleaning problem.

**Load the data (which is in wide format).** 

In [2]:
nerdy_wide_f = './datasets/NPAS_parsed_trunc_wide_missing.csv'

# load data and print the dimensions
nerdy_wide = pd.read_csv(nerdy_wide_f)

This data set is in a familiar format in which each column is a variable and each row contains an observation for that variable, corresponding to a distinct subject.

*Wide format implies that all of the information for one distinct subject **will be represented in the columns corresponding to that row**. A single subject should not be represented in multiple rows of data.*

In [3]:
# First let's print the header:


**Check to see how many null values there are per column.**

*Tips:* An easy way is to use the `.isnull()` method associated with the `.sum()`

In [None]:
# Now let's count the null values by column:

# ..... .isnull().sum()

The 691 missing demographic variables are intentional (I specifically enforced that only 700 of the subjects would have demographic information).

However, we can see that the `major` variable has 970 missing values. This was not an intentional change.

At this point, if we were to just **drop all the rows that have any null values, we would lose at least 970 rows** because of the missing `major` variable.

With a numeric column, this would be hard to avoid without "imputing" some number to fill in those values. In the simplest case, **imputing the mean or median for missing numeric values** is a common fix (but not ideal).

With a **categorical variable** like `major`, we have the luxury of replacing the missing values with a new category label that stands for "missing." 

**Replace the missing `major` column values with `unknown`.**

In [4]:
# first create a mask for the missing values in the major column:


# set missing values in major to "unknown":


In [5]:
# if all goes right you should not have any missing values left
print (nerdy_wide.major.isnull().sum())

970


<a id='long_format'></a>

### Long Format Data

---

Now, we can load the same data — this time in the format commonly called "long."

Long format data is structured so that:

- There are potentially multiple `ID` (identification) columns.
- There are pairs of columns such as `variable:value` that match a variable key to a value (In the simplest case, there would be a single `variable` column and a single `value` column).
- The `variable` column corresponds to the multiple variable columns in a wide format data set. Instead of a column for each variable, you have a row for each `variable:value` pair *per ID*. 
- This is a standard format for SQL databases because it makes it easier to join different tables together with keys.

**Load the long format of the same data below.**

In [6]:
nerdy_long_f = './datasets/NPAS_parsed_trunc_long_missing.csv'

# load long data and print the dimensions
nerdy_long = pd.read_csv(nerdy_long_f)

You can see that the long format data has far more rows than the wide data set but only three columns.

Below you can view the three columns: `subject_id`, `variable`, and `value`.

**`subject_id:`**
- This is the primary "key" or `ID` column. Each `subject_id` will have corresponding entries in the `variable` column — one for each row.

**`variable:`**
- This column indicates the variable with which the item in the `value` column corresponds.

**`value:`**

- This contains all values for all variables for all IDs. Essentially, every cell in the wide data set except the `subject_id` is listed in this column.

In [None]:
# print the header:


**Print out the unique values in the `variable` column.**

You can see that the unique values in the `variable` column correspond to the column headers in the wide format data.

*Tips: use the .unique() method*

In [None]:
# print the unique values in the variable column:


In [None]:
# count the unique subject ids:


**Replace the missing values in `major` with `unknown` in the long format data set.**

The process for replacing data will be different because of the format. Using logical selection masks with `pandas`' `.loc` syntax is the preferable way to do this.

In [20]:
# Identify the missing values in major:
sum(nerdy_long.value.isnull())

0

In [None]:
# replace the missing values for major in the long dataset with "unknown":


In [10]:
# check that there is no missing values left:
print(nerdy_long[nerdy_long.variable == 'major'].isnull().sum())

# you should get only 0s

subject_id      0
variable        0
value         279
dtype: int64


<a id='pivot_tables'></a>

### `Pandas`' `.pivot_table()` Function: Long to Wide Format

---

The `pd.pivot_table()` function is a powerful tool for both transforming data from long to wide format as well as summarizing data with user-supplied functions.

First, we'll look at transforming the long format data back into the wide format using the `.pivot_table()` function.

**Important parameters for the `.pivot_table()` function include:**

> The `pivot_table()` function takes a DataFrame to pivot as its first argument. 
    
- **`columns`**: This is the list of columns in the long format data to be transformed back into columns in the wide format. After pivoting, each unique value in the long format column becomes a header in the wide format.
- **`values`**: A single column indicating the values to use when pivoting and filling the new wide format columns.
- **`index`**: Columns in the long format data that are index variables. These will be left as single columns, not spread out by unique value like in the `columns` parameter.
- **`aggfunc`**: Often `.pivot_table()` is used to perform a summary of the data. `aggfunc` stands for "aggregation function." It's required and defaults to `np.mean()`. You can also insert your own function, which we'll demonstrate below.
- **`fill_value`**: If a cell is missing for the wide format data, this value will fill it in.
    
Next we'll put in our own function — `select_item_or_nan()` — to the `aggfunc` keyword argument. Because my `subject_id` column has a single variable value for each ID, I just want the single element in the long format value cell. My data is messy, so I have to write a function to check for places it could break. 

**Note:** Passed into my function, `x` will be a Series object. I pull out the first element of that using the `.iloc` indexer.

### Let's make sure value has only values:

*Note: The lambda operator or lambda function is used for creating small, one-time and anonymous function objects in Python. This is not the object of this lesson. We will cover it at a later stage. Do not worry about understanding it for now.*

In [13]:
def is_float(s):
    try:
        float(s)
        return True
    except ValueError:
        return False

In [16]:
# mask with true or false if we can convert to a numerical value
mask = nerdy_long.value.map(lambda x: is_float(x))

#### Now remove non numeric values using the mask

In [17]:
nerdy_long_only_num = nerdy_long[mask].copy()
nerdy_long_only_num.reset_index(drop=True, inplace=True)

In [19]:
nerdy_long_only_num.shape

(69874, 3)

#### Convert the column `value`  from the dataframe `nerdy_long_only_num` to float

In [None]:
# A:
# nerdy_long_only_num['value'] = ...

#### Finally pivot the data on subject_id and variable using .pivot()

In [None]:
nerdy_long_only_num.pivot(index='subject_id',columns='variable')

### Now with all the data and .pivot_table()

In [None]:
# First we define a custom function that returns a value if it exist and NaN if not
def select_item_or_nan(x):
    x = x.iloc[0]
#     print(x, type(x))
    if len(x) == 0:
        return np.nan
    else:
        return x
    

In [None]:
# This will take a few seconds to run.
nerdy_wide_pv = pd.pivot_table(nerdy_long, columns=['variable'], values='value',
                            index=['subject_id'], aggfunc=select_item_or_nan , fill_value=np.nan)
# 'pv' for 'pivot version.'
nerdy_wide_pv.head()

<a id='multiindex'></a>

### MultiIndex/Hierarchical Indices in `pandas`

---

First, let's reload a fresh copy of the data:

In [33]:
# let's reload the data
nerdy_long = pd.read_csv('./datasets/nerdy_long.csv')

In [34]:
def select_item_or_nan(x):
#     print(type(x))
    x = x.iloc[0]
    if len(x) == 0:
        return np.nan
    else:
        return x

In [35]:
# This will take a few seconds to run.
nerdy_wide_pv = pd.pivot_table(nerdy_long, columns=['variable'], values='value',
                            index=['subject_id'], aggfunc=select_item_or_nan , fill_value=np.nan)


In the header, you can see that the format of the new wide data **is *not* the same as our originally loaded wide format**. `pandas` implements something called **MultiIndexing** or **hierarchical indexing**, which allows for "tiered" row and column labels.

Right now the MultiIndexing is not terrible but can get **confusing and annoying**, which we will experience later in this lesson.

The main difference is that we now have a `variable` name in the top left corner, which is **"labeling"** our columns (and corresponds to the name of our original column in the long format data). The row indexer has become our **single key/ID variable**, `subject_id`. The columns are what we would expect here: **Each one is a variable**, like in the original wide format data.

In [32]:
# print the header of the widened dataset
nerdy_wide_pv.head()

variable,academic_over_social,age,anxious,bookish,books_over_parties,calm,collect_books,conventional,critical,dependable,diagnosed_autistic,disorganized,education,engnat,enjoy_learning,excited_about_research,extraverted,familysize,gender,hand,hobbies_over_people,in_advanced_classes,intelligence_over_appearance,interested_science,introspective,libraries_over_publicspace,like_dry_topics,like_hard_material,like_science_fiction,like_superheroes,major,married,online_over_inperson,opennness,play_many_videogames,playes_rpgs,prefer_fictional_people,race_arab,race_asian,race_black,race_hispanic,race_native_american,race_native_austrailian,race_nerdy,race_white,read_tech_reports,religion,reserved,socially_awkward,strange_person,sympathetic,urban,voted,was_odd_child,watch_science_shows,writing_novel
subject_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1
0,5.0,,1.0,5.0,5.0,7.0,5.0,1.0,1.0,7.0,2.0,1.0,,,5.0,5.0,1.0,,,,4.0,5.0,3.0,3.0,5.0,5.0,3.0,5.0,5.0,5.0,,,4.0,6.0,5.0,3.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,5.0,,7.0,5.0,5.0,7.0,,,5.0,5.0,3.0
1,2.0,50.0,4.0,4.0,4.0,6.0,5.0,1.0,3.0,5.0,2.0,5.0,4.0,1.0,3.0,4.0,5.0,3.0,2.0,1.0,1.0,4.0,3.0,4.0,3.0,5.0,1.0,3.0,4.0,4.0,biophysics,1.0,3.0,5.0,1.0,4.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,4.0,1.0,5.0,5.0,4.0,5.0,2.0,1.0,3.0,5.0,1.0
2,5.0,22.0,7.0,5.0,5.0,2.0,5.0,1.0,6.0,3.0,2.0,6.0,3.0,2.0,5.0,5.0,1.0,2.0,2.0,1.0,5.0,5.0,5.0,5.0,5.0,4.0,5.0,4.0,5.0,3.0,biology,1.0,5.0,6.0,5.0,5.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,5.0,1.0,7.0,5.0,5.0,2.0,1.0,1.0,5.0,5.0,4.0
3,5.0,,4.0,4.0,5.0,7.0,5.0,1.0,2.0,7.0,2.0,1.0,,,1.0,5.0,7.0,,,,5.0,5.0,5.0,5.0,2.0,5.0,3.0,5.0,5.0,5.0,,,5.0,7.0,5.0,5.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,4.0,,2.0,5.0,5.0,6.0,,,5.0,5.0,4.0
4,4.0,,3.0,5.0,5.0,6.0,4.0,2.0,5.0,4.0,2.0,5.0,,,4.0,4.0,2.0,,,,4.0,4.0,4.0,4.0,1.0,4.0,4.0,5.0,4.0,4.0,,,5.0,7.0,3.0,4.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,5.0,,6.0,0.0,5.0,5.0,,,5.0,4.0,1.0


**Drop the null values from our recreated wide format data. How many unique subjects do we have?**

Remember our `subject_id` is now the **index**, so we can access it using the `.index` attribute.

In [None]:
# drop the null values and count unique subjects

**Convert the `subject_id` index back into a column (ie: reset the index without dropping it)**

We can use the DataFrame function `.reset_index()` to move `subject_id` into a column and create a new index. We now have a DataFrame with the same format we loaded the original wide format data in previously. The only exception is that we still have the `variable` column label.

In [None]:
# convert the index to a column


**Remove the column label.**

You can remove the column label (which can be confusing during print statements) by setting the `.columns.name` attribute to `None`.

In [None]:
# remove the columns label


<a id='melt'></a>

### Using pandas' `.melt()` Function: Wide to Long Format

---
First, let's reload a fresh copy of the data:

In [37]:
nerdy_wide_flat = pd.read_csv('./datasets/nerdy_wide_flat.csv')

**`.melt()`** is a function that essentially performs the inverse of `.pivot_table()` on DataFrames.

`.melt()` takes a DataFrame as its first argument. Additional arguments typically used with this function are:

- **`id_vars`**: The column or columns that will be ID variables. ID variables contain data points specified by the `variable` and `value` columns.
- **`value_vars`**: A list that specifies which columns should be converted into single `value` and `variable` columns.
- **`var_name`**: The header name of the `variable` column (default='variable').
- **`value_name`**: The header name of the `value` column (default='value').

**First, subset the wide format data into just columns: `['subject_id','anxious','booking','calm','major']`.**

In [None]:
# subset the wide data:
# nerdy_subset = nerdy_wide_flat[...??]

**Use `.melt()` on the subset with `id_vars=['subject_id','major']`.**

Print out the shape of the data and the header. The non-ID columns and their values are now represented by the `variable:value` column pair.

**Note**: When you only specify the `id_vars`, the remaining columns become part of the `variable` and `value` columns.

In [39]:
# nerdy_sub_long = pd.melt(nerdy_subset, id_vars=['subject_id','major'])


If we don't specify `major` as an `id_var`, it will end up in the `variable` column.

In [None]:
### with two value_vars
# nerdy_sub_long = pd.melt(nerdy_subset, id_vars='subject_id')
# print nerdy_subset.shape, nerdy_sub_long.shape
# nerdy_sub_long.head(4)

In [None]:
### with all value_vars
# nerdy_sub_long = pd.melt(nerdy_wide_flat, id_vars=['subject_id','major'], 
#                          value_vars=['anxious','bookish','calm'])
# print nerdy_wide_flat.shape, nerdy_sub_long.shape
# nerdy_sub_long.head(4)

The more `id_vars` that we specify, the flatter our DataFrame will be. 

You can achieve the same result without having to subset the DataFrame first by simply specifying the `value_vars` keyword argument. The output DataFrame will then only contain the data specified in the `id_vars` and `value_vars` arguments.

**Create the same DataFrame with `.melt()` on the full wide data set, but select the columns to use with the `value_vars` argument.**

In [None]:
# nerdy_sub_long = pd.melt(nerdy_wide_flat, id_vars=['subject_id','major'], 
#                          value_vars=['anxious','bookish','calm'])

In [None]:
# print the datatypes

The `value` column is still a string, so we can convert it to a float.

In [None]:
# ensure the value is a float

<a id='pivot_table_summarizing'></a>

### Summarizing Your Data With  `.pivot_table()` and Aggregate Functions

---
First, let's reload a fresh copy of the data:

In [41]:
nerdy_sub_long = pd.read_csv('./datasets/nerdy_sub_long.csv')

For those of you who have experience with Excel, `pandas`' `.pivot_table()` accomplishes the same thing. It's more powerful but harder to use than the spreadsheet version.

`.pivot_table()` can take in a variable, value, and index to group by and apply aggregate functions to summarize the data. 

**Note**: Be careful that your index variable is not pulling out unique rows (For example, `subject_id` by variable would only have one value to send into the aggregate functions).

Below, I am calling the `.pivot_table()` function with:

- The long format data as the first argument.
- `variable` specified as the columns that indicate the variable names (groups).
- `value` specified as the column that contains the data per variable.
- `major` as the index; the rows will be grouped by `major`.
- `np.mean`, `np.median`, `np.std`, and `len` as aggregate functions. These will be calculated for each `major-by-variable` group.
- A `fill_value` of `np.nan` for cells in the output table that have no data.

In [42]:
nerdy_major_summary = pd.pivot_table(nerdy_sub_long, columns=['variable'], values='value',
                                     index=['major'], aggfunc=[np.mean, np.median, np.std, len],
                                     fill_value=np.nan)

The output DataFrame gives you a "hierarchical" column index — the three variables for each aggregate function. The row index is the `major` groups.

If you apply more index variables, the row indices will also become hierarchical! However, this can quickly make for a bloated DataFrame.

In [45]:
# print the header of the pivot table


<a id='examining_multiindex'></a>

### The Inner Workings of the MultiIndex

--- 

The `.names` attribute on the index and columns will show you the hierarchy of labels. The row index is `'major'`, and the two column indices are `None` and `'variable'` (The aggregate functions get no label from `.pivot_table()` in this case). 

If you print out the columns, you can see the data set has become a `pandas` `MultiIndex` object that has levels, labels, and names.

In [43]:
print(nerdy_major_summary.index.names)
print(nerdy_major_summary.columns.names)
print(nerdy_major_summary.columns)

['major']
[None, 'variable']
MultiIndex(levels=[['mean', 'median', 'std', 'len'], ['anxious', 'bookish', 'calm']],
           labels=[[0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]],
           names=[None, 'variable'])


Indexing along the hierarchical column headers SHOULD NOT be done with chained bracket keys — i.e., including the top-level column label in the first bracket, and so on, down to the bottom level.

Instead you should use a tuple like index

In [None]:
# nerdy_major_summary.loc[:,('mean','anxious')].head(2)

In [None]:
# nerdy_major_summary.loc[:,('mean',['anxious','bookish'])].head(2)

<a id='multiindex_to_flat'></a>

### Getting Rid of the MultiIndex: "Flattening" Data

---

MultiIndex DataFrames hold great potential and are a cool concept. That being said, the overhead and confusion on how to subset/mask them is most often not worth it, especially when your data needs to be formatted for insertion into a model.

The most reliable way to "flatten" a MultiIndexed DataFrame is with the `.to_records()` function. To make this a new DataFrame, it needs to be wrapped in a `pd.DataFrame()` like so:

In [47]:
# type(nerdy_major_summary.to_records())
# nerdy_major_flat = pd.DataFrame(nerdy_major_summary.to_records())
# nerdy_major_flat.head(2)

You can see that the new column names are tuples of the hierarchy of MultiIndexed columns. For example, you could convert these to new, more easily indexed columns with something like a list comprehension.

The **`.eval()`** function takes a string and trys to evaluate it as if it were a Python command.

**Use a list comprehension and the `.eval()` function to convert the flattened MultiIndexed columns to something more readable.**

In [None]:
# replace the column names with list comprehension and eval


<a id='merging_joining_preface'></a>

### A Preface (independent practice): Merging and Joining With Long and Wide Format Data

---

You will be merging and joining data sets extensively throughout this course and in your future careers. However, it is important to note the differences between merging long and wide data sets together.

**Load in the data used above, but now split it so that the demographic variables are in one data set and the survey question answers are in another.** 

These data sets are in a wide format, and they both contain `subject_id`s to identify the questions' categories. 

As you may recall, the demographic responses have fewer observations.

In [48]:
n_demos_file = './datasets/NPAS_parsed_trunc_demo_sample.csv'
n_survey_file = './datasets/NPAS_parsed_trunc_survey.csv'
# load the files
demos_subset = pd.read_csv(n_demos_file)
survey = pd.read_csv(n_survey_file)

In [None]:
# print the header of the demos and survey


<a id='pandas_merge'></a>

### Use  `pandas`' `.merge()` function: Joining Long Format vs. Wide Format Data

---

As we have seen yesterday, the `.merge()` function comes built into a DataFrame. The first argument is another DataFrame you want to merge it with, and the `on` keyword argument is the key(s) by which you want the DataFrames to be "matched."

We are specifying `how='inner'` here, which means that the key must be present in both DataFrames to have the corresponding rows included in the output. Because the demographics data set has fewer `subject_id`s, it will only merge the `subject_id` rows from the survey data set that are also present in the demographics data set.

**Combine the survey and demographic wide format data sets using `.merge()`.**

In [None]:
# demos_survey = demos_subset.merge(survey, on=..?


In [None]:
# print the merged data header


**Convert the demographic and survey data into long format using `.melt()`.**

- For the demographic DataFrame, specify two `id_vars` — `gender` and `subject_id`.
- For the survey DataFrame, only specify `subject_id` for `id_vars`.

In [None]:
# melt the demographic data


In [None]:
# melt the survey data


**Merge the long form data sets together, just like we did previously with the wide format data.**

Here, we will still merge on `subject_id`, using `'inner'` for the `how` variable. We have duplicate named columns in each of these DataFrames (`variable` and `value`). We can specify `suffixes=('_survey','_demo')` to give the instances of the survey and demographic DataFrames appropriate column names when they are joined together.

In [None]:
# merge the survey and demo data
