In [None]:
import pandas as pd
import numpy as np
import numpy.random as npr
import seaborn as sns

# Combining data frames

This notebook reviews _concatenating_ dataframes, or stacking them vertically, and introduces _merging_ dataframes which allows combining two dataframes where the data are linked based on values in one more columns of the frames.

### Goals of this notebook:

Introduce tools to quickly enable tasks like combining these two dataframes

| sub_id | exp_condition | accuracy |
| --- | --- | --- |
| sub-73 | group_2 | .843 |
| sub-43 | group_2 | .343 |
| sub-81 | group_1 | .897 |



| sub_id | age_group | location |
| --- | --- | --- |
| sub-73 | child | NYC |
| sub-43 | child | NYC |
| sub-81 | adolescent | BOS |


So that we have all of our data together in a frame like this:

| sub_id | exp_condition | accuracy | age_group | location |
| --- | --- | --- | --- | --- |
| sub-73 | group_2 | .843 | child | NYC |
| sub-43 | group_2 | .343 | child | NYC |
| sub-81 | group_1 | .897 | adolescent | BOS |

### Concatenating versus Merging

**Combining dataframes vertically, or _concatenating_ (adding rows)**

Sometimes you want to 'stack' dataframes that have (mostly) the same columns.

A common use for this is when you have experiment data from individual participants and you want to combine them into a dataframe that has data from all participants. 

This is usually done using the pandas `concat()` function which we have seen already. 

It lets us go from separate dataframes like this:


| sub_id | task | accuracy |
| --- | --- | --- |
| sub-113 | working_memory | .34 |
| sub-113 | word_generation | .28 |

| sub_id | task | accuracy |
| --- | --- | --- |
| sub-045 | working_memory | .63 |
| sub-045 | word_generation | .56 |
| sub-045 | attention_control | .72 |


To a single frame like this:

| sub_id | task | accuracy |
| --- | --- | --- |
| sub-113 | working_memory | .34 |
| sub-113 | word_generation | .28 |
| sub-045 | working_memory | .63 |
| sub-045 | word_generation | .56 |
| sub-045 | attention_control | .72 |



#### A function for simulating data

Throughout this notebook we will want to generate simulated data for participants in an experiment.

To simplify that and make code more readable I have defined a function that takes in some arguments and gives back simulated data for a participant in an experiment.

In [None]:
def make_participant_data(uid, 
                          n_trials = 100, 
                          cond_diffs = True, 
                          response_opts = [1,2,3,4], 
                          exp_conditions = ['easy', 'hard']):
    '''
    Makes fake participant data and returns a dataframe.

            Parameters:
                    uid (str, int): An id for the participant
                    n_trials  (int): Number of trials worth 
                                        of data in each exp_condition
                                        
                    cond_diffs (boolean): Whether to simulate 
                                            data with differences 
                                            between conditions
                                            
                    response_opts (list): the possible reponse keys that
                                            the participant can press
                    
                    
                    exp_conditions (list): The names of various experimental 
                                            conditions. If cond_diffs is True, then
                                            these trial types will vary in their 
                                            mean response time

            Returns:
                    df (dataframe): A dataframe of simulated data 
                                        with one row for each trial 
                                        and these columns:
                                            uid: participant id
                                            trial_type: experimental condition
                                            resp: response key pressed
                                            acc: accuracy (1/0)
                                            RT: response time
    '''
     
    # total number of experimental conditions:
    n_conditions = len(exp_conditions)
    
    # Loop over "trials of the experiment"
    # -----
    # In each loop, take one experimental condition and generate 
    # key presses (response), accuracy (acc), and response time for 
    # each of n_trials
    
    # keep track of exp condition for each trial:
    trial_types = []
    
    # keep track of the key pressed on each trial
    trial_responses = []
    
    # keep track of response time on each trial
    trial_rts = []
    
    # keep track of accuracy on each trial
    trial_acc = []
    

    # Loop over each experimental_condition. Enumerate returns two values: 
    #     first is the index on each loop, second is the actual value
    for trial_type_idx, trial_type in enumerate(exp_conditions):
        
        # If there are condition differences in response time set 
        # the mean RT for this condition to be the trial_type_idx+2
        #
        # This will make the first condition have mean RT of 2
        # (0+2), the next will be mean RT of 3 (1+1), etc
        
        if cond_diffs:
            mean_rt = trial_type_idx+2
        else:
            mean_rt = 1
        
        # get n_trials worth random responses from the response_opts list
        response = np.random.choice(response_opts, n_trials)
        
        # get n_trials worth of binary (1 or 0) integer accuracy scores
        acc = np.random.randint(0, 2, n_trials)
        
        # Get n_trials of response times using the mean_rt from above
        # take the absolute value because all response times are positive numbers
        rt = abs(npr.normal(mean_rt, 1, n_trials))

        # make a list n_trials long that tags each of the response, acc, and rts
        # as being of this trial type
        tts = [trial_type]*n_trials
        
        # Store all the info for this loop of the experimental conditions
        # Use extend() rather than append() because the things we are adding
        # to lists are arrays and we want to add the individual values
        # to the overall trial_* lists
        
        trial_types.extend(tts)
        trial_responses.extend(response)
        trial_rts.extend(rt)
        trial_acc.extend(acc)
    
    # After looping over the trial types and getting trial-level simulated
    # data, make a dataframe from a dictionary of column names (keys) and
    # data values
    trial_df = pd.DataFrame({'uid': str(uid),
                       'response': trial_responses,
                       'acc': trial_acc,
                       'trial_type': trial_types,
                       'RT': trial_rts
                      })
    
    # return a dataframe with all the data in it
    return trial_df

#### Using the data simulation function


Once we've defined the function we can execute all that code in a single line.

At a minimum the function takes in a user id. 

In that case the default arguments will be used to generate 100 trials worth of data in an easy and hard experimental condition and the response times will be longer in the hard condition (due to cond_diffs=True)

```python
make_participant_data(uid, 
                      n_trials = 100, 
                      cond_diffs = True, 
                      response_opts = [1,2,3,4], 
                      exp_conditions = ['easy', 'hard'])
```



**Use the function to make simulated data for one person with 50 trials per condition**

**Plot the response times in the two trial type (exp_conditions)**

**Make some data with three experimental conditions and no response time differences**

#### Make individual dataframes

For each participant in the list `all_subs = ['sub-18','sub-19', 'sub-33', 'sub-08']` use the `make_participant_data()` function with n_trials = 100 to make individual dataframes. The individual dataframes should be stored in a list called `df_list`.

In [None]:
# make a list of dataframes with participant data
all_subs = ['sub-18','sub-19', 'sub-33', 'sub-08']
df_list = []

# fill in code here:
for sub_id in all_subs:
    df_list.append(make_participant_data(sub_id))


Each dataframe has data for one person for a series of trials in the experiment. Rows correspond to trials and columns are different kinds of data we measured. 

### Using `pd.concat()` to combine or stack dataframes vertically



`pd.concat()` takes a list of dataframes as input and returns a dataframe that combines all of the input dataframes vertically.

In [None]:
# can do it this way if individual dataframes are stored in 
# their own variables
pd.concat([df1, df2, df3, df4], ignore_index=True)

# the ignore_index=True will ignore the index (row labels)
# in the original dataframes and just renumber everything
# 0 to n-1

In [None]:
# Or this code which will do the same thing equivalent if you
# already have a list of dataframes
all_participants = pd.concat(df_list, ignore_index=True)

all_participants


##### Use dataframe['col'].unique() to see which subject-ids are in the combined dataframe

#### Get average data values for each participant or trial type

In previous sections we used `.groupby()` to take different views on our data by combining them according to different value in the dataframe.

For example, let's check the average of the columns in the dataframe after separating the data according to trial_type:

In [None]:
# use unique to remind us of the different trial_types
all_participants['trial_type'].unique()

In [None]:
# get the average accuracy and response time for 
# each trial type:
all_participants.groupby('trial_type', as_index=False)['RT'].mean()

Or get lots of descriptive stats for the 'RT' column:

In [None]:
all_participants.groupby('trial_type')['RT'].describe()

#### Get mean RT for easy and hard trials for each participant in `all_participants` dataframe

Now get the average response time for each trial type within each participant id.

In [None]:
# get mean response time (RT column) for each person
# in the uid column separated by trial type:


#### Make a bar graph with percent correct (acc column) as the y value and a separate bar for each participant in the all_participants dataframe

**hint**: use seaborn catplot(). Catplot with kind=bar will give us the average of whatever we put in the y= value. 

In our data the acc column is binary 1/0 and the average of that will be the percent correct, or proportion of entries that are equal to 1.

In [None]:
# make a plot for percent correct for each person:


We could keep on analyzing the data, but the main thing from this section is to notice how using pd.concat() and inputting a list of dataframes makes a new dataframe with all the input df's stacked up vertically. If they have the same columns the result makes sense.

You can also use pd.concat() if df's have different columns:

In [None]:
# make some simulated trial data
df1 = make_participant_data('sub-99')
df1.head(2)

In [None]:
# make a different dataframe with some other columns
df99 = pd.DataFrame({'uid': 'sub-88', 
                     'response_time': np.random.uniform(1000,5000, 6)})
df99.head(2)

Use pd.concat() to stack df1 and df99:

In [None]:
df_list = [df1, df99]

combined_df = pd.concat(df_list, ignore_index=True)
combined_df

If the df's to be concatenated don't have the same columns, the output has all of the columns that appeared in any of the input list. 

Any rows of the data that come from a dataframe that didn't have that column get NaN (not a number) in that cell.


## Merging dataframes

The preceding examples showed __concatenating__ that glues together some dataframes top to bottom.

Another common need is to combine dataframes that have different information for the same person or other unit.

We want to go from these two frames:


| sub_id | exp_condition | accuracy |
| --- | --- | --- |
| sub-73 | group_2 | .843 |
| sub-43 | group_2 | .343 |
| sub-81 | group_1 | .897 |



| sub_id | age_group | location |
| --- | --- | --- |
| sub-73 | child | NYC |
| sub-43 | child | NYC |
| sub-81 | adolescent | BOS |


To one like this where information is linked based on the sub_id value

| sub_id | exp_condition | accuracy | age_group | location |
| --- | --- | --- | --- | --- |
| sub-73 | group_2 | .843 | child | NYC |
| sub-43 | group_2 | .343 | child | NYC |
| sub-81 | group_1 | .897 | adolescent | BOS |



In the next examples we will use `pd.merge()` to combine 'experiment data' from our participants that we already made with 'demographics' data for the same people.

The next cell defines a function that takes in a participant id (uid) and returns a dataframe that randomly assigns them to a location (urban, rural) and an age_group (child, adult).

In [None]:
# convenience function to get random demographic info for a person
def make_demo_data(uid, 
                   locations= ['urban', 'rural'], 
                   age_group = ['child', 'adult']):
      
        
    # make dataframe using a randomly selected value from the locations and
    # age_groups lists
    # np.random.choice() takes in an array of strings or numbers and returns 
    # one of them selected at random
    demo_df = pd.DataFrame({'uid': [uid],
                            'location': [np.random.choice(locations)],
                            'age_group': [np.random.choice(age_group)]
                           })
    return demo_df
    

In [None]:
# use it like this for subject with id sub-101:


In [None]:
# make some simulated trial data for sub-101 using the 
# make_participant_data function:


#### Do the wrong thing:

Try concatenating the two dataframes we just made.

Concatenating didn't throw away any data but it also didn't result in a very useful dataframe. 

The demographics info for sub101 is separate from the other data instead of "tidy" where each row has info for all of the relevant variables.

This will become even more clear when we have multiple people each of whom might be child or adult and want to do analyses where we look at trial performance based on demographics grouping. 

In [None]:
# umake new experimental for four people using the
# make_participant_data() function


# use pd.concat to put them together 


In [1]:
# Use make_demo_data() to get demographic data for the same sub-id's as above 
# and combine them into a group dataframe


### Using `pd.merge()`

To combine or trial data and the demographics info that are in separate dataframes we can use pd.merge() like this:

```python
df = pd.merge(left=dataframe1, right=dataframe2)
```


In [2]:
# use pd.merge() to combine trial and demographics data


Our output now has all the columns we want, and the demographic data values are lined up with the experimental data for each person listed in the uid column.

When we used merge() we gave it two inputs: `left=` and `right=`. 

These inputs corresponded to the dataframes we want to merge.

By default, merge() looks for any columns with the same name in the two input dataframes. Then it takes those columns and lines up the dataframes according to the values in them. 

In our case there was a 'uid' column in both dataframes and so wherever those values overlapped the columns were combined.

We can also specify which column values should be matched by including the `on=` argument to merge():

Using the `on=` argument is especially useful if you have dataframes with different column names for the same underlying data. 

Here we will rename the 'uid' column in all_demo_df and see what happens.

In [None]:
# dataframe rename function takes a dictionary mapping old names (keys) to new names (values)
# inplace=True means change the existing dataframe (rather than outputting to a new variable)

demo_data.rename(columns={'uid': 'ID_num', 'location': 'loc'}, 
                     inplace=True)
demo_data.head()

Try merging two dataframes that don't have overlapping column names:


In [None]:
# check the column names in each dataframe:
print(trial_data.columns)
print(demo_data.columns)

In [None]:
# try to merge them


That gave us an error: "No common columns to perform merge on."

The solution is to tell .merge() which columns in the left and right dataframe to merge on, or use for lining up the data.

pd.merge() has optional input arguments called `left_on` and `right_on`. These can be used to give the column names to use (to treat as the same) in the left and right dataframes, respectively.

We'll use it to tell pd.merge() to use the column called 'uid' in one dataframe to line up with the column called 'ID_num' in the other

In [None]:
# here we tell pd.merge to use 'uid' column in the 
# left= dataframe and line those values up with the
# ID_num column in the right= dataframe



Now we see that it worked, lining up rows according to matching values in 'uid' and 'ID_num' columns in the two dataframes. 

It also kept both the original merge columns combined dataframe. 

We can drop one of those if we want:

In [None]:
del combined_df['ID_num']
combined_df.head()

#### Exercise: practice merging dataframes

The next cell defines another dataframe that gives us some information about people in age_groups child and adult.

In [None]:
age_df = pd.DataFrame({'ages': ['child', 'adult'],
         'expected_value': [0,99]})

age_df.head(1)

Now we'll try to merge the `age_df` with our existing `combined_df` so that we have the `expected_value` for each age_group in the combined_df. 

In age_df the relevant column is called 'ages':



In [None]:
# take a look at the top of the combined_df
# the relevant column is called age_group
combined_df.head(3)

In [4]:
# Merge the combined_df and age_df
# They should be merged using the 'age_group' column
# in combined_df and the 'ages' column in the age_df


In [5]:
# take a look at the merged dataframe:


### Merging on multiple columns

In the last example we defined a little dataframe that had the expected value of something for both children and adults and we merged that into our trial and demographics data so that we had a new column with one value for children and another for adults.

In this example we'll extend that and show how to line up to two dataframes based on the values in mutiple columns.



In [None]:
# Define a new dataframe that has expected values 
# based on both age and location

# make two lists that line up to make all combos of age and location
ages = ['child', 'adult', 'child', 'adult']
locations = ['urban', 'rural', 'rural', 'urban']

# we have two lists that can be lined up in a dataframe
print(ages)
print(locations)
    

In [None]:
age_loc_df = pd.DataFrame({'ages': ages,
                           'locations': locations,
                           'expected_value': [0, 80, 10, 90]})
                           
age_loc_df

To merge the new expected values into the combined_df so that the value depends on matching both the age **and** the location we can use a **list of columns** as inputs to the `left_on` and `right_on` values in pd.merge()

Checking mean of the expected_value column after using groupby() with loc and age_group shows that the expected value was merged into the dataframe based on values on both the age and location columns

### Summary

This notebook reviewed pd.concat() for stacking dataframes vertically and pd.merge() for combining dataframes that are linked based on values in one or more columns.

If you would like to dive deeper into the pandas merge() function I recommmend checking out chapter 3, section 7 of the Python Data Science Handbook by Jake VanderPlas:

https://jakevdp.github.io/PythonDataScienceHandbook/

https://jakevdp.github.io/PythonDataScienceHandbook/03.07-merge-and-join.html

https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.07-Merge-and-Join.ipynb