# Combining data frames


There are many situations when you might want to combine seprate pandas dataframes into a single dataframe.

Sometimes this will be combining data frames with the same columns and we want to stack the dataframes. For example, you might have experiment data from individual participants and want to combine them into a dataframe that has data from all participants. This is usually done using the pandas `concat()` function.

Other times you might like to combine dataframes that contain different information or columns. For example, we might have a dataframe that has behavioral data from a set of participants in an experiment.

For many of these examples I'd like to be able to generate fake data for participants in an experiment. To make life simpler (and make code more readable) I have defined a function that makes simulated data for a participant in an experiment.

The function takes in arguments for user id and (optionally) the number of trials they participated in.

The function returns a dataframe with the following columns:

    df = pd.DataFrame({'uid': str(uid),
                       'response': response,
                       'acc': acc,
                       'trial_type': trial_type,
                       'RT': RT
                      })
    


`uid`: a participant id (string)

`response`: the button the person pressed (random integer 1 to 4)

`acc`: whether they got the trial correct (1) or incorrect (0)

`trial_type`: congruent or incongruent, roughly corresponding to easy and hard (string)

`RT`: response time, random number > 0

In [None]:
import pandas as pd
import numpy as np
import numpy.random as npr
import seaborn as sns

In [None]:
def make_participant_data(uid, n_trials=100, cond_diffs=True):
    '''
    Makes fake participant data and returns a dataframe.

            Parameters:
                    uid (str, int): An id for the participant
                    n_trials  (int): Number of trials worth 
                                        of data
                    cond_diffs (boolean): Whether to simulate 
                                            data with differences 
                                            between conditions
                    

            Returns:
                    df (dataframe): A dataframe of simulated data 
                                        with one row for each trial 
                                        and columns:
                                            uid: participant id
                                            trial_type: experimental condition
                                            resp: response
                                            acc: accuracy (1/0)
                                            RT: response time
    '''
    import numpy as np
    import pandas as pd
    from random import shuffle
    
    # Check whether desired n_trials is divisible by two by
    # checking whether the remainder of n_trials divided by 
    # two is 0 (even number) or not. If not, the assert statement
    # will exit the function after printing the message:
    assert n_trials%2 == 0, 'n_trials must be divisible by two'
    
    
    # get n_trials worth of random integers between 1 and 4 (5 excluded)
    response = np.random.randint(1,5,n_trials)
    
    # get n_trials worth of binary (1 or 0) integer accuracy scores   
    acc = np.random.randint(0,2, n_trials)
    
    # --- trial_type
    # make a list with our possible trial types
    cond_list = ['congruent', 'incongruent']

    # make a list equal to total number of trials
    # multiple the list by n_trials/2 because there 
    # are two elements of tt
    trial_type = [cond_list[0]]*int((n_trials/2))+[cond_list[1]]*int((n_trials/2)) 

    
    # shuffle the list using the shuffle() function from the 
    # random library
    shuffle(trial_type)
    #----------------------------
    
    # get n_trials worth response time values between
    if cond_diffs:
        # loop over conditions and use different
        # values of mean for random.normal
        # initialize RT array
        RT = np.array([])
        for c in range(len(cond_list)):
            t = abs(npr.normal(c+1,1, int(n_trials/2)))
            RT = np.concatenate([RT, t])
    else:
        RT = abs(npr.normal(1,1, n_trials))
    
    

    
    # make a dataframe from a dictionary of column names (keys) and
    # data values
    df = pd.DataFrame({'uid': str(uid),
                       'response': response,
                       'acc': acc,
                       'trial_type': trial_type,
                       'RT': RT
                      })
    
    return df

### Concatenating

Pandas provides convenient tools for concatenating individual dataframes together into a single frame. An example of when this can be useful is combining data from individual participants in the same experiment.

Use the make_participant_data() function to get dataframes with data for four 'participants' in our 'experiment'

In [None]:
df1 = make_participant_data('sub-18', n_trials=6)
df2 = make_participant_data('sub-19', n_trials=6)
df3 = make_participant_data('sub-33', n_trials=6)
df4 = make_participant_data('sub-08', n_trials=6)

df4.head()

The code in the last cell worked, but when we have the same code repeated several times in a row it's often better to put it in a for loop.

### Exercise 1: Make individual dataframes

For each participant in the list `all_subs = ['sub-18','sub-19', 'sub-33', 'sub-08']` use the `make_participant_data()` function with n_trials = 100 to make individual dataframes. The individual dataframes should be stored in a list called `df_list`.

In [None]:
# make a list of dataframes with participant data
all_subs = ['sub-18','sub-19', 'sub-33', 'sub-08']
df_list = []

# fill in code here:

    


Now we have a set of dataframes stored in their own variables. Each dataframe has data for a series of trials in the experiment. Rows correspond to trials and columns are different kinds of data we measured. 

If we want a dataframe that has all the data from all the participants we can use `pd.concat()` to stack them vertically.

`pd.concat()` takes a list of dataframes as input:

In [None]:
# can do it this way if individual dataframes are stored in 
# their own variables:

pd.concat([df1, df2, df3, df4])

In [None]:
# or the equivalent if you already have a list of dataframes
# the ignore_index=True will ignore the index (row labels)
# in the original dataframes and just renumber everything
# 0 to n-1
all_participants = pd.concat(df_list, ignore_index=True)

all_participants


Now we have a dataframe with all the participants in it. We can use pandas .unique() function to check which values are in the uid column of our new dataframe

In [None]:
pd.unique(all_participants['uid'])

In previous sections we used `.groupby()` to take different views on our data by combining them according to different value in the dataframe.

For example, let's check the average of the columns in the dataframe after separating the data according to trial_type:

In [None]:
# check the top of the dataframe and we see two trial_types
all_participants.head()

In [None]:
# or use unique to verify all the different trial_types
all_participants['trial_type'].unique()

In [None]:
# get the average accuracy and response time for 
# each trial type:
all_participants.groupby('trial_type').mean()

Notice that the output did not include the 'mean' of the uid column and this is good, because it has strings in it and the average of 'sub-11', 'sub-18', 'sub-33', etc is not very meaningful.

Restrict the previous view to just the 'RT' column of the data:

In [None]:
all_participants.groupby('trial_type')['RT'].mean()

Or get lots of descriptive stats for the 'RT' column:

In [None]:
all_participants.groupby('trial_type')['RT'].describe()

### Exercise: Get mean RT for each participant in `all_participants` dataframe

Earlier in this notebook you saw an example of getting mean RT for each trial_type. Now do it for each participant id.

In [None]:
all_participants.head()

In [None]:
# get mean response time (RT column) for each person
# in the uid column separated by trial type:


### Exercise: make a bar graph with percent correct (acc column) as the y value and a separate bar for each participant in the all_participants dataframe

**hint**: use seaborn catplot(). Catplot with kind=bar will give us the average of whatever we put in the y= value. In our data the acc column is binary 1/0 and the average of that will be the percent correct, or proportion of entries that are equal to 1.

In [None]:
# make a plot for percent correct for each person:

We could keep on analyzing the data, but the main thing from this section is to notice how using pd.concat() and inputting a list of dataframes makes a new dataframe with all the input df's stacked up vertically. If they have the same columns the result makes sense.

You can also use pd.concat() if df's have different columns:

In [None]:
all_participants.head(1)

In [None]:
# make a different dataframe with some other columns
df99 = pd.DataFrame({'uid': 'sub-99', 
                     'response_time': np.random.uniform(1000,5000, 6)})
df99.head(1)

Use pd.concat() to stack df1 and df99:

In [None]:
df_list = [df1, df99]

combined_df = pd.concat(df_list, ignore_index=True)
combined_df

As you can see, if the df's to be concatenated don't have the same columns, the output has all of the columns that appeared in any of the input list. Any rows of the data that come from a dataframe that didn't have that column get NaN (not a number) in that cell.


So in our example sub-18 has no values for 'response_time' because df1 didn't have that column and sub-99 has no data for response, correct trial_type, and RT because df99 didn't have those columns.

## Merging dataframes

The preceding examples showed __concatenating__ or glueing together some dataframes.

Another common need is to combine dataframes that have different information for the same person.

In the next examples we will use pd.merge() to combine 'experiment data' from our participants that we already made with 'demographics' data for the same people.

The next cell defines a function that takes in a participant id (uid) return a dataframe that randomly assigns them to a location (urban, rural) and an age_troup (child, adult)

In [None]:
# convenience function to get random demographic info for a person
def make_demo_data(uid):
   
    import random
    
    locations = ['urban','rural']
    age_group = ['child', 'adult']
    
    demo_df = pd.DataFrame({'uid': [uid],
                            'location': [random.choice(locations)],
                            'age_group': [random.choice(age_group)]
                           })
    return demo_df
    

In [None]:
df101 = make_demo_data('sub-19')
df101

In [None]:
# make new experimental data and combine the 
# individual participant data into a group dataframe
# this uses or make_participant_data() that we defined
# at the beginning of this notebook
df1 = make_participant_data('sub-18', n_trials=56)
df2 = make_participant_data('sub-19', n_trials=56)
df3 = make_participant_data('sub-33', n_trials=56)
df4 = make_participant_data('sub-08', n_trials=56)

# use pd.concat to put them together
all_exp_data = pd.concat([df1, df2, df3, df4], ignore_index=True)
all_exp_data.head()



In [None]:
# make new demographic data and combine the 
# individual participant data into a group dataframe
# use our make_demo_data() function
df101 = make_demo_data('sub-18')
df102 = make_demo_data('sub-19')
df103 = make_demo_data('sub-33')
df104 = make_demo_data('sub-08')

all_demo_data = pd.concat([df101, df102, df103, df104], ignore_index=True)

all_demo_data


Now we have the experimental data for each person in one dataframe and the demographics in another. What if we want to do some analyses that involve grouping the behavioral data observations according to demographic data?

Instead of using `pd.concat()` we will use the `pd.merge()` function. First let's run it:

In [None]:
combined_df = pd.merge(left=all_exp_data, 
                       right=all_demo_data)
combined_df

Our output now has all the columns we want, and we have the demographic data values lined up with the experimental data for each person listed in the uid column.

When we used merge() we gave it two inputs, `left=` and `right=`. These inputs corresponded to the dataframes we want to merge.

By default, merge() looks for any columns with the same name in the two input dataframes. Then it takes those columns and lines up the dataframes according to the values in them.

We can also specify which column values should be matched by including the `on=` argument to merge():

In [None]:
combined_df = pd.merge(left=all_exp_data, 
                       right=all_demo_data, 
                       on='uid')
combined_df.head()

Using the `on=` argument is especially useful if you have dataframes with different column names for the same underlying data. Here we will rename the 'uid' column in all_demo_data and see what happens.

In [None]:
all_demo_data

In [None]:
# dataframe rename function takes a dictionary mapping old names (keys) to new names (values)
# inplace=True means change the existing dataframe (rather than outputting to a new variable)

all_demo_data.rename(columns={'uid': 'ID_num', 'location': 'loc'}, 
                     inplace=True)
all_demo_data.head()

Try merging two dataframes that don't have overlapping column names:


In [None]:
# check the column names in each dataframe:
print(all_exp_data.columns)
print(all_demo_data.columns)

In [None]:
combined_df = pd.merge(left=all_exp_data, right=all_demo_data)
combined_df

That gave us an error and the error said "No common columns to perform merge on."

The solution is to tell .merge() which columns in the left and right df to merge on, or use for the merge:

In [None]:
# here we tell pd.merge to use 'uid' column in the 
# left= dataframe and line those values up with the
# ID_num column in the right= dataframe
combined_df = pd.merge(left=all_exp_data, 
                       right=all_demo_data, 
                       left_on='uid', 
                       right_on='ID_num')
combined_df

Now we see that it worked, lining up rows according to matching values in 'uid' and 'ID_num' columns in the two dataframes. It also kept both columns by default in the combined_df. We can drop one of those if we want:

In [None]:
del combined_df['ID_num']
combined_df.head()

### Exercise: practice merging dataframes

The next cell defines another dataframe that gives us some information about people in age_groups child and adult.

Try to merge the `age_df` with our existing `combined_df` so that we have the `expected_value` for each age_group in the combined_df. In age_df the relevant column is called 'ages':



In [None]:
age_df = pd.DataFrame({'ages': ['child', 'adult'],
         'expected_value': [0,99]})
age_df.head(1)

In [None]:
# take a look at the top of the combined_df
combined_df.head(3)

In [None]:
# your code for merging combined_df and age_df goes here
# they should be merged using the 'age_group' column
# in combined_df and the 'ages' column in the age_df
# Store the results of the merge in a variable
# called combined_df


In [None]:
# take a look at the merged dataframe:
combined_df

## Using .query() to get subsets of dataframes

We can use the dataframe `.query()` method to get portions of a dataframe according to some conditions. 

The .query() method helps us achieve things similar to what we did with boolean indexing previously, using things like:

    df[df['column_name]=='some value]
    
Some people find the query() syntax easier to deal with (although note the use of single and double quotes to handle strings. It is good to be familiar with both approaches.


Here we use query() it to get the parts of a single participant dataframe where trial_type is "incongruent".

In [None]:
df_incong = df1.query('trial_type == "incongruent"')
df_incong.head(1)

The syntax is straightforward. Put the conditions to be met inside of query and you'll get back any rows that match that condition. Because the input to query is a string, we had to put additional quotes around "incongruent" so that would be interepreted as a string.

Multiple expressions can be met.

In [None]:
df1.query('trial_type == "incongruent" and acc==1').head()

If we want to store some of the conditions to check in a variable we precede them the '@' symbol in the query:

In [None]:
tt = 'incongruent'

df1.query('trial_type == @tt and acc == 0').head()

### Exercise: query on mutliple conditions

Try using the query() syntax on the combined_df to pull out only the data where age is child, trial_type is congruent, and acc is 1.


In [None]:
combined_df['ages'].unique()

In [None]:
# your query goes here:
# note that if your randomly end up with only 
# adults in your dataframe this query will be empty
tt = 'congruent'
a = 'child'
accuracy = 1
combined_df.query('trial_type==@tt and age_group==@a and acc==@accuracy')


END