# Chapter 2: Concatenating Data

[View this lesson on datacamp](https://learn.datacamp.com/courses/merging-dataframes-with-pandas)

## Appending pandas Series and DataFrames

The `.append()` method is used for both pandas Series and DataFrames, to stack rows on top of each other. For example, appending the two pandas Series below results in one series in which the elements of 'blue' are stacked underneath the elements of 'pink' (since we append blue to pink in this case).

In [1]:
import pandas as pd
pd.set_option('display.max_rows', None)
import numpy as np

In [2]:
pink = pd.Series(['rose', 'fuchsia', 'ruby', 'magenta'])
blue = pd.Series(['turquoise', 'sky blue', 'navy', 'ocean blue'])
pb = pink.append(blue)
pb

0          rose
1       fuchsia
2          ruby
3       magenta
0     turquoise
1      sky blue
2          navy
3    ocean blue
dtype: object

In [3]:
pink

0       rose
1    fuchsia
2       ruby
3    magenta
dtype: object

## Appending Series with nonunique indices

One thing to note here is that the `.append()` method doesn't adjust the index of the series when it stacks them.  You can see this above in that the first column (which is the index) goes from zero to three, then zero to three again. This can be an issue when working with a Series or DataFrame later on, so it's a good idea to re-index.

This can be done using the `.reset_index()` method, which sets the row indexes sequentially from 0:

In [4]:
pb.reset_index()

Unnamed: 0,index,0
0,0,rose
1,1,fuchsia
2,2,ruby
3,3,magenta
4,0,turquoise
5,1,sky blue
6,2,navy
7,3,ocean blue


You'll see there's a weird extra column in there now; the index is the rightmost column (in bold), and the original index is now a column called "index". In some cases this might be a helpful historical record, but in many cases it's just annoying. Adding the argument `drop=True` will drop the original index:

In [5]:
pb.reset_index(drop=True)

0          rose
1       fuchsia
2          ruby
3       magenta
4     turquoise
5      sky blue
6          navy
7    ocean blue
dtype: object

<div class="alert alert-block alert-info">
One important "gotcha" with `.reset_index()` — and many pandas DataFrame methods — is that <b>by default they don't actually modify the Series or DataFrame you run them on</b>. 
    
So after running the command above, you might think you reset the index of `pb`, but actually you didn't; instead you just saw the copy that was created by your command, printed as output. Thus when we ask to see `pb` again, the index is unchanged:
</div>

 

In [6]:
pb

0          rose
1       fuchsia
2          ruby
3       magenta
0     turquoise
1      sky blue
2          navy
3    ocean blue
dtype: object

So to actually reset the index of `pb`, we need to *assign* the output of the method back to `pb`, like this:

In [7]:
pb = pb.reset_index(drop=True)
pb

0          rose
1       fuchsia
2          ruby
3       magenta
4     turquoise
5      sky blue
6          navy
7    ocean blue
dtype: object

You can alternatively do this by including the `inplace=True` argument, in which case you don't need to assign the output with `pb = `

In [8]:
pb = pink.append(blue)
pb.reset_index(drop=True, inplace=True)
pb

0          rose
1       fuchsia
2          ruby
3       magenta
4     turquoise
5      sky blue
6          navy
7    ocean blue
dtype: object

## Concatenation

Another way to combine pandas objects (Series or DataFrames) is concatenation. The `pd.concat()` function is a more powerful and flexible tool than the `.append()` method. Whereas appending always adds rows to the bottom of a DataFrame, concatenation can do this, *or* add columns to a DataFrame.

[API for `pd.concat()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html?highlight=concat#pandas.concat)

THere's also a nice, detailed explanation of appending, concatenating, merging, and joining DataFrames [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html)

Here's how we would use `pd.concat()` to do the same thing we did above, with `.append()`. There are two major difference to pay attention to. Firstly, `pd.concat()` is a *function*, whereas `.append()` is a *method*. Recall that methods are applied by dot-adding them to the variable name you want to modify (e.g., `pink.append(blue)`). With a function, we have to specify `pd` before the dot and the function name after, and give it all the input data as the first argument inside the parentheses. It's also important to pay attention to how we specify the input data: since the functions arguments are separated by commas, you can't just list the input data like this:
`pd.concat(pink, blue)`, because `'pink` will be interpreted as the input data, and `blue` as a second argument. We need to put the input data inside a list, like this:

In [9]:
pb = pd.concat([pink, blue])
pb

0          rose
1       fuchsia
2          ruby
3       magenta
0     turquoise
1      sky blue
2          navy
3    ocean blue
dtype: object

As with `.append()`, the original index values are preserved, which we might not want. While with `.append()` we had to run a separate method to reset the index, with `pd.concat()` we can do this at the same time as the concatenation, using the `ignore_index` argument:

In [10]:
pb = pd.concat([pink, blue], ignore_index=True)
pb

0          rose
1       fuchsia
2          ruby
3       magenta
4     turquoise
5      sky blue
6          navy
7    ocean blue
dtype: object

## `.append()` and `pd.concat()` 

Above we were working with pandas Series. For pandas DataFrames, the append method works just the same, stacking the rows. 

Here we will use what we learned in chapter 1 to read two CSV files as DataFrames, then combine them with the `append()` method. 

In [11]:
s1 = pd.read_csv('s1.csv')
s2 = pd.read_csv('s2.csv')

all_data = s1.append(s2)

all_data

Unnamed: 0,trial,RT
0,1,0.508971
1,2,0.389858
2,3,0.404175
3,4,0.26952
4,5,0.437765
5,6,0.368142
6,7,0.400544
7,8,0.335198
8,9,0.341722
9,10,0.439583


### Appending DataFrames with `ignore_index`

We can also do this with `pd.concat()`, again using `ignore_index=True`:

In [12]:
# s1 and s2 are already loaded into memory from above

df = pd.concat([s1, s2], ignore_index=True)

df

Unnamed: 0,trial,RT
0,1,0.508971
1,2,0.389858
2,3,0.404175
3,4,0.26952
4,5,0.437765
5,6,0.368142
6,7,0.400544
7,8,0.335198
8,9,0.341722
9,10,0.439583


## Reading multiple files to build a DataFrame

We can get a bit more fancy and use a loop to read in files, as in the previous chapter, and then combine them. Here's the code from the last chapter, which reads the CSV files in to a list of DataFrames:

In [13]:
filenames = ['s1.csv', 's2.csv', 's3.csv']

df_list = []

for filename in filenames:
    df_list.append(pd.read_csv(filename))

Since `df_list` is already a list — which is the format that `pdconcat()` wants its input in — we can just pass the whole thing to `pd.concat()`:

In [14]:
df = pd.concat(df_list, ignore_index=True)

df

Unnamed: 0,trial,RT
0,1,0.508971
1,2,0.389858
2,3,0.404175
3,4,0.26952
4,5,0.437765
5,6,0.368142
6,7,0.400544
7,8,0.335198
8,9,0.341722
9,10,0.439583


## Concatenating columns rather than rows

As noted earlier, `pd.concat()` is more powerful in how it combine inputs. 

Consider this example, where we have different data about the same participants, in different files. One file contains participants' birthday month, and the other their age. What we want is to end up with a DataFrame with one row per participant, with columns for participant number, `Fav Color` and `Brithday Month`. However, when we read in the two input files and concatenate them, we get a column for colour and a column for month, with lots of NaN values in each because each input file had different column names, but we've stacked the rows of the inputs:

In [15]:
fav_colour = pd.read_csv('fav_colour.csv')
birthday_month = pd.read_csv('birthday_months.csv')

df = pd.concat([fav_colour, birthday_month])

df

Unnamed: 0,Participant num,Fav Colour,Birthday Month
0,1,blue,
1,2,red,
2,3,green,
3,4,purple,
4,5,red,
5,6,green,
6,7,orange,
7,8,yellow,
8,9,yellow,
9,10,pink,


You can see above that there's also a `participant num` column, which indicates how we can match colours to months. What we actually want is to combine the two inputs "horizontally", such that we have 10 rows (one for each participant), with the colour and month corresponding to each participant in the same row. 

The default when concatenating dataframes is to do so vertically, as we saw above. However, `pd.concat()` allows us to concatenate horizontally as well. To do this, you must specify either `axis=1`, or `axis=columns`. Note in the example below, the rows with identical indices get combined when concatenated.

In [16]:
df = pd.concat([fav_colour, birthday_month], axis=1)
df

Unnamed: 0,Participant num,Fav Colour,Participant num.1,Birthday Month
0,1,blue,1,may
1,2,red,2,june
2,3,green,3,january
3,4,purple,4,february
4,5,red,5,september
5,6,green,6,july
6,7,orange,7,may
7,8,yellow,8,may
8,9,yellow,9,august
9,10,pink,10,december


We're still not quite where we want to be, as we have two redundant `Participant num` columns. When concatenating, pandas plays it safe, and doesn't assume that two columns with the same name are redundant. One way to fix this is, when we load the data in the beginning, we make the index of each input DataFrame the `participant num` column. Since indexes are essentially row labels, making participant_num the index tells pandas that indeed, these two columns with the same name are actually the same thing.

In [17]:
fav_colour = pd.read_csv('fav_colour.csv', index_col='Participant num')
birthday_month = pd.read_csv('birthday_months.csv', index_col='Participant num')

df = pd.concat([fav_colour, birthday_month], axis=1)
df

Unnamed: 0_level_0,Fav Colour,Birthday Month
Participant num,Unnamed: 1_level_1,Unnamed: 2_level_1
1,blue,may
2,red,june
3,green,january
4,purple,february
5,red,september
6,green,july
7,orange,may
8,yellow,may
9,yellow,august
10,pink,december


Alternatively, we could make one of the `Participant num` columns the index after concatenation, but specifying the index when we read in the data is a safer way of doing things. This is because, it could happen that your data aren't in the same order in both data files (e.g., one data file might not be sorted by `Participant num`), or one file might have missing data. By making `Participant num` the index for each file before we concatenate them, we ensure that pandas matches the rows from each input based on its index. 

Importantly, this is a case where we would *not* want to include the `ignore_index=True` argument to `pd.concat()`, because the index is important and meaningful.

## MultiIndexes

MultiIndexes extend pandas indesing, allowing you to designate multiple columns as indexes. For example, you may have data for each month of the year, from multiple years. In this case, you might want to use month as the index, but you would not want pandas to treat January, 2019, as the same as January, 2020. You you would want indexes both for month, and for year.

MultiIndexes can be applied to both rows (for which we've already learned about single-indexing), and to columns. 

Imagine we collected reaction time (RT) data from an individual human participant in two different testing sessions. Each session involved 10 experimental trials. Between the first and the second session, the person played cognitive training games and we want to know if their RTs decreased due to the training. So we can load in the two data files (one from each session):

In [18]:
sess_1 = pd.read_csv('session_1.csv', index_col='trial')
sess_2 = pd.read_csv('session_2.csv', index_col='trial')

Now we view the data from each session:

In [19]:
sess_1

Unnamed: 0_level_0,rt
trial,Unnamed: 1_level_1
0,0.988
1,0.753
2,0.949
3,0.824
4,0.262
5,0.803
6,0.376
7,0.496
8,0.235
9,0.336


In [20]:
sess_2

Unnamed: 0_level_0,rt
trial,Unnamed: 1_level_1
0,0.718
1,0.851
2,0.747
3,0.52
4,0.991
5,0.004
6,0.547
7,0.883
8,0.841
9,0.195


You can see that because of the `index_col='trial'` argument to `pd_read_csv()`, trial number is used as the index for each DataFrame

Now we can concatenate the data. One way to do this is simply appending the rows of `sess_2` to the bottom of `sess_1`, and use the `axis=0` argument to specify conctenation is by rows:

In [21]:
sess_12 = pd.concat([sess_1, sess_2], axis=0)
sess_12

Unnamed: 0_level_0,rt
trial,Unnamed: 1_level_1
0,0.988
1,0.753
2,0.949
3,0.824
4,0.262
5,0.803
6,0.376
7,0.496
8,0.235
9,0.336


The problem with the result above is that we don't know which session each data point came from. We know session names from the names of the files, but that information doesn't get used in making the DataFrame. We can deal with this by manually specifying the session names, and using them as row indexes. Critically, we will use MultiIndexing so that `trial` is retained as an index. In other words, there are two indexes.

In [22]:
sess_12 = pd.concat([sess_1, sess_2], keys=['sess_1', 'sess_2'], axis=0)
sess_12

Unnamed: 0_level_0,Unnamed: 1_level_0,rt
Unnamed: 0_level_1,trial,Unnamed: 2_level_1
sess_1,0,0.988
sess_1,1,0.753
sess_1,2,0.949
sess_1,3,0.824
sess_1,4,0.262
sess_1,5,0.803
sess_1,6,0.376
sess_1,7,0.496
sess_1,8,0.235
sess_1,9,0.336


We can the use the `.loc[]` property to select all trials from one session or the other:

In [23]:
sess_12.loc['sess_1']

Unnamed: 0_level_0,rt
trial,Unnamed: 1_level_1
0,0.988
1,0.753
2,0.949
3,0.824
4,0.262
5,0.803
6,0.376
7,0.496
8,0.235
9,0.336


## Concatenating DataFrames from a dict

Instead of using the `keys=[]` argument to `pd.concat()`, we can create a dictionary mapping index labels to input DataFrames. This works just as well as using the `keys` argument, but it's a little safer and more explicit, because with a dictionary the mapping between labels and data is easy to verify. Using separate lists of DataFrames and keys relies on ensuring that the order of items is the same in both lists.

In [24]:
sess_dict = {'session 1':sess_1, 'session 2':sess_2}

sess_12 = pd.concat(sess_dict)

sess_12

Unnamed: 0_level_0,Unnamed: 1_level_0,rt
Unnamed: 0_level_1,trial,Unnamed: 2_level_1
session 1,0,0.988
session 1,1,0.753
session 1,2,0.949
session 1,3,0.824
session 1,4,0.262
session 1,5,0.803
session 1,6,0.376
session 1,7,0.496
session 1,8,0.235
session 1,9,0.336


## Concatenating Columns

We can instead concatenate the data by column, as we did earlier for colours and birthday month:

In [25]:
sess_12 = pd.concat([sess_1, sess_2], axis='columns')  # We could also use axis=1

sess_12

Unnamed: 0_level_0,rt,rt
trial,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.988,0.718
1,0.753,0.851
2,0.949,0.747
3,0.824,0.52
4,0.262,0.991
5,0.803,0.004
6,0.376,0.547
7,0.496,0.883
8,0.235,0.841
9,0.336,0.195


We have the same problem as when concatenating by row; that there are two `rt` columns and no indication of which is associated with which session. So we can use MultiIndexing on the columns - just as we did for rows above, to add the session numbers. We do this by first creating a dictionary mapping each session's data to a label, and then using that dictionary as the input to `pd.concat()`:

In [26]:
sess_dict = {'session 1':sess_1, 'session 2':sess_2}
sess_12 = pd.concat(sess_dict, axis='columns')

sess_12

Unnamed: 0_level_0,session 1,session 2
Unnamed: 0_level_1,rt,rt
trial,Unnamed: 1_level_2,Unnamed: 2_level_2
0,0.988,0.718
1,0.753,0.851
2,0.949,0.747
3,0.824,0.52
4,0.262,0.991
5,0.803,0.004
6,0.376,0.547
7,0.496,0.883
8,0.235,0.841
9,0.336,0.195


You can see that the original column labels (`rt`) are preserved but above these are unique labels for each session.

#### Selecting and slicing on MultiIndexes

When we had row MultiIndexes above, we used `.loc` to access specific data, like for one session. However, `.loc` operates on rows, not columns, so we can't use it to select data from one session in this case. There are a few ways we can access MultiIndexes with columns, but one of the most intuitive to use is via the function `pd.IndexSlice`. This allows you to use `.loc` with a [*rows*, *columns*] syntax, as shown below, where we select all rows with `:` and the specific column by specifying its label:

In [27]:
sess_12.loc[pd.IndexSlice[:,'session 1']]

Unnamed: 0_level_0,rt
trial,Unnamed: 1_level_1
0,0.988
1,0.753
2,0.949
3,0.824
4,0.262
5,0.803
6,0.376
7,0.496
8,0.235
9,0.336


We could also select only some trials from one session:

In [28]:
sess_12.loc[pd.IndexSlice[0:3, 'session 1']]

Unnamed: 0_level_0,rt
trial,Unnamed: 1_level_1
0,0.988
1,0.753
2,0.949
3,0.824


One trick that you will often see (including in DataCamp) is to assign `pd.IndexSlice` to a shorter variable name, typically `idx`. This can make your commands a little simpler:

In [29]:
idx = pd.IndexSlice

sess_12.loc[idx[:,'session 1']]

Unnamed: 0_level_0,rt
trial,Unnamed: 1_level_1
0,0.988
1,0.753
2,0.949
3,0.824
4,0.262
5,0.803
6,0.376
7,0.496
8,0.235
9,0.336


## Outer and Inner Joins 

"Joins" refer to how two DataFrames are combined. 

To demonstrate, we'll start by showing how this works with NumPy arrays, and then move on to how to do the same things with pandas DataFrames. 

### NumPy

For NumPy, the first step is to create three NumPy arrays, each with different dimensions. To do this we use NumPy's `random.rand()` function, which creates an array with a specified shape, filled with random numbers. We *chain* this with the `.round()` method to shorten the numbers, then multiply the array by 100 to convert the numbers from values less than 1, to values in the 0-100 range (this uses NumPy *broadcasting* to apply the multiplication to each element of the array). This is done to make the numbers easy to look at, as well as demonstrating chaining and broadcasting.

In [30]:
a = np.random.rand(2, 4).round(3)*100
print(a)

[[ 1.1 66.  44.9 12.3]
 [23.3 68.9 39.7 24.4]]


In [31]:
b = np.random.rand(2, 3).round(3)*100
print(b)

[[49.1 20.7 57.2]
 [85.9  0.3 60.6]]


In [32]:
c = np.random.rand(3, 4).round(3)*100
print(c)

[[42.7 35.2  6.1 24.8]
 [69.3 35.6 82.1 17.8]
 [71.2 76.2 23.  66.6]]


Note that although the above arrays are each different shapes - (2, 4), (2, 3), and (3, 4) - there's always one dimension of each array that has the same size as one dimension of another array. For example, `a` and `b` both have 2 rows, while `b` and `c` each have one dimension of length 3, but `b` has 3 columns while `c` has 3 rows. This allows us to have lots of fun combining these arrays in different ways.

NumPy has 'convenience functions' for combining arrays horizontally (adding columns beside columns; `np.hstack()`) and vertically (adding rows below rows; `np.vstack()`). NumPy also has a more generic `np.concatenate()` function that allows either horizontal or vertical concatenation (stacking) using the `axis=` argument. 

### Stacking arrays horizontally

This will produce an array with `b` and `a` together, 'beside' each other. Note that the inputs have to be inside a list:

In [33]:
np.hstack([b, a])

array([[49.1, 20.7, 57.2,  1.1, 66. , 44.9, 12.3],
       [85.9,  0.3, 60.6, 23.3, 68.9, 39.7, 24.4]])

To do the same thing with `np.concatenate()`, we include the `axis=1` argument to specify joining on the column axis. 

In [34]:
np.concatenate([b, a], axis=1)

array([[49.1, 20.7, 57.2,  1.1, 66. , 44.9, 12.3],
       [85.9,  0.3, 60.6, 23.3, 68.9, 39.7, 24.4]])

### Stacking arrays vertically

This will produce an array with `c` stacked below `a`:

In [35]:
np.vstack([a, c])

array([[ 1.1, 66. , 44.9, 12.3],
       [23.3, 68.9, 39.7, 24.4],
       [42.7, 35.2,  6.1, 24.8],
       [69.3, 35.6, 82.1, 17.8],
       [71.2, 76.2, 23. , 66.6]])

Again, we can use `np.concatenate()`, but this time we need to specify `axis=0` to indicate we're stacking on rows:

In [36]:
np.concatenate([a, c], axis=0)

array([[ 1.1, 66. , 44.9, 12.3],
       [23.3, 68.9, 39.7, 24.4],
       [42.7, 35.2,  6.1, 24.8],
       [69.3, 35.6, 82.1, 17.8],
       [71.2, 76.2, 23. , 66.6]])

## Combining pandas arrays

NumPy arrays are relatively easy to work with, because they contain only numbers. Real datasets stored in pandas DataFrames present unique challenges, though, because they contain *labelled data*, and often they containg *missing data*. For instance, in a study an individual human participant may provide data on a number of tests, such as a working memory span, reading comprehension, and nonverbal intelligence. Sometimes, for any number of reasons (e.g., time, technical failures, human error), an individual's data on one test may be missing. As another example, in a reaction time (RT) study each participant will complete a large number of trials, resulting in *repeated measures* from the same individual (RT measures on each trial). On some trials, individuals may fail to respond, resulting in missing data for those trials. 

Missing data creates problems when combining data. Depending on the situation, it may be preferable to replace missing values with null values (which appear in Python as `NaN`, for 'not a number'), or it may be preferable to have complete data and leave out data that we don't have for all of the inputs (e.g., drop the data from one test completely, if we don't have data for every participant).

This is where teh terms **inner join** and **outer join** come in. This is a bit of jargon you need to learn, but it's pretty logical. 

An **outer join** involves filling in missing values with `NaN`. In formal logical terms, this is the *union* of the input data sets. This is the default for `pd.concat()`. You can also think of the names as reflecting the fact that this approach includes all the data within the 'outer' boundaries of the DataFrame, like a box drawn around the entire table.

An **inner join** involves keeping only the data that is complete for all inputs. In formal logic, this is the *intersection* of the inputs (i.e., only what they all have in common). You can think of the term 'inner' as referring to the fact that this takes only the data inside that big outer box, that has no missing data. 

## Example Data

Here we have data from two studies of reading and related abilities in children. Each study involved different children. In each study, some of the same measures were collected (such as vocabulary), along with some that were collected in only one study. We'd like to combine the data from the two studies. 

First, let's load the data from each study and see what we have. Note that I already know that there's a `Participant` column that uniquely identifies each person by an ID code, so we use that as the index for the DataFrame.

In [37]:
study1 = pd.read_csv('study1.csv', index_col='Participant')
study1.shape

(36, 6)

So study 1 contains 6 measures from each of 36 participants. Let's look at how the data are structured:

In [38]:
study1

Unnamed: 0_level_0,Fluency,WordID,Comprehension,Orthoknow,Vocab,PhonAwar
Participant,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
study1_01,73.0,84.0,41,47.0,32.0,31.0
study1_02,104.0,45.0,34,37.0,32.0,15.0
study1_03,109.0,59.0,20,48.0,31.0,26.0
study1_04,94.0,60.0,38,48.0,33.0,29.0
study1_05,106.0,66.0,41,,34.0,32.0
study1_06,133.0,52.0,48,28.0,41.0,13.0
study1_07,118.0,67.0,39,46.0,39.0,28.0
study1_08,106.0,71.0,25,45.0,37.0,30.0
study1_09,128.0,69.0,35,50.0,29.0,31.0
study1_10,108.0,77.0,27,38.0,36.0,32.0


You can see that, since we read in the data without an `index=` argument, the index defaults to numbers. We might want to use the participant ID as the index, but we'll decide on that later. 

Now let's load the data from the other study and look at it:

In [39]:
study2 = pd.read_csv('study2.csv', index_col='Participant')
study2.shape

(43, 4)

So now we have 4 measures from 43 participants. Again we examine it:

In [40]:
study2

Unnamed: 0_level_0,Comprehension,Vocab,Nonverbal,Fluency
Participant,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
study2_01,17.0,15.0,15.0,137.0
study2_02,,31.0,21.0,115.0
study2_05,,,20.0,95.0
study2_07,34.0,7.0,12.0,
study2_08,28.0,18.0,10.0,52.0
study2_09,21.0,27.0,,130.0
study2_10,28.0,,8.0,136.0
study2_11,,26.0,17.0,60.0
study2_12,,26.0,,97.0
study2_13,39.0,,,121.0


Comparing the two datasets, we can see that there are three measures in common across the two studies: `Fluency`, `Comprehension`, and `Vocab`. Each study also has unique measures, for which we don't have data in the other study: study 1 has `WordID`, `Orthoknow`, and `PhonAwar`, while study 2 has `Nonverbal`. 

The other thing to note is that in both datasets, there are missing data (`NaN`); for some participants we are missing data on some tests. 

### Combining the data sets

Now that we have an idea of what we're working with, we can think about how to combine these two datasets using `pd.concat()`. The first question is whether horizontal or vertical concatenation makes more sense. Since each row of data in each data set corresponds to one individual, it really doesn't make sense to combine these horizontally. So we want to concatenate vertically, stacking the rows. For this we use the `axis=0` argument.

In [41]:
studies_1_2 = pd.concat([study1, study2], axis=0)
studies_1_2

Unnamed: 0_level_0,Fluency,WordID,Comprehension,Orthoknow,Vocab,PhonAwar,Nonverbal
Participant,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
study1_01,73.0,84.0,41.0,47.0,32.0,31.0,
study1_02,104.0,45.0,34.0,37.0,32.0,15.0,
study1_03,109.0,59.0,20.0,48.0,31.0,26.0,
study1_04,94.0,60.0,38.0,48.0,33.0,29.0,
study1_05,106.0,66.0,41.0,,34.0,32.0,
study1_06,133.0,52.0,48.0,28.0,41.0,13.0,
study1_07,118.0,67.0,39.0,46.0,39.0,28.0,
study1_08,106.0,71.0,25.0,45.0,37.0,30.0,
study1_09,128.0,69.0,35.0,50.0,29.0,31.0,
study1_10,108.0,77.0,27.0,38.0,36.0,32.0,


You can see above that pandas preserved all of the columns from both inputs, combining the data for column labels that existed in both data sets, and inserting `NaN` in any column that was only present in one of the data sets, for the participants who did not provide data on that measure.

Again, this is called an **outer join**, and is the default for `pd.concat()`. In some data analysis situations, we might only want to analyze data from *complete cases* — measures for which there is no missing data. To do this, we would perform an **inner join**, which includes only the data from complete cases:

In [42]:
studies_1_2 = pd.concat([study1, study2], axis=0, join='inner')
studies_1_2

Unnamed: 0_level_0,Fluency,Comprehension,Vocab
Participant,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
study1_01,73.0,41.0,32.0
study1_02,104.0,34.0,32.0
study1_03,109.0,20.0,31.0
study1_04,94.0,38.0,33.0
study1_05,106.0,41.0,34.0
study1_06,133.0,48.0,41.0
study1_07,118.0,39.0,39.0
study1_08,106.0,25.0,37.0
study1_09,128.0,35.0,29.0
study1_10,108.0,27.0,36.0


Above you can see that only the `Participant`, `Fluency`, `Comprehension`, and `Vocab` columns were kept; the others were discarded. 

Note however that there are still `NaN` values for some participants, for some measures. In other words, our inner join only applied to the columns and not to the rows. If we truly want complete cases, and therefor wish to drop any participant with missing data, we can use the `.dropna()` method:

In [43]:
studies_1_2.dropna(axis=0)

Unnamed: 0_level_0,Fluency,Comprehension,Vocab
Participant,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
study1_01,73.0,41.0,32.0
study1_02,104.0,34.0,32.0
study1_03,109.0,20.0,31.0
study1_04,94.0,38.0,33.0
study1_05,106.0,41.0,34.0
study1_06,133.0,48.0,41.0
study1_07,118.0,39.0,39.0
study1_08,106.0,25.0,37.0
study1_09,128.0,35.0,29.0
study1_10,108.0,27.0,36.0


We can combine this with `pd.concat()` through chaining, to achieve the full result in one line of code:

In [44]:
studies_1_2 = pd.concat([study1, study2], axis=0, join='inner').dropna(axis=0)
studies_1_2

Unnamed: 0_level_0,Fluency,Comprehension,Vocab
Participant,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
study1_01,73.0,41.0,32.0
study1_02,104.0,34.0,32.0
study1_03,109.0,20.0,31.0
study1_04,94.0,38.0,33.0
study1_05,106.0,41.0,34.0
study1_06,133.0,48.0,41.0
study1_07,118.0,39.0,39.0
study1_08,106.0,25.0,37.0
study1_09,128.0,35.0,29.0
study1_10,108.0,27.0,36.0
