# Pandas Data Structures

## Summary
In this notebook, we'll be covering:
- [Dataframes](#Dataframes)
- [Columns](#Columns)
- [Rows and cells](#Rows-and-Cells)
- [Modifying dataframes](#Modifying-Dataframes)
- [Creating dataframes](#Creating-Dataframes)
- [Combining dataframes](#Combining-Dataframes)

### Introduction
This is the first section to deal specifically with Pandas. Pandas is a powerful data analysis library that makes data science work much easier than it once was.

Load the pandas package, and assign it a shorthand name, 'pd'.

In [1]:
import pandas as pd

Don't worry about understanding this block of code, but you will need to run it to do the rest of this exercise. 
The important thing is, it stores a dataframe you can use into the variable named "df". 
We are using this instead of the pd.read_csv() function to load a dataframe for the purposes of these exercises. The read_csv() function is often used to load a dataframe from a .csv file, but given the Jupyter notebooks format, it is easier to load a dataframe this way for the purposes of this exercise.

### Dataframes

In [2]:
import random

workout_dict = {'ID': [], 'Measurement Device': [], 'Heart Rate Max': [], 'Heart Rate Min': [], 'Heart Rate Avg': [],
              'Duration of exercise (min)': [], 'Exercise Type': []}
used_ids = []

for x in range(0, 500):
    id = random.randint(100000000, 999999999)
    while id in used_ids:
        id = random.randint(100000000, 999999999)
    used_ids.append(id)
    device = random.choice(['Skykandal', 'B-Wolf'])
    mu = random.randint(65, 85)
    min_rate = int(random.gauss(mu, 10))
    max_rate = int(random.gauss(mu + 55, 25))
    while max_rate <= min_rate:
        max_rate = int(random.gauss(mu + 55, 25))
    avg = random.gauss((max_rate + min_rate) / 2, (max_rate - min_rate) / 5)
    duration = random.randint(10, 90)
    exercise = random.choice(['Running', 'Running', 'Running', 'Bicycling', 'Swimming', 'Swimming',
                              'Weight training'])
    row = [device, min_rate, max_rate, avg, duration, exercise]
    workout_dict['ID'].append(id)
    workout_dict['Measurement Device'].append(row[0])
    workout_dict['Heart Rate Min'].append(row[1])
    workout_dict['Heart Rate Max'].append(row[2])
    workout_dict['Heart Rate Avg'].append(row[3])
    workout_dict['Duration of exercise (min)'].append(row[4])
    workout_dict['Exercise Type'].append(row[5])

df = pd.DataFrame(workout_dict)

View the first 5 rows of the dataframe stored in the variable named "df":

In [3]:
df.head()

Unnamed: 0,ID,Measurement Device,Heart Rate Max,Heart Rate Min,Heart Rate Avg,Duration of exercise (min),Exercise Type
0,454693206,Skykandal,139,58,104.292775,59,Bicycling
1,726996189,Skykandal,126,69,108.743138,59,Bicycling
2,206329065,Skykandal,101,91,92.866254,65,Running
3,756317113,Skykandal,134,101,111.765248,43,Swimming
4,376747954,Skykandal,154,87,118.287016,82,Running


The id column uniquely identifies each row.

### Columns

We will start by discussing how to view the names of the columns in the dataframe variable named `df`.

In [4]:
df.columns

Index(['ID', 'Measurement Device', 'Heart Rate Max', 'Heart Rate Min',
       'Heart Rate Avg', 'Duration of exercise (min)', 'Exercise Type'],
      dtype='object')

The code below demonstrates how to access the "Exercise Type" column.

In [5]:
df['Exercise Type']

0      Bicycling
1      Bicycling
2        Running
3       Swimming
4        Running
         ...    
495    Bicycling
496      Running
497      Running
498      Running
499      Running
Name: Exercise Type, Length: 500, dtype: object

We can simply look up the column in the dataframe (which we have named `df`) by using the column name as the key.

The column data type is a Pandas series. The values from this result on the right side are the values in the column. The values on the left are simply index values, which are unique for each row.
The "Length" value gives the number of rows, and the "dtype" value indicates the "data type".

In [8]:
df['Heart Rate Max']

0      139
1      126
2      101
3      134
4      154
      ... 
495    112
496    154
497    102
498    124
499    129
Name: Heart Rate Max, Length: 500, dtype: int64

You can see that this column is dtype `int64`. That's a dtype for integer numbers , which these all are. The Heart Rate Avg column, which is numeric but not integers, will be a floating point number, probably `float64`. The numbers at the end don't matter as much, just remember that there are `int`s for integers, `float`s for non-integer numbers (ones with decimal points, really), and `object`s for text (and some other less common things).

It is possible to loop over this column with a for loop. However, Pandas often has shortcuts for manipulating the data that make the for loop needed less often.

In [9]:
for x in df['Heart Rate Max']:
    print(x)

139
126
101
134
154
175
153
136
144
130
128
174
167
153
113
165
149
100
119
120
164
113
145
149
121
110
113
161
162
118
115
77
124
130
93
163
156
169
120
94
125
118
116
169
140
127
173
123
137
125
120
188
98
186
117
125
160
96
151
150
96
94
124
172
136
120
131
107
119
139
104
139
105
88
86
144
125
152
108
152
85
140
138
124
136
114
78
137
102
114
133
172
136
169
177
106
117
142
151
120
134
115
132
91
167
147
132
111
150
114
112
122
130
119
149
153
120
126
100
106
135
159
110
116
154
148
160
98
142
163
133
157
146
143
130
149
151
147
102
120
99
131
181
136
128
143
128
112
136
138
94
118
106
132
101
109
140
132
134
112
128
122
103
128
127
172
117
96
135
147
95
122
169
87
164
124
136
110
143
118
123
139
125
130
139
138
160
130
137
156
170
164
172
172
129
132
109
113
163
134
93
143
136
134
136
111
103
133
107
180
87
122
101
163
141
160
125
97
166
153
123
122
107
131
114
97
129
126
140
188
109
95
141
108
180
117
133
200
114
73
151
104
171
84
105
106
114
89
111
144
134
126
119
132
133
137
14

We can also access more than one column at once. We do this by passing a list of columns we want into the square brackets. Note that this means that we have two sets of square brackets, the inner one that defines the list and an outer one that says "get me these items from the dataframe".

In [10]:
df[['ID', 'Measurement Device']]

Unnamed: 0,ID,Measurement Device
0,454693206,Skykandal
1,726996189,Skykandal
2,206329065,Skykandal
3,756317113,Skykandal
4,376747954,Skykandal
...,...,...
495,131378923,Skykandal
496,268368914,B-Wolf
497,607678607,B-Wolf
498,830762164,Skykandal


#### Try writing some code in the space below that accesses more than one column at once. Access these columns: Measurement Device, Heart Rate Min, and Duration.

In [11]:
# put your code here


How do you find the column names? In small dataframes you can get the column names in the same way you check the data: `df.head()`. The `head` method prints out the first five rows of the dataframe by default, but you can type a number in between the parenthesis to make it show a different number of rows.

`head` is a method of the dataframe. This means that you write `name of dataframe.head()`, and `head` then shows you the head of the dataframe you named. If your dataframe was named blood_counts_frame you would write `blood_counts_frame.head()`.

In [12]:
df.head(10)

Unnamed: 0,ID,Measurement Device,Heart Rate Max,Heart Rate Min,Heart Rate Avg,Duration of exercise (min),Exercise Type
0,454693206,Skykandal,139,58,104.292775,59,Bicycling
1,726996189,Skykandal,126,69,108.743138,59,Bicycling
2,206329065,Skykandal,101,91,92.866254,65,Running
3,756317113,Skykandal,134,101,111.765248,43,Swimming
4,376747954,Skykandal,154,87,118.287016,82,Running
5,397302897,Skykandal,175,92,110.573254,59,Running
6,799512013,Skykandal,153,70,119.772863,18,Swimming
7,342981640,B-Wolf,136,69,84.825241,83,Swimming
8,875716976,B-Wolf,144,75,111.396355,29,Running
9,162562604,Skykandal,130,74,108.575459,88,Swimming


This code allows you to print just the column names:

In [13]:
print(list(df.columns))

['ID', 'Measurement Device', 'Heart Rate Max', 'Heart Rate Min', 'Heart Rate Avg', 'Duration of exercise (min)', 'Exercise Type']


We can also rename a column in a dataframe. For instance, this code renames the column 'Heart Rate Max' to 'Max'. 
To do this we use the dataframe `rename` method.

In [14]:
df.rename(columns={'Heart Rate Max': 'Max'}, inplace=True)
df.head()

Unnamed: 0,ID,Measurement Device,Max,Heart Rate Min,Heart Rate Avg,Duration of exercise (min),Exercise Type
0,454693206,Skykandal,139,58,104.292775,59,Bicycling
1,726996189,Skykandal,126,69,108.743138,59,Bicycling
2,206329065,Skykandal,101,91,92.866254,65,Running
3,756317113,Skykandal,134,101,111.765248,43,Swimming
4,376747954,Skykandal,154,87,118.287016,82,Running


As you can see, `rename` is a method that is "attached" to the dataframe, so we use dot notation for it and then it automatically knows what dataframe it is renaming. There are several ways to specify that we are renaming columns. I think the easiest one to read is the `columns` argument. `columns` then equals a dictionary where the key is the current column name and the value is the new name. Note that we then have a comma, because the whole `columns={'Heart Rate Max': 'Max'}` section was just the first argument.

The `inplace` option is very common in pandas. When it is set to `True` whatever you are doing changes this dataframe. If it is set to `False` your operation generates a new dataframe that has been changed. By default it is `False`.

Use of the `False` option is demonstrated here.

In [15]:
new_frame = df.rename(columns={'Max': 'MAX!!!!'}, inplace=False)
print('Original dataframe column names:', df.columns.values)
print('New dataframe column names:', new_frame.columns.values)

Original dataframe column names: ['ID' 'Measurement Device' 'Max' 'Heart Rate Min' 'Heart Rate Avg'
 'Duration of exercise (min)' 'Exercise Type']
New dataframe column names: ['ID' 'Measurement Device' 'MAX!!!!' 'Heart Rate Min' 'Heart Rate Avg'
 'Duration of exercise (min)' 'Exercise Type']


As you can see, Max was changed to MAX!!!! only in the new frame (imaginatively named "new_frame") and not in the original frame. This is very useful when you aren't sure if what you are doing will work the way you think it will, and don't want to mess up your original dataframe.

Since the `columns` argument of `rename` takes a dictionary you can change as many column names at once as you want.

#### Below, write code that changes the names of `Heart Rate Min` to `Min` and `Duration of exercise (min)` to `Duration`. Then print one of them, using the new name, to make sure it worked.

In [14]:
# put your code here


### Rows and Cells

Accessing rows works off of the index column of the dataframe. We can print the index for each row below just to remind ourselves what it looks like.

In [16]:
df.index.values

array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
        13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,
        26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,
        39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,
        52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,
        65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,
        78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,
        91,  92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103,
       104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
       117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129,
       130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142,
       143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155,
       156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168,
       169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 18

We can see that our index is just row numbers. In some cases it may be something different. You can reset the index to be the ID number, or anything else. But let's not do that, because that isn't useful for the sorts of things we're trying to do.

Instead, we will use the `loc` method of the dataframe to locate the row we want, by its index. Note that `loc` uses square brackets, like indexing from a list. Below, we will print out the second row (the first row would be index 0).

In [17]:
df.loc[1]

ID                             726996189
Measurement Device             Skykandal
Max                                  126
Heart Rate Min                        69
Heart Rate Avg                108.743138
Duration of exercise (min)            59
Exercise Type                  Bicycling
Name: 1, dtype: object

We can also pass slice notation to `loc`. The code below replicates `df.head()`. It shows rows with index 0, 1, 2, 3, and 4.

In [18]:
df.loc[0:4]

Unnamed: 0,ID,Measurement Device,Max,Heart Rate Min,Heart Rate Avg,Duration of exercise (min),Exercise Type
0,454693206,Skykandal,139,58,104.292775,59,Bicycling
1,726996189,Skykandal,126,69,108.743138,59,Bicycling
2,206329065,Skykandal,101,91,92.866254,65,Running
3,756317113,Skykandal,134,101,111.765248,43,Swimming
4,376747954,Skykandal,154,87,118.287016,82,Running


We can also get a cell directly with `loc` by specifying both row and column. If you give `loc` a tuple it will assume that it is (row, column). The code below gives us the ID at index 3.

In [19]:
df.loc[3, 'ID']

756317113

We can also do this with the multiple column lookup.

In [20]:
df.loc[3, ['ID', 'Measurement Device']]

ID                    756317113
Measurement Device    Skykandal
Name: 3, dtype: object

The code below will return the maximum heart rate ('Max') for the fourth row (index 3).

In [21]:
df.loc[(3, 'Max')]

134

You can also get both multiple columns and multiple rows, to get a sub-dataframe.

In [22]:
df.loc[1:3, ['ID', 'Measurement Device']]

Unnamed: 0,ID,Measurement Device
1,726996189,Skykandal
2,206329065,Skykandal
3,756317113,Skykandal


#### Below, write some code that does the inverse of `head` and prints the last five rows of the dataframe. However, leave out the ID column.

In [22]:
# your code goes here


(If you actually just want the inverse of `head` there's a command `tail` that does this. However, the purpose of the exercise above is practice.)

### Modifying Dataframes

Adding a column to a dataframe is quite easy. Below, we will create a column that is simply the maximum heart rate divided by 2.

In [23]:
df.head()

Unnamed: 0,ID,Measurement Device,Max,Heart Rate Min,Heart Rate Avg,Duration of exercise (min),Exercise Type
0,454693206,Skykandal,139,58,104.292775,59,Bicycling
1,726996189,Skykandal,126,69,108.743138,59,Bicycling
2,206329065,Skykandal,101,91,92.866254,65,Running
3,756317113,Skykandal,134,101,111.765248,43,Swimming
4,376747954,Skykandal,154,87,118.287016,82,Running


In [24]:
df['A'] = 'A'
df.head()

Unnamed: 0,ID,Measurement Device,Max,Heart Rate Min,Heart Rate Avg,Duration of exercise (min),Exercise Type,A
0,454693206,Skykandal,139,58,104.292775,59,Bicycling,A
1,726996189,Skykandal,126,69,108.743138,59,Bicycling,A
2,206329065,Skykandal,101,91,92.866254,65,Running,A
3,756317113,Skykandal,134,101,111.765248,43,Swimming,A
4,376747954,Skykandal,154,87,118.287016,82,Running,A


In [25]:
df['Max minus 2'] = df['Max'] - 2
df.head()

Unnamed: 0,ID,Measurement Device,Max,Heart Rate Min,Heart Rate Avg,Duration of exercise (min),Exercise Type,A,Max minus 2
0,454693206,Skykandal,139,58,104.292775,59,Bicycling,A,137
1,726996189,Skykandal,126,69,108.743138,59,Bicycling,A,124
2,206329065,Skykandal,101,91,92.866254,65,Running,A,99
3,756317113,Skykandal,134,101,111.765248,43,Swimming,A,132
4,376747954,Skykandal,154,87,118.287016,82,Running,A,152


All we had to do here was name a new column (much like adding a key to a dictionary) and set it equal to something else.

However, the thing we set it equal to must be either a single item or a list equal in length to the other columns. The code below shows what happens with a single item: it is just repeated in every cell of that column.

In [26]:
df['Two'] = 2
df.head()

Unnamed: 0,ID,Measurement Device,Max,Heart Rate Min,Heart Rate Avg,Duration of exercise (min),Exercise Type,A,Max minus 2,Two
0,454693206,Skykandal,139,58,104.292775,59,Bicycling,A,137,2
1,726996189,Skykandal,126,69,108.743138,59,Bicycling,A,124,2
2,206329065,Skykandal,101,91,92.866254,65,Running,A,99,2
3,756317113,Skykandal,134,101,111.765248,43,Swimming,A,132,2
4,376747954,Skykandal,154,87,118.287016,82,Running,A,152,2


This will break. We're specifying a length for the column, but it's two, and the other columns are length 500.

In [27]:
df['Breaking'] = [2, 2]

ValueError: Length of values (2) does not match length of index (500)

In later notebooks we'll generate data the same length as existing columns and fill new columns with it. For now, just remember not to fill a column with data that has a length but is the wrong length.

How do we remove a column? We use the `drop` method, which has a similar argument structure to `rename`. Here we just list the column names to drop. Again, `inplace=True` means "do this to the current dataframe".

The code below drops/removes the column named 'Two'.

In [28]:
df.drop(columns=['Two'], inplace=True)
df.head()

Unnamed: 0,ID,Measurement Device,Max,Heart Rate Min,Heart Rate Avg,Duration of exercise (min),Exercise Type,A,Max minus 2
0,454693206,Skykandal,139,58,104.292775,59,Bicycling,A,137
1,726996189,Skykandal,126,69,108.743138,59,Bicycling,A,124
2,206329065,Skykandal,101,91,92.866254,65,Running,A,99
3,756317113,Skykandal,134,101,111.765248,43,Swimming,A,132
4,376747954,Skykandal,154,87,118.287016,82,Running,A,152


If you run that code above again it will fail, since the column Two no longer exists to be dropped.

The `drop` command can also be used to drop rows. By default, `drop` actually expects you to drop rows, and so you can simply provide an index to drop.

In [29]:
df.drop(0, inplace=True)
df.head()

Unnamed: 0,ID,Measurement Device,Max,Heart Rate Min,Heart Rate Avg,Duration of exercise (min),Exercise Type,A,Max minus 2
1,726996189,Skykandal,126,69,108.743138,59,Bicycling,A,124
2,206329065,Skykandal,101,91,92.866254,65,Running,A,99
3,756317113,Skykandal,134,101,111.765248,43,Swimming,A,132
4,376747954,Skykandal,154,87,118.287016,82,Running,A,152
5,397302897,Skykandal,175,92,110.573254,59,Running,A,173


However, while column creation and deletion are common in analysis, adding and deleting rows is less useful. Most row deletion happens as part of filtering, which we will cover in an upcoming notebook.

Creating rows is less useful (normally), and is somewhat harder to do. To add a row you create an entire new dataframe and then use the `concat` function to add the dataframes together. We won't cover this further here.

### Creating Dataframes
Creating dataframes can be useful in analysis. You may want to process data and make a new frame from the processed data only. So, how do we make a dataframe? The easiest way is to make a dictionary and then use `pd.DataFrame` to make it into a dataframe.

Why `pd.DataFrame`? `DataFrame` is a function from the pandas library. The `pd.DataFrame` notation means "DataFrame, from the pandas library" (much like `df.head()` means "head, of the frame df"). This is one reason we want to import pandas and change the name to pd, because we'll type pd a lot doing analysis with pandas.

In [30]:
test_dict = {'Column One': [1, 2], 'Column Two': [20, 78], 'Column Three': ['C', 'K']}
new_df = pd.DataFrame(test_dict)
new_df.head()

Unnamed: 0,Column One,Column Two,Column Three
0,1,20,C
1,2,78,K


#### Below, make a dictionary with at least three keys. One of the lists of values should be all numbers. Then turn it into a dataframe and use `head` to check that it worked.

In [31]:
# your code goes here


#### Now, create a new column that is your column of numbers divided by two. Again, use `head` to check your work.

In [32]:
# your code goes here


#### Finally, drop one of your columns. As before, use `head` to check that it happened correctly.

In [33]:
# your code goes here


### Combining Dataframes

When you load your data from the workbench, there is a high likelihood that you will have to combine two pre-existing dataframes together to make a new dataframe. For instance, there may be a dataframe representing people, and another dataframe representing medication prescriptions. Before combining these dataframes, each row in the person dataframe will represent a different person. The medication prescriptions dataframe will have one column with a person_id, and other information in other columns with information on medications that someone is prescribed, such as the medication name, and prescription date. To use information from both of these tables at once, you will need to join or merge by the person_id column that exists in both dataframes.

We will need some very small dataframes to see this behavior in.

In [31]:
dict_1 = {'ID': [1, 3, 5, 7, 9], 'B': [True, True, False, True, False]}
dict_2 = {'ID': [1, 2, 3, 4, 5], 'C': [57, 89, 23, 12, 65]}
dict_3 = {'ID_num': [1, 2, 3, 4, 5], 'C': [57, 89, 23, 12, 65]}
dict_4 = {'ID': [1, 2, 3, 4, 5], 'B': [57, 89, 23, 12, 65]}

frame_1 = pd.DataFrame(dict_1)
frame_2 = pd.DataFrame(dict_2)
frame_3 = pd.DataFrame(dict_3)
frame_4 = pd.DataFrame(dict_4)

frame_1 and frame_2 are meant to be matched. Each contains some of the same IDs but different variables in each frame. frame_3 is frame_2 but now ID is ID_num. frame_4 is frame_2 but now C is called B.

The simplest way to combine two dataframes is to just stick them together. `concat` will do this. It's not a dataframe method, so we write it `pd.concat` so Python knows to go looking for the function in pands.

In [32]:
pd.concat([frame_1, frame_2])

Unnamed: 0,ID,B,C
0,1,True,
1,3,True,
2,5,False,
3,7,True,
4,9,False,
0,1,,57.0
1,2,,89.0
2,3,,23.0
3,4,,12.0
4,5,,65.0


The syntax is simple: pass a list of frames to concatenate (add together).

#### Below, write code to concatenate all three frames together at once.

In [36]:
# your code goes here


There are some other `concat` options, but the issue with `concat` is that items that are the same (like ID 1) just get repeated. We would really like to attach dataframes in such a way that all items with IDs get their data filled in from B and C. To do this we use `pd.merge`. Here, we can only merge two frames at a time, so instead of a list we just pass both frames as arguments.

In [33]:
m = pd.merge(frame_1, frame_2)
m.head()

Unnamed: 0,ID,B,C
0,1,True,57
1,3,True,23
2,5,False,65


This gave us columns ID, B, and C for all items present in ID in both dataframes (all odd numbers between 1-5). Why ID? Because the column names matched.
       
If we attempt to merge frame_1 with frame_4 we do need to specify what column to use. Remember, frame_4 is frame_2, except that column C is called B, which means that frame_1 and frame_4 have two identical column names. If we do nothing this is what we get:

In [34]:
m = pd.merge(frame_1, frame_4)
m.head(10)

Unnamed: 0,ID,B


Since no rows match in both ID and B we get nothing. We should specify that ID is the real match using the `on` keyword.

In [35]:
m = pd.merge(frame_1, frame_4, on='ID')
m.head(10)

Unnamed: 0,ID,B_x,B_y
0,1,True,57
1,3,True,23
2,5,False,65


Note that this relabels the two different column Bs with underscore and a letter (x and y, by default). Columns with different names would not need this relabeling.

What if the columns are the same but don't have the same name? We can name equivalent columns with `right_on` and `left_on`. Right and left refer to the position of the dataframes in the argument list. (E.g., left is the first one, right is the second one.)

frame_3 was created to show this. It is frame_2, but with ID_num instead of ID.

In [36]:
m = pd.merge(frame_1, frame_3, left_on='ID', right_on='ID_num')
m.head(10)

Unnamed: 0,ID,B,ID_num,C
0,1,True,1,57
1,3,True,3,23
2,5,False,5,65


We still get both named columns, but we specified that ID on the left (frame_1) was to be matched to ID_num on the right (frame_3) and so we successfully combined the dataframes intelligently.

However, right now we got only items in ID that were present in both dataframes. We can change this behavior with the keyword `how`. We have four simple options:
- Inner: This is the default, where only items present in the matching column in both frames are included.
- Outer: All rows from both dataframes are included.
- Left: All rows in the left dataframe (in the order you write them in `pd.merge()`) are kept. Ones only present in the right dataframe that don't match up with any rows in the left dataframe are discarded.
- Right: The opposite of left. All rows from the 'right' dataframe are kept, and only rows in the left dataframe that match up with rows in the 'right' dataframe are kept.

We simply write `how=` and then the lowercase name of the join type in either single or double quotes. Here's an outer join version of merging frame_1 and frame_2.

This concept sometimes takes a little while to get the hang of. "Left" and "Inner" joins are probably the most frequently used in the workbench.

(Source for additional examples: https://www.analyticsvidhya.com/blog/2020/02/joins-in-pandas-master-the-different-types-of-joins-in-python/)


In [37]:
m = pd.merge(frame_1, frame_2, how='outer')
m.head(10)

Unnamed: 0,ID,B,C
0,1,True,57.0
1,2,,89.0
2,3,True,23.0
3,4,,12.0
4,5,False,65.0
5,7,True,
6,9,False,


You'll see that any column with no data, like column B for the row with ID 2, or column C for the row with ID 7, becomes NaN. We'll discuss this more in the next notebook, but NaN is a blank cell.

#### Below, run a left merge on frame_1 and frame_3. Remember to specify how to merge them!

In [42]:
# your code goes here


That concludes data structures. Next up, we'll discuss cleaning up issues like these blank cells we created in our merges.