# Data Manipulation
## Tidy Data

Hadley Wickham proposed the term **TIDY DATA**.

>A huge amount of effort is spent cleaning data to get it ready for analysis, but there
has been little research on how to make data cleaning as easy and effective as possible.
This paper tackles a small, but important, component of data cleaning: data tidying.
Tidy datasets are easy to manipulate, model and visualise, and have a specific structure:
each variable is a column, each observation is a row, and each type of observational unit
is a table. This framework makes it easy to tidy messy datasets because only a small
set of tools are needed to deal with a wide range of un-tidy datasets. This structure
also makes it easier to develop tidy tools for data analysis, tools that both input and
output tidy datasets. The advantages of a consistent data structure and matching tools
are demonstrated with a case study free from mundane data manipulation chores. -- Hadley Wickham https://vita.had.co.nz/papers/tidy-data.pdf

Data can be represented and stored in many ways. With different ways of storing/organizing, the effectiveness in doing operations may vary greatly.

**Wickham** suggests us to store data in a way that is call **tidy data**. The rules for data to be in this from are:

>1. Each variable must have it's own column.
2. Each observation must have its own row.
3. Each value must have its own cell.

![tidy data](assets/tidy1.png)
Reference: Image from the online book "R for Data Science" by Hadley Wichham & Garrett Grolemund https://r4ds.had.co.nz/tidy-data.html

When data is in other forms, there are tools that can help us manipulate and try to turn that in to the tidy data form.


## Before we can have the tidy data, we need to know the basic methods to manipulate data

### Combining Data Sets

### Adding rows from a dataframe with identical columns

In [1]:
import pandas as pd

In [62]:
df1 = pd.read_csv('./datasets/muic1.csv')
df2 = pd.read_csv('./datasets/muic2.csv')
df3 = pd.read_csv('./datasets/muic3.csv')

In [63]:
row_concat1 = pd.concat([df1,df2,df3])

In [64]:
row_concat1

Unnamed: 0,ID,Name,Major,GPA
0,6780001,Ann Molive,ICCH,3.2
1,6780002,Ben Kenobi,ICJD,
2,6780003,Peter Cruiser,ICBA,3.3
3,6780004,Bob Hanger,ICFS,3.5
4,6780005,Moe Otto,ICCS,3.1
0,6780031,Tom Lee,ICCS,2.7
1,6780032,Pat Cade,ICFS,3.8
2,6780033,Will Power,ICBA,2.7
3,6780034,Arty Didas,ICFS,3.2
4,6780035,Bart Yolo,ICCS,1.7


In [65]:
# Subset the 4th student out, notice that
# there are many rows in index 3 since they all matche the label
row_concat1.loc[3,]

Unnamed: 0,ID,Name,Major,GPA
3,6780004,Bob Hanger,ICFS,3.5
3,6780034,Arty Didas,ICFS,3.2
3,6780024,Luke Warm,ICCS,1.2


In [66]:
# To get the only the 4th row, we use iloc
# iloc is integer location, only one row there
row_concat1.iloc[3,] 

ID           6780004
Name      Bob Hanger
Major           ICFS
GPA              3.5
Name: 3, dtype: object

Doing it this way will have a confusing index. So, we may want to ignore the original index.

In [67]:
row_concat2 = pd.concat([df1,df2,df3], ignore_index=True)
row_concat2 # notice the nice running index below

Unnamed: 0,ID,Name,Major,GPA
0,6780001,Ann Molive,ICCH,3.2
1,6780002,Ben Kenobi,ICJD,
2,6780003,Peter Cruiser,ICBA,3.3
3,6780004,Bob Hanger,ICFS,3.5
4,6780005,Moe Otto,ICCS,3.1
5,6780031,Tom Lee,ICCS,2.7
6,6780032,Pat Cade,ICFS,3.8
7,6780033,Will Power,ICBA,2.7
8,6780034,Arty Didas,ICFS,3.2
9,6780035,Bart Yolo,ICCS,1.7


#### Adding a dataframe where columns are not identical?
#### Case I : a subset

In [68]:
# Create a dataframe with fewer columns (to be used next)
narrower_df = pd.DataFrame({"ID":[6781111,6781112],"Name":["Jacky Chow", "Jet Loo"]})
narrower_df

Unnamed: 0,ID,Name
0,6781111,Jacky Chow
1,6781112,Jet Loo


In [69]:
# df1 has         columns: ID,Name,Major,GPA
# marrower_df has columns: ID,Name
dfx = pd.concat([df1,narrower_df])
dfx
# Data will be concatenated nicely. 
# The one that doesn't have data will be NaN.

Unnamed: 0,ID,Name,Major,GPA
0,6780001,Ann Molive,ICCH,3.2
1,6780002,Ben Kenobi,ICJD,
2,6780003,Peter Cruiser,ICBA,3.3
3,6780004,Bob Hanger,ICFS,3.5
4,6780005,Moe Otto,ICCS,3.1
0,6781111,Jacky Chow,,
1,6781112,Jet Loo,,


#### Case II : totally different

In [70]:
# Create a dataframe with totally different columns (to be used next)
another_df = pd.DataFrame({"Hobby":["Reading","Running"],"Club":["Board Game", "Debate"]})
another_df

Unnamed: 0,Hobby,Club
0,Reading,Board Game
1,Running,Debate


In [71]:
# df1 has         columns: ID,Name,Major,GPA
# another_df has  columns:                  ,Hobby,Club
dfy = pd.concat([df1,another_df])
dfy
# There will be all columns from both dataframes
# For each row, any missing values from newly added columns will be NaN (aka : missing)

Unnamed: 0,ID,Name,Major,GPA,Hobby,Club
0,6780001.0,Ann Molive,ICCH,3.2,,
1,6780002.0,Ben Kenobi,ICJD,,,
2,6780003.0,Peter Cruiser,ICBA,3.3,,
3,6780004.0,Bob Hanger,ICFS,3.5,,
4,6780005.0,Moe Otto,ICCS,3.1,,
0,,,,,Reading,Board Game
1,,,,,Running,Debate


#### Case III : some columns are the same

In [72]:
# Create a dataframe with some columns are the same (to be used next)
foo_df = pd.DataFrame({"ID":[6780101,6780102],"Major":["ICCH", "ICCS"], "Weight":[65,59], "Height":[170,180]})
foo_df

Unnamed: 0,ID,Major,Weight,Height
0,6780101,ICCH,65,170
1,6780102,ICCS,59,180


In [73]:
# df1 has     columns: ID,Name,Major,GPA
# foo_df has  columns: ID,     Major,    Weight,Height
dfz = pd.concat([df1,foo_df])
dfz
# Notice the results, those same columns will have data from both dfs.

Unnamed: 0,ID,Name,Major,GPA,Weight,Height
0,6780001,Ann Molive,ICCH,3.2,,
1,6780002,Ben Kenobi,ICJD,,,
2,6780003,Peter Cruiser,ICBA,3.3,,
3,6780004,Bob Hanger,ICFS,3.5,,
4,6780005,Moe Otto,ICCS,3.1,,
0,6780101,,ICCH,,65.0,170.0
1,6780102,,ICCS,,59.0,180.0


In [74]:
# What only care for columns that appear on both dataframes?
# Only columns with values appear on both dfs will be here

# df1 has     columns: ID,Name,Major,GPA
# foo_df has  columns: ID,     Major,    Weight,Height
dfz = pd.concat([df1,foo_df], join = "inner")
dfz
# Notice only ID,Major are created

Unnamed: 0,ID,Major
0,6780001,ICCH
1,6780002,ICJD
2,6780003,ICBA
3,6780004,ICFS
4,6780005,ICCS
0,6780101,ICCH
1,6780102,ICCS


### inner
will look at the structures of both dataframes. If the columns are avaialbe on both, those columes will be in the result.

Hence, this is **structural**.

It doesn't look at the value inside the cells. Let's try to test it. We will set one cell as NaN (missing). We will see what happensn.


In [75]:
from numpy import NaN  # import NaN , we will learn more about this in file 06
# Let's force 6780001 (Ann Molive) to have NaN in major field
df1.loc[0,'Major']=NaN
df1


Unnamed: 0,ID,Name,Major,GPA
0,6780001,Ann Molive,,3.2
1,6780002,Ben Kenobi,ICJD,
2,6780003,Peter Cruiser,ICBA,3.3
3,6780004,Bob Hanger,ICFS,3.5
4,6780005,Moe Otto,ICCS,3.1


In [79]:
# df1 has     columns: ID,Name,Major,GPA
# foo_df has  columns: ID,     Major,    Weight,Height
dfz = pd.concat([df1,foo_df], join = "inner")
dfz
# notice the result that we have NaN in the cell. 
# cancat doesn't concatenate based on value, but on structures

Unnamed: 0,ID,Major
0,6780001,
1,6780002,ICJD
2,6780003,ICBA
3,6780004,ICFS
4,6780005,ICCS
0,6780101,ICCH
1,6780102,ICCS


### outer
is the default value.

So, we don't need to specify it. Leaving it blank is the same as setting join="outer"

In [82]:
# df1 has     columns: ID,Name,Major,GPA
# foo_df has  columns: ID,     Major,    Weight,Height
dfz = pd.concat([df1,foo_df], join = "outer")
dfz

Unnamed: 0,ID,Name,Major,GPA,Weight,Height
0,6780001,Ann Molive,,3.2,,
1,6780002,Ben Kenobi,ICJD,,,
2,6780003,Peter Cruiser,ICBA,3.3,,
3,6780004,Bob Hanger,ICFS,3.5,,
4,6780005,Moe Otto,ICCS,3.1,,
0,6780101,,ICCH,,65.0,170.0
1,6780102,,ICCS,,59.0,180.0


### ADDING COLUMNS

In [83]:
df1 = pd.read_csv("./datasets/concat_1.csv")
df2 = pd.read_csv("./datasets/concat_2.csv")
df3 = pd.read_csv("./datasets/concat_3.csv")

In [20]:
df1

Unnamed: 0,A,B,C,D
0,a0,b0,c0,d0
1,a1,b1,c1,d1
2,a2,b2,c2,d2
3,a3,b3,c3,d3


In [21]:
df2

Unnamed: 0,A,B,C,D
0,a4,b4,c4,d4
1,a5,b5,c5,d5
2,a6,b6,c6,d6
3,a7,b7,c7,d7


In [22]:
df3

Unnamed: 0,A,B,C,D
0,a8,b8,c8,d8
1,a9,b9,c9,d9
2,a10,b10,c10,d10
3,a11,b11,c11,d11


In [87]:
# calling concat will just append rows below as we have seen earlier
row_concat = pd.concat([df1,df2,df3])
row_concat

Unnamed: 0,A,B,C,D
0,a0,b0,c0,d0
1,a1,b1,c1,d1
2,a2,b2,c2,d2
3,a3,b3,c3,d3
0,a4,b4,c4,d4
1,a5,b5,c5,d5
2,a6,b6,c6,d6
3,a7,b7,c7,d7
0,a8,b8,c8,d8
1,a9,b9,c9,d9


### What if we want to add along the direction of column?
#### Adding all columns by concat with axis = 0
axis 0 is "row".
data with the same row labels will be in the same row

In [84]:
col_concat = pd.concat([df1,df2,df3], axis=1)
col_concat

Unnamed: 0,A,B,C,D,A.1,B.1,C.1,D.1,A.2,B.2,C.2,D.2
0,a0,b0,c0,d0,a4,b4,c4,d4,a8,b8,c8,d8
1,a1,b1,c1,d1,a5,b5,c5,d5,a9,b9,c9,d9
2,a2,b2,c2,d2,a6,b6,c6,d6,a10,b10,c10,d10
3,a3,b3,c3,d3,a7,b7,c7,d7,a11,b11,c11,d11


#### Adding columns with different indexes

In [91]:
# First, let's rename the column names
df1.columns = ['A','B','C','D']
df2.columns = ['D','E','F','G']
df3.columns = ['A','D','G','H']

In [92]:
df1

Unnamed: 0,A,B,C,D
0,a0,b0,c0,d0
1,a1,b1,c1,d1
2,a2,b2,c2,d2
3,a3,b3,c3,d3


In [93]:
df2

Unnamed: 0,D,E,F,G
0,a4,b4,c4,d4
1,a5,b5,c5,d5
2,a6,b6,c6,d6
3,a7,b7,c7,d7


In [94]:
df3

Unnamed: 0,A,D,G,H
0,a8,b8,c8,d8
1,a9,b9,c9,d9
2,a10,b10,c10,d10
3,a11,b11,c11,d11


In [95]:
# df1 has columns: A,B,C,D
# df2 has columns:       D,E,F,G
# df3 has columns: A,    D,    G,H
row_concat = pd.concat([df1,df2,df3])
row_concat

# notice when columns appear on multiple dataframes

Unnamed: 0,A,B,C,D,E,F,G,H
0,a0,b0,c0,d0,,,,
1,a1,b1,c1,d1,,,,
2,a2,b2,c2,d2,,,,
3,a3,b3,c3,d3,,,,
0,,,,a4,b4,c4,d4,
1,,,,a5,b5,c5,d5,
2,,,,a6,b6,c6,d6,
3,,,,a7,b7,c7,d7,
0,a8,,,b8,,,c8,d8
1,a9,,,b9,,,c9,d9


In [97]:
# df1 has columns: A,B,C,D
# df2 has columns:       D,E,F,G
# df3 has columns: A,    D,    G,H
row_concat = pd.concat([df1,df2,df3], join = 'inner')
row_concat

# Notice that all dataframes have column D

Unnamed: 0,D
0,d0
1,d1
2,d2
3,d3
0,a4
1,a5
2,a6
3,a7
0,b8
1,b9


#### In what situation we can use this method?
Sometimes, we have dataframs that carry differnt number of columns. If both of them carry the same important columns, we may want to do the "inner" join so eliminate extra columns from each dataframe.

In [98]:
# Create a dataframe to be used next
addr1_df = pd.DataFrame({"Number":["114","153/1","102","945"], "Building":["A","Beach Front","Main bldg","C"],"Street":["Sai 4","Rama IX","Sai 3","Baromratchonnee"],"City":["Salaya","Payathai","Salaya","Aroonamarin"],"District":["Taweewattana","Ratchtevee","Taweewattana","Bangkok Noi"], "Province":["Bangkok","Bangkok","Bangkok","Bangkok"],"ZIP":[73710,10400,73710,10700]})

addr1_df

Unnamed: 0,Number,Building,Street,City,District,Province,ZIP
0,114,A,Sai 4,Salaya,Taweewattana,Bangkok,73710
1,153/1,Beach Front,Rama IX,Payathai,Ratchtevee,Bangkok,10400
2,102,Main bldg,Sai 3,Salaya,Taweewattana,Bangkok,73710
3,945,C,Baromratchonnee,Aroonamarin,Bangkok Noi,Bangkok,10700


In [99]:
# Create a dataframe to be used next
addr2_df = pd.DataFrame({"Number":["208","2491","327","123"],"Street":["Rama VI","Rama I","Sukhumvit","Charansanitwong"],"City":["SamSen Nai","Bangrug","Prakanong","Bangplud"],"District":["Phayathai","Ratchtevee","Taweewattana","Bangkok Noi"], "Province":["Bangkok","Bangkok","Bangkok","Bangkok"],"Country":["Thailand","Thailand","Thailand","Thailand"],"ZIP":[10100,10400,10600,10200]})

addr2_df

Unnamed: 0,Number,Street,City,District,Province,Country,ZIP
0,208,Rama VI,SamSen Nai,Phayathai,Bangkok,Thailand,10100
1,2491,Rama I,Bangrug,Ratchtevee,Bangkok,Thailand,10400
2,327,Sukhumvit,Prakanong,Taweewattana,Bangkok,Thailand,10600
3,123,Charansanitwong,Bangplud,Bangkok Noi,Bangkok,Thailand,10200


In [32]:
# If we concat with default join ("outer"), we will get many fields with NaN
addr_df = pd.concat([addr1_df,addr2_df])
addr_df

Unnamed: 0,Number,Building,Street,City,District,Province,ZIP,Country
0,114,A,Sai 4,Salaya,Taweewattana,Bangkok,73710,
1,153/1,Beach Front,Rama IX,Payathai,Ratchtevee,Bangkok,10400,
2,102,Main bldg,Sai 3,Salaya,Taweewattana,Bangkok,73710,
3,945,C,Baromratchonnee,Aroonamarin,Bangkok Noi,Bangkok,10700,
0,208,,Rama VI,SamSen Nai,Phayathai,Bangkok,10100,Thailand
1,2491,,Rama I,Bangrug,Ratchtevee,Bangkok,10400,Thailand
2,327,,Sukhumvit,Prakanong,Taweewattana,Bangkok,10600,Thailand
3,123,,Charansanitwong,Bangplud,Bangkok Noi,Bangkok,10200,Thailand


In [33]:
# This is more concise
addr_df = pd.concat([addr1_df,addr2_df], join="inner")
addr_df

Unnamed: 0,Number,Street,City,District,Province,ZIP
0,114,Sai 4,Salaya,Taweewattana,Bangkok,73710
1,153/1,Rama IX,Payathai,Ratchtevee,Bangkok,10400
2,102,Sai 3,Salaya,Taweewattana,Bangkok,73710
3,945,Baromratchonnee,Aroonamarin,Bangkok Noi,Bangkok,10700
0,208,Rama VI,SamSen Nai,Phayathai,Bangkok,10100
1,2491,Rama I,Bangrug,Ratchtevee,Bangkok,10400
2,327,Sukhumvit,Prakanong,Taweewattana,Bangkok,10600
3,123,Charansanitwong,Bangplud,Bangkok Noi,Bangkok,10200


#### Concatenate Columns With Different Row

In [101]:
df1.index = [0,1,2,3]
df2.index = [2,3,4,5]
df3.index = [2,3,6,7]

In [105]:
# df1 has rows: 0,1,2,3
# df2 has rows:     2,3,4,5
# df3 has rows:     2,3,   6,7
col_concat = pd.concat([df1,df2,df3],axis=1)
col_concat

# notice that the row labels indicate what data will be in that row
# At the end, we have row 0,1,2,3,4,5,6,7

# Also, notice the column labels, many of them have the same name !!!

Unnamed: 0,A,B,C,D,D.1,E,F,G,A.1,D.2,G.1,H
0,a0,b0,c0,d0,,,,,,,,
1,a1,b1,c1,d1,,,,,,,,
2,a2,b2,c2,d2,a4,b4,c4,d4,a8,b8,c8,d8
3,a3,b3,c3,d3,a5,b5,c5,d5,a9,b9,c9,d9
4,,,,,a6,b6,c6,d6,,,,
5,,,,,a7,b7,c7,d7,,,,
6,,,,,,,,,a10,b10,c10,d10
7,,,,,,,,,a11,b11,c11,d11


In [36]:
# select columns with label "A" out
col_concat["A"]

Unnamed: 0,A,A.1
0,a0,
1,a1,
2,a2,a8
3,a3,a9
4,,
5,,
6,,a10
7,,a11


In [106]:
# set join = "inner"
# if we want on rows that appear in all dataframes
col_concat = pd.concat([df1,df2,df3],axis=1,join="inner")
col_concat

Unnamed: 0,A,B,C,D,D.1,E,F,G,A.1,D.2,G.1,H
2,a2,b2,c2,d2,a4,b4,c4,d4,a8,b8,c8,d8
3,a3,b3,c3,d3,a5,b5,c5,d5,a9,b9,c9,d9


## Merging - Combinding dataframe base on values
There will be cases that we need to merge data in a more specific way. Pandas provide many functions to do this.

We will use data from the reference textbook of this course, "Pandas for Everyone".

In [107]:
person = pd.read_csv("./datasets/survey_person.csv")
site = pd.read_csv("./datasets/survey_site.csv")
survey = pd.read_csv("./datasets/survey_survey.csv")
visited = pd.read_csv("./datasets/survey_visited.csv")

In [108]:
person

Unnamed: 0,ident,personal,family
0,dyer,William,Dyer
1,pb,Frank,Pabodie
2,lake,Anderson,Lake
3,roe,Valentina,Roerich
4,danforth,Frank,Danforth


In [109]:
site

Unnamed: 0,name,lat,long
0,DR-1,-49.85,-128.57
1,DR-3,-47.15,-126.72
2,MSK-4,-48.87,-123.4


In [110]:
survey

Unnamed: 0,taken,person,quant,reading
0,619,dyer,rad,9.82
1,619,dyer,sal,0.13
2,622,dyer,rad,7.8
3,622,dyer,sal,0.09
4,734,pb,rad,8.41
5,734,lake,sal,0.05
6,734,pb,temp,-21.5
7,735,pb,rad,7.22
8,735,,sal,0.06
9,735,,temp,-26.0


In [111]:
visited

Unnamed: 0,ident,site,dated
0,619,DR-1,1927-02-08
1,622,DR-1,1927-02-10
2,734,DR-3,1939-01-07
3,735,DR-3,1930-01-12
4,751,DR-3,1930-02-26
5,752,DR-3,
6,837,MSK-4,1932-01-14
7,844,DR-1,1932-03-22


| Join | Desc |
|---|---|
| `Left` | _Keep all the keys from the left_ |
| `Right` | _Keep all the keys from the right_ |
| `outer` | _Keep all the keys from both left and right_ |
| `inner` | _Keep only keys that exist in both left and right_ |

### One-to-One Merge
The column we wants to join has no duplicated value.
Each value in the join-column of the left dataframe occurs once.
Each value in the join-column of the right dataframe occurs once.

Notice each value of the **name** column of the **site** dataframe occurs only once.

Notice each value of the **site** column of the **visited_subset** dataframe occurs only once.

In [113]:
# Let's select only some rows, so it is easier to observe changes
visited_subset = visited.loc[[0,2,6],]
visited_subset

Unnamed: 0,ident,site,dated
0,619,DR-1,1927-02-08
2,734,DR-3,1939-01-07
6,837,MSK-4,1932-01-14


In [44]:
# Notice the site column here
site

Unnamed: 0,name,lat,long
0,DR-1,-49.85,-128.57
1,DR-3,-47.15,-126.72
2,MSK-4,-48.87,-123.4


In [114]:
# LEFT  dataframe: site           : has columns:       name,lat,long
# RIGHT dataframe: visited_subset : has columns: ident,site,         dated
#
# noted: both dataframes have different names to store the site name.
# In terms of value: LEFT->name is the same as RIGHT->site            
one2one_merge = site.merge(visited_subset, left_on="name", right_on="site")
one2one_merge

Unnamed: 0,name,lat,long,ident,site,dated
0,DR-1,-49.85,-128.57,619,DR-1,1927-02-08
1,DR-3,-47.15,-126.72,734,DR-3,1939-01-07
2,MSK-4,-48.87,-123.4,837,MSK-4,1932-01-14


### Many-to-One Merge
One of the dataframes has repeat values.

Notice (in the cell below) some values of the **site** column of the **visited** dataframe occurs more than once (such as DR-3).

Notice each value of the **name** column of the **site** dataframe occurs only once (such as DR-3 here appears only once).



In [119]:
# Let's modify visited a bit, we will see the effect that next
visited.loc[2,'site'] = NaN
visited

Unnamed: 0,ident,site,dated
0,619,DR-1,1927-02-08
1,622,DR-1,1927-02-10
2,734,,1939-01-07
3,735,DR-3,1930-01-12
4,751,DR-3,1930-02-26
5,752,DR-3,
6,837,MSK-4,1932-01-14
7,844,DR-1,1932-03-22


In [121]:
site

Unnamed: 0,name,lat,long
0,DR-1,-49.85,-128.57
1,DR-3,-47.15,-126.72
2,MSK-4,-48.87,-123.4


In [124]:
# LEFT: visited has columns: ident,site,dated
# RIGHT: site has   columns:                  name,lat, long
Many2One_merge = visited.merge(site, left_on="site", right_on='name')
Many2One_merge

# Notice the result will have columns: ident,site,dated,name,lat,long
# For each row in LEFT, it'll look at the value in column "site" of that row.
#     With this value, it will look in to the table from the RIGHT.
#     If there is a match, the row from the right will be added (=more columns)
#     If there is no match, it won't show up here.

Unnamed: 0,ident,site,dated,name,lat,long
0,619,DR-1,1927-02-08,DR-1,-49.85,-128.57
1,622,DR-1,1927-02-10,DR-1,-49.85,-128.57
2,844,DR-1,1932-03-22,DR-1,-49.85,-128.57
3,735,DR-3,1930-01-12,DR-3,-47.15,-126.72
4,751,DR-3,1930-02-26,DR-3,-47.15,-126.72
5,752,DR-3,,DR-3,-47.15,-126.72
6,837,MSK-4,1932-01-14,MSK-4,-48.87,-123.4


### One-to-Many Merge
The concept is the same, we just swap the order

In [127]:
# LEFT: site has columns: name,lat, long 
# RIGHT: visited has columns: ident,site,dated
One2Many_merge = site.merge(visited, left_on="name", right_on='site')
One2Many_merge

# For each row in LEFT, it'll look at the value in column "name" of that row.
#     With this value, it will look in to the table from the RIGHT.
#     If there is a match, the row from the right will be added (=more columns)
#     If there is no match, it won't show up here.
# !!!!
# Where, there are multiple matches. Since the right appear in more than one row.
# For each successful match, it will become one row.


Unnamed: 0,name,lat,long,ident,site,dated
0,DR-1,-49.85,-128.57,619,DR-1,1927-02-08
1,DR-1,-49.85,-128.57,622,DR-1,1927-02-10
2,DR-1,-49.85,-128.57,844,DR-1,1932-03-22
3,DR-3,-47.15,-126.72,735,DR-3,1930-01-12
4,DR-3,-47.15,-126.72,751,DR-3,1930-02-26
5,DR-3,-47.15,-126.72,752,DR-3,
6,MSK-4,-48.87,-123.4,837,MSK-4,1932-01-14


### Many-to-Many Merge
The data on both sides contain duplicated values.

Let's start with a Many-to-One

In [141]:
# Create a dataframe to be used next
# Notice that the many students take the same course

enrollment_df = pd.DataFrame({"ID":[1001,1002],"Course":["ICCS161","ICCS161"]})
enrollment_df

Unnamed: 0,ID,Course
0,1001,ICCS161
1,1002,ICCS161


In [143]:
# Create another dataframe to be used next
# Notice that one course has many exams

exam_df = pd.DataFrame({"Course":["ICCS161","ICCS161"],"Exam":["Midterm","Final"]})
exam_df

Unnamed: 0,Course,Exam
0,ICCS161,Midterm
1,ICCS161,Final


In [144]:
# MANY-TO-MANY
# LEFT: enrollment_df 
# RIGHT: exam_df
Many2Many_merge = enrollment_df.merge(exam_df, left_on="Course",right_on="Course")
Many2Many_merge

# For each row in LEFT, it'll look at the value in column "Course" of that row.
#     With this value, it will look in to the table from the RIGHT.
#     If there is a match, the row from the right will be added (=more columns)
#     If there is no match, it won't show up here.
# !!!!
# Where, there are multiple matches. Since the right appear in more than one row.
# For each successful match, it will become one row.

# This is called Many-to-Many because
# MANY students enroll the same course
# one Course has MANY exams (midterm & final)
#
# Therefore, when we think about  STUDETS---EXAM
# That's many to many


Unnamed: 0,ID,Course,Exam
0,1001,ICCS161,Midterm
1,1001,ICCS161,Final
2,1002,ICCS161,Midterm
3,1002,ICCS161,Final
