# Combine and Merging Datasets

data contained can be combined together in a number of ways



## Database-Style Dataframe joins


The same key data are bound.
Look at the example


In [2]:
import numpy as np
import pandas as pd
pd.options.display.max_rows = 20
np.random.seed(12345)
import matplotlib.pyplot as plt
plt.rc('figure', figsize=(10, 6))
np.set_printoptions(precision=4, suppress=True)

```python
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                    'data1': range(7)})
df1
```

In [3]:
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                    'data1': range(7)})
df1

Unnamed: 0,key,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,a,5
6,b,6


```python
df2 = pd.DataFrame({'key': ['a', 'b', 'd'],
                    'data2': range(3)})
df2
```

In [4]:
df2 = pd.DataFrame({'key': ['a', 'b', 'd'],
                    'data2': range(3)})
df2

Unnamed: 0,key,data2
0,a,0
1,b,1
2,d,2


To merge two Dataframe with the same index 
```python
pd.merge(df1,df2)
```

In [5]:
pd.merge(df1,df2)

Unnamed: 0,key,data1,data2
0,b,0,1
1,b,1,1
2,b,6,1
3,a,2,0
4,a,4,0
5,a,5,0


Without setting anything. The key will used from the columns which is the same name in both data frame. The values which are the same in both data are provided, and the Cartesian products of the other values are provided as the data.

It should be better if we specified the key name

```python
pd.merge(df1,df2,on='key')
```

In [6]:
pd.merge(df1,df2,on='key')

Unnamed: 0,key,data1,data2
0,b,0,1
1,b,1,1
2,b,6,1
3,a,2,0
4,a,4,0
5,a,5,0


we can merge on the different column name

```python
df3 = pd.DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                    'data1': range(7)})
df4 = pd.DataFrame({'rkey': ['a', 'b', 'd'],
                    'data2': range(3)})
```

In [7]:
df3 = pd.DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                    'data1': range(7)})
df4 = pd.DataFrame({'rkey': ['a', 'b', 'd'],
                    'data2': range(3)})

In [12]:
df3

Unnamed: 0,lkey,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,a,5
6,b,6


In [13]:
df4

Unnamed: 0,rkey,data2
0,a,0
1,b,1
2,d,2


### Work
Form the given /file/studentlist.xlsx
It is the Excel which provide the score of the students.
Create the data frame which consist of the total score of Attendance(Midterm) score,and midterm score
-> to get the new dataframe from the old dataframe column you may use this code
` new = old[['A', 'C', 'D']].copy() `
to change the specific column name you may use
` rankings_pd.rename(columns = {'test':'TEST'}, inplace = True)`
note: you do not have to set the student id column to be the index in this work

In [15]:
dat = pd.read_excel('files/studentlist.xlsx',header=6)
dat

Unnamed: 0,ที่,รหัสนักศึกษา,2,3,4,5,6,7,Total (5)
0,1,582115509,5,5,5,5,5,5,2.500000
1,2,592115508,5,5,5,5,5,5,2.500000
2,3,592115521,5,5,5,5,5,5,2.500000
3,4,602115001,0,0,0,0,0,0,0.000000
4,5,602115002,9,8,7,9,7,8,4.000000
...,...,...,...,...,...,...,...,...,...
30,31,602115520,0,8,7,6,6,5,2.666667
31,32,602115521,10,7,7,6,7,8,3.750000
32,33,602115522,10,9,0,6,0,7,2.666667
33,34,602115524,7,5,6,7,0,6,2.583333


In [16]:
dat.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35 entries, 0 to 34
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   ที่           35 non-null     int64  
 1   รหัสนักศึกษา  35 non-null     int64  
 2   2             35 non-null     int64  
 3   3             35 non-null     int64  
 4   4             35 non-null     int64  
 5   5             35 non-null     int64  
 6   6             35 non-null     int64  
 7   7             35 non-null     int64  
 8   Total (5)     35 non-null     float64
dtypes: float64(1), int64(8)
memory usage: 2.6 KB


Check df3 and df4
```python
df3
```

In [17]:
df3

Unnamed: 0,lkey,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,a,5
6,b,6


```python
df4
```

In [18]:
df4

Unnamed: 0,rkey,data2
0,a,0
1,b,1
2,d,2


then try to merge df3 and df4
```python
pd.merge(df3,df4)
```

In [19]:
pd.merge(df3,df4)

MergeError: No common columns to perform merge on. Merge options: left_on=None, right_on=None, left_index=False, right_index=False

You may see that it's could not merge as there is no column which are the same key name.
We can specify the column name which can be merge using the different name with the `left-on` and `right_on` argument
```python
pd.merge(df3, df4, left_on='lkey', right_on='rkey')
```

In [20]:
pd.merge(df3, df4, left_on='lkey', right_on='rkey')

Unnamed: 0,lkey,data1,rkey,data2
0,b,0,b,1
1,b,1,b,1
2,b,6,b,1
3,a,2,a,0
4,a,4,a,0
5,a,5,a,0


or 
```python
pd.merge(df4, df3,  right_on='lkey',left_on='rkey')
```

In [21]:
pd.merge(df4, df3,  right_on='lkey',left_on='rkey')

Unnamed: 0,rkey,data2,lkey,data1
0,a,0,a,2
1,a,0,a,4
2,a,0,a,5
3,b,1,b,0
4,b,1,b,1
5,b,1,b,6


You may see that c, and d is removed as it's not in the left data frame

To use all data we use the outer argument
```python
pd.merge(df1,df2,how='outer')
```

In [22]:
pd.merge(df1,df2,how='outer')

Unnamed: 0,key,data1,data2
0,b,0.0,1.0
1,b,1.0,1.0
2,b,6.0,1.0
3,a,2.0,0.0
4,a,4.0,0.0
5,a,5.0,0.0
6,c,3.0,
7,d,,2.0


You may try other argument to see the different
![image-20230915053652812](./assets/image-20230915053652812.png)


inner join is the intersection of the data
```python
pd.merge(df1,df2,how='inner')
```

In [23]:
pd.merge(df1,df2,how='inner')

Unnamed: 0,key,data1,data2
0,b,0,1
1,b,1,1
2,b,6,1
3,a,2,0
4,a,4,0
5,a,5,0


left join is the left data frame
```python
pd.merge(df1,df2,how='left')
```

In [24]:
pd.merge(df1,df2,how='left')

Unnamed: 0,key,data1,data2
0,b,0,1.0
1,b,1,1.0
2,a,2,0.0
3,c,3,
4,a,4,0.0
5,a,5,0.0
6,b,6,1.0


we can merge with multiple keys by passing a list of columns
try to create this given data frame
```python
left = pd.DataFrame({'key1': ['foo', 'foo', 'bar'],
                     'key2': ['one', 'two', 'one'],
                     'lval': [1, 2, 3]})
right = pd.DataFrame({'key1': ['foo', 'foo', 'bar', 'bar'],
                      'key2': ['one', 'one', 'one', 'two'],
                      'rval': [4, 5, 6, 7]})
```

In [25]:
left = pd.DataFrame({'key1': ['foo', 'foo', 'bar'],
                     'key2': ['one', 'two', 'one'],
                     'lval': [1, 2, 3]})
right = pd.DataFrame({'key1': ['foo', 'foo', 'bar', 'bar'],
                      'key2': ['one', 'one', 'one', 'two'],
                      'rval': [4, 5, 6, 7]})

then try to merge with the multiple keys
```python
pd.merge(left,right,on=['key1','key2'],how='outer')
```

In [26]:
pd.merge(left,right,on=['key1','key2'],how='outer')

Unnamed: 0,key1,key2,lval,rval
0,foo,one,1.0,4.0
1,foo,one,1.0,5.0
2,foo,two,2.0,
3,bar,one,3.0,6.0
4,bar,two,,7.0


Some time if the name of the column may be collapse. (Same column name, but not the key)
You may change the name of the column before merging directly, or you can set the suffix column
the example of the problem is shown
```python
pd.merge(left, right, on='key1')
```

In [27]:
pd.merge(left, right, on='key1')

Unnamed: 0,key1,key2_x,lval,key2_y,rval
0,foo,one,1,one,4
1,foo,one,1,one,5
2,foo,two,2,one,4
3,foo,two,2,one,5
4,bar,one,3,one,6
5,bar,one,3,two,7


the suffixes can be set by the `suffixes` argument
```python
pd.merge(left, right, on='key1', suffixes=('_left', '_right'))
```

In [28]:
pd.merge(left, right, on='key1', suffixes=('_left', '_right'))

Unnamed: 0,key1,key2_left,lval,key2_right,rval
0,foo,one,1,one,4
1,foo,one,1,one,5
2,foo,two,2,one,4
3,foo,two,2,one,5
4,bar,one,3,one,6
5,bar,one,3,two,7


### Work
The students have been withdrawn so they do not have the score, in addition the staff change some of the index name.
So please try to add the attendance midterm score to the previous pandas


In [47]:
df = pd.read_excel('files/studentlist.xlsx',header = 6, sheet_name= 'Attendance (final)')
df

Unnamed: 0,ที่,sid,9,10,11,12,13,Total (5)
0,1,582115509,5,5,5,5,10,3.0
1,2,592115508,5,5,5,5,5,2.5
2,3,592115521,5,7,5,5,10,3.2
3,4,602115001,0,0,0,0,0,0.0
4,5,602115002,6,8,8,7,10,3.9
...,...,...,...,...,...,...,...,...
28,30,602115519,5,5,7,7,10,3.4
29,31,602115520,5,5,9,0,10,2.9
30,33,602115522,7,8,8,7,10,4.0
31,34,602115524,8,7,9,0,0,2.4


In [48]:
df['รหัสนักศึกษา'] = df['sid']
df

Unnamed: 0,ที่,sid,9,10,11,12,13,Total (5),รหัสนักศึกษา
0,1,582115509,5,5,5,5,10,3.0,582115509
1,2,592115508,5,5,5,5,5,2.5,592115508
2,3,592115521,5,7,5,5,10,3.2,592115521
3,4,602115001,0,0,0,0,0,0.0,602115001
4,5,602115002,6,8,8,7,10,3.9,602115002
...,...,...,...,...,...,...,...,...,...
28,30,602115519,5,5,7,7,10,3.4,602115519
29,31,602115520,5,5,9,0,10,2.9,602115520
30,33,602115522,7,8,8,7,10,4.0,602115522
31,34,602115524,8,7,9,0,0,2.4,602115524


In [49]:
df_total = pd.read_excel('files/studentlist.xlsx',header = 6, sheet_name= 'Final')
df_total

Unnamed: 0,ที่,รหัสนักศึกษา,1,2,3,4,Total,Total (25)
0,1.0,582115509.0,6.0,2.0,5.0,7.5,20.500000,8.541667
1,2.0,592115508.0,5.0,4.0,8.0,9,26.000000,10.833333
2,3.0,592115521.0,6.0,6.0,10.0,12,34.000000,14.166667
3,4.0,602115001.0,7.0,5.0,12.0,11.5,35.500000,14.791667
4,5.0,602115002.0,6.0,4.0,7.0,14.5,31.500000,13.125000
...,...,...,...,...,...,...,...,...
35,,,,,,,,
36,,,,,,Max,48.000000,20.000000
37,,,,,,Min,0.000000,0.000000
38,,,,,,Average,27.928571,11.636905


In [50]:
total_table = pd.merge(df,df_total,on = 'รหัสนักศึกษา')
column = ['รหัสนักศึกษา','Total (5)','Total  (25)']
total_table = total_table[column]
total_table

Unnamed: 0,รหัสนักศึกษา,Total (5),Total (25)
0,582115509,3.0,8.541667
1,592115508,2.5,10.833333
2,592115521,3.2,14.166667
3,602115001,0.0,14.791667
4,602115002,3.9,13.125000
...,...,...,...
28,602115519,3.4,10.000000
29,602115520,2.9,11.041667
30,602115522,4.0,17.291667
31,602115524,2.4,8.958333


In [51]:
total_table.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33 entries, 0 to 32
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   รหัสนักศึกษา  33 non-null     int64  
 1   Total (5)     33 non-null     float64
 2   Total  (25)   33 non-null     float64
dtypes: float64(2), int64(1)
memory usage: 924.0 bytes


In [52]:
total_table['Total (5)'].replace(0,np.nan, inplace = True)

In [53]:
total_table.isnull().sum()

รหัสนักศึกษา    0
Total (5)       4
Total  (25)     0
dtype: int64

In [54]:
total = total_table.dropna()
total

Unnamed: 0,รหัสนักศึกษา,Total (5),Total (25)
0,582115509,3.0,8.541667
1,592115508,2.5,10.833333
2,592115521,3.2,14.166667
4,602115002,3.9,13.125000
5,602115003,0.7,12.291667
...,...,...,...
28,602115519,3.4,10.000000
29,602115520,2.9,11.041667
30,602115522,4.0,17.291667
31,602115524,2.4,8.958333


In [46]:
all_score = pd.merge(total, final_score_new, left_on='รหัสนักศึกษา', right_index=True, how='left')
all_score

Unnamed: 0,รหัสนักศึกษา,Total (5),Total (25),Final
0,582115509,3.0,8.541667,8.541667
1,592115508,2.5,10.833333,10.833333
2,592115521,3.2,14.166667,14.166667
4,602115002,3.9,13.125000,13.125000
5,602115003,0.7,12.291667,12.291667
...,...,...,...,...
28,602115519,3.4,10.000000,10.000000
29,602115520,2.9,11.041667,11.041667
30,602115522,4.0,17.291667,17.291667
31,602115524,2.4,8.958333,8.958333


## Merging by index

Some time you may found that the part you want to merge is the index.
You can set the `left_index=True` or `right_index=True` to indicate that you need to use the index as the merge key

try this source code
```python
left1 = pd.DataFrame({'key': ['a', 'b', 'a', 'a', 'b', 'c'],
                      'value': range(6)})
right1 = pd.DataFrame({'group_val': [3.5, 7]}, index=['a', 'b'])
```

In [55]:
left1 = pd.DataFrame({'key': ['a', 'b', 'a', 'a', 'b', 'c'],
                      'value': range(6)})
right1 = pd.DataFrame({'group_val': [3.5, 7]}, index=['a', 'b'])

then try to merge with the index
```python
pd.merge(left1, right1, left_on='key', right_index=True)
```

In [56]:
pd.merge(left1, right1, left_on='key', right_index=True)

Unnamed: 0,key,value,group_val
0,a,0,3.5
2,a,2,3.5
3,a,3,3.5
1,b,1,7.0
4,b,4,7.0


the how can be applied as the same as the normal merge
```python
pd.merge(left1, right1, left_on='key', right_index=True,how ='outer')
```

In [38]:
pd.merge(left1, right1, left_on='key', right_index=True,how ='outer')

Unnamed: 0,key,value,group_val
0,a,0,3.5
2,a,2,3.5
3,a,3,3.5
1,b,1,7.0
4,b,4,7.0
5,c,5,


With hierarchically index, joining on index can be used as multiple-key merge
```python
lefth = pd.DataFrame({'key1': ['Ohio', 'Ohio', 'Ohio',
                               'Nevada', 'Nevada'],
                      'key2': [2000, 2001, 2002, 2001, 2002],
                      'data': np.arange(5.)})
righth = pd.DataFrame(np.arange(12).reshape((6, 2)),
                        index=[['Nevada', 'Nevada', 'Ohio', 'Ohio',
                                'Ohio', 'Ohio'],
                                 [2001, 2000, 2000, 2000, 2001, 2002]],
                        columns=['event1', 'event2'])
 ```

In [39]:
lefth = pd.DataFrame({'key1': ['Ohio', 'Ohio', 'Ohio',
                               'Nevada', 'Nevada'],
                      'key2': [2000, 2001, 2002, 2001, 2002],
                      'data': np.arange(5.)})
righth = pd.DataFrame(np.arange(12).reshape((6, 2)),
                        index=[['Nevada', 'Nevada', 'Ohio', 'Ohio',
                                'Ohio', 'Ohio'],
                                 [2001, 2000, 2000, 2000, 2001, 2002]],
                        columns=['event1', 'event2'])

The merge with multiple column can be set with the column, and the key index
```python
pd.merge(lefth, righth, left_on=['key1', 'key2'], right_index=True)
```

In [40]:
pd.merge(lefth, righth, left_on=['key1', 'key2'], right_index=True)

Unnamed: 0,key1,key2,data,event1,event2
0,Ohio,2000,0.0,4,5
0,Ohio,2000,0.0,6,7
1,Ohio,2001,1.0,8,9
2,Ohio,2002,2.0,10,11
3,Nevada,2001,3.0,0,1


with the `outer` join all data will be merged
```python
pd.merge(lefth, righth, left_on=['key1', 'key2'],
         right_index=True, how='outer')
```

In [41]:
pd.merge(lefth, righth, left_on=['key1', 'key2'],
         right_index=True, how='outer')

Unnamed: 0,key1,key2,data,event1,event2
0,Ohio,2000,0.0,4.0,5.0
0,Ohio,2000,0.0,6.0,7.0
1,Ohio,2001,1.0,8.0,9.0
2,Ohio,2002,2.0,10.0,11.0
3,Nevada,2001,3.0,0.0,1.0
4,Nevada,2002,4.0,,
4,Nevada,2000,,2.0,3.0


### Work
now setting the index to the Dataframe which we have done so far.
then add the rest of the score so that we can handle the overall score with in Dataframe

In [42]:
final_score = pd.read_excel('files/studentlist.xlsx', engine='openpyxl', sheet_name='Final', header=6, index_col='รหัสนักศึกษา')
final_score_new = final_score[['Total  (25)']].rename(columns={'Total  (25)': 'Final'})
final_score_new

Unnamed: 0_level_0,Final
รหัสนักศึกษา,Unnamed: 1_level_1
582115509.0,8.541667
592115508.0,10.833333
592115521.0,14.166667
602115001.0,14.791667
602115002.0,13.125000
...,...
,
,20.000000
,0.000000
,11.636905


In [43]:
total

Unnamed: 0,รหัสนักศึกษา,Total (5),Total (25)
0,582115509,3.0,8.541667
1,592115508,2.5,10.833333
2,592115521,3.2,14.166667
4,602115002,3.9,13.125000
5,602115003,0.7,12.291667
...,...,...,...
28,602115519,3.4,10.000000
29,602115520,2.9,11.041667
30,602115522,4.0,17.291667
31,602115524,2.4,8.958333


In [None]:
total.rename(columns={'รหัสนักศึกษา':'StudentID'},inplace=True)
total

Now find the name of the students whose name is in the top 10 of the total score.

## Concatenating Along an Axis

Concatenating data is to concat data to the same data.

An Axis mean concat the column data

In NumPy concatenate data concate it in each columns as shown in example

```python
arr = np.arange(12).reshape((3, 4))
arr
```

In [57]:
arr = np.arange(12).reshape((3, 4))
arr

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

We can concat the arrays by using the `np.concatenate` function
```python
np.concatenate([arr, arr])
```

In [58]:
np.concatenate([arr, arr])

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

we can concatenate the array in the different axis by passing the axis argument
```python
np.concatenate([arr, arr], axis=1)
```

In [59]:
np.concatenate([arr, arr], axis=1)

array([[ 0,  1,  2,  3,  0,  1,  2,  3],
       [ 4,  5,  6,  7,  4,  5,  6,  7],
       [ 8,  9, 10, 11,  8,  9, 10, 11]])

In Pandas, there are many concerns before concatination

* If the objects are indexed differently on the other axes, should we combine the distinct elements in these axes or use only the shared values (the intersection)?
* Do the concatenated chunks of data need to be identifiable in the resulting object?
* Does the “concatenation axis” contain data that needs to be preserved? In many cases, the default integer labels in a DataFrame are best discarded during concatenation.

try creating a series

```python
s1 = pd.Series([0, 1], index=['a', 'b'])
s2 = pd.Series([2, 3, 4], index=['c', 'd', 'e'])
s3 = pd.Series([5, 6], index=['f', 'g'])
```

In [60]:
s1 = pd.Series([0, 1], index=['a', 'b'])
s2 = pd.Series([2, 3, 4], index=['c', 'd', 'e'])
s3 = pd.Series([5, 6], index=['f', 'g'])

Then try the simple concat
```python
pd.concat([s1, s2, s3])
```

In [61]:
pd.concat([s1, s2, s3])

a    0
b    1
c    2
d    3
e    4
f    5
g    6
dtype: int64

and then concat on Axis
```python
pd.concat([s1, s2, s3], axis=1)
```

In [62]:
pd.concat([s1, s2, s3], axis=1)

Unnamed: 0,0,1,2
a,0.0,,
b,1.0,,
c,,2.0,
d,,3.0,
e,,4.0,
f,,,5.0
g,,,6.0


You may name each of the columns by the key column
```python
pd.concat([s1, s2, s3], axis=1, keys=['one', 'two', 'three'])
```

In [63]:
pd.concat([s1, s2, s3], axis=1, keys=['one', 'two', 'three'])

Unnamed: 0,one,two,three
a,0.0,,
b,1.0,,
c,,2.0,
d,,3.0,
e,,4.0,
f,,,5.0
g,,,6.0


You may notice that each series are concat as a new column.

This we can say as the union case as the data that combined all data (and NaN has been added to complete the data series)

To intersect them, you may passing `join='inner'`

see the following example
```python
s4 = pd.concat([s1, s3])
s4
```

In [64]:
s4 = pd.concat([s1, s3])
s4

a    0
b    1
f    5
g    6
dtype: int64

if we concat in the different axis
```python
pd.concat([s1, s4], axis=1)
```

In [65]:
pd.concat([s1, s4], axis=1)

Unnamed: 0,0,1
a,0.0,0
b,1.0,1
f,,5
g,,6


In [66]:
pd.concat([s1, s4], axis=1, join='inner')

Unnamed: 0,0,1
a,0,0
b,1,1


we can provide the hierarchical index of the search as given
```python
result = pd.concat([s1, s1, s3], keys=['one', 'two', 'three'])
result
```




In [67]:
result = pd.concat([s1, s1, s3], keys=['one', 'two', 'three'])
result

one    a    0
       b    1
two    a    0
       b    1
three  f    5
       g    6
dtype: int64

we also use the unstack() to show all data
```python
result.unstack()
```

In [68]:
result.unstack()

Unnamed: 0,a,b,f,g
one,0.0,1.0,,
two,0.0,1.0,,
three,,,5.0,6.0


### Work
It's happen that there are other sections score in the class. The files/studentlist2.xlsx provide the score for the second section. merge the data from two file to only one dataframe to find out the descriptive statistic of all the students.

In [69]:
dat = pd.read_excel('files/studentlist.xlsx',header=6)

sd2 = pd.read_excel('files/studentlist2.xlsx',header=6)

In [70]:
sd1 = pd.read_excel('files/studentlist.xlsx', skiprows=6, usecols=["รหัสนักศึกษา", "Total (5)"])
sd2 = pd.read_excel('files/studentlist2.xlsx', skiprows=6, usecols=["รหัสนักศึกษา", "Total (5)"])

sd1.head(), sd2.head()

(   รหัสนักศึกษา  Total (5)
 0     582115509        2.5
 1     592115508        2.5
 2     592115521        2.5
 3     602115001        0.0
 4     602115002        4.0,
    รหัสนักศึกษา  Total (5)
 0     612115509          2
 1     612115508          2)

In [71]:
total_students = pd.concat([sd1, sd2], ignore_index=True)
total_students.head()

Unnamed: 0,รหัสนักศึกษา,Total (5)
0,582115509,2.5
1,592115508,2.5
2,592115521,2.5
3,602115001,0.0
4,602115002,4.0


The teacher is very funny, they will random the score for all the students, create a new dataframe which provide the random score from 1 to the maximum score we can get from all students.
then merge the score to the student, and calculate the new score of all the student. then set the grade of the student by who get the grade less than 80 will get `F` otherwise will get `A`

## Combine Data with Overlap



The data con be merse if there are overlap in full or part

In [72]:
a = pd.Series([np.nan, 2.5, np.nan, 3.5, 4.5, np.nan],
              index=['f', 'e', 'd', 'c', 'b', 'a'])
b = pd.Series(np.arange(len(a), dtype=np.float64),
              index=['f', 'e', 'd', 'c', 'b', 'a'])

Combine the series using the if else expression.
Let's try
```python
np.where(pd.isnull(a),b,a)
```

In [73]:
np.where(pd.isnull(a),b,a)

array([0. , 2.5, 2. , 3.5, 4.5, 5. ])

try vice versa
```python
np.where(pd.isnull(a),a,b)
```

In [74]:
np.where(pd.isnull(a),a,b)

array([nan,  1., nan,  3.,  4., nan])

`combine_first` combine the data the series
```python
b[:-2].combine_first(a[2:])
```

In [75]:
b[:-2].combine_first(a[2:])

a    NaN
b    4.5
c    3.0
d    2.0
e    1.0
f    0.0
dtype: float64

`combine_first` for the Dataframe will do the same thing but with all column by column.

It is similar to *patching* the data (replace the data in `NaN` position.

In [77]:
df1 = pd.DataFrame({'a': [1., np.nan, 5., np.nan],
 'b': [np.nan, 2., np.nan, 6.],
 'c': range(2, 18, 4)})
df2 = pd.DataFrame({'a': [5., 4., np.nan, 3., 7.],
 'b': [np.nan, 3., 4., 6., 8.]})
df1

Unnamed: 0,a,b,c
0,1.0,,2
1,,2.0,6
2,5.0,,10
3,,6.0,14


try
```python
df1.combine_first(df2)
```


In [76]:
df1.combine_first(df2)

Unnamed: 0,data1,data2,key
0,0,0.0,b
1,1,1.0,b
2,2,2.0,a
3,3,,c
4,4,,a
5,5,,a
6,6,,b


```python
df2.combine_first(df1)
```

In [78]:
df2.combine_first(df1)

Unnamed: 0,a,b,c
0,5.0,,2.0
1,4.0,3.0,6.0
2,5.0,4.0,10.0
3,3.0,6.0,14.0
4,7.0,8.0,


# Reshaping and Pivoting



## Reshaping with Hierachical Indexing

Using the Hierarchical Indexing can reshape data in different way

The operations we can use are



*   `stack` rotate or pivots from the columns in the data to the rows
*   `unstack` privots from the rows into the columns



In [79]:
data = pd.DataFrame(np.arange(6).reshape((2, 3)),
    index=pd.Index(['Ohio', 'Colorado'], name='state'),
    columns=pd.Index(['one', 'two', 'three'],
    name='number'))
data

number,one,two,three
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Ohio,0,1,2
Colorado,3,4,5


Using the `stack` to pivot the columns into the rows, produce a serie from the data frame



```python
result = data.stack()
result
```

In [80]:
result = data.stack()
result

state     number
Ohio      one       0
          two       1
          three     2
Colorado  one       3
          two       4
          three     5
dtype: int32

from hierachically indexed Series, we can rearrange into a DataFrame with `unstack`
```python
result.unstack()
```

In [81]:
result.unstack()

number,one,two,three
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Ohio,0,1,2
Colorado,3,4,5


By the default the innermost level is unstacked, we can unstadck the different level by pasing the level name.
```python
result.unstack('state')
```

In [82]:
result.unstack('state')

state,Ohio,Colorado
number,Unnamed: 1_level_1,Unnamed: 2_level_1
one,0,3
two,1,4
three,2,5


unstack may introduce the missing data if all of the value in the level is ont found in each group

In [83]:
s1 = pd.Series([0, 1, 2, 3], index=['a', 'b', 'c', 'd'])
s2 = pd.Series([4, 5, 6], index=['c', 'd', 'e'])
data2 = pd.concat([s1, s2], keys=['one', 'two'])
data2

one  a    0
     b    1
     c    2
     d    3
two  c    4
     d    5
     e    6
dtype: int64

try to do unstack
```python
data2.unstack()
```

In [84]:
data2.unstack()

Unnamed: 0,a,b,c,d,e
one,0.0,1.0,2.0,3.0,
two,,,4.0,5.0,6.0


Stacking filter out the missing data by default, to preseved the na when unstack `dropna=False ` is required
```python
data2.unstack().stack()
```

In [85]:
data2.unstack().stack()

one  a    0.0
     b    1.0
     c    2.0
     d    3.0
two  c    4.0
     d    5.0
     e    6.0
dtype: float64

when using the dropna
```python
data2.unstack().stack(dropna=False)
```

In [86]:
data2.unstack().stack(dropna=False)

one  a    0.0
     b    1.0
     c    2.0
     d    3.0
     e    NaN
two  a    NaN
     b    NaN
     c    4.0
     d    5.0
     e    6.0
dtype: float64

When you unstack in a DataFrame, the **level unstacked** becomes the lowest level in the result
```python
df = pd.DataFrame({'left': result, 'right': result + 5},
                  columns=pd.Index(['left', 'right'], name='side'))
df
```

In [87]:
df = pd.DataFrame({'left': result, 'right': result + 5},
                  columns=pd.Index(['left', 'right'], name='side'))
df

Unnamed: 0_level_0,side,left,right
state,number,Unnamed: 2_level_1,Unnamed: 3_level_1
Ohio,one,0,5
Ohio,two,1,6
Ohio,three,2,7
Colorado,one,3,8
Colorado,two,4,9
Colorado,three,5,10


try
```python
df.unstack('state')
```

In [88]:
df.unstack('state')

side,left,left,right,right
state,Ohio,Colorado,Ohio,Colorado
number,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
one,0,3,5,8
two,1,4,6,9
three,2,5,7,10


we may use the state to stack the side
```python
df.unstack('state').stack('side')
```

In [89]:
df.unstack('state').stack('side')

Unnamed: 0_level_0,state,Colorado,Ohio
number,side,Unnamed: 2_level_1,Unnamed: 3_level_1
one,left,3,0
one,right,8,5
two,left,4,1
two,right,9,6
three,left,5,2
three,right,10,7


# Work

From the three files provided [here](https://github.com/jakevdp/data-USstates/)

1. Bind all data to get the data frame which have the state/ abbreviation, and area(sq.mi)
2. Bind the data to get the State, age, and population
3. Create the multiple hierachy dataframe, which contains the age as sub hierachy. The column should be the year and population, and the year must be order.


1. Bind all data to get the data frame which have the state/ abbreviation, and area(sq.mi)

In [112]:
import pandas as pd

st1 = pd.read_csv('files/state-abbrevs.csv')
st2 = pd.read_csv('files/state-areas.csv')
st3 = pd.read_csv('files/state-population.csv')

In [113]:
merged_df = pd.merge(st1, st2, on='state')
merged_df.head()

Unnamed: 0,state,abbreviation,area (sq. mi)
0,Alabama,AL,52423
1,Alaska,AK,656425
2,Arizona,AZ,114006
3,Arkansas,AR,53182
4,California,CA,163707


2. Bind the data to get the State, age, and population

In [114]:
merged_population = pd.merge(st3, st1, left_on='state', right_on='abbreviation')
final_population = merged_population[['state_x', 'ages', 'population']]
final_population = final_population.rename(columns={'state_x': 'state'})
final_population.head()

KeyError: 'state'

3. Create the multiple hierachy dataframe, which contains the age as sub hierachy. The column should be the year and population, and the year must be order.

In [115]:
final_population = merged_population[['state_x', 'ages', 'year', 'population']]
final_population = final_population.rename(columns={'state_x': 'state'})
hierarchical_df = pd.pivot_table(final_population, values='population', index=['state', 'ages'], columns='year')

hierarchical_df.head()

KeyError: "['ages', 'year', 'population'] not in index"