# Join, Combine, and Reshape a DataFrame

---

Oftentimes, the data is in different files and in different format. The analyst have to be able to deal with such kind of problem and appropriately join different data files in order to do successful operations on the whole data and not only one part of it. In this lecture, we will cover one of the most important and slightly advanced functionalities of Pandas - how to join and combine several DataFrames along with somewhat familiar Pivoting and cross-tabulation operations.


### Lecture outline

---

* Hierarchical Indexing (MultiIndex)


* Combining and Merging


* Joining and Concatenation


* Reshaping and Pivoting


* Groupby


* Cross Tabulation


* Long to Wide format


* Wide to Long format

In [1]:
import pandas as pd

import numpy as np

## Hierarchical Indexing (MultiIndex)

---

Before we delve deep into Pandas merging and reshaping operations, it's essential to know what is a hierarchical index and how to work with it.

Hierarchical indexing is an important feature of pandas that enables you to have multiple (two or more) index levels on an axis. Somewhat abstractly, it provides a way for you to work with higher dimensional data in a lower dimensional form, like Series (1d) and DataFrame (2d).


> Note that, operations on hierarchical indexed DataFrame is different due to several indices. Hence, we have to differentiate which index to use.

#### Reference

[MultiIndex / advanced indexing](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html)


[Multiindexing](https://pandas.pydata.org/pandas-docs/stable/user_guide/cookbook.html#multiindexing)

### Intro

In [167]:
multi_df = pd.DataFrame(data=np.random.randint(100, size=9),
                        index=[['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'],
                               [1, 2, 3, 1, 3, 1, 2, 1, 3]],
                        columns=["values"])


multi_df

Unnamed: 0,Unnamed: 1,values
a,1,28
a,2,89
a,3,75
b,1,16
b,3,92
c,1,49
c,2,71
d,1,93
d,3,48


In [168]:
multi_df.index # Return index object

multi_df.index.levels # Return index levels

multi_df.index.names # Return names in index levels. Currently no names

FrozenList([None, None])

In [169]:
multi_df.index.names = ["index_1", "index_2"]

multi_df.index.names

FrozenList(['index_1', 'index_2'])

In [170]:
multi_df.columns.names = ["column_index"]

multi_df.columns.names

FrozenList(['column_index'])

### Slicing

In [77]:
multi_df

Unnamed: 0_level_0,column_index,values
index_1,index_2,Unnamed: 2_level_1
a,1,0.371675
a,2,0.487012
a,3,0.778613
b,1,2.286855
b,3,-0.126349
c,1,0.86191
c,2,0.431215
d,2,-2.444565
d,3,1.881801


In [91]:
multi_df.xs(key="a", axis=0, level=0) # Get values at specified index

multi_df.xs(key=2, axis=0, level=1) # Get values at specified index

multi_df.xs(key=("a", 3)) # Get values at several indexes

multi_df.xs(key=("a", 3), axis=0, level=[0, 1]) # Get values at several indexes and levels

multi_df.xs(key="values", axis=1) # Get values at vertical axis

index_1  index_2
a        1          0.371675
         2          0.487012
         3          0.778613
b        1          2.286855
         3         -0.126349
c        1          0.861910
         2          0.431215
d        2         -2.444565
         3          1.881801
Name: values, dtype: float64

Instead of `xs()` method we can use familiar `loc` for slicing on different axis.

In [96]:
All = slice(None) # Python built-in slicer

In [133]:
multi_df.loc["a"] # Slice at the first level

multi_df.loc[["a", "c"]] # Selective slice at the first level

multi_df.loc["a"].loc[:2] # Slice at the second level


multi_df.loc[("a", All), All] # Return all values for "a" index at the first level

multi_df.loc[(All, 1), All] # Return all 1's from the second level

multi_df.loc[(All, 1), ("values")] # Same as above one. Selects all first level index and "1" from the second level

multi_df.loc[(slice("a", "c"), 2), All] # Selective slicing at both index level

Unnamed: 0_level_0,column_index,values
index_1,index_2,Unnamed: 2_level_1
a,2,0.487012
c,2,0.431215


### Reordering and Sorting Levels

---

Sometimes, we need to swap the index levels and/or sort multiindex DataFrame by either one or both index. Here, comes the solution for that.

In [148]:
multi_df

Unnamed: 0_level_0,column_index,values
index_1,index_2,Unnamed: 2_level_1
a,1,0.314717
a,2,-1.086142
a,3,2.459832
b,1,0.291573
b,3,0.401082
c,1,0.148473
c,2,-0.238632
d,1,0.2047
d,3,-1.064995


In [151]:
multi_df.swaplevel("index_2", "index_1") # Swap or change the index levels

Unnamed: 0_level_0,column_index,values
index_2,index_1,Unnamed: 2_level_1
1,a,0.314717
2,a,-1.086142
3,a,2.459832
1,b,0.291573
3,b,0.401082
1,c,0.148473
2,c,-0.238632
1,d,0.2047
3,d,-1.064995


We can sort multiindex DataFrame either by index or values.

In [155]:
multi_df.sort_index(level=0) # Sort by index level 0

multi_df.sort_index(level=1) # Sort by index level 1

Unnamed: 0_level_0,column_index,values
index_1,index_2,Unnamed: 2_level_1
a,1,0.314717
b,1,0.291573
c,1,0.148473
d,1,0.2047
a,2,-1.086142
c,2,-0.238632
a,3,2.459832
b,3,0.401082
d,3,-1.064995


In [156]:
multi_df

Unnamed: 0_level_0,column_index,values
index_1,index_2,Unnamed: 2_level_1
a,1,0.314717
a,2,-1.086142
a,3,2.459832
b,1,0.291573
b,3,0.401082
c,1,0.148473
c,2,-0.238632
d,1,0.2047
d,3,-1.064995


In [161]:
multi_df.sort_values(by=("values")) # Sort by column

Unnamed: 0_level_0,column_index,values
index_1,index_2,Unnamed: 2_level_1
a,2,-1.086142
d,3,-1.064995
c,2,-0.238632
c,1,0.148473
d,1,0.2047
b,1,0.291573
a,1,0.314717
b,3,0.401082
a,3,2.459832


### Summary Statistics by Level

In [171]:
multi_df

Unnamed: 0_level_0,column_index,values
index_1,index_2,Unnamed: 2_level_1
a,1,28
a,2,89
a,3,75
b,1,16
b,3,92
c,1,49
c,2,71
d,1,93
d,3,48


In [175]:
multi_df.sum() # Sum up all the values

multi_df.sum(level=0) # Sum up numbers at the level 0

multi_df.sum(level=1) # Sum up numbers at the level 1

column_index,values
index_2,Unnamed: 1_level_1
1,186
2,160
3,215


Other statistical and/or arithmetic functions works like that. We have to explicitly indicate at which level we want to perform the particular operation.

### Set and Reset MultiIndex

---

We can set and hence reset multiple index in our DataFrame by using `set_index()` and `reset_index()` methods.

In [185]:
multi_df.reset_index(level=0) # Reset level 0 index


multi_df.reset_index(level=1) # Reset level 1 index


multi_df.reset_index() # Reset all the index

column_index,index_1,index_2,values
0,a,1,28
1,a,2,89
2,a,3,75
3,b,1,16
4,b,3,92
5,c,1,49
6,c,2,71
7,d,1,93
8,d,3,48


In [187]:
multi_df = multi_df.reset_index() # Reset index and set it again


multi_df

In [188]:
multi_df

column_index,index_1,index_2,values
0,a,1,28
1,a,2,89
2,a,3,75
3,b,1,16
4,b,3,92
5,c,1,49
6,c,2,71
7,d,1,93
8,d,3,48


In [190]:
multi_df.set_index(keys=["index_1", "index_2"]) # Set columns as index

Unnamed: 0_level_0,column_index,values
index_1,index_2,Unnamed: 2_level_1
a,1,28
a,2,89
a,3,75
b,1,16
b,3,92
c,1,49
c,2,71
d,1,93
d,3,48


By default the columns are removed from the DataFrame. However, we can leave them inside DataFrame.

In [191]:
multi_df.set_index(keys=["index_1", "index_2"], drop=False)

Unnamed: 0_level_0,column_index,index_1,index_2,values
index_1,index_2,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
a,1,a,1,28
a,2,a,2,89
a,3,a,3,75
b,1,b,1,16
b,3,b,3,92
c,1,c,1,49
c,2,c,2,71
d,1,d,1,93
d,3,d,3,48


## Combining and Merging - ჩემი

---

https://pandas.pydata.org/pandas-docs/stable/user_guide/cookbook.html#merge
    
    
https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html

# ესენი ნახე

---

https://pandas.pydata.org/pandas-docs/stable/user_guide/cookbook.html#grouping


https://towardsdatascience.com/reshape-pandas-dataframe-with-pivot-table-in-python-tutorial-and-visualization-2248c2012a31


https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html



https://stackoverflow.com/questions/15322632/python-pandas-df-groupby-agg-column-reference-in-agg


https://stackoverflow.com/questions/14916358/reshaping-dataframes-in-pandas-based-on-column-labels


https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html

## Groups and Aggregations with groupby()

---

აქ ჩაამატე `Group_By` ნოუთბუქი

In [52]:
athletes = pd.read_csv('athletes.csv')
athletes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11538 entries, 0 to 11537
Data columns (total 12 columns):
id               11538 non-null int64
name             11538 non-null object
nationality      11538 non-null object
sex              11538 non-null object
date_of_birth    11538 non-null object
height           11208 non-null float64
weight           10879 non-null float64
sport            11538 non-null object
gold             11538 non-null int64
silver           11538 non-null int64
bronze           11538 non-null int64
info             131 non-null object
dtypes: float64(2), int64(4), object(6)
memory usage: 1.1+ MB


In [57]:
# Simply calling groupby returns a GroupBy object 
# This does not calculate anything yet!
g = athletes.groupby('nationality')[['gold', 'silver', 'bronze']]

In [58]:
# Calling an aggregation function on the GroupBy object
# applies the calculation for every group
# and constructs a DataFrame with the results
g.sum()

Unnamed: 0_level_0,gold,silver,bronze
nationality,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AFG,0,0,0
ALB,0,0,0
ALG,0,2,0
AND,0,0,0
ANG,0,0,0
ANT,0,0,0
ARG,21,1,0
ARM,1,3,0
ARU,0,0,0
ASA,0,0,0


In [71]:
# We can select multiple columns to group by
# And we can select a subset of columns to do
g = athletes.groupby(['sport', 'sex'])[['weight', 'height']]

In [129]:
# Because we selected only 2 columns, this calculation will now be cheaper
g.mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,weight,height
sport,sex,Unnamed: 2_level_1,Unnamed: 3_level_1
aquatics,female,62.284483,1.715712
aquatics,male,82.219061,1.860342
archery,female,64.301587,1.67619
archery,male,80.079365,1.795714
athletics,female,60.152542,1.6905
athletics,male,74.77768,1.809234
badminton,female,61.209877,1.686
badminton,male,76.156627,1.805059
basketball,female,75.377622,1.833819
basketball,male,100.297872,2.003611


## Reshaping Rows and Colums with stack() and unstack()

In [84]:
m = pd.read_csv('monthly_data.csv')
m

Unnamed: 0,YYYY,JAN,FEB,MAR,APR,MAY,JUN,JUL,AUG,SEP,OCT,NOV,DEC,YEAR
0,2008,10140,10239,10050,10111,10159,10159,10141,10117,10178,10148,10125,10182,10146
1,2009,10137,10140,10140,10141,10188,10168,10128,10165,10208,10166,10041,10068,10141
2,2010,10151,10034,10168,10194,10158,10166,10158,10129,10147,10135,10057,10133,10136
3,2011,10182,10161,10227,10192,10182,10154,10123,10130,10149,10182,10194,10099,10165
4,2012,10194,10286,10271,10053,10159,10127,10139,10155,10149,10109,10108,10085,10153
5,2013,10142,10169,10099,10155,10113,10180,10201,10176,10151,10129,10155,10170,10153
6,2014,10055,10031,10164,10148,10154,10184,10143,10117,10189,10142,10103,10172,10134
7,2015,10135,10164,10198,10214,10152,10195,10142,10152,10171,10186,10150,10217,10173
8,2016,10100,10099,10144,10122,10140,10137,10168,10183,10177,10214,10144,10283,10159
9,2017,10228,10151,10154,10211,10170,10134,10141,10162,10135,10176,10141,10120,10160


In [85]:
# Preparation: move the 'YYYY' column into the index
m.set_index('YYYY', inplace=True)
m

Unnamed: 0_level_0,JAN,FEB,MAR,APR,MAY,JUN,JUL,AUG,SEP,OCT,NOV,DEC,YEAR
YYYY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2008,10140,10239,10050,10111,10159,10159,10141,10117,10178,10148,10125,10182,10146
2009,10137,10140,10140,10141,10188,10168,10128,10165,10208,10166,10041,10068,10141
2010,10151,10034,10168,10194,10158,10166,10158,10129,10147,10135,10057,10133,10136
2011,10182,10161,10227,10192,10182,10154,10123,10130,10149,10182,10194,10099,10165
2012,10194,10286,10271,10053,10159,10127,10139,10155,10149,10109,10108,10085,10153
2013,10142,10169,10099,10155,10113,10180,10201,10176,10151,10129,10155,10170,10153
2014,10055,10031,10164,10148,10154,10184,10143,10117,10189,10142,10103,10172,10134
2015,10135,10164,10198,10214,10152,10195,10142,10152,10171,10186,10150,10217,10173
2016,10100,10099,10144,10122,10140,10137,10168,10183,10177,10214,10144,10283,10159
2017,10228,10151,10154,10211,10170,10134,10141,10162,10135,10176,10141,10120,10160


In [86]:
# stack() moves data from rows into a single column
m.stack()

YYYY      
2008  JAN     10140
      FEB     10239
      MAR     10050
      APR     10111
      MAY     10159
      JUN     10159
      JUL     10141
      AUG     10117
      SEP     10178
      OCT     10148
      NOV     10125
      DEC     10182
      YEAR    10146
2009  JAN     10137
      FEB     10140
      MAR     10140
      APR     10141
      MAY     10188
      JUN     10168
      JUL     10128
      AUG     10165
      SEP     10208
      OCT     10166
      NOV     10041
      DEC     10068
      YEAR    10141
2010  JAN     10151
      FEB     10034
      MAR     10168
      APR     10194
              ...  
2015  OCT     10186
      NOV     10150
      DEC     10217
      YEAR    10173
2016  JAN     10100
      FEB     10099
      MAR     10144
      APR     10122
      MAY     10140
      JUN     10137
      JUL     10168
      AUG     10183
      SEP     10177
      OCT     10214
      NOV     10144
      DEC     10283
      YEAR    10159
2017  JAN     10228
      FEB

In [87]:
# stack() also allows quick calculations over all cells
m.stack().sum()

1319751

In [130]:
w = athletes.groupby(['sport', 'sex'])['weight'].mean()
w

sport              sex   
aquatics           female     62.284483
                   male       82.219061
archery            female     64.301587
                   male       80.079365
athletics          female     60.152542
                   male       74.777680
badminton          female     61.209877
                   male       76.156627
basketball         female     75.377622
                   male      100.297872
boxing             female           NaN
                   male             NaN
canoe              female     66.457944
                   male       82.150000
cycling            female     60.207254
                   male       72.576052
equestrian         female     58.634146
                   male       72.954887
fencing            female     62.733871
                   male       78.785124
football           female     61.061069
                   male       74.451713
golf               female     63.200000
                   male       79.000000
gymnastics    

In [90]:
# unstack() takes the inner index level and creates a column for every unique index
# It then moves the data into these columns
w.unstack()

sex,female,male
sport,Unnamed: 1_level_1,Unnamed: 2_level_1
aquatics,62.284483,82.219061
archery,64.301587,80.079365
athletics,60.152542,74.77768
badminton,61.209877,76.156627
basketball,75.377622,100.297872
boxing,,
canoe,66.457944,82.15
cycling,60.207254,72.576052
equestrian,58.634146,72.954887
fencing,62.733871,78.785124


## Reshaping Rows and Colums with pivot()

---

აქ ჩაამატე `Pivot_Table` ნოუთბუქი

In [92]:
p = pd.DataFrame({'id': [823905, 823905,
                         235897, 235897, 235897,
                         983422, 983422],
                  'item': ['prize', 'unit', 
                           'prize', 'unit', 'stock', 
                           'prize', 'stock'],
                  'value': [3.49, 'kg',
                            12.89, 'l', 50,
                            0.49, 4]})
p

Unnamed: 0,id,item,value
0,823905,prize,3.49
1,823905,unit,kg
2,235897,prize,12.89
3,235897,unit,l
4,235897,stock,50
5,983422,prize,0.49
6,983422,stock,4


In [93]:
# pivot() moves data from rows into columns
# so that we end up with a wider, shorter DataFrame

# The first argument is the column that will be used for row indices
# The second argument is the column that will be used to create column labels
p.pivot('id', 'item')

Unnamed: 0_level_0,value,value,value
item,prize,stock,unit
id,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
235897,12.89,50.0,l
823905,3.49,,kg
983422,0.49,4.0,


In [98]:
grades = pd.DataFrame([[6, 4, 5], [7, 8, 7], [6, 7, 9], [6, 5, 5], [5, 2, 7]], 
                       index = ['Mary', 'John', 'Ann', 'Pete', 'Laura'],
                       columns = ['test_1', 'test_2', 'test_3'])
grades.reset_index(inplace=True)
grades

Unnamed: 0,index,test_1,test_2,test_3
0,Mary,6,4,5
1,John,7,8,7
2,Ann,6,7,9
3,Pete,6,5,5
4,Laura,5,2,7


In [102]:
# melt() is the opposite of pivot()
# It moves the data from the rows into a single column
# The column names will show up in a new column called "variable"
grades.melt(id_vars=['index'])

Unnamed: 0,index,variable,value
0,Mary,test_1,6
1,John,test_1,7
2,Ann,test_1,6
3,Pete,test_1,6
4,Laura,test_1,5
5,Mary,test_2,4
6,John,test_2,8
7,Ann,test_2,7
8,Pete,test_2,5
9,Laura,test_2,2


## Combining Datasets

---

აქ ჩაამატე `Merging_DataFrame` ნოუთბუქი

In [113]:
grades = pd.DataFrame([[6, 4, 5], [7, 8, 7], [6, 7, 9], [6, 5, 5], [5, 2, 7]], 
                       index = ['Mary', 'John', 'Ann', 'Pete', 'Laura'],
                       columns = ['test_1', 'test_2', 'test_3'])
grades

Unnamed: 0,test_1,test_2,test_3
Mary,6,4,5
John,7,8,7
Ann,6,7,9
Pete,6,5,5
Laura,5,2,7


In [116]:
# Adding a new column -- needs an indexed datastructure (Series)
grades['test_4'] = pd.Series({'John': 5, 'Ann': 8, 'Pete': 9, 'Mary': 7, 'Laura': 10})
grades

Unnamed: 0,test_1,test_2,test_3,test_4
Mary,6,4,5,7
John,7,8,7,5
Ann,6,7,9,8
Pete,6,5,5,9
Laura,5,2,7,10


In [117]:
# Adding a row with .loc -- no Series necessary
grades.loc['Bob'] = [2,3,4,5]
grades

Unnamed: 0,test_1,test_2,test_3,test_4
Mary,6,4,5,7
John,7,8,7,5
Ann,6,7,9,8
Pete,6,5,5,9
Laura,5,2,7,10
Bob,2,3,4,5


In [122]:
# We can also use append
# But in that case we need a Series with a name (will be used as row index)
new_row = pd.Series({'test_1': 5, 'test_2': 6, 'test_3': 7, 'test_4': 8}, name="Kim")
grades.append(new_row)

Unnamed: 0,test_1,test_2,test_3,test_4
Mary,6,4,5,7
John,7,8,7,5
Ann,6,7,9,8
Pete,6,5,5,9
Laura,5,2,7,10
Bob,2,3,4,5
Kim,5,6,7,8


In [123]:
grades['stud_nr'] = [113, 121, 123, 135, 139, 141]
grades = grades[['stud_nr', 'test_1', 'test_2', 'test_3', 'test_4']]
grades

Unnamed: 0,stud_nr,test_1,test_2,test_3,test_4
Mary,113,6,4,5,7
John,121,7,8,7,5
Ann,123,6,7,9,8
Pete,135,6,5,5,9
Laura,139,5,2,7,10
Bob,141,2,3,4,5


In [124]:
other = pd.DataFrame([[139, 7, 7],
                       [123, 8, 6],
                       [142, 4, 5],
                       [113, 7, 9],
                       [155, 10, 9],
                       [121, 6, 4]], 
                       columns = ['stud_nr', 'exam1', 'exam2'])
other

Unnamed: 0,stud_nr,exam1,exam2
0,139,7,7
1,123,8,6
2,142,4,5
3,113,7,9
4,155,10,9
5,121,6,4


In [125]:
# Merging two DataFrames
# By default this does an inner join on the common column (stud_nr)
grades.merge(other)

Unnamed: 0,stud_nr,test_1,test_2,test_3,test_4,exam1,exam2
0,113,6,4,5,7,7,9
1,121,7,8,7,5,6,4
2,123,6,7,9,8,8,6
3,139,5,2,7,10,7,7


In [128]:
# We can also specify other join types: left, right, outer
grades.merge(other, how='outer')

Unnamed: 0,stud_nr,test_1,test_2,test_3,test_4,exam1,exam2
0,113,6.0,4.0,5.0,7.0,7.0,9.0
1,121,7.0,8.0,7.0,5.0,6.0,4.0
2,123,6.0,7.0,9.0,8.0,8.0,6.0
3,135,6.0,5.0,5.0,9.0,,
4,139,5.0,2.0,7.0,10.0,7.0,7.0
5,141,2.0,3.0,4.0,5.0,,
6,142,,,,,4.0,5.0
7,155,,,,,10.0,9.0


## Long to Wide format

---

https://chrisalbon.com/python/data_wrangling/pandas_long_to_wide/

## Wide to Long format

---

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.wide_to_long.html


https://stackoverflow.com/questions/36537945/reshape-wide-to-long-in-pandas


https://stackoverflow.com/questions/22798934/pandas-long-to-wide-reshape-by-two-variables



# Summary

---

sdfnskfvnsldfknvsd