## CMPINF 2100 Week 05 | Combine DataFrames - Concatenation:

### Import Modules:

In [1]:
import numpy as np
import pandas as pd

### Read Data

Read in the Example A CSV file discussed in the previous recording.

In [2]:
dfA0 = pd.read_csv('Example_A.csv')

In [3]:
dfA0

Unnamed: 0,A,B,C,D,E,F
0,a,0,-100,Jan,aa,10
1,b,1,-200,Feb,aa,20
2,c,2,-300,Mar,aa,10
3,d,3,-400,Apr,bb,20
4,e,4,-500,May,bb,10
5,f,5,-600,Jun,bb,20
6,g,6,-700,Jul,cc,10
7,h,7,-800,Aug,cc,20
8,i,8,-900,Sep,cc,10
9,j,9,-1000,Oct,dd,20


Add a column with a constant value of '0'

In [4]:
dfA0['attempt'] = 0

In [5]:
dfA0

Unnamed: 0,A,B,C,D,E,F,attempt
0,a,0,-100,Jan,aa,10,0
1,b,1,-200,Feb,aa,20,0
2,c,2,-300,Mar,aa,10,0
3,d,3,-400,Apr,bb,20,0
4,e,4,-500,May,bb,10,0
5,f,5,-600,Jun,bb,20,0
6,g,6,-700,Jul,cc,10,0
7,h,7,-800,Aug,cc,20,0
8,i,8,-900,Sep,cc,10,0
9,j,9,-1000,Oct,dd,20,0


Read in the same CSV file again!

In [6]:
dfA1 = pd.read_csv('Example_A.csv')

In [7]:
dfA1

Unnamed: 0,A,B,C,D,E,F
0,a,0,-100,Jan,aa,10
1,b,1,-200,Feb,aa,20
2,c,2,-300,Mar,aa,10
3,d,3,-400,Apr,bb,20
4,e,4,-500,May,bb,10
5,f,5,-600,Jun,bb,20
6,g,6,-700,Jul,cc,10
7,h,7,-800,Aug,cc,20
8,i,8,-900,Sep,cc,10
9,j,9,-1000,Oct,dd,20


Add a constant but this time to equal 1

In [8]:
dfA1['attempt'] = 1

In [9]:
dfA1

Unnamed: 0,A,B,C,D,E,F,attempt
0,a,0,-100,Jan,aa,10,1
1,b,1,-200,Feb,aa,20,1
2,c,2,-300,Mar,aa,10,1
3,d,3,-400,Apr,bb,20,1
4,e,4,-500,May,bb,10,1
5,f,5,-600,Jun,bb,20,1
6,g,6,-700,Jul,cc,10,1
7,h,7,-800,Aug,cc,20,1
8,i,8,-900,Sep,cc,10,1
9,j,9,-1000,Oct,dd,20,1


### Vertically Concatenate:

Vertically combining means that we sstack the objects ontop of eachother

In [10]:
pd.concat( [dfA0, dfA1] )

Unnamed: 0,A,B,C,D,E,F,attempt
0,a,0,-100,Jan,aa,10,0
1,b,1,-200,Feb,aa,20,0
2,c,2,-300,Mar,aa,10,0
3,d,3,-400,Apr,bb,20,0
4,e,4,-500,May,bb,10,0
5,f,5,-600,Jun,bb,20,0
6,g,6,-700,Jul,cc,10,0
7,h,7,-800,Aug,cc,20,0
8,i,8,-900,Sep,cc,10,0
9,j,9,-1000,Oct,dd,20,0


This works because both DataFrames have the same column names!

In [11]:
dfA0.columns == dfA1.columns

array([ True,  True,  True,  True,  True,  True,  True])

Look closely at teh `.index` attribute of the combined vertically stacked DataFrames!

In [12]:
pd.concat([dfA0, dfA1]).loc[10]

Unnamed: 0,A,B,C,D,E,F,attempt
10,k,10,-1100,Nov,dd,10,0
10,k,10,-1100,Nov,dd,10,1


By default, the `.index` attribute is allowed to repeat. The `.index` does not uniquely define a row in the stacked DataFrame.

Ignoring the index allows each stacked row to be unique.

In [13]:
pd.concat([dfA0, dfA1], ignore_index=True)

Unnamed: 0,A,B,C,D,E,F,attempt
0,a,0,-100,Jan,aa,10,0
1,b,1,-200,Feb,aa,20,0
2,c,2,-300,Mar,aa,10,0
3,d,3,-400,Apr,bb,20,0
4,e,4,-500,May,bb,10,0
5,f,5,-600,Jun,bb,20,0
6,g,6,-700,Jul,cc,10,0
7,h,7,-800,Aug,cc,20,0
8,i,8,-900,Sep,cc,10,0
9,j,9,-1000,Oct,dd,20,0


I also like to force the DEEP COPY as a just incase. 

In [14]:
pd.concat([dfA0, dfA1], ignore_index=True, copy=True)

Unnamed: 0,A,B,C,D,E,F,attempt
0,a,0,-100,Jan,aa,10,0
1,b,1,-200,Feb,aa,20,0
2,c,2,-300,Mar,aa,10,0
3,d,3,-400,Apr,bb,20,0
4,e,4,-500,May,bb,10,0
5,f,5,-600,Jun,bb,20,0
6,g,6,-700,Jul,cc,10,0
7,h,7,-800,Aug,cc,20,0
8,i,8,-900,Sep,cc,10,0
9,j,9,-1000,Oct,dd,20,0


We can assign the result to an object.

In [15]:
dfA_double = pd.concat([dfA0, dfA1], ignore_index=True, copy=True)

In [16]:
dfA_double.shape

(24, 7)

In [17]:
dfA0.shape

(12, 7)

In [18]:
dfA1.shape

(12, 7)

### Horizontal Concatenation:

This is where we are 'Binding' columns together!

The default "axis" argument is Zero meaning the DataFrames are *vertically* combined!

In [19]:
pd.concat([dfA0, dfA1], axis=0)

Unnamed: 0,A,B,C,D,E,F,attempt
0,a,0,-100,Jan,aa,10,0
1,b,1,-200,Feb,aa,20,0
2,c,2,-300,Mar,aa,10,0
3,d,3,-400,Apr,bb,20,0
4,e,4,-500,May,bb,10,0
5,f,5,-600,Jun,bb,20,0
6,g,6,-700,Jul,cc,10,0
7,h,7,-800,Aug,cc,20,0
8,i,8,-900,Sep,cc,10,0
9,j,9,-1000,Oct,dd,20,0


If we change `axis` to `axis=1` then the two DataFrames will be combined 'horizontally'!

In [21]:
pd.concat([dfA0, dfA1], axis=1)

Unnamed: 0,A,B,C,D,E,F,attempt,A.1,B.1,C.1,D.1,E.1,F.1,attempt.1
0,a,0,-100,Jan,aa,10,0,a,0,-100,Jan,aa,10,1
1,b,1,-200,Feb,aa,20,0,b,1,-200,Feb,aa,20,1
2,c,2,-300,Mar,aa,10,0,c,2,-300,Mar,aa,10,1
3,d,3,-400,Apr,bb,20,0,d,3,-400,Apr,bb,20,1
4,e,4,-500,May,bb,10,0,e,4,-500,May,bb,10,1
5,f,5,-600,Jun,bb,20,0,f,5,-600,Jun,bb,20,1
6,g,6,-700,Jul,cc,10,0,g,6,-700,Jul,cc,10,1
7,h,7,-800,Aug,cc,20,0,h,7,-800,Aug,cc,20,1
8,i,8,-900,Sep,cc,10,0,i,8,-900,Sep,cc,10,1
9,j,9,-1000,Oct,dd,20,0,j,9,-1000,Oct,dd,20,1


In [22]:
pd.concat([dfA0, dfA1], axis=1).columns

Index(['A', 'B', 'C', 'D', 'E', 'F', 'attempt', 'A', 'B', 'C', 'D', 'E', 'F',
       'attempt'],
      dtype='object')

The column names are no longer unique.

In [23]:
pd.concat([dfA0, dfA1], axis=1).loc[:, ['A', 'B'] ]

Unnamed: 0,A,A.1,B,B.1
0,a,a,0,0
1,b,b,1,1
2,c,c,2,2
3,d,d,3,3
4,e,e,4,4
5,f,f,5,5
6,g,g,6,6
7,h,h,7,7
8,i,i,8,8
9,j,j,9,9


I think this is very bad. I really dislike that pandas allows conbining DataFrames horizontally even if they have the **same column names**!

Be careful when you horizontally combine. Why would we ever want to do this then if it causes issues at times?

In [24]:
dfA_left = dfA0.loc[ :, dfA0.columns[:3] ].copy()

In [25]:
dfA_left

Unnamed: 0,A,B,C
0,a,0,-100
1,b,1,-200
2,c,2,-300
3,d,3,-400
4,e,4,-500
5,f,5,-600
6,g,6,-700
7,h,7,-800
8,i,8,-900
9,j,9,-1000


In [26]:
dfA_right = dfA0.loc[ :, dfA0.columns[-2:]].copy()

In [27]:
dfA_right

Unnamed: 0,F,attempt
0,10,0
1,20,0
2,10,0
3,20,0
4,10,0
5,20,0
6,10,0
7,20,0
8,10,0
9,20,0


In [28]:
dfA_left.shape

(12, 3)

In [29]:
dfA_right.shape

(12, 2)

The point of horizontally combining is to bring together **different** columns that have the same number of rows!

In [30]:
pd.concat([dfA_left, dfA_right], axis=1)

Unnamed: 0,A,B,C,F,attempt
0,a,0,-100,10,0
1,b,1,-200,20,0
2,c,2,-300,10,0
3,d,3,-400,20,0
4,e,4,-500,10,0
5,f,5,-600,20,0
6,g,6,-700,10,0
7,h,7,-800,20,0
8,i,8,-900,10,0
9,j,9,-1000,20,0


But...Be careful... if you ignore the index with the horizontal concatenation. 

You will remove the column names if you are not paying attention!

In [31]:
pd.concat([dfA_left, dfA_right], axis=1, ignore_index=True)

Unnamed: 0,0,1,2,3,4
0,a,0,-100,10,0
1,b,1,-200,20,0
2,c,2,-300,10,0
3,d,3,-400,20,0
4,e,4,-500,10,0
5,f,5,-600,20,0
6,g,6,-700,10,0
7,h,7,-800,20,0
8,i,8,-900,10,0
9,j,9,-1000,20,0
