### `Pandas.concat()`

The `concat` function of `pandas` allows us to combine data by stacking one data on top of the other (by default).

In [1]:
import pandas as pd

When the table are of same shape and size

In [27]:
data1 = {'col1': [1,2,3,4,5,6],
        'col2': ['a','b','c','d','e','f']
        }

data2 = {'col1': [11,22,33,44,55],
        'col2': ['aa','bb','cc','dd','ee']}


In [28]:
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

In [29]:
df1

Unnamed: 0,col1,col2
0,1,a
1,2,b
2,3,c
3,4,d
4,5,e
5,6,f


In [30]:
df2

Unnamed: 0,col1,col2
0,11,aa
1,22,bb
2,33,cc
3,44,dd
4,55,ee


In [31]:
#combining the two tables
df3 = pd.concat([df1,df2])
df3

Unnamed: 0,col1,col2
0,1,a
1,2,b
2,3,c
3,4,d
4,5,e
5,6,f
0,11,aa
1,22,bb
2,33,cc
3,44,dd


When the tables are of different shape and size.

In [32]:
data1 = {'col1': [1,2,3,4,5,6],
        'col2': ['a','b','c','d','e','f'],
        'col3': ['apple','ball','cat','dog','egg','fish']}

data2 = {'col1': [11,22,33,44,55],
        'col2': ['aa','bb','cc','dd','ee']}

#defining the `DataFrame`
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

#combining the tables
df3 = pd.concat([df1,df2])
df3

Unnamed: 0,col1,col2,col3
0,1,a,apple
1,2,b,ball
2,3,c,cat
3,4,d,dog
4,5,e,egg
5,6,f,fish
0,11,aa,
1,22,bb,
2,33,cc,
3,44,dd,


We see that the empty cells are filled with NaN values and python does not show any errors.

### `Pandas.merge()`

Defining two `DataFrame` tables.

In [38]:
data1 = {'col1': [1,2,3,4,5,6],
        'col2': ['a','b','c','d','e','f'],
        'col3': ['apple','ball','cat','dog','egg','fish']}

data2 = {'col1': [11,22,33,44,55],
        'col2': ['a','b','cc','dd','e']}

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

Merging the two tables of above. By default, the join is `inner` i.e. only rows that have common values in the merging column are displayed.

In [40]:
df3 = pd.merge(left=df1, right=df2, on='col2')
df3

Unnamed: 0,col1_x,col2,col3,col1_y
0,1,a,apple,11
1,2,b,ball,22
2,5,e,egg,55


The join type can be specified while merging the tables, the available joins are:
* __inner__: only includes elements that appear in both dataframes with a common key
* __outer__: includes all data from both dataframes
* __right__: includes all of the rows from the "right" dataframe along with any rows from the "left" dataframe with a common key; the result retains all columns from both of the original dataframes
* __left__: includes all of the rows from the "left" dataframe along with any rows from the "right" dataframe with a common key; the result retains all columns from both of the original dataframes

In [42]:
df3 = pd.merge(left=df1, right=df2, on='col2', how='outer')
df3

Unnamed: 0,col1_x,col2,col3,col1_y
0,1.0,a,apple,11.0
1,2.0,b,ball,22.0
2,3.0,c,cat,
3,4.0,d,dog,
4,5.0,e,egg,55.0
5,6.0,f,fish,
6,,cc,,33.0
7,,dd,,44.0


In case of `outer` join, the missing values are filled with `NaN` like in `pd.concat()`. 

_This also suggests that by default, the `concat` function uses `outer` join._

### `Pandas.DataFrame.GroupBy()`

The groupby function performs in three steps: split, apply and combine. First it groups and splits data based on defined column/information provided and then apply the aggregating task and finally combined the data to give the output.

In [2]:
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'data': range(6)}, columns=['key', 'data'])
df

Unnamed: 0,key,data
0,A,0
1,B,1
2,C,2
3,A,3
4,B,4
5,C,5


We `GroupBy` the above table on the `key` column which groups the table based on the values on the `key` column.
`GroupBy` creates a `DataFrameGroupBy` object as can be seen below.

In [3]:
df.groupby('key')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000113DEA59E80>

When supplying an aggregating function along with the `GroupBy` method will apply the provided function in all the values of the remaining column.

Here we apply the `sum()` function and we see that the values in `data` column in summed up based on the values of `key` column i.e. every values in `data` is added for every `A` value of the `key` column and so forth.

In [4]:
df.groupby('key').sum()

Unnamed: 0_level_0,data
key,Unnamed: 1_level_1
A,3
B,5
C,7


Multiple aggregating functions can also be applied in a single go by providing a list of aggregating function within the `agg` method. _Notice how the function is not followed by parenthesis. We don't use the parenthesis when supplying the function in the `agg` method._

In [8]:
import numpy as np
df.groupby('key').agg([sum, np.mean, max])

Unnamed: 0_level_0,data,data,data
Unnamed: 0_level_1,sum,mean,max
key,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
A,3,1.5,3
B,5,2.5,4
C,7,3.5,5


### `Pandas.cut`

The `cut()` method is used to group data into bins that means it heps to convert individual data into categorical data.

In [24]:
# creating a sample dataframe
df = pd.DataFrame({'name': ['A', 'B', 'C', 'D', 'E'],
                  'val1': [10,20,30,40,50],
                  'val2': [40, 60, 75, 24, 13]})

In [25]:
df

Unnamed: 0,name,val1,val2
0,A,10,40
1,B,20,60
2,C,30,75
3,D,40,24
4,E,50,13


In [26]:
#defining the bins to categorize the values in the above table
bins = [0, 20, 40, 60, 80, 100]

In [28]:
#grouping the data of val2 column into categories defined above
df['group'] = pd.cut(df['val2'], bins=bins)
df

Unnamed: 0,name,val1,val2,group
0,A,10,40,"(20, 40]"
1,B,20,60,"(40, 60]"
2,C,30,75,"(60, 80]"
3,D,40,24,"(20, 40]"
4,E,50,13,"(0, 20]"


Notice how the `group` column has values starting with round parenthesis and ending with a square parenthesis. This signifies that the starting value is not included in the group while the end value is included i.e. for (20, 40] number 20 will not be included in this group but 40 will

We can also supply name to group name for each category, this is done by providing the `label` parameter in the `cut` method.

In [30]:
#defining the labels to be included for the groups
label = ['very poor', 'poor', 'average', 'good', 'very good']

df['group_named'] = pd.cut(df['val2'], bins=bins, labels = label)
df

Unnamed: 0,name,val1,val2,group,group_named
0,A,10,40,"(20, 40]",poor
1,B,20,60,"(40, 60]",average
2,C,30,75,"(60, 80]",good
3,D,40,24,"(20, 40]",poor
4,E,50,13,"(0, 20]",very poor
