### Pandas Day 2

### Missing Numerical data

The second missing data type which is different is NaN, which refers to not a number. It is a floating type case scenario, which uses the IEEE floating point representation and it is known as the np.nan

In [3]:
import pandas as pd

import numpy as np

In [4]:
vals = np.array([1, np.nan, 3, 5])
vals.dtype



dtype('float64')

In [5]:
# any data operation with a NaN valye will always return NaN as the result
1 - np.nan

nan

In [6]:
0 * np.nan

nan

In [7]:
# hence we can perform aggregates operations over it, but wont prove useful

vals.sum(), vals.min(), vals.max()

(nan, nan, nan)

In [8]:
# we can overstep this problem generated by NaN in performing aggregates value

np.nansum(vals), np.nanmin(vals), np.nanmax(vals)

(9.0, 1.0, 5.0)

In [9]:
# NaN and None in pandas

pd.Series([1,2,3, np.nan, 5, None])

0    1.0
1    2.0
2    3.0
3    NaN
4    5.0
5    NaN
dtype: float64

In [10]:
x = pd.Series(range(2), dtype=int)
x

0    0
1    1
dtype: int32

In [11]:
x[0] = None
x

# the int data type changes to float because of None, pandas converts the None to NaN

0    NaN
1    1.0
dtype: float64

### Operating on Null Values

there are several approaches for detecting and removing and replacing missing values in pandas

isnull() generates a Boolean mask indicating a missing value or values

notnull() opposite of isnull() that is not missing a value

dropna() return a filtered version of the data that is, dropping missing enteries

fillna(), Return a copy of the data withh missing values filled or imputed




In [12]:
# detecting null values
# we can either use isnull() or notnull() for this.
# pandas will return a boolean mask over the data

data = pd.Series([1, np.nan, 'hello', None])
data

0        1
1      NaN
2    hello
3     None
dtype: object

In [13]:
data.isna() # the same with data.isnull()

0    False
1     True
2    False
3     True
dtype: bool

In [14]:
data.isnull()

0    False
1     True
2    False
3     True
dtype: bool

In [15]:
data.notnull() # we can also use data.notna()

0     True
1    False
2     True
3    False
dtype: bool

In [16]:
# boolean mask can either be used directly as a series or dataframe index

data[data.isnull()]

1     NaN
3    None
dtype: object

In [17]:
data[data.notnull()]

0        1
2    hello
dtype: object

In [18]:
# dropping null values, this method is used to drop missing values in the data
# it removes any missing values detected in the data enteries

data.dropna()


0        1
2    hello
dtype: object

In [19]:
df = pd.DataFrame([[1, 2, np.nan],
                    [3, 4, 6],
                    [np.nan, 4, 8]])
df

Unnamed: 0,0,1,2
0,1.0,2,
1,3.0,4,6.0
2,,4,8.0


In [20]:
# we can drop single values, full rows, full coulmns in a dataframe.
# by default, dropna() will drop all rows associated with missing data

df.dropna() # returns only the rows without missing data

Unnamed: 0,0,1,2
1,3.0,4,6.0


In [21]:
# dropping along columns, we will signify, the axis=1 option

df.dropna(axis=1)

# alternatively, we can do df.dropna(axis='columns'). it will give out the same result
# in this case, all columns associated with any missing data are dropped
# the disadvantage of this method is that good data as well are dropped




Unnamed: 0,1
0,2
1,4
2,4


In [22]:
# you can also control how the NA values are dropped using the hpw and thresh parameter
# of dropna. the default is how='any'

df[3] = np.nan
df


Unnamed: 0,0,1,2,3
0,1.0,2,,
1,3.0,4,6.0,
2,,4,8.0,


In [23]:
df.dropna(axis='columns', how='all') # drop all columns that contains all values as missing data

Unnamed: 0,0,1,2
0,1.0,2,
1,3.0,4,6.0
2,,4,8.0


In [24]:
# the thresh parameter helps specify the control on number of nan values 
# we want to drop in any given data points
df.dropna(axis='rows', thresh=3) # keep the rows with at least 3 non null values

Unnamed: 0,0,1,2,3
1,3.0,4,6.0,


In [25]:
# finding null values
# it is another measure on dealing with null values, sometimes, we want to fill
# them with another values instead of dropping the rows or columns
# there are ways to do this, either replace with 0 or imputate it with good values

data = pd.Series([1, 4, np.nan, 5, None, 6], index=list('abcdef'))
data

a    1.0
b    4.0
c    NaN
d    5.0
e    NaN
f    6.0
dtype: float64

In [26]:
# filling with single zero values

data.fillna(0)

a    1.0
b    4.0
c    0.0
d    5.0
e    0.0
f    6.0
dtype: float64

In [27]:
# we can specify a forward fill to fill with the previous value forward

data.fillna(method='ffill')

a    1.0
b    4.0
c    4.0
d    5.0
e    5.0
f    6.0
dtype: float64

In [28]:
# we can as well specify a backward fill

data.fillna(method='bfill')

a    1.0
b    4.0
c    5.0
d    5.0
e    6.0
f    6.0
dtype: float64

In [29]:
# we can do this for dataframes as well, but in this case, we can specify the axis
# along which the fill takes place

df

Unnamed: 0,0,1,2,3
0,1.0,2,,
1,3.0,4,6.0,
2,,4,8.0,


In [30]:
df.fillna(method='ffill', axis=1) # we still have one missing value because there is no foward value for it

Unnamed: 0,0,1,2,3
0,1.0,2.0,2.0,2.0
1,3.0,4.0,6.0,6.0
2,,4.0,8.0,8.0


In [31]:
df.fillna(axis=1, method='bfill')

Unnamed: 0,0,1,2,3
0,1.0,2.0,,
1,3.0,4.0,6.0,
2,4.0,4.0,8.0,


### Hierarchical Indexing

Methods to cover within the hierarchical indexing topic:


Direct creation of Multiindex objects

consideration around indexing,

slicing, and computing statistics aacross multiply indexed data

Useful routines for converting between simple and hirearchichally indexed representation of our data. 

In [32]:
# A multiply indexed Series.
# there are ways to representing 2D data as 1D data.
# lets consider the bad way, using series of data


# The bad way
index = [('Aba', 2000), ('Aba', 2010),
        ('Lagos', 2000), ('Lagos', 2010),
        ('Uyo', 2000), ('Uyo', 2010)]
populations = [2345543, 4324543,
              20034533, 24345435,
              1234543, 3234543]
pop = pd.Series(populations, index=index) 
pop

(Aba, 2000)       2345543
(Aba, 2010)       4324543
(Lagos, 2000)    20034533
(Lagos, 2010)    24345435
(Uyo, 2000)       1234543
(Uyo, 2010)       3234543
dtype: int64

In [33]:
pop[('Aba', 2010): ('Uyo', 2000)]

(Aba, 2010)       4324543
(Lagos, 2000)    20034533
(Lagos, 2010)    24345435
(Uyo, 2000)       1234543
dtype: int64

In [34]:
# lets do some selection

pop[[i for i in pop.index if i[1] == 2010]] # this produces result but it is not clean

(Aba, 2010)       4324543
(Lagos, 2010)    24345435
(Uyo, 2010)       3234543
dtype: int64

In [35]:
# the better way Pandas Multiindexing
# let us create  multiindex from the tuples above

index = pd.MultiIndex.from_tuples(index)
index


MultiIndex([(  'Aba', 2000),
            (  'Aba', 2010),
            ('Lagos', 2000),
            ('Lagos', 2010),
            (  'Uyo', 2000),
            (  'Uyo', 2010)],
           )

In [36]:
pop = pop.reindex(index)
pop

Aba    2000     2345543
       2010     4324543
Lagos  2000    20034533
       2010    24345435
Uyo    2000     1234543
       2010     3234543
dtype: int64

In [37]:
pop[:, 2010]

Aba       4324543
Lagos    24345435
Uyo       3234543
dtype: int64

In [38]:
pop['Aba']

2000    2345543
2010    4324543
dtype: int64

In [39]:
# MultiIndex as extra Dimension 
# the unstack method will covert a muiltpy indexed series to a conventionally indexed dataframe

pop_df = pop.unstack()
pop_df


Unnamed: 0,2000,2010
Aba,2345543,4324543
Lagos,20034533,24345435
Uyo,1234543,3234543


In [40]:
# adding another column using a multiindexing

pop_df = pd.DataFrame({'Total':pop,
                      'Under 18': [768456, 876456,
                                  9843234, 10322344,
                                  784323, 942453]})
pop_df

Unnamed: 0,Unnamed: 1,Total,Under 18
Aba,2000,2345543,768456
Aba,2010,4324543,876456
Lagos,2000,20034533,9843234
Lagos,2010,24345435,10322344
Uyo,2000,1234543,784323
Uyo,2010,3234543,942453


In [41]:
# fraction of under 18

f_u18 = pop_df['Under 18'] / pop_df['Total']
f_u18.unstack()

Unnamed: 0,2000,2010
Aba,0.327624,0.20267
Lagos,0.491313,0.423995
Uyo,0.635314,0.291371


In [76]:
# methods of multiindex creation
# the most direct method of creating multiindex series or dataframe is
# to pass a list of teo or more index arrays to the constructor

df = pd.DataFrame(np.random.rand(4, 2),
                 index=[['a', 'a', 'b', 'b'], [1,2,1,2]],
                 columns=['data1', 'data2'])
df

Unnamed: 0,Unnamed: 1,data1,data2
a,1,0.700219,0.608071
a,2,0.568244,0.328505
b,1,0.822082,0.972334
b,2,0.931543,0.086164


In [43]:
# Explicit multiindex constructors
# we can do this with a class
pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b'], [1,2,1,2]])

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

In [44]:
# we  do this as well from a list of tuples

pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('b', 1), ('b', 2)])

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

In [45]:
# we can also do it with a cartesian product

pd.MultiIndex.from_product([['a', 'b'], [1,3]])

MultiIndex([('a', 1),
            ('a', 3),
            ('b', 1),
            ('b', 3)],
           )

In [46]:
# Multindex level names

pop.index.names = ['state', 'year']
pop

state  year
Aba    2000     2345543
       2010     4324543
Lagos  2000    20034533
       2010    24345435
Uyo    2000     1234543
       2010     3234543
dtype: int64

In [78]:
# Multiindex for columns

# hierarchical indices and columns
index = pd.MultiIndex.from_product([[2013,2014], [1,2]],
                                  names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Ken', 'Ekez', 'Suleman'], ['HR', 'Temp']],
                                    names=['subject', 'type'])

# mock some data

data = np.round(np.random.randn(4,6), 1)
data[:,::2] *=10
data +=37

# create the Dataframe
health_data = pd.DataFrame(data, index=index, columns=columns)
health_data

Unnamed: 0_level_0,subject,Ken,Ken,Ekez,Ekez,Suleman,Suleman
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,31.0,37.4,49.0,36.9,35.0,36.8
2013,2,42.0,37.3,40.0,38.4,56.0,37.4
2014,1,25.0,36.8,32.0,35.1,46.0,37.8
2014,2,43.0,36.4,38.0,36.0,15.0,36.9


In [83]:
health_data['Ken']

Unnamed: 0_level_0,type,HR,Temp
year,visit,Unnamed: 2_level_1,Unnamed: 3_level_1
2013,1,31.0,37.4
2013,2,42.0,37.3
2014,1,25.0,36.8
2014,2,43.0,36.4


In [84]:
# Multiply indexed series
# we will use the pop data here
pop

state  year
Aba    2000     2345543
       2010     4324543
Lagos  2000    20034533
       2010    24345435
Uyo    2000     1234543
       2010     3234543
dtype: int64

In [85]:
# we can access single elements here by indexing multiple times

pop['Aba', 2000]

2345543

In [86]:
# the mulitiindex also supports partial indexing
pop['Aba'] # to access all the info about Aba city.

year
2000    2345543
2010    4324543
dtype: int64

In [88]:
# partial slicing is also supported in multiindexing

pop.loc['Lagos': 'Uyo', 2000]

state  year
Lagos  2000    20034533
Uyo    2000     1234543
dtype: int64

In [89]:
pop.loc['Aba': 'Lagos'] # partial slicing indexing

state  year
Aba    2000     2345543
       2010     4324543
Lagos  2000    20034533
       2010    24345435
dtype: int64

In [90]:
# we can as well access all data with the year 2000

pop[:, 2000]

state
Aba       2345543
Lagos    20034533
Uyo       1234543
dtype: int64

In [92]:
pop[pop > 5000000] # Boolean masks indexing

state  year
Lagos  2000    20034533
       2010    24345435
dtype: int64

In [96]:
pop[['Aba', 'Lagos']] # fancy indexing

state  year
Aba    2000     2345543
       2010     4324543
Lagos  2000    20034533
       2010    24345435
dtype: int64

In [100]:
# multiply indexed Dataframes
# the behave as well in a similar manner

# let us work with our health data

health_data

Unnamed: 0_level_0,subject,Ken,Ken,Ekez,Ekez,Suleman,Suleman
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,31.0,37.4,49.0,36.9,35.0,36.8
2013,2,42.0,37.3,40.0,38.4,56.0,37.4
2014,1,25.0,36.8,32.0,35.1,46.0,37.8
2014,2,43.0,36.4,38.0,36.0,15.0,36.9


In [101]:
# accessin columns

health_data['Ken', 'HR']

year  visit
2013  1        31.0
      2        42.0
2014  1        25.0
      2        43.0
Name: (Ken, HR), dtype: float64

In [102]:
# we can also use the indexing formats

health_data.iloc[:2, :]

Unnamed: 0_level_0,subject,Ken,Ken,Ekez,Ekez,Suleman,Suleman
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,31.0,37.4,49.0,36.9,35.0,36.8
2013,2,42.0,37.3,40.0,38.4,56.0,37.4


In [103]:
health_data.iloc[:2,:2]

Unnamed: 0_level_0,subject,Ken,Ken
Unnamed: 0_level_1,type,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2
2013,1,31.0,37.4
2013,2,42.0,37.3


In [110]:
health_data.loc[:, ('Ken', ['HR','Temp'])]

Unnamed: 0_level_0,subject,Ken,Ken
Unnamed: 0_level_1,type,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2
2013,1,31.0,37.4
2013,2,42.0,37.3
2014,1,25.0,36.8
2014,2,43.0,36.4


In [114]:
# we can as well use pythons built in slice option

idx = pd.IndexSlice
health_data.loc[idx[:, 1], idx[:, 'HR']]

Unnamed: 0_level_0,subject,Ken,Ekez,Suleman
Unnamed: 0_level_1,type,HR,HR,HR
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
2013,1,31.0,49.0,35.0
2014,1,25.0,32.0,46.0


In [117]:
# rearranging multi-indices
# sorted and unsorted indices

# many of the multiindex slicing operation will fail if the data is not sorted

index = pd.MultiIndex.from_product([['a','c','b'], [1,2]])
data = pd.Series(np.random.rand(6), index=index)
data.index.names = ['char', 'int']
data

char  int
a     1      0.582978
      2      0.502201
c     1      0.952244
      2      0.500132
b     1      0.308063
      2      0.908920
dtype: float64

In [125]:
# let us try our partial slicing with the associated lexographical sorting error
try:
    data['a':'b']
except KeyError as e:
    print(type(e))
    print(e)

<class 'pandas.errors.UnsortedIndexError'>
'Key length (1) was greater than MultiIndex lexsort depth (0)'


In [126]:
# we can correct this error using the sort_index pandas method

data = data.sort_index()
data

char  int
a     1      0.582978
      2      0.502201
b     1      0.308063
      2      0.908920
c     1      0.952244
      2      0.500132
dtype: float64

In [129]:
# how sorting helped the data frame
try:
    data = data['a':'b']
except KeyError as e:
    print(type(e))
    print(e)
print(data)

char  int
a     1      0.582978
      2      0.502201
b     1      0.308063
      2      0.908920
dtype: float64


In [132]:
# Stacking and Unstacking indexing
pop

state  year
Aba    2000     2345543
       2010     4324543
Lagos  2000    20034533
       2010    24345435
Uyo    2000     1234543
       2010     3234543
dtype: int64

In [133]:
# unstacking with level representation

pop.unstack(level=0)

state,Aba,Lagos,Uyo
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2000,2345543,20034533,1234543
2010,4324543,24345435,3234543


In [135]:
# increasing the level of unstacking

pop.unstack(level=1) # i prefer this one

year,2000,2010
state,Unnamed: 1_level_1,Unnamed: 2_level_1
Aba,2345543,4324543
Lagos,20034533,24345435
Uyo,1234543,3234543


In [138]:
# the opposite of unstack is stack, in this case, we use it to recover the original series

pop.unstack(level=1).stack()

state  year
Aba    2000     2345543
       2010     4324543
Lagos  2000    20034533
       2010    24345435
Uyo    2000     1234543
       2010     3234543
dtype: int64

In [140]:
# index setting and resetting
# another way of turning the label to column of hierarchical data is through reset_index

pop


state  year
Aba    2000     2345543
       2010     4324543
Lagos  2000    20034533
       2010    24345435
Uyo    2000     1234543
       2010     3234543
dtype: int64

In [151]:
# reseting the indexing
pop_flat = pop.reset_index(name='population')
pop_flat

Unnamed: 0,state,year,population
0,Aba,2000,2345543
1,Aba,2010,4324543
2,Lagos,2000,20034533
3,Lagos,2010,24345435
4,Uyo,2000,1234543
5,Uyo,2010,3234543


In [155]:
# sometimes, while working with real world data, it will be easier to build a 
# multiindex using the set_index using the column values

pop_flat.set_index(['state','year', 'population'])

state,year,population
Aba,2000,2345543
Aba,2010,4324543
Lagos,2000,20034533
Lagos,2010,24345435
Uyo,2000,1234543
Uyo,2010,3234543


In [156]:
# Data Agrregations on Multi-indices
# for hierachichal indexed data, we can pass the level parameter that controls the data subset

health_data



Unnamed: 0_level_0,subject,Ken,Ken,Ekez,Ekez,Suleman,Suleman
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,31.0,37.4,49.0,36.9,35.0,36.8
2013,2,42.0,37.3,40.0,38.4,56.0,37.4
2014,1,25.0,36.8,32.0,35.1,46.0,37.8
2014,2,43.0,36.4,38.0,36.0,15.0,36.9


In [158]:
# computing the average of the two visits each year

data_mean = health_data.mean(level='year')
data_mean

subject,Ken,Ken,Ekez,Ekez,Suleman,Suleman
type,HR,Temp,HR,Temp,HR,Temp
year,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
2013,36.5,37.35,44.5,37.65,45.5,37.1
2014,34.0,36.6,35.0,35.55,30.5,37.35


In [162]:
# making use of the axis keyword
data_mean = health_data.mean(axis=1, level='type')
data_mean



Unnamed: 0_level_0,type,HR,Temp
year,visit,Unnamed: 2_level_1,Unnamed: 3_level_1
2013,1,38.333333,37.033333
2013,2,46.0,37.7
2014,1,34.333333,36.566667
2014,2,32.0,36.433333


### COmbining Datasets: Concat and Append

In [163]:
# this makes the joining of different datasets possible
# or more complex, different database style joins possible. 
# we will explore pd.concat() function

# creating a dataframe

def make_df(cols, ind):
    """Quickly make a datframe"""
    data = {c: [str(c) + str(i) for i in ind]
           for c in cols}
    return pd.DataFrame(data, ind)

In [165]:
make_df('ABC', range(4))

Unnamed: 0,A,B,C
0,A0,B0,C0
1,A1,B1,C1
2,A2,B2,C2
3,A3,B3,C3


In [166]:
# numpy concetenation

x = [1,2,3]
y = [2,3,5]
np.concatenate([x, y])

array([1, 2, 3, 2, 3, 5])

In [169]:
x = [[3,4],
     [6,7]]
np.concatenate([x,x], axis=1)

array([[3, 4, 3, 4],
       [6, 7, 6, 7]])

In [171]:
# simple concat with pd.cancat() function

val1 = pd.Series(['A', 'B', 'C'], index=[1,2,3])
val2 = pd.Series(['D','E', 'F'], index=[4,5,6])
pd.concat([val1,val2])

1    A
2    B
3    C
4    D
5    E
6    F
dtype: object

In [175]:
# for higher dimensional objects

df1 = make_df('AB', [1,2])
df2 = make_df('AB',[3,4])
print(df1), print(df2), print(pd.concat([df1,df2]))

    A   B
1  A1  B1
2  A2  B2
    A   B
3  A3  B3
4  A4  B4
    A   B
1  A1  B1
2  A2  B2
3  A3  B3
4  A4  B4


(None, None, None)

In [179]:
# pd.Concat() allows us to specify the axis we want to concat from

df3 = make_df('AB', [0,1])
df4 = make_df('CD', [0,1])

print(df3), print(df4), print(pd.concat([df3, df4], axis=1))

    A   B
0  A0  B0
1  A1  B1
    C   D
0  C0  D0
1  C1  D1
    A   B   C   D
0  A0  B0  C0  D0
1  A1  B1  C1  D1


(None, None, None)

In [183]:
# dupllicate indices
# pd.concat() preserves indices

x = make_df('AB', [0,1])
y = make_df('AB', [3,4])

y.index = x.index # make duplicate indices

print(x), print(y), print(pd.concat([x,y]))

    A   B
0  A0  B0
1  A1  B1
    A   B
0  A3  B3
1  A4  B4
    A   B
0  A0  B0
1  A1  B1
0  A3  B3
1  A4  B4


(None, None, None)

In [184]:
# using the verify integrity for checking overlapses of indices

try:
    pd.concat([x,y], verify_integrity=True)
except ValueError as e:
    print("ValueError:", e)
    

ValueError: Indexes have overlapping values: Int64Index([0, 1], dtype='int64')


In [186]:
# if we really want to ifnore the ignore index indices, we can use the ignore_flag

print(pd.concat([x,y], ignore_index=True ))

    A   B
0  A0  B0
1  A1  B1
2  A3  B3
3  A4  B4


In [190]:
# adding multiindex keys
print(pd.concat([x,y],  keys =['x', 'y']))

      A   B
x 0  A0  B0
  1  A1  B1
y 0  A3  B3
  1  A4  B4


In [194]:
# concatenation with joins
df5 = make_df("ABC", [1,2])
df6 = make_df("BAF", [4,5])
print(df5), print(df6), print(pd.concat([df5, df6]))


    A   B   C
1  A1  B1  C1
2  A2  B2  C2
    B   A   F
4  B4  A4  F4
5  B5  A5  F5
    A   B    C    F
1  A1  B1   C1  NaN
2  A2  B2   C2  NaN
4  A4  B4  NaN   F4
5  A5  B5  NaN   F5


(None, None, None)

In [195]:
# we could see by defualt the presence of the nan values
# we could avoid this using the join methods
# by defualt,the join is a union of both, that is outer join


print(df5), print(df6), print(pd.concat([df5, df6], join='inner' ))

    A   B   C
1  A1  B1  C1
2  A2  B2  C2
    B   A   F
4  B4  A4  F4
5  B5  A5  F5
    A   B
1  A1  B1
2  A2  B2
4  A4  B4
5  A5  B5


(None, None, None)

In [205]:
# The append Method()
print(df1.append(df2))

    A   B
1  A1  B1
2  A2  B2
3  A3  B3
4  A4  B4


### Combining Datasets: Merge and Join

Relational Algebra is the foundational bases of the function of this method
Categories of Joins. 

We have one to one join, many to one join, and many to many join. All this type of join are performed with an indentical pd.merge() method. The type of join Perfomed depends on the form of the input data.



In [228]:
# one-one join is the simplest type of join, which is similiar to the
# column -wise concatenation seen when concatenation dataframes

df1 = pd.DataFrame({'employee':['Boby','Willy','Bene','Kitis'],
                   'Group':["Engineering", "Engineering",'Geography',"Mecatronics"]})
df2 = pd.DataFrame({'employee':['Boby','Willy','Bene','Kitis'],
                   'Hire Date': [2014,2016,2017,2018]})

In [229]:
print(df1); print('\n'), print(df2)

  employee        Group
0     Boby  Engineering
1    Willy  Engineering
2     Bene    Geography
3    Kitis  Mecatronics


  employee  Hire Date
0     Boby       2014
1    Willy       2016
2     Bene       2017
3    Kitis       2018


(None, None)

In [232]:
df3 = pd.merge(df1,df2, how='outer')
df3

Unnamed: 0,employee,Group,Hire Date
0,Boby,Engineering,2014
1,Willy,Engineering,2016
2,Bene,Geography,2017
3,Kitis,Mecatronics,2018


In [234]:
# many to one joins
# in this case, one of the two key columns contains duplicate enterires
# and the resulting dataframes will preserve those duplicates enteries as appropriate

df4 = pd.DataFrame({'Group':['Engineering', 'Geography','Mecatronics'],
                   'supervisor': ['Mina', 'Prince','Austin']})

print(df3),print('\n'), print(df4), print('\n'), print(pd.merge(df3,df4))

  employee        Group  Hire Date
0     Boby  Engineering       2014
1    Willy  Engineering       2016
2     Bene    Geography       2017
3    Kitis  Mecatronics       2018


         Group supervisor
0  Engineering       Mina
1    Geography     Prince
2  Mecatronics     Austin


  employee        Group  Hire Date supervisor
0     Boby  Engineering       2014       Mina
1    Willy  Engineering       2016       Mina
2     Bene    Geography       2017     Prince
3    Kitis  Mecatronics       2018     Austin


(None, None, None, None, None)

In [238]:
# many to many joins
# in this case, if the key column in both left and right arrray contains duplicates
# then the result is a many to many merge

df5 = pd.DataFrame({'Group': ['Engineering', 'Engineering', 'Geography', 'Geography', 'Mecatronics'],
                   'skills':['math', 'excel','coding','spreadsheets', 'timing']})

In [239]:
print(df1),print('\n'), print(df5), print('\n'), print(pd.merge(df1,df5))

  employee        Group
0     Boby  Engineering
1    Willy  Engineering
2     Bene    Geography
3    Kitis  Mecatronics


         Group        skills
0  Engineering          math
1  Engineering         excel
2    Geography        coding
3    Geography  spreadsheets
4  Mecatronics        timing


  employee        Group        skills
0     Boby  Engineering          math
1     Boby  Engineering         excel
2    Willy  Engineering          math
3    Willy  Engineering         excel
4     Bene    Geography        coding
5     Bene    Geography  spreadsheets
6    Kitis  Mecatronics        timing


(None, None, None, None, None)

In [240]:
# Specification of the Merge Keys
# sometimes, the column names will not match. the pd.merge() provides options
# for handlind this exceptions

# the On-keyword is used to specify the name of the key column
# it can either take a column name, or list of column names
# this option works only if both left and right column names are specified

print(df1),print('\n'), print(df2), print('\n'), print(pd.merge(df1,df2, on='employee'))

  employee        Group
0     Boby  Engineering
1    Willy  Engineering
2     Bene    Geography
3    Kitis  Mecatronics


  employee  Hire Date
0     Boby       2014
1    Willy       2016
2     Bene       2017
3    Kitis       2018


  employee        Group  Hire Date
0     Boby  Engineering       2014
1    Willy  Engineering       2016
2     Bene    Geography       2017
3    Kitis  Mecatronics       2018


(None, None, None, None, None)

In [242]:
# the left_on and right_on keyword
# in this case, the column names varies

df3 = pd.DataFrame({'name':['Boby', 'Willy', 'Kitis', 'Bene'],
                    'salary': [100000,50000,60000,300000]})
print(df1),print('\n'), print(df3), print('\n'), print(pd.merge(df1,df3, left_on='employee', right_on='name'))

  employee        Group
0     Boby  Engineering
1    Willy  Engineering
2     Bene    Geography
3    Kitis  Mecatronics


    name  salary
0   Boby  100000
1  Willy   50000
2  Kitis   60000
3   Bene  300000


  employee        Group   name  salary
0     Boby  Engineering   Boby  100000
1    Willy  Engineering  Willy   50000
2     Bene    Geography   Bene  300000
3    Kitis  Mecatronics  Kitis   60000


(None, None, None, None, None)

In [244]:
# we can decide to drop the name column
pd.merge(df1,df3, left_on='employee', right_on='name').drop('name', axis=1)

Unnamed: 0,employee,Group,salary
0,Boby,Engineering,100000
1,Willy,Engineering,50000
2,Bene,Geography,300000
3,Kitis,Mecatronics,60000


In [253]:
# The left_index and right_index keywords
# this is used for index merging(row)

df1a = df1.set_index('employee')
df2a = df2.set_index('employee')

print(df1a), print('_' *20), print(df2a)

                Group
employee             
Boby      Engineering
Willy     Engineering
Bene        Geography
Kitis     Mecatronics
____________________
          Hire Date
employee           
Boby           2014
Willy          2016
Bene           2017
Kitis          2018


(None, None, None)

In [254]:

print(df1a), 
print('_' *20), 
print(df2a),
print('_' *20),
print(pd.merge(df1a,df2a, left_index=True, right_index=True))


                Group
employee             
Boby      Engineering
Willy     Engineering
Bene        Geography
Kitis     Mecatronics
____________________
          Hire Date
employee           
Boby           2014
Willy          2016
Bene           2017
Kitis          2018
____________________
                Group  Hire Date
employee                        
Boby      Engineering       2014
Willy     Engineering       2016
Bene        Geography       2017
Kitis     Mecatronics       2018


In [256]:
# for convenience, dataframe implements the join() method which performs a merge
# that defaults to joining on indices

print(df1a), 
print('_' *20), 
print(df2a),
print('_' *20),
print(df1a.join(df2a)) # this is used in place of specifying the left_index or right_index



                Group
employee             
Boby      Engineering
Willy     Engineering
Bene        Geography
Kitis     Mecatronics
____________________
          Hire Date
employee           
Boby           2014
Willy          2016
Bene           2017
Kitis          2018
____________________
                Group  Hire Date
employee                        
Boby      Engineering       2014
Willy     Engineering       2016
Bene        Geography       2017
Kitis     Mecatronics       2018


In [257]:
# we can combine the left_index and right_on 
# or the right_index and left_on to get our deried output

print(df1a), 
print('_' *20), 
print(df3),
print('_' *20),

print(pd.merge(df1a, df3, left_index=True, right_on='name'))


                Group
employee             
Boby      Engineering
Willy     Engineering
Bene        Geography
Kitis     Mecatronics
____________________
    name  salary
0   Boby  100000
1  Willy   50000
2  Kitis   60000
3   Bene  300000
____________________
         Group   name  salary
0  Engineering   Boby  100000
1  Engineering  Willy   50000
3    Geography   Bene  300000
2  Mecatronics  Kitis   60000
