# DataFrames

DataFrames are the workhorse of pandas and are directly inspired by the R programming language. We can think of a DataFrame as a bunch of Series objects put together to share the same index.

In [4]:
import pandas as pd
import numpy as np

In [5]:
from numpy.random import randn
np.random.seed(101)

In [12]:
df = pd.read_excel('3_NEWS_Sales.xlsx',sheet_name='NEWS')

In [13]:
df

Unnamed: 0,Month,North,East,West,South
0,Jan,5143,2027,7256,7428
1,Feb,9492,3506,7047,8374
2,Mar,6223,1419,7866,3485
3,Apr,6537,2986,7062,8061
4,May,9186,4614,5657,9003
5,Jun,6999,8625,5130,2725
6,Jul,4882,2341,7418,8768
7,Aug,7762,5923,6715,3117
8,Sep,9629,6063,4306,8304
9,Oct,2239,5743,5530,3131


In [14]:
df.set_index('Month', inplace = True)
df

Unnamed: 0_level_0,North,East,West,South
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Jan,5143,2027,7256,7428
Feb,9492,3506,7047,8374
Mar,6223,1419,7866,3485
Apr,6537,2986,7062,8061
May,9186,4614,5657,9003
Jun,6999,8625,5130,2725
Jul,4882,2341,7418,8768
Aug,7762,5923,6715,3117
Sep,9629,6063,4306,8304
Oct,2239,5743,5530,3131


## Selection and Indexing

Let's learn the various methods to grab data from a DataFrame

In [6]:
df['North']

0     5143
1     9492
2     6223
3     6537
4     9186
5     6999
6     4882
7     7762
8     9629
9     2239
10    5232
11    1455
Name: North, dtype: int64

In [15]:
# Pass a list of column names
df[['North','South']]

Unnamed: 0_level_0,North,South
Month,Unnamed: 1_level_1,Unnamed: 2_level_1
Jan,5143,7428
Feb,9492,8374
Mar,6223,3485
Apr,6537,8061
May,9186,9003
Jun,6999,2725
Jul,4882,8768
Aug,7762,3117
Sep,9629,8304
Oct,2239,3131


In [16]:
df.loc['Jan']

North    5143
East     2027
West     7256
South    7428
Name: Jan, dtype: int64

DataFrame Columns are just Series

In [12]:
type(df['North'])

pandas.core.series.Series

**Creating a new column:**

In [18]:
df['North_South'] = df['North'] + df['South']

In [19]:
df

Unnamed: 0_level_0,North,East,West,South,North_South
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Jan,5143,2027,7256,7428,12571
Feb,9492,3506,7047,8374,17866
Mar,6223,1419,7866,3485,9708
Apr,6537,2986,7062,8061,14598
May,9186,4614,5657,9003,18189
Jun,6999,8625,5130,2725,9724
Jul,4882,2341,7418,8768,13650
Aug,7762,5923,6715,3117,10879
Sep,9629,6063,4306,8304,17933
Oct,2239,5743,5530,3131,5370


** Removing Columns**

In [20]:
df.drop('North_South',axis=1)

Unnamed: 0_level_0,North,East,West,South
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Jan,5143,2027,7256,7428
Feb,9492,3506,7047,8374
Mar,6223,1419,7866,3485
Apr,6537,2986,7062,8061
May,9186,4614,5657,9003
Jun,6999,8625,5130,2725
Jul,4882,2341,7418,8768
Aug,7762,5923,6715,3117
Sep,9629,6063,4306,8304
Oct,2239,5743,5530,3131


In [21]:
# Not inplace unless specified!
df

Unnamed: 0_level_0,North,East,West,South,North_South
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Jan,5143,2027,7256,7428,12571
Feb,9492,3506,7047,8374,17866
Mar,6223,1419,7866,3485,9708
Apr,6537,2986,7062,8061,14598
May,9186,4614,5657,9003,18189
Jun,6999,8625,5130,2725,9724
Jul,4882,2341,7418,8768,13650
Aug,7762,5923,6715,3117,10879
Sep,9629,6063,4306,8304,17933
Oct,2239,5743,5530,3131,5370


In [22]:
df.drop('North_South',axis=1,inplace=True)

In [23]:
df

Unnamed: 0_level_0,North,East,West,South
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Jan,5143,2027,7256,7428
Feb,9492,3506,7047,8374
Mar,6223,1419,7866,3485
Apr,6537,2986,7062,8061
May,9186,4614,5657,9003
Jun,6999,8625,5130,2725
Jul,4882,2341,7418,8768
Aug,7762,5923,6715,3117
Sep,9629,6063,4306,8304
Oct,2239,5743,5530,3131


Can also drop rows this way:

In [20]:
df.drop('Jan',axis=0)

Unnamed: 0_level_0,North,East,West,South
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Feb,9492,3506,7047,8374
Mar,6223,1419,7866,3485
Apr,6537,2986,7062,8061
May,9186,4614,5657,9003
Jun,6999,8625,5130,2725
Jul,4882,2341,7418,8768
Aug,7762,5923,6715,3117
Sep,9629,6063,4306,8304
Oct,2239,5743,5530,3131
Nov,5232,5784,6924,3996


** Selecting Rows**

In [21]:
df.loc['Jan']

North    5143
East     2027
West     7256
South    7428
Name: Jan, dtype: int64

Or select based off of position instead of label 

In [24]:
df.iloc[2]

North    6223
East     1419
West     7866
South    3485
Name: Mar, dtype: int64

** Selecting subset of rows and columns **

In [23]:
df.loc['Mar','South']

3485

In [24]:
df.loc[['Feb','Mar'],['East','West']]

Unnamed: 0_level_0,East,West
Month,Unnamed: 1_level_1,Unnamed: 2_level_1
Feb,3506,7047
Mar,1419,7866


### Conditional Selection

An important feature of pandas is conditional selection using bracket notation, very similar to numpy:

In [25]:
df

Unnamed: 0_level_0,North,East,West,South
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Jan,5143,2027,7256,7428
Feb,9492,3506,7047,8374
Mar,6223,1419,7866,3485
Apr,6537,2986,7062,8061
May,9186,4614,5657,9003
Jun,6999,8625,5130,2725
Jul,4882,2341,7418,8768
Aug,7762,5923,6715,3117
Sep,9629,6063,4306,8304
Oct,2239,5743,5530,3131


In [27]:
df>5000

Unnamed: 0_level_0,North,East,West,South
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Jan,True,False,True,True
Feb,True,False,True,True
Mar,True,False,True,False
Apr,True,False,True,True
May,True,False,True,True
Jun,True,True,True,False
Jul,False,False,True,True
Aug,True,True,True,False
Sep,True,True,False,True
Oct,False,True,True,False


In [28]:
df[df>5000]

Unnamed: 0_level_0,North,East,West,South
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Jan,5143.0,,7256.0,7428.0
Feb,9492.0,,7047.0,8374.0
Mar,6223.0,,7866.0,
Apr,6537.0,,7062.0,8061.0
May,9186.0,,5657.0,9003.0
Jun,6999.0,8625.0,5130.0,
Jul,,,7418.0,8768.0
Aug,7762.0,5923.0,6715.0,
Sep,9629.0,6063.0,,8304.0
Oct,,5743.0,5530.0,


In [31]:
df[df['North']>8000]

Unnamed: 0_level_0,North,East,West,South
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Feb,9492,3506,7047,8374
May,9186,4614,5657,9003
Sep,9629,6063,4306,8304


In [32]:
df[df['North']>8000]['East']

Month
Feb    3506
May    4614
Sep    6063
Name: East, dtype: int64

In [34]:
df[df['North']>8000][['East','West']]

Unnamed: 0_level_0,East,West
Month,Unnamed: 1_level_1,Unnamed: 2_level_1
Feb,3506,7047
May,4614,5657
Sep,6063,4306


For two conditions you can use | and & with parenthesis:

In [37]:
df[(df['North']>8000) & (df['East'] > 5000)]

Unnamed: 0_level_0,North,East,West,South
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Sep,9629,6063,4306,8304


## More Index Details

Let's discuss some more features of indexing, including resetting the index or setting it something else. Also, about index hierarchy!

In [38]:
df

Unnamed: 0_level_0,North,East,West,South
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Jan,5143,2027,7256,7428
Feb,9492,3506,7047,8374
Mar,6223,1419,7866,3485
Apr,6537,2986,7062,8061
May,9186,4614,5657,9003
Jun,6999,8625,5130,2725
Jul,4882,2341,7418,8768
Aug,7762,5923,6715,3117
Sep,9629,6063,4306,8304
Oct,2239,5743,5530,3131


In [39]:
# Reset to default 0,1...n index
df.reset_index()

Unnamed: 0,Month,North,East,West,South
0,Jan,5143,2027,7256,7428
1,Feb,9492,3506,7047,8374
2,Mar,6223,1419,7866,3485
3,Apr,6537,2986,7062,8061
4,May,9186,4614,5657,9003
5,Jun,6999,8625,5130,2725
6,Jul,4882,2341,7418,8768
7,Aug,7762,5923,6715,3117
8,Sep,9629,6063,4306,8304
9,Oct,2239,5743,5530,3131


In [40]:
newind = 'Jan19 Feb19 Mar19 Apr19 May19 Jun19 Jul19 Aug19 Sep19 Oct19 Nov19 Dec19'.split()

In [41]:
df['Year_19'] = newind

In [42]:
df

Unnamed: 0_level_0,North,East,West,South,Year_19
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Jan,5143,2027,7256,7428,Jan19
Feb,9492,3506,7047,8374,Feb19
Mar,6223,1419,7866,3485,Mar19
Apr,6537,2986,7062,8061,Apr19
May,9186,4614,5657,9003,May19
Jun,6999,8625,5130,2725,Jun19
Jul,4882,2341,7418,8768,Jul19
Aug,7762,5923,6715,3117,Aug19
Sep,9629,6063,4306,8304,Sep19
Oct,2239,5743,5530,3131,Oct19


In [44]:
df.set_index('Year_19')

Unnamed: 0_level_0,North,East,West,South
Year_19,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Jan19,5143,2027,7256,7428
Feb19,9492,3506,7047,8374
Mar19,6223,1419,7866,3485
Apr19,6537,2986,7062,8061
May19,9186,4614,5657,9003
Jun19,6999,8625,5130,2725
Jul19,4882,2341,7418,8768
Aug19,7762,5923,6715,3117
Sep19,9629,6063,4306,8304
Oct19,2239,5743,5530,3131


In [45]:
df

Unnamed: 0_level_0,North,East,West,South,Year_19
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Jan,5143,2027,7256,7428,Jan19
Feb,9492,3506,7047,8374,Feb19
Mar,6223,1419,7866,3485,Mar19
Apr,6537,2986,7062,8061,Apr19
May,9186,4614,5657,9003,May19
Jun,6999,8625,5130,2725,Jun19
Jul,4882,2341,7418,8768,Jul19
Aug,7762,5923,6715,3117,Aug19
Sep,9629,6063,4306,8304,Sep19
Oct,2239,5743,5530,3131,Oct19


In [46]:
df.set_index('Year_19',inplace=True)

In [47]:
df

Unnamed: 0_level_0,North,East,West,South
Year_19,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Jan19,5143,2027,7256,7428
Feb19,9492,3506,7047,8374
Mar19,6223,1419,7866,3485
Apr19,6537,2986,7062,8061
May19,9186,4614,5657,9003
Jun19,6999,8625,5130,2725
Jul19,4882,2341,7418,8768
Aug19,7762,5923,6715,3117
Sep19,9629,6063,4306,8304
Oct19,2239,5743,5530,3131


## Multi-Index and Index Hierarchy

Let us go over how to work with Multi-Index, first we'll create a quick example of what a Multi-Indexed DataFrame would look like:

In [26]:
# Index Levels
outside = ['G1','G1','G1','G2','G2','G2']
inside = [1,2,3,1,2,3]
hier_index = list(zip(outside,inside))
hier_index = pd.MultiIndex.from_tuples(hier_index)

In [27]:
hier_index

MultiIndex(levels=[['G1', 'G2'], [1, 2, 3]],
           labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]])

In [28]:
df = pd.DataFrame(np.random.randn(6,2),index=hier_index,columns=['A','B'])
df

Unnamed: 0,Unnamed: 1,A,B
G1,1,0.302665,1.693723
G1,2,-1.706086,-1.159119
G1,3,-0.134841,0.390528
G2,1,0.166905,0.184502
G2,2,0.807706,0.07296
G2,3,0.638787,0.329646


Now let's show how to index this! For index hierarchy we use df.loc[], if this was on the columns axis, you would just use normal bracket notation df[]. Calling one level of the index returns the sub-dataframe:

In [29]:
df.loc['G1']

Unnamed: 0,A,B
1,0.302665,1.693723
2,-1.706086,-1.159119
3,-0.134841,0.390528


In [30]:
df.loc['G1'].loc[1]

A    0.302665
B    1.693723
Name: 1, dtype: float64

In [31]:
df.index.names

FrozenList([None, None])

In [32]:
df.index.names = ['Group','Num']

In [33]:
df

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
Group,Num,Unnamed: 2_level_1,Unnamed: 3_level_1
G1,1,0.302665,1.693723
G1,2,-1.706086,-1.159119
G1,3,-0.134841,0.390528
G2,1,0.166905,0.184502
G2,2,0.807706,0.07296
G2,3,0.638787,0.329646


In [34]:
df.xs('G1')

Unnamed: 0_level_0,A,B
Num,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0.302665,1.693723
2,-1.706086,-1.159119
3,-0.134841,0.390528


In [35]:
df.xs(['G1',1])

A    0.302665
B    1.693723
Name: (G1, 1), dtype: float64

In [36]:
df.xs(1,level='Num')

Unnamed: 0_level_0,A,B
Group,Unnamed: 1_level_1,Unnamed: 2_level_1
G1,0.302665,1.693723
G2,0.166905,0.184502
