# DataFrames

DataFrames are the workhorse of pandas and are directly inspired by the R programming language. We can think of a DataFrame as a bunch of Series objects put together to share the same index.

In [1]:
import pandas as pd
import numpy as np

In [2]:
from numpy.random import randn
np.random.seed(101)

In [3]:
df = pd.read_excel('3_NEWS_Sales.xlsx',sheet_name='NEWS')

In [4]:
df

Unnamed: 0,Month,North,East,West,South
0,Jan,5143,2027,7256,7428
1,Feb,9492,3506,7047,8374
2,Mar,6223,1419,7866,3485
3,Apr,6537,2986,7062,8061
4,May,9186,4614,5657,9003
5,Jun,6999,8625,5130,2725
6,Jul,4882,2341,7418,8768
7,Aug,7762,5923,6715,3117
8,Sep,9629,6063,4306,8304
9,Oct,2239,5743,5530,3131


In [5]:
df.set_index('Month', inplace = True)
df

Unnamed: 0_level_0,North,East,West,South
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Jan,5143,2027,7256,7428
Feb,9492,3506,7047,8374
Mar,6223,1419,7866,3485
Apr,6537,2986,7062,8061
May,9186,4614,5657,9003
Jun,6999,8625,5130,2725
Jul,4882,2341,7418,8768
Aug,7762,5923,6715,3117
Sep,9629,6063,4306,8304
Oct,2239,5743,5530,3131


## Selection and Indexing

Let's learn the various methods to grab data from a DataFrame

In [6]:
df['North']

Month
Jan    5143
Feb    9492
Mar    6223
Apr    6537
May    9186
Jun    6999
Jul    4882
Aug    7762
Sep    9629
Oct    2239
Nov    5232
Dec    1455
Name: North, dtype: int64

In [7]:
# Pass a list of column names
df[['North','South']]

Unnamed: 0_level_0,North,South
Month,Unnamed: 1_level_1,Unnamed: 2_level_1
Jan,5143,7428
Feb,9492,8374
Mar,6223,3485
Apr,6537,8061
May,9186,9003
Jun,6999,2725
Jul,4882,8768
Aug,7762,3117
Sep,9629,8304
Oct,2239,3131


In [8]:
df.loc['Jan']

North    5143
East     2027
West     7256
South    7428
Name: Jan, dtype: int64

DataFrame Columns are just Series

In [9]:
type(df['North'])

pandas.core.series.Series

**Creating a new column:**

In [10]:
df['North_South'] = df['North'] + df['South']

In [11]:
df

Unnamed: 0_level_0,North,East,West,South,North_South
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Jan,5143,2027,7256,7428,12571
Feb,9492,3506,7047,8374,17866
Mar,6223,1419,7866,3485,9708
Apr,6537,2986,7062,8061,14598
May,9186,4614,5657,9003,18189
Jun,6999,8625,5130,2725,9724
Jul,4882,2341,7418,8768,13650
Aug,7762,5923,6715,3117,10879
Sep,9629,6063,4306,8304,17933
Oct,2239,5743,5530,3131,5370


** Removing Columns**

In [13]:
df.drop('North_South',axis=1)

Unnamed: 0_level_0,North,East,West,South
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Jan,5143,2027,7256,7428
Feb,9492,3506,7047,8374
Mar,6223,1419,7866,3485
Apr,6537,2986,7062,8061
May,9186,4614,5657,9003
Jun,6999,8625,5130,2725
Jul,4882,2341,7418,8768
Aug,7762,5923,6715,3117
Sep,9629,6063,4306,8304
Oct,2239,5743,5530,3131


In [14]:
# Not inplace unless specified!
df

Unnamed: 0_level_0,North,East,West,South,North_South
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Jan,5143,2027,7256,7428,12571
Feb,9492,3506,7047,8374,17866
Mar,6223,1419,7866,3485,9708
Apr,6537,2986,7062,8061,14598
May,9186,4614,5657,9003,18189
Jun,6999,8625,5130,2725,9724
Jul,4882,2341,7418,8768,13650
Aug,7762,5923,6715,3117,10879
Sep,9629,6063,4306,8304,17933
Oct,2239,5743,5530,3131,5370


In [15]:
df.drop('North_South',axis=1,inplace=True)

In [16]:
df

Unnamed: 0_level_0,North,East,West,South
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Jan,5143,2027,7256,7428
Feb,9492,3506,7047,8374
Mar,6223,1419,7866,3485
Apr,6537,2986,7062,8061
May,9186,4614,5657,9003
Jun,6999,8625,5130,2725
Jul,4882,2341,7418,8768
Aug,7762,5923,6715,3117
Sep,9629,6063,4306,8304
Oct,2239,5743,5530,3131


Can also drop rows this way:

In [17]:
df.drop('Jan',axis=0)

Unnamed: 0_level_0,North,East,West,South
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Feb,9492,3506,7047,8374
Mar,6223,1419,7866,3485
Apr,6537,2986,7062,8061
May,9186,4614,5657,9003
Jun,6999,8625,5130,2725
Jul,4882,2341,7418,8768
Aug,7762,5923,6715,3117
Sep,9629,6063,4306,8304
Oct,2239,5743,5530,3131
Nov,5232,5784,6924,3996


** Selecting Rows**

In [18]:
df.loc['Jan']

North    5143
East     2027
West     7256
South    7428
Name: Jan, dtype: int64

Or select based off of position instead of label 

In [19]:
df.iloc[2]

North    6223
East     1419
West     7866
South    3485
Name: Mar, dtype: int64

** Selecting subset of rows and columns **

In [20]:
df.loc['Mar','South']

3485

In [21]:
df.loc[['Feb','Mar'],['East','West']]

Unnamed: 0_level_0,East,West
Month,Unnamed: 1_level_1,Unnamed: 2_level_1
Feb,3506,7047
Mar,1419,7866


### Conditional Selection

An important feature of pandas is conditional selection using bracket notation, very similar to numpy:

In [22]:
df

Unnamed: 0_level_0,North,East,West,South
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Jan,5143,2027,7256,7428
Feb,9492,3506,7047,8374
Mar,6223,1419,7866,3485
Apr,6537,2986,7062,8061
May,9186,4614,5657,9003
Jun,6999,8625,5130,2725
Jul,4882,2341,7418,8768
Aug,7762,5923,6715,3117
Sep,9629,6063,4306,8304
Oct,2239,5743,5530,3131


In [23]:
df>5000

Unnamed: 0_level_0,North,East,West,South
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Jan,True,False,True,True
Feb,True,False,True,True
Mar,True,False,True,False
Apr,True,False,True,True
May,True,False,True,True
Jun,True,True,True,False
Jul,False,False,True,True
Aug,True,True,True,False
Sep,True,True,False,True
Oct,False,True,True,False


In [24]:
df[df>5000]

Unnamed: 0_level_0,North,East,West,South
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Jan,5143.0,,7256.0,7428.0
Feb,9492.0,,7047.0,8374.0
Mar,6223.0,,7866.0,
Apr,6537.0,,7062.0,8061.0
May,9186.0,,5657.0,9003.0
Jun,6999.0,8625.0,5130.0,
Jul,,,7418.0,8768.0
Aug,7762.0,5923.0,6715.0,
Sep,9629.0,6063.0,,8304.0
Oct,,5743.0,5530.0,


In [25]:
df[df['North']>8000]

Unnamed: 0_level_0,North,East,West,South
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Feb,9492,3506,7047,8374
May,9186,4614,5657,9003
Sep,9629,6063,4306,8304


In [26]:
df[df['North']>8000]['East']

Month
Feb    3506
May    4614
Sep    6063
Name: East, dtype: int64

In [27]:
df[df['North']>8000][['East','West']]

Unnamed: 0_level_0,East,West
Month,Unnamed: 1_level_1,Unnamed: 2_level_1
Feb,3506,7047
May,4614,5657
Sep,6063,4306


For two conditions you can use | and & with parenthesis:

In [28]:
df[(df['North']>8000) & (df['East'] > 5000)]

Unnamed: 0_level_0,North,East,West,South
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Sep,9629,6063,4306,8304
