# Advanced Indexing

The key building blocks in Pandas are:

* indexes - sequence of labels, immutable, homogeneous datatype
* series - 1D array with index, mutable, homogeneous datatpe
* dataframes - 2D array with series as columns, mutable, homogeneous datatype

In [1]:
import pandas as pd
import numpy as np

values = [1,2,3,4,5,6,7,8,9,0]
labels = list('abcdefghij')
items = pd.Series(values)
items

0    1
1    2
2    3
3    4
4    5
5    6
6    7
7    8
8    9
9    0
dtype: int64

By default the assigned index will be a range of integers starting at `0`. The `dtype` of the series is `int64`, which is infered from the `values` list. The `index` itself in this case is a `RangeIndex`.

In [2]:
items = pd.Series([1.1,2.2,3.3,4.4,5.5,6.6,7.7,8.8,9.9,0.0])
print(type(items.index))
print(items.index)
items

<class 'pandas.core.indexes.range.RangeIndex'>
RangeIndex(start=0, stop=10, step=1)


0    1.1
1    2.2
2    3.3
3    4.4
4    5.5
5    6.6
6    7.7
7    8.8
8    9.9
9    0.0
dtype: float64

In [3]:
items = pd.Series(values, index=labels)
print(type(items.index))
print(items.index)
items

<class 'pandas.core.indexes.base.Index'>
Index(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'], dtype='object')


a    1
b    2
c    3
d    4
e    5
f    6
g    7
h    8
i    9
j    0
dtype: int64

We can slice index objects, but can not change it's values, without replacing the whole index.

In [4]:
items.index[:2]

Index(['a', 'b'], dtype='object')

In [5]:
try:
    items.index[2] = 'z'
except Exception as error:
    print(error)

Index does not support mutable operations


In [6]:
items.index = list('lmnopqrstu')
print(items.index)

Index(['l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u'], dtype='object')


And set a name on the index.

In [7]:
print(items.index.name)

None


In [8]:
items.index.name = 'letters'
print(items.index)
items

Index(['l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u'], dtype='object', name='letters')


letters
l    1
m    2
n    3
o    4
p    5
q    6
r    7
s    8
t    9
u    0
dtype: int64

With dataframes, we can give a name to both the index and the columns.

In [9]:
df = pd.read_csv('./data/sales.csv', index_col='month') # also sets name to 'month'

print(df.index)
print(df.columns)
df

Index(['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun'], dtype='object', name='month')
Index(['eggs', 'salt', 'spam'], dtype='object')


Unnamed: 0_level_0,eggs,salt,spam
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Jan,47,12.0,17
Feb,110,50.0,31
Mar,221,89.0,72
Apr,77,87.0,20
May,132,,52
Jun,205,60.0,55


In [10]:
df.columns.name = 'products'
print(df.columns)
df

Index(['eggs', 'salt', 'spam'], dtype='object', name='products')


products,eggs,salt,spam
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Jan,47,12.0,17
Feb,110,50.0,31
Mar,221,89.0,72
Apr,77,87.0,20
May,132,,52
Jun,205,60.0,55


In [40]:
stocks = pd.read_csv('./data/stocks.csv')
stocks['Ticker'] = np.array(['{}'.format((ticker + ' ') * 6).strip().split(' ')\
                        for ticker in ['TSLA','AAPL','AMZN','MSFT','AIG']]).flatten()
stocks

Unnamed: 0,Date,High,Low,Open,Close,Volume,Adj Close,Ticker
0,2000-01-03,4.017857,3.631696,3.745536,3.997768,133949200.0,2.665724,TSLA
1,2000-01-04,3.950893,3.613839,3.866071,3.660714,128094400.0,2.440975,TSLA
2,2000-01-05,3.948661,3.678571,3.705357,3.714286,194580400.0,2.476697,TSLA
3,2000-01-06,3.821429,3.392857,3.790179,3.392857,191993200.0,2.262367,TSLA
4,2000-01-07,3.607143,3.410714,3.446429,3.553571,115183600.0,2.369532,TSLA
5,2000-01-10,3.651786,3.383929,3.642857,3.491071,126266000.0,2.327857,TSLA
6,2000-01-03,16.160431,15.599305,15.823756,15.711531,10635000.0,6.698944,AAPL
7,2000-01-04,15.599305,15.150405,15.459024,15.26263,10734600.0,6.507545,AAPL
8,2000-01-05,15.402911,15.066236,15.066236,15.234573,11722500.0,6.495581,AAPL
9,2000-01-06,15.823756,15.178461,15.26263,15.767643,17479500.0,6.72287,AAPL


## MultiIndex (Hierarchical Index)

We use a **MultiIndex** (or hierarchical index) when we want a meaningful index that represents each row, `Date` and `Ticker` are innapropriate due to repetition. In these cases we can use a tuple made up of the `Ticker` and `Date` to represent each record uniquely.

You need to call `sort_index` on the dataframe, unless the data is already sorted on the index fields.

In [41]:
stocks = stocks.set_index(['Ticker', 'Date'])
stocks = stocks.sort_index()
stocks

Unnamed: 0_level_0,Unnamed: 1_level_0,High,Low,Open,Close,Volume,Adj Close
Ticker,Date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
AAPL,2000-01-03,16.160431,15.599305,15.823756,15.711531,10635000.0,6.698944
AAPL,2000-01-04,15.599305,15.150405,15.459024,15.26263,10734600.0,6.507545
AAPL,2000-01-05,15.402911,15.066236,15.066236,15.234573,11722500.0,6.495581
AAPL,2000-01-06,15.823756,15.178461,15.26263,15.767643,17479500.0,6.72287
AAPL,2000-01-07,16.272657,15.487081,15.487081,15.935981,15755900.0,6.794642
AAPL,2000-01-10,16.188488,15.655418,16.160431,15.823756,10442000.0,6.746791
AIG,2000-01-03,89.5625,79.046799,81.5,89.375,16117600.0,89.375
AIG,2000-01-04,91.5,81.75,85.375,81.9375,17487400.0,81.9375
AIG,2000-01-05,75.125,68.0,70.5,69.75,38457400.0,69.75
AIG,2000-01-06,72.6875,64.0,71.3125,65.5625,18752000.0,65.5625


In [42]:
print(stocks.index.name)
print(stocks.index.names)
stocks.index

None
['Ticker', 'Date']


MultiIndex(levels=[['AAPL', 'AIG', 'AMZN', 'MSFT', 'TSLA'], ['2000-01-03', '2000-01-04', '2000-01-05', '2000-01-06', '2000-01-07', '2000-01-10']],
           labels=[[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4], [0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5]],
           names=['Ticker', 'Date'])

The index consists of a two columns. It s a `MultiIndex` or heirarchical index. Notice that `name` attribute is `None`, but that `names` is `['Ticker', 'Date']`.

In [43]:
# average clcosing price by stock
stocks.groupby('Ticker').Close.mean()

Ticker
AAPL      15.622686
AIG       74.229167
AMZN      44.041667
MSFT    1378.576233
TSLA       3.635045
Name: Close, dtype: float64

In [44]:
# average volume traded by day
stocks.groupby('Date').Volume.mean()

Date
2000-01-03    32904780.0
2000-01-04    31842720.0
2000-01-05    49650960.0
2000-01-06    46220960.0
2000-01-07    28732840.0
2000-01-10    30839780.0
Name: Volume, dtype: float64

### Slicing

In [45]:
# fetch a single record
stocks.loc['TSLA', '2000-01-07']

High         3.607143e+00
Low          3.410714e+00
Open         3.446429e+00
Close        3.553571e+00
Volume       1.151836e+08
Adj Close    2.369532e+00
Name: (TSLA, 2000-01-07), dtype: float64

In [46]:
# retrieve a single value
stocks.loc[('TSLA', '2000-01-07'), 'Volume']

115183600.0

In [47]:
# slicing the outermost index
stocks.loc['AAPL']

Unnamed: 0_level_0,High,Low,Open,Close,Volume,Adj Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2000-01-03,16.160431,15.599305,15.823756,15.711531,10635000.0,6.698944
2000-01-04,15.599305,15.150405,15.459024,15.26263,10734600.0,6.507545
2000-01-05,15.402911,15.066236,15.066236,15.234573,11722500.0,6.495581
2000-01-06,15.823756,15.178461,15.26263,15.767643,17479500.0,6.72287
2000-01-07,16.272657,15.487081,15.487081,15.935981,15755900.0,6.794642
2000-01-10,16.188488,15.655418,16.160431,15.823756,10442000.0,6.746791


In [48]:
stocks.loc['AIG':'MSFT']

Unnamed: 0_level_0,Unnamed: 1_level_0,High,Low,Open,Close,Volume,Adj Close
Ticker,Date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
AIG,2000-01-03,89.5625,79.046799,81.5,89.375,16117600.0,89.375
AIG,2000-01-04,91.5,81.75,85.375,81.9375,17487400.0,81.9375
AIG,2000-01-05,75.125,68.0,70.5,69.75,38457400.0,69.75
AIG,2000-01-06,72.6875,64.0,71.3125,65.5625,18752000.0,65.5625
AIG,2000-01-07,70.5,66.1875,67.0,69.5625,10505400.0,69.5625
AIG,2000-01-10,72.625,65.5625,72.5625,69.1875,14757900.0,69.1875
AMZN,2000-01-03,46.9375,44.0,46.75,45.09375,3655600.0,31.241013
AMZN,2000-01-04,45.75,42.78125,44.75,42.8125,2533200.0,29.660543
AMZN,2000-01-05,44.125,41.59375,42.8125,43.4375,3228000.0,30.093563
AMZN,2000-01-06,43.8125,41.625,43.4375,42.25,2601000.0,29.270847


In [49]:
stocks.loc[['MSFT', 'AAPL'], ['Adj Close', 'Open', 'High', 'Volume']]

Unnamed: 0_level_0,Unnamed: 1_level_0,Adj Close,Open,High,Volume
Ticker,Date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AAPL,2000-01-03,6.698944,15.823756,16.160431,10635000.0
AAPL,2000-01-04,6.507545,15.459024,15.599305,10734600.0
AAPL,2000-01-05,6.495581,15.066236,15.402911,11722500.0
AAPL,2000-01-06,6.72287,15.26263,15.823756,17479500.0
AAPL,2000-01-07,6.794642,15.487081,16.272657,15755900.0
AAPL,2000-01-10,6.746791,16.160431,16.188488,10442000.0
MSFT,2000-01-03,1033.143555,1428.333374,1432.5,166500.0
MSFT,2000-01-04,980.337036,1353.333374,1361.666626,364000.0
MSFT,2000-01-05,982.666077,1315.0,1332.5,266500.0
MSFT,2000-01-06,1012.642273,1320.0,1381.666626,279100.0


In [50]:
stocks.loc[(['AMZN', 'TSLA'], ['2000-01-04','2000-01-07']), ['High', 'Low', 'Close']]

Unnamed: 0_level_0,Unnamed: 1_level_0,High,Low,Close
Ticker,Date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AMZN,2000-01-04,45.75,42.78125,42.8125
AMZN,2000-01-07,44.25,41.3125,43.4375
TSLA,2000-01-04,3.950893,3.613839,3.660714
TSLA,2000-01-07,3.607143,3.410714,3.553571


The `':'` symbol is not recognized in  order to select a range. In order to do so use the Python `slice` function. Use `slice()` to define the range.

In [51]:
# select a date range
stocks.loc[(['AMZN', 'TSLA'], slice('2000-01-04', '2000-01-07')), ['High', 'Low', 'Close']]

Unnamed: 0_level_0,Unnamed: 1_level_0,High,Low,Close
Ticker,Date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AMZN,2000-01-04,45.75,42.78125,42.8125
AMZN,2000-01-05,44.125,41.59375,43.4375
AMZN,2000-01-06,43.8125,41.625,42.25
AMZN,2000-01-07,44.25,41.3125,43.4375
TSLA,2000-01-04,3.950893,3.613839,3.660714
TSLA,2000-01-05,3.948661,3.678571,3.714286
TSLA,2000-01-06,3.821429,3.392857,3.392857
TSLA,2000-01-07,3.607143,3.410714,3.553571


Use `slice(None)` as a substitute for `':'`.

In [34]:
# select all stocks (rows) for a particular date range
stocks.loc[(slice(None), slice('2000-01-04', '2000-01-07')), ['High', 'Low', 'Close']]

Unnamed: 0_level_0,Unnamed: 1_level_0,High,Low,Close
Ticker,Date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AAPL,2000-01-04,3.950893,3.613839,3.660714
AAPL,2000-01-05,3.948661,3.678571,3.714286
AAPL,2000-01-06,3.821429,3.392857,3.392857
AAPL,2000-01-07,3.607143,3.410714,3.553571
AIG,2000-01-04,15.599305,15.150405,15.26263
AIG,2000-01-05,15.402911,15.066236,15.234573
AIG,2000-01-06,15.823756,15.178461,15.767643
AIG,2000-01-07,16.272657,15.487081,15.935981
AMZN,2000-01-04,45.75,42.78125,42.8125
AMZN,2000-01-05,44.125,41.59375,43.4375


In [55]:
sales = pd.read_csv('./data/sales2.csv')
sales

Unnamed: 0,state,month,eggs,salt,spam
0,CA,1,47,12.0,17
1,CA,2,110,50.0,31
2,NY,1,221,89.0,72
3,NY,2,77,87.0,20
4,TX,1,132,,52
5,TX,2,205,60.0,55


In [56]:
sales = sales.set_index(['state', 'month'])
sales

Unnamed: 0_level_0,Unnamed: 1_level_0,eggs,salt,spam
state,month,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CA,1,47,12.0,17
CA,2,110,50.0,31
NY,1,221,89.0,72
NY,2,77,87.0,20
TX,1,132,,52
TX,2,205,60.0,55


In [None]:
sales.loc['NY']

In [57]:
# Look up data for NY in month 1: NY_month1
sales.loc[('NY', 1)]

eggs    221.0
salt     89.0
spam     72.0
Name: (NY, 1), dtype: float64

In [58]:
# Look up data for CA and TX in month 2: CA_TX_month2
sales.loc[(['CA', 'TX'], 2), :]

Unnamed: 0_level_0,Unnamed: 1_level_0,eggs,salt,spam
state,month,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CA,2,110,50.0,31
TX,2,205,60.0,55


In [59]:
# Look up data for all states in month 2: all_month2
sales.loc[(slice(None), 2), :]

Unnamed: 0_level_0,Unnamed: 1_level_0,eggs,salt,spam
state,month,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CA,2,110,50.0,31
NY,2,77,87.0,20
TX,2,205,60.0,55
