# Chapter 6 --Understanding Indexes

SAS users tend to think of indexing SAS data sets as means to improve query performance.  Another use case for SAS use of indexes is when non-sequential reads of a data set is needed line in the case of a table look-up.

Indexing for DataFrames is used to provide direct access to data.  Many analytical techniques take advantage of indexes.  Indexing is also used as key values to selecting and subsetting data.  And as we will see below, SAS uses a default sequential access method for processing observations.  

It took me a bit of time to understand how indexing for Series and DataFrames actually work.  Perhaps my lack of understanding comes from how SAS mainly uses indexing passively--that is, by algorithms to determine if query performance can be improved by a sequential read versus an indexed read.  

In working Series and DataFrame examples, I encountered errors provideding little in the way of information as to how to diagnose the problem.  And many of the working examples I found, while mostly useful, used synthetic data which is typically ordered neatly.  Here we examine issues like overlapping date ranges or index values in non-alaphabetical or non-sequential order and so on.

That is why you will find some errors in the examples below.  By examing the pits I have fallen into, hopefully you can avoid them.

In [1]:
import numpy as np
import pandas as pd
from numpy.random import randn
from pandas import Series, DataFrame, Index

Consider the creation of the DataFrame df in the cell below.  

In [2]:
df = pd.DataFrame([['a', 'cold','slow', np.nan, 2., 6., 3.], 
                    ['b', 'warm', 'medium', 4, 5, 7, 9],
                    ['c', 'hot', 'fast', 9, 4, np.nan, 6],
                    ['d', 'cool', None, np.nan, np.nan, 17, 89],
                    ['e', 'cool', 'medium', 16, 44, 21, 13],
                    ['f', 'cold', 'slow', np.nan, 29, 33, 17]])

The .index attribute returns the DataFrame's index structure.

In [3]:
# inspecting the DataFrame index structure

df.index

RangeIndex(start=0, stop=6, step=1)

Since there was no explicit setting of an index, the DataFrame uses the RangeIndex object with its start position at 0 and end position set to len(df) - 1.  SAS uses, \_N\_ in the Data Step and FIRSTOBS and OBS in PROC step for its row indexing.  

The SAS example below creates a data set with same data used to create the DataFrame df in cell #2 above.

````
    11      data df;
    12      infile cards dlm=',';
    13      
    14      input id   $
    15            col1 $
    16            col2 $
    17            col3
    18            col4
    19            col5
    20            col6 ;
    21      
    22      datalines;
    23      a, cold, slow, ., 2, 6, 3
    24      b, warm, medium, 4, 5, 7, 9
    25      c, hot, fast, 9, 4, ., 6
    26      d, cool, , ., ., 17, 89
    27      e, cool,  medium, 16, 44, 21, 13
    28      f, cold, slow, . ,29, 33, 17
    29      ;;;;
````

The data step with the SET options NOBS= is an example of an implicit index used by SAS.  The END= parameter on the SET statement is initialized to 0 and set to one when the last observation is read.  The automatic variable \_N\_ is used as the observation (row) index.

````
    31      data _null_;
    32      set df nobs=obs end=end_of_file;
    33      
    34      put _n_ = ;
    35      
    36      if end_of_file then
    37         put 'Data set contains: ' obs ' observations' /;

    _N_=1
    _N_=2
    _N_=3
    _N_=4
    _N_=5
    _N_=6
    Data set contains: 6  observations
````

In the DataFrame df, we did not specify column namees resulting in the RangeIndex object used for column labels.  Cell #4 below returns the default column labels.  SAS has a similar construct (column suffix processing) allowing the use of column name 'ranges' (col1-colN).

In [4]:
df.columns

RangeIndex(start=0, stop=7, step=1)

Using the DataFrame index, we can select specific rows and columns.  DataFrames provide indexers to accomplish these tasks.  They are:

    1. .iloc() method which is mainly an integer-based method
    2. .loc() method used to select ranges by labels (either column or row)
    3. .ix() method which supports a combination of the loc() and iloc() methods
    
We also illsutrate altering values in a DataFrame using the .loc().  This is equivalent to the SAS update method.

#### .iloc Indexer

The .iloc indexer uses an integer-based method for row and column position location by integer values.

The syntax for the .iloc is:

    df.iloc[row selection, column selection]
    
For both the row and column selection, a comma (,) is used to request a list of multiple cells.  A colon (:) is used to request a range of cells.

The absence of a either a column selection or row selection is an implicit request for all columns or rows, respectively.

In [5]:
# return the first row in the DataFrame

df.iloc[0] 

0       a
1    cold
2    slow
3     NaN
4       2
5       6
6       3
Name: 0, dtype: object

The point= option on the SET statement behaves similarly to return the first row in the data set.  Note the SET statement inside a DO loop and the STOP statement.  STOP is needed because the POINT= option indicates a non-sequnetial access pattern and thus the end of data set indicator is not available.

````
    52      data _null_;
    53      
    54      do iloc = 1 to 1;
    55         set df point=iloc;
    56      
    57       put _all_ ;
    58       end;
    59      stop;

    _N_=1 _ERROR_=0 iloc=1 id=a col1=cold col2=slow col3=. col4=2 col5=6 col6=3
````

In the example below, you might expect three rows returned, rather than two.  

In [6]:
# return rows in the range of 2 to 3.  Notice how the 4th row is not returned.

df.iloc[2:4]

Unnamed: 0,0,1,2,3,4,5,6
2,c,hot,fast,9.0,4.0,,6.0
3,d,cool,,,,17.0,89.0


The SAS analog example for cell #6 is below.

````
    94       data _null_ ;
    95       
    96       do iloc = 3 to 4;
    97           set df point=iloc;
    98          put _all_ ;
    99       end;
    100      stop;

    _N_=1 _ERROR_=0 iloc=3 id=c col1=hot col2=fast col3=9 col4=4 col5=. col6=6
    _N_=1 _ERROR_=0 iloc=4 id=d col1=cool col2=  col3=. col4=. col5=17 col6=89
````

Similar to the indexer for string slicing, the index position 

    iloc[0]
    
returns the first row and 

    iloc[-1]
    
returns the last row in the DataFrame.  This is analogous to the END= option for the SET statement (assuming a sequential access pattern).

The .iloc indexer is mainly used to locate first or last row in the DataFrame.  

In [7]:
df.iloc[-1]

0       f
1    cold
2    slow
3     NaN
4      29
5      33
6      17
Name: 5, dtype: object

The .iloc indexer in cell #8 below returns rows 2 and 3 using (2:4) for row selector and columns 0 to 6 using (0:6) for column selctor.

In [8]:
# iloc indexer for rows and columns

df.iloc[2:4, 0:6]

Unnamed: 0,0,1,2,3,4,5
2,c,hot,fast,9.0,4.0,
3,d,cool,,,,17.0


The analog SAS program for returning the same sub-set is below.  FIRSTOBS=3 OBS=4 is the equivalent row selector and keep = id -- col5 is the equivalent column selector.

````
    60      data df;
    61          set df(keep = id -- col5
    62                 firstobs=3 obs=4);
    63       put _all_ ;

    _N_=1 _ERROR_=0 id=c col1=hot col2=fast col3=9 col4=4 col5=.
    _N_=2 _ERROR_=0 id=d col1=cool col2=  col3=. col4=. col5=17
````

The .iloc idexer illustrating multi-row and multi-column requests.  Note the double square brackets ([]) syntax.

In [9]:
df.iloc[[1,3,5], [2, 4, 6]]

Unnamed: 0,2,4,6
1,medium,5.0,9.0
3,,,89.0
5,slow,29.0,17.0


#### .loc Indexer 

The .loc indexer is similar to .iloc and allows access to rows and columns by labels.  A good analogy is a cell reference in Excel, eg. C:31.

The syntax for the .loc indexer is:

    df.loc[row selection, column selection]
    
For both the row and column selection, a comma (,) is used to request a list of multiple cells.  A colon (:) is used to request a range of cells.
    
Similiar to the .iloc indexer you can select combinations of rows and columns.  

Consider the DataFrame df2 created below in cell #10.  It contains a a new columns  'id' and 'date'.

In [10]:
df2 = pd.DataFrame([['a', 'cold','slow', np.nan, 2., 6., 3., '08/01/16'], 
                    ['b', 'warm', 'medium', 4, 5, 7, 9, '03/15/16'],
                    ['c', 'hot', 'fast', 9, 4, np.nan, 6, '04/30/16'],
                    ['d', 'None', 'fast', np.nan, np.nan, 17, 89, '05/31/16'],
                    ['e', 'cool', 'medium', 16, 44, 21, 13, '07/04/16'],
                    ['f', 'cold', 'slow', np.nan, 29, 33, 17, '01/01/16']],
                    columns=['id', 'col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'date'])
                  

Executing just the name for the DataFrame is the equivalent of:
    
    print(df2)
    
The print() method for a DataFrame returns the output without the cell outlines, however.

In [11]:
df2

Unnamed: 0,id,col1,col2,col3,col4,col5,col6,date
0,a,cold,slow,,2.0,6.0,3.0,08/01/16
1,b,warm,medium,4.0,5.0,7.0,9.0,03/15/16
2,c,hot,fast,9.0,4.0,,6.0,04/30/16
3,d,,fast,,,17.0,89.0,05/31/16
4,e,cool,medium,16.0,44.0,21.0,13.0,07/04/16
5,f,cold,slow,,29.0,33.0,17.0,01/01/16


We start by setting the index to 'id', so we can access rows by a single row or a range of rows based on the 'id' labels ('a' through 'f').  By default, the column is dropped when it becomes the index.  If you know you will switch indexes then use the argument 
    
    drop=False
    
to prevent the index column from being dropped from the DataFrame in the set_index request.  That way, you will not have to re-read/create the DataFrame to access the column previously used as the index.

In [12]:
# setting the index to 'id'

df2.set_index('id', inplace=True, drop=False)

The .set_index attribute execution is silent.  Validate the index using the .index attribute.

In [13]:
df2.index

Index(['a', 'b', 'c', 'd', 'e', 'f'], dtype='object', name='id')

Return the row labeled 'e'

In [14]:
df2.loc['e', ]

id             e
col1        cool
col2      medium
col3          16
col4          44
col5          21
col6          13
date    07/04/16
Name: e, dtype: object

Return rows in the range of 'b' to 'f' inclusive.  'b':'f' denotes a row range.  The absence of a column request is an implicit request for all of them.

In [15]:
df2.loc['b':'f' ,]

Unnamed: 0_level_0,id,col1,col2,col3,col4,col5,col6,date
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
b,b,warm,medium,4.0,5.0,7.0,9.0,03/15/16
c,c,hot,fast,9.0,4.0,,6.0,04/30/16
d,d,,fast,,,17.0,89.0,05/31/16
e,e,cool,medium,16.0,44.0,21.0,13.0,07/04/16
f,f,cold,slow,,29.0,33.0,17.0,01/01/16


Return the rows between range of 'd' to 'f' inclusive.  'col6' and 'col2' is a request for columns by label.

In [16]:
df2.loc['d':'f',['col6','col2']]

Unnamed: 0_level_0,col6,col2
id,Unnamed: 1_level_1,Unnamed: 2_level_1
d,89.0,fast
e,13.0,medium
f,17.0,slow


Return the 'date' column by label.

In [17]:
df2.loc[: , 'date']

id
a    08/01/16
b    03/15/16
c    04/30/16
d    05/31/16
e    07/04/16
f    01/01/16
Name: date, dtype: object

Change the DataFrame df2 index from 'id' to 'date'.  The inplace=True applies without making a copy of the DataFrame.

In [18]:
df2.set_index('date', inplace=True)

Validate the index.

In [19]:
df2.index

Index(['08/01/16', '03/15/16', '04/30/16', '05/31/16', '07/04/16', '01/01/16'], dtype='object', name='date')

Request a row by label.

In [20]:
df2.loc['05/31/16']

id         d
col1    None
col2    fast
col3     NaN
col4     NaN
col5      17
col6      89
Name: 05/31/16, dtype: object

Request a range of rows.  

In [21]:
# select dates between 01Feb16 and 31Jul16 with columns col4, col5, and col6

df2.loc['04/30/16':'07/04/16',['col2','col1']]

Unnamed: 0_level_0,col2,col1
date,Unnamed: 1_level_1,Unnamed: 2_level_1
04/30/16,fast,hot
05/31/16,fast,
07/04/16,medium,cool


The SAS program below is equivalent to cell #21 above.  It uses character variables to represent dates.

````
    25       data df2(keep = col2 col1);
    26          set df(where=(date between '04/30/16' and '07/04/16'));
    27       put _all_;

    _N_=1 _ERROR_=0 id=c col1=hot col2=fast col3=9 col4=4 col5=. col6=6 date=04/30/16
    _N_=2 _ERROR_=0 id=d col1=cool col2=  col3=. col4=. col5=17 col6=89 date=05/31/16
    _N_=3 _ERROR_=0 id=e col1=cool col2=medium col3=16 col4=44 col5=21 col6=13 date=07/04/16
````

In cell # 22 below is where we hit a snag.  The issue begins with cell #18 above using the set_index attribute for the df2 DataFrame.  Examine cell #19 and observe how dtype is 'object'.  This means we are working with string literals and not datetime objects.  Cells #20 and #21 work because these specific labels are values found in the 'date' index.

Cell #22, below does not work, since the range request contains the '07/31/16 as the range end-point which is not in the index.  The remedy, shown below in cell #25 is to use the pd.to_datetime() method to convert the date strings into a datetime object.  The obvious analogy for SAS users is converting a string variable to a numeric variable which has an associated datetime format.

In [22]:
df2.loc['01/01/16':'07/31/16']

KeyError: '07/31/16'

Return the index for the DataFrame df2 to the default RangeIndex object.  

In [23]:
df2.reset_index(inplace=True)

Validate the index.

In [24]:
df2.index

RangeIndex(start=0, stop=6, step=1)

Alter the 'date' column changing from dtype='object' (strings) to dtype=datetime. 

In [25]:
df2['date'] = pd.to_datetime(df2.date)

Set the 'date' column as the index.

In [26]:
df2.set_index('date', inplace=True)

Validate the index.  Observe the dytpe is now datetime64--a datetime stamp.  See Chapter 7 for more details on datetime arithmetic, shifting time intervals, and determining durations.

In [27]:
df2.index

DatetimeIndex(['2016-08-01', '2016-03-15', '2016-04-30', '2016-05-31',
               '2016-07-04', '2016-01-01'],
              dtype='datetime64[ns]', name='date', freq=None)

Now that the date literals have been converted to a datetime object, re-do the statement in cell #22 above.

In [28]:
df2.loc['02/01/16':'07/31/16']

Unnamed: 0_level_0,id,col1,col2,col3,col4,col5,col6
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2016-03-15,b,warm,medium,4.0,5.0,7.0,9.0
2016-04-30,c,hot,fast,9.0,4.0,,6.0
2016-05-31,d,,fast,,,17.0,89.0
2016-07-04,e,cool,medium,16.0,44.0,21.0,13.0


The SAS example below illustrates a similiar set of steps:
    
    1.  Read the original date variable which is character and rename it to 'str_date'
    2.  Use the input funtion to 'read' the 'str_date' values and assign them to 'date' using the mmddyy10. informat
    3.  Print the date value without any formatting showing it is now a SAS datetime value
    4.  Print the SAS datetime value using the mmddyy10. date format
   

````
    4      data df2(drop = str_date);
    5         set df(rename=(date=str_date));
    6      date=input(str_date,mmddyy10.);
    7      
    8      if _n_ = 1 then put date= /
    9         date mmddyy10.;

    date=20667
    08/01/2016
````

#### Mixing .loc Indexer with Boolean Operators

This approach works by creating either a Series or array of boolean values (True or False).  This Series or array is then used by the .loc indexer to return all of the values that evaulate to True.  Using the DataFrame df2 created in cell #10 above, consider the following.

We want to return all rows where 'col2' is not equal to 'fast. This is expressed as:

    df2['col2'] != 'fast'
    
A Series is returned with the True/False values not equal to 'fast' for 'col2' shown in cell #29 below.  The 'date' column is returned since it remains as the index for the DataFrame.  The second print() method displays this object as being derieved from the class: Series.

In [29]:
print(df2['col2'] != 'fast')
print(type(df2['col2'] != 'fast'))

date
2016-08-01     True
2016-03-15     True
2016-04-30    False
2016-05-31    False
2016-07-04     True
2016-01-01     True
Name: col2, dtype: bool
<class 'pandas.core.series.Series'>


Passing the boolean Series:
    
    df2['col2'] != 'fast' 
    
to the .loc indexer to retrieve those rows with a boolean value of True.  Also request 'col1' and 'col2', which a request by label.

In [30]:
df2.loc[df2['col2'] != 'fast', 'col1':'col2']

Unnamed: 0_level_0,col1,col2
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2016-08-01,cold,slow
2016-03-15,warm,medium
2016-07-04,cool,medium
2016-01-01,cold,slow


You can combine any number of boolean operation together.

In [31]:
df2.loc[(df2.col3 >=  9) & (df2.col1 == 'cool'), ]

Unnamed: 0_level_0,id,col1,col2,col3,col4,col5,col6
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2016-07-04,e,cool,medium,16.0,44.0,21.0,13.0


The .isin() method returns a boolean vector to the behavior described in cell #30 above.  The .isin list of elements returns True if the listed elements are found by the .loc indexer in 'col6'.

In [32]:
df2.loc[df2.col6.isin([6, 9, 13])]

Unnamed: 0_level_0,id,col1,col2,col3,col4,col5,col6
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2016-03-15,b,warm,medium,4.0,5.0,7.0,9.0
2016-04-30,c,hot,fast,9.0,4.0,,6.0
2016-07-04,e,cool,medium,16.0,44.0,21.0,13.0


So far, the .loc indexers have resulted in an output stream.  All of the indexers can be used to sub-set a DataFrame using assignment syntax shown in the cell #33 below.

In [33]:
df3 = df2.loc[df2.col6.isin([6, 9, 13])]

In [34]:
df3

Unnamed: 0_level_0,id,col1,col2,col3,col4,col5,col6
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2016-03-15,b,warm,medium,4.0,5.0,7.0,9.0
2016-04-30,c,hot,fast,9.0,4.0,,6.0
2016-07-04,e,cool,medium,16.0,44.0,21.0,13.0


Generally, in these types of sub-setting operations, the shape of the extracted DataFrame will be smaller than the input DataFrame.

To return a DataFrame of the same shape as the original, use the where() method.  This topic is covered in Chapter 8, data management.

In [35]:
print('Shape for df2 is', df2.shape)
print('Shape for df3 is', df3.shape)

Shape for df2 is (6, 7)
Shape for df3 is (3, 7)


The example SAS program below uses the WHERE IN (list) syntax to subset a data set analogous to the example in cell #32 above.

````
    NOTE: Data set "WORK.df2" has 6 observation(s) and 8 variable(s)

    27      data df3;
    28         set df2(where=(col6 in (6 9 13)));

    NOTE: 3 observations were read from "WORK.df2"
    NOTE: Data set "WORK.df3" has 3 observation(s) and 8 variable(s)
````

Notice how the SAS variable count and DataFrame column count differ by 1.  That is because the DataFrame .shape() method does
not include the index as part of its column count.  By reseting the index, the SAS variable count and DataFrame columns count agree.

In [36]:
df3.reset_index(inplace=True)
print('Shape for df3 is', df3.shape)

Shape for df3 is (3, 8)


#### Altering DataFrame values using the .loc indexer
The .loc indexer can also be used to do an in-place update of values.  

df2.col2 column 'before':

In [37]:
df2.loc[: , 'col2']

date
2016-08-01      slow
2016-03-15    medium
2016-04-30      fast
2016-05-31      fast
2016-07-04    medium
2016-01-01      slow
Name: col2, dtype: object

In [38]:
df2.loc[df2['col6'] > 50, "col2"] = "very hot"

df2.col2 column 'after':

In [39]:
df2.loc[: , 'col2']

date
2016-08-01        slow
2016-03-15      medium
2016-04-30        fast
2016-05-31    very hot
2016-07-04      medium
2016-01-01        slow
Name: col2, dtype: object

#### .ix Indexer

The .ix indexer combines characteristics of the .loc and .iloc indexers.  This means you can select rows and columns by labels and integers.  

The syntax for the .ix indexer is:

    df.ix[row selection, column selection]
    
For both the row and column selection, a comma (,) is used to request a list of multiple cells.  A colon (:) is used to request a range of cells.
    
Similiar to the .loc indexer you can select combinations of rows and columns.  

The .ix indexer is sometimes tricky to use.  A good rule of thumb is if you are indexing using labels or indexing using integers, use the .loc and .iloc to avoid unexpected results.  

Consider the creation of the DataFrame df4 in cell #40 below.  It is similiar to DataFrame df2 created in cell #10 above.   The differences are the addition of another column and  columns being identified with labels as well as integers.

In [40]:
df4 = pd.DataFrame([['a', 'cold','slow', np.nan, 2., 6., 3., 17, '08/01/16'], 
                    ['b', 'warm', 'medium', 4, 5, 7, 9, 21, '03/15/16'],
                    ['c', 'hot', 'fast', 9, 4, np.nan, 6, 10, '04/30/16'],
                    ['d', 'None', 'fast', np.nan, np.nan, 17, 89, 44, '05/31/16'],
                    ['e', 'cool', 'medium', 16, 44, 21, 13, 99, '07/04/16'],
                    ['f', 'cold', 'slow', np.nan, 29, 33, 17, 11,'01/01/16']],
                    columns=['id', 'col1', 'col2', 'col3', 'col4', 4, 5, 6, 'date'])

In [41]:
df4

Unnamed: 0,id,col1,col2,col3,col4,4,5,6,date
0,a,cold,slow,,2.0,6.0,3.0,17,08/01/16
1,b,warm,medium,4.0,5.0,7.0,9.0,21,03/15/16
2,c,hot,fast,9.0,4.0,,6.0,10,04/30/16
3,d,,fast,,,17.0,89.0,44,05/31/16
4,e,cool,medium,16.0,44.0,21.0,13.0,99,07/04/16
5,f,cold,slow,,29.0,33.0,17.0,11,01/01/16


In [42]:
df4.set_index('id', inplace=True)

In [43]:
df4.ix['b':'e', :6]

Unnamed: 0_level_0,col1,col2,col3,col4,4,5
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
b,warm,medium,4.0,5.0,7.0,9.0
c,hot,fast,9.0,4.0,,6.0
d,,fast,,,17.0,89.0
e,cool,medium,16.0,44.0,21.0,13.0


#### A Review:

    .iloc uses the integer position in the index and only accepts integers
    .loc uses the labels in the index
    .ix generally behaves like the .loc

Finally, to appreciate the differences, consider the following DataFrame.  Also notice the un-Pythonic style of using a semi-colon at the end of the DataFrame definition.

In [44]:
df5 = pd.DataFrame([['a', 'cold','slow', np.nan, 2., 6., 3.], 
                    ['b', 'warm', 'medium', 4, 5, 7, 9],
                    ['c', 'hot', 'fast', 9, 4, np.nan, 6],
                    ['d', 'cool', None, np.nan, np.nan, 17, 89],
                    ['e', 'cool', 'medium', 16, 44, 21, 13],
                    ['f', 'cold', 'slow', np.nan, 29, 33, 17]],
                    index = [6, 8, 2, 3, 4, 5,]);
df5

Unnamed: 0,0,1,2,3,4,5,6
6,a,cold,slow,,2.0,6.0,3.0
8,b,warm,medium,4.0,5.0,7.0,9.0
2,c,hot,fast,9.0,4.0,,6.0
3,d,cool,,,,17.0,89.0
4,e,cool,medium,16.0,44.0,21.0,13.0
5,f,cold,slow,,29.0,33.0,17.0


The .iloc indexer returns the first two rows since it looks at positions.

In [45]:
df5.iloc[:2]

Unnamed: 0,0,1,2,3,4,5,6
6,a,cold,slow,,2.0,6.0,3.0
8,b,warm,medium,4.0,5.0,7.0,9.0


The .loc indexer returns 3 rows since it looks at the labels.

In [46]:
df5.loc[:2]

Unnamed: 0,0,1,2,3,4,5,6
6,a,cold,slow,,2.0,6.0,3.0
8,b,warm,medium,4.0,5.0,7.0,9.0
2,c,hot,fast,9.0,4.0,,6.0


The .ix indexer returns the same number of rows as the .loc indexer since its behavior is to first use labels before looking by position.  Looking by position with an integer-based index can lead to unexpected results.  This illustrated in cell #49 below.

In [48]:
df5.ix[:2]

Unnamed: 0,0,1,2,3,4,5,6
6,a,cold,slow,,2.0,6.0,3.0
8,b,warm,medium,4.0,5.0,7.0,9.0
2,c,hot,fast,9.0,4.0,,6.0


For the next two examples, review the DataFrame df5 index structure in cell #50 below.  

In [52]:
df5.index

Int64Index([6, 8, 2, 3, 4, 5], dtype='int64')

The .iloc example in cell #50 below returns the first row.  That's because it is looking by position.

In [55]:
df5.iloc[:1]

Unnamed: 0,0,1,2,3,4,5,6
6,a,cold,slow,,2.0,6.0,3.0


The .ix example in cell #51 below raises an error since a KeyError since 1 is not found in the index.

In [51]:
df5.ix[:1]

KeyError: 1

#### Indexing Issues
So far, so good.  We have a basic understanding of how indexes can be established, utilized, and reset.  We can use the .iloc, .loc, and .ix indexers to retreieve subsets of columns and rows.  But what about real-world scenarios where data is rarely, if ever tidy?

The synthetic examples above work (except the one with intentional errors of course) since they rely on constructing the DataFrames in an orderly manner, like having 'id' columns in alphabetical order or dates in chronological order.  

It took me a bit of time to understand how indexing for DataFrames actually work.  Consider the DataFrame df5 created below.  It is similiar to DataFrame df2, with the exception of the 'id' column containing non-unique values.

In [56]:
df5 = pd.DataFrame([['b', 'cold','slow', np.nan, 2., 6., 3., '01/01/16'], 
                    ['c', 'warm', 'medium', 4, 5, 7, 9, '03/15/16'],
                    ['a', 'hot', 'fast', 9, 4, np.nan, 6, '04/30/16'],
                    ['d', 'cool', None, np.nan, np.nan, 17, 89, '05/31/16'],
                    ['c', 'cool', 'medium', 16, 44, 21, 13, '07/04/16'],
                    ['e', 'cold', 'slow', np.nan, 29, 33, 17, '08/30/16']],
                    columns=['id', 'col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'date'])

In [57]:
df5

Unnamed: 0,id,col1,col2,col3,col4,col5,col6,date
0,b,cold,slow,,2.0,6.0,3.0,01/01/16
1,c,warm,medium,4.0,5.0,7.0,9.0,03/15/16
2,a,hot,fast,9.0,4.0,,6.0,04/30/16
3,d,cool,,,,17.0,89.0,05/31/16
4,c,cool,medium,16.0,44.0,21.0,13.0,07/04/16
5,e,cold,slow,,29.0,33.0,17.0,08/30/16


Set the index for DataFrame df3 to the 'id' column. 

In [58]:
df5.set_index('id', inplace=True)

Validate the index for DataFrame df3.

In [59]:
df5.index

Index(['b', 'c', 'a', 'd', 'c', 'e'], dtype='object', name='id')

We can use the .loc indexer to request the rows in the range of 'd' through 'd'.

In [60]:
df5.loc['b':'d', :]

Unnamed: 0_level_0,col1,col2,col3,col4,col5,col6,date
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
b,cold,slow,,2.0,6.0,3.0,01/01/16
c,warm,medium,4.0,5.0,7.0,9.0,03/15/16
a,hot,fast,9.0,4.0,,6.0,04/30/16
d,cool,,,,17.0,89.0,05/31/16


If you look closely at the results from the example above, you will see that the first occurence of the row 'id' label 'c' was returned, but not the second one.  'id' label 'c' is obviously non-unique.  And that is only part of the issue.  Consider futher the use of a non-unique label for the row range selection in the example below.  

This issue is described in sparse prose <a href="https://docs.python.org/3.4/library/datetime.html#strftime-and-strptime-behavior"> here.</a>

We want the row label range of 'b' to 'c' with all the columns.  Instead, it raises the error:

    "Cannot get right slice bound for non-unique label: 'c'"


In [61]:
df5.loc['b':'c', :]

KeyError: "Cannot get right slice bound for non-unique label: 'c'"

The attribute .index.is_montonic_increasing and .index_montonic_decreasing return a boolean to test for this non-uniqueness.  Here it is applied to the df3 DataFrame.

In [62]:
df5.index.is_monotonic_increasing

False

While not spelled out in any documentation I found, the moral of the story is when using indicies with non-unique values, be wary.