This document is a Python exploration of this R-based document: http://m-clark.github.io/data-processing-and-visualization/indexing.html. Code is not optimized for anything but learning. In addition, all the content is located with the main document, not here, so many sections may not be included. I only focus on reproducing the code chunks.

# Indexing

While much of data processing regards data frames, or other tables of mixed data types, even so, it would be impossible to use Python effectively without knowing how to handle more basic data types.

Note to any one looking at this and coming from the R notes, Python utilizes 0 based indexing, which means that things start with 0 instead of 1, unlike most other data science programming languages (and regular human, as opposed to mathematical, experience). Also, when dealing with ranges, the last number is left open, i.e. not included as part of the range (except for pandas). So if you want the first 10 things (I'm using 'first' loosely), you'd need the range(0, 10), which would give you the zeroeth through 9th indices, i.e. the first 10 elements. 

In [1]:
import string
import numpy as np
import pandas as pd

letters_list = list(string.ascii_lowercase)
letters_dict = {'a':1, 'b':2, 'c':3, 'd':4, 'e':5}
letters = np.array(list(string.ascii_lowercase))

## Slicing Vectors

Taking individual parts of a vector of values is straightforward and something you’ll likely need to do a lot. The basic idea is to provide the indices for which elements you want to exract.

In [2]:
letters[3:6]

array(['d', 'e', 'f'], dtype='<U1')

In [3]:
letters[[12, 9, 2]]

array(['m', 'j', 'c'], dtype='<U1')

In [4]:
## example ranges
[i for i in (range(0,10))]  # or range(10)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [5]:
np.arange(10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

## Slicing Matrices/dataFrames

With 2-d objects we can specify rows and columns. Rows are indexed to the left of the comma, columns to the right.

In [6]:
myMatrix = np.arange(1, 13).reshape(3, 4)

In [7]:
mydf = pd.DataFrame({'a': [1, 5, 2],
                     'b': [3, 8, 1]}, index=['row1', 'row2', 'row3'])

In [8]:
myMatrix[0, 2:4]

array([3, 4])

## Label-based Indexing

We can do this by name if they are available.

In [9]:
mydf.loc['row1', 'b']

3

In [10]:
mydf.loc['row1':'row2', 'b']

row1    3
row2    8
Name: b, dtype: int64

## Position-based Indexing

Otherwise we can index by number.

In [11]:
mydf.iloc[0, 1]

3

## Mixed Indexing

With pandas, we have limited mixing indexing, namely name or integer plus a boolean (described later) is allowed.

In [12]:
mydf.loc[mydf.loc[:,'b'] > 1, 'b']  # deprecated due to reasons of 'user confusion', do by label for reproducibility anyway

row1    3
row2    8
Name: b, dtype: int64

For selecting all rows or columns, use the colon operator.

In [13]:
mydf.loc['row1', :]  # same as mydf.loc['row1']

a    1
b    3
Name: row1, dtype: int64

In [14]:
mydf.loc[:, 'b']

row1    3
row2    8
row3    1
Name: b, dtype: int64

## Non-contiguous

Note that the indices supplied do not have to be in order or in sequence.

In [15]:
mydf.iloc[[0, 2], :]

Unnamed: 0,a,b
row1,1,3
row3,2,1


## Boolean

Boolean indexing requires some `TRUE`-`FALSE` indicator.  In the following, if column A has a value greater than or equal to 2, it is `TRUE` and is selected. Otherwise it is `FALSE` and will be dropped.

In [16]:
mydf[lambda df: df.a >= 2]

Unnamed: 0,a,b
row2,5,8
row3,2,1


## List/dataFrame Extraction

We have a couple ways to get at elements of a list and data frame. Note, pandas dataFrames are not lists, but you could get something similar by converting it to a dictionary via mydf.to_dict().  These examples will just regard a list/dict where noted.

In [17]:
#my_list_or_df[2:4]
letters_list[1:4]

['b', 'c', 'd']

In [18]:
letters_dict['d']   # dict

4

In [19]:
mydf['a']           # dataframe

row1    1
row2    5
row3    2
Name: a, dtype: int64

In [20]:
mydf.a

row1    1
row2    5
row3    2
Name: a, dtype: int64

In [21]:
letters_dict.pop('a')   # pop by name

1

## Miscellaneous indexing

In [22]:
## ------------------------------------------------------------------------
mymatrix = np.matrix(np.random.normal(size=100)).reshape(10, 10)
mydf = pd.read_csv('../data/cars.csv')
my_matdf_list = {'thismat': mymatrix, 'thisdf': mydf}  # all have to be named

In [23]:
mymatrix[0:5, ]

matrix([[-0.20204135, -1.07106443,  0.30304085, -0.86228201, -0.17762166,
          0.77708433, -0.41551318,  1.25933809, -0.17554444,  0.74503061],
        [ 1.42245071,  1.97828504,  1.32814873, -0.79372394, -0.12261866,
          0.40039033, -0.1204134 , -1.94312173, -1.40159201, -0.61447159],
        [-1.69528861,  0.05397468, -1.22006004,  1.35702778, -0.5140965 ,
         -0.2399811 ,  0.05660043, -0.09484537, -0.06190253, -1.52141814],
        [ 0.63252468, -3.08115243, -0.23711853,  0.3091085 ,  0.97987619,
         -0.947788  ,  1.85134599,  0.63400283, -0.10543731, -0.13083431],
        [ 0.18351778, -0.72801773, -0.5449031 ,  0.80636813,  2.7943664 ,
          0.71182609, -0.56865032, -2.04653567, -2.1836784 , -1.38623167]])

In [24]:
mymatrix[:, 0:5]

matrix([[-0.20204135, -1.07106443,  0.30304085, -0.86228201, -0.17762166],
        [ 1.42245071,  1.97828504,  1.32814873, -0.79372394, -0.12261866],
        [-1.69528861,  0.05397468, -1.22006004,  1.35702778, -0.5140965 ],
        [ 0.63252468, -3.08115243, -0.23711853,  0.3091085 ,  0.97987619],
        [ 0.18351778, -0.72801773, -0.5449031 ,  0.80636813,  2.7943664 ],
        [ 0.41965686, -1.12805228, -0.30306457, -0.49835632, -1.93109753],
        [-0.17233402, -1.73062811, -0.13544302,  1.29126885,  0.35569434],
        [ 0.79596888,  0.71070138,  0.4023354 , -0.81500251, -0.80266529],
        [ 0.78885704,  0.82090829, -1.04740755,  1.61731777, -1.27520777],
        [ 0.91321348,  1.28091967, -0.61675141, -1.43059279, -0.19069887]])

In [25]:
mymatrix[0, 1]

-1.0710644292557496

In [26]:
mydf.disp

0     160.0
1     160.0
2     108.0
3     258.0
4     360.0
5     225.0
6     360.0
7     146.7
8     140.8
9     167.6
10    167.6
11    275.8
12    275.8
13    275.8
14    472.0
15    460.0
16    440.0
17     78.7
18     75.7
19     71.1
20    120.1
21    318.0
22    304.0
23    350.0
24    400.0
25     79.0
26    120.3
27     95.1
28    351.0
29    145.0
30    301.0
31    121.0
Name: disp, dtype: float64

In [27]:
mydf.iloc[:,3]

0     160.0
1     160.0
2     108.0
3     258.0
4     360.0
5     225.0
6     360.0
7     146.7
8     140.8
9     167.6
10    167.6
11    275.8
12    275.8
13    275.8
14    472.0
15    460.0
16    440.0
17     78.7
18     75.7
19     71.1
20    120.1
21    318.0
22    304.0
23    350.0
24    400.0
25     79.0
26    120.3
27     95.1
28    351.0
29    145.0
30    301.0
31    121.0
Name: disp, dtype: float64

In [28]:
mydf['disp']

0     160.0
1     160.0
2     108.0
3     258.0
4     360.0
5     225.0
6     360.0
7     146.7
8     140.8
9     167.6
10    167.6
11    275.8
12    275.8
13    275.8
14    472.0
15    460.0
16    440.0
17     78.7
18     75.7
19     71.1
20    120.1
21    318.0
22    304.0
23    350.0
24    400.0
25     79.0
26    120.3
27     95.1
28    351.0
29    145.0
30    301.0
31    121.0
Name: disp, dtype: float64

In [29]:
## ---- echo=F-------------------------------------------------------------
my_matdf_list.pop('thisdf')

Unnamed: 0,car,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
3,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
4,Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2
5,Valiant,18.1,6,225.0,105,2.76,3.46,20.22,1,0,3,1
6,Duster 360,14.3,8,360.0,245,3.21,3.57,15.84,0,0,3,4
7,Merc 240D,24.4,4,146.7,62,3.69,3.19,20.0,1,0,4,2
8,Merc 230,22.8,4,140.8,95,3.92,3.15,22.9,1,0,4,2
9,Merc 280,19.2,6,167.6,123,3.92,3.44,18.3,1,0,4,4


## Indexing Exercises

### Exercise 1

For the matrix, in separate operations, take a slice of rows, a selection of columns, and a single element.


### Exercise 2

For the data.frame, grab a column in 3 different ways.

### Exercise 3

For the list grab an element by number and by name.
