## 1.3 LOOKING AT COLUMNS, ROWS, AND CELLS

1.3.1.1 Subsetting Columns by Name
If we want only a specific column from our data, we can access the data using square brackets.

In [2]:
import pandas as pd

In [3]:
df = pd.read_csv('data/gapminder.tsv', sep='\t')

In [4]:
# just get the country column and save it to its own variable

country_df = df['country']


In [5]:
# show the first 5 observations

print(country_df.head())

0    Afghanistan
1    Afghanistan
2    Afghanistan
3    Afghanistan
4    Afghanistan
Name: country, dtype: object


In [6]:
# show the last 5 observations

print(country_df.tail())

1699    Zimbabwe
1700    Zimbabwe
1701    Zimbabwe
1702    Zimbabwe
1703    Zimbabwe
Name: country, dtype: object


In [7]:
# Looking at country, continent, and year

subset = df[['country', 'continent', 'year']]

print(subset.head())

       country continent  year
0  Afghanistan      Asia  1952
1  Afghanistan      Asia  1957
2  Afghanistan      Asia  1962
3  Afghanistan      Asia  1967
4  Afghanistan      Asia  1972


In [8]:
print(subset.tail())

       country continent  year
1699  Zimbabwe    Africa  1987
1700  Zimbabwe    Africa  1992
1701  Zimbabwe    Africa  1997
1702  Zimbabwe    Africa  2002
1703  Zimbabwe    Africa  2007


1.3.1.2 Subsetting Columns by Index Position Break in Pandas v0.20

At times, you may want to get a particular column by its position, rather than its name. For example, you want to get the first (“country”) column and third column (“year”), or just the last column (“gdpPercap”).

In [9]:
print(df.head())

       country continent  year  lifeExp       pop   gdpPercap
0  Afghanistan      Asia  1952   28.801   8425333  779.445314
1  Afghanistan      Asia  1957   30.332   9240934  820.853030
2  Afghanistan      Asia  1962   31.997  10267083  853.100710
3  Afghanistan      Asia  1967   34.020  11537966  836.197138
4  Afghanistan      Asia  1972   36.088  13079460  739.981106


On the left side of the printed dataframe, we see what appear to be row numbers. This column-less row of values is the index label of the dataframe. Think of the index label as being like a column name, but for rows instead of columns. By default, Pandas will fill in the index labels with the row numbers (note that it starts counting from 0). A common example where the row index labels are not the same as the row number is when we work with time series data. In that case, the index label will be a timestamp of sorts. For now, though, we will keep the default row number values.

We can use the loc attribute on the dataframe to subset rows based on the index label.

In [10]:
# get the first row

# Python counts from 0

print(df.loc[0])

country      Afghanistan
continent           Asia
year                1952
lifeExp           28.801
pop              8425333
gdpPercap        779.445
Name: 0, dtype: object


In [11]:
# get the 100th row

# Python counts from 0

print(df.loc[99])

country      Bangladesh
continent          Asia
year               1967
lifeExp          43.453
pop            62821884
gdpPercap       721.186
Name: 99, dtype: object


Note that passing -1 as the loc will cause an error, because it is actually looking for the row index label (row number) ‘-1’, which does not exist in our example. Instead, we can use a bit of Python to calculate the number of rows and pass that value into loc.

In [12]:
# get the last row

# this will cause an error

print(df.loc[-1])

KeyError: 'the label [-1] is not in the [index]'

In [13]:
# get the last row (correctly)

# use the first value given from shape to get the number of rows

number_of_rows = df.shape[0]



# subtract 1 from the value since we want the last index value

last_row_index = number_of_rows - 1



# now do the subset using the index of the last row

print(df.loc[last_row_index])

country      Zimbabwe
continent      Africa
year             2007
lifeExp        43.487
pop          12311143
gdpPercap     469.709
Name: 1703, dtype: object


Alternatively, we can use the tail method to return the last 1 row, instead of the default 5.

In [14]:
# there are many ways of doing what you want

print(df.tail(n=1))

       country continent  year  lifeExp       pop   gdpPercap
1703  Zimbabwe    Africa  2007   43.487  12311143  469.709298


Notice that when we used tail() and loc, the results were printed out differently. Let’s look at which type is returned when we use these methods.

In [15]:
subset_loc = df.loc[0]

subset_head = df.head(n=1)



# type using loc of 1 row

print(type(subset_loc))

<class 'pandas.core.series.Series'>


In [16]:
# type using head of 1 row

print(type(subset_head))

<class 'pandas.core.frame.DataFrame'>


Subsetting Multiple Rows Just as for columns, we can select multiple rows.

In [17]:
# select the first, 100th, and 1000th rows

# note the double square brackets similar to the syntax used to

# subset multiple columns

print(df.loc[[0, 99, 999]])

         country continent  year  lifeExp       pop    gdpPercap
0    Afghanistan      Asia  1952   28.801   8425333   779.445314
99    Bangladesh      Asia  1967   43.453  62821884   721.186086
999     Mongolia      Asia  1967   51.253   1149500  1226.041130


1.3.2.2 Subset Rows by Row Number: iloc
iloc does the same thing as loc but is used to subset by the row index number. In our current example, iloc and loc will behave om exactly the same way since the index labels are the row numbers. However, keep in mind that the index labels do not necessarily have to be row numbers.

In [18]:
# get the 2nd row

print(df.iloc[1])

country      Afghanistan
continent           Asia
year                1957
lifeExp           30.332
pop              9240934
gdpPercap        820.853
Name: 1, dtype: object


In [19]:
## get the 100th row

print(df.iloc[99])

country      Bangladesh
continent          Asia
year               1967
lifeExp          43.453
pop            62821884
gdpPercap       721.186
Name: 99, dtype: object


Note that when we put 1 into the list, we actually get the second row, rather than the first row. This follows Python’s zero-indexed behavior, meaning that the first item of a container is index 0 (i.e., 0th item of the container). More details about this kind of behavior are found in Appendices I, L, and P.

With iloc, we can pass in the -1 to get the last row—something we couldn’t do with loc.

In [20]:
# using -1 to get the last row

print(df.iloc[-1])

country      Zimbabwe
continent      Africa
year             2007
lifeExp        43.487
pop          12311143
gdpPercap     469.709
Name: 1703, dtype: object


In [21]:
## get the first, 100th, and 1000th rows

print(df.iloc[[0, 99, 999]])

         country continent  year  lifeExp       pop    gdpPercap
0    Afghanistan      Asia  1952   28.801   8425333   779.445314
99    Bangladesh      Asia  1967   43.453  62821884   721.186086
999     Mongolia      Asia  1967   51.253   1149500  1226.041130


1.3.2.3 Subsetting Rows With ix No Longer Works in Pandas v0.20
The ix attribute does not work in versions later than Pandas v0.20, since it can be confusing. Nevertheless, this section quickly reviews ix for completeness.

ix can be thought of as a combination of loc and iloc, as it allows us to subset by label or integer. By default, it searches for labels. If it cannot find the corresponding label, it falls back to using integer indexing. This can be the cause for a lot of confusion, which is why this feature has been taken out. The code using ix will look exactly like that written when using loc or iloc.

In [22]:
# first row

df.ix[0]

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  This is separate from the ipykernel package so we can avoid doing imports until


country      Afghanistan
continent           Asia
year                1952
lifeExp           28.801
pop              8425333
gdpPercap        779.445
Name: 0, dtype: object

In [23]:
# 100th row

df.ix[99]

country      Bangladesh
continent          Asia
year               1967
lifeExp          43.453
pop            62821884
gdpPercap       721.186
Name: 99, dtype: object

In [24]:
# 1st, 100th, and 1000th rows

df.ix[[0, 99, 999]]

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
99,Bangladesh,Asia,1967,43.453,62821884,721.186086
999,Mongolia,Asia,1967,51.253,1149500,1226.04113


1.3.3.1 Subsetting Columns
If we want to use these techniques to just subset columns, we must use Python’s slicing syntax (Appendix L). We need to do this because if we are subsetting columns, we are getting all the rows for the specified column. So, we need a method to capture all the rows.

The Python slicing syntax uses a colon, :. If we have just a colon, the attribute refers to everything. So, if we just want to get the first column using the loc or iloc syntax, we can write something like df.loc[:, [columns]] to subset the column(s).

In [25]:
# subset columns with loc

# note the position of the colon

# it is used to select all rows

subset = df.loc[:, ['year', 'pop']]

print(subset.head())

   year       pop
0  1952   8425333
1  1957   9240934
2  1962  10267083
3  1967  11537966
4  1972  13079460


In [27]:
subset2 = df.loc[:, ['country', 'year', 'gdpPercap']]

print(subset2.head())

       country  year   gdpPercap
0  Afghanistan  1952  779.445314
1  Afghanistan  1957  820.853030
2  Afghanistan  1962  853.100710
3  Afghanistan  1967  836.197138
4  Afghanistan  1972  739.981106


In [28]:
# subset columns with iloc

# iloc will alow us to use integers

# -1 will select the last column

subset = df.iloc[:, [2, 4, -1]]

print(subset.head())

   year       pop   gdpPercap
0  1952   8425333  779.445314
1  1957   9240934  820.853030
2  1962  10267083  853.100710
3  1967  11537966  836.197138
4  1972  13079460  739.981106


We will get an error if we don’t specify loc and iloc correctly.

In [29]:
# subset columns with loc

# but pass in integer values

# this will cause an error

subset = df.loc[:, [2, 4, -1]]

print(subset.head())

KeyError: 'None of [[2, 4, -1]] are in the [columns]'

In [30]:
# subset columns with iloc

# but pass in index names

# this will cause an error

subset = df.iloc[:, ['year', 'pop']]

print(subset.head())

TypeError: cannot perform reduce with flexible type

1.3.3.2 Subsetting Columns by Range

You can use the built-in range function to create a range of values in Python. This way you can specify beginning and end values, and Python will automatically create a range of values in between. By default, every value between the beginning and the end (inclusive left, exclusive right; see Appendix L) will be created, unless you specify a step (Appendices L and P). In Python 3, the range function returns a generator (Appendix P). If you are using Python 2, the range function returns a list (Appendix I), and the xrange function returns a generator.


If we look at the code given earlier (Section 1.3.1.2), we see that we subset columns using a list of integers. Since range returns a generator, we have to convert the generator to a list first.


Note that when range(5) is called, five integers are returned: 0 – 4.

In [31]:
# create a range of integers from 0 to 4 inclusive

small_range = list(range(5))

print(small_range)

[0, 1, 2, 3, 4]


In [32]:
# subset the dataframe with the range

subset = df.iloc[:, small_range]

print(subset.head())

       country continent  year  lifeExp       pop
0  Afghanistan      Asia  1952   28.801   8425333
1  Afghanistan      Asia  1957   30.332   9240934
2  Afghanistan      Asia  1962   31.997  10267083
3  Afghanistan      Asia  1967   34.020  11537966
4  Afghanistan      Asia  1972   36.088  13079460


In [38]:
# create a range from 3 to 5 inclusive

small_range = list(range(3, 6))

print(small_range)

[3, 4, 5]


In [39]:
subset = df.iloc[:, small_range]

print(subset.head())

   lifeExp       pop   gdpPercap
0   28.801   8425333  779.445314
1   30.332   9240934  820.853030
2   31.997  10267083  853.100710
3   34.020  11537966  836.197138
4   36.088  13079460  739.981106


In [40]:
# create a range from 0 to 5 inclusive, every other integer

small_range = list(range(0, 6, 2))

subset = df.iloc[:, small_range]

print(subset.head())

       country  year       pop
0  Afghanistan  1952   8425333
1  Afghanistan  1957   9240934
2  Afghanistan  1962  10267083
3  Afghanistan  1967  11537966
4  Afghanistan  1972  13079460


In [41]:
print(small_range)

[0, 2, 4]


1.3.3.3 Slicing Columns

Python’s slicing syntax, :, is similar to the range syntax. Instead of a function that specifies start, stop, and step values delimited by a comma, we separate the values with the colon.


If you understand what was going on with the range function earlier, then slicing can be seen as a shorthand means to the same thing.


While the range function can be used to create a generator and converted to a list of values, the colon syntax for slicing only has meaning when slicing and subsetting values, and has no inherent meaning on its own.

In [42]:
small_range = list(range(3))

subset = df.iloc[:, small_range]

print(subset.head())

       country continent  year
0  Afghanistan      Asia  1952
1  Afghanistan      Asia  1957
2  Afghanistan      Asia  1962
3  Afghanistan      Asia  1967
4  Afghanistan      Asia  1972


In [43]:
print(small_range)

[0, 1, 2]


In [44]:
# slice the first 3 columns

subset = df.iloc[:, :3]

print(subset.head())

       country continent  year
0  Afghanistan      Asia  1952
1  Afghanistan      Asia  1957
2  Afghanistan      Asia  1962
3  Afghanistan      Asia  1967
4  Afghanistan      Asia  1972


In [45]:
small_range = list(range(3, 6))

subset = df.iloc[:, small_range]

print(subset.head())

   lifeExp       pop   gdpPercap
0   28.801   8425333  779.445314
1   30.332   9240934  820.853030
2   31.997  10267083  853.100710
3   34.020  11537966  836.197138
4   36.088  13079460  739.981106


In [46]:
print(small_range)

[3, 4, 5]


In [47]:
# slice columns 3 to 5 inclusive

subset = df.iloc[:, 3:6]

print(subset.head())

   lifeExp       pop   gdpPercap
0   28.801   8425333  779.445314
1   30.332   9240934  820.853030
2   31.997  10267083  853.100710
3   34.020  11537966  836.197138
4   36.088  13079460  739.981106


In [48]:
small_range = list(range(0, 6, 2))

subset = df.iloc[:, small_range]

print(subset.head())

       country  year       pop
0  Afghanistan  1952   8425333
1  Afghanistan  1957   9240934
2  Afghanistan  1962  10267083
3  Afghanistan  1967  11537966
4  Afghanistan  1972  13079460


In [49]:
# slice every other first 5 columns

subset = df.iloc[:, 0:6:2]

print(subset.head())

       country  year       pop
0  Afghanistan  1952   8425333
1  Afghanistan  1957   9240934
2  Afghanistan  1962  10267083
3  Afghanistan  1967  11537966
4  Afghanistan  1972  13079460


Question


What happens if you use the slicing method with two colons, but leave a value out? For example, what is the result in each of the following cases?

In [50]:
subset = df.iloc[:, 0:6:]

print(subset.head())

       country continent  year  lifeExp       pop   gdpPercap
0  Afghanistan      Asia  1952   28.801   8425333  779.445314
1  Afghanistan      Asia  1957   30.332   9240934  820.853030
2  Afghanistan      Asia  1962   31.997  10267083  853.100710
3  Afghanistan      Asia  1967   34.020  11537966  836.197138
4  Afghanistan      Asia  1972   36.088  13079460  739.981106


In [51]:
subset = df.iloc[:, 0::2]

print(subset.head())

       country  year       pop
0  Afghanistan  1952   8425333
1  Afghanistan  1957   9240934
2  Afghanistan  1962  10267083
3  Afghanistan  1967  11537966
4  Afghanistan  1972  13079460


In [52]:
print(df.head())

       country continent  year  lifeExp       pop   gdpPercap
0  Afghanistan      Asia  1952   28.801   8425333  779.445314
1  Afghanistan      Asia  1957   30.332   9240934  820.853030
2  Afghanistan      Asia  1962   31.997  10267083  853.100710
3  Afghanistan      Asia  1967   34.020  11537966  836.197138
4  Afghanistan      Asia  1972   36.088  13079460  739.981106


In [53]:
subset = df.iloc[:, :6:2]

print(subset.head())

       country  year       pop
0  Afghanistan  1952   8425333
1  Afghanistan  1957   9240934
2  Afghanistan  1962  10267083
3  Afghanistan  1967  11537966
4  Afghanistan  1972  13079460


In [54]:
subset = df.iloc[:, ::2]

print(subset.head())

       country  year       pop
0  Afghanistan  1952   8425333
1  Afghanistan  1957   9240934
2  Afghanistan  1962  10267083
3  Afghanistan  1967  11537966
4  Afghanistan  1972  13079460


In [55]:
subset = df.iloc[:, ::]

print(subset.head())

       country continent  year  lifeExp       pop   gdpPercap
0  Afghanistan      Asia  1952   28.801   8425333  779.445314
1  Afghanistan      Asia  1957   30.332   9240934  820.853030
2  Afghanistan      Asia  1962   31.997  10267083  853.100710
3  Afghanistan      Asia  1967   34.020  11537966  836.197138
4  Afghanistan      Asia  1972   36.088  13079460  739.981106


1.3.3.4 Subsetting Rows and Columns

We’ve been using the colon, :, in loc and iloc to the left of the comma. When we do so, we select all the rows in our dataframe. However, we can choose to put values to the left of the comma if we want to select specific rows along with specific columns.

In [56]:
# using loc

print(df.loc[42, 'country'])

Angola


Just make sure you don’t forget the differences between loc and iloc.

In [57]:
# will cause an error

print(df.loc[42, 0])

TypeError: cannot do label indexing on <class 'pandas.core.indexes.base.Index'> with these indexers [0] of <class 'int'>

Now, look at how confusing ix can be. Good thing it no longer works.

In [58]:
# get the 43rd country in our data

df.ix[42, 'country']


'Angola'

In [59]:
# instead of 'country' I used the index 0

df.ix[42, 0]

'Angola'

In [60]:
# use iloc

print(df.iloc[42, 0])

Angola


1.3.3.5 Subsetting Multiple Rows and Columns

We can combine the row and column subsetting syntax with the multiple-row and multiple-column subsetting syntax to get various slices of our data.

In [61]:
# get the 1st, 100th, and 1000th rows

# from the 1st, 4th, and 6th columns

# the columns we are hoping to get are

# country, lifeExp, and gdpPercap

print(df.iloc[[0, 99, 999], [0, 3, 5]])

         country  lifeExp    gdpPercap
0    Afghanistan   28.801   779.445314
99    Bangladesh   43.453   721.186086
999     Mongolia   51.253  1226.041130


In my own work, I try to pass in the actual column names when subsetting data whenever possible. That approach makes the code more readable since you do not need to look at the column name vector to know which index is being called. Additionally, using absolute indexes can lead to problems if the column order gets changed for some reason. This is just a general rule of thumb, as there will be exceptions where using the index position is a better option (i.e., concatenating data in Section 4.3).

In [62]:
# if we use the column names directly,

# it makes the code a bit easier to read

# note now we have to use loc, instead of iloc

print(df.loc[[0, 99, 999], ['country', 'lifeExp', 'gdpPercap']])

         country  lifeExp    gdpPercap
0    Afghanistan   28.801   779.445314
99    Bangladesh   43.453   721.186086
999     Mongolia   51.253  1226.041130


Remember, you can use the slicing syntax on the row portion of the loc and iloc attributes.

In [63]:
print(df.loc[10:13, ['country', 'lifeExp', 'gdpPercap']])

        country  lifeExp    gdpPercap
10  Afghanistan   42.129   726.734055
11  Afghanistan   43.828   974.580338
12      Albania   55.230  1601.056136
13      Albania   59.280  1942.284244
