# Slicing and indexing DataFrames

## Explicit indexes

### Setting and removing indexes


With pandas it is possible to designate columns as an index. This enables cleaner code when taking subsets, as well as providing more efficient lookup under some circumstances.

Note: the index values don't need to be unique. 

In [2]:
import pandas as pd

temperatures = pd.read_csv('./data/temperatures.csv', index_col=0)

# Look at temperatures
temperatures.head()

Unnamed: 0,date,city,country,avg_temp_c
0,2000-01-01,Abidjan,Côte D'Ivoire,27.293
1,2000-02-01,Abidjan,Côte D'Ivoire,27.685
2,2000-03-01,Abidjan,Côte D'Ivoire,29.061
3,2000-04-01,Abidjan,Côte D'Ivoire,28.162
4,2000-05-01,Abidjan,Côte D'Ivoire,27.547


In [5]:
# Set the index of temperatures to "city", assigning to temperatures_ind
temperatures_ind = temperatures.set_index('city')
temperatures_ind.head()

Unnamed: 0_level_0,date,country,avg_temp_c
city,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Abidjan,2000-01-01,Côte D'Ivoire,27.293
Abidjan,2000-02-01,Côte D'Ivoire,27.685
Abidjan,2000-03-01,Côte D'Ivoire,29.061
Abidjan,2000-04-01,Côte D'Ivoire,28.162
Abidjan,2000-05-01,Côte D'Ivoire,27.547


In [8]:
# Reset the index of temperatures_ind, dropping its contents
temperatures_ind = temperatures_ind.reset_index(drop=True)
temperatures_ind.head()

Unnamed: 0,date,country,avg_temp_c
0,2000-01-01,Côte D'Ivoire,27.293
1,2000-02-01,Côte D'Ivoire,27.685
2,2000-03-01,Côte D'Ivoire,29.061
3,2000-04-01,Côte D'Ivoire,28.162
4,2000-05-01,Côte D'Ivoire,27.547


### Subsetting with `.loc[]`

If you have definded an explicit index, the next step is to use `.loc[]`: a subsetting method that accepts index values. When you pass it a single argument, it will take a subset of rows.

The code for subsetting using `.loc[]` can be easier to read than standard square bracket subsetting.

In [10]:
temperatures_ind = temperatures.set_index('city')

# Create a list called cities that contains "Moscow" and "Saint Petersburg"
cities = ['Moscow', 'Saint Petersburg']

# Use .loc[] subsetting to filter temperatures_ind for rows where the city is in the cities list
temperatures_ind.loc[cities]

Unnamed: 0_level_0,date,country,avg_temp_c
city,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Moscow,2000-01-01,Russia,-7.313
Moscow,2000-02-01,Russia,-3.551
Moscow,2000-03-01,Russia,-1.661
Moscow,2000-04-01,Russia,10.096
Moscow,2000-05-01,Russia,10.357
...,...,...,...
Saint Petersburg,2013-05-01,Russia,12.355
Saint Petersburg,2013-06-01,Russia,17.185
Saint Petersburg,2013-07-01,Russia,17.234
Saint Petersburg,2013-08-01,Russia,17.153


### Setting multi-level indexes

Indexes can also be made out of multiple columns, forming a multi-level index (sometimes called a hierarchical index).

The benefit is that multi-level indexes make it more natural to reason about nested categorical variables. 

In [11]:
# Set the index of temperatures to the "country" and "city" columns, and assign this to temperatures_ind
temperatures_ind = temperatures.set_index(['country', 'city'])

# Specify two country/city pairs to keep: "Brazil"/"Rio De Janeiro" and "Pakistan"/"Lahore", assigning to rows_to_keep
rows_to_keep = [('Pakistan', 'Lahore'), ('Brazil', 'Rio De Janeiro')]

# Print and subset temperatures_ind for rows_to_keep
temperatures_ind.loc[rows_to_keep]

Unnamed: 0_level_0,Unnamed: 1_level_0,date,avg_temp_c
country,city,Unnamed: 2_level_1,Unnamed: 3_level_1
Pakistan,Lahore,2000-01-01,12.792
Pakistan,Lahore,2000-02-01,14.339
Pakistan,Lahore,2000-03-01,20.309
Pakistan,Lahore,2000-04-01,29.072
Pakistan,Lahore,2000-05-01,34.845
...,...,...,...
Brazil,Rio De Janeiro,2013-05-01,24.443
Brazil,Rio De Janeiro,2013-06-01,24.703
Brazil,Rio De Janeiro,2013-07-01,23.768
Brazil,Rio De Janeiro,2013-08-01,23.175


### Sorting by index values

It is sometimes useful to be able to sort by elements in the index. For this, you need to use `.sort_index()`.

In [12]:
# Sort temperatures_ind by ascending country then descending city
temperatures_ind.sort_index(level=['country', 'city'], ascending=[True, False])
temperatures_ind

Unnamed: 0_level_0,Unnamed: 1_level_0,date,avg_temp_c
country,city,Unnamed: 2_level_1,Unnamed: 3_level_1
Côte D'Ivoire,Abidjan,2000-01-01,27.293
Côte D'Ivoire,Abidjan,2000-02-01,27.685
Côte D'Ivoire,Abidjan,2000-03-01,29.061
Côte D'Ivoire,Abidjan,2000-04-01,28.162
Côte D'Ivoire,Abidjan,2000-05-01,27.547
...,...,...,...
China,Xian,2013-05-01,18.979
China,Xian,2013-06-01,23.522
China,Xian,2013-07-01,25.251
China,Xian,2013-08-01,24.528


## Slicing and subsetting with `.loc[]` and `.iloc[]`

### Slicing index values

Slicing lets you select consecutive elements of an object using `first:last` syntax. DataFrames can be sliced by index values, which involves using the `.loc[]` method.

* You can only slice an index if the index is sorted (using  the `.sort_index()` method).
* To slice at the outer level, first and last can be strings.
* To slice at inner levels, first and last should be tuples.
* If you pass a single slice to `.loc[]`, it will slice the rows.

In [16]:
# Sort the index of temperatures_ind
temperatures_srt = temperatures_ind.sort_index()

# Slice the DF values from Pakistan to Russia
temperatures_srt.loc['Pakistan':'Russia']

Unnamed: 0_level_0,Unnamed: 1_level_0,date,avg_temp_c
country,city,Unnamed: 2_level_1,Unnamed: 3_level_1
Pakistan,Faisalabad,2000-01-01,12.792
Pakistan,Faisalabad,2000-02-01,14.339
Pakistan,Faisalabad,2000-03-01,20.309
Pakistan,Faisalabad,2000-04-01,29.072
Pakistan,Faisalabad,2000-05-01,34.845
...,...,...,...
Russia,Saint Petersburg,2013-05-01,12.355
Russia,Saint Petersburg,2013-06-01,17.185
Russia,Saint Petersburg,2013-07-01,17.234
Russia,Saint Petersburg,2013-08-01,17.153


In [17]:
# Slice from Pakistan, Lahore to Russia, Moscow
temperatures_srt.loc[('Pakistan', 'Lahore'):('Russia', 'Moscow')]

Unnamed: 0_level_0,Unnamed: 1_level_0,date,avg_temp_c
country,city,Unnamed: 2_level_1,Unnamed: 3_level_1
Pakistan,Lahore,2000-01-01,12.792
Pakistan,Lahore,2000-02-01,14.339
Pakistan,Lahore,2000-03-01,20.309
Pakistan,Lahore,2000-04-01,29.072
Pakistan,Lahore,2000-05-01,34.845
...,...,...,...
Russia,Moscow,2013-05-01,16.152
Russia,Moscow,2013-06-01,18.718
Russia,Moscow,2013-07-01,18.136
Russia,Moscow,2013-08-01,17.485


### Slicing in both directions

Since DataFrames are two-dimensional objects, it is often natural to slice both dimensions at once. That is, by passing two arguments to `.loc[]`, you can subset by rows and columns in one go.

In [18]:
# Slice in both directions at once from Hyderabad to Baghdad, and date to avg_temp_c
temperatures_srt.loc[('India','Hyderabad'):('Iraq','Baghdad'), 'date':'avg_temp_c']

Unnamed: 0_level_0,Unnamed: 1_level_0,date,avg_temp_c
country,city,Unnamed: 2_level_1,Unnamed: 3_level_1
India,Hyderabad,2000-01-01,23.779
India,Hyderabad,2000-02-01,25.826
India,Hyderabad,2000-03-01,28.821
India,Hyderabad,2000-04-01,32.698
India,Hyderabad,2000-05-01,32.438
...,...,...,...
Iraq,Baghdad,2013-05-01,28.673
Iraq,Baghdad,2013-06-01,33.803
Iraq,Baghdad,2013-07-01,36.392
Iraq,Baghdad,2013-08-01,35.463


### Slicing time series

Slicing is particularly useful for time series since it's a common thing to want to filter for data within a date range. The important thing to remember is to keep your dates in ISO 8601 format, that is, "yyyy-mm-dd" for year-month-day, "yyyy-mm" for year-month, and "yyyy" for year.

In [25]:
# Set the date column values as datetime objects
temperatures['date'] = pd.to_datetime(temperatures['date'])

In [27]:
# Set the index of temperatures to the date column and sort it
temperatures_date = temperatures.set_index('date').sort_index()

# Use .loc[] to subset temperatures_date for rows in 2010 and 2011
temperatures_date.loc['2010':'2011']

Unnamed: 0_level_0,city,country,avg_temp_c
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2010-01-01,Faisalabad,Pakistan,11.810
2010-01-01,Melbourne,Australia,20.016
2010-01-01,Chongqing,China,7.921
2010-01-01,São Paulo,Brazil,23.738
2010-01-01,Guangzhou,China,14.136
...,...,...,...
2011-12-01,Nagoya,Japan,6.476
2011-12-01,Hyderabad,India,23.613
2011-12-01,Cali,Colombia,21.559
2011-12-01,Lima,Peru,18.293


In [28]:
# Use .loc[] to subset temperatures_ind for rows from Aug 2010 to Feb 2011
temperatures_date.loc['2010-08':'2011-02']

Unnamed: 0_level_0,city,country,avg_temp_c
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2010-08-01,Calcutta,India,30.226
2010-08-01,Pune,India,24.941
2010-08-01,Izmir,Turkey,28.352
2010-08-01,Tianjin,China,25.543
2010-08-01,Manila,Philippines,27.101
...,...,...,...
2011-02-01,Kabul,Afghanistan,3.914
2011-02-01,Chicago,United States,0.276
2011-02-01,Aleppo,Syria,8.246
2011-02-01,Delhi,India,18.136


### Subsetting by row/column number

It is also occasionally useful to slice based on row and column numbers. This is done using the `.iloc[]` method.

In [31]:
# Get the first 5 rows, columns 3 and 4 from temperatures
temperatures.iloc[:5, 2:4]

Unnamed: 0,country,avg_temp_c
0,Côte D'Ivoire,27.293
1,Côte D'Ivoire,27.685
2,Côte D'Ivoire,29.061
3,Côte D'Ivoire,28.162
4,Côte D'Ivoire,27.547


## Working with pivot tables

### Create a pivot table

In [32]:
# Add a year column to temperatures, from the year component of the date column
temperatures['year'] = temperatures['date'].dt.year

# Make a pivot table of the avg_temp_c column, with country and city as rows, and year as columns 
# Assign to temp_by_country_city_vs_year, and look at the result
temp_by_country_city_vs_year = temperatures.pivot_table('avg_temp_c', index=['country', 'city'], columns='year')
temp_by_country_city_vs_year

Unnamed: 0_level_0,year,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013
country,city,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
Afghanistan,Kabul,15.822667,15.847917,15.714583,15.132583,16.128417,14.847500,15.798500,15.518000,15.479250,15.093333,15.676000,15.812167,14.510333,16.206125
Angola,Luanda,24.410333,24.427083,24.790917,24.867167,24.216167,24.414583,24.138417,24.241583,24.266333,24.325083,24.440250,24.150750,24.240083,24.553875
Australia,Melbourne,14.320083,14.180000,14.075833,13.985583,13.742083,14.378500,13.991083,14.991833,14.110583,14.647417,14.231667,14.190917,14.268667,14.741500
Australia,Sydney,17.567417,17.854500,17.733833,17.592333,17.869667,18.028083,17.749500,18.020833,17.321083,18.175833,17.999000,17.713333,17.474333,18.089750
Bangladesh,Dhaka,25.905250,25.931250,26.095000,25.927417,26.136083,26.193333,26.440417,25.951333,26.004500,26.535583,26.648167,25.803250,26.283583,26.587000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
United States,Chicago,11.089667,11.703083,11.532083,10.481583,10.943417,11.583833,11.870500,11.448333,10.242417,10.298333,11.815917,11.214250,12.821250,11.586889
United States,Los Angeles,16.643333,16.466250,16.430250,16.944667,16.552833,16.431417,16.623083,16.699917,17.014750,16.677000,15.887000,15.874833,17.089583,18.120667
United States,New York,9.969083,10.931000,11.252167,9.836000,10.389500,10.681417,11.519250,10.627333,10.641667,10.141833,11.357583,11.272250,11.971500,12.163889
Vietnam,Ho Chi Minh City,27.588917,27.831750,28.064750,27.827667,27.686583,27.884000,28.044000,27.866667,27.611417,27.853333,28.281750,27.675417,28.248750,28.455000


### Subsetting pivot tables

A pivot table is just a DataFrame with sorted indexes, so you can use `.loc[]` to subset them.

In [33]:
# Subset temp_by_country_city_vs_year from Egypt, Cairo to India, Delhi, and 2005 to 2010
temp_by_country_city_vs_year.loc[('Egypt','Cairo'):('India','Delhi'), '2005':'2010']

Unnamed: 0_level_0,year,2005,2006,2007,2008,2009,2010
country,city,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Egypt,Cairo,22.0065,22.05,22.361,22.6445,22.625,23.71825
Egypt,Gizeh,22.0065,22.05,22.361,22.6445,22.625,23.71825
Ethiopia,Addis Abeba,18.312833,18.427083,18.142583,18.165,18.765333,18.29825
France,Paris,11.552917,11.7885,11.750833,11.27825,11.464083,10.409833
Germany,Berlin,9.919083,10.545333,10.883167,10.65775,10.0625,8.606833
India,Ahmadabad,26.828083,27.282833,27.511167,27.0485,28.095833,28.017833
India,Bangalore,25.4765,25.41825,25.464333,25.352583,25.72575,25.70525
India,Bombay,27.03575,27.3815,27.634667,27.17775,27.8445,27.765417
India,Calcutta,26.729167,26.98625,26.584583,26.522333,27.15325,27.288833
India,Delhi,25.716083,26.365917,26.145667,25.675,26.55425,26.52025


### Calculating on a pivot table

In [41]:
# Calculate the mean temperature for each year
mean_temp_by_year = temp_by_country_city_vs_year.mean()

# Filter mean_temp_by_year for the year that had the highest mean temperature
mean_temp_by_year[mean_temp_by_year == mean_temp_by_year.max()]

year
2013    20.312285
dtype: float64

In [42]:
# Calculate the mean temperature for each city, assigning to mean_temp_by_city
mean_temp_by_city = temp_by_country_city_vs_year.mean(axis='columns')

# Filter mean_temp_by_city for the city that had the lowest mean temperature
mean_temp_by_city[mean_temp_by_city == mean_temp_by_city.min()]

country  city  
China    Harbin    4.876551
dtype: float64