[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/HarisJafri-xcode/Data-Analyst-in-Python/blob/main/04_Data_Manipulation_with_pandas/03_Slicing_and_Indexing_DataFrames.ipynb)

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Setting and Removing Indexes

pandas allows you to designate columns as an index. This enables cleaner code when taking subsets (as well as providing more efficient lookup under some circumstances).

In this chapter, you'll be exploring temperatures, a DataFrame of average temperatures in cities around the world.

In [2]:
file_path = 'https://raw.githubusercontent.com/HarisJafri-xcode/Data-Analyst-in-Python/refs/heads/main/04_Data_Manipulation_with_pandas/temperatures.csv'
temperatures = pd.read_csv(file_path)

Let us observe the DataFrame.

In [3]:
temperatures.head()

Unnamed: 0,date,city,country,avg_temp_c
0,1/1/2000,Abidjan,Côte D'Ivoire,27.293
1,2/1/2000,Abidjan,Côte D'Ivoire,27.685
2,3/1/2000,Abidjan,Côte D'Ivoire,29.061
3,4/1/2000,Abidjan,Côte D'Ivoire,28.162
4,5/1/2000,Abidjan,Côte D'Ivoire,27.547


In [4]:
temperatures.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16500 entries, 0 to 16499
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   date        16500 non-null  object 
 1   city        16500 non-null  object 
 2   country     16500 non-null  object 
 3   avg_temp_c  16407 non-null  float64
dtypes: float64(1), object(3)
memory usage: 515.8+ KB


In [5]:
temperatures['date'] = pd.to_datetime(temperatures['date'],format='%d/%m/%Y')

Set the index of temperatures to "city", assigning to temperatures_ind.

In [6]:
temperatures_ind = temperatures.set_index('city')
temperatures_ind.head()

Unnamed: 0_level_0,date,country,avg_temp_c
city,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Abidjan,2000-01-01,Côte D'Ivoire,27.293
Abidjan,2000-01-02,Côte D'Ivoire,27.685
Abidjan,2000-01-03,Côte D'Ivoire,29.061
Abidjan,2000-01-04,Côte D'Ivoire,28.162
Abidjan,2000-01-05,Côte D'Ivoire,27.547


We can even reset the Index !

In [7]:
temperatures_ind.reset_index() # Not Deleting the Index Columns

Unnamed: 0,city,date,country,avg_temp_c
0,Abidjan,2000-01-01,Côte D'Ivoire,27.293
1,Abidjan,2000-01-02,Côte D'Ivoire,27.685
2,Abidjan,2000-01-03,Côte D'Ivoire,29.061
3,Abidjan,2000-01-04,Côte D'Ivoire,28.162
4,Abidjan,2000-01-05,Côte D'Ivoire,27.547
...,...,...,...,...
16495,Xian,2013-01-05,China,18.979
16496,Xian,2013-01-06,China,23.522
16497,Xian,2013-01-07,China,25.251
16498,Xian,2013-01-08,China,24.528


In [8]:
temperatures_ind.reset_index(drop=True) # Dropping the Index Column

Unnamed: 0,date,country,avg_temp_c
0,2000-01-01,Côte D'Ivoire,27.293
1,2000-01-02,Côte D'Ivoire,27.685
2,2000-01-03,Côte D'Ivoire,29.061
3,2000-01-04,Côte D'Ivoire,28.162
4,2000-01-05,Côte D'Ivoire,27.547
...,...,...,...
16495,2013-01-05,China,18.979
16496,2013-01-06,China,23.522
16497,2013-01-07,China,25.251
16498,2013-01-08,China,24.528


# Subsetting with .loc[]

The killer feature for indexes is .loc[]: a subsetting method that accepts index values. When you pass it a single argument, it will take a subset of rows.

The code for subsetting using .loc[] can be easier to read than standard square bracket subsetting, which can make your code less burdensome to maintain.

Create a list called cities that contains "London" and "Paris".

In [9]:
cities = ["London", "Paris"]

Use [] subsetting to filter temperatures for rows where the city column takes a value in the cities list.

In [10]:
temperatures[temperatures['city'].isin(cities)]

Unnamed: 0,date,city,country,avg_temp_c
8910,2000-01-01,London,United Kingdom,4.693
8911,2000-01-02,London,United Kingdom,6.115
8912,2000-01-03,London,United Kingdom,7.422
8913,2000-01-04,London,United Kingdom,8.246
8914,2000-01-05,London,United Kingdom,12.491
...,...,...,...,...
12040,2013-01-05,Paris,France,11.703
12041,2013-01-06,Paris,France,16.340
12042,2013-01-07,Paris,France,21.186
12043,2013-01-08,Paris,France,19.235


Use .loc[] subsetting to filter temperatures_ind for rows where the city is in the cities list.

In [11]:
temperatures_ind.loc[["London","Paris"]]

Unnamed: 0_level_0,date,country,avg_temp_c
city,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
London,2000-01-01,United Kingdom,4.693
London,2000-01-02,United Kingdom,6.115
London,2000-01-03,United Kingdom,7.422
London,2000-01-04,United Kingdom,8.246
London,2000-01-05,United Kingdom,12.491
...,...,...,...
Paris,2013-01-05,France,11.703
Paris,2013-01-06,France,16.340
Paris,2013-01-07,France,21.186
Paris,2013-01-08,France,19.235


# Setting Multi-level Indexes

Indexes can also be made out of multiple columns, forming a multi-level index (sometimes called a hierarchical index). There is a trade-off to using these.

The benefit is that multi-level indexes make it more natural to reason about nested categorical variables. For example, in a clinical trial, you might have control and treatment groups. Then each test subject belongs to one or another group, and we can say that a test subject is nested inside the treatment group. Similarly, in the temperature dataset, the city is located in the country, so we can say a city is nested inside the country.

The main downside is that the code for manipulating indexes is different from the code for manipulating columns, so you have to learn two syntaxes and keep track of how your data is represented.

Set the index of temperatures to the "country" and "city" columns, and assign this to temperatures_ind.

In [12]:
temperatures_ind = temperatures.set_index(['country','city'])

Specify two country/city pairs to keep: "Brazil"/"Rio De Janeiro" and "Pakistan"/"Lahore", assigning to rows_to_keep.

In [13]:
rows_to_keep = [('Brazil','Rio De Janeiro'),('Pakistan','Lahore')]

In [14]:
temperatures_ind.loc[rows_to_keep]

Unnamed: 0_level_0,Unnamed: 1_level_0,date,avg_temp_c
country,city,Unnamed: 2_level_1,Unnamed: 3_level_1
Brazil,Rio De Janeiro,2000-01-01,25.974
Brazil,Rio De Janeiro,2000-01-02,26.699
Brazil,Rio De Janeiro,2000-01-03,26.270
Brazil,Rio De Janeiro,2000-01-04,25.750
Brazil,Rio De Janeiro,2000-01-05,24.356
...,...,...,...
Pakistan,Lahore,2013-01-05,33.457
Pakistan,Lahore,2013-01-06,34.456
Pakistan,Lahore,2013-01-07,33.279
Pakistan,Lahore,2013-01-08,31.511


# Sorting by index values

Previously, you changed the order of the rows in a DataFrame by calling .sort_values(). It's also useful to be able to sort by elements in the index. For this, you need to use .sort_index().

Sort temperatures_ind by the index values.

In [15]:
temperatures_ind.sort_index()

Unnamed: 0_level_0,Unnamed: 1_level_0,date,avg_temp_c
country,city,Unnamed: 2_level_1,Unnamed: 3_level_1
Afghanistan,Kabul,2000-01-01,3.326
Afghanistan,Kabul,2000-01-02,3.454
Afghanistan,Kabul,2000-01-03,9.612
Afghanistan,Kabul,2000-01-04,17.925
Afghanistan,Kabul,2000-01-05,24.658
...,...,...,...
Zimbabwe,Harare,2013-01-05,18.298
Zimbabwe,Harare,2013-01-06,17.020
Zimbabwe,Harare,2013-01-07,16.299
Zimbabwe,Harare,2013-01-08,19.232


Sort temperatures_ind by the index values at the "city" level.

In [16]:
temperatures_ind.sort_index(level='city')

Unnamed: 0_level_0,Unnamed: 1_level_0,date,avg_temp_c
country,city,Unnamed: 2_level_1,Unnamed: 3_level_1
Côte D'Ivoire,Abidjan,2000-01-01,27.293
Côte D'Ivoire,Abidjan,2000-01-02,27.685
Côte D'Ivoire,Abidjan,2000-01-03,29.061
Côte D'Ivoire,Abidjan,2000-01-04,28.162
Côte D'Ivoire,Abidjan,2000-01-05,27.547
...,...,...,...
China,Xian,2013-01-05,18.979
China,Xian,2013-01-06,23.522
China,Xian,2013-01-07,25.251
China,Xian,2013-01-08,24.528


Sort temperatures_ind by ascending country then descending city.

In [17]:
temperatures_ind.sort_index(level=['country','city'], ascending=[True,False])

Unnamed: 0_level_0,Unnamed: 1_level_0,date,avg_temp_c
country,city,Unnamed: 2_level_1,Unnamed: 3_level_1
Afghanistan,Kabul,2000-01-01,3.326
Afghanistan,Kabul,2000-01-02,3.454
Afghanistan,Kabul,2000-01-03,9.612
Afghanistan,Kabul,2000-01-04,17.925
Afghanistan,Kabul,2000-01-05,24.658
...,...,...,...
Zimbabwe,Harare,2013-01-05,18.298
Zimbabwe,Harare,2013-01-06,17.020
Zimbabwe,Harare,2013-01-07,16.299
Zimbabwe,Harare,2013-01-08,19.232


# Slicing Index Values

Slicing lets you select consecutive elements of an object using first:last syntax. DataFrames can be sliced by index values or by row/column number; we'll start with the first case. This involves slicing inside the .loc[] method.

Compared to slicing lists, there are a few things to remember.

- You can only slice an index if the index is sorted (using .sort_index()).
- To slice at the outer level, first and last can be strings.
- To slice at inner levels, first and last should be tuples.
- If you pass a single slice to .loc[], it will slice the rows.

Sort the index of temperatures_ind and store in temperatures_srt

In [18]:
temperatures_srt = temperatures_ind.sort_index()

Use slicing with .loc[] to get these subsets:
- from Pakistan to Philippines.
- from Lahore to Manila. (This will return nonsense.)
- from Pakistan, Lahore to Philippines, Manila.

In [19]:
temperatures_srt.loc['Pakistan':'Philippines']

Unnamed: 0_level_0,Unnamed: 1_level_0,date,avg_temp_c
country,city,Unnamed: 2_level_1,Unnamed: 3_level_1
Pakistan,Faisalabad,2000-01-01,12.792
Pakistan,Faisalabad,2000-01-02,14.339
Pakistan,Faisalabad,2000-01-03,20.309
Pakistan,Faisalabad,2000-01-04,29.072
Pakistan,Faisalabad,2000-01-05,34.845
...,...,...,...
Philippines,Manila,2013-01-05,29.552
Philippines,Manila,2013-01-06,28.572
Philippines,Manila,2013-01-07,27.266
Philippines,Manila,2013-01-08,26.754


In [20]:
# temperatures_srt.loc['Lahore','Manila'] # This shall return an Error

In [21]:
temperatures_srt.loc[('Pakistan, Lahore'):('Philippines, Manila')]

Unnamed: 0_level_0,Unnamed: 1_level_0,date,avg_temp_c
country,city,Unnamed: 2_level_1,Unnamed: 3_level_1
Peru,Lima,2000-01-01,19.455
Peru,Lima,2000-01-02,20.911
Peru,Lima,2000-01-03,20.113
Peru,Lima,2000-01-04,18.726
Peru,Lima,2000-01-05,16.710
...,...,...,...
Philippines,Manila,2013-01-05,29.552
Philippines,Manila,2013-01-06,28.572
Philippines,Manila,2013-01-07,27.266
Philippines,Manila,2013-01-08,26.754


# Slicing in Both Directions

You've seen slicing DataFrames by rows and by columns, but since DataFrames are two-dimensional objects, it is often natural to slice both dimensions at once. That is, by passing two arguments to .loc[], you can subset by rows and columns in one go.

Use .loc[] slicing to subset rows from India, Hyderabad to Iraq, Baghdad.

In [22]:
temperatures_srt.loc[('India','Hyderabad'):('Iraq','Baghdad')]

Unnamed: 0_level_0,Unnamed: 1_level_0,date,avg_temp_c
country,city,Unnamed: 2_level_1,Unnamed: 3_level_1
India,Hyderabad,2000-01-01,23.779
India,Hyderabad,2000-01-02,25.826
India,Hyderabad,2000-01-03,28.821
India,Hyderabad,2000-01-04,32.698
India,Hyderabad,2000-01-05,32.438
...,...,...,...
Iraq,Baghdad,2013-01-05,28.673
Iraq,Baghdad,2013-01-06,33.803
Iraq,Baghdad,2013-01-07,36.392
Iraq,Baghdad,2013-01-08,35.463


Use .loc[] slicing to subset columns from date to avg_temp_c.

In [23]:
temperatures_srt.loc[:,'date':'avg_temp_c']

Unnamed: 0_level_0,Unnamed: 1_level_0,date,avg_temp_c
country,city,Unnamed: 2_level_1,Unnamed: 3_level_1
Afghanistan,Kabul,2000-01-01,3.326
Afghanistan,Kabul,2000-01-02,3.454
Afghanistan,Kabul,2000-01-03,9.612
Afghanistan,Kabul,2000-01-04,17.925
Afghanistan,Kabul,2000-01-05,24.658
...,...,...,...
Zimbabwe,Harare,2013-01-05,18.298
Zimbabwe,Harare,2013-01-06,17.020
Zimbabwe,Harare,2013-01-07,16.299
Zimbabwe,Harare,2013-01-08,19.232


Slice in both directions at once from Hyderabad to Baghdad, and date to avg_temp_c.

In [24]:
temperatures_srt.loc[('India','Hyderabad'):('Iraq','Baghdad'),'date':'avg_temp_c']

Unnamed: 0_level_0,Unnamed: 1_level_0,date,avg_temp_c
country,city,Unnamed: 2_level_1,Unnamed: 3_level_1
India,Hyderabad,2000-01-01,23.779
India,Hyderabad,2000-01-02,25.826
India,Hyderabad,2000-01-03,28.821
India,Hyderabad,2000-01-04,32.698
India,Hyderabad,2000-01-05,32.438
...,...,...,...
Iraq,Baghdad,2013-01-05,28.673
Iraq,Baghdad,2013-01-06,33.803
Iraq,Baghdad,2013-01-07,36.392
Iraq,Baghdad,2013-01-08,35.463


# Slicing Time Series

Slicing is particularly useful for time series since it's a common thing to want to filter for data within a date range. Add the date column to the index, then use .loc[] to perform the subsetting. The important thing to remember is to keep your dates in ISO 8601 format, that is, "yyyy-mm-dd" for year-month-day, "yyyy-mm" for year-month, and "yyyy" for year.

Use Boolean conditions, not .isin() or .loc[], and the full date "yyyy-mm-dd", to subset temperatures for rows where the date column is in 2010 and 2011 and print the results.

In [25]:
temperatures_bool = temperatures[(temperatures['date'] >= '2010-01-01') & (temperatures['date'] <= '2011-12-30')]

In [26]:
temperatures_bool

Unnamed: 0,date,city,country,avg_temp_c
120,2010-01-01,Abidjan,Côte D'Ivoire,28.270
121,2010-01-02,Abidjan,Côte D'Ivoire,29.262
122,2010-01-03,Abidjan,Côte D'Ivoire,29.596
123,2010-01-04,Abidjan,Côte D'Ivoire,29.068
124,2010-01-05,Abidjan,Côte D'Ivoire,28.258
...,...,...,...,...
16474,2011-01-08,Xian,China,23.069
16475,2011-01-09,Xian,China,16.775
16476,2011-01-10,Xian,China,12.587
16477,2011-01-11,Xian,China,7.543


Set the index of temperatures to the date column and sort it.

In [27]:
temperatures_ind = temperatures.set_index('date').sort_index()

In [28]:
temperatures_ind

Unnamed: 0_level_0,city,country,avg_temp_c
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2000-01-01,Abidjan,Côte D'Ivoire,27.293
2000-01-01,Lahore,Pakistan,12.792
2000-01-01,Tangshan,China,-5.406
2000-01-01,Gizeh,Egypt,12.669
2000-01-01,Lakhnau,India,15.152
...,...,...,...
2013-01-09,Nanjing,China,
2013-01-09,New Delhi,India,
2013-01-09,New York,United States,17.408
2013-01-09,Peking,China,


Use .loc[] to subset temperatures_ind for rows in 2010 and 2011.

In [29]:
temperatures_ind.loc['2010':'2011']

Unnamed: 0_level_0,city,country,avg_temp_c
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2010-01-01,Faisalabad,Pakistan,11.810
2010-01-01,Melbourne,Australia,20.016
2010-01-01,Chongqing,China,7.921
2010-01-01,São Paulo,Brazil,23.738
2010-01-01,Guangzhou,China,14.136
...,...,...,...
2011-01-12,Nagoya,Japan,6.476
2011-01-12,Hyderabad,India,23.613
2011-01-12,Cali,Colombia,21.559
2011-01-12,Lima,Peru,18.293


# Subsetting by row/column number

The most common ways to subset rows are the ways we've previously discussed: using a Boolean condition or by index labels. However, it is also occasionally useful to pass row numbers.

This is done using .iloc[], and like .loc[], it can take two arguments to let you subset by rows and columns.

Use .iloc[] on temperatures to take subsets.

In [30]:
# Get 23rd row, 2nd column (index 22, 1)
temperatures.iloc[22,1]

'Abidjan'

In [31]:
# Use slicing to get the first 5 rows
temperatures.iloc[:5]

Unnamed: 0,date,city,country,avg_temp_c
0,2000-01-01,Abidjan,Côte D'Ivoire,27.293
1,2000-01-02,Abidjan,Côte D'Ivoire,27.685
2,2000-01-03,Abidjan,Côte D'Ivoire,29.061
3,2000-01-04,Abidjan,Côte D'Ivoire,28.162
4,2000-01-05,Abidjan,Côte D'Ivoire,27.547


In [32]:
# Use slicing to get columns 3 to 4
temperatures.iloc[:,2:4]

Unnamed: 0,country,avg_temp_c
0,Côte D'Ivoire,27.293
1,Côte D'Ivoire,27.685
2,Côte D'Ivoire,29.061
3,Côte D'Ivoire,28.162
4,Côte D'Ivoire,27.547
...,...,...
16495,China,18.979
16496,China,23.522
16497,China,25.251
16498,China,24.528


In [33]:
# Use slicing in both directions at once
temperatures.iloc[:6,2:4]

Unnamed: 0,country,avg_temp_c
0,Côte D'Ivoire,27.293
1,Côte D'Ivoire,27.685
2,Côte D'Ivoire,29.061
3,Côte D'Ivoire,28.162
4,Côte D'Ivoire,27.547
5,Côte D'Ivoire,25.812


# Pivot Temperature by City and Year

It's interesting to see how temperatures for each city change over time—looking at every month results in a big table, which can be tricky to reason about. Instead, let's look at how temperatures change by year.

You can access the components of a date (year, month and day) using code of the form dataframe["column"].dt.component. For example, the month component is dataframe["column"].dt.month, and the year component is dataframe["column"].dt.year.

Once you have the year column, you can create a pivot table with the data aggregated by city and year.

Add a year column to temperatures, from the year component of the date column.

In [34]:
temperatures['year'] = temperatures['date'].dt.year

In [35]:
temperatures.head()

Unnamed: 0,date,city,country,avg_temp_c,year
0,2000-01-01,Abidjan,Côte D'Ivoire,27.293,2000
1,2000-01-02,Abidjan,Côte D'Ivoire,27.685,2000
2,2000-01-03,Abidjan,Côte D'Ivoire,29.061,2000
3,2000-01-04,Abidjan,Côte D'Ivoire,28.162,2000
4,2000-01-05,Abidjan,Côte D'Ivoire,27.547,2000


Make a pivot table of the avg_temp_c column, with country and city as rows, and year as columns. Assign to temp_by_country_city_vs_year, and look at the result.

In [36]:
temp_by_country_city_vs_year = temperatures.pivot_table("avg_temp_c",index=['country','city'],columns='year')

In [39]:
temp_by_country_city_vs_year.head()

Unnamed: 0_level_0,year,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013
country,city,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
Afghanistan,Kabul,15.822667,15.847917,15.714583,15.132583,16.128417,14.8475,15.7985,15.518,15.47925,15.093333,15.676,15.812167,14.510333,16.206125
Angola,Luanda,24.410333,24.427083,24.790917,24.867167,24.216167,24.414583,24.138417,24.241583,24.266333,24.325083,24.44025,24.15075,24.240083,24.553875
Australia,Melbourne,14.320083,14.18,14.075833,13.985583,13.742083,14.3785,13.991083,14.991833,14.110583,14.647417,14.231667,14.190917,14.268667,14.7415
Australia,Sydney,17.567417,17.8545,17.733833,17.592333,17.869667,18.028083,17.7495,18.020833,17.321083,18.175833,17.999,17.713333,17.474333,18.08975
Bangladesh,Dhaka,25.90525,25.93125,26.095,25.927417,26.136083,26.193333,26.440417,25.951333,26.0045,26.535583,26.648167,25.80325,26.283583,26.587


# Subsetting pivot Tables

A pivot table is just a DataFrame with sorted indexes, so the techniques you have learned already can be used to subset them. In particular, the .loc[] + slicing combination is often helpful.

Use .loc[] on temp_by_country_city_vs_year to take subsets.

From Egypt to India.

In [49]:
temp_by_country_city_vs_year.loc["Egypt":"India",:]

Unnamed: 0_level_0,year,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013
country,city,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
Egypt,Alexandria,20.7445,21.454583,21.456167,21.221417,21.064167,21.082333,21.148167,21.50775,21.739,21.6705,22.459583,21.1815,21.552583,21.4385
Egypt,Cairo,21.486167,22.330833,22.414083,22.1705,22.081917,22.0065,22.05,22.361,22.6445,22.625,23.71825,21.986917,22.48425,22.907
Egypt,Gizeh,21.486167,22.330833,22.414083,22.1705,22.081917,22.0065,22.05,22.361,22.6445,22.625,23.71825,21.986917,22.48425,22.907
Ethiopia,Addis Abeba,18.24125,18.296417,18.46975,18.320917,18.29275,18.312833,18.427083,18.142583,18.165,18.765333,18.29825,18.60675,18.448583,19.539
France,Paris,11.739667,11.37125,11.871333,11.9095,11.338833,11.552917,11.7885,11.750833,11.27825,11.464083,10.409833,12.32575,11.219917,11.011625
Germany,Berlin,10.963667,9.69025,10.264417,10.06575,9.822583,9.919083,10.545333,10.883167,10.65775,10.0625,8.606833,10.556417,9.964333,10.1215
India,Ahmadabad,27.436,27.198083,27.719083,27.403833,27.628333,26.828083,27.282833,27.511167,27.0485,28.095833,28.017833,27.290417,27.02725,27.608625
India,Bangalore,25.337917,25.528167,25.755333,25.92475,25.252083,25.4765,25.41825,25.464333,25.352583,25.72575,25.70525,25.362083,26.042333,26.6105
India,Bombay,27.203667,27.243667,27.628667,27.578417,27.31875,27.03575,27.3815,27.634667,27.17775,27.8445,27.765417,27.384917,27.1925,26.713
India,Calcutta,26.491333,26.515167,26.703917,26.561333,26.634333,26.729167,26.98625,26.584583,26.522333,27.15325,27.288833,26.406917,26.935083,27.36925


From Egypt, Cairo to India, Delhi.

In [53]:
temp_by_country_city_vs_year.loc[("Egypt","Cairo"):("India","Delhi"),:]

Unnamed: 0_level_0,year,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013
country,city,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
Egypt,Cairo,21.486167,22.330833,22.414083,22.1705,22.081917,22.0065,22.05,22.361,22.6445,22.625,23.71825,21.986917,22.48425,22.907
Egypt,Gizeh,21.486167,22.330833,22.414083,22.1705,22.081917,22.0065,22.05,22.361,22.6445,22.625,23.71825,21.986917,22.48425,22.907
Ethiopia,Addis Abeba,18.24125,18.296417,18.46975,18.320917,18.29275,18.312833,18.427083,18.142583,18.165,18.765333,18.29825,18.60675,18.448583,19.539
France,Paris,11.739667,11.37125,11.871333,11.9095,11.338833,11.552917,11.7885,11.750833,11.27825,11.464083,10.409833,12.32575,11.219917,11.011625
Germany,Berlin,10.963667,9.69025,10.264417,10.06575,9.822583,9.919083,10.545333,10.883167,10.65775,10.0625,8.606833,10.556417,9.964333,10.1215
India,Ahmadabad,27.436,27.198083,27.719083,27.403833,27.628333,26.828083,27.282833,27.511167,27.0485,28.095833,28.017833,27.290417,27.02725,27.608625
India,Bangalore,25.337917,25.528167,25.755333,25.92475,25.252083,25.4765,25.41825,25.464333,25.352583,25.72575,25.70525,25.362083,26.042333,26.6105
India,Bombay,27.203667,27.243667,27.628667,27.578417,27.31875,27.03575,27.3815,27.634667,27.17775,27.8445,27.765417,27.384917,27.1925,26.713
India,Calcutta,26.491333,26.515167,26.703917,26.561333,26.634333,26.729167,26.98625,26.584583,26.522333,27.15325,27.288833,26.406917,26.935083,27.36925
India,Delhi,26.048333,25.862917,26.634333,25.721083,26.239917,25.716083,26.365917,26.145667,25.675,26.55425,26.52025,25.6295,25.889417,26.70925


From Egypt, Cairo to India, Delhi, and 2005 to 2010.

In [54]:
temp_by_country_city_vs_year.loc[("Egypt","Cairo"):("India","Delhi"),"2005":"2010"]

Unnamed: 0_level_0,year,2005,2006,2007,2008,2009,2010
country,city,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Egypt,Cairo,22.0065,22.05,22.361,22.6445,22.625,23.71825
Egypt,Gizeh,22.0065,22.05,22.361,22.6445,22.625,23.71825
Ethiopia,Addis Abeba,18.312833,18.427083,18.142583,18.165,18.765333,18.29825
France,Paris,11.552917,11.7885,11.750833,11.27825,11.464083,10.409833
Germany,Berlin,9.919083,10.545333,10.883167,10.65775,10.0625,8.606833
India,Ahmadabad,26.828083,27.282833,27.511167,27.0485,28.095833,28.017833
India,Bangalore,25.4765,25.41825,25.464333,25.352583,25.72575,25.70525
India,Bombay,27.03575,27.3815,27.634667,27.17775,27.8445,27.765417
India,Calcutta,26.729167,26.98625,26.584583,26.522333,27.15325,27.288833
India,Delhi,25.716083,26.365917,26.145667,25.675,26.55425,26.52025


# Calculating on a pivot table

Pivot tables are filled with summary statistics, but they are only a first step to finding something insightful. Often you'll need to perform further calculations on them. A common thing to do is to find the rows or columns where the highest or lowest value occurs.

Recall from Chapter 1 that you can easily subset a Series or DataFrame to find rows of interest using a logical condition inside of square brackets. For example: series[series > value].

Calculate the mean temperature for each year, assigning to mean_temp_by_year.

In [56]:
mean_temp_by_year = temp_by_country_city_vs_year.mean()
mean_temp_by_year

year
2000    19.506243
2001    19.679352
2002    19.855685
2003    19.630197
2004    19.672204
2005    19.607239
2006    19.793993
2007    19.854270
2008    19.608778
2009    19.833752
2010    19.911734
2011    19.549197
2012    19.668239
2013    20.312285
dtype: float64

Filter mean_temp_by_year for the year that had the highest mean temperature.

In [68]:
mean_temp_by_year[mean_temp_by_year == mean_temp_by_year.max()]

year
2013    20.312285
dtype: float64

Calculate the mean temperature for each city (across columns), assigning to mean_temp_by_city.

In [70]:
mean_temp_by_city = temp_by_country_city_vs_year.mean(axis="columns")

Filter mean_temp_by_city for the city that had the lowest mean temperature.

In [71]:
mean_temp_by_city[mean_temp_by_city == mean_temp_by_city.min()]

country  city  
China    Harbin    4.876551
dtype: float64