# Chapter III

## Slicing and Indexing DataFrames

### Explicit indexes

In Chapter I, you saw that DataFrames are composed of three parts: a NumPy array for the data, and two indexes to store the row and column details.

In [156]:
import pandas as pd 
import matplotlib.pyplot as plt 
import numpy as np 

dogs = pd.read_csv('datasets\\Dogs.csv')

dogs.head()

Unnamed: 0,Name,Breed,Colour,Height(cm),Weight (kg),Date of Birth
0,Bella,Labrador,Brown,56,25,2013-07-01
1,Charlie,Poodle,Black,43,23,2016-09-16
2,Lucy,Poodle,Brown,46,22,2014-08-25
3,Cooper,Schnauzer,Gray,49,17,2011-12-11
4,Max,Labrador,Black,59,29,2017-01-20


In [157]:
# .columns and . index

print(dogs.columns)
# Contains an index object of column names

print("\n------------------------\n")

print(dogs.index)
# Contains an index object of row numbers

Index(['Name ', 'Breed', 'Colour', 'Height(cm)', ' Weight (kg)',
       ' Date of Birth '],
      dtype='object')

------------------------

RangeIndex(start=0, stop=7, step=1)


### Setting a column as the index

You can move a column from the body of the DataFrame to the index. This is called **"setting an index,"** and it uses the **set_index()** method.

In [158]:
dogs_ind = dogs.set_index("Name ")
dogs_ind

Unnamed: 0_level_0,Breed,Colour,Height(cm),Weight (kg),Date of Birth
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Bella,Labrador,Brown,56,25,2013-07-01
Charlie,Poodle,Black,43,23,2016-09-16
Lucy,Poodle,Brown,46,22,2014-08-25
Cooper,Schnauzer,Gray,49,17,2011-12-11
Max,Labrador,Black,59,29,2017-01-20
Stella,Chihuahua,Tan,18,2,2015-04-20
Bernie,St. Bernard,White,77,74,2018-02-27


Notice that the output changed slightly; in particular, a quick visual clue that name is now in the index is that the index values are left-aligned rather than right-alligned.

### Removing an index

To undo what you just did, you can reset the index - that is, you remove it. This is done via **reset_index()**

In [159]:
dogs_ind.reset_index()


Unnamed: 0,Name,Breed,Colour,Height(cm),Weight (kg),Date of Birth
0,Bella,Labrador,Brown,56,25,2013-07-01
1,Charlie,Poodle,Black,43,23,2016-09-16
2,Lucy,Poodle,Brown,46,22,2014-08-25
3,Cooper,Schnauzer,Gray,49,17,2011-12-11
4,Max,Labrador,Black,59,29,2017-01-20
5,Stella,Chihuahua,Tan,18,2,2015-04-20
6,Bernie,St. Bernard,White,77,74,2018-02-27


### Dropping an index

**reset_index()** has a drop argument that allows you to discard an index. Here, setting drop tu True entirely removes the dog names.

In [160]:
dogs_ind.reset_index(drop=True)

Unnamed: 0,Breed,Colour,Height(cm),Weight (kg),Date of Birth
0,Labrador,Brown,56,25,2013-07-01
1,Poodle,Black,43,23,2016-09-16
2,Poodle,Brown,46,22,2014-08-25
3,Schnauzer,Gray,49,17,2011-12-11
4,Labrador,Black,59,29,2017-01-20
5,Chihuahua,Tan,18,2,2015-04-20
6,St. Bernard,White,77,74,2018-02-27


### Indexes make subsetting simpler

Indexing makes subsetting code cleaner. Consider this example of subsetting for the rows where the dog is called Bella or Stella.

In [161]:
dogs[dogs['Name '].isin(['Bella', 'Stella'])]

# It's a fairly tricky line of code for such a simple task!

Unnamed: 0,Name,Breed,Colour,Height(cm),Weight (kg),Date of Birth
0,Bella,Labrador,Brown,56,25,2013-07-01
5,Stella,Chihuahua,Tan,18,2,2015-04-20


In [162]:
# This is the equivalent and easier method.

dogs_ind.loc[['Bella', 'Stella']]

Unnamed: 0_level_0,Breed,Colour,Height(cm),Weight (kg),Date of Birth
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Bella,Labrador,Brown,56,25,2013-07-01
Stella,Chihuahua,Tan,18,2,2015-04-20


### Index values don't need to be unique

Here, there are two Poodles and Labradors in the index

In [163]:
dogs_ind2 = dogs.set_index('Breed')
dogs_ind2

Unnamed: 0_level_0,Name,Colour,Height(cm),Weight (kg),Date of Birth
Breed,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Labrador,Bella,Brown,56,25,2013-07-01
Poodle,Charlie,Black,43,23,2016-09-16
Poodle,Lucy,Brown,46,22,2014-08-25
Schnauzer,Cooper,Gray,49,17,2011-12-11
Labrador,Max,Black,59,29,2017-01-20
Chihuahua,Stella,Tan,18,2,2015-04-20
St. Bernard,Bernie,White,77,74,2018-02-27


In [164]:
dogs_ind2.loc['Labrador']

# If you subset on 'Labrador' using loc, all the Labrador data is returned.

Unnamed: 0_level_0,Name,Colour,Height(cm),Weight (kg),Date of Birth
Breed,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Labrador,Bella,Brown,56,25,2013-07-01
Labrador,Max,Black,59,29,2017-01-20


### Multi-Level indexes a.k.a hierarchical indexes

You can include multiple columns in the index by passing a list of column names to **set_index()** 

In [165]:
dogs_ind3 = dogs.set_index(['Breed', 'Colour'])
dogs_ind3

Unnamed: 0_level_0,Unnamed: 1_level_0,Name,Height(cm),Weight (kg),Date of Birth
Breed,Colour,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Labrador,Brown,Bella,56,25,2013-07-01
Poodle,Black,Charlie,43,23,2016-09-16
Poodle,Brown,Lucy,46,22,2014-08-25
Schnauzer,Gray,Cooper,49,17,2011-12-11
Labrador,Black,Max,59,29,2017-01-20
Chihuahua,Tan,Stella,18,2,2015-04-20
St. Bernard,White,Bernie,77,74,2018-02-27


### Subset the outer level with a list

To make a subset of rows at the outer level index, you pass a list of index values to loc. Here, list contains Labrador and Poodle, and the resulting subset contains all dogs from both breeds.

In [166]:
dogs_ind3.loc[['Labrador', 'Poodle']]

Unnamed: 0_level_0,Unnamed: 1_level_0,Name,Height(cm),Weight (kg),Date of Birth
Breed,Colour,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Labrador,Brown,Bella,56,25,2013-07-01
Labrador,Black,Max,59,29,2017-01-20
Poodle,Black,Charlie,43,23,2016-09-16
Poodle,Brown,Lucy,46,22,2014-08-25


### Subset inner levels with a list of Tuples

To subset on inner levels, you need to pass a list of tuples. Here, first tuple specifies Labrador at the outer level and Brown at the inner level. The resulting rows have to match all conditions from a tuple.

In [167]:
dogs_ind3.loc[[('Labrador', 'Brown'), ('Poodle', 'Black')]]

Unnamed: 0_level_0,Unnamed: 1_level_0,Name,Height(cm),Weight (kg),Date of Birth
Breed,Colour,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Labrador,Brown,Bella,56,25,2013-07-01
Poodle,Black,Charlie,43,23,2016-09-16


### Sorting by index values

Just as we sort the rows of a DataFrame using **sort_values**. We can also sort by index values using **sort_index**  
By default, it sorts all index levels from outer to inner, in ascending order.

In [168]:
dogs_ind3.sort_index()

Unnamed: 0_level_0,Unnamed: 1_level_0,Name,Height(cm),Weight (kg),Date of Birth
Breed,Colour,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Chihuahua,Tan,Stella,18,2,2015-04-20
Labrador,Black,Max,59,29,2017-01-20
Labrador,Brown,Bella,56,25,2013-07-01
Poodle,Black,Charlie,43,23,2016-09-16
Poodle,Brown,Lucy,46,22,2014-08-25
Schnauzer,Gray,Cooper,49,17,2011-12-11
St. Bernard,White,Bernie,77,74,2018-02-27


In [169]:
# You can control the sorting by passing lists to the level and ascending arguments.

dogs_ind3.sort_index(level=['Colour', 'Breed'], ascending=[True, False])

Unnamed: 0_level_0,Unnamed: 1_level_0,Name,Height(cm),Weight (kg),Date of Birth
Breed,Colour,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Poodle,Black,Charlie,43,23,2016-09-16
Labrador,Black,Max,59,29,2017-01-20
Poodle,Brown,Lucy,46,22,2014-08-25
Labrador,Brown,Bella,56,25,2013-07-01
Schnauzer,Gray,Cooper,49,17,2011-12-11
Chihuahua,Tan,Stella,18,2,2015-04-20
St. Bernard,White,Bernie,77,74,2018-02-27


### Now you have two problems

* Index values are just data

* Indexes violate "tidy data" principles

* You need to learn two syntaxes

Indexes are controversial. Although they simplify subsetting code, there are some downsides. Index values are just data. Storing data in multiple forms makes it harder to think about. There is a concept called **"tidy data,"** where data is stored in tabular form - like a DataFrame.  
Each row contains a single observation, and each variable is stored in its own column. Indexes violate the last rule since index values don't get their own column.  
In pandas, the syntax for working with indexes is different from the syntax for working with columns. By using two syntaxes, your code is more complicated, which can result in more bugs. If you decide you don't want to use indexes, that's perfectly reasonable. However, it's useful to know how they work for cases when you need to read other people's code.

In [170]:
### Temperature Data Set
### ın this chapter, we'll work with a monthly time series of air temperatures in cities around the world.

temperatures = pd.read_csv('datasets\\temperatures.csv')
# Look at temperatures
temperatures.head()

Unnamed: 0.1,Unnamed: 0,date,city,country,avg_temp_c
0,0,2000-01-01,Abidjan,Côte D'Ivoire,27.293
1,1,2000-02-01,Abidjan,Côte D'Ivoire,27.685
2,2,2000-03-01,Abidjan,Côte D'Ivoire,29.061
3,3,2000-04-01,Abidjan,Côte D'Ivoire,28.162
4,4,2000-05-01,Abidjan,Côte D'Ivoire,27.547


In [171]:
# Index temperatures by city
temperatures_ind = temperatures.set_index('city')
# Look at temperatures_ind
temperatures_ind

Unnamed: 0_level_0,Unnamed: 0,date,country,avg_temp_c
city,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Abidjan,0,2000-01-01,Côte D'Ivoire,27.293
Abidjan,1,2000-02-01,Côte D'Ivoire,27.685
Abidjan,2,2000-03-01,Côte D'Ivoire,29.061
Abidjan,3,2000-04-01,Côte D'Ivoire,28.162
Abidjan,4,2000-05-01,Côte D'Ivoire,27.547
...,...,...,...,...
Xian,16495,2013-05-01,China,18.979
Xian,16496,2013-06-01,China,23.522
Xian,16497,2013-07-01,China,25.251
Xian,16498,2013-08-01,China,24.528


In [172]:
# Reset the index, keeping its contents
temperatures_ind.reset_index()

Unnamed: 0.1,city,Unnamed: 0,date,country,avg_temp_c
0,Abidjan,0,2000-01-01,Côte D'Ivoire,27.293
1,Abidjan,1,2000-02-01,Côte D'Ivoire,27.685
2,Abidjan,2,2000-03-01,Côte D'Ivoire,29.061
3,Abidjan,3,2000-04-01,Côte D'Ivoire,28.162
4,Abidjan,4,2000-05-01,Côte D'Ivoire,27.547
...,...,...,...,...,...
16495,Xian,16495,2013-05-01,China,18.979
16496,Xian,16496,2013-06-01,China,23.522
16497,Xian,16497,2013-07-01,China,25.251
16498,Xian,16498,2013-08-01,China,24.528


In [173]:
# Reset the index, dropping its contents
temperatures_ind.reset_index(drop=True)

Unnamed: 0.1,Unnamed: 0,date,country,avg_temp_c
0,0,2000-01-01,Côte D'Ivoire,27.293
1,1,2000-02-01,Côte D'Ivoire,27.685
2,2,2000-03-01,Côte D'Ivoire,29.061
3,3,2000-04-01,Côte D'Ivoire,28.162
4,4,2000-05-01,Côte D'Ivoire,27.547
...,...,...,...,...
16495,16495,2013-05-01,China,18.979
16496,16496,2013-06-01,China,23.522
16497,16497,2013-07-01,China,25.251
16498,16498,2013-08-01,China,24.528


In [174]:
# Make a list of cities to subset on
cities = ['Moscow', 'Saint Petersburg']
# Subset temperatures using square brackets
print(temperatures[temperatures['city'].isin(cities)])
# Subset temperatures_ind using .loc[]
print(temperatures_ind.loc[cities])

       Unnamed: 0        date              city country  avg_temp_c
10725       10725  2000-01-01            Moscow  Russia      -7.313
10726       10726  2000-02-01            Moscow  Russia      -3.551
10727       10727  2000-03-01            Moscow  Russia      -1.661
10728       10728  2000-04-01            Moscow  Russia      10.096
10729       10729  2000-05-01            Moscow  Russia      10.357
...           ...         ...               ...     ...         ...
13360       13360  2013-05-01  Saint Petersburg  Russia      12.355
13361       13361  2013-06-01  Saint Petersburg  Russia      17.185
13362       13362  2013-07-01  Saint Petersburg  Russia      17.234
13363       13363  2013-08-01  Saint Petersburg  Russia      17.153
13364       13364  2013-09-01  Saint Petersburg  Russia         NaN

[330 rows x 5 columns]
                  Unnamed: 0        date country  avg_temp_c
city                                                        
Moscow                 10725  2000-01-

In [175]:
# ındex temperatures by country & city
temperatures_ind = temperatures.set_index(['country', 'city'])
# List of tuples: Brazil, Rio De Janeiro & Pakistan, Lahore
rows_to_keep =  [('Brazil', 'Rio De Janeiro'), ('Pakistan', 'Lahore')]
# Subset for rows to keep
print(temperatures_ind.loc[rows_to_keep])

                         Unnamed: 0        date  avg_temp_c
country  city                                              
Brazil   Rio De Janeiro       12540  2000-01-01      25.974
         Rio De Janeiro       12541  2000-02-01      26.699
         Rio De Janeiro       12542  2000-03-01      26.270
         Rio De Janeiro       12543  2000-04-01      25.750
         Rio De Janeiro       12544  2000-05-01      24.356
...                             ...         ...         ...
Pakistan Lahore                8575  2013-05-01      33.457
         Lahore                8576  2013-06-01      34.456
         Lahore                8577  2013-07-01      33.279
         Lahore                8578  2013-08-01      31.511
         Lahore                8579  2013-09-01         NaN

[330 rows x 3 columns]


In [176]:
# Sort temperatures_ind by index values
print(temperatures_ind.sort_index())
# Sort temperatures_ind by index values at the city level
print(temperatures_ind.sort_index(level=['city']))
# Sort temperatures_ind by country then descending city
print(temperatures_ind.sort_index(level=['country', 'city'], ascending=[True,False]))

                    Unnamed: 0        date  avg_temp_c
country     city                                      
Afghanistan Kabul         7260  2000-01-01       3.326
            Kabul         7261  2000-02-01       3.454
            Kabul         7262  2000-03-01       9.612
            Kabul         7263  2000-04-01      17.925
            Kabul         7264  2000-05-01      24.658
...                        ...         ...         ...
Zimbabwe    Harare        5605  2013-05-01      18.298
            Harare        5606  2013-06-01      17.020
            Harare        5607  2013-07-01      16.299
            Harare        5608  2013-08-01      19.232
            Harare        5609  2013-09-01         NaN

[16500 rows x 3 columns]
                       Unnamed: 0        date  avg_temp_c
country       city                                       
Côte D'Ivoire Abidjan           0  2000-01-01      27.293
              Abidjan           1  2000-02-01      27.685
              Abidjan      

## Slicing and subsetting with .loc and .iloc

### Slicing lists

To slice the list, you pass first and last positions separated by a colon into square brackets.  
Here are the dog breeds, this time as a list:

In [177]:
breeds = ["Labrador", "Poodle", "Chow Chow", "Schnauzer", "Labrador", "Chihuahua", "St. Bernard"]   

print("1st line :" + str(breeds[2:5]))

print("2nd line :" + str(breeds[:3]))

print("3rd line :" + str(breeds[:]))

1st line :['Chow Chow', 'Schnauzer', 'Labrador']
2nd line :['Labrador', 'Poodle', 'Chow Chow']
3rd line :['Labrador', 'Poodle', 'Chow Chow', 'Schnauzer', 'Labrador', 'Chihuahua', 'St. Bernard']


You can also slice DataFrames, but first, you need to sort the index. Here, the dogs dataset has been given a multi-level index of breed and color; then, the index is sorted with **sort_index()**

In [178]:
dogs_srt = dogs.set_index(["Breed", "Colour"]).sort_index()

dogs_srt

Unnamed: 0_level_0,Unnamed: 1_level_0,Name,Height(cm),Weight (kg),Date of Birth
Breed,Colour,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Chihuahua,Tan,Stella,18,2,2015-04-20
Labrador,Black,Max,59,29,2017-01-20
Labrador,Brown,Bella,56,25,2013-07-01
Poodle,Black,Charlie,43,23,2016-09-16
Poodle,Brown,Lucy,46,22,2014-08-25
Schnauzer,Gray,Cooper,49,17,2011-12-11
St. Bernard,White,Bernie,77,74,2018-02-27


### Slicing the outer index level

To slice rows at the outer level of an index, you call .loc, passing the first and last values separated by a colon.  
There are two differences compared to slicing lists. Rather than specifying row numbers, you specify index values. Secondly, notice that the final value is included.

In [179]:
dogs_srt.loc["Chow Chow":"Poodle"]

Unnamed: 0_level_0,Unnamed: 1_level_0,Name,Height(cm),Weight (kg),Date of Birth
Breed,Colour,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Labrador,Black,Max,59,29,2017-01-20
Labrador,Brown,Bella,56,25,2013-07-01
Poodle,Black,Charlie,43,23,2016-09-16
Poodle,Brown,Lucy,46,22,2014-08-25


### Slicing the inner index

The same technique doesn't work on inner index levels. Here, trying to slice from Tan to Grey returns an empty DataFrame instead of the six dogs we wanted.  
It's important to understand that, pandas doesn't throw an error to let you know that there is a problem.

In [180]:
dogs_srt.loc["Tan":"Grey"]

Unnamed: 0_level_0,Unnamed: 1_level_0,Name,Height(cm),Weight (kg),Date of Birth
Breed,Colour,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1


To correct approach to slicing at inner index levels is to pass the first and last positions as tuples. Here, the first element to include is a tuple of Labrador and Brown.

In [181]:
dogs_srt.loc[("Labrador", "Brown"):("Schnauzer", "Grey")]

Unnamed: 0_level_0,Unnamed: 1_level_0,Name,Height(cm),Weight (kg),Date of Birth
Breed,Colour,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Labrador,Brown,Bella,56,25,2013-07-01
Poodle,Black,Charlie,43,23,2016-09-16
Poodle,Brown,Lucy,46,22,2014-08-25
Schnauzer,Gray,Cooper,49,17,2011-12-11


### Slicing Columns

Since DataFrames are two-dimensional objects, you can also slice columns. You do this by passing two arguments to loc.  
The simplest case involves subsetting columns but keeping all rows. To do this, pass a colon as the first argument to loc.  
The second argument takes column names as the first and last positions to slice on.

In [182]:
dogs_srt.loc[:, "Name ":"Height (cm)"]

Unnamed: 0_level_0,Unnamed: 1_level_0,Name,Height(cm)
Breed,Colour,Unnamed: 2_level_1,Unnamed: 3_level_1
Chihuahua,Tan,Stella,18
Labrador,Black,Max,59
Labrador,Brown,Bella,56
Poodle,Black,Charlie,43
Poodle,Brown,Lucy,46
Schnauzer,Gray,Cooper,49
St. Bernard,White,Bernie,77


You can slice on rows and columns at the same time: simply pass the appropriate slice to each argument. Here, you see the previous two slices being performed in the same line of code.

In [183]:
dogs_srt.loc[("Labrador", "Brown"):("Scnauzer", "Grey"),"Name ":"Height (cm)"]

Unnamed: 0_level_0,Unnamed: 1_level_0,Name,Height(cm)
Breed,Colour,Unnamed: 2_level_1,Unnamed: 3_level_1
Labrador,Brown,Bella,56
Poodle,Black,Charlie,43
Poodle,Brown,Lucy,46
Schnauzer,Gray,Cooper,49


### Slicing by range of Dates

An important use case of slicing is to subset DataFrames by a range of dates. To demonstrate this, let's set the **date_of_birth** as the index and sort by this index.

In [184]:
dogs = dogs.set_index(" Date of Birth ").sort_index()
print(dogs)

                   Name         Breed Colour  Height(cm)   Weight (kg)
 Date of Birth                                                        
2011-12-11        Cooper    Schnauzer   Gray          49            17
2013-07-01         Bella     Labrador  Brown          56            25
2014-08-25          Lucy       Poodle  Brown          46            22
2015-04-20        Stella    Chihuahua    Tan          18             2
2016-09-16       Charlie       Poodle  Black          43            23
2017-01-20           Max     Labrador  Black          59            29
2018-02-27        Bernie  St. Bernard  White          77            74


In [185]:
# Get dogs with date_of_birth between 2014-08-25 and 2016-09-16
dogs.loc["2014-08-25":"2016-09-16"]

Unnamed: 0_level_0,Name,Breed,Colour,Height(cm),Weight (kg)
Date of Birth,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2014-08-25,Lucy,Poodle,Brown,46,22
2015-04-20,Stella,Chihuahua,Tan,18,2
2016-09-16,Charlie,Poodle,Black,43,23


In [186]:
# Slicing by partial dates
# Here, the first and last positions are only specified as 2014 and 2016, with no month or day parts.
# pandas interprets this as slicing from the start of 2014 to the end of 2016

dogs.loc["2014":"2016"]

Unnamed: 0_level_0,Name,Breed,Colour,Height(cm),Weight (kg)
Date of Birth,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2014-08-25,Lucy,Poodle,Brown,46,22
2015-04-20,Stella,Chihuahua,Tan,18,2


### Subsetting by row/column number

We can also slice DataFrames by row or column number using the **iloc()** method.  
Therea are two arguments: one for rows and one for columns.

In [187]:
print(dogs.iloc[2:5, 1:4])

                     Breed Colour  Height(cm)
 Date of Birth                               
2014-08-25          Poodle  Brown          46
2015-04-20       Chihuahua    Tan          18
2016-09-16          Poodle  Black          43


In [188]:
# Sort the index of temperatures_ind
temperatures_srt = temperatures_ind.sort_index()
# Subset rows from Pakistan to Russia
print(temperatures_srt.loc["Pakistan":"Russia"])
# Try to subset rows from Lahore to Moscow
print(temperatures_srt.loc["Lahore":"Moscow"])
# Subset rows from Pakistan, Lahore to Russia, Moscow
print(temperatures_srt.loc[("Pakistan", "Lahore"):("Russia", "Moscow")])

                           Unnamed: 0        date  avg_temp_c
country  city                                                
Pakistan Faisalabad              4785  2000-01-01      12.792
         Faisalabad              4786  2000-02-01      14.339
         Faisalabad              4787  2000-03-01      20.309
         Faisalabad              4788  2000-04-01      29.072
         Faisalabad              4789  2000-05-01      34.845
...                               ...         ...         ...
Russia   Saint Petersburg       13360  2013-05-01      12.355
         Saint Petersburg       13361  2013-06-01      17.185
         Saint Petersburg       13362  2013-07-01      17.234
         Saint Petersburg       13363  2013-08-01      17.153
         Saint Petersburg       13364  2013-09-01         NaN

[1155 rows x 3 columns]
                    Unnamed: 0        date  avg_temp_c
country city                                          
Mexico  Mexico           10230  2000-01-01      12.694
    

In [189]:
# Subset rows from India, Hyderabad to Iraq, Baghdad
print(temperatures_srt.loc[("India", "Hyderabad"):("Iraq", "Baghdad")])
# Subset columns from date to avg_temp_c
print(temperatures_srt.loc[:, "date":"avg_temp_c"])
# Subset in both directions at one
print(temperatures_srt.loc[("India", "Hyderabad"):("Iraq", "Baghdad"), "date":"avg_temp_c"])

                   Unnamed: 0        date  avg_temp_c
country city                                         
India   Hyderabad        5940  2000-01-01      23.779
        Hyderabad        5941  2000-02-01      25.826
        Hyderabad        5942  2000-03-01      28.821
        Hyderabad        5943  2000-04-01      32.698
        Hyderabad        5944  2000-05-01      32.438
...                       ...         ...         ...
Iraq    Baghdad          1150  2013-05-01      28.673
        Baghdad          1151  2013-06-01      33.803
        Baghdad          1152  2013-07-01      36.392
        Baghdad          1153  2013-08-01      35.463
        Baghdad          1154  2013-09-01         NaN

[2145 rows x 3 columns]
                          date  avg_temp_c
country     city                          
Afghanistan Kabul   2000-01-01       3.326
            Kabul   2000-02-01       3.454
            Kabul   2000-03-01       9.612
            Kabul   2000-04-01      17.925
            Kab

In [190]:
# Use Boolean conditions to subset temperatures for rows in 2010 and 2011
temperatures_bool = temperatures[(temperatures["date"] >= "2010-01-01") & (temperatures["date"] <= "2011-12-31")]
print(temperatures_bool)
# Set date as the index and sort the index
temperatures_ind = temperatures.set_index("date").sort_index()
# Use .loc[] to subset temperatures_ind for rows in 2010 and 2011
print(temperatures_ind.loc["2010":"2011"])
# Use .loc[] to subset temperatures_ind for rows from Aug 2010 to Feb 2011
print(temperatures_ind.loc["2010-08":"2011-02"])

       Unnamed: 0        date     city        country  avg_temp_c
120           120  2010-01-01  Abidjan  Côte D'Ivoire      28.270
121           121  2010-02-01  Abidjan  Côte D'Ivoire      29.262
122           122  2010-03-01  Abidjan  Côte D'Ivoire      29.596
123           123  2010-04-01  Abidjan  Côte D'Ivoire      29.068
124           124  2010-05-01  Abidjan  Côte D'Ivoire      28.258
...           ...         ...      ...            ...         ...
16474       16474  2011-08-01     Xian          China      23.069
16475       16475  2011-09-01     Xian          China      16.775
16476       16476  2011-10-01     Xian          China      12.587
16477       16477  2011-11-01     Xian          China       7.543
16478       16478  2011-12-01     Xian          China      -0.490

[2400 rows x 5 columns]
            Unnamed: 0        city    country  avg_temp_c
date                                                     
2010-01-01        4905  Faisalabad   Pakistan      11.810
2010-01-0

In [191]:
# Get 23rd row, 2nd column (index 22, 1)
print(temperatures.iloc[22, 1])
# Use slicing to get the first 5 rows
print(temperatures.iloc[:5])
# Use slicing to get columns 3 to 4
print(temperatures.iloc[:, 2:4])
# Use slicing in both directions at once
print(temperatures.iloc[:5, 2:4])

2001-11-01
   Unnamed: 0        date     city        country  avg_temp_c
0           0  2000-01-01  Abidjan  Côte D'Ivoire      27.293
1           1  2000-02-01  Abidjan  Côte D'Ivoire      27.685
2           2  2000-03-01  Abidjan  Côte D'Ivoire      29.061
3           3  2000-04-01  Abidjan  Côte D'Ivoire      28.162
4           4  2000-05-01  Abidjan  Côte D'Ivoire      27.547
          city        country
0      Abidjan  Côte D'Ivoire
1      Abidjan  Côte D'Ivoire
2      Abidjan  Côte D'Ivoire
3      Abidjan  Côte D'Ivoire
4      Abidjan  Côte D'Ivoire
...        ...            ...
16495     Xian          China
16496     Xian          China
16497     Xian          China
16498     Xian          China
16499     Xian          China

[16500 rows x 2 columns]
      city        country
0  Abidjan  Côte D'Ivoire
1  Abidjan  Côte D'Ivoire
2  Abidjan  Côte D'Ivoire
3  Abidjan  Côte D'Ivoire
4  Abidjan  Côte D'Ivoire


## Working with Pivot Tables

### Pivoting the dog pack

Recall that you create a pivot table by calling **.pivot_table**:  
* The first argument is the column name containing values to aggregate.  
* The index argument lists the columns to group by and dislplay in rows  
* The columns argument lists the columns, to groupby and display in columns.  

```python
dogs_height_by_breed_vs_color = dog_pack.pivot_table("height_cm", index="breed", columns="color")
```

Pivot tables are just DataFrames with sorted indexes. That means that all the stuff we've learned so far this chapter can bu used on them. In particular, the loc and slicing combination is ideal for subsetting pivot tables, like so:  

```python 
dogs_height_by_breed_vs_color.loc["Chow Chow":"Poodle"]
```

### The Axis argument

Methods for calculating summary statistics on a DataFrame, such as mean, have an axis argument. The default value is **"index"**, which means **"calculate the statistic across rows."**  
Here, the mean is calculated for each color. That is, **"across the breeds."**. The behaviour is the same as if you hadn't specified the axis argument.

```python
dogs_height_by_breed_vs_color.mean(axis="index")
```

To calculate a summary statistics for each row, that is **"across the columns,"** you set axis to **"columns"**.  
Here, mean height is calculated for each breed. That is, **"across the colors."**
```python 
dogs_height_by_breed_vs_colors.mean(axis="columns")
```

For most DataFrames, setting the axis argument doesn't make any sense, since you'll have different data types in each column. Pivot tables are a special case since every column contains the same data type. 

In [None]:
# Add a year column to temperatures
temperatures["year"] = temperatures["date"].dt.year
# Pivot avg_temp_c by country and city vs year
temp_by_country_city_vs_year = temperatures.pivot_table("avg_temp_c", index = ["country", "city"], columns = "year")
# See the result
print(temp_by_country_city_vs_year)