## Accessing and Selecting Values


In [1]:
%%html
<style>
    table { display: inline-block }
</style>

In [3]:
import numpy as np
import pandas as pd

### Contents

    4.1 Data
    4.2 Overview of Column and Row Accessors
    4.3 .loc and .iloc
    4.4 Final Remark
    4.A Appendix




---
### 4.1 Data

We use an excerpt from the female literacy dataset. 

In [10]:
data = {
    'population' : [199.8, 112.4, 104.1, 91.3, 72.6],       # in millions
    'm_literacy' : [79.24, 89.82, 73.39, 82.67, 80.53],     # literacy male in %
    'f_literacy' : [59.26, 75.48, 53.33, 71.16, 60.02]      # literacy female in %
}
index = ['Uttar_Pradesh', 'Maharashtra', 'Bihar', 'West_Bengal', 'Madhya_Pradesh']
df = pd.DataFrame(data, index=index)
df.head()

Unnamed: 0,population,m_literacy,f_literacy
Uttar_Pradesh,199.8,79.24,59.26
Maharashtra,112.4,89.82,75.48
Bihar,104.1,73.39,53.33
West_Bengal,91.3,82.67,71.16
Madhya_Pradesh,72.6,80.53,60.02


---
### 4.2 Overview of Column and Row Accessors

The following accessors return a Series object:

|Name                | Operation                      | Syntax          | Return |
|:-------------------|:-------------------------------|:----------------|:-------|
| Attribute accessor | Select column                  | `df.col`        | Series |
| Indexing operator  | Select column                  | `df[col]`       | Series |
| `.loc` accessor    | Select row by label            | `df.loc[label]` | Series |
| `.iloc` accessor   | Select row by integer location | `df.iloc[loc]`  | Series |

#### 4.2.1 Attribute Access

The accessor `df.column_name` where is known as the *dot notation* or *attribute access* method for accessing a column in a DataFrame by using its name as an attribute of the DataFrame object. This method can only be used if the column name is a valid Python identifier.

In [11]:
# show column names
df.columns

Index(['population', 'm_literacy', 'f_literacy'], dtype='object')

In [12]:
# access column 'population'
df.population

Uttar_Pradesh     199.8
Maharashtra       112.4
Bihar             104.1
West_Bengal        91.3
Madhya_Pradesh     72.6
Name: population, dtype: float64

#### 4.2.2 Indexing Operator `[]`

The `df['column_name']` accessor is known as the **indexing operator** for accessing a column in a DataFrame by using the corresponding column name as a key, similar to how we would access a value in a dictionary using its key.

This method is more flexible than attribute access, because it allows us to access columns with names that are not valid Python identifiers. It can also be used to access multiple columns at once by passing a list of column names within the square brackets. In this case, the return type is a DataFrame rather than a Series.

In [13]:
df['m_literacy']

Uttar_Pradesh     79.24
Maharashtra       89.82
Bihar             73.39
West_Bengal       82.67
Madhya_Pradesh    80.53
Name: m_literacy, dtype: float64

In [14]:
# returns data frame object
df[['f_literacy', 'population']]

Unnamed: 0,f_literacy,population
Uttar_Pradesh,59.26,199.8
Maharashtra,75.48,112.4
Bihar,53.33,104.1
West_Bengal,71.16,91.3
Madhya_Pradesh,60.02,72.6


**Note:** Slicing with index-labels or integer-locations is possible but refers to rows:

+ Slicing with index-labels (if exists) includes the endpoint.

+ Slicing with integer-locations excludes the endpoint.

In [16]:
# 1. Slicing with index-labels
#df[:'Bihar']

# 2. Slicing with integer-locations
#df[:3]

Unnamed: 0,population,m_literacy,f_literacy
Uttar_Pradesh,199.8,79.24,59.26
Maharashtra,112.4,89.82,75.48
Bihar,104.1,73.39,53.33


#### 4.2.3 `.loc` Accessor

The `.loc` accessor provides access to the data in a DataFrame using row and column labels. It supports various indexing and slicing operations.

In its simplest form, the `.loc` accessor can be used to access a single row of data by specifying the row label. For example, `df.loc[row_label]` returns a Series containing the data in the specified row.

In [17]:
df.loc['Bihar']

population    104.10
m_literacy     73.39
f_literacy     53.33
Name: Bihar, dtype: float64

#### 4.2.4 `.iloc` Accessor

The accessor `df.iloc[]` is used for *purely integer-location based indexing* for selection by position in a DataFrame. Like the `.loc` accessor, the `.iloc` accessor also provides versatile access to the data.

In its simplest form, the `.iloc` accessor can be used to access a single row of data by specifying the row index as an integer. For example, `df.iloc[row_index]` returns a Series containing the data in the specified row.

In [18]:
df.iloc[2]

population    104.10
m_literacy     73.39
f_literacy     53.33
Name: Bihar, dtype: float64

---
### 4.3 `.loc` and `.iloc`

Section 4.2 introduced the `.loc` and `.iloc` accessors for retrieving rows. In this section, we will explore both accessors in more depth.

#### 4.3.1 Overview

1. `.loc` : 

    + **label-based selection:** access rows by their index-labels (names)

    + slicing *includes* the endpoint

    + supports boolean indexing

2. `.iloc`: 
    
    + **index-based selection:** access rows by their integer-location (positional)
    
    + slicing *excludes* the endpoint

    + does not support boolean indexing


#### 4.3.2 Accesing Individual Values

The preferred way to access individual values from a DataFrame object `df` is

+ `df.loc['row_name', 'col_name']`

+ `df.iloc[row_index, col_index]`

The appendix list alternative ways to access individual values. 

**Example:** Access the population of Bihar

In [20]:
# with .loc[]
df.loc['Bihar', 'population'] 

104.1

In [21]:
# with .iloc[]
df.iloc[2, 0]

104.1

#### 4.3.3 Selecting several Rows

In [25]:
# 1. with list of index-labels / index-positions
# df.loc[['Bihar', 'Maharashtra']]
# df.iloc[[2, 1]]

# 2. with slicing
# df.loc[:'Bihar']                                      # with endpoint 'Bihar'
# df.iloc[:3]                                           # without endpoint

Unnamed: 0,population,m_literacy,f_literacy
Uttar_Pradesh,199.8,79.24,59.26
Maharashtra,112.4,89.82,75.48
Bihar,104.1,73.39,53.33


#### 4.3.4 Selecting a sub-Series

In [26]:
df

Unnamed: 0,population,m_literacy,f_literacy
Uttar_Pradesh,199.8,79.24,59.26
Maharashtra,112.4,89.82,75.48
Bihar,104.1,73.39,53.33
West_Bengal,91.3,82.67,71.16
Madhya_Pradesh,72.6,80.53,60.02


In [28]:
# 1. with indexer for rows
# df.loc[['Bihar', 'Maharashtra'], 'population'] 
# df.iloc[[2, 1], 0]

# 2. with indexer for columns
# df.loc['Bihar', ['m_literacy', 'f_literacy']]
# df.iloc[2, [1, 2]]

# 3. with slicing for rows
# df.loc[:'Bihar', 'population'] 
# df.iloc[:3, 0]

# 4. with slicing for columns
# df.loc['Bihar', 'm_literacy':'f_literacy'] 
# df.iloc[2, 1:]

#### 4.3.5 Selecting a sub-DataFrame

In [31]:
# 1. with indexers 
# df.loc[['Bihar', 'Maharashtra'], ['m_literacy', 'f_literacy']] 
# df.iloc[[2, 1], [1, 2]]

# 2. with slicing
# df.loc[:'Bihar', 'm_literacy':'f_literacy']
# df.iloc[:3, 1:]

# 3. with slicing and indexer
# ...

#### 4.3.6 Boolean Indexing

In [32]:
# show data
df.head()

Unnamed: 0,population,m_literacy,f_literacy
Uttar_Pradesh,199.8,79.24,59.26
Maharashtra,112.4,89.82,75.48
Bihar,104.1,73.39,53.33
West_Bengal,91.3,82.67,71.16
Madhya_Pradesh,72.6,80.53,60.02


**Example 1:** Return all rows with at least 80% male literacy. 

In [33]:
# boolean mask
mask = df['m_literacy'] >= 80.0   # mask = df.loc[:, 'm_literacy'] >= 80.0

# show mask
print(mask)
print('type:', type(mask))

Uttar_Pradesh     False
Maharashtra        True
Bihar             False
West_Bengal        True
Madhya_Pradesh     True
Name: m_literacy, dtype: bool
type: <class 'pandas.core.series.Series'>


When using a boolean mask with the `.loc` accessor, the mask selects all rows for which the corresponding mask value is `True`.

In [36]:
#df.loc[mask]

# short form
# df.loc[df['m_literacy'] >= 80.0]

# not possible with .iloc
# df.iloc[mask]

**Example 2:** Return all rows with more than 100 million people and female literacy greater than 75 %.

In [None]:
df.loc[(df['population'] > 100) & (df['f_literacy'] > 75)]

---
## 4.4 Final Remark

Unlike many other Python libraries, Pandas uses the indexing operator `[]` primarily for column selection rather than row selection. This design choice is consistent with the structure of a DataFrame. A DataFrame can be thought of as a dictionary of Series objects representing the columns. Series uses the indexing operator `[]` to access values by their labels. 

To access rows, you can use the `.loc[]` and `.iloc[]` methods. 

---
## 4.A. Appendix

The appendix shows alternative ways to access data using `.loc[]` and `.iloc[]`. 

---
### 4.A.1. Example with `.loc`

We demonstrate six different ways to access the population of Bihar using `df.loc['Bihar']`. In general, the preferred way to access a value is `df.loc['row_name', 'col_name']`.

In [None]:
# 1. chaining the indexing operator with column-index
#df.loc['Bihar'][0]                 

# 2. chaining the indexing operator with column-label
#df.loc['Bihar']['population']

# 3. chaining the attribute access
#df.loc['Bihar'].population

# 4. chaining .iloc
#df.loc['Bihar'].iloc[0] 

# 5. chaining .loc
# df.loc['Bihar'].loc['population'] 

# 6. preferred access
#df.loc['Bihar', 'population'] 


### 4.A.2. Example with `.iloc`

As with `.loc`, we demonstrate six different ways to access the population of Bihar using `df.iloc[]`. In general, the preferred way to access a value is `df.iloc[row_index, col_index]`.

In [None]:
# 1. chaining the indexing operator with column-index
#df.iloc[2][0]                 

# 2. chaining the indexing operator with column-label
# df.iloc[2]['population']

# 3. chaining the attribute access
# df.iloc[2].population

# 4. chaining .iloc
# df.iloc[2].iloc[0] 

# 5. chaining .loc
# df.iloc[2].loc['population'] 

# 6. preferred access
# df.iloc[2, 0]