![license_header_logo](../../../images/license_header_logo.png)

> **Copyright (c) 2021 CertifAI Sdn. Bhd.**<br>
<br>
This program is part of OSRFramework. You can redistribute it and/or modify
<br>it under the terms of the GNU Affero General Public License as published by
<br>the Free Software Foundation, either version 3 of the License, or
<br>(at your option) any later version.
<br>
<br>This program is distributed in the hope that it will be useful
<br>but WITHOUT ANY WARRANTY; without even the implied warranty of
<br>MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
<br>GNU Affero General Public License for more details.
<br>
<br>You should have received a copy of the GNU Affero General Public License
<br>along with this program.  If not, see <http://www.gnu.org/licenses/>.
<br>

# Introduction

Selecting specific values of a pandas DataFrame or Series to work on is an implicit step in almost any data operation you'll run, so one of the first things you need to learn in working with data in Python is how to go about selecting the data points relevant to you quickly and effectively.

# Notebook Content

* [Native Accessors](#Native-accessors)


* [Indexing in Pandas](#Indexing-in-pandas)
    * [Index-based Selection](#Index-based-selection)
    * [Label-based Selection](#Label-based-selection)
    
    
* [Manipulating the Index](#Manipulating-the-index)


* [Conditional Selection](#Conditional-selection)


* [Assigning Data](#Assigning-data)

In [1]:
import pandas as pd
water_potability = pd.read_csv("../../../resources/day_01/water_potability.csv")

# Native accessors

Native Python objects provide  good ways of indexing data. Pandas carries all of these over, which helps make it easy to start with.

Consider this DataFrame:

In [2]:
water_potability

Unnamed: 0,ph,Hardness,Solids,Chloramines,Sulfate,Conductivity,Organic_carbon,Trihalomethanes,Turbidity,Potability
0,,204.890455,20791.318981,7.300212,368.516441,564.308654,10.379783,86.990970,2.963135,0
1,3.716080,129.422921,18630.057858,6.635246,,592.885359,15.180013,56.329076,4.500656,0
2,8.099124,224.236259,19909.541732,9.275884,,418.606213,16.868637,66.420093,3.055934,0
3,8.316766,214.373394,22018.417441,8.059332,356.886136,363.266516,18.436524,100.341674,4.628771,0
4,9.092223,181.101509,17978.986339,6.546600,310.135738,398.410813,11.558279,31.997993,4.075075,0
...,...,...,...,...,...,...,...,...,...,...
3271,4.668102,193.681735,47580.991603,7.166639,359.948574,526.424171,13.894419,66.687695,4.435821,1
3272,7.808856,193.553212,17329.802160,8.061362,,392.449580,19.903225,,2.798243,1
3273,9.419510,175.762646,33155.578218,7.350233,,432.044783,11.039070,69.845400,3.298875,1
3274,5.126763,230.603758,11983.869376,6.303357,,402.883113,11.168946,77.488213,4.708658,1


In Python, we can access the property of an object by accessing it as an attribute. A `book` object, for example, might have a `title` property, which we can access by calling `book.title`. Columns in a pandas DataFrame work in much the same way. 

Hence to access the `ph` property of `water_potability` we can use:

In [3]:
water_potability.ph

0            NaN
1       3.716080
2       8.099124
3       8.316766
4       9.092223
          ...   
3271    4.668102
3272    7.808856
3273    9.419510
3274    5.126763
3275    7.874671
Name: ph, Length: 3276, dtype: float64

If we have a Python dictionary, we can access its values using the indexing (`[]`) operator. We can do the same with columns in a DataFrame:

In [4]:
water_potability['ph']

0            NaN
1       3.716080
2       8.099124
3       8.316766
4       9.092223
          ...   
3271    4.668102
3272    7.808856
3273    9.419510
3274    5.126763
3275    7.874671
Name: ph, Length: 3276, dtype: float64

These are the two ways of selecting a specific Series out of a DataFrame. Neither of them is more or less syntactically valid than the other, but the indexing operator `[]` does have the advantage that it can handle column names with reserved characters in them (e.g. if we had a `ph values` column, `reviews.ph values` wouldn't work).

Doesn't a pandas Series look kind of like a fancy dictionary? It pretty much is, so it's no surprise that, to drill down to a single specific value, we need only use the indexing operator `[]` once more:

In [5]:
water_potability['ph'][10]

7.360640105838258

# Indexing in pandas

The indexing operator and attribute selection are nice because they work just like they do in the rest of the Python ecosystem. As a novice, this makes them easy to pick up and use. However, pandas has its own accessor operators, `loc` and `iloc`. For more advanced operations, these are the ones you're supposed to be using.

### Index-based selection

Pandas indexing works in one of two paradigms. The first is **index-based selection**: selecting data based on its numerical position in the data. `iloc` follows this paradigm.

To select the first row of data in a DataFrame, we may use the following:

In [6]:
water_potability.iloc[0]

ph                          NaN
Hardness             204.890455
Solids             20791.318981
Chloramines            7.300212
Sulfate              368.516441
Conductivity         564.308654
Organic_carbon        10.379783
Trihalomethanes       86.990970
Turbidity              2.963135
Potability             0.000000
Name: 0, dtype: float64

Both `loc` and `iloc` are row-first, column-second. This is the opposite of what we do in native Python, which is column-first, row-second.

This means that it's marginally easier to retrieve rows, and marginally harder to get retrieve columns. To get a column with `iloc`, we can do the following:

In [7]:
water_potability.iloc[:, 0]

0            NaN
1       3.716080
2       8.099124
3       8.316766
4       9.092223
          ...   
3271    4.668102
3272    7.808856
3273    9.419510
3274    5.126763
3275    7.874671
Name: ph, Length: 3276, dtype: float64

On its own, the `:` operator, which also comes from native Python, means "everything". When combined with other selectors, however, it can be used to indicate a range of values. For example, to select the `ph` column from just the first, second, and third row, we would do:

In [8]:
water_potability.iloc[:3, 0]

0         NaN
1    3.716080
2    8.099124
Name: ph, dtype: float64

Or, to select just the second and third entries, we would do:

In [9]:
water_potability.iloc[1:3, 0]

1    3.716080
2    8.099124
Name: ph, dtype: float64

It's also possible to pass a list:

In [10]:
water_potability.iloc[[0, 1, 2], 0]

0         NaN
1    3.716080
2    8.099124
Name: ph, dtype: float64

Finally, it's worth knowing that negative numbers can be used in selection. This will start counting forwards from the _end_ of the values. So for example here are the last five elements of the dataset.

In [11]:
water_potability.iloc[-5:]

Unnamed: 0,ph,Hardness,Solids,Chloramines,Sulfate,Conductivity,Organic_carbon,Trihalomethanes,Turbidity,Potability
3271,4.668102,193.681735,47580.991603,7.166639,359.948574,526.424171,13.894419,66.687695,4.435821,1
3272,7.808856,193.553212,17329.80216,8.061362,,392.44958,19.903225,,2.798243,1
3273,9.41951,175.762646,33155.578218,7.350233,,432.044783,11.03907,69.8454,3.298875,1
3274,5.126763,230.603758,11983.869376,6.303357,,402.883113,11.168946,77.488213,4.708658,1
3275,7.874671,195.102299,17404.177061,7.509306,,327.45976,16.140368,78.698446,2.309149,1


### Label-based selection

The second paradigm for attribute selection is the one followed by the `loc` operator: **label-based selection**. In this paradigm, it's the data index value, not its position, which matters.

For example, to get the first entry in `reviews`, we would now do the following:

In [12]:
water_potability.loc[0, 'ph']

nan

`iloc` is conceptually simpler than `loc` because it ignores the dataset's indices. When we use `iloc` we treat the dataset like a big matrix (a list of lists), one that we have to index into by position. `loc`, by contrast, uses the information in the indices to do its work. Since your dataset usually has meaningful indices, it's usually easier to do things using `loc` instead. For example, here's one operation that's much easier using `loc`:

In [13]:
water_potability.loc[:, ['Solids', 'Conductivity', 'Turbidity']]

Unnamed: 0,Solids,Conductivity,Turbidity
0,20791.318981,564.308654,2.963135
1,18630.057858,592.885359,4.500656
2,19909.541732,418.606213,3.055934
3,22018.417441,363.266516,4.628771
4,17978.986339,398.410813,4.075075
...,...,...,...
3271,47580.991603,526.424171,4.435821
3272,17329.802160,392.449580,2.798243
3273,33155.578218,432.044783,3.298875
3274,11983.869376,402.883113,4.708658


### Choosing between `loc` and `iloc`

When choosing or transitioning between `loc` and `iloc`, there is one "gotcha" worth keeping in mind, which is that the two methods use slightly different indexing schemes.

`iloc` uses the Python stdlib indexing scheme, where the first element of the range is included and the last one excluded. So `0:10` will select entries `0,...,9`. `loc`, meanwhile, indexes inclusively. So `0:10` will select entries `0,...,10`.

Why the change? Remember that loc can index any stdlib type: strings, for example. If we have a DataFrame with index values `Apples, ..., Potatoes, ...`, and we want to select "all the alphabetical fruit choices between Apples and Potatoes", then it's a lot more convenient to index `df.loc['Apples':'Potatoes']` than it is to index something like `df.loc['Apples', 'Potatoet']` (`t` coming after `s` in the alphabet).

This is particularly confusing when the DataFrame index is a simple numerical list, e.g. `0,...,1000`. In this case `df.iloc[0:1000]` will return 1000 entries, while `df.loc[0:1000]` return 1001 of them! To get 1000 elements using `loc`, you will need to go one lower and ask for `df.loc[0:999]`. 

Otherwise, the semantics of using `loc` are the same as those for `iloc`.

# Manipulating the index

Label-based selection derives its power from the labels in the index. Critically, the index we use is not immutable. We can manipulate the index in any way we see fit.

The `set_index()` method can be used to do the job. Here is what happens when we `set_index` to the `title` field:

In [14]:
water_potability.set_index("ph")

Unnamed: 0_level_0,Hardness,Solids,Chloramines,Sulfate,Conductivity,Organic_carbon,Trihalomethanes,Turbidity,Potability
ph,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
,204.890455,20791.318981,7.300212,368.516441,564.308654,10.379783,86.990970,2.963135,0
3.716080,129.422921,18630.057858,6.635246,,592.885359,15.180013,56.329076,4.500656,0
8.099124,224.236259,19909.541732,9.275884,,418.606213,16.868637,66.420093,3.055934,0
8.316766,214.373394,22018.417441,8.059332,356.886136,363.266516,18.436524,100.341674,4.628771,0
9.092223,181.101509,17978.986339,6.546600,310.135738,398.410813,11.558279,31.997993,4.075075,0
...,...,...,...,...,...,...,...,...,...
4.668102,193.681735,47580.991603,7.166639,359.948574,526.424171,13.894419,66.687695,4.435821,1
7.808856,193.553212,17329.802160,8.061362,,392.449580,19.903225,,2.798243,1
9.419510,175.762646,33155.578218,7.350233,,432.044783,11.039070,69.845400,3.298875,1
5.126763,230.603758,11983.869376,6.303357,,402.883113,11.168946,77.488213,4.708658,1


This is useful if you can come up with an index for the dataset which is better than the current one.

# Conditional selection

So far we've been indexing various strides of data, using structural properties of the DataFrame itself. To do *interesting* things with the data, however, we often need to ask questions based on conditions. 

For example, suppose that we're interested specifically in soils that have potability == 1.

We can start by checking if each soild has potability == 1:

In [15]:
water_potability.Potability == 1

0       False
1       False
2       False
3       False
4       False
        ...  
3271     True
3272     True
3273     True
3274     True
3275     True
Name: Potability, Length: 3276, dtype: bool

This operation produced a Series of `True`/`False` booleans based on the `country` of each record.  This result can then be used inside of `loc` to select the relevant data:

In [16]:
water_potability.loc[water_potability.Potability == 1]

Unnamed: 0,ph,Hardness,Solids,Chloramines,Sulfate,Conductivity,Organic_carbon,Trihalomethanes,Turbidity,Potability
250,9.445130,145.805402,13168.529156,9.444471,310.583374,592.659021,8.606397,77.577460,3.875165,1
251,9.024845,128.096691,19859.676476,8.016423,300.150377,451.143481,14.770863,73.778026,3.985251,1
252,,169.974849,23403.637304,8.519730,,475.573562,12.924107,50.861913,2.747313,1
253,6.800119,242.008082,39143.403329,9.501695,187.170714,376.456593,11.432466,73.777275,3.854940,1
254,7.174135,203.408935,20401.102461,7.681806,287.085679,315.549900,14.533510,74.405616,3.939896,1
...,...,...,...,...,...,...,...,...,...,...
3271,4.668102,193.681735,47580.991603,7.166639,359.948574,526.424171,13.894419,66.687695,4.435821,1
3272,7.808856,193.553212,17329.802160,8.061362,,392.449580,19.903225,,2.798243,1
3273,9.419510,175.762646,33155.578218,7.350233,,432.044783,11.039070,69.845400,3.298875,1
3274,5.126763,230.603758,11983.869376,6.303357,,402.883113,11.168946,77.488213,4.708658,1


This DataFrame has ~1,200 rows. The original had ~3,000. That means that around 40% of solid has potability == 1.

We also wanted to know which ones are better solids. Solids which have ph values between 6.0 to 7.0 are consider good solids.

We can use the ampersand (`&`) to bring the two questions together:

In [17]:
water_potability.loc[(water_potability.Potability == 1) & (water_potability.ph >= 6.0) & (water_potability.ph <= 7.0)]

Unnamed: 0,ph,Hardness,Solids,Chloramines,Sulfate,Conductivity,Organic_carbon,Trihalomethanes,Turbidity,Potability
253,6.800119,242.008082,39143.403329,9.501695,187.170714,376.456593,11.432466,73.777275,3.854940,1
259,6.101955,215.268090,15976.926225,8.857160,308.482695,417.843553,13.147279,62.505642,3.535596,1
262,6.548021,278.585105,25508.386949,6.749378,366.871502,497.321753,16.563167,79.323678,3.611860,1
264,6.618011,233.661636,19598.860740,4.701049,432.556385,401.669791,11.766146,73.191921,4.437696,1
268,6.869639,251.293448,21728.821295,8.803175,279.776857,539.466877,12.994140,56.409708,4.726702,1
...,...,...,...,...,...,...,...,...,...,...
3257,6.683368,272.111698,18989.316768,5.336202,336.555100,307.725009,20.178716,75.402260,5.208061,1
3258,6.638411,180.826667,9772.504814,8.295983,,401.111143,12.601517,61.051889,5.164057,1
3263,6.923636,260.593154,24792.525623,5.501164,332.232177,607.773567,15.483027,51.535867,4.013339,1
3268,6.702547,207.321086,17246.920347,7.708117,304.510230,329.266002,16.217303,28.878601,3.442983,1


Suppose we'll buy any solid that has high organic carbon (>18) or high turbidity (>5). For this we use a pipe (`|`):

In [18]:
water_potability.loc[(water_potability.Organic_carbon > 18) | (water_potability.Turbidity >= 5)]

Unnamed: 0,ph,Hardness,Solids,Chloramines,Sulfate,Conductivity,Organic_carbon,Trihalomethanes,Turbidity,Potability
3,8.316766,214.373394,22018.417441,8.059332,356.886136,363.266516,18.436524,100.341674,4.628771,0
13,,150.174923,27331.361962,6.838223,299.415781,379.761835,19.370807,76.509996,4.413974,0
16,7.051786,211.049406,30980.600787,10.094796,,315.141267,20.397022,56.651604,4.268429,0
22,,215.977859,17107.224226,5.607060,326.943978,436.256194,14.189062,59.855476,5.459251,0
25,6.514415,198.767351,21218.702871,8.670937,323.596349,413.290450,14.900000,79.847843,5.200885,0
...,...,...,...,...,...,...,...,...,...,...
3248,6.260111,211.594112,18577.623969,7.154891,340.792574,357.098395,7.992210,82.365378,5.403615,1
3257,6.683368,272.111698,18989.316768,5.336202,336.555100,307.725009,20.178716,75.402260,5.208061,1
3258,6.638411,180.826667,9772.504814,8.295983,,401.111143,12.601517,61.051889,5.164057,1
3264,5.893103,239.269481,20526.666156,6.349561,341.256362,403.617560,18.963707,63.846319,4.390702,1


Get all the solids that have known sulfate value.

In [19]:
water_potability.loc[water_potability.Sulfate.notnull()]

Unnamed: 0,ph,Hardness,Solids,Chloramines,Sulfate,Conductivity,Organic_carbon,Trihalomethanes,Turbidity,Potability
0,,204.890455,20791.318981,7.300212,368.516441,564.308654,10.379783,86.990970,2.963135,0
3,8.316766,214.373394,22018.417441,8.059332,356.886136,363.266516,18.436524,100.341674,4.628771,0
4,9.092223,181.101509,17978.986339,6.546600,310.135738,398.410813,11.558279,31.997993,4.075075,0
5,5.584087,188.313324,28748.687739,7.544869,326.678363,280.467916,8.399735,54.917862,2.559708,0
6,10.223862,248.071735,28749.716544,7.513408,393.663396,283.651634,13.789695,84.603556,2.672989,0
...,...,...,...,...,...,...,...,...,...,...
3267,8.989900,215.047358,15921.412018,6.297312,312.931022,390.410231,9.899115,55.069304,4.613843,1
3268,6.702547,207.321086,17246.920347,7.708117,304.510230,329.266002,16.217303,28.878601,3.442983,1
3269,11.491011,94.812545,37188.826022,9.263166,258.930600,439.893618,16.172755,41.558501,4.369264,1
3270,6.069616,186.659040,26138.780191,7.747547,345.700257,415.886955,12.067620,60.419921,3.669712,1


# Assigning data

Going the other way, assigning data to a DataFrame is easy. You can assign either a constant value:

In [20]:
water_potability.loc[0, "Potability"] = 1
water_potability

Unnamed: 0,ph,Hardness,Solids,Chloramines,Sulfate,Conductivity,Organic_carbon,Trihalomethanes,Turbidity,Potability
0,,204.890455,20791.318981,7.300212,368.516441,564.308654,10.379783,86.990970,2.963135,1
1,3.716080,129.422921,18630.057858,6.635246,,592.885359,15.180013,56.329076,4.500656,0
2,8.099124,224.236259,19909.541732,9.275884,,418.606213,16.868637,66.420093,3.055934,0
3,8.316766,214.373394,22018.417441,8.059332,356.886136,363.266516,18.436524,100.341674,4.628771,0
4,9.092223,181.101509,17978.986339,6.546600,310.135738,398.410813,11.558279,31.997993,4.075075,0
...,...,...,...,...,...,...,...,...,...,...
3271,4.668102,193.681735,47580.991603,7.166639,359.948574,526.424171,13.894419,66.687695,4.435821,1
3272,7.808856,193.553212,17329.802160,8.061362,,392.449580,19.903225,,2.798243,1
3273,9.419510,175.762646,33155.578218,7.350233,,432.044783,11.039070,69.845400,3.298875,1
3274,5.126763,230.603758,11983.869376,6.303357,,402.883113,11.168946,77.488213,4.708658,1


# Contributors

**Author**
<br>Chee Lam

# References

1. [Learning Pandas](https://www.kaggle.com/learn/pandas)
2. [Pandas Documentation](https://pandas.pydata.org/docs/reference/index.html)