## ![logo](../../img/license_header_logo.png)
> **Copyright &copy; 2021 CertifAI Sdn. Bhd.**<br>
 <br>
This program and the accompanying materials are made available under the
terms of the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0). <br>
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
License for the specific language governing permissions and limitations
under the License. <br>
<br>**SPDX-License-Identifier: Apache-2.0**

# <a name="top">03 - Common Operations</a>
Authored by: Scotrraaj Gopal - scotrraaj.gopal@certifai.ai

## <a name="description">Notebook Description</a>

Selecting particular values from a `pandas` DataFrame or Series to work on is an important step in practically any data operation you'll perform, therefore one of the first things you should learn when dealing with data in Python is how to quickly and effectively choose the data points important to you.

By the end of this tutorial, you will be able to:

1. Understand the difference between `.iloc` and `.loc`.
2. Slice and select your data through index, labels and conditions.
3. Obtain data by columns and by rows.

## Notebook Outline
Here's the outline for this tutorial:
1. [Notebook Description](#description)
2. [Notebook Configurations](#configuration)
3. [Native Accessors](#native)
4. [Indexing in `pandas`](#index)
    - [`.iloc` - Index based selection](#iloc)
    - [`.loc` - Label based selection](#loc)
5. [Filtering with Conditions](#conditions)
6. [Summary](#summary)
7. [Reference](#reference)

## <a name="configuration">Notebook Configurations</a>

Import the `pandas` module and the dataset from `../../Datasets/pandas/winemag-data-130k-v2.csv`. Name the `DataFrame` object `reviews`.

In [1]:
### BEGIN SOLUTION
import pandas as pd
reviews = pd.read_csv("../../Datasets/pandas/winemag-data-130k-v2.csv", index_col=0)
### END SOLUTION

reviews

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks
...,...,...,...,...,...,...,...,...,...,...,...,...,...
129966,Germany,Notes of honeysuckle and cantaloupe sweeten th...,Brauneberger Juffer-Sonnenuhr Spätlese,90,28.0,Mosel,,,Anna Lee C. Iijima,,Dr. H. Thanisch (Erben Müller-Burggraef) 2013 ...,Riesling,Dr. H. Thanisch (Erben Müller-Burggraef)
129967,US,Citation is given as much as a decade of bottl...,,90,75.0,Oregon,Oregon,Oregon Other,Paul Gregutt,@paulgwine,Citation 2004 Pinot Noir (Oregon),Pinot Noir,Citation
129968,France,Well-drained gravel soil gives this wine its c...,Kritt,90,30.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Gresser 2013 Kritt Gewurztraminer (Als...,Gewürztraminer,Domaine Gresser
129969,France,"A dry style of Pinot Gris, this is crisp with ...",,90,32.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Marcel Deiss 2012 Pinot Gris (Alsace),Pinot Gris,Domaine Marcel Deiss


## <a name="native">Native Accessors</a>

The Python and Numpy indexing operators `"[ ]"` and attribute operators `"."`provide quick and easy access to `pandas` data structures across a wide range of use cases. Say we have a `car` object, for example. It might have a `brand` property, which we can access by calling `car.brand`.

Columns in a `DataFrame` object work very much similar in the same way. Let's try to access the `country` property of the `reviews` dataframe.

In [2]:
# Access 'country' property
### BEGIN SOLUTION
reviews.country
### END SOLUTION


0            Italy
1         Portugal
2               US
3               US
4               US
            ...   
129966     Germany
129967          US
129968      France
129969      France
129970      France
Name: country, Length: 129971, dtype: object

What if you were interested with only the 5th country in that column?
> *Note: The first element is in index 0*

In [3]:
# Get 5th country in 'country' column
### BEGIN SOLUTION
reviews.country[4]
### END SOLUTION


'US'

What if you were interested with a range of rows? For example, you'd like to know the 6th up until the 9th country in that column.

You can either list down every row that you want with `[5,6,7,8]`, or you can also slice them out using `[5:9]`.

> *The convention that you have to keep in mind when using square brackets to slice is* `[start:stop:step]`*, in which the default step is equal to 1*

In [4]:
# Listing down the index interested
### BEGIN SOLUTION
reviews.country[[5,6,7,8]]
### END SOLUTION


5      Spain
6      Italy
7     France
8    Germany
Name: country, dtype: object

In [5]:
# Slicing
### BEGIN SOLUTION
reviews.country[5:9]
### END SOLUTION


5      Spain
6      Italy
7     France
8    Germany
Name: country, dtype: object

We can also use the indexing operator (`[ ]`) to do the same thing.

In [6]:
# Access 'country' property
### BEGIN SOLUTION
reviews['country']
### END SOLUTION


0            Italy
1         Portugal
2               US
3               US
4               US
            ...   
129966     Germany
129967          US
129968      France
129969      France
129970      France
Name: country, Length: 129971, dtype: object

In [7]:
# Get 5th country in 'country' column
### BEGIN SOLUTION
reviews['country'][4]
### END SOLUTION


'US'

In [8]:
# Listing down the index interested
### BEGIN SOLUTION
reviews['country'][[5,6,7,8]]
### END SOLUTION


5      Spain
6      Italy
7     France
8    Germany
Name: country, dtype: object

In [9]:
# Slicing
### BEGIN SOLUTION
reviews['country'][5:9]
### END SOLUTION


5      Spain
6      Italy
7     France
8    Germany
Name: country, dtype: object

## <a name="index">Indexing in `pandas`</a>
The indexing and attribute operators are great as they work just like they do in the rest of the Python ecosystem. As a beginner, this makes them easy to be picked up and used. `pandas` has its own accessor operators which can be really handy when doing more advanced operations.

### <a name="iloc">`.iloc` - Index based selection</a>

The `.iloc` method is what you're looking for to select data based on its numerical position in the data.

> *Note: Both* `.iloc` *and* `.loc` *are row-first, column-second expressions. This is contrary to how we do things in Python, where columns come before rows.*

Using `.iloc`, let's try to access the element of the `DataFrame` in the 7th row in the `"designation"` column. *(Hint: What's the index of the column with this label?)*

In [10]:
### BEGIN SOLUTION
reviews.iloc[6,2]
### END SOLUTION


'Belsito'

The `:` operator, which is also native Python, signifies "everything" on its own. However, when used in conjunction with other selectors, it may be used to represent a range of values. 

To pick the `"designation"` column from only the first, second, and third rows with `.iloc`, we can do it as the following.

In [11]:
### BEGIN SOLUTION
reviews.iloc[:3,2]
### END SOLUTION


# Try also by listing down the interested index
### BEGIN SOLUTION
reviews.iloc[[0,1,2],2]
### END SOLUTION


0    Vulkà Bianco
1        Avidagos
2             NaN
Name: designation, dtype: object

Moreover, it is important to understand that negative values can be utilised in selection. This will **begin counting forwards from the end** of the values. 

To illustrate this, here are the bottom 7 elements of the `"designation"` column.

In [12]:
### BEGIN SOLUTION
reviews.iloc[-7:,2]
### END SOLUTION


129964              Domaine Saint-Rémy Herrenweg
129965               Seppi Landmann Vallée Noble
129966    Brauneberger Juffer-Sonnenuhr Spätlese
129967                                       NaN
129968                                     Kritt
129969                                       NaN
129970             Lieu-dit Harth Cuvée Caroline
Name: designation, dtype: object

### <a name="loc">`.loc` - Label based selection</a>

The `.loc` method is beneficial when trying to select data based on the data index values which could be in any data types including `int64`, `object` and more.

Now, let's try to access the same element in the 7th row in the `"designation"` column but this time using `.loc`.

In [13]:
### BEGIN SOLUTION
reviews.loc[6, 'designation']
### END SOLUTION


'Belsito'

In [14]:
# Try to get the 7th row for every column up until "designation"
### BEGIN SOLUTION
reviews.loc[6, :'designation']
### END SOLUTION


country                                                    Italy
description    Here's a bright, informal red that opens with ...
designation                                              Belsito
Name: 6, dtype: object

Observe that when you used the `:` operator with `"designation"`, the `"designation"` was included in the result. It occurs because both `.iloc` and `.loc` use slightly different indexing schemes from each other. This behaviour is something that you need to be aware of when choosing between them.

For example, you have a `DataFrame`, `df` indexed with numerics from `[0, 1, 2, .... 1000]`. Using `df.iloc[0:1000]` returns 1000 entries while `df.loc[0:1000]` returns 1001! To get 1000 elements with `.loc`, you will need to go one lower and write `df.loc[0:999]`.

In [15]:
# Test it out for yourself.

# import numpy as np
# data = np.random.randn(1001,2)
# df = pd.DataFrame(data)
# df.iloc[:1000].shape
# # df.iloc[:1000].shape

## <a name="conditions">Filtering with Conditions</a>

We've gone over how to select rows and columns, but what if we wanted to make a conditional selection? While analyzing our data, often we will have questions running in our head. One will be able to tap into the answers to those queries with a smart use of conditions.

For example, suppose we want to know which of the wines in Italy are better than the average? Let's work step by step into answerring that question.

In [16]:
### BEGIN SOLUTION
reviews.country == 'Italy'
### END SOLUTION


0          True
1         False
2         False
3         False
4         False
          ...  
129966    False
129967    False
129968    False
129969    False
129970    False
Name: country, Length: 129971, dtype: bool

The operation above created a `Series` of boolean values based on the `country` of each entry. This series can then be used inside of `.loc` to select the appropriate data.

In [17]:
### BEGIN SOLUTION
reviews.loc[reviews.country == 'Italy']
### END SOLUTION


Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
6,Italy,"Here's a bright, informal red that opens with ...",Belsito,87,16.0,Sicily & Sardinia,Vittoria,,Kerin O’Keefe,@kerinokeefe,Terre di Giurfo 2013 Belsito Frappato (Vittoria),Frappato,Terre di Giurfo
13,Italy,This is dominated by oak and oak-driven aromas...,Rosso,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Masseria Setteporte 2012 Rosso (Etna),Nerello Mascalese,Masseria Setteporte
22,Italy,Delicate aromas recall white flower and citrus...,Ficiligno,87,19.0,Sicily & Sardinia,Sicilia,,Kerin O’Keefe,@kerinokeefe,Baglio di Pianetto 2007 Ficiligno White (Sicilia),White Blend,Baglio di Pianetto
24,Italy,"Aromas of prune, blackcurrant, toast and oak c...",Aynat,87,35.0,Sicily & Sardinia,Sicilia,,Kerin O’Keefe,@kerinokeefe,Canicattì 2009 Aynat Nero d'Avola (Sicilia),Nero d'Avola,Canicattì
...,...,...,...,...,...,...,...,...,...,...,...,...,...
129929,Italy,"This luminous sparkler has a sweet, fruit-forw...",,91,38.0,Veneto,Prosecco Superiore di Cartizze,,,,Col Vetoraz Spumanti NV Prosecco Superiore di...,Prosecco,Col Vetoraz Spumanti
129943,Italy,"A blend of Nero d'Avola and Syrah, this convey...",Adènzia,90,29.0,Sicily & Sardinia,Sicilia,,Kerin O’Keefe,@kerinokeefe,Baglio del Cristo di Campobello 2012 Adènzia R...,Red Blend,Baglio del Cristo di Campobello
129947,Italy,"A blend of 65% Cabernet Sauvignon, 30% Merlot ...",Symposio,90,20.0,Sicily & Sardinia,Terre Siciliane,,Kerin O’Keefe,@kerinokeefe,Feudo Principi di Butera 2012 Symposio Red (Te...,Red Blend,Feudo Principi di Butera
129961,Italy,"Intense aromas of wild cherry, baking spice, t...",,90,30.0,Sicily & Sardinia,Sicilia,,Kerin O’Keefe,@kerinokeefe,COS 2013 Frappato (Sicilia),Frappato,COS


Awesome! Now for one more condition. We wanted to know which of the wines were better than average. So, we can base our presumptions based on the points for each wine. Wines are reviewed on a 80-100 point scale, and we can find the mean of all the points with `reviews.points.mean()`.

Let's bring both of these conditions together with the ampersand symbol (&) and save our filtered table into `topItalyWines`.

In [18]:
### BEGIN SOLUTION
topItalyWines = reviews.loc[(reviews.country =='Italy')&(reviews.points>=reviews.points.mean())]
### END SOLUTION


topItalyWines

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
120,Italy,"Slightly backward, particularly given the vint...",Bricco Rocche Prapó,92,70.0,Piedmont,Barolo,,,,Ceretto 2003 Bricco Rocche Prapó (Barolo),Nebbiolo,Ceretto
130,Italy,"At the first it was quite muted and subdued, b...",Bricco Rocche Brunate,91,70.0,Piedmont,Barolo,,,,Ceretto 2003 Bricco Rocche Brunate (Barolo),Nebbiolo,Ceretto
133,Italy,"Einaudi's wines have been improving lately, an...",,91,68.0,Piedmont,Barolo,,,,Poderi Luigi Einaudi 2003 Barolo,Nebbiolo,Poderi Luigi Einaudi
135,Italy,The color is just beginning to show signs of b...,Sorano,91,60.0,Piedmont,Barolo,,,,Giacomo Ascheri 2001 Sorano (Barolo),Nebbiolo,Giacomo Ascheri
140,Italy,"A big, fat, luscious wine with plenty of toast...",Costa Bruna,90,26.0,Piedmont,Barbera d'Alba,,,,Poderi Colla 2005 Costa Bruna (Barbera d'Alba),Barbera,Poderi Colla
...,...,...,...,...,...,...,...,...,...,...,...,...,...
129929,Italy,"This luminous sparkler has a sweet, fruit-forw...",,91,38.0,Veneto,Prosecco Superiore di Cartizze,,,,Col Vetoraz Spumanti NV Prosecco Superiore di...,Prosecco,Col Vetoraz Spumanti
129943,Italy,"A blend of Nero d'Avola and Syrah, this convey...",Adènzia,90,29.0,Sicily & Sardinia,Sicilia,,Kerin O’Keefe,@kerinokeefe,Baglio del Cristo di Campobello 2012 Adènzia R...,Red Blend,Baglio del Cristo di Campobello
129947,Italy,"A blend of 65% Cabernet Sauvignon, 30% Merlot ...",Symposio,90,20.0,Sicily & Sardinia,Terre Siciliane,,Kerin O’Keefe,@kerinokeefe,Feudo Principi di Butera 2012 Symposio Red (Te...,Red Blend,Feudo Principi di Butera
129961,Italy,"Intense aromas of wild cherry, baking spice, t...",,90,30.0,Sicily & Sardinia,Sicilia,,Kerin O’Keefe,@kerinokeefe,COS 2013 Frappato (Sicilia),Frappato,COS


What if we wanted to have the same conditions but we would like to include another `country`, say `Brazil`?

You can chain the `isin()` method to the first condition and put a list as arguments. Save the results in the variable `topItalyBrazilWines`.

In [19]:
### BEGIN SOLUTION
topItalyBrazilWines = reviews.loc[(reviews.country.isin(['Italy', 'Brazil']))&(reviews.points>=reviews.points.mean())]
### END SOLUTION


topItalyBrazilWines

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
120,Italy,"Slightly backward, particularly given the vint...",Bricco Rocche Prapó,92,70.0,Piedmont,Barolo,,,,Ceretto 2003 Bricco Rocche Prapó (Barolo),Nebbiolo,Ceretto
130,Italy,"At the first it was quite muted and subdued, b...",Bricco Rocche Brunate,91,70.0,Piedmont,Barolo,,,,Ceretto 2003 Bricco Rocche Brunate (Barolo),Nebbiolo,Ceretto
133,Italy,"Einaudi's wines have been improving lately, an...",,91,68.0,Piedmont,Barolo,,,,Poderi Luigi Einaudi 2003 Barolo,Nebbiolo,Poderi Luigi Einaudi
135,Italy,The color is just beginning to show signs of b...,Sorano,91,60.0,Piedmont,Barolo,,,,Giacomo Ascheri 2001 Sorano (Barolo),Nebbiolo,Giacomo Ascheri
140,Italy,"A big, fat, luscious wine with plenty of toast...",Costa Bruna,90,26.0,Piedmont,Barbera d'Alba,,,,Poderi Colla 2005 Costa Bruna (Barbera d'Alba),Barbera,Poderi Colla
...,...,...,...,...,...,...,...,...,...,...,...,...,...
129929,Italy,"This luminous sparkler has a sweet, fruit-forw...",,91,38.0,Veneto,Prosecco Superiore di Cartizze,,,,Col Vetoraz Spumanti NV Prosecco Superiore di...,Prosecco,Col Vetoraz Spumanti
129943,Italy,"A blend of Nero d'Avola and Syrah, this convey...",Adènzia,90,29.0,Sicily & Sardinia,Sicilia,,Kerin O’Keefe,@kerinokeefe,Baglio del Cristo di Campobello 2012 Adènzia R...,Red Blend,Baglio del Cristo di Campobello
129947,Italy,"A blend of 65% Cabernet Sauvignon, 30% Merlot ...",Symposio,90,20.0,Sicily & Sardinia,Terre Siciliane,,Kerin O’Keefe,@kerinokeefe,Feudo Principi di Butera 2012 Symposio Red (Te...,Red Blend,Feudo Principi di Butera
129961,Italy,"Intense aromas of wild cherry, baking spice, t...",,90,30.0,Sicily & Sardinia,Sicilia,,Kerin O’Keefe,@kerinokeefe,COS 2013 Frappato (Sicilia),Frappato,COS


Amazing. As you progress in your abilities to filter, you will be able to answer more and more complex questions, critical to your analysis. That is why good data scientists must not only know to code, but must also possess the creativity to ask the right questions.

##  <a name="summary">Summary</a>
To conclude, you should now be able to:

1. Understand the difference between `.iloc` and `.loc`.
2. Slice and select your data through index, labels and conditions.
3. Obtain data by columns and by rows.

Congratulations, that concludes this lesson. In the next tutorial, we'll learn on how to clean up our data and deal with missing values.

See you!

## <a name="reference">Reference</a>
* [Dataset Source](https://www.kaggle.com/zynicide/wine-reviews)

<font size=2>[Back to Top](#top)</font>