## ![logo](../../img/license_header_logo.png)
> **Copyright &copy; 2021 CertifAI Sdn. Bhd.**<br>
 <br>
This program is part of OSRFramework. You can redistribute it and/or modify
<br>it under the terms of the GNU Affero General Public License as published by
<br>the Free Software Foundation, either version 3 of the License, or
<br>(at your option) any later version.
<br>
<br>This program is distributed in the hope that it will be useful,
<br>but WITHOUT ANY WARRANTY; without even the implied warranty of
<br>MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
<br>GNU Affero General Public License for more details.
<br>
<br>You should have received a copy of the GNU Affero General Public License
<br>along with this program.  If not, see <http://www.gnu.org/licenses/>.

# <a name="top">03 - Common Operations</a>
Authored by: Scotrraaj Gopal - scotrraaj.gopal@certifai.ai

## <a name="description">Notebook Description</a>

Selecting particular values from a `pandas` DataFrame or Series to work on is an important step in practically any data operation you'll perform, therefore one of the first things you should learn when dealing with data in Python is how to quickly and effectively choose the data points important to you.

By the end of this tutorial, you will be able to:

1. Understand the difference between `.iloc` and `.loc`.
2. Slice and select your data through index, labels and conditions.
3. Obtain data by columns and by rows.

## Notebook Outline
Here's the outline for this tutorial:
1. [Notebook Description](#description)
2. [Notebook Configurations](#configuration)
3. [Native Accessors](#native)
4. [Indexing in `pandas`](#index)
    - [`.iloc` - Index based selection](#iloc)
    - [`.loc` - Label based selection](#loc)
5. [Filtering with Conditions](#conditions)
6. [Summary](#summary)
7. [Reference](#reference)

## <a name="configuration">Notebook Configurations</a>

Import the `pandas` module and the dataset from `../../Datasets/pandas/winemag-data-130k-v2.csv`. Name the `DataFrame` object `reviews`.

In [None]:
# YOUR CODE HERE

reviews

## <a name="native">Native Accessors</a>

The Python and Numpy indexing operators `"[ ]"` and attribute operators `"."`provide quick and easy access to `pandas` data structures across a wide range of use cases. Say we have a `car` object, for example. It might have a `brand` property, which we can access by calling `car.brand`.

Columns in a `DataFrame` object work very much similar in the same way. Let's try to access the `country` property of the `reviews` dataframe.

In [None]:
# Access 'country' property
# YOUR CODE HERE


What if you were interested with only the 5th country in that column?
> *Note: The first element is in index 0*

In [None]:
# Get 5th country in 'country' column
# YOUR CODE HERE


What if you were interested with a range of rows? For example, you'd like to know the 6th up until the 9th country in that column.

You can either list down every row that you want with `[5,6,7,8]`, or you can also slice them out using `[5:9]`.

> *The convention that you have to keep in mind when using square brackets to slice is* `[start:stop:step]`*, in which the default step is equal to 1*

In [None]:
# Listing down the index interested
# YOUR CODE HERE


In [None]:
# Slicing
# YOUR CODE HERE


We can also use the indexing operator (`[ ]`) to do the same thing.

In [None]:
# Access 'country' property
# YOUR CODE HERE


In [None]:
# Get 5th country in 'country' column
# YOUR CODE HERE


In [None]:
# Listing down the index interested
# YOUR CODE HERE


In [None]:
# Slicing
# YOUR CODE HERE


## <a name="index">Indexing in `pandas`</a>
The indexing and attribute operators are great as they work just like they do in the rest of the Python ecosystem. As a beginner, this makes them easy to be picked up and used. `pandas` has its own accessor operators which can be really handy when doing more advanced operations.

### <a name="iloc">`.iloc` - Index based selection</a>

The `.iloc` method is what you're looking for to select data based on its numerical position in the data.

> *Note: Both* `.iloc` *and* `.loc` *are row-first, column-second expressions. This is contrary to how we do things in Python, where columns come before rows.*

Using `.iloc`, let's try to access the element of the `DataFrame` in the 7th row in the `"designation"` column. *(Hint: What's the index of the column with this label?)*

In [None]:
# YOUR CODE HERE


The `:` operator, which is also native Python, signifies "everything" on its own. However, when used in conjunction with other selectors, it may be used to represent a range of values. 

To pick the `"designation"` column from only the first, second, and third rows with `.iloc`, we can do it as the following.

In [None]:
# YOUR CODE HERE


# Try also by listing down the interested index
# YOUR CODE HERE


Moreover, it is important to understand that negative values can be utilised in selection. This will **begin counting forwards from the end** of the values. 

To illustrate this, here are the bottom 7 elements of the `"designation"` column.

In [None]:
# YOUR CODE HERE


### <a name="loc">`.loc` - Label based selection</a>

The `.loc` method is beneficial when trying to select data based on the data index values which could be in any data types including `int64`, `object` and more.

Now, let's try to access the same element in the 7th row in the `"designation"` column but this time using `.loc`.

In [None]:
# YOUR CODE HERE


In [None]:
# Try to get the 7th row for every column up until "designation"
# YOUR CODE HERE


Observe that when you used the `:` operator with `"designation"`, the `"designation"` was included in the result. It occurs because both `.iloc` and `.loc` use slightly different indexing schemes from each other. This behaviour is something that you need to be aware of when choosing between them.

For example, you have a `DataFrame`, `df` indexed with numerics from `[0, 1, 2, .... 1000]`. Using `df.iloc[0:1000]` returns 1000 entries while `df.loc[0:1000]` returns 1001! To get 1000 elements with `.loc`, you will need to go one lower and write `df.loc[0:999]`.

In [None]:
# Test it out for yourself.

# import numpy as np
# data = np.random.randn(1001,2)
# df = pd.DataFrame(data)
# df.iloc[:1000].shape
# # df.iloc[:1000].shape

## <a name="conditions">Filtering with Conditions</a>

We've gone over how to select rows and columns, but what if we wanted to make a conditional selection? While analyzing our data, often we will have questions running in our head. One will be able to tap into the answers to those queries with a smart use of conditions.

For example, suppose we want to know which of the wines in Italy are better than the average? Let's work step by step into answerring that question.

In [None]:
# YOUR CODE HERE


The operation above created a `Series` of boolean values based on the `country` of each entry. This series can then be used inside of `.loc` to select the appropriate data.

In [None]:
# YOUR CODE HERE


Awesome! Now for one more condition. We wanted to know which of the wines were better than average. So, we can base our presumptions based on the points for each wine. Wines are reviewed on a 80-100 point scale, and we can find the mean of all the points with `reviews.points.mean()`.

Let's bring both of these conditions together with the ampersand symbol (&) and save our filtered table into `topItalyWines`.

In [None]:
# YOUR CODE HERE


topItalyWines

What if we wanted to have the same conditions but we would like to include another `country`, say `Brazil`?

You can chain the `isin()` method to the first condition and put a list as arguments. Save the results in the variable `topItalyBrazilWines`.

In [None]:
# YOUR CODE HERE


topItalyBrazilWines

Amazing. As you progress in your abilities to filter, you will be able to answer more and more complex questions, critical to your analysis. That is why good data scientists must not only know to code, but must also possess the creativity to ask the right questions.

##  <a name="summary">Summary</a>
To conclude, you should now be able to:

1. Understand the difference between `.iloc` and `.loc`.
2. Slice and select your data through index, labels and conditions.
3. Obtain data by columns and by rows.

Congratulations, that concludes this lesson. In the next tutorial, we'll learn on how to clean up our data and deal with missing values.

See you!

## <a name="reference">Reference</a>
* [Dataset Source](https://www.kaggle.com/zynicide/wine-reviews)

<font size=2>[Back to Top](#top)</font>