# <center> Data Accessing & Aggregation </center>

* [Selecting data by row, column and index values](#section_1)
* [Filtering data with conditions](#section_2)
* [Aggregating and sorting data](#section_3)

In this lesson, we will learn about all the different tools that allow data professionals to query and explore the dataframes.

To demonstrate how to query and explore a Pandas DataFrame object, we will need a toy dataset. Let's build one about country information and call it `df_countries_info`. 

In [1]:
# Import pandas


In [16]:
# Create a countries information DataFrame. Refer to lesson video for details

# Display DataFrame


In this DataFrame, we have information about 25 different countries where each country has some numerical values such as its "population", and some text values like "main language", "region" and "name". 

Also notice that we replaced the default numerical index with a predefined index value called the `ISO code` which is basically a two letter code representing the country name.

Let's use this dataset to learn about all the different ways that we can use to select records from a Pandas DataFrame.

### Selecting Data by Row, Column and Index Values <a class="anchor" id="section_1"></a>

Pandas library provides multiple ways to select a group of rows and columns by labels or position values using [`loc`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html) and [`iloc`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html) functions. `loc` is a label-based selection function where users must specify rows and columns based on the row and column labels; while `iloc` is an integer position-based selection function where users must specify rows and columns by the integer position values (0-based integer position).

Let's see an example by selecting the record for the country "China". We will do it two ways: one way to select that is to use the [`loc`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html) function and the other way is to use the [`iloc`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html) function.

In [1]:
# Select one record by labels


# Select one record by index integer


We can see that these two examples gave us the same results. In one way we selected the `index name` and the other way we selected the `index integer`. 

Similarly you can pass a list of labels or a list of index values. This time we will pass a list of countries so we select "China" again, "New Zealand" and the "UK" as below:

In [2]:
# Select a list of records


# Select a list of records using index


One limitation to this approach is that users need to know the exact position or the index label for the record they want to select.

In another scenario you may not know the exact name or position for the record you need, but you want to select a range of values (or several records at the same time). Say, you want to select from this index value to another index value. We can do that using [`loc`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html) and [`iloc`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html) as we can see in this example below:

In [3]:
# Select a range of values


# Select a range of values


The command gave us a subset where the country of China is the first country and the country of Vietnam is the last country. We achieved the same output using the [`iloc`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html) function by identifying the range from the numerical position 10 to position 15.

These examples showed us how we can use [`loc`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html) and [`iloc`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html) functions to select dataframe records using the index label or the numerical index value.

What you can also do is to specify which columns you want to display. You can do that using the column label or its numerical index. Let's modify this example to also select specific column names as below:

In [4]:
# Select a list of records and columns


# Select a list of records and columns


From the result we noticed that we selected the same range of records from the country China to the country of Vietnam, but we only used a specific list of column names. This same output is expected from the [`iloc`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html) function.

Here it gave us the same result as in the range and selected number of columns based on the rows and column's numerical index value.

One last thing to note is that sometimes you can skip the use of the [`loc`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html) and [`iloc`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html) functions if you want to select the entire dataset or specific columns of your dataframe. Let's have a look at this example below:

In [5]:
# Select `region` and `population` columns without using the loc and iloc functions


### Filtering Data with Conditions <a class="anchor" id="section_2"></a>

We can also select rows by adding filter conditions that only match a subset of records. Each individual condition is often surrounded by parentheses () and several conditions can be grouped together using `AND` or `OR` conditions represented with `&` or `|` symbols respectively.

Let's have a look at an example by selecting all the countries that have English as their main language.

To specify the condition where the main language is only English, we will need to pass that as a parenthesis.

In [6]:
# Select records based on one condition


You can see here that these countries are in multiple regions. Some of them are in "Africa", some in "North America", "Europe" and "Oceania". 

We can add another filter or another condition, say, we want to have "English" as the "main language" and the "region" is "Oceania". 

We can modify this query by having the `AND` operator to specify another condition and further limit our subset to base on these two conditions.

In [7]:
# Select records based on a list of condtions using the AND operator


If only one of the conditions needs to be true, not both of them, we can replace that with the `OR` operator as follows:

In [8]:
# Select records based on a list of condtions using the OR operator


<br>

### Aggregating and Sorting Data <a class="anchor" id="section_3"></a>

We can also use the Pandas to sort values and query specific numerical values per group of records. The Pandas library provides us with a built-in function called [`sort_values()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html) and this function can take multiple parameters, and one of the main parameters you need to specify is the sorting column. 

If the sorting column is a numerical one, then it can be sorted from the smallest value to the largest value or the other way around; if it's a date column then it can be sorted from earliest to latest, and if it's a text column, it can be sorted alphabetically. 

Let's see an example how we can sort based on the "population" column.

In [9]:
# Sort all records that have `English` as `main_language` and `region` as `Oceania` by `population`


Keep in mind that the [`sort_values()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html) function can be applied either to a query like (the example we just had) or to the entire dataframe. 

For example, if we put [`sort_values()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html) after the data frame name, this will give us the entire data frame sorted by the population column. 

In [10]:
# Sort the entire dataFrame by population


Finally, we will learn about another technique which shows us how we can calculate a numerical value that is assigned to a specific group of records.

For example, sometimes you may want to do a sum-up, say, the total population or the average population per a group of records. 

In order to do that, the Pandas library provides us with a [`group_by()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) function which can be used to target a specific group of records and calculate a summary value for that group.

Let's make a query to summarize the total "population" per "main_language". 

In [13]:
# Summarize population by main_language


We can also calculate the population per region.

In [14]:
# Summarize population by region


In this lesson, we have learned about all the different techniques that allow us to explore and query data from our DataFrames. 

We have learned how to select a subset of records based on their index value or column name, and how to add one or multiple conditions to filter results. And how to sort these results based on specific column names and how to group them together using the group by functions.

Want to learn more cool stuff about the Pandas library? Keep learning!