# <center> Data Accessing & Aggregation </center>

* [Selecting data by row, column and index values](#section_1)
* [Filtering data with conditions](#section_2)
* [Aggregating and sorting data](#section_3)

In [1]:
import pandas as pd

In [2]:
# Countries Information DataFrame
df_countries_info = pd.DataFrame(
    {'country_name': ['Egypt','Kenya','Morocco','Nigeria','South Africa','Brazil','Canada','Chile','Mexico','United States','China','India','Indonesia','Japan','Vietnam','Austria','Belgium','France','Italy','United Kingdom','Australia','Fiji','New Zealand','Tonga','Tuvalu'],
     'region': ['Africa','Africa','Africa','Africa','Africa','South Americas','North Americas','South Americas','North Americas','North Americas','Asia','Asia','Asia','Asia','Asia','Europe','Europe','Europe','Europe','Europe','Oceania','Oceania','Oceania','Oceania','Oceania'],
     'population':[100388073,52573973,36471769,200963599,58558270,211049527,37411047,18952038,127575529,329064917,1433783686,1366417754,270625568,126860301,96462106,8955102,11539328,65129728,60550075,67530172,25203198,889953,4783063,110940,11646],
     'main_language':['Arabic','English','Arabic','English','English','Portuguese','English','Spanish','Spanish','English','Mandarin','Hindi','Indonesian','Japanese','Vietnamese','German','Dutch','French','Italian','English','English','English','English','English','English']}, 
     index = ['EG','KE','MA','NG','ZA','BR','CA','CL','MX','US','CN','IN','ID','JP','VN','AT','BE','FR','IT','GB','AU','FJ','NZ','TO','TV'])

# Display DataFrame
df_countries_info

Unnamed: 0,country_name,region,population,main_language
EG,Egypt,Africa,100388073,Arabic
KE,Kenya,Africa,52573973,English
MA,Morocco,Africa,36471769,Arabic
NG,Nigeria,Africa,200963599,English
ZA,South Africa,Africa,58558270,English
BR,Brazil,South Americas,211049527,Portuguese
CA,Canada,North Americas,37411047,English
CL,Chile,South Americas,18952038,Spanish
MX,Mexico,North Americas,127575529,Spanish
US,United States,North Americas,329064917,English


### Selecting Data by Row, Column and Index Values <a class="anchor" id="section_1"></a>

Pandas library provides multiple ways to select a group of rows and columns by labels or position values using `loc` and `iloc` functions. `loc` is a label-based selection function where users must specify rows and columns based on the row and column labels; while `iloc` is an integer position-based selection function where users must specify rows and columns by the integer position values (0-based integer position).

In [3]:
# Select one record
df_countries_info.loc['CN']

# Select one record
df_countries_info.iloc[10]

country_name          China
region                 Asia
population       1433783686
main_language      Mandarin
Name: CN, dtype: object

In [4]:
# Select a list of records
df_countries_info.loc[['CN', 'NZ', 'GB']]

# Select a list of records
df_countries_info.iloc[[10,19,22]]

Unnamed: 0,country_name,region,population,main_language
CN,China,Asia,1433783686,Mandarin
GB,United Kingdom,Europe,67530172,English
NZ,New Zealand,Oceania,4783063,English


In [1]:
# Select a range of values
df_countries_info.loc['CN':'NZ']

# Select a range of values
df_countries_info.iloc[10:15]

NameError: name 'df_countries_info' is not defined

In [6]:
# Select a list of records and columns
df_countries_info.loc['CN':'VN', ['region', 'population']]

# Select a list of records and columns
df_countries_info.iloc[10:15, [1,2]]

Unnamed: 0,region,population
CN,Asia,1433783686
IN,Asia,1366417754
ID,Asia,270625568
JP,Asia,126860301
VN,Asia,96462106
AT,Europe,8955102
BE,Europe,11539328
FR,Europe,65129728
IT,Europe,60550075
GB,Europe,67530172


In [7]:
df_countries_info[['region', 'population']]

Unnamed: 0,region,population
EG,Africa,100388073
KE,Africa,52573973
MA,Africa,36471769
NG,Africa,200963599
ZA,Africa,58558270
BR,South Americas,211049527
CA,North Americas,37411047
CL,South Americas,18952038
MX,North Americas,127575529
US,North Americas,329064917


### Filtering Data with Conditions <a class="anchor" id="section_2"></a>

We can also select rows by adding filter conditions that only match a subset of records. Each individual condition is often surrounded by parentheses () and several conditions can be grouped together using AND and OR conditions represented with & or | symbols respectively.

In [8]:
# Select records based on list of condtions
df_countries_info.loc[(df_countries_info['main_language']=='English') | 
                      (df_countries_info['region']=='Oceania')]

Unnamed: 0,country_name,region,population,main_language
KE,Kenya,Africa,52573973,English
NG,Nigeria,Africa,200963599,English
ZA,South Africa,Africa,58558270,English
CA,Canada,North Americas,37411047,English
US,United States,North Americas,329064917,English
GB,United Kingdom,Europe,67530172,English
AU,Australia,Oceania,25203198,English
FJ,Fiji,Oceania,889953,English
NZ,New Zealand,Oceania,4783063,English
TO,Tonga,Oceania,110940,English


### Aggregating and Sorting Data <a class="anchor" id="section_3"></a>

we can use the Pandas built-in functions to sort values and query specific numerical values per group of records.

In [9]:
# Sort DataFrame data
df_countries_info.loc[(df_countries_info['main_language']=='English') & 
                      (df_countries_info['region']=='Oceania')].sort_values(by='population')

Unnamed: 0,country_name,region,population,main_language
TV,Tuvalu,Oceania,11646,English
TO,Tonga,Oceania,110940,English
FJ,Fiji,Oceania,889953,English
NZ,New Zealand,Oceania,4783063,English
AU,Australia,Oceania,25203198,English


In [10]:
df_countries_info.sort_values('population')

Unnamed: 0,country_name,region,population,main_language
TV,Tuvalu,Oceania,11646,English
TO,Tonga,Oceania,110940,English
FJ,Fiji,Oceania,889953,English
NZ,New Zealand,Oceania,4783063,English
AT,Austria,Europe,8955102,German
BE,Belgium,Europe,11539328,Dutch
CL,Chile,South Americas,18952038,Spanish
AU,Australia,Oceania,25203198,English
MA,Morocco,Africa,36471769,Arabic
CA,Canada,North Americas,37411047,English


In [11]:
# Total population size by main language
df_countries_info.groupby('main_language').population.sum()

main_language
Arabic         136859842
Dutch           11539328
English        777100778
French          65129728
German           8955102
Hindi         1366417754
Indonesian     270625568
Italian         60550075
Japanese       126860301
Mandarin      1433783686
Portuguese     211049527
Spanish        146527567
Vietnamese      96462106
Name: population, dtype: int64

In [12]:
# Total population size by region
df_countries_info.groupby('region').population.sum()

region
Africa             448955684
Asia              3294149415
Europe             213704405
North Americas     494051493
Oceania             30998800
South Americas     230001565
Name: population, dtype: int64

Finally, we will learn about how to query specific numerical values per group of records. In our toy example, we may want to know the summarization of population size per region or main language. This query can be answered by applying the Pandas groupby() function on the targeted groups and a summarization function on the numerical value. In the examples below, we calculate the total population size per language and geographic region.

Sometimes we may need to sort a DataFrame or query output by specific numeric, alphabet, or date values. This process can be achieved by applying sort_values() function which takes the name of the targeted sorting value as a mandatory parameter and ascending order as default behaviour. We can adjust the above example by sorting the result by population size as shown in the code below

We can also select rows by adding filter conditions that only match a subset of records. Each individual condition is often surrounded by parentheses () and several conditions can be grouped together using AND and OR conditions represented with & or | symbols respectively.

In the following example, we retrieve all records with the main language being English and the region being Oceania. Note that we left the column selection area empty to indicate that we want all DataFrame columns.

Pandas library provides multiple ways to select a group of rows and columns by labels or position values using loc and iloc functions. loc is a label-based selection function where users must specify rows and columns based on the row and column labels; while iloc is an integer position-based selection function where users must specify rows and columns by the integer position values (0-based integer position).

In the example below, we apply both loc and iloc to select a record based on its index label ‘CN’ or its 10th position in the DataFrame. We notice both methods would return a series object about China country information.

We can also pass different DataFrame labels and positions as shown in the example below:

We can also pass a range of index labels or position values using the colon sign : as shown in the example below.

We can also pass a list of index labels or position values as shown in the example below:

So far in this course, you have learned about Pandas different techniques to process and get data ready for analysis. In this section of the course, we will learn about how to explore your data by performing the following tasks:


To demonstrate these tasks, we will use the following toy DataFrame about different countries' information. The DataFrame has an index value representing each country's ISO code, regions, population size, and most common language in each country.