![*INTERTECHNICA - SOLON EDUCATIONAL PROGRAMS - TECHNOLOGY LINE*](https://solon.intertechnica.com/assets/IntertechnicaSolonEducationalPrograms-TechnologyLine.png)

# Data Manipulation with Python - The Pandas Library - Grouping and Sorting

*Basic initialization of the workspace.*

In [1]:
!python -m pip install numpy
import numpy as np
print ("NumPy installed at version: {}".format(np.__version__))

NumPy installed at version: 1.19.5


In [2]:
!python -m pip install pandas
import pandas as pd
print ("Pandas installed at version: {}".format(pd.__version__))

#adjust pandas DataFrame display for a wider target 
pd.set_option('display.expand_frame_repr', False)

Pandas installed at version: 1.1.5


Load sample data for processing:

In [3]:
# load EU quality of life indicator 
quality_of_life_data_frame = pd.read_csv(
    "https://raw.githubusercontent.com/INTERTECHNICA-BUSINESS-SOLUTIONS-SRL/CourseDataManipulationWithPython/main/Module%203%20-%20The%20Pandas%20Library/Session%202%20-%20Pandas%20basics/data/EU_Quality_Of_Life.csv",
 )
print(
   "Loaded quality of life data frame with shape {}".format(
       quality_of_life_data_frame.shape
   ) 
)

Loaded quality of life data frame with shape (6315, 6)


## 1. Data Grouping

Data grouping is an extremely important process for data exploration, analysis and decision making. It allows the operators to look at the bigger picture - identifying consolidated information from granular data.  

Exploring the data before proceeding to grouping we can identify the following fields:

*   **isced11** - encodes the **International Standard Classification of Education (ISCED) 2011** data associated with the observation;
*   **sex** - encodes the sex value associated with the observation (M/F);
*   **age** - encodes various age segments associated with the observation;
*   **geo** - represents the country code for the observation;
*   **TIME_PERIOD** - represents the year of the observation (2013/2018);
*   **OBS_VALUE** - represents the perceived quality of life indicator value (higher is better) 

In [4]:
# display a data sample
print(
    "A sample of quality of data is: \n{}".format(
        quality_of_life_data_frame.iloc[0:10]
    )
)

A sample of quality of data is: 
  isced11 sex     age geo  TIME_PERIOD  OBS_VALUE
0   ED0-2   F  Y16-24  AT         2013        8.2
1   ED0-2   F  Y16-24  AT         2018        8.0
2   ED0-2   F  Y16-24  BE         2013        7.7
3   ED0-2   F  Y16-24  BE         2018        7.8
4   ED0-2   F  Y16-24  BG         2013        5.5
5   ED0-2   F  Y16-24  BG         2018        6.3
6   ED0-2   F  Y16-24  CH         2013        NaN
7   ED0-2   F  Y16-24  CH         2018        8.4
8   ED0-2   F  Y16-24  CY         2013        7.5
9   ED0-2   F  Y16-24  CY         2018        7.9


We observe that are values with no data, therefore we should drop them:

In [5]:
# remove records where data is empty
clean_quality_of_life_data_frame = quality_of_life_data_frame.drop(
    quality_of_life_data_frame.index[np.isnan(quality_of_life_data_frame["OBS_VALUE"])],
    axis = 0
 )

print(
    "Removing empty data has dropped {} records".format(
        quality_of_life_data_frame.shape[0] - clean_quality_of_life_data_frame.shape[0]
    )
)

Removing empty data has dropped 131 records


In [6]:
# remove redundant values
previous_row_counts = clean_quality_of_life_data_frame.shape[0]
clean_quality_of_life_data_frame = clean_quality_of_life_data_frame[
     (clean_quality_of_life_data_frame["isced11"] != "TOTAL")
     &
     (clean_quality_of_life_data_frame["sex"] != "T")
     &
     (clean_quality_of_life_data_frame["age"] != "Y_GE16")
     &
     (clean_quality_of_life_data_frame["age"] != "Y_GE65")
    ]   

print(
    "Removing redundant data has dropped {} records".format(
        previous_row_counts - clean_quality_of_life_data_frame.shape[0]
    )
)

Removing redundant data has dropped 3721 records


A simple grouping would be to create data groups associated to each country, this will help us to extract aggregated data for a specific country.

The Pandas function that allows aggregation of data is the [**groupby**](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) function which generates a group by object that can be iterated in order to perform data processing:

In [7]:
# create a data grouping by country
geo_groups = clean_quality_of_life_data_frame.groupby(
    ["geo"]
)

# iterate over the data groups
for (geo_key, geo_group) in geo_groups :
  print(
      "The group with key [{}] has {} records".format(
        geo_key,
        geo_group.shape[0]
      )
  )

The group with key [AT] has 77 records
The group with key [BE] has 78 records
The group with key [BG] has 75 records
The group with key [CH] has 42 records
The group with key [CY] has 78 records
The group with key [CZ] has 75 records
The group with key [DE] has 78 records
The group with key [DK] has 71 records
The group with key [EE] has 77 records
The group with key [EL] has 78 records
The group with key [ES] has 78 records
The group with key [FI] has 76 records
The group with key [FR] has 78 records
The group with key [HR] has 73 records
The group with key [HU] has 77 records
The group with key [IE] has 76 records
The group with key [IS] has 70 records
The group with key [IT] has 77 records
The group with key [LT] has 75 records
The group with key [LU] has 73 records
The group with key [LV] has 76 records
The group with key [MT] has 75 records
The group with key [NL] has 78 records
The group with key [NO] has 78 records
The group with key [PL] has 78 records
The group with key [PT] h

In [8]:
# dispay a group sample
group_key = "RO"
print(
    "The group with key {} has the following data \n {}".format(
                                              group_key,
                                              geo_groups.get_group(group_key)        
    )
)

The group with key RO has the following data 
      isced11 sex     age geo  TIME_PERIOD  OBS_VALUE
52     ED0-2   F  Y16-24  RO         2013        8.1
53     ED0-2   F  Y16-24  RO         2018        8.0
118    ED0-2   F  Y25-34  RO         2013        6.9
119    ED0-2   F  Y25-34  RO         2018        7.1
158    ED0-2   F  Y25-64  RO         2018        6.8
...      ...  ..     ...  ..          ...        ...
3970   ED5-8   M  Y50-64  RO         2018        8.0
4034   ED5-8   M  Y65-74  RO         2013        7.6
4035   ED5-8   M  Y65-74  RO         2018        8.0
4197   ED5-8   M  Y_GE75  RO         2013        7.7
4198   ED5-8   M  Y_GE75  RO         2018        7.2

[76 rows x 6 columns]


The advantage of using data groups is that they allows data aggregation over their data using different aggregation functions. The aggregation functionality can be accessed via the [**agg**](https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.core.groupby.DataFrameGroupBy.agg.html) method associated with the data group object.

This allows the specification of the value to aggregate along with the aggreation functions to be applied.

The most relevant aggregation functions supported by the agg function are:

*  **count** - returns the count of records;
*  **min** - returns the minimum value;
*  **mean** - returns the average value;
*  **median** - returns the median value;
*  **max** - returns the maximum value.

For example, we can aggregate the values of "OBS_VALUE" with different aggregation functions over the country (geo) values: 

In [9]:
# extract the aggregation values for the OBS_VALUE
print(
    "The OBS_VALUE aggregations over 'geo' are \n {}".format(
      geo_groups.agg({"OBS_VALUE": ["min", "mean", "median", "max"]})        
    )
  )

The OBS_VALUE aggregations over 'geo' are 
     OBS_VALUE                      
          min      mean median  max
geo                                
AT        6.5  7.850649   8.00  8.7
BE        6.4  7.556410   7.60  8.2
BG        3.3  5.090667   5.00  7.1
CH        6.6  7.890476   8.00  8.7
CY        4.9  6.743590   6.80  7.9
CZ        5.0  7.045333   7.10  8.3
DE        6.1  7.282051   7.40  8.3
DK        6.6  7.900000   7.90  8.8
EE        5.6  6.688312   6.50  8.0
EL        5.4  6.401282   6.40  7.5
ES        6.3  7.280769   7.30  8.2
FI        7.2  8.047368   8.10  8.5
FR        6.4  7.242308   7.20  8.2
HR        4.8  6.541096   6.50  8.1
HU        4.4  6.366234   6.50  7.9
IE        6.3  7.878947   7.90  8.9
IS        6.6  7.895714   7.90  8.5
IT        6.1  7.057143   7.10  8.1
LT        4.2  6.534667   6.50  8.5
LU        6.8  7.539726   7.60  8.4
LV        5.5  6.638158   6.50  7.8
MT        6.7  7.497333   7.60  8.1
NL        7.0  7.751282   7.80  8.2
NO        7.2  7.935

It is possible to agggregate on multiple levels as well:

In [10]:
# create a data grouping by country and time period
geo_time_period_groups = clean_quality_of_life_data_frame.groupby(
    ["geo", "TIME_PERIOD"]
)

print(
    "The OBS_VALUE aggregations over 'geo' and 'TIME_PERIOD' are \n {}".format(
      geo_time_period_groups.agg({"OBS_VALUE": ["min", "mean", "median", "max"]})        
    )
  )

The OBS_VALUE aggregations over 'geo' and 'TIME_PERIOD' are 
                 OBS_VALUE                      
                      min      mean median  max
geo TIME_PERIOD                                
AT  2013              6.7  7.808571   7.90  8.6
    2018              6.5  7.885714   8.00  8.7
BE  2013              6.4  7.550000   7.60  8.2
    2018              6.8  7.561905   7.60  8.2
BG  2013              3.3  4.791429   4.60  6.5
...                   ...       ...    ...  ...
SK  2018              4.8  6.838095   7.05  8.7
TR  2013              5.1  5.843478   5.80  6.9
    2018              5.1  5.900000   6.00  6.7
UK  2013              6.1  7.282857   7.40  8.0
    2018              6.8  7.576190   7.60  8.2

[65 rows x 4 columns]


## 2. Data Sorting

The Pandas library offers extended capabilities for data sorting for both data series and data frames. The data sorting capabilities are available via the [**sort_values**](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html) method. This method accepts the following relevant parameters:

*   **by** - the key(s) used for sorting;
*   **ascending** - if True, ascending ordering will be used otherwise the sorting is descending.

In [11]:
# sort values of the data frame by country(ascending) 
# and year(descending)
print(
    "The data frame ordered by country and year is \n {}".format(
      clean_quality_of_life_data_frame.sort_values(
          by = ["geo", "TIME_PERIOD"],
          ascending = [True, False]
      )        
    )
  )

The data frame ordered by country and year is 
      isced11 sex     age geo  TIME_PERIOD  OBS_VALUE
1      ED0-2   F  Y16-24  AT         2018        8.0
67     ED0-2   F  Y25-34  AT         2018        7.7
132    ED0-2   F  Y25-64  AT         2018        7.3
166    ED0-2   F  Y35-49  AT         2018        7.5
232    ED0-2   F  Y50-64  AT         2018        7.1
...      ...  ..     ...  ..          ...        ...
3817   ED5-8   M  Y25-34  UK         2013        7.7
3915   ED5-8   M  Y35-49  UK         2013        7.4
3980   ED5-8   M  Y50-64  UK         2013        7.4
4045   ED5-8   M  Y65-74  UK         2013        8.0
4208   ED5-8   M  Y_GE75  UK         2013        7.8

[2463 rows x 6 columns]


The data can be sorted also by the index value using the [**sort_index**](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_index.html) method:

In [12]:
# sort country means data by country codes descending
country_means_aggregate = geo_groups.agg({"OBS_VALUE": ["mean"]})
country_means_data_series = pd.Series(
            data = country_means_aggregate.values[:,0],
            index = country_means_aggregate.index.values
        )

print(
    "The sorted (descending) data by index is \n{}".format(
        country_means_data_series.sort_index(ascending = False)
    )
)

The sorted (descending) data by index is 
UK    7.442857
TR    5.879687
SK    6.810390
SI    7.110811
SE    7.852632
RS    5.541558
RO    7.352632
PT    6.905333
PL    7.551282
NO    7.935897
NL    7.751282
MT    7.497333
LV    6.638158
LU    7.539726
LT    6.534667
IT    7.057143
IS    7.895714
IE    7.878947
HU    6.366234
HR    6.541096
FR    7.242308
FI    8.047368
ES    7.280769
EL    6.401282
EE    6.688312
DK    7.900000
DE    7.282051
CZ    7.045333
CY    6.743590
CH    7.890476
BG    5.090667
BE    7.556410
AT    7.850649
dtype: float64
