<b><font size="5">Pandas Dataframes: Charachteristics and Manipulation</font></b>
<br><br>
In this notebook, we will take at some additional aspects of what can be done using the very flexible constructs that are pandas dataframes. As per usual, you are encouraged to complement your knowledge using the documentation that is available online:<br>
https://pandas.pydata.org/pandas-docs/stable/user_guide/index.html

### <font color='#BFD72F'>Table of Contents </font> <a class="anchor" id='toc'></a> 
- [0. Importing packages and Data](#P0) 
- [1. Data Exploration](#P1) 
    - [Attributes](#P1.1)
    - [Basic inspection](#P1.2)
    - [Descriptive statistics](#P1.3)
- [2. Grouping and Aggregation](#P2) 
- [3. Cross tabulation and Pivoting](#P3)
    - [CrossTab](#P3.1)
    - [Pivot Tables](#P3.2)
- [4. Grabbing subsets - Advanced options](#P4)
- [5. Try it out](#P5)

### <font color='#BFD72F'>0. Importing packages and Data</font> <a class="anchor" id="P0"></a>
  [Back to TOC](#toc)

The first step you should take in any Data Science project that with jupyter notebooks should be importing the packages and data that you intend to use. In this notebook we will exclusively used the *pandas* package. However, in the real world, you will regularly find projects where you will need to call dozens of different packages and/or functions.<br>
While importing packages as you go is possible, importing every package right from the start is considered a *good practice* as it will allow you to keep your notebooks clean and organized, which will definitely come in handy in situations where you need to revisit a notebook.

  <b>Step 0.1</b>: Import *pandas* and *numpy*

In [44]:
from sys import modules as notebook_modules

import pandas as pd
import numpy as np

print(f"Pandas version: {pd.__version__}\nNumPy version: {np.__version__}")

Pandas version: 2.2.2
NumPy version: 1.26.4


  <b>Step 0.2</b>: Import the csv file *pop_censuspt_2021.csv* that is stored in folder *data* and store it in variable *population*.

In [45]:
#always pay attention to the path and the separator used in the csv (Portuguese csvs by default use ";"" instead of ",")
population = pd.read_csv(r"./data/pop_censuspt_2021.csv", sep=";")

### <font color='#BFD72F'>1. Data Exploration</font> <a class="anchor" id="P1"></a>
  [Back to TOC](#toc)

Every problem has its own intrincacies so whenever you start working with new data, you should take the time to get yourself familiar with it, what variables are present and what the meaning of each variable is. In this notebook, we will use an adapted version of the Population Census 2021 collected from Instituto Nacional de Estatística (INE). For every parish, you will find the official population numbers stratified by sex and age group bracket. For every parish, there are 8 rows (Age brackets: 0-14, 15-24, 25-64, 65 or older for both males and females).

In this section, we will use different pandas methods to explore the data and get a better understanding of it.

#### Attributes <a class="anchor" id="P1.1"></a>

A pandas dataframe is a two-dimensional data structure, i.e., it is a table with rows and columns, each having its own index. You can tell the number of rows and columns of a dataframe by using the `shape` attribute.

**Step 1.1** Use the `shape` attribute to find out the number of rows and columns of the dataframe.

In [46]:
#[number of rows, number of columns]
population.shape

(24728, 10)

To be able to tell the index of the dataset you can use the `index` attribute on the dataframe.

**Step 1.2** Use the *columns* attribute to see the columns of the dataframe.

In [47]:
population.index

RangeIndex(start=0, stop=24728, step=1)

To access the names of the columns in the dataset you can use the `columns` attribute on the dataframe.

**Step 1.3** Use the *columns* attribute to see the columns of the dataframe.

In [48]:
population.columns.tolist()

['Year',
 'Location_Code',
 'District',
 'Municipality',
 'Parish',
 'Sex',
 'Age',
 'Population',
 'Notes',
 'Population_corrected']

Using `dtypes` you can also access the names of the different columns whilst also knowing the datatype they contain.

**Step 1.4** Use the `dtypes` attribute to print the names of the columns and their datatypes.

In [49]:
population.dtypes

Year                      int64
Location_Code            object
District                 object
Municipality             object
Parish                   object
Sex                      object
Age                      object
Population                int64
Notes                   float64
Population_corrected      int64
dtype: object

In [50]:
population.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24728 entries, 0 to 24727
Data columns (total 10 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Year                  24728 non-null  int64  
 1   Location_Code         24728 non-null  object 
 2   District              24728 non-null  object 
 3   Municipality          24728 non-null  object 
 4   Parish                24728 non-null  object 
 5   Sex                   24728 non-null  object 
 6   Age                   24728 non-null  object 
 7   Population            24728 non-null  int64  
 8   Notes                 0 non-null      float64
 9   Population_corrected  24728 non-null  int64  
dtypes: float64(1), int64(3), object(6)
memory usage: 1.9+ MB


In [51]:
#in case you want to change a dtype you can use the astype()-method
#eventhough it makes more sense to keep year as an integer, here is the general idea:
population['Year'] = population['Year'].astype('float')

**Exercise**

In some cases, columns may come with a name that is not suitable for your analysis. For example, the column `Age` may intuitively imply that it should be numerical (as age is generally represented by a number). Later in this notebook, you will be able to tell column `Age` is constituted by different age brackets. In this case, it would be more appropriate to rename the column `Age` to `Age_bracket`.

**Step 1.5** Rename the column `Age` to `Age_bracket` in the `population` dataframe.

As there are multiple ways to rename columns, we will start with the creation of a copy of the dataframe.
**Step 1.5.0** Run the cell below to create 2 distinct copies of the `population` dataframe. 

In [52]:
#creating 2 copies of the population dataframe
population_copy_1 = population.copy()
population_copy_2 = population.copy()

**Step 1.5.1** Edit the name of column `Age` in `population_copy_1` exclusively using the `columns` attribute and lists.

**Step 1.5.1.1** Create a list of the column names of `population_copy_1` using the `columns` attribute.

In [53]:
column_list = population_copy_1.columns.tolist()
type(column_list)

list

**Step 1.5.2.2** Use list indexing to edit "Age" to "Age_bracket".

In [54]:
column_list[-4] = "Age_bracket"

**Step 1.5.1.3** Assign the updated list of column names as the new columns attribute of `population_copy_1` 

In [55]:
population_copy_1.columns = column_list

**Step 1.5.1.4** Check whether your changes took effect by looking at the columns of `population_copy_1`

In [56]:
population_copy_1

Unnamed: 0,Year,Location_Code,District,Municipality,Parish,Sex,Age_bracket,Population,Notes,Population_corrected
0,2021.0,170320,Vila Real,Chaves,Oura,F,65 and older,100,,100
1,2021.0,91205,Guarda,Seia,Girabolhos,M,0 - 14 year old,5,,5
2,2021.0,10416,Aveiro,Arouca,Santa Eulália,M,0 - 14 year old,137,,137
3,2021.0,31360,Braga,Vila Verde,União das freguesias da Ribeira do Neiva,M,0 - 14 year old,203,,203
4,2021.0,130526,Porto,Lousada,Vilar do Torno e Alentém,M,25 - 64 year old,346,,346
...,...,...,...,...,...,...,...,...,...,...
24723,2021.0,40542,Bragança,Macedo de Cavaleiros,"União das freguesias de Espadanedo, Edroso, Mu...",M,65 and older,76,,76
24724,2021.0,140503,Santarém,Benavente,Santo Estêvão,F,0 - 14 year old,124,,124
24725,2021.0,430101,Ilha Terceira,Angra do Heroísmo,Altares,F,25 - 64 year old,235,,235
24726,2021.0,171305,Vila Real,Vila Pouca de Aguiar,Capeludos,F,0 - 14 year old,5,,5


An alternative method that allows you to edit the name of one or more columns is through the `rename()` method. You can access its documentation [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html).

**Step 1.5.2.1** Use the method `rename()` to change `Age` to `Age_bracket` in `population_copy_2`.

In [57]:
#Note - You will need to expliclty say that you want to change a column, which column to change and the new name of the column
population_copy_2.rename(mapper={"Age":"Age_bracket"}, axis=1, inplace=True)

# Alternatively I can do simply (column=dict)...

**Step 1.5.2.2** Check whether your changes took effect by looking at the columns of `population_copy_2`.

In [58]:
population_copy_2

Unnamed: 0,Year,Location_Code,District,Municipality,Parish,Sex,Age_bracket,Population,Notes,Population_corrected
0,2021.0,170320,Vila Real,Chaves,Oura,F,65 and older,100,,100
1,2021.0,91205,Guarda,Seia,Girabolhos,M,0 - 14 year old,5,,5
2,2021.0,10416,Aveiro,Arouca,Santa Eulália,M,0 - 14 year old,137,,137
3,2021.0,31360,Braga,Vila Verde,União das freguesias da Ribeira do Neiva,M,0 - 14 year old,203,,203
4,2021.0,130526,Porto,Lousada,Vilar do Torno e Alentém,M,25 - 64 year old,346,,346
...,...,...,...,...,...,...,...,...,...,...
24723,2021.0,40542,Bragança,Macedo de Cavaleiros,"União das freguesias de Espadanedo, Edroso, Mu...",M,65 and older,76,,76
24724,2021.0,140503,Santarém,Benavente,Santo Estêvão,F,0 - 14 year old,124,,124
24725,2021.0,430101,Ilha Terceira,Angra do Heroísmo,Altares,F,25 - 64 year old,235,,235
24726,2021.0,171305,Vila Real,Vila Pouca de Aguiar,Capeludos,F,0 - 14 year old,5,,5


**Step 1.5.3** Using one of the methods above (or another method you can come up with), rename columns `Age` and `Notes` on the original dataframe `population` to `Age_bracket` and `Comments`. Run all the code in a single cell (including verification).

In [59]:
#changing column names
population.rename(mapper={"Age": "Age_bracket", "Notes":"Comments"}, axis=1, inplace=True)

#checking
print(population.columns.tolist())

['Year', 'Location_Code', 'District', 'Municipality', 'Parish', 'Sex', 'Age_bracket', 'Population', 'Comments', 'Population_corrected']


#### Basic Inspection <a class="anchor" id="P1.2"></a>

More than just having the shape of the data, it is important to know what the data looks like. There are multiple methods that allow you to that and this section will cover the most important ones.

**Step 1.6** Use the method `head` on dataframe `population` to look at the first rows of a dataframe (default = 5).

In [60]:
population.head()

Unnamed: 0,Year,Location_Code,District,Municipality,Parish,Sex,Age_bracket,Population,Comments,Population_corrected
0,2021.0,170320,Vila Real,Chaves,Oura,F,65 and older,100,,100
1,2021.0,91205,Guarda,Seia,Girabolhos,M,0 - 14 year old,5,,5
2,2021.0,10416,Aveiro,Arouca,Santa Eulália,M,0 - 14 year old,137,,137
3,2021.0,31360,Braga,Vila Verde,União das freguesias da Ribeira do Neiva,M,0 - 14 year old,203,,203
4,2021.0,130526,Porto,Lousada,Vilar do Torno e Alentém,M,25 - 64 year old,346,,346


In [61]:
#While checking the head you notice that you prefer explicitly stating the Sex instead of using 'F' and 'M'

#so you check the unique values covered in Sex
print(population['Sex'].unique())
#in this case it's just F/M, so we can replace using the following
population['Sex'] = population['Sex'].replace({'M': 'Male', 'F': 'Female'})

['F' 'M']


In [62]:
list(population['Sex'].unique())

# nunique returns dimension of the list...

['Female', 'Male']

**Step 1.6.1** Use, again, the method `head` on dataframe `population` to look at the first 50 rows.

In [63]:
population.head(50)

Unnamed: 0,Year,Location_Code,District,Municipality,Parish,Sex,Age_bracket,Population,Comments,Population_corrected
0,2021.0,170320,Vila Real,Chaves,Oura,Female,65 and older,100,,100
1,2021.0,91205,Guarda,Seia,Girabolhos,Male,0 - 14 year old,5,,5
2,2021.0,10416,Aveiro,Arouca,Santa Eulália,Male,0 - 14 year old,137,,137
3,2021.0,31360,Braga,Vila Verde,União das freguesias da Ribeira do Neiva,Male,0 - 14 year old,203,,203
4,2021.0,130526,Porto,Lousada,Vilar do Torno e Alentém,Male,25 - 64 year old,346,,346
5,2021.0,90628,Guarda,Gouveia,União das freguesias de Rio Torto e Lagarinhos,Female,0 - 14 year old,26,,26
6,2021.0,130811,Porto,Matosinhos,"União das freguesias de Custóias, Leça do Bali...",Male,15 - 24 year old,2332,,2332
7,2021.0,30301,Braga,Braga,Adaúfe,Male,25 - 64 year old,1026,,1026
8,2021.0,30922,Braga,Póvoa de Lanhoso,Santo Emilião,Male,15 - 24 year old,47,,47
9,2021.0,151101,Setúbal,Sesimbra,Sesimbra (Castelo),Female,25 - 64 year old,5456,,5456


**Step 1.6.2** Use, again, the method `head` on dataframe `population` to look at the first 100 rows.

In [64]:
population.head(100)

Unnamed: 0,Year,Location_Code,District,Municipality,Parish,Sex,Age_bracket,Population,Comments,Population_corrected
0,2021.0,170320,Vila Real,Chaves,Oura,Female,65 and older,100,,100
1,2021.0,91205,Guarda,Seia,Girabolhos,Male,0 - 14 year old,5,,5
2,2021.0,10416,Aveiro,Arouca,Santa Eulália,Male,0 - 14 year old,137,,137
3,2021.0,31360,Braga,Vila Verde,União das freguesias da Ribeira do Neiva,Male,0 - 14 year old,203,,203
4,2021.0,130526,Porto,Lousada,Vilar do Torno e Alentém,Male,25 - 64 year old,346,,346
...,...,...,...,...,...,...,...,...,...,...
95,2021.0,470101,Ilha do Faial,Horta,Capelo,Male,0 - 14 year old,32,,32
96,2021.0,140120,Santarém,Abrantes,União das freguesias de Abrantes (São Vicente ...,Female,65 and older,2193,,2193
97,2021.0,180409,Viseu,Cinfães,Nespereira,Female,25 - 64 year old,424,,424
98,2021.0,110658,Lisboa,Lisboa,Belém,Female,15 - 24 year old,798,,798


That's not good. It seems that you wanted to see to many rows at once. While circumstancial, you might run in situations such as these. To circumvent it, you can use the following code to extend the viewing limits of pandas. 

In [65]:
with pd.option_context('display.max_rows', None):
    display(population.head(100))

Unnamed: 0,Year,Location_Code,District,Municipality,Parish,Sex,Age_bracket,Population,Comments,Population_corrected
0,2021.0,170320,Vila Real,Chaves,Oura,Female,65 and older,100,,100
1,2021.0,91205,Guarda,Seia,Girabolhos,Male,0 - 14 year old,5,,5
2,2021.0,10416,Aveiro,Arouca,Santa Eulália,Male,0 - 14 year old,137,,137
3,2021.0,31360,Braga,Vila Verde,União das freguesias da Ribeira do Neiva,Male,0 - 14 year old,203,,203
4,2021.0,130526,Porto,Lousada,Vilar do Torno e Alentém,Male,25 - 64 year old,346,,346
5,2021.0,90628,Guarda,Gouveia,União das freguesias de Rio Torto e Lagarinhos,Female,0 - 14 year old,26,,26
6,2021.0,130811,Porto,Matosinhos,"União das freguesias de Custóias, Leça do Bali...",Male,15 - 24 year old,2332,,2332
7,2021.0,30301,Braga,Braga,Adaúfe,Male,25 - 64 year old,1026,,1026
8,2021.0,30922,Braga,Póvoa de Lanhoso,Santo Emilião,Male,15 - 24 year old,47,,47
9,2021.0,151101,Setúbal,Sesimbra,Sesimbra (Castelo),Female,25 - 64 year old,5456,,5456


**Step 1.7** Use the method `tail` on dataframe `population` to look at the last rows of a dataframe (default = 5).

In [66]:
population.tail()

Unnamed: 0,Year,Location_Code,District,Municipality,Parish,Sex,Age_bracket,Population,Comments,Population_corrected
24723,2021.0,40542,Bragança,Macedo de Cavaleiros,"União das freguesias de Espadanedo, Edroso, Mu...",Male,65 and older,76,,76
24724,2021.0,140503,Santarém,Benavente,Santo Estêvão,Female,0 - 14 year old,124,,124
24725,2021.0,430101,Ilha Terceira,Angra do Heroísmo,Altares,Female,25 - 64 year old,235,,235
24726,2021.0,171305,Vila Real,Vila Pouca de Aguiar,Capeludos,Female,0 - 14 year old,5,,5
24727,2021.0,130526,Porto,Lousada,Vilar do Torno e Alentém,Male,15 - 24 year old,68,,68


**Step 1.7.1** Use, again, the method `tail` on dataframe `population` to look at the last 50 rows.

In [67]:
population.tail(50)

Unnamed: 0,Year,Location_Code,District,Municipality,Parish,Sex,Age_bracket,Population,Comments,Population_corrected
24678,2021.0,480102,Ilha das Flores,Lajes das Flores,Fajãzinha,Male,15 - 24 year old,2,,2
24679,2021.0,181111,Viseu,Penalva do Castelo,Sezures,Female,15 - 24 year old,30,,30
24680,2021.0,70304,Évora,Borba,Borba (São Bartolomeu),Male,25 - 64 year old,132,,132
24681,2021.0,181917,Viseu,Tabuaço,Valença do Douro,Male,65 and older,37,,37
24682,2021.0,151302,Setúbal,Sines,Porto Covo,Female,0 - 14 year old,68,,68
24683,2021.0,40119,Bragança,Alfândega da Fé,Vilarelhos,Female,0 - 14 year old,9,,9
24684,2021.0,30322,Braga,Braga,Lamas,Male,0 - 14 year old,67,,67
24685,2021.0,11706,Aveiro,Sever do Vouga,Sever do Vouga,Male,65 and older,303,,303
24686,2021.0,50901,Castelo Branco,Sertã,Cabeçudo,Female,25 - 64 year old,237,,237
24687,2021.0,80903,Faro,Monchique,Monchique,Male,65 and older,670,,670


**Step 1.7.2** Use, again, the method `tail` on dataframe `population` to look at the last 100 rows.

In [68]:
population.tail(100)

Unnamed: 0,Year,Location_Code,District,Municipality,Parish,Sex,Age_bracket,Population,Comments,Population_corrected
24628,2021.0,90210,Guarda,Almeida,Freixo,Male,65 and older,23,,23
24629,2021.0,50431,Castelo Branco,Fundão,Enxames,Male,25 - 64 year old,99,,99
24630,2021.0,180417,Viseu,Cinfães,Travanca,Male,25 - 64 year old,180,,180
24631,2021.0,30838,Braga,Guimarães,Ponte,Male,65 and older,513,,513
24632,2021.0,110109,Lisboa,Alenquer,Ota,Male,0 - 14 year old,93,,93
...,...,...,...,...,...,...,...,...,...,...
24723,2021.0,40542,Bragança,Macedo de Cavaleiros,"União das freguesias de Espadanedo, Edroso, Mu...",Male,65 and older,76,,76
24724,2021.0,140503,Santarém,Benavente,Santo Estêvão,Female,0 - 14 year old,124,,124
24725,2021.0,430101,Ilha Terceira,Angra do Heroísmo,Altares,Female,25 - 64 year old,235,,235
24726,2021.0,171305,Vila Real,Vila Pouca de Aguiar,Capeludos,Female,0 - 14 year old,5,,5


A simple, yet less discussed form at looking at the dataset using jupyter notebooks is by just the dataset outright. The dataset will be presented in a truncated format, showing the first and the last 5 rows.

In [69]:
population

Unnamed: 0,Year,Location_Code,District,Municipality,Parish,Sex,Age_bracket,Population,Comments,Population_corrected
0,2021.0,170320,Vila Real,Chaves,Oura,Female,65 and older,100,,100
1,2021.0,91205,Guarda,Seia,Girabolhos,Male,0 - 14 year old,5,,5
2,2021.0,10416,Aveiro,Arouca,Santa Eulália,Male,0 - 14 year old,137,,137
3,2021.0,31360,Braga,Vila Verde,União das freguesias da Ribeira do Neiva,Male,0 - 14 year old,203,,203
4,2021.0,130526,Porto,Lousada,Vilar do Torno e Alentém,Male,25 - 64 year old,346,,346
...,...,...,...,...,...,...,...,...,...,...
24723,2021.0,40542,Bragança,Macedo de Cavaleiros,"União das freguesias de Espadanedo, Edroso, Mu...",Male,65 and older,76,,76
24724,2021.0,140503,Santarém,Benavente,Santo Estêvão,Female,0 - 14 year old,124,,124
24725,2021.0,430101,Ilha Terceira,Angra do Heroísmo,Altares,Female,25 - 64 year old,235,,235
24726,2021.0,171305,Vila Real,Vila Pouca de Aguiar,Capeludos,Female,0 - 14 year old,5,,5


If you look at the rows, it is clear that the rows do not seem to follow any particular order. We can, however, seamlessly rearrange the order of the rows using the method `sort_values()`. More information on this method can be found [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html).

**Step 1.8** Sort dataframe `population` by the column `District` in ascending order<br>
**Note**: Sorting text will be alphabetical, however bear in mind that having accents or special characters will affect the order. For example, `é` will be sorted after `z`.

In [70]:
population.sort_values(
    by="District",
    ascending=True
)

Unnamed: 0,Year,Location_Code,District,Municipality,Parish,Sex,Age_bracket,Population,Comments,Population_corrected
21545,2021.0,10517,Aveiro,Aveiro,União das freguesias de Glória e Vera Cruz,Female,25 - 64 year old,6328,,6328
10318,2021.0,10610,Aveiro,Castelo de Paiva,"União das freguesias de Raiva, Pedorido e Paraíso",Female,25 - 64 year old,1152,,1152
994,2021.0,11509,Aveiro,Ovar,"União das freguesias de Ovar, São João, Arada ...",Male,15 - 24 year old,1646,,1646
14124,2021.0,11305,Aveiro,Oliveira de Azeméis,Macieira de Sarnes,Male,0 - 14 year old,109,,109
23031,2021.0,10121,Aveiro,Águeda,União das freguesias de Águeda e Borralha,Female,65 and older,1793,,1793
...,...,...,...,...,...,...,...,...,...,...
8986,2021.0,70509,Évora,Évora,São Miguel de Machede,Female,25 - 64 year old,180,,180
21953,2021.0,70304,Évora,Borba,Borba (São Bartolomeu),Male,0 - 14 year old,24,,24
4994,2021.0,70906,Évora,Portel,Santana,Male,65 and older,56,,56
17157,2021.0,71404,Évora,Vila Viçosa,Pardais,Male,65 and older,48,,48


Doing this, however, will just create view of the sorted dataset. However, the original dataset will remain unchanged.

**Step 1.8.1** Call the dataset population to see if your sorting worked

In [71]:
population

Unnamed: 0,Year,Location_Code,District,Municipality,Parish,Sex,Age_bracket,Population,Comments,Population_corrected
0,2021.0,170320,Vila Real,Chaves,Oura,Female,65 and older,100,,100
1,2021.0,91205,Guarda,Seia,Girabolhos,Male,0 - 14 year old,5,,5
2,2021.0,10416,Aveiro,Arouca,Santa Eulália,Male,0 - 14 year old,137,,137
3,2021.0,31360,Braga,Vila Verde,União das freguesias da Ribeira do Neiva,Male,0 - 14 year old,203,,203
4,2021.0,130526,Porto,Lousada,Vilar do Torno e Alentém,Male,25 - 64 year old,346,,346
...,...,...,...,...,...,...,...,...,...,...
24723,2021.0,40542,Bragança,Macedo de Cavaleiros,"União das freguesias de Espadanedo, Edroso, Mu...",Male,65 and older,76,,76
24724,2021.0,140503,Santarém,Benavente,Santo Estêvão,Female,0 - 14 year old,124,,124
24725,2021.0,430101,Ilha Terceira,Angra do Heroísmo,Altares,Female,25 - 64 year old,235,,235
24726,2021.0,171305,Vila Real,Vila Pouca de Aguiar,Capeludos,Female,0 - 14 year old,5,,5


**Step 1.8.2** Try to sort the dataframe population, now in descending order on columns `District`, `Municipality` and `Parish`. Make sure that you leave set the parameter `inplace = True`. Then, call the dataset to check whether the sorting was successful.

In [72]:
#sorting
population.sort_values(
    by=["District", "Municipality", "Parish"],
    ascending=False,
    inplace=True
)
#checking changes
population

Unnamed: 0,Year,Location_Code,District,Municipality,Parish,Sex,Age_bracket,Population,Comments,Population_corrected
3232,2021.0,70523,Évora,Évora,"União das freguesias de Évora (São Mamede, Sé,...",Female,15 - 24 year old,212,,212
7672,2021.0,70523,Évora,Évora,"União das freguesias de Évora (São Mamede, Sé,...",Male,65 and older,430,,430
13912,2021.0,70523,Évora,Évora,"União das freguesias de Évora (São Mamede, Sé,...",Male,0 - 14 year old,265,,265
14498,2021.0,70523,Évora,Évora,"União das freguesias de Évora (São Mamede, Sé,...",Female,65 and older,771,,771
16065,2021.0,70523,Évora,Évora,"União das freguesias de Évora (São Mamede, Sé,...",Male,25 - 64 year old,1052,,1052
...,...,...,...,...,...,...,...,...,...,...
4904,2021.0,10209,Aveiro,Albergaria-a-Velha,Albergaria-a-Velha e Valmaior,Female,15 - 24 year old,576,,576
14888,2021.0,10209,Aveiro,Albergaria-a-Velha,Albergaria-a-Velha e Valmaior,Female,0 - 14 year old,777,,777
15395,2021.0,10209,Aveiro,Albergaria-a-Velha,Albergaria-a-Velha e Valmaior,Male,25 - 64 year old,2941,,2941
21325,2021.0,10209,Aveiro,Albergaria-a-Velha,Albergaria-a-Velha e Valmaior,Male,15 - 24 year old,625,,625


Another important method in EDA is `info()`. It gives us the number of non-null values in each column and the data type of each column. This is useful to check if there are any missing values in the dataset.

**Step 1.9** Use the method `info()` on `population`. Are there any columns with a datatype that are nonsensical? Moreover, is there any column with missing values? Write your answer in the markdown cell below.

In [73]:
population.info()

<class 'pandas.core.frame.DataFrame'>
Index: 24728 entries, 3232 to 22975
Data columns (total 10 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Year                  24728 non-null  float64
 1   Location_Code         24728 non-null  object 
 2   District              24728 non-null  object 
 3   Municipality          24728 non-null  object 
 4   Parish                24728 non-null  object 
 5   Sex                   24728 non-null  object 
 6   Age_bracket           24728 non-null  object 
 7   Population            24728 non-null  int64  
 8   Comments              0 non-null      float64
 9   Population_corrected  24728 non-null  int64  
dtypes: float64(2), int64(2), object(6)
memory usage: 2.1+ MB


**Answer**:

In [74]:
population.isna().sum() # Better way to see missing values....

Year                        0
Location_Code               0
District                    0
Municipality                0
Parish                      0
Sex                         0
Age_bracket                 0
Population                  0
Comments                24728
Population_corrected        0
dtype: int64

#### Descriptive statistics<a class="anchor" id="P1.3"></a>

Looking at individual rows can only tell so much, so it is useful that we have some methods that allow us to get a better sense of the data. For example, we could be interested in understanding how the population is distributed and for that, we can statistical variables such as the mean, median, standard deviation, etc.

**Step 1.9** Obtain some descriptive statistics of column `Population` (mean, median, first quartile and third quartile)

**Step 1.9.1** Calculate the mean of column `Population`

In [75]:
population['Population'].mean()

417.9555160142349

**Step 1.9.2** Calculate the median of column `Population`

In [76]:
population.Population.median()

108.0

**Step 1.9.2** Calculate the first quartile of column `Population`.
**Note:** Use the `quantile()` method. You can check how it works [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.quantile.html). 

In [77]:
population.Population.quantile(.25)

41.0

**Step 1.9.3** Calculate the third quartile of column `Population`.

In [78]:
population.Population.quantile(.75)

304.0

Calculating statistical properties of one, or more, variables is important in many regards. `pandas` has a very helpful method that will allow you to obtain the most relevant descriptive statistics for any variable in `describe()`.

**Step 1.9.5** Use the `describe()` method to obtain the descriptive statistics for the numerical variables of dataset `population`
**Note**: Depending on your preference, you may prefer to look at the resulting dataframe in its transposed form using `describe().T`

In [79]:
population.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Year,24728.0,2021.0,0.0,2021.0,2021.0,2021.0,2021.0,2021.0
Population,24728.0,417.955516,1134.569287,0.0,41.0,108.0,304.0,20301.0
Comments,0.0,,,,,,,
Population_corrected,24728.0,417.955516,1134.569287,0.0,41.0,108.0,304.0,20301.0


In [80]:
population.describe(include='O').T

Unnamed: 0,count,unique,top,freq
Location_Code,24728,3091,70523,8
District,24728,29,Braga,2776
Municipality,24728,306,Barcelos,488
Parish,24728,2873,Pinheiro,48
Sex,24728,2,Female,12364
Age_bracket,24728,4,15 - 24 year old,6182


In [81]:
population.describe(include='all').T
# kind of questionable...

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
Year,24728.0,,,,2021.0,0.0,2021.0,2021.0,2021.0,2021.0,2021.0
Location_Code,24728.0,3091.0,70523,8.0,,,,,,,
District,24728.0,29.0,Braga,2776.0,,,,,,,
Municipality,24728.0,306.0,Barcelos,488.0,,,,,,,
Parish,24728.0,2873.0,Pinheiro,48.0,,,,,,,
Sex,24728.0,2.0,Female,12364.0,,,,,,,
Age_bracket,24728.0,4.0,15 - 24 year old,6182.0,,,,,,,
Population,24728.0,,,,417.955516,1134.569287,0.0,41.0,108.0,304.0,20301.0
Comments,0.0,,,,,,,,,,
Population_corrected,24728.0,,,,417.955516,1134.569287,0.0,41.0,108.0,304.0,20301.0


Categorical variables also have their own methods. For example, you may want to know what is the `District` that appears more often or how many parishes it has.

**Step 1.10.1** Use the method `mode()` on the `District` column to find the most common district in the dataset.

In [82]:
population['District'].mode()

0    Braga
Name: District, dtype: object

**Step 1.10.2** Use the method `describe(include = 'O')` to find relevant statistics for the `object` columns in dataframe `population`

In [83]:
population.describe(include = 'O') 

Unnamed: 0,Location_Code,District,Municipality,Parish,Sex,Age_bracket
count,24728,24728,24728,24728,24728,24728
unique,3091,29,306,2873,2,4
top,70523,Braga,Barcelos,Pinheiro,Female,15 - 24 year old
freq,8,2776,488,48,12364,6182


**Step 1.10.3** Use the method `value_counts()` to count the number of rows that refer to each district.

In [84]:
#CODE HERE
population.District.value_counts()

District
Braga                  2776
Viseu                  2216
Porto                  1944
Guarda                 1936
Bragança               1808
Viana do Castelo       1664
Vila Real              1576
Coimbra                1240
Aveiro                 1168
Santarém               1128
Lisboa                 1072
Castelo Branco          960
Leiria                  880
Beja                    600
Évora                   552
Portalegre              552
Faro                    536
Ilha de São Miguel      512
Setúbal                 440
Ilha da Madeira         424
Ilha Terceira           240
Ilha do Pico            136
Ilha do Faial           104
Ilha das Flores          88
Ilha de São Jorge        88
Ilha de Santa Maria      40
Ilha da Graciosa         32
Ilha de Porto Santo       8
Ilha do Corvo             8
Name: count, dtype: int64

Is this the number of parishes per district? 

**Step 1.10.3** Validate the claim that each parish is represented by 8 rows using value counts.<br>
**Hint**: You can use value_counts on multiple columns simultaneously

In [85]:
population[['District', 'Municipality', 'Parish']].value_counts()

District  Municipality        Parish                                                                 
Aveiro    Albergaria-a-Velha  Albergaria-a-Velha e Valmaior                                              8
Porto     Paredes             Vandoma                                                                    8
          Penafiel            Guilhufe e Urrô                                                            8
                              Irivo                                                                      8
                              Lagares e Figueira                                                         8
                                                                                                        ..
Coimbra   Penacova            Carvalho                                                                   8
                              Figueira de Lorvão                                                         8
                              Lorvão      

**Step 1.10.4** Now that we confirmed that each parish is represented by 8 rows, use `value_counts` to obtain the number of parishes in each district<br>
**Hint** Recall that you can apply mathematical operations to pandas series

In [86]:
population.District.value_counts()/8

District
Braga                  347.0
Viseu                  277.0
Porto                  243.0
Guarda                 242.0
Bragança               226.0
Viana do Castelo       208.0
Vila Real              197.0
Coimbra                155.0
Aveiro                 146.0
Santarém               141.0
Lisboa                 134.0
Castelo Branco         120.0
Leiria                 110.0
Beja                    75.0
Évora                   69.0
Portalegre              69.0
Faro                    67.0
Ilha de São Miguel      64.0
Setúbal                 55.0
Ilha da Madeira         53.0
Ilha Terceira           30.0
Ilha do Pico            17.0
Ilha do Faial           13.0
Ilha das Flores         11.0
Ilha de São Jorge       11.0
Ilha de Santa Maria      5.0
Ilha da Graciosa         4.0
Ilha de Porto Santo      1.0
Ilha do Corvo            1.0
Name: count, dtype: float64

A final note on `describe`. It is possible to obtain the aggregate basic statistics for both numerical and object variables using `include = 'all'`. For each dtype, the method will automatically detect which descriptive statistics to compute and assume the nonsensical statistics to be NaNs. See the example below:

In [87]:
population.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
Year,24728.0,,,,2021.0,0.0,2021.0,2021.0,2021.0,2021.0,2021.0
Location_Code,24728.0,3091.0,70523,8.0,,,,,,,
District,24728.0,29.0,Braga,2776.0,,,,,,,
Municipality,24728.0,306.0,Barcelos,488.0,,,,,,,
Parish,24728.0,2873.0,Pinheiro,48.0,,,,,,,
Sex,24728.0,2.0,Female,12364.0,,,,,,,
Age_bracket,24728.0,4.0,15 - 24 year old,6182.0,,,,,,,
Population,24728.0,,,,417.955516,1134.569287,0.0,41.0,108.0,304.0,20301.0
Comments,0.0,,,,,,,,,,
Population_corrected,24728.0,,,,417.955516,1134.569287,0.0,41.0,108.0,304.0,20301.0


### <font color='#BFD72F'>2. Grouping and Aggregation</font> <a class="anchor" id="P2"></a>
  [Back to TOC](#toc)

  In this section, we will shift our attention to grouping data and aggregate statistical measures. In particular, we will cover the flexible method that is `groupby()`. More info in this method [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html). It is really important to bear in mind that <b>the output of `groupby()` can either be a *pandas* series or *pandas* dataframe</b>. This means that we can apply all the methods that are applicable to those data structures that we have covered thus far.
  
  Taking into consideration our data, we may want to verify what is the total population per district.

  **Step 2.1** Use `groupby()` to get the `sum` of the populain every numerical column per district. 

In [88]:
population.groupby('District').sum(numeric_only=True) #numeric_only set to True because of coming changes to groupby()

Unnamed: 0_level_0,Year,Population,Comments,Population_corrected
District,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Aveiro,2360528.0,692925,0.0,692925
Beja,1212600.0,144401,0.0,144401
Braga,5610296.0,846293,0.0,846293
Bragança,3653968.0,122804,0.0,122804
Castelo Branco,1940160.0,177962,0.0,177962
Coimbra,2506040.0,408551,0.0,408551
Faro,1083256.0,467343,0.0,467343
Guarda,3912656.0,142974,0.0,142974
Ilha Terceira,485040.0,53234,0.0,53234
Ilha da Graciosa,64672.0,4090,0.0,4090


  **Step 2.2.2** Use `groupby()` to obtain a pandas series with the `sum` of column `Population`per `District`

In [89]:
population.groupby('District')[['Population']].sum() # double square brackets for DF

Unnamed: 0_level_0,Population
District,Unnamed: 1_level_1
Aveiro,692925
Beja,144401
Braga,846293
Bragança,122804
Castelo Branco,177962
Coimbra,408551
Faro,467343
Guarda,142974
Ilha Terceira,53234
Ilha da Graciosa,4090


  **Step 2.2.3** Use `groupby()` to obtain a pandas dataframe with the `sum` of column `Population` per `District`. Sort it in descending order

In [90]:
population.groupby('District')[['Population']].sum(numeric_only = True).sort(ascending=False) #CODE HERE

AttributeError: 'DataFrame' object has no attribute 'sort'

**Step 2.2.4** Use `groupby()` to obtain a pandas dataframe with the `sum` of column `Population` per `District` and `Age_bracket`. Store it in variable `population_per_district_age`.

In [91]:
#storing it in a variable allows you to use it later as any other pandas dataframe as it has too many rows to see in one go
population_per_district_age =  population.groupby(['District', 'Age_bracket'])[['Population']].sum()

#checking the result
population_per_district_age

Unnamed: 0_level_0,Unnamed: 1_level_0,Population
District,Age_bracket,Unnamed: 2_level_1
Aveiro,0 - 14 year old,85047
Aveiro,15 - 24 year old,72264
Aveiro,25 - 64 year old,374324
Aveiro,65 and older,161290
Beja,0 - 14 year old,17498
...,...,...
Viseu,65 and older,98871
Évora,0 - 14 year old,18482
Évora,15 - 24 year old,14812
Évora,25 - 64 year old,77747


Grouping on multiple columns generates a MultiIndex by default. While MultiIndexes have been covered before, we haven't learned how to deal with them. In this notebook, we will cover the most basic way of removing a multi-index, which is resetting it using `reset_index()`. More info [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html)

**Step 2.2.5** Identify and deal with the MultiIndex in the dataframe `population_per_district_age`<br>
**Step 2.2.5.1** Check the shape and columns of the dataframe `population_per_district_age` (you should expect it to have single column)

In [92]:
print(f'Shape of the dataframe: {population_per_district_age.shape}\n' +
      f'Columns in the dataframe: {population_per_district_age.columns.unique()}')

Shape of the dataframe: (116, 1)
Columns in the dataframe: Index(['Population'], dtype='object')


**Step 2.2.5.2** Use the method `reset_index()` to remove the MultiIndex and set its components to regular columns of a *pandas* dataframe. Check whether your changes had an effect.

In [93]:
#resetting index
population_per_district_age.reset_index(inplace=True, drop = False) #the current default is drop = False

#check
print(f'Shape of the dataframe: {population_per_district_age.shape}\n' +
      f'Columns in the dataframe: {population_per_district_age.columns.unique()}')

population_per_district_age

Shape of the dataframe: (116, 3)
Columns in the dataframe: Index(['District', 'Age_bracket', 'Population'], dtype='object')


Unnamed: 0,District,Age_bracket,Population
0,Aveiro,0 - 14 year old,85047
1,Aveiro,15 - 24 year old,72264
2,Aveiro,25 - 64 year old,374324
3,Aveiro,65 and older,161290
4,Beja,0 - 14 year old,17498
...,...,...,...
111,Viseu,65 and older,98871
112,Évora,0 - 14 year old,18482
113,Évora,15 - 24 year old,14812
114,Évora,25 - 64 year old,77747


The `groupby` method allows for the grouping to be performed not only over multiple variables, but it allows you to set multiple functions as well using the `agg` parameter. While very useful in order to obtain a more granular insight, it has the **drawback** of possibly generation a column **MultiIndex**, which is much more challenging to handle than the ones covered thus far.

**Step 2.3.1** Use `groupby()` to obtain a pandas dataframe with the `sum`, `mean` and `median` of column `Population` per `District` and `Age_bracket`.

In [94]:
population.groupby(['District', 'Age_bracket']).agg({
                                                     'Population' : ['sum', 'mean', 'median'],
                                                    })

Unnamed: 0_level_0,Unnamed: 1_level_0,Population,Population,Population
Unnamed: 0_level_1,Unnamed: 1_level_1,sum,mean,median
District,Age_bracket,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Aveiro,0 - 14 year old,85047,291.256849,197.5
Aveiro,15 - 24 year old,72264,247.479452,168.5
Aveiro,25 - 64 year old,374324,1281.931507,861.5
Aveiro,65 and older,161290,552.363014,392.0
Beja,0 - 14 year old,17498,116.653333,50.5
...,...,...,...,...
Viseu,65 and older,98871,178.467509,120.0
Évora,0 - 14 year old,18482,133.927536,47.5
Évora,15 - 24 year old,14812,107.333333,39.0
Évora,25 - 64 year old,77747,563.384058,234.5


**Step 2.3.2** Use `groupby()` to obtain a pandas dataframe with the `sum`, `mean` and `median` of column `Population` and the number of `Location_Code` and `Parish` per `District`.

In [95]:
population.groupby(['District']).agg({
                                                     'Population' : ['sum', 'mean', 'median'],
                                                     'Location_Code' : 'nunique', #nunique gives the number of unique values
                                                     'Parish' : 'nunique', #nunique gives the number of unique values
                                                    })

Unnamed: 0_level_0,Population,Population,Population,Location_Code,Parish
Unnamed: 0_level_1,sum,mean,median,nunique,nunique
District,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Aveiro,692925,593.257705,323.0,146,145
Beja,144401,240.668333,122.5,75,75
Braga,846293,304.860591,142.0,347,342
Bragança,122804,67.922566,30.0,226,225
Castelo Branco,177962,185.377083,61.5,120,120
Coimbra,408551,329.476613,139.0,155,155
Faro,467343,871.908582,384.5,67,67
Guarda,142974,73.850207,33.0,242,237
Ilha Terceira,53234,221.808333,127.5,30,30
Ilha da Graciosa,4090,127.8125,100.0,4,4


### <font color='#BFD72F'>3. CrossTab and Pivot Tables</font> <a class="anchor" id="P3"></a>
  [Back to TOC](#toc)

#### CrossTab<a class="anchor" id="P3.1"></a>

By default, cross tabulation creates a frequency table that shows the distribution of data across different categories of those variables.
Consider the example given in the theoretical class, where you had multiple students and their respective classes.

**Step 3.1.1** Run the cell below to create the dataframe used in the theoretical class.

In [96]:
#create sample dataframe
crosstab_example = pd.DataFrame(data = {
                                            'Name' : ['John', 'Mary', 'Hanna', 'James', 'Louise', 'Robert', 'Ben', 'Hanna', 'Richard', 'Emily'],
                                            'Class' : ['LGI', 'LGI', 'LSTI', 'LCD', 'LSTI', 'LGI', 'LCD', 'LGI', 'LCD', 'LSTI'],
                                            'P' : ['P1', 'P2', 'P1', 'P1', 'P1', 'P1', 'P2', 'P2', 'P1', 'P2'],
                                             })
#display dataframe
crosstab_example

Unnamed: 0,Name,Class,P
0,John,LGI,P1
1,Mary,LGI,P2
2,Hanna,LSTI,P1
3,James,LCD,P1
4,Louise,LSTI,P1
5,Robert,LGI,P1
6,Ben,LCD,P2
7,Hanna,LGI,P2
8,Richard,LCD,P1
9,Emily,LSTI,P2


**Step 3.1.2** Use `crosstab()` on `crosstab_example` to have the columns be the different `P` each row be `Class`

In [97]:
pd.crosstab(index = crosstab_example['Class'],
            columns = crosstab_example['P'])

P,P1,P2
Class,Unnamed: 1_level_1,Unnamed: 2_level_1
LCD,2,1
LGI,2,2
LSTI,2,1


A key difference between `crosstab()` and `pivot_table()` is that `crosstab` does not expect dataframes as input. Instead, it expects arrays, series, or lists. This means that you can use `crosstab` to create a table from data that is not already in a dataframe. Check the documentation [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.crosstab.html)

**Step 3.2.1** Create two separate lists: the first using data from of `Class` and the second with the data in column `P`

In [98]:
#list for Class
class_list = crosstab_example['Class'].to_list()

#do the same for p
p_list = crosstab_example['P'].to_list()

**Step 3.2.2** Use `crosstab()` on lists `class_list` and `p_list` to create replicate the previous crosstab

In [99]:
pd.crosstab(index = class_list,
            columns = p_list)

col_0,P1,P2
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1
LCD,2,1
LGI,2,2
LSTI,2,1


However, `crosstab()` also allows you to make computations and be used to create pivot tables thanks to the `value` and `aggfunc` parameter.

**Step 3.2** Use `crosstab()` on `population` to create a pivot table where the rows are `District`, the columns are the `Sex` and the values are the *sum* of `Population`

In [100]:
pd.crosstab(index = population['District'],
            columns = population['Sex'],
            values = population.Population,
            aggfunc = 'sum'
             )

Sex,Female,Male
District,Unnamed: 1_level_1,Unnamed: 2_level_1
Aveiro,360863,332062
Beja,71690,72711
Braga,439664,406629
Bragança,64039,58765
Castelo Branco,93109,84853
Coimbra,216077,192474
Faro,240568,226775
Guarda,75308,67666
Ilha Terceira,27371,25863
Ilha da Graciosa,2070,2020


#### Pivot Tables<a class="anchor" id="P3.2"></a>

`pivot_table()` is a *pandas* function that is designed to create pivot tables from a *DataFrame* (key difference between this method and `crosstab()`). It shares a lot of the syntax used in `crosstab()`, although it is a bit more flexible. To check the documentation, click [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html).

**Step 3.3** Use `pivot_table()` on `population` to create a pivot table where the rows are `District`, the columns are the `Sex` and the values are the *sum* of `Population`

In [101]:
pd.pivot_table(data = population, 
           index = 'District',
            columns = 'Sex',
            values = 'Population',
            aggfunc = 'sum' 
            )

# Not necessary to break down tables...

Sex,Female,Male
District,Unnamed: 1_level_1,Unnamed: 2_level_1
Aveiro,360863,332062
Beja,71690,72711
Braga,439664,406629
Bragança,64039,58765
Castelo Branco,93109,84853
Coimbra,216077,192474
Faro,240568,226775
Guarda,75308,67666
Ilha Terceira,27371,25863
Ilha da Graciosa,2070,2020


Both `crosstab()` and `pivot_table()` allow the creation of multi-indexed tables. The `crosstab()` function is a special case of the `pivot_table()` function. However, `pivot_table()` has a more forgiving syntax since you can call the column names directly.

**Step 3.4** Use `pivot_table()` on `population` to create a pivot table where the rows are `District`, the columns are `Sex` and `Age_bracket` and the values are the *sum* of `Population`

In [102]:
pd.pivot_table(
    data=population,
    index='District',
    columns=['Sex', 'Age_bracket'],
    values='Population',
    aggfunc='sum'
)

Sex,Female,Female,Female,Female,Male,Male,Male,Male
Age_bracket,0 - 14 year old,15 - 24 year old,25 - 64 year old,65 and older,0 - 14 year old,15 - 24 year old,25 - 64 year old,65 and older
District,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
Aveiro,41362,35586,193018,90897,43685,36678,181306,70393
Beja,8467,6431,35074,21718,9031,7436,39579,16665
Braga,52614,46868,243870,96312,56082,49273,227244,74030
Bragança,5628,5278,29704,23429,5825,5403,28683,18854
Castelo Branco,8881,7536,44023,32669,9334,8164,42585,24770
Coimbra,22619,19150,109342,64966,23958,19730,100813,47973
Faro,30603,22126,127813,60026,32178,23703,119971,50923
Guarda,6637,5961,34652,28058,6714,6308,33555,21089
Ilha Terceira,3521,2887,15182,5781,3605,3008,14948,4302
Ilha da Graciosa,274,205,1091,500,317,219,1094,390


**Note** Order when creating these MultiIndexes matters. The first index is the outermost, the last index is the innermost.
**Step 3.5** Use `pivot_table()` to create the same pivot table you had just created. However, this time, make the columns be `Age_bracket` first and `Sex` second

In [103]:
pd.pivot_table(
    data = population,
    columns = ['Age_bracket', 'Sex'],
    index = 'District',
    values = 'Population',
    aggfunc = 'sum'
)

Age_bracket,0 - 14 year old,0 - 14 year old,15 - 24 year old,15 - 24 year old,25 - 64 year old,25 - 64 year old,65 and older,65 and older
Sex,Female,Male,Female,Male,Female,Male,Female,Male
District,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
Aveiro,41362,43685,35586,36678,193018,181306,90897,70393
Beja,8467,9031,6431,7436,35074,39579,21718,16665
Braga,52614,56082,46868,49273,243870,227244,96312,74030
Bragança,5628,5825,5278,5403,29704,28683,23429,18854
Castelo Branco,8881,9334,7536,8164,44023,42585,32669,24770
Coimbra,22619,23958,19150,19730,109342,100813,64966,47973
Faro,30603,32178,22126,23703,127813,119971,60026,50923
Guarda,6637,6714,5961,6308,34652,33555,28058,21089
Ilha Terceira,3521,3605,2887,3008,15182,14948,5781,4302
Ilha da Graciosa,274,317,205,219,1091,1094,500,390


### <font color='#BFD72F'>4.Grabbing subsets - Advanced options</font> <a class="anchor" id="P4"></a>
  [Back to TOC](#toc)

  In this section, we will explore a few more advanced options to grab subsets of data. In particular, we will use the following methods:
  1. `query` to grab subsets of data based on a condition or set of conditions
  2. `isin` to make explicit lists of values to grab
  3. `between` to grab subsets of data based on a range of values

  **Step 4.1** Use query to obtain subsets of data based on a condition or set of conditions.<br>
  **Step 4.1.1** Select all rows where the `District` is `Lisboa`

In [289]:
population.query(expr='District=="Lisboa"')

# equivalent notation
population[population['District']=='Lisboa']

# use loc if i want to do further selections

Unnamed: 0,Year,Location_Code,District,Municipality,Parish,Sex,Age_bracket,Population,Comments,Population_corrected
6518,2021.0,111409,Lisboa,Vila Franca de Xira,Vila Franca de Xira,Female,25 - 64 year old,5063,,5063
7094,2021.0,111409,Lisboa,Vila Franca de Xira,Vila Franca de Xira,Male,0 - 14 year old,1323,,1323
10408,2021.0,111409,Lisboa,Vila Franca de Xira,Vila Franca de Xira,Female,15 - 24 year old,974,,974
14336,2021.0,111409,Lisboa,Vila Franca de Xira,Vila Franca de Xira,Male,15 - 24 year old,988,,988
15841,2021.0,111409,Lisboa,Vila Franca de Xira,Vila Franca de Xira,Male,25 - 64 year old,4495,,4495
...,...,...,...,...,...,...,...,...,...,...
8631,2021.0,110106,Lisboa,Alenquer,Carnota,Female,0 - 14 year old,72,,72
10248,2021.0,110106,Lisboa,Alenquer,Carnota,Female,25 - 64 year old,397,,397
19353,2021.0,110106,Lisboa,Alenquer,Carnota,Male,65 and older,209,,209
21332,2021.0,110106,Lisboa,Alenquer,Carnota,Female,65 and older,255,,255


  **Step 4.1.2** Select all rows where the `District` is `Lisboa` and the `Population` is greater or equal than 2000 people.

In [292]:
population.query('District == "Lisboa" and Population >= 2000') #CODE HERE

Unnamed: 0,Year,Location_Code,District,Municipality,Parish,Sex,Age_bracket,Population,Comments,Population_corrected
6518,2021.0,111409,Lisboa,Vila Franca de Xira,Vila Franca de Xira,Female,25 - 64 year old,5063,,5063
15841,2021.0,111409,Lisboa,Vila Franca de Xira,Vila Franca de Xira,Male,25 - 64 year old,4495,,4495
17798,2021.0,111409,Lisboa,Vila Franca de Xira,Vila Franca de Xira,Female,65 and older,2480,,2480
4467,2021.0,111408,Lisboa,Vila Franca de Xira,Vialonga,Male,0 - 14 year old,2052,,2052
12404,2021.0,111408,Lisboa,Vila Franca de Xira,Vialonga,Male,25 - 64 year old,5533,,5533
...,...,...,...,...,...,...,...,...,...,...
21972,2021.0,111512,Lisboa,Amadora,Alfragide,Male,25 - 64 year old,4272,,4272
14007,2021.0,110120,Lisboa,Alenquer,União das freguesias de Carregado e Cadafais,Male,25 - 64 year old,4000,,4000
17708,2021.0,110120,Lisboa,Alenquer,União das freguesias de Carregado e Cadafais,Female,25 - 64 year old,4318,,4318
1591,2021.0,110119,Lisboa,Alenquer,União das freguesias de Alenquer (Santo Estêvã...,Female,25 - 64 year old,3406,,3406


  **Step 4.1.3** Select all rows where the `Population` is greater or equal than 2000 people and the `District` is `Lisboa` or `Porto`.

In [294]:
population.query('Population >= 2000 and (District == "Lisboa" or District == "Porto")') 

Unnamed: 0,Year,Location_Code,District,Municipality,Parish,Sex,Age_bracket,Population,Comments,Population_corrected
1669,2021.0,131628,Porto,Vila do Conde,Vila do Conde,Female,25 - 64 year old,8423,,8423
6806,2021.0,131628,Porto,Vila do Conde,Vila do Conde,Female,65 and older,3417,,3417
7657,2021.0,131628,Porto,Vila do Conde,Vila do Conde,Male,25 - 64 year old,7649,,7649
11004,2021.0,131628,Porto,Vila do Conde,Vila do Conde,Male,65 and older,2422,,2422
11495,2021.0,131628,Porto,Vila do Conde,Vila do Conde,Female,0 - 14 year old,2058,,2058
...,...,...,...,...,...,...,...,...,...,...
21972,2021.0,111512,Lisboa,Amadora,Alfragide,Male,25 - 64 year old,4272,,4272
14007,2021.0,110120,Lisboa,Alenquer,União das freguesias de Carregado e Cadafais,Male,25 - 64 year old,4000,,4000
17708,2021.0,110120,Lisboa,Alenquer,União das freguesias de Carregado e Cadafais,Female,25 - 64 year old,4318,,4318
1591,2021.0,110119,Lisboa,Alenquer,União das freguesias de Alenquer (Santo Estêvã...,Female,25 - 64 year old,3406,,3406


  **Step 4.2** Use standard dataframe slicing to obtain subsets of data based on a condition or set of conditions.<br>
    **Step 4.2.1** Select all rows where the `District` is `Coimbra`

In [296]:
population.query('District=="Coimbra"')

Unnamed: 0,Year,Location_Code,District,Municipality,Parish,Sex,Age_bracket,Population,Comments,Population_corrected
1779,2021.0,61704,Coimbra,Vila Nova de Poiares,São Miguel de Poiares,Female,25 - 64 year old,350,,350
8525,2021.0,61704,Coimbra,Vila Nova de Poiares,São Miguel de Poiares,Female,65 and older,205,,205
10475,2021.0,61704,Coimbra,Vila Nova de Poiares,São Miguel de Poiares,Male,25 - 64 year old,314,,314
10488,2021.0,61704,Coimbra,Vila Nova de Poiares,São Miguel de Poiares,Female,0 - 14 year old,78,,78
15764,2021.0,61704,Coimbra,Vila Nova de Poiares,São Miguel de Poiares,Male,0 - 14 year old,77,,77
...,...,...,...,...,...,...,...,...,...,...
13257,2021.0,60102,Coimbra,Arganil,Arganil,Female,15 - 24 year old,180,,180
15456,2021.0,60102,Coimbra,Arganil,Arganil,Male,65 and older,429,,429
18608,2021.0,60102,Coimbra,Arganil,Arganil,Female,65 and older,570,,570
22219,2021.0,60102,Coimbra,Arganil,Arganil,Male,25 - 64 year old,968,,968


  **Step 4.2.2** Select all rows where the `District` is `Coimbra` and the `Population` is equal or smaller than 300 people.

In [300]:
population.query('District=="Coimbra" & (Population<300 or Population==300)')

Unnamed: 0,Year,Location_Code,District,Municipality,Parish,Sex,Age_bracket,Population,Comments,Population_corrected
8525,2021.0,61704,Coimbra,Vila Nova de Poiares,São Miguel de Poiares,Female,65 and older,205,,205
10488,2021.0,61704,Coimbra,Vila Nova de Poiares,São Miguel de Poiares,Female,0 - 14 year old,78,,78
15764,2021.0,61704,Coimbra,Vila Nova de Poiares,São Miguel de Poiares,Male,0 - 14 year old,77,,77
19539,2021.0,61704,Coimbra,Vila Nova de Poiares,São Miguel de Poiares,Female,15 - 24 year old,52,,52
19735,2021.0,61704,Coimbra,Vila Nova de Poiares,São Miguel de Poiares,Male,15 - 24 year old,77,,77
...,...,...,...,...,...,...,...,...,...,...
21400,2021.0,60104,Coimbra,Arganil,Benfeita,Female,15 - 24 year old,1,,1
1673,2021.0,60102,Coimbra,Arganil,Arganil,Female,0 - 14 year old,228,,228
3732,2021.0,60102,Coimbra,Arganil,Arganil,Male,0 - 14 year old,223,,223
13257,2021.0,60102,Coimbra,Arganil,Arganil,Female,15 - 24 year old,180,,180


  **Step 4.2.3** Select all rows where the `Population` is inferior or equal than 300 people and the `District` is `Coimbra` or `Aveiro`.

In [302]:
population[(population['Population'] <= 300) & ((population['District'] == 'Coimbra') | (population['District'] == 'Aveiro'))]

Unnamed: 0,Year,Location_Code,District,Municipality,Parish,Sex,Age_bracket,Population,Comments,Population_corrected
8525,2021.0,61704,Coimbra,Vila Nova de Poiares,São Miguel de Poiares,Female,65 and older,205,,205
10488,2021.0,61704,Coimbra,Vila Nova de Poiares,São Miguel de Poiares,Female,0 - 14 year old,78,,78
15764,2021.0,61704,Coimbra,Vila Nova de Poiares,São Miguel de Poiares,Male,0 - 14 year old,77,,77
19539,2021.0,61704,Coimbra,Vila Nova de Poiares,São Miguel de Poiares,Female,15 - 24 year old,52,,52
19735,2021.0,61704,Coimbra,Vila Nova de Poiares,São Miguel de Poiares,Male,15 - 24 year old,77,,77
...,...,...,...,...,...,...,...,...,...,...
7634,2021.0,10202,Aveiro,Albergaria-a-Velha,Alquerubim,Female,0 - 14 year old,118,,118
10768,2021.0,10202,Aveiro,Albergaria-a-Velha,Alquerubim,Female,15 - 24 year old,116,,116
15185,2021.0,10202,Aveiro,Albergaria-a-Velha,Alquerubim,Male,65 and older,258,,258
19906,2021.0,10202,Aveiro,Albergaria-a-Velha,Alquerubim,Male,0 - 14 year old,141,,141


There is are also functions that allow you to effortlessly create more elaborate conditions. One possibility is using `isin` to give a boolean mask that you can use to filter the dataframe based on a set of values you want to check for. More info on this method can be found [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isin.html).

**Step 4.3.1** Use the `isin` function to create a boolean mask that you can use to filter the dataframe. The mask should be true for all rows where the `District` column is either `Coimbra` or `Leiria`. Use that mask to filter the dataframe `population`.

In [310]:
population['District'].isin(['Coimbra', 'Leiria']) #CODE HERE

# to use it i put it inside square brackets

3232     False
7672     False
13912    False
14498    False
16065    False
         ...  
4904     False
14888    False
15395    False
21325    False
22975    False
Name: District, Length: 24728, dtype: bool

An alternative, and more elegant way is to create a list with the values you want to check for and use that list as the input to the isin() method. Try it.

In [312]:
#create list (try and change it to look for different values)
districts_to_check = ['Coimbra', 'Leiria']

#filter
population[population['District'].isin(districts_to_check)]#CODE HERE

Unnamed: 0,Year,Location_Code,District,Municipality,Parish,Sex,Age_bracket,Population,Comments,Population_corrected
2629,2021.0,101207,Leiria,Óbidos,Vau,Female,0 - 14 year old,44,,44
3390,2021.0,101207,Leiria,Óbidos,Vau,Female,25 - 64 year old,232,,232
4597,2021.0,101207,Leiria,Óbidos,Vau,Female,65 and older,142,,142
7362,2021.0,101207,Leiria,Óbidos,Vau,Female,15 - 24 year old,39,,39
14849,2021.0,101207,Leiria,Óbidos,Vau,Male,0 - 14 year old,56,,56
...,...,...,...,...,...,...,...,...,...,...
13257,2021.0,60102,Coimbra,Arganil,Arganil,Female,15 - 24 year old,180,,180
15456,2021.0,60102,Coimbra,Arganil,Arganil,Male,65 and older,429,,429
18608,2021.0,60102,Coimbra,Arganil,Arganil,Female,65 and older,570,,570
22219,2021.0,60102,Coimbra,Arganil,Arganil,Male,25 - 64 year old,968,,968


Another example, now for numerical values is the method `between()`. More info on this method can be found [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.between.html).

**Step 4.3.1** Use the `between()` function to create a boolean mask that you can use to filter the dataframe. Only the rows where `Population` values are between 100 and 200 should be returned

In [314]:
population[population['Population'].between(100, 200)]

Unnamed: 0,Year,Location_Code,District,Municipality,Parish,Sex,Age_bracket,Population,Comments,Population_corrected
21931,2021.0,70523,Évora,Évora,"União das freguesias de Évora (São Mamede, Sé,...",Female,0 - 14 year old,189,,189
23796,2021.0,70523,Évora,Évora,"União das freguesias de Évora (São Mamede, Sé,...",Male,15 - 24 year old,192,,192
4662,2021.0,70527,Évora,Évora,União das freguesias de São Sebastião da Giest...,Female,65 and older,172,,172
15786,2021.0,70527,Évora,Évora,União das freguesias de São Sebastião da Giest...,Male,65 and older,125,,125
1585,2021.0,70526,Évora,Évora,União das freguesias de São Manços e São Vicen...,Male,65 and older,145,,145
...,...,...,...,...,...,...,...,...,...,...
23238,2021.0,10210,Aveiro,Albergaria-a-Velha,São João de Loure e Frossos,Female,0 - 14 year old,171,,171
7634,2021.0,10202,Aveiro,Albergaria-a-Velha,Alquerubim,Female,0 - 14 year old,118,,118
10768,2021.0,10202,Aveiro,Albergaria-a-Velha,Alquerubim,Female,15 - 24 year old,116,,116
19906,2021.0,10202,Aveiro,Albergaria-a-Velha,Alquerubim,Male,0 - 14 year old,141,,141


An alternative, and more elegant way is to assign the values you want to search in variables and then use those variables as the inputs to `between`. Try it.

In [318]:
#assign values to the variables
max_value = 200
min_value = 100

#search for the values
population.Population.between(min_value,max_value)

3232     False
7672     False
13912    False
14498    False
16065    False
         ...  
4904     False
14888    False
15395    False
21325    False
22975    False
Name: Population, Length: 24728, dtype: bool

You can also combine these two methods for more powerful queries to your data. 

**Step 4.3** Return the rows in the dataset that are in districts `Évora`, `Beja` and `Faro` whose `Population` is between 50 and 150 inhabitants.

In [325]:
#set search space
districts_to_check = ['Évora', 'Beja', 'Faro']
min_value = 50
max_value = 150

#search for the values
mask1 = population.District.isin(['Évora', 'Beja', 'Faro'])
mask2 = population.Population.between(min_value, max_value)

population[mask1 & mask2]

Unnamed: 0,Year,Location_Code,District,Municipality,Parish,Sex,Age_bracket,Population,Comments,Population_corrected
15786,2021.0,70527,Évora,Évora,União das freguesias de São Sebastião da Giest...,Male,65 and older,125,,125
1585,2021.0,70526,Évora,Évora,União das freguesias de São Manços e São Vicen...,Male,65 and older,145,,145
5823,2021.0,70526,Évora,Évora,União das freguesias de São Manços e São Vicen...,Male,15 - 24 year old,57,,57
6732,2021.0,70526,Évora,Évora,União das freguesias de São Manços e São Vicen...,Female,0 - 14 year old,59,,59
7830,2021.0,70526,Évora,Évora,União das freguesias de São Manços e São Vicen...,Male,0 - 14 year old,54,,54
...,...,...,...,...,...,...,...,...,...,...
17484,2021.0,20104,Beja,Aljustrel,São João de Negrilhos,Male,0 - 14 year old,71,,71
22913,2021.0,20104,Beja,Aljustrel,São João de Negrilhos,Female,0 - 14 year old,75,,75
2865,2021.0,20103,Beja,Aljustrel,Messejana,Female,0 - 14 year old,50,,50
4738,2021.0,20103,Beja,Aljustrel,Messejana,Male,65 and older,123,,123


### <font color='#BFD72F'>5.Try it out</font> <a class="anchor" id="P5"></a>
  [Back to TOC](#toc)

1. Create a pivot_table whose values are the `sum` of the population separated by **males** and **females** for each `District`, `Municipality`, `Parish`. Store it in variable `exercise_pivot`

In [148]:
population = pd.read_csv(r"./data/pop_censuspt_2021.csv", sep=";")
population.head(1)

Unnamed: 0,Year,Location_Code,District,Municipality,Parish,Sex,Age,Population,Notes,Population_corrected
0,2021,170320,Vila Real,Chaves,Oura,F,65 and older,100,,100


In [156]:
exercise_pivot = population.pivot_table(
    index = ["District", "Municipality", "Parish"],
    columns= "Sex",
    values = "Population",
    aggfunc = "sum",
)

exercise_pivot

Unnamed: 0_level_0,Unnamed: 1_level_0,Sex,F,M
District,Municipality,Parish,Unnamed: 3_level_1,Unnamed: 4_level_1
Aveiro,Albergaria-a-Velha,Albergaria-a-Velha e Valmaior,5720,5338
Aveiro,Albergaria-a-Velha,Alquerubim,1137,1096
Aveiro,Albergaria-a-Velha,Angeja,984,891
Aveiro,Albergaria-a-Velha,Branca,2770,2657
Aveiro,Albergaria-a-Velha,Ribeira de Fráguas,749,745
...,...,...,...,...
Évora,Évora,União das freguesias de Malagueira e Horta das Figueiras,11352,10195
Évora,Évora,União das freguesias de Nossa Senhora da Tourega e Nossa Senhora de Guadalupe,513,482
Évora,Évora,União das freguesias de São Manços e São Vicente do Pigeiro,547,533
Évora,Évora,União das freguesias de São Sebastião da Giesteira e Nossa Senhora da Boa Fé,472,455


2. Create a new column in `exercise_pivot` with the total population of each parish. What are the 10 most populated parishes and to what municipality and district do they belong to? <br>
**Hint**: You may need to revise your answer to question 1 to get the correct answer to this question.

In [157]:
#computing new column
exercise_pivot["Parish_Population"] = exercise_pivot.F + exercise_pivot.M
exercise_pivot.sort_values(by="Parish_Population", ascending = False).head(10) # Answer

Unnamed: 0_level_0,Unnamed: 1_level_0,Sex,F,M,Parish_Population
District,Municipality,Parish,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Lisboa,Sintra,Algueirão-Mem Martins,36259,32390,68649
Lisboa,Cascais,União das freguesias de Cascais e Estoril,34576,29616,64192
Lisboa,Odivelas,Odivelas,31710,27876,59586
Lisboa,Cascais,São Domingos de Rana,31328,27910,59238
Lisboa,Oeiras,"União das freguesias de Oeiras e São Julião da Barra, Paço de Arcos e Caxias",31488,26606,58094
Porto,Vila Nova de Gaia,União das freguesias de Mafamude e Vilar do Paraíso,28401,24443,52844
Setúbal,Setúbal,Setúbal (São Sebastião),27586,25041,52627
Lisboa,Sintra,União das freguesias de Queluz e Belas,27694,24720,52414
Porto,Gondomar,Rio Tinto,27250,23833,51083
Setúbal,Seixal,Corroios,26829,23977,50806


3. Filter the dataframe `exercise_pivot` in a way that only the parishes with more females than males appear. Sort them in ascending order of total population.

In [158]:
y = exercise_pivot.query("F>M").sort_values(by="Parish_Population", ascending=True).head(5)
y

Unnamed: 0_level_0,Unnamed: 1_level_0,Sex,F,M,Parish_Population
District,Municipality,Parish,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Ilha das Flores,Lajes das Flores,Fajãzinha,38,33,71
Ilha das Flores,Santa Cruz das Flores,Caveira,40,36,76
Viseu,Tabuaço,União das freguesias de Paradela e Granjinha,51,48,99
Guarda,Trancoso,Granja,58,51,109
Bragança,Vimioso,Vilar Seco,60,56,116


4. Find the 5 most common `Parish` names and the number of times they appear<br>
**Hint**: You may need to reset the index of the dataframe

In [159]:
exercise_pivot.reset_index().Parish.value_counts().sort_values(ascending=False)

Parish
Pinheiro                                                                   6
Santa Bárbara                                                              5
Rio de Moinhos                                                             5
Carvalhal                                                                  5
Santo António                                                              4
                                                                          ..
Vale de Figueira                                                           1
Valongo dos Azeites                                                        1
Bordonhos                                                                  1
Figueiredo de Alva                                                         1
União das freguesias de Évora (São Mamede, Sé, São Pedro e Santo Antão)    1
Name: count, Length: 2873, dtype: int64

### That's it for this week. Don't forget to practice!