# Python Pandas
## Filtering and Sorting Data
* 1 - Filter data
 * 1.1 - Ascending/Descending
   * 1.1.1 - Single sorting Index
   * 1.1.2 - Single sorting
   * 1.1.3 - Multiple sorting
 * 1.2 - By condition (equal, smaller or greater than)
   * 1.2.1 - Filter single conditions
       * Pure for python, `.query()` and `.isin()`
   * 1.2.2 - Filter multiple conditions (and, or)
 * 1.3 - Count Values and Unique Values
   * 1.3.1 - Count Values `.value_counts`
   * 1.3.2 - `.nunique` Number of unique observarions
   * 1.3.3 - `.unique` Outputs each unique value
   
[Official Documentation](https://pandas.pydata.org/pandas-docs/stable/index.html)

In [1]:
import pandas as pd
Countries = pd.read_excel('Countries.xlsx', 
                          sheet_name = 'Sheet1', 
                          index_col = 0)

# 1 - Filter data
Below section will show how to filter (a) data in ascending or descending order, (b) filter out data from given specified criteria.

## 1.1 - Filter data in ascending/descending order (`Sort` by column)


### 1.1.1 - Single sorting criteria on `Index`

Apply sorting mechanism on `Index` column, order countries in ascending order. With formula `.sort_index()`. By default the function will apply the sorting on a ascending way (i.e.
```python
ascending = True
``` 
is the defaulte instance), if this is the desired setting, nothing is required to be the above instruction can be added or not to the code, however, if inverting the order i.e. starting from the largest or last is desired the condition
```python
ascending = False
```
is therefor required.

In [2]:
# Countries.sort_index()
Countries.sort_index(ascending = True)

Unnamed: 0_level_0,European_Region,GDPperCapita(PPP),Population_Millions
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Azerbaijan,Eastern,18.8,9.9
Belgium,Western,49.7,11.5
France,Western,47.1,67.3
Germany,Western,55.0,82.9
Italy,Southern,40.7,60.5
Norway,Northern,76.6,5.4
Poland,Eastern,33.5,38.4
Portugal,Southern,33.4,10.3
Spain,Southern,42.1,46.8
Switzerland,Western,66.8,8.6


### 1.1.2 - Single sorting
Apply filter to single column.

Example - Order countries by population in ascending order, and then in descending order.

In [3]:
# Countries.sort_values(by=['Population_Millions'], ascending=True)  ## Ascending is default position
Countries.sort_values(by=['Population_Millions'])

Unnamed: 0_level_0,European_Region,GDPperCapita(PPP),Population_Millions
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Norway,Northern,76.6,5.4
Switzerland,Western,66.8,8.6
Azerbaijan,Eastern,18.8,9.9
Portugal,Southern,33.4,10.3
Belgium,Western,49.7,11.5
Poland,Eastern,33.5,38.4
Spain,Southern,42.1,46.8
Italy,Southern,40.7,60.5
United Kingdom,Northern,47.0,66.4
France,Western,47.1,67.3


In [4]:
Countries.sort_values(by=['Population_Millions'], ascending=False)

Unnamed: 0_level_0,European_Region,GDPperCapita(PPP),Population_Millions
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Germany,Western,55.0,82.9
France,Western,47.1,67.3
United Kingdom,Northern,47.0,66.4
Italy,Southern,40.7,60.5
Spain,Southern,42.1,46.8
Poland,Eastern,33.5,38.4
Belgium,Western,49.7,11.5
Portugal,Southern,33.4,10.3
Azerbaijan,Eastern,18.8,9.9
Switzerland,Western,66.8,8.6


### 3.1.3 - Multiple sorting
Apply filter to multiple columns.

Example, sort by European Region and then by GDP per Capita

In [5]:
# Countries2i.sort_values(['European_Region', 'GDPperCapita(PPP)'], 
#                           ascending=True)  ## by ascenting is default position
Countries.sort_values(['European_Region', 'GDPperCapita(PPP)'])

Unnamed: 0_level_0,European_Region,GDPperCapita(PPP),Population_Millions
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Azerbaijan,Eastern,18.8,9.9
Poland,Eastern,33.5,38.4
United Kingdom,Northern,47.0,66.4
Norway,Northern,76.6,5.4
Portugal,Southern,33.4,10.3
Italy,Southern,40.7,60.5
Spain,Southern,42.1,46.8
France,Western,47.1,67.3
Belgium,Western,49.7,11.5
Germany,Western,55.0,82.9


In [6]:
Countries.sort_values(['European_Region', 'GDPperCapita(PPP)'], ascending = False)

Unnamed: 0_level_0,European_Region,GDPperCapita(PPP),Population_Millions
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Switzerland,Western,66.8,8.6
Germany,Western,55.0,82.9
Belgium,Western,49.7,11.5
France,Western,47.1,67.3
Spain,Southern,42.1,46.8
Italy,Southern,40.7,60.5
Portugal,Southern,33.4,10.3
Norway,Northern,76.6,5.4
United Kingdom,Northern,47.0,66.4
Poland,Eastern,33.5,38.4


## 1.2 - By condition (Equal, smaller/greatter than, equivalent)

### 1.2.1 - Filter single values
Example - Show data for Western European Countries.

Start by showing in bolean form

In [7]:
# Countries.European_Region=='Western'
Countries['European_Region']=='Western'

Country
Portugal          False
Poland            False
Germany            True
United Kingdom    False
France             True
Spain             False
Italy             False
Belgium            True
Norway            False
Switzerland        True
Azerbaijan        False
Name: European_Region, dtype: bool

Apply bolean selection to dataframe

In [8]:
# Countries[Countries.European_Region=='Western']
Countries[Countries['European_Region']=='Western']

Unnamed: 0_level_0,European_Region,GDPperCapita(PPP),Population_Millions
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Germany,Western,55.0,82.9
France,Western,47.1,67.3
Belgium,Western,49.7,11.5
Switzerland,Western,66.8,8.6


Filter by non `Western` European_Region

In [9]:
Countries[Countries['European_Region']!='Western']

Unnamed: 0_level_0,European_Region,GDPperCapita(PPP),Population_Millions
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Portugal,Southern,33.4,10.3
Poland,Eastern,33.5,38.4
United Kingdom,Northern,47.0,66.4
Spain,Southern,42.1,46.8
Italy,Southern,40.7,60.5
Norway,Northern,76.6,5.4
Azerbaijan,Eastern,18.8,9.9


**Use `.query()` method for filtering**

Apply the same filter as before, i.e., select western European countries

In [10]:
Countries.query('European_Region=="Western"')

Unnamed: 0_level_0,European_Region,GDPperCapita(PPP),Population_Millions
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Germany,Western,55.0,82.9
France,Western,47.1,67.3
Belgium,Western,49.7,11.5
Switzerland,Western,66.8,8.6


Second Example, to apply `.query()` method to remove Western European Countries. Use the `!` character.

In [11]:
Countries.query('European_Region!="Western"')

Unnamed: 0_level_0,European_Region,GDPperCapita(PPP),Population_Millions
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Portugal,Southern,33.4,10.3
Poland,Eastern,33.5,38.4
United Kingdom,Northern,47.0,66.4
Spain,Southern,42.1,46.8
Italy,Southern,40.7,60.5
Norway,Northern,76.6,5.4
Azerbaijan,Eastern,18.8,9.9


Apply `.query()` to filter multiple values

Example, select Eastern and Western European Countries

In [12]:
Countries.query('European_Region==["Western", "Eastern"]')

Unnamed: 0_level_0,European_Region,GDPperCapita(PPP),Population_Millions
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Poland,Eastern,33.5,38.4
Germany,Western,55.0,82.9
France,Western,47.1,67.3
Belgium,Western,49.7,11.5
Switzerland,Western,66.8,8.6
Azerbaijan,Eastern,18.8,9.9


##### Apply single filters to numerical values

Example - filter countries with a population greater than 45 million

In [13]:
# Countries[Countries['Population_Millions']>=45]  
Countries[Countries.Population_Millions>=45]

Unnamed: 0_level_0,European_Region,GDPperCapita(PPP),Population_Millions
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Germany,Western,55.0,82.9
United Kingdom,Northern,47.0,66.4
France,Western,47.1,67.3
Spain,Southern,42.1,46.8
Italy,Southern,40.7,60.5


Applying the `.query()` method for the above example

In [14]:
Countries.query('Population_Millions > 45')

Unnamed: 0_level_0,European_Region,GDPperCapita(PPP),Population_Millions
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Germany,Western,55.0,82.9
United Kingdom,Northern,47.0,66.4
France,Western,47.1,67.3
Spain,Southern,42.1,46.8
Italy,Southern,40.7,60.5


##### Apply `Wild Cards` to strings

Example, select Northern and Southern countries. Note that last 5 caracters ent with 'thern', as oposed to 'tern' on Western and Eastern. Apply formula `.str.contains()` for the desired term.

In [15]:
Countries[Countries['European_Region'].str.contains("thern")]

Unnamed: 0_level_0,European_Region,GDPperCapita(PPP),Population_Millions
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Portugal,Southern,33.4,10.3
United Kingdom,Northern,47.0,66.4
Spain,Southern,42.1,46.8
Italy,Southern,40.7,60.5
Norway,Northern,76.6,5.4


##### Filter single conditions using `.isin()` method
The use of `isin` formula, allows for a greatter level of flexibility when applying filters

Using the `isin` method. The principle is to filter by rows where Western `is in` column European Region

In [16]:
Countries[Countries['European_Region'].isin(['Western'])]

Unnamed: 0_level_0,European_Region,GDPperCapita(PPP),Population_Millions
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Germany,Western,55.0,82.9
France,Western,47.1,67.3
Belgium,Western,49.7,11.5
Switzerland,Western,66.8,8.6


Apply `.isin` formula to remove given position, as example remove 'Western' European Countries

In [17]:
Countries[~Countries['European_Region'].isin(['Western'])]

Unnamed: 0_level_0,European_Region,GDPperCapita(PPP),Population_Millions
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Portugal,Southern,33.4,10.3
Poland,Eastern,33.5,38.4
United Kingdom,Northern,47.0,66.4
Spain,Southern,42.1,46.8
Italy,Southern,40.7,60.5
Norway,Northern,76.6,5.4
Azerbaijan,Eastern,18.8,9.9


##### Explicitelly filter multiple values per column
Apply a function that allows to explicitly filter more than one criteria for a given column.

Example, select both Southern and Northern european countries. Use `.isin` function.

In [18]:
Countries[Countries['European_Region'].isin(['Southern','Northern'])]

Unnamed: 0_level_0,European_Region,GDPperCapita(PPP),Population_Millions
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Portugal,Southern,33.4,10.3
United Kingdom,Northern,47.0,66.4
Spain,Southern,42.1,46.8
Italy,Southern,40.7,60.5
Norway,Northern,76.6,5.4


### 1.2.2 Filter multiple conditions
Apply `and` and `or` type of statements. In this case `and` statement is defined with the `&` and the `or` statement is defined with `|`.

From below country indexed dataset

In [19]:
Countries

Unnamed: 0_level_0,European_Region,GDPperCapita(PPP),Population_Millions
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Portugal,Southern,33.4,10.3
Poland,Eastern,33.5,38.4
Germany,Western,55.0,82.9
United Kingdom,Northern,47.0,66.4
France,Western,47.1,67.3
Spain,Southern,42.1,46.8
Italy,Southern,40.7,60.5
Belgium,Western,49.7,11.5
Norway,Northern,76.6,5.4
Switzerland,Western,66.8,8.6


##### `And` condition
Example, find the set of countries a population above 45 million people `and` GDP per capita above 45k per year. Also on below case note that there is a difference between Population filter and GDP filter.

In [20]:
Countries[(Countries.Population_Millions>=45) 
          & (Countries['GDPperCapita(PPP)']>=45)
         ]

Unnamed: 0_level_0,European_Region,GDPperCapita(PPP),Population_Millions
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Germany,Western,55.0,82.9
United Kingdom,Northern,47.0,66.4
France,Western,47.1,67.3


##### `And` condition with `.query()` method

Example, Western European country and more than 50 million people

In [21]:
#Countries.query('(European_Region=="Western") & (Population_Millions>50)')
Countries.query('European_Region=="Western" and Population_Millions>50')

Unnamed: 0_level_0,European_Region,GDPperCapita(PPP),Population_Millions
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Germany,Western,55.0,82.9
France,Western,47.1,67.3


Use the `.isin()` method to demonstrate above example adding alternative methods when defualt is not enough.

In [22]:
Countries[(Countries['European_Region'].isin(['Western'])) 
          & (Countries['Population_Millions'] >= 50)
         ]

Unnamed: 0_level_0,European_Region,GDPperCapita(PPP),Population_Millions
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Germany,Western,55.0,82.9
France,Western,47.1,67.3


##### `Or` condition

Example, select the countries that have a population above 65 million people `or` those with a population below 10 million

In [23]:
Countries[(Countries.Population_Millions>65) 
          | (Countries.Population_Millions<10)
         ]

Unnamed: 0_level_0,European_Region,GDPperCapita(PPP),Population_Millions
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Germany,Western,55.0,82.9
United Kingdom,Northern,47.0,66.4
France,Western,47.1,67.3
Norway,Northern,76.6,5.4
Switzerland,Western,66.8,8.6
Azerbaijan,Eastern,18.8,9.9


##### Filter and column selection

Show only the countries column where population is above 45 million people

In [24]:
Countries2 = pd.read_excel('Countries.xlsx', 
                          sheet_name = 'Sheet1')
Countries2

Unnamed: 0,Country,European_Region,GDPperCapita(PPP),Population_Millions
0,Portugal,Southern,33.4,10.3
1,Poland,Eastern,33.5,38.4
2,Germany,Western,55.0,82.9
3,United Kingdom,Northern,47.0,66.4
4,France,Western,47.1,67.3
5,Spain,Southern,42.1,46.8
6,Italy,Southern,40.7,60.5
7,Belgium,Western,49.7,11.5
8,Norway,Northern,76.6,5.4
9,Switzerland,Western,66.8,8.6


In [25]:
### 5 Alternative ways to reach the same results
#Countries2.loc[Countries2.Population_Millions>40, ['Country']]
#Countries2.loc[Countries2['Population_Millions']>40, ['Country']]
#Countries2[Countries2['Population_Millions']>40][['Country']]
#Countries2[Countries2.Population_Millions>40][['Country']]
Countries2.query('Population_Millions>40')[['Country']]

Unnamed: 0,Country
2,Germany
3,United Kingdom
4,France
5,Spain
6,Italy


In [26]:
### 5 Alternative ways to reach the same results
#Countries2.loc[Countries2.Population_Millions>40, 'Country']
#Countries2.loc[Countries2['Population_Millions']>40, 'Country']
#Countries2[Countries2['Population_Millions']>40].Country
#Countries2[Countries2.Population_Millions>40].Country
Countries2.query('Population_Millions>40').Country

2           Germany
3    United Kingdom
4            France
5             Spain
6             Italy
Name: Country, dtype: object

## 1.3 - Count Values and Unique Values

### 1.3.1 - Count Values `.value_counts`

Return a series/rows containing counts of unique values, by default in descending order from the most frequently occuring element. Also excludes NA valus by default.

In [42]:
Countries2['European_Region'].value_counts()

Western     4
Southern    3
Northern    2
Eastern     2
Name: European_Region, dtype: int64

In [43]:
Countries2['European_Region'].value_counts().sum()

11

### 1.3.2 - `.nunique` Number of unique observarions
Count distinct observations over requested axis. On the `Countries2` dataframe there are 11 unique instances of countries and 4 instances of European regions

In [46]:
Countries2.loc[:, ['Country', 'European_Region']].nunique()

Country            11
European_Region     4
dtype: int64


### 1.3.3 - `.unique` Outputs each unique value
Return unique values of Series object in order of appearance. Values are not sorted.

In [45]:
print(Countries2['European_Region'].unique())

['Southern' 'Eastern' 'Western' 'Northern']
