In [95]:
import polars as pl
import polars.selectors as cs

# Using `polars` column selectors

In this notebook, we will look at using [`polars` column selectors](https://docs.pola.rs/api/python/stable/reference/selectors.html#selectors) to perform

1. Column selections,
2. Group & Aggregate, and
3. Table Reshaping

## The Data - World Bank Economic Indicators

First, let's load the World Bank's [World Development Indicators](https://databank.worldbank.org/source/world-development-indicators).

#### Attempt 1

In [139]:
(WB_dev_ind :=
 pl.read_csv('./data/world_bank_raw_download_F23.csv')
)

ComputeError: could not parse `..` as dtype `f64` at column '2011 [YR2011]' (column number 56)

The current offset in the file is 1187125 bytes.

You might want to try:
- increasing `infer_schema_length` (e.g. `infer_schema_length=10000`),
- specifying correct dtype with the `dtypes` argument
- setting `ignore_errors` to `True`,
- adding `..` to the `null_values` list.

Original error: ```remaining bytes non-empty```

#### Attempt 2

Let's use the first hint and extend the infer schema length.

In [140]:
(WB_dev_ind :=
 pl.read_csv('./data/world_bank_raw_download_F23.csv', infer_schema_length=10000)
)

Country Name,Region,Series Name,Series Code,1960 [YR1960],1961 [YR1961],1962 [YR1962],1963 [YR1963],1964 [YR1964],1965 [YR1965],1966 [YR1966],1967 [YR1967],1968 [YR1968],1969 [YR1969],1970 [YR1970],1971 [YR1971],1972 [YR1972],1973 [YR1973],1974 [YR1974],1975 [YR1975],1976 [YR1976],1977 [YR1977],1978 [YR1978],1979 [YR1979],1980 [YR1980],1981 [YR1981],1982 [YR1982],1983 [YR1983],1984 [YR1984],1985 [YR1985],1986 [YR1986],1987 [YR1987],1988 [YR1988],1989 [YR1989],1990 [YR1990],1991 [YR1991],1992 [YR1992],1993 [YR1993],1994 [YR1994],1995 [YR1995],1996 [YR1996],1997 [YR1997],1998 [YR1998],1999 [YR1999],2000 [YR2000],2001 [YR2001],2002 [YR2002],2003 [YR2003],2004 [YR2004],2005 [YR2005],2006 [YR2006],2007 [YR2007],2008 [YR2008],2009 [YR2009],2010 [YR2010],2011 [YR2011],2012 [YR2012],2013 [YR2013],2014 [YR2014],2015 [YR2015],2016 [YR2016],2017 [YR2017],2018 [YR2018],2019 [YR2019],2020 [YR2020],2021 [YR2021],2022 [YR2022]
str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str
"""Afghanistan""","""Asia""","""Diabetes prevalence (% of popu…","""SH.STA.DIAB.ZS""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""7.6""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""10.9""",""".."""
"""Afghanistan""","""Asia""","""Diabetes prevalence (% of popu…","""SH.STA.DIAB.ZS""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""7.6""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""10.9""",""".."""
"""Afghanistan""","""Asia""","""Diabetes prevalence (% of popu…","""SH.STA.DIAB.ZS""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""7.6""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""10.9""",""".."""
"""Afghanistan""","""Asia""","""Diabetes prevalence (% of popu…","""SH.STA.DIAB.ZS""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""7.6""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""10.9""",""".."""
"""Afghanistan""","""Asia""","""Diabetes prevalence (% of popu…","""SH.STA.DIAB.ZS""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""7.6""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""..""","""10.9""",""".."""
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
"""Data from database: World Deve…",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


#### Attempt 3

Looks like missing data is expressed as `".."`, let's add that as the `null_value`.

In [141]:
(WB_dev_ind :=
 pl.read_csv('./data/world_bank_raw_download_F23.csv', infer_schema_length=10000, null_values='..')
)

Country Name,Region,Series Name,Series Code,1960 [YR1960],1961 [YR1961],1962 [YR1962],1963 [YR1963],1964 [YR1964],1965 [YR1965],1966 [YR1966],1967 [YR1967],1968 [YR1968],1969 [YR1969],1970 [YR1970],1971 [YR1971],1972 [YR1972],1973 [YR1973],1974 [YR1974],1975 [YR1975],1976 [YR1976],1977 [YR1977],1978 [YR1978],1979 [YR1979],1980 [YR1980],1981 [YR1981],1982 [YR1982],1983 [YR1983],1984 [YR1984],1985 [YR1985],1986 [YR1986],1987 [YR1987],1988 [YR1988],1989 [YR1989],1990 [YR1990],1991 [YR1991],1992 [YR1992],1993 [YR1993],1994 [YR1994],1995 [YR1995],1996 [YR1996],1997 [YR1997],1998 [YR1998],1999 [YR1999],2000 [YR2000],2001 [YR2001],2002 [YR2002],2003 [YR2003],2004 [YR2004],2005 [YR2005],2006 [YR2006],2007 [YR2007],2008 [YR2008],2009 [YR2009],2010 [YR2010],2011 [YR2011],2012 [YR2012],2013 [YR2013],2014 [YR2014],2015 [YR2015],2016 [YR2016],2017 [YR2017],2018 [YR2018],2019 [YR2019],2020 [YR2020],2021 [YR2021],2022 [YR2022]
str,str,str,str,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
"""Afghanistan""","""Asia""","""Diabetes prevalence (% of popu…","""SH.STA.DIAB.ZS""",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,7.6,,,,,,,,,,10.9,
"""Afghanistan""","""Asia""","""Diabetes prevalence (% of popu…","""SH.STA.DIAB.ZS""",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,7.6,,,,,,,,,,10.9,
"""Afghanistan""","""Asia""","""Diabetes prevalence (% of popu…","""SH.STA.DIAB.ZS""",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,7.6,,,,,,,,,,10.9,
"""Afghanistan""","""Asia""","""Diabetes prevalence (% of popu…","""SH.STA.DIAB.ZS""",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,7.6,,,,,,,,,,10.9,
"""Afghanistan""","""Asia""","""Diabetes prevalence (% of popu…","""SH.STA.DIAB.ZS""",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,7.6,,,,,,,,,,10.9,
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
"""Data from database: World Deve…",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


#### Attempt 4 - Removing rows with a null `Series Name`

Notice that there are some extra rows at the bottom of the table that don't correspond to a series name/code.  Let's remove these.

In [142]:
(WB_dev_ind :=
 pl.read_csv('./data/world_bank_raw_download_F23.csv', infer_schema_length=10000, null_values='..')
 .filter(pl.col("Series Name").is_not_null())
)

Country Name,Region,Series Name,Series Code,1960 [YR1960],1961 [YR1961],1962 [YR1962],1963 [YR1963],1964 [YR1964],1965 [YR1965],1966 [YR1966],1967 [YR1967],1968 [YR1968],1969 [YR1969],1970 [YR1970],1971 [YR1971],1972 [YR1972],1973 [YR1973],1974 [YR1974],1975 [YR1975],1976 [YR1976],1977 [YR1977],1978 [YR1978],1979 [YR1979],1980 [YR1980],1981 [YR1981],1982 [YR1982],1983 [YR1983],1984 [YR1984],1985 [YR1985],1986 [YR1986],1987 [YR1987],1988 [YR1988],1989 [YR1989],1990 [YR1990],1991 [YR1991],1992 [YR1992],1993 [YR1993],1994 [YR1994],1995 [YR1995],1996 [YR1996],1997 [YR1997],1998 [YR1998],1999 [YR1999],2000 [YR2000],2001 [YR2001],2002 [YR2002],2003 [YR2003],2004 [YR2004],2005 [YR2005],2006 [YR2006],2007 [YR2007],2008 [YR2008],2009 [YR2009],2010 [YR2010],2011 [YR2011],2012 [YR2012],2013 [YR2013],2014 [YR2014],2015 [YR2015],2016 [YR2016],2017 [YR2017],2018 [YR2018],2019 [YR2019],2020 [YR2020],2021 [YR2021],2022 [YR2022]
str,str,str,str,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
"""Afghanistan""","""Asia""","""Diabetes prevalence (% of popu…","""SH.STA.DIAB.ZS""",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,7.6,,,,,,,,,,10.9,
"""Afghanistan""","""Asia""","""Diabetes prevalence (% of popu…","""SH.STA.DIAB.ZS""",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,7.6,,,,,,,,,,10.9,
"""Afghanistan""","""Asia""","""Diabetes prevalence (% of popu…","""SH.STA.DIAB.ZS""",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,7.6,,,,,,,,,,10.9,
"""Afghanistan""","""Asia""","""Diabetes prevalence (% of popu…","""SH.STA.DIAB.ZS""",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,7.6,,,,,,,,,,10.9,
"""Afghanistan""","""Asia""","""Diabetes prevalence (% of popu…","""SH.STA.DIAB.ZS""",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,7.6,,,,,,,,,,10.9,
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""World""",,"""Rural population""","""SP.RUR.TOTL""",2.0122e9,2.0244e9,2.0468e9,2.0783e9,2.1094e9,2.1460e9,2.1846e9,2.2221e9,2.2607e9,2.3011e9,2.3423e9,2.3840e9,2.4237e9,2.4628e9,2.4999e9,2.5368e9,2.5716e9,2.6055e9,2.6368e9,2.6659e9,2.6942e9,2.7227e9,2.7545e9,2.7876e9,2.8194e9,2.8513e9,2.8839e9,2.9170e9,2.9500e9,2.9829e9,3.0160e9,3.0469e9,3.0766e9,3.1053e9,3.1326e9,3.1590e9,3.1851e9,3.2100e9,3.2336e9,3.2559e9,3.2766e9,3.2926e9,3.3050e9,3.3163e9,3.3263e9,3.3352e9,3.3440e9,3.3525e9,3.3599e9,3.3672e9,3.3743e9,3.3835e9,3.3941e9,3.4037e9,3.4118e9,3.4182e9,3.4239e9,3.4290e9,3.4325e9,3.4346e9,3.4354e9,3.4324e9,3.4263e9
"""World""",,"""Urban population""","""SP.URB.TOTL""",1.0183e9,1.0470e9,1.0791e9,1.1141e9,1.1500e9,1.1811e9,1.2128e9,1.2451e9,1.2783e9,1.3122e9,1.3467e9,1.3826e9,1.4186e9,1.4559e9,1.4947e9,1.5318e9,1.5701e9,1.6089e9,1.6515e9,1.6984e9,1.7466e9,1.7967e9,1.8466e9,1.8956e9,1.9456e9,1.9970e9,2.0504e9,2.1055e9,2.1616e9,2.2178e9,2.2756e9,2.3337e9,2.3916e9,2.4493e9,2.5074e9,2.5656e9,2.6244e9,2.6839e9,2.7441e9,2.8046e9,2.8660e9,2.9320e9,3.0014e9,3.0715e9,3.1429e9,3.2157e9,3.2894e9,3.3636e9,3.4400e9,3.5169e9,3.5940e9,3.6686e9,3.7455e9,3.8242e9,3.9043e9,3.9853e9,4.0664e9,4.1474e9,4.2274e9,4.3063e9,4.3837e9,4.4539e9,4.5231e9
"""World""",,"""CO2 emissions (kt)""","""EN.ATM.CO2E.KT""",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.1284e7,2.1440e7,2.1390e7,2.1532e7,2.1677e7,2.2299e7,2.2779e7,2.3203e7,2.3366e7,2.3530e7,2.4280e7,2.4644e7,2.4990e7,2.6133e7,2.7332e7,2.8372e7,2.9308e7,3.0419e7,3.0632e7,3.0238e7,3.2096e7,3.3080e7,3.3460e7,3.4120e7,3.4261e7,3.4070e7,3.4146e7,3.4688e7,3.5561e7,3.5477e7,3.3566e7,,
"""World""",,"""Adolescent fertility rate (bir…","""SP.ADO.TFRT""",91.748048,90.692445,94.138599,95.977382,91.732784,89.285728,87.201738,84.5266,85.594114,85.010166,84.750717,84.645394,83.463033,82.905935,82.786104,81.86678,81.784936,79.347406,77.332245,76.727103,75.603382,75.696488,76.842785,76.598129,76.603095,76.050318,75.408671,72.901563,73.311004,73.851735,73.791261,74.656674,73.885417,73.376407,72.666378,71.666689,69.551847,67.726505,66.758498,65.452044,64.43683,63.825338,61.907368,60.069875,57.055993,53.385806,51.88069,51.890943,52.966351,51.994128,52.023991,51.659622,51.43433,51.199118,50.806674,47.189771,45.659168,44.917716,44.070807,43.372848,42.745881,42.479374,


## `polars` Column Selectors

Allow column selection based on

- name,
- index,
- type, or
- other useful helper functions like `contains`, `starts_with`, or `matches`

#### Example - Selecting by name

In [63]:
(WB_dev_ind
 .select(cs.by_name("Series Name", "Series Code"))
).head()

Series Name,Series Code
str,str
"""Diabetes prevalence (% of popu…","""SH.STA.DIAB.ZS"""
"""Diabetes prevalence (% of popu…","""SH.STA.DIAB.ZS"""
"""Diabetes prevalence (% of popu…","""SH.STA.DIAB.ZS"""
"""Diabetes prevalence (% of popu…","""SH.STA.DIAB.ZS"""
"""Diabetes prevalence (% of popu…","""SH.STA.DIAB.ZS"""


#### Example - Selecting by index

In [64]:
(WB_dev_ind
 .select(cs.by_index(range(1,5)))
).head()

Region,Series Name,Series Code,1960 [YR1960]
str,str,str,f64
"""Asia""","""Diabetes prevalence (% of popu…","""SH.STA.DIAB.ZS""",
"""Asia""","""Diabetes prevalence (% of popu…","""SH.STA.DIAB.ZS""",
"""Asia""","""Diabetes prevalence (% of popu…","""SH.STA.DIAB.ZS""",
"""Asia""","""Diabetes prevalence (% of popu…","""SH.STA.DIAB.ZS""",
"""Asia""","""Diabetes prevalence (% of popu…","""SH.STA.DIAB.ZS""",


#### Example - Selecting all string columns.

In [65]:
(WB_dev_ind
 .select(cs.by_dtype(pl.String))
).head()

Country Name,Region,Series Name,Series Code
str,str,str,str
"""Afghanistan""","""Asia""","""Diabetes prevalence (% of popu…","""SH.STA.DIAB.ZS"""
"""Afghanistan""","""Asia""","""Diabetes prevalence (% of popu…","""SH.STA.DIAB.ZS"""
"""Afghanistan""","""Asia""","""Diabetes prevalence (% of popu…","""SH.STA.DIAB.ZS"""
"""Afghanistan""","""Asia""","""Diabetes prevalence (% of popu…","""SH.STA.DIAB.ZS"""
"""Afghanistan""","""Asia""","""Diabetes prevalence (% of popu…","""SH.STA.DIAB.ZS"""


In [66]:
(WB_dev_ind
 .select(cs.string())
).head()

Country Name,Region,Series Name,Series Code
str,str,str,str
"""Afghanistan""","""Asia""","""Diabetes prevalence (% of popu…","""SH.STA.DIAB.ZS"""
"""Afghanistan""","""Asia""","""Diabetes prevalence (% of popu…","""SH.STA.DIAB.ZS"""
"""Afghanistan""","""Asia""","""Diabetes prevalence (% of popu…","""SH.STA.DIAB.ZS"""
"""Afghanistan""","""Asia""","""Diabetes prevalence (% of popu…","""SH.STA.DIAB.ZS"""
"""Afghanistan""","""Asia""","""Diabetes prevalence (% of popu…","""SH.STA.DIAB.ZS"""


## Familiar string-based helpers

In addition, we can use
- `contains` to check for a sub-string,
- `starts_with` and `ends_with` to select by prefix/suffix,
- `matches` to capture more complicated patterns with a RegEx.

#### Example - Selecting the Series name and code using `contains`

In [67]:
(WB_dev_ind
 .select(cs.contains('Series'))
).head()


Series Name,Series Code
str,str
"""Diabetes prevalence (% of popu…","""SH.STA.DIAB.ZS"""
"""Diabetes prevalence (% of popu…","""SH.STA.DIAB.ZS"""
"""Diabetes prevalence (% of popu…","""SH.STA.DIAB.ZS"""
"""Diabetes prevalence (% of popu…","""SH.STA.DIAB.ZS"""
"""Diabetes prevalence (% of popu…","""SH.STA.DIAB.ZS"""


#### Example - Selecting the 1990's using `starts_with`

In [68]:
(WB_dev_ind
 .select(cs.starts_with('199'))
).head()


1990 [YR1990],1991 [YR1991],1992 [YR1992],1993 [YR1993],1994 [YR1994],1995 [YR1995],1996 [YR1996],1997 [YR1997],1998 [YR1998],1999 [YR1999]
f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
,,,,,,,,,
,,,,,,,,,
,,,,,,,,,
,,,,,,,,,
,,,,,,,,,


#### Example - Selecting the first five years of 1990's using `matches`

In [69]:
(WB_dev_ind
 .select(cs.matches(r'^199[0-4]'))
).head()


1990 [YR1990],1991 [YR1991],1992 [YR1992],1993 [YR1993],1994 [YR1994]
f64,f64,f64,f64,f64
,,,,
,,,,
,,,,
,,,,
,,,,


### Combining selectors with set operations

Another useful feature: Combine with set operations:

- **Complement.** `~selector1`
- **Union.** `selector1 | selector2`
- **Intersection.** `selector1 & selector2`
- **Difference.** `selector1 - selector2`
- **Symmetric difference.** `selector1 ^ selector2`

#### Example - All the string/index columns excluding the `Series Code`

In [70]:
(WB_dev_ind
 .select(cs.string() - cs.contains('Code'))
).head()


Country Name,Region,Series Name
str,str,str
"""Afghanistan""","""Asia""","""Diabetes prevalence (% of popu…"
"""Afghanistan""","""Asia""","""Diabetes prevalence (% of popu…"
"""Afghanistan""","""Asia""","""Diabetes prevalence (% of popu…"
"""Afghanistan""","""Asia""","""Diabetes prevalence (% of popu…"
"""Afghanistan""","""Asia""","""Diabetes prevalence (% of popu…"


#### Example - All index columns (minus the `Series Code`) and the `1990`s

In [71]:
(WB_dev_ind
 .select(cs.string() - cs.contains('Code') | cs.starts_with('199'))
).head()


Country Name,Region,Series Name,1990 [YR1990],1991 [YR1991],1992 [YR1992],1993 [YR1993],1994 [YR1994],1995 [YR1995],1996 [YR1996],1997 [YR1997],1998 [YR1998],1999 [YR1999]
str,str,str,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
"""Afghanistan""","""Asia""","""Diabetes prevalence (% of popu…",,,,,,,,,,
"""Afghanistan""","""Asia""","""Diabetes prevalence (% of popu…",,,,,,,,,,
"""Afghanistan""","""Asia""","""Diabetes prevalence (% of popu…",,,,,,,,,,
"""Afghanistan""","""Asia""","""Diabetes prevalence (% of popu…",,,,,,,,,,
"""Afghanistan""","""Asia""","""Diabetes prevalence (% of popu…",,,,,,,,,,


## Using column selectors to reshape tables.

Another place where column selectors are useful is when reshaping tables, e.g., using `pivot` or `unpivot`.

### Example - Making a tidy subset of the data

**Goal.** Compare the urban and overall population changes across regions for each year in the `1990`s for each region.

**Task 1.** Tidy up the table by reshaping, by
1. `filter` to the measures of interest,
1. `unpivot` the years in question, and
2. `pivot` to measures into separate columns.

#### Solution - No columns selector

First, let's use comprehensions to perform the reshaping (after the filter).

In [113]:
(year_columns := 
 [c for c in WB_dev_ind.columns if c.startswith('199')])

['1990 [YR1990]',
 '1991 [YR1991]',
 '1992 [YR1992]',
 '1993 [YR1993]',
 '1994 [YR1994]',
 '1995 [YR1995]',
 '1996 [YR1996]',
 '1997 [YR1997]',
 '1998 [YR1998]',
 '1999 [YR1999]']

In [114]:
(string_columns :=
 [c for i, c in enumerate(WB_dev_ind.columns) if i in (0, 1, 2)])

['Country Name', 'Region', 'Series Name']

In [115]:
(index_columns :=
 [c for c in string_columns if c != 'Series Name'])

['Country Name', 'Region']

In [130]:
(pop_nums :=
 WB_dev_ind
 .select(cs.string() - cs.contains('Code') | cs.starts_with('199'))
 .filter(pl.col('Series Name').str.contains(r'^(Urban|Population)'))
 .unpivot(on = year_columns,
          index= string_columns,
          variable_name = "Year",
          )
 .pivot(on = 'Series Name',
        index = index_columns + ['Year'],  # Had to manually include the new column
        aggregate_function='sum'
       )
)

Country Name,Region,Year,"Population, total",Urban population
str,str,str,f64,f64
"""Afghanistan""","""Asia""","""1990 [YR1990]""",6.4168776e7,1.3589022e7
"""Albania""","""Europe""","""1990 [YR1990]""",1.9719252e7,7.183332e6
"""Algeria""","""Africa""","""1990 [YR1990]""",1.53108444e8,7.9746534e7
"""American Samoa""","""Oceania""","""1990 [YR1990]""",286908.0,232248.0
"""Andorra""","""Europe""","""1990 [YR1990]""",321414.0,304416.0
…,…,…,…,…
"""Sub-Saharan Africa""",,"""1999 [YR1999]""",6.53883261e8,2.03014745e8
"""Sub-Saharan Africa (excluding …",,"""1999 [YR1999]""",6.53802851e8,2.02974349e8
"""Sub-Saharan Africa (IDA & IBRD…",,"""1999 [YR1999]""",6.53883261e8,2.03014745e8
"""Upper middle income""",,"""1999 [YR1999]""",2.3531e9,1.1255e9


#### Solution 2 - Using columns selectors in `unpivot` and `pivot`

In [131]:
(pop_nums :=
 WB_dev_ind
 .select(cs.string() - cs.contains('Code') | cs.starts_with('199'))
 .filter(pl.col('Series Name').str.contains(r'^(Urban|Population)'))
 .unpivot(on = cs.starts_with('199'),
          index= cs.string(),
          variable_name = "Year",
          )
 .pivot(on = 'Series Name',
        index = cs.string() - cs.by_name('Series Name'),  # Column selectors captured the new column!
        aggregate_function='sum'
       )
)

Country Name,Region,Year,"Population, total",Urban population
str,str,str,f64,f64
"""Afghanistan""","""Asia""","""1990 [YR1990]""",6.4168776e7,1.3589022e7
"""Albania""","""Europe""","""1990 [YR1990]""",1.9719252e7,7.183332e6
"""Algeria""","""Africa""","""1990 [YR1990]""",1.53108444e8,7.9746534e7
"""American Samoa""","""Oceania""","""1990 [YR1990]""",286908.0,232248.0
"""Andorra""","""Europe""","""1990 [YR1990]""",321414.0,304416.0
…,…,…,…,…
"""Sub-Saharan Africa""",,"""1999 [YR1999]""",6.53883261e8,2.03014745e8
"""Sub-Saharan Africa (excluding …",,"""1999 [YR1999]""",6.53802851e8,2.02974349e8
"""Sub-Saharan Africa (IDA & IBRD…",,"""1999 [YR1999]""",6.53883261e8,2.03014745e8
"""Upper middle income""",,"""1999 [YR1999]""",2.3531e9,1.1255e9


## Using columns selectors to group and aggregate.

Aggregation also benefits from columns selectors, both for
1. `group_by` multiple columns, as well as
2. Performing multiple `agg`regations.

### Example - Compute the regional population totals.

#### Solution 1 - Without column selectors

In [90]:
(pop_nums
 .filter(pl.col('Region').is_not_null())
 .group_by('Region', 'Year')
 .agg(pl.col('Population, total').sum(),
      pl.col('Urban population').sum(),
     )
)

Region,Year,"Population, total",Urban population
str,str,f64,f64
"""Asia""","""1998 [YR1998]""",1.9675e10,6.7835e9
"""Asia""","""1997 [YR1997]""",1.9393e10,6.5832e9
"""Africa""","""1998 [YR1998]""",4.6636e9,1.6007e9
"""Oceania""","""1992 [YR1992]""",1.6516122e8,1.16916756e8
"""Europe""","""1999 [YR1999]""",4.2982e9,3.0452e9
…,…,…,…
"""Asia""","""1999 [YR1999]""",1.9950e10,6.9842e9
"""The Americas""","""1998 [YR1998]""",4.8727e9,3.7050e9
"""Asia""","""1992 [YR1992]""",1.7941e10,5.6267e9
"""Middle East""","""1992 [YR1992]""",8.52674184e8,5.1825078e8


#### Solution 2 - With column selectors

In [132]:
(pop_nums_by_region_and_year :=
 pop_nums
 .drop(cs.starts_with('C'))
 .filter(pl.col('Region').is_not_null())
 .group_by(cs.string())
 .agg(cs.float().sum())
)

Region,Year,"Population, total",Urban population
str,str,f64,f64
"""The Americas""","""1992 [YR1992]""",4.4579e9,3.2733e9
"""Oceania""","""1993 [YR1993]""",1.67485644e8,1.18040172e8
"""Middle East""","""1993 [YR1993]""",8.73521784e8,5.35684674e8
"""Asia""","""1999 [YR1999]""",1.9950e10,6.9842e9
"""The Americas""","""1994 [YR1994]""",4.5989e9,3.4174e9
…,…,…,…
"""The Americas""","""1990 [YR1990]""",4.3141e9,3.1286e9
"""The Americas""","""1999 [YR1999]""",4.9395e9,3.7764e9
"""Africa""","""1998 [YR1998]""",4.6636e9,1.6007e9
"""Middle East""","""1994 [YR1994]""",8.93041194e8,5.5203258e8


## Cleaning up multiple column transformations

Finally, we can use column selectors to perform the same computation to multiple columns simultaneously.

### Example - Converting the population totals to per 1000 people. 

#### Solution 1 - Without column selectors

In [133]:
(pop_per_1K_by_region_and_year :=
 pop_nums_by_region_and_year
 .with_columns(pl.col('Population, total')/1000, 
               pl.col('Urban population')/1000,
              )
)

Region,Year,"Population, total",Urban population
str,str,f64,f64
"""The Americas""","""1992 [YR1992]""",4.4579e6,3.2733e6
"""Oceania""","""1993 [YR1993]""",167485.644,118040.172
"""Middle East""","""1993 [YR1993]""",873521.784,535684.674
"""Asia""","""1999 [YR1999]""",1.9950e7,6.9842e6
"""The Americas""","""1994 [YR1994]""",4.5989e6,3.4174e6
…,…,…,…
"""The Americas""","""1990 [YR1990]""",4.3141e6,3.1286e6
"""The Americas""","""1999 [YR1999]""",4.9395e6,3.7764e6
"""Africa""","""1998 [YR1998]""",4.6636e6,1.6007e6
"""Middle East""","""1994 [YR1994]""",893041.194,552032.58


#### Solution 2 - With column selectors

In [134]:
(pop_per_1K_by_region_and_year :=
 pop_nums_by_region_and_year
 .with_columns(cs.float()/1000)
)

Region,Year,"Population, total",Urban population
str,str,f64,f64
"""The Americas""","""1992 [YR1992]""",4.4579e6,3.2733e6
"""Oceania""","""1993 [YR1993]""",167485.644,118040.172
"""Middle East""","""1993 [YR1993]""",873521.784,535684.674
"""Asia""","""1999 [YR1999]""",1.9950e7,6.9842e6
"""The Americas""","""1994 [YR1994]""",4.5989e6,3.4174e6
…,…,…,…
"""The Americas""","""1990 [YR1990]""",4.3141e6,3.1286e6
"""The Americas""","""1999 [YR1999]""",4.9395e6,3.7764e6
"""Africa""","""1998 [YR1998]""",4.6636e6,1.6007e6
"""Middle East""","""1994 [YR1994]""",893041.194,552032.58


### Example - Standardize multiple columns

#### Solution 1 - Without column selectors

In [135]:
(pop_z_scores_by_region_and_year :=
 pop_nums_by_region_and_year
 .with_columns((pl.col('Population, total') - pl.col('Population, total').mean())/pl.col('Population, total').std(), 
               (pl.col('Urban population') - pl.col('Urban population').mean())/pl.col('Urban population').std(),
              )
)

Region,Year,"Population, total",Urban population
str,str,f64,f64
"""The Americas""","""1992 [YR1992]""",-0.166381,0.400455
"""Oceania""","""1993 [YR1993]""",-0.857565,-1.128629
"""Middle East""","""1993 [YR1993]""",-0.743824,-0.926231
"""Asia""","""1999 [YR1999]""",2.329283,2.198809
"""The Americas""","""1994 [YR1994]""",-0.143674,0.470314
…,…,…,…
"""The Americas""","""1990 [YR1990]""",-0.18956,0.330323
"""The Americas""","""1999 [YR1999]""",-0.088801,0.644295
"""Africa""","""1998 [YR1998]""",-0.133249,-0.410126
"""Middle East""","""1994 [YR1994]""",-0.74068,-0.918309


#### Solution 2 - With columns selectors

In [136]:
(pop_per_1K_by_region_and_year :=
 pop_nums_by_region_and_year
 .with_columns((cs.float() - cs.float().mean())/cs.float().std())
)

Region,Year,"Population, total",Urban population
str,str,f64,f64
"""The Americas""","""1992 [YR1992]""",-0.166381,0.400455
"""Oceania""","""1993 [YR1993]""",-0.857565,-1.128629
"""Middle East""","""1993 [YR1993]""",-0.743824,-0.926231
"""Asia""","""1999 [YR1999]""",2.329283,2.198809
"""The Americas""","""1994 [YR1994]""",-0.143674,0.470314
…,…,…,…
"""The Americas""","""1990 [YR1990]""",-0.18956,0.330323
"""The Americas""","""1999 [YR1999]""",-0.088801,0.644295
"""Africa""","""1998 [YR1998]""",-0.133249,-0.410126
"""Middle East""","""1994 [YR1994]""",-0.74068,-0.918309


### Example - Standardize multiple columns (within `Region`)

#### Solution 1 - Without column selectors

In [137]:
(pop_z_scores_by_region_and_year :=
 pop_nums_by_region_and_year
 .with_columns((pl.col('Population, total') - pl.col('Population, total').mean().over('Region'))/pl.col('Population, total').std().over('Region'), 
               (pl.col('Urban population') - pl.col('Urban population').mean().over('Region'))/pl.col('Urban population').std().over('Region'),
              )
)

Region,Year,"Population, total",Urban population
str,str,f64,f64
"""The Americas""","""1992 [YR1992]""",-0.821409,-0.824405
"""Oceania""","""1993 [YR1993]""",-0.509868,-0.487318
"""Middle East""","""1993 [YR1993]""",-0.450549,-0.463163
"""Asia""","""1999 [YR1999]""",1.465332,1.515147
"""The Americas""","""1994 [YR1994]""",-0.15145,-0.16302
…,…,…,…
"""The Americas""","""1990 [YR1990]""",-1.505272,-1.488382
"""The Americas""","""1999 [YR1999]""",1.467548,1.484156
"""Africa""","""1998 [YR1998]""",1.166318,1.161915
"""Middle East""","""1994 [YR1994]""",-0.158724,-0.161371


#### Solution 2 - With columns selectors

In [138]:
(pop_per_1K_by_region_and_year :=
 pop_nums_by_region_and_year
 .with_columns((cs.float() - cs.float().mean().over('Region'))/cs.float().std().over('Region'))
)

Region,Year,"Population, total",Urban population
str,str,f64,f64
"""The Americas""","""1992 [YR1992]""",-0.821409,-0.824405
"""Oceania""","""1993 [YR1993]""",-0.509868,-0.487318
"""Middle East""","""1993 [YR1993]""",-0.450549,-0.463163
"""Asia""","""1999 [YR1999]""",1.465332,1.515147
"""The Americas""","""1994 [YR1994]""",-0.15145,-0.16302
…,…,…,…
"""The Americas""","""1990 [YR1990]""",-1.505272,-1.488382
"""The Americas""","""1999 [YR1999]""",1.467548,1.484156
"""Africa""","""1998 [YR1998]""",1.166318,1.161915
"""Middle East""","""1994 [YR1994]""",-0.158724,-0.161371
