# Exploring the use of Polars dataframes as an alternative to Pandas

Polars is a package built in the Rust programming language. It can do much of what can be done with Pandas but with much better performance

[Polars documentation](https://docs.pola.rs/)

[Comparison with other tools (Polars documentation)](https://docs.pola.rs/user-guide/misc/comparison/)

#### What advantages do Polars dataframes have over Pandas dataframes?

- Much better performance, speed and memory efficiency.
    - It is built on an engine written in Rust that is highly optimised for parallel execution and efficient memory usage.
    - It builds an execution plan before performing operations ("lazy execution").
- Native support for a number of operations such as datetime manipulation and rounding floats.
- Join operations are faster and deal better with larger datasets.
- Better handling of nulls and missing values.
- Built-in support for Altair visualisations, which are interactive and have an intuitive syntax.
- Ed's personal opinion: 
    - Polars syntax tends to be more explicit than Pandas in its syntax, which can make it easier to read. That said, it is often possible to perform operations using the same syntax as Pandas, or at least familiar syntax, providing different options.
    - Polars displays the data type in more contexts than Pandas -for example, at the top of the columns when you return a dataframe- which serves as a useful reminder when you are considering which operations you need to perform.


#### Why might you still opt for Pandas?

- Simpler EDA.
    - Pandas is a much more mature libary with plenty of exploratory methods, extensions and libraries.
    - Simpler syntax.
- Seamless integration with a large number of Machine Learning frameworks.
    - **That said, this is improving all the time with Polars. `sci-kit learn` supports Polars outputs and Polars integrates seamlessly with TensorFlow**

Installation
------------

#### With pip

`python -m pip install polars`

#### With uv

`uv add polars`

Import Polars
-------------

In [2]:
import polars as pl

Read data from a .csv
---------------------

In [3]:
data = './data/simple_monthly_timeseries_data.csv'

df_import = pl.read_csv(data, try_parse_dates= True)

df_import.glimpse() # get the first 10 entries for each column in the data.

# Notice that Polars has parsed the dates for us. The original format of the Month column is dd/mm/yyyy.

Rows: 36
Columns: 2
$ Month    <date> 2019-01-01, 2019-02-01, 2019-03-01, 2019-04-01, 2019-05-01, 2019-06-01, 2019-07-01, 2019-08-01, 2019-09-01, 2019-10-01
$ Activity  <i64> 433, 635, 643, 645, 770, 846, 853, 351, 885, 443



Create a copy of the dataframe with a "year" column and group by year.
----------------------------------------------------------------------

In [4]:
df_processed = df_import.with_columns(          # .with_columns() to add one or more columns
    year=pl.col("Month").dt.year(),             # create a "year" column by getting the year date part from "Month"
)
# NB: even when you are only adding one column, you end the list of new columns with a comma.

df_processed.glimpse()

Rows: 36
Columns: 3
$ Month    <date> 2019-01-01, 2019-02-01, 2019-03-01, 2019-04-01, 2019-05-01, 2019-06-01, 2019-07-01, 2019-08-01, 2019-09-01, 2019-10-01
$ Activity  <i64> 433, 635, 643, 645, 770, 846, 853, 351, 885, 443
$ year      <i32> 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019



In [5]:
result = (    
df_processed.select(                                    # Select the columns you want in the result dataframe
    pl.all().exclude("Month"),
    )
    .group_by(                                          # Group by the specified column
    pl.col("year"),
    maintain_order= True,                               # This will keep the same order the years appear in the data
    )
    .agg(                                               # Create an aggregation column
    pl.col("Activity").sum().name.prefix("total_"),     # Specify column(s) to aggregate, how to aggregate, and create new name with prefix
    )
)

# .name.prefix is particularly useful when you want to aggregate multiple columns. To do that, it would look like this:
# pl.col("Activity","Cost").sum().name.prefix("total_"),
# This would result in "total_Activity" and "total_Cost" columns.

print(result)

shape: (3, 2)
┌──────┬────────────────┐
│ year ┆ total_Activity │
│ ---  ┆ ---            │
│ i32  ┆ i64            │
╞══════╪════════════════╡
│ 2019 ┆ 7881           │
│ 2020 ┆ 9852           │
│ 2021 ┆ 12314          │
└──────┴────────────────┘


Export the "result" table to a .csv file
----------------------------------------

In [6]:
result.write_csv('data/result.csv')

Import data from a database in conjunction with SQLAlchemy
----------------------------------------------------------

This is largely the same as the connection we have been using with Pandas; however,
the "read_database" function uses keyword arguments and the .connect() method needs
to be used explicitly. N.B.: sensitive server/database info has been removed.

In [7]:
import sqlalchemy as sa
from sqlalchemy import create_engine

server = r'[SERVER NAME HERE]'
database = '[DATABASE NAME HERE]'

connection_string = ('DRIVER={ODBC Driver 17 for SQL Server};SERVER='+server+
                     ';DATABASE='+database+
                     ';ENCRYPT=no;TRUSTED_CONNECTION=yes;'
    )

connection_url = sa.engine.URL.create(
    "mssql+pyodbc",
    query=dict(odbc_connect=connection_string)
    )

engine = sa.create_engine(connection_url)

query = f'SELECT TOP (1000) * FROM [TABLE NAME HERE]'

df = pl.read_database(query=query, connection=engine.connect())

df.glimpse()

Rows: 1000
Columns: 9
$ Reporting_Period_Start             <datetime[μs]> 2022-04-01 00:00:00, 2022-04-01 00:00:00, 2022-04-01 00:00:00, 2022-04-01 00:00:00, 2022-04-01 00:00:00, 2022-04-01 00:00:00, 2022-04-01 00:00:00, 2022-04-01 00:00:00, 2022-04-01 00:00:00, 2022-04-01 00:00:00
$ Reporting_Period_End               <datetime[μs]> 2023-03-31 00:00:00, 2023-03-31 00:00:00, 2023-03-31 00:00:00, 2023-03-31 00:00:00, 2023-03-31 00:00:00, 2023-03-31 00:00:00, 2023-03-31 00:00:00, 2023-03-31 00:00:00, 2023-03-31 00:00:00, 2023-03-31 00:00:00
$ HI_Breakdown                                <str> 'Total', 'Total', 'Total', 'Total', 'Total', 'Age Group', 'Age Group', 'Age Group', 'Age Group', 'Age Group'
$ ICB_Code                                    <str> 'QNQ', 'QNX', 'QRL', 'QSL', 'QU9', 'QNQ', 'QNQ', 'QNQ', 'QNQ', 'QNQ'
$ ICB_Name                                    <str> 'NHS FRIMLEY INTEGRATED CARE BOARD', 'NHS SUSSEX INTEGRATED CARE BOARD', 'NHS HAMPSHIRE AND ISLE OF WIGHT INTEGRATED CARE 

Joining tables
--------------

In [8]:
# Create some tables from scratch

source_table = pl.DataFrame(
    {
        "icb_code": ["QRL","QRL","QU9","QU9"],
        "provider_code": ["RHM","RHU","RHW","RNU"],
        "activity": [100,200,400,800]
    }
)

icb_lookup = pl.DataFrame(
    {
        "icb_code": ["QRL","QU9"],
        "icb_name": ["Hampshire and Isle of Wight", "Bucks, Oxon and Berks West"]
    }
)

provider_lookup = pl.DataFrame(
    {
        "provider_code": ["RHM","RHU","RHW","RNU"],
        "provider_name": ["Southampton General","Portsmouth University Hospital","Royal Berks","Oxford Health"]
    }
)

# Use method-chaining to join onto both lookup tables.

join_result =(
    source_table.join(
        icb_lookup,
        on="icb_code",
        how="left"
    )
    .join(
        provider_lookup,
        on="provider_code",
        how="left"
    )
)

print(join_result)

shape: (4, 5)
┌──────────┬───────────────┬──────────┬─────────────────────────────┬───────────────────────┐
│ icb_code ┆ provider_code ┆ activity ┆ icb_name                    ┆ provider_name         │
│ ---      ┆ ---           ┆ ---      ┆ ---                         ┆ ---                   │
│ str      ┆ str           ┆ i64      ┆ str                         ┆ str                   │
╞══════════╪═══════════════╪══════════╪═════════════════════════════╪═══════════════════════╡
│ QRL      ┆ RHM           ┆ 100      ┆ Hampshire and Isle of Wight ┆ Southampton General   │
│ QRL      ┆ RHU           ┆ 200      ┆ Hampshire and Isle of Wight ┆ Portsmouth University │
│          ┆               ┆          ┆                             ┆ Hospital              │
│ QU9      ┆ RHW           ┆ 400      ┆ Bucks, Oxon and Berks West  ┆ Royal Berks           │
│ QU9      ┆ RNU           ┆ 800      ┆ Bucks, Oxon and Berks West  ┆ Oxford Health         │
└──────────┴───────────────┴──────────┴───────

Concatenating tables
--------------------

In [9]:
# Create another table containing additional rows to add onto "join_result"

rows_to_add = pl.DataFrame(
    {
        "icb_code": ["QSL","QNQ","QNX"],
        "provider_code": ["RH5","RDU","RX2"],
        "activity": [1600,3200,6400],
        "icb_name": ["Somerset","Frimley","Sussex"],
        "provider_name": ["Somerset Foundation Trust","Frimley Health","Sussex Partnership"]
    }
)


concatenation_result = pl.concat(
    [
        join_result,
        rows_to_add,
    ],
    how="vertical",
)

print(concatenation_result)


shape: (7, 5)
┌──────────┬───────────────┬──────────┬─────────────────────────────┬───────────────────────────┐
│ icb_code ┆ provider_code ┆ activity ┆ icb_name                    ┆ provider_name             │
│ ---      ┆ ---           ┆ ---      ┆ ---                         ┆ ---                       │
│ str      ┆ str           ┆ i64      ┆ str                         ┆ str                       │
╞══════════╪═══════════════╪══════════╪═════════════════════════════╪═══════════════════════════╡
│ QRL      ┆ RHM           ┆ 100      ┆ Hampshire and Isle of Wight ┆ Southampton General       │
│ QRL      ┆ RHU           ┆ 200      ┆ Hampshire and Isle of Wight ┆ Portsmouth University     │
│          ┆               ┆          ┆                             ┆ Hospital                  │
│ QU9      ┆ RHW           ┆ 400      ┆ Bucks, Oxon and Berks West  ┆ Royal Berks               │
│ QU9      ┆ RNU           ┆ 800      ┆ Bucks, Oxon and Berks West  ┆ Oxford Health             │
│ QSL 

Pivot / Unpivot
---------------

#### Pivot

In [10]:
pivot_result=(
    concatenation_result.select(                            # just select part of the concatenation_result
        pl.col("icb_code","provider_code","activity"),
    )
    .pivot(
        "provider_code",                                    # what we are pivoting
        index="icb_code",                                   # the column determining the rows
        values="activity",                                  # the values
        aggregate_function="sum"                            # pivot operations require an agg. function on the values
    )
)

print(pivot_result)

shape: (5, 8)
┌──────────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┐
│ icb_code ┆ RHM  ┆ RHU  ┆ RHW  ┆ RNU  ┆ RH5  ┆ RDU  ┆ RX2  │
│ ---      ┆ ---  ┆ ---  ┆ ---  ┆ ---  ┆ ---  ┆ ---  ┆ ---  │
│ str      ┆ i64  ┆ i64  ┆ i64  ┆ i64  ┆ i64  ┆ i64  ┆ i64  │
╞══════════╪══════╪══════╪══════╪══════╪══════╪══════╪══════╡
│ QRL      ┆ 100  ┆ 200  ┆ null ┆ null ┆ null ┆ null ┆ null │
│ QU9      ┆ null ┆ null ┆ 400  ┆ 800  ┆ null ┆ null ┆ null │
│ QSL      ┆ null ┆ null ┆ null ┆ null ┆ 1600 ┆ null ┆ null │
│ QNQ      ┆ null ┆ null ┆ null ┆ null ┆ null ┆ 3200 ┆ null │
│ QNX      ┆ null ┆ null ┆ null ┆ null ┆ null ┆ null ┆ 6400 │
└──────────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘


#### Unpivot

In [11]:
# Unpivot the pivot_result

unpivot_result = pivot_result.unpivot(
    pivot_result.columns[1:],                   # this takes all the provider_code columns without having to name them all
    index="icb_code"
    )

print(unpivot_result)

shape: (35, 3)
┌──────────┬──────────┬───────┐
│ icb_code ┆ variable ┆ value │
│ ---      ┆ ---      ┆ ---   │
│ str      ┆ str      ┆ i64   │
╞══════════╪══════════╪═══════╡
│ QRL      ┆ RHM      ┆ 100   │
│ QU9      ┆ RHM      ┆ null  │
│ QSL      ┆ RHM      ┆ null  │
│ QNQ      ┆ RHM      ┆ null  │
│ QNX      ┆ RHM      ┆ null  │
│ …        ┆ …        ┆ …     │
│ QRL      ┆ RX2      ┆ null  │
│ QU9      ┆ RX2      ┆ null  │
│ QSL      ┆ RX2      ┆ null  │
│ QNQ      ┆ RX2      ┆ null  │
│ QNX      ┆ RX2      ┆ 6400  │
└──────────┴──────────┴───────┘


While we are at it, let's have a look at how we drop rows containing nulls.

In [12]:
cleaned_unpivot_result = unpivot_result.filter(                 # use the .filter() method
    pl.col("value").is_not_null()
)

print(cleaned_unpivot_result)

shape: (7, 3)
┌──────────┬──────────┬───────┐
│ icb_code ┆ variable ┆ value │
│ ---      ┆ ---      ┆ ---   │
│ str      ┆ str      ┆ i64   │
╞══════════╪══════════╪═══════╡
│ QRL      ┆ RHM      ┆ 100   │
│ QRL      ┆ RHU      ┆ 200   │
│ QU9      ┆ RHW      ┆ 400   │
│ QU9      ┆ RNU      ┆ 800   │
│ QSL      ┆ RH5      ┆ 1600  │
│ QNQ      ┆ RDU      ┆ 3200  │
│ QNX      ┆ RX2      ┆ 6400  │
└──────────┴──────────┴───────┘


Built-in visualisations with Altair
--------------------------

Polars has Altair built in for its chart plotting. Altair has interactive panning and zooming 
functionality as standard, and it is easy to add tooltips.

Lets return to the timeseries data we imported earlier.

[Polars visualisation documentation](https://docs.pola.rs/user-guide/misc/visualization/)

[Altair documentation](https://altair-viz.github.io/getting_started/overview.html)

In [13]:
chart = (
    df_import.plot.line(
        x="Month",
        y="Activity",
        tooltip="Activity"
    )
    .properties(width=500, title="Monthly Activty")
    .configure_scale(zero=False)
)
chart.encoding.x.title = "Month"
chart.encoding.y.title = "Activity"
chart

Exercises
---------

For these exercises, we will use this [Water, Sanitation and Hygiene](https://www.kaggle.com/datasets/willianoliveiragibin/water-sanitation-and-hygiene) dataset on Kaggle. 

It contains data on the proportion of the population with access to safely managed drinking water, by country and year.

You would be wise to make use of the Polars documentation, in particular the [Expressions](https://docs.pola.rs/user-guide/expressions/) and [IO](https://docs.pola.rs/user-guide/io/) sections.

#### 1. Import the data from the .csv file into a Polars dataframe and view the first 10 entries in the dataset.

In [14]:
file = 'data/safe_drinking_water.csv'

# your code here...

df_import = pl.read_csv(file)

df_import.glimpse()

Rows: 5737
Columns: 3
$ Country                                         <str> 'Afghanistan', 'Afghanistan', 'Afghanistan', 'Afghanistan', 'Afghanistan', 'Afghanistan', 'Afghanistan', 'Afghanistan', 'Afghanistan', 'Afghanistan'
$ Year                                            <i64> 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009
$ Usage of safely managed drinking water services <f64> 11.093327, 11.105221, 12.007733, 12.909922, 13.818684, 14.733853, 15.648427, 16.562523, 17.476011, 18.388884



#### 2. Find the mean values across all years for "Usage of safely managed drinking water services" by country.

In [15]:
# your code here...

df_aggregated = (
    df_import.select(
        pl.all().exclude("Year"),
    )
    .group_by(
    pl.col("Country"),
    maintain_order= True,
    )
    .agg(
        pl.col("Usage of safely managed drinking water services").mean().name.prefix("avg_")
    )
)

print(df_aggregated)

"""
Bonus material to get columns for multiple aggregations:

.agg([
        pl.col("Usage of safely managed drinking water services").mean().alias("avg_Usage"),
        pl.col("Usage of safely managed drinking water services").min().alias("min_Usage"),
        pl.col("Usage of safely managed drinking water services").max().alias("max_Usage")
    ])

"""


shape: (254, 2)
┌───────────────────────┬─────────────────────────────────┐
│ Country               ┆ avg_Usage of safely managed dr… │
│ ---                   ┆ ---                             │
│ str                   ┆ f64                             │
╞═══════════════════════╪═════════════════════════════════╡
│ Afghanistan           ┆ 20.237917                       │
│ Africa (WHO)          ┆ 25.927864                       │
│ Albania               ┆ 62.007785                       │
│ Algeria               ┆ 73.702565                       │
│ American Samoa        ┆ 87.329029                       │
│ …                     ┆ …                               │
│ Western Pacific (WHO) ┆ null                            │
│ World                 ┆ 66.808092                       │
│ Yemen                 ┆ null                            │
│ Zambia                ┆ null                            │
│ Zimbabwe              ┆ 28.01352                        │
└───────────────────────

'\nBonus material to get columns for multiple aggregations:\n\n.agg([\n        pl.col("Usage of safely managed drinking water services").mean().alias("avg_Usage"),\n        pl.col("Usage of safely managed drinking water services").min().alias("min_Usage"),\n        pl.col("Usage of safely managed drinking water services").max().alias("max_Usage")\n    ])\n\n'

#### 3. Create a new DataFrame based on the aggregated figures where rows containing "null" have been removed.

In [16]:
# your code here...

cleaned_df_aggregated = df_aggregated.filter(
    df_aggregated["avg_Usage of safely managed drinking water services"].is_not_null()
)

print(cleaned_df_aggregated)

shape: (169, 2)
┌───────────────────┬─────────────────────────────────┐
│ Country           ┆ avg_Usage of safely managed dr… │
│ ---               ┆ ---                             │
│ str               ┆ f64                             │
╞═══════════════════╪═════════════════════════════════╡
│ Afghanistan       ┆ 20.237917                       │
│ Africa (WHO)      ┆ 25.927864                       │
│ Albania           ┆ 62.007785                       │
│ Algeria           ┆ 73.702565                       │
│ American Samoa    ┆ 87.329029                       │
│ …                 ┆ …                               │
│ Uzbekistan        ┆ 67.485789                       │
│ Vietnam           ┆ 51.779854                       │
│ Wallis and Futuna ┆ 69.01344                        │
│ World             ┆ 66.808092                       │
│ Zimbabwe          ┆ 28.01352                        │
└───────────────────┴─────────────────────────────────┘
