# Exploring the use of Polars dataframes as an alternative to Pandas

Polars is a package built in the Rust programming language. It can do much of what can be done with Pandas but with much better performance

[Polars documentation](https://docs.pola.rs/)

[Comparison with other tools (Polars documentation)](https://docs.pola.rs/user-guide/misc/comparison/)

Installation
------------

#### With pip

`python -m pip install polars`

#### With uv

`uv add polars`

Import Polars
-------------

In [2]:
import polars as pl

Read data from a .csv
---------------------

In [None]:
data = './data/simple_monthly_timeseries_data.csv'

df_import = pl.read_csv(data, try_parse_dates= True)

df_import.glimpse() # get the first 10 entries for each column in the data.

# Notice that Polars has parsed the dates for us. The original format of the Month column is dd/mm/yyyy.

Create a copy of the dataframe with a "year" column and group by year.
----------------------------------------------------------------------

In [None]:
df_processed = df_import.with_columns(          # .with_columns() to add one or more columns
    year=pl.col("Month").dt.year(),             # create a "year" column by getting the year date part from "Month"
)
# NB: even when you are only adding one column, you end the list of new columns with a comma.

df_processed.glimpse()

In [None]:
result = (    
df_processed.select(                                    # Select the columns you want in the result dataframe
    pl.all().exclude("Month"),
    )
    .group_by(                                          # Group by the specified column
    pl.col("year"),
    maintain_order= True,                               # This will keep the same order the years appear in the data
    )
    .agg(                                               # Create an aggregation column
    pl.col("Activity").sum().name.prefix("total_"),     # Specify column(s) to aggregate, how to aggregate, and create new name with prefix
    )
)

# .name.prefix is particularly useful when you want to aggregate multiple columns. To do that, it would look like this:
# pl.col("Activity","Cost").sum().name.prefix("total_"),
# This would result in "total_Activity" and "total_Cost" columns.

print(result)

Export the "result" table to a .csv file
----------------------------------------

In [6]:
result.write_csv('data/result.csv')

Import data from a database in conjunction with SQLAlchemy
----------------------------------------------------------

This is largely the same as the connection we have been using with Pandas; however,
the "read_database" function uses keyword arguments and the .connect() method needs
to be used explicitly.

In [None]:
import sqlalchemy as sa
from sqlalchemy import create_engine

server = r'BIS-000-SP08.bis.xswhealth.nhs.uk, 14431'
database = 'Analyst_SQL_Area'

connection_string = ('DRIVER={ODBC Driver 17 for SQL Server};SERVER='+server+
                     ';DATABASE='+database+
                     ';ENCRYPT=no;TRUSTED_CONNECTION=yes;'
    )

connection_url = sa.engine.URL.create(
    "mssql+pyodbc",
    query=dict(odbc_connect=connection_string)
    )

engine = sa.create_engine(connection_url)

query = f'SELECT TOP (1000) * FROM [Analyst_SQL_Area].[Spec].[0002_4-5_CYP_Access_MHB]'

df = pl.read_database(query=query, connection=engine.connect())

df.glimpse()

Joining tables
--------------

In [None]:
# Create some tables from scratch

source_table = pl.DataFrame(
    {
        "icb_code": ["QRL","QRL","QU9","QU9"],
        "provider_code": ["RHM","RHU","RHW","RNU"],
        "activity": [100,200,400,800]
    }
)

icb_lookup = pl.DataFrame(
    {
        "icb_code": ["QRL","QU9"],
        "icb_name": ["Hampshire and Isle of Wight", "Bucks, Oxon and Berks West"]
    }
)

provider_lookup = pl.DataFrame(
    {
        "provider_code": ["RHM","RHU","RHW","RNU"],
        "provider_name": ["Southampton General","Portsmouth University Hospital","Royal Berks","Oxford Health"]
    }
)

# Use method-chaining to join onto both lookup tables.

join_result =(
    source_table.join(
        icb_lookup,
        on="icb_code",
        how="left"
    )
    .join(
        provider_lookup,
        on="provider_code",
        how="left"
    )
)

print(join_result)

Concatenating tables
--------------------

In [None]:
# Create another table containing additional rows to add onto "join_result"

rows_to_add = pl.DataFrame(
    {
        "icb_code": ["QSL","QNQ","QNX"],
        "provider_code": ["RH5","RDU","RX2"],
        "activity": [1600,3200,6400],
        "icb_name": ["Somerset","Frimley","Sussex"],
        "provider_name": ["Somerset Foundation Trust","Frimley Health","Sussex Partnership"]
    }
)


concatenation_result = pl.concat(
    [
        join_result,
        rows_to_add,
    ],
    how="vertical",
)

print(concatenation_result)


Pivot / Unpivot
---------------

#### Pivot

In [None]:
pivot_result=(
    concatenation_result.select(                            # just select part of the concatenation_result
        pl.col("icb_code","provider_code","activity"),
    )
    .pivot(
        "provider_code",                                    # what we are pivoting
        index="icb_code",                                   # the column determining the rows
        values="activity",                                  # the values
        aggregate_function="sum"                            # pivot operations require an agg. function on the values
    )
)

print(pivot_result)

#### Unpivot

In [None]:
# Unpivot the pivot_result

unpivot_result = pivot_result.unpivot(
    pivot_result.columns[1:],                   # this takes all the provider_code columns without having to name them all
    index="icb_code"
    )

print(unpivot_result)

While we are at it, let's have a look at how we drop rows containing nulls.

In [None]:
cleaned_unpivot_result = unpivot_result.filter(                 # use the .filter() method
    pl.col("value").is_not_null()
)

print(cleaned_unpivot_result)

Built-in visualisations with Altair
--------------------------

Polars has Altair built in for its chart plotting. Altair has interactive panning and zooming 
functionality as standard, and it is easy to add tooltips.

Lets return to the timeseries data we imported earlier.

[Polars visualisation documentation](https://docs.pola.rs/user-guide/misc/visualization/)

[Altair documentation](https://altair-viz.github.io/getting_started/overview.html)

In [None]:
chart = (
    df_import.plot.line(
        x="Month",
        y="Activity",
        tooltip="Activity"
    )
    .properties(width=500, title="Monthly Activty")
    .configure_scale(zero=False)
)
chart.encoding.x.title = "Month"
chart.encoding.y.title = "Activity"
chart

Exercises
---------

For these exercises, we will use this [Water, Sanitation and Hygiene](https://www.kaggle.com/datasets/willianoliveiragibin/water-sanitation-and-hygiene) dataset on Kaggle. 

It contains data on the proportion of the population with access to safely managed drinking water, by country and year.

You would be wise to make use of the Polars documentation, in particular the [Expressions](https://docs.pola.rs/user-guide/expressions/) and [IO](https://docs.pola.rs/user-guide/io/) sections.

#### 1. Import the data from the .csv file into a Polars dataframe and view the first 10 entries in the dataset.

In [None]:
file = 'data/safe_drinking_water.csv'

# your code here...

#### 2. Find the mean values across all years for "Usage of safely managed drinking water services" by country.

In [None]:
# your code here...


#### 3. Create a new DataFrame based on the aggregated figures where rows containing "null" have been removed.

In [None]:
# your code here...
