# Polars Tutorial

* Taken from [Matt Harrison at PyCon 2023](https://www.youtube.com/watch?v=CJ0f45evuME)
* [GitHub dataset](https://github.com/mattharrison/datasets/blob/master/data/__mharrison__2020-2021.csv)
* Unlike original, I will be using LazyFrame API, where possible

## Imports

In [57]:
import numpy as np
import polars as pl
import polars.selectors as cs

## Load data
* `pl.read_csv()` outputs a `DataFrame`
* `pl.scan_csv()` outputs a `LazyFrame`

In [58]:
# Load from GitHub repo
url = 'https://github.com/mattharrison/datasets/raw/master/data/__mharrison__2020-2021.csv'
df = pl.read_csv(url)
lf = pl.scan_csv(url)

In [59]:
# df
lf.collect()

Tweet id,Tweet permalink,Tweet text,time,impressions,engagements,engagement rate,retweets,replies,likes,user profile clicks,url clicks,hashtag clicks,detail expands,permalink clicks,app opens,app installs,follows,email tweet,dial phone,media views,media engagements,promoted impressions,promoted engagements,promoted engagement rate,promoted retweets,promoted replies,promoted likes,promoted user profile clicks,promoted url clicks,promoted hashtag clicks,promoted detail expands,promoted permalink clicks,promoted app opens,promoted app installs,promoted follows,promoted email tweet,promoted dial phone,promoted media views,promoted media engagements
i64,str,str,str,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,i64,i64,i64,i64,i64,i64,i64,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str
1212580517905780737,"""https://twitter.com/__mharriso…","""Sounds like a great topic! htt…","""2020-01-02 03:44:00+00:00""",1465.0,7.0,0.004778,0.0,0.0,3.0,3.0,0.0,0.0,1.0,0.0,0,0,0,0,0,0,0,,,,,,,,,,,,,,,,,,
1212582494828036097,"""https://twitter.com/__mharriso…","""@FogleBird Looks like SLC. I c…","""2020-01-02 03:52:00+00:00""",154.0,3.0,0.019481,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0,0,0,0,0,0,0,,,,,,,,,,,,,,,,,,
1212613735698690049,"""https://twitter.com/__mharriso…","""@afilina That's really amount …","""2020-01-02 05:56:00+00:00""",1024.0,6.0,0.005859,0.0,0.0,1.0,2.0,0.0,0.0,3.0,0.0,0,0,0,0,0,0,0,,,,,,,,,,,,,,,,,,
1212911749617242113,"""https://twitter.com/__mharriso…","""@randal_olson I use anaconda w…","""2020-01-03 01:41:00+00:00""",1419.0,14.0,0.009866,0.0,1.0,5.0,7.0,0.0,0.0,1.0,0.0,0,0,0,0,0,0,0,,,,,,,,,,,,,,,,,,
1212920556028252160,"""https://twitter.com/__mharriso…","""@AlSweigart Sometimes the stud…","""2020-01-03 02:16:00+00:00""",198.0,1.0,0.005051,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,,,,,,,,,,,,,,,,,,
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
1475300661851934721,"""https://twitter.com/__mharriso…","""@allison_horst That's awesome!""","""2021-12-27 03:01:00+00:00""",986.0,1.0,0.001014,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0,0,0,0,0,0,0,,,,,,,,,,,,,,,,,,
1475518143690801156,"""https://twitter.com/__mharriso…","""@willmcgugan You need to find …","""2021-12-27 17:25:00+00:00""",1790.0,7.0,0.003911,0.0,0.0,3.0,1.0,0.0,0.0,3.0,0.0,0,0,0,0,0,0,0,,,,,,,,,,,,,,,,,,
1475891441243025408,"""https://twitter.com/__mharriso…","""@posco Visiting Hawaii for the…","""2021-12-28 18:08:00+00:00""",1611.0,12.0,0.007449,0.0,0.0,4.0,4.0,0.0,0.0,4.0,0.0,0,0,0,0,0,0,0,,,,,,,,,,,,,,,,,,
1476453819751878656,"""https://twitter.com/__mharriso…","""@johndsaunders My son just bui…","""2021-12-30 07:23:00+00:00""",1354.0,8.0,0.005908,0.0,0.0,2.0,4.0,0.0,0.0,2.0,0.0,0,0,0,0,0,0,0,,,,,,,,,,,,,,,,,,


In [60]:
# df.columns
lf.collect_schema().names()

['Tweet id',
 'Tweet permalink',
 'Tweet text',
 'time',
 'impressions',
 'engagements',
 'engagement rate',
 'retweets',
 'replies',
 'likes',
 'user profile clicks',
 'url clicks',
 'hashtag clicks',
 'detail expands',
 'permalink clicks',
 'app opens',
 'app installs',
 'follows',
 'email tweet',
 'dial phone',
 'media views',
 'media engagements',
 'promoted impressions',
 'promoted engagements',
 'promoted engagement rate',
 'promoted retweets',
 'promoted replies',
 'promoted likes',
 'promoted user profile clicks',
 'promoted url clicks',
 'promoted hashtag clicks',
 'promoted detail expands',
 'promoted permalink clicks',
 'promoted app opens',
 'promoted app installs',
 'promoted follows',
 'promoted email tweet',
 'promoted dial phone',
 'promoted media views',
 'promoted media engagements']

In this particular example case, the `.select(x)` method will select `x` rows of the DF at random. Since selecting rows at random requires loading all rows, LFs do not have a `.select()` method.

However, `Expr` does have a select method. By using the expression to select all rows (`pl.all()`), we can run the `Expr.sample()` method. Alternatively, you could just convert to a DF via `lf.collect().sample(20)`.

In [61]:
# Select 20 rows at random

# df.sample(20)
# lf.collect().sample(20)  # Sample via converting LF to DF
lf.select(pl.all().sample(20)).collect()

Tweet id,Tweet permalink,Tweet text,time,impressions,engagements,engagement rate,retweets,replies,likes,user profile clicks,url clicks,hashtag clicks,detail expands,permalink clicks,app opens,app installs,follows,email tweet,dial phone,media views,media engagements,promoted impressions,promoted engagements,promoted engagement rate,promoted retweets,promoted replies,promoted likes,promoted user profile clicks,promoted url clicks,promoted hashtag clicks,promoted detail expands,promoted permalink clicks,promoted app opens,promoted app installs,promoted follows,promoted email tweet,promoted dial phone,promoted media views,promoted media engagements
i64,str,str,str,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,i64,i64,i64,i64,i64,i64,i64,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str
1337784727474823168,"""https://twitter.com/__mharriso…","""@fulhack Life!""","""2020-03-17 19:01:00+00:00""",1862.0,32.0,0.03125,0.0,0.0,5.0,4.0,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,,,,,,,,,,,,,,,,,,
1302306774700101632,"""https://twitter.com/__mharriso…","""@dvassallo Interesting... Whic…","""2021-09-26 14:25:00+00:00""",295.0,6.0,0.023853,0.0,1.0,0.0,5.0,0.0,0.0,5.0,0.0,0,0,0,0,0,151,0,,,,,,,,,,,,,,,,,,
1433650885582680064,"""https://twitter.com/__mharriso…","""Utah #COVID19 plot for May 5. …","""2020-12-23 05:49:00+00:00""",666.0,209.0,0.058824,0.0,1.0,1.0,0.0,13.0,0.0,0.0,0.0,0,0,0,0,0,0,0,,,,,,,,,,,,,,,,,,
1240701506937999360,"""https://twitter.com/__mharriso…","""@sbstnschmtthd One to import l…","""2020-10-06 02:45:00+00:00""",52.0,2.0,0.192327,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,,,,,,,,,,,,,,,,,,
1359895060427337730,"""https://twitter.com/__mharriso…","""@RyanQualtrics Thanks for spre…","""2020-10-02 17:30:00+00:00""",2126.0,0.0,0.052632,0.0,1.0,3.0,0.0,0.0,0.0,13.0,0.0,0,0,0,0,0,0,0,,,,,,,,,,,,,,,,,,
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
1234348713218211840,"""https://twitter.com/__mharriso…","""My company, MetaSnake, has bee…","""2020-10-26 04:47:00+00:00""",2059.0,5.0,0.009298,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0,0,0,0,0,0,0,,,,,,,,,,,,,,,,,,
1350447679776583686,"""https://twitter.com/__mharriso…","""One annoying thing with pullin…","""2021-07-23 13:14:00+00:00""",515.0,5.0,0.081633,0.0,1.0,4.0,21.0,1.0,0.0,0.0,0.0,0,0,0,0,0,165,0,,,,,,,,,,,,,,,,,,
1463906183509471234,"""https://twitter.com/__mharriso…","""@pydanny 😢Sounds annoying. You…","""2020-11-30 19:14:00+00:00""",1636.0,1.0,0.054545,0.0,0.0,6.0,0.0,107.0,0.0,5.0,0.0,0,0,0,0,0,0,0,,,,,,,,,,,,,,,,,,
1233548397681860609,"""https://twitter.com/__mharriso…","""@jackbutcher Going to make a l…","""2021-03-25 22:51:00+00:00""",280.0,59.0,0.044118,0.0,0.0,2.0,1.0,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,,,,,,,,,,,,,,,,,,


LazyFrames also don't know their own shape, so `.shape()` only works on DFs

In [62]:
# df.shape

lf.collect().shape
# lf.select(pl.len()).collect()  # row count
# len(lf.collect_schema())  # col count

(5791, 40)

## Types

In [63]:
# df.dtypes
lf.collect_schema().dtypes()

[Int64,
 String,
 String,
 String,
 Float64,
 Float64,
 Float64,
 Float64,
 Float64,
 Float64,
 Float64,
 Float64,
 Float64,
 Float64,
 Float64,
 Int64,
 Int64,
 Int64,
 Int64,
 Int64,
 Int64,
 Int64,
 String,
 String,
 String,
 String,
 String,
 String,
 String,
 String,
 String,
 String,
 String,
 String,
 String,
 String,
 String,
 String,
 String,
 String]

LFs have no estimated size b/c they are query plans, not in-memory data storage

In [64]:
# Estimated size in MB
df.estimated_size() / 10**6

1.925925

In [65]:
# df.describe()
lf.describe()

statistic,Tweet id,Tweet permalink,Tweet text,time,impressions,engagements,engagement rate,retweets,replies,likes,user profile clicks,url clicks,hashtag clicks,detail expands,permalink clicks,app opens,app installs,follows,email tweet,dial phone,media views,media engagements,promoted impressions,promoted engagements,promoted engagement rate,promoted retweets,promoted replies,promoted likes,promoted user profile clicks,promoted url clicks,promoted hashtag clicks,promoted detail expands,promoted permalink clicks,promoted app opens,promoted app installs,promoted follows,promoted email tweet,promoted dial phone,promoted media views,promoted media engagements
str,f64,str,str,str,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str
"""count""",5791.0,"""5791""","""5791""","""5791""",5791.0,5791.0,5791.0,5791.0,5791.0,5791.0,5791.0,5791.0,5791.0,5791.0,5791.0,5791.0,5791.0,5791.0,5791.0,5791.0,5791.0,5791.0,"""0""","""0""","""0""","""0""","""0""","""0""","""0""","""0""","""0""","""0""","""0""","""0""","""0""","""0""","""0""","""0""","""0""","""0"""
"""null_count""",0.0,"""0""","""0""","""0""",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"""5791""","""5791""","""5791""","""5791""","""5791""","""5791""","""5791""","""5791""","""5791""","""5791""","""5791""","""5791""","""5791""","""5791""","""5791""","""5791""","""5791""","""5791"""
"""mean""",1.3604e+18,,,,2297.820411,111.400967,0.034748,0.979796,1.124504,9.400622,20.594543,4.502331,0.019686,34.658436,0.0,0.001036,0.0,0.135901,0.0,0.0,39.868935,39.712485,,,,,,,,,,,,,,,,,,
"""std""",6.8508e+16,,,,16414.560844,976.353689,0.050031,10.903919,6.322059,108.117865,436.521415,32.377223,0.302481,355.671163,0.0,0.045513,0.0,3.870531,0.0,0.0,333.838003,333.753866,,,,,,,,,,,,,,,,,,
"""min""",1.2126e+18,"""https://twitter.com/__mharriso…","""""i"" has a specific connotation…","""2020-01-02 03:44:00+00:00""",7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,
"""25%""",1.3142e+18,,,,175.0,3.0,0.007064,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,
"""50%""",1.3589e+18,,,,612.0,7.0,0.016043,0.0,0.0,1.0,1.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,
"""75%""",1.4155e+18,,,,1617.0,25.0,0.040902,0.0,1.0,4.0,4.0,0.0,0.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,
"""max""",1.477e+18,"""https://twitter.com/__mharriso…","""🤯🙏 https://t.co/AsLaS3Hrdv""","""2021-12-31 21:11:00+00:00""",856749.0,45660.0,0.484127,465.0,207.0,5358.0,22393.0,1272.0,12.0,17078.0,0.0,3.0,0.0,191.0,0.0,0.0,16816.0,16816.0,,,,,,,,,,,,,,,,,,


In [66]:
# Find the 25th percentile of each column

# df.quantile(0.25)
lf.quantile(0.25).collect()

Tweet id,Tweet permalink,Tweet text,time,impressions,engagements,engagement rate,retweets,replies,likes,user profile clicks,url clicks,hashtag clicks,detail expands,permalink clicks,app opens,app installs,follows,email tweet,dial phone,media views,media engagements,promoted impressions,promoted engagements,promoted engagement rate,promoted retweets,promoted replies,promoted likes,promoted user profile clicks,promoted url clicks,promoted hashtag clicks,promoted detail expands,promoted permalink clicks,promoted app opens,promoted app installs,promoted follows,promoted email tweet,promoted dial phone,promoted media views,promoted media engagements
f64,str,str,str,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str
1.3142e+18,,,,175.0,3.0,0.007064,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,


Whereas in Pandas the DF & contained data itself is manipulated, in Polars we use expressions so that Polars can create a query plan. So, in Polars its generally better to build out expressions that can be run in parallel, rather than sequential chains like in Pandas.

In [67]:
# Select all columns

# df.select(pl.all())  # Preferred over df.select(pl.col('*'))
lf.select(pl.all()).collect()  # Preferred over lf.select(pl.col('*'))

Tweet id,Tweet permalink,Tweet text,time,impressions,engagements,engagement rate,retweets,replies,likes,user profile clicks,url clicks,hashtag clicks,detail expands,permalink clicks,app opens,app installs,follows,email tweet,dial phone,media views,media engagements,promoted impressions,promoted engagements,promoted engagement rate,promoted retweets,promoted replies,promoted likes,promoted user profile clicks,promoted url clicks,promoted hashtag clicks,promoted detail expands,promoted permalink clicks,promoted app opens,promoted app installs,promoted follows,promoted email tweet,promoted dial phone,promoted media views,promoted media engagements
i64,str,str,str,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,i64,i64,i64,i64,i64,i64,i64,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str
1212580517905780737,"""https://twitter.com/__mharriso…","""Sounds like a great topic! htt…","""2020-01-02 03:44:00+00:00""",1465.0,7.0,0.004778,0.0,0.0,3.0,3.0,0.0,0.0,1.0,0.0,0,0,0,0,0,0,0,,,,,,,,,,,,,,,,,,
1212582494828036097,"""https://twitter.com/__mharriso…","""@FogleBird Looks like SLC. I c…","""2020-01-02 03:52:00+00:00""",154.0,3.0,0.019481,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0,0,0,0,0,0,0,,,,,,,,,,,,,,,,,,
1212613735698690049,"""https://twitter.com/__mharriso…","""@afilina That's really amount …","""2020-01-02 05:56:00+00:00""",1024.0,6.0,0.005859,0.0,0.0,1.0,2.0,0.0,0.0,3.0,0.0,0,0,0,0,0,0,0,,,,,,,,,,,,,,,,,,
1212911749617242113,"""https://twitter.com/__mharriso…","""@randal_olson I use anaconda w…","""2020-01-03 01:41:00+00:00""",1419.0,14.0,0.009866,0.0,1.0,5.0,7.0,0.0,0.0,1.0,0.0,0,0,0,0,0,0,0,,,,,,,,,,,,,,,,,,
1212920556028252160,"""https://twitter.com/__mharriso…","""@AlSweigart Sometimes the stud…","""2020-01-03 02:16:00+00:00""",198.0,1.0,0.005051,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,,,,,,,,,,,,,,,,,,
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
1475300661851934721,"""https://twitter.com/__mharriso…","""@allison_horst That's awesome!""","""2021-12-27 03:01:00+00:00""",986.0,1.0,0.001014,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0,0,0,0,0,0,0,,,,,,,,,,,,,,,,,,
1475518143690801156,"""https://twitter.com/__mharriso…","""@willmcgugan You need to find …","""2021-12-27 17:25:00+00:00""",1790.0,7.0,0.003911,0.0,0.0,3.0,1.0,0.0,0.0,3.0,0.0,0,0,0,0,0,0,0,,,,,,,,,,,,,,,,,,
1475891441243025408,"""https://twitter.com/__mharriso…","""@posco Visiting Hawaii for the…","""2021-12-28 18:08:00+00:00""",1611.0,12.0,0.007449,0.0,0.0,4.0,4.0,0.0,0.0,4.0,0.0,0,0,0,0,0,0,0,,,,,,,,,,,,,,,,,,
1476453819751878656,"""https://twitter.com/__mharriso…","""@johndsaunders My son just bui…","""2021-12-30 07:23:00+00:00""",1354.0,8.0,0.005908,0.0,0.0,2.0,4.0,0.0,0.0,2.0,0.0,0,0,0,0,0,0,0,,,,,,,,,,,,,,,,,,


* In addition to selecting certain columns via `.select()` context, `pl.col()` expression can also select all *except* certain columns via the `.exclude()` context
* `pl.col()` can also select via regex

In [68]:
# Select by dtypes: all Float64 cols

# (
#     df
#     .select(
#         pl.col(pl.Float64)
#     )
# )
(
    lf
    .select(
        pl.col(pl.Float64)
    )
).collect()

impressions,engagements,engagement rate,retweets,replies,likes,user profile clicks,url clicks,hashtag clicks,detail expands,permalink clicks
f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
1465.0,7.0,0.004778,0.0,0.0,3.0,3.0,0.0,0.0,1.0,0.0
154.0,3.0,0.019481,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0
1024.0,6.0,0.005859,0.0,0.0,1.0,2.0,0.0,0.0,3.0,0.0
1419.0,14.0,0.009866,0.0,1.0,5.0,7.0,0.0,0.0,1.0,0.0
198.0,1.0,0.005051,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
…,…,…,…,…,…,…,…,…,…,…
986.0,1.0,0.001014,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1790.0,7.0,0.003911,0.0,0.0,3.0,1.0,0.0,0.0,3.0,0.0
1611.0,12.0,0.007449,0.0,0.0,4.0,4.0,0.0,0.0,4.0,0.0
1354.0,8.0,0.005908,0.0,0.0,2.0,4.0,0.0,0.0,2.0,0.0


In [69]:
# Can also use pl.selectors to select by dtype

# (
#     lf
#     .select(
#         cs.by_dtype([pl.Float32, pl.Float64]),
#     )
# ).collect()
(
    lf
    .select(
        # cs.float(),  # Select all float types
        # cs.integer(),  # Select all int types
        cs.numeric(),  # Select all floats, ints, & uints
    )
).collect()

Tweet id,impressions,engagements,engagement rate,retweets,replies,likes,user profile clicks,url clicks,hashtag clicks,detail expands,permalink clicks,app opens,app installs,follows,email tweet,dial phone,media views,media engagements
i64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,i64,i64,i64,i64,i64,i64,i64
1212580517905780737,1465.0,7.0,0.004778,0.0,0.0,3.0,3.0,0.0,0.0,1.0,0.0,0,0,0,0,0,0,0
1212582494828036097,154.0,3.0,0.019481,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0,0,0,0,0,0,0
1212613735698690049,1024.0,6.0,0.005859,0.0,0.0,1.0,2.0,0.0,0.0,3.0,0.0,0,0,0,0,0,0,0
1212911749617242113,1419.0,14.0,0.009866,0.0,1.0,5.0,7.0,0.0,0.0,1.0,0.0,0,0,0,0,0,0,0
1212920556028252160,198.0,1.0,0.005051,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
1475300661851934721,986.0,1.0,0.001014,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0,0,0,0,0,0,0
1475518143690801156,1790.0,7.0,0.003911,0.0,0.0,3.0,1.0,0.0,0.0,3.0,0.0,0,0,0,0,0,0,0
1475891441243025408,1611.0,12.0,0.007449,0.0,0.0,4.0,4.0,0.0,0.0,4.0,0.0,0,0,0,0,0,0,0
1476453819751878656,1354.0,8.0,0.005908,0.0,0.0,2.0,4.0,0.0,0.0,2.0,0.0,0,0,0,0,0,0,0


A `.select()` expression cannot return duplicate columns (unlike Pandas), so must use one of the following to rename any duplicates:
* `.name.prefix()`
* `.name.suffix()`
* `.name.alias()`

In [71]:
# Select duplicate columns

# (
#     df.select(
#         ['impressions',
#          pl.col('impressions').name.suffix('_2')]
#         pl.col('impressions'),
#         pl.col('impressions').name.suffix('_2')
#     )
# )

(
    lf.select(
        pl.col('impressions'),
        pl.col('impressions').name.suffix('_2'),
    )
).collect(engine='gpu')

ModuleNotFoundError: GPU engine requested, but required package 'cudf_polars' not found.
Please install using the command `pip install cudf-polars-cu12` (or `pip install --extra-index-url=https://pypi.nvidia.com cudf-polars-cu11` if your system has a CUDA 11 driver).