# SI 618 Scaling up (polars)
### Dr. Chris Teplovs, School of Information, University of Michigan
Copyright &copy; 2024.  This notebook may not be shared outside of the course without permission.
### Please ensure you have this version:
Version 2024.11.20.1.CT

In [None]:
# Uncomment the following line and run this cell once to install polars. Then, comment out the line again so you don't ever run it again.
#%pip install polars

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import polars as pl

In [2]:
pandas_df = pd.read_csv('athlete_events.csv', na_values=['NA'])
polars_df = pl.read_csv('athlete_events.csv',null_values=['NA'])  

In [3]:
print(pandas_df.head())

   ID                      Name Sex   Age  Height  Weight            Team  \
0   1                 A Dijiang   M  24.0   180.0    80.0           China   
1   2                  A Lamusi   M  23.0   170.0    60.0           China   
2   3       Gunnar Nielsen Aaby   M  24.0     NaN     NaN         Denmark   
3   4      Edgar Lindenau Aabye   M  34.0     NaN     NaN  Denmark/Sweden   
4   5  Christine Jacoba Aaftink   F  21.0   185.0    82.0     Netherlands   

   NOC        Games  Year  Season       City          Sport  \
0  CHN  1992 Summer  1992  Summer  Barcelona     Basketball   
1  CHN  2012 Summer  2012  Summer     London           Judo   
2  DEN  1920 Summer  1920  Summer  Antwerpen       Football   
3  DEN  1900 Summer  1900  Summer      Paris     Tug-Of-War   
4  NED  1988 Winter  1988  Winter    Calgary  Speed Skating   

                              Event Medal  
0       Basketball Men's Basketball   NaN  
1      Judo Men's Extra-Lightweight   NaN  
2           Football Men's

In [4]:
print(polars_df.head())

shape: (5, 15)
┌─────┬────────────────────┬─────┬─────┬───┬───────────┬───────────────┬───────────────────┬───────┐
│ ID  ┆ Name               ┆ Sex ┆ Age ┆ … ┆ City      ┆ Sport         ┆ Event             ┆ Medal │
│ --- ┆ ---                ┆ --- ┆ --- ┆   ┆ ---       ┆ ---           ┆ ---               ┆ ---   │
│ i64 ┆ str                ┆ str ┆ i64 ┆   ┆ str       ┆ str           ┆ str               ┆ str   │
╞═════╪════════════════════╪═════╪═════╪═══╪═══════════╪═══════════════╪═══════════════════╪═══════╡
│ 1   ┆ A Dijiang          ┆ M   ┆ 24  ┆ … ┆ Barcelona ┆ Basketball    ┆ Basketball Men's  ┆ null  │
│     ┆                    ┆     ┆     ┆   ┆           ┆               ┆ Basketball        ┆       │
│ 2   ┆ A Lamusi           ┆ M   ┆ 23  ┆ … ┆ London    ┆ Judo          ┆ Judo Men's        ┆ null  │
│     ┆                    ┆     ┆     ┆   ┆           ┆               ┆ Extra-Lightweight ┆       │
│ 3   ┆ Gunnar Nielsen     ┆ M   ┆ 24  ┆ … ┆ Antwerpen ┆ Football      ┆ Foo

In [5]:
pandas_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271116 entries, 0 to 271115
Data columns (total 15 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   ID      271116 non-null  int64  
 1   Name    271116 non-null  object 
 2   Sex     271116 non-null  object 
 3   Age     261642 non-null  float64
 4   Height  210945 non-null  float64
 5   Weight  208241 non-null  float64
 6   Team    271116 non-null  object 
 7   NOC     271116 non-null  object 
 8   Games   271116 non-null  object 
 9   Year    271116 non-null  int64  
 10  Season  271116 non-null  object 
 11  City    271116 non-null  object 
 12  Sport   271116 non-null  object 
 13  Event   271116 non-null  object 
 14  Medal   39783 non-null   object 
dtypes: float64(3), int64(2), object(10)
memory usage: 31.0+ MB


In [6]:
pandas_df.Height.describe()

count    210945.000000
mean        175.338970
std          10.518462
min         127.000000
25%         168.000000
50%         175.000000
75%         183.000000
max         226.000000
Name: Height, dtype: float64

In [7]:
polars_df.schema

Schema([('ID', Int64),
        ('Name', String),
        ('Sex', String),
        ('Age', Int64),
        ('Height', Int64),
        ('Weight', Float64),
        ('Team', String),
        ('NOC', String),
        ('Games', String),
        ('Year', Int64),
        ('Season', String),
        ('City', String),
        ('Sport', String),
        ('Event', String),
        ('Medal', String)])

In [8]:
pandas_df[pandas_df['Age'] > 65]

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
2392,1337,Olof Ahlberg,M,71.0,,,Sweden,SWE,1948 Summer,1948,Summer,London,Art Competitions,"Art Competitions Mixed Painting, Unknown Event",
7433,4160,Robert Day Andrews,M,75.0,,,United States,USA,1932 Summer,1932,Summer,Los Angeles,Art Competitions,"Art Competitions Mixed Architecture, Unknown E...",
9369,5146,George Denholm Armour,M,68.0,,,Great Britain,GBR,1932 Summer,1932,Summer,Los Angeles,Art Competitions,"Art Competitions Mixed Painting, Unknown Event",
9370,5146,George Denholm Armour,M,68.0,,,Great Britain,GBR,1932 Summer,1932,Summer,Los Angeles,Art Competitions,"Art Competitions Mixed Painting, Unknown Event",
9371,5146,George Denholm Armour,M,84.0,,,Great Britain,GBR,1948 Summer,1948,Summer,London,Art Competitions,"Art Competitions Mixed Painting, Unknown Event",
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
257054,128719,John Quincy Adams Ward,M,97.0,,,United States,USA,1928 Summer,1928,Summer,Amsterdam,Art Competitions,"Art Competitions Mixed Sculpturing, Statues",
260790,130487,Norman L. Wilkinson,M,69.0,,,Great Britain,GBR,1948 Summer,1948,Summer,London,Art Competitions,"Art Competitions Mixed Painting, Unknown Event",
260791,130487,Norman L. Wilkinson,M,69.0,,,Great Britain,GBR,1948 Summer,1948,Summer,London,Art Competitions,"Art Competitions Mixed Painting, Unknown Event",
260792,130487,Norman L. Wilkinson,M,69.0,,,Great Britain,GBR,1948 Summer,1948,Summer,London,Art Competitions,"Art Competitions Mixed Painting, Unknown Event",


In [9]:
polars_df.filter(pl.col("Age") > 65)

ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
i64,str,str,i64,i64,f64,str,str,str,i64,str,str,str,str,str
1337,"""Olof Ahlberg""","""M""",71,,,"""Sweden""","""SWE""","""1948 Summer""",1948,"""Summer""","""London""","""Art Competitions""","""Art Competitions Mixed Paintin…",
4160,"""Robert Day Andrews""","""M""",75,,,"""United States""","""USA""","""1932 Summer""",1932,"""Summer""","""Los Angeles""","""Art Competitions""","""Art Competitions Mixed Archite…",
5146,"""George Denholm Armour""","""M""",68,,,"""Great Britain""","""GBR""","""1932 Summer""",1932,"""Summer""","""Los Angeles""","""Art Competitions""","""Art Competitions Mixed Paintin…",
5146,"""George Denholm Armour""","""M""",68,,,"""Great Britain""","""GBR""","""1932 Summer""",1932,"""Summer""","""Los Angeles""","""Art Competitions""","""Art Competitions Mixed Paintin…",
5146,"""George Denholm Armour""","""M""",84,,,"""Great Britain""","""GBR""","""1948 Summer""",1948,"""Summer""","""London""","""Art Competitions""","""Art Competitions Mixed Paintin…",
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
128719,"""John Quincy Adams Ward""","""M""",97,,,"""United States""","""USA""","""1928 Summer""",1928,"""Summer""","""Amsterdam""","""Art Competitions""","""Art Competitions Mixed Sculptu…",
130487,"""Norman L. Wilkinson""","""M""",69,,,"""Great Britain""","""GBR""","""1948 Summer""",1948,"""Summer""","""London""","""Art Competitions""","""Art Competitions Mixed Paintin…",
130487,"""Norman L. Wilkinson""","""M""",69,,,"""Great Britain""","""GBR""","""1948 Summer""",1948,"""Summer""","""London""","""Art Competitions""","""Art Competitions Mixed Paintin…",
130487,"""Norman L. Wilkinson""","""M""",69,,,"""Great Britain""","""GBR""","""1948 Summer""",1948,"""Summer""","""London""","""Art Competitions""","""Art Competitions Mixed Paintin…",


In [10]:
pandas_df.groupby('NOC')['Age'].mean()

NOC
AFG    23.538462
AHO    26.589744
ALB    25.342857
ALG    24.370642
AND    23.065089
         ...    
YEM    21.093750
YMD    23.600000
YUG    24.745721
ZAM    23.461039
ZIM    25.200647
Name: Age, Length: 230, dtype: float64

In [11]:
polars_df.group_by('NOC').agg(pl.col("Age").mean())

NOC,Age
str,f64
"""SAA""",25.064516
"""MAD""",23.720339
"""KIR""",21.636364
"""MAR""",25.571429
"""LBA""",23.5
…,…
"""LAT""",26.659091
"""RSA""",25.561275
"""SEY""",22.648649
"""FRG""",24.440121


In [12]:
pandas_df.fillna(0)
polars_df.fill_null(0)

ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
i64,str,str,i64,i64,f64,str,str,str,i64,str,str,str,str,str
1,"""A Dijiang""","""M""",24,180,80.0,"""China""","""CHN""","""1992 Summer""",1992,"""Summer""","""Barcelona""","""Basketball""","""Basketball Men's Basketball""",
2,"""A Lamusi""","""M""",23,170,60.0,"""China""","""CHN""","""2012 Summer""",2012,"""Summer""","""London""","""Judo""","""Judo Men's Extra-Lightweight""",
3,"""Gunnar Nielsen Aaby""","""M""",24,0,0.0,"""Denmark""","""DEN""","""1920 Summer""",1920,"""Summer""","""Antwerpen""","""Football""","""Football Men's Football""",
4,"""Edgar Lindenau Aabye""","""M""",34,0,0.0,"""Denmark/Sweden""","""DEN""","""1900 Summer""",1900,"""Summer""","""Paris""","""Tug-Of-War""","""Tug-Of-War Men's Tug-Of-War""","""Gold"""
5,"""Christine Jacoba Aaftink""","""F""",21,185,82.0,"""Netherlands""","""NED""","""1988 Winter""",1988,"""Winter""","""Calgary""","""Speed Skating""","""Speed Skating Women's 500 metr…",
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
135569,"""Andrzej ya""","""M""",29,179,89.0,"""Poland-1""","""POL""","""1976 Winter""",1976,"""Winter""","""Innsbruck""","""Luge""","""Luge Mixed (Men)'s Doubles""",
135570,"""Piotr ya""","""M""",27,176,59.0,"""Poland""","""POL""","""2014 Winter""",2014,"""Winter""","""Sochi""","""Ski Jumping""","""Ski Jumping Men's Large Hill, …",
135570,"""Piotr ya""","""M""",27,176,59.0,"""Poland""","""POL""","""2014 Winter""",2014,"""Winter""","""Sochi""","""Ski Jumping""","""Ski Jumping Men's Large Hill, …",
135571,"""Tomasz Ireneusz ya""","""M""",30,185,96.0,"""Poland""","""POL""","""1998 Winter""",1998,"""Winter""","""Nagano""","""Bobsleigh""","""Bobsleigh Men's Four""",


In [13]:
polars_df["Age"]

Age
i64
24
23
24
34
21
…
29
27
27
30


In [14]:
polars_df.select(pl.col("Age"))

Age
i64
24
23
24
34
21
…
29
27
27
30


In [15]:
polars_df.select((pl.col("Weight") / (pl.col("Height")/100) ** 2).alias("BMI"))

BMI
f64
24.691358
20.761246
""
""
23.959094
…
27.776911
19.047004
19.047004
28.049671


In [16]:
result = polars_df.with_columns(
    BMI=pl.col("Weight") / ((pl.col("Height")/100) ** 2),
)
print(result)

shape: (271_116, 16)
┌────────┬──────────────────┬─────┬─────┬───┬───────────────┬──────────────────┬───────┬───────────┐
│ ID     ┆ Name             ┆ Sex ┆ Age ┆ … ┆ Sport         ┆ Event            ┆ Medal ┆ BMI       │
│ ---    ┆ ---              ┆ --- ┆ --- ┆   ┆ ---           ┆ ---              ┆ ---   ┆ ---       │
│ i64    ┆ str              ┆ str ┆ i64 ┆   ┆ str           ┆ str              ┆ str   ┆ f64       │
╞════════╪══════════════════╪═════╪═════╪═══╪═══════════════╪══════════════════╪═══════╪═══════════╡
│ 1      ┆ A Dijiang        ┆ M   ┆ 24  ┆ … ┆ Basketball    ┆ Basketball Men's ┆ null  ┆ 24.691358 │
│        ┆                  ┆     ┆     ┆   ┆               ┆ Basketball       ┆       ┆           │
│ 2      ┆ A Lamusi         ┆ M   ┆ 23  ┆ … ┆ Judo          ┆ Judo Men's Extra ┆ null  ┆ 20.761246 │
│        ┆                  ┆     ┆     ┆   ┆               ┆ -Lightweight     ┆       ┆           │
│ 3      ┆ Gunnar Nielsen   ┆ M   ┆ 24  ┆ … ┆ Football      ┆ Football

### Challenge:
Compare the mean ages for athletes who competed before 1950 and athletes who competed in 1950 or later.  Don't worry about athletes that competed both before and after 1950 -- you can count them for each category.

In [22]:
# Calculate the mean ages who competed before 1950 and after 1950
result = polars_df.with_columns(
    pl.when(pl.col("Year") < 1950).then(pl.lit("before_1950")).otherwise(pl.lit("after_1950")).alias("group")
)

result = result.group_by("group").agg(pl.col("Age").mean())
result

group,Age
str,f64
"""after_1950""",25.040514
"""before_1950""",28.521954


### Challenge:
Use `%%timeit` as the first line in a cell to generate estimates for how long that cell takes to run.  See if you can compare the times for code that uses `pandas` with times for code that uses `polars`.  Is `polars` faster?

We have completed the next cell for you using a pandas example:

In [23]:
%%timeit
pandas_df['BMI'] = pandas_df['Weight'] / (pandas_df['Height']/100) ** 2

1.16 ms ± 71.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [25]:
%%timeit
result = polars_df.with_columns(
    BMI=pl.col("Weight") / ((pl.col("Height")/100) ** 2),
)

1.41 ms ± 37.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


## Part 2: Hands-on with polars

### Challenge:
We show code in the next cell that uses pandas DataFrames. Re-implement the code using polars DataFrames.

In [26]:
# This uses pandas -- re-implement using polars
dfs = []
for i in range(10):
    df = pd.read_csv(f"199{i}.csv")
    dfs.append(df)
df = pd.concat(dfs)

In [None]:
# Use ploars to implement 

### Challenge:
How many rows do you have in the combined dataframe?

In [None]:
# Calculate the rows in polar

### Challenge:
Calculate the mean departure delay for each of the three airports.  Which airport is "best"?

### Challenge:
What percentage of flights are canceled from each New York airport?

### Challenge:
Which New York airport has shown the most improvement in reducing delays over the time period for which we have data (i.e., 1990-1999)?

## Challenge: Lazy API

If you have time, take a look at the documentation on the [Lazy API for polars](https://docs.pola.rs/user-guide/concepts/lazy-api/) and see if you can re-implement your solutions above using the Lazy API.