# OPAN5510 Lab Assignment - Joins

This lab focuses on using Polars to perform data joins and aggregations to answer business questions.

# Bike Trips Dataset

## Prerequisites

For this assignment, you'll need to use Polars for data manipulation.

*Insert a code block to import necessary packages (polars)*

In [2]:
# Import necessary packages
import polars as pl

## Load Datasets

Load the `bike_trips.csv` and `bike_weather.csv` files into Polars DataFrames called `trips` and `weather`, respectively.

*Insert code block that reads these files into Polars DataFrames*

In [5]:
# Load the bike trips and weather data
trips = pl.read_csv("https://raw.githubusercontent.com/philhetzel/opan5510-class9/refs/heads/main/data/bike_trips.csv", null_values='NA')
weather = pl.read_csv("https://raw.githubusercontent.com/philhetzel/opan5510-class9/refs/heads/main/data/bike_weather.csv", null_values=['NA',''])

trips = trips.with_columns(pl.col("date").str.strptime(pl.Date, "%Y-%m-%d").alias("date")) # ensure date column is typed correctly
weather = weather.with_columns(pl.col("date").str.strptime(pl.Date, "%Y-%m-%d").alias("date")) # ensure date column is typed correctly

print(trips.height)
print(weather.height)

78704
733


## Business Question 1: What was the average `duration` of trips that occurred in rainy weather?

#### Part A: Join the trips and weather data frames

The `trips` data represent every ride taken for a bike share company. The `weather` data represent the prevailing weather for a particular day. Join the `trips` and `weather` data together using the `date` column. Name the resulting DataFrame `trips_weather`.

*Insert a code block below that joins `trips` to `weather` using the `date` column.*

In [6]:
# Join trips and weather data

trips_weather = trips.join(
    weather,
    on="date",
    how="left")

print (trips_weather)


shape: (78_704, 35)
┌────────┬──────────┬─────────────┬────────────┬───┬────────────┬────────┬────────────┬────────────┐
│ id     ┆ duration ┆ start_date  ┆ start_stat ┆ … ┆ cloud_cove ┆ events ┆ wind_dir_d ┆ zip_code_r │
│ ---    ┆ ---      ┆ ---         ┆ ion_name   ┆   ┆ r          ┆ ---    ┆ egrees     ┆ ight       │
│ i64    ┆ i64      ┆ str         ┆ ---        ┆   ┆ ---        ┆ str    ┆ ---        ┆ ---        │
│        ┆          ┆             ┆ str        ┆   ┆ i64        ┆        ┆ i64        ┆ i64        │
╞════════╪══════════╪═════════════╪════════════╪═══╪════════════╪════════╪════════════╪════════════╡
│ 4721   ┆ 3        ┆ 2013-08-29T ┆ Market at  ┆ … ┆ 4          ┆ null   ┆ 286        ┆ 94107      │
│        ┆          ┆ 20:27:00Z   ┆ 10th       ┆   ┆            ┆        ┆            ┆            │
│ 4812   ┆ 3        ┆ 2013-08-29T ┆ 2nd at     ┆ … ┆ 4          ┆ null   ┆ 286        ┆ 94107      │
│        ┆          ┆ 21:30:00Z   ┆ Folsom     ┆   ┆            ┆      

#### Part B: Calculate the average `duration` of trips in poor weather

Using the `trips_weather` DataFrame, compute the average trip `duration` for days that had weather `events` (i.e. anytime a day had an event of anything but `null`). Name the column for average `duration`: `avg_duration`. The resulting DataFrame should have one row and be named `avg_bad_weather`.

*Insert a code block that shows the computation of the average trip duration for days that had a weather event.*

In [8]:
# Calculate average duration for trips with weather events
avg_bad_weather = trips_weather.filter(pl.col("events") != pl.lit("null")).select(pl.col("duration").mean().alias("avg_duration"))
print(avg_bad_weather)
print(f"The average duration of trips in poor weather was {avg_bad_weather['avg_duration'][0]:,.2f} minutes.")

shape: (1, 1)
┌──────────────┐
│ avg_duration │
│ ---          │
│ f64          │
╞══════════════╡
│ 10.34899     │
└──────────────┘
The average duration of trips in poor weather was 10.35 minutes.


# Baseball Dataset

## Load Data

Load the `Batting.csv`, `People.csv`, and `HallOfFame.csv` datasets into Polars DataFrames.

*Insert a code block to load your datasets*

In [10]:
# Load baseball datasets
Batting = pl.read_csv("https://raw.githubusercontent.com/philhetzel/opan5510-class9/refs/heads/main/data/Batting.csv", null_values='NA')
People = pl.read_csv("https://raw.githubusercontent.com/philhetzel/opan5510-class9/refs/heads/main/data/People.csv", null_values='NA')
HallOfFame = pl.read_csv("https://raw.githubusercontent.com/philhetzel/opan5510-class9/refs/heads/main/data/HallOfFame.csv", null_values='NA')

print(Batting.height)
print(People.height)
print(HallOfFame.height)


112184
20676
4323


## Question 2: How many home runs (`HR`) were hit by players born in Florida?

#### Part A: Join the `Batting` and `People` DataFrames together

In order to answer this question, you'll need to use the `Batting` and `People` DataFrames. The `Batting` DataFrame has hitting statistics for every season that an individual player has played. The `People` DataFrame represents biographical data about every professional baseball player.

Join the `People` DataFrame into the `Batting` DataFrame so that we can perform analysis on batting statistics using columns from the `People` DataFrame. Name the new DataFrame `stats_w_bio`.

*Insert a code block that joins the `Batting` DataFrame to the `People` DataFrame.*

In [11]:
# Join Batting and People DataFrames
stats_w_bio = Batting.join(
    People,
    on = "playerID",
    how = "left"
)

print(stats_w_bio.height)

112184


#### Part B: Calculate the total number of home runs that were hit by Florida-born players

You would like to perform an analysis about how many home runs (`HR`) were hit by players that were born in the `birthState` of Florida (`FL`). Using the new `stats_w_bio` DataFrame that combines batting statistics with biographical information, compute the total number of home runs (`HR`) hit by players who were born (`birthState`) in Florida (`FL`) and name the new column `total_hr`. The DataFrame should have one row and be named `florida_hr`.

*Insert a block of code that shows the computation of total home runs hit by players born in Florida. The output of this code block should be a DataFrame.*

In [15]:
# Calculate total home runs by Florida-born players
florida_hr = stats_w_bio.filter(pl.col("birthState")== pl.lit("FL")).select(pl.col("HR").sum().alias("total_hr"))
print(florida_hr)
print(f"The total number of home runs hit by players born in Florida is {florida_hr['total_hr'][0]:,.0f}.")

shape: (1, 1)
┌──────────┐
│ total_hr │
│ ---      │
│ i64      │
╞══════════╡
│ 16225    │
└──────────┘
The total number of home runs hit by players born in Florida is 16,225.


## Question 3: What are the average number of career Hits (`H`) for Hall of Fame baseball players?

#### Part A: Clean the `HallOfFame` DataFrame

You would like to perform an analysis on the batting statistics of Hall of Fame baseball players. In order to answer this question, you'll need to use the `Batting` and `HallOfFame` DataFrames.

The first thing that you have to do to perform this analysis is to join the `HallOfFame` data into the `Batting` data to understand which players are "Hall of Famers". To join the data correctly, we have to ensure that the `playerID` field is unique in `HallOfFame`. Not every player in the `HallOfFame` DataFrame is a Hall of Famer; their `inducted` field must have the value of `Y` and their `category` field should have the value of `Player`. Name the new DataFrame `hall_inducted`.

*Insert a code block to transform the `HallOfFame` DataFrame to ensure that `playerID` is unique.*

In [16]:
# Clean HallOfFame DataFrame
filtered_HallOfFame = HallOfFame.filter(
    (pl.col("inducted") == pl.lit("Y")) &
    (pl.col("category") == pl.lit("Player"))
)
print(HallOfFame.height)
print(filtered_HallOfFame["playerID"].n_unique())
print(filtered_HallOfFame.height)


hall_inducted = Batting.join(
    filtered_HallOfFame,
    on = "playerID",
    how = "inner"
)

hall_inducted

4323
270
270


playerID,yearID,stint,teamID,lgID,G,AB,R,H,X2B,X3B,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP,yearID_right,votedBy,ballots,needed,votes,inducted,category,needed_note
str,i64,i64,str,str,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,str,str,str,str,i64,i64,str,i64,i64,i64,str,str,str
"""ansonca01""",1871,1,"""RC1""",,25,120,29,39,11,3,0,16,6,2,2,1,,,,,0,1939,"""Old Timers""",,,,"""Y""","""Player""",
"""whitede01""",1871,1,"""CL1""",,29,146,40,47,6,5,1,21,2,2,4,1,,,,,0,2013,"""Veterans""",,,,"""Y""","""Player""",
"""ansonca01""",1872,1,"""PH1""",,46,217,60,90,10,7,0,48,6,6,16,3,,,,,2,1939,"""Old Timers""",,,,"""Y""","""Player""",
"""orourji01""",1872,1,"""MID""",,23,99,25,27,5,0,0,16,1,0,4,0,,,,,1,1945,"""Old Timers""",,,,"""Y""","""Player""",
"""whitede01""",1872,1,"""CL1""",,22,109,21,37,2,2,0,22,0,0,4,1,,,,,0,2013,"""Veterans""",,,,"""Y""","""Player""",
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""riverma01""",2013,1,"""NYA""","""AL""",64,0,0,0,0,0,0,0,0,0,0,0,"""0""","""0""","""0""","""0""",0,2019,"""BBWAA""",425,319,425,"""Y""","""Player""",
"""jeterde01""",2014,1,"""NYA""","""AL""",145,581,47,149,19,1,4,50,10,2,35,87,"""0""","""6""","""8""","""4""",15,2020,"""BBWAA""",397,298,396,"""Y""","""Player""",
"""ortizda01""",2014,1,"""BOS""","""AL""",142,518,59,136,27,0,35,104,0,0,75,95,"""22""","""3""","""0""","""6""",18,2022,"""BBWAA""",394,296,307,"""Y""","""Player""",
"""ortizda01""",2015,1,"""BOS""","""AL""",146,528,73,144,37,0,37,108,0,1,77,95,"""16""","""0""","""0""","""9""",16,2022,"""BBWAA""",394,296,307,"""Y""","""Player""",


#### Part B: Find the average number of career hits across Hall of Fame players

Join the `hall_inducted` data into the `Batting` data by `playerID`. Aggregate the data to compute the average total hits (`H`) across all Hall of Fame players. To do this, you will first need to calculate the total number of hits for each player and then calculate the average hits across all players. Call the new column `average_hits`. The resulting DataFrame should have one row and be named `hof_hits`.

*Insert a code block that joins the `Batting` and the `hall_inducted` data together and then calculates the average number of career hits (`H`) across all Hall of Fame players. The output of this code block should be a DataFrame.*

In [21]:
# Calculate average career hits for Hall of Fame players
hof_hits = hall_inducted.group_by("playerID").agg(pl.col("H").sum().alias("total_hits")).select(pl.col("total_hits").mean().alias("average_hits"))
print(hof_hits)
print(f"The average number of career hits across Hall of Fame players is {hof_hits['average_hits'][0]:,.0f}.")

shape: (1, 1)
┌──────────────┐
│ average_hits │
│ ---          │
│ f64          │
╞══════════════╡
│ 1717.29918   │
└──────────────┘
The average number of career hits across Hall of Fame players is 1,717.
