# OPAN5510 Lab Assignment - Joins

This lab focuses on using Polars to perform data joins and aggregations to answer business questions.

# Bike Trips Dataset

## Prerequisites

For this assignment, you'll need to use Polars for data manipulation.

*Insert a code block to import necessary packages (polars)*

In [1]:
# Import necessary packages
import polars as pl

## Load Datasets

Load the `bike_trips.csv` and `bike_weather.csv` files into Polars DataFrames called `trips` and `weather`, respectively.

*Insert code block that reads these files into Polars DataFrames*

In [2]:
# Load the bike trips and weather data
trips = pl.read_csv("https://raw.githubusercontent.com/philhetzel/opan5510-class9/refs/heads/main/data/bike_trips.csv", null_values='NA')
weather = pl.read_csv("https://raw.githubusercontent.com/philhetzel/opan5510-class9/refs/heads/main/data/bike_weather.csv", null_values=['NA',''])

trips = trips.with_columns(pl.col("date").str.strptime(pl.Date, "%Y-%m-%d").alias("date")) # ensure date column is typed correctly
weather = weather.with_columns(pl.col("date").str.strptime(pl.Date, "%Y-%m-%d").alias("date")) # ensure date column is typed correctly
print(trips.height)
print(weather.height)

78704
733


## Business Question 1: What was the average `duration` of trips that occurred in rainy weather?

#### Part A: Join the trips and weather data frames

The `trips` data represent every ride taken for a bike share company. The `weather` data represent the prevailing weather for a particular day. Join the `trips` and `weather` data together using the `date` column. Name the resulting DataFrame `trips_weather`.

*Insert a code block below that joins `trips` to `weather` using the `date` column.*

In [3]:
# Join trips and weather data
trips_weather = trips.join(weather, on="date", how="left")
trips_weather.head()

id,duration,start_date,start_station_name,start_station_id,end_date,end_station_name,end_station_id,bike_id,subscription_type,zip_code,date,max_temperature_f,mean_temperature_f,min_temperature_f,max_dew_point_f,mean_dew_point_f,min_dew_point_f,max_humidity,mean_humidity,min_humidity,max_sea_level_pressure_inches,mean_sea_level_pressure_inches,min_sea_level_pressure_inches,max_visibility_miles,mean_visibility_miles,min_visibility_miles,max_wind_Speed_mph,mean_wind_speed_mph,max_gust_speed_mph,precipitation_inches,cloud_cover,events,wind_dir_degrees,zip_code_right
i64,i64,str,str,i64,str,str,i64,i64,str,i64,date,i64,i64,i64,i64,i64,i64,i64,i64,i64,f64,f64,f64,i64,i64,i64,i64,i64,i64,str,i64,str,i64,i64
4721,3,"""2013-08-29T20:27:00Z""","""Market at 10th""",67,"""2013-08-29T20:30:00Z""","""South Van Ness at Market""",66,416,"""Subscriber""",94107,2013-08-29,74,68,61,61,58,56,93,75,57,30.07,30.02,29.97,10,10,10,23,11,28,"""0""",4,,286,94107
4812,3,"""2013-08-29T21:30:00Z""","""2nd at Folsom""",62,"""2013-08-29T21:33:00Z""","""2nd at Folsom""",62,409,"""Subscriber""",94107,2013-08-29,74,68,61,61,58,56,93,75,57,30.07,30.02,29.97,10,10,10,23,11,28,"""0""",4,,286,94107
4705,3,"""2013-08-29T20:15:00Z""","""Golden Gate at Polk""",59,"""2013-08-29T20:18:00Z""","""San Francisco City Hall""",58,519,"""Subscriber""",94107,2013-08-29,74,68,61,61,58,56,93,75,57,30.07,30.02,29.97,10,10,10,23,11,28,"""0""",4,,286,94107
4841,3,"""2013-08-29T21:48:00Z""","""University and Emerson""",35,"""2013-08-29T21:51:00Z""","""Cowper at University""",37,83,"""Subscriber""",94107,2013-08-29,74,68,61,61,58,56,93,75,57,30.07,30.02,29.97,10,10,10,23,11,28,"""0""",4,,286,94107
4668,4,"""2013-08-29T19:39:00Z""","""San Francisco Caltrain 2 (330 …",69,"""2013-08-29T19:43:00Z""","""Townsend at 7th""",65,489,"""Subscriber""",94107,2013-08-29,74,68,61,61,58,56,93,75,57,30.07,30.02,29.97,10,10,10,23,11,28,"""0""",4,,286,94107


#### Part B: Calculate the average `duration` of trips in poor weather

Using the `trips_weather` DataFrame, compute the average trip `duration` for days that had weather `events` (i.e. anytime a day had an event of anything but `null`). Name the column for average `duration`: `avg_duration`. The resulting DataFrame should have one row and be named `avg_bad_weather`.

*Insert a code block that shows the computation of the average trip duration for days that had a weather event.*

In [4]:
# Calculate average duration for trips with weather events
avg_bad_weather = (
    trips_weather
    .filter(pl.col("events").is_not_null() & (pl.col("events").cast(pl.Utf8).str.strip_chars() != ""))
    .select(pl.col("duration").mean().alias("avg_duration"))
)

avg_bad_weather

avg_duration
f64
10.34899


# Baseball Dataset

## Load Data

Load the `Batting.csv`, `People.csv`, and `HallOfFame.csv` datasets into Polars DataFrames.

*Insert a code block to load your datasets*

In [7]:
# Load baseball datasets
Batting = pl.read_csv("https://raw.githubusercontent.com/philhetzel/opan5510-class9/refs/heads/main/data/Batting.csv", null_values='NA')
People = pl.read_csv("https://raw.githubusercontent.com/philhetzel/opan5510-class9/refs/heads/main/data/People.csv", null_values='NA')
HallOfFame = pl.read_csv("https://raw.githubusercontent.com/philhetzel/opan5510-class9/refs/heads/main/data/HallOfFame.csv", null_values='NA')


## Question 2: How many home runs (`HR`) were hit by players born in Florida?

#### Part A: Join the `Batting` and `People` DataFrames together

In order to answer this question, you'll need to use the `Batting` and `People` DataFrames. The `Batting` DataFrame has hitting statistics for every season that an individual player has played. The `People` DataFrame represents biographical data about every professional baseball player.

Join the `People` DataFrame into the `Batting` DataFrame so that we can perform analysis on batting statistics using columns from the `People` DataFrame. Name the new DataFrame `stats_w_bio`.

*Insert a code block that joins the `Batting` DataFrame to the `People` DataFrame.*

In [8]:
# Join Batting and People DataFrames
stats_w_bio = Batting.join(People.select(["playerID", "birthState"]), on="playerID", how="left")
stats_w_bio.head()

playerID,yearID,stint,teamID,lgID,G,AB,R,H,X2B,X3B,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP,birthState
str,i64,i64,str,str,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,str,str,str,str,i64,str
"""abercda01""",1871,1,"""TRO""",,1,4,0,0,0,0,0,0,0,0,0,0,,,,,0,"""OK"""
"""addybo01""",1871,1,"""RC1""",,25,118,30,32,6,0,0,13,8,1,4,0,,,,,0,"""ON"""
"""allisar01""",1871,1,"""CL1""",,29,137,28,40,4,5,0,19,3,1,2,5,,,,,1,"""PA"""
"""allisdo01""",1871,1,"""WS3""",,27,133,28,44,10,2,2,27,1,1,0,2,,,,,0,"""PA"""
"""ansonca01""",1871,1,"""RC1""",,25,120,29,39,11,3,0,16,6,2,2,1,,,,,0,"""IA"""


#### Part B: Calculate the total number of home runs that were hit by Florida-born players

You would like to perform an analysis about how many home runs (`HR`) were hit by players that were born in the `birthState` of Florida (`FL`). Using the new `stats_w_bio` DataFrame that combines batting statistics with biographical information, compute the total number of home runs (`HR`) hit by players who were born (`birthState`) in Florida (`FL`) and name the new column `total_hr`. The DataFrame should have one row and be named `florida_hr`.

*Insert a block of code that shows the computation of total home runs hit by players born in Florida. The output of this code block should be a DataFrame.*

In [9]:
# Calculate total home runs by Florida-born players
florida_hr = (
    stats_w_bio
    .filter(pl.col("birthState") == "FL")
    .select(pl.col("HR").fill_null(0).sum().alias("total_hr"))
)

florida_hr

total_hr
i64
16225


## Question 3: What are the average number of career Hits (`H`) for Hall of Fame baseball players?

#### Part A: Clean the `HallOfFame` DataFrame

You would like to perform an analysis on the batting statistics of Hall of Fame baseball players. In order to answer this question, you'll need to use the `Batting` and `HallOfFame` DataFrames.

The first thing that you have to do to perform this analysis is to join the `HallOfFame` data into the `Batting` data to understand which players are "Hall of Famers". To join the data correctly, we have to ensure that the `playerID` field is unique in `HallOfFame`. Not every player in the `HallOfFame` DataFrame is a Hall of Famer; their `inducted` field must have the value of `Y` and their `category` field should have the value of `Player`. Name the new DataFrame `hall_inducted`.

*Insert a code block to transform the `HallOfFame` DataFrame to ensure that `playerID` is unique.*

In [10]:
# Clean HallOfFame DataFrame
hall_inducted = (
    HallOfFame
    .filter((pl.col("inducted") == "Y") & (pl.col("category") == "Player"))
    .select("playerID")
    .unique()
)

hall_inducted.height, hall_inducted.head()

(270,
 shape: (5, 1)
 ┌───────────┐
 │ playerID  │
 │ ---       │
 │ str       │
 ╞═══════════╡
 │ burkeje01 │
 │ wanerpa01 │
 │ whitede01 │
 │ mantlmi01 │
 │ wallabo01 │
 └───────────┘)

#### Part B: Find the average number of career hits across Hall of Fame players

Join the `hall_inducted` data into the `Batting` data by `playerID`. Aggregate the data to compute the average total hits (`H`) across all Hall of Fame players. To do this, you will first need to calculate the total number of hits for each player and then calculate the average hits across all players. Call the new column `average_hits`. The resulting DataFrame should have one row and be named `hof_hits`.

*Insert a code block that joins the `Batting` and the `hall_inducted` data together and then calculates the average number of career hits (`H`) across all Hall of Fame players. The output of this code block should be a DataFrame.*

In [11]:
# Calculate average career hits for Hall of Fame players
hof_hits = (
    Batting
    .join(hall_inducted, on="playerID", how="inner")
    .group_by("playerID")
    .agg(pl.col("H").fill_null(0).sum().alias("career_hits"))
    .select(pl.col("career_hits").mean().alias("average_hits"))
)

hof_hits

average_hits
f64
1717.29918
