## 11 - Statistical Analysis of Seattle Transit Activity vs Weather

This notebook explores how weather and temporal features influence real-time public transit activity in Seattle.  
We use the **Gold-layer** table `gtfs_rt_weather_joined` to perform correlation analysis and hypothesis testing.

**Goals:**
- Explore the correlation between **temperature** and **number of vehicle updates**
- Perform statistical tests (**Pearson**, **Spearman**, **t-test**, **ANOVA**)
- Identify meaningful behavioral patterns (e.g., **weekday vs. weekend** transit activity)


In [0]:
# Import PySpark functions and Python libraries for analysis and plotting
from pyspark.sql.functions import col, hour, dayofweek, count, to_date
from pyspark.sql import functions as F
# Python libraries for statistical analysis and visualization
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import pearsonr, spearmanr, ttest_ind, f_oneway
from pyspark.sql.functions import to_date, count, avg, dayofweek, when
from scipy.stats import pearsonr, spearmanr

In [0]:
# Load the Gold layer table that joins real-time GTFS data with weather snapshots
GOLD_PATH = "/gold/gtfs_rt_weather_joined"
df_gold = spark.read.format("delta").load(GOLD_PATH)
df_gold.display()


### Aggregate Transit Volume

In [0]:
# Extract hour of day from each event timestamp for time-based grouping
df_gold = df_gold.withColumn("hour", hour("event_ts"))


In [0]:
# Count the number of vehicle updates per hour to identify data skewedness
df_gold.groupBy("hour").count().orderBy("count", ascending=False).show()


In [0]:
# Filter to a specific hour (3 PM) with the highest available data across different dates
df_gold = df_gold.filter(col("hour")==15)

In [0]:
# Group by event date and weekend/weekday flag
# Aggregate to get daily vehicle update count and average temperature
df_daily = df_gold \
    .withColumn("event_date", to_date("event_ts")) \
    .withColumn("day_of_week", dayofweek("event_ts")) \
    .withColumn("is_weekend", when(col("day_of_week").isin(1, 7), True).otherwise(False)) \
    .groupBy("event_date", "is_weekend") \
    .agg(
        count("*").alias("vehicle_update_count"),
        avg("temperature").alias("avg_temp")
    ) \
    .orderBy("event_date")
    
# Convert to pandas for statistical analysis
df_daily_pd = df_daily.toPandas()
df_daily_pd.head()


### Correlation Analysis – Temperature vs Vehicle Count

In [0]:
# Drop rows with nulls in key columns
df_corr = df_daily_pd.dropna(subset=["vehicle_update_count", "avg_temp"])

# Pearson correlation (linear relationship)
pearson_corr, pearson_p = pearsonr(df_corr["avg_temp"], df_corr["vehicle_update_count"])

# Spearman correlation (monotonic relationship)
spearman_corr, spearman_p = spearmanr(df_corr["avg_temp"], df_corr["vehicle_update_count"])

# Print correlation results
print(f"Pearson Correlation: {pearson_corr:.2f} (p={pearson_p:.3f})")
print(f"Spearman Correlation: {spearman_corr:.2f} (p={spearman_p:.3f})")

# Plot correlation between average temperature and transit volume
sns.regplot(data=df_corr, x="avg_temp", y="vehicle_update_count")
plt.title("Correlation: Temperature vs Vehicle Update Count")
plt.xlabel("Average Daily Temperature (°F)")
plt.ylabel("Vehicle Updates")
plt.show()


Weak positive correlation between temperature and number of vehicle updates.

p-values > 0.05 → Not statistically significant, so the correlation is likely due to chance.

### Correlation Analysis – Temperature vs Vehicle Count (Weekdays)
Here, we test whether restricting the analysis to weekdays only reveals a more significant relationship between temperature and transit volume.

In [0]:
# Filter to weekday data only
df_weekday = df_daily_pd.loc[df_daily_pd["is_weekend"] == False]

# Drop rows with nulls
df_corr = df_weekday.dropna(subset=["vehicle_update_count", "avg_temp"])

# Pearson & Spearman correlations (for weekdays only)
pearson_corr, pearson_p = pearsonr(df_corr["avg_temp"], df_corr["vehicle_update_count"])
spearman_corr, spearman_p = spearmanr(df_corr["avg_temp"], df_corr["vehicle_update_count"])

# Print results
print(f"Pearson Correlation: {pearson_corr:.2f} (p={pearson_p:.3f})")
print(f"Spearman Correlation: {spearman_corr:.2f} (p={spearman_p:.3f})")

# Plot
sns.regplot(data=df_corr, x="avg_temp", y="vehicle_update_count")
plt.title("Correlation: Temperature vs Vehicle Update Count")
plt.xlabel("Average Daily Temperature (°F)")
plt.ylabel("Vehicle Updates")
plt.show()


Although the correlation values increased slightly, the results remain statistically insignificant.

### Weekday vs Weekend – t-test

In [0]:
# Ensure weekend flag is present
if "is_weekend" not in df_daily_pd.columns:
    df_daily_pd["is_weekend"] = df_daily_pd["event_date"].apply(lambda d: pd.to_datetime(d).weekday() >= 5)

# Separate weekday and weekend samples
weekend = df_daily_pd[df_daily_pd["is_weekend"] == True]["vehicle_update_count"]
weekday = df_daily_pd[df_daily_pd["is_weekend"] == False]["vehicle_update_count"]

# Perform independent t-test
t_stat, p_val = ttest_ind(weekday, weekend, equal_var=False)

# Output statistics
print(f"Weekday Mean: {weekday.mean():.2f}")
print(f"Weekend Mean: {weekend.mean():.2f}")
print(f"t-statistic = {t_stat:.2f}, p-value = {p_val:.4f}")


This result does show a statistically significant difference between weekday and weekend traffic.
So the vehicle activity is significantly higher on weekdays than weekends.
