# TLC Data Trip Data Analytics with Weather

> Make sure you have save the trip data in S3 and create a Glue Data Catalog before start futher data processing.

We are going to use the TLC Trip Record Data to analyze the trip data with weather data. And create a Machine Learning model to predict the trip requests in each hour and NY zone area.

We will cover the following in this section:
1. Enrich the trip data with weather data
2. Create some analytics on the relevance between trip data and weather data
 


In [None]:
# intialize the Glue environment

%idle_timeout 2880
%glue_version 4.0
%worker_type G.1X
%number_of_workers 32

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)

import os
# Detect if the code is running in a Glue job
def is_glue_job():
    try:
        args = getResolvedOptions(sys.argv,['JOB_NAME'])
        print("JOB_NAME: ", args['JOB_NAME'])
        return True
    except:
        return False

print("Running in Glue Job: ", is_glue_job())

## 1. New York Weather data

Weather is a very important factor of the taxi trip. We use the NY_Weather data to join with the TLC Trip Data Record. The data is downloaded from [Visual Crossing](https://www.visualcrossing.com/). 

We download the weather data from 2019.1 to 2023.6, and store the data in the Amazon S3 bucket. After that, we use the data to join with the TLC Trip Data Record.


**Column Description**
- name: name of the weather station
- datetime: date and time of the weather record
- temp: temperature in Fahrenheit
- feelslike: feels like temperature in Fahrenheit
- dewp: dew point in Fahrenheit
- humid: relative humidity
- precip: precipitation in inches
- precipprob: probability of precipitation
- snow: snowfall in inches
- snowdepth: snow depth in inches
- windgust: wind gust in miles per hour
- windspeed: wind speed in miles per hour
- winddir: wind direction in degrees
- sealvlpressure: sea level pressure in millibars
- cloudcover: cloud cover percentage
- visibility: visibility in miles
- solarradiation: solar radiation in watts per square meter
- solarenergy: solar energy in megajoules per square meter
- uvindex: UV index
- severisky: severe weather risk index
- conditions: weather conditions
- icon: weather icon
- stations: number of weather stations reporting data


### 1.1 Load Weather Data

Since the weather data is well formated, we use Glue Crawler to create a Glue Table, and load into a Spark DataFrame.

The weather data is stored in the Amazon S3 bucket, and the Glue Table name is `tlcny_weather`.

In [None]:
# Create a Spark DataFrame from the Glue Catalog
df_weather = glueContext.create_dynamic_frame.from_catalog(database = "tlc", table_name = "tlcny_weather").toDF()

# print the schema if not running in a Glue job
if not is_glue_job():
    print("Schema before transformation:")
    df_weather.printSchema()

# Count the number of rows
print("Number of rows:", df_weather.count())

### 1.2 Data Cleanup

In this seciton, we did the following data transformation:

1. Drop duplicated fields `name`, `dew`, `stations`, `conditions`, `severerisk`.
2. Convert the U.S metrics to metric system, so it is easier to understand for international audience.
3. Add columns `year`, `month`, `day`, `hour`.
4. Remove duplicated rows.

We added the `year`, `month`, `day`, `hour` fields based on the `request_datetime` in the TLC Trip Data Record.

In [None]:
# Drop columns: name (location name), stations (number of stations reporting), conditions (weather conditions), severerisk (severe weather risk)
df_weather = df_weather.drop("name", "dew", "stations", "conditions", "severerisk")

if not is_glue_job():
    df_weather.printSchema()

In [None]:
from pyspark.sql.functions import round

# Convert the U.S metrics to metric system, so it is easier to understand for international audience
df_weather = df_weather.withColumn("temp_c", round((df_weather["temp"] - 32) * 5/9, 1))
df_weather = df_weather.withColumn("feelslike_c", round((df_weather["feelslike"] - 32) * 5/9, 1))
df_weather = df_weather.withColumn("precip_cm", round(df_weather["precip"] * 0.3048 * 100, 1)) # centimeters
df_weather = df_weather.withColumn("snow_cm", round(df_weather["snow"] * 0.3048 * 100, 1)) # centimeters
df_weather = df_weather.withColumn("snowdepth_cm", round(df_weather["snowdepth"] * 0.3048 * 100, 1)) #centimeters
df_weather = df_weather.withColumn("windgust_mps", round(df_weather["windgust"] * 0.44704, 1))  # meters per second
df_weather = df_weather.withColumn("windspeed_mps", round(df_weather["windspeed"] * 0.44704, 1)) # meters per second
df_weather = df_weather.withColumn("visibility_km", round(df_weather["visibility"] * 1.60934, 1)) # kilometers

df_weather = df_weather.drop("temp", "feelslike", "precip", "snow", "snowdepth", "windgust", "windspeed", "visibility")

if not is_glue_job():
    df_weather.printSchema()

In [None]:
# Add columns year, month, day, hour based on field datetime
from pyspark.sql.functions import year, month, dayofmonth, hour

df_weather = df_weather.withColumn("year", year(df_weather["datetime"]))
df_weather = df_weather.withColumn("month", month(df_weather["datetime"]))
df_weather = df_weather.withColumn("day", dayofmonth(df_weather["datetime"]))
df_weather = df_weather.withColumn("hour", hour(df_weather["datetime"]))

if not is_glue_job():
    df_weather.printSchema()

In [None]:
from pyspark.sql.functions import col

duplicated_rows = df_weather.groupBy("year", "month", "day", "hour").count().filter(col("count") > 1)

# we found join with the trip data will have more records, this is usually caused by the weather 
# data contains multiple records for a data point
duplicated_rows.show()

In [None]:
# There is duplicated records in weather data, remove the duplicated records
print("Count before removing duplicated records: ", df_weather.count())

df_weather = df_weather.dropDuplicates(["year", "month", "day", "hour"])

print("Count after removing duplicated records: ", df_weather.count())

In [None]:
df_weather.describe().show()

In [None]:
from pyspark.sql.functions import col, sum

# Check for null values in each column
null_counts = df_weather.agg(*[sum(col(c).isNull().cast("int")).alias(c) for c in df_weather.columns])

# Display the null counts
null_counts.show()


In [None]:
# Fill with 0 if there is null value in the windgust_mps column
df_weather = df_weather.fillna(0, subset=["windgust_mps"])

## 2. Load the Trip Data

Load the trip data from the Glue Table `trips` into a Spark DataFrame. 

In [None]:
df_trips = glueContext.create_dynamic_frame.from_catalog(database = "tlc", table_name = "trips")

if not is_glue_job():
    df_trips = df_trips.toDF().sample(False, 0.002, 42) # sample 0.2% of the data
    df_trips.printSchema()
else:
    df_trips = df_trips.toDF()

print("Number of trips:", df_trips.count())

### 2.1 Aggregate the Trip Data

The task in this section is to generate a new DataFrame with aggregated criteria.

- year
- month
- day
- hour
- PUlocationID
- Count(*) as trip_needs
- request_hour
- is_holiday
- weekday

In [None]:
df_trips_grouped = df_trips.groupBy("year", "month", "day", "hour", "PULocationID").count().withColumnRenamed("count", "total_trips")

if not is_glue_job():
    df_trips_grouped.show(3)


In [None]:
df_trips_grouped.printSchema()

In [7]:
from pyspark.sql.functions import date_format, col, lit, concat

df_trips_grouped = df_trips_grouped.withColumn("request_hour", date_format(concat(col("year"), lit("-"), col("month"), lit("-"), col("day"), lit(" "), col("hour"), lit(":00:00")), "yyyy-MM-dd HH:00:00"))

if not is_glue_job():
    df_trips_grouped.show(5)

In [None]:
from pyspark.sql.functions import dayofweek
# add the column weekday_n

df_trips_grouped = df_trips_grouped.withColumn("weekday_n", dayofweek(df_trips_grouped["request_hour"]))

if not is_glue_job():
    df_trips_grouped.show(5)


## 3. Enrich the Trip Data 
Before we run this secion, make sure you have create a Glue Table for the TLC Trip Data Record.


### 3.1 Enrich with Holiday Data

Add a column `is_holiday` to the DataFrame, and drop the holiday name column.

In [None]:
df_holidays = glueContext.create_dynamic_frame.from_catalog(database = "tlc", table_name = "holidays").toDF()

if not is_glue_job():
    df_holidays.show(5)

In [None]:
# Join the TLC Trip Data Record with the public holiday data
df_trips_grouped = df_trips_grouped.join(df_holidays, on=['year', 'month', 'day'], how='left')

# Add a column is_holiday of type boolean, and dop the column holiday
df_trips_grouped = df_trips_grouped.withColumn("is_holiday", df_trips_grouped["holiday"].isNotNull())

# drop the column holiday
df_trips_grouped.drop("holiday")

print("Add column is_holiday, and drop column holiday done.")

if not is_glue_job():
    df_trips_grouped.show(3)

### 3.2 Enrich with weather data

The weather data is hourly, we enrich the trip data with weather data.


In [None]:
# drop the duplicated columns first, because we are going to join with datetime field
df_weather = df_weather.drop("year", "month", "day", "hour")

df_trips_grouped_with_weather = df_trips_grouped.join(df_weather, df_trips_grouped['request_hour'] == df_weather['datetime'], how='left')

df_trips_grouped_with_weather = df_trips_grouped_with_weather.drop("datetime")

if not is_glue_job():
    df_trips_grouped_with_weather.printSchema()


## 4. Save the data to S3

Save the processed DataFrame to S3, for ML tasks.



In [None]:
# Output a partitioned records in parquet format

df_trips_grouped_with_weather.write.mode("overwrite").partitionBy("year", "month").parquet("s3://qiaoshi-aws-ml/tlc/results/full/trips_with_weather/parquet/")

print("Write to S3 in parquet format done.")

In [None]:
# output the full dataset to S3
df_trips_grouped_with_weather.write.mode("overwrite").csv("s3://qiaoshi-aws-ml/tlc/results/full/trips_with_weather/csv/", header=True)

print("Output full records to S3 done.")

# output a sample dataset to S3
df_trips_grouped_with_weather.sample(False, 0.01, 42).write.mode("overwrite").csv("s3://qiaoshi-aws-ml/tlc/results/sample/trips_with_weather/csv/", header=True)
print("output sample records to S3 done.")