## Preprocessing and analyse rain data

### 1. Importing library and data
Further information on csv is from: https://www.visualcrossing.com/resources/documentation/weather-data/how-we-process-integrated-surface-database-historical-weather-data/

https://www.ncei.noaa.gov/access/search/data-search/global-hourly?bbox=40.959,-74.251,40.469,-73.761&pageNum=1&stations=72505394728&startDate=2024-10-01T00:00:00&endDate=2025-03-31T23:59:59

In [14]:
from pyspark.sql import functions as F
from pyspark.sql import SparkSession
from pyspark.sql.functions import split, col, udf, to_timestamp, expr
from pyspark.sql.types import DoubleType, StringType, IntegerType, StructType, StructField

In [15]:
# Create a spark session (which will run spark jobs)
spark = (
    SparkSession.builder.appName("MAST30034 Weather Data Analysis")
    .config("spark.driver.memory", "4g")
    .config("spark.sql.repl.eagerEval.enabled", True) 
    .config("spark.sql.parquet.cacheMetadata", "true")
    .config("spark.sql.session.timeZone", "Etc/UTC")
    .getOrCreate()
)

### 2. Preprocess weather data
+ Feature selection: Keeping wind speed, sky condition(overcast,partially cloudy, clear sky), visibility obeservation, air temp obeservation, dew point observation, air pressure observation, precipritation (rain/snow or no rain/snow)

1. Loading data in

In [16]:
# Reading 2024 and 2025 weather data
weather_df = spark.read.option("header", True).csv("weather_data/202*.csv")


25/08/27 11:39:07 WARN FileStreamSink: Assume no metadata directory. Error while looking for metadata directory in the path: weather_data/202*.csv.
java.io.FileNotFoundException: File weather_data/202*.csv does not exist
	at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:917)
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:1238)
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:907)
	at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:462)
	at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:56)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:381)
	at org.apache.spark.sql.catalyst.analysis.ResolveDataSource.org$apache$spark$sql$catalyst$analysis$ResolveDataSource$$loadV1BatchSource(ResolveDataSource.scala:143)
	at org.apache.spark.sql.catalyst.analysis.ResolveDa

In [17]:
print("Raw weather shape:", (weather_df.count(),
                           len(weather_df.columns),))

Raw weather shape: (18865, 91)


2. Select feature we want to analyse

In [18]:
from pyspark.sql.functions import  date_format
weather_df = weather_df.withColumn(
    "timestamp", to_timestamp(col("DATE"), "yyyy-MM-dd'T'HH:mm:ss")
).withColumn(
    "date", date_format(col("timestamp"), "yyyy-MM-dd")
).withColumn(
    "time", date_format(col("timestamp"), "HH:mm:ss")
)
print("Weather shape after fix:",
    (weather_df.count(), len(weather_df.columns))
)

weather_filtered = weather_df.filter(
    (col("timestamp") >= "2024-09-01 00:00:00") & (col("timestamp") < "2025-05-01 00:00:00")
)


Weather shape after fix: (18865, 93)


In [19]:
# Note to myself: Number of rows is not equal to total /2 since we only have data till 
# august 13 2025
weather_filtered.count()    

7484

In [20]:
print("Chosen weather shape:", 
    (weather_filtered.count(), len(weather_filtered.columns))
)




Chosen weather shape: (7484, 93)


                                                                                

In [21]:
weather_df.count()

18865

In [22]:
# Dropping unnecessary columns
# Keeping WND TMP DEW SLP AA1

weather_filtered = weather_filtered.select(
    col("DATE").alias("timestamp"),
    col("date"),
    col("time"),
    col("WND").alias("wind_observation"),
    col("TMP").alias("air_temp_observation"),
    col("DEW").alias("dew_point_observation"),
    col("SLP").alias("air_pressure_observation"),
    col("AA1").alias("precipitation")
)

In [23]:
print("Filtered weather shape:", 
    (weather_filtered.count(), len(weather_filtered.columns))
)

Filtered weather shape: (7484, 8)


In [24]:
# Checking the final structure of the filtered weather data
weather_filtered.show(5, truncate=False)

                                                                                

+----------+----------+--------+----------------+--------------------+---------------------+------------------------+-------------+
|timestamp |date      |time    |wind_observation|air_temp_observation|dew_point_observation|air_pressure_observation|precipitation|
+----------+----------+--------+----------------+--------------------+---------------------+------------------------+-------------+
|2025-01-01|2025-01-01|00:51:00|050,5,N,0031,5  |+0089,5             |+0039,5              |10052,5                 |01,0000,9,5  |
|2025-01-01|2025-01-01|01:51:00|999,9,V,0015,5  |+0072,5             |+0050,5              |10032,5                 |01,0013,9,5  |
|2025-01-01|2025-01-01|02:46:00|060,5,V,0036,5  |+0070,5             |+0050,5              |99999,9                 |01,0017,3,1  |
|2025-01-01|2025-01-01|02:47:00|040,5,N,0046,5  |+0072,5             |+0050,5              |10018,5                 |01,0036,9,6  |
|2025-01-01|2025-01-01|02:56:00|999,9,V,0021,5  |+0067,5             |+0050,

3. Change the precipitation to Y/N situtation (find the name of this)

In [25]:
weather_filtered.filter(col("precipitation").isNull()).count()

                                                                                

1037

In [26]:
from pyspark.sql.functions import when
# Getting precipitation accumulation
# Only keeping those with accumulatable precipitation
weather_precipitation = weather_filtered.withColumn(
    "precip_mm",
    (split(col("precipitation"), ",").getItem(1).cast("int") / 10.0)  # convert to mm
).withColumn(
    "precip_happened",
    when(col("precip_mm") > 0, "Y").otherwise("N")
)
print("Precipitation update:", 
    (weather_filtered.count(), len(weather_filtered.columns))
)

Precipitation update: (7484, 8)


In [27]:
weather_precipitation.show(5, truncate=False)

+----------+----------+--------+----------------+--------------------+---------------------+------------------------+-------------+---------+---------------+
|timestamp |date      |time    |wind_observation|air_temp_observation|dew_point_observation|air_pressure_observation|precipitation|precip_mm|precip_happened|
+----------+----------+--------+----------------+--------------------+---------------------+------------------------+-------------+---------+---------------+
|2025-01-01|2025-01-01|00:51:00|050,5,N,0031,5  |+0089,5             |+0039,5              |10052,5                 |01,0000,9,5  |0.0      |N              |
|2025-01-01|2025-01-01|01:51:00|999,9,V,0015,5  |+0072,5             |+0050,5              |10032,5                 |01,0013,9,5  |1.3      |Y              |
|2025-01-01|2025-01-01|02:46:00|060,5,V,0036,5  |+0070,5             |+0050,5              |99999,9                 |01,0017,3,1  |1.7      |Y              |
|2025-01-01|2025-01-01|02:47:00|040,5,N,0046,5  |+00

4. Only taking wind speed

In [28]:
weather_wind_speed = weather_precipitation.withColumn(
    "wind_speed_mps",
    (split(col("wind_observation"), ",").getItem(3).cast("int") / 10.0)
)
print("Windspeed update:", 
    (weather_filtered.count(), len(weather_filtered.columns))
)

Windspeed update: (7484, 8)


In [29]:
weather_wind_speed.filter(col("wind_speed_mps") >= 999).count() # empty data

976

In [30]:
weather_wind_speed.count()

7484

In [31]:
weather_wind_speed.show(5, truncate=False)

+----------+----------+--------+----------------+--------------------+---------------------+------------------------+-------------+---------+---------------+--------------+
|timestamp |date      |time    |wind_observation|air_temp_observation|dew_point_observation|air_pressure_observation|precipitation|precip_mm|precip_happened|wind_speed_mps|
+----------+----------+--------+----------------+--------------------+---------------------+------------------------+-------------+---------+---------------+--------------+
|2025-01-01|2025-01-01|00:51:00|050,5,N,0031,5  |+0089,5             |+0039,5              |10052,5                 |01,0000,9,5  |0.0      |N              |3.1           |
|2025-01-01|2025-01-01|01:51:00|999,9,V,0015,5  |+0072,5             |+0050,5              |10032,5                 |01,0013,9,5  |1.3      |Y              |1.5           |
|2025-01-01|2025-01-01|02:46:00|060,5,V,0036,5  |+0070,5             |+0050,5              |99999,9                 |01,0017,3,1  |1.7 

Note: not using wind speed due to lots of missing information

5. Extract important information from other observation

In [32]:
"Extrqcting air temperature in Celsius, removing data with missing temperature"
weather_temp = weather_precipitation.withColumn(
    "air_temp_celsius",
    (split(col("air_temp_observation"), ",").getItem(0).cast("int") / 10.0)
)
weather_temp.count()


7484

In [33]:
weather_temp = weather_temp.filter(col("air_temp_celsius") < 999) # missing data
weather_temp.count() 

7195

In [34]:
"Extracting dew point in Celsius, removing data with missing temperature"
weather_dew_point = weather_temp.withColumn(
    "dew_point_celsius",
    (split(col("dew_point_observation"), ",").getItem(0).cast("int") / 10.0)
).filter(col("air_temp_celsius") < 999) 


In [35]:
weather_dew_point.show(5, truncate=False)
weather_dew_point.count()

+----------+----------+--------+----------------+--------------------+---------------------+------------------------+-------------+---------+---------------+----------------+-----------------+
|timestamp |date      |time    |wind_observation|air_temp_observation|dew_point_observation|air_pressure_observation|precipitation|precip_mm|precip_happened|air_temp_celsius|dew_point_celsius|
+----------+----------+--------+----------------+--------------------+---------------------+------------------------+-------------+---------+---------------+----------------+-----------------+
|2025-01-01|2025-01-01|00:51:00|050,5,N,0031,5  |+0089,5             |+0039,5              |10052,5                 |01,0000,9,5  |0.0      |N              |8.9             |3.9              |
|2025-01-01|2025-01-01|01:51:00|999,9,V,0015,5  |+0072,5             |+0050,5              |10032,5                 |01,0013,9,5  |1.3      |Y              |7.2             |5.0              |
|2025-01-01|2025-01-01|02:46:00|060

7195

+ <u>NOTE</u>: Since dew point and air temp share the same thermometer, if air temp is empty, dew point is missing data too

In [36]:
"Extracting air pressure in hPa, removing data with missing pressure"
weather_pressure = weather_dew_point.withColumn(
    "air_pressure",
    (split(col("air_pressure_observation"), ",").getItem(0).cast("int") / 10.0)
).filter(col("air_pressure") != 999)  # missing data is set to 999
weather_pressure.show(5, truncate=False)
weather_pressure.count()

+----------+----------+--------+----------------+--------------------+---------------------+------------------------+-------------+---------+---------------+----------------+-----------------+------------+
|timestamp |date      |time    |wind_observation|air_temp_observation|dew_point_observation|air_pressure_observation|precipitation|precip_mm|precip_happened|air_temp_celsius|dew_point_celsius|air_pressure|
+----------+----------+--------+----------------+--------------------+---------------------+------------------------+-------------+---------+---------------+----------------+-----------------+------------+
|2025-01-01|2025-01-01|00:51:00|050,5,N,0031,5  |+0089,5             |+0039,5              |10052,5                 |01,0000,9,5  |0.0      |N              |8.9             |3.9              |1005.2      |
|2025-01-01|2025-01-01|01:51:00|999,9,V,0015,5  |+0072,5             |+0050,5              |10032,5                 |01,0013,9,5  |1.3      |Y              |7.2             |5.

7191

6. FINAL DATA

<i>yes, finally</i>

In [37]:
# The final processed weather data
# Note: Since dew point and air temp share the same thermometer, if air temp is empty
# dew point is missing data too
# So we can use air temp to filter dew point data
# Final weather data contains:
# - timestamp
# - air_temp_celsius
# - dew_point_celsius
# - air_pressure
# - precipitation
weather_final = weather_pressure.select(
    "timestamp",
    "date",
    "time",
    "air_temp_celsius",
    "dew_point_celsius",
    "air_pressure",
    "precip_mm",
    "precip_happened"
)
weather_final.show(5, truncate=False)

+----------+----------+--------+----------------+-----------------+------------+---------+---------------+
|timestamp |date      |time    |air_temp_celsius|dew_point_celsius|air_pressure|precip_mm|precip_happened|
+----------+----------+--------+----------------+-----------------+------------+---------+---------------+
|2025-01-01|2025-01-01|00:51:00|8.9             |3.9              |1005.2      |0.0      |N              |
|2025-01-01|2025-01-01|01:51:00|7.2             |5.0              |1003.2      |1.3      |Y              |
|2025-01-01|2025-01-01|02:46:00|7.0             |5.0              |9999.9      |1.7      |Y              |
|2025-01-01|2025-01-01|02:47:00|7.2             |5.0              |1001.8      |3.6      |Y              |
|2025-01-01|2025-01-01|02:56:00|6.7             |5.0              |9999.9      |2.0      |Y              |
+----------+----------+--------+----------------+-----------------+------------+---------+---------------+
only showing top 5 rows


In [38]:
print("weather final:", 
    (weather_final.count(), len(weather_final.columns))
)

weather final: (7191, 8)


In [24]:
# saving the final weather data
weather_final.write.mode("overwrite").parquet("data/processed_data/processed_weather_data.parquet")

                                                                                