# Data Cleaning and Setup

This notebook marks the beginning of my meteorological forecasting project during my internship at **MeteoGalicia**. My first task was to create a clean and well-structured dataset based on the raw data files provided by my tutors. This step was essential to ensure that the dataset was ready for model training and further analysis.

All data processing and model training were performed remotely on the **CESGA** (*Centro Europeo de Supercomputación de Galicia*) high-performance computing cluster.

To access CESGA, I first established a secure **VPN** connection using the snx command-line utility, which allowed CESGA to identify and authorize my device. Once connected, I used **SSH** to access the cluster and standard Bash commands to navigate directories, manage files, and execute Python scripts.

Below is the basic process I used to connect to CESGA via SSH after the VPN was successfully established.

In [None]:
%%bash
sudo ./snx -s pasarela.cesga.es -u uscfphpb
ssh uscfphpb@ft.cesga.es

Once connected to CESGA facilites, I started polishing the data I had at my disposal. To create a comprehensive dataset, I had to merge multiple files containing hour-by-hour data from a meteorological model alongside observational data collected from various weather stations. Each file represented a different time period or data source, so careful alignment was required to synchronize timestamps and ensure consistency. 

In [None]:
%%bash
cd GitHub/dataset/
tar -xvzf estaciones.tar.gz
tar -xvzf wrfout.tar.gz

2008 to 2025 .csv files contain hour-by-hour observational data from different stations across Galicia whereas wrfout files contain predictions made by the WRF (Weather Research and Forecasting Model) sorted by hour as well.

In [None]:
import glob
import pandas as pd

csv_files = sorted(glob.glob("GitHub/dataset/20*.csv"))

df_combined = pd.concat([pd.read_csv(file) for file in csv_files], ignore_index=True)
df_combined.to_csv('Observational_data.csv', index=False)

In [None]:
import glob
import pandas as pd 
from datetime import datetime, timedelta
import os

csv_files = sorted(glob.glob("GitHub/dataset/wrfout*.csv"))
all_data = []

for file in csv_files:
    # Extract date from filename
    date_str = os.path.basename(file).split('_')[2]  # 'YYYYMMDD' format
    base_datetime = datetime.strptime(date_str, "%Y%m%d")
    
    df = pd.read_csv(file)
    
    # Add hour sequence and datetime column
    df['hour'] = list(range(24)) * (len(df) // 24)
    df['datetime'] = df['hour'].apply(lambda h: base_datetime + timedelta(hours=h))
    
    df.drop(columns='hour', inplace=True)
    all_data.append(df)

# Combine all files
merged_df = pd.concat(all_data, ignore_index=True)
# Save to CSV
merged_df.to_csv("WRF.csv", index=False)

In [None]:
import pandas as pd

obs = pd.read_csv('Observational_data.csv')
obs['datetime'] = obs['fecha'] ; obs['id'] = obs['estacion']

wrf = pd.read_csv('WRF.csv')
wrf['datetime'] = wrf['Time']; wrf['id'] = wrf['estacion'] 

# Perform the merge on both keys
merged = pd.merge(wrf, obs[['TA','id','datetime']], on=["id", "datetime"], how="inner")
merged.to_csv("DATA.csv", index=False)

## Additional variables

The available data was sufficient to begin developing a model capable of forecasting temperature. However, we identified several missing parameters that could significantly impact the model’s accuracy and decided they were worth adding to our initial dataset. One of these was the percentage of sea surrounding a given location. Although Galicia is not a large region, its weather varies considerably from place to place, which makes our task more challenging. The Atlantic Ocean acts as a thermal reservoir, helping to stabilize coastal temperatures, but its influence decreases as we move inland. We believed that including a parameter representing the sea percentage around each point could noticeably improve the model’s accuracy.

In addition to this, we decided to incorporate the latitude and longitude of each location, as well as two variables to represent the hour of the day and the day of the year. Because daily and yearly weather patterns follow periodic cycles, simply providing the model with a numeric hour (e.g., 00:00 to 23:00) or day of the year (1 to 365) would not effectively capture this cyclical nature. To address this, we included two periodic functions: one with a period of 24 hours and another with a period of 365 days, allowing the values to return smoothly to their starting point after a full day or year. This is a common and effective practice in meteorological modeling.