# Topic ideas

---

Group name: Group iBm 

Sofie Pischl, Furkan Saygin, Julian Erath
---

**Project Title: Weather Data Analysis Using Regression and Classification on ERA5 Dataset**

**1. Scope of the Project:**

The project aims to analyze and derive insights from the ERA5 (ECMWF Reanalysis version 5) weather dataset. It involves both regression and classification analyses to model and predict various weather parameters as well as gain insights on correlations, causations and patterns involving the parameters. The ERA5 dataset, provided by the European Centre for Medium-Range Weather Forecasts (ECMWF), is a high-quality global atmospheric reanalysis dataset covering multiple decades. The analysis will focus on the region of Bancroft in Ontario, Canada.

## Weather Data Analysis Using Regression and Classification on ERA5 Dataset

### Data source
- **Source:** ERA5 dataset by ECMWF (https://www.ecmwf.int/en/forecasts/datasets/reanalysis-datasets/era5)
- **Temporal Coverage:** Multiple decades (2015-2022) (https://confluence.ecmwf.int/display/CKB/ERA5%3A+data+documentation)
- **Spatial Resolution:** Approximately 31 km (https://confluence.ecmwf.int/display/CKB/ERA5%3A+data+documentation)
- **Parameters:** Includes but not limited to temperature, precipitation, wind speed, atmospheric pressure, etc. (https://confluence.ecmwf.int/display/CKB/ERA5%3A+data+documentation#ERA5:datadocumentation-Parameterlistings)
- **Labels:** The data has been labelled by meteorologists and data scientists from IBM and The Weather Company.

The ERA5 dataset is provided by the European Centre for Medium-Range Weather Forecasts (ECMWF). It is a product of atmospheric reanalysis, a process that involves assimilating observational data from various sources into a numerical weather prediction model. 

**Collection Method:**

1. **Observational Data Assimilation:** The ERA5 dataset is generated through a process called reanalysis. This involves assimilating vast amounts of observational data from satellites, weather stations, and other sources into a numerical weather prediction model. This assimilation process helps create a consistent and comprehensive representation of the atmosphere over time.

2. **Numerical Weather Prediction Model:** ECMWF uses advanced numerical weather prediction models to simulate the Earth's atmosphere. These models take into account the laws of physics governing the atmosphere and use initial conditions based on observational data.

3. **Temporal Coverage:** The ERA5 dataset spans multiple decades, with ongoing updates. It covers from 1979 to near-real-time, providing a continuous record of atmospheric conditions. In this project the scope of the data is reduced to the dates of 2015 to 2022 for reasons of resources and project limitations.

4. **Spatial Resolution:** The dataset has a high spatial resolution, offering detailed information on a global scale, with grid points approximately 31 kilometers apart.

**Purpose:** The primary purpose of generating the ERA5 dataset is to provide a comprehensive and accurate representation of past weather conditions. It serves various scientific and operational applications, including climate monitoring, research, and supporting weather forecasting.

**Quality Assurance:** ECMWF is renowned for its commitment to data quality. The assimilation process involves rigorous validation against a wide range of observational data to ensure the accuracy and reliability of the reanalysis output.

**Accessibility:** The ECMWF makes ERA5 data available to the public and the scientific community, fostering research and applications in meteorology, climate science, and related fields.

Source: https://confluence.ecmwf.int/display/CKB/ERA5%3A+data+documentation


### Data characterisitcs

- **Temporal Aspect:** The dataset provides temporal information at various intervals (e.g., hourly or monthly), allowing for both short-term and long-term analyses.
  
- **Spatial Aspect:** With a spatial resolution of around 31 km, the dataset provides a comprehensive global coverage, facilitating region-specific studies. In this project a geographical resolution for the region Bancroft in Ontario, Canada is chosen (as this area is of particular interest for IBM).

- **Multivariate Nature:** Multiple meteorological parameters are available, enabling a comprehensive analysis of weather patterns.

- **Quality:** ERA5 is known for its high quality and precision, making it suitable for detailed analyses and modeling.

The ERA5 dataset is comprehensive, providing a wide range of meteorological variables that cover various aspects of the Earth's atmosphere. The observations and general characteristics measured in the ERA5 dataset are crucial for understanding and analyzing weather and climate patterns. Some of the key variables included in the dataset are:

1. **Temperature:** ERA5 provides information on air temperature at different levels of the atmosphere. This includes 2-meter temperature, which is often used as a representative of surface temperature.

2. **Wind:** Wind speed, gust and direction are essential meteorological parameters. ERA5 includes information on both zonal (east-west) and meridional (north-south) components of the wind at different altitudes.

3. **Precipitation:** Precipitation data includes rainfall and snowfall. This information is crucial for understanding water cycles and is vital for various applications, including hydrology and agriculture.

4. **Pressure:** Atmospheric pressure at different levels is provided in the dataset. Changes in atmospheric pressure are associated with weather patterns and can influence local weather conditions.

5. **Humidity:** Relative humidity and specific humidity are included, providing insights into the moisture content of the atmosphere. This is important for understanding cloud formation and precipitation processes.

6. **Snow and ice:** Snow density as well as cumulative snow and ice are included.

7. **Other Atmospheric Parameters** 
   
These variables collectively provide a detailed snapshot of the Earth's atmosphere and are fundamental for conducting both regression and classification analyses. Researchers and meteorologists often leverage these variables to study climate trends, weather patterns, and environmental changes. The high spatial and temporal resolution of the ERA5 dataset enhances its utility for diverse applications, including climate research, environmental monitoring, and weather forecasting.


### Research questions

**Regression Analysis:**

- *Temperature Prediction:* Can we build an accurate regression model to predict temperature based on historical data? This involves exploring the relationships between temperature time (daytime, day, season, etc.). The result of this regression analysis is the identification of a trend in the weather data as well the prediction of the temperature for the next x days, weeks, etc. based on the historical weather.

- *Temperature and Wind Modeling:* Can we find a correlation or causation between the temperature and windspeed, windgust or winddirection using regression techniques? This involves investigating the impact of independent variables (wind features) on the temperature. The result of this regression analysis is the identification of the correlation or causation between wind features and the temperature based on the historical weather.
As an outlook to this analysis it could also be analyzed whether certain winddirections result in a certain type of weather event (classification?).

**Classification Analysis:**

- *Extreme Weather Events:* Can we classify and predict extreme weather events such as storms? This involves training a binary classification model to identify patterns indicative of extreme events. The result of this classification analysis is the prediction of extreme weather events based on the current weather data and a model that was trained on historical weather data.

- *Weather Event and Pattern Classification:* Is it possible to categorize and predict different weather patterns (e.g., clear skies, rain, snow, etc.) based on multivariate weather data? This involves using multiclass classification algorithms. The results of this classification analysis is the predection of certain weather events based on the current weather data and a model that was trained on historical weather data.


**Response Variable / Dependent Variable (y):**

- Temperature
- Weather Event (label)


**Predictor Variables / Independent Variables (x):**

1. **Time**
2. **Wind:**
   - Speed
   - Gust
   - Direction
3. **Humidity**
4. **Pressure**
5. **Temperature**	
6. **Cumulative Precipitation**
7. **Snow Density**
8. **Cumulative Snow**
9. **Cumulative Ice**

By exploring the relationships between regional precipitation and these predictor variables, we can gain insights into the complex interactions that drive local weather patterns. For instance, we might investigate how temperature and wind variations correlate, or how certain parameter patterns influence the weather events. The goal is to identify key meteorological factors that contribute to each weather event and identify correlations and causations in the data.


### Overview of data

In [1]:
import pandas as pd

df = pd.read_csv("../project/data/external/feature_data_substation_bancroft_labelled.csv")
df.head()

  df = pd.read_csv("/Users/julianerath/Documents/Master/Data Analytics with Statistics/GitHub/project/project/data/external/feature_data_substation_bancroft_labelled.csv")


Unnamed: 0.1,Unnamed: 0,substation,run_datetime,valid_datetime,horizon,avg_temp,avg_windspd,avg_windgust,avg_pressure_change,avg_snow,...,MA_avg_winddir_month,MA_avg_temp_change_month,label0,label1,label2,label3,wep,storm_id2,year,month
0,0,Bancroft,2015-07-15 00:00:00,2015-07-15 00:00:00,0.0,287.389224,3.38638,11.136197,52.892217,0.0,...,80.302464,,0,1,,1,Blue sky day,,2015,7
1,1,Bancroft,2015-07-15 01:00:00,2015-07-15 01:00:00,0.0,287.378997,3.326687,11.002795,50.256685,0.0,...,78.584418,-0.010227,0,1,,1,Blue sky day,,2015,7
2,2,Bancroft,2015-07-15 02:00:00,2015-07-15 02:00:00,0.0,287.388845,3.243494,10.700595,47.944054,0.0,...,77.809235,-0.000189,3,1,,1,Blue sky day,,2015,7
3,3,Bancroft,2015-07-15 03:00:00,2015-07-15 03:00:00,0.0,287.427324,3.145505,10.323983,45.855264,0.0,...,77.93183,0.0127,2,1,,1,Blue sky day,,2015,7
4,4,Bancroft,2015-07-15 04:00:00,2015-07-15 04:00:00,0.0,287.489158,3.047607,9.921157,44.823453,0.0,...,79.272034,0.024983,2,1,,1,Blue sky day,,2015,7


In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65345 entries, 0 to 65344
Columns: 184 entries, Unnamed: 0 to month
dtypes: float64(166), int64(8), object(10)
memory usage: 91.7+ MB


In [3]:
df.dtypes

Unnamed: 0          int64
substation         object
run_datetime       object
valid_datetime     object
horizon           float64
                   ...   
label3              int64
wep                object
storm_id2          object
year                int64
month               int64
Length: 184, dtype: object

In [4]:
pd.options.display.max_rows = 999
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Unnamed: 0,65345.0,32685.66,18867.701277,0.0,16343.0,32689.0,49025.0,65361.0
horizon,65345.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
avg_temp,65345.0,279.5743,11.383325,243.8494,271.1142,279.8827,289.9032,300.9341
avg_windspd,65345.0,2.440013,0.77288,0.643393,1.882336,2.34648,2.893087,6.470306
avg_windgust,65345.0,7.644774,2.512226,1.980821,5.769796,7.31874,9.142357,19.90555
avg_pressure_change,56672.0,0.01739078,33.88573,-168.3157,-19.26354,-0.07129918,19.5491,153.8115
avg_snow,65345.0,0.438549,1.611839,0.0,0.0,0.0,0.1095332,34.02421
avg_wet_bulb_temp,65345.0,278.8068,11.210085,243.3543,270.5423,279.0885,288.9529,299.5567
avg_snow_density_6,65345.0,1.364904,5.599313,0.0,0.0,0.0,0.0,79.91902
avg_snow_density_12,65345.0,0.5019844,2.692368,0.0,0.0,0.0,0.0,42.72815
