# Topic ideas

---

Group name: Group iBm 

---

**Project Title: Weather Data Analysis Using Regression and Classification on ERA5 Dataset**

**1. Scope of the Project:**

The project aims to analyze and derive insights from the ERA5 (ECMWF Reanalysis version 5) weather dataset. It involves both regression and classification analyses to model and predict various weather parameters as well as gain insights on correlations, causations and patterns involving the parameters. The ERA5 dataset, provided by the European Centre for Medium-Range Weather Forecasts (ECMWF), is a high-quality global atmospheric reanalysis dataset covering multiple decades. The analysis will focus on the region of Bancroft in Ontario, Canada.

## Name of topic idea 1

### Data source
- **Source:** ERA5 dataset by ECMWF (https://www.ecmwf.int/en/forecasts/datasets/reanalysis-datasets/era5)
- **Temporal Coverage:** Multiple decades (2015-2022) (https://confluence.ecmwf.int/display/CKB/ERA5%3A+data+documentation)
- **Spatial Resolution:** Approximately 31 km (https://confluence.ecmwf.int/display/CKB/ERA5%3A+data+documentation)
- **Parameters:** Includes but not limited to temperature, precipitation, wind speed, atmospheric pressure, etc. (https://confluence.ecmwf.int/display/CKB/ERA5%3A+data+documentation#ERA5:datadocumentation-Parameterlistings)


The ERA5 dataset is provided by the European Centre for Medium-Range Weather Forecasts (ECMWF). It is a product of atmospheric reanalysis, a process that involves assimilating observational data from various sources into a numerical weather prediction model. 

**Collection Method:**

1. **Observational Data Assimilation:** The ERA5 dataset is generated through a process called reanalysis. This involves assimilating vast amounts of observational data from satellites, weather stations, and other sources into a numerical weather prediction model. This assimilation process helps create a consistent and comprehensive representation of the atmosphere over time.

2. **Numerical Weather Prediction Model:** ECMWF uses advanced numerical weather prediction models to simulate the Earth's atmosphere. These models take into account the laws of physics governing the atmosphere and use initial conditions based on observational data.

3. **Temporal Coverage:** The ERA5 dataset spans multiple decades, with ongoing updates. It covers from 1979 to near-real-time, providing a continuous record of atmospheric conditions. In this project the scope of the data is reduced to the dates of 2015 to 2022 for reasons of resources and project limitations.

4. **Spatial Resolution:** The dataset has a high spatial resolution, offering detailed information on a global scale, with grid points approximately 31 kilometers apart.

**Purpose:** The primary purpose of generating the ERA5 dataset is to provide a comprehensive and accurate representation of past weather conditions. It serves various scientific and operational applications, including climate monitoring, research, and supporting weather forecasting.

**Quality Assurance:** ECMWF is renowned for its commitment to data quality. The assimilation process involves rigorous validation against a wide range of observational data to ensure the accuracy and reliability of the reanalysis output.

**Accessibility:** The ECMWF makes ERA5 data available to the public and the scientific community, fostering research and applications in meteorology, climate science, and related fields.

Source: https://confluence.ecmwf.int/display/CKB/ERA5%3A+data+documentation


### Data characterisitcs

- **Temporal Aspect:** The dataset provides temporal information at various intervals (e.g., hourly or monthly), allowing for both short-term and long-term analyses.
  
- **Spatial Aspect:** With a spatial resolution of around 31 km, the dataset provides a comprehensive global coverage, facilitating region-specific studies. In this project a geographical resolution for the region Bancroft in Ontario, Canada is chosen (as this area is of particular interest for IBM).

- **Multivariate Nature:** Multiple meteorological parameters are available, enabling a comprehensive analysis of weather patterns.

- **Quality:** ERA5 is known for its high quality and precision, making it suitable for detailed analyses and modeling.

The ERA5 dataset is comprehensive, providing a wide range of meteorological variables that cover various aspects of the Earth's atmosphere. The observations and general characteristics measured in the ERA5 dataset are crucial for understanding and analyzing weather and climate patterns. Some of the key variables included in the dataset are:

1. **Temperature:** ERA5 provides information on air temperature at different levels of the atmosphere. This includes 2-meter temperature, which is often used as a representative of surface temperature.

2. **Wind:** Wind speed and direction are essential meteorological parameters. ERA5 includes information on both zonal (east-west) and meridional (north-south) components of the wind at different altitudes.

3. **Precipitation:** Precipitation data includes rainfall and snowfall. This information is crucial for understanding water cycles and is vital for various applications, including hydrology and agriculture.

4. **Pressure:** Atmospheric pressure at different levels is provided in the dataset. Changes in atmospheric pressure are associated with weather patterns and can influence local weather conditions.

5. **Humidity:** Relative humidity and specific humidity are included, providing insights into the moisture content of the atmosphere. This is important for understanding cloud formation and precipitation processes.

6. **Radiation:** Solar radiation at the surface and other radiative fluxes are available. These variables are essential for studying energy exchanges in the atmosphere and on the Earth's surface.

7. **Clouds:** Various cloud-related variables, such as cloud cover and cloud type, are part of the dataset. Cloud information is crucial for understanding the Earth's energy balance and climate.

8. **Other Atmospheric Parameters:** The dataset includes information on other atmospheric variables such as geopotential height, sea level pressure, and potential vorticity.

These variables collectively provide a detailed snapshot of the Earth's atmosphere and are fundamental for conducting both regression and classification analyses. Researchers and meteorologists often leverage these variables to study climate trends, weather patterns, and environmental changes. The high spatial and temporal resolution of the ERA5 dataset enhances its utility for diverse applications, including climate research, environmental monitoring, and weather forecasting.


### Research question

#### Examples

*How does the variability in atmospheric conditions, as captured by the ERA5 dataset, influence regional precipitation patterns?*

*Analyse the trend of parameters e.g., temperature in a regression and use it to predict response variables such as wind_speed or a storm before it happens*

*Effect of air humidity / groudn humidity as a independent variable on clouds or certain weather events as a dependend variable*

*Do certain winddirections result in certain weather conditions (temperature)*


**Response Variable:**

*Regional Precipitation*
- This could be the total precipitation over a specific area or the frequency of precipitation events.
*Windspeed*
*Clouds*
*Weather event*
*Temperature*


**Possible Predictor Variables:**

1. **Temperature:**
   - 2-meter temperature
   - Temperature at different atmospheric levels

2. **Wind:**
   - Zonal and meridional components of wind at various altitudes
   - Speed
   - Direction

3. **Humidity:**
   - Relative humidity
   - Specific humidity

4. **Pressure:**
   - Atmospheric pressure at different levels

5. **Radiation:**
   - Solar radiation at the surface

6. **Clouds:**
   - Cloud cover
   - Cloud type

7. **Geographical Features:**
   - Latitude and longitude of the region

8. **Time:**
   - Temporal information to capture seasonal and diurnal variations

By exploring the relationships between regional precipitation and these predictor variables, we can gain insights into the complex interactions that drive local weather patterns. For instance, we might investigate how temperature and humidity variations correlate with precipitation, or how wind patterns influence the frequency and intensity of rainfall. The goal is to identify key meteorological factors that contribute to precipitation variability in a specific region.

The study could employ regression analysis techniques to model the quantitative relationship between predictor variables and precipitation amounts. Additionally, classification analysis might be applied to predict the likelihood of different precipitation categories (e.g., light rain, heavy rain, no precipitation) based on atmospheric conditions. This dual approach could provide a more holistic understanding of the dynamics influencing regional precipitation.

**Regression Analysis:**

- *Temperature Prediction:* Can we build an accurate regression model to predict temperature based on historical data? This could involve exploring the relationships between temperature and other variables like time of day, geographical location, and atmospheric pressure.

- *Precipitation Modeling:* How well can we predict precipitation patterns using regression techniques? This may involve investigating the impact of various factors on rainfall, including temperature, humidity, and wind patterns.

**Classification Analysis:**

- *Extreme Weather Events:* Can we classify and predict extreme weather events such as storms or heatwaves? This involves training a classification model to identify patterns indicative of extreme events.

- *Weather Pattern Classification:* Is it possible to categorize and predict different weather patterns (e.g., clear skies, cloudy, rainy) based on multivariate data? This could involve using clustering or classification algorithms.

**Overall Objectives:**

- Utilize regression techniques to predict specific weather parameters accurately.
- Employ classification algorithms to identify and predict different weather events and patterns.
- Explore temporal and spatial trends within the ERA5 dataset.
- Assess the impact of various meteorological factors on each other.

**Expected Outcomes:**

- Regression models predicting temperature, precipitation, and other weather parameters with high accuracy.
- Classification models capable of identifying and predicting extreme weather events.
- Insights into the relationships between different meteorological variables.

**Note:** It's crucial to conduct thorough exploratory data analysis (EDA) before diving into modeling, ensuring a solid understanding of the dataset's characteristics and relationships between variables. Additionally, feature engineering and careful model selection will play a vital role in the success of the regression and classification tasks.


### Overview of data

*Use the Pandas functions to provide an overview of the data set*

In [20]:
import pandas as pd

df = pd.read_csv("../../../feature_data_substation_Bancroft.csv")
df.head()

  df = pd.read_csv("../../../feature_data_substation_Bancroft.csv")


Unnamed: 0,substation,run_datetime,valid_datetime,horizon,avg_temp,avg_windspd,avg_windgust,avg_pressure_change,avg_snow,avg_wet_bulb_temp,...,min_ice_M1,max_ice_M1,avg_ice_M1,min_ice_M2,max_ice_M2,avg_ice_M2,damagecount,damagecount_hourly,len_powerline,density_powerline
0,Bancroft,2015-07-24 05:00:00,2015-07-24 05:00:00,0.0,292.972609,2.055396,6.17241,-0.527509,0.0,291.332157,...,,,,,,,0.0,0,3920020,0.02695
1,Bancroft,2015-07-27 16:00:00,2015-07-27 16:00:00,0.0,296.443003,1.895564,5.595455,2.729771,0.0,295.036379,...,,,,,,,1.0,0,3920020,0.02695
2,Bancroft,2015-09-07 04:00:00,2015-09-07 04:00:00,0.0,296.953925,2.863214,9.110616,-9.833964,0.0,296.02446,...,,,,,,,2.0,0,3920020,0.02695
3,Bancroft,2015-09-25 04:00:00,2015-09-25 04:00:00,0.0,285.943825,2.103261,6.400018,3.555872,0.0,284.753172,...,,,,,,,1.0,0,3920020,0.02695
4,Bancroft,2015-10-20 04:00:00,2015-10-20 04:00:00,0.0,283.157591,2.65653,8.448167,39.358195,0.0,282.441072,...,,,,,,,1.0,0,3920020,0.02695


In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65449 entries, 0 to 65448
Columns: 142 entries, substation to density_powerline
dtypes: float64(133), int64(2), object(7)
memory usage: 70.9+ MB


In [21]:
pd.options.display.max_rows = 999
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
horizon,65449.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
avg_temp,65362.0,279.5739,11.382142,243.8494,271.1155,279.8819,289.9003,300.9341
avg_windspd,65449.0,2.439872,0.772849,0.6001035,1.882597,2.346945,2.89318,6.470306
avg_windgust,65449.0,7.636845,2.518158,1.600792,5.761621,7.31307,9.138378,19.90555
avg_pressure_change,56689.0,0.01477876,33.881355,-168.3157,-19.26306,-0.07404531,19.54243,153.8115
avg_snow,65449.0,0.4378833,1.610646,0.0,0.0,0.0,0.109303,34.02421
avg_wet_bulb_temp,65362.0,278.8065,11.208903,243.3543,270.545,279.0882,288.9492,299.5567
avg_snow_density_6,65449.0,1.362735,5.595126,0.0,0.0,0.0,0.0,79.91902
avg_snow_density_12,65449.0,0.5011867,2.690302,0.0,0.0,0.0,0.0,42.72815
avg_winddir,65432.0,210.2392,70.850966,21.17728,165.4006,222.0229,262.8592,346.806


In [22]:
df.dtypes

substation                         object
run_datetime                       object
valid_datetime                     object
horizon                           float64
avg_temp                          float64
avg_windspd                       float64
avg_windgust                      float64
avg_pressure_change               float64
avg_snow                          float64
avg_wet_bulb_temp                 float64
avg_snow_density_6                float64
avg_snow_density_12               float64
avg_winddir                       float64
avg_dewpoint                      float64
avg_soil_moisture                 float64
min_temp                          float64
min_windspd                       float64
min_windgust                      float64
min_pressure_change               float64
min_snow                          float64
min_wet_bulb_temp                 float64
min_snow_density_6                float64
min_snow_density_12               float64
min_winddir                       