# Temperature Prediction Using Dependent Gaussian Data

## License  
This project is released under the **GNU General Public License v3 (GPLv3)**.  
You are free to use, modify, and distribute this code under the terms of the GPLv3 license.  
For more details, see: [GNU GPL v3](https://www.gnu.org/licenses/gpl-3.0.en.html).

## Authors  
- **Mina Sadat Mahmoudi**  
- **Saeed Foroutan**  
- **Seyed Abolfazl Motahari**  
- **Babak Khalaj**  
- Department of Electrical Engineering & Computer Engineering, Sharif University of Technology, Iran  

## Conference & Publication  
This work has been **accepted for presentation at ICASSP 2025**.

---

## Overview  

This notebook presents a **real-world application** of the theoretical results from the paper:  

**"Uniform Convergence of Lipschitz Functions with Dependent Gaussian Samples"**  
(*ICASSP 2025, Sharif University of Technology*)  

### Objective:  
- Predict temperature in **Hyderabad** using **dependent Gaussian data**.  
- Train a **neural network model** to approximate the target function.  
- Compare prediction errors under **different dependency structures**.  

### Methodology:  
- Utilize **temperature data** from the **five nearest cities** to Hyderabad.  
- Apply a **neural network** to predict the temperature.  
- Measure **prediction error decay** with increasing sample size.  

The goal is to analyze how learning from **dependent data** (e.g., correlated temperature readings) affects **generalization performance**.  


In [1]:
#Install meteostat if you didn't have already.
pip install meteostat

Collecting meteostat
  Downloading meteostat-1.6.8-py3-none-any.whl.metadata (4.6 kB)
Downloading meteostat-1.6.8-py3-none-any.whl (31 kB)
Installing collected packages: meteostat
Successfully installed meteostat-1.6.8


## Importing Required Libraries  

This notebook utilizes the following libraries for **data processing, modeling, and visualization**:  

- **`pandas`**: For handling structured data and DataFrames.  
- **`numpy`**: For numerical operations and matrix manipulations.  
- **`matplotlib.pyplot`**: For plotting and visualizing temperature trends.  
- **`datetime`**: For handling time-based data processing.  
- **`torch`**: Deep learning framework (PyTorch) for building neural networks.  
- **`torch.nn`**: For defining and training neural network models.  
- **`torch.optim`**: Optimization algorithms like Adam for training.  
- **`meteostat`**: To retrieve historical weather data from meteorological stations.  

These libraries support the implementation of **temperature prediction** using **deep learning models** and **historical weather data**.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime
import torch
import torch.nn as nn
import torch.optim as optim
from meteostat import Stations, Daily, Point
from sklearn.preprocessing import MinMaxScaler

In [None]:

# --------------------------------------------------------------------
# 1. Define a deeper neural network with more layers and neurons
# --------------------------------------------------------------------
class DeepNN(nn.Module):
    """
    # This is a feedforward deep neural network with 7 fully connected layers,
    # each followed by a ReLU activation, except the final output layer.
    #
    # The architecture is designed for single-input, single-output prediction tasks
    # and features increasing then decreasing hidden dimensions to capture
    # complex relationships effectively.

    Description:
        The DeepNN class provides a deep feedforward neural network built using
        fully connected layers. Each layer is followed by a ReLU activation
        to introduce non-linearity, except for the last output layer. This design
        is suitable for various regression or binary classification tasks when
        adapted appropriately.

    Attributes:
        fc1 (nn.Linear): Linear layer mapping from 1 input feature to 128 hidden features.
        fc2 (nn.Linear): Linear layer mapping from 128 hidden features to 256 hidden features.
        fc3 (nn.Linear): Linear layer mapping from 256 hidden features to 512 hidden features.
        fc4 (nn.Linear): Linear layer mapping from 512 hidden features to 256 hidden features.
        fc5 (nn.Linear): Linear layer mapping from 256 hidden features to 128 hidden features.
        fc6 (nn.Linear): Linear layer mapping from 128 hidden features to 64 hidden features.
        fc7 (nn.Linear): Output layer mapping from 64 hidden features to 1 output value.

    Forward:
        Args:
            x (torch.Tensor): The input tensor containing a single feature.
        Returns:
            torch.Tensor: The model's output tensor, representing the predicted value.
    """
    def __init__(self):
        super(DeepNN, self).__init__()
        self.fc1 = nn.Linear(1, 128)   # Input layer (1 feature) -> first hidden layer
        self.fc2 = nn.Linear(128, 256)
        self.fc3 = nn.Linear(256, 512)
        self.fc4 = nn.Linear(512, 256)
        self.fc5 = nn.Linear(256, 128)
        self.fc6 = nn.Linear(128, 64)
        self.fc7 = nn.Linear(64, 1)    # Output layer (1 target)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = torch.relu(self.fc3(x))
        x = torch.relu(self.fc4(x))
        x = torch.relu(self.fc5(x))
        x = torch.relu(self.fc6(x))
        x = self.fc7(x)
        return x




## Deep Neural Network for Temperature Prediction

This section implements a **deep feedforward neural network** to model temperature predictions based on dependent weather data.

### 1️⃣ **`DeepNN` Class**
- A **7-layer neural network** with **fully connected (FC) layers** and **ReLU activations**.
- Designed to **capture complex temperature dependencies** effectively.
- **Architecture**:
  - Input: **1 feature** (temperature data from nearby locations).
  - Hidden layers: **128 → 256 → 512 → 256 → 128 → 64 neurons**.
  - Output: **1 neuron** (predicted temperature value).

### 2️⃣ **Helper Functions**
- **`tensorize(arr)`**: Converts NumPy arrays into PyTorch tensors.
- **`custom_loss(output, target)`**: Computes the **Mean Absolute Error (MAE)**.
- **`train_model(x, y, model, optimizer, num_epochs=2000)`**:
  - Trains the **DeepNN** model using **Adam optimizer**.
  - Prints **loss** every 100 epochs for monitoring.
- **`predict(model, input_array)`**:
  - Converts input to tensors and generates predictions using the trained model.

This deep learning approach enables us to **learn and generalize** from temperature data under different **dependency conditions**.


In [None]:
# --------------------------------------------------------------------
# 2. Helper functions (tensorize, custom_loss, train_model, predict)
# --------------------------------------------------------------------
def tensorize(arr):
    """
    Convert a numpy array to a float32 PyTorch tensor, shaped (n_samples, 1).
    """
    return torch.tensor(arr, dtype=torch.float32).view(-1, 1)

# Mean absolute error
def custom_loss(output, target):
    return torch.mean(torch.abs(output - target))

def train_model(x, y, model, optimizer, num_epochs=2000):
    """
    x, y: numpy arrays
    model: a PyTorch nn.Module
    optimizer: e.g., torch.optim.Adam(model.parameters(), lr=1e-4)
    """
    # Convert numpy arrays to PyTorch tensors
    x_tensor = tensorize(x)
    y_tensor = tensorize(y)

    for epoch in range(num_epochs):
        model.train()
        # Forward pass
        outputs = model(x_tensor)
        loss = custom_loss(outputs, y_tensor)

        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if (epoch + 1) % 100 == 0:
            print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')

    return model

def predict(model, input_array):
    """
    input_array: a numpy array
    """
    model.eval()
    input_tensor = tensorize(input_array)
    output = model(input_tensor).detach().numpy()
    return output

# Fetching and Preparing Temperature Data for Hyderabad  

## Overview  
This section retrieves **historical temperature data** from **Meteostat** for **Hyderabad, India**, and its surrounding weather stations. The goal is to construct a **reliable dataset** by aggregating nearby station data, handling missing values, and preparing training and testing sets for temperature prediction.

---

## 1️⃣ Finding Nearby Weather Stations  
- **Hyderabad coordinates**: `(Latitude: 17.3850, Longitude: 78.4867)`.  
- Using **Meteostat's `Stations()` API**, we retrieve the **10 nearest weather stations**.  
- **First two stations** are **inside Hyderabad**, so they are ignored.  
- The **next 5 closest stations** are selected for temperature aggregation.

---

## 2️⃣ Fetching Daily Temperature Data (`tavg`)  
- **Training period**: **April 1, 2021** → **Present**  
- **Testing period**: **April 1, 2022** → **Present**  
- We retrieve **daily average temperature (`tavg`)** from the **selected 5 stations**.  
- The **Meteostat `Daily()` API** is used to fetch temperature records.

---

## 3️⃣ Constructing the Combined Dataset  
- **Temperature data from 5 stations** is combined into a **single DataFrame**.  
- A new column **`avg_5_stations`** is created by computing the **mean temperature** of the selected stations.  
- **Hyderabad's actual temperature (`Hyderabad_tavg`)** is retrieved separately from the **nearest station**.

---

## 4️⃣ Handling Missing Values  
- Some stations may have missing temperature data (`NaN`).  
- **Interpolation Strategy**:
  - **For missing station values**: Fill them with the corresponding **`avg_5_stations`** value for that day.  
  - Ensures **data completeness** without introducing artificial fluctuations.

---

## 5️⃣ Data Cleaning and Final Preparation  
- **Remove rows** where both `avg_5_stations` and `Hyderabad_tavg` are missing.  
- **Test dataset**:
  - Extracted for the period **April 2022 → Present**.  
  - **Downsampled** to include **1 data point every 10 days**, ensuring a **minimum gap of 3 days** between test samples.  

---

## 🔥 Key Takeaways  
✅ **Uses real-world weather data for model training.**  
✅ **Constructs a robust dataset by aggregating multiple stations.**  
✅ **Handles missing values with intelligent interpolation.**  
✅ **Prepares a well-structured test set for performance evaluation.**  



In [None]:

# --------------------------------------------------------------------
# 3. Fetch daily temperature data using Meteostat
# --------------------------------------------------------------------
# Coordinates for Hyderabad, India
HYD_LAT, HYD_LON = 17.3850, 78.4867
# Set the test period
train_start = datetime(2021, 4, 1)
test_start = datetime(2022, 4, 1)
test_end = datetime.today()

In [None]:
# (a) Find the 10 nearest weather stations
stations_nearby = Stations()
stations_nearby = stations_nearby.nearby(HYD_LAT, HYD_LON)
stations_df = stations_nearby.fetch(limit=10)
print("Top 10 nearby stations:\n", stations_df)

Top 10 nearby stations:
                             name country region    wmo  icao  latitude  \
id                                                                       
43128          Hyderabad Airport      IN     AP  43128  VOHY   17.4500   
VOHS0  Hyderabad / Anthredigudam      IN     AP   <NA>  VOHS   17.2406   
43083                      Medak      IN     AP  43083  <NA>   18.0500   
43168               Mahabubnagar      IN     AP  43168  <NA>   16.7500   
43133                   Nalgonda      IN     AP  43133  <NA>   17.0000   
43125              Bidar Airport      IN     KA  43125  VOBR   17.9167   
43087                 Hanamkonda      IN     AP  43087  <NA>   18.0167   
43177              Rentachintala      IN     AP  43177  <NA>   16.5500   
43081                  Nizamabad      IN     AP  43081  <NA>   18.6667   
43076                      Udgir      IN     KA  43076  <NA>   18.0667   

       longitude  elevation      timezone hourly_start hourly_end daily_start  \
id   

In [None]:
station_ids = stations_df.index  # Each row's index is the station ID

In [None]:
# We are going to use the first 5 closest stations. the first and second stations are located inside the heydarabad so we ignore them.

stations_df = stations_df.iloc[2:7]
main_station_ids = stations_df.index.tolist()
stations_df

Unnamed: 0_level_0,name,country,region,wmo,icao,latitude,longitude,elevation,timezone,hourly_start,hourly_end,daily_start,daily_end,monthly_start,monthly_end,distance
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
43083,Medak,IN,AP,43083,,18.05,78.2667,472.0,Asia/Kolkata,1987-09-05,2024-09-13,1933-01-01,1970-12-30,1933-01-01,1970-01-01,77529.411822
43168,Mahabubnagar,IN,AP,43168,,16.75,78.0,504.0,Asia/Kolkata,2009-02-05,2024-11-17,NaT,NaT,NaT,NaT,87533.370066
43133,Nalgonda,IN,AP,43133,,17.0,79.25,0.0,Asia/Kolkata,1977-10-19,2024-11-17,1901-01-01,1970-12-31,1901-01-01,1970-01-01,91689.97169
43125,Bidar Airport,IN,KA,43125,VOBR,17.9167,77.5333,663.0,Asia/Kolkata,1944-07-01,2024-11-16,1901-01-02,1970-12-31,1902-01-01,1970-01-01,117050.623523
43087,Hanamkonda,IN,AP,43087,,18.0167,79.5667,266.0,Asia/Kolkata,1944-01-01,2024-11-17,1901-01-01,1970-12-31,1901-01-01,1970-01-01,134246.887933


In [None]:
# Debugging print statements
print(f"train_start: {train_start}, test_end: {test_end}")

# Fetch daily temperature data (tavg) for these 5 stations
temp_data_list = []

for sid in main_station_ids:
    daily_data = Daily(sid, train_start, test_end)
    df_daily = daily_data.fetch()
    print(f"sid is {sid} and df_daily is {df_daily}")
    # We only need the 'tavg' (average daily temperature)
    temp_data_list.append(df_daily['tavg'])

temp_data_list

train_start: 2021-04-01 00:00:00, test_end: 2025-01-11 08:58:28.795586




sid is 43083 and df_daily is             tavg  tmin  tmax  prcp  snow   wdir  wspd  wpgt    pres  tsun
time                                                                     
2021-04-01  28.7  21.3  36.4   0.0   NaN  285.0   7.8   NaN  1004.9   NaN
2021-04-02  27.7  19.8  36.2   0.0   NaN  285.0   7.1   NaN  1005.6   NaN
2021-04-03  28.6  20.0  36.1   0.0   NaN  324.0   4.8   NaN  1006.0   NaN
2021-04-04  28.1  21.9  36.1   0.0   NaN  127.0   7.1   NaN  1007.8   NaN
2021-04-05  27.5  21.0  34.6   0.0   NaN   98.0   6.5   NaN  1008.9   NaN
...          ...   ...   ...   ...   ...    ...   ...   ...     ...   ...
2025-01-07  21.0  15.7  28.1   0.0   NaN   89.0   8.3   NaN  1014.0   NaN
2025-01-08  20.0  15.4  25.9   0.0   NaN   62.0   9.8   NaN  1014.5   NaN
2025-01-09  19.3  13.9  25.8   0.0   NaN   77.0  11.0   NaN  1015.5   NaN
2025-01-10  20.1  15.3  25.9   0.0   NaN  107.0  10.4   NaN  1016.9   NaN
2025-01-11  20.4  16.0  26.4   0.0   NaN  125.0  12.2   NaN  1016.3   NaN

[1370 ro



sid is 43133 and df_daily is             tavg  tmin  tmax  prcp  snow   wdir  wspd  wpgt    pres  tsun
time                                                                     
2021-07-06  31.5  26.3  37.2   0.8   NaN  282.0   5.4   NaN  1004.2   NaN
2021-07-07  30.7  28.0  37.1   7.9   NaN  265.0   5.5   NaN  1004.1   NaN
2021-07-08  28.4  26.9  34.0  11.0   NaN  289.0   6.8   NaN  1004.7   NaN
2021-07-09  29.0  25.8  33.8   6.7   NaN  271.0   6.4   NaN  1003.0   NaN
2021-07-10  29.4  26.3  33.8   7.7   NaN  266.0   7.8   NaN  1001.7   NaN
...          ...   ...   ...   ...   ...    ...   ...   ...     ...   ...
2025-01-07  24.5  18.8  31.0   0.0   NaN   59.0   7.1   NaN  1013.3   NaN
2025-01-08  23.8  17.7  30.2   0.0   NaN   16.0  10.8   NaN  1013.6   NaN
2025-01-09  23.9  18.1  29.9   0.0   NaN   72.0   9.2   NaN  1014.9   NaN
2025-01-10  24.3  18.4  30.4   0.0   NaN  104.0   9.2   NaN  1016.2   NaN
2025-01-11  24.1  18.1  30.0   0.0   NaN  122.0  10.5   NaN  1016.0   NaN

[1274 ro



sid is 43125 and df_daily is             tavg  tmin  tmax  prcp  snow   wdir  wspd  wpgt    pres  tsun
time                                                                     
2021-04-01  30.5  23.1  37.8   0.0   NaN  328.0  10.3   NaN  1005.2   NaN
2021-04-02  29.8  22.6  37.0   0.0   NaN  319.0   8.5   NaN  1005.8   NaN
2021-04-03  30.9  22.2  38.1   0.0   NaN  298.0   6.9   NaN  1006.2   NaN
2021-04-04  31.0  24.0  37.8   0.0   NaN    1.0  11.0   NaN  1008.1   NaN
2021-04-05  29.9  22.4  36.4   0.0   NaN   64.0   8.8   NaN  1009.3   NaN
...          ...   ...   ...   ...   ...    ...   ...   ...     ...   ...
2025-01-07  21.7  16.0  27.9   0.0   NaN   95.0   8.8   NaN  1014.0   NaN
2025-01-08  20.3  15.9  26.0   0.0   NaN   59.0  10.4   NaN  1014.7   NaN
2025-01-09  19.7  14.2  26.0   0.0   NaN   95.0  10.9   NaN  1015.5   NaN
2025-01-10  20.7  15.9  26.0   0.0   NaN  111.0  11.9   NaN  1016.6   NaN
2025-01-11  21.9  16.8  28.0   0.0   NaN  136.0  12.9   NaN  1015.7   NaN

[1370 ro



sid is 43087 and df_daily is             tavg  tmin  tmax  prcp  snow   wdir  wspd  wpgt    pres  tsun
time                                                                     
2021-04-01  31.3  23.0  39.4   0.0   NaN  311.0   8.9   NaN  1003.0   NaN
2021-04-02  31.0  23.6  39.1   0.0   NaN  305.0   7.2   NaN  1003.9   NaN
2021-04-03  31.2  23.6  38.9   0.0   NaN  141.0   6.9   NaN  1005.0   NaN
2021-04-04  31.3  25.3  37.9   0.0   NaN  170.0   7.2   NaN  1006.9   NaN
2021-04-05  30.7  24.2  37.7   0.0   NaN  151.0   7.5   NaN  1008.3   NaN
...          ...   ...   ...   ...   ...    ...   ...   ...     ...   ...
2025-01-07  22.1  15.8  28.4   0.0   NaN  292.0   7.5   NaN  1013.6   NaN
2025-01-08  20.5  15.9  26.4   0.0   NaN  351.0  10.2   NaN  1014.3   NaN
2025-01-09  20.5  13.9  26.8   0.0   NaN    2.0   6.3   NaN  1015.3   NaN
2025-01-10  21.6  16.0  27.3   0.0   NaN  146.0   6.8   NaN  1016.6   NaN
2025-01-11  21.7  16.6  27.1   0.0   NaN  165.0  10.1   NaN  1016.2   NaN

[1370 ro

[time
 2021-04-01    28.7
 2021-04-02    27.7
 2021-04-03    28.6
 2021-04-04    28.1
 2021-04-05    27.5
               ... 
 2025-01-07    21.0
 2025-01-08    20.0
 2025-01-09    19.3
 2025-01-10    20.1
 2025-01-11    20.4
 Name: tavg, Length: 1370, dtype: float64,
 time
 2021-04-01    30.6
 2021-04-02    29.8
 2021-04-03    30.7
 2021-04-04    30.0
 2021-04-05    29.4
               ... 
 2025-01-07    22.3
 2025-01-08    21.7
 2025-01-09    20.9
 2025-01-10    21.3
 2025-01-11    21.6
 Name: tavg, Length: 1370, dtype: float64,
 time
 2021-07-06    31.5
 2021-07-07    30.7
 2021-07-08    28.4
 2021-07-09    29.0
 2021-07-10    29.4
               ... 
 2025-01-07    24.5
 2025-01-08    23.8
 2025-01-09    23.9
 2025-01-10    24.3
 2025-01-11    24.1
 Name: tavg, Length: 1274, dtype: float64,
 time
 2021-04-01    30.5
 2021-04-02    29.8
 2021-04-03    30.9
 2021-04-04    31.0
 2021-04-05    29.9
               ... 
 2025-01-07    21.7
 2025-01-08    20.3
 2025-01-09    19.7
 2025-0

In [None]:
daily_data = Daily(43125, train_start, test_end)
df_daily = daily_data.fetch()
df_daily

Unnamed: 0_level_0,tavg,tmin,tmax,prcp,snow,wdir,wspd,wpgt,pres,tsun
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2021-04-01,30.5,23.1,37.8,0.0,,328.0,10.3,,1005.2,
2021-04-02,29.8,22.6,37.0,0.0,,319.0,8.5,,1005.8,
2021-04-03,30.9,22.2,38.1,0.0,,298.0,6.9,,1006.2,
2021-04-04,31.0,24.0,37.8,0.0,,1.0,11.0,,1008.1,
2021-04-05,29.9,22.4,36.4,0.0,,64.0,8.8,,1009.3,
...,...,...,...,...,...,...,...,...,...,...
2025-01-07,21.7,16.0,27.9,0.0,,95.0,8.8,,1014.0,
2025-01-08,20.3,15.9,26.0,0.0,,59.0,10.4,,1014.7,
2025-01-09,19.7,14.2,26.0,0.0,,95.0,10.9,,1015.5,
2025-01-10,20.7,15.9,26.0,0.0,,111.0,11.9,,1016.6,


In [None]:
# Combine into one DataFrame
df_combined = pd.concat(temp_data_list, axis=1)
df_combined.columns = [f"station_{i}" for i in range(len(main_station_ids))]
df_combined

Unnamed: 0_level_0,station_0,station_1,station_2,station_3,station_4
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2021-04-01,28.7,30.6,,30.5,31.3
2021-04-02,27.7,29.8,,29.8,31.0
2021-04-03,28.6,30.7,,30.9,31.2
2021-04-04,28.1,30.0,,31.0,31.3
2021-04-05,27.5,29.4,,29.9,30.7
...,...,...,...,...,...
2025-01-07,21.0,22.3,24.5,21.7,22.1
2025-01-08,20.0,21.7,23.8,20.3,20.5
2025-01-09,19.3,20.9,23.9,19.7,20.5
2025-01-10,20.1,21.3,24.3,20.7,21.6


In [None]:
# Compute the mean of these 5 stations -> 'avg_5_stations'
df_combined['avg_5_stations'] = df_combined.mean(axis=1)
df_combined

Unnamed: 0_level_0,station_0,station_1,station_2,station_3,station_4,avg_5_stations
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2021-04-01,28.7,30.6,,30.5,31.3,30.275
2021-04-02,27.7,29.8,,29.8,31.0,29.575
2021-04-03,28.6,30.7,,30.9,31.2,30.350
2021-04-04,28.1,30.0,,31.0,31.3,30.100
2021-04-05,27.5,29.4,,29.9,30.7,29.375
...,...,...,...,...,...,...
2025-01-07,21.0,22.3,24.5,21.7,22.1,22.320
2025-01-08,20.0,21.7,23.8,20.3,20.5,21.260
2025-01-09,19.3,20.9,23.9,19.7,20.5,20.860
2025-01-10,20.1,21.3,24.3,20.7,21.6,21.600


In [None]:
# (c) Fetch daily temperature data specifically for Hyderabad
hyderabad_station_id = station_ids[0]
daily_data_hyd = Daily(hyderabad_station_id, train_start, test_end)
df_hyd = daily_data_hyd.fetch()
df_hyd



Unnamed: 0_level_0,tavg,tmin,tmax,prcp,snow,wdir,wspd,wpgt,pres,tsun
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2021-04-01,31.2,22.2,39.9,,,324.0,4.9,,1003.0,
2021-04-02,29.8,23.1,39.1,,,323.0,5.2,,1003.0,
2021-04-03,32.0,24.0,39.3,,,353.0,3.1,,1004.2,
2021-04-04,30.3,24.0,40.1,,,32.0,3.5,,1006.2,
2021-04-05,30.0,25.4,38.2,,,83.0,5.9,,1007.5,
...,...,...,...,...,...,...,...,...,...,...
2025-01-07,22.3,14.0,29.0,0.0,,59.0,6.8,,1015.1,
2025-01-08,21.6,14.0,28.0,0.0,,,6.1,,1015.8,
2025-01-09,20.8,13.0,26.0,0.0,,87.0,6.6,,1016.8,
2025-01-10,21.9,17.1,27.0,0.0,,99.0,7.7,,1017.9,


In [None]:
# We'll use 'tavg' to represent Hyderabad's actual temperature
df_combined['Hyderabad_tavg'] = df_hyd['tavg']

In [None]:
df_combined

Unnamed: 0_level_0,station_0,station_1,station_2,station_3,station_4,avg_5_stations,Hyderabad_tavg
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2021-04-01,28.7,30.6,,30.5,31.3,30.275,31.2
2021-04-02,27.7,29.8,,29.8,31.0,29.575,29.8
2021-04-03,28.6,30.7,,30.9,31.2,30.350,32.0
2021-04-04,28.1,30.0,,31.0,31.3,30.100,30.3
2021-04-05,27.5,29.4,,29.9,30.7,29.375,30.0
...,...,...,...,...,...,...,...
2025-01-07,21.0,22.3,24.5,21.7,22.1,22.320,22.3
2025-01-08,20.0,21.7,23.8,20.3,20.5,21.260,21.6
2025-01-09,19.3,20.9,23.9,19.7,20.5,20.860,20.8
2025-01-10,20.1,21.3,24.3,20.7,21.6,21.600,21.9


In [None]:
# prompt: Using dataframe df_combined: I want to interpolate in case the fill all of the Nan values. in any point that it is Nan put the value of avg_5_stations for that point

# Iterate over each column representing a station
for col in ['station_0', 'station_1', 'station_2', 'station_3', 'station_4']:
    # Fill NaN values in the current station column with the corresponding 'avg_5_stations' value
    df_combined[col] = df_combined[col].fillna(df_combined['avg_5_stations'])

# Display the first few rows of the updated dataframe to verify the changes
df_combined.head()


Unnamed: 0_level_0,station_0,station_1,station_2,station_3,station_4,avg_5_stations,Hyderabad_tavg
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2021-04-01,28.7,30.6,30.275,30.5,31.3,30.275,31.2
2021-04-02,27.7,29.8,29.575,29.8,31.0,29.575,29.8
2021-04-03,28.6,30.7,30.35,30.9,31.2,30.35,32.0
2021-04-04,28.1,30.0,30.1,31.0,31.3,30.1,30.3
2021-04-05,27.5,29.4,29.375,29.9,30.7,29.375,30.0


In [None]:
# (d) Remove rows with missing values
df_combined.dropna(subset=['avg_5_stations', 'Hyderabad_tavg'], inplace=True)
df_combined

Unnamed: 0_level_0,station_0,station_1,station_2,station_3,station_4,avg_5_stations,Hyderabad_tavg
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2021-04-01,28.7,30.6,30.275,30.5,31.3,30.275,31.2
2021-04-02,27.7,29.8,29.575,29.8,31.0,29.575,29.8
2021-04-03,28.6,30.7,30.350,30.9,31.2,30.350,32.0
2021-04-04,28.1,30.0,30.100,31.0,31.3,30.100,30.3
2021-04-05,27.5,29.4,29.375,29.9,30.7,29.375,30.0
...,...,...,...,...,...,...,...
2025-01-07,21.0,22.3,24.500,21.7,22.1,22.320,22.3
2025-01-08,20.0,21.7,23.800,20.3,20.5,21.260,21.6
2025-01-09,19.3,20.9,23.900,19.7,20.5,20.860,20.8
2025-01-10,20.1,21.3,24.300,20.7,21.6,21.600,21.9


In [None]:
# Test set
test_data = df_combined.loc[test_start:test_end]
test_data = test_data.iloc[::10]  # Ensure a minimum distance of 3 days
test_data

Unnamed: 0_level_0,station_0,station_1,station_2,station_3,station_4,avg_5_stations,Hyderabad_tavg
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2022-04-01,29.2,30.4,32.6,30.8,31.1,30.82,30.4
2022-04-11,30.2,30.1,32.3,31.7,31.1,31.08,31.2
2022-04-21,31.2,30.9,33.4,32.5,32.0,32.00,31.1
2022-05-01,31.6,33.0,35.2,33.4,34.0,33.44,32.8
2022-05-11,29.2,26.7,28.1,30.7,29.2,28.78,27.6
...,...,...,...,...,...,...,...
2024-11-26,21.3,22.3,25.5,20.1,21.4,22.12,22.4
2024-12-06,24.4,24.5,26.4,23.4,25.0,24.74,25.5
2024-12-18,19.3,22.2,24.6,20.0,21.0,21.42,22.0
2024-12-28,22.5,23.1,24.9,22.7,23.3,23.30,23.5


## Training and Evaluating the Neural Network Model  

This section **trains and evaluates** the deep neural network on different training set sizes to predict **Hyderabad's temperature** using the **average temperature of 5 nearby stations**.

---

### 1️⃣ **Feature Scaling**
- **MinMaxScaler** is applied to **normalize** both `X` (input) and `y` (target) values.
- **Why?** Scaling improves the convergence of the neural network during training.

---

### 2️⃣ **Training Data Sizes**
To evaluate the model's performance over different training periods, we experiment with:  
- **7 days** (~1 week of training data)  
- **30 days** (~1 month of training data)  
- **365 days** (~1 year of training data)  

For each training size:  
- **Training set (`X_train, y_train`)** is taken from the first `size` days.  
- **Test set (`X_test, y_test`)** remains the same (independent test set).  

---

### 3️⃣ **Neural Network Training**
- A **Deep Neural Network (`DeepNN`)** is initialized for each training size.  
- **Adam optimizer (`lr = 1e-4`)** is used for training.  
- **Training runs for 500 epochs** to ensure convergence.  

---

### 4️⃣ **Prediction and Error Calculation**
- The model **predicts temperatures** for the test set.  
- **Inverse transformation** is applied to convert predictions back to original temperature values.  
- The **Average Absolute Risk Error (AARE)** is computed:
$$
  \text{AARE} = \frac{1}{N} \sum | y_{\text{pred}} - y_{\text{true}} |
$$
- This metric evaluates how well the model generalizes to unseen test data.

---

### 5️⃣ **Results Summary**
- The **final results** display the **Average Absolute Risk Error** for each training period.
- Helps analyze the **impact of training data size** on prediction accuracy.

---

## 🔥 Key Takeaways
✅ **Uses real-world temperature data to train a deep learning model.**  
✅ **Compares model performance across different training sizes.**  
✅ **Evaluates generalization error using a structured test set.**  
✅ **Applies feature scaling for better convergence.**  


In [None]:


# Initialize scalers for X and y
scaler_X = MinMaxScaler()
scaler_y = MinMaxScaler()

X_test_main = test_data[['avg_5_stations']].values
y_test_main = test_data['Hyderabad_tavg'].values




In [None]:
# Train and test for different training set sizes
train_sizes = [7, 30, 365]  # Approx. 5, 8, 10 years of training data
avg_abs_risk_errors = []

In [None]:
from sklearn.preprocessing import MinMaxScaler

# Initialize scalers for X and y
scaler_X = MinMaxScaler()
scaler_y = MinMaxScaler()

for size in train_sizes:
    train_data = df_combined.iloc[:size]
    test_data = df_combined.iloc[size:]  # Ensure test data is taken from a separate slice

    X_train = train_data[['avg_5_stations']].values
    y_train = train_data['Hyderabad_tavg'].values  # Make y_train 2D for the scaler


    # Fit scalers on training data and transform both train and test
    X_train = scaler_X.fit_transform(X_train)
    y_train = scaler_y.fit_transform(y_train.reshape(-1, 1)).flatten()
    X_test = scaler_X.transform(X_test_main)  # Use the same scaler fitted on training data
    y_test = scaler_y.transform(y_test_main.reshape(-1, 1)).flatten()

    print(f"Normalized X_train is {X_train}")
    print(f"Normalized y_train is {y_train}")

    # Define and train model
    model = DeepNN()
    optimizer = optim.Adam(model.parameters(), lr=1e-4)
    trained_model = train_model(X_train, y_train, model, optimizer, num_epochs=500)

    # Predict and calculate average absolute risk error
    y_pred = predict(trained_model, X_test).flatten()

    # Inverse transform predictions and true values for error calculation
    y_pred_inverse = scaler_y.inverse_transform(y_pred.reshape(-1, 1)).flatten()
    y_test_inverse = scaler_y.inverse_transform(y_test.reshape(-1, 1)).flatten()

    print(f"y_pred_inverse is {y_pred_inverse}")
    print(f"y_test_inverse is {y_test_inverse}")

    avg_abs_risk_error = np.mean(np.abs(y_pred_inverse - y_test_inverse))
    avg_abs_risk_errors.append(avg_abs_risk_error)

    print(f"Train size: {size}, Average Absolute Risk Error: {avg_abs_risk_error:.4f}")

    # Free up memory
    model, optimizer, trained_model = 0, 0, 0


# Summary of results
for size, error in zip(train_sizes, avg_abs_risk_errors):
    print(f"Train size: {size} days -> Average Absolute Risk Error: {error:.4f}")

Normalized X_train is [[0.92307692]
 [0.20512821]
 [1.        ]
 [0.74358974]
 [0.        ]
 [0.64102564]
 [0.41025641]]
Normalized y_train is [0.63636364 0.         1.         0.22727273 0.09090909 0.40909091
 0.40909091]
Epoch [100/500], Loss: 0.1271
Epoch [200/500], Loss: 0.1106
Epoch [300/500], Loss: 0.0964
Epoch [400/500], Loss: 0.0857
Epoch [500/500], Loss: 0.0646
y_pred_inverse is [32.869152 33.38571  35.15439  37.877995 30.223383 29.861118 35.15439
 30.065987 30.555414 30.465706 31.005722 30.750336 30.582073 30.58726
 30.536175 30.634766 30.870277 30.821196 30.753078 30.82932  30.842897
 31.102854 30.989513 31.23787  31.167677 31.229765 31.21896  30.807636
 31.483185 31.105556 31.003021 30.807636 30.785952 30.626572 30.454706
 31.03539  30.270033 30.354103 34.69971  30.487984 29.929043 35.646984
 35.267845 35.30566  36.626137 30.432434 30.558123 30.659525 30.938255
 30.44369  30.454706 30.547283 30.758564 30.544535 30.739363 30.566221
 30.457432 30.77776  30.618439 30.769535 30

# **Model Performance Analysis Based on Theoretical Findings**  

This experiment validates the **theoretical results** from the paper **"Uniform Convergence of Lipschitz Functions with Dependent Gaussian Samples"**, accepted at **ICASSP 2025**.  

## **1️⃣ Expected Error Decay with Increased Training Data**  
According to the paper's findings, the **expected generalization error** should **decrease** as the number of training samples increases, even in the presence of **dependent data**. Our results confirm this:  

| **Training Size** | **Average Absolute Risk Error (AARE)** |
|------------------|--------------------------------|
| **7 days**      | **4.6286°C** (**High Error**)  |
| **30 days**     | **1.4554°C** (**Significant Improvement**) |
| **365 days**    | **0.9075°C** (**Best Accuracy**) |

- **Small training sizes (7 days)** → Poor generalization, high error.  
- **Larger training sizes (365 days)** → Stronger convergence, lower error.  

---

## **2️⃣ Impact of Temporal Dependence on Learning**  
- The paper establishes that learning under **dependent Gaussian samples** still achieves **uniform convergence**, though at a rate influenced by dependency levels (`γ`).  
- Since temperature data exhibits **strong temporal dependencies**, we observe **gradual improvement** in predictions as training size grows.  
- This aligns with the **empirical and theoretical results** of the paper, confirming that **deep learning models can generalize well even with correlated inputs**.

---

## **3️⃣ Practical Implications**  
🔹 **For short-term forecasting (7-30 days of training data)** → High variance in predictions, unreliable generalization.  
🔹 **For long-term forecasting (365 days of training data)** → Improved model stability, aligned with uniform convergence predictions.  
🔹 **Real-world takeaway** → Dependent time-series data **does not prevent generalization**, but **larger training sets are essential** for accuracy.  

---

### **Conclusion**  
✅ **Experimental results support the paper’s theoretical uniform convergence bounds.**  
✅ **Learning under dependent data is feasible, with sufficient training samples.**  
✅ **Deep neural networks can effectively model temperature dependencies over time.**  
✅ **Findings reinforce the applicability of dependent learning in real-world forecasting tasks.**  
