## Information about the data normalization/standardization

This notebook implements standardization using `StandardScaler` and `MinMaxScaler` from scikit-learn. In time series analysis, the approach to standardization can be slightly different.

For example, if we are trying to see how a certain variable changes over seasons in a year or over different months, we would typically standardize the data on a yearly basis. We would normalize the data for each year, using its mean and variance, and then move on to the next year. Similarly, if we want to observe how a variable changes across different days of the week, we would apply weekly standardization.

Usually, when performing time series prediction with a certain lookback window, it’s a good idea to normalize the data across that lookback window only, on a rolling basis. In this case, you would normalize the input, save the mean and standard deviation, make predictions, then denormalize the predictions with the saved mean and standard deviation, and finally calculate the loss.

However, in our case, we will normalize the data globally to make things simpler and easier. Most people take this approach because it’s straightforward.


`StandardScaler` normalizes data column wise. It calculates mean($\mu$)  and standard deviation (s) for each column and normalize each column separately by computing (x - $\mu$)/s.

`MinMaxScaler` does the same, except it computes (x - x_min)/(x_max - x_min) and thus maps the data between 0 and 1.

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
import warnings

#ignore warnings for the notebook

warnings.filterwarnings('ignore')

def standardize_river_forecast_data(df, variables, scaler, method='global'):
    """
    Standardize multiple variables.
    It normalizes data i.e. each column in the dataframe to have mean 0 and variance 1.

    Parameters:
    df (pandas.DataFrame): The input dataframe with a datetime index
    variables (list): List of column names to standardize
    method (str): 'global', 'yearly', or 'monthly'

    Returns:
    pandas.DataFrame: The dataframe with standardized columns
    """
    standardized_df = df.copy()
    df.reset_index(inplace=True)
    df['DATE'] = pd.to_datetime(df['DATE'])
    df.set_index('DATE', inplace=True)

    if method == 'global':
        standardized_df[variables] = scaler.fit_transform(df[variables])

    elif method == 'yearly':
        for year in df.index.year.unique():
            year_data = df[df.index.year == year]
            standardized_df.loc[year_data.index, variables] = scaler.fit_transform(year_data[variables])

    elif method == 'monthly':
        for month in range(1, 13):
            month_data = df[df.index.month == month]
            standardized_df.loc[month_data.index, variables] = scaler.fit_transform(month_data[variables])

    else:
        raise ValueError("Method must be 'global', 'yearly', or 'monthly'")

    return standardized_df

In [2]:
df = pd.read_csv('final_data.csv')

In [3]:
variables_to_standardize = list(df.columns)

if 'DATE' in variables_to_standardize:
    variables_to_standardize.remove('DATE')

scaler1 = StandardScaler()
scaler2 = MinMaxScaler()

global_std_df = standardize_river_forecast_data(df, variables_to_standardize, scaler1, method='global')
yearly_std_df = standardize_river_forecast_data(df, variables_to_standardize, scaler1, method='yearly')
monthly_std_df = standardize_river_forecast_data(df, variables_to_standardize, scaler1, method='monthly')

# The standardized dataframes can now be used for further analysis and modeling

In [4]:
global_std_df.head()

Unnamed: 0,DATE,Precip,WetBulbTemp,DryBulbTemp,RelHumidity,WindSpeed,StationPressure,gauge_height
0,2008-01-01 01:00:00,-0.103183,-1.33112,-1.502346,0.919314,-1.078557,0.108053,-0.311826
1,2008-01-01 02:00:00,-0.103183,-1.26186,-1.440811,0.919314,0.065279,0.043514,-0.346246
2,2008-01-01 03:00:00,-0.103183,-1.26186,-1.379277,0.77616,0.065279,0.043514,-0.37479
3,2008-01-01 04:00:00,-0.103183,-1.26186,-1.379277,0.77616,0.294046,0.108053,-0.406692
4,2008-01-01 05:00:00,-0.103183,-1.05408,-1.071605,-0.082763,1.437883,0.237133,-0.455384


In [5]:
# Let's check mean and variance after normalization for each column

for col in global_std_df.columns:
    if col != 'DATE':
        print(f'{col:15}   : Mean : {global_std_df[col].mean():.2f}  Variance: {global_std_df[col].var():.2f}')

Precip            : Mean : -0.00  Variance: 1.00
WetBulbTemp       : Mean : -0.00  Variance: 1.00
DryBulbTemp       : Mean : 0.00  Variance: 1.00
RelHumidity       : Mean : -0.00  Variance: 1.00
WindSpeed         : Mean : -0.00  Variance: 1.00
StationPressure   : Mean : 0.00  Variance: 1.00
gauge_height      : Mean : -0.00  Variance: 1.00


In [6]:
global_std_df = standardize_river_forecast_data(df, variables_to_standardize, scaler2, method='global')
yearly_std_df = standardize_river_forecast_data(df, variables_to_standardize, scaler2, method='yearly')
monthly_std_df = standardize_river_forecast_data(df, variables_to_standardize, scaler2, method='monthly')


In [7]:
global_std_df.head()

Unnamed: 0_level_0,index,Precip,WetBulbTemp,DryBulbTemp,RelHumidity,WindSpeed,StationPressure,gauge_height
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2008-01-01 01:00:00,0,0.0,0.428571,0.329897,0.877778,0.0,0.589474,0.11375
2008-01-01 02:00:00,1,0.0,0.441558,0.340206,0.877778,0.169492,0.583158,0.109659
2008-01-01 03:00:00,2,0.0,0.441558,0.350515,0.844444,0.169492,0.583158,0.106266
2008-01-01 04:00:00,3,0.0,0.441558,0.350515,0.844444,0.20339,0.589474,0.102475
2008-01-01 05:00:00,4,0.0,0.480519,0.402062,0.644444,0.372881,0.602105,0.096687


In [8]:
# Let's check mean and variance after normalization for each column

for col in global_std_df.columns:
    if col != 'DATE':
        print(f'{col:15}   : Min : {global_std_df[col].min():.2f}  Max: {global_std_df[col].max():.2f}')

index             : Min : 0.00  Max: 143810.00
Precip            : Min : 0.00  Max: 1.00
WetBulbTemp       : Min : 0.00  Max: 1.00
DryBulbTemp       : Min : 0.00  Max: 1.00
RelHumidity       : Min : 0.00  Max: 1.00
WindSpeed         : Min : 0.00  Max: 1.00
StationPressure   : Min : 0.00  Max: 1.00
gauge_height      : Min : 0.00  Max: 1.00


**Note: In neural networks we normally use MinMaxScaler.**