# Weather Forecasting algorithms
### (Or how to use Linear Regression)

In [6]:
import pandas as pd
import numpy as np
from sklearn import linear_model as skl
from regression import Regressors as reg

AttributeError: module 'scipy.sparse' has no attribute 'linalg'

# Creating the datasets

In [None]:
# Creating a dataframe from the CSV file
data = pd.read_csv("weather_data.csv")
data

*Let's represent data as an array of floats*

*As the data is ordered by date, the Date column is not important, so we get rid of it*

In [None]:
data_array = np.array(data.drop(columns="Date").values)
data_array

*We'll also need each column as a separate array*

In [None]:
max_temp = np.array(data["Maximum Temperature (°C)"])
min_temp = np.array(data["Minimum Temperature (°C)"])
mean_temp = np.array(data["Mean Temperature (°C)"])

sunshine = np.array(data["Sunshine Duration (min)"])
radiation = np.array(data["Shortwave Radiation (MJ/m²)"])
precipitation = np.array(data["Precipitation (mm)"])

max_humidity = np.array(data["Maximum Relative Humidity (%)"])
min_humidity = np.array(data["Minimum Relative Humidity (%)"])
mean_humidity = np.array(data["Mean Relative Humidity (%)"])

max_pressure = np.array(data["Maximum Sea Level Pressure (hPa)"])
min_pressure = np.array(data["Minimum Relative Humidity (%)"])
mean_pressure = np.array(data["Mean Sea Level Pressure (hPa)"])

max_wind_speed = np.array(data["Maximum Wind Speed (m/s)"])
min_wind_speed = np.array(data["Minimum Wind Speed (m/s)"])
mean_wind_speed = np.array(data["Mean Wind Speed (m/s)"])
wind_direction = np.array(data["Wind Direction Dominant (°)"])

# Necessary functions to analyse the data

In [None]:
# Standardise / Destandardise data

def to_standard(dataset):
    return (dataset - min(dataset)) / (max(dataset) - min(dataset))

def to_source(standard, source):
    return standard * (max(source) - min(source)) + min(source)

In [None]:
# This function returns an array of all sequential sub-arrays of n elements from the array

def group(array, _n):
    return np.array([array[i:i + _n] for i in range(len(array) - _n)])

In [None]:
# This function splits dataset into train and test sets

def split(dataset, point=0.8):
    pivot = int(len(dataset) * point)
    return dataset[:pivot], dataset[pivot:]

In [None]:
# Error functions

def MAE(real, predicted):
    difference = abs(real - predicted)
    return sum(difference) / len(difference)

def MSE(real, predicted):
    difference = abs(real - predicted)
    return sum(difference ** 2) / (2 * len(difference))

## What we are going to do:
We are going to use 3 methods of training and predicting data:
- For each table in dataframe predict data for the next day, according to the data from previous day of the same column
- For each table in dataframe predict data for the next day, according to the data from n previous days of the same column
- For each table in dataframe predict data for the next day, according to the data from previous day of the whole dataframe

For each method we use both sklearn and regression libraries
The results are compared with each other and with the zero theory(s)

# Method 1:
In this method we are going to predict a particular weather parameter (ex. temperature) for "tomorrow",
having only information about this parameter "today"

Taking mean temperature as an example and creating datasets

In [None]:
data_train, data_test = split(mean_temp)

X_train = data_train[:-1]
y_train = data_train[1:]

X_test = data_test[:-1]
y_test = data_test[1:]

*Scikit-learn model*

In [None]:
skl_model = skl.LinearRegression()
skl_model.fit(X_train.reshape(-1, 1), y_train)
# Show the R value
skl_model.score(X_train.reshape(-1, 1), y_train)

*Prediction*

In [None]:
prediction = skl_model.predict(X_test.reshape(-1, 1))
prediction

*Reality*

In [None]:
y_test

*Error values*

In [None]:
print("Mean Absolute Error:", MAE(y_test, prediction))
print("Mean Squared Error:", MSE(y_test, prediction))

*Let's do the same operation with my own regression model*

In [None]:
reg_model = reg.LinearRegressor()
reg_model.fit(X_train.reshape(-1, 1), y_train)
# Show the R value
reg_model.score(X_train, y_train)

*Prediction*

In [None]:
prediction = skl_model.predict(X_test.reshape(-1, 1))
prediction

*Error values*

In [None]:
print("Mean Absolute Error:", MAE(y_test, prediction))
print("Mean Squared Error:", MSE(y_test, prediction))

## Results
As we can see, the results of Scikit-learn model and regression library model are completely the same.
We've got that for the first method:
- Mean Absolute Error: 1.6468385480344996
- Mean Squared Error: 2.3136632848913203

***But how much sense do these results make?***

## Zero theory
*Zero theory is an assumption, that we can do to predict the data without Machine Learning,
in order to compare its results with the ML-estimator's and to see if the using of the ML is reasonable*

In this case, we can assume that the weather doesn't change very much from day to day.
So, our *zero theory* will be that the next day the temperature wil be (approximately) the same as 'today'.

In [None]:
# Estimator function of the first zero function
# (we might have different zero functions for other methods)
def zero_func1(X):
    return X

*Prediction*

In [None]:
prediction = zero_func1(X_test)
prediction

*Error values*

In [1]:
print("Mean Absolute Error:", MAE(y_test, prediction))
print("Mean Squared Error:", MSE(y_test, prediction))

NameError: name 'MAE' is not defined

# Conclusion
As we can see, the errors of the linear models are approximately the same as the errors of the zero-function.
That means, that there is no use in linear models as we can make predictions with just the same quality just from an assumption.
We obtained such result because the linear model had too little information to build a better estimator.
The next methods are expected to be more effective.

# Method 2:
In this method we are going to predict a particular weather parameter (ex. pressure) for "tomorrow",
having information about this parameter for the past *n* days.

In [None]:
# First, let's pick some random n
test_n = 5

In [None]:
# Now we'll create new datasets with mean pressure as example
data_train, data_test = split(mean_pressure)

test_X_train = group(data_train, test_n)
test_y_train = data_train[test_n:]

test_X_test = group(data_test, test_n)
test_y_test = data_test[test_n:]

In [None]:
# Scikit-learn model
skl_model = skl.LinearRegression()
skl_model.fit(test_X_train, test_y_train)
# Show the R value
skl_model.score(test_X_train, test_y_train)

*Prediction*

In [None]:
prediction = skl_model.predict(test_X_test)
prediction

*Reality*

In [None]:
test_y_test

*Error values*

In [None]:
print("Mean Absolute Error:", MAE(test_y_test, prediction))
print("Mean Squared Error:", MSE(test_y_test, prediction))

*Now test MultivariateRegressor of regression module*

In [None]:
reg_model = reg.MultivariateRegressor(test_n)
reg_model.fit(X_train, test_y_train)
# Show the R value
reg_model.score(X_train, test_y_train)

*Prediction*

In [None]:
prediction = reg_model.predict(test_X_test)
prediction

*Error values*

In [None]:
print("Mean Absolute Error:", MAE(y_test, prediction))
print("Mean Squared Error:", MSE(y_test, prediction))

***Note:*** *Don't be confused with the absolute values of errors looking just at the raw numbers.
They are not 'larger' than the ones, there were in method 1.
As you remember, in method 1 we were predicting temperature (in Celsius degrees),
whereas now we are predicting pressure (in hectoPascale). So, they cannot be compared directly.
To see how effective the model is, we will compare it to the zero theory, but before...*

# Optimising the value of n
*If you play with the value of n, and run the code again, you can see, that the prediction also changes.
That obviously means that the precision of the estimator depends on n.
Now, as we want to train an estimator with the highest precision of predictions possible, we need to optimise the value of n.*

In this example, we'll minimise the MAE function.

In [None]:
# First, let's create of function of MAE on n:
def n_to_MAE(n):
    X_train = group(data_train, n)
    y_train = data_train[n:]

    X_test = group(data_test, n)
    y_test = data_test[n:]

    model = skl.LinearRegression()
    model.fit(X_train, y_train)

    prediction = reg_model.predict(X_test)

    return MAE(y_test, prediction)

Although there are lots of different optimisation methods,
in this particular case the easiest and the most sufficient way is just a simple enumeration

In [None]:
# using range up to 400, because 1 year is 365 days, which can be rounded up to 400
n_to_MAE(4)