# Problem overview

The dataset contains information on weather conditions recorded on each day at various weather stations around the world.  
The time interval goes from 1940 till 1945.  
The dataset includes information about weather such as precipitation, snowfall, temperatures, wind.

### Objective
In this report, we will perform a regression task of predicting average temperature for a given day.  
Our task can be divided into these subsections:
* A naive prediction to set the baseline scores.
* Sliding window feature extraction and Regression Models.

### Importing useful libraries and models

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# coding utilities
%matplotlib inline
plt.style.use('fivethirtyeight')
plt.rcParams["figure.figsize"] = (8, 6)
%config IPCompleter.use_jedi = False

import folium

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error

### Importing the Data

In [None]:
in_file = '../input/weatherww2/Summary of Weather.csv'
loc_file = '../input/weatherww2/Weather Station Locations.csv'

df_loc = pd.read_csv(loc_file)
df_loc.head()

In the first file we have information about different stations all around the globe.  
Let's look at the map of all available stations.

In [None]:
# Create the map
mp = folium.Map(width=800,height=500,tiles='CartoDB positron', zoom_start= 5)

# Add points to the map
for idx, row in df_loc.iterrows():
    folium.Marker(location = [row['Latitude'], row['Longitude']]).add_to(mp)
    
# Create title
title_html = '''
             <h3 align="left" style="font-size:28px"><b>{}</b></h3>
             '''.format("Overview of Stations")   
# Add title
mp.get_root().html.add_child(folium.Element(title_html))

mp


In [None]:
df_tot = pd.read_csv(in_file, parse_dates = [1], low_memory=False)
df_tot.head()

In the other file, we have weather information for each station.  
STA is the common attribute that can be used to merge these two datasets.  

In [None]:
df_tot[["STA", "MeanTemp"]].describe()

In [None]:
df_tot[["STA", "MeanTemp"]].isna().value_counts()

We will perform our analysis on a single station.  
Luckily data on hand is clean and we can choose any station ID to perform this analysis.  
In the presence of nan values, we would have selected a station with fewer missing values.  

We choose a station with id 22508 to perform these analyses.  

In [None]:
mask = df_loc.WBAN == 22508
df_loc[mask]

In [None]:
loc = [21.483333,-158.05]

station = folium.Map(location=loc, width=800,height=500,tiles='CartoDB positron', zoom_start= 10)
folium.Marker(location= loc).add_to(station)
station

title_html = '''
             <h3 align="left" style="font-size:28px"><b>{}</b></h3>
             '''.format("Station at Honolulu")   

station.get_root().html.add_child(folium.Element(title_html))
station

In [None]:
# Create a new dataframe only with relevant information
int_cols = ['Date','MeanTemp']
mask_22508 = df_tot["STA"] == 22508
df = df_tot.loc[mask_22508, int_cols]

df.head()

In [None]:
df.dtypes

In [None]:
plt.figure(figsize = (20,5))
plt.plot(df.Date, df.MeanTemp, linewidth = 0.8);

title = 'Temprature cycle'
x_label = 'time'
y_label = 'Temprature (°C)'

plt.suptitle(title, fontsize = 'xx-large')
plt.xlabel(x_label)
plt.ylabel(y_label)
plt.show()

As expected, a long-term wave-like pattern is showing us that time series has a cycle of 1 year.

We split our data set into training and testing portions.  
Since it is a time series, we cannot use conventional sample selection methods to split it.  
Although sklearn provides modules to split time-series data, we will use another simple but effective method.  
We will train our models on data from 1940-1944 and predictions will be made on 1945.

# 1. Naive Solution

We can use several naive solutions to set a baseline score.  
One possible approach is to calculate the mean value from the training dataset and use that as the prediction for a given day.

In [None]:
# Splitting the data set
mask = df.Date >= '1945-01-01'
test_df = df[mask]
train_df = df[~mask]

print(test_df.head())
print(train_df.head())

In [None]:
# Aggregate on Month and date to find the mean
mean_MeanTemp = train_df.groupby([train_df.Date.dt.month, train_df.Date.dt.day])["MeanTemp"].mean()

# Create a new date column without 29 Feb
d = [(f'1945-{a[0]}-{a[1]}') for a in mean_MeanTemp.index if a[0] != 2 or a[1] != 29]
date = pd.to_datetime(d)

# Attach mean values with related date
mean_MeanTemp.index.names = ["Month", "Day"]
mean_MeanTemp = mean_MeanTemp.reset_index()
mask = (mean_MeanTemp.Month == 2) & (mean_MeanTemp.Day == 29)
mean_MeanTemp = mean_MeanTemp[~mask]
mean_MeanTemp["Date"] = date

# Remove columns created by group by
mean_MeanTemp.drop(["Month", "Day"], axis = 1, inplace = True)
# Standardize
mean_MeanTemp.columns = ["avg_MeanTemp", "Date"]
mean_MeanTemp = mean_MeanTemp[["Date", "avg_MeanTemp"]]
mean_MeanTemp

In [None]:
def model_evalutation(y_true, y_pred):
    """ This function produces a short report for regression results
    """
    r2 = r2_score(y_true, y_pred)
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    rep = f"""
    R2 score: {r2:.2f}
    MSE:      {mse:.2f}
    RMSE :    {rmse:.2f}
    """
    print(rep)
    
    return

In [None]:
# Truth values
y_true = test_df.MeanTemp.values
# Naive predictions
y_naive = mean_MeanTemp.avg_MeanTemp.values

model_evalutation(y_true, y_naive)

In [None]:
plt.figure(figsize = (16,4))
plt.plot(mean_MeanTemp.Date, y_true, linewidth = 0.7,color = 'blue', label = "True value")
plt.plot(mean_MeanTemp.Date, y_naive, linewidth = 0.7, color = 'black', label = "Naive Prediction")

plt.legend(loc = "upper left")

title = 'Naive Prediction of Temprature'
x_label = 'time'
y_label = 'Temprature (°C)'

plt.suptitle(title, fontsize = 'xx-large')
plt.xlabel(x_label)
plt.ylabel(y_label)
plt.show()

### Conclusion
**It is clear from the results that even *naive* predictions are not so 'naive'.  
This approach makes predictions with an root mean squared error of around 1-degree Celcius.**

In the next section, we will try to further improve our predictions.

# 2. Time Window

To use ML models on time series we need to define two sets of variables:
1. **Predictive or independent variables**
2. **Target variable**  

The first ones are used to construct a structured representation of time series.  
They are computed from the values already observed from the series. They can also be the values themselves.  
We will use a similar strategy to build a features matrix from the training dataset.
This matrix includes values in a window of fixed-length T. Each row contains values of MeanTemp from T previous days. 
Instead, the target variable encodes a future event. In our case, it's the mean temperature of a given day.
The algorithm will be able to model the relationship between some already-seen temperature values and an upcoming behavior of the series. 

In [None]:
from statsmodels.graphics.tsaplots import plot_pacf

In [None]:
plot_pacf(df.MeanTemp, lags=7)

plt.xlabel("previous day n.")
plt.ylabel("correlation")
plt.show()

**From the partial auto-correlation function of mean temperature values, we can observe that values with a lag of one day are highly correlated with the target.**  
It is enough for us to create a time window of size 3 to get relevant features.

In [None]:
def to_time_window(df, n):
    """ This function creates a new Dataframe containing values from 'n' previous rows
        First n rows of input dataframe are dropped as they contain missing values
    """
    label_col = "MeanTemp"
    data = pd.DataFrame(df.copy())
    cols = []
    
    # add the lag of the target variable from current steps back up to n
    for i in range(1, n+1):
        new_col = f'day_ - {i}'
        data[new_col] = data[label_col].shift(i)
        cols.insert(0,new_col)

    cols.insert(0, "Date")
    cols.append(label_col)
    
    return data.dropna()[cols]

X = to_time_window(df, 3)
X

We divide our dataset into two train and test chunks.  
The model will be trained on the values from 1940-1944 and predictions will be made on 1945.

In [None]:
mask = X.Date < '1945-01-01'
df_train = X[mask]
df_test = X[~mask]

df_test= df_test.drop("Date",axis = 1)
df_train= df_train.drop("Date",axis = 1)

target = "MeanTemp"
X_train = df_train.drop(target, axis = 1).values
X_test = df_test.drop(target, axis = 1).values

y_train = df_train[target].values
y_test = df_test[target].values

X_train.shape, X_test.shape, y_train.shape, y_test.shape

# Linear Regression

Finally, we can have a well-constructed form of time-series and we can build ML models for predictions.  
We will use a basic linear regression model to predict the target values.

In [None]:
reg  = LinearRegression()
reg.fit(X_train, y_train)
y_pred = reg.predict(X_test)
model_evalutation(y_test, y_pred)

In [None]:
plt.figure(figsize = (20,6))
plt.plot(mean_MeanTemp.Date, y_true, linewidth = 1.3, label = "True value")
plt.plot(mean_MeanTemp.Date, y_pred, linewidth = 1.3,  label = "LR Prediction")
plt.plot(mean_MeanTemp.Date, y_naive, linewidth = 0.7, color = 'grey' , label = "Naive Prediction")



plt.legend(loc = "upper left")
title = 'Predictions of Temperature'
x_label = 'time'
y_label = 'temperature (°C)'

plt.suptitle(title, fontsize = 'xx-large')
plt.xlabel(x_label)
plt.ylabel(y_label)
plt.show()

error = np.abs(y_test) - np.abs(y_pred)

plt.figure(figsize = (20,3))
plt.plot(mean_MeanTemp.Date, error, linewidth = 0.9, label = "")
plt.axhline(np.mean(abs(error)), linewidth = 0.5,color = 'purple', label = "Average")

plt.legend(loc = "upper left")
title = 'Error in Predictions'
x_label = 'time'
y_label = 'temperature (°C)'

plt.suptitle(title, fontsize = 'xx-large')
plt.xlabel(x_label)
plt.ylabel(y_label)
plt.show()


print(f"Average error is {np.mean(abs(error)):.2f} with Standard deviation of {np.std(abs(error)):.2f}")

# Conclusions

In this report, we have seen that a simple linear regression model can perform better than baseline scores.  
R2 score is improved by 0.16 and predictions are made with an RMS error of 0.87-degrees Celcius.  

In this experiment, we had true values for the whole year but in real life scenario, we need to update our model every day with the current day temperature.  

Linear Regressor provides a good approximation but its predictions are somewhat lagged.  
The model takes at least one day to react to the current trend and so induces a non-zero error.  
This error is larger when the temperature difference w.r.t previous day is high.  
**Although R2 Score is not very high, the average absolute error of 0.64 with the standard deviation is 0.58, might be acceptable values for
our domain.**



## Further improvements
We can enhance further by including information from past years.  
For example, we can include the temperature of the target day from past years, temperatures of previous days from past years.  
For this, we may require data from more past years.  
We can also use other Regression models to check if they are able to perform better.  Polynomial features can be used to check if the relation between previous days temprature with current day is non-linear.

With similar approach we can also predict other aspects of weather like minimum\maximum temprature. humidity level and precipitation probability.