<h1 align=center><font size=5>Predicting Max Temp in a day given Min Temp. </font></h1>

Weather Conditions in World War Two: Is there a relationship between the daily minimum and maximum temperature? Can you predict the maximum temperature given the minimum temperature?

In [None]:
#Needed libraries and modules
import pandas as pd                                  # data processing
import numpy as np                                   # linear algebra functionalities 
import seaborn as sns                                # visualization library
import matplotlib.pyplot as plt                      # visualization library

#Modeling
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression, Ridge, ElasticNet
from sklearn.svm import SVR

from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

import multiprocessing
import warnings
warnings.filterwarnings("ignore")


### Table of Contents

* [Loading and visualizing the data](#Section1)
* [Linear regression](#Section2)
* [Polynomial regression](#Section3)
* [Conclusion. Model comparison](#Section4)


# 1. Loading and visualizing the data <a class="anchor" id="Section1"></a>



In [None]:
path = "../input/weatherww2/Summary of Weather.csv"
df = pd.read_csv(path)
df

In [None]:
#Reshape to covert from numpy 1 array to numpy 2D array which is what we need to work with.

X = df['MinTemp'].values.reshape(-1,1)         #X contains the observations of the independent variable
Y = df['MaxTemp'].values.reshape(-1,1)         #Y contains the observations of the dependent variable
 
print("Type and size of the  vector X:", type(X), X.shape)
print("Type and size of the  vector Y:", type(Y), Y.shape)

In [None]:
#Scatter plot of the data

width = 30
height = 15
plt.figure(figsize=(width, height))

plt.scatter(X,Y)
plt.title('MinTemp vs MaxTemp')  
plt.xlabel('MinTemp')  
plt.ylabel('MaxTemp')  

In [None]:
#Distribution plot of MaxTemp

width = 20
height = 10
plt.figure(figsize=(width, height))
sns.distplot(Y, hist=False, label="MaxTemp")

plt.title('Distribution of MaxTemp')  
plt.xlabel('MaxTemp')
plt.ylabel('Density of observations')

In [None]:
#Spliting the data into train (80%) and test (20%) sets.
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=1)

print('Amount of elements in the train set:', X_train.size)
print('Amount of elements in the test set:', Y_test.size)
print("Shape of train and test vectors", Y_train.shape, X_test.shape)

 # 2. Linear regression <a class="anchor" id="Section2"></a>

In [None]:
#Training linear model

lm = LinearRegression()

lm.fit(X_train,Y_train)

print('The learned intercept and coefficients are', lm.intercept_, lm.coef_) 

In [None]:
#Making predictions using the entire data set and plotting 

Yhat = lm.predict(X)

width = 30
height = 15
plt.figure(figsize=(width, height))

plt.scatter(X,Y)
plt.plot(X, Yhat, color='red', linewidth=4)
plt.title('MinTemp vs MaxTemp')  
plt.xlabel('MinTemp')  
plt.ylabel('MaxTemp')
plt.show()

Let us see how our model performs on new data.

In [None]:
print('Acuracy of simple linear regression on the test set:', lm.score(X_test, Y_test))

Yhat_test = lm.predict(X_test)
print('Mean squared error of simple linear regression on the test set:', mean_squared_error(Y_test, Yhat_test))

Let's compare the distribution of the predicted max temperatures  with the actual one.

In [None]:
width = 30
height = 15
plt.figure(figsize=(width, height))

#Ploting the actual and predicted distributions
ax1 = sns.distplot(Y_test, hist=False, color="b", label="Actual Value")
sns.distplot(Yhat_test, hist=False, color="r", label="Predicted Values" , ax=ax1)

plt.title('Distribution')  
plt.xlabel('MaxTemp')
plt.ylabel('Density of observations')

plt.show()
plt.close()

This looks pretty accurate. Let's check if there is another model that fits better the data.

# 3. Polynomial regression <a class="anchor" id="Section3"></a>

In this section, we look for the best polynomial model that fits our data. To that end we will use a Ridge model which considers a regularization parameter proportional to the L2 norm. See https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html

For each degree d=1, ..., n, we do a grid search along the  regularization parameters alpha. This yields:

- A best estimator for each degree, i.e. the best parameters alpha.
- associated rˆ2 and mean squared error (mse) scores for each of these best estimators 



In [None]:
CV_param_grid = [{'alpha': [0.1, 0.2, 0.4, 0.8, 1.6, 3.2, 6.4] }]

CV_best_estimators = []
CV_best_r2_scores = []
CV_best_mse_scores = []
CV_best_params = []

random_state = 47

degrees = 20
for d in range(degrees):
    pf = PolynomialFeatures(degree=d+1)
    Xd = pf.fit_transform(X_train)
    
    #The grid serch computes both, the r^2 and the negative mse, but it selects the best model based on the r^2 score.
    Grid = GridSearchCV(estimator= Ridge(random_state = random_state), param_grid=CV_param_grid, scoring=['r2','neg_mean_squared_error'], 
                        cv=20, refit='r2', n_jobs=multiprocessing.cpu_count())
    Grid.fit(Xd, Y_train)
    CV_best_estimators.append(Grid.best_estimator_)
    CV_best_r2_scores.append(Grid.best_score_)
    CV_best_mse_scores.append((-1)*Grid.cv_results_['mean_test_neg_mean_squared_error'][Grid.best_index_])
    CV_best_params.append(Grid.best_params_)


In [None]:
print('The list of best parameters for each degree is:\n', CV_best_params)
print('The list of best CV r^2 scores for each degree is:\n', CV_best_r2_scores)
print('The list of best CV mse scores for each degree is:\n', CV_best_mse_scores)

Let's see how good these models perform in the test set

In [None]:
test_r2_scores = []
test_mse_scores = []


for d in range(degrees):
    pf = PolynomialFeatures(degree=d+1)
    Xd_test = pf.fit_transform(X_test)
    Ydhat = CV_best_estimators[d].predict(Xd_test)
    test_r2_scores.append(r2_score(Ydhat, Y_test))
    test_mse_scores.append(mean_squared_error(Ydhat, Y_test))

In [None]:
print('The list of r^2 scores in the test set for each degree is:\n', test_r2_scores)
print('The list of mse scores in the test set for each degree is:\n', test_mse_scores)

Let's plot the r^2 and mse scores per degree

In [None]:
width = 24
height = 8
plt.figure(figsize=(width, height))

plt.subplot(1, 2, 1)
plt.plot(range(1,degrees+1), CV_best_r2_scores, color='blue', label="CV r^2 score")
plt.plot(range(1,degrees+1), test_r2_scores, color='red', label="Test r^2 score")
plt.legend(loc="lower right")
plt.title('CV R2 score vs Degree') 

plt.subplot(1, 2, 2)
plt.plot(range(1,degrees+1), CV_best_mse_scores, color='blue', label="CV mse score")
plt.plot(range(1,degrees+1), test_mse_scores, color='red', label="Test mse score")
plt.legend(loc="top right")
plt.title('CV MSE score vs Degree') 

plt.show()


In [None]:
#Conclussion
best_d_r2_cv = CV_best_r2_scores.index(max(CV_best_r2_scores))+1
best_d_mse_cv = CV_best_mse_scores.index(min(CV_best_mse_scores))+1

print("Degree with best CV R^2 score:", best_d_r2_cv)
print("Degree with best CV MSE score:", best_d_mse_cv)

In [None]:
print('The best hyperparameters found after CV are:\n', 'degree:', best_d_r2_cv, '\n',  
      'alpha:', CV_best_params[best_d_r2_cv-1]['alpha'],
     )

In [None]:
BestModel = CV_best_estimators[best_d_r2_cv-1]
BestModel

Lets plot the curve of our trained model together with data set 

In [None]:
X_new = np.linspace(min(X), max(X), 2000).reshape(2000,1)
pf = PolynomialFeatures(degree= best_d_r2_cv)
X_trans = pf.fit_transform(X_new)
Yhat_trans = BestModel.predict(X_trans)

#Plotting predictions
width = 30
height = 15
plt.figure(figsize=(width, height))

plt.scatter(X,Y)
plt.plot(X_new, Yhat_trans, color='red', linewidth=4)
plt.title('MinTemp vs MaxTemp')  
plt.xlabel('MinTemp')  
plt.ylabel('MaxTemp')
plt.show()

Let's see how the above polynomial regression model performs in the test set.

In [None]:
pf = PolynomialFeatures(degree= best_d_r2_cv)
Xd_test = pf.fit_transform(X_test)
Ydhat_test = BestModel.predict(Xd_test)

print('Acuracy of the above polynomial regression on the test set:', BestModel.score(Xd_test, Y_test))

print('Mean squared error of the above polynomial regression on the test set:', mean_squared_error(Y_test, Ydhat_test))

# 4. Conclusion. Model comparison <a class="anchor" id="Section4"></a>

The polynomial model yields beter r^2 and mse errors than the linear regression model (around 0.76 and 17.6 respectively). 

However, in certain regions the variance seems to be very high, espacially at the lower extreme values. Let's compare de performance of the polynomial and the linear model in this region.

In [None]:
def PerformaceLimitCases (Model, X, Y, lim_inf, lim_sup, degree=1):
    """
    This function returs the r^2 scores of a model in the interval  lim_inf < X < lim_sup
    """
    if degree == 1:
        return Model.score(X[(X > lim_inf) & (X < lim_sup)].reshape(-1,1), Y[(X > lim_inf) & (X < lim_sup)].reshape(-1,1))
    else:
        pf = PolynomialFeatures(degree= degree)
        Xt = pf.fit_transform(X[(X > lim_inf) & (X < lim_sup)].reshape(-1,1))
        return Model.score(Xt, Y[(X > lim_inf) & (X < lim_sup)].reshape(-1,1))

In [None]:
print('Perfomance of the BestModel in the range (-34, -30):', 
      PerformaceLimitCases(BestModel, X_test, Y_test, -34, -30, best_d_r2_cv))
print('Perfomance of the linear model outside of the range (-34, -30):', 
      PerformaceLimitCases(lm, X_test, Y_test, -34, -30, 1))


We see that in the interval (-34, -30) the linear model performs better. 

Let's also check the predicted  distribution of the polnomial model. 

In [None]:
width = 30
height = 15
plt.figure(figsize=(width, height))

#Ploting the actual and predicted distributions
ax1 = sns.distplot(Y_test, hist=False, color="b", label="Actual Value")
sns.distplot(Ydhat_test, hist=False, color="r", label="Predicted Values" , ax=ax1)

plt.title('Distribution')  
plt.xlabel('MaxTemp')
plt.ylabel('Density of observations')

plt.show()
plt.close()

We see that the linear model seems to provide a more accurate distribution, especially in values around 30 degrees. 
Moreover, the improvements in the r^2 and mse scores are not vere significative (from 0.76 to 0.79 and from 17 to 15). Therefore, it seems reasonable to keep  the linear regression as a predictive model for this data set.