## PS1 Part 1: Linear Models and Validation

### Preamble
We'll be loading some CO2 concentration data that is a commonly used dataset for model building of time series prediction. You will build a few baseline linear models and assess them using some of the tools we discussed in class. Which model is best? Let's find out.

First let's just load the data and take a look at it:

In [None]:
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_openml
from datetime import datetime, timedelta
import pandas as pd
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
sns.set_context('notebook')

# Fetch the data
mauna_lao = fetch_openml('mauna-loa-atmospheric-co2', as_frame = False)
print(mauna_lao.DESCR)
data = mauna_lao.data
# Assemble the day/time from the data columns so we can plot it
d1958 = datetime(year=1958,month=1,day=1)
time = [datetime(int(d[0]),int(d[1]),int(d[2])) for d in data]
X = np.array([1958+(t-d1958)/timedelta(days=365.2425) for t in time]).T
X = X.reshape(-1,1)  # Make it a column to make scikit happy
y = np.array(mauna_lao.target)

In [None]:
# Plot the data
plt.figure(figsize=(10,5))    # Initialize empty figure
plt.scatter(X, y, c='k',s=1) # Scatterplot of data
plt.xlabel("Year")
plt.ylabel(r"CO$_2$ in ppm")
plt.title(r"Atmospheric CO$_2$ concentration at Mauna Loa")
plt.tight_layout()
plt.show()

In [None]:
y[:100]

### Linear Models

Construct the following linear models:
1. Model 1: "Vanilla" Linear Regression, that is, where $CO_2 = a+b \cdot time$
2. Model 2: Quadratic Regression, where $CO_2 = a+b \cdot t + c\cdot t^2$
3. Model 3: A more complex "linear" model with the following additive terms $CO_2=a+b\cdot t+c\cdot sin(\omega\cdot t)$:
  * a linear (in time) term
  * a sinusoidal additive term with period such that the peak-to-peak of the sinsusoid is roughly ~1 year and phase shift of zero (set $\omega$ as appropriate to match the peaks)
4. Model 4: A "linear" model with the following additive terms ($CO_2=a+b\cdot t+c\cdot t^2+d\cdot sin(\omega\cdot t)$:
  * a quadratic (in time) polynomial
  * a sinusoidal additive term with period such that the peak-to-peak of the sinsusoid is roughly ~1 year and phase shift of zero (set $\omega$ as appropriate to match the peaks)
  
Evauate these models using **the appropriate kind of Cross Validation** for each of the following amounts of Training data:
1. N=50 Training Data Points
2. N=100
3. N=200
4. N=500
5. N=1000
6. N=2000

**Question**: Before you even construct the models or do any coding below, what is your initial guess or intuition behind how each of those four models will perform? Note: there is no right or wrong answer to this part of the assignment and this question will only be graded on completeness, not accuracy. It's intent is to get you to think about and write down your preliminary intuition regarding what you think will happen before you actually implement anything, based on your approximate understanding of how functions of the above complexity *should* perform as N increases.

**Student Response:** 
Model 1 should average out the spikes and approximate the slope of overall increase of CO2
Model 2 will perform similarly to Model 1 with the difference that it won't form a straight line, but have a slight curve
Both of these Models should perform fairly well, even with few datapoints
Model 3 should additionally incorporate the peaks. Of course with fewer datapoints it will struggle to find the correct amplitude.
Model 4 should perform similarly to Model 2 but again have a slight curve representing the overall (growing) increase

**Question**: What is the appropriate kind of Cross Validation to perform in this case if we want a correct Out of Sample estimate of our Test MSE?

**Student Response:** Time Series Split Cross Validation is appropriate, because to predict (potentially future) CO2 values we need to train our model to only train on past data.

Now, for each of the above models and training data sizes:
* Plot the predicted CO2 as a function of time, including the actual data, for each of the N=X training data examples. This should correspond to six plots (one for each amount of training data) if you plot all models on the same plot, or 6x4 = 24 plots if you plot each model and training data plot separately.
* Create a [Learning Curve](https://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html) plot for the model which plots its Training and Test MSE as a function of training data. That is, plot how Training and Testing MSE change as you increase the training data for each model. This could be a single plot for all four models (8 lines on the plot) or four different plots corresponding to the learning curve of each model separately.

In [None]:
import numpy as np

X_train_100 = X[:100]
y_train_100 = y[:100]
X_test = X[100:200]
print("Shape of X_train_100: %s" % str(X_train_100.shape))
print("Beginning of X_train_100: %s" % str(X_train_100[0:5]))
print("Shape of y_train_100: %s" % str(y_train_100.shape))
print("Beginning of y_train_100: %s" % str(y_train_100[0:5]))

print('Shape of X_test: %s' % str(X_test.shape))
print("Beginning of X_test: %s" % str(X_test[0:5]))

### Modify the below code. You can leave the code above as is. ###



In [None]:
# Insert Modeling Building or Plotting code here
# Note, you may implement these however you see fit
# Ex: using an existing library, solving the Normal Eqns
#     implementing your own SGD solver for them. Your Choice.
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.linear_model import SGDRegressor, LinearRegression
from sklearn.model_selection import TimeSeriesSplit
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error, r2_score

class SinFeature(BaseEstimator, TransformerMixin):
    def __init__(self, omega):
        self.omega = omega
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return np.sin(self.omega * X).reshape(-1,1)

omega = 2 * np.pi / 1.0  # yearly seasonality
n_splits = 5

sgd = SGDRegressor()
tscv = TimeSeriesSplit(n_splits=n_splits)
pipe1 = Pipeline([('poly', PolynomialFeatures(degree=1,
                                             include_bias=False)),
                 ('lr',LinearRegression())])
pipe2 = Pipeline([('poly', PolynomialFeatures(degree=2,include_bias=False)),
                  ('lr',LinearRegression())])
pipe3 = Pipeline([('features', FeatureUnion([
                        ('poly', PolynomialFeatures(degree=1,include_bias=False)),
                        ('sin', SinFeature(omega=omega))])),
                  ('lr',LinearRegression())])
pipe4 = Pipeline([('features', FeatureUnion([
                        ('poly', PolynomialFeatures(degree=2,include_bias=False)),
                        ('sin', SinFeature(omega=omega))])),
                  ('lr',LinearRegression())])

N = [50,100,200,500,1000,2000]

for n in N:
    print(f"N = {n}")
    random_indices = np.sort(np.random.choice(X.shape[0], n, replace=False))
    randomY = y[random_indices]
    randomX = X[random_indices]

    for i, (train_index, test_index) in enumerate(tscv.split(randomX)):
        pipe1.fit(randomX[train_index],randomY[train_index])
        pipe2.fit(randomX[train_index],randomY[train_index])
        pipe3.fit(randomX[train_index],randomY[train_index])
        pipe4.fit(randomX[train_index],randomY[train_index])

        # Visualize Time Split

        pred = pipe4.predict(randomX[test_index])
        pred1 = pipe4.predict(randomX[train_index])
        plt.figure(figsize=(4,4))
        plt.scatter(X, y, c='k',s=0.5, alpha=0.5, label="Data")
        plt.scatter(randomX[train_index],randomY[train_index], label="train",s=5)
        plt.scatter(randomX[test_index],randomY[test_index], label="test",s=5)
        plt.scatter(randomX[train_index],pred1, label="train_pred",s=5)
        plt.scatter(randomX[test_index],pred, label="test_pred",s=5)
        plt.legend(loc="best")
        plt.title(f"Time Split Nr {i+1}")
        plt.show()

    pred1 = pipe1.predict(X)
    pred2 = pipe2.predict(X)
    pred3 = pipe3.predict(X)
    pred4 = pipe4.predict(X)

    # Calculate and print scores for each model
    mse1 = mean_squared_error(y, pred1)
    r2_1 = r2_score(y, pred1)
    mse2 = mean_squared_error(y, pred2)
    r2_2 = r2_score(y, pred2)
    mse3 = mean_squared_error(y, pred3)
    r2_3 = r2_score(y, pred3)
    mse4 = mean_squared_error(y, pred4)
    r2_4 = r2_score(y, pred4)

    plt.figure(figsize=(8,8))
    plt.scatter(X, y, c='k',s=1, label="Data")
    plt.scatter(X,pred1, label="Model 1: MSE {:.2f}, R2 {:.3f}".format(mse1, r2_1),s=5)
    plt.scatter(X,pred2, label="Model 2: MSE {:.2f}, R2 {:.3f}".format(mse2, r2_2),s=5)
    plt.scatter(X,pred3, label="Model 3: MSE {:.2f}, R2 {:.3f}".format(mse3, r2_3),s=5,alpha=0.5)
    plt.scatter(X,pred4, label="Model 4: MSE {:.2f}, R2 {:.3f}".format(mse4, r2_4),s=5,alpha=0.5)
    plt.legend(loc="best")
    plt.title(f"N = {n}")
    plt.show()

**Question**: Which Model appears to perform best in the N=50 or N=100 Condition? Why is this?

**Student Response:** Although it goes against my intuition model 4 performs best everywhere. As seen in the plots it fairly accurately follows the curve and since we chose the "correct" $\omega$ it only need to approximate the amplitude resulting in high accuracy.

**Question**: Which Model appears to perform best as the N=200 to 500? Why is this?

**Student Response:** Model 4

**Question**: Which Model appears to perform best as N = 2000? Why is this?

**Student Response:** Model 4