# STAT 207 Homework 12 [25 points]

## Regularization Models for Linear Relationships

Due: Wednesday, May 1, end of day (11:59 pm CT)

<hr>

## Imports 

Run the following code cell to import the necessary packages into the file.  You may import additional packages, as needed for this assignment.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso

## The Data

With available climate data dating back many decades and the prevalence of climate change, humans are looking to understand exactly how different features of the climate affect the temperatures globally.  For this assignment, we will look to understand how the **global temperature** fluctuates based on other environmental features.

We will use various atmospheric and temperature measures over 309 months from 1983 to 2008, with the following variables:

- **Year**: the observation year
- **Month**: the observation month, recorded with numbers 1 to 12
- **MEI**: Multivariate El Nino Southern Oscillation Index (MEI), measuring the affects of the El Nino weather pattern
- **CO2**: atmospheric concentration of carbon dioxide (in ppmv, parts per million by volume)
- **CH4**: atmospheric concentration of methane (in ppmv)
- **N2O**: atmospheric concentration of nitrous oxide (in ppmv)
- **CFC-11**: atmospheric concentration of CCl3F or trichlorofluoromethane (in ppbv, parts per billion by volume)
- **CFC-12**: atmospheric concentration of CCl2F2 or dichlorodifluoromethane (in ppbv)
- **TSI**: the total solar irradiance (TSI) (in W/m2), measuring the rate at which the sun's energy is deposited per unit area.
- **Aerosols**: the mean stratospheric aerosol optical depth at 500 nm, a measure associated with volcanic activity
- **Temp**: the difference in the average global temperature for that month (in Celsius) and a reference value

The ESRL/NOAA Physical Sciences Division reports the MEI; atmospheric concentrations are measured by the ESRL/NOAA Global Monitoring Division; the SOLARIS-HEPPA project website provides the TSI; the Godard Institute for Space Studies at NASA reports the Aerosols; and the Climatic Research Unit at the University of East Anglia reports the Temp.

Run the code in the cell below to read in the cleaned data for this document.  The data is saved as `df` with this code.  

In [None]:
df = pd.read_csv('climate_change.csv')
df_train = df[df['Year'] <= 2006]
df_test = df[df['Year'] >= 2007]
X_train = df_train.drop(['Year', 'Month', 'Temp'], axis = 1)
X_test = df_test.drop(['Year', 'Month', 'Temp'], axis = 1)
y_train = df_train['Temp']
y_test = df_test['Temp']

## 1. Summarize Data [2 points]

Above, we set aside a training data.  We didn't randomly select our training and test set; instead, we imagine that we fit a model using the available data in 2006 in our training data.  We'll then use the following data from the months from 2007 and 2008 as the test set.

As defined in our X_train and X_test above, our response variable for this assignment with be **Temp**.  We'll use all variables except the **Year** and **Month** as our predictor variables.

**a)** First, let's explore the summary statistics between our predictor variables.

**b)** Scale our predictor variables in the training data.  Then, observe the summary statistics for the variables in the training data.

**c)** Now, we want to be sure that we also scale our test data according to the same process.  Apply your scaling algorithm from **part b** to the test data.  Observe the means and variances of your scaled test data.

*Note*: This does not include re-fitting your scaling process.  You will re-use your scaling process from **part b**.

## 2. Fitting A Model [1 point]

Fit a LASSO model with $\lambda = 0.06$ to the training data, including all variables except the year and month variables.  Print the coefficients for this model.

## 3. Picking a Best Model [2 points]

Let's suppose that we decide to move forward with a ridge regression model.  We originally fit the model with $\lambda = 0.06$.  Let's explore whether this value of $\lambda$ is the best for our model.

To do this, we'll explore $\lambda$ values between 0.05 and 1 exploring by every 0.05.  We can do this with code using:

`for m in range(1, 21):`

`alph = m / 20`

The following code sets up the folds for cross-validation.

In [None]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

cross_val = KFold(n_splits = 10, shuffle = True, random_state = 202405)

**a)** Use 10-fold cross-validation to explore the $R^2$ values for each of the $\lambda$s as defined above.

**b)** Print the $R^2$ values for each of the individual folds of your optimal $\lambda$.

**c)** Repeat this process for a different set of 10-folds, using a random state of your choosing.  Determine which $\lambda$ results in the optimal mean $R^2$, similar to what you did above.

## 4. Evaluating Our Best Model [1.5 points]

**a)** Refit the model with the optimal $\lambda$ found in Question 3 to the full training data.  Then, print the resulting coefficients.

**b)** Calculate the $R^2$ on the test data for the model fit in **4a**.

Remember to keep all your cells and hit the save icon above periodically to checkpoint (save) your results on your local computer. Once you are satisified with your results restart the kernel and run all (Kernel -> Restart & Run All). **Make sure nothing has changed**. Checkpoint and exit (File -> Save and Checkpoint + File -> Close and Halt). Follow the instructions on the Homework 12 Canvas Assignment to submit your notebook to GitHub.