<h1>Climate Change</h1>
There have been many studies documenting that the average global temperature has been increasing over the last century. The consequences of a continued rise in global temperature will be dire. Rising sea levels and an increased frequency of extreme weather events will affect billions of people.

In this problem, we will attempt to study the relationship between average global temperature and several other factors.

The file climate_change.csv contains climate data from May 1983 to December 2008. The available variables include:

<p>Year: the observation year.</p>
<p>Month: the observation month.</p>
<p>Temp: the difference in degrees Celsius between the average global temperature in that period and a reference value. This data comes from the Climatic Research Unit at the University of East Anglia.</p>
<p>CO2, N2O, CH4, CFC.11, CFC.12: atmospheric concentrations of carbon dioxide (CO2), nitrous oxide (N2O), methane  (CH4), trichlorofluoromethane (CCl3F; commonly referred to as CFC-11) and dichlorodifluoromethane (CCl2F2; commonly referred to as CFC-12), respectively. This data comes from the ESRL/NOAA Global Monitoring Division. </p>
<p>CO2, N2O and CH4 are expressed in ppmv (parts per million by volume  -- i.e., 397 ppmv of CO2 means that CO2 constitutes 397 millionths of the total volume of the atmosphere)</p>
<p>CFC.11 and CFC.12 are expressed in ppbv (parts per billion by volume). </p>
<p>Aerosols: the mean stratospheric aerosol optical depth at 550 nm. This variable is linked to volcanoes, as volcanic eruptions result in new particles being added to the atmosphere, which affect how much of the sun's energy is reflected back into space. This data is from the Godard Institute for Space Studies at NASA.</p>
<p>TSI: the total solar irradiance (TSI) in W/m2 (the rate at which the sun's energy is deposited per unit area). Due to sunspots and other solar phenomena, the amount of energy that is given off by the sun varies substantially with time. This data is from the SOLARIS-HEPPA project website.</p>
<p>MEI: multivariate El Nino Southern Oscillation index (MEI), a measure of the strength of the El Nino/La Nina-Southern Oscillation (a weather effect in the Pacific Ocean that affects global temperatures). This data comes from the ESRL/NOAA Physical Sciences Division.</p>

In [24]:
import pandas as pd
import numpy as np
import statsmodels.api as sm

import warnings
warnings.filterwarnings('ignore')

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.feature_selection import RFE

#np.random.seed(123456)

<h2>Problem 1.1 - Creating Our First Model</h2>

<p>We are interested in how changes in these variables affect future temperatures, as well as how well these variables explain temperature changes so far. To do this, first read the dataset climate_change.csv into dataframe.</p>

<p>Then, split the data into a training set, consisting of all the observations up to and including 2006, and a testing set consisting of the remaining years (hint: use subset). A training set refers to the data that will be used to build the model (this is the data we give to the lm() function), and a testing set refers to the data we will use to test our predictive ability.</p>

<p>Next, build a linear regression model to predict the dependent variable Temp, using MEI, CO2, CH4, N2O, CFC.11, CFC.12, TSI, and Aerosols as independent variables (Year and Month should NOT be used in the model). Use the training set to build the model.</p>

Enter the model R2 (the "Multiple R-squared" value):

In [2]:
data = pd.read_csv("climate_change.csv")

In [3]:
df_train = data[data.Year<=2006]
df_test = data[data.Year > 2006]

In [4]:
np.random.seed(9876789)

In [5]:
df_train.head()

Unnamed: 0,Year,Month,MEI,CO2,CH4,N2O,CFC-11,CFC-12,TSI,Aerosols,Temp
0,1983,5,2.556,345.96,1638.59,303.677,191.324,350.113,1366.1024,0.0863,0.109
1,1983,6,2.167,345.52,1633.71,303.746,192.057,351.848,1366.1208,0.0794,0.118
2,1983,7,1.741,344.15,1633.22,303.795,192.818,353.725,1366.285,0.0731,0.137
3,1983,8,1.13,342.25,1631.35,303.839,193.602,355.633,1366.4202,0.0673,0.176
4,1983,9,0.428,340.17,1648.4,303.901,194.392,357.465,1366.2335,0.0619,0.149


In [6]:
X_train = df_train.drop(['Year','Month','Temp'],axis=1)
y_train = df_train[['Temp']]
X_test = df_test.drop(['Year','Month','Temp'],axis=1)
y_test = df_test[['Temp']]

In [7]:
X_train = sm.add_constant(X_train)


In [8]:
model = sm.OLS(y_train,X_train)
results = model.fit()
print(results.summary())

X_train.drop(['const'],axis=1)

                            OLS Regression Results                            
Dep. Variable:                   Temp   R-squared:                       0.751
Model:                            OLS   Adj. R-squared:                  0.744
Method:                 Least Squares   F-statistic:                     103.6
Date:                Sat, 12 Dec 2020   Prob (F-statistic):           1.94e-78
Time:                        02:05:04   Log-Likelihood:                 280.10
No. Observations:                 284   AIC:                            -542.2
Df Residuals:                     275   BIC:                            -509.4
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       -124.5943     19.887     -6.265      0.0

<p> Compute the correlations between all the variables in the training set. Which of the following independent variables is N2O highly correlated with (absolute correlation greater than 0.7)? </p>

In [9]:
X_train.corr()

Unnamed: 0,const,MEI,CO2,CH4,N2O,CFC-11,CFC-12,TSI,Aerosols
const,,,,,,,,,
MEI,,1.0,-0.041147,-0.033419,-0.05082,0.069,0.008286,-0.154492,0.340238
CO2,,-0.041147,1.0,0.87728,0.97672,0.51406,0.85269,0.177429,-0.356155
CH4,,-0.033419,0.87728,1.0,0.899839,0.779904,0.963616,0.245528,-0.267809
N2O,,-0.05082,0.97672,0.899839,1.0,0.522477,0.867931,0.199757,-0.337055
CFC-11,,0.069,0.51406,0.779904,0.522477,1.0,0.868985,0.272046,-0.043921
CFC-12,,0.008286,0.85269,0.963616,0.867931,0.868985,1.0,0.255303,-0.225131
TSI,,-0.154492,0.177429,0.245528,0.199757,0.272046,0.255303,1.0,0.052117
Aerosols,,0.340238,-0.356155,-0.267809,-0.337055,-0.043921,-0.225131,0.052117,1.0


<h2>Problem 3 - Simplifying the Model </h2>

<p>Given that the correlations are so high, let us focus on the N2O variable and build a model with only MEI, TSI, Aerosols and N2O as independent variables. Remember to use the training set to build the model.</p>

In [21]:
new_train =  X_train[['MEI','TSI','Aerosols','N2O']]
new_train = sm.add_constant(new_train)
model1 = sm.OLS(y_train,new_train)
results1 = model1.fit()
print(results1.summary())

                            OLS Regression Results                            
Dep. Variable:                   Temp   R-squared:                       0.726
Model:                            OLS   Adj. R-squared:                  0.722
Method:                 Least Squares   F-statistic:                     184.9
Date:                Sat, 12 Dec 2020   Prob (F-statistic):           3.52e-77
Time:                        02:09:55   Log-Likelihood:                 266.64
No. Observations:                 284   AIC:                            -523.3
Df Residuals:                     279   BIC:                            -505.0
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       -116.2269     20.223     -5.747      0.0

In [59]:
clmModel = LinearRegression()
rfe = RFE(clmModel)
rfe.fit(X_train.drop(['const'],axis=1), y_train)

RFE(estimator=LinearRegression())

In [60]:
print(r2_score(y_pred=rfe.predict(X_train.drop(['const'],axis=1)),y_true=y_train))

0.7261321279511104


In [62]:
rfe.predict(X_test)

array([[0.50424254],
       [0.47500149],
       [0.45062686],
       [0.43946052],
       [0.45083106],
       [0.41885274],
       [0.42277342],
       [0.41654133],
       [0.37051579],
       [0.37979078],
       [0.37698739],
       [0.38379337],
       [0.39889735],
       [0.37618515],
       [0.35521685],
       [0.40470021],
       [0.44461922],
       [0.47276726],
       [0.46242522],
       [0.44429193],
       [0.42265441],
       [0.4210011 ],
       [0.43879308],
       [0.43941919]])

In [64]:
from sklearn.metrics import mean_squared_error

In [67]:
y_test

Unnamed: 0,Temp
284,0.601
285,0.498
286,0.435
287,0.466
288,0.372
289,0.382
290,0.394
291,0.358
292,0.402
293,0.362


In [68]:
X_test_pred = rfe.predict(X_test)

In [69]:
np.sqrt(mean_squared_error(y_pred=X_test_pred,y_true=y_test))

0.11084837772392987

In [77]:
SSE = sum(np.power((X_test_pred - y_test.values),2))
SST = sum((np.mean(y_train.values) - y_test.values)**2)
rsq = 1-SSE/SST

In [78]:
print(rsq)

[0.49677949]
