-
Notifications
You must be signed in to change notification settings - Fork 5
Week 08 (W53 Jan18) Global Climate Dataset
This week what we tried to reduce the dimensionality of our variables and determine the most important variables contributing in Tavg rise in the respective countries around the world. As we don't know a priori which of the methods yields the most optimal result with regards to accuracy and performance, we execute all of the necessary steps for both approaches (PCA/PCR and PLSR). As a measure of how well the output vector is approximated we used the Pearson Correlation Coefficient (or PCC).
The PCC is defined as:
, where cov is the covariance and σ the standard deviation. The value of the PCC can range from -1 to +1, with the plus sign denoting that Y grows linearly with X. In case of a negative PCC the Y would fall with increasing X. Either way, if the absolute value is large (i.e. close to |1|) it means that the correlation is high and therefore the scattering low. In our case we calculate the PCC between the fitted and the observed response (so X="fitted Y" and Y="observed Y" in the above equation).
For this reason we want to find out at which point the PCC reaches its maximum, while discarding a variable in every iteration. So in every iteration, we discard the variable with the smallest weight and calculate the PCC. The outcome can be plotted and analyzed for a maximum. The corresponding X-value of the maximum gives us the amount of variables we can drop in order to preserve the maximum PCC.
Global Climate Data (GCD) : Main Dataset
- Number of files: 100.791
- Format: .dly files (Complete Works Wordprocessing Template)
- Size: 26.5 GB
- Features: 46
- Source Date: 1763 - 2015
World Bank (WB) : Complementary Dataset
- Number of files: 1
- Format: .csv
- Size: ~15 MB
- Features: 82
- Source Date: 1960 - 2015

It's often useful to choose the number of components to minimize the expected error when predicting the response from future observations on the predictor variables. Simply using a large number of components will do a good job in fitting the current observed data, but is a strategy that leads to overfitting. Fitting the current data too well results in a model that does not generalize well to other data, and gives an overly-optimistic estimate of the expected error. Cross-validation is a more statistically sound method for choosing the number of components in either PLSR or PCR. It avoids overfitting data by not reusing the same data to both fit a model and to estimate prediction error. Thus, the estimate of prediction error is not optimistically biased downwards.
- Prediction model
- Menne, M.J., I. Durre, R.S. Vose, B.E. Gleason, and T.G. Houston, 2012: An overview of the Global Historical Climatology Network-Daily Database. Journal of Atmospheric and Oceanic Technology, 29, 897-910, doi:10.1175/JTECH-D-11-00103.1.
- Menne, M.J., I. Durre, B. Korzeniewski, S. McNeal, K. Thomas, X. Yin, S. Anthony, R. Ray, R.S. Vose, B.E.Gleason, and T.G. Houston, 2012: Global Historical Climatology Network - Daily (GHCN-Daily), Version 3. [indicate subset used following decimal, e.g. Version 3.12]. NOAA National Climatic Data Center. http://doi.org/10.7289/V5D21VHZ
- WB Dataset - http://data.worldbank.org
- Correlation Analysis - http://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/BS704_Multivariable/BS704_Multivariable5.html
- Climate change impacts on Austrian ski areas, Robert Steiger & Bruno Abegg (Link)
- HFCs? Curbing Them Is Key to Climate-Change Strategy (Op-Ed), Hallie Kennan, Energy Innovation: Policy and Technology (Link)
- How do we know more CO2 is causing warming? (Link)
- Does CO2 always correlate with temperature (and if not, why not?)
- Earth itself is telling us there’s nothing to worry about in doubled, or even quadrupled, atmospheric CO2
- China Exports Pollution to U.S., Study Finds