<a href="https://colab.research.google.com/github/ChadDelany/drought_prediction/blob/main/notebooks/05b_Results_RAPIDS_allFeatures.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Drought Prediction: Results**

This notebook was run on Google Colab Pro+.  The RAPIDS system was used to access GPU processing.  Previous attempts to run the models using Pandas with CPUs took 24+ hours to run and often crashed due to exceeding existing resources.  The RAPIDS GPU processing allowed models to run usually within 5 minutes and at a maximum of 15 minutes.  Setting up RAPIDS to run on Google Colab Pro+ takes between 15 minutes to 1 hour depending on resource availability on Google Colab Pro+.

ALL MODELS INITIALLY RUN WITH ALL AVAILABLE VARIABLES TO DETERMINE INITIAL MODEL PERFORMANCE.

# Environment Sanity Check #

Click the _Runtime_ dropdown at the top of the page, then _Change Runtime Type_ and confirm the instance type is _GPU_.

Check the output of `!nvidia-smi` to make sure you've been allocated a Tesla T4, P4, or P100.

In [None]:
!nvidia-smi

Mon Dec 12 19:29:03 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  A100-SXM4-40GB      Off  | 00000000:00:04.0 Off |                    0 |
| N/A   40C    P0    55W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

#Setup:
This notebook was built on RAPIDS 0.13 stable and is based on this [DataCamp Tutorial](https://www.datacamp.com/community/tutorials/xgboost-in-python).  tested and working on 0.19 stable.

#Setup:
Set up script installs
1. Updates gcc in Colab
1. Installs Conda
1. Install RAPIDS' current stable version of its libraries, as well as some external libraries including:
  1. cuDF
  1. cuML
  1. cuGraph
  1. cuSpatial
  1. cuSignal
  1. BlazingSQL
  1. xgboost
1. Copy RAPIDS .so files into current working directory, a neccessary workaround for RAPIDS+Colab integration.


In [None]:
!pip install pynvml

# This get the RAPIDS-Colab install files and test check your GPU.  Run this and the next cell only.
# Please read the output of this cell.  If your Colab Instance is not RAPIDS compatible, it will warn you and give you remediation steps.
!git clone https://github.com/rapidsai/rapidsai-csp-utils.git
!python rapidsai-csp-utils/colab/env-check.py

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pynvml
  Downloading pynvml-11.4.1-py3-none-any.whl (46 kB)
[K     |████████████████████████████████| 46 kB 831 kB/s 
[?25hInstalling collected packages: pynvml
Successfully installed pynvml-11.4.1
Cloning into 'rapidsai-csp-utils'...
remote: Enumerating objects: 308, done.[K
remote: Counting objects: 100% (137/137), done.[K
remote: Compressing objects: 100% (82/82), done.[K
remote: Total 308 (delta 79), reused 98 (delta 55), pack-reused 171[K
Receiving objects: 100% (308/308), 89.88 KiB | 22.47 MiB/s, done.
Resolving deltas: 100% (141/141), done.
***********************************************************************
Woo! Your instance has the right kind of GPU, a A100-SXM4-40GB!
***********************************************************************



In [None]:
# This will update the Colab environment and restart the kernel.  Don't run the next cell until you see the session crash.
!bash rapidsai-csp-utils/colab/update_gcc.sh
import os
os._exit(00)

Updating your Colab environment.  This will restart your kernel.  Don't Panic!
Hit:1 http://archive.ubuntu.com/ubuntu bionic InRelease
Ign:2 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Get:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease [1,581 B]
Hit:4 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release
Get:5 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Get:6 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Packages [1,073 kB]
Hit:7 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease
Get:8 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease [3,626 B]
Get:10 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Hit:11 http://ppa.launchpad.net/cran/libgit2/ubuntu bionic InRelease
Get:12 http://archive.ubuntu.com/ubuntu bionic-backports InRelease [83.3 

In [None]:
# This will install CondaColab.  This will restart your kernel one last time.  Run this cell by itself and only run the next cell once you see the session crash.
import condacolab
condacolab.install()

⏬ Downloading https://github.com/jaimergp/miniforge/releases/latest/download/Mambaforge-colab-Linux-x86_64.sh...
📦 Installing...
📌 Adjusting configuration...
🩹 Patching environment...
⏲ Done in 0:00:23
🔁 Restarting kernel...


In [None]:
# you can now run the rest of the cells as normal
import condacolab
condacolab.check()

✨🍰✨ Everything looks OK!


In [None]:
# Installing RAPIDS is now 'python rapidsai-csp-utils/colab/install_rapids.py <release> <packages>'
# The <release> options are 'stable' and 'nightly'.  Leaving it blank or adding any other words will default to stable.
!python rapidsai-csp-utils/colab/install_rapids.py stable
import os
os.environ['NUMBAPRO_NVVM'] = '/usr/local/cuda/nvvm/lib64/libnvvm.so'
os.environ['NUMBAPRO_LIBDEVICE'] = '/usr/local/cuda/nvvm/libdevice/'
os.environ['CONDA_PREFIX'] = '/usr/local'

Found existing installation: cffi 1.15.1
Uninstalling cffi-1.15.1:
  Successfully uninstalled cffi-1.15.1
Found existing installation: cryptography 38.0.4
Uninstalling cryptography-38.0.4:
  Successfully uninstalled cryptography-38.0.4
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting cffi==1.15.0
  Downloading cffi-1.15.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (446 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 446.7/446.7 kB 14.4 MB/s eta 0:00:00
Installing collected packages: cffi
Successfully installed cffi-1.15.0
Installing RAPIDS Stable 21.12
Starting the RAPIDS install on Colab.  This will take about 15 minutes.
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): ...working... done
failed with initial frozen solve. Retrying with flexible so

## Load RAPIDS libraries

In [None]:
# RAPIDS libraries for accessing GPU processing for running models.  Instead of taking 24+ hours to run models, it only takes 15 minutes or less.
import cudf
import cuml
import cupy

import pandas as pd

import pynvml
import numpy as np


## Load additional Libraries.

In [None]:
# Import Sklearn metrics.  The RAPIDS metrics currently appeared bugged.
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.dummy import DummyRegressor
from sklearn.model_selection import cross_val_score, StratifiedKFold

# For displaying model metrics in an easily readable table form.
from IPython.display import HTML, display
import tabulate

# Functions

## Model Accuracy Assessment for Regression.

In [None]:
# Function for Model Accuracy Assessment for Regression.  Input is in Pandas because RAPIDS metrics currently appeared to be bugged.
def reg_metric(y_test, y_pred):
  # Calculation of metrics
  r2 = str(round(r2_score(y_test, y_pred), 3))
  mse = str(round(mean_squared_error(y_test, y_pred), 3))
  rmse = str(round(np.sqrt(mean_squared_error(y_test, y_pred)), 3))
  mae = str(round(mean_absolute_error(y_test, y_pred), 3))

  #create table for display
  metric = [['Metric', 'Value'],
            ['R**2:', r2],
            ['MSE:', mse],
            ['RMSE:', rmse],
            ['MAE:', mae]]
  table = tabulate.tabulate(metric, tablefmt='html')
  display(HTML(table))

  # Return metric values as strings for later display.
  return(r2, mse, rmse, mae)

## Coefficients from Regression Models.

In [None]:
# Function to pull coefficients for regression models.
def reg_coefs(model):
  # get coefficients from RAPIDS model
  coefs = model.coef_
  coefs = coefs.to_pandas()

  # associate variable names with coefficients
  features = X_train.columns
  
  # Create Pandas Series with appropriate labels
  coefs = coefs.set_axis(features)

  return(coefs)

## Model Accuracy Assessment for Classification.

In [None]:
# Function for calculating Classification Metrics
def class_metric(ycat_test, y_pred):
  #Calculation of metrics
  accuracy = str(np.round(cuml.metrics.accuracy.accuracy_score(ycat_test, y_pred), 3) * 100)
  roc_auc = str(np.round(cuml.metrics.roc_auc_score(ycat_test, y_pred), 3))

  cp = np.round(cuml.metrics.confusion_matrix(ycat_test, y_pred, normalize='pred'), 3) * 100
  pred_perclass = [cp[0][0].get(), cp[1][1].get(), cp[2][2].get(), cp[3][3].get(), cp[4][4].get(), cp[5][5].get()]

  cp_mean = str(np.round(np.mean(pred_perclass), 1))
  cp_std = str(np.round(np.std(pred_perclass), 1))

  print(f'Accuracy Score: {accuracy}%')
  print(f'ROC AUC: {roc_auc}')
  print(f'Mean Accuracy per Class & Standard Deviation: {cp_mean}% +/- {cp_std}%')
  print(cp)

  return(accuracy, roc_auc, cp, cp_mean, cp_std)

# Load Dataset.

In [None]:
# Local location of the data

# Location on Windows
# local_data = 'D:\\Data_Science\\DroughtProject\\Data\\' 

# Location on Linux
# local_data = '/home/chad/Data/Drought_Prediction/' 

# Load local data into Google Colab
# from google.colab import files
# files = files.upload()

In [None]:
# Accessing Google Drive by mounting it locally
# https://towardsdatascience.com/7-ways-to-load-external-data-into-google-colab-7ba73e7d5fc7
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Location on Google Drive
local_data = '/content/drive/MyDrive/Colab Notebooks/'

In [None]:
# Load the dataset that contains training (meteorological variables) resampled weekly with mean, max, min
# and the soil variables that have been merged on the county 'fips' value
# This version of the file has already been scaled for the mean equal to zero and the variance to a standard deviation via StandardScaler.

tsm = cudf.read_csv(local_data + 'train_soil_stats_scaled.csv',
                        parse_dates=['date'],
                        index_col=['index'],
                        header=0)

In [None]:
# Load the dataset that contains training (meteorological variables) resampled weekly with mean, max, min
# and the soil variables that have been merged on the county 'fips' value
# This version of the file has already been scaled for the mean equal to zero and the variance to a standard deviation via StandardScaler.

testval = cudf.read_csv(local_data + 'testval_soil_stats_scaled.csv',
                        parse_dates=['date'],
                        index_col=['index'],
                        header=0)

In [None]:
# Unmount Google Drive.
drive.flush_and_unmount()

## Select Features and Target for Models using Full Set of Training Data

In [None]:
# Breaking out independent numerical variables from target variable, categorical variable ('fips'), and date.
cols = tsm.columns.tolist()
features = cols[3:]

# Separating out the features
Xtrain = tsm[features]

# Separating out the target
ytrain = tsm[['score']]

In [None]:
# Converting 'y' from Panda Dataframe to Panda Series to avoid conflicts with type when running RAPIDS models.
ytrain = ytrain['score']

# Convert X from float64 to float32 in order to utilize GPU processing instead of CPU processing.  RAPIDS currently does not support float64.
Xtrain = Xtrain.astype('float32')

# Convert y from float64 to float32 in order to utilize GPU processing instead of CPU processing.
ytrain = ytrain.astype('float32')

# Create target for classication models.  Drought Score was originally an integer class ranging 0 - 5.
ytrain_cat = np.round(ytrain,0)
ytrain_cat = ytrain_cat.astype(int)

## Select Features and Target for Models from Final Test Dataset

In [None]:
# Breaking out independent numerical variables from target variable, categorical variable ('fips'), and date.
cols = testval.columns.tolist()
features = cols[3:]

# Separating out the features
Xtest = testval[features]

# Separating out the target
ytest = testval[['score']]

In [None]:
# Converting 'y' from Panda Dataframe to Panda Series to avoid conflicts with type when running RAPIDS models.
ytest = ytest['score']

# Convert X from float64 to float32 in order to utilize GPU processing instead of CPU processing.  RAPIDS currently does not support float64.
Xtest = Xtest.astype('float32')

# Convert y from float64 to float32 in order to utilize GPU processing instead of CPU processing.
ytest = ytest.astype('float32')

# Create target for classication models.  Drought Score was originally an integer class ranging 0 - 5.
ytest_cat = np.round(ytest,0)
ytest_cat = ytest_cat.astype(int)

## Random Forest Classification Model

#### Random Forest Classifier, max_depth=100, n_estimators=300

In [None]:
# Train Best Verified Model on All Training Data.
# Random Forest Classifier Model, MAX_DEPTH = 100, n_estimators = 300
RFclass_model = cuml.ensemble.RandomForestClassifier(max_depth=100, n_estimators=300)
RFclass_model.fit(Xtrain, ytrain_cat)

RandomForestClassifier()

In [None]:
# Perform Test on Previously Withheld Test Data.
y_pred = RFclass_model.predict(Xtest)
y_pred = y_pred.astype(int)

In [None]:
# Accuracy Assessment for Classification Model.
RF_acc, RF_roc, RF_cp, RF_cpMean, RF_cpSTD = class_metric(ytest_cat, y_pred)

Accuracy Score: 73.7%
ROC AUC: 0.515
Mean Accuracy per Class & Standard Deviation: 18.8% +/- 26.5%
[[76.1 68.1 58.1 56.7 61.5 60. ]
 [12.9 17.2 20.4 23.7 23.   0. ]
 [ 6.7  9.1 14.6 12.9 11.5  0. ]
 [ 2.9  3.9  4.4  4.8  4.1 20. ]
 [ 1.1  1.5  2.2  1.6  0.  20. ]
 [ 0.3  0.3  0.2  0.2  0.   0. ]]


# **Conclusions**

## **Best Model: Random Forest (max_depth=100, n_estimators=300)**

In [None]:
# Accuracy for each Drought Score Category.
metric = [['Class', 'Accuracy'],
          ['0', '76.1%'],
          ['1', '17.2%'],
          ['2', '14.6%'],
          ['3', '4.8%'],
          ['4', '0%'],
          ['5', '0%']]
table = tabulate.tabulate(metric, tablefmt='html')

display(HTML(table))

0,1
Class,Accuracy
0,76.1%
1,17.2%
2,14.6%
3,4.8%
4,0%
5,0%


**Conclusion and Next Steps:**<br> This process did produce a viable model and demonstrated the usefulness of random forests for this problem. It was not able to highlight a few, key variables. The next steps to improve this model would be:

- Allocate more resources so that the training can be done on the entire training dataset to exact key variables.
- Subset the training dataset and rerun the models to allow for a standard cross validation procedure and allow tools that determine important input variables to be determined.
- Incorporate ordinality information into the classification schema. 
- Incorporate a time series analysis that capitalizes on the time nature of the data.
- Use a recurrent neural network to build a time series model.

The initial overall accuracy of the Random Forest model is 74%. With an additional allocation of time and resources, these models could absolutely reach an accuracy above 80%. This is especially true when the cardinality of drought scores is incorporated into the models and ever more importantly the information contained within the timeseries. Additionally, a recurrent neural network may be able to leverage deep learning available within such a large dataset. Given the changing climate and the inherent integration of economies throughout the present-day world, understanding and accurately predicting drought is an important first step in adapting to the current changing conditions of our environment and maintaining a viable global economy.  Being able to predict drought from simple variables and not overly complex models, would allow them to be applied worldwide.