# Environment Sanity Check #

Click the _Runtime_ dropdown at the top of the page, then _Change Runtime Type_ and confirm the instance type is _GPU_.

You can check the output of `!nvidia-smi` to check which GPU you have.  Please uncomment the cell below if you'd like to do that.  Currently, RAPIDS runs on all available Colab GPU instances.

In [None]:
!nvidia-smi

/bin/bash: line 1: nvidia-smi: command not found


#Setup:
This set up script:

1. Checks to make sure that the GPU is RAPIDS compatible
1. Installs the **current stable version** of RAPIDSAI's core libraries using pip, which are:
  1. cuDF
  1. cuML
  1. cuGraph
  1. cuSpatial
  1. cuxFilter
  1. cuCIM
  1. xgboost

**This will complete in about 5-6 minutes**

If you require installing the **nightly** releases of RAPIDSAI, please use the [RAPIDS Conda Colab Template notebook](https://colab.research.google.com/drive/1TAAi_szMfWqRfHVfjGSqnGVLr_ztzUM9) and use the nightly parameter option when running the RAPIDS installation cell.


In [None]:
# This get the RAPIDS-Colab install files and test check your GPU.  Run this and the next cell only.
# Please read the output of this cell.  If your Colab Instance is not RAPIDS compatible, it will warn you and give you remediation steps.
!git clone https://github.com/rapidsai/rapidsai-csp-utils.git
!python rapidsai-csp-utils/colab/pip-install.py


Cloning into 'rapidsai-csp-utils'...
remote: Enumerating objects: 476, done.[K
remote: Counting objects: 100% (207/207), done.[K
remote: Compressing objects: 100% (116/116), done.[K
remote: Total 476 (delta 141), reused 124 (delta 91), pack-reused 269[K
Receiving objects: 100% (476/476), 131.59 KiB | 1.34 MiB/s, done.
Resolving deltas: 100% (243/243), done.
Collecting pynvml
  Downloading pynvml-11.5.0-py3-none-any.whl (53 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 53.1/53.1 kB 820.9 kB/s eta 0:00:00
Installing collected packages: pynvml
Successfully installed pynvml-11.5.0
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/pynvml/nvml.py", line 1798, in _LoadNvmlLibrary
    nvmlLib = CDLL("libnvidia-ml.so.1")
  File "/usr/lib/python3.10/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libnvidia-ml.so.1: cannot open shared object file: No such file or directory

During handling of the above excep

# RAPIDS is now installed on Colab.  
You can copy your code into the cells below or use the below to validate your RAPIDS installation and version.  
# Enjoy!

In [None]:
import cudf
cudf.__version__

'24.04.01'

In [None]:
import cuml
cuml.__version__

'24.04.00'

In [None]:
import cugraph
cugraph.__version__

'23.10.00'

In [None]:
import cuspatial
cuspatial.__version__

'23.10.00'

In [None]:
import cuxfilter
cuxfilter.__version__

'23.10.00'

# Next Steps #

For an overview of how you can access and work with your own datasets in Colab, check out [this guide](https://towardsdatascience.com/3-ways-to-load-csv-files-into-colab-7c14fcbdcb92).

For more RAPIDS examples, check out our RAPIDS notebooks repos:
1. https://github.com/rapidsai/notebooks
2. https://github.com/rapidsai/notebooks-contrib

In [None]:
from google.colab import drive
drive.mount('/content/drive')
data_source = "drive/MyDrive/Thesis Nina"
import os
print(os.listdir(data_source))

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
['scales']


In [None]:
import pandas as pd
import numpy as np
import os

from sklearn.model_selection import train_test_split, GridSearchCV, KFold
from sklearn.ensemble import RandomForestRegressor
from cuml.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.exceptions import ConvergenceWarning

from sklearn.feature_selection import SelectFromModel
from sklearn.svm import SVR
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.pipeline import make_pipeline
from sklearn.decomposition import PCA
from sklearn.metrics import mean_squared_error, r2_score

import time

import warnings

X_train = pd.read_pickle(f'{data_source}/scale_data/X_train.pkl').astype('float32')
X_val = pd.read_pickle(f'{data_source}/scale_data/X_val.pkl').astype('float32')
X_test = pd.read_pickle(f'{data_source}/scale_data/X_test.pkl').astype('float32')
y_train = pd.read_pickle(f'{data_source}/scale_data/y_train.pkl').astype('float32')
y_val = pd.read_pickle(f'{data_source}/scale_data/y_val.pkl').astype('float32')
y_test = pd.read_pickle(f'{data_source}/scale_data/y_test.pkl').astype('float32')

models = {
	'Random forest regressor': make_pipeline(
		RandomForestRegressor(split_criterion=1000, max_features='sqrt', random_state=42, n_streams=1)
	),
	# 'Random forest regressor fitted': make_pipeline(
	# 	RandomForestRegressor(1000, max_features='sqrt', random_state=42, max_depth=12)
	# ),
	'Decision tree regressor': make_pipeline(
		DecisionTreeRegressor()
	),
	'Ridge': make_pipeline(
		Ridge()
	),
	'Lasso': make_pipeline(
		Lasso()
	),
	'ElasticNet': make_pipeline(
		ElasticNet(max_iter=10000)
	),
	'Support vector machine': make_pipeline(
		SVR()
	),
	'PCA + Ridge': make_pipeline(
		PCA(),
		Ridge(),
	),
	"PCA + random forest regressor": make_pipeline(
		PCA(),
		RandomForestRegressor(1000, max_depth=12, max_features='sqrt', random_state=42, n_streams=1)
	),
	# "SelectFromModel + random forest regressor": make_pipeline(
	# 	SelectFromModel(RandomForestRegressor(n_estimators=100, random_state=42, n_streams=1)),
	# 	RandomForestRegressor(1000, max_depth=12, max_features='sqrt', random_state=42, n_streams=1)
	# ),
	# "SelectFromModel + random forest regressor fitted": make_pipeline(
	# 	SelectFromModel(RandomForestRegressor(n_estimators=100, random_state=42)),
	# 	RandomForestRegressor(1000, max_depth=12, max_features='sqrt', random_state=42)
	# )
}

param_grids = {
	'Random forest regressor': {
		'randomforestregressor__max_depth': [*np.linspace(1, 21, 21).astype(int)]
	},
	'Decision tree regressor': {
		'decisiontreeregressor__max_depth': [*np.linspace(1, 21, 21).astype(int)]
	},
	'Ridge': {
		'ridge__alpha': np.logspace(-3, 3, 7)
	},
	'Lasso': {
		'lasso__alpha': np.logspace(-3, 3, 7)
	},
	'ElasticNet': {
		"elasticnet__alpha": np.logspace(-3, 3, 7),
		"elasticnet__l1_ratio": np.linspace(0, 1, 11),
	},
	'Support vector machine': {
		"svr__kernel": ["linear", "poly", "rbf", "sigmoid"]
	},
	'PCA + Ridge': {
		"pca__n_components": np.linspace(1, X_train.shape[1], X_train.shape[1]).astype(int),
		'ridge__alpha': np.logspace(-3, 3, 7)
	},
	'PCA + random forest regressor': {
		"pca__n_components": np.linspace(1, X_train.shape[1], X_train.shape[1]).astype(int),
		'randomforestregressor__max_depth': [*np.linspace(1, 21, 21).astype(int)]
	},
	"SelectFromModel + random forest regressor": {
		"selectfrommodel__max_features": np.linspace(1, X_train.shape[1], X_train.shape[1]).astype(int),
		'randomforestregressor__max_depth': [*np.linspace(1, 21, 21).astype(int)]
	}

}


data = {
    "Model": ["Random forest regressor", "Decision tree regressor", "PCA + random forest regressor"],
    "DASS_anxiety Optimal parameters": ["max. depth=12", "", "n_components"],
    "DASS_anxiety Training score MSE": ["MSE_value", "MSE_value", "MSE_value"],
    "DASS_anxiety Training score R2": ["R2_value", "R2_value", "R2_value"],
    "DASS_anxiety Validation score MSE": ["MSE_value", "MSE_value", "MSE_value"],
    "DASS_anxiety Validation score R2": ["R2_value", "R2_value", "R2_value"],
    "DASS_stress Optimal parameters": ["max. depth=12", "", "n_components"],
    "DASS_stress Training score MSE": ["MSE_value", "MSE_value", "MSE_value"],
    "DASS_stress Training score R2": ["R2_value", "R2_value", "R2_value"],
    "DASS_stress Validation score MSE": ["MSE_value", "MSE_value", "MSE_value"],
    "DASS_stress Validation score R2": ["R2_value", "R2_value", "R2_value"],
    "DASS_depression Optimal parameters":  ["max. depth=12", "", "n_components"],
    "DASS_depression Training score MSE": ["MSE_value", "MSE_value", "MSE_value"],
    "DASS_depression Training score R2": ["R2_value", "R2_value", "R2_value"],
    "DASS_depression Validation score MSE": ["MSE_value", "MSE_value", "MSE_value"],
    "DASS_depression Validation score R2": ["R2_value", "R2_value", "R2_value"],
}

for key in data:
	data[key] = []

warnings.filterwarnings('ignore', category=ConvergenceWarning)

cv = KFold(n_splits=5, shuffle=True, random_state=42)
# X_train_np = X_train.to_numpy()
# y_train_np = y_train.to_numpy()
# X_val_np = X_val.to_numpy()
# y_val_np = y_val.to_numpy()
# print(models.keys()[])
# for model_name in list(models.keys())[2:4]:

def write_file(data):
	df = pd.DataFrame(data)

	df.columns = pd.MultiIndex.from_tuples([
			('Model',"",""),
			('DASS_anxiety', 'Optimal parameters', ""),
			('DASS_anxiety', 'Training scores', 'MSE'), ('DASS_anxiety', 'Training scores', 'R2'),
			('DASS_anxiety', 'Validation scores', 'MSE'), ('DASS_anxiety', 'Validation scores', 'R2'),
			('DASS_stress', 'Optimal parameters', ""),
			('DASS_stress', 'Training scores', 'MSE'), ('DASS_stress', 'Training scores', 'R2'),
			('DASS_stress', 'Validation scores', 'MSE'), ('DASS_stress', 'Validation scores', 'R2'),
			('DASS_depression', 'Optimal parameters', ""),
			('DASS_depression', 'Training scores', 'MSE'), ('DASS_depression', 'Training scores', 'R2'),
			('DASS_depression', 'Validation scores', 'MSE'), ('DASS_depression', 'Validation scores', 'R2'),
	])


	# Write DataFrame to Excel file
	df.to_excel(f'{data_source}/model_comparison.xlsx', engine='openpyxl', index=True)

for model_name in models:
	model = models[model_name]
	print(model_name)
	data["Model"].append(model_name)

	for column in y_train:
		t = time.time()
		print(column)
		y_train_column, y_val_column = y_train[column], y_val[column]
		grid_search = GridSearchCV(model, param_grids[model_name], cv=cv, scoring='neg_mean_squared_error', verbose=2)
		grid_search.fit(X_train, y_train_column)

		best_params = grid_search.best_params_
		data[f"{column} Optimal parameters"].append(", ".join([f"{parameter} = {best_params[parameter]}" for parameter in best_params]))
		# print("Best cross-validation score (MSE):", -grid_search.best_score_)

		# Predictions on the training and validation sets
		y_train_pred = grid_search.predict(X_train)
		y_val_pred = grid_search.predict(X_val)

		# Calculate MSE for the validation set
		mse_train = mean_squared_error(y_train_column, y_train_pred)
		mse_val = mean_squared_error(y_val_column, y_val_pred)

		data[f"{column} Training score MSE"].append(mse_train)
		data[f"{column} Validation score MSE"].append(mse_val)

		# Calculate R2 for the training and validation sets
		r2_train = r2_score(y_train_column, y_train_pred)
		r2_val = r2_score(y_val_column, y_val_pred)

		data[f"{column} Training score R2"].append(r2_train)
		data[f"{column} Validation score R2"].append(r2_val)
		print(f"{column} Optimal parameters:", data[f"{column} Optimal parameters"][-1])
		print("Time elapsed:", time.time() - t)




ModuleNotFoundError: No module named 'cuml'