<a target="_blank" href="https://colab.research.google.com/github/rapidsai-community/showcase/blob/main/getting_started_tutorials/rapids-pip-colab-template.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Install RAPIDS into Colab"/>
</a>

# RAPIDS cuDF is now already on your Colab instance!
RAPIDS cuDF is preinstalled on Google Colab and instantly accelerates Pandas with zero code changes. [You can quickly get started with our tutorial notebook](https://nvda.ws/rapids-cudf). This notebook template is for users who want to utilize the full suite of the RAPIDS libraries for their workflows on Colab.  

# Environment Sanity Check #

Click the _Runtime_ dropdown at the top of the page, then _Change Runtime Type_ and confirm the instance type is _GPU_.

You can check the output of `!nvidia-smi` to check which GPU you have.  Please uncomment the cell below if you'd like to do that.  Currently, RAPIDS runs on all available Colab GPU instances.

In [42]:
# !nvidia-smi

#Setup:
This set up script:

1. Checks to make sure that the GPU is RAPIDS compatible
1. Pip Installs the RAPIDS' libraries, which are:
  1. cuDF
  1. cuML
  1. cuGraph
  1. cuSpatial
  1. cuxFilter
  1. cuCIM
  1. xgboost

# Controlling Which RAPIDS Version is Installed
This line in the cell below, `!python rapidsai-csp-utils/colab/pip-install.py`, kicks off the RAPIDS installation script.  You can control the RAPIDS version installed by adding either `latest`, `nightlies` or the default/blank option.  Example:

`!python rapidsai-csp-utils/colab/pip-install.py <option>`

You can now tell the script to install:
1. **RAPIDS + Colab Default Version**, by leaving the install script option blank (or giving an invalid option), adds the rest of the RAPIDS libraries to the RAPIDS cuDF library preinstalled on Colab.  **This is the default and recommended version.**  Example: `!python rapidsai-csp-utils/colab/pip-install.py`
1. **Latest known working RAPIDS stable version**, by using the option `latest` upgrades all RAPIDS labraries to the latest working RAPIDS stable version.  Usually early access for future RAPIDS+Colab functionality - some functionality may not work, but can be same as the default version. Example: `!python rapidsai-csp-utils/colab/pip-install.py latest`
1. **the current nightlies version**, by using the option, `nightlies`, installs current RAPIDS nightlies version.  For RAPIDS Developer use - **not recommended/untested**.  Example: `!python rapidsai-csp-utils/colab/pip-install.py nightlies`


**This will complete in about 5-6 minutes**

In [43]:
# This get the RAPIDS-Colab install files and test check your GPU.  Run this and the next cell only.
# Please read the output of this cell.  If your Colab Instance is not RAPIDS compatible, it will warn you and give you remediation steps.
!git clone https://github.com/rapidsai/rapidsai-csp-utils.git
!python rapidsai-csp-utils/colab/pip-install.py


fatal: destination path 'rapidsai-csp-utils' already exists and is not an empty directory.
Installing RAPIDS remaining 24.10.* libraries
Using Python 3.11.11 environment at /usr
Audited 11 packages in 91ms

        ***********************************************************************
        The pip install of RAPIDS is complete.

        Please do not run any further installation from the conda based installation methods, as they may cause issues!

        Please ensure that you're pulling from the git repo to remain updated with the latest working install scripts.

        Troubleshooting:
            - If there is an installation failure, please check back on RAPIDSAI owned templates/notebooks to see how to update your personal files.
            - If an installation failure persists when using the latest script, please make an issue on https://github.com/rapidsai-community/rapidsai-csp-utils
        ***********************************************************************
        


# RAPIDS is now installed on Colab.  
You can copy your code into the cells below or use the below to validate your RAPIDS installation and version.  
# Enjoy!

In [44]:
import cudf
cudf.__version__

'24.10.01'

In [45]:
import cuml
cuml.__version__

'24.10.00'

In [46]:
import cugraph
cugraph.__version__

'24.10.00'

In [47]:
import cuspatial
cuspatial.__version__

'24.10.00'

In [48]:
import cuxfilter
cuxfilter.__version__

'24.10.00'

# Next Steps #

For an overview of how you can access and work with your own datasets in Colab, check out [this guide](https://towardsdatascience.com/3-ways-to-load-csv-files-into-colab-7c14fcbdcb92).

For more RAPIDS examples, check out our RAPIDS notebooks repos:
1. https://github.com/rapidsai/notebooks
2. https://github.com/rapidsai/notebooks-contrib

In [49]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegressionCV, LinearRegression, LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import TimeSeriesSplit
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score, mean_squared_error
import numpy as np
import joblib

In [50]:
!gdown 1E3mW0xMmC7JBnD20H4qZA0XPlSruVpJ2
!gdown 1sce_ashbiRGErbm7OXr8w2BGGy9u5jon
!gdown 11mD6Go_WSF6ksAH8lwW8alEeUM8WiUHx
!gdown 1hzopdXc-GiZ03Uqf556A3IgGh87zn3e6

Downloading...
From: https://drive.google.com/uc?id=1E3mW0xMmC7JBnD20H4qZA0XPlSruVpJ2
To: /content/constancia_inscripcion.parquet
100% 10.6M/10.6M [00:00<00:00, 65.6MB/s]
Downloading...
From: https://drive.google.com/uc?id=1sce_ashbiRGErbm7OXr8w2BGGy9u5jon
To: /content/creditos_hist.parquet
100% 62.7M/62.7M [00:00<00:00, 105MB/s]
Downloading...
From: https://drive.google.com/uc?id=11mD6Go_WSF6ksAH8lwW8alEeUM8WiUHx
To: /content/principales_variables.parquet
100% 124k/124k [00:00<00:00, 121MB/s]
Downloading...
From: https://drive.google.com/uc?id=1hzopdXc-GiZ03Uqf556A3IgGh87zn3e6
To: /content/sh_emae_mensual_base2004.xls
100% 71.7k/71.7k [00:00<00:00, 76.0MB/s]


In [58]:
data = pd.read_parquet('./creditos_hist.parquet')

In [59]:
# Eliminamos las situaciones 0, que indican que el crédito ya fue pagado
data = data.loc[data['situacion'] != 0]
data = data.drop('denominacion', axis = 1) # Elimino la columna con las razones sociales para ahorrar RAM

In [60]:
# Una variable que puede ser de interés es cuantos créditos tiene una empresa en un momento dado del tiempo
counts = data.groupby(['identificacion', 'periodo']).size().reset_index(name='n_creditos')

# También nos interesa cuanta plata debe una empresa en cada momento dado
sums = data.groupby(['identificacion', 'periodo'], as_index=True)['monto'].sum().reset_index(name='sum_montos')

# La literatura indica que también importa la duración de la relación empresa-banco, por lo que contamos la cantidad
# de periodos que aparece cada par: empresa-banco
period_counts = data.groupby(['identificacion', 'entidad']).size().reset_index(name='n_periodos')

# Definimos como default cuando el crédito se encuentra en situación 4 o 5, por lo que creamos la dummy de default
# Esta es nuestra variable dependiente
data['default'] = (data['situacion'] >= 4).astype(int)

In [61]:
# Queremos predecir el default el periodo siguienre
data['default_lag'] = data.groupby(['identificacion', 'entidad'])['default'].shift(1) # Lag a la variable default
data = data.dropna(subset=['default_lag']) # Eliminamos las observaciones que no tienen variable dependiente
data['default_lag'] = data['default_lag'].astype(int) # Cambio el dtype de la variable de interés

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['default_lag'] = data['default_lag'].astype(int) # Cambio el dtype de la variable de interés


In [62]:
data = data.sort_values(by=['identificacion', 'periodo'])
data['prev_default'] = (
    data.groupby('identificacion')['default']
    .transform(lambda x: x.cumsum().clip(upper=1))
) # Armamos una variable que indique si en algún momento de su historia, esa empresa tuvo un crédito en default

In [63]:
data["sin_historial"] = (
    data.groupby("identificacion")["periodo"]
    .transform("rank", method="first") == 1).astype(int)

In [65]:
# Agregamos las nuevas variables al dataframe
data = data.merge(counts, on=['identificacion', 'periodo'], how='left')
data = data.merge(sums, on=['identificacion', 'periodo'], how='left')
data = data.merge(period_counts, on=['identificacion', 'entidad'], how='left')

del sums, counts, period_counts # Para ahorrar RAM

In [66]:
# Por último, la literatura también resalta que la intensidad de la relación empresa-banco es relevante
# Usamos como proxy para la intensidad la proporción del monto adeudado con un banco sobre el total adeudado
data['monto_relativo'] = data['monto'] / data['sum_montos']

In [67]:
data = data.loc[data['periodo'] > '202310']

In [68]:
# Elijo aleatoriamente un porcentaje de las empresas de la población
np.random.seed(42)
cuits = data['identificacion'].unique()
moneda = np.random.binomial(1, 0.1, len(cuits)) # Es como tirar una moneda sesgada para que agarre un porcentaje arbitrario de las empresas
cuits_aleatorios = cuits[moneda == 1] # Estos son los cuits con los que me voy a quedar
data = data.loc[data['identificacion'].isin(cuits_aleatorios)] # Me quedo unicamente con las obs que tienen un cuit dentro de los seleccionados aleatoriamente

In [69]:
pv = pd.read_parquet('./principales_variables.parquet') # Datos de principales variables monetarias provenientes de la API del BCRA
pv.reset_index(inplace= True) # el index es la fecha, así que lo paso a columna
pv['fecha'] = pd.to_datetime(pv['fecha']) # paso la nueva columna al formato correcto
pv['periodo'] = pv['fecha'].dt.strftime('%Y%m') # armo una variable llamada periodo igual a la que tengo en los datos de la Central de Deudores
pv = pv.drop('fecha', axis = 1).groupby('periodo').agg(['mean', 'std']) # elimino la de "fecha" porque no me interesan los datos diarios
# Me quedo únicamente con los promedios por mes y también calculo el desvío estándar
pv.columns = ['_'.join(col).strip() for col in pv.columns] # Renombro las columnas para que sea más prolijo
pv.reset_index(inplace= True) # Vuelvo a agregar la columna periodo
pv = pv.loc[pv['periodo'].astype(int) <= 202411] # En la Central de Deudores tenemos datos hasta 202410
pv = pv.dropna(axis = 1) # Elimino las columnas con NAs

In [70]:
pv = pv.sort_values("periodo")

pv["inflacion_acumulada"] = (1 + pv["Inflación mensual (variación en %)_mean"]/100).cumprod()

columnas = pv.columns.tolist()
tasas = [3, 4, 5, 6, 28 ,29, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53]
excluir = [0, 11, 12, 25, 26, 27, 34, 35, 58, 59, 60, 61, 62, 63, 64]

columnas_excluir = [columnas[i] for i in excluir]
columnas_tasas = [columnas[i] for i in tasas]

columnas_nominales = [col for col in pv.columns if col not in columnas_excluir and col not in columnas_tasas]
columnas_tasas = [col for col in pv.columns if col in tasas and col not in columnas_excluir]

for col in columnas_nominales:
    nombre_col_real = f"{col}_real"
    pv[nombre_col_real] = pv[col] / pv['inflacion_acumulada']

for col in columnas_tasas:
    nombre_col_real = f"{col}_real"
    pv[nombre_col_real] = pv[col]/(100*pv['inflacion_acumulada'])

In [71]:
data = data.merge(pv, on = 'periodo', how = 'left') # Junto las principales variables monetarias con la Central de Deudores

In [72]:
data['monto_real'] = data['monto']/data['inflacion_acumulada']
data['sum_montos_real'] = data['sum_montos']/data['inflacion_acumulada']
data.drop('inflacion_acumulada', axis = 1, inplace = True)

In [73]:
emae = pd.read_excel('./sh_emae_mensual_base2004.xls', index_col=[0,1])
meses_a_numeros = {
    'Enero': '01', 'Febrero': '02', 'Marzo': '03', 'Abril': '04',
    'Mayo': '05', 'Junio': '06', 'Julio': '07', 'Agosto': '08',
    'Septiembre': '09', 'Octubre': '10', 'Noviembre': '11', 'Diciembre': '12'
}
emae = emae.reset_index()
emae['level_1'] = emae['level_1'].map(meses_a_numeros)
emae['periodo'] = emae['Período'].astype(str) + emae['level_1']
emae = emae.drop(columns=['level_1', 'Período'])

In [74]:
data = data.merge(emae, on = 'periodo', how= 'left')

In [75]:
arca = pd.read_parquet('./constancia_inscripcion.parquet') # Cargo los datos de la constancia de inscripción de ARCA

In [76]:
cuits_arca = set(arca['identificacion'])
cuits_bcra = set(data['identificacion'])

faltan = list(cuits_bcra - cuits_arca)

data['sin_arca'] = (data['identificacion'].isin(faltan)).astype(int)

In [77]:
# Ponemos bien el tipo de dato para las columnas categóricas, así el get_dummies funciona bien
data['identificacion'] = data['identificacion'].astype('category')
data['entidad'] = data['entidad'].astype('category')
data['situacion'] = data['situacion'].astype('category')
data['default'] = data['default'].astype('category')
data['periodo'] = data['periodo'].astype('category')
data['default_lag'] = data['default_lag'].astype('category')
data['prev_default'] = data['prev_default'].astype('category')
data['sin_arca'] = data['sin_arca'].astype('category')

In [78]:
# Cross Validation
data = data.sort_values(by='periodo') # Ordeno de acuerdo a la fecha
data = data.reset_index().drop(columns= 'index')
split_index = int(len(data) * 0.8) # El 80% de las observaciones más antiguas
train_indices = data.iloc[:split_index].index # Estos son los índices con los que después voy a separar en test y train
test_indices = data.iloc[split_index:].index

In [79]:
Y = data['default_lag']

In [80]:
boolean_columns = data.select_dtypes(include='object').columns # Estas son las columnas que ya están en el formato correcto
columnas = ['entidad', 'monto', 'n_creditos', 'sum_montos', 'n_periodos', 'monto_relativo', 'sin_arca', 'default', 'prev_default', 'sin_historial', 'periodo', 'monto_real', 'sum_montos_real'] # Algunas de las variables independientes del modelo
pv.set_index('periodo', inplace= True)
pv.drop('inflacion_acumulada', axis = 1, inplace = True)
columnas.extend(pv.columns) # Todas las columnas de las principales variables monetarias
emae.set_index('periodo', inplace= True)
columnas.extend(emae.columns) # Todas las columnas de emae

In [81]:
columns_to_encode = [col for col in columnas if col not in boolean_columns] # Una lista con las columnas que no tengo que meter en "get_dummies"
X_encoded = pd.get_dummies(data[columns_to_encode], drop_first=True) # Meto las columnas en get_dummies
X = pd.concat([X_encoded, data[boolean_columns]], axis=1) # Junto todas las variables independientes en un solo df

del columns_to_encode, X_encoded

In [82]:
X = cudf.DataFrame(X)
Y = cudf.DataFrame(Y)

In [83]:
scaler = StandardScaler()
X = scaler.fit_transform(X)
Y = scaler.fit_transform(Y)

TypeError: Implicit conversion to a host NumPy array via __array__ is not allowed, To explicitly construct a GPU matrix, consider using .to_cupy()
To explicitly construct a host matrix, consider using .to_numpy().

In [None]:
X_cambio_index = X.loc[X['default_1'] != Y].index
X = X.drop('default_1', axis = 1)

# Separo en entrenamiento y test
X_train = X.loc[train_indices]
Y_train = Y.loc[train_indices]
X_test = X.loc[test_indices]
Y_test = Y.loc[test_indices]

X_test_cambio_index =list(set(X_test.index) & set(X_cambio_index))
Y_test_cambio_index = list(set(Y_test.index) & set(X_cambio_index))

X_test_cambio = X_test.loc[X_test_cambio_index]
Y_test_cambio = Y_test.loc[Y_test_cambio_index]

In [None]:
def eval(model, X_test, Y_test, linear = None):
    y_pred = model.predict(X_test)

    if linear:
        y_pred = np.where(y_pred >= 0.5, 1, 0)

    cm = confusion_matrix(Y_test, y_pred)

    precision = precision_score(Y_test, y_pred)
    recall = recall_score(Y_test, y_pred)
    f1 = f1_score(Y_test, y_pred)
    accuracy = accuracy_score(Y_test, y_pred)
    mse = mean_squared_error(Y_test, y_pred)

    print(cm)
    print(f'La precisión es: {precision}')
    print(f'El recall es: {recall}')
    print(f'El f1 es: {f1}')
    print(f'El accuracy es: {accuracy}')
    print(f'El MSE es: {mse} \n')

    return y_pred

In [None]:
def mse_table(model, elasticnet=None):
    inverse_Cs = 1 / model.Cs_
    results = []

    if elasticnet:
        l1_ratios = model.l1_ratios_
        for c_idx, lambda_ in enumerate(inverse_Cs):
            for l1_idx, l1_ratio in enumerate(l1_ratios):
                    mean_score = np.mean(-model.scores_[1][:, c_idx, l1_idx], axis=0)
                    std = np.std(-model.scores_[1][:, c_idx, l1_idx], axis=0)
                    results.append({
                        "Lambda": lambda_,
                        "L1 Ratio": l1_ratio,
                        "Mean MSE": mean_score,
                        "Std MSE": std
                    })

        results_table = pd.DataFrame(results).sort_values(by="Mean MSE", ascending=True)

    else:
        mean_scores = np.mean(-model.scores_[1], axis=0)
        std = np.std(-model.scores_[1], axis=0)
        results_table = pd.DataFrame({
            "Lambda": inverse_Cs,
            "Mean Score": mean_scores,
            "Std MSE": std
        }).sort_values(by="Mean Score", ascending=True)

    print(results_table)
    return results_table

In [None]:
def non_zero_coefs(model, X_train):
    best_coefs = model.coef_[0]
    feature_names = X_train.columns
    non_zero_coefs = []
    for coef, name in zip(best_coefs, feature_names):
        if coef != 0:
            non_zero_coefs.append({
                "Variable": name,
                "Coeficiente": coef
            })
    non_zero_table = pd.DataFrame(non_zero_coefs)
    print(non_zero_table)

    return non_zero_table

In [None]:
tscv = TimeSeriesSplit(n_splits=10)

Cs = np.logspace(-4, 4, 20)

In [None]:
for train_idx, test_idx in tscv.split(X_cudf):
    X_train, X_test = X_cudf.iloc[train_idx], X_cudf.iloc[test_idx]
    y_train, y_test = y_cudf.iloc[train_idx], y_cudf.iloc[test_idx]

    # Mejor modelo por validación cruzada
    best_C = None
    best_score = float('inf')

    for C in Cs:
        model = LogisticRegression(penalty="l1", solver="qn", C=C, fit_intercept=True, max_iter=2000)
        model.fit(X_train, y_train)

        # Predicción y evaluación
        y_pred = model.predict(X_test)
        mse = ((y_test - y_pred) ** 2).mean()

        if mse < best_score:
            best_score = mse
            best_C = C

    errors.append(best_score)
    print(f"Fold terminado, mejor C: {best_C}, error: {best_score}")

print("Cross-validation finalizado. Errores por fold:", errors)