<a href="https://colab.research.google.com/github/OmarMachuca851/Task/blob/main/credit_information_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# aprendizaje de información de crédito

## Problema 1: Confirmación del contenido de la competencia

- **Qué aprender:** la información de transacción de los clientes
- **Qué Predecir:** las habilidades de reembolso
- **Archivo de envío:** Para cada SDK_ID_CURR en el conjunto de pruebas, debe predecir una probabilidad para la variable TARGET. El archivo debe contener un encabezado y tener el siguiente formato:

```
100001,0.1
100005,0.9
100013,0.2
etc.
```

- **¿Cómo se evaluarán los elementos enviados?:** Las presentaciones se evalúan en el Área bajo la curva ROC entre la probabilidad predicha y el objetivo observado.

## Problema 2: Aprendizaje y verificación

In [None]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score

# loading the csv of the dataset
df = pd.read_csv('application_train.csv')

#cleaning the dataset by removing the empy data(null)
cleaned_df = df.dropna()

categorical_feats = df.select_dtypes('object').columns.tolist()

#separating them into variables
X = df.drop(columns=['TARGET'])
y = df['TARGET']

In [None]:
!pip install category_encoders

Collecting category_encoders
  Downloading category_encoders-2.8.1-py3-none-any.whl.metadata (7.9 kB)
Collecting patsy>=0.5.1 (from category_encoders)
  Downloading patsy-1.0.1-py2.py3-none-any.whl.metadata (3.3 kB)
Collecting statsmodels>=0.9.0 (from category_encoders)
  Downloading statsmodels-0.14.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.2 kB)
Downloading category_encoders-2.8.1-py3-none-any.whl (85 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.7/85.7 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading patsy-1.0.1-py2.py3-none-any.whl (232 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.9/232.9 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading statsmodels-0.14.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (10.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.8/10.8 MB[0m [31m112.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected package

In [None]:
from category_encoders import CountEncoder
#Encodings values
X = CountEncoder(cols=categorical_feats).fit_transform(X)

In [None]:
!pip install lightgbm

Collecting lightgbm
  Downloading lightgbm-4.6.0-py3-none-manylinux_2_28_x86_64.whl.metadata (17 kB)
Downloading lightgbm-4.6.0-py3-none-manylinux_2_28_x86_64.whl (3.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB[0m [31m37.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: lightgbm
Successfully installed lightgbm-4.6.0


In [None]:
# splitting  the data into trainig and testing data using train_test_split from sklearn
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# standardizing tha data
scaler = StandardScaler()
scaler.fit(X_train)
X_train_trans = scaler.transform(X_train)
X_test_trans = scaler.transform(X_test)

# fitting the data
from lightgbm import LGBMClassifier

reg = LGBMClassifier(random_state=5).fit(X_train_trans, y_train)

#predicting
reg_pred = reg.predict(X_test_trans)

print('Acc: ', accuracy_score(y_true=y_test, y_pred=reg_pred))
print('ROC: ', roc_auc_score(y_test, reg_pred))

[LightGBM] [Info] Number of positive: 18634, number of negative: 211999
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.114589 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 11717
[LightGBM] [Info] Number of data points in the train set: 230633, number of used features: 116
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.080795 -> initscore=-2.431594
[LightGBM] [Info] Start training from score -2.431594
Acc:  0.9196909388901896
ROC:  0.5088892347318036




## Problema 3: Estimación para los datos de prueba

In [None]:
# loading the csv of the test dataset
test_df = pd.read_csv('application_test.csv')

# cleaning the datasets by removing the empy data(null)
test_cleaned_df = test_df.dropna(axis=0)

# separating them into variables
test_X = CountEncoder(cols=categorical_feats).fit_transform(test_df)

# standaring the data
test_scaler = StandardScaler()
test_X_trans = scaler.fit_transform(test_X)

# predicting
test_reg_pred = reg.predict(test_X_trans)

kgl_submision = pd.concat([test_df['SK_ID_CURR'], pd.Series(test_reg_pred, name='TARGET')], axis=1)
kgl_submision.to_csv('kgl_submission.csv', index=False)



In [None]:
kgl_submision

Unnamed: 0,SK_ID_CURR,TARGET
0,100001,0
1,100005,0
2,100013,0
3,100028,0
4,100038,0
...,...,...
48739,456221,0
48740,456222,0
48741,456223,0
48742,456224,0


## Problema 4: Feature enginering

In [None]:
# cleaning the dataset removing the empy data (null)
cleaned_df = df.dropna()

# separating the intro variables
X = cleaned_df.drop(columns=['TARGET'])
y = cleaned_df['TARGET']
print(X.shape, y.shape)

(8602, 121) (8602,)


In [None]:
# imputation
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score

#X = df.drop(columns=['TARGET'])
#y = df['TARGET']

# pattern 1
imp_mean = SimpleImputer(strategy='mean')

# Select only numerical columns for mean imputation
X_numerical = X.select_dtypes(include=np.number)

# drop the missing values - apply imputer to numerical data
# Note: This only imputes numerical columns. You will need to handle categoriacal columns separately.
imp_X_numerical = imp_mean.fit_transform(X_numerical)

# spliting the data into trainingand testing data using train_test_split from sklearn
# use the imputed numerical data

from sklearn.preprocessing import OneHotEncoder

X_train_1, X_test_1, y_train_1, y_test_1 = train_test_split(imp_X_numerical, y, test_size=0.25, random_state=42)

# standarizing the data
scaler1 = StandardScaler()
scaler1.fit(X_train_1)
X_train_trans_1 = scaler1.transform(X_train_1)
X_test_trans_1 = scaler1.transform(X_test_1)

# fitting the data
from lightgbm import LGBMClassifier
lgbm = LGBMClassifier(random_state=5)
lgb = lgbm.fit(X_train_trans_1, y_train_1)

# predicing
reg_pred_1 = lgb.predict(X_test_trans_1)

print('Accuracy: ', accuracy_score(y_test_1, reg_pred_1))

[LightGBM] [Info] Number of positive: 407, number of negative: 6044
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002984 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 10422
[LightGBM] [Info] Number of data points in the train set: 6451, number of used features: 94
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.063091 -> initscore=-2.698008
[LightGBM] [Info] Start training from score -2.698008
Accuracy:  0.9428172942817294




In [None]:
# imputation
from sklearn.preprocessing import OneHotEncoder

#X = df.drop(columns=['TARGET'])
#y = df['TARGET']

#separate numerical and categorical columns
X_numerical = X.select_dtypes(include=np.number)
X_categorical = X.select_dtypes(exclude=np.number)

#impute numerical columns using the median strategy
imp_median_numerical = SimpleImputer(strategy='median')
imp_X_numerical = imp_median_numerical.fit_transform(X_numerical)


#Impute  categorical columns using the most frenquent strategy (or another suitable strategy for categorical data)
imp_mf_categorical = SimpleImputer(strategy='most_frequent')
imp_X_categorical = imp_mf_categorical.fit_transform(X_categorical)

# one hot encode the imputed categorical data
enc_1 = OneHotEncoder(handle_unknown='ignore', sparse_output=False) # Use sparse_output=False for dense array output
enc_imp_X_categorical = enc_1.fit_transform(imp_X_categorical)

# combine the imputed numerical data and the one-hot encoded categorical data
imp_X_1 =np.hstack((imp_X_numerical, enc_imp_X_categorical))

#splitting the data into training and testing using train_test_Split form sklearn
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(imp_X_1, y, test_size=0.25, random_state=42)

# standarizing the data
scaler2 = StandardScaler()
scaler2.fit(X_train_2)
X_train_trans_2 = scaler2.transform(X_train_2)
X_test_trans_2 = scaler2.transform(X_test_2)

# fitting the data
from lightgbm import LGBMClassifier
lgbm_1 = LGBMClassifier(random_state=5)
lgb_1 = lgbm_1.fit(X_train_trans_2, y_train_2)

# predicting
reg_pred_2 = lgb_1.predict(X_test_trans_2)

print('Accuracy: ', accuracy_score(y_test_2, reg_pred_2))
print(X.shape)

[LightGBM] [Info] Number of positive: 407, number of negative: 6044
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.005163 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 10749
[LightGBM] [Info] Number of data points in the train set: 6451, number of used features: 203
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.063091 -> initscore=-2.698008
[LightGBM] [Info] Start training from score -2.698008
Accuracy:  0.9442119944211994
(8602, 121)




In [None]:
imp_mf = SimpleImputer(strategy='most_frequent')

#drop the missing values
imp_X_2 = imp_mf.fit_transform(X)

# One hot encoding
enc_2 = OneHotEncoder(handle_unknown='ignore')
enc_imp_X_2 = enc_2.fit_transform(imp_X_2).toarray()

# splitting the data into training and testing data
X_train_3, X_test_3 , y_train_3, y_test_3 = train_test_split(enc_imp_X_2, y, test_size=0.25, random_state=42)

#standardizing the data
scaler = StandardScaler()
scaler.fit(X_train_3)
X_train_trans_3 = scaler.transform(X_train_3)
X_test_trans_3 = scaler.transform(X_test_3)

# fitting the data
lgbm_2 = LGBMClassifier(random_state=5)
lgb_2 = lgbm_2.fit(X_train_trans_3, y_train_3)

# predicting
reg_pred_3 = lgb_2.predict(X_test_trans_3)

print(reg_pred_3.shape)
print('Accuracy: ', accuracy_score(y_test_3, reg_pred_3))

[LightGBM] [Info] Number of positive: 407, number of negative: 6044
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.025960 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 5253
[LightGBM] [Info] Number of data points in the train set: 6451, number of used features: 1751
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.063091 -> initscore=-2.698008
[LightGBM] [Info] Start training from score -2.698008




(2151,)
Accuracy:  0.9437470943747094


In [None]:
imp_cnst = SimpleImputer(strategy='constant')

# drop the missing values
imp_X_3 = imp_cnst.fit_transform(X)
#print(imp_X_3.shape)

#Onehot encoding
enc_3 = OneHotEncoder(handle_unknown='ignore')
enc_imp_X_3 = enc_3.fit_transform(imp_X_3).toarray()

#splitting the data into training and teting data using train_test_split from sklearn
X_train_4, X_test_4 , y_train_4, y_test_4 = train_test_split(enc_imp_X_3, y, test_size=0.25, random_state=42)

#standardizing the data
scaler = StandardScaler()
scaler.fit(X_train_4)
X_train_trans_4 = scaler.transform(X_train_4)
X_test_trans_4 = scaler.transform(X_test_4)

# fitting the data
lgbm_3 = LGBMClassifier(random_state=5)
lgb_3 = lgbm_3.fit(X_train_trans_4, y_train_4)

# predicting
reg_pred_4 = lgb_3.predict(X_test_trans_4)

print(reg_pred_4.shape)
print('Accuracy: ', accuracy_score(y_test_4, reg_pred_4))

[LightGBM] [Info] Number of positive: 407, number of negative: 6044
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.026452 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 5253
[LightGBM] [Info] Number of data points in the train set: 6451, number of used features: 1751
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.063091 -> initscore=-2.698008
[LightGBM] [Info] Start training from score -2.698008




(2151,)
Accuracy:  0.9437470943747094


## problem 5: Posting to Notebooks

https://www.kaggle.com/code/machucacruzomar/homecreditpredict