# Kaggle Predicting Loan Payback

## Visualizar datos

![Texto alternativo](1.png)


La variable a predecir es loan_paid_back

![Texto alternativo](2.png)

Eleccion de un modelo en base al tipo de dato, tenemos datos de tipos numerico y categorico

## Leer el .cvs

In [32]:
import pandas as pd
import numpy as np

df = pd.read_csv('train.csv')
df.head()

Unnamed: 0,id,annual_income,debt_to_income_ratio,credit_score,loan_amount,interest_rate,gender,marital_status,education_level,employment_status,loan_purpose,grade_subgrade,loan_paid_back
0,0,29367.99,0.084,736,2528.42,13.67,Female,Single,High School,Self-employed,Other,C3,1.0
1,1,22108.02,0.166,636,4593.1,12.92,Male,Married,Master's,Employed,Debt consolidation,D3,0.0
2,2,49566.2,0.097,694,17005.15,9.76,Male,Single,High School,Employed,Debt consolidation,C5,1.0
3,3,46858.25,0.065,533,4682.48,16.1,Female,Single,High School,Employed,Debt consolidation,F1,1.0
4,4,25496.7,0.053,665,12184.43,10.21,Male,Married,High School,Employed,Other,D1,1.0


## Separar variables categoricas de variables numericas

In [33]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

test_df = pd.read_csv('test.csv')

X = df.drop('loan_paid_back', axis=1)
y = df['loan_paid_back']


numeric_features = ['annual_income', 'debt_to_income', 'credit_score', 
                   'loan_amount', 'interest_rate']
categorical_features = ['gender', 'marital_status', 'education_level', 
                       'employment_status', 'loan_purpose', 'grade_subgrade']



## hacer label encoding

In [34]:
label_encoders = {}
for col in categorical_features:
    le = LabelEncoder()
    X[col] = le.fit_transform(X[col].astype(str))
    label_encoders[col] = le

## Seleccion de atributos

In [35]:
import pandas as pd
from sklearn.feature_selection import mutual_info_classif
 
mi_scores = mutual_info_classif(X, y, random_state=42)
mi_scores = pd.Series(mi_scores, index=X.columns).sort_values(ascending=False)
print("üîé Importancia de las variables (Mutual Information):")
print(mi_scores)

üîé Importancia de las variables (Mutual Information):
employment_status       0.181682
debt_to_income_ratio    0.079689
marital_status          0.036004
credit_score            0.033634
grade_subgrade          0.032247
gender                  0.029844
loan_purpose            0.028081
education_level         0.020901
loan_amount             0.013141
interest_rate           0.011935
annual_income           0.010262
id                      0.000280
dtype: float64


### Variables seleccionadas

In [36]:
selected_features = [
    'employment_status',
    'debt_to_income_ratio',
    'marital_status',
    'credit_score',
    'grade_subgrade',
    'gender',
    'loan_purpose'
]


X = X[selected_features]

In [37]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)


rf_model = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
rf_model.fit(X_train, y_train)


y_pred = rf_model.predict(X_val)
accuracy = accuracy_score(y_val, y_pred)
print(f" Accuracy en validaci√≥n: {accuracy:.4f}")


for col in categorical_features:
    if col in test_df.columns:
        test_df[col] = label_encoders[col].transform(test_df[col].astype(str))

X_test = test_df[selected_features]


test_predictions = rf_model.predict(X_test)


submission = pd.DataFrame({
    'id': test_df['id'],
    'loan_paid_back': test_predictions
})

submission.to_csv('submission.csv', index=False)
print(" Submission creado exitosamente!")
print(submission.head())


‚úÖ Accuracy en validaci√≥n: 0.9023
üìÅ Submission creado exitosamente!
       id  loan_paid_back
0  593994             1.0
1  593995             1.0
2  593996             0.0
3  593997             1.0
4  593998             1.0
