<a href="https://colab.research.google.com/github/Odima-dev/Data-Science-and-Machine-Learning/blob/main/CreditInformationLearning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Problem 1: Confirmation of competition contents

Having read the overview page of the competition, the following points are clear:

1) The goal of this task is to predict the probability that a customer will default on a loan using demographic and financial data.

2) The submission file must be a CSV with 'SK_ID_CURR' and 'TARGET' columns.

3) The evaluation metric used is AUC (Area Under the ROC Curve).


In [1]:
# Problem 2: Learning and verification

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

# Uploading train data
from google.colab import files
uploaded = files.upload()

# Loading
train_df = pd.read_csv('application_train.csv')

print(train_df.shape)
print(train_df.head())

# Preparing: simple numeric features only
X = train_df.select_dtypes(include=[np.number]).drop(['TARGET', 'SK_ID_CURR'], axis=1)
y = train_df['TARGET']

# Filling missing with median
X = X.fillna(X.median())

# Split
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)

# Training baseline Logistic Regression
lr = LogisticRegression(max_iter=500)
lr.fit(X_train, y_train)

# Predicting & AUC
y_pred = lr.predict_proba(X_valid)[:,1]
auc = roc_auc_score(y_valid, y_pred)
print("Baseline Logistic Regression AUC:", auc)


Saving application_train.csv to application_train.csv
(307511, 122)
   SK_ID_CURR  TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR  \
0      100002       1         Cash loans           M            N   
1      100003       0         Cash loans           F            N   
2      100004       0    Revolving loans           M            Y   
3      100006       0         Cash loans           F            N   
4      100007       0         Cash loans           M            N   

  FLAG_OWN_REALTY  CNT_CHILDREN  AMT_INCOME_TOTAL  AMT_CREDIT  AMT_ANNUITY  \
0               Y             0          202500.0    406597.5      24700.5   
1               N             0          270000.0   1293502.5      35698.5   
2               Y             0           67500.0    135000.0       6750.0   
3               Y             0          135000.0    312682.5      29686.5   
4               Y             0          121500.0    513000.0      21865.5   

   ...  FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_D

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [2]:
# Problem 3: Estimation on test data

# Uploading test data
uploaded = files.upload()

# Loading test
test_df = pd.read_csv('application_test.csv')
print(test_df.shape)
print(test_df.head())

# Using same numeric columns as training
X_test = test_df.select_dtypes(include=[np.number]).drop(['SK_ID_CURR'], axis=1)
X_test = X_test.fillna(X_test.median())

# Predicting
test_pred = lr.predict_proba(X_test)[:,1]

# Creating submission
submission = pd.DataFrame({
    'SK_ID_CURR': test_df['SK_ID_CURR'],
    'TARGET': test_pred
})

submission.to_csv('baseline_submission.csv', index=False)
print("Submission file 'baseline_submission.csv' created!")

# Downloading file
from google.colab import files
files.download('baseline_submission.csv')


Saving application_test.csv to application_test.csv
(48744, 121)
   SK_ID_CURR NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY  \
0      100001         Cash loans           F            N               Y   
1      100005         Cash loans           M            N               Y   
2      100013         Cash loans           M            Y               Y   
3      100028         Cash loans           F            N               Y   
4      100038         Cash loans           M            Y               N   

   CNT_CHILDREN  AMT_INCOME_TOTAL  AMT_CREDIT  AMT_ANNUITY  AMT_GOODS_PRICE  \
0             0          135000.0    568800.0      20560.5         450000.0   
1             0           99000.0    222768.0      17370.0         180000.0   
2             0          202500.0    663264.0      69777.0         630000.0   
3             2          315000.0   1575000.0      49018.5        1575000.0   
4             1          180000.0    625500.0      32067.0         625500.0  

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [3]:
# Problem 4: Feature engineering

from sklearn.preprocessing import LabelEncoder

# Copying original
df = train_df.copy()

# Example 1: Numeric only (baseline)
X1 = df.select_dtypes(include=[np.number]).drop(['TARGET', 'SK_ID_CURR'], axis=1)
X1 = X1.fillna(X1.median())

# Example 2: Adding encoded Gender
X2 = X1.copy()
X2['CODE_GENDER'] = LabelEncoder().fit_transform(df['CODE_GENDER'])

# Example 3: Adding encoded Education
X3 = X2.copy()
X3['NAME_EDUCATION_TYPE'] = LabelEncoder().fit_transform(df['NAME_EDUCATION_TYPE'])

# Example 4: Adding interaction feature
X4 = X3.copy()
X4['CREDIT_INCOME_RATIO'] = df['AMT_CREDIT'] / (df['AMT_INCOME_TOTAL'] + 1)

# Example 5: Filling remaining missing with 0 for demonstration
X5 = X4.fillna(0)

# Train/test split once
y = df['TARGET']
X_train, X_valid, y_train, y_valid = train_test_split(X1, y, test_size=0.2, random_state=42)

# Training & validating each version
def train_and_score(X, y):
    X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)
    model = LogisticRegression(max_iter=500)
    model.fit(X_train, y_train)
    y_pred = model.predict_proba(X_valid)[:,1]
    auc = roc_auc_score(y_valid, y_pred)
    return auc

auc1 = train_and_score(X1, y)
auc2 = train_and_score(X2, y)
auc3 = train_and_score(X3, y)
auc4 = train_and_score(X4, y)
auc5 = train_and_score(X5, y)

print("AUCs for 5 patterns:")
print(f"1) Baseline numeric only: {auc1:.4f}")
print(f"2) + Gender: {auc2:.4f}")
print(f"3) + Education: {auc3:.4f}")
print(f"4) + Interaction (credit/income): {auc4:.4f}")
print(f"5) + Fill missing with zero: {auc5:.4f}")


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

AUCs for 5 patterns:
1) Baseline numeric only: 0.6269
2) + Gender: 0.6272
3) + Education: 0.6278
4) + Interaction (credit/income): 0.6277
5) + Fill missing with zero: 0.6277


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In Problem 4 above, I tested 5 different feature engineering patterns:

1) Numeric only (baseline)

2) Added encoded gender

3) Added encoded education level

4) Added a new interaction feature (credit to income ratio)

5) Filled remaining missing values with zero.

I found that adding encoded categorical features and interactions slightly improved AUC compared to the baseline. This shows that meaningful feature engineering helps the model learn patterns better.
