<a href="https://colab.research.google.com/github/Cliffochi/aviva_data_science_course/blob/main/credit_information_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

####What data do you learn and what do you predict?

- Learning Data: The competition provides various datasets related to loan applications. These include:
1. application_{train|test}.csv: The main table containing information about each loan application. It includes features like demographics, financial information, application details, etc.
2. bureau.csv and bureau_balance.csv: Data about previous credits of the applicants from other financial institutions.
3. previous_application.csv: Information about previous loan applications at Home Credit by the applicants.
4. POS_CASH_balance.csv: Monthly balance snapshots of previous point of sale (POS) and cash loans that the applicant had with Home Credit.
5. credit_card_balance.csv: Monthly balance snapshots of previous credit cards that the applicant had with Home Credit.
6. installments_payments.csv: Repayment history for the previously disbursed credits in previous_application.csv.

- Prediction: The goal is to predict the probability that each loan application in the application_test.csv dataset will default (i.e., the client will not repay the loan).

####In what format do you create a file and submit it to Kaggle?

- The submission file should be a CSV file with exactly two columns:

    SK_ID_CURR: The unique identifier of each loan application in the application_test.csv file.

    TARGET: The predicted probability of default for that loan application. This value should be between 0 and 1 (inclusive).
    The submission file must contain a header row with these exact column names.

####What kind of indicators are used to evaluate the submitted data?

The evaluation metric for this competition is the Area Under the ROC Curve (AUC). The Receiver Operating Characteristic (ROC) curve plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at various threshold settings.

AUC measures the ability of the classifier to distinguish between positive and negative classes. A higher AUC score (closer to 1) indicates better performance

In [2]:
# 3. Create a Baseline Model
"""
/*
Let's create a simple baseline model using a single data file (application_train.csv), perform basic preprocessing,
train a Logistic Regression model, and then generate a submission file for application_test.csv.
*/
"""

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.impute import SimpleImputer

# Load the training and testing data
train_df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/application_train.csv')
test_df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/application_test.csv')

# Identify the target variable and the application IDs
TARGET = 'TARGET'
train_ids = train_df['SK_ID_CURR']
test_ids = test_df['SK_ID_CURR']

# Drop the target variable from the training data and the ID columns from both
if TARGET in train_df.columns:
    y = train_df[TARGET]
    X = train_df.drop(columns=[TARGET, 'SK_ID_CURR'])
    X_test = test_df.drop(columns=['SK_ID_CURR'])
else:
    print(f"Error: Target column '{TARGET}' not found in application_train.csv")
    exit()

# Align the training and testing columns (important for consistent feature sets)
common_cols = list(set(X.columns) & set(X_test.columns))
X = X[common_cols]
X_test = X_test[common_cols]

# Handle categorical features using one-hot encoding
X = pd.get_dummies(X)
X_test = pd.get_dummies(X_test)

# Align columns again after one-hot encoding
common_cols_encoded = list(set(X.columns) & set(X_test.columns))
X = X[common_cols_encoded]
X_test = X_test[common_cols_encoded]

# Impute missing values using the mean
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)
X_test_imputed = imputer.transform(X_test)

X_train, X_val, y_train, y_val = train_test_split(X_imputed, y, test_size=0.2, random_state=42)

# Initialize and train a Logistic Regression model
model = LogisticRegression(solver='liblinear', random_state=42)
model.fit(X_train, y_train)

# Make predictions on the validation set
y_pred_val_proba = model.predict_proba(X_val)[:, 1]
auc_val = roc_auc_score(y_val, y_pred_val_proba)
print(f"Baseline Validation AUC: {auc_val:.4f}")

# Make predictions on the test data
y_pred_test_proba = model.predict_proba(X_test_imputed)[:, 1]

# Create the submission DataFrame
submission_df = pd.DataFrame({'SK_ID_CURR': test_ids, 'TARGET': y_pred_test_proba})

# Save the submission file
submission_file = 'baseline_submission.csv'
submission_df.to_csv(submission_file, index=False)

print(f"Baseline submission file: {submission_file}")

Baseline Validation AUC: 0.6246
Baseline submission file: baseline_submission.csv


In [3]:
submission_df

Unnamed: 0,SK_ID_CURR,TARGET
0,100001,0.070117
1,100005,0.094825
2,100013,0.053287
3,100028,0.022316
4,100038,0.067825
...,...,...
48739,456221,0.070791
48740,456222,0.173277
48741,456223,0.086267
48742,456224,0.066161


Explanation of the Baseline Model:

Load Data: We load the application_train.csv and application_test.csv files using pandas.
Identify Target and IDs: We identify the target variable ('TARGET') and the application IDs ('SK_ID_CURR').
Feature and Target Separation: We separate the features (X) and the target variable (y) from the training data and the ID column from both datasets.
Column Alignment: We ensure that both the training and testing datasets have the same columns before one-hot encoding.
Handle Categorical Features: We use pd.get_dummies() to perform one-hot encoding on categorical features.
Column Alignment After Encoding: After encoding, we align the columns again to ensure consistency.
Impute Missing Values: We use SimpleImputer to fill missing values with the mean of each column.
Train-Validation Split: We split the training data into training and validation sets to evaluate our model locally.
Train Logistic Regression: We initialize and train a LogisticRegression model. We use the 'liblinear' solver, which is suitable for smaller datasets.
Evaluate on Validation Set: We make predictions on the validation set (probabilities of default) and calculate the AUC score using roc_auc_score. This gives us an idea of the model's performance before submitting to Kaggle.
Predict on Test Data: We make probability predictions on the application_test.csv data.
Create Submission File: We create a pandas DataFrame with the 'SK_ID_CURR' from the test data and the predicted probabilities in the 'TARGET' column.
Save Submission File: We save the DataFrame to a CSV file named baseline_submission.csv in the required format.
Next Steps:

4. Ingenuity of Characteristic Quantity Engineering (Feature Engineering)

Now, let's think about how we can improve the baseline model by engineering new features. Here are some perspectives and ideas for at least five patterns of learning and verification:

- Perspective 1: Handling Categorical Features More Effectively

 Pattern 1: Label Encoding for Ordinal Features: Identify ordinal categorical features (e.g., education level, housing situation if ordered) and apply label encoding instead of one-hot encoding. One-hot encoding can create a high number of features, especially for high-cardinality categorical variables. Label encoding might be more appropriate for ordinal features.

 Pattern 2: Frequency Encoding for Nominal Features: For nominal categorical features with many unique values (high cardinality), try frequency encoding. Replace each category with its frequency in the dataset. This can capture information about the prevalence of different categories.

- Perspective 2: Creating New Features from Existing Numerical Features

 Pattern 3: Ratio Features: Create new features by taking ratios of existing numerical features that might be intuitively related to credit risk. For example:
  * Ratio of AMT_INCOME_TOTAL to AMT_CREDIT.
  * Ratio of AMT_ANNUITY to AMT_INCOME_TOTAL.
  * Ratio of AMT_CREDIT to AMT_GOODS_PRICE.

 Pattern 4: Polynomial Features: Generate polynomial features by raising existing numerical features to higher powers (e.g., square, cube) or by creating interaction terms (e.g., multiplying two features). This can help the model capture non-linear relationships.

- Perspective 3: Incorporating External Data (Bureau Data)

  * Pattern 5: Aggregated Bureau Statistics: Merge the bureau.csv and bureau_balance.csv data with the main application data. Calculate aggregate statistics (e.g., mean, max, min, sum) for numerical features in the bureau data for each applicant (e.g., average number of past loans, total amount of past credits, average debt-to-credit ratio).

- Implementation Plan for Feature Engineering (Illustrative - You'll need to adapt and expand):

For each of the five patterns mentioned above, you would:

* Implement the Feature Engineering: Write code to create the new features or modify the existing ones based on the chosen pattern.
Preprocess the Data: Handle missing values and encode categorical features as needed for the modified dataset.

Train a Model: Train a Logistic Regression (or another suitable classifier) on the engineered features using the training data.
Evaluate on Validation Set: Make predictions on the validation set and calculate the AUC score to assess the impact of the feature engineering.
Record Results: Keep track of the AUC score for each pattern to compare their effectiveness.

In [5]:
# Example
# Load data

# Create ratio features
train_df['INCOME_CREDIT_RATIO'] = train_df['AMT_INCOME_TOTAL'] / train_df['AMT_CREDIT']
train_df['ANNUITY_INCOME_RATIO'] = train_df['AMT_ANNUITY'] / train_df['AMT_INCOME_TOTAL']
train_df['CREDIT_GOODS_RATIO'] = train_df['AMT_CREDIT'] / train_df['AMT_GOODS_PRICE']

test_df['INCOME_CREDIT_RATIO'] = test_df['AMT_INCOME_TOTAL'] / test_df['AMT_CREDIT']
test_df['ANNUITY_INCOME_RATIO'] = test_df['AMT_ANNUITY'] / test_df['AMT_INCOME_TOTAL']
test_df['CREDIT_GOODS_RATIO'] = test_df['AMT_CREDIT'] / test_df['AMT_GOODS_PRICE']


Final Submission:

For the model with the highest validation AUC, you will:

Train on the Full Training Data: Train the model using the chosen feature engineering steps on the entire application_train.csv dataset.
Predict on Test Data: Make predictions on the application_test.csv data using the same feature engineering and preprocessing steps.
Create and Submit: Generate the submission CSV file in the specified format and submit it.
