## Table of Contents
1. [Setting Up](#setting-up)
2. [Loading the dataset](#loading-the-dataset)
3. [Preprocessing](#preprocessing)
4. [Preparing dataset for training and validating](#preparing-dataset-for-training-and-validating)
5. [Model Training and Evaluation](#model-training-and-evaluation)
6. [Comparing models](#comparing-models)
7. [Parameter tuning for Logistic Regression](#parameter-tuning-for-logistic-regression)
8. [Getting and evaluating the optimized logistic regression model](#getting-and-evaluating-the-optimized-logistic-regression-model)
9. [Saving the model](#saving-the-model)

## Setting Up

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report
from sklearn.pipeline import Pipeline
import pickle

## Loading the dataset

In [2]:
# Load the CSV file
file_path = '/kaggle/input/train-data/credit_card_train.csv'
df = pd.read_csv(file_path)

In [3]:
df

Unnamed: 0,Num_Children,Gender,Income,Own_Car,Own_Housing,Credit_Card_Issuing
0,1,Male,40690,No,Yes,Denied
1,2,Female,75469,Yes,No,Denied
2,1,Male,70497,Yes,Yes,Approved
3,1,Male,61000,No,No,Denied
4,1,Male,56666,Yes,Yes,Denied
...,...,...,...,...,...,...
399995,2,Male,55332,Yes,No,Denied
399996,1,Male,95108,No,No,Approved
399997,3,Male,57163,Yes,Yes,Denied
399998,5,Male,112237,Yes,Yes,Approved


## Preprocessing

In [4]:
# Convert the target variable 'Credit_Card_Issuing' to binary (1 for Approved, 0 for Denied)
df['Credit_Card_Issuing'] = df['Credit_Card_Issuing'].apply(lambda x: 1 if x == 'Approved' else 0)

# Label encode categorical variables
le = LabelEncoder()
df['Gender'] = le.fit_transform(df['Gender'])
df['Own_Car'] = le.fit_transform(df['Own_Car'])
df['Own_Housing'] = le.fit_transform(df['Own_Housing'])

In [5]:
df

Unnamed: 0,Num_Children,Gender,Income,Own_Car,Own_Housing,Credit_Card_Issuing
0,1,1,40690,0,1,0
1,2,0,75469,1,0,0
2,1,1,70497,1,1,1
3,1,1,61000,0,0,0
4,1,1,56666,1,1,0
...,...,...,...,...,...,...
399995,2,1,55332,1,0,0
399996,1,1,95108,0,0,1
399997,3,1,57163,1,1,0
399998,5,1,112237,1,1,1


## Preparing dataset for training and validating

In [6]:
# Separate features and target
X = df.drop('Credit_Card_Issuing', axis=1)
y = df['Credit_Card_Issuing']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Model Training and Evaluation

In [7]:
# Check for gender-based bias
male_indices = X_test['Gender'] == 1
female_indices = X_test['Gender'] == 0

In [8]:
# Function to train, predict, evaluate models, and optionally return the model
def train_and_evaluate_model(model, model_name, return_model=False):
    # Create a pipeline with the model
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('model', model)
    ])

    # Train the model
    pipeline.fit(X_train, y_train)

    # Predict on the test set
    y_pred = pipeline.predict(X_test)

    # Performance metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)

    print(f"Performance for {model_name}:")
    print(f"Accuracy: {accuracy * 100:.2f}%")
    print(f"Precision: {precision * 100:.2f}%")
    print(f"Recall: {recall * 100:.2f}%")
    print(f"F1-Score: {f1 * 100:.2f}%")

    # Fairness: Check classification reports for males and females
    y_pred_male = pipeline.predict(X_test[male_indices])
    y_true_male = y_test[male_indices]
    y_pred_female = pipeline.predict(X_test[female_indices])
    y_true_female = y_test[female_indices]

    print("\nBias/Fairness Evaluation:")
    print(f"Male Classification Report for {model_name}:")
    print(classification_report(y_true_male, y_pred_male))
    print(f"Female Classification Report for {model_name}:")
    print(classification_report(y_true_female, y_pred_female))

    # Variance: Compare training and test performance
    y_train_pred = pipeline.predict(X_train)
    train_accuracy = accuracy_score(y_train, y_train_pred)
    print(f"\nVariance Check for {model_name}:")
    print(f"Training Accuracy: {train_accuracy * 100:.2f}%")
    print(f"Test Accuracy: {accuracy * 100:.2f}%")
    
    # Interpretability: Feature importance for Logistic Regression and Random Forest
    if hasattr(pipeline.named_steps['model'], 'coef_'):  # For Logistic Regression
        feature_importance = pipeline.named_steps['model'].coef_[0]
        feature_names = X.columns
        importance_dict = dict(zip(feature_names, feature_importance))
        importance_sorted = sorted(importance_dict.items(), key=lambda item: abs(item[1]), reverse=True)
        print("\nFeature Importance for Logistic Regression:")
        for feature, importance in importance_sorted:
            print(f"{feature}: {importance:.2f}")
    elif hasattr(pipeline.named_steps['model'], 'feature_importances_'):  # For Random Forest and XGBoost
        feature_importance = pipeline.named_steps['model'].feature_importances_
        feature_names = X.columns
        importance_dict = dict(zip(feature_names, feature_importance))
        importance_sorted = sorted(importance_dict.items(), key=lambda item: abs(item[1]), reverse=True)
        print(f"\nFeature Importance for {model_name}:")
        for feature, importance in importance_sorted:
            print(f"{feature}: {importance:.2f}")

    # Optionally return the trained model
    if return_model:
        return pipeline

### Comments

- **Standardization (Scaler)**: The data is scaled using `StandardScaler` to ensure that all features have a similar scale, which is crucial for models like Logistic Regression that are sensitive to feature magnitude.

1. **Performance Evaluation**:
    - **Accuracy**: The overall accuracy of the model is calculated, indicating how many predictions the model got right out of all predictions. For instance, the accuracy achieved was 97.26%.

    - **Precision**: Precision measures how many of the predicted positive cases were actually positive. This metric is especially important in cases where false positives are costly.

    - **Recall**: Recall evaluates how many of the actual positive cases were correctly identified by the model. High recall ensures that most true positives are detected.

    - **F1-Score**: The F1-score, which is the harmonic mean of precision and recall, is used as a balanced metric when both false positives and false negatives matter. This score was calculated for both the overall dataset and for specific groups (males and females).

2. **Fairness and Bias Evaluation**: Predictions are evaluated separately for males and females by splitting the test set (`X_test`) into `male_indices` and `female_indices`. Classification reports for both genders are generated to check for bias and ensure fairness in model performance.

3. **Variance Check**: Training and test accuracies are compared to check for overfitting or underfitting. In this case, both accuracies are nearly the same, which shows the model generalizes well.

4. **Feature Importance**: For models like Logistic Regression, feature importance is derived from the coefficients (`coef_`). The most impactful features are identified and sorted, providing insights into which factors contributed most to the model's predictions (e.g., income, gender, etc.).

## Comparing models

In [9]:
# Models to compare
models = {
    "Logistic Regression": LogisticRegression(),
    "Random Forest": RandomForestClassifier(),
    "XGBoost": XGBClassifier(use_label_encoder=False, eval_metric='logloss')
}

# Train and evaluate all models
for model_name, model in models.items():
    train_and_evaluate_model(model, model_name)
    print("--------------------------------------------------------------------")

Performance for Logistic Regression:
Accuracy: 97.25%
Precision: 96.50%
Recall: 96.40%
F1-Score: 96.45%

Bias/Fairness Evaluation:
Male Classification Report for Logistic Regression:
              precision    recall  f1-score   support

           0       0.97      0.97      0.97     17045
           1       0.98      0.98      0.98     22910

    accuracy                           0.97     39955
   macro avg       0.97      0.97      0.97     39955
weighted avg       0.97      0.97      0.97     39955

Female Classification Report for Logistic Regression:
              precision    recall  f1-score   support

           0       0.98      0.98      0.98     32024
           1       0.93      0.93      0.93      8021

    accuracy                           0.97     40045
   macro avg       0.96      0.95      0.95     40045
weighted avg       0.97      0.97      0.97     40045


Variance Check for Logistic Regression:
Training Accuracy: 97.31%
Test Accuracy: 97.25%

Feature Importance 

### Comments

- **Best Model:** Based on accuracy, precision, recall, F1-score, and fairness evaluation, **Logistic Regression** seems to be the best-performing model. It achieves the highest balance between performance and fairness, without significant overfitting or bias across genders.

- **General Comments:**
    - Random Forest shows signs of overfitting with a large gap between training and test accuracy, making it less generalizable to unseen data.
    - All models show relatively fair performance across genders, but Logistic Regression and XGBoost perform slightly better for female applicants than Random Forest.
    - Across all models, Income is the most important factor in determining credit card approval. Logistic Regression gives higher importance to features like Gender and Own_Housing, while XGBoost assigns more importance to Gender.

## Parameter tuning for Logistic Regression

In [10]:
# Parameter tuning for Logistic Regression
param_grid_lr = {
    'model__C': [0.01, 0.1, 1, 10, 100],
    'model__penalty': ['l1', 'l2'],
    'model__solver': ['liblinear']
}

# Grid Search for Logistic Regression
grid_search_lr = GridSearchCV(Pipeline([('scaler', StandardScaler()), ('model', LogisticRegression())]), 
                              param_grid_lr, cv=5, scoring='accuracy')
grid_search_lr.fit(X_train, y_train)

# Get best params and score for Logistic Regression
best_params_lr = grid_search_lr.best_params_
best_score_lr = grid_search_lr.best_score_

print(f"\nBest parameters for Logistic Regression: {best_params_lr}")
print(f"Best cross-validation score: {best_score_lr * 100:.2f}%")


Best parameters for Logistic Regression: {'model__C': 0.1, 'model__penalty': 'l1', 'model__solver': 'liblinear'}
Best cross-validation score: 97.31%


### Comments

- **Best Parameters:**
    - Best C value (0.1): This indicates moderate regularization, striking a balance between preventing overfitting while still fitting the data well.
    - L1 penalty: This suggests that some feature coefficients may have been shrunk to zero, simplifying the model and helping with feature selection.
- **High cross-validation score (97.31%):** This confirms that the model is performing very well on multiple folds of the data and can generalize effectively to unseen data.

## Getting and evaluating the optimized logistic regression model

In [13]:
# Apply the best hyperparameters from GridSearchCV
best_logistic_regression = LogisticRegression(C=0.1, penalty='l1', solver='liblinear')

# Train and evaluate the optimized Logistic Regression model
best_logistic_regression = train_and_evaluate_model(best_logistic_regression, "Optimized Logistic Regression", return_model=True)

Performance for Optimized Logistic Regression:
Accuracy: 97.26%
Precision: 96.49%
Recall: 96.41%
F1-Score: 96.45%

Bias/Fairness Evaluation:
Male Classification Report for Optimized Logistic Regression:
              precision    recall  f1-score   support

           0       0.97      0.97      0.97     17045
           1       0.98      0.98      0.98     22910

    accuracy                           0.97     39955
   macro avg       0.97      0.97      0.97     39955
weighted avg       0.97      0.97      0.97     39955

Female Classification Report for Optimized Logistic Regression:
              precision    recall  f1-score   support

           0       0.98      0.98      0.98     32024
           1       0.93      0.93      0.93      8021

    accuracy                           0.97     40045
   macro avg       0.96      0.95      0.95     40045
weighted avg       0.97      0.97      0.97     40045


Variance Check for Optimized Logistic Regression:
Training Accuracy: 97.31%
Te

### Comments

- **Model Performance**:
    - **Accuracy**: The model performs well with 97.26% accuracy, meaning most predictions are correct.
    - **Precision**: Precision is high for both genders, especially for males (96.49%), meaning the model rarely mislabels negatives as positives.
    - **Recall**: The model effectively identifies true positives with recall scores around 96.41%.
    - **F1-Score**: The F1-scores are balanced and high, showing the model has a good balance between precision and recall.


- **Bias/Fairness Evaluation**:
    - **Male Classification**: Precision, recall, and F1-scores are all around 97-98%, indicating strong performance for males.
    - **Female Classification**: Precision, recall, and F1-scores are slightly lower (around 93%) but still high, showing good performance for females as well.


- **Variance Check**:
    - **Training vs Test Accuracy**: Both accuracies are nearly identical (97.31% and 97.26%), meaning the model generalizes well without overfitting.


- **Feature Importance**:
    - **Income**: This is the most important feature, having the biggest impact on predictions.
    - **Gender**: It's important but not as much as income.
    - **Own_Housing and Own_Car**: These factors play a smaller role.
    - **Num_Children**: This has almost no impact on the model's predictions.

## Saving the model

In [14]:
# Save the model to a .pkl file
model_filename = 'optimized_logistic_regression_model.pkl'
with open(model_filename, 'wb') as file:
    pickle.dump(best_logistic_regression, file)

print(f"Model saved as {model_filename}")

Model saved as optimized_logistic_regression_model.pkl
