# Machine Learning Project - Group 18

### Table of Contents
1. [Explanatory Data Analysis](#1.-Explanatory-Data-Analysis)
    - [Monthly Income](#Monthly-Income)
    - [Revolving Utilization of Unsecured Lines](#Revolving-Utilization-Of-Unsecured-Lines)
    - [Debt Ratio](#Debt-Ratio)
    - [Age](#Age)
    - [Number Of Open Credit Lines And Loans](#Number-Of-Open-Credit-Lines-And-Loans)
    - [Number of Real Estate Loans and Lines](#Number-Of-Real-Estate-Loans-And-Lines)
    - [Number of Dependents](#Number-Of-Dependents)
    - [Number of N Days Past Due](#Number-Of-N-Days-Past-Due)
2. [Data Cleaning](#Data-Cleaning)
    - [Monthly Income - Missing Values](#Missing-Values)
    - [Debt Ratio - Incorrect values](#Monthly-Income)
3. [Feature Engineering](#Feature-Engineering)
    - [Rare Case](#Missing-Values)
4. [Preprocessing](#Preprocessing)

5. [Model Selection](#Model-Selection)
6. [Model Tuning](#Model-Tuning)


In [None]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import math 
import seaborn as sns
from math import radians, sin, cos, sqrt, atan2 

from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans

from sklearn.ensemble import AdaBoostRegressor, RandomForestRegressor, BaggingRegressor
from sklearn.ensemble import StackingRegressor, VotingRegressor, GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.svm import SVR
from sklearn.datasets import make_regression
from scipy.spatial.distance import cdist
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score


In [None]:
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('cs-test.csv')
df_train.head()
df_test.head()
df_train.drop(columns=['Unnamed: 0'], inplace=True)
df_test.drop(columns=['Unnamed: 0', 'SeriousDlqin2yrs'], inplace=True)


# 1. Explanatory Data Analysis

In [None]:
df_train.info()

In [None]:
df_test.info()

In [None]:
df_train.describe()

In [None]:
sns.countplot(x='SeriousDlqin2yrs', data=df_train)
plt.show()

Unbalanced dataset.

## Monthly Income 

In [None]:
# Boxplot for 'RevolvingUtilizationOfUnsecuredLines' 
sns.boxplot(x='MonthlyIncome', data=df_train)
plt.title('MonthlyIncome - BoxPlot')
plt.show()

In [None]:
df_train[df_train['MonthlyIncome'].isna()].describe()

In [None]:
df_train[df_train['MonthlyIncome'].isna()]['SeriousDlqin2yrs'].value_counts(1)

## Revolving Utilization Of Unsecured Lines

In [None]:
df_train['RevolvingUtilizationOfUnsecuredLines'].describe().to_frame().T

In [None]:
# Boxplot for 'RevolvingUtilizationOfUnsecuredLines' 
sns.boxplot(x='RevolvingUtilizationOfUnsecuredLines', data=df_train)
plt.title('RevolvingUtilizationOfUnsecuredLines - BoxPlot')
plt.show()

In [None]:
# Boxplot for 'RevolvingUtilizationOfUnsecuredLines' < 1
import matplotlib.pyplot as plt
sns.boxplot(x='RevolvingUtilizationOfUnsecuredLines', data=df_train[df_train['RevolvingUtilizationOfUnsecuredLines']<10])
plt.title('RevolvingUtilizationOfUnsecuredLines < 10 - BoxPlot')
plt.show()


In [None]:
df_train[df_train['RevolvingUtilizationOfUnsecuredLines'] > 10].describe()

Approximately 97.5% of values of this Variable are between 0 and 1 with a well defined right-skewed distribution. Generally, Credit Utilization is expected to be within the region (0 - 1). Altough, Borrowers can sometimes spend beyond credit limit. Values between 1 and 10 make up 2% of the dataset. Values beyond 10 are extremely big and they make up less than 0.5% of our data, these values would be dropped to prevent them from impacting our model.

## Debt Ratio

In [None]:
df_train['DebtRatio'].describe().to_frame().T

In [None]:
# Boxplot for DebtRatio, df_train
sns.boxplot(df_train['DebtRatio'])
plt.title('DebtRatio - BoxPlot')
plt.show()


In [None]:
# How many values for DebtRatio are between 0 and 2
df_train['DebtRatio'].between(0, 2).sum()/len(df_train)

In [None]:
df_train[df_train['DebtRatio'] > 2]['MonthlyIncome'].isnull().sum()/len(df_train[df_train['DebtRatio'] > 2])

In [None]:
df_train[df_train['DebtRatio'] > 2].describe()

- 79.4% of values in this variable are between 0 - 2
- The remaining 20% have high values (Median of 1201). Outliers responsible for skewing the Variable
- 90% of these values have a missing value for Monthly Income, this could mean that people forgot to insert the monthly income and the debt ratio was calculated having nothing as a denominator. \
We could try to predict Monthly Income, inserting some constraints based on the values of the debt ratio (e.g. if Debt Ratio = 0, then Monthly Income cannot be greater than 0). \
Then, we can adjust the Debt Ratios based on the new values of Monthly Income.

## Age

In [None]:
# Boxplot for 'Age'
sns.boxplot(x='age', data=df_train)
plt.title('Age - BoxPlot')
plt.show()

Impossible to have a person under age. We should remove or substitute these values.

In [None]:
df_train[df_train['age']<18]

In [None]:
df_train['age'].replace(-1, 52, inplace=True)

Replaced -1 value with the median of age equal to 52.

In [None]:
# Boxplot for 'Age' in the test dataset
sns.boxplot(x='age', data=df_test)
plt.title('Age - BoxPlot')
plt.show()

No underage individuals in the test set.

## NumberOfOpenCreditLinesAndLoans

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(18,6))
sns.histplot(x = df_train['NumberOfOpenCreditLinesAndLoans'], binwidth=1, ax = axes[0])
sns.histplot(x = df_test['NumberOfOpenCreditLinesAndLoans'], binwidth=1, ax = axes[1])

This variable is right-skewed with no extreme values.

Since the feature 'Number of Open Credit Lines and Loans' has integer values, it does not make sense that we keep decimal values.

## Number of Real Estate Loans and Lines

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(18,6))
sns.histplot(x = df_train['NumberRealEstateLoansOrLines'], binwidth=1, ax = axes[0])
sns.histplot(x = df_test['NumberRealEstateLoansOrLines'], binwidth=1, ax = axes[1])

In [None]:
df_train['NumberRealEstateLoansOrLines'].value_counts()

This variable is highly skewed to the right, Majority of the Borrowers have between 0 to 2 Mortgage loans. 

## Number of Dependents

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(18,6))
sns.histplot(x = df_train['NumberOfDependents'], binwidth=1, ax = axes[0])
sns.histplot(x = df_test['NumberOfDependents'], binwidth=1, ax = axes[1])

This variable is right skewed. Majority of the Borrowers have between 0 - 3 Dependents.

## Number of N Days Past Due

In [None]:
due_30_59 = pd.DataFrame(df_train['NumberOfTime30-59DaysPastDueNotWorse'].value_counts()).rename(columns = {'NumberOfTime30-59DaysPastDueNotWorse':'30-59days'})
due_60_89 =  pd.DataFrame(df_train['NumberOfTime60-89DaysPastDueNotWorse'].value_counts()).rename(columns = {'NumberOfTime60-89DaysPastDueNotWorse':'60-89days'})
due_90 = pd.DataFrame(df_train['NumberOfTimes90DaysLate'].value_counts()).rename(columns = {'NumberOfTimes90DaysLate':'90days'})
pd.concat([due_30_59, due_60_89, due_90], axis = 1)

In [None]:
columns_needed = ['NumberOfTime30-59DaysPastDueNotWorse', 'NumberOfTime60-89DaysPastDueNotWorse',
                  'NumberOfTimes90DaysLate', 'SeriousDlqin2yrs']  
data_filtered = df_train[columns_needed]

mask = (data_filtered['NumberOfTime30-59DaysPastDueNotWorse'].isin([96, 98]) |
        data_filtered['NumberOfTime60-89DaysPastDueNotWorse'].isin([96, 98]) |
        data_filtered['NumberOfTimes90DaysLate'].isin([96, 98]))

delinquency_data = data_filtered[mask]

serious_delinquency_count = delinquency_data['SeriousDlqin2yrs'].sum()
total_cases = delinquency_data.shape[0]

serious_delinquency_count/total_cases

These Features have similar distribution. There are two unique values (98 and 96). It is impossible for a borrower to exhibit delinquency 98 or 96 times in space of 2 years. It can also be observerd that these values share the same corresponding index, which might indicates Data Entry error. However, they can't be dropped due to high information they possess in identifying defaulting members. 54% of Borrowers in this category defaulted compared to 6% global default rate. 

## Correlation Matrix

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

corr_matrix = df_train.corr()
plt.figure(figsize=(12, 10))
ax = sns.heatmap(corr_matrix, fmt=".2f", cmap='Reds', square=True, linewidths=.5, cbar_kws={"shrink": .8})
plt.title('Correlation Matrix')
plt.xticks(fontstyle='italic')
plt.yticks(fontstyle='italic')
plt.show()

# 2. Data Cleaning

## Number of N Days Past Due - 96 and 98 values

In [None]:
rare_df = df_train[df_train['NumberOfTimes90DaysLate']>90]
len(rare_df[rare_df['DebtRatio']==0])/len(rare_df)

53% of these data have 1s for the target feature, it means they are important to understand what determines the probability of facing serious deliquency. \
But we also can observe that 80% of these values have 0 as DebtRatio, so it's impossible for them to be past due.

In [None]:
len(rare_df[(rare_df['SeriousDlqin2yrs']==1)&(rare_df['DebtRatio']==0)])/len(rare_df[rare_df['SeriousDlqin2yrs']==1])

From the rare case, 72% of people having debtratio = 0 and seriousdel = 1. How can it be possible that a person with no debt has a deliquency?

In [None]:
rare_df.describe()

In [None]:
(len(rare_df[rare_df['DebtRatio']>5])+len(rare_df[rare_df['DebtRatio']==0]))/len(rare_df)

In [None]:
rare_df[rare_df['SeriousDlqin2yrs']==1]

In [None]:
rare_df.describe()

Number of Open Credit Lines and Loans has very small mean, same thing for number of real estate loans or lines. \
Therefore we can consider them as errors.

In [None]:
rare_df[rare_df['SeriousDlqin2yrs']==0]

In [None]:
rare_df[rare_df['MonthlyIncome'].isna()]

In [None]:
rare_df[rare_df['MonthlyIncome'].notna()]

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

corr_matrix = df_train.corr()
plt.figure(figsize=(12, 10))
ax = sns.heatmap(corr_matrix, fmt=".2f", cmap='Reds', square=True, linewidths=.5, cbar_kws={"shrink": .8})
plt.title('Correlation Matrix with Rare Case')
plt.xticks(fontstyle='italic')
plt.yticks(fontstyle='italic')
plt.show()

In [None]:
columns_to_filter = ['NumberOfTime30-59DaysPastDueNotWorse', 'NumberOfTimes90DaysLate', 'NumberOfTime60-89DaysPastDueNotWorse']

## Extreme values - RevolvingUtilizationOfUnsecuredLines 

In [None]:
revolv = df_train[df_train['RevolvingUtilizationOfUnsecuredLines'] > 10]

In [None]:
revolv_test = df_test[df_test['RevolvingUtilizationOfUnsecuredLines'] > 10]

In [None]:
len(revolv)

In [None]:
len(revolv[revolv['SeriousDlqin2yrs'] == 1])/len(revolv)

In [None]:
df_train = df_train[df_train['RevolvingUtilizationOfUnsecuredLines'] <= 10]

## Monthly Income - Missing Values 

In [None]:
import pandas as pd

data_no_missing_income = df_train[df_train['MonthlyIncome'].notna()]

debt_ratio_condition = (data_no_missing_income['DebtRatio'] >= 0) & (data_no_missing_income['DebtRatio'] <= 2)
percentage_in_range = (debt_ratio_condition.sum() / len(data_no_missing_income)) * 100

print(f"Percentage of DebtRatio values between 0 and 2: {percentage_in_range:.2f}%")

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error
from sklearn.impute import SimpleImputer


number_imputer = SimpleImputer(strategy='median')
df_train['NumberOfDependents'] = number_imputer.fit_transform(df_train[['NumberOfDependents']])

missing_income_indexes = df_train[df_train['MonthlyIncome'].isna()].index

features = ['RevolvingUtilizationOfUnsecuredLines', 'age', 'DebtRatio', 
            'NumberOfOpenCreditLinesAndLoans', 'NumberRealEstateLoansOrLines']
data_with_income = df_train[df_train['MonthlyIncome'].notna()][features + ['MonthlyIncome']]

X = data_with_income.drop(columns=['MonthlyIncome'])
y = data_with_income['MonthlyIncome']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
'''
models = {
    'LinearRegression': LinearRegression(),
    'RandomForestRegressor': RandomForestRegressor(n_estimators=100, random_state=42),
    'GradientBoostingRegressor': GradientBoostingRegressor(n_estimators=100, random_state=42),
    'SupportVectorRegression': SVR()
}

results = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    results[name] = mse

best_model_name = min(results, key=results.get)
best_model = models[best_model_name]
'''
monthly_model = LinearRegression()
monthly_model.fit(X_train, y_train)
y_pred = monthly_model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)

X_missing = df_train.loc[missing_income_indexes][features]
predicted_income = monthly_model.predict(X_missing)

predicted_income = np.maximum(predicted_income, 0)

df_train.loc[missing_income_indexes, 'MonthlyIncome'] = predicted_income

df_train.loc[missing_income_indexes, 'DebtRatio'] /= df_train.loc[missing_income_indexes, 'MonthlyIncome']

print(f"Best model: {monthly_model} with MSE: {mse}")


In [None]:
df_test['NumberOfDependents'] = number_imputer.fit_transform(df_test[['NumberOfDependents']])

# Predict MonthlyIncome on df_submission where it's missing
missing_income_indexes_submission = df_test[df_test['MonthlyIncome'].isna()].index
X_missing_submission = df_test.loc[missing_income_indexes_submission][features]

predicted_income_submission = monthly_model.predict(X_missing_submission)
predicted_income_submission = np.maximum(predicted_income_submission, 0)  # Ensure non-negative income predictions

# Assign predicted MonthlyIncome to df_submission
df_test.loc[missing_income_indexes_submission, 'MonthlyIncome'] = predicted_income_submission

# Adjust DebtRatio in df_submission based on new MonthlyIncome
df_test.loc[missing_income_indexes_submission, 'DebtRatio'] /= df_test.loc[missing_income_indexes_submission, 'MonthlyIncome']

# Ensure no division by zero or near zero which could distort DebtRatio significantly
df_test.loc[missing_income_indexes_submission, 'DebtRatio'] = df_test.loc[missing_income_indexes_submission, 'DebtRatio'].replace([np.inf, -np.inf], 0)

In [None]:
print(f"Count of 'DebtRatio' between 0 and 2: {df_train['DebtRatio'].between(-1, 2).sum()}")
print(f"Count of 'DebtRatio' beyond 2: {(df_train['DebtRatio'] > 2).sum()}")

In [None]:
df_train.info()

In [None]:
df_train.info()

# 3. Feature Engineering

In [None]:
df_train['RareCase'] = (df_train['NumberOfTimes90DaysLate'].isin([96, 98])).astype(int)

In [None]:
median_30_59 = df_train['NumberOfTime30-59DaysPastDueNotWorse'].median()
median_60_89 = df_train['NumberOfTime60-89DaysPastDueNotWorse'].median()
median_90 = df_train['NumberOfTimes90DaysLate'].median()

df_train.loc[df_train['NumberOfTime30-59DaysPastDueNotWorse'].isin([96, 98]), 'NumberOfTime30-59DaysPastDueNotWorse'] = median_30_59
df_train.loc[df_train['NumberOfTime60-89DaysPastDueNotWorse'].isin([96, 98]), 'NumberOfTime60-89DaysPastDueNotWorse'] = median_60_89
df_train.loc[df_train['NumberOfTimes90DaysLate'].isin([96, 98]), 'NumberOfTimes90DaysLate'] = median_90

In [None]:
df_train[df_train['NumberOfTimes90DaysLate']>90]

In [None]:
df_train.replace([np.inf, -np.inf], np.nan, inplace=True)

In [None]:
df_train.dropna()

In [None]:
df_test['RareCase'] = (df_test['NumberOfTimes90DaysLate'].isin([96, 98])).astype(int)
median_30_59 = df_test['NumberOfTime30-59DaysPastDueNotWorse'].median()
median_60_89 = df_test['NumberOfTime60-89DaysPastDueNotWorse'].median()
median_90 = df_test['NumberOfTimes90DaysLate'].median()

df_test.loc[df_test['NumberOfTime30-59DaysPastDueNotWorse'].isin([96, 98]), 'NumberOfTime30-59DaysPastDueNotWorse'] = median_30_59
df_test.loc[df_test['NumberOfTime60-89DaysPastDueNotWorse'].isin([96, 98]), 'NumberOfTime60-89DaysPastDueNotWorse'] = median_60_89
df_test.loc[df_test['NumberOfTimes90DaysLate'].isin([96, 98]), 'NumberOfTimes90DaysLate'] = median_90

# 4. Preprocessing 

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from imblearn.under_sampling import RandomUnderSampler
import pandas as pd


def preprocess_data(train_df, test_df,  target, 
                    apply_undersampling=True, 
                    apply_scaling=True, 
                    apply_imputation=True, 
                    apply_pca=True,
                    seed = 42,
                    under_sampler = RandomUnderSampler(),
                    scaler = StandardScaler(),
                    imputer = SimpleImputer(strategy='mean'),
                    pca = PCA(n_components=0.95)):
    

    under_sampler.set_params(random_state=seed)

    # Split data into features and target
    X = train_df.drop(columns=[target])
    y = train_df[target]

    cols = X.columns

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    if apply_undersampling:
        X_train, y_train = under_sampler.fit_resample(X_train, y_train)

    if apply_scaling:
        X_train = scaler.fit_transform(X_train)
        X_test = scaler.transform(X_test)
        test_df = scaler.transform(test_df)

    if apply_imputation:
        X_train = imputer.fit_transform(X_train)
        X_test = imputer.transform(X_test)
        test_df = imputer.transform(test_df)

    if apply_pca:
        X_train = pca.fit_transform(X_train)
        X_test = pca.transform(X_test)
        cols = [f'PCA_{i}' for i in range(X_train.shape[1])]
        test_df = pca.transform(test_df)

    test_df = pd.DataFrame(test_df, columns=cols)

    X_train_df = pd.DataFrame(X_train, columns=cols)
    y_train_df = y_train

    X_test_df = pd.DataFrame(X_test, columns=cols)
    y_test_df = y_test

    return X_train_df, X_test_df, y_train_df, y_test_df, test_df

In [None]:
from sklearn.metrics import roc_auc_score, roc_curve, confusion_matrix, accuracy_score
import seaborn as sns
import matplotlib.pyplot as plt


def plot_roc_curve_and_cm(model, X_test, y_test):
    # Compute probabilities
    y_pred_proba = model.predict_proba(X_test)[:, 1]
    y_pred = model.predict(X_test)

    # Compute AUC
    auc = roc_auc_score(y_test, y_pred_proba)
    print("AUC: ", auc)

    accuracy = accuracy_score(y_test, y_pred)
    print("Accuracy: ", accuracy)

    # Compute ROC curve
    fpr, tpr, _ = roc_curve(y_test, y_pred_proba)

    # Create subplots
    fig, ax = plt.subplots(2, 1, figsize=(7, 15))

    # Plot ROC curve
    ax[0].plot(fpr, tpr, label='ROC curve (AUC = %0.2f)' % auc)
    ax[0].plot([0, 1], [0, 1], 'k--')
    ax[0].set_xlim([0.0, 1.0])
    ax[0].set_ylim([0.0, 1.05])
    ax[0].set_xlabel('False Positive Rate')
    ax[0].set_ylabel('True Positive Rate')
    ax[0].set_title('Receiver Operating Characteristic')
    ax[0].legend(loc="lower right")

    # Plot confusion matrix
    cm = confusion_matrix(y_test, y_pred)
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=ax[1])
    ax[1].set_title('Confusion Matrix')

    # Show the plot
    plt.tight_layout()
    plt.show()

In [None]:
X_train_df, X_test_df, y_train_df, y_test_df, test_for_submission = preprocess_data(df_train, df_test, 'SeriousDlqin2yrs')

In [None]:
X_train_df.head()

## Models

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

## Logistic with weight

In [None]:
X_train_no_undersampling, X_test_no_undersampling, y_train_no_undersampling, y_test_no_undersampling = preprocess_data(df_train, df_test, 'SeriousDlqin2yrs', apply_undersampling=False, apply_pca=False)
weight = y_train_no_undersampling.value_counts(normalize=True)[0] / y_train_no_undersampling.value_counts(normalize=True)[1]

lg_w_weight = LogisticRegression(class_weight={0: 1, 1: weight})

lg_w_weight.fit(X_train_no_undersampling, y_train_no_undersampling)

y_pred = lg_w_weight.predict(X_test_no_undersampling)

accuracy = accuracy_score(y_test_no_undersampling, y_pred)
print("Accuracy:", accuracy)

report = classification_report(y_test_no_undersampling, y_pred)

print(report)

In [None]:
plot_roc_curve_and_cm(lg_w_weight, X_test_no_undersampling, y_test_no_undersampling)

In [None]:
importance = lg_w_weight.coef_[0]

indices = np.argsort(np.abs(importance))[::-1]

names = [X_train_no_undersampling.columns[i] for i in indices]

plt.figure()
plt.title("Feature Importance")
plt.bar(range(X_train_no_undersampling.shape[1]), importance[indices])
plt.xticks(range(X_train_no_undersampling.shape[1]), names, rotation=90)
plt.show()

## All models

In [None]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression

# Define models
models = {
    "XGBClassifier": XGBClassifier(),
    "Random Forest": RandomForestClassifier(),
    "XGBoost": XGBClassifier(),
    "LogisticRegression": LogisticRegression(),
    "Gradient Boosting": GradientBoostingClassifier(),
    "Ada Boosting": AdaBoostClassifier(),
    "Support Vector Machine": SVC(probability=True),
    "K-Nearest Neighbors": KNeighborsClassifier(),
    "Naive Bayes": GaussianNB(),
    "Decision Tree": DecisionTreeClassifier(),
}

# Train and evaluate each model
for name, model in models.items():
    print(f"Training {name}...")
    model.fit(X_train_df, y_train_df)
    plot_roc_curve_and_cm(model, X_test_df, y_test_df)

In [None]:
from sklearn.metrics import roc_auc_score, roc_curve
import matplotlib.pyplot as plt

def plot_all_roc_curves(models, X_test, y_test):
    plt.figure(figsize=(10, 8))
    
    # Plot each model's ROC curve
    for name, model in models.items():
        # Predict probabilities for the positive class
        y_pred_proba = model.predict_proba(X_test)[:, 1]
        
        # Calculate ROC AUC
        roc_auc = roc_auc_score(y_test, y_pred_proba)
        
        # Calculate ROC curve points
        fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
        
        # Plot the ROC curve
        plt.plot(fpr, tpr, label=f'{name} (AUC = {roc_auc:.2f})')
    
    # Plot the no skill line
    plt.plot([0, 1], [0, 1], 'k--')
    
    # Add labels and legend
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('ROC Curves Comparison')
    plt.legend(loc='lower right')
    
    # Show the plot
    plt.show()

# Define models
models = {
    "Random Forest": RandomForestClassifier(),
    "Gradient Boosting": GradientBoostingClassifier(),
    "Ada Boosting": AdaBoostClassifier(),
    "K-Nearest Neighbors": KNeighborsClassifier(),
    "Naive Bayes": GaussianNB(),
    "Decision Tree": DecisionTreeClassifier(),
}

# Assuming X_train_df, y_train_df, X_test_df, y_test_df are already defined and available
for name, model in models.items():
    print(f"Training {name}...")
    model.fit(X_train_df, y_train_df)

# After training, pass the models and test data to plot the ROC curves
plot_all_roc_curves(models, X_test_df, y_test_df)


# Best Model

We check the most important features for the best model.

In [None]:
best_model = GradientBoostingClassifier()

best_model.fit(X_train_df, y_train_df)

plot_roc_curve_and_cm(best_model, X_test_df, y_test_df)

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
import matplotlib.pyplot as plt
import pandas as pd

# Example setup
# X_train, y_train are your features and target variable respectively
# clf is your trained Gradient Boosting classifier
clf = GradientBoostingClassifier()
clf.fit(X_train_df, y_train_df)

# Extract feature importances
feature_importances = clf.feature_importances_

# Creating a DataFrame to view and sort importances
features = X_train_df.columns
importances_df = pd.DataFrame({'Feature': features, 'Importance': feature_importances})
importances_df = importances_df.sort_values(by='Importance', ascending=True)

# Plotting
plt.figure(figsize=(10, 8))
plt.barh(importances_df['Feature'], importances_df['Importance'], color='darkred')
plt.xlabel('Importance')
plt.title('Feature Importance from Gradient Boosting Classifier')
plt.show()



## Model with highest accuracy

In [None]:
model = GaussianNB()

model.fit(X_train_df, y_train_df)

plot_roc_curve_and_cm(best_model, X_test_df, y_test_df)

In [None]:
log_prob_features = model.theta_  # You could also use sigma_ or other statistics

# Plotting these "importances"
# Note: We're using the mean here for demonstration; this isn't a typical "importance"
fig, ax = plt.subplots(figsize=(10, 8))
indices = np.argsort(log_prob_features[0])  # Sort by the first class, for example
plt.title('Feature "Importances" in Gaussian Naive Bayes')
plt.barh(range(len(indices)), log_prob_features[0][indices], color='darkred', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Feature Mean Value (First Class)')
plt.show()

# Optimizing model

### Random forest optimization

In [None]:
from sklearn.model_selection import RandomizedSearchCV

def optimize_model(model, param_distributions, X_train, y_train, n_iter=100, cv=5, random_state=42):
    # Set up the random search with 5-fold cross validation
    random_search = RandomizedSearchCV(estimator=model,
                                       param_distributions=param_distributions,
                                       n_iter=n_iter,
                                       cv=cv,
                                       random_state=random_state,
                                       verbose = 2,
                                       n_jobs=-1)

    # Fit the random search model
    random_search.fit(X_train, y_train)

    return random_search.best_estimator_

In [None]:
param_distributions = {
    'n_estimators': [int(x) for x in np.linspace(start=200, stop=1000, num=10)],
    'max_features': ['sqrt'],
    'max_depth': [int(x) for x in np.linspace(5, 25, num=6)] + [None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'bootstrap': [True, False]
}

# optimized_rf = optimize_model(RandomForestClassifier(), 
                                 # param_distributions, 
                                 # X_train_df, 
                                 # y_train_df, 
                                 # n_iter=20)

In [None]:
# optimized_rf.get_params()

In [None]:
# plot_roc_curve_and_cm(optimized_rf, X_test_df, y_test_df)

### Xgboost optimization

In [None]:
'''
param_distributions = {
    'n_estimators': [int(x) for x in np.linspace(start=100, stop=1000, num=10)],
    'learning_rate': [0.01, 0.05, 0.1, 0.2, 0.3],
    'max_depth': [int(x) for x in np.linspace(3, 10, num=8)],
    'min_child_weight': [1, 3, 5],
    'gamma': [0, 0.1, 0.2, 0.3, 0.4],
    'subsample': [0.6, 0.7, 0.8, 0.9, 1.0],
    'colsample_bytree': [0.6, 0.7, 0.8, 0.9, 1.0],
    'objective': ['binary:logistic']
}'''

In [None]:
'''
optimized_xgb = optimize_model(XGBClassifier(), 
                                 param_distributions, 
                                 X_train_df, 
                                 y_train_df, 
                                 n_iter=20)
            '''

In [None]:
# plot_roc_curve_and_cm(optimized_xgb, X_test_df, y_test_df)

### Gradient boosting

Optimization for gradient boosting.

In [None]:
param_distributions = {
    'n_estimators': [50, 100],  # Reduced maximum number, fewer steps
    'learning_rate': [0.1, 0.2, 0.3],  # Higher and fewer options
    'max_depth': [3, 5],  # Reduced complexity
    'min_samples_split': [5, 10],  # Reduced variation
    'min_samples_leaf': [2, 4]  # Fewer options, more generalization
}

# Assuming X_train_df and y_train_df are defined
optimized_gb = optimize_model(GradientBoostingClassifier(), 
                              param_distributions,
                              X_train_df, 
                              y_train_df, 
                              n_iter=10)  # Reduced the number of iterations

print("Optimized model:", optimized_gb)

In [None]:
plot_roc_curve_and_cm(optimized_gb, X_test_df, y_test_df)

In [None]:
y_test.value_counts()

In [None]:
# Initialize the model with 'log_loss' which supports probability predictions
best_model = GradientBoostingClassifier(max_depth=5, min_samples_leaf=4, min_samples_split=5,
                           n_estimators=50, loss='log_loss')

# Train the model with your training data
best_model.fit(X_train_df, y_train_df)


# Test Dataset

In [None]:
df_test = test_for_submission

In [None]:
df_test.info()

In [None]:
'''

features = ['RevolvingUtilizationOfUnsecuredLines', 'age', 'DebtRatio', 
            'NumberOfOpenCreditLinesAndLoans', 'NumberRealEstateLoansOrLines']

df_test['NumberOfDependents'] = number_imputer.fit_transform(df_test[['NumberOfDependents']])

# Predict MonthlyIncome on df_submission where it's missing
missing_income_indexes_submission = df_test[df_test['MonthlyIncome'].isna()].index
X_missing_submission = df_test.loc[missing_income_indexes_submission][features]

predicted_income_submission = monthly_model.predict(X_missing_submission)
predicted_income_submission = np.maximum(predicted_income_submission, 0)  # Ensure non-negative income predictions

# Assign predicted MonthlyIncome to df_submission
df_test.loc[missing_income_indexes_submission, 'MonthlyIncome'] = predicted_income_submission

# Adjust DebtRatio in df_submission based on new MonthlyIncome
df_test.loc[missing_income_indexes_submission, 'DebtRatio'] /= df_test.loc[missing_income_indexes_submission, 'MonthlyIncome']

# Ensure no division by zero or near zero which could distort DebtRatio significantly
df_test.loc[missing_income_indexes_submission, 'DebtRatio'] = df_test.loc[missing_income_indexes_submission, 'DebtRatio'].replace([np.inf, -np.inf], 0)

# Optionally, save the updated df_submission to a new CSV file
# df_submission.to_csv('path_to_save_preprocessed_submission.csv', index=False)

print("Preprocessing and prediction completed for submission dataset.")
'''

In [None]:
df_test.info()

## Parte Preprocessing

In [None]:
def submit(model, test_df, filename='submission3.csv'):
    """
    Generate a CSV submission file containing the probabilities of the positive class.
    
    :param model: Trained Gradient Boosting model.
    :param test_df: DataFrame containing the test features.
    :param filename: The name of the file to save the predictions to.
    """
    # Ensure the input DataFrame for predictions is in the correct format
    test_array = np.array(test_df)  # Convert DataFrame to numpy array if needed for the model

    # Predict the probabilities for the positive class
    probabilities = model.predict_proba(test_array)[:, 1]  # Index 1 for class '1'

    test_df.index = range(1, len(test_df) + 1)

    # Create a new DataFrame for submission using the adjusted DataFrame index as ID
    submission_df = pd.DataFrame({
        'id': test_df.index,  # Use the adjusted index as the ID
        'probability': probabilities
    })


    # Save the DataFrame to a CSV file for submission
    submission_df.to_csv(filename, index=False)
    print("Submission file created successfully and saved to:", filename)


In [None]:
submit(best_model, df_test)

In [None]:
sub = pd.read_csv('submission3.csv')

In [None]:
sub.describe()