In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Problem Understanding

In [None]:
# Load dataset
file_path = 'PD_modelling_dataset.xlsx'
data = pd.read_excel(file_path, sheet_name='dataset')
data.head()

### Need of the Study/Project:
In the role of a data scientist, it is essential to understand why this study or project is necessary:

Risk Management:

Credit Risk: Predicting the probability of default helps in assessing the credit risk associated with each customer. By identifying high-risk customers, the company can take preventive measures to mitigate potential losses.
Portfolio Management: Understanding the risk profile of customers allows for better management of the credit portfolio, ensuring a balanced risk-return trade-off.
Financial Stability:

Provisioning and Reserves: Accurate PD modeling helps in determining the amount of reserves required to cover potential losses. This ensures the financial stability and health of the organization.
Capital Allocation: Efficient capital allocation based on risk assessment ensures that resources are optimally utilized, contributing to the overall profitability of the company.
Regulatory Compliance:

Compliance with Standards: Financial institutions are required to comply with regulatory standards such as Basel III, which mandates robust risk assessment and management practices. PD modeling is a key component of these requirements.
Transparency: Predictive models provide transparency in credit decision-making processes, which is crucial for regulatory reporting and audit purposes.

### Understanding Business/Social Opportunity:
The project presents several business and social opportunities:

Enhanced Customer Experience:

Personalized Services: By understanding the risk profile of customers, the company can offer personalized credit products and services tailored to individual needs and risk levels.
Proactive Customer Support: Identifying customers at risk of default allows for proactive engagement and support, such as offering restructuring options or financial counseling.
Competitive Advantage:

Data-Driven Decisions: Leveraging predictive analytics provides a competitive edge by enabling data-driven decision-making. This can lead to improved credit approval processes, optimized interest rates, and better customer retention strategies.
Market Positioning: Companies with robust risk management practices are perceived as more reliable and stable, enhancing their market reputation and attracting more customers.
Social Impact:

Financial Inclusion: By accurately assessing credit risk, the company can extend credit to underserved segments of the population, promoting financial inclusion.
Economic Stability: Reducing the incidence of defaults contributes to the stability of the financial system, which in turn supports broader economic stability and growth.

# Data Report

In [None]:
# Get the number of rows and columns
num_rows, num_columns = data.shape

print(f"\nNumber of Rows: {num_rows}")
print(f"Number of Columns: {num_columns}")

In [None]:
# Display basic information about the dataset
print("Dataset Info:")
print(data.info())

In [None]:
print("\nDataset Description:")
print(data.describe())

# Exploratory Data Analysis

### a) Univariate analysis (distribution and spread for every continuous attribute, distribution of data in categories for categorical ones)

In [None]:
# Separate continuous and categorical variables
continuous_vars = data.select_dtypes(include=['float64', 'int64']).columns
categorical_vars = data.select_dtypes(include=['object', 'category']).columns

In [None]:
# Univariate analysis for continuous variables
for var in continuous_vars:
    plt.figure(figsize=(12, 6))

    # Histogram
    plt.subplot(1, 2, 1)
    sns.histplot(data[var].dropna(), kde=True, bins=30)
    plt.title(f'Histogram of {var}')

    # Box plot
    plt.subplot(1, 2, 2)
    sns.boxplot(x=data[var].dropna())
    plt.title(f'Box Plot of {var}')

    plt.show()


In [None]:
# Univariate analysis for categorical variables
for var in categorical_vars:
    plt.figure(figsize=(12, 6))

    # Bar plot
    sns.countplot(y=data[var])
    plt.title(f'Bar Plot of {var}')
    plt.show()

### b) Bivariate analysis (relationship between different variables , correlations)

In [None]:
# Separate continuous and categorical variables
continuous_vars = data.select_dtypes(include=['float64', 'int64']).columns
categorical_vars = data.select_dtypes(include=['object', 'category']).columns


In [None]:
# Bivariate analysis: Continuous vs Continuous
print("Correlation Matrix:\n")
correlation_matrix = data[continuous_vars].corr()
print(correlation_matrix)


for i in range(len(continuous_vars)):
    for j in range(i + 1, len(continuous_vars)):
        plt.figure(figsize=(8, 6))
        sns.scatterplot(x=data[continuous_vars[i]], y=data[continuous_vars[j]])
        plt.title(f'Scatter Plot between {continuous_vars[i]} and {continuous_vars[j]}')
        plt.show()


In [None]:
# Bivariate analysis: Continuous vs Continuous

# Select continuous variables
continuous_vars = data.select_dtypes(include=['float64', 'int64']).columns

# Calculate correlation matrix
correlation_matrix = data[continuous_vars].corr()

# Plot heatmap with improved readability
plt.figure(figsize=(16, 12))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1, 
            annot_kws={"size": 10}, fmt=".2f", linewidths=.5, cbar_kws={"shrink": .75})

plt.title('Correlation Matrix', fontsize=20)
plt.xticks(rotation=45, ha='right', fontsize=12)
plt.yticks(fontsize=12)
plt.show()

In [None]:
# Bivariate analysis: Categorical vs Categorical
for i in range(len(categorical_vars)):
    for j in range(i + 1, len(categorical_vars)):
        crosstab = pd.crosstab(data[categorical_vars[i]], data[categorical_vars[j]])
        print(f'\nCrosstab between {categorical_vars[i]} and {categorical_vars[j]}:\n')
        print(crosstab)
        
        plt.figure(figsize=(10, 8))
        sns.heatmap(crosstab, annot=True, cmap='Blues')
        plt.title(f'Heatmap of {categorical_vars[i]} vs {categorical_vars[j]}')
        plt.show()


In [None]:
# Bivariate analysis: Continuous vs Categorical
for cat in categorical_vars:
    for cont in continuous_vars:
        plt.figure(figsize=(14, 8))  # Increase the figure size for better readability
        sns.boxplot(x=data[cat], y=data[cont])
        plt.title(f'Box Plot of {cont} by {cat}', fontsize=16)
        plt.xticks(rotation=45, ha='right', fontsize=12)  # Rotate x-axis labels
        plt.xlabel(cat, fontsize=14)
        plt.ylabel(cont, fontsize=14)
        plt.show()
        
        plt.figure(figsize=(14, 8))  # Increase the figure size for better readability
        sns.violinplot(x=data[cat], y=data[cont])
        plt.title(f'Violin Plot of {cont} by {cat}', fontsize=16)
        plt.xticks(rotation=45, ha='right', fontsize=12)  # Rotate x-axis labels
        plt.xlabel(cat, fontsize=14)
        plt.ylabel(cont, fontsize=14)
        plt.show()

### c) Missing Value treatment

In [None]:

# Identify missing values
missing_values = data.isnull().sum()
print("\nMissing Values per Column:\n", missing_values)

In [None]:
# Drop rows with missing values
data_dropped_rows = data.dropna()

# Drop columns with missing values
data_dropped_columns = data.dropna(axis=1)


In [None]:
# Separate numerical and categorical columns
numerical_cols = data.select_dtypes(include=['float64', 'int64']).columns
categorical_cols = data.select_dtypes(include=['object', 'category']).columns

# Mean Imputation for numerical columns
data_mean_imputed = data.copy()
data_mean_imputed[numerical_cols] = data[numerical_cols].fillna(data[numerical_cols].mean())

# Median Imputation for numerical columns
data_median_imputed = data.copy()
data_median_imputed[numerical_cols] = data[numerical_cols].fillna(data[numerical_cols].median())

# Mode Imputation for categorical columns
data_mode_imputed = data.copy()
for col in categorical_cols:
    data_mode_imputed[col].fillna(data[col].mode()[0], inplace=True)

# Forward Fill for both numerical and categorical columns
data_ffill_imputed = data.copy()
data_ffill_imputed.fillna(method='ffill', inplace=True)

# Backward Fill for both numerical and categorical columns
data_bfill_imputed = data.copy()
data_bfill_imputed.fillna(method='bfill', inplace=True)

# Display the first few rows of the imputed datasets
print("\nData after Mean Imputation:\n", data_mean_imputed.head())
print("\nData after Median Imputation:\n", data_median_imputed.head())
print("\nData after Mode Imputation:\n", data_mode_imputed.head())
print("\nData after Forward Fill:\n", data_ffill_imputed.head())
print("\nData after Backward Fill:\n", data_bfill_imputed.head())

# Verify that there are no more missing values in each imputed dataset
print("\nMissing Values per Column after Mean Imputation:\n", data_mean_imputed.isnull().sum())
print("\nMissing Values per Column after Median Imputation:\n", data_median_imputed.isnull().sum())
print("\nMissing Values per Column after Mode Imputation:\n", data_mode_imputed.isnull().sum())
print("\nMissing Values per Column after Forward Fill:\n", data_ffill_imputed.isnull().sum())
print("\nMissing Values per Column after Backward Fill:\n", data_bfill_imputed.isnull().sum())


### d) Outlier treatment (if required)

In [None]:
import pandas as pd
import numpy as np
from scipy import stats
import seaborn as sns
import matplotlib.pyplot as plt

# Separate numerical columns
numerical_cols = data.select_dtypes(include=['float64', 'int64']).columns

# Function to detect and treat outliers using Z-Score method
def zscore_outlier_treatment(df, threshold=3):
    z_scores = np.abs(stats.zscore(df))
    df_outliers_removed = df[(z_scores < threshold).all(axis=1)]
    return df_outliers_removed

# Function to detect and treat outliers using IQR method
def iqr_outlier_treatment(df):
    Q1 = df.quantile(0.25)
    Q3 = df.quantile(0.75)
    IQR = Q3 - Q1
    df_outliers_removed = df[~((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))).any(axis=1)]
    return df_outliers_removed

# Z-Score Method
data_zscore_treated = zscore_outlier_treatment(data[numerical_cols])

# IQR Method
data_iqr_treated = iqr_outlier_treatment(data[numerical_cols])

# Visualization: Boxplots for visual inspection of outliers
plt.figure(figsize=(15, 10))
num_plots = len(numerical_cols)
cols = 3
rows = (num_plots // cols) + (num_plots % cols > 0)

for i, col in enumerate(numerical_cols, 1):
    plt.subplot(rows, cols, i)
    sns.boxplot(x=data[col])
    plt.title(f'Box Plot of {col}')
plt.tight_layout()
plt.show()

# Visualization: Scatterplots for visual inspection of outliers
plt.figure(figsize=(15, 10))
for i, col in enumerate(numerical_cols, 1):
    plt.subplot(rows, cols, i)
    sns.scatterplot(x=data.index, y=data[col])
    plt.title(f'Scatter Plot of {col}')
plt.tight_layout()
plt.show()

# Display the shape of the datasets after outlier treatment
print("\nOriginal Data Shape:", data.shape)
print("Data Shape after Z-Score Outlier Treatment:", data_zscore_treated.shape)
print("Data Shape after IQR Outlier Treatment:", data_iqr_treated.shape)

# Verify that the outliers have been removed
print("\nSummary Statistics after Z-Score Outlier Treatment:\n", data_zscore_treated.describe())
print("\nSummary Statistics after IQR Outlier Treatment:\n", data_iqr_treated.describe())


### e) Variable transformation (if applicable)

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split


# Step 1: Fill missing values (example: using mean imputation for numeric columns)
numeric_columns = data.select_dtypes(include=[np.number]).columns
data[numeric_columns] = data[numeric_columns].fillna(data[numeric_columns].mean())

# Step 2: Ensure all categorical variables are uniformly of type str
for column in data.select_dtypes(include=['object']).columns:
    data[column] = data[column].astype(str)


# Encode categorical variables (example using LabelEncoder for simplicity)
label_encoders = {}
for column in data.select_dtypes(include=['object']).columns:
    le = LabelEncoder()
    data[column] = le.fit_transform(data[column])
    label_encoders[column] = le

# Step 3: Scale numerical features
scaler = StandardScaler()
data[numeric_columns] = scaler.fit_transform(data[numeric_columns])

# Step 4: Split the data (optional, usually for supervised learning, but shown here for completeness)
train_data, test_data = train_test_split(data, test_size=0.2, random_state=42)

# Display the first few rows of the processed data to verify the transformations
print(data.head())

# Check for any remaining missing values
print(data.isnull().sum())

# Check the data types to ensure all columns are appropriately typed
print(data.dtypes)

# Plot histograms for numerical features
data.hist(bins=30, figsize=(20, 15))
plt.show()

# Check distribution of categorical variables
for column in data.select_dtypes(include=['int']).columns:
    sns.countplot(data[column])
    plt.show()




# Business insights from EDA

In [None]:
pip install scikit-learn pandas numpy matplotlib seaborn joblib


In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
import seaborn as sns
import joblib

# Assuming train_data is your DataFrame and numeric_columns is your list of numeric columns

# Select features for clustering (excluding the target variable if applicable)
features = train_data.drop(columns=['default'])

# Apply KMeans clustering
kmeans = KMeans(n_clusters=5, random_state=0)
train_data['Cluster'] = kmeans.fit_predict(features)

# Analyze clusters
cluster_analysis = train_data.groupby('Cluster')[numeric_columns].mean()
print("\nCluster Analysis:\n", cluster_analysis)

# Visualize the clusters using PCA for 2D visualization
pca = PCA(n_components=2)
principal_components = pca.fit_transform(features)
train_data['PCA1'] = principal_components[:, 0]
train_data['PCA2'] = principal_components[:, 1]

plt.figure(figsize=(10, 7))
plt.scatter(train_data['PCA1'], train_data['PCA2'], c=train_data['Cluster'], cmap='viridis')
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.title('PCA of Clusters')
plt.colorbar(label='Cluster')
plt.show()

# Validate clusters using silhouette score
silhouette_avg = silhouette_score(features, train_data['Cluster'])
print(f'Silhouette Score: {silhouette_avg}')

# Save label encoders and scaler for future use
for column, le in label_encoders.items():
    joblib.dump(le, f'label_encoder_{column}.pkl')
joblib.dump(scaler, 'scaler.pkl')



The PCA plot you provided visualizes the clustering results of your data. Each point represents a data instance projected onto two principal components (PCA Component 1 and PCA Component 2), and the colors indicate the cluster assignments. Here's how we can derive some business insights from this plot:

1. Cluster Distribution and Separation:
The plot shows distinct clusters indicating that the KMeans algorithm successfully identified groups of similar instances.
Good separation between clusters suggests that the features used for clustering are effective in distinguishing between different patterns in the data.
2. Cluster Characteristics:
To gain business insights, we should look at the average values of key metrics for each cluster (which we already calculated). For instance, clusters might represent different segments of customers based on their credit behavior.
Analyzing the cluster_analysis output will help identify unique characteristics of each cluster.
3. Potential Actions for Each Cluster:
Cluster 0: If this cluster has higher default rates, you might want to implement stricter credit policies or additional monitoring for this group.
Cluster 1: If this cluster shows strong financial health (e.g., low default rates, high paid amounts), they could be targeted for new credit products or upselling.
Cluster 2: If this group is in between the high-risk and low-risk segments, tailored offers or educational content about financial management could be beneficial.
Cluster 3: This cluster might represent a niche segment with unique characteristics. Understanding their needs could help in designing specific financial products.
Cluster 4: If this cluster shows diverse behavior, more granular segmentation might be needed.
4. Market Strategy:
Marketing and Customer Engagement: Use the cluster insights to create targeted marketing campaigns. For example, offering lower interest rates to low-risk clusters and personalized repayment plans to high-risk clusters.
Product Development: Develop financial products tailored to the needs of each cluster, like premium credit cards for low-risk customers and secured cards for high-risk customers.
5. Risk Management:
Use the cluster information to better manage risk by adjusting credit limits, interest rates, and approval criteria based on the risk profile of each cluster.
6. Operational Efficiency:
Optimize resource allocation by focusing more on high-value or high-risk clusters. This could involve providing more personalized customer service to profitable clusters or investing in risk mitigation for high-risk clusters.

### 1. Model building and interpretation.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, roc_auc_score
import matplotlib.pyplot as plt

# Load data
file_path = 'PD_modelling_dataset.xlsx'
df = pd.read_excel(file_path)

# Identify feature and target columns
# Assuming the target column is named 'default'
X = df.drop('default', axis=1)
y = df['default']

# Handle missing values and encode categorical variables
numerical_features = X.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X.select_dtypes(include=['object']).columns

numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Drop rows with NaN in y_train (if any)
y_train.dropna(inplace=True)

# Drop corresponding rows in X_train to maintain consistency
X_train = X_train.loc[y_train.index]

# Apply transformations
X_train = preprocessor.fit_transform(X_train)
X_test = preprocessor.transform(X_test)

# Apply SMOTE to the training data
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)


In [None]:
print("Missing values in y_train:", y_train.isnull().sum())

# Check shapes to ensure consistency
print("X_train shape after handling NaNs:", X_train.shape)
print("y_train shape after handling NaNs:", y_train.shape)

# Initialize the Random Forest Classifier
rf_classifier = RandomForestClassifier(random_state=42)

# Fit the classifier to the training data
rf_classifier.fit(X_train, y_train)


In [None]:


# Train the Random Forest classifier
rf_classifier = RandomForestClassifier(random_state=42)
rf_classifier.fit(X_train_resampled, y_train_resampled)

# Predict on the test data
y_pred = rf_classifier.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

cm = confusion_matrix(y_test, y_pred)
print('Confusion Matrix:')
print(cm)

report = classification_report(y_test, y_pred)
print('Classification Report:')
print(report)

# Plot heatmap of confusion matrix
plt.figure(figsize=(10, 7))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['No Default', 'Default'], yticklabels=['No Default', 'Default'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix Heatmap')
plt.show()

# Compute ROC curve and AUC score
y_pred_prob = rf_classifier.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
auc_score = roc_auc_score(y_test, y_pred_prob)
print(f'AUC Score: {auc_score:.2f}')

# Plot ROC curve
plt.figure()
plt.plot(fpr, tpr, color='blue', lw=2, label=f'ROC curve (area = {auc_score:.2f})')
plt.plot([0, 1], [0, 1], color='gray', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()


### Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_curve, roc_auc_score
import matplotlib.pyplot as plt
import seaborn as sns


# Remove unexpected class '10000.0' from y_test and corresponding X_test
mask = y_test != 10000.0
y_test = y_test[mask]
X_test = X_test[mask]

# Predict on the test data
y_pred = dt_classifier.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

cm = confusion_matrix(y_test, y_pred)
print('Confusion Matrix:')
print(cm)

report = classification_report(y_test, y_pred)
print('Classification Report:')
print(report)

# Plot heatmap of confusion matrix
plt.figure(figsize=(10, 7))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['No Default', 'Default'], yticklabels=['No Default', 'Default'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix Heatmap')
plt.show()

# Compute ROC curve and AUC score
y_pred_prob = dt_classifier.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
auc_score = roc_auc_score(y_test, y_pred_prob)
print(f'AUC Score: {auc_score:.2f}')

# Plot ROC curve
plt.figure()
plt.plot(fpr, tpr, color='blue', lw=2, label=f'ROC curve (area = {auc_score:.2f})')
plt.plot([0, 1], [0, 1], color='gray', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()


### Support Vector Machine

In [None]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_curve, roc_auc_score
from sklearn.impute import SimpleImputer
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Assuming X_train, X_test, y_train, y_test are already defined

# Impute missing values in X_train and X_test
imputer = SimpleImputer(strategy='mean')
X_train = imputer.fit_transform(X_train)
X_test = imputer.transform(X_test)

# Remove unexpected class '10000.0' from y_train and y_test
mask_train = y_train != 10000.0
y_train = y_train[mask_train]
X_train = X_train[mask_train]

mask_test = y_test != 10000.0
y_test = y_test[mask_test]
X_test = X_test[mask_test]

# Train the SVM classifier
svm_classifier = SVC(probability=True, random_state=42)
svm_classifier.fit(X_train, y_train)

# Predict on the test data
y_pred = svm_classifier.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

cm = confusion_matrix(y_test, y_pred)
print('Confusion Matrix:')
print(cm)

report = classification_report(y_test, y_pred)
print('Classification Report:')
print(report)

# Plot heatmap of confusion matrix
plt.figure(figsize=(10, 7))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['No Default', 'Default'], yticklabels=['No Default', 'Default'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix Heatmap')
plt.show()

# Compute ROC curve and AUC score
y_pred_prob = svm_classifier.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
auc_score = roc_auc_score(y_test, y_pred_prob)
print(f'AUC Score: {auc_score:.2f}')

# Plot ROC curve
plt.figure()
plt.plot(fpr, tpr, color='blue', lw=2, label=f'ROC curve (area = {auc_score:.2f})')
plt.plot([0, 1], [0, 1], color='gray', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()


### Naive Bayes

In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_curve, roc_auc_score
from sklearn.impute import SimpleImputer
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Assuming X_train, X_test, y_train, y_test are already defined

# Impute missing values in X_train and X_test
imputer = SimpleImputer(strategy='mean')
X_train = imputer.fit_transform(X_train)
X_test = imputer.transform(X_test)

# Remove unexpected class '10000.0' from y_train and y_test
mask_train = y_train != 10000.0
y_train = y_train[mask_train]
X_train = X_train[mask_train]

mask_test = y_test != 10000.0
y_test = y_test[mask_test]
X_test = X_test[mask_test]

# Train the Naive Bayes classifier
nb_classifier = GaussianNB()
nb_classifier.fit(X_train, y_train)

# Predict on the test data
y_pred = nb_classifier.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

cm = confusion_matrix(y_test, y_pred)
print('Confusion Matrix:')
print(cm)

report = classification_report(y_test, y_pred)
print('Classification Report:')
print(report)

# Plot heatmap of confusion matrix
plt.figure(figsize=(10, 7))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['No Default', 'Default'], yticklabels=['No Default', 'Default'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix Heatmap')
plt.show()

# Compute ROC curve and AUC score
y_pred_prob = nb_classifier.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
auc_score = roc_auc_score(y_test, y_pred_prob)
print(f'AUC Score: {auc_score:.2f}')

# Plot ROC curve
plt.figure()
plt.plot(fpr, tpr, color='blue', lw=2, label=f'ROC curve (area = {auc_score:.2f})')
plt.plot([0, 1], [0, 1], color='gray', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()


### 2. Model Tuning

In [None]:
from sklearn.ensemble import GradientBoostingClassifier, StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_curve, roc_auc_score, classification_report, confusion_matrix
from sklearn.impute import SimpleImputer
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Assuming X_train, X_test, y_train, y_test are already defined

# Impute missing values in X_train and X_test
imputer = SimpleImputer(strategy='mean')
X_train = imputer.fit_transform(X_train)
X_test = imputer.transform(X_test)

# Remove unexpected class '10000.0' from y_train and y_test
mask_train = y_train != 10000.0
y_train = y_train[mask_train]
X_train = X_train[mask_train]

mask_test = y_test != 10000.0
y_test = y_test[mask_test]
X_test = X_test[mask_test]

# Function to train and evaluate models
def evaluate_model(model, X_train, y_train, X_test, y_test):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    y_pred_prob = model.predict_proba(X_test)[:, 1]
    
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    auc_score = roc_auc_score(y_test, y_pred_prob)
    
    cm = confusion_matrix(y_test, y_pred)
    report = classification_report(y_test, y_pred)
    
    print(f'Accuracy: {accuracy:.2f}')
    print('Confusion Matrix:')
    print(cm)
    print('Classification Report:')
    print(report)
    
    # Plot heatmap of confusion matrix
    plt.figure(figsize=(10, 7))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['No Default', 'Default'], yticklabels=['No Default', 'Default'])
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.title('Confusion Matrix Heatmap')
    plt.show()
    
    # Compute ROC curve
    fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
    
    # Plot ROC curve
    plt.figure()
    plt.plot(fpr, tpr, color='blue', lw=2, label=f'ROC curve (area = {auc_score:.2f})')
    plt.plot([0, 1], [0, 1], color='gray', lw=2, linestyle='--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver Operating Characteristic (ROC) Curve')
    plt.legend(loc="lower right")
    plt.show()
    
    return {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1_score': f1,
        'auc_score': auc_score
    }

# Dictionary to store results
results = {}


# GradientBoostingClassifier
print("\nGradientBoostingClassifier Results:")
gb_classifier = GradientBoostingClassifier(random_state=42)
results['GradientBoosting'] = evaluate_model(gb_classifier, X_train, y_train, X_test, y_test)

# StackingClassifier
print("\nStackingClassifier Results:")
estimators = [('rf', RandomForestClassifier(random_state=42)),
              ('gb', GradientBoostingClassifier(random_state=42))]
stacking_classifier = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression())
results['Stacking'] = evaluate_model(stacking_classifier, X_train, y_train, X_test, y_test)

# Compare results
print("\nComparison of Model Performance:")
for model_name, metrics in results.items():
    print(f"{model_name}:")
    for metric_name, value in metrics.items():
        print(f"  {metric_name}: {value:.2f}")


In [None]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, BaggingClassifier, AdaBoostClassifier, StackingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_curve, roc_auc_score, confusion_matrix, classification_report
from sklearn.impute import SimpleImputer
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Assuming X_train, X_test, y_train, y_test are already defined and preprocessed

# Impute missing values in X_train and X_test
imputer = SimpleImputer(strategy='mean')
X_train = imputer.fit_transform(X_train)
X_test = imputer.transform(X_test)

# Remove unexpected class '10000.0' from y_train and y_test
mask_train = y_train != 10000.0
y_train = y_train[mask_train]
X_train = X_train[mask_train]

mask_test = y_test != 10000.0
y_test = y_test[mask_test]
X_test = X_test[mask_test]

# Function to train and evaluate models
def evaluate_model(model, X_train, y_train, X_test, y_test):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    y_pred_prob = model.predict_proba(X_test)[:, 1]
    
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    auc_score = roc_auc_score(y_test, y_pred_prob)
    
    cm = confusion_matrix(y_test, y_pred)
    report = classification_report(y_test, y_pred)
    
    print(f'Accuracy: {accuracy:.2f}')
    print('Confusion Matrix:')
    print(cm)
    print('Classification Report:')
    print(report)
    
    # Plot heatmap of confusion matrix
    plt.figure(figsize=(10, 7))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['No Default', 'Default'], yticklabels=['No Default', 'Default'])
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.title('Confusion Matrix Heatmap')
    plt.show()
    
    # Compute ROC curve
    fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
    
    # Plot ROC curve
    plt.figure()
    plt.plot(fpr, tpr, color='blue', lw=2, label=f'ROC curve (area = {auc_score:.2f})')
    plt.plot([0, 1], [0, 1], color='gray', lw=2, linestyle='--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver Operating Characteristic (ROC) Curve')
    plt.legend(loc="lower right")
    plt.show()
    
    return {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1_score': f1,
        'auc_score': auc_score
    }

# Dictionary to store results
results = {}

# RandomForestClassifier
print("RandomForestClassifier Results:")
rf_classifier = RandomForestClassifier(random_state=42)
results['Random Forest Classifier'] = evaluate_model(rf_classifier, X_train, y_train, X_test, y_test)

# DecisionTreeClassifier
print("\nDecisionTreeClassifier Results:")
dt_classifier = DecisionTreeClassifier(random_state=42)
results['Decision Tree Classifier'] = evaluate_model(dt_classifier, X_train, y_train, X_test, y_test)

# Support Vector Machine
print("\nSupport Vector Machine Results:")
svm_classifier = SVC(probability=True, random_state=42)
results['Support Vector Machine'] = evaluate_model(svm_classifier, X_train, y_train, X_test, y_test)

# Naive Bayes Classifier
print("\nNaive Bayes Classifier Results:")
nb_classifier = GaussianNB()
results['Naive Bayes Classifier'] = evaluate_model(nb_classifier, X_train, y_train, X_test, y_test)

# Bagging Classifier
print("\nBaggingClassifier Results:")
bagging_classifier = BaggingClassifier(random_state=42)
results['Bagging Classifier'] = evaluate_model(bagging_classifier, X_train, y_train, X_test, y_test)

# AdaBoostClassifier
print("\nAdaBoostClassifier Results:")
ada_classifier = AdaBoostClassifier(random_state=42)
results['Ada Boost Classifier'] = evaluate_model(ada_classifier, X_train, y_train, X_test, y_test)

# GradientBoostingClassifier
print("\nGradientBoostingClassifier Results:")
gb_classifier = GradientBoostingClassifier(random_state=42)
results['Gradient Boosting Classifier'] = evaluate_model(gb_classifier, X_train, y_train, X_test, y_test)

# StackingClassifier
print("\nStackingClassifier Results:")
estimators = [('rf', RandomForestClassifier(random_state=42)),
              ('gb', GradientBoostingClassifier(random_state=42))]
stacking_classifier = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression())
results['Stacking'] = evaluate_model(stacking_classifier, X_train, y_train, X_test, y_test)


# Compare results
print("\nComparison of Model Performance:")
comparison_df = pd.DataFrame(results).T
print(comparison_df)

# Display the comparison in tabular format
comparison_df.reset_index(inplace=True)
comparison_df.columns = ['Model Name', 'Accuracy', 'Precision', 'Recall', 'F1-Score', 'AUC Score']
comparison_df


### Choosing the Optimum Model
The Gradient Boosting Classifier and Stacking models stand out with high AUC Scores of 0.897468 and 0.881665, respectively. These models indicate a good balance between distinguishing positive and negative classes.

The Random Forest Classifier has the highest accuracy, but its recall is very low, indicating it misses a large number of actual positives. On the other hand, the Naive Bayes Classifier has the highest recall but suffers from very low accuracy and precision.

Considering a balance of accuracy, precision, recall, F1-score, and AUC score, Gradient Boosting Classifier appears to be the most optimum model.

##### Implications on the Business
###### Using the Gradient Boosting Classifier can provide the following benefits to the business:

Improved Prediction Accuracy: Higher accuracy and AUC scores indicate better performance in distinguishing between classes, leading to more reliable predictions.

Balanced Performance: Good precision and recall balance, ensuring fewer false positives and false negatives, which is crucial for decision-making processes.

Customer Satisfaction: Reliable predictions can improve customer satisfaction by providing accurate results, enhancing trust in the system.

Resource Allocation: Better prediction performance can optimize resource allocation, ensuring efforts and investments are directed towards the right areas.

Risk Management: Enhanced ability to predict and manage risks by accurately identifying potential issues or opportunities.
Adopting the Gradient Boosting Classifier can significantly enhance the business's decision-making capabilities, leading to better outcomes and improved efficiency.






