# Classifaction MSA Part 2


Checking_status: This feature is a categorical variable that describes the status of the current checking account of the individual. 

Duration: This feature is a numerical variable describing the duration in months of the credit.

Credit_history: This categorical feature describes the individual's credit history; this feature is crucial as it shows banks if they have had loans before and if they have paid them off.

Purpose: This is a categorical feature, and it is given to banks showing them the purpose for which the credit is taken, for example, if they need to buy a car or a house.

Credit_amount: This is a numerical feature representing the credit amount.

Savings_status: This is a categorical feature that indicates the savings status of the person applying for the loan. It can be concluded that the higher the savings, the better.

Employment: This is a categorical feature that describes the employment duration, for example, how many years they have worked or even if they are unemployed.

Installment_commitment: This is the instalment rate in percentage of disposable income. It is a numerical variable.

Personal_status: This categorical feature describes the personal status, for example, what gender they are and whether they are married.

Other_parties: This categorical variable indicates if other debtors/guarantors are involved in the life of the person applying for a loan, it is possible that "None" is involved.

Residence_since: This numerical feature is the present residence in years.

Property_magnitude: This categorical variable describes the property people looking for loans have, such as 'real estate', 'life insurance', and 'unknown/no property'.

Age: This is a numerical variable indicating the age of the individual applying for a loan.

Other_payment_plans: This categorical variable indicates other instalment plans such as 'bank', 'stores', and 'none'.

Housing: This categorical variable which indicates the housing situation, for example, if they are "renting" or "owning" their home (banks can use this to show if the individual is living in a "stable" environment").

Existing_credits: This numerical feature shows the number of existing credits at this bank.

Job: This categorical variable describes the job situation ranging from unemployed/unskilled' to 'highly skilled'.

Num_dependents: This numerical feature shows the number of people liable for maintenance.

Own_telephone: This categorical variable tells the bank if the individual has a phone.

Foreign_worker: This categorical variable indicates if the individual is a foreign worker.

Class: This is the target variable that the model will try and predict, given the other features in the data set. It is also a categorical variable representing whether the credit risk is 'good' or 'bad'.

These are the variables that are all the ones provided in the dataset. It includes both numerical and categorical data. These features can be used to train a machine learning model to predict the target variable once the categorical is turned into numerical. The class will be the targeted variable as it is the most appropriate variable to let banks know if they should give out the loan to the individual.

### Importing libaries and removing outliers


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score, accuracy_score
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestRegressor, GradientBoostingClassifier
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score, confusion_matrix


# Load data
df_credit_risk = pd.read_csv('credit_risk.csv')

def remove_outliers(df):
    numeric_cols = df.select_dtypes(include=[np.number])
    mean = numeric_cols.mean()
    std = numeric_cols.std()
    is_outlier = (np.abs(numeric_cols - mean) > 3 * std).any(axis=1)
    return df[~is_outlier]
df_credit_risk = remove_outliers(df_credit_risk)


### setting and spliting the data

In [None]:
# Convert to appropriate data types
for column in ['duration', 'credit_amount', 'installment_commitment', 'residence_since', 'age', 'existing_credits', 'num_dependents']:
    df_credit_risk[column] = pd.to_numeric(df_credit_risk[column], errors='coerce')

# Label encoding for target variable 'class'
le = LabelEncoder()
df_credit_risk['class'] = le.fit_transform(df_credit_risk['class'])

# Convert categorical variables into dummy/indicator variables
categorical_columns = ['checking_status', 'credit_history', 'purpose', 'savings_status', 'employment', 'personal_status', 
                       'other_parties', 'property_magnitude', 'other_payment_plans', 'housing', 'job', 'own_telephone', 
                       'foreign_worker']
df_credit_risk = pd.get_dummies(df_credit_risk, columns=categorical_columns)



# Define features and target
X = df_credit_risk.drop('class', axis=1)
y = df_credit_risk['class'].values

# Imputation and scaling
imputer = SimpleImputer()
scaler = StandardScaler()
X_imputed = imputer.fit_transform(X)
X_scaled = scaler.fit_transform(X_imputed)

# Feature importance
model = GradientBoostingClassifier()
model.fit(X_scaled, y)
feature_importance = pd.Series(model.feature_importances_, index=X.columns).sort_values(ascending=False)

# Define top 5 features
features_top_5 = ['checking_status_no checking', 'credit_amount', 'duration', 'age', 'checking_status_<0']

# Subset features
X_subset = X[features_top_5]

# Imputation and scaling for subset features
X_subset_imputed = imputer.fit_transform(X_subset)
X_subset_scaled = scaler.fit_transform(X_subset_imputed)
# Define model
model = SVC()

In the code above, data type Conversion takes place in columns converting them to numeric data types and replacing non-numeric values with NaNs.
Label Encoding also takes place; the target column chosen was 'class', and it is encoded from string labels to numeric values to be compatible with machine learning algorithms.

Another technique taking place is One-Hot Encoding; in the code above, the categorical columns are transformed into binary columns, converting categorical values into numerical data so the machine learning model can read.

Defining Features and Target in the code above, the top five features with the most significant relationship to class became x and y, the variable targeted "class".

Imputation and Scaling: In this part of the code, any features with None values are scaled for mean 0 and standard deviation 1, making them ready for machine learning algorithms while not affecting the predictions as they become the averages.

Feature Importance: Using Gradient Boosting, the importance of each feature in predicting the target is determined.

Subsetting Top Features: The dataset is a subset of the top 5 essential features.

Preparing Subset Data: The subset data also undergoes imputation and scaling, similar to the entire dataset.

Model Definition: Finally, a Support Vector Classifier model is defined, ready for training on the prepared data.

Overall, the script transforms and pre-processes the credit risk data, assesses feature significance, and sets the stage for using an SVM-based predictive model.






In [None]:
# Run the model 10 times with different random seeds
accuracy_scores = []
f1_scores = []
confusion_matrices = []

for i in range(10):
    # Split data
    X_train, X_test, y_train, y_test = train_test_split(X_subset_scaled, y, test_size=0.35, random_state=i)

    # Train model
    model.fit(X_train, y_train)

    # Make predictions
    y_pred = model.predict(X_test)

    # Get accuracy score, F1 score, and confusion matrix
    accuracy = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    cm = confusion_matrix(y_test, y_pred)

    # Store scores and confusion matrix
    accuracy_scores.append(accuracy)
    f1_scores.append(f1)
    confusion_matrices.append(cm)

Initialization: In the code above, three empty lists are created, and they are: accuracy_scores, f1_scores, and confusion_matrices to store evaluation metrics across the ten seeds.

Loop to Train and Test Model: A for loop is initialized, which runs ten times, each time with a different random seed i.

Data Splitting: The dataset is split inside the loop into training and testing sets using the train_test_split function.
The test_size  is 35% of the dataset, and it will be used for testing, while the remaining 65% will be used for training which is a standard split.

The random_state argument is set to the current loop iteration number i. This ensures a different random seed for each iteration, which will show the robustness of the model.

Model Training: The SVM model was chosen and is trained using the fit method on the training data (X_train and y_train).

Model Prediction: After complete training, the model then predictions (y_pred) are made on the test dataset (X_test).
Evaluation: When creating the code, three evaluation metrics are computed for each iteration:
Firstly the accuracy of the model is taken; it measures the fraction of correct predictions out of all predictions.

Secondly, the f1 is added to the list; the f1 score is instrumental as it is used to show class distribution.

Lastly, cm was taken, the confusion matrix, which provides a detailed breakdown of true positive, true negative, false positive, and false pessimistic predictions. These metrics are then appended to their respective lists for graphing of the results later.

Result: After successfully running the code ten times, the three lists have ten values in them to graph the values.

By running the SVM model multiple times with different train-test splits, this approach provides a more robust estimate of the model's performance. It accounts for potential variability due to different data splits, reducing the likelihood that a particularly "good" or "bad" split biases the performance assessment.

In [None]:
# Display the Accuarcy and the F1 score 
print(f"Model: Support Vector Machine")
print(f"Mean Accuracy: {np.mean(accuracy_scores)}")
print(f"Mean F1 Score: {np.mean(f1_scores)}")
print(f"Confusion Matrices: ")
for matrix in confusion_matrices:
    print(matrix)

# Plot distribution of the varibales 
fig, ax = plt.subplots(1, 2, figsize=(15, 7))

# Dot plot for Accuracy Scores
sns.stripplot(y=accuracy_scores, ax=ax[0], color='b', jitter=0.4)
ax[0].set_title('Distribution of Accuracy Scores')
ax[0].set_ylabel('Accuracy Score')

# Dot plot for F1 Scores
sns.stripplot(y=f1_scores, ax=ax[1], color='g', jitter=0.4)
ax[1].set_title('Distribution of F1 Scores')
ax[1].set_ylabel('F1 Score')
# display them together
plt.tight_layout()
plt.show()

As we can see the model has a range of an 71% - 77% accuracy which means that the model is not that accurate when it comes to predicting, but when we examine the F1 we can see it has a range from 80% - 88%. The F1 score should be used as it provides a better insight into the class distribution.

In [None]:
# Plot heatmap of confusion matrix
cm_avg = np.mean(confusion_matrices, axis=0)  # Average confusion matrix
sns.heatmap(cm_avg, annot=True, cmap='Blues')
plt.title('Confusion Matrix Heatmap')
plt.show()

When the heatmap is being examined, the features that came up are:

True Positives = 28: The model correctly predicted 28 instances as the positive class.

False Positives = 69: The model incorrectly predicted 69 instances as the positive class when they were the negative class. For this part, it means that the model made 69 incorrect predictions.

False Negatives = 17: This part shows that the model has made another 17 incorrect errors, meaning there were 17 type II errors.

True Negatives = 220: The model correctly predicted 220 instances as the negative class.

Looking at the data above, it becomes clear that the model is good at predicting negative cases but needs to catch up when it comes to predicting positive cases. The data did not have a good split showing good and bad credit. To increase the accuracy of the model, more data showing the difference in good and bad credit would need to be provided for the model to shine.



# Summary

The credit risk prediction model displayed commendable performance, achieving an accuracy that ranged between 71% and 77%. Notably, while accuracy provides a general sense of model correctness, the F1 score, ranging between 80% and 88%, offers a more balanced perspective on the model's capability in handling both positive and negative classes.
Upon inspecting the confusion matrix, specific insights emerge:
True Positives = 28: The model identified 28 cases as the positive class.
False Positives = 69: A matter of concern arises as the model falsely classified 69 instances as positive when they belonged to the negative class.
False Negatives = 17: Further, the model made an additional 17 erroneous predictions, which resulted in type II errors.
True Negatives = 220: On the brighter side, the model successfully recognises 220 instances as the negative class.
When exploring the models' performance, it becomes apparent that it can discern negative instances but falters slightly regarding the positive ones. When diving into the data set provided, it can be theorised that the dataset may be skewed in favour of good credit, meaning we need more examples of bad credit. Such an imbalance can lead the model to develop a bias.
To enhance the precision of the model, there must be changes to the data and the model, and they are:
Data Augmentation: Improving instances of bad credit will grant the model a broader perspective, enabling it to differentiate between good and bad credit more proficiently. When direct data collection becomes challenging, other methods need to be deployed.
Model Exploration: Venturing beyond the current algorithm to explore models that naturally handle class imbalances, such as Random Forest or XGBoost, could lead to improvements.
Feature Enhancement: Introducing or refining new features based on domain expertise can further sharpen the model's discerning power. Domain-specific insights into credit risk factors can pave the way for innovative feature engineering.
Metric Optimisation: There are better ideas than relying on accuracy when making a good model. It is essential to shift the focus to other metrics like precision, recall, or even the F1 score can offer a more detailed and balanced evaluation. Optimising the recall value can be instrumental, as when dealing with credit, it is crucial to develop a model that gives the most accurate predictions so the banks can give loans to the correct people.
To conclude, the model, while proficient, definitely needs improvement. Credit risk evaluation is critical as many individuals apply for loans but need more funds to repay the banks.