In [90]:
import pandas as pd
import pyarrow as pa

# Load the dataset
df = pd.read_csv("loan_data.csv")

# Display the first few rows of the dataframe
print(df.head())


   credit.policy             purpose  int.rate  installment  log.annual.inc  \
0              1  debt_consolidation    0.1189       829.10       11.350407   
1              1         credit_card    0.1071       228.22       11.082143   
2              1  debt_consolidation    0.1357       366.86       10.373491   
3              1  debt_consolidation    0.1008       162.34       11.350407   
4              1         credit_card    0.1426       102.92       11.299732   

     dti  fico  days.with.cr.line  revol.bal  revol.util  inq.last.6mths  \
0  19.48   737        5639.958333      28854        52.1               0   
1  14.29   707        2760.000000      33623        76.7               0   
2  11.63   682        4710.000000       3511        25.6               1   
3   8.10   712        2699.958333      33667        73.2               1   
4  14.97   667        4066.000000       4740        39.5               0   

   delinq.2yrs  pub.rec  not.fully.paid  
0            0        0   

In [91]:
# Count the occurrences of each class in the 'not.fully.paid' column
class_distribution = df['not.fully.paid'].value_counts()

# Display the class distribution
print(class_distribution)


not.fully.paid
0    8045
1    1533
Name: count, dtype: int64


In [92]:
# Convert 'purpose' column to dummy variables
df = pd.get_dummies(df, columns=['purpose'])

# Display the updated dataframe
print(df.head())


   credit.policy  int.rate  installment  log.annual.inc    dti  fico  \
0              1    0.1189       829.10       11.350407  19.48   737   
1              1    0.1071       228.22       11.082143  14.29   707   
2              1    0.1357       366.86       10.373491  11.63   682   
3              1    0.1008       162.34       11.350407   8.10   712   
4              1    0.1426       102.92       11.299732  14.97   667   

   days.with.cr.line  revol.bal  revol.util  inq.last.6mths  delinq.2yrs  \
0        5639.958333      28854        52.1               0            0   
1        2760.000000      33623        76.7               0            0   
2        4710.000000       3511        25.6               1            0   
3        2699.958333      33667        73.2               1            0   
4        4066.000000       4740        39.5               0            1   

   pub.rec  not.fully.paid  purpose_all_other  purpose_credit_card  \
0        0               0              

In [93]:
from sklearn.model_selection import train_test_split

# Split the dataset into features (X) and target variable (y)
X = df.drop(columns=['not.fully.paid'])  # Features
y = df['not.fully.paid']  # Target variable

# Split the data into training and testing sets (67% train, 33% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# Display the shapes of the training and testing sets
print("Training set - Features:", X_train.shape, "Labels:", y_train.shape)
print("Testing set - Features:", X_test.shape, "Labels:", y_test.shape)


Training set - Features: (6417, 19) Labels: (6417,)
Testing set - Features: (3161, 19) Labels: (3161,)


In [94]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix

# Initialize the Gaussian Naive Bayes classifier
nb_classifier = GaussianNB()

# Train the classifier on the training data
nb_classifier.fit(X_train, y_train)

# Predictions on the testing set
y_pred = nb_classifier.predict(X_test)


In [95]:
# Calculate accuracy score
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Calculate F1 score
f1 = f1_score(y_test, y_pred)
print("F1 Score:", f1)

# Construct confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(conf_matrix)


Accuracy: 0.8203100284720025
F1 Score: 0.13149847094801223
Confusion Matrix:
[[2550  100]
 [ 468   43]]


Based on the provided output, here's the interpretation:

    Accuracy: The accuracy of the model is approximately 82.03%. This indicates that about 82.03% of the predictions made by the model on the test set were correct.

    F1 Score: The F1 score is approximately 0.1315. The F1 score is a harmonic mean of precision and recall. A low F1 score indicates that the model has a poor balance between precision and recall.

    Confusion Matrix:
        True Negative (TN): 2550
        False Positive (FP): 100
        False Negative (FN): 468
        True Positive (TP): 43

Interpreting the confusion matrix:

    The model correctly predicted 2550 instances of fully paid loans (True Negatives).
    It incorrectly predicted 100 instances of not fully paid loans as fully paid (False Positives).
    It incorrectly predicted 468 instances of fully paid loans as not fully paid (False Negatives).
    It correctly predicted only 43 instances of not fully paid loans (True Positives).

Based on these evaluation metrics, we can see that while the accuracy is relatively high, the F1 score is low, indicating that the model may not be performing well in terms of both precision and recall. Additionally, looking at the confusion matrix, we see a relatively high number of false negatives, which suggests that the model is not effectively identifying instances of not fully paid loans.

In conclusion, although the accuracy is decent, the F1 score and the imbalance in the confusion matrix suggest that the model's performance may not be satisfactory for predicting not fully paid loans accurately. Further investigation and potentially model improvement are warranted.

In the analysis step, we interpret the results obtained from evaluating the model and provide insights into its performance. Here are some points to consider for the analysis:

    Accuracy:
        The accuracy score indicates the proportion of correctly classified instances out of the total instances. In this case, the model achieved an accuracy of approximately 82.03%. While this seems relatively high, it's important to consider the class distribution of the dataset. If the dataset is imbalanced (which we have not explicitly confirmed), accuracy alone may not be a reliable metric.

    F1 Score:
        The F1 score provides a balance between precision and recall. It considers both false positives and false negatives. A low F1 score suggests that the model's performance in terms of both precision and recall is not satisfactory. In this case, the F1 score is approximately 0.1315, indicating poor performance.

    Confusion Matrix:
        The confusion matrix provides detailed insights into the model's performance, showing the counts of true positives, true negatives, false positives, and false negatives.
        Analyzing the confusion matrix:
            The high number of false negatives (468) suggests that the model is failing to correctly identify instances of not fully paid loans, leading to a significant number of misclassifications.
            The relatively low number of true positives (43) further indicates the model's difficulty in correctly predicting instances of not fully paid loans.
            The presence of false positives (100) indicates instances where the model incorrectly predicts fully paid loans as not fully paid.

    Imbalance:
        If the dataset is imbalanced, it can affect the model's performance. We have not explicitly checked the balance of the dataset, so it's important to consider whether the observed performance metrics are influenced by class imbalance.

    Model Performance:
        Based on the evaluation metrics and the analysis of the confusion matrix, we can conclude that the model's performance is not satisfactory for predicting not fully paid loans accurately.
        The model exhibits a significant number of false negatives, indicating that it struggles to identify instances of not fully paid loans, which is crucial for the task at hand.

    Further Steps:
        Further investigation into the dataset, including feature engineering, handling imbalance (if present), and potentially trying other machine learning algorithms or improving the current model, may be necessary to enhance predictive performance.

In summary, while the model achieved a decent accuracy, the F1 score and analysis of the confusion matrix reveal weaknesses in its performance, particularly in correctly identifying instances of not fully paid loans. Further steps should be taken to improve the model's performance for this classification task.