<h1 style="color:red;">Credit rating assignment</h1>
<p></p>
In this assignment, we'll work our way through a simple ML exercise. Machine learning is an iterative process that starts with feature engineering (making the features ready for ML), works it way through various models and hyperparameter tuning exercises, until we find a model that seems to work well for us. 

<h3 style="color:green;">The problem: Rating creditworthiness of loan applicants</h3>

When banks issue loans to individuals, they have two goals that conflict with each other:
<ol>
    <li>Give as many loans as possible (fees, interest, all add to revenue)</li>
    <li>Try not to give loans to individuals who won't pay it back (lose money on the loan, collection costs, etc.)</li>
</ol>
    
<li>A typical machine learning program in this space tries to find a suitable tradeoff between finding many good loans and not calling a bad loan good</li>

<li>In this assignment, we'll try to build a "good" model that finds a good tradeoff between these two objectives</li>

<li>In machine learning terms, the proportion of times we get our guess right (i.e., we call a bad loan a bad loan and a good loan a good loan divided by the total number of cases) is called <span style="color:blue">accuracy</span></li>

<li>The proportion of actual good loans that we identify as good loans is known as <span style="color:blue">recall</span></li>

<li>The probability that if a loan is called good it actually is good is called <span style="color:blue">precision</span></li>

<li>The precision recall tradeoff is measured through a score called <span style="color:blue">f1 score</span></li>

<li>An important part of running an ML model is trying to figure out "which metric is right for you"</li>


    
    
<ol>
    <li>We'll try the SGD classifier, tune hyperparameters using grid search, and examine the results</li>
    <li>then, set up the data for a random forest classifier, run a grid search, and examine the results</li>
        <ul><li>finally, run a couple of gradient booster models</li></ul>
    <li>draw precision recall curves and roc curves for the two classifiers and compare the results</li>
    <li>note that grid search is a computing intensive activity. I've simplified the search to a few options but even those can take a long while (less than 15 minutes on my laptop but could be a couple of hours if you have an older machine)</li>
</ol>

<h3 style="color:green;">The models</h3>
<p></p>
<li><b>Model 1 SGD Classifier</b>: Vanilla version with max_iter set to 1000</li>
<li><b>Model 2 SGD Classifier round 2</b>: SGD Classifier with positive cases assigned a higher weight. One issue with our data is that positive cases are vastly outnumbered by negative cases (in other words, a model that says all cases are negative will have a pretty good accuracy). By overweighting positive cases in our model, we increase the efficacy of the model in finding an actual good solution</li>
<li><b>Model 3 SGD Classifier round 3</b>: Best SGD Classifier model after grid search</li>
<li><b>Model 4 Random Forest Classifier round 1</b>: Random Forest Classifier with base parameters (see below)</li>
<li><b>Model 5 Random Forest Classifier round 2</b>: Best model from grid search</li>
<li><b>Model 6 Gradient Booster Classfier</b></li>
<li><b>Model 7 Gradient Booster Classifier (2nd model)</b></li>

For each model, collect model metrics in the following dataframe results_df. After each model run, replace the 0.0 with the appropriate metric value


In [1]:
import pandas as pd
import numpy as np
results_df = pd.DataFrame(np.zeros(shape=(7,6)))
results_df.index=[1,2,3,4,5,6,7]
results_df.columns = ["accuracy","precision","recall","f1_score","AUC","AP"]
results_df.index.rename("Model",inplace=True)
results_df

Unnamed: 0_level_0,accuracy,precision,recall,f1_score,AUC,AP
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0


<h3 style="color:green;">The data</h3>
<p></p>
<li>A curated extract from the popular Lending club loan data. The data is in the file loan_data_small.csv</li>
<li>The dataset contains information about loan applications. Very basic information about the applicant and the status of the loan</li>
<li>The goal of the ML exercise is to build a model that uses information about the loan to predict whether a loan is a "good" one (i.e., it will be paid back) or a "bad" one (the money is unrecoverable)</li>
<li>Note that we're only using a fraction of the data. If you're interested, I can share the curated extract on a larger fraction which gives better results (but can crash your machine!)</li>

<h1 style="color:red;font-size:xx-large">Data preparation and feature engineering</h1>


<h3 style="color:green;">Build a binary target</h3>

<li>For the purposes of this analysis, drop rows that contain any NaN values</li>
<li><b>Target</b>: For the classifier, classify any loans that have a loan_status value of "Charged Off","Default", or "Does not meet the credit policy. Status:Charged Off" as a bad loan and give these loans a target value of 1 (we're predicting bad loans)</li>
<li><b>Input features</b>: create the input feature dataframe (i.e., drop any columns that are not an independent variable). The input variables we're interested in are "int_rate", "grade", "home_ownership","annual_income", "loan_amt", and "purpose"</li>
<p></p>
<li>The data should look like:</li>
<pre>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 565167 entries, 0 to 565166
Data columns (total 7 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   Unnamed: 0.1    565167 non-null  int64  
 1   int_rate        565167 non-null  float64
 2   grade           565167 non-null  object 
 3   home_ownership  565167 non-null  object 
 4   annual_inc      565167 non-null  float64
 5   loan_amnt       565167 non-null  int64  
 6   purpose         565167 non-null  object 
dtypes: float64(2), int64(2), object(3)
memory usage: 30.2+ MB
Out[108]:
0         False
1          True
2         False
3         False
4          True
          ...  
565162    False
565163    False
565164    False
565165     True
565166    False
Name: loan_status, Length: 565167, dtype: bool

</pre>

In [2]:
#read the file
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv("Resources/loan_data_small.csv")

#Drop rows with NaN values
df.dropna(inplace=True)

#Prepare the y (target) variable
#The target variable should be 1 if loan_status is "Charged Off","Default", or "Does not meet the credit policy. Status:Charged Off"
#And 0 otherwise
#(Hint: Create a boolean mask series)

y = df["loan_status"].isin(["Charged Off","Default","Does not meet the credit policy. Status:Charged Off"])

#remove unwanted input features "Unnamed: 0" and "loan_status"
df.drop(columns=["Unnamed: 0","loan_status"],inplace=True)

#Examine the df and the target
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 565167 entries, 0 to 565166
Data columns (total 7 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   Unnamed: 0.1    565167 non-null  int64  
 1   int_rate        565167 non-null  float64
 2   grade           565167 non-null  object 
 3   home_ownership  565167 non-null  object 
 4   annual_inc      565167 non-null  float64
 5   loan_amnt       565167 non-null  int64  
 6   purpose         565167 non-null  object 
dtypes: float64(2), int64(2), object(3)
memory usage: 30.2+ MB


In [3]:
y

0         False
1          True
2         False
3         False
4          True
          ...  
565162    False
565163    False
565164    False
565165     True
565166    False
Name: loan_status, Length: 565167, dtype: bool

<h3 style="color:green;">Label Encoding</h3>
<li>Since we're using regression as our underlying algorithm, all values need to be numerical. ML Models generally deal with numerical data</li>
<li>But, <span style="color:blue">grade</span>, <span style="color:blue">purpose</span>, and <span style="color:blue">home_ownership</span> are not</li>
</li>
<li>sklearn's <a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html">LabelEncoder</a> assigns numerical values to categorical data</li>
<li>LabelEncoder replaces each categorical string value with an integer - 0, 1, 2, ...</li>
<li>After label encoding, df.info() should return:</li>
<pre>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 565167 entries, 0 to 565166
Data columns (total 7 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   Unnamed: 0.1    565167 non-null  int64  
 1   int_rate        565167 non-null  float64
 2   grade           565167 non-null  int64  
 3   home_ownership  565167 non-null  int64  
 4   annual_inc      565167 non-null  float64
 5   loan_amnt       565167 non-null  int64  
 6   purpose         565167 non-null  int64  
dtypes: float64(2), int64(5)
memory usage: 30.2 MB
</pre>

In [4]:
#replace grade, purpose, and home_ownership by label encoded versions
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df["grade"] = le.fit_transform(df["grade"])
df["purpose"] = le.fit_transform(df["purpose"])
df["home_ownership"] = le.fit_transform(df["home_ownership"])


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 565167 entries, 0 to 565166
Data columns (total 7 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   Unnamed: 0.1    565167 non-null  int64  
 1   int_rate        565167 non-null  float64
 2   grade           565167 non-null  int32  
 3   home_ownership  565167 non-null  int32  
 4   annual_inc      565167 non-null  float64
 5   loan_amnt       565167 non-null  int64  
 6   purpose         565167 non-null  int32  
dtypes: float64(2), int32(3), int64(2)
memory usage: 23.7 MB


<h3 style="color:green;">One-hot encoding</h3>

<p></p>
<li>In regression, the assumption is that values associated with a feature are ordered</li>
<li>But, this is not necessarily so for the label encoded categorical values</li>
<li>The way to deal with this in regression is to create dummy variables, one for each category, that take the value 1 if the category is present in the row and 0 otherwise</li>
<li>In ML, a procedure known as <a href="https://en.wikipedia.org/wiki/One-hot">one-hot encoding</a> is used to do this conversion</li>
<li>One hot encoding is the process of converting a single column of categorical (integer) data with k categories into k-1 columns of 0 or 1 values</li>
<li>for example, the array with three possible categories [1,2,3,2,1] will be converted into the matrix:</li>

$$\begin{bmatrix} 0 & 0 \\ 1 & 0 \\ 0 & 1 \\ 1 & 0 \\ 0 & 0 \end{bmatrix}$$

<li>1's are replaced by (0, 0); 2's by (1, 0); and 3's by (0, 1). Note that category 1 is implicitly coded</li>
<li><b>Documentation</b>: <a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html">https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html</a>

<h3 style="color:green;">Scaling</h3>

<p></p>
<li>Non-categorical independent variables need to be scaled so that they follow the same underlying distribution</li>
<li>We will normalize them so that the mean is 0 and standard deviation is 1 using sklearn's StandardScaler feature transformer</li>
<li><a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html">https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html</a></li>

<li>All feature transformations can be encapsulated in the sklearn <a href="https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html">make_column_transformer</a> object</li>
<li>Use <span style="color:blue">make_column_transformer</span> to encapsulate both the one-hot coding as well as standard scaling. Note that the one-hot encoded columns are not scaled!</li>

In [6]:

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer, make_column_transformer

#Make a column transformer object that scales (using StandardScaler) the two non-categorical columns
# and one hot encodes (using OneHotEncoder) the three categorical columns
# Using make_column_transformer 
preprocess = make_column_transformer(
    (StandardScaler(),['int_rate', 'annual_inc'], ),
    (OneHotEncoder(categories="auto",drop="first"),['grade', 'home_ownership','purpose'], )
)

#Generate the independent variable df
X = preprocess.fit_transform(df).toarray()
X.shape
#Should return (565167, 26)

(565167, 26)

<h3 style="color:green;">Train/Test split</h3>

<li><a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html">https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html</a></li>
<li>split the data into 70% training and 30% testing</li>
<li>make sure the x and y datasets are aligned</li>
<li>use random_state=42 to get the same split as in my code </li>
<li>x and y training data shapes: (395616, 26) (395616,)</li>
<li>x and y testing data shapes: (169551, 26) (169551,)</li>

In [7]:
from sklearn.model_selection import train_test_split
#Get x_train, x_test, y_train, y_test
x_train, x_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=42)

#And check the shape
print(x_train.shape,y_train.shape)
print(x_test.shape,y_test.shape)

"""
Should return:
(395616, 26) (395616,)
(169551, 26) (169551,)
"""

(395616, 26) (395616,)
(169551, 26) (169551,)


'\nShould return:\n(395616, 26) (395616,)\n(169551, 26) (169551,)\n'

In [8]:
x_test

array([[ 0.27542382, -0.54678873,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [-0.81078196,  1.05693436,  1.        , ...,  0.        ,
         0.        ,  0.        ],
       [-0.20043776,  0.4221273 ,  1.        , ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [ 0.79059571, -0.3930986 ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [-1.18940226, -0.85299293,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [-0.09698959, -0.0656718 ,  0.        , ...,  0.        ,
         0.        ,  0.        ]])

<h1 style="color:green">The models</h1>
<li>For each model, do the following</li>
<ol>
    <li>Fit a classifier to the training data</li>
    <li>calculate the metrics</li>
    <ul>
        <li>training accuracy</li>
        <li>testing accuracy</li>
        <li>precision on test dataset</li>
        <li>recall on test dataset</li>
        <li>f1 score on test dataset</li>
        <li>area under the curve on test dataset</li>
        <li>average precision on the test dataset</li>
    </ul>
    <li>Write up a brief (pointwise) interpretation of the results
</ol>
<li>Chart the various metrics</li>


<h1 style="color:red;font-size:xx-large">Build Model 1</h1>


<h3 style="color:green;">Build the model on the training data set</h3>

<li>set random_state to 42 (if you want to get the same results that I got) and max_iter to 1000</li>
<li>set the loss function to "log_loss" ("log" if using sklearn 1.0.x or on colab)</li>


In [11]:
from sklearn.linear_model import SGDClassifier
model_1 = SGDClassifier(loss="log_loss",random_state=42, max_iter=1000)
model_1.fit(x_train,y_train) #change if you used different variable names

print(model_1.score(x_train,y_train))
print(model_1.score(x_test,y_test))
"""
You should get:
0.8846634109843889
0.8843828700508991
"""

0.8846634109843889
0.8843828700508991


'\nYou should get:\n0.8846634109843889\n0.8843828700508991\n'


<h3 style="color:green;">Model 1 metrics</h3>
<li>Report the following on the <b>test</b> data:</li>
<ul>
<li>the confusion matrix</li>
<li>the accuracy, precision, recall, f1-score, AUC, and AP </li>
</ul>


In [13]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score,recall_score,precision_score
from sklearn.metrics import average_precision_score,roc_auc_score

cfm = confusion_matrix(y_test,model_1.predict(x_test))
accuracy_training = model_1.score(x_train,y_train)
accuracy_testing = model_1.score(x_test,y_test)
precision = precision_score(y_test,model_1.predict(x_test))
recall = recall_score(y_test,model_1.predict(x_test))
f1 = f1_score(y_test,model_1.predict(x_test))
auc = roc_auc_score(y_test,model_1.predict_proba(x_test)[:,1])
ap = average_precision_score(y_test,model_1.predict_proba(x_test)[:,1])

print("Confusion Matrix: \n",cfm)
print("Training accuracy: ",accuracy_training)
print("Testing  accuracy: ",accuracy_testing)
print("Precision: ",precision)
print("Recall: ",recall)
print("F1-Score: ",f1)
print("AUC: ",auc)
print("Average Precision: ",ap)


"""

You should see:

Confusion Matrix: 
 [[149948      1]
 [ 19602      0]]
Training accuracy:  0.8846634109843889
Testing  accuracy:  0.8843828700508991
Precision:  0.0
Recall:  0.0
F1-Score:  0.0
AUC:  0.692962177388246
Average Precision:  0.11561123201868465
"""

Confusion Matrix: 
 [[149948      1]
 [ 19602      0]]
Training accuracy:  0.8846634109843889
Testing  accuracy:  0.8843828700508991
Precision:  0.0
Recall:  0.0
F1-Score:  0.0
AUC:  0.6929621775583543
Average Precision:  0.2277691812634675


'\n\nYou should see:\n\nConfusion Matrix: \n [[149948      1]\n [ 19602      0]]\nTraining accuracy:  0.8846634109843889\nTesting  accuracy:  0.8843828700508991\nPrecision:  0.0\nRecall:  0.0\nF1-Score:  0.0\nAUC:  0.692962177388246\nAverage Precision:  0.11561123201868465\n'

<h3 style="color:green;">Interpret the results</h3>
<li>In a few bullet points, write your interpreation of the results. Why are we seeing what we are seeing? Is it useful? Why is the AUC not 0.5?</li>

<h4>Interpretation</h4>
<li> The model is heavily biased toward the negative class (good loans). The bias is likely because of the class imblanace in the dataset. </li>
<li> The training and testing accuracy are approximately 0.884, i.e. the model gets its guess correct 88.4\% of the time. However, this high accuracy might be misleading because it is always predict loans to be good, regardless of their actual status. i.e. Any model that always predicts the majority class would also achieve a similar accuracy. </li>
<li> The precision for this model is 0, which indicates that if model 1 were to label a loan as bad, the probability that it is actually bad is zero. i.e. the model's ability to correct label bad loans when it attempts to is zero.</li>
<li> The recall for this model is 0, which indicates the proportion of actual bad loans that our model identifies as bad loans is zero. i.e. the model fails to capture any of the bad loans. </li>
<li> The F1 score of 0 suggests that our model is failing in both precision and recall for the positive class (bad loans). </li>
<li> The AUC has a value 0.693 (greater than 0.5) indicates that the model has indeed some ability to distinguish between good and bad loans, not merely random guessing. Because our model is heavily biased toward predicting good loans, then this will not produce a 0.5 AUC. </li>
<li> The average precision indicates that over different recall value, the model's precision is relatively low. </li>
<li> Model 1 achieves the goal to give as many loan as possible, but it is fails to achieve the goal to try not give loans to individuals who won't pay back, i.e. identify bad loans.</li>

<h3 style="color:green;">Update results_df</h3>


In [17]:
results_df.loc[1] = [accuracy_testing,precision,recall,f1,auc,ap]

results_df

Unnamed: 0_level_0,accuracy,precision,recall,f1_score,AUC,AP
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,0.884383,0.0,0.0,0.0,0.692962,0.227769
2,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0


<h1 style="color:red;font-size:xx-large">Build Model 2</h1>



<li>sklearn's ML models can be given a <span style="color:blue">class_weight</span> parameter</li>
<li>weights can be given explicitly or implicitly</li>
<li>note that by increasing the weight of the true cases, our model is more likely to find true positives</li>
<li>and by decreasing the weight of the true cases, our model is more likely to find true negatives</li>
<li>In Model 2, increase the weight of positives by a factor of 9 to balance the positives and negatives</li>

<h3 style="color:green">Build model 2 and report metrics</h3>

In [19]:
model_2 = SGDClassifier(loss="log_loss",random_state=42, max_iter=1000, class_weight={0:1, 1:9})
model_2.fit(x_train,y_train) 

print(model_2.score(x_train,y_train))
print(model_2.score(x_test,y_test))

0.5627401318450215
0.5625033175858591


In [20]:
cfm = confusion_matrix(y_test,model_2.predict(x_test))
accuracy_training = model_2.score(x_train,y_train)
accuracy_testing = model_2.score(x_test,y_test)
precision = precision_score(y_test,model_2.predict(x_test))
recall = recall_score(y_test,model_2.predict(x_test))
f1 = f1_score(y_test,model_2.predict(x_test))
auc = roc_auc_score(y_test,model_2.predict_proba(x_test)[:,1])
ap = average_precision_score(y_test,model_2.predict_proba(x_test)[:,1])

print("Confusion Matrix: \n",cfm)
print("Training accuracy: ",accuracy_training)
print("Testing  accuracy: ",accuracy_testing)
print("Precision: ",precision)
print("Recall: ",recall)
print("F1-Score: ",f1)
print("AUC: ",auc)
print("Average Precision: ",ap)

Confusion Matrix: 
 [[81037 68912]
 [ 5266 14336]]
Training accuracy:  0.5627401318450215
Testing  accuracy:  0.5625033175858591
Precision:  0.17220834134153373
Recall:  0.7313539434751556
F1-Score:  0.27877491492464757
AUC:  0.6940091593866806
Average Precision:  0.22986705851283923


<h3 style="color:green;">Interpret the results</h3>


<h4>Interpretation</h4>
<li> Adjusting the class weights makes the model to be more sensitive to bad loans, resulting in higher recall. i.e. the model is better at catching bad loans. </li>
<li> Increasing the recall also leads to misclassification of good loans as bad one, resulting in the drop in precision and overall accuracy. </li>
<li> The F1 score of 0.2788 improves, but it is still indicates challenges in achieving a good balance between precision and recall. </li>
<li> The AUC remains approximately the same, ie. even though precision and recall has changed, the general ranking of instances hasn't. </li>

<h3 style="color:green;">Update results_df</h3>

In [21]:
results_df.loc[2] = [accuracy_testing,precision,recall,f1,auc,ap]

results_df

Unnamed: 0_level_0,accuracy,precision,recall,f1_score,AUC,AP
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,0.884383,0.0,0.0,0.0,0.692962,0.227769
2,0.562503,0.172208,0.731354,0.278775,0.694009,0.229867
3,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0


<h1 style="color:red;font-size:xx-large">Build Model 3</h1>

<h3 style="color:green;">Tune hyperparameters using grid search</h3>
<li><span style="color:blue">parameters</span> versus <span style="color:blue">hyperparameters</span></li>
<ul>
    <li><span style="color:blue">parameters</span>: the parameters that are necessary for the model to make predictions. For example, the coefficients of the linear equation estimated by the SGD classifier are parameters of the model. Parameters are estimated by the algorithm and from the data</li>
    <li><span style="color:blue">hyperparameters</span>: parameters that are external to the model and cannot be estimated from the data. For example, in an SGD classifier, parameters like the loss function, the regularization parameter, stopping rules, etc. are hyper parameters</li>
    </ul>
<li>In ML, hyperparameters are often set intuitively and then <span style="color:red">tuned</span> using a grid search</li>
<li>In a grid search, various combinations of hyperparameters are tried and <span style="color:blue">k-fold cross validation</span> is used to test the efficacy of the parameter combination</li>
<li>the best combination is then selected as a candidate model</li>

<h3 style="color:green;">The <span style="color:blue">scoring</span> parameter</h3>
<li>since our data is imbalaced, we should look for the model with the best f1 score (precision/recall tradeoff)</li>
<li>set the scoring parameter for GridSearchCV so that it maximizes the f1 score</li>
<li>Though we should be using a much wider range of parameters, I've reduced them so that it runs fairly quickly</li>
<li>This takes about 30 seconds on my machine. Could take longer on your machine</li>

In [60]:
%%time
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import SGDClassifier
#Set up the hyperparameter options in param_grid
param_grid = {
    'alpha': [0.0001, 0.001, 0.01, 1, 10, 100],
    'loss': ["hinge", "log_loss", "modified_huber"],
    'penalty': ["l2", "l1", "elasticnet"],
    'class_weight': [None, {1:3}, {1:5}, {1:7}, {1:9}]
}

#Do the search
grid_search = GridSearchCV(SGDClassifier(random_state=42, max_iter=1000), param_grid, cv=5, n_jobs=-1, verbose=3, scoring='f1', return_train_score=True)
grid_search.fit(x_train, y_train)

Fitting 5 folds for each of 270 candidates, totalling 1350 fits
CPU times: total: 7.81 s
Wall time: 1min 6s


<h3 style="color:green;">Get the best model parameters</h3>


In [61]:
grid_search.best_score_, grid_search.best_params_

(0.2963659026877409,
 {'alpha': 0.001,
  'class_weight': {1: 5},
  'loss': 'modified_huber',
  'penalty': 'l1'})

<h3 style="color:green;">Run the best model and report metrics</h3>
<li>Run the classifier using the best parameters</li>






In [62]:
model_3 = SGDClassifier(**grid_search.best_params_,random_state=42, max_iter=1000)
model_3.fit(x_train,y_train)

cfm = confusion_matrix(y_test,model_3.predict(x_test))
accuracy_training = model_3.score(x_train,y_train)
accuracy_testing = model_3.score(x_test,y_test)
precision = precision_score(y_test,model_3.predict(x_test))
recall = recall_score(y_test,model_3.predict(x_test))
f1 = f1_score(y_test,model_3.predict(x_test))
auc = roc_auc_score(y_test,model_3.predict_proba(x_test)[:,1])
ap = average_precision_score(y_test,model_3.predict_proba(x_test)[:,1])


print("Confusion Matrix: \n",cfm)
print("Training accuracy: ",accuracy_training)
print("Testing  accuracy: ",accuracy_testing)
print("Precision: ",precision)
print("Recall: ",recall)
print("F1-Score: ",f1)
print("AUC: ",auc)
print("Average Precision: ",ap)


Confusion Matrix: 
 [[122451  27498]
 [ 11445   8157]]
Training accuracy:  0.7714298713904392
Testing  accuracy:  0.7703168958012634
Precision:  0.22877576777450567
Recall:  0.4161310070400979
F1-Score:  0.2952386123025137
AUC:  0.6910679332023801
Average Precision:  0.22891183309360005


<h3 style="color:green;">Interpret the results</h3>


<h4>Interpretation</h4>
<li> The model ccorrectly predicted the class labels for 77\% of the time for both training and testing datasets, i.e. the model is not overfitting. </li>
<li> The precision of 22.88\% indicates that when the model predicts a loan to be bad, it is only correct abou 23\% of the time. </li>
<li> The recall of 41.61\% suggests that the model only identifies 42\% of the actual bad loans. </li>
<li> Even though we tried to optimize the f1 score, but it still indicates that the model doesn't balance the precision and recall optimally. There still should be room for improvement.</li>
<li> The AUC of 0.6910 suggest the model has a decent capability to distinguish between good and bad loans. </li>

<h3 style="color:green;">Update results_df</h3>

In [63]:
results_df.loc[3] = [accuracy_testing,precision,recall,f1,auc,ap]

results_df

Unnamed: 0_level_0,accuracy,precision,recall,f1_score,AUC,AP
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,0.884383,0.0,0.0,0.0,0.692962,0.227769
2,0.562503,0.172208,0.731354,0.278775,0.694009,0.229867
3,0.770317,0.228776,0.416131,0.295239,0.691068,0.228912
4,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0


<h1 style="color:red;font-size:xx-large">Build Model 4</h1>

<h3 style="color:green;">Random Forest Classifier</h3>
<li>We need to improve recall and precision so perhaps a non-linear classifier will help</li>

<h3 style="color:green;">Build, fit, and report metrics</h3>

<li>Run this with the following parameters (these are our base parameters)</li>
<li>random_state=42,n_estimators=30,max_depth=6,min_samples_leaf=2000,min_samples_split=4000,class_weight={1:5}</li>


In [64]:
from sklearn.ensemble import RandomForestClassifier
model_4 = RandomForestClassifier(random_state=42,n_estimators=30,max_depth=6,min_samples_leaf=500,min_samples_split=4000,class_weight={1:5})
model_4.fit(x_train,y_train)


In [65]:
cfm = confusion_matrix(y_test,model_4.predict(x_test))
accuracy_training = model_4.score(x_train,y_train)
accuracy_testing = model_4.score(x_test,y_test)
precision = precision_score(y_test,model_4.predict(x_test))
recall = recall_score(y_test,model_4.predict(x_test))
f1 = f1_score(y_test,model_4.predict(x_test))
auc = roc_auc_score(y_test,model_4.predict_proba(x_test)[:,1])
ap = average_precision_score(y_test,model_4.predict_proba(x_test)[:,1])

print("Confusion Matrix: \n",cfm)
print("Training accuracy: ",accuracy_training)
print("Testing  accuracy: ",accuracy_testing)
print("Precision: ",precision)
print("Recall: ",recall)
print("F1-Score: ",f1)
print("AUC: ",auc)
print("Average Precision: ",ap)

Confusion Matrix: 
 [[128839  21110]
 [ 12625   6977]]
Training accuracy:  0.8021414705168648
Testing  accuracy:  0.8010333174089213
Precision:  0.24840673621248263
Recall:  0.35593306805428016
F1-Score:  0.29260416448237536
AUC:  0.6945967053414696
Average Precision:  0.22876096340572005


<h3 style="color:green;">Interpreting model 4 results</h3>
<p></p>

<h4>Interpretation</h4>
<li> Out of  all the predictions the model made, around 80% of them were correct for both the training and testing sets.</li>
<li> Out of all the loans that the model predicted as bad, only around 24.84% of them were actually bad. This means there's a relatively high chance that a loan flagged by the model as bad is actually a good loan. </li>
<li> The model was able to correctly identify about 35.59% of all the actual bad loans. i.e. it missed out on a large chunk (around 64.41%) of bad loans. </li>
<li> An F1-score of 29.26% indicates that the model doesn't achieve an optimal balance between precision and recall. There might still have room for improvement. </li>
<li> An AUC value of 69.46% indicates that the model has a decent capability to distinguish between good and bad loans, better than random guess. </li>
<li> The AP value suggests that the model's performance across different thresholds, when prioritizing recall, is not very high. </li>

<h3 style="color:green;">Update results_df</h3>

In [66]:
results_df.loc[4] = [accuracy_testing,precision,recall,f1,auc,ap]

results_df

Unnamed: 0_level_0,accuracy,precision,recall,f1_score,AUC,AP
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,0.884383,0.0,0.0,0.0,0.692962,0.227769
2,0.562503,0.172208,0.731354,0.278775,0.694009,0.229867
3,0.770317,0.228776,0.416131,0.295239,0.691068,0.228912
4,0.801033,0.248407,0.355933,0.292604,0.694597,0.228761
5,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0


<h1 style="color:red;font-size:xx-large">Build Model 5</h1>

<h3 style="color:green;">Random Forest Grid Search</h3>
<p></p>


<li>Run the best model</li>
<li>Note that this will take a while, perhaps even a couple of hours (25 minutes on my laptop). Let it run. Get some coffee or whatever beverage you like. Then come back in a while to check out the results!</li>
<li>If you want to speed it up, remove the 500 option from n_estimators (n_estimators is the number of trees generated and is the single most expensive part of the grid search)</li>


In [67]:
%%time
from sklearn.ensemble import RandomForestClassifier


from sklearn.model_selection import GridSearchCV
from sklearn.metrics import average_precision_score,make_scorer
parameters = {
     'n_estimators':(500,800), #the number of trees
     'min_samples_split': (100, 200),
    'class_weight': [{1:4},{1:6}],
     'min_samples_leaf': (10,20) #
}
gs_clf = GridSearchCV(RandomForestClassifier(random_state=42),parameters,cv=5,n_jobs=-1,
                      scoring='f1')
gs_clf.fit(x_train, np.ravel(y_train))


CPU times: total: 41 s
Wall time: 17min 59s


<h3 style="color:green;">Get the best model parameters</h3>


In [68]:
gs_clf.best_score_, gs_clf.best_params_

(0.331162244496855,
 {'class_weight': {1: 6},
  'min_samples_leaf': 10,
  'min_samples_split': 100,
  'n_estimators': 800})

<h3 style="color:green;">Run the best model and get metrics</h3>


In [70]:
model_5 = RandomForestClassifier(**gs_clf.best_params_,random_state=42)
model_5.fit(x_train,y_train)

In [71]:
cfm = confusion_matrix(y_test,model_5.predict(x_test))
accuracy_training = model_5.score(x_train,y_train)
accuracy_testing = model_5.score(x_test,y_test)
precision = precision_score(y_test,model_5.predict(x_test))
recall = recall_score(y_test,model_5.predict(x_test))
f1 = f1_score(y_test,model_5.predict(x_test))
auc = roc_auc_score(y_test,model_5.predict_proba(x_test)[:,1])
ap = average_precision_score(y_test,model_5.predict_proba(x_test)[:,1])

print("Training accuracy: ",accuracy_training)
print("Testing  accuracy: ",accuracy_testing)
print("confusion matrix:")
print(cfm)

print("precision: ",precision)
print("recall: ",recall)
print("f1 score: ",f1)
print("auc",auc)
print("ap",ap)

Training accuracy:  0.7819248968696918
Testing  accuracy:  0.7627675448685056
confusion matrix:
[[119522  30427]
 [  9796   9806]]
precision:  0.24373027117043222
recall:  0.5002550760126517
f1 score:  0.32776802874571737
auc 0.7309289440285697
ap 0.2556894711994775


<h3 style="color:green;">Interpreting model 5 results</h3>

<p>
    </p>
<h4>Interpretation</h4>
<li> The precision of 24.37\% indicates that out of all loans predicted as bad, only 24.37% were actually bad. This suggests that there are quite a few loans that the model predicted as bad but were actually good.</li>
<li> The recall of 50.03% indicates that the model was able to correctly identify 50.03% of the actual bad loans. So, half of the bad loans were missed by the model.</li>
<li>  An F1 score of 32.78% indicates that the model has room for improvement, especially given that neither precision nor recall is particularly high.</li>
<li> The AUC (Area Under the ROC Curve) of 73.09% indicates a good ability of the model to differentiate between good and bad loans. </li>
<li> The AP of 25.57% summarizes the precision-recall curve, indicating the model's ability to maintain high precision across varying recall levels. Given the imbalance of classes (more good loans than bad), the AP provides a more informative metric than AUC</li>

<h3 style="color:green;">Update results df</h3>


In [72]:
results_df.loc[5] = [accuracy_testing,precision,recall,f1,auc,ap]

results_df

Unnamed: 0_level_0,accuracy,precision,recall,f1_score,AUC,AP
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,0.884383,0.0,0.0,0.0,0.692962,0.227769
2,0.562503,0.172208,0.731354,0.278775,0.694009,0.229867
3,0.770317,0.228776,0.416131,0.295239,0.691068,0.228912
4,0.801033,0.248407,0.355933,0.292604,0.694597,0.228761
5,0.762768,0.24373,0.500255,0.327768,0.730929,0.255689
6,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0


<h1 style="color:red;font-size:xx-large">Build Model 6</h1>

<li>Gradient Boosting Classifier</li>
<li>Grid search on GBC can take several days so let's just skip to the best models (I ran a 2-day reduced version)!</li>
<li>Sklearn's gradient boosting classifier uses a sample weight vector to correct for imbalances in the data</li>


In [73]:
from sklearn.ensemble import GradientBoostingClassifier

#sample_weight is a vector that indicates the weight of each 
#case in the training sample
#If you're interested, try values from 1 to 10 instead of 4
sample_weight = np.array([4 if i == 1 else 1 for i in y_train])


model_6 = GradientBoostingClassifier(min_samples_split=100,
                                     max_depth=8,
                                     min_samples_leaf=100,
                                     n_estimators=400,
                                     subsample=0.6)


model_6.fit(x_train,y_train,sample_weight=sample_weight)


In [74]:
#Calculate and print metrics
cfm = confusion_matrix(y_test,model_6.predict(x_test))
accuracy_training = model_6.score(x_train,y_train)
accuracy_testing = model_6.score(x_test,y_test)
precision = precision_score(y_test,model_6.predict(x_test))
recall = recall_score(y_test,model_6.predict(x_test))
f1 = f1_score(y_test,model_6.predict(x_test))
auc = roc_auc_score(y_test,model_6.predict_proba(x_test)[:,1])
ap = average_precision_score(y_test,model_6.predict_proba(x_test)[:,1])

print("Training accuracy: ",accuracy_training)
print("Testing  accuracy: ",accuracy_testing)
print("confusion matrix:")
print(cfm)

print("precision: ",precision)
print("recall: ",recall)
print("f1 score: ",f1)
print("auc",auc)
print("ap",ap)

Training accuracy:  0.8167086265469546
Testing  accuracy:  0.8015641311463807
confusion matrix:
[[127803  22146]
 [ 11499   8103]]
precision:  0.2678766240206288
recall:  0.41337618610345883
f1 score:  0.32508876451826446
auc 0.7486911451332081
ap 0.2592452251286931


<h3 style="color:green;">Interpreting model 6 results</h3>

<p>
    </p>
<h4>Interpretation</h4>
<li> The precision of 26.79% indicates that out of all loans predicted as bad, only 26.79% were actually bad. This means that there are a number of loans that the model predicts as bad, but they are in fact good. </li>
<li> The recall of 41.34% suggests that the model was able to correctly identify 41.34% of the actual bad loans. This indicates that a little more than half of the bad loans were missed by the model. </li>
<li> An F1 score of 32.51% suggests that the model needs improvement, especially since neither precision nor recall is particularly high. </li>
<li> The AP of 25.92% offers a summary of the precision-recall curve, indicating the model's ability to sustain high precision across different recall levels. </li>

<h3 style="color:green;">Update results df</h3>


In [75]:
results_df.loc[6] = [accuracy_testing,precision,recall,f1,auc,ap]

results_df

Unnamed: 0_level_0,accuracy,precision,recall,f1_score,AUC,AP
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,0.884383,0.0,0.0,0.0,0.692962,0.227769
2,0.562503,0.172208,0.731354,0.278775,0.694009,0.229867
3,0.770317,0.228776,0.416131,0.295239,0.691068,0.228912
4,0.801033,0.248407,0.355933,0.292604,0.694597,0.228761
5,0.762768,0.24373,0.500255,0.327768,0.730929,0.255689
6,0.801564,0.267877,0.413376,0.325089,0.748691,0.259245
7,0.0,0.0,0.0,0.0,0.0,0.0


<h1 style="color:red;font-size:xx-large">Build Model 7</h1>

<li>Same parameters but up the sample weight to 5</li>

In [76]:
from sklearn.ensemble import GradientBoostingClassifier

#sample_weight is a vector that indicates the weight of each 
#case in the training sample
#If you're interested, try values from 1 to 10 instead of 4
sample_weight = np.array([5 if i == 1 else 1 for i in y_train])


model_7 = GradientBoostingClassifier(min_samples_split=100,
                                     max_depth=8,
                                     min_samples_leaf=100,
                                     n_estimators=400,
                                     subsample=0.6)
model_7.fit(x_train,y_train,sample_weight=sample_weight)

In [78]:
#Calculate and print metrics
cfm = confusion_matrix(y_test,model_7.predict(x_test))
accuracy_training = model_7.score(x_train,y_train)
accuracy_testing = model_7.score(x_test,y_test)
precision = precision_score(y_test,model_7.predict(x_test))
recall = recall_score(y_test,model_7.predict(x_test))
f1 = f1_score(y_test,model_7.predict(x_test))
auc = roc_auc_score(y_test,model_7.predict_proba(x_test)[:,1])
ap = average_precision_score(y_test,model_7.predict_proba(x_test)[:,1])

print("Training accuracy: ",accuracy_training)
print("Testing  accuracy: ",accuracy_testing)
print("confusion matrix:")
print(cfm)

print("precision: ",precision)
print("recall: ",recall)
print("f1 score: ",f1)
print("auc",auc)
print("ap",ap)

Training accuracy:  0.7720339925584405
Testing  accuracy:  0.7560262104027696
confusion matrix:
[[117730  32219]
 [  9147  10455]]
precision:  0.24499695364859164
recall:  0.5333639424548515
f1 score:  0.33576337593936667
auc 0.7483617961719404
ap 0.25926125785167986


<h3 style="color:green;">Interpreting model 7 results</h3>

<p>
    </p>
<h4>Interpretation</h4>
<li> For the loans in the testing set, the model made correct predictions for about 75.60% of them. </li>
<li> A precision of 24.50% means that out of all loans the model flagged as bad, only 24.50% were truly bad. This metric is particularly important when the cost of a false positive is high. </li>
<li> A recall of 53.34% implies that the model caught a little more than half of the actual bad loans, leaving almost half undetected. This metric is crucial when the cost of a false negative is significant. </li>
<li> An F1 score of 33.58% suggests there's room for improvement in the model's ability to balance precision and recall. </li>
<li> An AUC of 74.84% suggests that the model has a good ability to differentiate between good and bad loans, with a value closer to 100% being ideal. </li>
<li> The AP of 25.93% indicates that the model sustains a precision of approximately 25.93% across varying recall levels. </li>

<h3 style="color:green;">Update results df</h3>


In [79]:
results_df.loc[7] = [accuracy_testing,precision,recall,f1,auc,ap]

results_df

Unnamed: 0_level_0,accuracy,precision,recall,f1_score,AUC,AP
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,0.884383,0.0,0.0,0.0,0.692962,0.227769
2,0.562503,0.172208,0.731354,0.278775,0.694009,0.229867
3,0.770317,0.228776,0.416131,0.295239,0.691068,0.228912
4,0.801033,0.248407,0.355933,0.292604,0.694597,0.228761
5,0.762768,0.24373,0.500255,0.327768,0.730929,0.255689
6,0.801564,0.267877,0.413376,0.325089,0.748691,0.259245
7,0.756026,0.244997,0.533364,0.335763,0.748362,0.259261


<h3 style="color:red;font-size:xx-large">Model comparison</h3>
<li>Draw a graph that shows the changes to accuracy, precision, recall, and f1 score</li>
<li>The x-axis contains the five models you have created</li>
<li>Use bokeh for the charts</li>

In [80]:
from bokeh.io import output_notebook, show
from bokeh.plotting import figure
from bokeh.layouts import gridplot
from bokeh.models import ColumnDataSource, LabelSet, HoverTool
output_notebook()

In [82]:
#CHART 
source = ColumnDataSource(results_df)

p1 = figure(width=600, height=400, title="Model Performance Metrics", x_axis_label='Model', y_axis_label='Score', x_minor_ticks=2, y_range=(0, 1), toolbar_location=None)
p1.line(x='Model', y='accuracy', color="green", legend_label="Accuracy", line_width=3, source=source)
p1.line(x='Model', y='precision', color="blue", legend_label="Precision", line_width=3, source=source)
p1.line(x='Model', y='recall', color="red", legend_label="Recall", line_width=3, source=source)
p1.line(x='Model', y='f1_score', color="purple", legend_label="F1-Score", line_width=3, source=source)

p1.legend.location = "bottom_right"
p1.legend.click_policy="hide"

hover = HoverTool()
hover.tooltips = [("Accuracy", "@accuracy"), ("Precision", "@precision"), ("Recall", "@recall"), ("F1-Score", "@f1_score")]
p1.add_tools(hover)

show(p1)

<h3 style="color:green;">Interpret the chart</h3>
<li>What can you say about the changes in precision and recall?</li>

- Model 1: The initial model has zero precision and recall. This likely means that the model is predicting all instances as the negative class, which suggests it is not capturing any of the nuances of the data.
- From Model 1 to Model 2: There's a significant increase in both precision and recall. However, the trade-off between precision and recall is evident here. The recall is at its peak, suggesting that the model is catching most of the positive instances. But, the precision is low, meaning among those predicted as positive, only a small fraction is actually positive.
- From Model 2 to Model 3: Precision increases while recall decreases, indicating a shift towards making more accurate positive predictions at the cost of missing some actual positive instances.
- From Model 3 to Model 4: Precision continues to rise slightly, but recall drops further. The model is becoming more conservative, ensuring its positive predictions are more reliable but catching fewer positive instances overall.
- From Model 4 to Model 5: Precision drops slightly, but recall increases significantly. The model is attempting to identify more positive cases, but in doing so, it's making some incorrect positive predictions.
- From Model 5 to Model 6: Precision improves, and recall drops marginally. The model is refining its predictions to be slightly more accurate.
- From Model 6 to Model 7: Precision drops a tad, but recall sees a significant boost. The model is again leaning towards capturing more positive instances, even if it makes some mistakes in the process.

There's a constant trade-off between precision and recall across the models. When one metric increases, the other tends to decrease. This is a classic challenge in machine learning, especially in imbalanced datasets. From the data, we can see that the F1-score is highest for Model 7, suggesting it might be the best compromise between precision and recall among the presented models.

<h3 style="color:green;">Chart AUC and AP</h3>


In [84]:
#CHART
source = ColumnDataSource(results_df)

p1 = figure(width=600, height=400, title="Model Performance Metrics", x_axis_label='Model', y_axis_label='Score', x_minor_ticks=2, y_range=(0, 1), toolbar_location=None)
p1.line(x='Model', y='AUC', color="green", legend_label="AUC", line_width=3, source=source)
p1.line(x='Model', y='AP', color="blue", legend_label="AP", line_width=3, source=source)

p1.legend.location = "bottom_right"
p1.legend.click_policy="hide"

hover = HoverTool()
hover.tooltips = [("AUC", "@AUC"), ("AP", "@AP")]
p1.add_tools(hover)

show(p1)

<h3 style="color:green;">Interpret the AUC/AP chart</h3>
<li>The AUC on the first 4 models is pretty much the same. What does that mean?</li>
<ul><li>The first four models' discriminatory power is nearly the same, i.e. the models are equally adept at ranking a randomly chosen bad loan (positive instance) higher than a randomly chosen good loan (negative instance).</li>
<li>It's possible that the features and the SGDClassifier model have reached a saturation point. Beyond this point, they might not provide additional discriminatory power in predicting bad loans.</li>
<li>The consistent AUC values also suggest that the models are stable. They are not overly sensitive to small changes, which can be reassuring. It means that in a production setting, small variations in incoming data might not wildly swing predictions.</li>
    </ul>
<li>The average precision improves steadily but almost entirely by getting better at recall than at precision. What does that mean?</li>
<ul>
<li>The increase in recall indicates that the models are getting better at correctly identifying bad loans from the total actual bad loans. This can be particularly crucial for lending institutions, as failing to identify a bad loan (false negative) can result in significant financial losses.</li>
<li>The steady improvement in AP indicates that, overall, the model's precision is not deteriorating rapidly as recall increases. This is a good sign, especially if the costs associated with false positives (wrongly predicted bad loans) are lower than the costs of false negatives (missed bad loans).</li></ul>

<li>Finally, what can you do to get better results? </li>
<ul>
    <li>Instead of using a default 0.5 threshold, optimize the threshold based on a desired balance between precision and recall or based on a cost matrix.</li>
    <li>Calibrate the probabilities using techniques like Platt Scaling or Isotonic Regression to ensure they are well-calibrated.</li>
    <li>Use techniques like recursive feature elimination, feature importance from tree-based models, or correlation analysis to select relevant features.</li></ul>
    