# Credit Card Application

Banks and financial institutions often need to predict customer behavior, such as the likelihood of a customer accepting a loan offer, to target their marketing efforts effectively. This helps in increasing the acceptance rate of their offers while reducing marketing costs.

Our objective is to create a model that forecasts the propensity (probability) of customers responding to a personal loan campaign, we will utilize logistic regression. The outcomes will be categorized, and the factors influencing the answer will be found using the model's probability. Building a model that identifies clients who are most likely to accept the loan offer in upcoming personal loan campaigns is the objective.

We import necessary libraries to handle data manipulation, visualization, and model building. These libraries provide functions to simplify complex operations, ensuring efficient data processing and analysis.

# Grading Scheme:

1. **Importing Libraries and Data (10 points)**
   - Correctly import all required libraries; remove any unncessary libraries: 5 points
   - Correctly read the dataset: 5 points

2. **Data Exploration (20 points)**
   - Correctly display dimensions, first and last entries: 10 points
   - Correctly display descriptive statistics: 10 points

3. **Handling Missing Values (20 points)**
   - Correctly impute missing numeric values: 10 points
   - Correctly impute missing non-numeric values: 10 points

4. **Data Pre-processing (20 points)**
   - Correctly encode non-numeric data: 10 points
   - Correctly plot histograms and heatmap: 10 points

5. **Model Building (30 points)**
   - Correctly split data into train and test sets: 10 points
   - Correctly scale the data: 10 points
   - Correctly build and fit the logistic regression model: 10 points

6. **Model Evaluation (30 points)**
   - Correctly calculate and display confusion matrix: 10 points
   - Correctly calculate and plot the ROC curve: 20 points
 
7. **Answering Red Questions**
   - Correctly answer all subjective questions in red: 20 points


Total: 150 points

## 1. Importing Libraries

In [1]:
# KEEP ONLY THE REQUIRED LIBRARIES; REMOVE OTHERS

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

import os
import joblib
import itertools
import subprocess
from time import time
from scipy import stats, optimize as opt
from scipy.stats import chi2_contingency
import pyLDAvis.sklearn
from collections import Counter
from textblob import TextBlob
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist

Matplotlib is building the font cache; this may take a moment.


ModuleNotFoundError: No module named 'pyLDAvis'

## <span style="color:red">*Q1. Why only the required libraries are kept in the code?*</span>

In [None]:
# Only the required libraries are kept in the code because it helps keep the code cleaner. Furthermore, removing unnecessary 
# libraries decreases the chances of version conflict and bugs that may pop up. Lastly, keeping only the necessary libraries
# in the code may help the file load faster, and even reduce a device's memory usage.

## 2) Importing and Descriptive Stats

To market their loan products to people who already have deposit accounts, BankABC wants to create a direct marketing channel. To cross-sell personal loans to its current clients, the bank ran a test campaign. An enticing personal loan offer and processing charge waiver were aimed at a random group of 20000 clients. The targeted clients' information has been provided, together with information on how they responded to the marketing offer.

In [2]:
# READ DATA
# <Your code here>  
Approvel = pd.read_csv("Approval")

Unnamed: 0,Gender,Age,Debt,Married,BankCustomer,Industry,Ethnicity,YearsEmployed,PriorDefault,Employed,CreditScore,DriversLicense,Citizen,ZipCode,Income,Approved
0,1.0,30.83,0.0,1.0,1.0,Industrials,White,1.25,1.0,1.0,1.0,0.0,ByBirth,202.0,0.0,1.0
1,0.0,58.67,4.46,1.0,1.0,Materials,Black,3.04,1.0,1.0,6.0,0.0,ByBirth,43.0,560.0,1.0
2,0.0,24.5,0.5,1.0,1.0,Materials,Black,1.5,1.0,0.0,0.0,0.0,ByBirth,280.0,824.0,1.0
3,1.0,27.83,1.54,1.0,1.0,Industrials,White,3.75,1.0,1.0,5.0,1.0,ByBirth,100.0,3.0,1.0
4,1.0,20.17,5.625,1.0,1.0,Industrials,White,1.71,1.0,0.0,0.0,0.0,ByOtherMeans,120.0,0.0,1.0


**Instructions:**
1. Get the dimensions of the array and print them.
2. Verify if the correct dataset was imported by checking the first 15 entries.
3. Verify by checking the last five entries.
4. Display descriptive statistics of the dataset.

In [2]:
# GETTING THE DIMENSIONS OF THE ARRAY
# <Your code here>

Approval.shape

## 

In [4]:
# VERIFYING IF WE IMPORTED THE RIGHT DATASET BY CHECKING THE FIRST 15 ENTRIES OF THE DATA
# <Your code here>

Approval.head(15)

Unnamed: 0,Gender,Age,Debt,Married,BankCustomer,Industry,Ethnicity,YearsEmployed,PriorDefault,Employed,CreditScore,DriversLicense,Citizen,ZipCode,Income,Approved
0,1.0,30.83,0.0,1.0,1.0,Industrials,White,1.25,1.0,1.0,1.0,0.0,ByBirth,202.0,0.0,1.0
1,0.0,58.67,4.46,1.0,1.0,Materials,Black,3.04,1.0,1.0,6.0,0.0,ByBirth,43.0,560.0,1.0
2,0.0,24.5,0.5,1.0,1.0,Materials,Black,1.5,1.0,0.0,0.0,0.0,ByBirth,280.0,824.0,1.0
3,1.0,27.83,1.54,1.0,1.0,Industrials,White,3.75,1.0,1.0,5.0,1.0,ByBirth,100.0,3.0,1.0
4,1.0,20.17,5.625,1.0,1.0,Industrials,White,1.71,1.0,0.0,0.0,0.0,ByOtherMeans,120.0,0.0,1.0
5,1.0,32.08,4.0,1.0,1.0,CommunicationServices,White,2.5,1.0,0.0,0.0,1.0,ByBirth,360.0,0.0,1.0
6,1.0,33.17,1.04,1.0,1.0,Transport,Black,6.5,1.0,0.0,0.0,1.0,ByBirth,164.0,31285.0,1.0
7,0.0,22.92,11.585,1.0,1.0,InformationTechnology,White,0.04,1.0,0.0,0.0,0.0,ByBirth,80.0,1349.0,1.0
8,1.0,54.42,0.5,0.0,0.0,Financials,Black,3.96,1.0,0.0,0.0,0.0,ByBirth,180.0,314.0,1.0
9,1.0,42.5,4.915,0.0,0.0,Industrials,White,3.165,1.0,0.0,0.0,1.0,ByBirth,52.0,1442.0,1.0


In [5]:
# VERIFYING IF WE IMPORTED THE RIGHT DATASET BY CHECKING THE LAST FIVE ENTRIES OF THE DATA
# <Your code here>

Approval.tail()

Unnamed: 0,Gender,Age,Debt,Married,BankCustomer,Industry,Ethnicity,YearsEmployed,PriorDefault,Employed,CreditScore,DriversLicense,Citizen,ZipCode,Income,Approved
685,1.0,21.08,10.085,0.0,0.0,Education,Black,1.25,0.0,0.0,0.0,0.0,ByBirth,260.0,0.0,0.0
686,0.0,22.67,0.75,1.0,1.0,Energy,White,2.0,0.0,1.0,2.0,1.0,ByBirth,200.0,394.0,0.0
687,0.0,25.25,13.5,0.0,0.0,Healthcare,Latino,2.0,0.0,1.0,1.0,1.0,ByBirth,200.0,1.0,0.0
688,1.0,17.92,0.205,1.0,1.0,ConsumerStaples,White,0.04,0.0,0.0,0.0,0.0,ByBirth,280.0,750.0,0.0
689,1.0,35.0,3.375,1.0,1.0,Energy,Black,8.29,0.0,0.0,0.0,1.0,ByBirth,0.0,0.0,0.0


In [6]:
# DESCRIPTIVE STATS
# <Your code here>

Approval.describe(include='all')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Gender          690 non-null    float64
 1   Age             690 non-null    float64
 2   Debt            690 non-null    float64
 3   Married         690 non-null    float64
 4   BankCustomer    690 non-null    float64
 5   Industry        690 non-null    object 
 6   Ethnicity       690 non-null    object 
 7   YearsEmployed   690 non-null    float64
 8   PriorDefault    690 non-null    float64
 9   Employed        690 non-null    float64
 10  CreditScore     690 non-null    float64
 11  DriversLicense  690 non-null    float64
 12  Citizen         690 non-null    object 
 13  ZipCode         690 non-null    float64
 14  Income          690 non-null    float64
 15  Approved        690 non-null    float64
dtypes: float64(13), object(3)
memory usage: 86.4+ KB


In [7]:
# DESCRIPTIVE STATS
# <Your code here>

Approval.describe(include='all")

<bound method NDFrame.describe of      Gender    Age    Debt  Married  BankCustomer         Industry Ethnicity  \
0       1.0  30.83   0.000      1.0           1.0      Industrials     White   
1       0.0  58.67   4.460      1.0           1.0        Materials     Black   
2       0.0  24.50   0.500      1.0           1.0        Materials     Black   
3       1.0  27.83   1.540      1.0           1.0      Industrials     White   
4       1.0  20.17   5.625      1.0           1.0      Industrials     White   
..      ...    ...     ...      ...           ...              ...       ...   
685     1.0  21.08  10.085      0.0           0.0        Education     Black   
686     0.0  22.67   0.750      1.0           1.0           Energy     White   
687     0.0  25.25  13.500      0.0           0.0       Healthcare    Latino   
688     1.0  17.92   0.205      1.0           1.0  ConsumerStaples     White   
689     1.0  35.00   3.375      1.0           1.0           Energy     Black   

     

## 3) Handling Missing Values

Missing values in the dataset can lead to incorrect analysis and model predictions. Imputing missing values ensures the integrity of the dataset, making it possible to build reliable models.

**Instructions:**
1. Check for missing values.
2. Impute missing values for numeric data using the mean and for non-numeric data using the mode.

In [8]:
# CHECK FOR MISSING VALUES
# <Your code here>

print(Approval.isnull().sum())

Gender            0
Age               0
Debt              0
Married           0
BankCustomer      0
Industry          0
Ethnicity         0
YearsEmployed     0
PriorDefault      0
Employed          0
CreditScore       0
DriversLicense    0
Citizen           0
ZipCode           0
Income            0
Approved          0
dtype: int64

In [9]:
# IMPUTE MISSING VALUES

# for numeric data using mean
# <Your code here>

numeric_cols = Approval.select_dtypes(include=['float64', 'int64']).columns

for col in numeric_cols:
    mean_val = Approval[col].mean()
    Approval[col].fillna(mean_val, inplace=True)

In [3]:
# For non numeric data using mode
# <Your code here>

non_numeric_cols = Approval.select_dtypes(include=['object']).columns

for col in non_numeric_cols:
    mode_val = Approval[col].mode90[0]
    Approval[col].fillna(mode_val, inplace=True)

NameError: name 'Approval' is not defined

In [11]:
data.head(10)

Unnamed: 0,Gender,Age,Debt,Married,BankCustomer,Industry,Ethnicity,YearsEmployed,PriorDefault,Employed,CreditScore,DriversLicense,Citizen,ZipCode,Income,Approved
0,1.0,30.83,0.0,1.0,1.0,Industrials,White,1.25,1.0,1.0,1.0,0.0,ByBirth,202.0,0.0,1.0
1,0.0,58.67,4.46,1.0,1.0,Materials,Black,3.04,1.0,1.0,6.0,0.0,ByBirth,43.0,560.0,1.0
2,0.0,24.5,0.5,1.0,1.0,Materials,Black,1.5,1.0,0.0,0.0,0.0,ByBirth,280.0,824.0,1.0
3,1.0,27.83,1.54,1.0,1.0,Industrials,White,3.75,1.0,1.0,5.0,1.0,ByBirth,100.0,3.0,1.0
4,1.0,20.17,5.625,1.0,1.0,Industrials,White,1.71,1.0,0.0,0.0,0.0,ByOtherMeans,120.0,0.0,1.0
5,1.0,32.08,4.0,1.0,1.0,CommunicationServices,White,2.5,1.0,0.0,0.0,1.0,ByBirth,360.0,0.0,1.0
6,1.0,33.17,1.04,1.0,1.0,Transport,Black,6.5,1.0,0.0,0.0,1.0,ByBirth,164.0,31285.0,1.0
7,0.0,22.92,11.585,1.0,1.0,InformationTechnology,White,0.04,1.0,0.0,0.0,0.0,ByBirth,80.0,1349.0,1.0
8,1.0,54.42,0.5,0.0,0.0,Financials,Black,3.96,1.0,0.0,0.0,0.0,ByBirth,180.0,314.0,1.0
9,1.0,42.5,4.915,0.0,0.0,Industrials,White,3.165,1.0,0.0,0.0,1.0,ByBirth,52.0,1442.0,1.0


In [11]:
Approval.isnull().values.any()

False

In [10]:
import pandas as pd

Approval = pd.DataFrame([
    [1.0, 30.83, 0.000, 1.0, 1.0, 'Industrials', 'White', 1.25, 1.0, 1.0, 1.0, 0.0, 'ByBirth', 202.0, 0.0, 1.0],
    [0.0, 58.67, 4.460, 1.0, 1.0, 'Materials', 'Black', 3.04, 1.0, 1.0, 6.0, 0.0, 'ByBirth', 43.0, 560.0, 1.0],
    [0.0, 24.50, 0.500, 1.0, 1.0, 'Materials', 'Black', 1.50, 1.0, 0.0, 0.0, 0.0, 'ByBirth', 280.0, 824.0, 1.0],
    [1.0, 27.83, 1.540, 1.0, 1.0, 'Industrials', 'White', 3.75, 1.0, 1.0, 5.0, 1.0, 'ByBirth', 100.0, 3.0, 1.0],
    [1.0, 20.17, 5.625, 1.0, 1.0, 'Industrials', 'White', 1.71, 1.0, 0.0, 0.0, 0.0, 'ByOtherMeans', 120.0, 0.0, 1.0]
], columns=[
    'Gender', 'Age', 'Debt', 'Married', 'BankCustomer', 'Industry', 'Ethnicity',
    'YearsEmployed', 'PriorDefault', 'Employed', 'CreditScore', 'DriversLicense',
    'Citizen', 'ZipCode', 'Income', 'Approved'
])


## <span style="color:red">*Q2. Is there any missing data at all?*</span>

In [None]:
# There are no missing values in the data at all.

## <span style="color:red">*Q3. Why do we impute mean for numeric and mode for non-numeric data?*</span>

In [None]:
# We impute mean for numeric data because it is a good summary statistic to represent a "typical" value, and replacing missing
# values with the mean allows the overall distribution to remain roughly the same Furthermore, it is simple and effective to 
# do when the missing values are random, and if there are not too many missing values.

# We impute mode for non-numeric data because categorical columns contain categories or labels, meaning there would be no
# "average." Furthermore,imputing missing categories with the mode means that we are assuming the missing data most likely 
# belongs to the most common group. Thus preventing introducing invalid categories and keeps the category distribution consistent.

## 4) Data Preprocessing

Data preprocessing is crucial for preparing raw data for analysis. Converting non-numeric data to numeric forms, such as one-hot encoding, ensures compatibility with machine learning algorithms, which typically require numerical input.

In [20]:
# Import the necessary library first
from sklearn.preprocessing import OneHotEncoder

# CONVERTING ALL NON-NUMERIC DATA TO NUMERIC - USING ONE-HOT ENCODING

# INSTANTIATE LABELENCODER
ohe = OneHotEncoder(sparse=False)

# USE LABEL ENCODER le TO TRANSFORM VARIABLES

# Creating a new DataFrame for storing transformed data
data_transformed = pd.DataFrame()

for column in data.columns:
    if data[column].dtypes == 'object':
        # One-hot encode the column if it's object type
        # <Your code here>
        encoded_cols = ohe.fit_transform(Approval[[column]])
        # Set the column name of one-hot encoded DataFrame as column_value
        # <Your code here>
        encoded_col_names = [f"{column}_{cat}" for cat in ohe.categories_[0][1:]]
        encoded_df = pd.DataFrame(encoded_cols, columns=encoded_col_names, index=Approval.index)
        # Concatenate to the transformed DataFrame
        # <Your code here>
        Approval_transformed = pd.concat([Approval_transformed, encoded_df], axis=1)
    else:
        # If not object type, just copy the data
        # <Your code here>
        Approval_transformed = pd.concat([Approval_transformed, Approval[[column]]], axis=1)
   
Approval_transformed.head(10)

NameError: name 'data' is not defined

## 5) Data Visualization

**Instructions:**
1. Plot histograms for all variables to understand their distributions.
2. Calculate the correlation matrix and plot the heatmap to identify relationships between variables.

In [22]:
# PLOTTING HISTOGRAMS FOR ALL VARIABLES
# <Your code here>

import matplotlib.pyplot as plt

Approval_transformed.hist(figsize=(15, 15), bins=30, edgecolor='black')
plt.tight_layout()
plt.show()

# CALCULATE THE CORRELATION MATRIX
# <Your code here>

corr_matrix = Approval_transformed.corr()
print(corr>matrix)

# Decrease font size
plt.rcParams['font.size'] = 8


# PLOT THE HEATMAP
# <Your code here>

plt.figure(figsize=(12,10))
sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap='coolwarm', square=True)
plt.title('Correlation Heatmap')
plt.sho()

NameError: name 'Approval_transformed' is not defined

## <span style="color:red">*Q4. What do the histograms for all variables look like?*</span>

In [None]:
# The histograms for all numeric variables shows how data points are spread across value ranges and histograms for all 
# categorical variables are represented by bars(for each category) that shows how many samples fall in each group.

## <span style="color:red">*Q5. What does the correlation matrix and heatmap reveal about relationships between variables?*</span>

In [None]:
# Since the correlation matrix measures the linear relationship between pairs of numeric variables and the heatmap uses colors
# to show the strength and direction of correlations, combining the two may reveal important relationships between variables.
# For example, if higher income means better credit, that would be a positive correlation. If higher debt reduces the chances
# of loan approval, then that signifies a negative correlation between the two variables.

## 6) Model Building

Model building involves training a machine learning model to make predictions based on historical data. In this case, we are predicting the likelihood of a credit card application being approved. Splitting the data into training and testing sets ensures that we can evaluate the model's performance on unseen data, providing a realistic assessment of its accuracy.

In [14]:
# DROP THE VARIABLES NOT NEEDED
# <Your code here>

data = Approval_transformed.drop(columns=['ZipCode'])

# SEGREGATE FEATURES AND LABELS INTO SEPARATE VARIABLES
# <Your code here>

X = data.drop('Approved', axis=1)
Y = data['Approved']

# SPLIT INTO TRAIN AND TEST USING TRAIN_TEST_SPLIT()
# <Your code here>

from sklearn.model_selection import tran_test_split

X_train, X_test, Y_train, Y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Code snippet for splitting data adapted from scikit-learn documentation:
# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html


**Instructions:**
1. Scale the data using MinMaxScaler to ensure that all features contribute equally to the model. Scaling is important as it brings all features to a comparable range, improving the convergence of the learning algorithm.
2. Instantiate and fit a Logistic Regression model to the training set.

In [23]:
# INSTANTIATE MINMAXSCALER AND USE IT TO RESCALE X_TRAIN AND X_TEST
# <Your code here>

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

X_train_scaled = scaler.fit_transform(X_train)

X_test_scaled = scaler.transform(X_test)

# You can try to do z-score normalization (look it up!)
# INSTANTIATE A LOGISTICREGRESSION CLASSIFIER WITH DEFAULT PARAMETER VALUES
# <Your code here>

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

# FIT MODEL TO THE TRAIN SET
# <Your code here>

model.fit(X_train_scaled, y_train)

# Code for scaling and logistic regression adapted from:
# scikit-learn documentation – https://scikit-learn.org/stable/modules/preprocessing.html
# and https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

NameError: name 'X_train' is not defined

In [16]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=100, multi_class='warn', n_jobs=None, penalty='l2',
random_state=None, solver='warn', tol=0.0001, verbose=0, warm_start=False)

LogisticRegression(multi_class='warn', solver='warn')

## <span style="color:red">*Q6. Why is it important to split the data into training and testing sets?*</span>

In [None]:
# It is important to split the data into training and testing sets because it allows the user to evaluate how well the ML 
# model performs on unseen data. The training set would be used to teach the model to distinguish patterns from the data, while
# the testing set would be used to represent how the model would perform on real-world or future data. That being said, if one
# were to train and test on the same data, the model might appear to perform well but in reality, would memorize the training
# data instead of general patterns. Thus leading to poor performance when making predictions on new data.

## 7) Model Evaluation

Evaluating the model's performance is crucial to ensure it can accurately predict outcomes on new data. The confusion matrix and accuracy score provide insights into the model's ability to distinguish between approved and not approved applications. This is critical for minimizing false approvals and rejections, directly impacting the bank's operations and customer satisfaction.

In [17]:
from sklearn.metrics import confusion_matrix

y_pred = model.predict(X_test_scaled) # <Your code here>

# print("Accuracy of logistic regression classifier: ", logreg.score(* <Your code here> *))

print("Accuracy of logistic regression classifier: ", model.score(X_test_scaled, y_test))

# PRINT THE CONFUSION MATRIX OF THE LOGREG MODEL
# <Your code here>

cm - confusion_matrix(y_test, y_pred)
print("Confusion Matrix: \n", cm)

# Model evaluation code adapted from:
# scikit-learn documentation – https://scikit-learn.org/stable/modules/model_evaluation.html

Accuracy of logistic regression classifier:  0.8405797101449275


array([[88, 22],
       [11, 86]])

**Instructions:**
1. Calculate and plot the ROC curve for the model. The ROC curve is a graphical representation of a classifier's performance and is useful for visualizing the trade-off between the true positive rate and false positive rate at various threshold settings.

In [24]:
import sklearn.metrics as metrics
import matplotlib.pyplot as plt

# CALCULATE THE FPR AND TPR FOR ALL THRESHOLDS OF THE CLASSIFICATION
# <Your code here>

fpr, tpr, thresholds = metrics.roc_curve(y_test, model.predict_proba(X_test_scaled)[:,1])

# method to plot ROC Curve
# <Your code here>

plt.plot(fpr, tpr, label=f'ROC curve(AUC = {metrics.roc_auc_score(y_test, model.predict_proba(X_test_scaled)[:,1]):.2f})')
plt.plot([0,1], [0,1], linestyle='--', color='gray')  # random guess line
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc='lower right')
plt.show()

NameError: name 'y_test' is not defined

## <span style="color:red">*Q7. Can you code a similar model for random forest?*</span>

In [None]:
Yes.

In [4]:
# Import the required classifier

from sklearn.ensemble import RandomForestClassifier # <Your code here>


# Instantiate a RandomForestClassifier with default parameters

rf = RandomForestClassifier(random_state=42) # <Your code here>


# Fit the model on the training data

rf.fit(X_train_scaled, y_train) # <Your code here>


# Re-instantiate the RandomForestClassifier with 200 trees

rf = RandomForestClassifier(n_estimators=200, random_state=42) # <Your code here>)

    
# Fit the model again on the training data

rf.fit(X_train_scaled, y_train) # <Your code here>


# Predict the test set labels

y_pred = rf.predict(X_test_scaled) # <Your code here>

# Compute and print the accuracy on test data

accuracy = rf.score(X_test_scaled, y_test)# <Your code here>
print("Accuracy of random forest classifier: ", accuracy) # <Your code here>)

# Code adapted from:
# scikit-learn RandomForestClassifier documentation:
# https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

Accuracy of random forest classifier:  0.8599033816425121


In [26]:
import sklearn.metrics as metrics
import matplotlib.pyplot as plt
# Get the probability predictions for the positive class
probs = rf.predict_proba(X_test_scaled)[:, 1]# <Your code here>)   
preds = rf.predict(X_test_scaled)# <Your code here>

# Compute false positive rate and true positive rate
fpr, tpr, threshold = metrics.roc_curve(y_test, probs)# <Your code here>

# Compute the AUC score
roc_auc = metrics.roc_auc_score(y_test, probs) # <Your code here>

# Plot ROC Curve
import matplotlib.pyplot as plt
# <Your code here>

plt.plot(fpr, tpr, label=f'ROC curve (AUC = {roc_auc:.2f})', color='green')
plt.plot([0,1], [0,1], linestyle='--', color='gray')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Random Forest ROC Curve')
plt.legend(loc='lower right')
plt.grid(True)
plt.show()

NameError: name 'rf' is not defined