# Module 1 - Classification
## Machine Learning Basics
Machine learning is a branch of statistics/computer science/math that is concerned with the ways in which algorithms can be trained by data and iteratively tuned (self-tuned or by human input) to perform some task (typically predicting some outcome). One way of understanding machine learning algorithms is by considering the outcome that the algorithm seeks to predict. If the outcome is a group (e.g., out/not out or malignant/benign or coupe/sedan/suv/van/truck) then the algorithm is a classifier. If the outcome is a numeric value (e.g., stock price, retention), the algorithm is a regressor.   

Regardless of the type of algorithm, they all need data that will teach the algorithm how to detect the appropriate outcome. This type of data is referred to as training data. To be effective, training data must perfectly mimic real-world data, otherwise, the model will be inaccurate at best and broken at worst when used to predict real-world outcomes. 
There are three types of training techniques: supervised learning, semi-supervised learning, unsupervised learning.  

1. Supervised Learning: Supervised learning refers to model training that uses labeled data. Data are labeled when the correct outcome is known and the outcome data are present during training. For example, if you wanted to train an algorithm to discriminate between malignant and benign tumors, you would need training data which include all data points needed to make that determination (e.g., mass size, shape, location, etc.) and an additional data point with the correct outcome.  
2. Unsupervised Learning: Unsupervised Learning refers to model training that has no predetermined outcome but is instead seeking to organize the inputs in a discriminatory way. For example, a retailer may not know what types of customers shop at their store, but they could use unsupervised learning to ‘find’ different types of customers based on shopping histories.  
3. Semi-Supervised Learning: Semi-supervised learning refers to model training that uses partially labeled data. In some situations, labelling all training data is cost-prohibitive or impossible (e.g., finding radiologists who are willing to label images of tumors is expensive). In this case, supervised and unsupervised technique are combined to predict the desired outcome. Unsupervised learning is used to find the desired outcome in the unlabeled data and its results are iteratively verified by the labeled data.  

Learning algorithms are said to ‘converge’ on the correct solution. This happens iteratively as the algorithm seeks to reduce error (i.e. the difference between the value predicted by the algorithm and the actual output) as determined by a loss function. Each iteration (or Epoch), the algorithm adjusts the weights—the influence each input parameter has on predicting the outcome—until the error is minimized.  

## Machine Learning Process
Machine learning exercises tend to follow a series of steps that are common to all projects.  

1. Problem Identification: Machine learning algorithms are designed to find relationships between inputs and outputs, regardless of whether it is appropriate to do so. Therefore, any machine learning task must begin with an appropriate understanding of what you are trying to achieve.  
2. Data Collection: Beginning with problem identification is important because it will also identify possible data sources and data elements that are needed to solve the problem. Once identified, you must set out to collect the data needed to address the focal problem.  
3. Data Profiling: Real-world is often messy, inconsistent, and incomplete. Thus, all projects must undergo a process of profiling and processing the data so that they are appropriate for modeling. Data profiling often involves a variety of tasks including exploring data, scaling variables, addressing missing values, and splitting the dataset into training and testing data.  
4. Model Specification: Once cleaned, the data are ready to be used for training a model. At this point, the analyst must choose an algorithm that is appropriate given the inputs and desired outputs and then define the hyperparameters, learning parameters of the chosen algorithm, and convergence criteria.  
5. Model Evaluation: A converging model is not necessarily a good solution. To assess the quality of the model, test data are used to evaluate model predictions. Alternatively, models may be compared to alternative algorithms to assess quality.
6. Conclusion: Model building is not (always) an educational exercise. Often, we build machine learning models to predict outcomes. So, how did this model do? Where is it right and where is it wrong? Is it more important to be right than it is to be not wrong? Are the costs of being wrong so high that we have little to no tolerance for error?

## Classification
Classification is a fundamental task in machine learning where the objective is to predict the categorical label of new observations based on past observations. This type of task is essential in scenarios ranging from email spam detection to medical diagnoses. The process involves training a model on a labeled dataset where the correct label is known. The model then learns to map the input features to the desired output labels by minimizing the discrepancy between its predictions and the actual labels. Various algorithms, such as logistic regression, support vector machines, and neural networks, are employed to perform classification tasks. These models can handle binary classification (two classes) or multi-class classification (multiple classes), depending on the problem at hand.  

<p style="text-align: center"><img src="https://thislondonhouse.com/Jupyter/Images/svm.jpg"></p>

Once trained, the classifier is evaluated using performance metrics such as accuracy, precision, recall, and F1-score to ensure its effectiveness. An important aspect of classification is handling class imbalances, which occur when some classes are underrepresented in the training data. Techniques such as oversampling, undersampling, or employing specialized loss functions help address this challenge. With advancements in machine learning, classifiers can now handle complex and high-dimensional datasets, making them instrumental in various fields including finance, healthcare, and social media analytics. Continual improvements in algorithms and computational power are constantly pushing the boundaries of what is possible with classification models.  

<p style="text-align: center"><img src="https://thislondonhouse.com/Jupyter/Images/confusionMatrxiUpdated.jpg"></p>

### Classification Algorithms
#### Logistic Regression
Logistic regression is a simple yet powerful classification technique used for binary and multi-class classification problems. Despite its name, logistic regression is actually a linear model for classification rather than regression. It predicts the probability of a binary outcome using a logistic function, also known as the sigmoid function. The output of the logistic regression model is a value between 0 and 1, which represents the probability of the positive class. The model estimates the coefficients of the input features by minimizing the logistic loss function, which measures the difference between the predicted probabilities and the actual labels. Logistic regression is widely used due to its simplicity, interpretability, and effectiveness on linearly separable data. However, it may not perform well on complex datasets where the relationship between the input features and the output is highly non-linear.
#### Support Vector Machines
Support Vector Machines (SVM) are powerful classification algorithms that aim to find the optimal hyperplane that separates the data into different classes. The SVM algorithm tries to maximize the margin between the classes, which is the distance between the hyperplane and the nearest data points from each class, known as support vectors. By doing so, SVM achieves a robust and generalizable decision boundary. For non-linearly separable data, SVM uses a technique called the kernel trick to transform the input features into a higher-dimensional space where a linear separation is possible. Common kernel functions include polynomial, radial basis function (RBF), and Gaussian. SVMs are particularly effective in high-dimensional spaces and have been widely used for tasks such as image recognition and bioinformatics. However, SVMs can be computationally expensive and less efficient on very large datasets.
#### Naive Bayes Classifier
Naive Bayes classifiers are probabilistic models based on Bayes' Theorem, which leverages the concept of conditional probability. The "naive" aspect of the algorithm refers to the assumption that the features are independent given the class label. Despite this strong assumption, Naive Bayes classifiers have been highly successful in various applications, particularly in text classification tasks such as spam detection and sentiment analysis. The model calculates the probability of each class given the input features and selects the class with the highest probability as the predicted label. There are different types of Naive Bayes classifiers, including Gaussian, Multinomial, and Bernoulli, each suited for different types of data. One major advantage is their simplicity and efficiency; however, their performance can be suboptimal if the feature independence assumption is violated.
#### Neural Network Classifiers
Neural network classifiers are inspired by the human brain and consist of multiple layers of interconnected artificial neurons. These models are capable of learning complex, non-linear decision boundaries and are highly flexible, making them suitable for a wide range of applications. A neural network consists of an input layer, one or more hidden layers, and an output layer. Each neuron in a layer performs a weighted sum of its inputs, applies an activation function, and passes the result to the next layer. During training, the model adjusts the weights using a process called backpropagation, which minimizes the error between the predicted and actual labels. With the advent of deep learning, neural networks with many hidden layers, known as deep neural networks, have achieved remarkable performance in tasks such as image classification, speech recognition, and natural language processing. These models excel at discovering intricate patterns in large and complex datasets. However, they require substantial computational resources and large amounts of data, and they can be more challenging to interpret compared to simpler models.



In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import tensorflow as tf
from scikeras.wrappers import KerasClassifier
from sklearn import metrics
from sklearn.compose import ColumnTransformer
from sklearn.dummy import DummyClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from sklearn.svm import LinearSVC
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.models import Sequential

## Classification Exercise 1
### Business Problem

It is important for blood banks to be able to correctly identify individuals who are likely to donate blood at a blood drive. Given that blood banks often have very limited resources for incentivizing donations, it would be wise to reserve incentives for those who are more likely to need encouragement (i.e., those who are less likely to donate at any given blood drive).

### Data Collection/Selection
We will be loading data from a blood donation dataset. More information is available [here](https://archive.ics.uci.edu/dataset/176/blood+transfusion+service+center).  

Data are orgnized in tabular format with each record representing an individual donor. The target variable is 'DonatedBlood' which represents whether the individual donated at the most recent blood drive. The data contain the following columns:  
| Variable Name | Role | Type | Description |
| ------------- | ---- | ---- | ----------- |
| Recency | Feature | Integer | months since last donation |
| Frequency | Feature| Integer | total number of donations |
| DonatedAmount | Feature | Integer|total blood donated in c.c. |
| Time | Feature | Integer | months since first donation |
| DonatedBlood | Target | Binary | whether he/she donated blood in March 2007 (1 stand for donating blood; 0 stands for not donating blood) |

The following line will load the data as a pandas dataframe.

In [None]:
blood_donation_df = pd.read_csv("data/blood_donation.csv")

### Data Profiling  
Once the data are loaded, we need to profile the data and prepare it for analysis. This typically involves several steps that may include handling missing data, exploring data, feature selection, among others. The steps will vary depending on the dataset and the business problem, but profiling always precedes model building.  

The following lines provide important insight into the nature of the data.

In [None]:
print(blood_donation_df.info())

In [None]:
blood_donation_df.head()

In [None]:
blood_donation_df.describe()

In [None]:
for feature, values in blood_donation_df.items():
    if values.dtypes in ['int64', 'float64']:
        values.hist()
    else:
        values.value_counts().plot.barh()
    plt.title(f'{feature}') 
    plt.show()

The following lines will define which features (i.e., columns) we want to include in our model and groups them based on the data type included in each column. We may change our mind, but these steps will serve as a starting point. 

In [None]:
categorical_cols = []
numeric_cols = []
count_cols = ['Recency', 'Frequency', 'DonatedAmt', 'Time']
target_cols = ['DonatedBlood']
input_cols = [x for x in categorical_cols + numeric_cols + count_cols if x not in target_cols]
data_cols = input_cols + target_cols

Once we've decided on which features to include, it is sometimes easiest to subset these features into a new dataframe.

In [None]:
df = blood_donation_df[data_cols]

In [None]:
print(df.info())

In [None]:
df[target_cols].head()

In [None]:
df[target_cols].value_counts()

In [None]:
for target in target_cols:
    df[target].value_counts().plot.barh()
    plt.show()

We can view a pair plot of each feature separated by the target variable. This can be helpful for spotting any problems in our data or relationships among features.

In [None]:
for target in target_cols:
    sns.pairplot(df, hue=target, corner=True)
    plt.show()

Once we are happy with our dataset, we can drop records with missing data, split the data into testing and training sets, and move on to model specification.

In [None]:
df = df.dropna()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df[input_cols], df[target_cols], test_size=0.25, random_state=16)

### Model Specification
Specifying a model usually requires you to make decisions about whether/how to deal with data transformations and selecting an appropriate model. We will be using a concept known as pipelines to address data transformations and model selection. Pipelines make our lives easier by collecting a series of common steps into a single pipeline. This pipeline of steps will execute whenever the model is run, ensuring consistency and reducing the likelihood of error.  

Because different types of data are transformed in different ways, we will build three different transformer pipelines, one for numbers, one for count values, and one for categorical values. Each transformer will only be applied to the columns that contain the specifed data type 

In [None]:
count_transformer = Pipeline(steps=[
    ('log', FunctionTransformer(np.log1p, feature_names_out='one-to-one'))
])

In [None]:
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

In [None]:
categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(drop='first', handle_unknown='ignore', sparse_output=False))
])

Each transformer is an individual transformer, but we are going to collect each into a preprocessor pipeline that is specific to our model.

In [None]:
preprocessor = ColumnTransformer(
    transformers=[
        ('count', count_transformer, count_cols),
    ])

#### Logistic Classifier

In [None]:
logistic_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('logistic', LogisticRegression())
])

With the pipeline built, we are ready to train the model.

In [None]:
logistic_pipeline.fit(X_train, np.ravel(y_train))

To assess the performance of the model, we will use the data we reserved for test to predict the target variable (donate/not dontate). This will help us benchmark the performance of our model.

In [None]:
logistic_predicted = logistic_pipeline.predict(X_test)

If we look at the results side-by-side, we can see the logistic regression model did a pretty good, but not perfect, job. But, how good?

In [None]:
print(pd.DataFrame({f"{target_cols[0]}-True": np.ravel(y_test), 
                    f"{target_cols[0]}-Predicted": logistic_predicted
                   }))

Here we can see whether the test value matchest the predicted value.

In [None]:
pd.DataFrame(np.ravel(y_test) == logistic_predicted, columns=['Correctly Predicted']).value_counts()

And if we take the average of these results (Note, True is treated as a 1 and False is treated as a 0, so the average tells us the percentage of True's), we can see the **Accuracy** of our classifier.

In [None]:
np.mean(np.ravel(y_test) == logistic_predicted)

This is a common task so there are built-in tools for assessing classifer performance. The following function summarizes the most important metrics for classifier performance.

In [None]:
def classifier_performance(y, y_pred, labels_dict=None):
    accuracy = metrics.accuracy_score(y, y_pred)
    precision = metrics.precision_score(y, y_pred, average='weighted')
    recall = metrics.recall_score(y, y_pred, average='weighted')
    balanced_accuracy = metrics.balanced_accuracy_score(y, y_pred)
    f1 = metrics.f1_score(y, y_pred, average='weighted')
    report = metrics.classification_report(y, y_pred, target_names=[labels_dict[i] for i in sorted(labels_dict.keys())])

    # Display the confusion matrix with custom labels
    conf_matrix = metrics.confusion_matrix(y, y_pred)
    disp = metrics.ConfusionMatrixDisplay(confusion_matrix=conf_matrix, display_labels=[labels_dict[i] for i in sorted(labels_dict.keys())])
    disp.plot(cmap=plt.cm.Greens)

    print(f"Accuracy: {accuracy:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"Balanced Accuracy: {balanced_accuracy:.4f}")
    print(f"F1-score: {f1:.4f}")
    print("\nDetailed Classification Report:")
    print(report)
    plt.show()

If we pass in the results of our current model to the function above, we will get the performance of this model. Then, we could run another model and reuse the function to have a common platform for assessing classifer perforamnce across models.

In [None]:
print("Logistic Classification Performance Metrics:")
classifier_performance(y_test, logistic_predicted, {0: 'donated', 1: 'not donated'})

**For reference**  
Accuracy = (TN + TP)/(TN + TP + FN + FP)  
Precision = (TP)/(TP + FP)  
Sensitivity (Recall) = (TP)/(TP + FN)  
Specificity (Selectivity) = (TN)/(TN + FP)  
Balanced Accuracy = (Sensitivity + Specificity)/2  
F1 Score = (2 × Precision × Recall)/(Precision + Recall)  

### Model Evaluation

Is the logistic model any good? Can we do better? The only way to know is to run more models. To get a baseline, we can run a model that ignores all inpts and simply predicts the most common outcome, 'not donated' in this case. This type of model should be the worst we could do.  

Since we already have a pipeline set up, we can reuse most of our work and change the final model.
#### Dummy Classifier

In [None]:
dummy_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('dummy', DummyClassifier(strategy="most_frequent"))
])

Next, we will fit the model, as we did above.

In [None]:
dummy_pipeline.fit(X_train, y_train)

In [None]:
dummy_predicted = dummy_pipeline.predict(X_test)

Now, we can reuse our classifier performance function to get a feel for how the dummy classifier performed.

In [None]:
print("Dummy Classification Performance Metrics:")
classifier_performance(y_test, dummy_predicted, {0: 'donated', 1: 'not donated'})

These results show a reasonable level of accuracy (76%, compared to 79% for the logistic regression), but precision and balanced accuracy scores are much lower (because it wasn't even trying to predict). This illustrates one of the challenges of highly unblanced datasets: models seeking correct predictions often struggle to outperform the most common result and tend to revert to the most frequent response.

Let's try another model!
#### Support Vector Machines
Again, we will reuse the pipeline we've created and just change the classifier.

In [None]:
svm_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('svm', LinearSVC())
])

In [None]:
svm_pipeline.fit(X_train, np.ravel(y_train))

In [None]:
svm_predicted = svm_pipeline.predict(X_test)

In [None]:
print("SVM Classification Performance Metrics:")
classifier_performance(y_test, svm_predicted, {0: 'donated', 1: 'not donated'})

A little bit better than guessing, but not much. Logistic is still the model to beat.

So, let's try another!
#### Naive Bayes Classifier

In [None]:
nb_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('nb', GaussianNB())
])

In [None]:
nb_pipeline.fit(X_train, np.ravel(y_train))

In [None]:
nb_predicted = nb_pipeline.predict(X_test)

In [None]:
classifier_performance(y_test, nb_predicted, {0: 'donated', 1: 'not donated'})

The best overall model, thus far. Let's do one more!  
#### Neural Network Classifier  
Neural Networks work a little differently as they have no a priori structure. Instead, the analyst builds the model and the model then tries to detect patterns predicting the target via the defined network. As such, there are many more choices that can be made than are defined here. We will look at these models more as the semester progresses and what it takes to tune a neural network. For now, we will use the model defined below and we will insert it into our exisiting pipelines.

In [None]:
def create_sequential_model(dims, metric):
    model = Sequential()
    model.add(Input(shape=(dims,)))
    model.add(Dense(4, activation="relu"))
    model.add(Dense(4, activation="relu"))
    model.add(Dense(1, activation="sigmoid"))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=metric)
    return model

In [None]:
nn_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('nn', KerasClassifier(model=create_sequential_model(preprocessor.fit_transform(X_train).shape[1], ['accuracy']), epochs=10, batch_size=5, verbose=1))
])

In [None]:
nn_pipeline.fit(X_train, y_train)

In [None]:
nn_predicted = nn_pipeline.predict(X_test)

In [None]:
print("Neural Network Classification Performance Metrics:")
classifier_performance(y_test, nn_predicted, {0: 'donated', 1: 'not donated'})

Not great. It is just guessing 'not donated' and therefore no smarter than the dummy classifier. But, neural networks have optimization objectives and we will get different results if we ask the model to prioritize different objectives. For the initial model, we prioritized **accuracy**. What happens if we prioritize **precision**?  
#### Feature Selection
Just because we have data doesn't mean we will want to use it. Some values may be unrelated to our target. Others may confound and lead to irrational decisions. Also, for complex models and large datasets, more data may adversely affect the performance of the model--a slow but accurate model may be worse than a fast but kinda accurate model. So, it is important that we consider the role that features have in predicting the target. Some will have a large role and some will have a small role. Those with small roles may **sometimes** be excluded without harming the overall performance of the model. You will have to use trial and error to make the decisions. *Note: Coefficients are not available for all models.*

In [None]:
def feature_importance(model_pipeline, preprocessor_name, model_name):
    all_feature_names = []
    [all_feature_names.extend(transformer.get_feature_names_out()) for key, transformer in model_pipeline.named_steps[preprocessor_name].named_transformers_.items()]
    pd.DataFrame(
        model_pipeline[model_name].coef_[0] * np.std(logistic_pipeline[preprocessor_name].fit_transform(X_train), axis=0),
        columns=["Coefficients"],
        index=all_feature_names
    ).plot(kind="barh", figsize=(7, 8))
    plt.title("Feature Importance")
    plt.show()

In [None]:
feature_importance(logistic_pipeline, 'preprocessor', 'logistic')

In [None]:
feature_importance(svm_pipeline, 'preprocessor', 'svm')

### Conclusion
Is this a good model? That depends. What is the risk of misclassifying? If you predict someone as a donor, but they do not donate, what are the consequences? If you predict someone will not donate, but they do, is that bad? Assessing the quality of the model involves not only looking at predictive quality but also considering the cost of misclassification.  

In this case, the models are fine but not great. The best model is moderately better than just assuming someone will not donate, which would be considered a guess in most cases, but given the unbalanced nature of this dataset, is a highly likely outcome. Given the moderate prediction performance, the more expensive (with regard to time/resources) algorithms are less preferred than a standard logistic. Fortunately, the costs of misclassifying are relatively low as it would likely mean that you end up devoting resources to attract a donor who would have donated without the incentives. This is illustrated in the confusion matrix where we see that because the models are biased toward the negative, all algorithms place a large number of positives (donate) in the negative (not donate) column (false negatives). Fortunately, the risks/costs associated with false positives are low for blood drives.  

Unfortunately, the best course of action would be to improve the dataset (which is often expensive or impossible). We only have four variables and two of them are perfectly correlated (which means we really only have three input variables). It would be nice to have some more variables about the donors, the donation event, and the event location as it seems likely that all three could have an impact on whether or not someone donates blood.


## Classification Exercise 2

### Business Problem

Offering credit is risky. Banks need to offer credit to a variety of customers because low-risk customers are of little value (because they pay off their balance monthly or they maintain very small balances) but high-risk customers may have negative value if they are prone to default on their debts. So, it is important for banks to be able to accurately identify customers who are likely to default on their credit balances so as to limit their exposure to potential loss.

### Data Collection

We will be loading data from a credit card clients dataset. More information is available [here](https://archive.ics.uci.edu/dataset/350/default+of+credit+card+clients).  

Data are orgnized in tabular format with each record representing an individual customer. The target variable is 'Default' which represents whether the individual donated at the most recent blood drive. The data contain the following columns:  
| Variable Name | Role | Type | Description |
| ------------- | ---- | ---- | ----------- |
| LIMIT_BAL | Feature | Integer | Current credit borrowing limit |
| SEX | Feature | | |
| EDUCATION | Feature | | |
| MARRIAGE | Feature | | |
| AGE | Feature | | |
| PAY_0 | Feature | | |
| PAY_2 | Feature | | |
| PAY_3 | Feature | | |
| PAY_4 | Feature | | |
| PAY_5 | Feature | | |
| PAY_6 | Feature | | |
| BILL_AMT1 | Feature | | |
| BILL_AMT2 | Feature | | |
| BILL_AMT3 | Feature | | |
| BILL_AMT4 | Feature | | |
| BILL_AMT5 | Feature | | |
| BILL_AMT6 | Feature | | |
| PAY_AMT1 | Feature | | |
| PAY_AMT2 | Feature | | |
| PAY_AMT3 | Feature | | |
| PAY_AMT4 | Feature | | |
| PAY_AMT5 | Feature | | |
| PAY_AMT6 | Feature | | |
| DEFAULT | Target| |  |

The following line will load the data as a pandas dataframe.

In [None]:
credit_default_df = pd.read_csv("data/credit_card_default.csv")

### Data Profiling

In [None]:
print(credit_default_df.info())

In [None]:
print(credit_default_df.head())

In [None]:
print(credit_default_df.describe())

In [None]:
for feature, values in credit_default_df.items():
    if values.dtypes in ['int64', 'float64']:
        values.hist()
    else:
        values.value_counts().plot.barh()
    plt.title(f'{feature}') 
    plt.show()

In [None]:
categorical_cols = ['SEX', 'EDUCATION', 'MARRIAGE', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6']
numeric_cols = ['LIMIT_BAL', 'BILL_AMT1', 'PAY_AMT1', 'BILL_AMT2', 'PAY_AMT2', 'BILL_AMT3', 'PAY_AMT3', 'BILL_AMT4', 'PAY_AMT4', 'BILL_AMT5', 'PAY_AMT5', 'BILL_AMT6', 'PAY_AMT6']
count_cols = []
target_cols = ['DEFAULT']
input_cols = [x for x in categorical_cols + numeric_cols + count_cols if x not in target_cols]
data_cols = input_cols + target_cols

In [None]:
df = credit_default_df[data_cols]

In [None]:
df.describe()

In [None]:
for target in target_cols:
    sns.pairplot(df, hue=target, corner=True)
    plt.show()

In [None]:
df = df.dropna()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df[input_cols], df[target_cols], test_size=0.25, random_state=16)

### Model Specification

In [None]:
preprocessor = ColumnTransformer(
    transformers=[
        ('categorical', categorical_transformer, categorical_cols),
        ('numeric', numeric_transformer, numeric_cols),
    ])

In [None]:
# Dummy Model
dummy_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('dummy', DummyClassifier(strategy="most_frequent"))
])
dummy_pipeline.fit(X_train, y_train)
dummy_predicted = dummy_pipeline.predict(X_test)

In [None]:
# Logistic Model
logistic_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('logistic', LogisticRegression())
])
logistic_pipeline.fit(X_train, np.ravel(y_train))
logistic_predicted = logistic_pipeline.predict(X_test)

In [None]:
# Naive Bayes Classifier
nb_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('nb', GaussianNB())
])
nb_pipeline.fit(X_train, np.ravel(y_train))
nb_predicted = nb_pipeline.predict(X_test)

In [None]:
# SVM Model
svm_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('svm', LinearSVC())
])
svm_pipeline.fit(X_train, np.ravel(y_train))
svm_predicted = svm_pipeline.predict(X_test)

In [None]:
# Neural Network
nn_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('nn', KerasClassifier(model=create_sequential_model(preprocessor.fit_transform(X_train).shape[1], ['accuracy']), epochs=10, batch_size=5, verbose=1))
])
nn_pipeline.fit(X_train, y_train)
nn_predicted = nn_pipeline.predict(X_test)

### Model Evaluation

In [None]:
print("Dummy Classification Performance Metrics:")
classifier_performance(y_test, dummy_predicted, {0: 'donated', 1: 'not donated'})

In [None]:
print("Logistic Classification Performance Metrics:")
classifier_performance(y_test, logistic_predicted, {0: 'default', 1: 'not default'})

In [None]:
print("Naive Bayes Classification Performance Metrics:")
classifier_performance(y_test, nb_predicted, {0: 'donated', 1: 'not donated'})

In [None]:
print("SVM Classification Performance Metrics:")
classifier_performance(y_test, svm_predicted, {0: 'default', 1: 'not default'})

In [None]:
print("Neural Network Classification Performance Metrics:")
classifier_performance(y_test, nn_predicted, {0: 'default', 1: 'not default'})

In [None]:
feature_importance(logistic_pipeline, 'preprocessor', 'logistic')

### Conclusion

Is this a good model?

## AI as Person

AI is terrible at acting like traditional software, but it is increasingly capable of acting like a person. What implications will that have for us, for the marketplace, and for society?