# Announcement

As with wrangling, I switched this notebook over to use the Titanic dataset for consistency. The video uses the Pima dataset.

The video ends rather abuptly. I hope the notebook is clear and what you need to do.

One thing I introduce is Upsampling. I promised I would do this way back in Chapter 2. I do not expect you to use it but thought it was worth demonstrating.

I also provide an optional make-up problem for you to get points back.

At the end of the notebook, I explore several ways of combining the four models. The most sophisticated is something called stacking. In essence we build a meta-model that we train with the output of the existing models. So this meta-model attempts to learn how to interpret the existing models' output.

<center>
<h1>Training and Tuning</h1>
</center>

<hr>

Once you are done here, you are ready to start playing with your server. Cool.

# How long does it take?

I think tuning time is the biggest issue for you now.
Using Pima data (training set = 614 rows) and what I consider an ok set of parameters to tune, this notebook takes me roughly 3 hours.

Take away is that as you tune each model, be aware that you might need to leave it running while you do something else.

The good news is that each model-tuning step is independent. Once you tune model X and save to GitHub, you are done with model X and can move on to model Y. The bad news is that if your dataset is larger, e.g., 5K rows, you can expect an increase in my times.

The further bad news is that there is not an easy way to get a progress bar with HalvingSearch. So if you wait 30 minutes, you don't know if you are almost done or will take another 4 hours.

Here are some strategies to consider:

1. Use incremental tuning. Tune some subset of params. Get best values and fix them. Then take on new subset using fixed values from past. You can use this strategy with both halving search and keras tuner.

2. I have factor=3. You could increase it to reduce wait time. But my experimentation tells me you may not gain that much.

3. For keras_tuner, it's easier. You can play around with `max_trials`. Set it small to start, e.g., 5. You can count on linearity here. If 5 trials takes 10 minutes, 50 likely to take 100 minutes.

## Set-up

First bring in your library.

In [None]:
github_name = 'marvnc'
repo_name = 'cs523'
source_file = 'library.py'
url = f'https://raw.githubusercontent.com/{github_name}/{repo_name}/main/{source_file}'
# !rm $source_file
# !wget $url
# %run -i $source_file
from library import *

rm: cannot remove 'library.py': No such file or directory
--2025-06-06 04:52:02--  https://raw.githubusercontent.com/marvnc/cs523/main/library.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 49866 (49K) [text/plain]
Saving to: ‘library.py’


2025-06-06 04:52:02 (6.41 MB/s) - ‘library.py’ saved [49866/49866]



## You need to change this url to point to your own dataset

And good idea to rename variables using "titanic" to something closer to your dataset.

In [2]:
# url = 'https://raw.githubusercontent.com/fickas/asynch_models/main/datasets/titanic_trimmed.csv'  #trimmed version
url = 'https://raw.githubusercontent.com/MarvNC/cs523/refs/heads/main/s25_final_cc_approvals_reduced.csv'

approvals_trimmed = pd.read_csv(url)
approvals_trimmed.head()


Unnamed: 0,Gender,Age,Debt,YearsEmployed,PriorDefault,Employed,CreditScore,DriversLicense,Income,Approved
0,1,30.83,0.0,1.25,1,1,1,0,0,1
1,0,58.67,4.46,3.04,1,1,6,0,560,1
2,0,24.5,0.5,1.5,1,0,0,0,824,1
3,1,27.83,1.54,3.75,1,1,5,1,3,1
4,1,20.17,5.625,1.71,1,0,0,0,0,1


In [3]:
len(approvals_trimmed)

690

# Break out into features and labels



In [4]:
target_col = 'Approved'

approvals_features = approvals_trimmed.drop(columns=target_col)
labels = approvals_trimmed[target_col].to_list()

In [5]:
labels.count(1)/len(labels)

0.4449275362318841

## Load pipeline from Wrangling notebook

You will be doing this exact same thing in the server.

In [15]:
import joblib

# model_path = 'MarvNC/cs523/s25_'
full_path = f'https://github.com/MarvNC/cs523/raw/refs/heads/main/s25_final_fully_fitted_pipeline.pkl'
!rm 's25_final_fully_fitted_pipeline.pkl'
!wget $full_path
titanic_transformer = joblib.load("s25_final_fully_fitted_pipeline.pkl")


rm: cannot remove 's25_final_fully_fitted_pipeline.pkl': No such file or directory
--2025-06-06 04:56:16--  https://github.com/MarvNC/cs523/raw/refs/heads/main/s25_final_fully_fitted_pipeline.pkl
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/MarvNC/cs523/refs/heads/main/s25_final_fully_fitted_pipeline.pkl [following]
--2025-06-06 04:56:17--  https://raw.githubusercontent.com/MarvNC/cs523/refs/heads/main/s25_final_fully_fitted_pipeline.pkl
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1327 (1.3K) [application/octet-stream]
Saving to: ‘s25_final_fully_fitted_pipeline.pkl’


2025-06-06 04:56:17

# Step I. Break into numpy datasets

In [17]:
%%capture
# rs = 74 #what you computed in wrangling notebook
rs = approvals_variance_based_split
# label_column = 'Survived'  #change to name of your label column
label_column = target_col

x_train,  x_test, y_train,  y_test = dataset_setup(approvals_trimmed, label_column, titanic_transformer, rs=rs)

In [18]:
len(x_train)

552

# II. Upsampling

In Chapter 2 I removed duplicates, giving us unique rows. However, it did shrink the table down to roughly 1000. I noted I would show you a way to build the table back up. I am going to use a popular method called SMOTE (Synthetic Minority Over-sampling Technique). I'll show you how to use it even though I do not expect you will need it here. But could come in handy later in your career.

Note that I am only applying it to the training data. I'd like to keep the test data pure: augment training, let test data stand.

You can find plenty of tutorials on SMOTE. Briefly, it generates new rows by
using existing rows as starting places and then interpolating values. So it
does not duplicate rows but tries to give you similar rows.

In [21]:
from imblearn.over_sampling import SMOTE

# Calculate target numbers for 3000 total samples
target_total = 2000
pos_count = np.sum(y_train == 1)/len(y_train)
neg_count = np.sum(y_train == 0)/len(y_train)
target_0 = int(neg_count * target_total)  # 1950 samples
target_1 = int(pos_count * target_total)  # 1050 samples

# Create SMOTE instance with specified sampling strategy
smote = SMOTE(sampling_strategy={0: target_0, 1: target_1}, random_state=42)
x_resampled, y_resampled = smote.fit_resample(x_train, y_train)  #requires transformed data - cannot handle categorical columns

# Verify the new distribution
print("New class distribution:")
print(f"Class 0: {sum(y_resampled == 0)} ({sum(y_resampled == 0)/len(y_resampled):.2%})")
print(f"Class 1: {sum(y_resampled == 1)} ({sum(y_resampled == 1)/len(y_resampled):.2%})")
print(f"Total samples: {len(y_resampled)}")

New class distribution:
Class 0: 1108 (55.43%)
Class 1: 891 (44.57%)
Total samples: 1999


In [23]:
#Uncomment if you want to use upsampled data

x_train= x_resampled
y_train = y_resampled

# Prof said to upsample to 2k

# III. Setup Lime

Reminder: Lime will help us explain to the user why we come up with the predictions we do.

In [24]:
%%capture
!pip install lime

In [25]:
import lime
from lime import lime_tabular

In [26]:
feature_names = approvals_features.columns.to_list()
print(feature_names)

['Gender', 'Age', 'Debt', 'YearsEmployed', 'PriorDefault', 'Employed', 'CreditScore', 'DriversLicense', 'Income']


### Set up the explainer before using

In [27]:
explainer = lime.lime_tabular.LimeTabularExplainer(x_train,
                    feature_names=feature_names,
                    training_labels=y_train,
                    class_names=[0,1], #label values
                    verbose=True,
                    mode='classification')



# IV. Write out to file

And move to GitHub.

In [28]:
!pip install dill
import dill as pickle
with open('lime_explainer.pkl', 'wb') as file:
    pickle.dump(explainer, file)

#read it back in just as a test
with open('lime_explainer.pkl', 'rb') as file:   #this will be in your webserver
    explainer2 = pickle.load(file)



# Minimal help from me with remainder of notebook

I can remind you of the steps you need for each model's tuning:

1. If using halving search, set up grid. If using Optuna, then set up model builder with hp code. With Optuna, will also need to define a validation set.

2. Get the best model found by tuning.

3. Run it on test set.

4. Produce threshold table.

5. Save both best model and threshold table out to GitHub so can load them back in with server.

I would avoid Run All here. Each notebook can be tuned separately, really in any order. But once you finish steps above for one model, you don't want to waste time and repeat them.

# V. KNN tuning



In [29]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import roc_auc_score

### Follow the steps

# VI. Logistic Regression tuning



In [None]:
from sklearn.linear_model import LogisticRegressionCV

### Follow the steps

# VII. LGB tuning



In [None]:
from lightgbm import LGBMClassifier

### Follow the steps

# VIII. ANN tuning



In [None]:
!pip install keras-tuner -q
import keras_tuner

In [None]:
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Activation, Dropout, Input
import tensorflow as tf
from tensorflow import keras

In [None]:
tf.keras.utils.set_random_seed(1234)  #need this for replication
tf.config.experimental.enable_op_determinism()  #ditto - https://www.tensorflow.org/api_docs/python/tf/config/experimental/enable_op_determinism
tf.config.threading.set_inter_op_parallelism_threads(1)
tf.config.threading.set_intra_op_parallelism_threads(1)

In [None]:
import hashlib

def string_to_seed(string):
    # Create a hash of the string using SHA-256
    hash_object = hashlib.sha256(string.encode())
    # Convert first 8 bytes of hash to integer
    hash_int = int.from_bytes(hash_object.digest()[:8], 'big')
    return hash_int % (2**32 - 1)

In [None]:
early_stop_cb = tf.keras.callbacks.EarlyStopping(
    monitor='loss',
    min_delta=0,
    patience=10,
    verbose=0
)

### Follow the steps

# You should eventually have these files on GitHub

* LIME explainer
* tuned KNN model and associated threshold table
* tuned logistic regression model and associated threshold table
* tuned light boosting model and associated threshold table
* tuned ANN model and associated threshold table

# Optional make-up: Random Forest model

I will give you credit for one homework assignment in terms of points if you elect to take on this problem.

You will need to do two things: (1) tune and save your threshold table and model below, and (2) add the model to your production notebook (your last notebook that is part of final.) The latter is the most tricky given you will actually have to change several places in the code I handed you for server. But it is doable if you get an early jump on it.

In [None]:
#From chapter 12
from sklearn.ensemble import RandomForestClassifier

### Follow the steps

## You still need to change the production notebook

Find the places where you are loading models and thresholds and add the RF results. Find places where you are doing predictions and add RF prediction. Find place where you are showing prediction results in html and add RF prediction. Also add threshold table.

This should not take long but will require you to pay attention to what you are doing to avoid screwing up what is already there.

# Just for your interest

There are several ways to combine the results of multiple models, four models in our case. We are using one of the ways in the server, but wanted to show you other options.

# IX. Voting - averaging binary

There are two ways I can see of voting when have 4 models producing results. The first is to convert their output to binary. Then simply look for majority of either 0s or 1s. I added a twist that I fall back on probabilities for ties.

In [None]:
lgb_raw = final_lgb_model.predict_proba(x_test)[:,1]
knn_raw = final_knn_model.predict_proba(x_test)[:,1]
logreg_raw = final_logreg_model.predict_proba(x_test)[:,1]
ann_raw = final_ann_model.predict(x_test)[:,0]

In [None]:
yvotes = []
for i in range(len(y_test)):
  the_vote = (lgb_raw[i]>=.5+logreg_raw[i]>=.5+knn_raw[i]>=.5+ann_raw[i]>=.5)
  if the_vote==2:
    #tie breaker - go to probabilities
    prob = (knn_yraw[i]+logreg_yraw[i]+xgb_yraw[i]+ann_yraw[i])/4
    the_winner = 1 if prob>=.5 else 0
  else:
    the_winner = 1 if the_vote>2 else 0
  yvotes.append(the_winner)

In [None]:
sum([1 if p>=.5 else 0 for p in ann_raw])/len(x_test)

In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, yvotes)
print(cm)


In [None]:
(cm[0,0]+cm[1,1])/len(y_test)  #accuracy 0.5665399239543726

Can now use it to compute precision and recall.

In [None]:
def precision_recall(y_test, y_pred):
    cm = confusion_matrix(y_test, y_pred)
    tp = cm[0,0]
    fp = cm[0,1]
    fn = cm[1,0]
    prec = tp / (tp+fp)
    rec = tp / (tp+fn)
    return prec, rec

precision, recall = precision_recall(y_test, yvotes)
print(f'Precision: {precision} Recall {recall}')

In [None]:
f1 = 2*(precision*recall)/(precision+recall)
f1

# X. Prob averaging

The second voting approach is not actually voting. Instead, take average of 4 raw probabilities and use result as final probability. Can then run that through threshold table.

This is what the server is doing to get the "Ensemble" value.

In [None]:
avg_yraw = []
for i in range(len(y_test)):
  prob = (knn_raw[i]+logreg_raw[i]+lgb_raw[i]+ann_raw[i])/4
  avg_yraw.append(prob)

In [None]:
result_df, fancy_df = threshold_results(np.linspace(0,1,19,endpoint=True), y_test, avg_yraw)

# XI. Stacking

This is interesting in that it builds a whole separate model (a meta model) that takes the output of other base models, three in example below, and uses that as a row. So a row of 3 feature values, one from each of the base models.

I kind of like it. The meta model learns how to combine the outputs of base models, e.g., when to weight KNN higher than LGB, etc.


In [None]:
from sklearn.ensemble import StackingClassifier

estimators = [
     ('knn', KNeighborsClassifier(15, algorithm='ball_tree', p=1, weights='distance')),
    ('logreg', LogisticRegressionCV(Cs= 5, class_weight= None, cv= 5, max_iter= 500, solver= 'saga', penalty='l1', random_state=1234)),
    ('lgb', LGBMClassifier(boosting_type= 'gbdt',
                          class_weight= 'balanced',
                          learning_rate= 0.3,
                          max_depth= 5,
                          min_child_samples= 10,
                          n_estimators= 10,
                          num_leaves= 7,
                          random_state=1234),
    )
]
final_estimator = LogisticRegressionCV(random_state=1234)   #this is choice for meta model
clf = StackingClassifier(estimators=estimators, final_estimator=final_estimator)

In [None]:
%%capture
clf.fit(x_train, y_train)

In [None]:
yraw = clf.predict_proba(x_test)[:,1]
result_df, fancy_df = threshold_results(np.linspace(0,1,19,endpoint=True), y_test, yraw)
fancy_df

<img src='https://www.dropbox.com/scl/fi/zilmy2diy1lg1tva9vurx/Screenshot-2025-02-07-at-8.38.53-AM.png?rlkey=006szbv5t0daha005eotxt9k2&raw=1' height=400>