<center>
<h1>Chapter 11</h1>
</center>

<hr>

Start getting our models ready for production.


In [None]:
github_name = 'MarvNC'
repo_name = 'cs523'
source_file = 'library.py'
# url = f'https://raw.githubusercontent.com/{github_name}/{repo_name}/main/{source_file}'
url = f'https://raw.githubusercontent.com/MarvNC/cs523/refs/heads/main/library.py'
!rm $source_file
!wget $url
%run -i $source_file

--2025-05-19 08:08:55--  https://raw.githubusercontent.com/MarvNC/cs523/refs/heads/main/library.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 47012 (46K) [text/plain]
Saving to: ‘library.py’


2025-05-19 08:08:55 (4.73 MB/s) - ‘library.py’ saved [47012/47012]



In [None]:
#to be compatible
titanic_variance_based_split = 107
customer_variance_based_split = 113

In [None]:

url = 'https://raw.githubusercontent.com/fickas/asynch_models/refs/heads/main/datasets/titanic_trimmed.csv'
titanic_trimmed = pd.read_csv(url)

In [None]:
titanic_features = titanic_trimmed.drop(columns='Survived')
titanic_features.head()  #print first 5 rows of the table

Unnamed: 0,Age,Gender,Class,Joined,Married,Fare
0,41.0,Male,C3,Southampton,0.0,7.0
1,21.0,Male,Crew,Southampton,0.0,0.0
2,13.0,Male,C3,Southampton,,20.0
3,16.0,Male,C3,Southampton,0.0,
4,,Male,C2,Cherbourg,0.0,24.0


In [None]:
labels = titanic_trimmed['Survived'].to_list()

In [None]:
%%capture
x_train, x_test, y_train, y_test = titanic_setup(titanic_trimmed)

In [None]:
x_train.std(axis=0) #array([0.75333128, 0.47741652, 1.03590395, 0.0872873 , 0.47611519, 1.23157575])

array([0.75333128, 0.47741652, 1.03590395, 0.0872873 , 0.47611519,
       1.23157575])

In [None]:
y_train[:5] #array([0, 0, 1, 1, 0])

array([0, 0, 1, 1, 0])

# I. The tuning problem

The problem is all of those parameters that each model carries. I said we can use the defaults for most of them and get ok results. But I would like to look at a means of actually trying different values to tune our models.

# II. Setting up our alternatives

The simplest way to lay out the space of parameter values we want to try is using a dictionary. The key is the parameter name and the value is a list of values to try for that parameter.

Let's start with KNN since we looked at it first. Here are some of the parameters I would like to vary to tune the algorithm.

<img src='https://www.dropbox.com/s/8aqz8qyxcekcwj8/Screen%20Shot%202022-02-10%20at%209.55.46%20AM.png?raw=1' height=300>

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
knn_grid_raw = dict(n_neighbors=range(5,100,10),
                weights=('distance', 'uniform'),
                algorithm=('brute', 'kd_tree', 'ball_tree', 'auto'),
                p=(2,1)  #When p=1, manhattan_distance, when p=2 euclidean_distance.
)


### I am going to sort so we get same results

I have played around with the raw grid by changing ordering of keys and ordering of values. Different orderings give different results. I think this is a bug in some of `sklearn`'s searching algorithms (to follow) and have held [lengthy conversations](https://github.com/scikit-learn/scikit-learn/issues/27740#issuecomment-1802334571) with the `sklearn` team about it. My take is that it is likely a bug but not high on their list to fix.

To keep us on the same page, I am going to sort the grid so we all use the same thing.

In [None]:
#sorts both keys and values

def sort_grid(grid):
  sorted_grid = grid.copy()

  #sort values - note that this will expand range for you
  for k,v in sorted_grid.items():
    sorted_grid[k] = sorted(sorted_grid[k], key=lambda x: (x is None, x))  #handles cases where None is an alternative value

  #sort keys
  sorted_grid = dict(sorted(sorted_grid.items()))

  return sorted_grid

In [None]:
knn_grid = sort_grid(knn_grid_raw)
knn_grid

{'algorithm': ['auto', 'ball_tree', 'brute', 'kd_tree'],
 'n_neighbors': [5, 15, 25, 35, 45, 55, 65, 75, 85, 95],
 'p': [1, 2],
 'weights': ['distance', 'uniform']}

### How many different combinations?

In [None]:
from sklearn.model_selection import ParameterGrid

param_grid = ParameterGrid(knn_grid)  #a list of dictionaries, one for each combo
len(param_grid)  #160

160

### How many samples (rows)?

In [None]:
len(x_train)  #1050

1050

# III. Exhaustive search (Grid Search)

The early approach to the problem was to do an exhaustive search of all possible combinations. So if we have 100 separate unique combinations, we will build and train 100 separate models (or 500 if cv=5) and record their scores. At end we will take combo that gives best score.

This goes under the name Grid Search and there is an sklearn method for it: [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html). I'm not going to discuss it much further because better search algorithms have come along.

# IV. Halving Search

This is the search algorithm that will be our go to for the most part. It's kind of interesting. It works this way:

1. Start with a small number of rows. Train on those rows for all possible combinations. So at this point it is like Grid Search but with much smaller part of table.

2. Choose the top half of the candidates, i.e., cut the combinations in half, dropping the half that are lowest scorers.

3. Double the rows and repeat training but now on half the original combos.

4. Choose the top half of candidates.

5. Double rows.

6. Continue until have either (a) top candidate between last 2 standing, or (b) run out of rows. In latter case, use rows you have and candidates remaining to select best.




In [None]:
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingGridSearchCV

In [None]:
knn_model = KNeighborsClassifier()

### Potential `min_resources` problem in `HalvingGridSearchCV`

I again ran into strange results with the code below. After a [lengthy conversation](https://github.com/scikit-learn/scikit-learn/issues/27422) with the sklearn team, the issue was turned into a decision. I assume this is a choice on whether to issue a PR to make a fix.

If nothing else, looking at my conversation demonstrates what you can expect if you submit an issue (bug report) to any of the major libraries we use.



In [None]:
%%capture

#do the halving search
halving_cv = HalvingGridSearchCV(
    knn_model, knn_grid,  #our model and the parameter combos we want to try
    scoring="roc_auc",  #from chapter 10
    n_jobs=-1,  #use all available cpus
    min_resources=30,  #"exhaust" sets this to 20, which is non-optimal. Possible bug in algorithm. See https://github.com/scikit-learn/scikit-learn/issues/27422.
    factor=2,  #double samples and take top half of combos on each iteration
    cv=5, random_state=1234,
    refit=True,  #remembers the best combo and gives us back that model already trained and ready for testing
)

grid_result = halving_cv.fit(x_train, y_train)


In [None]:
grid_result.best_params_  #{'algorithm': 'auto', 'n_neighbors': 15, 'p': 1, 'weights': 'uniform'}

{'algorithm': 'auto', 'n_neighbors': 15, 'p': 1, 'weights': 'uniform'}

In [None]:
pd.set_option('display.max_colwidth', None)  #don't limit/elide text values in a cell
df = pd.DataFrame(grid_result.cv_results_)


In [None]:
df[['iter', 'n_resources', 'params', 'mean_test_score']][0:]

Unnamed: 0,iter,n_resources,params,mean_test_score
0,0,30,"{'algorithm': 'auto', 'n_neighbors': 5, 'p': 1, 'weights': 'distance'}",0.623889
1,0,30,"{'algorithm': 'auto', 'n_neighbors': 5, 'p': 1, 'weights': 'uniform'}",0.681389
2,0,30,"{'algorithm': 'auto', 'n_neighbors': 5, 'p': 2, 'weights': 'distance'}",0.558889
3,0,30,"{'algorithm': 'auto', 'n_neighbors': 5, 'p': 2, 'weights': 'uniform'}",0.550278
4,0,30,"{'algorithm': 'auto', 'n_neighbors': 15, 'p': 1, 'weights': 'distance'}",0.621111
...,...,...,...,...
310,5,960,"{'algorithm': 'auto', 'n_neighbors': 15, 'p': 1, 'weights': 'uniform'}",0.803356
311,5,960,"{'algorithm': 'brute', 'n_neighbors': 5, 'p': 1, 'weights': 'uniform'}",0.781739
312,5,960,"{'algorithm': 'auto', 'n_neighbors': 5, 'p': 1, 'weights': 'uniform'}",0.782674
313,5,960,"{'algorithm': 'kd_tree', 'n_neighbors': 5, 'p': 1, 'weights': 'uniform'}",0.782674


### Results might not match

When I rerun this code, I do not always match with this table. Mystery to me. There is randomness somewhere that is not being captured. There is a slight possibility it has to do with Colab and its runtime configuration changing on each run.

Notice table is big so only showing first and last part.

|index|iter|n\_resources|params|mean\_test\_score|
|---|---|---|---|---|
|0|0|30|\{'algorithm': 'auto', 'n\_neighbors': 5, 'p': 1, 'weights': 'distance'\}|0\.6238888888888889|
|1|0|30|\{'algorithm': 'auto', 'n\_neighbors': 5, 'p': 1, 'weights': 'uniform'\}|0\.6813888888888888|
|2|0|30|\{'algorithm': 'auto', 'n\_neighbors': 5, 'p': 2, 'weights': 'distance'\}|0\.558888888888889|
|3|0|30|\{'algorithm': 'auto', 'n\_neighbors': 5, 'p': 2, 'weights': 'uniform'\}|0\.5502777777777779|
|4|0|30|\{'algorithm': 'auto', 'n\_neighbors': 15, 'p': 1, 'weights': 'distance'\}|0\.6211111111111112|
|5|0|30|\{'algorithm': 'auto', 'n\_neighbors': 15, 'p': 1, 'weights': 'uniform'\}|0\.5113888888888889|
|6|0|30|\{'algorithm': 'auto', 'n\_neighbors': 15, 'p': 2, 'weights': 'distance'\}|0\.6255555555555555|
|7|0|30|\{'algorithm': 'auto', 'n\_neighbors': 15, 'p': 2, 'weights': 'uniform'\}|0\.6272222222222222|
|8|0|30|\{'algorithm': 'auto', 'n\_neighbors': 25, 'p': 1, 'weights': 'distance'\}|NaN|
|9|0|30|\{'algorithm': 'auto', 'n\_neighbors': 25, 'p': 1, 'weights': 'uniform'\}|0\.5|
|10|0|30|\{'algorithm': 'auto', 'n\_neighbors': 25, 'p': 2, 'weights': 'distance'\}|NaN|
|11|0|30|\{'algorithm': 'auto', 'n\_neighbors': 25, 'p': 2, 'weights': 'uniform'\}|NaN|
|12|0|30|\{'algorithm': 'auto', 'n\_neighbors': 35, 'p': 1, 'weights': 'distance'\}|NaN|
|13|0|30|\{'algorithm': 'auto', 'n\_neighbors': 35, 'p': 1, 'weights': 'uniform'\}|0\.5|
|14|0|30|\{'algorithm': 'auto', 'n\_neighbors': 35, 'p': 2, 'weights': 'distance'\}|NaN|
|15|0|30|\{'algorithm': 'auto', 'n\_neighbors': 35, 'p': 2, 'weights': 'uniform'\}|NaN|
|16|0|30|\{'algorithm': 'auto', 'n\_neighbors': 45, 'p': 1, 'weights': 'distance'\}|NaN|
|17|0|30|\{'algorithm': 'auto', 'n\_neighbors': 45, 'p': 1, 'weights': 'uniform'\}|0\.5|
|18|0|30|\{'algorithm': 'auto', 'n\_neighbors': 45, 'p': 2, 'weights': 'distance'\}|NaN|
|19|0|30|\{'algorithm': 'auto', 'n\_neighbors': 45, 'p': 2, 'weights': 'uniform'\}|NaN|
|20|0|30|\{'algorithm': 'auto', 'n\_neighbors': 55, 'p': 1, 'weights': 'distance'\}|NaN|
|21|0|30|\{'algorithm': 'auto', 'n\_neighbors': 55, 'p': 1, 'weights': 'uniform'\}|0\.5|
|22|0|30|\{'algorithm': 'auto', 'n\_neighbors': 55, 'p': 2, 'weights': 'distance'\}|NaN|
|23|0|30|\{'algorithm': 'auto', 'n\_neighbors': 55, 'p': 2, 'weights': 'uniform'\}|NaN|
|24|0|30|\{'algorithm': 'auto', 'n\_neighbors': 65, 'p': 1, 'weights': 'distance'\}|NaN|
|305|4|480|\{'algorithm': 'brute', 'n\_neighbors': 5, 'p': 1, 'weights': 'uniform'\}|0\.7849582023008617|
|306|4|480|\{'algorithm': 'ball\_tree', 'n\_neighbors': 5, 'p': 1, 'weights': 'uniform'\}|0\.7859519314949898|
|307|4|480|\{'algorithm': 'auto', 'n\_neighbors': 15, 'p': 1, 'weights': 'uniform'\}|0\.7726732286929819|
|308|4|480|\{'algorithm': 'ball\_tree', 'n\_neighbors': 15, 'p': 1, 'weights': 'uniform'\}|0\.7724003676627056|
|309|4|480|\{'algorithm': 'kd\_tree', 'n\_neighbors': 15, 'p': 1, 'weights': 'uniform'\}|0\.7726732286929819|
|310|5|960|\{'algorithm': 'auto', 'n\_neighbors': 15, 'p': 1, 'weights': 'uniform'\}|0\.8033556478595869|
|311|5|960|\{'algorithm': 'brute', 'n\_neighbors': 5, 'p': 1, 'weights': 'uniform'\}|0\.7817389707190797|
|312|5|960|\{'algorithm': 'auto', 'n\_neighbors': 5, 'p': 1, 'weights': 'uniform'\}|0\.7826741192258598|
|313|5|960|\{'algorithm': 'kd\_tree', 'n\_neighbors': 5, 'p': 1, 'weights': 'uniform'\}|0\.7826741192258598|
|314|5|960|\{'algorithm': 'ball\_tree', 'n\_neighbors': 5, 'p': 1, 'weights': 'uniform'\}|0\.7828490275397011|


### Note the last iteration has more than 2 candidates (i.e., 5)

We cannot split further because we cannot double the rows (we only have 1050). So we choose the best among 5 remaining.

## Test set

Notice I am using the model the search found, i.e., `grid_result.best_estimator_`.

In [None]:
best_knn_model = grid_result.best_estimator_
best_knn_model.score(x_test,y_test)  #0.7452471482889734

0.7452471482889734

# V. Build threshold table



In [None]:
ypos = best_knn_model.predict_proba(x_test)[:,1]

In [None]:
result_df, fancy_df = threshold_results(np.round(np.arange(0.0,1.01,.05), 2), y_test, ypos)
fancy_df

Unnamed: 0,threshold,precision,recall,f1,auc,accuracy
0,0.0,0.43,1.0,0.6,0.81,0.43
1,0.05,0.44,0.98,0.61,0.81,0.45
2,0.1,0.47,0.98,0.64,0.81,0.52
3,0.15,0.52,0.96,0.67,0.81,0.6
4,0.2,0.52,0.96,0.67,0.81,0.6
5,0.25,0.56,0.9,0.69,0.81,0.65
6,0.3,0.62,0.82,0.71,0.81,0.71
7,0.35,0.63,0.69,0.66,0.81,0.69
8,0.4,0.63,0.69,0.66,0.81,0.69
9,0.45,0.71,0.63,0.67,0.81,0.73


In [None]:
result_df

Unnamed: 0,threshold,precision,recall,f1,auc,accuracy
0,0.0,0.43,1.0,0.6,0.81,0.43
1,0.05,0.44,0.98,0.61,0.81,0.45
2,0.1,0.47,0.98,0.64,0.81,0.52
3,0.15,0.52,0.96,0.67,0.81,0.6
4,0.2,0.52,0.96,0.67,0.81,0.6
5,0.25,0.56,0.9,0.69,0.81,0.65
6,0.3,0.62,0.82,0.71,0.81,0.71
7,0.35,0.63,0.69,0.66,0.81,0.69
8,0.4,0.63,0.69,0.66,0.81,0.69
9,0.45,0.71,0.63,0.67,0.81,0.73


|index|threshold|precision|recall|f1|accuracy|auc|
|---|---|---|---|---|---|---|
|0|0\.0|0\.43|1\.0|0\.6|0\.43|0\.81|
|1|0\.05|0\.44|0\.98|0\.61|0\.45|0\.81|
|2|0\.1|0\.47|0\.98|0\.64|0\.52|0\.81|
|3|0\.15|0\.52|0\.96|0\.67|0\.6|0\.81|
|4|0\.2|0\.52|0\.96|0\.67|0\.6|0\.81|
|5|0\.25|0\.56|0\.9|0\.69|0\.65|0\.81|
|6|0\.3|0\.62|0\.82|0\.71|0\.71|0\.81|
|7|0\.35|0\.63|0\.69|0\.66|0\.69|0\.81|
|8|0\.4|0\.63|0\.69|0\.66|0\.69|0\.81|
|9|0\.45|0\.71|0\.63|0\.67|0\.73|0\.81|
|10|0\.5|0\.78|0\.57|0\.66|0\.75|0\.81|
|11|0\.55|0\.8|0\.52|0\.63|0\.73|0\.81|
|12|0\.6|0\.8|0\.52|0\.63|0\.73|0\.81|
|13|0\.65|0\.84|0\.46|0\.6|0\.73|0\.81|
|14|0\.7|0\.91|0\.43|0\.58|0\.73|0\.81|
|15|0\.75|0\.92|0\.39|0\.54|0\.72|0\.81|
|16|0\.8|0\.92|0\.39|0\.54|0\.72|0\.81|
|17|0\.85|0\.95|0\.32|0\.47|0\.7|0\.81|
|18|0\.9|1\.0|0\.25|0\.39|0\.67|0\.81|
|19|0\.95|1\.0|0\.14|0\.25|0\.63|0\.81|
|20|1\.0|1\.0|0\.14|0\.25|0\.63|0\.81|

# VI. Save your results!

I'll ask you to use them later in your web server. Save locally then move to your GitHub repo.

Note we are saving our pre-trained model. Nice. It is ready to do prediction once we load it into the backend of our webserver. No need to retrain it (or tune it).

In [None]:
result_df.to_csv('knn_thresholds.csv', index=False)

In [None]:
from joblib import dump
dump(best_knn_model, 'knn_model.joblib')

['knn_model.joblib']

### To load back in

Won't need this until later.

In [None]:
from joblib import load
knn_model2 = load('knn_model.joblib')
knn_model2.predict_proba(x_test)[:,1][:5]  #array([0.6       , 1.        , 1.        , 0.13333333, 0.06666667])

array([0.6       , 1.        , 1.        , 0.13333333, 0.06666667])

In [None]:
best_knn_model.predict_proba(x_test)[:,1][:5]  #array([0.6       , 1.        , 1.        , 0.13333333, 0.06666667])


array([0.6       , 1.        , 1.        , 0.13333333, 0.06666667])

## Congratulations!

You have your first model and threshold table saved and ready for use in production. When we run our webserver, we can use the loading commands to bring both in and will be ready to predict.

Nice.

# Challenge 1
<img src='https://www.dropbox.com/s/3uyvp722kp5to2r/assignment.png?raw=1' width='300'>

I'd like you to try your hand at tuning an algorithm called a Support Vector Machine or SVM. It is an attempt to improve plain old Regression. Check this picture out.

<img src='https://www.dropbox.com/s/8a16c9y7uybtgup/Screen%20Shot%202022-11-03%20at%201.03.38%20PM.png?raw=1' height=200>


### Here are the SVM parameters with defaults

<pre>
class sklearn.svm.SVC(*, C=1.0, kernel='rbf', degree=3, gamma='scale',
  coef0=0.0, shrinking=True, probability=False, tol=0.001, cache_size=200,
  class_weight=None, verbose=False, max_iter=-1, decision_function_shape='ovr',
  break_ties=False, random_state=None)
</pre>


Source: https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html


Here is tutorial if you want to dig deeper: [SVM tutorial](https://www.analyticsvidhya.com/blog/2021/10/support-vector-machinessvm-a-complete-guide-for-beginners/). It helps explain the meaning of some of the parameters.

### First let's try it with all default parameters

Except I want probabilities so have to set that to `True`.

In [None]:
from sklearn.svm import SVC
svc_model = SVC(probability=True, random_state=1)  #needs to be True to get probabilities out
svc_model.fit(x_train, y_train)
ypos = svc_model.predict_proba(x_test)[:,1]
ypos[-10:]


array([0.24144045, 0.78004767, 0.81007289, 0.23855765, 0.23730804,
       0.22015847, 0.75452366, 0.23910634, 0.80959319, 0.76831821])

In [None]:
result_df, fancy_df = threshold_results(np.round(np.arange(0.0,1.01,.05), 2), y_test, ypos)
fancy_df

Unnamed: 0,threshold,precision,recall,f1,auc,accuracy
0,0.0,0.43,1.0,0.6,0.75,0.43
1,0.05,0.43,1.0,0.6,0.75,0.43
2,0.1,0.43,1.0,0.6,0.75,0.43
3,0.15,0.44,1.0,0.61,0.75,0.45
4,0.2,0.44,1.0,0.61,0.75,0.45
5,0.25,0.6,0.63,0.61,0.75,0.65
6,0.3,0.68,0.61,0.64,0.75,0.71
7,0.35,0.69,0.6,0.64,0.75,0.71
8,0.4,0.7,0.59,0.64,0.75,0.71
9,0.45,0.71,0.59,0.64,0.75,0.72


In [None]:
result_df

|index|threshold|precision|recall|f1|accuracy|auc|
|---|---|---|---|---|---|---|
|0|0\.0|0\.43|1\.0|0\.6|0\.43|0\.75|
|1|0\.05|0\.43|1\.0|0\.6|0\.43|0\.75|
|2|0\.1|0\.43|1\.0|0\.6|0\.43|0\.75|
|3|0\.15|0\.44|1\.0|0\.61|0\.45|0\.75|
|4|0\.2|0\.44|1\.0|0\.61|0\.45|0\.75|
|5|0\.25|0\.6|0\.63|0\.61|0\.65|0\.75|
|6|0\.3|0\.68|0\.61|0\.64|0\.71|0\.75|
|7|0\.35|0\.69|0\.6|0\.64|0\.71|0\.75|
|8|0\.4|0\.7|0\.59|0\.64|0\.71|0\.75|
|9|0\.45|0\.71|0\.59|0\.64|0\.72|0\.75|
|10|0\.5|0\.71|0\.58|0\.64|0\.71|0\.75|
|11|0\.55|0\.72|0\.58|0\.64|0\.72|0\.75|
|12|0\.6|0\.73|0\.58|0\.65|0\.73|0\.75|
|13|0\.65|0\.77|0\.57|0\.66|0\.74|0\.75|
|14|0\.7|0\.77|0\.56|0\.65|0\.74|0\.75|
|15|0\.75|0\.79|0\.46|0\.59|0\.71|0\.75|
|16|0\.8|0\.94|0\.14|0\.24|0\.62|0\.75|
|17|0\.85|1\.0|0\.02|0\.03|0\.57|0\.75|
|18|0\.9|0\.0|0\.0|0\.0|0\.57|0\.75|
|19|0\.95|0\.0|0\.0|0\.0|0\.57|0\.75|
|20|1\.0|0\.0|0\.0|0\.0|0\.57|0\.75|

### Your job

First set up a grid to tune 5 separate parameters:

* For `C`, try the values `1,2,3`.
* For `gamma` try both its values; exclude `float`.
* For `shrinking`, try all its values.
* For `kernel`, try everything except `'precomputed'` and `'callable'`.
* For `max_iter`, try `5000`, `10000`, `-1`.

In [None]:
svc_grid_raw = {
    'C': [1,2,3],
    'gamma': ['auto', 'scale'],
    'shrinking': [True, False],
    'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
    'max_iter': [5000, 10000, -1]
}


In [None]:
svc_grid = sort_grid(svc_grid_raw)
svc_grid

{'C': [1, 2, 3],
 'gamma': ['auto', 'scale'],
 'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
 'max_iter': [-1, 5000, 10000],
 'shrinking': [False, True]}

### How many different combinations?

In [None]:
param_grid = ParameterGrid(svc_grid)
len(param_grid)  #144

144

### How many samples (rows)?

In [None]:
len(x_train)  #1050

1050

### Now run `halvingGridSearchCV` on your grid

You can copy code from above and make changes needed.

In [None]:
svc_model = SVC(probability=True, random_state=1)  #base model

In [None]:
%%capture

#do the halving search
halving_cv = HalvingGridSearchCV(
    svc_model, svc_grid,  #our model and the parameter combos we want to try
    scoring="roc_auc",  #from chapter 10
    n_jobs=-1,  #use all available cpus
    min_resources=20,  #"exhaust" sets this to 20, which is non-optimal. Possible bug in algorithm. See https://github.com/scikit-learn/scikit-learn/issues/27422.
    factor=2,  #double samples and take top half of combos on each iteration
    cv=5, random_state=1234,
    refit=True,  #remembers the best combo and gives us back that model already trained and ready for testing
)

grid_result = halving_cv.fit(x_train, y_train)


In [None]:
df = pd.DataFrame(grid_result.cv_results_)
df[['iter', 'n_resources', 'params', 'mean_test_score']][0:]

Unnamed: 0,iter,n_resources,params,mean_test_score
0,0,20,"{'C': 1, 'gamma': 'auto', 'kernel': 'linear', 'max_iter': -1, 'shrinking': False}",0.800000
1,0,20,"{'C': 1, 'gamma': 'auto', 'kernel': 'linear', 'max_iter': -1, 'shrinking': True}",0.800000
2,0,20,"{'C': 1, 'gamma': 'auto', 'kernel': 'linear', 'max_iter': 5000, 'shrinking': False}",0.800000
3,0,20,"{'C': 1, 'gamma': 'auto', 'kernel': 'linear', 'max_iter': 5000, 'shrinking': True}",0.800000
4,0,20,"{'C': 1, 'gamma': 'auto', 'kernel': 'linear', 'max_iter': 10000, 'shrinking': False}",0.800000
...,...,...,...,...
279,5,640,"{'C': 3, 'gamma': 'auto', 'kernel': 'poly', 'max_iter': 5000, 'shrinking': False}",0.759464
280,5,640,"{'C': 3, 'gamma': 'auto', 'kernel': 'poly', 'max_iter': 10000, 'shrinking': False}",0.759464
281,5,640,"{'C': 3, 'gamma': 'auto', 'kernel': 'poly', 'max_iter': 10000, 'shrinking': True}",0.759464
282,5,640,"{'C': 3, 'gamma': 'auto', 'kernel': 'poly', 'max_iter': -1, 'shrinking': True}",0.759464


## My results

<img src='https://www.dropbox.com/scl/fi/bcesm38n76kdzeivcedws/Screenshot-2025-05-12-at-10.07.34-AM.png?rlkey=kccljqo9orpbtho210v23hrar&raw=1' height=400>

In [None]:
grid_result.best_params_

{'C': 3,
 'gamma': 'auto',
 'kernel': 'poly',
 'max_iter': 5000,
 'shrinking': False}

### My best params after search

<pre>
{'C': 3,                 #different than default
 'gamma': 'auto',        #different than default
 'kernel': 'poly',       #different than default
 'max_iter': 5000,       #different than default (kind of)
 'shrinking': False}     #different than default
 </pre>

 So 5 of the 5 were assigned something other than default.

In [None]:
best_svc_model = grid_result.best_estimator_  #get best model because we have refit=True

### Here are all the parameter values for reference

In [None]:
best_svc_model.get_params()

{'C': 3,
 'break_ties': False,
 'cache_size': 200,
 'class_weight': None,
 'coef0': 0.0,
 'decision_function_shape': 'ovr',
 'degree': 3,
 'gamma': 'auto',
 'kernel': 'poly',
 'max_iter': 5000,
 'probability': True,
 'random_state': 1,
 'shrinking': False,
 'tol': 0.001,
 'verbose': False}

### Build threshold table and see if tuned better than defaults

In [None]:
ypos = best_svc_model.predict_proba(x_test)[:,1]

In [None]:
result_df, fancy_df = threshold_results(np.round(np.arange(0.0,1.01,.05), 2), y_test, ypos)
fancy_df

Unnamed: 0,threshold,precision,recall,f1,auc,accuracy
0,0.0,0.43,1.0,0.6,0.74,0.43
1,0.05,0.43,1.0,0.6,0.74,0.43
2,0.1,0.43,1.0,0.6,0.74,0.43
3,0.15,0.44,1.0,0.61,0.74,0.44
4,0.2,0.44,1.0,0.61,0.74,0.45
5,0.25,0.49,0.76,0.6,0.74,0.56
6,0.3,0.61,0.62,0.62,0.74,0.67
7,0.35,0.67,0.6,0.63,0.74,0.7
8,0.4,0.69,0.58,0.63,0.74,0.71
9,0.45,0.7,0.57,0.63,0.74,0.71


In [None]:
result_df

Unnamed: 0,threshold,precision,recall,f1,auc,accuracy
0,0.0,0.43,1.0,0.6,0.74,0.43
1,0.05,0.43,1.0,0.6,0.74,0.43
2,0.1,0.43,1.0,0.6,0.74,0.43
3,0.15,0.44,1.0,0.61,0.74,0.44
4,0.2,0.44,1.0,0.61,0.74,0.45
5,0.25,0.49,0.76,0.6,0.74,0.56
6,0.3,0.61,0.62,0.62,0.74,0.67
7,0.35,0.67,0.6,0.63,0.74,0.7
8,0.4,0.69,0.58,0.63,0.74,0.71
9,0.45,0.7,0.57,0.63,0.74,0.71


|index|threshold|precision|recall|f1|accuracy|auc|
|---|---|---|---|---|---|---|
|0|0\.0|0\.43|1\.0|0\.6|0\.43|0\.74|
|1|0\.05|0\.43|1\.0|0\.6|0\.43|0\.74|
|2|0\.1|0\.43|1\.0|0\.6|0\.43|0\.74|
|3|0\.15|0\.44|1\.0|0\.61|0\.44|0\.74|
|4|0\.2|0\.44|1\.0|0\.61|0\.45|0\.74|
|5|0\.25|0\.49|0\.76|0\.6|0\.56|0\.74|
|6|0\.3|0\.61|0\.62|0\.62|0\.67|0\.74|
|7|0\.35|0\.67|0\.6|0\.63|0\.7|0\.74|
|8|0\.4|0\.69|0\.58|0\.63|0\.71|0\.74|
|9|0\.45|0\.7|0\.57|0\.63|0\.71|0\.74|
|10|0\.5|0\.72|0\.55|0\.62|0\.71|0\.74|
|11|0\.55|0\.75|0\.54|0\.63|0\.72|0\.74|
|12|0\.6|0\.78|0\.53|0\.63|0\.73|0\.74|
|13|0\.65|0\.84|0\.47|0\.61|0\.73|0\.74|
|14|0\.7|0\.86|0\.43|0\.57|0\.72|0\.74|
|15|0\.75|0\.87|0\.39|0\.54|0\.71|0\.74|
|16|0\.8|0\.91|0\.35|0\.51|0\.7|0\.74|
|17|0\.85|0\.94|0\.28|0\.43|0\.68|0\.74|
|18|0\.9|0\.95|0\.17|0\.28|0\.63|0\.74|
|19|0\.95|0\.9|0\.08|0\.15|0\.6|0\.74|
|20|1\.0|0\.0|0\.0|0\.0|0\.57|0\.74|

### Kind of interesting

Tuning roughly the same results as using defaults. Accuracy better but f1 worse.

## Storing table and model

I don't plan to use SVC in our production system. But there is nothing stopping you from saving the model and table in the same way we did with KNN previously. Then it could be available to you if you ever wanted to use it.


# Challenge 2
<img src='https://www.dropbox.com/s/3uyvp722kp5to2r/assignment.png?raw=1' width='300'>

Let's build a function to do Halving Search. Below is my start. You can find most of code above.

Note that I am defaulting `min_resources` to `"exhaust"` even though I have my doubts about how well that value is chosen by the algorithm. This brings up the whole question of meta-tuning: tuning the tuner, i.e., tuning the tuner parameters! I won't go down that rabbit hole in our class but it is something to consider.



In [None]:
def halving_search(model, grid, x_train, y_train, factor=2, min_resources="exhaust", scoring='roc_auc'):
  halving_cv = HalvingGridSearchCV(
      model, grid,  #our model and the parameter combos we want to try
      scoring=scoring,  #from chapter 10
      n_jobs=-1,  #use all available cpus
      min_resources=min_resources,  #"exhaust" sets this to 20, which is non-optimal. Possible bug in algorithm. See https://github.com/scikit-learn/scikit-learn/issues/27422.
      factor=factor,  #double samples and take top half of combos on each iteration
      cv=5, random_state=1234,
      refit=True,  #remembers the best combo and gives us back that model already trained and ready for testing
  )
  return halving_cv.fit(x_train, y_train);

### Let's put it to use

Let's tune KNN on the Titanic. But to make sure we did not hard code anything into the function, we will try on subset of Titanic.

In [None]:
len(titanic_trimmed)

1313

In [None]:
%%capture
x_train, x_test, y_train, y_test = titanic_setup(titanic_trimmed[:1000])  #first 1000 rows

In [None]:
x_train.std(axis=0)  #array([0.71636443, 0.44460481, 1.02148149, 0.06290918, 0.47761641, 1.20103094])

array([0.71636443, 0.44460481, 1.02148149, 0.06290918, 0.47761641,
       1.20103094])

In [None]:
knn_model = KNeighborsClassifier()

### Try out your new function

In [None]:
%%capture
grid_result = halving_search(knn_model, knn_grid, x_train, y_train)
best_model = grid_result.best_estimator_

In [None]:
grid_result.best_params_  #{'algorithm': 'ball_tree', 'n_neighbors': 25, 'p': 1, 'weights': 'uniform'}

{'algorithm': 'ball_tree', 'n_neighbors': 25, 'p': 1, 'weights': 'uniform'}

### Build threshold table

In [None]:
ypos = best_model.predict_proba(x_test)[:,1]

In [None]:
result_df, fancy_df= threshold_results(np.round(np.arange(0.0,1.01,.05), 2), y_test, ypos)
fancy_df

Unnamed: 0,threshold,precision,recall,f1,auc,accuracy
0,0.0,0.26,1.0,0.41,0.79,0.26
1,0.05,0.29,0.96,0.45,0.79,0.4
2,0.1,0.34,0.96,0.5,0.79,0.5
3,0.15,0.35,0.92,0.51,0.79,0.55
4,0.2,0.4,0.88,0.55,0.79,0.64
5,0.25,0.44,0.69,0.53,0.79,0.7
6,0.3,0.53,0.61,0.56,0.79,0.76
7,0.35,0.54,0.53,0.53,0.79,0.76
8,0.4,0.67,0.47,0.55,0.79,0.8
9,0.45,0.87,0.39,0.54,0.79,0.83


In [None]:
result_df

Unnamed: 0,threshold,precision,recall,f1,auc,accuracy
0,0.0,0.26,1.0,0.41,0.79,0.26
1,0.05,0.29,0.96,0.45,0.79,0.4
2,0.1,0.34,0.96,0.5,0.79,0.5
3,0.15,0.35,0.92,0.51,0.79,0.55
4,0.2,0.4,0.88,0.55,0.79,0.64
5,0.25,0.44,0.69,0.53,0.79,0.7
6,0.3,0.53,0.61,0.56,0.79,0.76
7,0.35,0.54,0.53,0.53,0.79,0.76
8,0.4,0.67,0.47,0.55,0.79,0.8
9,0.45,0.87,0.39,0.54,0.79,0.83


### Mixed results

Notice we have higher accuracy but lower f1 scores. Seems we are doing really well on Precision but not Recall, or vice versa. The f1 score tells the story while accuracy is misleading.

|index|threshold|precision|recall|f1|accuracy|auc|
|---|---|---|---|---|---|---|
|0|0\.0|0\.26|1\.0|0\.41|0\.26|0\.79|
|1|0\.05|0\.29|0\.96|0\.45|0\.4|0\.79|
|2|0\.1|0\.34|0\.96|0\.5|0\.5|0\.79|
|3|0\.15|0\.35|0\.92|0\.51|0\.55|0\.79|
|4|0\.2|0\.4|0\.88|0\.55|0\.64|0\.79|
|5|0\.25|0\.44|0\.69|0\.53|0\.7|0\.79|
|6|0\.3|0\.53|0\.61|0\.56|0\.76|0\.79|
|7|0\.35|0\.54|0\.53|0\.53|0\.76|0\.79|
|8|0\.4|0\.67|0\.47|0\.55|0\.8|0\.79|
|9|0\.45|0\.87|0\.39|0\.54|0\.83|0\.79|
|10|0\.5|0\.86|0\.35|0\.5|0\.82|0\.79|
|11|0\.55|0\.83|0\.29|0\.43|0\.8|0\.79|
|12|0\.6|0\.94|0\.29|0\.45|0\.82|0\.79|
|13|0\.65|0\.9|0\.18|0\.3|0\.78|0\.79|
|14|0\.7|0\.89|0\.16|0\.27|0\.78|0\.79|
|15|0\.75|0\.83|0\.1|0\.18|0\.76|0\.79|
|16|0\.8|0\.83|0\.1|0\.18|0\.76|0\.79|
|17|0\.85|0\.67|0\.04|0\.07|0\.75|0\.79|
|18|0\.9|0\.0|0\.0|0\.0|0\.74|0\.79|
|19|0\.95|0\.0|0\.0|0\.0|0\.74|0\.79|
|20|1\.0|0\.0|0\.0|0\.0|0\.74|0\.79|

# Challenge 3
<img src='https://www.dropbox.com/s/3uyvp722kp5to2r/assignment.png?raw=1' width='300'>

Go ahead and try knn and halving on Customer dataset. Use the same grid.

You don't have to save your results to drive.



In [None]:
url = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vQPM6PqZXgmAHfRYTcDZseyALRyVwkBtKEo_rtaKq_C7T0jycWxH6QVEzTzJCRA0m8Vz0k68eM9tDm-/pub?output=csv'

In [None]:
customers_df = pd.read_csv(url)
customers_trimmed = customers_df.drop(columns='ID')  #this is a useless column which we will drop early
customers_trimmed = customers_trimmed.drop_duplicates(ignore_index=True)  #get rid of any duplicates
customers_trimmed.head()

Unnamed: 0,Gender,Experience Level,Time Spent,OS,ISP,Age,Rating
0,Female,medium,,iOS,Xfinity,,0
1,Male,medium,71.97,Android,Cox,50.0,0
2,Female,medium,101.81,,Cox,49.0,1
3,Female,medium,86.37,Android,Xfinity,53.0,0
4,Female,medium,103.97,iOS,Xfinity,58.0,0


In [None]:
%%capture
x_train_cust, x_test_cust, y_train_cust, y_test_cust = customer_setup(customers_trimmed, customer_transformer)

In [None]:
x_train_cust.std(axis=0)

array([0.45875063, 0.43511254, 0.75411243, 0.45929552, 0.04987596,
       0.62993528])

In [None]:
knn_model = KNeighborsClassifier()  #need to build new model so don't reuse old one

Use your function from challenge 1.

In [None]:
%%capture
grid_result = halving_search(knn_model, knn_grid, x_train_cust, y_train_cust)
best_model = grid_result.best_estimator_

In [None]:
grid_results.best_params_

{'algorithm': 'auto', 'n_neighbors': 45, 'p': 1, 'weights': 'distance'}

In [None]:
best_knn_model = grid_results.best_estimator_

In [None]:
ypos = best_knn_model.predict_proba(x_test_cust)[:,1]
ypos[:5]

array([0.08295874, 0.11475831, 0.04429103, 0.72115918, 0.36913518])

# V. Build threshold table



In [None]:
result_df, fancy_df= threshold_results(np.round(np.arange(0.0,1.01,.05), 2), y_test_cust, ypos)
fancy_df

Unnamed: 0,threshold,precision,recall,f1,auc,accuracy
0,0.0,0.32,1.0,0.49,0.83,0.32
1,0.05,0.35,0.97,0.51,0.83,0.41
2,0.1,0.4,0.92,0.56,0.83,0.54
3,0.15,0.48,0.86,0.61,0.83,0.65
4,0.2,0.56,0.78,0.65,0.83,0.73
5,0.25,0.65,0.75,0.7,0.83,0.79
6,0.3,0.74,0.71,0.73,0.83,0.83
7,0.35,0.79,0.65,0.71,0.83,0.83
8,0.4,0.8,0.56,0.65,0.83,0.81
9,0.45,0.82,0.51,0.63,0.83,0.81


In [None]:
result_df

Unnamed: 0,threshold,precision,recall,f1,auc,accuracy
0,0.0,0.32,1.0,0.49,0.83,0.32
1,0.05,0.35,0.97,0.51,0.83,0.41
2,0.1,0.4,0.92,0.56,0.83,0.54
3,0.15,0.48,0.86,0.61,0.83,0.65
4,0.2,0.56,0.78,0.65,0.83,0.73
5,0.25,0.65,0.75,0.7,0.83,0.79
6,0.3,0.74,0.71,0.73,0.83,0.83
7,0.35,0.79,0.65,0.71,0.83,0.83
8,0.4,0.8,0.56,0.65,0.83,0.81
9,0.45,0.82,0.51,0.63,0.83,0.81


|index|threshold|precision|recall|f1|accuracy|auc|
|---|---|---|---|---|---|---|
|0|0\.0|0\.32|1\.0|0\.49|0\.32|0\.89|
|1|0\.05|0\.34|1\.0|0\.5|0\.36|0\.89|
|2|0\.1|0\.36|1\.0|0\.53|0\.43|0\.89|
|3|0\.15|0\.49|0\.9|0\.63|0\.66|0\.89|
|4|0\.2|0\.57|0\.86|0\.68|0\.74|0\.89|
|5|0\.25|0\.64|0\.79|0\.71|0\.79|0\.89|
|6|0\.3|0\.77|0\.75|0\.76|0\.85|0\.89|
|7|0\.35|0\.85|0\.65|0\.74|0\.85|0\.89|
|8|0\.4|0\.89|0\.65|0\.75|0\.86|0\.89|
|9|0\.45|0\.93|0\.63|0\.75|0\.87|0\.89|
|10|0\.5|0\.95|0\.6|0\.74|0\.86|0\.89|
|11|0\.55|0\.97|0\.57|0\.72|0\.86|0\.89|
|12|0\.6|1\.0|0\.54|0\.7|0\.85|0\.89|
|13|0\.65|1\.0|0\.46|0\.63|0\.83|0\.89|
|14|0\.7|1\.0|0\.43|0\.6|0\.82|0\.89|
|15|0\.75|1\.0|0\.35|0\.52|0\.79|0\.89|
|16|0\.8|1\.0|0\.35|0\.52|0\.79|0\.89|
|17|0\.85|1\.0|0\.27|0\.42|0\.77|0\.89|
|18|0\.9|1\.0|0\.08|0\.15|0\.7|0\.89|
|19|0\.95|0\.0|0\.0|0\.0|0\.68|0\.89|
|20|1\.0|0\.0|0\.0|0\.0|0\.68|0\.89|

# Challenge 4
<img src='https://www.dropbox.com/s/3uyvp722kp5to2r/assignment.png?raw=1' width='300'>

Try a HalvingSearch on LogisticRegression and Titanic. Explore these alternatives:

* Cs: (5,10,15)

* cv: (3,5,10)

* solver: ('newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga')

* max_iter: (10,100,500,1000)



In [None]:
%%capture
x_trained, x_test, y_train, y_test = titanic_setup(titanic_trimmed)

In [None]:
#create grid

logreg_grid_raw = {
    'Cs': [5,10,15],
    'cv': [3,5,10],
    'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
    'max_iter': [10,100,500,1000]
}

In [None]:
logreg_grid = sort_grid(logreg_grid_raw)
logreg_grid

{'Cs': [5, 10, 15],
 'cv': [3, 5, 10],
 'max_iter': [10, 100, 500, 1000],
 'solver': ['lbfgs', 'liblinear', 'newton-cg', 'sag', 'saga']}

### How many different combinations?

In [None]:
param_grid = ParameterGrid(logreg_grid)
len(param_grid)  #180

180

In [None]:
from sklearn.linear_model import LogisticRegressionCV
logreg_model = LogisticRegressionCV(random_state=1, n_jobs=-1)  #base model

In [None]:
%%capture
grid_result = halving_search(logreg_model, logreg_grid, x_trained, y_train)
best_logreg_model = grid_result.best_estimator_

In [None]:
grid_result.best_params_  #{'Cs': 5, 'cv': 3, 'max_iter': 500, 'solver': 'sag'}

{'Cs': 5, 'cv': 3, 'max_iter': 500, 'solver': 'sag'}

In [None]:
ypos = best_logreg_model.predict_proba(x_test)[:,1]

In [None]:
result_df, fancy_df = threshold_results(np.round(np.arange(0.0,1.01,.05), 2), y_test, ypos)
fancy_df

Unnamed: 0,threshold,precision,recall,f1,auc,accuracy
0,0.0,0.43,1.0,0.6,0.68,0.43
1,0.05,0.43,1.0,0.6,0.68,0.43
2,0.1,0.43,1.0,0.6,0.68,0.43
3,0.15,0.44,1.0,0.61,0.68,0.44
4,0.2,0.44,0.88,0.59,0.68,0.47
5,0.25,0.48,0.75,0.58,0.68,0.54
6,0.3,0.56,0.65,0.6,0.68,0.62
7,0.35,0.65,0.6,0.62,0.68,0.68
8,0.4,0.69,0.58,0.63,0.68,0.7
9,0.45,0.71,0.57,0.63,0.68,0.71


In [None]:
result_df

Unnamed: 0,threshold,precision,recall,f1,auc,accuracy
0,0.0,0.43,1.0,0.6,0.68,0.43
1,0.05,0.43,1.0,0.6,0.68,0.43
2,0.1,0.43,1.0,0.6,0.68,0.43
3,0.15,0.44,1.0,0.61,0.68,0.44
4,0.2,0.44,0.88,0.59,0.68,0.47
5,0.25,0.48,0.75,0.58,0.68,0.54
6,0.3,0.56,0.65,0.6,0.68,0.62
7,0.35,0.65,0.6,0.62,0.68,0.68
8,0.4,0.69,0.58,0.63,0.68,0.7
9,0.45,0.71,0.57,0.63,0.68,0.71


### My table

|index|threshold|precision|recall|f1|accuracy|auc|
|---|---|---|---|---|---|---|
|0|0\.0|0\.43|1\.0|0\.6|0\.43|0\.68|
|1|0\.05|0\.43|1\.0|0\.6|0\.43|0\.68|
|2|0\.1|0\.43|1\.0|0\.6|0\.43|0\.68|
|3|0\.15|0\.44|1\.0|0\.61|0\.44|0\.68|
|4|0\.2|0\.44|0\.88|0\.59|0\.47|0\.68|
|5|0\.25|0\.48|0\.75|0\.58|0\.54|0\.68|
|6|0\.3|0\.56|0\.65|0\.6|0\.62|0\.68|
|7|0\.35|0\.65|0\.6|0\.62|0\.68|0\.68|
|8|0\.4|0\.69|0\.58|0\.63|0\.7|0\.68|
|9|0\.45|0\.71|0\.57|0\.63|0\.71|0\.68|
|10|0\.5|0\.71|0\.56|0\.63|0\.71|0\.68|
|11|0\.55|0\.72|0\.56|0\.63|0\.71|0\.68|
|12|0\.6|0\.72|0\.55|0\.62|0\.71|0\.68|
|13|0\.65|0\.73|0\.51|0\.6|0\.71|0\.68|
|14|0\.7|0\.75|0\.39|0\.52|0\.68|0\.68|
|15|0\.75|0\.86|0\.32|0\.47|0\.68|0\.68|
|16|0\.8|0\.91|0\.18|0\.29|0\.63|0\.68|
|17|0\.85|1\.0|0\.03|0\.05|0\.58|0\.68|
|18|0\.9|0\.0|0\.0|0\.0|0\.57|0\.68|
|19|0\.95|0\.0|0\.0|0\.0|0\.57|0\.68|
|20|1\.0|0\.0|0\.0|0\.0|0\.57|0\.68|

## Save both `result_df` and `best_logreg_model` to file

In [None]:
result_df.to_csv('logreg_thresholds.csv', index=False)

In [None]:
from joblib import dump
dump(best_logreg_model, 'logreg_model.joblib')

['logreg_model.joblib']

## Test by loading back into `logreg_model2` and compare results

Should be the same as `best_logreg_model`.

In [None]:
from joblib import load

logreg_model2 = load('logreg_model.joblib')

In [None]:
#what you loaded back in
logreg_model2.predict_proba(x_test)[:,1][:5]  #array([0.65603952, 0.76587596, 0.83815386, 0.25765383, 0.27242287])


array([0.65603952, 0.76587596, 0.83815386, 0.25765383, 0.27242287])

In [None]:
#original you saved
best_logreg_model.predict_proba(x_test)[:,1][:5]  #array([0.65603952, 0.76587596, 0.83815386, 0.25765383, 0.27242287])

array([0.65603952, 0.76587596, 0.83815386, 0.25765383, 0.27242287])

## Congratulations again!

You have a second model stored away and ready to use in production. On a roll.

# Challenge 5
<img src='https://www.dropbox.com/s/3uyvp722kp5to2r/assignment.png?raw=1' width='300'>

You should have all of these defined in your library: `halving_search, sort_grid, ParameterGrid`.

