# Announcement

In the video lecture for this notebook, I only look at a method called XGBoost. I would like to open it up a bit more for you and offer you some alternatives:

* xgboost. Still a good boosting model. Downsides: has dropped out of compliance with later versions of sklearn; if we used it, we would have to install a downgraded version of sklearn. Is a real memory hog and time hog. When running in Colab, can lead to overrunning memory or time allowance.

* catboost. Another good boosting model. Less of a memory hog and will generally not overrun memory. But still a time hog.

* lightgbm. Perhaps not quite as good as fancier boosting models, but still good. And optimized for memory usage.

I am going to choose lightgbm. I actually tried it against xgboost and catboost and it was as good if not better on our Titanic data!

<center>
<h1>Chapter 12</h1>
</center>

<hr>

Let's work on a couple other models. Get them in shape for our web site.


In [1]:
github_name = 'MarvNC'
repo_name = 'cs523'
source_file = 'library.py'
url = f'https://raw.githubusercontent.com/{github_name}/{repo_name}/main/{source_file}'
!rm $source_file
!wget $url
%run -i $source_file

rm: cannot remove 'library.py': No such file or directory
--2025-05-28 00:21:06--  https://raw.githubusercontent.com/MarvNC/cs523/main/library.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 48297 (47K) [text/plain]
Saving to: ‘library.py’


2025-05-28 00:21:07 (1008 KB/s) - ‘library.py’ saved [48297/48297]



In [2]:
threshold_results, halving_search, sort_grid, ParameterGrid  #all should be defined in your library from Chapter 11

(<function __main__.threshold_results(thresh_list, actuals, predicted)>,
 <function __main__.halving_search(model, grid, x_train, y_train, factor=3, min_resources='exhaust', scoring='roc_auc')>,
 <function __main__.sort_grid(grid)>,
 sklearn.model_selection._search.ParameterGrid)

In [3]:
titanic_variance_based_split, customer_variance_based_split  #107,113 to match my results

(107, 113)

In [4]:
url = 'https://raw.githubusercontent.com/fickas/asynch_models/refs/heads/main/datasets/titanic_trimmed.csv'
titanic_trimmed = pd.read_csv(url)

In [5]:
titanic_features = titanic_trimmed.drop(columns='Survived')
titanic_features.head()  #print first 5 rows of the table

Unnamed: 0,Age,Gender,Class,Joined,Married,Fare
0,41.0,Male,C3,Southampton,0.0,7.0
1,21.0,Male,Crew,Southampton,0.0,0.0
2,13.0,Male,C3,Southampton,,20.0
3,16.0,Male,C3,Southampton,0.0,
4,,Male,C2,Cherbourg,0.0,24.0


In [6]:
labels = titanic_trimmed['Survived'].to_list()

In [7]:
%%capture
x_train, x_test, y_train, y_test = titanic_setup(titanic_trimmed)

In [8]:
x_train.std(axis=0)  #array([0.75333128, 0.47741652, 1.03590395, 0.0872873 , 0.47611519, 1.23157575])

array([0.75333128, 0.47741652, 1.03590395, 0.0872873 , 0.47611519,
       1.23157575])

# I. An Arborist view of machine learning


<img src='https://wallpapercave.com/wp/C6fXFAd.jpg' height=200>

I'd like to trace the lineage of a series of decision-tree based models to the one we will actually use.

## Start with Decision Tree

<img src='https://trevorstephens.com/images/2014-01-13-r-part-3-decision-trees-2.png' height=500>

## The algorithm

1. Out of all the features, choose the root feature that best splits the rows. The idea of "entropy" comes in here. Best entropy is .62 and .38 for both split nodes. But best entropy is our worst case! We want highly divided nodes. In the diagram, you can see male as .81 and .19. This is bad for entropy but good for us. Our goal is to get to 1.0 and 0 or 0 and 1.0. Everyone in the node survived or everyone perished. We can stop right there and predict with 100% accuracy! In above diagram, what is not shown is the search process. Only the outcome. The Sex column did the best anti-entropy split.

2. For each node, decide if want to quit and make it a leaf node, or continue splitting. Looks like a decision was made to keep splitting on each node. Take node on left (males). We again search features for best split. We are looking for something that does better than .81 and .19. It turns out that splitting on age >= 6.5 does give us better on left, now .83 and .17. But worse on right with .33 and .67. It looks like we decide to quit on node 4 on left So we will predict perished for males >= 6.5.

3. For node 5 on right (males < 6.5) we are not happy so decide to continue our anti-entropy campaign. We again search for a feature that will improve odds. We find the SibSp (a new column that counts family members traveling together) is best at splitting. It gives us nodes 10 and 11 which are pretty darn good. We decide to stop splitting and use these as leaf nodes.

4. The process continues for remaining nodes: decide to stop or decide to keep splitting.

## Overfitting

The hint of danger is seen in the percentages at the bottom of each node. They represent the percent of rows that actually fall into the node. Looking at node 4 (males >= 6.5) the percentage is 62%. That's good. We are capturing 62% of rows in that node. But look at node 10. We are capturing only 1% of the rows (roughly 22 passengers).

Here is where we can go down the rabbit hole. We search for perfect scores, i.e., either 100% survived or 100% perished. We keep splitting and splitting until we get to nodes of this kind. The problem is that we may end up with a 100 or even a 1000 leaf nodes. See the diagram below.

<img src='https://trevorstephens.com/images/2014-01-13-r-part-3-decision-trees-4.png' height=1000>

## The problem is when to stop

The issue with the huge tree above is that of overfitting. We are drilling down into the minutiae of each row. But this is for the training set. The test set is unlikely to match the training set exactly. So we have a tree that does perfect on training set and poorly on test set.

The classifier gives us parameters to allow us to choose when to stop. We will look at one, `max_depth`, but there are others. See documentation for details.

`sklearn` has a [Decision Tree classifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html).

In [9]:
from sklearn.tree import DecisionTreeClassifier
dt_model = DecisionTreeClassifier(random_state=1234)

In [10]:
dt_grid_raw = dict(criterion=['gini', 'entropy'],  #algorithms that judge the goodness of a split
                max_features=['sqrt', 'log2', None],  #how many features to consider for a split - None=all
                max_depth=range(1,15),
)
dt_grid = sort_grid(dt_grid_raw)
dt_grid

{'criterion': ['entropy', 'gini'],
 'max_depth': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14],
 'max_features': ['log2', 'sqrt', None]}

In [11]:
%%capture
grid_result = halving_search(dt_model, dt_grid, x_train, y_train)  #defaults to factor=2, min_resources="exhaust", scoring='roc_auc'
best_model = grid_result.best_estimator_  #best model found by search

In [12]:
grid_result.best_params_   #{'criterion': 'entropy', 'max_depth': 3, 'max_features': 'sqrt'}

{'criterion': 'entropy', 'max_depth': 3, 'max_features': 'sqrt'}

## Test set

Notice I am using the model the search found with the best roc_auc score, found in  `grid_result.best_estimator_`.

In [13]:
ypos = best_model.predict_proba(x_test)[:,1]  #given it is numpy matrix (nx2), I can access 2nd column with [:,1]

In [14]:
ypos[:3]  #array([0.53038674, 0.96240602, 0.96240602])

array([0.53038674, 0.96240602, 0.96240602])

In [15]:
result_df, fancy_df = threshold_results(np.round(np.arange(0.0,1.01,.05), 2), y_test, ypos)

In [16]:
fancy_df

Unnamed: 0,threshold,precision,recall,f1,auc,accuracy
0,0.0,0.43,1.0,0.6,0.77,0.43
1,0.05,0.43,1.0,0.6,0.77,0.43
2,0.1,0.43,1.0,0.6,0.77,0.43
3,0.15,0.43,1.0,0.6,0.77,0.43
4,0.2,0.51,0.89,0.65,0.77,0.58
5,0.25,0.62,0.79,0.69,0.77,0.7
6,0.3,0.62,0.79,0.69,0.77,0.7
7,0.35,0.62,0.79,0.69,0.77,0.7
8,0.4,0.72,0.58,0.64,0.77,0.72
9,0.45,0.72,0.58,0.64,0.77,0.72


In [17]:
result_df

Unnamed: 0,threshold,precision,recall,f1,auc,accuracy
0,0.0,0.43,1.0,0.6,0.77,0.43
1,0.05,0.43,1.0,0.6,0.77,0.43
2,0.1,0.43,1.0,0.6,0.77,0.43
3,0.15,0.43,1.0,0.6,0.77,0.43
4,0.2,0.51,0.89,0.65,0.77,0.58
5,0.25,0.62,0.79,0.69,0.77,0.7
6,0.3,0.62,0.79,0.69,0.77,0.7
7,0.35,0.62,0.79,0.69,0.77,0.7
8,0.4,0.72,0.58,0.64,0.77,0.72
9,0.45,0.72,0.58,0.64,0.77,0.72


## From trees to forest

<center>
<img src='https://miro.medium.com/max/2800/0*w4CMLEKAFp_bVYNk.jpg' height=200>
</center>

Remember our KNN algorithm and the notion of crowd sourcing. Finding a set of experts and then averaging their opinion? Some clever person found a way to apply that to decision trees. Here's what they came up with.

1. Don't have just one tree. Have multiple trees. And let them all vote on the answer/prediction.

2. Ok, here is the wild part. Choose features to split on randomly!

Guess what they called the algorithm? Random Forest. Here is an example with 3 random trees.

<img src='https://trevorstephens.com/images/2014-01-19-r-part-5-random-forests-1.png'>

### Stumps

The 3 trees above are called stumps because they only have one split and then quit with 2 leaf nodes. How big your trees grow is a hyperparameter but the closer to stumps the better.

Also, 3 trees is rather a small number for a forest. It is typically more in the 100 range. It is another hyperparameter.

## Back to overfitting

It's kind of crazy, but the random forest idea combats overfitting at the same time as producing better results than decision trees.

sklearn has a [Random Forest classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html). And it is still widely used.

## Let's just try with all defaults

In [18]:
from sklearn.ensemble import RandomForestClassifier
rf_model = RandomForestClassifier(random_state=1234)

In [19]:
rf_model.fit(x_train, y_train)

In [20]:
ypos = rf_model.predict_proba(x_test)[:,1]

In [21]:
result_df, fancy_df = threshold_results(np.round(np.arange(0.0,1.01,.05), 2), y_test, ypos)

In [22]:
fancy_df

Unnamed: 0,threshold,precision,recall,f1,auc,accuracy
0,0.0,0.43,1.0,0.6,0.72,0.43
1,0.05,0.46,0.92,0.62,0.72,0.51
2,0.1,0.5,0.84,0.63,0.72,0.57
3,0.15,0.51,0.76,0.61,0.72,0.57
4,0.2,0.52,0.7,0.59,0.72,0.59
5,0.25,0.53,0.65,0.58,0.72,0.6
6,0.3,0.56,0.65,0.6,0.72,0.63
7,0.35,0.6,0.65,0.62,0.72,0.66
8,0.4,0.61,0.64,0.63,0.72,0.67
9,0.45,0.62,0.61,0.62,0.72,0.67


In [23]:
result_df

Unnamed: 0,threshold,precision,recall,f1,auc,accuracy
0,0.0,0.43,1.0,0.6,0.72,0.43
1,0.05,0.46,0.92,0.62,0.72,0.51
2,0.1,0.5,0.84,0.63,0.72,0.57
3,0.15,0.51,0.76,0.61,0.72,0.57
4,0.2,0.52,0.7,0.59,0.72,0.59
5,0.25,0.53,0.65,0.58,0.72,0.6
6,0.3,0.56,0.65,0.6,0.72,0.63
7,0.35,0.6,0.65,0.62,0.72,0.66
8,0.4,0.61,0.64,0.63,0.72,0.67
9,0.45,0.62,0.61,0.62,0.72,0.67


## Untuned forest

Maybe tuning can improve Random Forest. At moment it is not doing great against tuned decision tree. But I am going to move on from random forest to something called boosting.


## From random to boosting

If you think carefully about the Random Forest algorithm, you can see that it is highly parallelizable. I could build each tree in parallel, right?

The next clever idea that came along said we can trade that parallelization for something called boosting. The general idea is as follows:

1. We build our first  tree (just as with decision tree model) and predict just with it.

2. We will get some rows right and some wrong. We weight the "wrong" rows higher and build the 2nd tree. The 2nd tree focuses on getting the weighted rows correct. In essence, it tries to do better on the errors the first tree made.

3. The 2nd tree will make mistakes. We continue to build subsequent trees that try to improve on the mistakes of the previous trees.

4. We stop at some point. We now have a forest. Voting is more complicated than with Random Forest. Each tree provides a score and these scores are accumulated then passed through a sigmoid function.

I'm not going to go into a deep dive into Tree boosting. I can point you to a [good tutorial](https://sefiks.com/2018/10/04/a-step-by-step-gradient-boosting-decision-tree-example/) and [genealogy](https://towardsdatascience.com/xgboost-its-genealogy-its-architectural-features-and-its-innovation-bf32b15b45d2)

You should note that the "boosting" idea can be applied to any sequence of models. However, tree models are certainly the easist to work with.

I also note that there is a related concept called "stacking" that works well with heterogenous models. So if wanted a "forest" (typically called an ensemble) of KNN, RandomForest and LogisticRegression models, stacking would be best bet. We may get back to this toward the end of the class.


### First run boosting model with just defaults (no tuning)



In [24]:
from lightgbm import LGBMClassifier   #built-in to Colab, nice

In [25]:
%%capture

# Create model with all defaults
model = LGBMClassifier(random_state=1234)

# Train the model
model.fit(x_train, y_train)

ypos = model.predict_proba(x_test)[:,1]

In [26]:
result_df, fancy_df = threshold_results(np.round(np.arange(0.0,1.01,.05), 2), y_test, ypos)
fancy_df

Unnamed: 0,threshold,precision,recall,f1,auc,accuracy
0,0.0,0.43,1.0,0.6,0.78,0.43
1,0.05,0.49,0.97,0.65,0.78,0.55
2,0.1,0.52,0.95,0.67,0.78,0.59
3,0.15,0.54,0.91,0.68,0.78,0.62
4,0.2,0.58,0.85,0.69,0.78,0.67
5,0.25,0.59,0.79,0.68,0.78,0.67
6,0.3,0.62,0.72,0.66,0.78,0.68
7,0.35,0.64,0.66,0.65,0.78,0.69
8,0.4,0.66,0.61,0.64,0.78,0.7
9,0.45,0.71,0.59,0.64,0.78,0.71


In [27]:
result_df

Unnamed: 0,threshold,precision,recall,f1,auc,accuracy
0,0.0,0.43,1.0,0.6,0.78,0.43
1,0.05,0.49,0.97,0.65,0.78,0.55
2,0.1,0.52,0.95,0.67,0.78,0.59
3,0.15,0.54,0.91,0.68,0.78,0.62
4,0.2,0.58,0.85,0.69,0.78,0.67
5,0.25,0.59,0.79,0.68,0.78,0.67
6,0.3,0.62,0.72,0.66,0.78,0.68
7,0.35,0.64,0.66,0.65,0.78,0.69
8,0.4,0.66,0.61,0.64,0.78,0.7
9,0.45,0.71,0.59,0.64,0.78,0.71


### Now try tuning but some important notes

I have to consider how parameter values interact with HalvingSearch. I put a comment on two standard parameters I dropped.

I also considered parameter values based on binary classification. In particular,

* boosting_type:
  * 'gbdt' is stable for binary decisions.
  * 'dart' helps prevent overfitting when you have strong signal features.
  * 'goss' is particularly good for binary classification because it focuses on hard-to-classify samples (those with large gradients).

* num_leaves:
  * Binary decisions often don't need very complex trees.
  * Starting with smaller values (7, 15) helps prevent overfitting.
  * Especially important when your classes are well-separated.

I also note that Claude (or other LLMs) can be a big help here and in general for selecting parameters to tune.


In [28]:
lgb_grid_raw = {
    # Parameters that work well with HalvingGridSearch
    "n_estimators": [5, 10, 50, 100],
    "learning_rate": [.01, .05, 0.1, 0.2, 0.3, 0.4],
    "max_depth": [1,5,10,15],
    "boosting_type": ['gbdt', 'dart', 'goss'],
    "num_leaves": [5, 7, 15, 31, 40],   #start small with binary classification

    "min_child_samples": [5, 10,15],
    "class_weight": ['balanced', None],

    #"subsample": [0.25, 0.5, 0.75]     # Drop this as HalvingGridSearch handles row sampling
    #"subsample_freq": [0, 1, 5]        # Drop this too as it's related to subsample
}

lgb_grid = sort_grid(lgb_grid_raw)
lgb_grid

{'boosting_type': ['dart', 'gbdt', 'goss'],
 'class_weight': ['balanced', None],
 'learning_rate': [0.01, 0.05, 0.1, 0.2, 0.3, 0.4],
 'max_depth': [1, 5, 10, 15],
 'min_child_samples': [5, 10, 15],
 'n_estimators': [5, 10, 50, 100],
 'num_leaves': [5, 7, 15, 31, 40]}

In [29]:
from sklearn.model_selection import ParameterGrid


In [30]:
param_grid = ParameterGrid(lgb_grid)  #a list of dictionaries, one for each combo
len(param_grid)  #8640

8640

### Welcome to longer waits

Up until now our tuning has been taking a few minutes. Boosting and Neural Nets (next chapters) will change that. These models take longer to train and that impacts tuning.

I have purposely kept the parameter space relative small to keep tuning in the minutes. My first attempt to build what I thought was a reasonable set of parameters to tune took twelve (12!) hours. And remember this is with a dataset that is on the small side.

My warning is that when you you get into a real situation, e.g., a job, you will have to be careful that you have enough time and computing resources to tune modern algorithms like Boosting and Neural Nets.

In [31]:
import datetime  #can use to time tuning - roughly 25 minutes for below!

In [32]:
%%capture
start = datetime.datetime.now()

grid_result = halving_search(LGBMClassifier(random_state=1234), lgb_grid, x_train, y_train)

end = datetime.datetime.now()
time_difference = end - start
difference_in_minutes = time_difference.total_seconds() / 60

In [33]:
print(f"The difference between the two datetimes is {difference_in_minutes} minutes.")  #Average time around 25 minutes with roughly 9k combos

The difference between the two datetimes is 17.806271033333335 minutes.


In [34]:
best_model = grid_result.best_estimator_
grid_result.best_params_   #see below for what to match

{'boosting_type': 'gbdt',
 'class_weight': 'balanced',
 'learning_rate': 0.3,
 'max_depth': 15,
 'min_child_samples': 5,
 'n_estimators': 5,
 'num_leaves': 7}

<pre>
{'boosting_type': 'gbdt',
 'class_weight': 'balanced',
 'learning_rate': 0.3,
 'max_depth': 15,
 'min_child_samples': 5,
 'n_estimators': 5,
 'num_leaves': 7}
 </pre>

In [35]:
result_df, fancy_df = threshold_results(np.round(np.arange(0.0,1.01,.05), 2), y_test, best_model.predict_proba(x_test)[:,1])
fancy_df



Unnamed: 0,threshold,precision,recall,f1,auc,accuracy
0,0.0,0.43,1.0,0.6,0.81,0.43
1,0.05,0.43,1.0,0.6,0.81,0.43
2,0.1,0.43,1.0,0.6,0.81,0.43
3,0.15,0.43,1.0,0.6,0.81,0.43
4,0.2,0.44,0.99,0.61,0.81,0.44
5,0.25,0.48,0.96,0.64,0.81,0.53
6,0.3,0.49,0.96,0.65,0.81,0.56
7,0.35,0.61,0.89,0.72,0.81,0.71
8,0.4,0.61,0.89,0.72,0.81,0.71
9,0.45,0.69,0.61,0.65,0.81,0.71


In [36]:
result_df

Unnamed: 0,threshold,precision,recall,f1,auc,accuracy
0,0.0,0.43,1.0,0.6,0.81,0.43
1,0.05,0.43,1.0,0.6,0.81,0.43
2,0.1,0.43,1.0,0.6,0.81,0.43
3,0.15,0.43,1.0,0.6,0.81,0.43
4,0.2,0.44,0.99,0.61,0.81,0.44
5,0.25,0.48,0.96,0.64,0.81,0.53
6,0.3,0.49,0.96,0.65,0.81,0.56
7,0.35,0.61,0.89,0.72,0.81,0.71
8,0.4,0.61,0.89,0.72,0.81,0.71
9,0.45,0.69,0.61,0.65,0.81,0.71


### Boosting quite an improvement!

# Challenge 1

Save and move two files to GitHub. This is one of 4 models we will use in the production system. And we will want to show user threshold table.

In [37]:
result_df.to_csv('lightgbm_thresholds.csv', index=False)  #to help user make choices

In [38]:
from joblib import dump
dump(best_model, 'lightgbm_model.joblib')  #model to make predictions in production system

['lightgbm_model.joblib']

# Challenge 2

I'd like you to try a couple more parameters in our grid to see if can improve tuning results. They are

<pre>
    "colsample_bytree": ?,    # Fraction of columns to select, important for preventing overfitting, default = 1.0
    "lambda_l1": ?,           # L1 regularization helpful for binary, default 0
    "lambda_l2": ?,           # L2 regularization helpful for binary, default 0
</pre>

I'll give you starting grid below. Your challenge is to do some research (using an LLM is fine) and find reasonable alternatives for each of these parameters and add them to the grid.

Also keep in mind that you will be mutiplying your time with new alternatives. So try just a few that seem reasonable from your research.

FYI: I added 9 new values, total, 3 for each. But I have also fixed several with the best values from original tuning to offset. Why? Because they seem to have the least interaction with the 3 new parameters:

* `min_child_samples` mainly affects local node decisions.
Less interaction with feature sampling or regularization.

* `class_weight` operates independently of regularization and feature sampling.

In summary, I added 9 and subtracted 3. It still leaves me with a 6-times increase in configurations.

## Kind of important idea

As noted in the video, this is really using an incremental approach to tuning. Choose some first set to tune. Get their best values. Fix those values and then choose next set to tune, etc. It is not optimal but it is realistic given the time it takes to tune. If you tried all parameters at once for tuning, could be days that you are waiting.

In [39]:
lgb_grid2_raw = {
    # Parameters that work well with HalvingGridSearch
    "n_estimators": [5, 10, 50, 100],
    "learning_rate": [.01, .05, 0.1, 0.2, 0.3, 0.4],
    "max_depth": [1,5,10,15],
    "boosting_type": ['gbdt', 'dart', 'goss'],
    "num_leaves": [5, 7, 15, 31, 40],

    "min_child_samples": [5],            #use best from original - minimal interaction with new
    "class_weight": ['balanced'],        #use best from original - minimal interaction with new

    #new
    "colsample_bytree": [0.7, 0.8, 0.9],          # Important for preventing overfitting
    "lambda_l1": [0, 0.1, 0.5],                 # L1 regularization helpful for binary
    "lambda_l2": [0, 0.1, 0.5],                 # L2 regularization helpful for binary

    #"subsample": [0.25, 0.5, 0.75]     # Drop this as HalvingGridSearch handles row sampling
    #"subsample_freq": [0, 1, 5]        # Drop this too as it's related to subsample
}

lgb_grid2 = sort_grid(lgb_grid2_raw)
lgb_grid2

{'boosting_type': ['dart', 'gbdt', 'goss'],
 'class_weight': ['balanced'],
 'colsample_bytree': [0.7, 0.8, 0.9],
 'lambda_l1': [0, 0.1, 0.5],
 'lambda_l2': [0, 0.1, 0.5],
 'learning_rate': [0.01, 0.05, 0.1, 0.2, 0.3, 0.4],
 'max_depth': [1, 5, 10, 15],
 'min_child_samples': [5],
 'n_estimators': [5, 10, 50, 100],
 'num_leaves': [5, 7, 15, 31, 40]}

In [40]:
param_grid = ParameterGrid(lgb_grid2)
len(param_grid)  #38880 vs old 8640

38880

### Roughly two hours

Go get coffee.

In [41]:
%%capture
start = datetime.datetime.now()

grid_result = halving_search(LGBMClassifier(random_state=1234), lgb_grid2, x_train, y_train)

end = datetime.datetime.now()
time_difference = end - start
difference_in_minutes = time_difference.total_seconds() / 60

In [42]:
print(f"The difference between the two datetimes is {difference_in_minutes} minutes.")  #120 minutes

The difference between the two datetimes is 91.38001606666667 minutes.


In [43]:
best_model = grid_result.best_estimator_
grid_result.best_params_

{'boosting_type': 'goss',
 'class_weight': 'balanced',
 'colsample_bytree': 0.8,
 'lambda_l1': 0.1,
 'lambda_l2': 0.5,
 'learning_rate': 0.05,
 'max_depth': 10,
 'min_child_samples': 5,
 'n_estimators': 50,
 'num_leaves': 7}

### What I came up with

Yours will likely differ unless you picked the exact alternative as I did.

<pre>
{'boosting_type': 'dart',
 'class_weight': 'balanced',
 'colsample_bytree': 0.8,
 'lambda_l1': 0,
 'lambda_l2': 0,
 'learning_rate': 0.1,
 'max_depth': 5,
 'min_child_samples': 5,
 'n_estimators': 50,
 'num_leaves': 5}
 </pre>


In [44]:
result_df, fancy_df = threshold_results(np.round(np.arange(0.0,1.01,.05), 2), y_test, [p for n,p in best_model.predict_proba(x_test)])
fancy_df





Unnamed: 0,threshold,precision,recall,f1,auc,accuracy
0,0.0,0.43,1.0,0.6,0.8,0.43
1,0.05,0.43,1.0,0.6,0.8,0.43
2,0.1,0.43,1.0,0.6,0.8,0.43
3,0.15,0.43,1.0,0.6,0.8,0.43
4,0.2,0.46,0.99,0.62,0.8,0.48
5,0.25,0.47,0.97,0.63,0.8,0.51
6,0.3,0.53,0.9,0.66,0.8,0.6
7,0.35,0.55,0.85,0.67,0.8,0.63
8,0.4,0.61,0.82,0.7,0.8,0.69
9,0.45,0.64,0.74,0.69,0.8,0.71


In [45]:
result_df

Unnamed: 0,threshold,precision,recall,f1,auc,accuracy
0,0.0,0.43,1.0,0.6,0.8,0.43
1,0.05,0.43,1.0,0.6,0.8,0.43
2,0.1,0.43,1.0,0.6,0.8,0.43
3,0.15,0.43,1.0,0.6,0.8,0.43
4,0.2,0.46,0.99,0.62,0.8,0.48
5,0.25,0.47,0.97,0.63,0.8,0.51
6,0.3,0.53,0.9,0.66,0.8,0.6
7,0.35,0.55,0.85,0.67,0.8,0.63
8,0.4,0.61,0.82,0.7,0.8,0.69
9,0.45,0.64,0.74,0.69,0.8,0.71


### No big improvement for me (compare with prior table)

|index|threshold|precision|recall|f1|accuracy|auc|
|---|---|---|---|---|---|---|
|0|0\.0|0\.43|1\.0|0\.6|0\.43|0\.81|
|1|0\.05|0\.43|1\.0|0\.6|0\.43|0\.81|
|2|0\.1|0\.43|1\.0|0\.6|0\.43|0\.81|
|3|0\.15|0\.43|1\.0|0\.6|0\.43|0\.81|
|4|0\.2|0\.43|1\.0|0\.6|0\.43|0\.81|
|5|0\.25|0\.44|0\.99|0\.61|0\.44|0\.81|
|6|0\.3|0\.46|0\.99|0\.62|0\.48|0\.81|
|7|0\.35|0\.58|0\.89|0\.7|0\.68|0\.81|
|8|0\.4|0\.62|0\.83|0\.71|0\.71|0\.81|
|9|0\.45|0\.72|0\.6|0\.65|0\.73|0\.81|
|10|0\.5|0\.75|0\.59|0\.66|0\.74|0\.81|
|11|0\.55|0\.78|0\.57|0\.66|0\.75|0\.81|
|12|0\.6|0\.89|0\.49|0\.63|0\.75|0\.81|
|13|0\.65|0\.89|0\.48|0\.62|0\.75|0\.81|
|14|0\.7|0\.91|0\.46|0\.61|0\.75|0\.81|
|15|0\.75|0\.9|0\.41|0\.57|0\.73|0\.81|
|16|0\.8|0\.91|0\.35|0\.51|0\.7|0\.81|
|17|0\.85|0\.95|0\.18|0\.3|0\.64|0\.81|
|18|0\.9|0\.0|0\.0|0\.0|0\.57|0\.81|
|19|0\.95|0\.0|0\.0|0\.0|0\.57|0\.81|
|20|1\.0|0\.0|0\.0|0\.0|0\.57|0\.81|

# Challenge 3

Run lightgbm on the cable customer dataset.

Do halving search to find best parameter values.

I'll remind you of the steps.

## Bring in cable data

Divide out into features and labels.

In [46]:
url = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vQPM6PqZXgmAHfRYTcDZseyALRyVwkBtKEo_rtaKq_C7T0jycWxH6QVEzTzJCRA0m8Vz0k68eM9tDm-/pub?output=csv'

In [47]:
customers_df = pd.read_csv(url)
customers_trimmed = customers_df.drop(columns='ID')  #this is a useless column which we will drop early
customers_trimmed = customers_trimmed.drop_duplicates(ignore_index=True)  #get rid of any duplicates
customers_trimmed.head()

Unnamed: 0,Gender,Experience Level,Time Spent,OS,ISP,Age,Rating
0,Female,medium,,iOS,Xfinity,,0
1,Male,medium,71.97,Android,Cox,50.0,0
2,Female,medium,101.81,,Cox,49.0,1
3,Female,medium,86.37,Android,Xfinity,53.0,0
4,Female,medium,103.97,iOS,Xfinity,58.0,0


In [48]:
%%capture
x_train_cust, x_test_cust, y_train_cust,  y_test_cust = customer_setup(customers_trimmed)

In [49]:
x_train_cust.std(axis=0)  #[0.45875063, 0.43511254, 0.75411243, 0.45929552, 0.04987596, 0.62993528]

array([0.45875063, 0.43511254, 0.75411243, 0.45929552, 0.04987596,
       0.62993528])

## Step 3.1 Get baseline without tuning


In [50]:
%%capture

# Create model with all defaults
model = LGBMClassifier(random_state=1234)

# Fit the model
model.fit(x_train_cust, y_train_cust)

ypos = model.predict_proba(x_test_cust)[:,1]

# Get probability predictions


In [51]:
result_df, fancy_df = threshold_results(np.round(np.arange(0.0,1.01,.05), 2), y_test_cust, ypos)
fancy_df

Unnamed: 0,threshold,precision,recall,f1,auc,accuracy
0,0.0,0.32,1.0,0.49,0.83,0.32
1,0.05,0.47,0.84,0.61,0.83,0.65
2,0.1,0.53,0.79,0.64,0.83,0.71
3,0.15,0.56,0.73,0.63,0.83,0.73
4,0.2,0.59,0.71,0.65,0.83,0.75
5,0.25,0.65,0.7,0.67,0.83,0.78
6,0.3,0.65,0.68,0.67,0.83,0.78
7,0.35,0.69,0.68,0.69,0.83,0.8
8,0.4,0.7,0.62,0.66,0.83,0.79
9,0.45,0.76,0.62,0.68,0.83,0.82


In [52]:
result_df

Unnamed: 0,threshold,precision,recall,f1,auc,accuracy
0,0.0,0.32,1.0,0.49,0.83,0.32
1,0.05,0.47,0.84,0.61,0.83,0.65
2,0.1,0.53,0.79,0.64,0.83,0.71
3,0.15,0.56,0.73,0.63,0.83,0.73
4,0.2,0.59,0.71,0.65,0.83,0.75
5,0.25,0.65,0.7,0.67,0.83,0.78
6,0.3,0.65,0.68,0.67,0.83,0.78
7,0.35,0.69,0.68,0.69,0.83,0.8
8,0.4,0.7,0.62,0.66,0.83,0.79
9,0.45,0.76,0.62,0.68,0.83,0.82


|index|threshold|precision|recall|f1|accuracy|auc|
|---|---|---|---|---|---|---|
|0|0\.0|0\.32|1\.0|0\.49|0\.32|0\.83|
|1|0\.05|0\.47|0\.84|0\.61|0\.65|0\.83|
|2|0\.1|0\.53|0\.79|0\.64|0\.71|0\.83|
|3|0\.15|0\.56|0\.73|0\.63|0\.73|0\.83|
|4|0\.2|0\.59|0\.71|0\.65|0\.75|0\.83|
|5|0\.25|0\.65|0\.7|0\.67|0\.78|0\.83|
|6|0\.3|0\.65|0\.68|0\.67|0\.78|0\.83|
|7|0\.35|0\.69|0\.68|0\.69|0\.8|0\.83|
|8|0\.4|0\.7|0\.62|0\.66|0\.79|0\.83|
|9|0\.45|0\.76|0\.62|0\.68|0\.82|0\.83|
|10|0\.5|0\.81|0\.62|0\.7|0\.83|0\.83|
|11|0\.55|0\.81|0\.62|0\.7|0\.83|0\.83|
|12|0\.6|0\.84|0\.6|0\.7|0\.84|0\.83|
|13|0\.65|0\.84|0\.59|0\.69|0\.83|0\.83|
|14|0\.7|0\.85|0\.56|0\.67|0\.83|0\.83|
|15|0\.75|0\.83|0\.48|0\.61|0\.8|0\.83|
|16|0\.8|0\.88|0\.44|0\.59|0\.8|0\.83|
|17|0\.85|0\.86|0\.4|0\.54|0\.79|0\.83|
|18|0\.9|0\.91|0\.32|0\.47|0\.77|0\.83|
|19|0\.95|0\.95|0\.3|0\.46|0\.77|0\.83|
|20|1\.0|0\.0|0\.0|0\.0|0\.68|0\.83|

## Now tune

You can use the original grid, i.e., `lgb_grid`. Roughly 20 minutes.

In [53]:
%%capture
start = datetime.datetime.now()

#do search
grid_result = halving_search(LGBMClassifier(random_state=1234), lgb_grid, x_train_cust, y_train_cust)

end = datetime.datetime.now()
time_difference = end - start
difference_in_minutes = time_difference.total_seconds() / 60

In [54]:
print(f"The difference between the two datetimes is {difference_in_minutes} minutes.")  #Time varies for me. Sometimes 21 minutes, sometimes 12.

The difference between the two datetimes is 17.225386266666664 minutes.


In [55]:
best_model = grid_result.best_estimator_
grid_result.best_params_   #match my results below

{'boosting_type': 'dart',
 'class_weight': 'balanced',
 'learning_rate': 0.3,
 'max_depth': 5,
 'min_child_samples': 5,
 'n_estimators': 100,
 'num_leaves': 5}

<pre>
{'boosting_type': 'dart',
 'class_weight': 'balanced',
 'learning_rate': 0.3,
 'max_depth': 5,
 'min_child_samples': 5,
 'n_estimators': 100,
 'num_leaves': 5}
 </pre>

In [56]:
result_df, fancy_df = threshold_results(np.round(np.arange(0.0,1.01,.05), 2), y_test_cust, best_model.predict_proba(x_test_cust)[:,1])
fancy_df



Unnamed: 0,threshold,precision,recall,f1,auc,accuracy
0,0.0,0.32,1.0,0.49,0.87,0.32
1,0.05,0.33,1.0,0.5,0.87,0.35
2,0.1,0.4,0.97,0.57,0.87,0.53
3,0.15,0.42,0.94,0.58,0.87,0.56
4,0.2,0.47,0.92,0.62,0.87,0.64
5,0.25,0.54,0.89,0.67,0.87,0.72
6,0.3,0.59,0.83,0.69,0.87,0.76
7,0.35,0.62,0.75,0.68,0.87,0.77
8,0.4,0.68,0.73,0.7,0.87,0.8
9,0.45,0.73,0.7,0.72,0.87,0.82


In [57]:
result_df

Unnamed: 0,threshold,precision,recall,f1,auc,accuracy
0,0.0,0.32,1.0,0.49,0.87,0.32
1,0.05,0.33,1.0,0.5,0.87,0.35
2,0.1,0.4,0.97,0.57,0.87,0.53
3,0.15,0.42,0.94,0.58,0.87,0.56
4,0.2,0.47,0.92,0.62,0.87,0.64
5,0.25,0.54,0.89,0.67,0.87,0.72
6,0.3,0.59,0.83,0.69,0.87,0.76
7,0.35,0.62,0.75,0.68,0.87,0.77
8,0.4,0.68,0.73,0.7,0.87,0.8
9,0.45,0.73,0.7,0.72,0.87,0.82


### Wow, tuning seemed to help!

|index|threshold|precision|recall|f1|accuracy|auc|
|---|---|---|---|---|---|---|
|0|0\.0|0\.32|1\.0|0\.49|0\.32|0\.87|
|1|0\.05|0\.33|1\.0|0\.5|0\.35|0\.87|
|2|0\.1|0\.4|0\.97|0\.57|0\.53|0\.87|
|3|0\.15|0\.42|0\.94|0\.58|0\.56|0\.87|
|4|0\.2|0\.47|0\.92|0\.62|0\.64|0\.87|
|5|0\.25|0\.54|0\.89|0\.67|0\.72|0\.87|
|6|0\.3|0\.59|0\.83|0\.69|0\.76|0\.87|
|7|0\.35|0\.62|0\.75|0\.68|0\.77|0\.87|
|8|0\.4|0\.68|0\.73|0\.7|0\.8|0\.87|
|9|0\.45|0\.73|0\.7|0\.72|0\.82|0\.87|
|10|0\.5|0\.8|0\.68|0\.74|0\.84|0\.87|
|11|0\.55|0\.78|0\.63|0\.7|0\.83|0\.87|
|12|0\.6|0\.8|0\.62|0\.7|0\.83|0\.87|
|13|0\.65|0\.85|0\.56|0\.67|0\.83|0\.87|
|14|0\.7|0\.87|0\.54|0\.67|0\.83|0\.87|
|15|0\.75|0\.89|0\.54|0\.67|0\.83|0\.87|
|16|0\.8|0\.89|0\.49|0\.63|0\.82|0\.87|
|17|0\.85|0\.93|0\.43|0\.59|0\.81|0\.87|
|18|0\.9|0\.95|0\.3|0\.46|0\.77|0\.87|
|19|0\.95|0\.92|0\.19|0\.32|0\.73|0\.87|
|20|1\.0|0\.0|0\.0|0\.0|0\.68|0\.87|