# Predicting BreastScreen NSW Attendance to Improve Participation

##### by David Schanzer, Student ID 82329622

## Introduction

### Problem Definition

This project aims to determine the accuracy with which the attendance of women at their next breast screening appointment at a BreastScreen NSW clinic or mobile van can be predicted, classifying women as either "regular" (predicted to attend within 90 days of the rescreen date) or "lapsed" (predicted to not attend within 90 days of their rescreen date).

### Significance of Problem

It is a sad fact that breast cancer is the most common cancer affecting women in NSW, with 1 in 8 NSW women developing breast cancer in their lifetime.

Having a regular screening mammogram is the most effective way to find breast cancer early in women aged over 50, because the best time to treat breast cancer is when it is still very small and has not spread. As a result, attending a breast screen appointment every two years is vital to reduce deaths from breast cancer.

Over 80% of women in NSW in the 50-69 age range have at some point in their lives been screened by BreastScreen NSW. However, only 63% of those women screened within the last 24 months, with the remaining 37% being either under-screened, or having lapsed from the program completely. Rescreen rates are therefore a major driver of screening activity and participation rates.

Understanding rescreening behaviour is necessary for the development of effective strategies for retaining women in the program, and thereby maintaining and improving participation rates. Being able to identify women at high risk of not rescreening will facilitate individual-level interventions (eg. a reminder phone call) to encourage regular rescreening.

Therefore, this machine learning project aims to:
1. Identify key predictors of non-reattendance at BreastScreen NSW
2. Determine whether it is feasible to develop a prediction tool to flag individuals that are likely to lapse from the screening program, that is, not attend within 90 days of their next scheduled rescreen date.

## Exploration

### Challenges

The first challenge is to identify whether it is feasible to create a high-quality labelled data set that could be used for training a machine learning model. After discussion with relevant Cancer Institute NSW staff members, the BreastScreen data mart in the production data warehouse (database DATA_MART_SCREENING on server CISQL03PRD) was identified as a suitable, relatively clean data set in a dimensional data model.

The next challenge was to identify suitable features (data attributes) that could either be directly extracted or derived from this data mart. After further discussion with the data custodian, a suitable labelled data set was identified and formal approval granted for the study to proceed in the form of a data request. A relatively complex SQL query was then developed, and this has been included in Appendix 1. In total, 824,812 observations were available, with 23 attributes including the target attribute. These are detailed in the next section.

Next, it was necessary to select a suitable machine learning algorithm. After discussion with the tutor of my Machine Learning subject, the XGBoost gradient boosting library was selected, due to its well-regarded reputation for being highly efficient and flexible, its popularity and success in relevant classification tasks in Kaggle competitions, and the availability of a Python implementation with a wealth of available information on how best to implement it.

The next challenge was to select a suitable execution platform for the building and testing of the model. The approach taken was to undertake the work using an Anaconda Python distribution using Jupyter Notebooks on an available and powerful MacBook. This decision was made partly because utilising the alternative available platform, Google Colab, would have required uploading production (although deidentified) data, and partly because the Google Colab platform, while offering the use of high-performance GPUs, has a low limit on the amount of memory made available for free use, and the compute platforms are often busy and therefore unavailable.

My complete lack of experience with XGBoost was the next challenge to be overcome, which was met with a combination of DataCamp online training, online searching, and advice from both my Machine Learning subject lecturer and tutors.

The final challenge was, ironically, the amount of data available for analysis. Given that my proposed approach involved a large amount of hyperparameter tuning, my initial approach of using all 824,812 observations proved to be impractical due to excessive execution times, and so I made the decision to proceed with only 100,000 randomly-chosen observations for the purpose of this assignment, in the hope that these would be a suitable representative subsample of the entire set, and therefore be a suitable model for predicting unseen values for business use.

***Note that, because of the sensitivity of the data, the GitHub folder (https://github.com/DavidSchanzer823239622/UTS_ML2019_82329622) contains only sample data that will not receive the same results obtained with the 100,000 observations when run on Google Colab; these are 1000 observations only, with the data values scrambled.***

### Plan for Data Models and Tests

The 23 features of the data set, with some preliminary data analysis, are as follows:

| No. | Name                   | Distinct values | Distribution                                                     |
| --: | :--------------------- | :-------------: | :-----------------------------------------------------------     |
|   1 | Country of birth       |       215       | 65.8% = 'AUSTRALIA', 5.9% = 'UNITED KINGDOM', 2.4% = 'CHINA',... |
|   2 | Main language spoken   |        22       | 81.3% = 'English Only', 3.8% = 'Other (please specify)', 2.4% = 'Cantonese',... |
|   3 | Indigenous status      |         6       | 98.4% = 'Non-indigenous', 1.2% = 'Aboriginal', 0.27% = 'Not stated', 0.06% = 'Aboriginal and Torres Strait Islander', 0.03% = 'Torres Strait Islander', 0.02% = 'Declines to Respond'
|   4 | Remoteness Area        |         6       | 0.97% = NULL, then 69.7% = 'Major Cities of Australia', 23.6% = 'Inner Regional Australia', 5.4% = 'Outer Regional Australia', 0.33% = 'Remote Australia', 0.07% = 'Very Remote Australia'
|   5 | Relative Socio-Economic Disadvantage Decile | 11 | 2.6% = NULL, then 28.1% = 10, 17.8% = 6, 9.8% = 3, 9.5% = 5,... |
|   6 | Does the woman have any history of previous cancer external to program? | 2 | 98.6% = 0, 1.4% = 1 |
|   7 | Does the woman have any family history of cancer? | 2  | 74.6% = 0, 25.4% = 1 |
|   8 | Total number of episodes (previous BreastScreen encounters) | N/A | min = 2, max = 29, avg = 6, stdev = 3.8 **(binned)** |
|   9 | Has the woman been "DNA" (did not attend) at any point in her past as part of the program? | 2 | 72.7% = 0, 27.3% = 1 |
|  10 | Has the woman had an assessment (ie. asked to come back for a more diagnostic test due to something suspicious on her mammogram) at any point in her past as part of the program? | 2 | 74.6% = 0, 25.4% = 1 |
|  11 | Has the woman had a needle biopsy (a potentially painful or traumatic procedure) at any point in her past as part of the program? | 2 | 93.4% = 0, 6.6% = 1 |
|  12 | Did the woman have a Technical Recall (asked to return for another mammogram due to an inadequate image) at any point in her past as part of the program? | 2 | 96.9% = 0, 3.1% = 1 |
|  13 | Age at most recent episode | N/A | 0% = NULL, min = 40, max = 109, avg = 63, stdev = 8.3 **(binned)** |
|  14 | Distance from residential address to location of most recent episode | N/A | 1.23% = NULL, min = 0, max = 3860, avg = 15.6, stdev = 82.6 **(binned)** |
|  15 | Month of year of most recent screening attendance | 13 | 0.006% = NULL, 10.0% = 8 (May), ..., 4.8% = 12 (Dec) |
|  16 | Day of week of most recent screening attendance | 8 | 0.006% = NULL, 21.8% = 4 (Wed), ..., 0.54% = 1 (Sun) |
|  17 | Hour of day of most recent screening attendance | 25 | 0.006% = NULL, 15.0% = 11 (11am-12pm), 14.3% = 9 (9am-10pm), ..., 0.0001% = 21 (9-10pm)
|  18 | Type of venue of most recent screening attendance | 3 | 11.3% = NULL, 66.7% = 'Fixed', 22.0% = 'Mobile' |
|  19 | Number of films (x-rays) taken at most recent screening attendance | N/A | 0.07% = NULL, min = 0, max = 54, avg = 4, stdev = 0.87 **(binned)** |
|  20 | How many days did the woman have to wait for results after her most recent attendance? | N/A | 0.82% = NULL, min = 0, max = 100, avg = 6, stdev = 4.6 **(binned)** |
|  21 | Was the woman 'regular' or 'lapsed' at her 3rd most recent episode? | 3 | 37.4% = NULL, 49.8% = 'Regular', 12.9% = 'Lapsed' |
|  22 | Was the woman 'regular' or 'lapsed' at her 2nd most recent episode? | 3 | 21.7% = NULL, 61.0% = 'Regular', 17.4% = 'Lapsed' |
|  23 | ***TARGET VARIABLE:*** Was the woman 'regular' or 'lapsed' at her most recent episode? | 2 | 74.4% = 'Regular', 25.6% = 'Lapsed' |

Data was acquired by directly querying the data mart database identified in the previous section, using the SQL statements in Appendix 1.

Data quality control was undertaken through exploratory data analysis, using both SQL (which yielded modifications to the SQL code, as well as clarifications about individual data items from the screening data subject matter expert) and the pandas_profiling Python library (which yielded the report [here](http://htmlpreview.github.com/?https://github.com/DavidSchanzer823239622/UTS_ML2019_82329622/blob/master/pandas-profiling%20output.html "Output from pandas_profiling")). The pandas_profiling report revealed no particular data problems that needed addressing.

In considering different machine learning modelling techniques, it was discussed above that XGBoost was chosen due to:
* it being recommended for the task by the tutor of my Machine Learning subject
* its reputation for being designed to be highly efficient and flexible
* its popularity and success in relevant classification tasks in Kaggle competitions
* the availability of a Python implementation with a wealth of available information on how best to implement it

One additional consideration in favour of XGBoost not mentioned above was the availability of a Feature Importance ranking through use of the get_booster().get_fscore() methods - this would allow me to determine which of the Top N input attributes had the largest impact on the predicted target variable. I felt this would add business value to my investigation.

Consideration was given to also utilising a Random Forest classifier for comparison, as it is also rated highly in terms of its prediction accuracy. However, it rates poorly in terms of interpretability, as it lacks an equivalent Feature Importance rank to the XGBoost ranking. As a result, this classifier was excluded.

The evaluation method chosen to measure the success of the investigation was to set aside 30% of the available data as a Test Set, which would be used only at the beginning to get a baseline accuracy, and at the end to evaluate the effects of tuning on the accuracy. All hyperparameter tuning between these two points would be done exclusively using the Training Set (the remaining 70%) using various subsets as Validation Sets. In tuning the hyperparameters, the AUC measure (area under the receiver operating characteristics curve) would be used as the value to be maximised, as this is appropriate to binary classification problems.

The criteria for success would ultimately be determined by the BreastScreen business team, based on whether the achieved accuracy would be sufficient for the model to be deployed into production in order to make predictions of non-attendance so that interventions could take place. For the purpose of my investigation, my aim was to improve upon the baseline accuracy through hyperparameter tuning.

## Methodology

The experimental methodology undertaken was as follows:
1. Import data
2. Perform exploratory data analysis to check for completeness, quality and other data issues
3. Bin the numeric variables, to aid interpretability of the Feature Importance chart.
4. Perform one-hot encoding of all categorical columns
5. Split the data into X and y
6. Encode the string target class values ("regular" or "lapsed") as integers in a numpy array
7. Split the data set into 70% training, 30% test, using the encoded numpy array
8. **Approach 1**: call XGBClassifier with no parameters, using all of the default values, to measure the baseline accuracy
9. **Approach 2**: See if we can improve this using XGBoost's built-in cross-validation capabilities
10. **Approach 3**: See if we can improve this using a higher number of boosting rounds with automated boosting round selection using early_stopping
11. **Approach 4**: See if we can improve this by tuning a few of the above hyperparameters using a very simple GridSearch

**Approach 5**: Try a step-by-step approach to hyperparameter tuning:
12. Step 1: Fix the learning rate and the number of estimators, with typical values for other parameters
13. Step 2: Tune max_depth and min_child_weight, as they will have the highest impact on model outcome, using the optimal n_estimators value calculated in the previous step. To start with, set wider ranges for max_depth and min_child_weight and then perform another iteration for smaller ranges.
14. Step 3: Fine-tune max_depth and min_child_weight, looking for optimum values, by searching for values 1 above and below the best values discovered so far.
15. Step 4: Tune gamma using the parameters already tuned above.
16. Step 5: Re-calibrate the number of boosting rounds for the updated parameters
17. Step 6: Tune subsample and colsample_bytree
18. Step 7: Try values in 0.05 intervals around the best value so far for subsample and colsample_bytree
19. Step 8: Apply regularization to reduce overfitting, by tuning reg_alpha
20. Step 9: Try values of reg_alpha closer to the best value so far to see if we improve AUC
21. Step 10: Apply this regularization (reg_alpha) in the model and look at the impact
22. Step 11: Finally, lower the learning rate and add more trees, using the cv function of XGBoost


23. Finally, evaluation: call XGBClassifier with the optimised parameters to measure the final achieved accuracy

### Building and Training of Data Models

In [23]:
# Import required packages
import pandas as pd
import pandas_profiling
import numpy as np
import xgboost as xgb
import graphviz
import matplotlib.pyplot as plt
from sklearn import model_selection, metrics
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split, GridSearchCV
from joblib import dump, load
import time

In [24]:
# Import the data
dtypes = {
'CountryOfBirth':                           'category',    # Country of birth
'MainLanguage':                             'category',    # Main language spoken
'IndigenousStatus':                         'category',    # Indigenous status
'RemotenessArea':                           'category',    # Remoteness Area
'RelativeSocioEconomicDisadvantageDecile':  'category',    # Relative Socio-Economic Disadvantage Decile
'HistoryPreviousCancer':                    bool,          # Does the woman have any history of previous cancer external to program?
'HistoryFamilyCancer':                      bool,          # Does the woman have any family history of cancer?
'TotalEpisodes':                            int,           # Total number of episodes (previous BreastScreen encounters)
'HasBeenDNA':                               bool,          # Has the woman been "DNA" (did not attend) at any point in her past as part of the program?
'HadAssessment':                            bool,          # Has the woman had an assessment (ie. asked to come back for a more diagnostic test due to something suspicious on her mammogram) at any point in her past as part of the program?
'HadNeedleBiopsy':                          bool,          # Has the woman had a needle biopsy (a potentially painful or traumatic procedure) at any point in her past as part of the program?
'HadTechRecall':                            bool,          # Did the woman have a Technical Recall (asked to return for another mammogram due to an inadequate image) at any point in her past as part of the program?
'AgeAtMostRecentEpisode':                   int,           # Age at most recent episode
'DistanceKms':                              float,         # Distance from residential address to location of most recent episode
'MonthMostRecentScreening':                 'category',    # Month of year of most recent screening attendance
'DayOfWeekMostRecentScreening':             'category',    # Day of week of most recent screening attendance
'HourOfDayMostRecentScreening':             'category',    # Hour of day of most recent screening attendance
'VenueTypeMostRecentScreening':             'category',    # Type of venue of most recent screening attendance
'FilmsTakenMostRecentScreening':            int,           # Number of films (x-rays) taken at most recent screening attendance
'DaysFromAttendanceToResultSent':           int,           # How many days did the woman have to wait for results after her most recent attendance?
'LapsedRegular3rdMostRecentEpisode':        'category',    # Was the woman 'regular' or 'lapsed' at her 3rd most recent episode?
'LapsedRegular2ndMostRecentEpisode':        'category',    # Was the woman 'regular' or 'lapsed' at her 2nd most recent episode?
'LapsedRegularMostRecentEpisode':           'category'}    # TARGET VARIABLE: Was the woman 'regular' or 'lapsed' at her most recent episode?

# Data import for the full data set has been commented out for Google Colab use
#df_full = pd.read_csv('Data extraction.csv', dtype = dtypes)

# Random sampling - Random 100,000 rows
#df = df_full.sample(n = 100000)

# For Google Colab:
df = pd.read_csv('https://raw.githubusercontent.com/DavidSchanzer823239622/UTS_ML2019_82329622/master/Data%20extraction%20-%201000%20observations%20scrambled.csv', dtype = dtypes)

In [5]:
# How many rows and columns?
df.shape

(100000, 23)

In [6]:
# Inspect the first few rows
df.head()

Unnamed: 0,CountryOfBirth,MainLanguage,IndigenousStatus,RemotenessArea,RelativeSocioEconomicDisadvantageDecile,HistoryPreviousCancer,HistoryFamilyCancer,TotalEpisodes,HasBeenDNA,HadAssessment,...,DistanceKms,MonthMostRecentScreening,DayOfWeekMostRecentScreening,HourOfDayMostRecentScreening,VenueTypeMostRecentScreening,FilmsTakenMostRecentScreening,DaysFromAttendanceToResultSent,LapsedRegular3rdMostRecentEpisode,LapsedRegular2ndMostRecentEpisode,LapsedRegularMostRecentEpisode
330518,NEW ZEALAND,English Only,Non-indigenous,Major Cities of Australia,10,False,False,2,True,False,...,11.0,4,5,13,Fixed,6,7,,,Regular
786429,AUSTRALIA,English Only,Non-indigenous,Inner Regional Australia,5,False,False,4,True,False,...,40.0,6,6,9,Fixed,4,5,,,Lapsed
625728,AUSTRALIA,English Only,Non-indigenous,Major Cities of Australia,7,False,False,9,True,True,...,16.0,4,4,9,Fixed,4,1,Lapsed,Regular,Lapsed
721374,NEW ZEALAND,English Only,Non-indigenous,Major Cities of Australia,7,False,False,2,True,False,...,9.0,11,5,14,Fixed,5,5,,,Regular
213098,ITALY,English Only,Non-indigenous,Major Cities of Australia,0,False,False,2,False,False,...,4.0,11,3,10,Fixed,4,3,,,Regular


In [7]:
# Verify data types
df.dtypes

CountryOfBirth                             category
MainLanguage                               category
IndigenousStatus                           category
RemotenessArea                             category
RelativeSocioEconomicDisadvantageDecile    category
HistoryPreviousCancer                          bool
HistoryFamilyCancer                            bool
TotalEpisodes                                 int64
HasBeenDNA                                     bool
HadAssessment                                  bool
HadNeedleBiopsy                                bool
HadTechRecall                                  bool
AgeAtMostRecentEpisode                        int64
DistanceKms                                 float64
MonthMostRecentScreening                   category
DayOfWeekMostRecentScreening               category
HourOfDayMostRecentScreening               category
VenueTypeMostRecentScreening               category
FilmsTakenMostRecentScreening                 int64
DaysFromAtte

In [8]:
# Generate pandas_profiling output for EDA
pp = df.copy()

for c in pp:
    if pp[c].dtypes != bool and pp[c].dtypes != np.float64 and pp[c].dtypes != np.uint64 and pp[c].dtypes != np.uint64 and pp[c].dtypes != np.uint8 and pp[c].dtypes != np.datetime64 and pp[c].dtypes != np.timedelta64 and pp[c].dtypes != np.dtype('<m8[ns]'):
        pp[c] = pp[c].astype("str")
        pp[c] = pp[c].astype("category")

    elif pp[c].dtypes == bool:
        pp[c] = pp[c].astype("int")
    
pfr = pandas_profiling.ProfileReport(pp)
pfr.to_file("pandas-profiling output.html")

  variable_stats = pd.concat(ldesc, join_axes=pd.Index([names]), axis=1)


**The pandas_profiling report can be viewed [here](http://htmlpreview.github.com/?https://github.com/DavidSchanzer823239622/UTS_ML2019_82329622/blob/master/pandas-profiling%20output.html "Output from pandas_profiling"). The report revealed no particular data problems that needed addressing.**

In [9]:
# In preparation for binning the numeric variables, find the required quantiles
print("AgeAtMostRecentEpisode min and max:", min(df.AgeAtMostRecentEpisode), max(df.AgeAtMostRecentEpisode))
print("AgeAtMostRecentEpisode Q1:", df.AgeAtMostRecentEpisode.quantile(.2))
print("AgeAtMostRecentEpisode Q2:", df.AgeAtMostRecentEpisode.quantile(.4))
print("AgeAtMostRecentEpisode Q3:", df.AgeAtMostRecentEpisode.quantile(.6))
print("AgeAtMostRecentEpisode Q4:", df.AgeAtMostRecentEpisode.quantile(.8))
print("AgeAtMostRecentEpisode Q5:", df.AgeAtMostRecentEpisode.quantile(1))

print("TotalEpisodes min and max:", min(df.TotalEpisodes), max(df.TotalEpisodes))
print("TotalEpisodes Q1:", df.TotalEpisodes.quantile(.2))
print("TotalEpisodes Q2:", df.TotalEpisodes.quantile(.4))
print("TotalEpisodes Q3:", df.TotalEpisodes.quantile(.6))
print("TotalEpisodes Q4:", df.TotalEpisodes.quantile(.8))
print("TotalEpisodes Q5:", df.TotalEpisodes.quantile(1))

print("DistanceKms min and max:", min(df.DistanceKms), max(df.DistanceKms))
print("DistanceKms Q1:", df.DistanceKms.quantile(.2))
print("DistanceKms Q2:", df.DistanceKms.quantile(.4))
print("DistanceKms Q3:", df.DistanceKms.quantile(.6))
print("DistanceKms Q4:", df.DistanceKms.quantile(.8))
print("DistanceKms Q5:", df.DistanceKms.quantile(1))

print("FilmsTakenMostRecentScreening min and max:", min(df.FilmsTakenMostRecentScreening), max(df.FilmsTakenMostRecentScreening))
print("FilmsTakenMostRecentScreening Q1:", df.FilmsTakenMostRecentScreening.quantile(.2))
print("FilmsTakenMostRecentScreening Q2:", df.FilmsTakenMostRecentScreening.quantile(.4))
print("FilmsTakenMostRecentScreening Q3:", df.FilmsTakenMostRecentScreening.quantile(.6))
print("FilmsTakenMostRecentScreening Q4:", df.FilmsTakenMostRecentScreening.quantile(.8))
print("FilmsTakenMostRecentScreening Q5:", df.FilmsTakenMostRecentScreening.quantile(1))

print("DaysFromAttendanceToResultSent min and max:", min(df.DaysFromAttendanceToResultSent), max(df.DaysFromAttendanceToResultSent))
print("DaysFromAttendanceToResultSent Q1:", df.DaysFromAttendanceToResultSent.quantile(.2))
print("DaysFromAttendanceToResultSent Q2:", df.DaysFromAttendanceToResultSent.quantile(.4))
print("DaysFromAttendanceToResultSent Q3:", df.DaysFromAttendanceToResultSent.quantile(.6))
print("DaysFromAttendanceToResultSent Q4:", df.DaysFromAttendanceToResultSent.quantile(.8))
print("DaysFromAttendanceToResultSent Q5:", df.DaysFromAttendanceToResultSent.quantile(1))

AgeAtMostRecentEpisode min and max: 40 96
AgeAtMostRecentEpisode Q1: 56.0
AgeAtMostRecentEpisode Q2: 61.0
AgeAtMostRecentEpisode Q3: 66.0
AgeAtMostRecentEpisode Q4: 71.0
AgeAtMostRecentEpisode Q5: 96.0
TotalEpisodes min and max: 2 28
TotalEpisodes Q1: 3.0
TotalEpisodes Q2: 5.0
TotalEpisodes Q3: 7.0
TotalEpisodes Q4: 10.0
TotalEpisodes Q5: 28.0
DistanceKms min and max: 0.0 3406.0
DistanceKms Q1: 2.0
DistanceKms Q2: 3.0
DistanceKms Q3: 6.0
DistanceKms Q4: 11.0
DistanceKms Q5: 3406.0
FilmsTakenMostRecentScreening min and max: 0 19
FilmsTakenMostRecentScreening Q1: 4.0
FilmsTakenMostRecentScreening Q2: 4.0
FilmsTakenMostRecentScreening Q3: 4.0
FilmsTakenMostRecentScreening Q4: 4.0
FilmsTakenMostRecentScreening Q5: 19.0
DaysFromAttendanceToResultSent min and max: 0 97
DaysFromAttendanceToResultSent Q1: 4.0
DaysFromAttendanceToResultSent Q2: 5.0
DaysFromAttendanceToResultSent Q3: 7.0
DaysFromAttendanceToResultSent Q4: 9.0
DaysFromAttendanceToResultSent Q5: 97.0


In [10]:
# Bin the numeric variables, to aid interpretability of the Feature Importance chart.

# Firstly, put the AgeAtMostRecentEpisode values into 5 equal depth buckets
df['AgeAtMostRecentEpisode_lt56'] = df.AgeAtMostRecentEpisode[df.AgeAtMostRecentEpisode < df.AgeAtMostRecentEpisode.quantile(.2)]
df['AgeAtMostRecentEpisode_56_to_60'] = df.AgeAtMostRecentEpisode[(df.AgeAtMostRecentEpisode >= df.AgeAtMostRecentEpisode.quantile(.2)) & (df.AgeAtMostRecentEpisode < df.AgeAtMostRecentEpisode.quantile(.4))]
df['AgeAtMostRecentEpisode_61_to_65'] = df.AgeAtMostRecentEpisode[(df.AgeAtMostRecentEpisode >= df.AgeAtMostRecentEpisode.quantile(.4)) & (df.AgeAtMostRecentEpisode < df.AgeAtMostRecentEpisode.quantile(.6))]
df['AgeAtMostRecentEpisode_66_to_70'] = df.AgeAtMostRecentEpisode[(df.AgeAtMostRecentEpisode >= df.AgeAtMostRecentEpisode.quantile(.6)) & (df.AgeAtMostRecentEpisode < df.AgeAtMostRecentEpisode.quantile(.8))]
df['AgeAtMostRecentEpisode_ge71'] = df.AgeAtMostRecentEpisode[df.AgeAtMostRecentEpisode >= df.AgeAtMostRecentEpisode.quantile(.8)]
df['AgeAtMostRecentEpisode_lt56'] = df['AgeAtMostRecentEpisode_lt56'].values.astype(np.int64)
df['AgeAtMostRecentEpisode_56_to_60'] = df['AgeAtMostRecentEpisode_56_to_60'].values.astype(np.int64)
df['AgeAtMostRecentEpisode_61_to_65'] = df['AgeAtMostRecentEpisode_61_to_65'].values.astype(np.int64)
df['AgeAtMostRecentEpisode_66_to_70'] = df['AgeAtMostRecentEpisode_66_to_70'].values.astype(np.int64)
df['AgeAtMostRecentEpisode_ge71'] = df['AgeAtMostRecentEpisode_ge71'].values.astype(np.int64)
df = df.drop(['AgeAtMostRecentEpisode'], axis = 1)

# Now, put the TotalEpisodes values into 5 equal depth buckets
df['TotalEpisodes_lt3'] = df.TotalEpisodes[df.TotalEpisodes < df.TotalEpisodes.quantile(.2)]
df['TotalEpisodes_3_to_4'] = df.TotalEpisodes[(df.TotalEpisodes >= df.TotalEpisodes.quantile(.2)) & (df.TotalEpisodes < df.TotalEpisodes.quantile(.4))]
df['TotalEpisodes_5_to_6'] = df.TotalEpisodes[(df.TotalEpisodes >= df.TotalEpisodes.quantile(.4)) & (df.TotalEpisodes < df.TotalEpisodes.quantile(.6))]
df['TotalEpisodes_7_to_9'] = df.TotalEpisodes[(df.TotalEpisodes >= df.TotalEpisodes.quantile(.6)) & (df.TotalEpisodes < df.TotalEpisodes.quantile(.8))]
df['TotalEpisodes_ge10'] = df.TotalEpisodes[df.TotalEpisodes >= df.TotalEpisodes.quantile(.8)]
df['TotalEpisodes_lt3'] = df['TotalEpisodes_lt3'].values.astype(np.int64)
df['TotalEpisodes_3_to_4'] = df['TotalEpisodes_3_to_4'].values.astype(np.int64)
df['TotalEpisodes_5_to_6'] = df['TotalEpisodes_5_to_6'].values.astype(np.int64)
df['TotalEpisodes_7_to_9'] = df['TotalEpisodes_7_to_9'].values.astype(np.int64)
df['TotalEpisodes_ge10'] = df['TotalEpisodes_ge10'].values.astype(np.int64)
df = df.drop(['TotalEpisodes'], axis = 1)

# Next, put the DistanceKms values into 5 equal depth buckets
df['DistanceKms_lt1'] = df.DistanceKms[df.DistanceKms < df.DistanceKms.quantile(.2)]
df['DistanceKms_1_to_lt3'] = df.DistanceKms[(df.DistanceKms >= df.DistanceKms.quantile(.2)) & (df.DistanceKms < df.DistanceKms.quantile(.4))]
df['DistanceKms_3_to_lt6'] = df.DistanceKms[(df.DistanceKms >= df.DistanceKms.quantile(.4)) & (df.DistanceKms < df.DistanceKms.quantile(.6))]
df['DistanceKms_6_to_lt11'] = df.DistanceKms[(df.DistanceKms >= df.DistanceKms.quantile(.6)) & (df.DistanceKms < df.DistanceKms.quantile(.8))]
df['DistanceKms_ge11'] = df.DistanceKms[df.DistanceKms >= df.DistanceKms.quantile(.8)]
df['DistanceKms_lt1'] = df['DistanceKms_lt1'].values.astype(np.float64)
df['DistanceKms_1_to_lt3'] = df['DistanceKms_1_to_lt3'].values.astype(np.float64)
df['DistanceKms_3_to_lt6'] = df['DistanceKms_3_to_lt6'].values.astype(np.float64)
df['DistanceKms_6_to_lt11'] = df['DistanceKms_6_to_lt11'].values.astype(np.float64)
df['DistanceKms_ge11'] = df['DistanceKms_ge11'].values.astype(np.float64)
df = df.drop(['DistanceKms'], axis = 1)

# Next, put the FilmsTakenMostRecentScreening values into 5 non-equal depth buckets (due to extreme skew)
df['FilmsTakenMostRecentScreening_lt4'] = df.FilmsTakenMostRecentScreening[df.FilmsTakenMostRecentScreening < 4]
df['FilmsTakenMostRecentScreening_4'] = df.FilmsTakenMostRecentScreening[(df.FilmsTakenMostRecentScreening == 4)]
df['FilmsTakenMostRecentScreening_5'] = df.FilmsTakenMostRecentScreening[(df.FilmsTakenMostRecentScreening == 5)]
df['FilmsTakenMostRecentScreening_6_to_7'] = df.FilmsTakenMostRecentScreening[(df.FilmsTakenMostRecentScreening >= 6) & (df.FilmsTakenMostRecentScreening <= 7)]
df['FilmsTakenMostRecentScreening_ge8'] = df.FilmsTakenMostRecentScreening[df.FilmsTakenMostRecentScreening >= 8]
df['FilmsTakenMostRecentScreening_lt4'] = df['FilmsTakenMostRecentScreening_lt4'].values.astype(np.int64)
df['FilmsTakenMostRecentScreening_4'] = df['FilmsTakenMostRecentScreening_4'].values.astype(np.int64)
df['FilmsTakenMostRecentScreening_5'] = df['FilmsTakenMostRecentScreening_5'].values.astype(np.int64)
df['FilmsTakenMostRecentScreening_6_to_7'] = df['FilmsTakenMostRecentScreening_6_to_7'].values.astype(np.int64)
df['FilmsTakenMostRecentScreening_ge8'] = df['FilmsTakenMostRecentScreening_ge8'].values.astype(np.int64)
df = df.drop(['FilmsTakenMostRecentScreening'], axis = 1)
    
# Finally, put the DaysFromAttendanceToResultSent values into 5 equal depth buckets
df['DaysFromAttendanceToResultSent_lt4'] = df.DaysFromAttendanceToResultSent[df.DaysFromAttendanceToResultSent < df.DaysFromAttendanceToResultSent.quantile(.2)]
df['DaysFromAttendanceToResultSent_4'] = df.DaysFromAttendanceToResultSent[(df.DaysFromAttendanceToResultSent >= df.DaysFromAttendanceToResultSent.quantile(.2)) & (df.DaysFromAttendanceToResultSent < df.DaysFromAttendanceToResultSent.quantile(.4))]
df['DaysFromAttendanceToResultSent_5_to_6'] = df.DaysFromAttendanceToResultSent[(df.DaysFromAttendanceToResultSent >= df.DaysFromAttendanceToResultSent.quantile(.4)) & (df.DaysFromAttendanceToResultSent < df.DaysFromAttendanceToResultSent.quantile(.6))]
df['DaysFromAttendanceToResultSent_7_to_8'] = df.DaysFromAttendanceToResultSent[(df.DaysFromAttendanceToResultSent >= df.DaysFromAttendanceToResultSent.quantile(.6)) & (df.DaysFromAttendanceToResultSent < df.DaysFromAttendanceToResultSent.quantile(.8))]
df['DaysFromAttendanceToResultSent_ge9'] = df.DaysFromAttendanceToResultSent[df.DaysFromAttendanceToResultSent >= df.DaysFromAttendanceToResultSent.quantile(.8)]
df['DaysFromAttendanceToResultSent_lt4'] = df['DaysFromAttendanceToResultSent_lt4'].values.astype(np.int64)
df['DaysFromAttendanceToResultSent_4'] = df['DaysFromAttendanceToResultSent_4'].values.astype(np.int64)
df['DaysFromAttendanceToResultSent_5_to_6'] = df['DaysFromAttendanceToResultSent_5_to_6'].values.astype(np.int64)
df['DaysFromAttendanceToResultSent_7_to_8'] = df['DaysFromAttendanceToResultSent_7_to_8'].values.astype(np.int64)
df['DaysFromAttendanceToResultSent_ge9'] = df['DaysFromAttendanceToResultSent_ge9'].values.astype(np.int64)
df = df.drop(['DaysFromAttendanceToResultSent'], axis = 1)

print(df.dtypes)

CountryOfBirth                             category
MainLanguage                               category
IndigenousStatus                           category
RemotenessArea                             category
RelativeSocioEconomicDisadvantageDecile    category
HistoryPreviousCancer                          bool
HistoryFamilyCancer                            bool
HasBeenDNA                                     bool
HadAssessment                                  bool
HadNeedleBiopsy                                bool
HadTechRecall                                  bool
MonthMostRecentScreening                   category
DayOfWeekMostRecentScreening               category
HourOfDayMostRecentScreening               category
VenueTypeMostRecentScreening               category
LapsedRegular3rdMostRecentEpisode          category
LapsedRegular2ndMostRecentEpisode          category
LapsedRegularMostRecentEpisode             category
AgeAtMostRecentEpisode_lt56                   int64
AgeAtMostRec

In [11]:
# No need to clean missing data or remove correlated features,
# as boosted trees (which is what we will be using) are robust to these potential data problems

# Perform one-hot encoding of all categorical columns

# CountryOfBirth                             
one_hot = pd.get_dummies(df['CountryOfBirth'], prefix = 'CountryOfBirth')
df = df.drop('CountryOfBirth',axis = 1)
df = df.join(one_hot)

# MainLanguage                             
one_hot = pd.get_dummies(df['MainLanguage'], prefix = 'MainLanguage')
df = df.drop('MainLanguage',axis = 1)
df = df.join(one_hot)

# IndigenousStatus                             
one_hot = pd.get_dummies(df['IndigenousStatus'], prefix = 'IndigenousStatus')
df = df.drop('IndigenousStatus',axis = 1)
df = df.join(one_hot)

# RemotenessArea                             
one_hot = pd.get_dummies(df['RemotenessArea'], prefix = 'RemotenessArea')
df = df.drop('RemotenessArea',axis = 1)
df = df.join(one_hot)

# RelativeSocioEconomicDisadvantageDecile                             
one_hot = pd.get_dummies(df['RelativeSocioEconomicDisadvantageDecile'], prefix = 'RelativeSocioEconomicDisadvantageDecile')
df = df.drop('RelativeSocioEconomicDisadvantageDecile',axis = 1)
df = df.join(one_hot)

# MonthMostRecentScreening                             
one_hot = pd.get_dummies(df['MonthMostRecentScreening'], prefix = 'MonthMostRecentScreening')
df = df.drop('MonthMostRecentScreening',axis = 1)
df = df.join(one_hot)

# DayOfWeekMostRecentScreening                             
one_hot = pd.get_dummies(df['DayOfWeekMostRecentScreening'], prefix = 'DayOfWeekMostRecentScreening')
df = df.drop('DayOfWeekMostRecentScreening',axis = 1)
df = df.join(one_hot)

# HourOfDayMostRecentScreening                             
one_hot = pd.get_dummies(df['HourOfDayMostRecentScreening'], prefix = 'HourOfDayMostRecentScreening')
df = df.drop('HourOfDayMostRecentScreening',axis = 1)
df = df.join(one_hot)

# VenueTypeMostRecentScreening                             
one_hot = pd.get_dummies(df['VenueTypeMostRecentScreening'], prefix = 'VenueTypeMostRecentScreening')
df = df.drop('VenueTypeMostRecentScreening',axis = 1)
df = df.join(one_hot)

# LapsedRegular3rdMostRecentEpisode                             
one_hot = pd.get_dummies(df['LapsedRegular3rdMostRecentEpisode'], prefix = 'LapsedRegular3rdMostRecentEpisode')
df = df.drop('LapsedRegular3rdMostRecentEpisode',axis = 1)
df = df.join(one_hot)

# LapsedRegular2ndMostRecentEpisode                             
one_hot = pd.get_dummies(df['LapsedRegular2ndMostRecentEpisode'], prefix = 'LapsedRegular2ndMostRecentEpisode')
df = df.drop('LapsedRegular2ndMostRecentEpisode',axis = 1)
df = df.join(one_hot)

df.columns.tolist()

['HistoryPreviousCancer',
 'HistoryFamilyCancer',
 'HasBeenDNA',
 'HadAssessment',
 'HadNeedleBiopsy',
 'HadTechRecall',
 'LapsedRegularMostRecentEpisode',
 'AgeAtMostRecentEpisode_lt56',
 'AgeAtMostRecentEpisode_56_to_60',
 'AgeAtMostRecentEpisode_61_to_65',
 'AgeAtMostRecentEpisode_66_to_70',
 'AgeAtMostRecentEpisode_ge71',
 'TotalEpisodes_lt3',
 'TotalEpisodes_3_to_4',
 'TotalEpisodes_5_to_6',
 'TotalEpisodes_7_to_9',
 'TotalEpisodes_ge10',
 'DistanceKms_lt1',
 'DistanceKms_1_to_lt3',
 'DistanceKms_3_to_lt6',
 'DistanceKms_6_to_lt11',
 'DistanceKms_ge11',
 'FilmsTakenMostRecentScreening_lt4',
 'FilmsTakenMostRecentScreening_4',
 'FilmsTakenMostRecentScreening_5',
 'FilmsTakenMostRecentScreening_6_to_7',
 'FilmsTakenMostRecentScreening_ge8',
 'DaysFromAttendanceToResultSent_lt4',
 'DaysFromAttendanceToResultSent_4',
 'DaysFromAttendanceToResultSent_5_to_6',
 'DaysFromAttendanceToResultSent_7_to_8',
 'DaysFromAttendanceToResultSent_ge9',
 'CountryOfBirth_AFGHANISTAN',
 'CountryOfBirth

In [12]:
df.shape

(100000, 340)

In [13]:
# Split the data into X and y
TargetVariable = 'LapsedRegularMostRecentEpisode'
X = df.loc[:, df.columns != TargetVariable]
y = np.ravel(df.loc[:, df.columns == TargetVariable])
print(X.shape)
print(y.shape)

(100000, 339)
(100000,)


In [14]:
# Encode the string target class values ("regular" or "lapsed") as integers in a numpy array
label_encoder = LabelEncoder()
label_encoder = label_encoder.fit(y)
label_encoded_y = label_encoder.transform(y)

In [15]:
# Split the data set into 70% training, 30% test, using the encoded numpy array
seed = 7
test_size = 0.3
X_train, X_test, y_train, y_test = train_test_split(X, label_encoded_y, test_size=test_size, random_state=seed)

In [16]:
# Approach 1: call XGBClassifier with no parameters, using all of the default values, to measure the baseline accuracy

start_time = time.time()

# Fit the model to training data
model_baseline = xgb.XGBClassifier()
model_baseline.fit(X_train, y_train)
print(model_baseline)

# Make predictions for the test data
y_pred = model_baseline.predict(X_test)
predictions = [round(value) for value in y_pred]

# Evaluate the predictions made using the test data
accuracy = accuracy_score(y_test, predictions)
print("Untuned accuracy: %.2f%%" % (accuracy * 100.0))

# Save the model to disk using joblib’s replacement of pickle (dump & load), which is more efficient on objects
# that carry large numpy arrays internally as is often the case for fitted scikit-learn estimators
dump(model_baseline, 'model_baseline.joblib')

print("--- %s seconds ---" % (time.time() - start_time))

# Output:
#     XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
#                   colsample_bynode=1, colsample_bytree=1, gamma=0,
#                   learning_rate=0.1, max_delta_step=0, max_depth=3,
#                   min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
#                   nthread=None, objective='binary:logistic', random_state=0,
#                   reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
#                   silent=None, subsample=1, verbosity=1)
#     Untuned accuracy: 79.00%
#     --- 90.91621375083923 seconds ---
# So, our baseline accuracy on the test dataset is 79.00% - can we do better?

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)
Untuned accuracy: 78.60%
--- 98.3970000743866 seconds ---


In [17]:
# Approach 2: See if we can improve this using XGBoost's built-in cross-validation capabilities

start_time = time.time()

# Create the DMatrix required for the xgboost cv method, using some minimal parameters
dmatrix = xgb.DMatrix(data = X_train, label = y_train)
params={"objective":"binary:logistic","max_depth":4}
model_cv = xgb.cv(dtrain = dmatrix, params = params, nfold = 4, num_boost_round = 10, metrics = ["error","auc"],
                  as_pandas = True)

print("Untuned cross-validated accuracy: %.2f%%" %(((1 - model_cv["test-error-mean"]).iloc[-1]) * 100.0))
print("Untuned cross-validated AUC: %.4f" %(((model_cv["test-auc-mean"]).iloc[-1])))

print("--- %s seconds ---" % (time.time() - start_time))

# Save the model to disk
dump(model_cv, 'model_cv.joblib')

# Output:
#     Untuned cross-validated accuracy: 78.85%
#     Untuned cross-validated AUC: 0.7878
#     --- 41.71943497657776 seconds ---
# This accuracy of 78.85% is slightly lower than the baseline accuracy of 79.00%.
# So, cross-validation on its own is not going to "magically" improve accuracy.

Untuned cross-validated accuracy: 78.99%
Untuned cross-validated AUC: 0.7869
--- 42.51188898086548 seconds ---


['model_cv.joblib']

In [240]:
# Approach 3: See if we can improve this using a higher number of boosting rounds
# with automated boosting round selection using early_stopping

start_time = time.time()

# Create the DMatrix required for the xgboost cv method, using some minimal parameters
dmatrix = xgb.DMatrix(data=X_train, label=y_train)
params = {"objective":"binary:logistic", "max_depth":4}
model_cv_early_stopping = xgb.cv(dtrain = dmatrix, params = params, early_stopping_rounds = 10, nfold = 4,
                                 num_boost_round = 50, metrics = ["error","auc"], seed = 0, as_pandas = True)

print("Cross-validated early_stopping accuracy: %.2f%%" %(((1 - model_cv_early_stopping["test-error-mean"]).iloc[-1]) * 100.0))
print("Cross-validated early_stopping AUC: %.4f" %(((model_cv_early_stopping["test-auc-mean"]).iloc[-1])))

print("--- %s seconds ---" % (time.time() - start_time))

# Save the model to disk
dump(model_cv_early_stopping, 'model_cv_early_stopping.joblib')

# Output:
#     Cross-validated early_stopping accuracy: 79.10%
#     Cross-validated early_stopping AUC: 0.7943
#     --- 137.35042786598206 seconds ---
# This accuracy of 79.10% is slightly higher than the baseline accuracy of 79.00%,
# and the AUC of 0.7943 is also slightly higher than the untuned cross-validated AUC of 0.7878.
# However, we're not getting much improvement, so time to start tuning the XGBoost hyperparameters.

Cross-validated early_stopping accuracy: 79.10%
Cross-validated early_stopping AUC: 0.7943
--- 137.35042786598206 seconds ---


['model_cv_early_stopping.joblib']

In [241]:
# For my reference: some frequently tuned XGBoost parameters for the tree base learner:
#
# learning rate (eta):
#     Affects how quickly the model fits the residual error using additional base learners.
#     A low learning rate will require more boosting rounds to achieve the same reduction in residual error as an
#     XGBoost model with a high learning rate.
# gamma:
#     Minimum loss reduction to create new tree split, which affects how strongly regularised the trained model will
#     be.
# lambda:
#     L2 regularisation on leaf weights, which affects how strongly regularised the trained model will be.
# alpha:
#     L1 regularisation on leaf weights, which affects how strongly regularised the trained model will be.
# max_depth:
#     Maximum depth per tree.
#     Must be a positive integer value.
#     Affects how deeply each tree is allowed to grow during any given boosting round.
# subsample:
#     Percentage of samples used per tree.
#     Must be a value between 0 and 1.
#     The fraction of the total training set that can be used for any given boosting round.
#     If the value is low, then the fraction of training data used per boosting round would be low and may run into
#     underfitting problems, while a value that is very high can lead to overfitting.
# colsample_bytree:
#     Percentage of features used per tree.
#     The fraction of features that can be selected from during any given boosting round.
#     Must be a value between 0 and 1.
#     A large value means that almost all features can be used to build a tree during a given boosting round,
#     while a small value means that the fraction of features that can be selected from is very small.
#     Smaller values can be thought of as providing additional regularisation to the model,
#     while using all columns may overfit a trained model.
# num_boost_round:
#     Number of boosting rounds.
#     The number of trees to be built or the number of base learners to be constructed.

In [242]:
# Approach 4: See if we can improve this by tuning a few of the above hyperparameters using a very simple GridSearch

start_time = time.time()

# Create the parameter grid
gbm_param_grid = {
    "colsample_bytree": [0.3, 0.7],
    "n_estimators": [50],
    "max_depth": [2, 5]
}

# Instantiate the classifier
gbm = xgb.XGBClassifier()

# Perform grid search
model_grid_search = GridSearchCV(param_grid = gbm_param_grid, estimator = gbm, scoring = "roc_auc", cv = 4,
                                 n_jobs = -1, verbose = 1)

# Fit model_grid_search to the data
model_grid_search.fit(X_train, y_train)

# Print the best parameters and lowest RMSE
print("Best parameters found: ", model_grid_search.best_params_)
print("Best AUC found: %.4f" %model_grid_search.best_score_)

print("--- %s seconds ---" % (time.time() - start_time))

# Save the model to disk
dump(model_grid_search, 'model_grid_search.joblib')

# Output:
#     Fitting 4 folds for each of 4 candidates, totalling 16 fits
#     [Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
#     [Parallel(n_jobs=-1)]: Done  16 out of  16 | elapsed:  1.7min finished
#     Best parameters found:  {'colsample_bytree': 0.7, 'max_depth': 5, 'n_estimators': 50}
#     Best AUC found: 0.7955
#     --- 155.17163109779358 seconds ---
# This AUC of 0.7955 is slightly better than the cross-validated early_stopping AUC of 0.7943.
# However, we just picked some hyperparameters somewhat haphazardly, so let's try a more structured approach.

Fitting 4 folds for each of 4 candidates, totalling 16 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  16 out of  16 | elapsed:  1.7min finished


Best parameters found:  {'colsample_bytree': 0.7, 'max_depth': 5, 'n_estimators': 50}
Best AUC found: 0.7955
--- 155.17163109779358 seconds ---


['model_grid_search.joblib']

In [243]:
# Define a function to create XGBoost models and perform cross-validation
def modelfit(alg, data, label, useTrainCV = True, cv_folds = 5, early_stopping_rounds = 50):
    
    if useTrainCV:
        xgb_param = alg.get_xgb_params()
        xgtrain = xgb.DMatrix(data = data, label = label)
        cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round = alg.get_params()['n_estimators'], nfold = cv_folds,
            metrics = 'auc', early_stopping_rounds = early_stopping_rounds)
        alg.set_params(n_estimators = cvresult.shape[0])
    
    # Fit the algorithm on the data
    alg.fit(data, label, eval_metric = 'auc', verbose = True)
        
    # Predict training set
    dtrain_predictions = alg.predict(data)
    dtrain_predprob = alg.predict_proba(data)[:,1]
        
    # Print model report:
    print("\nModel Report")
    print("Accuracy : %.4g" % metrics.accuracy_score(label, dtrain_predictions))
    print("AUC Score (Train): %.4f" % metrics.roc_auc_score(label, dtrain_predprob))
    print("Best Iteration: {}".format(alg.get_booster().best_iteration))  
    print("Best ntree limit: {}".format(alg.get_booster().best_ntree_limit))
    
    feat_imp = pd.Series(alg.get_booster().get_fscore()).nlargest(50).sort_values(ascending=False)
    feat_imp.plot(kind = 'bar', title = 'Top 50 Feature Importances')
    plt.ylabel('Feature Importance Score')
    plt.tight_layout()

In [244]:
# Approach 5: Try a step-by-step approach to hyperparameter tuning
# based on https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/

start_time = time.time()

# Step 1: Fix the learning rate and the number of estimators, with typical values for other parameters
model_step_by_step_1 = \
    xgb.XGBClassifier(learning_rate = 0.1, n_estimators = 1000, max_depth = 5, min_child_weight = 1,
                      gamma = 0, subsample = 0.8, colsample_bytree = 0.8, objective = 'binary:logistic',
                      nthread = 4, scale_pos_weight = 1, seed = 27)

modelfit(model_step_by_step_1, X_train, y_train)

# Save the initial Feature Importances bar chart to a file
plt.savefig('Feature Importances - initial.png')

print("--- %s seconds ---" % (time.time() - start_time))

# Save the model to disk
dump(model_step_by_step_1, 'model_step_by_step_1.joblib')

# Output:
#     Model Report
#     Accuracy : 0.8053
#     AUC Score (Train): 0.8213
#     Best Iteration: 128
#     Best ntree limit: 129
#     --- 265.79448223114014 seconds ---
# So, we've improved the AUC from 0.7955 to 0.8213. Let's keep tuning to see if we can improve this further.


Model Report
Accuracy : 0.8053
AUC Score (Train): 0.8213
Best Iteration: 128
Best ntree limit: 129
--- 265.79448223114014 seconds ---


['model_step_by_step_1.joblib']

**The initial Feature Importances bar chart can be viewed [here](https://github.com/DavidSchanzer823239622/UTS_ML2019_82329622/blob/master/Feature%20Importances%20-%20initial.png "Initial Feature Importances bar chart"))**

In [245]:
# Step 2: Tune max_depth and min_child_weight, as they will have the highest impact on model outcome.
# Use the optimal n_estimators value calculated in the previous step (129).
# To start with, set wider ranges for max_depth and min_child_weight and then perform another iteration for
# smaller ranges.

start_time = time.time()

param_step_by_step_2 = {
    'max_depth': range(3, 10, 2),        # Values [3, 5, 7, 9]
    'min_child_weight': range(1, 6, 2)   # Values [1, 3, 5]
}
model_step_by_step_2 = \
    GridSearchCV(estimator = xgb.XGBClassifier(learning_rate = 0.1, n_estimators = 129, max_depth = 5,
                 min_child_weight = 1, gamma = 0, subsample = 0.8, colsample_bytree = 0.8,
                 objective = 'binary:logistic', nthread = 4, scale_pos_weight = 1, seed = 27),
                 param_grid = param_step_by_step_2, scoring = 'roc_auc',n_jobs = 4, iid = False, cv = 5,
                 verbose = 1)
    
model_step_by_step_2.fit(X_train, y_train)

print(pd.DataFrame(model_step_by_step_2.cv_results_)[['mean_test_score', 'std_test_score', 'params']])
print("Best params: %s" % model_step_by_step_2.best_params_)
print("Best score: %f" % model_step_by_step_2.best_score_)

print("--- %s seconds ---" % (time.time() - start_time))

# Save the model to disk
dump(model_step_by_step_2, 'model_step_by_step_2.joblib')

# Output:
#     Fitting 5 folds for each of 12 candidates, totalling 60 fits
#     [Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
#     [Parallel(n_jobs=4)]: Done  42 tasks      | elapsed: 18.4min
#     [Parallel(n_jobs=4)]: Done  60 out of  60 | elapsed: 28.9min finished
#         mean_test_score  std_test_score                                   params
#     0          0.792775        0.002397  {'max_depth': 3, 'min_child_weight': 1}
#     1          0.792989        0.002257  {'max_depth': 3, 'min_child_weight': 3}
#     2          0.792919        0.002463  {'max_depth': 3, 'min_child_weight': 5}
#     3          0.797768        0.002121  {'max_depth': 5, 'min_child_weight': 1}
#     4          0.797900        0.002359  {'max_depth': 5, 'min_child_weight': 3}
#     5          0.798228        0.002526  {'max_depth': 5, 'min_child_weight': 5}
#     6          0.798096        0.002827  {'max_depth': 7, 'min_child_weight': 1}
#     7          0.798151        0.002446  {'max_depth': 7, 'min_child_weight': 3}
#     8          0.798359        0.002328  {'max_depth': 7, 'min_child_weight': 5}
#     9          0.795148        0.001951  {'max_depth': 9, 'min_child_weight': 1}
#     10         0.795467        0.001924  {'max_depth': 9, 'min_child_weight': 3}
#     11         0.796833        0.001508  {'max_depth': 9, 'min_child_weight': 5}
#     Best params: {'max_depth': 7, 'min_child_weight': 5}
#     Best score: 0.798359
#     --- 1784.0705111026764 seconds ---
# So, the best values discovered so far (using an interval of two) are 7 for max_depth and 5 for min_child_weight.

Fitting 5 folds for each of 12 candidates, totalling 60 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed: 18.4min
[Parallel(n_jobs=4)]: Done  60 out of  60 | elapsed: 28.9min finished


    mean_test_score  std_test_score                                   params
0          0.792775        0.002397  {'max_depth': 3, 'min_child_weight': 1}
1          0.792989        0.002257  {'max_depth': 3, 'min_child_weight': 3}
2          0.792919        0.002463  {'max_depth': 3, 'min_child_weight': 5}
3          0.797768        0.002121  {'max_depth': 5, 'min_child_weight': 1}
4          0.797900        0.002359  {'max_depth': 5, 'min_child_weight': 3}
5          0.798228        0.002526  {'max_depth': 5, 'min_child_weight': 5}
6          0.798096        0.002827  {'max_depth': 7, 'min_child_weight': 1}
7          0.798151        0.002446  {'max_depth': 7, 'min_child_weight': 3}
8          0.798359        0.002328  {'max_depth': 7, 'min_child_weight': 5}
9          0.795148        0.001951  {'max_depth': 9, 'min_child_weight': 1}
10         0.795467        0.001924  {'max_depth': 9, 'min_child_weight': 3}
11         0.796833        0.001508  {'max_depth': 9, 'min_child_weight': 5}

['model_step_by_step_2.joblib']

In [246]:
# Step 3: Fine-tune max_depth and min_child_weight, looking for optimum values,
# by searching for values 1 above and below the best values discovered so far.

start_time = time.time()

param_step_by_step_3 = {
    'max_depth': [6, 7, 8],
    'min_child_weight': [4, 5, 6]
}

model_step_by_step_3 = \
    GridSearchCV(estimator = xgb.XGBClassifier(learning_rate = 0.1, n_estimators = 129, max_depth = 7,
                 min_child_weight = 5, gamma = 0, subsample = 0.8, colsample_bytree = 0.8,
                 objective = 'binary:logistic', nthread = 4, scale_pos_weight = 1, seed = 27), 
                 param_grid = param_step_by_step_3, scoring = 'roc_auc', n_jobs = 4, iid = False, cv = 5,
                 verbose = 1)

model_step_by_step_3.fit(X_train, y_train)

print(pd.DataFrame(model_step_by_step_3.cv_results_)[['mean_test_score', 'std_test_score', 'params']])
print("Best params: %s" % model_step_by_step_3.best_params_)
print("Best score: %f" % model_step_by_step_3.best_score_)

print("--- %s seconds ---" % (time.time() - start_time))

# Save the model to disk
dump(model_step_by_step_3, 'model_step_by_step_3.joblib')

# Output:
#     Fitting 5 folds for each of 9 candidates, totalling 45 fits
#     [Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
#     [Parallel(n_jobs=4)]: Done  45 out of  45 | elapsed: 23.8min finished
#        mean_test_score  std_test_score                                   params
#     0         0.798927        0.002129  {'max_depth': 6, 'min_child_weight': 4}
#     1         0.798543        0.002475  {'max_depth': 6, 'min_child_weight': 5}
#     2         0.798449        0.002379  {'max_depth': 6, 'min_child_weight': 6}
#     3         0.798134        0.002382  {'max_depth': 7, 'min_child_weight': 4}
#     4         0.798359        0.002328  {'max_depth': 7, 'min_child_weight': 5}
#     5         0.798307        0.002430  {'max_depth': 7, 'min_child_weight': 6}
#     6         0.797472        0.002018  {'max_depth': 8, 'min_child_weight': 4}
#     7         0.797375        0.002247  {'max_depth': 8, 'min_child_weight': 5}
#     8         0.797992        0.002273  {'max_depth': 8, 'min_child_weight': 6}
#     Best params: {'max_depth': 6, 'min_child_weight': 4}
#     Best score: 0.798927
#     --- 1472.9096410274506 seconds ---
# So, we've found that the optimum values for max_depth is 6, and for min_child_weight is 4.

Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  45 out of  45 | elapsed: 23.8min finished


   mean_test_score  std_test_score                                   params
0         0.798927        0.002129  {'max_depth': 6, 'min_child_weight': 4}
1         0.798543        0.002475  {'max_depth': 6, 'min_child_weight': 5}
2         0.798449        0.002379  {'max_depth': 6, 'min_child_weight': 6}
3         0.798134        0.002382  {'max_depth': 7, 'min_child_weight': 4}
4         0.798359        0.002328  {'max_depth': 7, 'min_child_weight': 5}
5         0.798307        0.002430  {'max_depth': 7, 'min_child_weight': 6}
6         0.797472        0.002018  {'max_depth': 8, 'min_child_weight': 4}
7         0.797375        0.002247  {'max_depth': 8, 'min_child_weight': 5}
8         0.797992        0.002273  {'max_depth': 8, 'min_child_weight': 6}
Best params: {'max_depth': 6, 'min_child_weight': 4}
Best score: 0.798927
--- 1472.9096410274506 seconds ---


['model_step_by_step_3.joblib']

In [247]:
# Step 4: Tune gamma using the parameters already tuned above.

start_time = time.time()

param_step_by_step_4 = {
    'gamma': [i/10.0 for i in range(0,5)]    # Values [0.0, 0.1, 0.2, 0.3, 0.4]
}

model_step_by_step_4 = \
    GridSearchCV(estimator = xgb.XGBClassifier(learning_rate = 0.1, n_estimators = 129, max_depth = 6,
                 min_child_weight = 4, gamma = 0, subsample = 0.8, colsample_bytree = 0.8,
                 objective = 'binary:logistic', nthread = 4, scale_pos_weight = 1, seed = 27), 
                 param_grid = param_step_by_step_4, scoring = 'roc_auc', n_jobs = 4, iid = False, cv = 5,
                 verbose = 1)

model_step_by_step_4.fit(X_train, y_train)

print(pd.DataFrame(model_step_by_step_4.cv_results_)[['mean_test_score', 'std_test_score', 'params']])
print("Best params: %s" % model_step_by_step_4.best_params_)
print("Best score: %f" % model_step_by_step_4.best_score_)

print("--- %s seconds ---" % (time.time() - start_time))

# Save the model to disk
dump(model_step_by_step_4, 'model_step_by_step_4.joblib')

# Output:
#     Fitting 5 folds for each of 5 candidates, totalling 25 fits
#     [Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
#     [Parallel(n_jobs=4)]: Done  25 out of  25 | elapsed: 11.8min finished
#        mean_test_score  std_test_score          params
#     0         0.798927        0.002129  {'gamma': 0.0}
#     1         0.798545        0.002114  {'gamma': 0.1}
#     2         0.798390        0.002208  {'gamma': 0.2}
#     3         0.798590        0.002178  {'gamma': 0.3}
#     4         0.798611        0.002357  {'gamma': 0.4}
#     Best params: {'gamma': 0.0}
#     Best score: 0.798927
#     --- 752.3053398132324 seconds ---
# So, the optimum value for gamma, given the previous parameters, is 0.

Fitting 5 folds for each of 5 candidates, totalling 25 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  25 out of  25 | elapsed: 11.8min finished


   mean_test_score  std_test_score          params
0         0.798927        0.002129  {'gamma': 0.0}
1         0.798545        0.002114  {'gamma': 0.1}
2         0.798390        0.002208  {'gamma': 0.2}
3         0.798590        0.002178  {'gamma': 0.3}
4         0.798611        0.002357  {'gamma': 0.4}
Best params: {'gamma': 0.0}
Best score: 0.798927
--- 752.3053398132324 seconds ---


['model_step_by_step_4.joblib']

In [248]:
# Step 5: Re-calibrate the number of boosting rounds for the updated parameters

start_time = time.time()

model_step_by_step_5 = \
    xgb.XGBClassifier(learning_rate = 0.1, n_estimators = 129, max_depth = 6, min_child_weight = 4, gamma = 0,
                  subsample = 0.8, colsample_bytree = 0.8, objective = 'binary:logistic', nthread = 4,
                  scale_pos_weight = 1, seed = 27)

modelfit(model_step_by_step_5, X_train, y_train)

print("--- %s seconds ---" % (time.time() - start_time))

# Save the model to disk
dump(model_step_by_step_5, 'model_step_by_step_5.joblib')

# Output:
#     Model Report
#     Accuracy : 0.8114
#     AUC Score (Train): 0.8316
#     Best Iteration: 128
#     Best ntree limit: 129
#     --- 234.36470699310303 seconds ---
# So, with max_depth = 6, min_child_weight = 4 and gamma= 0, the optimum n_estimators value is still 129.


Model Report
Accuracy : 0.8114
AUC Score (Train): 0.8316
Best Iteration: 128
Best ntree limit: 129
--- 234.36470699310303 seconds ---


['model_step_by_step_5.joblib']

In [249]:
# Step 6: Tune subsample and colsample_bytree

start_time = time.time()

param_step_by_step_6 = {
    'subsample': [i/10.0 for i in range(6,10)],         # Values [0.6, 0.7, 0.8, 0.9]
    'colsample_bytree': [i/10.0 for i in range(6,10)]   # Values [0.6, 0.7, 0.8, 0.9]
}

model_step_by_step_6 = \
    GridSearchCV(estimator = xgb.XGBClassifier(learning_rate = 0.1, n_estimators = 129, max_depth = 6,
                 min_child_weight = 4, gamma = 0, subsample = 0.8, colsample_bytree = 0.8,
                 objective = 'binary:logistic', nthread = 4, scale_pos_weight = 1, seed = 27), 
                 param_grid = param_step_by_step_6, scoring = 'roc_auc', n_jobs = 4, iid = False, cv = 5,
                 verbose = 1)

model_step_by_step_6.fit(X_train, y_train)

print(pd.DataFrame(model_step_by_step_6.cv_results_)[['mean_test_score', 'std_test_score', 'params']])
print("Best params: %s" % model_step_by_step_6.best_params_)
print("Best score: %f" % model_step_by_step_6.best_score_)

print("--- %s seconds ---" % (time.time() - start_time))

# Save the model to disk
dump(model_step_by_step_6, 'model_step_by_step_6.joblib')

# Output:
#     Fitting 5 folds for each of 16 candidates, totalling 80 fits
#     [Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
#     [Parallel(n_jobs=4)]: Done  42 tasks      | elapsed: 18.5min
#     [Parallel(n_jobs=4)]: Done  80 out of  80 | elapsed: 36.0min finished
#         mean_test_score  std_test_score  \
#     0          0.797229        0.002333   
#     1          0.797729        0.002164   
#     2          0.798429        0.002594   
#     3          0.798609        0.002403   
#     4          0.798080        0.002859   
#     5          0.798242        0.002294   
#     6          0.798585        0.002031   
#     7          0.798720        0.001836   
#     8          0.797463        0.002507   
#     9          0.798609        0.002218   
#     10         0.798927        0.002129   
#     11         0.799125        0.002065   
#     12         0.797627        0.002651   
#     13         0.797927        0.002148   
#     14         0.798310        0.002229   
#     15         0.799003        0.001883   
#     
#                                              params  
#     0   {'colsample_bytree': 0.6, 'subsample': 0.6}  
#     1   {'colsample_bytree': 0.6, 'subsample': 0.7}  
#     2   {'colsample_bytree': 0.6, 'subsample': 0.8}  
#     3   {'colsample_bytree': 0.6, 'subsample': 0.9}  
#     4   {'colsample_bytree': 0.7, 'subsample': 0.6}  
#     5   {'colsample_bytree': 0.7, 'subsample': 0.7}  
#     6   {'colsample_bytree': 0.7, 'subsample': 0.8}  
#     7   {'colsample_bytree': 0.7, 'subsample': 0.9}  
#     8   {'colsample_bytree': 0.8, 'subsample': 0.6}  
#     9   {'colsample_bytree': 0.8, 'subsample': 0.7}  
#     10  {'colsample_bytree': 0.8, 'subsample': 0.8}  
#     11  {'colsample_bytree': 0.8, 'subsample': 0.9}  
#     12  {'colsample_bytree': 0.9, 'subsample': 0.6}  
#     13  {'colsample_bytree': 0.9, 'subsample': 0.7}  
#     14  {'colsample_bytree': 0.9, 'subsample': 0.8}  
#     15  {'colsample_bytree': 0.9, 'subsample': 0.9}  
#     Best params: {'colsample_bytree': 0.8, 'subsample': 0.9}
#     Best score: 0.799125
#     --- 2207.8702890872955 seconds ---
# So, given the above parameters, the best value so far for subsample is 0.9 and for colsample_bytree was 0.8.

Fitting 5 folds for each of 16 candidates, totalling 80 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed: 18.5min
[Parallel(n_jobs=4)]: Done  80 out of  80 | elapsed: 36.0min finished


    mean_test_score  std_test_score  \
0          0.797229        0.002333   
1          0.797729        0.002164   
2          0.798429        0.002594   
3          0.798609        0.002403   
4          0.798080        0.002859   
5          0.798242        0.002294   
6          0.798585        0.002031   
7          0.798720        0.001836   
8          0.797463        0.002507   
9          0.798609        0.002218   
10         0.798927        0.002129   
11         0.799125        0.002065   
12         0.797627        0.002651   
13         0.797927        0.002148   
14         0.798310        0.002229   
15         0.799003        0.001883   

                                         params  
0   {'colsample_bytree': 0.6, 'subsample': 0.6}  
1   {'colsample_bytree': 0.6, 'subsample': 0.7}  
2   {'colsample_bytree': 0.6, 'subsample': 0.8}  
3   {'colsample_bytree': 0.6, 'subsample': 0.9}  
4   {'colsample_bytree': 0.7, 'subsample': 0.6}  
5   {'colsample_bytree': 0.7, 'subsa

['model_step_by_step_6.joblib']

In [250]:
# Step 7: The best value so far for subsample was 0.9, and for colsample_bytree was 0.8.
# Now try values in 0.05 intervals around these.

start_time = time.time()

param_step_by_step_7 = {
    'subsample': [i/100.0 for i in range(85,100,5)],        # Values [0.85, 0.9, 0.95]
    'colsample_bytree': [i/100.0 for i in range(75,90,5)]   # Values [0.75, 0.8, 0.85]
}

model_step_by_step_7 = \
    GridSearchCV(estimator = xgb.XGBClassifier(learning_rate = 0.1, n_estimators = 129, max_depth = 6,
                 min_child_weight = 4, gamma = 0, subsample = 0.9, colsample_bytree = 0.8,
                 objective = 'binary:logistic', nthread = 4, scale_pos_weight = 1, seed = 27), 
                 param_grid = param_step_by_step_7, scoring = 'roc_auc', n_jobs = 4, iid = False, cv = 5,
                 verbose = 1)

model_step_by_step_7.fit(X_train, y_train)

print(pd.DataFrame(model_step_by_step_7.cv_results_)[['mean_test_score', 'std_test_score', 'params']])
print("Best params: %s" % model_step_by_step_7.best_params_)
print("Best score: %f" % model_step_by_step_7.best_score_)

print("--- %s seconds ---" % (time.time() - start_time))

# Save the model to disk
dump(model_step_by_step_7, 'model_step_by_step_7.joblib')

# Output:
#     Fitting 5 folds for each of 9 candidates, totalling 45 fits
#     [Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
#     [Parallel(n_jobs=4)]: Done  45 out of  45 | elapsed: 21.1min finished
#        mean_test_score  std_test_score  \
#     0         0.798669        0.002629   
#     1         0.799158        0.002359   
#     2         0.799457        0.002496   
#     3         0.799124        0.002814   
#     4         0.799125        0.002065   
#     5         0.799129        0.002119   
#     6         0.798629        0.002213   
#     7         0.798867        0.002451   
#     8         0.798500        0.002351   
#     
#                                               params  
#     0  {'colsample_bytree': 0.75, 'subsample': 0.85}  
#     1   {'colsample_bytree': 0.75, 'subsample': 0.9}  
#     2  {'colsample_bytree': 0.75, 'subsample': 0.95}  
#     3   {'colsample_bytree': 0.8, 'subsample': 0.85}  
#     4    {'colsample_bytree': 0.8, 'subsample': 0.9}  
#     5   {'colsample_bytree': 0.8, 'subsample': 0.95}  
#     6  {'colsample_bytree': 0.85, 'subsample': 0.85}  
#     7   {'colsample_bytree': 0.85, 'subsample': 0.9}  
#     8  {'colsample_bytree': 0.85, 'subsample': 0.95}  
#     Best params: {'colsample_bytree': 0.75, 'subsample': 0.95}
#     Best score: 0.799457
#     --- 1308.6555030345917 seconds ---
# So, the optimum value for subsample has been refined to 0.95, and for colsample_bytree to 0.75.

Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  45 out of  45 | elapsed: 21.1min finished


   mean_test_score  std_test_score  \
0         0.798669        0.002629   
1         0.799158        0.002359   
2         0.799457        0.002496   
3         0.799124        0.002814   
4         0.799125        0.002065   
5         0.799129        0.002119   
6         0.798629        0.002213   
7         0.798867        0.002451   
8         0.798500        0.002351   

                                          params  
0  {'colsample_bytree': 0.75, 'subsample': 0.85}  
1   {'colsample_bytree': 0.75, 'subsample': 0.9}  
2  {'colsample_bytree': 0.75, 'subsample': 0.95}  
3   {'colsample_bytree': 0.8, 'subsample': 0.85}  
4    {'colsample_bytree': 0.8, 'subsample': 0.9}  
5   {'colsample_bytree': 0.8, 'subsample': 0.95}  
6  {'colsample_bytree': 0.85, 'subsample': 0.85}  
7   {'colsample_bytree': 0.85, 'subsample': 0.9}  
8  {'colsample_bytree': 0.85, 'subsample': 0.95}  
Best params: {'colsample_bytree': 0.75, 'subsample': 0.95}
Best score: 0.799457
--- 1308.6555030345917 second

['model_step_by_step_7.joblib']

In [251]:
# Step 8: Apply regularization to reduce overfitting, by tuning reg_alpha.
# Tuning reg_lambda may also be useful to further reduce overfitting, but the decision was made that this is enough
# tuning for the purpose of this assignment.

start_time = time.time()

param_step_by_step_8 = {
    'reg_alpha': [1e-5, 1e-2, 0.1, 1, 100]
}

model_step_by_step_8 = \
    GridSearchCV(estimator = xgb.XGBClassifier(learning_rate = 0.1, n_estimators = 129, max_depth = 6,
                 min_child_weight = 4, gamma = 0, subsample = 0.95, colsample_bytree = 0.75,
                 objective = 'binary:logistic', nthread = 4, scale_pos_weight = 1, seed = 27), 
                 param_grid = param_step_by_step_8, scoring = 'roc_auc', n_jobs = 4, iid = False, cv = 5,
                 verbose = 1)

model_step_by_step_8.fit(X_train, y_train)

print(pd.DataFrame(model_step_by_step_8.cv_results_)[['mean_test_score', 'std_test_score', 'params']])
print("Best params: %s" % model_step_by_step_8.best_params_)
print("Best score: %f" % model_step_by_step_8.best_score_)

print("--- %s seconds ---" % (time.time() - start_time))

# Save the model to disk
dump(model_step_by_step_8, 'model_step_by_step_8.joblib')

# Output:
#     Fitting 5 folds for each of 5 candidates, totalling 25 fits
#     [Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
#     [Parallel(n_jobs=4)]: Done  25 out of  25 | elapsed: 11.0min finished
#        mean_test_score  std_test_score                params
#     0         0.799457        0.002496  {'reg_alpha': 1e-05}
#     1         0.799444        0.002456   {'reg_alpha': 0.01}
#     2         0.799148        0.002625    {'reg_alpha': 0.1}
#     3         0.799720        0.002136      {'reg_alpha': 1}
#     4         0.791205        0.001887    {'reg_alpha': 100}
#     Best params: {'reg_alpha': 1}
#     Best score: 0.799720
#     --- 705.3381948471069 seconds ---
# So, the best value so far for reg_alpha from these 5 possible values is 1.

Fitting 5 folds for each of 5 candidates, totalling 25 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  25 out of  25 | elapsed: 11.0min finished


   mean_test_score  std_test_score                params
0         0.799457        0.002496  {'reg_alpha': 1e-05}
1         0.799444        0.002456   {'reg_alpha': 0.01}
2         0.799148        0.002625    {'reg_alpha': 0.1}
3         0.799720        0.002136      {'reg_alpha': 1}
4         0.791205        0.001887    {'reg_alpha': 100}
Best params: {'reg_alpha': 1}
Best score: 0.799720
--- 705.3381948471069 seconds ---


['model_step_by_step_8.joblib']

In [255]:
# Step 9: Try values of reg_alpha closer to the best value so far (1) to see if we get something better.

start_time = time.time()

param_step_by_step_9 = {
    'reg_alpha': [0.5, 1, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 20]
}

model_step_by_step_9 = \
    GridSearchCV(estimator = xgb.XGBClassifier(learning_rate = 0.1, n_estimators = 129, max_depth = 6,
                 min_child_weight = 4, gamma = 0, subsample = 0.95, colsample_bytree = 0.65,
                 objective = 'binary:logistic', nthread = 4, scale_pos_weight = 1, seed = 27), 
                 param_grid = param_step_by_step_9, scoring = 'roc_auc', n_jobs = 4, iid = False, cv = 5,
                 verbose = 1)

model_step_by_step_9.fit(X_train, y_train)

print(pd.DataFrame(model_step_by_step_9.cv_results_)[['mean_test_score', 'std_test_score', 'params']])
print("Best params: %s" % model_step_by_step_9.best_params_)
print("Best score: %f" % model_step_by_step_9.best_score_)

print("--- %s seconds ---" % (time.time() - start_time))

# Save the model to disk
dump(model_step_by_step_9, 'model_step_by_step_9.joblib')

# Output:
#     Fitting 5 folds for each of 14 candidates, totalling 70 fits
#     [Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
#     [Parallel(n_jobs=4)]: Done  42 tasks      | elapsed: 18.3min
#     [Parallel(n_jobs=4)]: Done  70 out of  70 | elapsed: 29.3min finished
#         mean_test_score  std_test_score              params
#     0          0.799333        0.002465  {'reg_alpha': 0.5}
#     1          0.799116        0.002172    {'reg_alpha': 1}
#     2          0.800166        0.002031    {'reg_alpha': 5}
#     3          0.800270        0.002156    {'reg_alpha': 6}
#     4          0.800363        0.002301    {'reg_alpha': 7}
#     5          0.800188        0.002046    {'reg_alpha': 8}
#     6          0.800091        0.002142    {'reg_alpha': 9}
#     7          0.799958        0.002240   {'reg_alpha': 10}
#     8          0.799878        0.002038   {'reg_alpha': 11}
#     9          0.799791        0.002194   {'reg_alpha': 12}
#     10         0.800000        0.002170   {'reg_alpha': 13}
#     11         0.799666        0.002205   {'reg_alpha': 14}
#     12         0.799583        0.001937   {'reg_alpha': 15}
#     13         0.798932        0.001976   {'reg_alpha': 20}
#     Best params: {'reg_alpha': 7}
#     Best score: 0.800363
#     --- 1795.949402809143 seconds ---
#     # So, given the above parameters, the optimum value for reg_alpha is 7.

Fitting 5 folds for each of 14 candidates, totalling 70 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed: 18.3min
[Parallel(n_jobs=4)]: Done  70 out of  70 | elapsed: 29.3min finished


    mean_test_score  std_test_score              params
0          0.799333        0.002465  {'reg_alpha': 0.5}
1          0.799116        0.002172    {'reg_alpha': 1}
2          0.800166        0.002031    {'reg_alpha': 5}
3          0.800270        0.002156    {'reg_alpha': 6}
4          0.800363        0.002301    {'reg_alpha': 7}
5          0.800188        0.002046    {'reg_alpha': 8}
6          0.800091        0.002142    {'reg_alpha': 9}
7          0.799958        0.002240   {'reg_alpha': 10}
8          0.799878        0.002038   {'reg_alpha': 11}
9          0.799791        0.002194   {'reg_alpha': 12}
10         0.800000        0.002170   {'reg_alpha': 13}
11         0.799666        0.002205   {'reg_alpha': 14}
12         0.799583        0.001937   {'reg_alpha': 15}
13         0.798932        0.001976   {'reg_alpha': 20}
Best params: {'reg_alpha': 7}
Best score: 0.800363
--- 1795.949402809143 seconds ---


['model_step_by_step_9.joblib']

In [256]:
# Step 10: Apply this regularization (reg_alpha) in the model and look at the impact.

start_time = time.time()

model_step_by_step_10 = \
    xgb.XGBClassifier(learning_rate = 0.1, n_estimators = 1000, max_depth = 6, min_child_weight = 4, gamma = 0,
                      subsample = 0.95, colsample_bytree = 0.65, reg_alpha = 7, objective = 'binary:logistic',
                      nthread = 4, scale_pos_weight = 1, seed = 27)

modelfit(model_step_by_step_10, X_train, y_train)

print("--- %s seconds ---" % (time.time() - start_time))

# Save the model to disk
dump(model_step_by_step_10, 'model_step_by_step_10.joblib')

# Output:
#     Model Report
#     Accuracy : 0.8114
#     AUC Score (Train): 0.8319
#     Best Iteration: 203
#     Best ntree limit: 204
#     --- 417.5869870185852 seconds ---
# So, the new best AUC is 0.8319, which is achieved with an n_estimators value of 204.


Model Report
Accuracy : 0.8114
AUC Score (Train): 0.8319
Best Iteration: 203
Best ntree limit: 204
--- 417.5869870185852 seconds ---


['model_step_by_step_10.joblib']

In [257]:
# Step 11: Finally, lower the learning rate and add more trees, using the cv function of XGBoost.

start_time = time.time()

model_step_by_step_11 = \
    xgb.XGBClassifier(learning_rate = 0.01, n_estimators = 5000, max_depth = 6, min_child_weight = 4, gamma = 0,
                      subsample = 0.95, colsample_bytree = 0.65, reg_alpha = 7, objective = 'binary:logistic',
                      nthread = 4, scale_pos_weight = 1, seed = 27)

modelfit(model_step_by_step_11, X_train, y_train)

# Save the final Feature Importances bar chart to a file
plt.savefig('Feature Importances - final.png')

print("--- %s seconds ---" % (time.time() - start_time))

# Save the model to disk
dump(model_step_by_step_11, 'model_step_by_step_11.joblib')

# Output:
#     Model Report
#     Accuracy : 0.8092
#     AUC Score (Train): 0.8266
#     Best Iteration: 1602
#     Best ntree limit: 1603
#     --- 2614.470659017563 seconds ---
# So, using this slower learning rate, we have found that the optimum n_estimators value is 1603,
# which yields an AUC value of 0.8266.


Model Report
Accuracy : 0.8092
AUC Score (Train): 0.8266
Best Iteration: 1602
Best ntree limit: 1603
--- 2614.470659017563 seconds ---


['model_step_by_step_11.joblib']

**The final Feature Importances bar chart can be viewed [here](https://github.com/DavidSchanzer823239622/UTS_ML2019_82329622/blob/master/Feature%20Importances%20-%20final.png "Final Feature Importances bar chart"))**

The top 5 features on the above bar chart are:
1. Distance from residential address to location of most recent episode >= 11 kms
2. Age at most recent episode >= 71 years
3. Age at most recent episode <= 56 years
4. Number of days the woman had to wait for results after her most recent attendance >= 9 days
5. Total number of episodes >= 10 episodes

In [258]:
# Evaluation: call XGBClassifier with the optimised parameters to measure the final achieved accuracy

start_time = time.time()

# Fit the model to training data
model_final = \
    xgb.XGBClassifier(learning_rate = 0.01, n_estimators = 1603, max_depth = 6, min_child_weight = 4, gamma = 0,
                      subsample = 0.95, colsample_bytree = 0.65, reg_alpha = 7, objective = 'binary:logistic',
                      nthread = 4, scale_pos_weight = 1, seed = 27)
model_final.fit(X_train, y_train)
print(model_final)

# Make predictions for the test data
y_pred = model_final.predict(X_test)
predictions = [round(value) for value in y_pred]

# Evaluate the predictions made using the test data
accuracy = accuracy_score(y_test, predictions)
print("Tuned accuracy: %.2f%%" % (accuracy * 100.0))

dump(model_final, 'model_final.joblib')

print("--- %s seconds ---" % (time.time() - start_time))

# Output:
#     XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
#                   colsample_bynode=1, colsample_bytree=0.65, gamma=0,
#                   learning_rate=0.01, max_delta_step=0, max_depth=6,
#                   min_child_weight=4, missing=None, n_estimators=1603, n_jobs=1,
#                   nthread=4, objective='binary:logistic', random_state=0,
#                   reg_alpha=7, reg_lambda=1, scale_pos_weight=1, seed=27,
#                   silent=None, subsample=0.95, verbosity=1)
#     Tuned accuracy: 79.53%
#     --- 477.62118005752563 seconds ---
# So, the final accuracy on the test dataset is 79.53%, compared to the baseline accuracy of 79.00%.

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.65, gamma=0,
              learning_rate=0.01, max_delta_step=0, max_depth=6,
              min_child_weight=4, missing=None, n_estimators=1603, n_jobs=1,
              nthread=4, objective='binary:logistic', random_state=0,
              reg_alpha=7, reg_lambda=1, scale_pos_weight=1, seed=27,
              silent=None, subsample=0.95, verbosity=1)
Tuned accuracy: 79.53%
--- 477.62118005752563 seconds ---


## Evaluation

### Report Execution on Data

The results of the execution of individual Jupyter Notebook code cells is shown above as comments at the end of each cell. As explained in the Challenges section above, these results were for the 100,000 randomly-chosen observations out of the total available 824,812 observations in the full data set, and this was done for performance reasons, due to the excessive execution times encountered when attempting to use the full data set.

The sample data that has been uploaded to the GitHub folder (https://github.com/DavidSchanzer823239622/UTS_ML2019_82329622), in file "Data extraction - 1000 observations scrambled.csv", contains only 1000 observations whose data values have been scrambled, due to the sensitivity of the BreastScreen data. Scrambled in this context means that the data has been imported into Microsoft Excel and then each column has been sorted independently of all other columns, thus creating a sample data set that has a real appearance but cannot be reidentified to an individual. As a result, when running this Notebook in Google Colab, different results will be seen, and this is why the execution results from the 100,000 observations have been included in the comments, to aid the marker of this assignment.

### Test Results

Below is a table showing the effect on **accuracy and AUC** of the 5 different approaches to XGBoost model training and tuning:

| Approach | Description | Accuracy | AUC |
| --: | :--------------------- | :-------------: | :----------------------------------------------------------- |
|   1 | Call XGBClassifier with no parameters, using all of the default values, to measure the baseline accuracy | 79.00% | N/A |
|   2 | Utilise XGBoost's built-in cross-validation capabilities | 78.85% | 0.7878 |
|   3 | Utilise a higher number of boosting rounds with automated boosting round selection using early_stopping | 79.10% | 0.7943 |
|   4 | Tuning a few of the above hyperparameters using a very simple GridSearch | N/A | 0.7955 |
|   5 | Try a step-by-step approach to hyperparameter tuning: |  |  |
|   5.1 | Fix learning_rate and n_estimators with typical values for other parameters | 80.53% | 0.8213 |
|   5.2 | Tune max_depth and min_child_weight using wide ranges | N/A | 0.7984 |
|   5.3 | Fine-tune max_depth and min_child_weight looking for optimum values | N/A | 0.7989 |
|   5.4 | Tune gamma using the parameters already tuned | N/A | 0.7989 |
|   5.5 | Re-calibrate the number of boosting rounds for the updated parameters | 81.14% | 0.8316 |
|   5.6 | Tune subsample and colsample_bytree using wide values | N/A | 0.7991 |
|   5.7 | Fine-tune subsample and colsample_bytree looking for optimum values | N/A | 0.7995 |
|   5.8 | Apply regularization to reduce overfitting, by tuning reg_alpha | N/A | 0.7997 |
|   5.9 | Fine-tune reg_alpha looking for optimum values | N/A | 0.8004 |
|   5.10 | Apply this regularization (reg_alpha) in the model and look at the impact | 81.14% | 0.8319 |
|   5.11 | Finally, lower the learning rate and add more trees, using the cv function of XGBoost | 80.92% | 0.8266 |
| Evaluation | Compare the final accuracy to the baseline accuracy | 79.53% | N/A |

As can be seen, Accuracy was the measure used to evaluate the ultimate effect of the tuning effort, with AUC used as the measure to be optimised during the hyperparameter tuning effort itself, especially with the step-by-step hyperparameter tuning undertaken during Approach 5.

In the end, the accuracy of the model could only be increased from 79.00% (baseline) to 79.53% (final). This could be due to a number of factors:
1. The use of a subset of 100,000 observations of the full data set only which, although randomly sampled, may have skewed training and testing sets for the model.
2. A lack of time to perform more extensive hyperparameter tuning, such as additional regularization using reg_lambda.
3. The 70/30 split chosen for the training and testing sets; alternative splits may result in differing model accuracy.
4. The default values for XGBoost parameters may already be quite good default values, and therefore good accuracy could be obtained using them without any hyperparameter tuning.
5. Other factors unknown to me due to my lack of experience with Machine Learning in general and XGBoost in particular.

### Efficiency Analysis

Below is a table showing the **execution duration (elapsed time)** of the 5 different approaches to XGBoost model training and tuning:

| Approach | Description | Execution Time (secs) |
| --: | :--------------------- | --------------: |
|   1 | Call XGBClassifier with no parameters, using all of the default values, to measure the baseline accuracy | **91** |
|   2 | Utilise XGBoost's built-in cross-validation capabilities | **43** |
|   3 | Utilise a higher number of boosting rounds with automated boosting round selection using early_stopping | **137** |
|   4 | Tuning a few of the above hyperparameters using a very simple GridSearch | **155** |
|   5 | Try a step-by-step approach to hyperparameter tuning: | **13559** |
|   5.1 | Fix learning_rate and n_estimators with typical values for other parameters | 266 |
|   5.2 | Tune max_depth and min_child_weight using wide ranges | 1784 |
|   5.3 | Fine-tune max_depth and min_child_weight looking for optimum values | 1473 |
|   5.4 | Tune gamma using the parameters already tuned | 752 |
|   5.5 | Re-calibrate the number of boosting rounds for the updated parameters | 234 |
|   5.6 | Tune subsample and colsample_bytree using wide values | 2208 |
|   5.7 | Fine-tune subsample and colsample_bytree looking for optimum values | 1309 |
|   5.8 | Apply regularization to reduce overfitting, by tuning reg_alpha | 705 |
|   5.9 | Fine-tune reg_alpha looking for optimum values | 1796 |
|   5.10 | Apply this regularization (reg_alpha) in the model and look at the impact | 418 |
|   5.11 | Finally, lower the learning rate and add more trees, using the cv function of XGBoost | 2614 |
| Evaluation | Compare the final accuracy to the baseline accuracy | 478 |

An analysis of the efficiency of the above approach, when also taking into account the Accuracy test results from the previous section, reveals that:
1. The baseline accuracy, achieved with a mere 91 seconds of execution time, achieved a level of accuracy that proved surprisingly difficult to improve upon to any great degree.
2. The efficiency of Approaches 2 to 4, which used simple cross-validation and very elementary Grid Searching, was good (low execution time) but had little effect on accuracy.
3. Approach 5, while highly methodical, required 13,559 seconds (3 hrs 46 mins) of execution time while only increasing accuracy by a little over 0.5%, and so can be considered to be an inefficient process. However, the effort may be considered by the business to be worthwhile to increase accuracy, but this highlights the diminishing returns that can be achieved through excessive hyperparameter tuning.

### Suggestions for Future Research

Some ways in which model accuracy may be further improved are:
1. Repeating the analysis using the full data set (ie. all 824,812 observations).
2. Additional feature engineering (creating additional features from existing features), as well as locating other possibly correlated features from the original data source.
3. Perform finer-level binning of numerical features, which are currently only binned into 5 quantiles.
4. Perform Principal Components Analysis on the data set to determine correlated components, to reduce the number of dimensions used to train the model. For instance, the Country of Birth and Main Language Spoken attributes had high cardinality and therefore generated many columns in the one-hot encoding step, but did not appear in the Feature Importances bar chart and therefore can be considered to have very low predictive correlation with the target variable. Removing these may also reduce run times and therefore improve efficiency.
5. Measure the effect of different train/test splits (other than the 70/30 split used here) on model accuracy.
6. Measure the effect of tuning other XGBoost parameters, such as reg_lambda, on model accuracy.
7. Perform a comparative study utilising other classifiers such as Random Forest or Neural Networks. These were not undertaken as part of this study due to the additional complexity that would be added to this exercise.

## Ethical Considerations

When considering the ethical aspects of the above study, it seems appropriate to adopt a Deontological (duty-based)  approach to ethics. This is because the purpose of the study is to raise rescreen rates for BreastScreen NSW, with the aim of detecting breast cancer earlier in its development when it is more responsive to treatment, and under a Deontological ethical framework, it is one's duty to prevent suffering if it is within one's power to do so, simply because it is "the right thing to do".

Deontologists would agree that this is true even if it sometimes produces a "bad result". In the case of this study, which seeks to accurately predict whether each woman in the screening program will be "regular" or "lapsed" at her next appointment, a "bad result" can be considered to be a false prediction. Specifically, there are two possibilities for a false prediction:
1. A woman may be predicted to be lapsed (and therefore receives an individual-level intervention such as a phone call) but in fact presents for breast screening within the "regular" period of 90 days from the rescreen date.
2. Conversely, a woman may be predicted to be "regular" but in fact either presents later than 90 days from the rescreen date, or not at all.

To begin with the first case (ie. a false positive), we can say that either the woman's "regular" presentation was specifically **in response to** the intervention, or that she would have presented within 90 days in the absence of the intervention. In the former case, there is no "bad result" because the intervention resulted in her on-time attendance and therefore it reduced her risk of undetected breast cancer. In the latter case, she received an unnecessary phone call which, while it may have been annoying and unnecessary for her, can be justified under the Deontological framework because it was still the "right thing" to intervene because of her wrongly-predicted risk of  being "lapsed".

To consider the second case above, namely a false negative where a woman may be predicted to be "regular" but in fact either presents later than 90 days from the rescreen date or not at all, we can say that the failure of the model to predict her "lapsedness" has failed to reduce her risk of undetected breast cancer. However, it also has not increased it, as the consequence is that an intervention was not undertaken for her, and therefore her risk is the same as in the absence of the deployed prediction model. Deontologists would state that the absence of 100% accuracy in the prediction model is not a reason to avoid using it, since some "good" is still being done to the majority of women, despite the absence of "good" being done to a minority.

It is difficult to anticipate potential misuses of the technique presented in this study, with the possible exception of an extremely poor prediction model being deployed into production that would result in a large number of unneeded interventions in the case of the false postives, and a large number of missed interventions in the case of the false negatives. This misuse could be prevented were the business (BreastScreen NSW) to set a minimum level of prediction accuracy required before a solution is deployed into production.

## Conclusion

This project set out to determine the accuracy with which the attendance of women at their next breast screening appointment at a BreastScreen NSW clinic or mobile van can be predicted, classifying women as either "regular" (predicted to attend within 90 days of the rescreen date) or "lapsed" (predicted to not attend within 90 days of their rescreen date). Within the parameters of this project, an accuracy of 79.53% has been achieved, with a number of suggestions for how this might be improved.

This provides BreastScreen NSW with the opportunity to consider whether this level of prediction accuracy is sufficient to be deployed to allow individual-level interventions (eg. a reminder phone call) to raise resceeen rates for the estimated 37% of NSW women who are either under-screened or have lapsed from the program completely.

The 5 key predictors of non-reattendance at BreastScreen NSW have been identified to be:
1. Distance from residential address to location of most recent episode >= 11 kms
2. Age at most recent episode >= 71 years
3. Age at most recent episode <= 56 years
4. Number of days the woman had to wait for results after her most recent attendance >= 9 days
5. Total number of episodes >= 10 episodes

Most of these predictor factors are outside of BreastScreen NSW's control. However, these results indicate that any changes that can be made to operational procedures in order to reduce the length of time that women wait for results may increase rescreening rates.

A more rigorous study, utilising additional classifiers and Machine Learning expertise in addition to the suggestions  above, may well yield improved model accuracy and therefore result in a more deployment-ready "regular" vs. "lapsed" prediction tool.

## Appendix 1 - SQL for Data Acquisition