# Covid-19 Averse Vaccine Response Prediction
![vaccine brands](../data/images/vaccines.jpg)
Author: Christos Maglaras<br>
Date : 4/14/2021
## Stakeholder
This project is mainly focused on serving the induvidual by providing an prediction based on personal information of the possibility of a negative reaction from one of the Covid-19 vaccines. It could also be applied in a medical center such as a clinic or hospital to screen patients easily and quickly. If you would like to know your or anothers risk of illness from one of the three available vaccines you may enter some or all of your personal information into the following web form ~flask link~. Successfully being able to predict the outcome of a patient utilizing nothing more than an online form would be higly beneficial to the patient of course, but also the healthcare system by decreasing the amount of strain placed on hospitals. 

## Data
The data utilized for this project has been sourced from the CDC VAERS system, a public dataset consisting of thirty years of domestic adverse vaccine events. Medical professionals and vaccine manufactures are required to report all adverse reactions that come to their attention. While they are required to submit records, anyone can submit a report of their experience. The data consists of general informations such as age and sex, vaccination information like the administration facility and brand, and health information such as preexisting illnesses, allergies, and medications they may take. This dataset contains roughly 70,000 records containing covid-19 vaccines, and is updated every two weeks with new records. You can collect the data [Here](https://vaers.hhs.gov/data/datasets.html?).
![vaers](../data/images/vaers.png)

## Business Understanding
This system would alleviate some of the pressure from hospitals, freeing up resources so they can operate more effectively. The first way in which a system like this would help is as a first step screening method, filtering patients to at least notify their clinician of their risks. The second is that in avoiding the adverse reactions, the hospitals do not need to dedicate extra resources to the patient after the reaction. Aulthough we have here seventy thousand cases reporting adverse reactions, the US now has reached five million vaccinations of at least one dose, and three million full vaccinations, meaning that these adverse reactions are only 1.4% of all domestic vaccinations. This is not to say that the 1.4% are to be ignored, with the legal age of the vaccine being sixteen, that leaves two hundred thirty million people eligible for the vaccine in the us, 1.4% being three million.

## Model
The model that achieved the best results was xgboost optimized using the Bayesian Optimization technique. This achieved significantly better results than both the standard random forests and the neural network which were not able to produce informative models. Aulthough there is some difficulty differentiating between hospital-bound and those who are not, that is between patients who all had some sort of averse reaction. A real wold test might be more precice as it has been trained on the more difficult task of seperating two groups who are very similar to one another. 

## Contents
```
├── data
|   ├── vaers_guide.pdf
|   ├── images
|   |   ├── vaccines.jpg
|   |   └── vaers.mp4
|   └── dataset
|       ├── 2020
|       |   ├── data20.csv
|       |   ├── symptoms20.csv
|       |   └── vax20.csv
|       └── 2021
|           ├── data21.csv
|           ├── symptoms21.csv
|           └── vax21.csv
├── notebooks
|   ├── Exploration Notebook.ipynb
|   ├── Final Notebook.ipynb
|   └── MVP Final Notebook.ipynb
├── presentation.pdf
└── README.md
```

## Data Preprocessing

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn import metrics
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, plot_confusion_matrix, classification_report, mean_squared_error

import lazypredict
from lazypredict.Supervised import LazyClassifier

import xgboost as xgb
from bayes_opt import BayesianOptimization

import tensorflow as tf
from tensorflow.keras.layers import Dense, Dropout

pd.set_option('display.max_columns', None)
import warnings; warnings.simplefilter('ignore')



In [2]:
# lets load in all three datasets for each year
symptoms20 = pd.read_csv('../data/2020/symptoms20.csv', index_col=['VAERS_ID'], encoding='latin-1')
data20     = pd.read_csv('../data/2020/data20.csv', index_col=['VAERS_ID'], encoding='latin-1')
vax20      = pd.read_csv('../data/2020/vax20.csv', index_col=['VAERS_ID'], encoding='latin-1')

symptoms21 = pd.read_csv('../data/2021/symptoms21.csv', index_col=['VAERS_ID'], encoding='latin-1')
data21     = pd.read_csv('../data/2021/data21.csv', index_col=['VAERS_ID'], encoding='latin-1')
vax21      = pd.read_csv('../data/2021/vax21.csv', index_col=['VAERS_ID'], encoding='latin-1')

In [3]:

combined_vax = pd.concat([vax20, vax21])
combined_data = pd.concat([data20, data21])
combined_symptoms = pd.concat([symptoms20, symptoms21])

In [4]:
datavax = pd.merge(combined_data, combined_vax, on='VAERS_ID', how='right')
dvs = pd.merge(datavax, combined_symptoms, on='VAERS_ID', how='left')

In [5]:
# isolating covid-19 vaccinations for the base dataframe
df = dvs[dvs['VAX_TYPE'] == 'COVID19']

In [6]:
df.drop_duplicates(inplace=True)

In [7]:
df['DIED'] = df['DIED'].fillna(0)
df['DIED'] = df['DIED'].replace('Y', 1)

df['SEX'] = df['SEX'].replace('U', '0')
df['SEX'] = df['SEX'].replace('F', '0')
df['SEX'] = df['SEX'].replace('M', '1')

df['L_THREAT'] = df['L_THREAT'].fillna(0)
df['L_THREAT'] = df['L_THREAT'].replace('Y', 1)

df['HOSPITAL'] = df['HOSPITAL'].fillna(0)
df['HOSPITAL'] = df['HOSPITAL'].replace('Y', 1)

df['HOSPDAYS'] = df['HOSPDAYS'].fillna(0)

df['X_STAY'] = df['X_STAY'].fillna(0)
df['X_STAY'] = df['X_STAY'].replace('Y', 1)

df['DISABLE'] = df['DISABLE'].fillna(0)
df['DISABLE'] = df['DISABLE'].replace('Y', 1)

df['RECOVD'] = df['RECOVD'].fillna(0)
df['RECOVD'] = df['RECOVD'].replace('U', 0)
df['RECOVD'] = df['RECOVD'].replace('N', 0)
df['RECOVD'] = df['RECOVD'].replace('Y', 1)

df['BIRTH_DEFECT'] = df['BIRTH_DEFECT'].fillna(0)
df['BIRTH_DEFECT'] = df['BIRTH_DEFECT'].replace('Y', 1)

df['VAX_DOSE_SERIES'] = df['VAX_DOSE_SERIES'].fillna(0)
df['VAX_DOSE_SERIES'] = df['VAX_DOSE_SERIES'].replace('7+', 7)
df['VAX_DOSE_SERIES'] = df['VAX_DOSE_SERIES'].replace('UNK', 1)

df['NUMDAYS'] = df['NUMDAYS'].where(df['NUMDAYS']<120, 7)

In [8]:
df['SEX'] = df['SEX'].astype(int)
df['AGE_YRS'] = df['AGE_YRS'].fillna(50)
df['AGE_YRS'] = df['AGE_YRS'].astype(int)
df['HOSPDAYS'] = df['HOSPDAYS'].astype(int)
df['NUMDAYS'] = df['NUMDAYS'].astype(int)
df['VAX_DOSE_SERIES'] = df['VAX_DOSE_SERIES'].astype(int)

## Data Processing
### State

In [9]:
# lets clean up the state column by filling NaN values and binning the unusual locations as Other as there are only a few of each
df['STATE'].replace(['AS', 'VI', 'MP', 'Ca', 'XB', 'FM', 'MH', 'GU'], 'OTH', inplace=True)
df['STATE'] = df['STATE'].fillna('NA')

label_encoder = LabelEncoder()
label_encoder = label_encoder.fit(df['STATE'])
df['STATE'] = label_encoder.transform(df['STATE'])

### Vaccine Manufacturer

In [10]:
# since there are no NaNs or unwanted values we can just fit and transform
df['VAX_MANU'].value_counts()

label_encoder = LabelEncoder()
label_encoder = label_encoder.fit(df['VAX_MANU'])
df['VAX_MANU'] = label_encoder.transform(df['VAX_MANU'])

### Injection Site

In [11]:
# where the dose was applied, right arm, left arm, leg, ect.
df['VAX_SITE'] = df['VAX_SITE'].fillna('NA')

label_encoder = LabelEncoder()
label_encoder = label_encoder.fit(df['VAX_SITE'])
df['VAX_SITE'] = label_encoder.transform(df['VAX_SITE'])

### Delivery Method

In [12]:
# states the method of delivering the vaccine, such as srynge, nasal, intradermal, intramuscular, and others
df['VAX_ROUTE'] = df['VAX_ROUTE'].fillna('UN')

label_encoder = LabelEncoder()
label_encoder = label_encoder.fit(df['VAX_ROUTE'])
df['VAX_ROUTE'] = label_encoder.transform(df['VAX_ROUTE'])

### Administration Facility

In [13]:
# examples of locations would be school, military, senior home ect.
label_encoder = LabelEncoder()
label_encoder = label_encoder.fit(df['V_ADMINBY'])
df['V_ADMINBY'] = label_encoder.transform(df['V_ADMINBY'])

### Allergies

In [14]:
# the following columns cleaned cannot using LabelEncoder as many rows have multiple values to represent.

In [15]:
# preparing the columns with text data is a little tricky as there is no standard format, many misspellings, and multiple values presented for many people
# Lets begin by lowercasing the Allergies column
df['ALLERGIES'] = df['ALLERGIES'].str.lower()

In [16]:
# Now we replace all these different ways of saying none with none, as well as NaNs
nonelist = ['no', 'no known allergies', 'unknown', 'none known', 'n/a', 'none reported', 'na', 'none.',
            'no known drug allergies', 'no allergies', 'na', 'no known', 'no known allergies.', 'none listed', 
           'unk', 'none known.']

df['ALLERGIES'] = df['ALLERGIES'].fillna('none')
df['ALLERGIES'] = df['ALLERGIES'].replace('penicillin|sulfa', 'penicillin')
df['ALLERGIES'] = df['ALLERGIES'].replace(nonelist, 'none')

In [17]:
# these functions will find each list-like text and turn it into a format acceptable for get_dummies
# We are using this method instead of onehotencoder or labelbinarizer so a row can have multiple allergies represented
# by only using the allergies that at least thirty people have in the data, we can keep the dimensionality and required cleaning down
# it is likely that an algorith would not be able to learn from allergies that have a very small representation
allall = []
for each in df['ALLERGIES']:
    if ',' in each:
        alls = each.split(',')
        ally = []
        for weach in alls:
            ally.append(weach.strip())
        allall.append(ally)
    else:
        allall.append(each)
        
allall2 = []
for each in allall:
    if type(each) == list:
        welt = "|".join(each)
        allall2.append(welt)
    else:
        allall2.append(each)
        
listy = list(pd.Series(allall2).value_counts()[pd.Series(allall2).value_counts()>30].index)
allall3 = list(map(lambda x: 'none' if x not in listy else x, allall2))

dummyframe = pd.Series(allall3).str.get_dummies()

listy.remove('penicillin|sulfa')

In [18]:
df.reset_index(inplace=True)

In [19]:
# Lets attach our new columns to the dataframe
df = df.join(dummyframe[listy])

### Current Illnesses

In [20]:
df['CUR_ILL'] = df['CUR_ILL'].str.lower()

nonelist = ['no', 'unknown', 'none.', 'none reported', 'n/a', 'na', 'none known', 'denies', 'none noted', '0', 'no illness',
           'none listed', 'not known', 'no known', 'non', 'no acute illnesses', 'no.', 'denied', 'see below', 'no illnesses',
            'unk', 'unkown', 'none documented', 'none stated', 'nothing', 'none known.', 'unknown.', 'no known illnesses',
            'n/a.','no e', 'none reported.', 'no acute illness']

df['CUR_ILL'] = df['CUR_ILL'].fillna('none')
df['CUR_ILL'] = df['CUR_ILL'].replace(nonelist, 'none')
df['CUR_ILL'] = df['CUR_ILL'].replace(['covid 19', 'covid', 'covid- 19 diagnosis 12/11/2020 asymptomatic', 'covid-19 (diagnosed 10/26/20)', 'covid-19  (diagnosed 10/26/20)'], 'covid-19')

In [21]:
allall = []
for each in df['CUR_ILL']:
    if ',' in each:
        alls = each.split(',')
        ally = []
        for weach in alls:
            ally.append(weach.strip())
        allall.append(ally)
    else:
        allall.append(each)
        
allall2 = []
for each in allall:
    if type(each) == list:
        welt = "|".join(each)
        allall2.append(welt)
    else:
        allall2.append(each)
        
listy = list(pd.Series(allall2).value_counts()[pd.Series(allall2).value_counts()>13].index)
allall3 = list(map(lambda x: 'none' if x not in listy else x, allall2))
    
datufrayme = pd.Series(allall3).str.get_dummies()

listy.remove('alcohol use disorder|facial laceration|alcohol intoxication|secondary syphillis')
listy.remove('elevated troponin i level elevated troponin i level        elevated brain natriuretic peptide (bnp) level elevated brain natriuretic peptide (bnp) level        dyspnea       chest pain        atrial fibrillation with rapid ventricular response (hcc) atrial fibrillation with rapid ventricular response|initial encounter       hyponatremia hyponatremia')
    
df = df.join(datufrayme[listy], lsuffix=" cur_ill")

### Patient history

In [22]:
df['HISTORY'] = df['HISTORY'].str.lower()

In [23]:
# Quite a few entries for history need to be rectified, doubles were found using .value_counts() as we use the most common
nonelist = ['no', 'unknown', 'none.', 'none reported', 'n/a', 'na', 'none known', 'denies', 'none noted', '0', 'no illness',
           'none listed', 'not known', 'no known', 'non', 'no acute illnesses', 'no.', 'denied', 'see below', 'no illnesses',
            'unk', 'unkown', 'none documented', 'none stated', 'nothing', 'none known.', 'unknown.', 'as above', 'no known illnesses',
            'n/a.','no e', 'none reported.', 'medical history/concurrent conditions: no adverse event (no reported medical history)',
           'medical history/concurrent conditions: no adverse event (no reported medical history.)', 'see above', 'medical history/concurrent conditions: no adverse event',
           'medical history/concurrent conditions: no adverse event (no medical history reported.)', 'medical history/concurrent conditions: no adverse event (no medical history reported)',
           'medical history/concurrent conditions: no adverse event (medical history not provided)', 'comments: list of non-encoded patient relevant history: patient other relevant history 1: none',
           ]

df['HISTORY'] = df['HISTORY'].fillna('none')
df['HISTORY'] = df['HISTORY'].replace(nonelist, 'none')
df['HISTORY'] = df['HISTORY'].replace('medical history/concurrent conditions: covid-19', 'covid-19')
df['HISTORY'] = df['HISTORY'].replace('medical history/concurrent conditions: hypertension', 'hypertension')
df['HISTORY'] = df['HISTORY'].replace('medical history/concurrent conditions: penicillin allergy', 'penicillin allergy')
df['HISTORY'] = df['HISTORY'].replace(['medical history/concurrent conditions: asthma','mild asthma','exercise induced asthma'], 'asthma')
df['HISTORY'] = df['HISTORY'].replace('medical history/concurrent conditions: blood pressure high', 'high blood pressure')
df['HISTORY'] = df['HISTORY'].replace('medical history/concurrent conditions: sulfonamide allergy', 'sulfonamide allergy')
df['HISTORY'] = df['HISTORY'].replace(['diabetic', 'type 2 diabetes', 'type 1 diabetes'], 'diabetes')
df['HISTORY'] = df['HISTORY'].replace('medical history/concurrent conditions: migraine', 'migraines')

In [24]:
allall = []
for each in df['HISTORY']:
    if ',' in each:
        alls = each.split(',')
        ally = []
        for weach in alls:
            ally.append(weach.strip())
        allall.append(ally)
    else:
        allall.append(each)
            
allall2 = []
for each in allall:
    if type(each) == list:
        welt = "|".join(each)
        allall2.append(welt)
    else:
        allall2.append(each)
            
listy = list(pd.Series(allall2).value_counts()[pd.Series(allall2).value_counts()>40].index)
allall3 = list(map(lambda x: 'none' if x not in listy else x, allall2))
    
datufrayme = pd.Series(allall3).str.get_dummies()

listy.remove('cerebral palsy|anxiety|crohns|bipolar|gerd|nutrition deficiency|iron deficiency')
    
df = df.join(datufrayme[listy], lsuffix=" history")

### Medication

In [25]:
df['OTHER_MEDS'] = df['OTHER_MEDS'].str.lower()

In [26]:
nonelist = ['unknown', 'no', 'none.', 'n/a', 'none reported', 'unk', 'none known', ';', 'not known', 'na', 'denies', ';  ;', 
           'nothing']

df['OTHER_MEDS'] = df['OTHER_MEDS'].fillna('none')
df['OTHER_MEDS'] = df['OTHER_MEDS'].replace(nonelist, 'none')

In [27]:
allall = []
for each in df['OTHER_MEDS']:
    if ',' in each:
        alls = each.split(',')
        ally = []
        for weach in alls:
            ally.append(weach.strip())
        allall.append(ally)
    else:
        allall.append(each)
            
allall2 = []
for each in allall:
    if type(each) == list:
        welt = "|".join(each)
        allall2.append(welt)
    else:
        allall2.append(each)
            
listy = list(pd.Series(allall2).value_counts()[pd.Series(allall2).value_counts()>20].index)
allall3 = list(map(lambda x: 'none' if x not in listy else x, allall2))
    
datufrayme = pd.Series(allall3).str.get_dummies()
    
df = df.join(datufrayme[listy], lsuffix=" meds")

### Symptoms

### Further Cleaning

In [28]:
# drop all the columns with data that has been dummied, data that gives away the target such as days in hopital, 
#     columns of post-diagnosis information, and irrelavant colums such as Form Version
df.drop(columns = ['RECVDATE', 'CAGE_MO', 'CAGE_YR', 'RPT_DATE', 'SYMPTOM_TEXT', 'DIED',
                  'DATEDIED', 'L_THREAT', 'ER_VISIT', 'HOSPDAYS', 'X_STAY', 'RECOVD', 'VAX_DATE', 'ONSET_DATE',
                  'NUMDAYS', 'LAB_DATA', 'V_FUNDBY', 'OTHER_MEDS', 'CUR_ILL', 'HISTORY', 'PRIOR_VAX', 'SPLTTYPE', 
                  'FORM_VERS', 'TODAYS_DATE', 'OFC_VISIT', 'ER_ED_VISIT', 'ALLERGIES', 'VAX_TYPE', 'VAX_LOT', 
                  'SYMPTOM1', 'SYMPTOM2','SYMPTOM3', 'SYMPTOM4', 'SYMPTOM5','SYMPTOMVERSION1', 'SYMPTOMVERSION2',
                   'SYMPTOMVERSION3','SYMPTOMVERSION4', 'SYMPTOMVERSION5', 'VAX_NAME', 'VAERS_ID'],
       axis = 1, inplace=True)

In [29]:
categoricals = ['STATE', 'V_ADMINBY', 'VAX_MANU', 'VAX_ROUTE', 'VAX_SITE']

In [30]:
df = df.astype(int)
df[categoricals] = df[categoricals].astype('category')

In [31]:
df_dupe = df.drop_duplicates()

# EDA

In [None]:
df

# Modelling

In [32]:
# these columns do not need to be dummied, passing enable_categorical=True into the DMatrixs will allow it to process, but only
#     if gpu is enabled and numerous other requirements are met.
df = pd.concat([df,pd.get_dummies(df['STATE'], prefix='STATE: ')],axis=1).drop(['STATE'],axis=1)
df = pd.concat([df,pd.get_dummies(df['VAX_MANU'], prefix='BRAND: ')],axis=1).drop(['VAX_MANU'],axis=1)
df = pd.concat([df,pd.get_dummies(df['VAX_SITE'], prefix='VAX_SITE: ')],axis=1).drop(['VAX_SITE'],axis=1)
df = pd.concat([df,pd.get_dummies(df['VAX_ROUTE'], prefix='VAX_ROUTE: ')],axis=1).drop(['VAX_ROUTE'],axis=1)
df = pd.concat([df,pd.get_dummies(df['V_ADMINBY'], prefix='ADMINBY: ')],axis=1).drop(['V_ADMINBY'],axis=1)

In [33]:
# lets use hospitalization as our target, if it performs well enough we can try the continious target hosp_days
x = df.drop(columns = ['HOSPITAL'])
y = df['HOSPITAL']
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size = .2)
xtrain, xval, ytrain, yval = train_test_split(xtrain, ytrain, test_size = .2)

### Lazy Predict

In [None]:
# Lazy Predict is a great tool to easily test your dataset across over thirty models
# this gives us a head start in our modelling as we can begin from the best result below
clf = LazyClassifier(verbose=0,ignore_warnings=True, custom_metric=None)
models, predictions = clf.fit(xtrain, xval, ytrain, yval)

print(models)

 86%|████████▌ | 25/29 [11:17<00:17,  4.31s/it] 

### SKLearn Random Forest

In [None]:
# since Random Forests generated the best results, we should try and improve upon it
rfc = RandomForestClassifier(n_estimators=200)
rfc.fit(xtrain, ytrain)

In [None]:
ypred = rfc.predict(xtrain)
plot_confusion_matrix(estimator=rfc, y_true=ytrain, X = xtrain)

In [None]:
ypred = rfc.predict(xval)
plot_confusion_matrix(estimator=rfc, y_true=yval, X = xval)

### XGBoost

In [None]:
# since the best performing model was a random forest, we can try XGBoost as it is also based on random forests

In [34]:
df = df.astype(int)

In [36]:
# XGBoost requires it own matrix type to be used
# XGBoost was built with sparse data in mind, adding the missing paramater will greatly improve the efficiency of the training
dtrain = xgb.DMatrix(xtrain, label=ytrain, missing=0)
dtest  = xgb.DMatrix(xtest, label=ytest, missing=0)
dval   = xgb.DMatrix(xval, label=yval, missing=0)

In [37]:
# lets define a parameter tuner using Bayesian Optimiization
def bo_tune_xgb(max_depth, gamma, n_estimators ,learning_rate, scale_pos_weight, min_child_weight, colsample_bytree, subsample):
    params = {'max_depth'       : int(max_depth),
              'gamma'           : gamma,
              'n_estimators'    : int(n_estimators),
              'learning_rate'   : learning_rate,
              'subsample'       : subsample,
              'eval_metric'     : 'rmse',
              'min_child_weight': min_child_weight,
              'scale_pos_weight': scale_pos_weight,
              'colsample_bytree': colsample_bytree,
              'tree_method'     : 'gpu_hist'}
    cv_result = xgb.cv(params, dtrain, num_boost_round=200, nfold=5)
    return -1.0 * cv_result['test-rmse-mean'].iloc[-1]

In [38]:
# here we define the ranges that Bayesian Optimization is allowed to search through
# these ranges are common parameter ranges that most models fall into
xgb_bo = BayesianOptimization(bo_tune_xgb, {'max_depth' : (1, 30),
                        'gamma'            : (0, 2),
                        'subsample'        : (0,1),           
                        'learning_rate'    : (0,1),
                        'n_estimators'     : (100,400),
                        'scale_pos_weight' : (5,10),
                        'min_child_weight' : (1,10),
                        'colsample_bytree' : (0,1)} ,verbose=3)

In [39]:
# here we search for the strongest parameter combination
xgb_bo.maximize(n_iter=20, init_points=15, acq='ei')

|   iter    |  target   | colsam... |   gamma   | learni... | max_depth | min_ch... | n_esti... | scale_... | subsample |
-------------------------------------------------------------------------------------------------------------------------
Parameters: { n_estimators } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Parameters: { n_estimators } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Parameters: { n_estimators } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some pa

| [0m 4       [0m | [0m-0.4396  [0m | [0m 0.1636  [0m | [0m 1.397   [0m | [0m 0.4194  [0m | [0m 2.625   [0m | [0m 6.96    [0m | [0m 175.7   [0m | [0m 5.386   [0m | [0m 0.01547 [0m |
Parameters: { n_estimators } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Parameters: { n_estimators } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Parameters: { n_estimators } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through thi

| [95m 8       [0m | [95m-0.3458  [0m | [95m 0.5312  [0m | [95m 0.778   [0m | [95m 0.1999  [0m | [95m 15.6    [0m | [95m 3.805   [0m | [95m 373.6   [0m | [95m 5.33    [0m | [95m 0.3     [0m |
Parameters: { n_estimators } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Parameters: { n_estimators } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Parameters: { n_estimators } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip t


Parameters: { n_estimators } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


| [95m 12      [0m | [95m-0.3209  [0m | [95m 0.9868  [0m | [95m 0.2588  [0m | [95m 0.1133  [0m | [95m 15.88   [0m | [95m 4.989   [0m | [95m 100.2   [0m | [95m 9.657   [0m | [95m 0.6738  [0m |
Parameters: { n_estimators } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Parameters: { n_estimators } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip 


Parameters: { n_estimators } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


| [0m 16      [0m | [0m-0.4427  [0m | [0m 0.2443  [0m | [0m 1.462   [0m | [0m 0.4707  [0m | [0m 15.75   [0m | [0m 8.08    [0m | [0m 101.7   [0m | [0m 9.935   [0m | [0m 0.325   [0m |
Parameters: { n_estimators } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Parameters: { n_estimators } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through th

| [0m 20      [0m | [0m-0.3868  [0m | [0m 0.3803  [0m | [0m 1.805   [0m | [0m 0.6522  [0m | [0m 4.916   [0m | [0m 3.647   [0m | [0m 265.6   [0m | [0m 5.145   [0m | [0m 0.5095  [0m |
Parameters: { n_estimators } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Parameters: { n_estimators } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Parameters: { n_estimators } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through thi


| [0m 24      [0m | [0m-0.4681  [0m | [0m 0.7184  [0m | [0m 1.603   [0m | [0m 0.8309  [0m | [0m 7.624   [0m | [0m 5.785   [0m | [0m 283.1   [0m | [0m 5.411   [0m | [0m 0.2606  [0m |
Parameters: { n_estimators } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Parameters: { n_estimators } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Parameters: { n_estimators } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through th


Parameters: { n_estimators } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


| [0m 28      [0m | [0m-0.4159  [0m | [0m 0.3939  [0m | [0m 0.8727  [0m | [0m 0.7701  [0m | [0m 19.64   [0m | [0m 6.095   [0m | [0m 366.3   [0m | [0m 9.819   [0m | [0m 0.8815  [0m |
Parameters: { n_estimators } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Parameters: { n_estimators } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through th

| [0m 32      [0m | [0m-0.4652  [0m | [0m 0.3104  [0m | [0m 1.852   [0m | [0m 0.1067  [0m | [0m 3.235   [0m | [0m 9.657   [0m | [0m 272.1   [0m | [0m 9.317   [0m | [0m 0.5328  [0m |
Parameters: { n_estimators } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Parameters: { n_estimators } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Parameters: { n_estimators } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through thi

In [40]:
# .max returns the strongest parameters
params = xgb_bo.max['params']

# some of the parameters need to be whole numbers
params['max_depth'] = int(params['max_depth'])
params['n_estimators'] = int(params['n_estimators'])
params

{'colsample_bytree': 0.6706444130863642,
 'gamma': 0.06862852195992875,
 'learning_rate': 0.33445149639682603,
 'max_depth': 22,
 'min_child_weight': 3.124991575902069,
 'n_estimators': 223,
 'scale_pos_weight': 7.785205591714172,
 'subsample': 0.9479171369815372}

In [41]:
# lets see our results for the train data
xgb_opt= xgb.train(params, dtrain)
predsopt = xgb_opt.predict(dtrain)
print(classification_report(predsopt.round(), ytrain))

Parameters: { n_estimators } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


              precision    recall  f1-score   support

         0.0       0.94      0.99      0.97     36577
         1.0       0.96      0.73      0.83      8514

    accuracy                           0.94     45091
   macro avg       0.95      0.86      0.90     45091
weighted avg       0.95      0.94      0.94     45091



In [42]:
confusion_matrix(predsopt.round(), ytrain)

array([[36337,   240],
       [ 2262,  6252]], dtype=int64)

In [43]:
# and now for the validation
predsoptval = xgb_opt.predict(dval)
print(classification_report(predsoptval.round(), yval))

              precision    recall  f1-score   support

         0.0       0.90      0.96      0.93      9034
         1.0       0.78      0.55      0.64      2239

    accuracy                           0.88     11273
   macro avg       0.84      0.75      0.78     11273
weighted avg       0.87      0.88      0.87     11273



In [44]:
confusion_matrix(predsoptval.round(), yval)

array([[8688,  346],
       [1018, 1221]], dtype=int64)

### Neural Network

In [None]:
ypred = xgb_opt.predict(dval)
plot_confusion_matrix([[8707,  365],[ 937, 1264]])

In [None]:
# swish is a recent activation function that remedies the issues of ReLU
def swish(x, b = 1):
    return (x * sigmoid(b * x))

In [None]:
def newmod():
    model = tf.keras.Sequential()
    model.add(Dense(176, input_dim=len(xtrain.columns), activation='swish'))
    model.add(Dropout(.2))

    model.add(Dense(88, activation='swish'))
    model.add(Dropout(.2))
    
    model.add(Dense(44, activation='swish'))
    model.add(Dropout(.2))
    
    model.add(Dense(22, activation='swish'))
    model.add(Dropout(.2))
    
    model.add(Dense(11, activation='swish'))
    model.add(Dropout(.2))
    
    model.add(Dense(1, activation='sigmoid'))
    
    return model


estimator = newmod()
estimator.compile(optimizer='nadam', 
                  metrics=['accuracy', tf.keras.metrics.Precision(), tf.keras.metrics.Recall()], loss='binary_crossentropy')

In [None]:
history = estimator.fit(xtrain, ytrain, epochs=100, validation_data=(xval, yval))

In [None]:
history_df = pd.DataFrame(history.history)
plt.plot(history_df['loss'], label='loss')
plt.plot(history_df['val_loss'], label='val_loss')

plt.legend()