# Medical Transcription Classification Task

The goal of this project is to develop or find a NLP model that can correctly classify the medical specialties based on the transcription text with significant accuracy. This project was heavily inspired by the existing Kaggle dataset Medical Transcriptions uploaded by Tara Boyle and the website mtsamples.com that has a collection of transcribed medical report. 

In [195]:
import pandas as pd
import numpy as np
import re
import random
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.metrics import f1_score

np.random.seed(42)
random.seed(42)

In [196]:
data_df = pd.read_csv("mtsamples.csv", usecols=["description", "medical_specialty", "sample_name", "transcription", "keywords"])
data_df.head(3)


Unnamed: 0,description,medical_specialty,sample_name,transcription,keywords
0,A 23-year-old white female presents with comp...,Allergy / Immunology,Allergic Rhinitis,"SUBJECTIVE:, This 23-year-old white female pr...","allergy / immunology, allergic rhinitis, aller..."
1,Consult for laparoscopic gastric bypass.,Bariatrics,Laparoscopic Gastric Bypass Consult - 2,"PAST MEDICAL HISTORY:, He has difficulty climb...","bariatrics, laparoscopic gastric bypass, weigh..."
2,Consult for laparoscopic gastric bypass.,Bariatrics,Laparoscopic Gastric Bypass Consult - 1,"HISTORY OF PRESENT ILLNESS: , I have seen ABC ...","bariatrics, laparoscopic gastric bypass, heart..."


### EDA

In [197]:
data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4999 entries, 0 to 4998
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   description        4999 non-null   object
 1   medical_specialty  4999 non-null   object
 2   sample_name        4999 non-null   object
 3   transcription      4966 non-null   object
 4   keywords           3931 non-null   object
dtypes: object(5)
memory usage: 195.4+ KB


In [198]:
data_df.describe()

Unnamed: 0,description,medical_specialty,sample_name,transcription,keywords
count,4999,4999,4999,4966,3931.0
unique,2348,40,2377,2357,3849.0
top,An example/template for a routine normal male...,Surgery,Lumbar Discogram,"PREOPERATIVE DIAGNOSIS: , Low back pain.,POSTO...",
freq,12,1103,5,5,81.0


In [199]:
data_df['medical_specialty'].value_counts()


 Surgery                          1103
 Consult - History and Phy.        516
 Cardiovascular / Pulmonary        372
 Orthopedic                        355
 Radiology                         273
 General Medicine                  259
 Gastroenterology                  230
 Neurology                         223
 SOAP / Chart / Progress Notes     166
 Obstetrics / Gynecology           160
 Urology                           158
 Discharge Summary                 108
 ENT - Otolaryngology               98
 Neurosurgery                       94
 Hematology - Oncology              90
 Ophthalmology                      83
 Nephrology                         81
 Emergency Room Reports             75
 Pediatrics - Neonatal              70
 Pain Management                    62
 Psychiatry / Psychology            53
 Office Notes                       51
 Podiatry                           47
 Dermatology                        29
 Dentistry                          27
 Cosmetic / Plastic Surge

In [200]:
data_df["transcription"][10]

'PREOPERATIVE DIAGNOSIS: , Morbid obesity. ,POSTOPERATIVE DIAGNOSIS: , Morbid obesity. ,PROCEDURE:,  Laparoscopic Roux-en-Y gastric bypass, antecolic, antegastric with 25-mm EEA anastamosis, esophagogastroduodenoscopy. ,ANESTHESIA: , General with endotracheal intubation. ,INDICATIONS FOR PROCEDURE: , This is a 50-year-old male who has been overweight for many years and has tried multiple different weight loss diets and programs.  The patient has now begun to have comorbidities related to the obesity.  The patient has attended our bariatric seminar and met with our dietician and psychologist.  The patient has read through our comprehensive handout and understands the risks and benefits of bypass surgery as evidenced by the signing of our consent form.,PROCEDURE IN DETAIL: , The risks and benefits were explained to the patient.  Consent was obtained.  The patient was taken to the operating room and placed supine on the operating room table.  General anesthesia was administered with endot

In [201]:
data_df["transcription"][11]

'2-D STUDY,1. Mild aortic stenosis, widely calcified, minimally restricted.,2. Mild left ventricular hypertrophy but normal systolic function.,3. Moderate biatrial enlargement.,4. Normal right ventricle.,5. Normal appearance of the tricuspid and mitral valves.,6. Normal left ventricle and left ventricular systolic function.,DOPPLER,1. There is 1 to 2+ aortic regurgitation easily seen, but no aortic stenosis.,2. Mild tricuspid regurgitation with only mild increase in right heart pressures, 30-35 mmHg maximum.,SUMMARY,1. Normal left ventricle.,2. Moderate biatrial enlargement.,3. Mild tricuspid regurgitation, but only mild increase in right heart pressures.'

In [202]:
data_df["description"][10]

' Morbid obesity.  Laparoscopic Roux-en-Y gastric bypass, antecolic, antegastric with 25-mm EEA anastamosis, esophagogastroduodenoscopy.'

In [203]:
data_df["description"][11]

' Normal left ventricle, moderate biatrial enlargement, and mild tricuspid regurgitation, but only mild increase in right heart pressures.'

In [204]:
data_df["sample_name"][10]

' Laparoscopic Gastric Bypass - 1 '

In [205]:
data_df.isna().any()


description          False
medical_specialty    False
sample_name          False
transcription         True
keywords              True
dtype: bool

In [206]:
data_df.isna().sum()

description             0
medical_specialty       0
sample_name             0
transcription          33
keywords             1068
dtype: int64

### Data Cleaning

In [207]:
# only use two columns 

data_tr = data_df[["medical_specialty", "transcription"]]
data_tr.shape

(4999, 2)

In [208]:
# drop rows where transcriptions don't have any text 

data_tr = data_tr.dropna(axis=0)
data_tr.shape

(4966, 2)

In [209]:
data_tr[95:98]

Unnamed: 0,medical_specialty,transcription
95,Urology,"PREOPERATIVE DIAGNOSIS: , Inguinal hernia.,POS..."
96,Urology,"PROCEDURE PERFORMED: , Inguinal herniorrhaphy...."
98,Urology,"PREOPERATIVE DIAGNOSIS:, Bilateral inguinal h..."


In [210]:
data_tr = data_tr.reset_index().drop("index", axis=1)
data_tr[95:98]

Unnamed: 0,medical_specialty,transcription
95,Urology,"PREOPERATIVE DIAGNOSIS: , Inguinal hernia.,POS..."
96,Urology,"PROCEDURE PERFORMED: , Inguinal herniorrhaphy...."
97,Urology,"PREOPERATIVE DIAGNOSIS:, Bilateral inguinal h..."


In [211]:
def strip_lr(row):
    output = row.strip()
    return output

def clean_text(row):
    sample_text = row
    sample_text2 = re.sub(r"\.", " ", sample_text) # replace any . 
    sample_text2 = re.sub(r"[A-Z]{2,}", "", sample_text2) # replace any UPPERCASE words longer than 2 characters
    sample_text2 = re.sub(r"\d", "", sample_text2) # replace any digit
    sample_text2 = re.sub(r"\:", " ", sample_text2) # replace :
    sample_text2 = re.sub(r",", "", sample_text2) # replace comma
    sample_text2 = re.sub(r"-", " ", sample_text2)
    sample_text2 = re.sub(r"_", " ", sample_text2)
    sample_text2 = re.sub(r"[\\\/]", " ", sample_text2) 
    sample_text2 = re.sub(r"\s{2,}", " ", sample_text2) # replace whitespaces longer than 2
    sample_text2 = sample_text2.strip() # strip any whitespace on the sides of text
    sample_text2 = sample_text2.lower() # make everything lowercase
    output = sample_text2
    
    return output

data_tr['medical_specialty'] = data_tr['medical_specialty'].apply(strip_lr)
data_tr['medical_specialty'] = data_tr['medical_specialty'].apply(clean_text)
data_tr['transcription'] = data_tr['transcription'].apply(clean_text)
data_tr.head(3)

Unnamed: 0,medical_specialty,transcription
0,allergy immunology,this year old white female presents with compl...
1,bariatrics,he has difficulty climbing stairs difficulty w...
2,bariatrics,i have seen today he is a very pleasant gentle...


In [212]:
data_tr['transcription'][1]

'he has difficulty climbing stairs difficulty with airline seats tying shoes used to public seating and lifting objects off the floor he exercises three times a week at home and does cardio he has difficulty walking two blocks or five flights of stairs difficulty with snoring he has muscle and joint pains including knee pain back pain foot and ankle pain and swelling he has gastroesophageal reflux disease includes reconstructive surgery on his right hand years ago he is currently single he has about ten drinks a year he had smoked significantly up until several months ago he now smokes less than three cigarettes a day heart disease in both grandfathers grandmother with stroke and a grandmother with diabetes denies obesity and hypertension in other family members none he is allergic to penicillin he has been going to support groups for seven months with lynn holmberg in greenwich and he is from eastchester new york and he feels that we are the appropriate program he had a poor experienc

In [213]:
data_tr['medical_specialty'].value_counts()

surgery                     1088
consult history and phy      516
cardiovascular pulmonary     371
orthopedic                   355
radiology                    273
general medicine             259
gastroenterology             224
neurology                    223
chart progress notes         166
urology                      156
obstetrics gynecology        155
discharge summary            108
otolaryngology                96
neurosurgery                  94
hematology oncology           90
ophthalmology                 83
nephrology                    81
emergency room reports        75
pediatrics neonatal           70
pain management               61
psychiatry psychology         53
office notes                  50
podiatry                      47
dermatology                   29
dentistry                     27
cosmetic plastic surgery      27
letters                       23
physical medicine rehab       21
sleep medicine                20
endocrinology                 19
bariatrics

In [214]:
from collections import Counter
texts = data_tr['transcription']
tokenized_texts = [row.split() for row in texts]
vocabulary = Counter()

for row in tokenized_texts:
    vocabulary.update(row)

In [215]:
len(vocabulary)

22116

In [277]:
vocabulary.most_common(5)

[('the', 149521), ('and', 81921), ('was', 71759), ('of', 56203), ('to', 50468)]

In [282]:
np.max([len(row) for row in tokenized_texts])

2991

### Split the Data into Train, Validation, Test Sets

In [216]:
# split: 
# train set: 70% 
# val set: 15%
# test set: 15% 

In [217]:
sss = StratifiedShuffleSplit(n_splits=1, train_size=0.7, random_state=42)
for train_index, val_test_index in sss.split(data_tr, data_tr['medical_specialty']):
    train_set = data_tr.loc[data_tr.index.intersection(train_index)]
    val_test_set = data_tr.loc[data_tr.index.intersection(val_test_index)]
    

In [218]:
train_X = train_set.drop('medical_specialty',axis=1)['transcription']
train_y = train_set['medical_specialty'].copy()

In [219]:
sss = StratifiedShuffleSplit(n_splits=1, train_size=0.5, random_state=42)
for val_index, test_index in sss.split(val_test_set, val_test_set['medical_specialty']):
    val_set = val_test_set.loc[val_test_set.index.intersection(val_index)]
    test_set = val_test_set.loc[val_test_set.index.intersection(test_index)]


In [220]:
val_X = val_set.drop('medical_specialty',axis=1)['transcription']
val_y = val_set["medical_specialty"].copy()
test_X = test_set.drop('medical_specialty',axis=1)['transcription']
test_y = test_set['medical_specialty'].copy()

In [221]:
train_X.shape, val_X.shape, test_X.shape

((3476,), (228,), (219,))

In [222]:
train_X

0       this year old white female presents with compl...
1       he has difficulty climbing stairs difficulty w...
3       d m left atrial enlargement with left atrial d...
6       deformity right breast reconstruction excess s...
7       d multiple views of the heart and great vessel...
                              ...                        
4959    i ligature strangulation a circumferential lig...
4960    a year old female presents self referred for t...
4962    kawasaki disease kawasaki disease resolving th...
4963    this is a year old white female who comes in t...
4964    this year old male presents to children's hosp...
Name: transcription, Length: 3476, dtype: object

### Baseline Model Performance

#### Random

In [223]:
medical_specialties = list(data_tr['medical_specialty'])

In [224]:
# Validation Set
random_pred = np.random.choice(medical_specialties, val_set.shape[0])

print("Micro F-1 score:", f1_score(val_y, random_pred, average="micro"))
print("Macro F-1 score:", f1_score(val_y, random_pred, average="macro"))
print("Weighted F-1 score:", f1_score(val_y, random_pred, average="weighted"))

Micro F-1 score: 0.14473684210526316
Macro F-1 score: 0.011146135417980079
Weighted F-1 score: 0.22553957709808506


In [225]:
# Test Set
random_pred = np.random.choice(medical_specialties, test_set.shape[0])

print("Micro F-1 score:", f1_score(test_y, random_pred, average="micro"))
print("Macro F-1 score:", f1_score(test_y, random_pred, average="macro"))
print("Weighted F-1 score:", f1_score(test_y, random_pred, average="weighted"))

Micro F-1 score: 0.182648401826484
Macro F-1 score: 0.016040316922849972
Weighted F-1 score: 0.2788184885588703


#### Most common/majority 

In [226]:
majority_label = data_tr['medical_specialty'].value_counts().sort_values(ascending=False).index[0]

In [227]:
# Validation Set
majority_prediction = [majority_label]*val_set.shape[0]
print("Micro F-1 score:", f1_score(val_y, majority_prediction, average="micro"))
print("Macro F-1 score:", f1_score(val_y, majority_prediction, average="macro"))
print("Weighted F-1 score:", f1_score(val_y, majority_prediction, average="weighted"))


Micro F-1 score: 0.7017543859649122
Macro F-1 score: 0.09163802978235967
Weighted F-1 score: 0.5787665038885874


In [228]:
# Test Set
majority_prediction = [majority_label]*test_set.shape[0]
print("Micro F-1 score:", f1_score(test_y, majority_prediction, average="micro"))
print("Macro F-1 score:", f1_score(test_y, majority_prediction, average="macro"))
print("Weighted F-1 score:", f1_score(test_y, majority_prediction, average="weighted"))


Micro F-1 score: 0.7579908675799086
Macro F-1 score: 0.08623376623376623
Weighted F-1 score: 0.653644072822155


### Feature Extraction

In [229]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)

train_X_processed = vectorizer.fit_transform(train_X)

In [230]:
vectorizer.get_feature_names()

['abdomen',
 'abdominal',
 'able',
 'abnormal',
 'abnormalities',
 'abscess',
 'abuse',
 'access',
 'accident',
 'achieved',
 'active',
 'activities',
 'activity',
 'actually',
 'acute',
 'addition',
 'additional',
 'adenopathy',
 'adequate',
 'adhesions',
 'administered',
 'admission',
 'admitted',
 'advanced',
 'age',
 'ago',
 'air',
 'airway',
 'alcohol',
 'alert',
 'allergies',
 'allow',
 'allowed',
 'alternatives',
 'amounts',
 'anastomosis',
 'anemia',
 'anesthesia',
 'anesthetic',
 'ankle',
 'anterior',
 'anteriorly',
 'antibiotic',
 'antibiotics',
 'anxiety',
 'aorta',
 'aortic',
 'apparent',
 'apparently',
 'appear',
 'appearance',
 'appeared',
 'appearing',
 'appears',
 'applied',
 'appreciated',
 'appropriate',
 'approximated',
 'approximately',
 'area',
 'areas',
 'arm',
 'arteries',
 'artery',
 'arthritis',
 'asked',
 'aspect',
 'aspiration',
 'aspirin',
 'associated',
 'asthma',
 'atraumatic',
 'atrial',
 'attention',
 'auscultation',
 'avoid',
 'away',
 'axillary',
 'bab

In [231]:
val_X_processed = vectorizer.transform(val_X)
test_X_processed = vectorizer.transform(test_X)

In [232]:
vectorizer.vocabulary_

{'year': 998,
 'old': 607,
 'white': 987,
 'female': 339,
 'presents': 676,
 'allergies': 30,
 'used': 951,
 'worse': 994,
 'past': 629,
 'short': 801,
 'time': 905,
 'began': 83,
 'using': 952,
 'weeks': 984,
 'ago': 25,
 'does': 264,
 'appear': 49,
 'prescription': 672,
 'nasal': 570,
 'asthma': 70,
 'daily': 212,
 'medication': 529,
 'think': 899,
 'currently': 207,
 'known': 469,
 'weight': 985,
 'pounds': 668,
 'blood': 98,
 'pressure': 677,
 'throat': 901,
 'mildly': 541,
 'mucosa': 563,
 'clear': 156,
 'drainage': 271,
 'seen': 790,
 'neck': 574,
 'supple': 869,
 'adenopathy': 17,
 'lungs': 509,
 'use': 950,
 'given': 380,
 'difficulty': 245,
 'floor': 353,
 'times': 906,
 'week': 983,
 'home': 413,
 'muscle': 567,
 'joint': 462,
 'including': 430,
 'knee': 467,
 'pain': 623,
 'foot': 364,
 'ankle': 39,
 'swelling': 877,
 'reflux': 724,
 'disease': 255,
 'surgery': 872,
 'right': 760,
 'hand': 394,
 'years': 999,
 'single': 814,
 'months': 554,
 'day': 214,
 'heart': 402,
 'stro

### Model Selection

In [233]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression
import xgboost as xgb

In [234]:
names = ["Nearest Neighbors", "Linear SVM", "RBF SVM", "Random Forest", 
         "Neural Net", "AdaBoost"]

names2 = ["Gaussian Process", "Decision Tree", "Naive Bayes", "Logistic Regression"]

classifiers = [
    KNeighborsClassifier(3),
    SVC(kernel="linear", C=0.025),
    SVC(gamma=2, C=1),
    RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
    MLPClassifier(alpha=1, max_iter=1000),
    AdaBoostClassifier()]

classifiers2 = [GaussianProcessClassifier(1.0 * RBF(1.0)),
               #DecisionTreeClassifier(max_depth=5),
               GaussianNB(),
               LogisticRegression()]

In [235]:
classifiers

[KNeighborsClassifier(n_neighbors=3),
 SVC(C=0.025, kernel='linear'),
 SVC(C=1, gamma=2),
 RandomForestClassifier(max_depth=5, max_features=1, n_estimators=10),
 MLPClassifier(alpha=1, max_iter=1000),
 AdaBoostClassifier()]

In [236]:
for name, clf in zip(names, classifiers):
    clf.fit(train_X_processed, train_y)
    y_pred = clf.predict(val_X_processed)
    
    # evaluate predictions
    f1_micro = f1_score(val_y, y_pred, average="micro")
    f1_macro = f1_score(val_y, y_pred, average="macro")
    f1_weighted = f1_score(val_y, y_pred, average="weighted")
    
    print("%s F1 Score Micro: %.2f%%" % (name,f1_micro * 100.0))
    print("%s F1 Score Macro: %.2f%%" % (name,f1_macro * 100.0))
    print("%s F1 Score Weighted: %.2f%%" % (name,f1_weighted * 100.0))

Nearest Neighbors F1 Score Micro: 20.18%
Nearest Neighbors F1 Score Macro: 4.18%
Nearest Neighbors F1 Score Weighted: 31.07%
Linear SVM F1 Score Micro: 70.18%
Linear SVM F1 Score Macro: 8.58%
Linear SVM F1 Score Weighted: 60.20%
RBF SVM F1 Score Micro: 25.44%
RBF SVM F1 Score Macro: 4.33%
RBF SVM F1 Score Weighted: 33.13%
Random Forest F1 Score Micro: 69.74%
Random Forest F1 Score Macro: 8.35%
Random Forest F1 Score Weighted: 58.57%
Neural Net F1 Score Micro: 52.63%
Neural Net F1 Score Macro: 11.82%
Neural Net F1 Score Weighted: 54.67%
AdaBoost F1 Score Micro: 48.68%
AdaBoost F1 Score Macro: 6.38%
AdaBoost F1 Score Weighted: 53.72%


In [46]:
for name, clf in zip(names2, classifiers2):
    clf.fit(train_X_processed.toarray(), train_y)
    y_pred = clf.predict(val_X_processed)
    
    # evaluate predictions
    f1_micro = f1_score(val_y, y_pred, average="micro")
    f1_macro = f1_score(val_y, y_pred, average="macro")
    f1_weighted = f1_score(val_y, y_pred, average="weighted")
    
    print("%s F1 Score Micro: %.2f%%" % (name,f1_micro * 100.0))
    print("%s F1 Score Macro: %.2f%%" % (name,f1_macro * 100.0))
    print("%s F1 Score Weighted: %.2f%%" % (name,f1_weighted * 100.0))

KeyboardInterrupt: 

#### Improving Random Forest Classifier

In [237]:
name = "Random Forest Classifier"
rfc = RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1)
rfc.fit(train_X_processed, train_y)
y_pred = rfc.predict(val_X_processed)

f1_micro = f1_score(val_y, y_pred, average="micro")
f1_macro = f1_score(val_y, y_pred, average="macro")
f1_weighted = f1_score(val_y, y_pred, average="weighted")
    
print("%s F1 Score Micro: %.2f%%" % (name,f1_micro * 100.0))
print("%s F1 Score Macro: %.2f%%" % (name,f1_macro * 100.0))
print("%s F1 Score Weighted: %.2f%%" % (name,f1_weighted * 100.0))

Random Forest Classifier F1 Score Micro: 69.74%
Random Forest Classifier F1 Score Macro: 8.26%
Random Forest Classifier F1 Score Weighted: 57.96%


In [238]:
name = "Random Forest Classifier"

n_estimators = [10, 100, 200, 500]

for i in n_estimators:
    rfc_classifier = RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1)
    rfc_classifier.fit(train_X_processed, train_y)
    y_pred = rfc_classifier.predict(val_X_processed)
    
    f1_micro = f1_score(val_y, y_pred, average="micro")
    f1_macro = f1_score(val_y, y_pred, average="macro")
    f1_weighted = f1_score(val_y, y_pred, average="weighted")
    
    print("Random Forest Classifier number of trees %s F1 Score Micro: %.2f%%" % (i,f1_micro * 100.0))
    print("Random Forest Classifier number of trees %s F1 Score Macro: %.2f%%" % (i,f1_macro * 100.0))
    print("Random Forest Classifier number of trees %s F1 Score Weighted: %.2f%%" % (i,f1_weighted * 100.0))    

Random Forest Classifier F1 Score Micro: 70.18%
Random Forest Classifier F1 Score Macro: 8.38%
Random Forest Classifier F1 Score Weighted: 58.79%
Random Forest Classifier F1 Score Micro: 70.18%
Random Forest Classifier F1 Score Macro: 8.38%
Random Forest Classifier F1 Score Weighted: 58.79%
Random Forest Classifier F1 Score Micro: 70.18%
Random Forest Classifier F1 Score Macro: 8.36%
Random Forest Classifier F1 Score Weighted: 58.63%
Random Forest Classifier F1 Score Micro: 70.18%
Random Forest Classifier F1 Score Macro: 8.47%
Random Forest Classifier F1 Score Weighted: 59.41%


In [50]:
max_features = [1, 5, 10, 20, 30, 'auto', 'sqrt', 'log2']
name = "Random Forest Classifier"

for i in max_features:
    rfc = RandomForestClassifier(max_depth=5, n_estimators=500, max_features=i)
    rfc.fit(train_X_processed, train_y)
    y_pred = rfc.predict(val_X_processed)
    
    f1_micro = f1_score(val_y, y_pred, average="micro")
    f1_macro = f1_score(val_y, y_pred, average="macro")
    f1_weighted = f1_score(val_y, y_pred, average="weighted")
    
    print("Random Forest Classifier max_feature %s F1 Score Micro: %.2f%%" % (i,f1_micro * 100.0))
    print("Random Forest Classifier max_feature %s F1 Score Macro: %.2f%%" % (i,f1_macro * 100.0))
    print("Random Forest Classifier max_feature %s F1 Score Weighted: %.2f%%" % (i,f1_weighted * 100.0))  

Random Forest Classifier F1 Score Micro: 70.18%
Random Forest Classifier F1 Score Macro: 8.27%
Random Forest Classifier F1 Score Weighted: 58.03%
Random Forest Classifier F1 Score Micro: 70.18%
Random Forest Classifier F1 Score Macro: 8.63%
Random Forest Classifier F1 Score Weighted: 60.53%
Random Forest Classifier F1 Score Micro: 70.18%
Random Forest Classifier F1 Score Macro: 8.77%
Random Forest Classifier F1 Score Weighted: 61.52%
Random Forest Classifier F1 Score Micro: 69.74%
Random Forest Classifier F1 Score Macro: 8.91%
Random Forest Classifier F1 Score Weighted: 62.51%
Random Forest Classifier F1 Score Micro: 68.42%
Random Forest Classifier F1 Score Macro: 17.33%
Random Forest Classifier F1 Score Weighted: 64.05%
Random Forest Classifier F1 Score Micro: 68.86%
Random Forest Classifier F1 Score Macro: 17.36%
Random Forest Classifier F1 Score Weighted: 64.27%
Random Forest Classifier F1 Score Micro: 67.98%
Random Forest Classifier F1 Score Macro: 8.22%
Random Forest Classifier F1

In [239]:
criterions = ['entropy', 'gini']
name = "Random Forest Classifier"

for i in criterions:
    rfc = RandomForestClassifier(max_depth=5, n_estimators=500, max_features=10, criterion=i)
    rfc.fit(train_X_processed, train_y)
    y_pred = rfc.predict(val_X_processed)
    
    f1_micro = f1_score(val_y, y_pred, average="micro")
    f1_macro = f1_score(val_y, y_pred, average="macro")
    f1_weighted = f1_score(val_y, y_pred, average="weighted")
    
    print("Random Forest Classifier criterion %s F1 Score Micro: %.2f%%" % (i,f1_micro * 100.0))
    print("Random Forest Classifier criterion %s F1 Score Macro: %.2f%%" % (i,f1_macro * 100.0))
    print("Random Forest Classifier criterion %s F1 Score Weighted: %.2f%%" % (i,f1_weighted * 100.0))  

Random Forest Classifier F1 Score Micro: 70.18%
Random Forest Classifier F1 Score Macro: 8.72%
Random Forest Classifier F1 Score Weighted: 61.19%
Random Forest Classifier F1 Score Micro: 70.18%
Random Forest Classifier F1 Score Macro: 8.74%
Random Forest Classifier F1 Score Weighted: 61.36%


In [49]:
max_depths = [1, 5, 10, 20, 50]
name = "Random Forest Classifier"

for i in max_depths:
    rfc = RandomForestClassifier(max_depth=i, n_estimators=500, max_features=10, criterion='gini')
    rfc.fit(train_X_processed, train_y)
    y_pred = rfc.predict(val_X_processed)
    
    f1_micro = f1_score(val_y, y_pred, average="micro")
    f1_macro = f1_score(val_y, y_pred, average="macro")
    f1_weighted = f1_score(val_y, y_pred, average="weighted")
    
    print("Random Forest Classifier max_depth %s F1 Score Micro: %.2f%%" % (i,f1_micro * 100.0))
    print("Random Forest Classifier max_depth %s F1 Score Macro: %.2f%%" % (i,f1_macro * 100.0))
    print("Random Forest Classifier max_depth %s F1 Score Weighted: %.2f%%" % (i,f1_weighted * 100.0)) 

Random Forest Classifier F1 Score Micro: 70.18%
Random Forest Classifier F1 Score Macro: 9.16%
Random Forest Classifier F1 Score Weighted: 57.88%
Random Forest Classifier F1 Score Micro: 70.18%
Random Forest Classifier F1 Score Macro: 8.74%
Random Forest Classifier F1 Score Weighted: 61.36%
Random Forest Classifier F1 Score Micro: 63.60%
Random Forest Classifier F1 Score Macro: 13.34%
Random Forest Classifier F1 Score Weighted: 61.31%
Random Forest Classifier F1 Score Micro: 27.63%
Random Forest Classifier F1 Score Macro: 3.72%
Random Forest Classifier F1 Score Weighted: 35.62%
Random Forest Classifier F1 Score Micro: 24.12%
Random Forest Classifier F1 Score Macro: 2.92%
Random Forest Classifier F1 Score Weighted: 32.12%


In [240]:
bootstrap_options = ['True', 'False']
name = "Random Forest Classifier"

for i in bootstrap_options:
    rfc = RandomForestClassifier(max_depth=5, n_estimators=500, max_features=10, criterion='gini', bootstrap=i)
    rfc.fit(train_X_processed, train_y)
    y_pred = rfc.predict(val_X_processed)
    
    f1_micro = f1_score(val_y, y_pred, average="micro")
    f1_macro = f1_score(val_y, y_pred, average="macro")
    f1_weighted = f1_score(val_y, y_pred, average="weighted")
    
    print("Random Forest Classifier bootstrap %s F1 Score Micro: %.2f%%" % (i,f1_micro * 100.0))
    print("Random Forest Classifier bootstrap %s F1 Score Macro: %.2f%%" % (i,f1_macro * 100.0))
    print("Random Forest Classifier bootstrap %s F1 Score Weighted: %.2f%%" % (i,f1_weighted * 100.0)) 
    


Random Forest Classifier F1 Score Micro: 70.18%
Random Forest Classifier F1 Score Macro: 8.70%
Random Forest Classifier F1 Score Weighted: 61.02%
Random Forest Classifier F1 Score Micro: 70.18%
Random Forest Classifier F1 Score Macro: 8.74%
Random Forest Classifier F1 Score Weighted: 61.36%


In [241]:
class_weights = ['balanced', 'balanced_subsample', None]

name = "Random Forest Classifier"

for i in class_weights:
    rfc = RandomForestClassifier(max_depth=5, 
                                 n_estimators=500, 
                                 max_features=10, 
                                 criterion='gini', 
                                 bootstrap=False, 
                                 class_weight=i)
    rfc.fit(train_X_processed, train_y)
    y_pred = rfc.predict(val_X_processed)
    
    f1_micro = f1_score(val_y, y_pred, average="micro")
    f1_macro = f1_score(val_y, y_pred, average="macro")
    f1_weighted = f1_score(val_y, y_pred, average="weighted")
    
    print("Random Forest Classifier class_weight %s F1 Score Micro: %.2f%%" % (i,f1_micro * 100.0))
    print("Random Forest Classifier class_weight %s F1 Score Macro: %.2f%%" % (i,f1_macro * 100.0))
    print("Random Forest Classifier class_weight %s F1 Score Weighted: %.2f%%" % (i,f1_weighted * 100.0)) 

Random Forest Classifier F1 Score Micro: 9.65%
Random Forest Classifier F1 Score Macro: 10.95%
Random Forest Classifier F1 Score Weighted: 12.39%
Random Forest Classifier F1 Score Micro: 9.65%
Random Forest Classifier F1 Score Macro: 9.39%
Random Forest Classifier F1 Score Weighted: 12.88%
Random Forest Classifier F1 Score Micro: 70.18%
Random Forest Classifier F1 Score Macro: 8.82%
Random Forest Classifier F1 Score Weighted: 61.86%


#### XGB Classifier

In [242]:
from sklearn import preprocessing

le = preprocessing.LabelEncoder()
le.fit(train_y.tolist())
xgb_train_y = le.transform(train_y.tolist())

xgb_val_y = le.transform(val_y.tolist())

In [243]:
name = "XGB Classifier"
xgbclassifier = xgb.XGBClassifier(learning_rate=0.01, 
                                  reg_lambda=3,
                                  objective='multi:softmax')

xgbclassifier.fit(train_X_processed, xgb_train_y)

y_pred = xgbclassifier.predict(val_X_processed)
f1_micro = f1_score(xgb_val_y, y_pred, average="micro")
f1_macro = f1_score(xgb_val_y, y_pred, average="macro")
f1_weighted = f1_score(xgb_val_y, y_pred, average="weighted")
    
print("%s F1 Score Micro: %.2f%%" % (name,f1_micro * 100.0))
print("%s F1 Score Macro: %.2f%%" % (name,f1_macro * 100.0))
print("%s F1 Score Weighted: %.2f%%" % (name,f1_weighted * 100.0))  



XGB Classifier F1 Score Micro: 40.35%
XGB Classifier F1 Score Macro: 6.24%
XGB Classifier F1 Score Weighted: 49.33%


In [244]:
name = "XGB Classifier"

boosters = ['gbtree', 'gblinear', 'dart']

for i in boosters:
    xgbclassifier = xgb.XGBClassifier(learning_rate=0.01,
                                      booster = i,
                                      reg_lambda=3,
                                      objective='multi:softmax')
    
    xgbclassifier.fit(train_X_processed, xgb_train_y)
    y_pred = xgbclassifier.predict(val_X_processed)
    f1_micro = f1_score(xgb_val_y, y_pred, average="micro")
    f1_macro = f1_score(xgb_val_y, y_pred, average="macro")
    f1_weighted = f1_score(xgb_val_y, y_pred, average="weighted")

    print("%s F1 Score Micro: %.2f%%" % (name,f1_micro * 100.0))
    print("%s F1 Score Macro: %.2f%%" % (name,f1_macro * 100.0))
    print("%s F1 Score Weighted: %.2f%%" % (name,f1_weighted * 100.0)) 



XGB Classifier F1 Score Micro: 40.35%
XGB Classifier F1 Score Macro: 6.24%
XGB Classifier F1 Score Weighted: 49.33%




XGB Classifier F1 Score Micro: 70.18%
XGB Classifier F1 Score Macro: 9.16%
XGB Classifier F1 Score Weighted: 57.88%




XGB Classifier F1 Score Micro: 39.91%
XGB Classifier F1 Score Macro: 6.22%
XGB Classifier F1 Score Weighted: 48.97%


In [245]:
name = "XGB Classifier"

learning_rates = [0.001, 0.01, 0.1, 1, 5, 10]

for i in learning_rates:
    xgbclassifier = xgb.XGBClassifier(learning_rate=i,
                                      booster = 'gblinear',
                                      reg_lambda=3,
                                      objective='multi:softmax')
    
    xgbclassifier.fit(train_X_processed, xgb_train_y)
    y_pred = xgbclassifier.predict(val_X_processed)
    f1_micro = f1_score(xgb_val_y, y_pred, average="micro")
    f1_macro = f1_score(xgb_val_y, y_pred, average="macro")
    f1_weighted = f1_score(xgb_val_y, y_pred, average="weighted")

    print("%s F1 Score Micro: %.2f%%" % (name,f1_micro * 100.0))
    print("%s F1 Score Macro: %.2f%%" % (name,f1_macro * 100.0))
    print("%s F1 Score Weighted: %.2f%%" % (name,f1_weighted * 100.0)) 



XGB Classifier F1 Score Micro: 70.18%
XGB Classifier F1 Score Macro: 9.16%
XGB Classifier F1 Score Weighted: 57.88%




XGB Classifier F1 Score Micro: 70.18%
XGB Classifier F1 Score Macro: 9.16%
XGB Classifier F1 Score Weighted: 57.88%




XGB Classifier F1 Score Micro: 70.18%
XGB Classifier F1 Score Macro: 9.16%
XGB Classifier F1 Score Weighted: 57.88%




XGB Classifier F1 Score Micro: 70.18%
XGB Classifier F1 Score Macro: 9.16%
XGB Classifier F1 Score Weighted: 57.88%




XGB Classifier F1 Score Micro: 0.00%
XGB Classifier F1 Score Macro: 0.00%
XGB Classifier F1 Score Weighted: 0.00%




XGB Classifier F1 Score Micro: 70.18%
XGB Classifier F1 Score Macro: 9.16%
XGB Classifier F1 Score Weighted: 57.88%


In [246]:
name = "XGB Classifier"

reg_lambdas = [0.001, 0.01, 0.1, 1, 5, 10]

for i in reg_lambdas:
    xgbclassifier = xgb.XGBClassifier(learning_rate=0.01,
                                      booster = 'gblinear',
                                      reg_lambda=i,
                                      objective='multi:softmax')
    
    xgbclassifier.fit(train_X_processed, xgb_train_y)
    y_pred = xgbclassifier.predict(val_X_processed)
    f1_micro = f1_score(xgb_val_y, y_pred, average="micro")
    f1_macro = f1_score(xgb_val_y, y_pred, average="macro")
    f1_weighted = f1_score(xgb_val_y, y_pred, average="weighted")

    print("%s F1 Score Micro: %.2f%%" % (name,f1_micro * 100.0))
    print("%s F1 Score Macro: %.2f%%" % (name,f1_macro * 100.0))
    print("%s F1 Score Weighted: %.2f%%" % (name,f1_weighted * 100.0)) 




XGB Classifier F1 Score Micro: 62.72%
XGB Classifier F1 Score Macro: 12.00%
XGB Classifier F1 Score Weighted: 60.69%




XGB Classifier F1 Score Micro: 70.18%
XGB Classifier F1 Score Macro: 8.47%
XGB Classifier F1 Score Weighted: 59.41%




XGB Classifier F1 Score Micro: 70.18%
XGB Classifier F1 Score Macro: 9.16%
XGB Classifier F1 Score Weighted: 57.88%




XGB Classifier F1 Score Micro: 70.18%
XGB Classifier F1 Score Macro: 9.16%
XGB Classifier F1 Score Weighted: 57.88%




XGB Classifier F1 Score Micro: 70.18%
XGB Classifier F1 Score Macro: 9.16%
XGB Classifier F1 Score Weighted: 57.88%




XGB Classifier F1 Score Micro: 70.18%
XGB Classifier F1 Score Macro: 9.16%
XGB Classifier F1 Score Weighted: 57.88%


In [73]:
name = "XGB Classifier"

reg_alphas = [0.001, 0.01, 0.1, 1, 5, 10]

for i in reg_alphas:
    xgbclassifier = xgb.XGBClassifier(learning_rate=0.01,
                                      booster = 'gblinear',
                                      reg_lambda=0.01,
                                      reg_alpha=i,
                                      objective='multi:softmax')
    
    xgbclassifier.fit(train_X_processed, xgb_train_y)
    y_pred = xgbclassifier.predict(val_X_processed)
    f1_micro = f1_score(xgb_val_y, y_pred, average="micro")
    f1_macro = f1_score(xgb_val_y, y_pred, average="macro")
    f1_weighted = f1_score(xgb_val_y, y_pred, average="weighted")

    print("%s F1 Score Micro: %.2f%%" % (name,f1_micro * 100.0))
    print("%s F1 Score Macro: %.2f%%" % (name,f1_macro * 100.0))
    print("%s F1 Score Weighted: %.2f%%" % (name,f1_weighted * 100.0)) 
    




XGB Classifier F1 Score Micro: 70.18%
XGB Classifier F1 Score Macro: 9.16%
XGB Classifier F1 Score Weighted: 57.88%




XGB Classifier F1 Score Micro: 70.18%
XGB Classifier F1 Score Macro: 9.16%
XGB Classifier F1 Score Weighted: 57.88%




XGB Classifier F1 Score Micro: 70.18%
XGB Classifier F1 Score Macro: 9.16%
XGB Classifier F1 Score Weighted: 57.88%




XGB Classifier F1 Score Micro: 70.18%
XGB Classifier F1 Score Macro: 9.16%
XGB Classifier F1 Score Weighted: 57.88%




XGB Classifier F1 Score Micro: 70.18%
XGB Classifier F1 Score Macro: 9.16%
XGB Classifier F1 Score Weighted: 57.88%




XGB Classifier F1 Score Micro: 70.18%
XGB Classifier F1 Score Macro: 9.16%
XGB Classifier F1 Score Weighted: 57.88%


#### Improving Multi-layer Perceptron classifier

In [248]:
# activation functions:

activation_functions=["identity", "logistic", "tanh", "relu"]

for i in activation_functions:
    mlp_classifier = MLPClassifier(alpha=1, max_iter=5000, activation=i)
    mlp_classifier.fit(train_X_processed, train_y)
    y_pred = mlp_classifier.predict(val_X_processed)
    
    f1_micro = f1_score(val_y, y_pred, average="micro")
    f1_macro = f1_score(val_y, y_pred, average="macro")
    f1_weighted = f1_score(val_y, y_pred, average="weighted")
    
    print("Neural Net Activation function: %s F1 Score Micro: %.2f%%" % (i,f1_micro * 100.0))
    print("Neural Net Activation function: %s F1 Score Macro: %.2f%%" % (i,f1_macro * 100.0))
    print("Neural Net Activation function: %s F1 Score Weighted: %.2f%%" % (i,f1_weighted * 100.0)) 

# it does not seem to matter as much on f1-score

Neural Net Activation function: identity F1 Score Micro: 51.75%
Neural Net Activation function: identity F1 Score Macro: 11.78%
Neural Net Activation function: identity F1 Score Weighted: 54.28%
Neural Net Activation function: logistic F1 Score Micro: 70.18%
Neural Net Activation function: logistic F1 Score Macro: 8.89%
Neural Net Activation function: logistic F1 Score Weighted: 62.38%
Neural Net Activation function: tanh F1 Score Micro: 54.39%
Neural Net Activation function: tanh F1 Score Macro: 11.92%
Neural Net Activation function: tanh F1 Score Weighted: 55.77%
Neural Net Activation function: relu F1 Score Micro: 53.95%
Neural Net Activation function: relu F1 Score Macro: 11.90%
Neural Net Activation function: relu F1 Score Weighted: 55.50%


In [249]:
solvers = ["lbfgs", "sgd", "adam"]

for i in solvers:
    mlp_classifier = MLPClassifier(alpha=1, max_iter=5000, activation = "logistic", solver=i)
    mlp_classifier.fit(train_X_processed, train_y)
    y_pred = mlp_classifier.predict(val_X_processed)
    
    f1_micro = f1_score(val_y, y_pred, average="micro")
    f1_macro = f1_score(val_y, y_pred, average="macro")
    f1_weighted = f1_score(val_y, y_pred, average="weighted")
    
    print("Neural Net Solver: %s F1 Score Micro: %.2f%%" % (i,f1_micro * 100.0))
    print("Neural Net Solver: %s F1 Score Macro: %.2f%%" % (i,f1_macro * 100.0))
    print("Neural Net Solver: %s F1 Score Weighted: %.2f%%" % (i,f1_weighted * 100.0)) 


Neural Net Solver: lbfgs F1 Score Micro: 23.25%
Neural Net Solver: lbfgs F1 Score Macro: 5.94%
Neural Net Solver: lbfgs F1 Score Weighted: 32.74%
Neural Net Solver: sgd F1 Score Micro: 70.18%
Neural Net Solver: sgd F1 Score Macro: 9.16%
Neural Net Solver: sgd F1 Score Weighted: 57.88%
Neural Net Solver: adam F1 Score Micro: 70.18%
Neural Net Solver: adam F1 Score Macro: 8.89%
Neural Net Solver: adam F1 Score Weighted: 62.38%


In [251]:
l2_penalties = [0.001, 0.01, 0.1, 1, 2, 5, 10]

for i in l2_penalties:
    mlp_classifier = MLPClassifier(solver='adam', alpha=i, max_iter=5000, activation="logistic")
    mlp_classifier.fit(train_X_processed, train_y)
    y_pred = mlp_classifier.predict(val_X_processed)
    
    f1_micro = f1_score(val_y, y_pred, average="micro")
    f1_macro = f1_score(val_y, y_pred, average="macro")
    f1_weighted = f1_score(val_y, y_pred, average="weighted")
    
    print("Neural Net L2 Penalty: %s F1 Score Micro: %.2f%%" % (i,f1_micro * 100.0))
    print("Neural Net L2 Penalty: %s F1 Score Macro: %.2f%%" % (i,f1_macro * 100.0))
    print("Neural Net L2 Penalty: %s F1 Score Weighted: %.2f%%" % (i,f1_weighted * 100.0))
    
# keept it as 0.0001

Neural Net L2 Penalty: 0.001 F1 Score Micro: 16.23%
Neural Net L2 Penalty: 0.001 F1 Score Macro: 4.92%
Neural Net L2 Penalty: 0.001 F1 Score Weighted: 24.11%
Neural Net L2 Penalty: 0.01 F1 Score Micro: 14.47%
Neural Net L2 Penalty: 0.01 F1 Score Macro: 4.79%
Neural Net L2 Penalty: 0.01 F1 Score Weighted: 21.58%
Neural Net L2 Penalty: 0.1 F1 Score Micro: 41.67%
Neural Net L2 Penalty: 0.1 F1 Score Macro: 4.84%
Neural Net L2 Penalty: 0.1 F1 Score Weighted: 49.98%
Neural Net L2 Penalty: 1 F1 Score Micro: 70.18%
Neural Net L2 Penalty: 1 F1 Score Macro: 8.89%
Neural Net L2 Penalty: 1 F1 Score Weighted: 62.38%
Neural Net L2 Penalty: 2 F1 Score Micro: 70.18%
Neural Net L2 Penalty: 2 F1 Score Macro: 9.16%
Neural Net L2 Penalty: 2 F1 Score Weighted: 57.88%
Neural Net L2 Penalty: 5 F1 Score Micro: 70.18%
Neural Net L2 Penalty: 5 F1 Score Macro: 9.16%
Neural Net L2 Penalty: 5 F1 Score Weighted: 57.88%
Neural Net L2 Penalty: 10 F1 Score Micro: 70.18%
Neural Net L2 Penalty: 10 F1 Score Macro: 9.16%


In [254]:
learning_rates = ["constant", "invscaling", "adaptive"]

for i in learning_rates:
    mlp_classifier = MLPClassifier(solver='adam', 
                                   alpha=1, 
                                   max_iter=5000, 
                                   activation='logistic', 
                                   learning_rate=i)
    mlp_classifier.fit(train_X_processed, train_y)
    y_pred = mlp_classifier.predict(val_X_processed)
    
    f1_micro = f1_score(val_y, y_pred, average="micro")
    f1_macro = f1_score(val_y, y_pred, average="macro")
    f1_weighted = f1_score(val_y, y_pred, average="weighted")
    
    print("Neural Net Learning Rate Type: %s F1 Score Micro: %.2f%%" % (i,f1_micro * 100.0))
    print("Neural Net Learning Rate Type: %s F1 Score Macro: %.2f%%" % (i,f1_macro * 100.0))
    print("Neural Net Learning Rate Type: %s F1 Score Weighted: %.2f%%" % (i,f1_weighted * 100.0)) 

Neural Net Learning Rate Type: constant F1 Score Micro: 70.18%
Neural Net Learning Rate Type: constant F1 Score Macro: 8.84%
Neural Net Learning Rate Type: constant F1 Score Weighted: 62.03%
Neural Net Learning Rate Type: invscaling F1 Score Micro: 70.18%
Neural Net Learning Rate Type: invscaling F1 Score Macro: 8.86%
Neural Net Learning Rate Type: invscaling F1 Score Weighted: 62.21%
Neural Net Learning Rate Type: adaptive F1 Score Micro: 70.18%
Neural Net Learning Rate Type: adaptive F1 Score Macro: 8.89%
Neural Net Learning Rate Type: adaptive F1 Score Weighted: 62.38%


In [256]:
learning_rate_inits = [0.001, 0.01, 0.1, 1, 2, 5, 10]


for i in learning_rate_inits:
    mlp_classifier = MLPClassifier(solver="adam", 
                                   alpha=1, 
                                   max_iter=5000, 
                                   activation="logistic",
                                   learning_rate="adaptive",
                                   learning_rate_init=i)
    mlp_classifier.fit(train_X_processed, train_y)
    y_pred = mlp_classifier.predict(val_X_processed)
    
    f1_micro = f1_score(val_y, y_pred, average="micro")
    f1_macro = f1_score(val_y, y_pred, average="macro")
    f1_weighted = f1_score(val_y, y_pred, average="weighted")
    
    print("Neural Net Initial Learning Rate: %s F1 Score Micro: %.2f%%" % (i,f1_micro * 100.0))
    print("Neural Net Initial Learning Rate: %s F1 Score Macro: %.2f%%" % (i,f1_macro * 100.0))
    print("Neural Net Initial Learning Rate: %s F1 Score Weighted: %.2f%%" % (i,f1_weighted * 100.0)) 

Neural Net Initial Learning Rate: 0.001 F1 Score Micro: 70.18%
Neural Net Initial Learning Rate: 0.001 F1 Score Macro: 8.84%
Neural Net Initial Learning Rate: 0.001 F1 Score Weighted: 62.03%
Neural Net Initial Learning Rate: 0.01 F1 Score Micro: 70.18%
Neural Net Initial Learning Rate: 0.01 F1 Score Macro: 8.77%
Neural Net Initial Learning Rate: 0.01 F1 Score Weighted: 61.52%
Neural Net Initial Learning Rate: 0.1 F1 Score Micro: 70.18%
Neural Net Initial Learning Rate: 0.1 F1 Score Macro: 8.53%
Neural Net Initial Learning Rate: 0.1 F1 Score Weighted: 59.88%
Neural Net Initial Learning Rate: 1 F1 Score Micro: 70.18%
Neural Net Initial Learning Rate: 1 F1 Score Macro: 9.16%
Neural Net Initial Learning Rate: 1 F1 Score Weighted: 57.88%
Neural Net Initial Learning Rate: 2 F1 Score Micro: 70.18%
Neural Net Initial Learning Rate: 2 F1 Score Macro: 9.16%
Neural Net Initial Learning Rate: 2 F1 Score Weighted: 57.88%
Neural Net Initial Learning Rate: 5 F1 Score Micro: 10.53%
Neural Net Initial 

In [257]:
max_iters = [1000, 5000, 6000, 7000]

for i in max_iters:
    mlp_classifier = MLPClassifier(solver='adam', 
                                   alpha=1, 
                                   max_iter=i, 
                                   activation='logistic', 
                                   learning_rate='adaptive', 
                                   learning_rate_init=0.001)
    mlp_classifier.fit(train_X_processed, train_y)
    y_pred = mlp_classifier.predict(val_X_processed)
    
    f1_micro = f1_score(val_y, y_pred, average="micro")
    f1_macro = f1_score(val_y, y_pred, average="macro")
    f1_weighted = f1_score(val_y, y_pred, average="weighted")
    
    print("Neural Net Learning Rate Type: %s F1 Score Micro: %.2f%%" % (i,f1_micro * 100.0))
    print("Neural Net Learning Rate Type: %s F1 Score Macro: %.2f%%" % (i,f1_macro * 100.0))
    print("Neural Net Learning Rate Type: %s F1 Score Weighted: %.2f%%" % (i,f1_weighted * 100.0)) 

Neural Net Learning Rate Type: 1000 F1 Score Micro: 70.18%
Neural Net Learning Rate Type: 1000 F1 Score Macro: 8.79%
Neural Net Learning Rate Type: 1000 F1 Score Weighted: 61.69%
Neural Net Learning Rate Type: 5000 F1 Score Micro: 70.18%
Neural Net Learning Rate Type: 5000 F1 Score Macro: 8.84%
Neural Net Learning Rate Type: 5000 F1 Score Weighted: 62.03%
Neural Net Learning Rate Type: 6000 F1 Score Micro: 70.18%
Neural Net Learning Rate Type: 6000 F1 Score Macro: 8.77%
Neural Net Learning Rate Type: 6000 F1 Score Weighted: 61.52%
Neural Net Learning Rate Type: 7000 F1 Score Micro: 70.18%
Neural Net Learning Rate Type: 7000 F1 Score Macro: 8.89%
Neural Net Learning Rate Type: 7000 F1 Score Weighted: 62.38%


### Voting Classifier

In [259]:
from sklearn.ensemble import VotingClassifier

name = "Voting Classifier"

clf1 = MLPClassifier(solver='adam', 
                     alpha=1, 
                     max_iter=7000, 
                     activation='logistic', 
                     learning_rate='adaptive', 
                     learning_rate_init=0.001)

clf2 = RandomForestClassifier(max_depth=5, 
                              n_estimators=500, 
                              max_features=10, 
                              criterion='gini', 
                              bootstrap=False)

clf3 = xgb.XGBClassifier(learning_rate=0.01, 
                         booster = 'gblinear', 
                         reg_lambda=0.01,
                         reg_alpha=0.001,
                         objective='multi:softmax')

eclf1 = VotingClassifier(estimators=[('mlp', clf1), ('rfc', clf2), ('xgb', clf3)], 
                         voting='hard')

eclf1.fit(train_X_processed, xgb_train_y)
y_pred = eclf1.predict(val_X_processed)

f1_micro = f1_score(xgb_val_y, y_pred, average="micro")
f1_macro = f1_score(xgb_val_y, y_pred, average="macro")
f1_weighted = f1_score(xgb_val_y, y_pred, average="weighted")

print("%s F1 Score Micro: %.2f%%" % (name,f1_micro * 100.0))
print("%s F1 Score Macro: %.2f%%" % (name,f1_macro * 100.0))
print("%s F1 Score Weighted: %.2f%%" % (name,f1_weighted * 100.0)) 




Voting Classifier F1 Score Micro: 70.18%
Voting Classifier F1 Score Macro: 8.77%
Voting Classifier F1 Score Weighted: 61.52%


### DistilBERT for Sentence Embeddings + Logistic Regression Classifier

In [10]:
data_tr[95:98]

Unnamed: 0,medical_specialty,transcription
95,urology,inguinal hernia direct inguinal hernia rutkow ...
96,urology,inguinal herniorrhaphy after informed consent ...
97,urology,bilateral inguinal hernia bilateral inguinal h...


In [11]:
data_tr.head(2)

Unnamed: 0,medical_specialty,transcription
0,allergy immunology,this year old white female presents with compl...
1,bariatrics,he has difficulty climbing stairs difficulty w...


In [15]:
from transformers import DistilBertModel,DistilBertTokenizer, DistilBertConfig
import torch

In [16]:
configuration = DistilBertConfig(max_position_embeddings=4000, dropout=0.5, )


In [17]:
model = DistilBertModel(configuration).from_pretrained('distilbert-base-uncased')
tokenizer = DistilBertTokenizer(configuration).from_pretrained('distilbert-base-uncased')

In [None]:
# Convert to Sentence Embeddings

In [29]:
tokenized_transcripts = data_tr['transcription'].apply(lambda x: tokenizer.encode(x, add_special_tokens=True, truncation=True, max_length=100))
tokenized_transcripts


0       [101, 2023, 2095, 2214, 2317, 2931, 7534, 2007...
1       [101, 2002, 2038, 7669, 8218, 5108, 7669, 2007...
2       [101, 1045, 2031, 2464, 2651, 2002, 2003, 1037...
3       [101, 1040, 1049, 2187, 2012, 14482, 4372, 801...
4       [101, 1996, 2187, 18834, 7277, 7934, 17790, 29...
                              ...                        
4961    [101, 1045, 2018, 1996, 5165, 1997, 3116, 1998...
4962    [101, 27324, 4295, 27324, 4295, 29304, 2023, 2...
4963    [101, 2023, 2003, 1037, 2095, 2214, 2317, 2931...
4964    [101, 2023, 2095, 2214, 3287, 7534, 2000, 2336...
4965    [101, 1037, 2095, 2214, 3287, 7534, 2651, 2969...
Name: transcription, Length: 4966, dtype: object

In [30]:
# Make all sentences equal size for DistilBERT
tokenized_transcripts.apply(len)

0       100
1       100
2       100
3        96
4       100
       ... 
4961    100
4962    100
4963    100
4964    100
4965    100
Name: transcription, Length: 4966, dtype: int64

In [33]:
max_length

100

In [32]:
# need to make all the vectors the same size by padding shorter sentences with the token id 0. 
# You can refer to the notebook for the padding step, it’s basic python string and array manipulation.

max_length = max(tokenized_transcripts.apply(len))

padded_transcripts = [row+[0]*(max_length-len(row)) for row in tokenized_transcripts]
padded_transcripts = np.array(padded_transcripts)

#for row in tokenized_transcripts:
    #if len(row) < max_length:
        #difference = max_length - len(row)
        #row.extend([0 for i in range(difference)])

In [34]:
attention_masked_transcripts = np.where(padded_transcripts!=0,1,0)

In [35]:
padded_transcripts

array([[  101,  2023,  2095, ..., 19077, 12509,   102],
       [  101,  2002,  2038, ...,  2055,  2702,   102],
       [  101,  1045,  2031, ...,  2009,  1998,   102],
       ...,
       [  101,  2023,  2003, ...,  3366,  2029,   102],
       [  101,  2023,  2095, ...,  2091,  2000,   102],
       [  101,  1037,  2095, ...,  1998,  5845,   102]])

In [36]:
attention_masked_transcripts

array([[1, 1, 1, ..., 1, 1, 1],
       [1, 1, 1, ..., 1, 1, 1],
       [1, 1, 1, ..., 1, 1, 1],
       ...,
       [1, 1, 1, ..., 1, 1, 1],
       [1, 1, 1, ..., 1, 1, 1],
       [1, 1, 1, ..., 1, 1, 1]])

In [37]:
# Convert the matrix to input tensor for DistilBERT

input_ids = torch.tensor(padded_transcripts)
attention_masked = torch.tensor(attention_masked_transcripts)

In [38]:
with torch.no_grad():
    last_hidden_states = model(input_ids, attention_mask=attention_masked)
    
# last_hidden_states holds the outputs of DistilBERT. 
# It is a tuple with the shape 
# (number of examples, max number of tokens in the sequence, number of hidden units in the DistilBERT model)

In [63]:
#we’re only only interested in BERT’s output for the [CLS] token, 
# so we select that slice of the cube and discard everything else.

# Slice the output for the first position for all the sequences, take all hidden unit outputs

features = last_hidden_states[0][:,0,:].numpy()

In [69]:
# train test split here
y_val_test.values

array(['orthopedic', 'surgery', 'surgery', ..., 'chart progress notes',
       'cardiovascular pulmonary', 'surgery'], dtype=object)

In [70]:
sss = StratifiedShuffleSplit(n_splits=1, train_size=0.7, random_state=42)
labels = data_tr['medical_specialty'].values
for train_index, val_test_index in sss.split(features, labels):
    features_train, features_val_test = features[train_index], features[val_test_index]
    y_train, y_val_test = labels[train_index], labels[val_test_index]
    

sss = StratifiedShuffleSplit(n_splits=1, train_size=0.5, random_state=42)
for val_index, test_index in sss.split(features_val_test, y_val_test):
    features_val, features_test = features_val_test[val_index], features_val_test[test_index]
    y_val, y_test = y_val_test[val_index], y_val_test[test_index]


features_train.shape, features_val.shape, features_test.shape
    

((3476, 768), (745, 768), (745, 768))

In [72]:
### Train the Model using Logistic Regression()

from sklearn.linear_model import LogisticRegression

In [78]:
lr_clf = LogisticRegression(max_iter=500)
lr_clf.fit(features_train, y_train)

lr_clf.score(features_test, y_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.22281879194630871

In [None]:
### Evaluate on Validation Set using F-1 Scores

In [76]:
name="logistic regression"
y_pred = lr_clf.predict(features_val)
f1_micro = f1_score(y_val, y_pred, average="micro")
f1_macro = f1_score(y_val, y_pred, average="macro")
f1_weighted = f1_score(y_val, y_pred, average="weighted")
    
print("%s F1 Score Micro: %.2f%%" % (name,f1_micro * 100.0))
print("%s F1 Score Macro: %.2f%%" % (name,f1_macro * 100.0))
print("%s F1 Score Weighted: %.2f%%" % (name,f1_weighted * 100.0))  

logistic regression F1 Score Micro: 28.59%
logistic regression F1 Score Macro: 12.65%
logistic regression F1 Score Weighted: 26.10%


### Keras LSTM with Keras Tokenizer

In [77]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras import layers
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import pandas as pd
import numpy as np
from collections import Counter

from sklearn import preprocessing


In [110]:
data_tr.head(2)

Unnamed: 0,medical_specialty,transcription
0,allergy immunology,this year old white female presents with compl...
1,bariatrics,he has difficulty climbing stairs difficulty w...


In [128]:
data_tr3 = data_tr.copy()

le = preprocessing.LabelEncoder()
le.fit(data_tr3['medical_specialty'].to_list())
data_tr3['specialty_id'] = le.transform(data_tr3['medical_specialty'].to_list())
data_tr3.head(3)

Unnamed: 0,medical_specialty,transcription,specialty_id
0,allergy immunology,this year old white female presents with compl...,0
1,bariatrics,he has difficulty climbing stairs difficulty w...,2
2,bariatrics,i have seen today he is a very pleasant gentle...,2


In [173]:
le = preprocessing.LabelEncoder()
le.fit(train_y.to_list())
train_y_labels = le.transform(train_y.to_list())
val_y_labels = le.transform(val_y.to_list())
test_y_labels = le.transform(test_y.to_list())

In [174]:
train_y_labels[:10]

array([ 0,  2,  3,  2,  3,  3,  2,  3,  2, 15])

In [131]:
from collections import Counter
texts = data_tr3['transcription']
tokenized_texts = [row.split() for row in texts]
vocabulary = Counter()

for row in tokenized_texts:
    vocabulary.update(row)

In [175]:
vocab_size = len(vocabulary)
tokenizer = Tokenizer(num_words=vocab_size, split=' ', char_level=False)
tokenizer.fit_on_texts(data_tr3['transcription'])

tokenized_train_X = tokenizer.texts_to_sequences(train_X)
tokenized_val_X = tokenizer.texts_to_sequences(val_X) 
tokenized_test_X = tokenizer.texts_to_sequences(test_X) 

In [176]:
max_len = max([len(x.split()) for x in data_tr3['transcription']])

x_train = pad_sequences(tokenized_train_X, padding='post', maxlen=max_len)
x_val = pad_sequences(tokenized_val_X, padding='post', maxlen=max_len)
x_test = pad_sequences(tokenized_test_X, padding='post', maxlen=max_len)

In [180]:
x_train.shape, x_val.shape

((3476, 2991), (228, 2991))

In [187]:
embedding_dim = 20

model = Sequential()
model.add(layers.Embedding(input_dim=vocab_size,
                           output_dim=embedding_dim,
                           input_length=max_len))
model.add(layers.LSTM(units=50))
model.add(layers.Dropout(0.5))
model.add(layers.Dense(len(specialties), activation="relu"))
model.compile(optimizer="adam", 
              loss="sparse_categorical_crossentropy", 
              metrics=['accuracy'])

model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 2991, 20)          442320    
_________________________________________________________________
lstm_2 (LSTM)                (None, 50)                14200     
_________________________________________________________________
dropout_2 (Dropout)          (None, 50)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 40)                2040      
Total params: 458,560
Trainable params: 458,560
Non-trainable params: 0
_________________________________________________________________


In [188]:
model.fit(x_train, train_y_labels , epochs=20, batch_size=32, verbose=True)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x7f81a99846d0>

In [189]:
model.predict_classes(x_val)



array([37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37,
       37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37,
       37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37,
       37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37,
       37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37,
       37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37,
       37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37,
       37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37,
       37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37,
       37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37,
       37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37,
       37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37,
       37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37,
       37, 37, 37, 37, 37

In [194]:
y_pred = np.argmax(model.predict(x_val), axis=-1)
name = "Keras LSTM"
f1_micro = f1_score(val_y_labels, y_pred, average="micro")
f1_macro = f1_score(val_y_labels, y_pred, average="macro")
f1_weighted = f1_score(val_y_labels, y_pred, average="weighted")

print("%s F1 Score Micro: %.2f%%" % (name,f1_micro * 100.0))
print("%s F1 Score Macro: %.2f%%" % (name,f1_macro * 100.0))
print("%s F1 Score Weighted: %.2f%%" % (name,f1_weighted * 100.0)) 

Keras LSTM F1 Score Micro: 70.18%
Keras LSTM F1 Score Macro: 9.16%
Keras LSTM F1 Score Weighted: 57.88%


### Evaluating on Test Set

In [262]:
mlpclassifier = MLPClassifier(solver='adam', 
                     alpha=1, 
                     max_iter=7000, 
                     activation='logistic', 
                     learning_rate='adaptive', 
                     learning_rate_init=0.001)

mlpclassifier.fit(train_X_processed, train_y)
y_pred = mlpclassifier.predict(test_X_processed)

f1_micro = f1_score(test_y, y_pred, average="micro")
f1_macro = f1_score(test_y, y_pred, average="macro")
f1_weighted = f1_score(test_y, y_pred, average="weighted")

In [264]:
name = "MLP Classifier on Test Set"

print("%s F1 Score Micro: %.2f%%" % (name,f1_micro * 100.0))
print("%s F1 Score Macro: %.2f%%" % (name,f1_macro * 100.0))
print("%s F1 Score Weighted: %.2f%%" % (name,f1_weighted * 100.0)) 

MLP Classifier on Test Set F1 Score Micro: 75.80%
MLP Classifier on Test Set F1 Score Macro: 8.31%
MLP Classifier on Test Set F1 Score Weighted: 69.33%
