## Decision Trees with Bagging and Boosting Implementation

We will be using the sklearn tree package for Gradient Boost Decision Trees and Random Forest.  The documentation is [here](https://scikit-learn.org/stable/modules/ensemble.html#gradientboostingclassifier-and-gradientboostingregressor) and [here](https://scikit-learn.org/stable/modules/ensemble.html#random-forests-and-other-randomized-tree-ensembles).

In [1]:
#import packages
#standard packages
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd

##support vector machine package
from sklearn import svm 
#evaulation metrics
from sklearn import metrics
from sklearn.metrics import mean_squared_error

#NLP packages
import nltk
from nltk.stem import WordNetLemmatizer #lemmatizer
from nltk.corpus import stopwords
import string #load punctuation charachers

# bag of words
from sklearn.feature_extraction.text import CountVectorizer

#Text processing packages
import re
!pip install emoji==1.7
#for emojis
import emoji

#testing and training set splitting function
from sklearn.model_selection import train_test_split

# decision tree package
from sklearn import tree

#gradient boosting
from sklearn.datasets import make_hastie_10_2
from sklearn.ensemble import GradientBoostingClassifier

#random forest
from sklearn.ensemble import RandomForestClassifier

print("packages imported")

[0mpackages imported


In [2]:
## generate some example data
## this is the same data from last week 
## Likely not enough data...so results will probably be somewhat strange
X = [[1,1],[1,2],[1,7],[2,2],[2,4],[2,5],[3,2],[3,4],[3,6],[4,4],[4,6],[4,7],[5,7],[4,1],[5,2],[5,3],[6,2],[6,4],[7,1],[7,3],[7,6],[8,2],[8,5],[8,6]]
Y = [0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1]

# split the data into a 70% for training
X_train, X_test, y_train, y_test = train_test_split(X,Y, test_size=0.3,random_state=109) # 70% training and 30% test

## First run a gradient boosted tree
#Generate the decision tree classifier
clf = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1,
    max_depth=5)
clf = clf.fit(X_train,y_train) #fit the gradient boosted tree

In [3]:
# check how "good" the tree is by using the testing set

#Predict the response for test dataset
y_predicted = clf.predict(X_test)

print(y_predicted)
print(y_test)

[1 0 1 1 0 1 1 1]
[0, 0, 1, 1, 0, 1, 1, 1]


In [4]:
## Here we have one mis-classified point 
## which is predicted as a 1 but is actually a 0

## Let's print the confusion matrix 

## the entries of the confusion matrix are:
## C[0,0] true negatives 
## C[1,0] false negatives  
## C[1,1] true positives
## C[0,1] false positives

## note that this is slightly different than the 
## confusion matrix on the wikipedia page!

C = metrics.confusion_matrix(y_test,y_predicted)

C

array([[2, 1],
       [0, 5]])

In [5]:
## print out the other metrics
## Accuracy -- what fraction of the time is the classifier correct
print("Model Accuracy:",metrics.accuracy_score(y_test, y_predicted))

## Precision -- fraction of true positives divided by the true positives and false positives 
print("Precision:",metrics.precision_score(y_test, y_predicted))

## Recall -- fraction of true positives divided by the true positives and false negatives 
print("Recall:",metrics.recall_score(y_test, y_predicted))

Model Accuracy: 0.875
Precision: 0.8333333333333334
Recall: 1.0


In [6]:
mean_squared_error(y_test, clf.predict(X_test))

0.125

### Random Forest

In [7]:
## generate some example data
## this is the same data from last week 
X = [[1,1],[1,2],[1,7],[2,2],[2,4],[2,5],[3,2],[3,4],[3,6],[4,4],[4,6],[4,7],[5,7],[4,1],[5,2],[5,3],[6,2],[6,4],[7,1],[7,3],[7,6],[8,2],[8,5],[8,6]]
Y = [0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1]

# split the data into a 70% for training
X_train, X_test, y_train, y_test = train_test_split(X,Y, test_size=0.3,random_state=109) # 70% training and 30% test

## First run a gradient boosted tree
#Generate the decision tree classifier
clf2 =  RandomForestClassifier(n_estimators=300)
clf2 = clf2.fit(X_train,y_train) #fit the random forest

In [8]:
# check how "good" the tree is by using the testing set

#Predict the response for test dataset
y_predicted2 = clf2.predict(X_test)

print(y_predicted)
print(y_test)

## Here we have one mis-classified point 
## which is predicted as a 1 but is actually a 0

## Let's print the confusion matrix 

## the entries of the confusion matrix are:
## C[0,0] true negatives 
## C[1,0] false negatives  
## C[1,1] true positives
## C[0,1] false positives

## note that this is slightly different than the 
## confusion matrix on the wikipedia page!

C = metrics.confusion_matrix(y_test,y_predicted)

C

[1 0 1 1 0 1 1 1]
[0, 0, 1, 1, 0, 1, 1, 1]


array([[2, 1],
       [0, 5]])

## Natural Language Processing

Here, we will use the same data set as last week, and looking at the "angry" emotion from tweets.  Again, there is a lot of cleaning to do first.

In [9]:
newdata_set = pd.read_csv('data3-test.txt',encoding='utf-8',sep="\t")
training_set = pd.read_csv('data3-train.txt',encoding='utf-8',sep="\t")

### following the data cleaning protocol on the kaggle website
#extract hashtags from training data and put them in new column named hashtag 
training_set["hashtags"]=training_set["Tweet"].apply(lambda x:re.findall(r"#(\w+)",x))

#extract hashtags from new data and put them in new column named hashtag 
newdata_set["hashtags"]=newdata_set["Tweet"].apply(lambda x:re.findall(r"#(\w+)",x))

In [10]:
#translate emojis in training
training_set["clean"]=training_set["Tweet"].apply(lambda x: emoji.demojize(x))

#translate emojis in new data
newdata_set["clean"]=newdata_set["Tweet"].apply(lambda x: emoji.demojize(x))

#remove urls in training
training_set["clean"]=training_set["clean"].apply(lambda x: re.sub(r"http:\S+",'',x))

#remove urls in new data
newdata_set["clean"]=newdata_set["clean"].apply(lambda x: re.sub(r"http:\S+",'',x))

#tokenize training tweet
training_set["clean"]=training_set["clean"].apply(lambda x: nltk.word_tokenize(str(x).lower()))

#tokenize new data tweet
newdata_set["clean"]=newdata_set["clean"].apply(lambda x: nltk.word_tokenize(str(x).lower()))

In [11]:
#remove stopwords and punctuations
stopwrds = set(stopwords.words('english'))

#training data
training_set["clean"]=training_set["clean"].apply(lambda x: [y for y in x if (y not in stopwrds)]) ##stop words
training_set["clean"]=training_set["clean"].apply(lambda x: [re.sub(r'['+string.punctuation+']','',y) for y in x]) ## punctuation
training_set["clean"]=training_set["clean"].apply(lambda x: [re.sub('\\n','',y) for y in x]) ##whitespace

#new data
newdata_set["clean"]=newdata_set["clean"].apply(lambda x: [y for y in x if (y not in stopwrds)])
newdata_set["clean"]=newdata_set["clean"].apply(lambda x: [re.sub(r'['+string.punctuation+']','',y) for y in x])
newdata_set["clean"]=newdata_set["clean"].apply(lambda x: [re.sub('\\n','',y) for y in x])

#clean unneeded spaces or empty columns or non sense words
#training data
training_set["clean"]=training_set["clean"].apply(lambda x: [y for y in x if y.strip() != '' and len(y.strip())>2])

#new data
newdata_set["clean"]=newdata_set["clean"].apply(lambda x: [y for y in x if y.strip() != '' and len(y.strip())>2])

#save Cleaned tweets

#training data
training_set=training_set[["clean","hashtags","anger","anticipation","disgust","fear","joy","love","optimism","pessimism","sadness","surprise","trust"]]
#new data
newdata_set=newdata_set[["clean","hashtags","anger","anticipation","disgust","fear","joy","love","optimism","pessimism","sadness","surprise","trust"]]

In [12]:
#convert tokenize tweets to sentences
training_set["clean"]=training_set["clean"].apply(lambda x: ' '.join(x).replace('\\n',''))

newdata_set["clean"]=newdata_set["clean"].apply(lambda x: ' '.join(x).replace('\\n',''))


## make a function that will lemmatize
## see https://gist.github.com/MaxHalford/68b584e9154098151e6d9b5aa7464948
## so that we can use lemmatization as the method to generate
## the words in the bag of words using CountVectorizer
def lemmatize(text):
    text = ''.join([ch for ch in text if ch not in string.punctuation])
    tokens = nltk.word_tokenize(text)
    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(token) for token in tokens]

vectorizer = CountVectorizer(tokenizer=lemmatize,binary = True, ngram_range=(1,2))
text_vectors = vectorizer.fit_transform(training_set["clean"])



In [13]:
## Run a decision tree to classify the tweets into anger

#Generate the vectors for the classifier for anger
X_train_anger = text_vectors
Y_train_anger = training_set["anger"]
#split the training set
X_train_a, X_test_a, y_train_a, y_test_a = train_test_split(X_train_anger,Y_train_anger, test_size=0.3,random_state=37) # 70% training and 30% test
#fit the GBM
Tree_classifier_GBM_anger = GradientBoostingClassifier(n_estimators=100, learning_rate=0.5,
    max_depth=4, random_state=0)
Tree_classifier_GBM_anger.fit(X_train_a, y_train_a)


# fit random forest
Random_forest_classifier_anger = RandomForestClassifier(n_estimators=100)
Random_forest_classifier_anger.fit(X_train_a, y_train_a)


In [14]:
## now that it is fitted, let's see how "good" it is at
## predicting anger from tweets
y_predicted_a_gbm = Tree_classifier_GBM_anger.predict(X_test_a)
y_predicted_a_rf = Random_forest_classifier_anger.predict(X_test_a)

print(y_predicted_a_gbm[10]) ## can manually check some values
print(y_predicted_a_rf[10])
print(y_test_a[10])

1
1
0


In [15]:
## the entries of the confusion matrix are:
## C[0,0] true negatives 
## C[1,0] false negatives  
## C[1,1] true positives
## C[0,1] false positives

C_gbm = metrics.confusion_matrix(y_test_a,y_predicted_a_gbm)

print(C_gbm)

C_rf = metrics.confusion_matrix(y_test_a,y_predicted_a_rf)

print(C_rf)

[[1163  136]
 [ 300  453]]
[[1222   77]
 [ 352  401]]


In [16]:
#evaluation metrics - gmb
print("Model Accuracy GBM:",metrics.accuracy_score(y_test_a, y_predicted_a_gbm))

print("Precision GBM:",metrics.precision_score(y_test_a, y_predicted_a_gbm))

print("Recall GBM:",metrics.recall_score(y_test_a, y_predicted_a_gbm))

#evaluation metrics - rf
print("Model Accuracy RF:",metrics.accuracy_score(y_test_a, y_predicted_a_rf))

print("Precision RF:",metrics.precision_score(y_test_a, y_predicted_a_rf))

print("Recall RF:",metrics.recall_score(y_test_a, y_predicted_a_rf))

Model Accuracy GBM: 0.7875243664717348
Precision GBM: 0.769100169779287
Recall GBM: 0.601593625498008
Model Accuracy RF: 0.7909356725146199
Precision RF: 0.8389121338912134
Recall RF: 0.5325365205843293


In [17]:
## seems to not do too bad... but is worse than SVM 

## let's classify unknown tweets now with the trained 
## Decision tree

text_vectors_1 = vectorizer.transform(newdata_set["clean"])
## why .transform here and .fit_transform above? 
## fit_transform() is used on the training data so that we can 
## scale the training data and also learn the mean/variance/other parameters 
## of that data. 
y_test_a_predicted_GBM = Tree_classifier_GBM_anger.predict(text_vectors_1)
y_test_a_predicted_rf = Random_forest_classifier_anger.predict(text_vectors_1)

print(np.count_nonzero(y_test_a_predicted_GBM == 0))
print(np.count_nonzero(y_test_a_predicted_GBM == 1))


print(np.count_nonzero(y_test_a_predicted_rf == 0))
print(np.count_nonzero(y_test_a_predicted_rf == 1))

2475
784
2714
545


For comparison, the SVM model predicted:

2494

765

The basic Decision Tree predicted:

2345

914

#### Voting Method
One option is to do "voting", which is where we can combine multiple ML algorithms together to (hopefully) increase our success rate.

There is a decent kaggle article [here.](https://www.kaggle.com/code/faressayah/ensemble-ml-algorithms-bagging-boosting-voting) This article also talks about Gradient Boosted Machines and Random Forest. 

The documentation on the Voting Classifier is [here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html).

Note that this could take a long time to run, depending on the size of your data.  We are essentially running as many ML algorithms as we want and then tacking on a "voting" scheme to choose the best ML method for certain data features. Usually this is not run on very large data sets (either Random Forest or Gradient Boosting Machines are run), but it can be interesting to experiment with this. 

In [18]:
from sklearn.ensemble import VotingClassifier
from sklearn.svm import SVC

estimators = []
#GBM
GBM_tree = GradientBoostingClassifier(n_estimators=100, learning_rate=0.5,
    max_depth=4, random_state=0)
estimators.append(('GBM', GBM_tree))
#RF
RF_tree = RandomForestClassifier(n_estimators=100)
estimators.append(('RF', RF_tree))
#SVM
svm_clf = SVC(kernel='linear')
estimators.append(('SVM', svm_clf))

voting = VotingClassifier(estimators=estimators)
voting.fit(X_train_a, y_train_a)

In [19]:
## using their evaulation function
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

### this is adapted from the kaggle example
def evaluate(model, X_train, X_test, y_train, y_test):
    y_test_pred = model.predict(X_test)
    y_train_pred = model.predict(X_train)

    print("TRAINING RESULTS: \n===============================")
    clf_report = pd.DataFrame(classification_report(y_train, y_train_pred, output_dict=True))
    print(f"CONFUSION MATRIX:\n{confusion_matrix(y_train, y_train_pred)}")
    print(f"ACCURACY SCORE:\n{accuracy_score(y_train, y_train_pred):.4f}")
    print(f"CLASSIFICATION REPORT:\n{clf_report}")

    print("TESTING RESULTS: \n===============================")
    clf_report = pd.DataFrame(classification_report(y_test, y_test_pred, output_dict=True))
    print(f"CONFUSION MATRIX:\n{confusion_matrix(y_test, y_test_pred)}")
    print(f"ACCURACY SCORE:\n{accuracy_score(y_test, y_test_pred):.4f}")
    print(f"CLASSIFICATION REPORT:\n{clf_report}")

In [20]:
evaluate(voting, X_train_a, X_test_a, y_train_a, y_test_a)

TRAINING RESULTS: 
CONFUSION MATRIX:
[[2992    3]
 [   2 1789]]
ACCURACY SCORE:
0.9990
CLASSIFICATION REPORT:
                     0            1  accuracy    macro avg  weighted avg
precision     0.999332     0.998326  0.998955     0.998829      0.998955
recall        0.998998     0.998883  0.998955     0.998941      0.998955
f1-score      0.999165     0.998605  0.998955     0.998885      0.998955
support    2995.000000  1791.000000  0.998955  4786.000000   4786.000000
TESTING RESULTS: 
CONFUSION MATRIX:
[[1188  111]
 [ 292  461]]
ACCURACY SCORE:
0.8036
CLASSIFICATION REPORT:
                     0           1  accuracy    macro avg  weighted avg
precision     0.802703    0.805944  0.803606     0.804323      0.803892
recall        0.914550    0.612218  0.803606     0.763384      0.803606
f1-score      0.854984    0.695849  0.803606     0.775416      0.796588
support    1299.000000  753.000000  0.803606  2052.000000   2052.000000


In [21]:
y_test_a_predicted_voting = voting.predict(text_vectors_1)

print(np.count_nonzero(y_test_a_predicted_voting == 0))
print(np.count_nonzero(y_test_a_predicted_voting == 1))

2598
661
