***In this file, we only focus on the model with 1 and 2 features; and using 6 main method to evaluate then compare them!***

First, I have to note something about this part.

1) In `XGBoost, AdaBoost` we fixed, (and) pick the `learning_rate` = 0.1.

2) In `XGBoost, AdaBoost` and `Random Forest`; we pick the `n_estimators` = 300.

3) In `XGBoost` and `RandomForest`; the `max_depth` = 3

4) The study about the affects of the optimal `learning_rate, n_estimator` and `max_depth` will be considered in the next Session!

5) The 3 last methodologies are `Naive Bayes, Logistic Regression & KNN`

Now, we need import the following `libraries`....

In [1]:
from numpy import loadtxt 
from sklearn.model_selection import train_test_split 
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_breast_cancer
import nltk
from nltk.corpus import stopwords
import string
import pandas as pd

**Loading dataset**

In [2]:
# load data
df = pd.read_csv(r"train.csv", usecols = ["text", "target"])
df.head(5)

Unnamed: 0,text,target
0,Our Deeds are the Reason of this #earthquake M...,1
1,Forest fire near La Ronge Sask. Canada,1
2,All residents asked to 'shelter in place' are ...,1
3,"13,000 people receive #wildfires evacuation or...",1
4,Just got sent this photo from Ruby #Alaska as ...,1


***Checking null-values***

In [3]:
print("col_names : \t" + df.columns)
print('\n')
print("Data-dimensions: \t" + str(df.shape))
print('\n')
print("Count the not-null values of each features: \n" + str(df.notnull().sum()))

Index(['col_names : \ttext', 'col_names : \ttarget'], dtype='object')


Data-dimensions: 	(7613, 2)


Count the not-null values of each features: 
text      7613
target    7613
dtype: int64


***Checking & removing duplications***

In [4]:
df.drop_duplicates(inplace = True)
print("The new dimension after checking duplicate & removing is:\t" + str(df.shape))

The new dimension after checking duplicate & removing is:	(7521, 2)


***Adding variables named Text length & Number of words to get the new model***

In [5]:
df['Text_length'] = df['text'].str.len()
df['Numb_words'] = df['text'].str.split().map(lambda x: len(x))
df.head()

Unnamed: 0,text,target,Text_length,Numb_words
0,Our Deeds are the Reason of this #earthquake M...,1,69,13
1,Forest fire near La Ronge Sask. Canada,1,38,7
2,All residents asked to 'shelter in place' are ...,1,133,22
3,"13,000 people receive #wildfires evacuation or...",1,65,8
4,Just got sent this photo from Ruby #Alaska as ...,1,88,16


***Define 2 classes: Text_Selector for text-column and Number_Selector for the 2 new features***

In [6]:
from sklearn.base import BaseEstimator, TransformerMixin

class TextSelector(BaseEstimator, TransformerMixin):
    def __init__(self, field):
        self.field = field
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.field]
    
class NumberSelector(BaseEstimator, TransformerMixin):
    def __init__(self, field):
        self.field = field
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[[self.field]]

***Define a `Tokenizer`  function***

In [7]:
import re
from spellchecker import SpellChecker    

def Tokenizer(str_input):
    ## 1. Remove url_link
    remove_url = re.compile(r'https?://\S+|www\.\S+').sub(r'', str_input)
    
    ## 2. Remove html_link
    remove_html = re.compile(r'<.*?>').sub(r'', remove_url)
    
    ## 3. Remove Emojis
    remove_emo = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE).sub(r'', remove_html)
    words = re.sub(r"[^A-Za-z0-9\-]", " ", remove_emo).lower().split()    
        
    ## 4. spell_correction
    # spell = SpellChecker()
    # words = [spell.correction(word) for word in words]

    return words

***Assign X, y to train_test_split & fit the corresponding model***

`We only consider 2 case: ('text' & 'numb_words') and ('text' & 'text_length')`


***`1) Case1. Data contains 'text' & 'numb_words'`***

In [8]:
X = df[['text', 'Numb_words']] 
y = df['target']
test_size = 0.3

**Split dataset into 2 parts: train & test**

In [9]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    test_size = test_size, 
                                                    stratify = y, 
                                                    random_state = 42)

**Import some libraries**

In [10]:
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import TruncatedSVD
from sklearn.ensemble import AdaBoostClassifier
from nltk.corpus import stopwords

stop = set(stopwords.words('english'))

**1.1. Using AdaBoost_classifier**

In [11]:
classifier1 = Pipeline([
    (
        'features', FeatureUnion([
        ('text', Pipeline([
            ('colext', TextSelector('text')),
            ('tfidf', TfidfVectorizer(tokenizer = Tokenizer, stop_words = 'english',
                     min_df = .0025, max_df = 0.25, ngram_range = (1, 3) ) ),
            ('svd', TruncatedSVD(algorithm = 'randomized', n_components = 300) ), #for AdaBoost
        ])),
        ('words', Pipeline([
            ('wordext', NumberSelector('Numb_words')),
            ('wscaler', StandardScaler()),
        ])),            
    ])
    ),
    ('clf', AdaBoostClassifier(n_estimators = 300, learning_rate = 0.1)),
    ])

**Fit the model**

In [12]:
import time
start = time.time()

classifier1.fit(X_train, y_train)
preds = classifier1.predict(X_test)

print ('Fit&trainning time : ', time.time() - start)

Fit&trainning time :  63.551987171173096


**Predict & accuracy**

In [13]:
from sklearn.metrics import accuracy_score, precision_score, classification_report, confusion_matrix

train_acc_Ada = accuracy_score(y_train, classifier1.predict(X_train)) * 100.0 
test_acc_Ada = accuracy_score(y_test, preds) * 100

print("Training Accuracy: %.2f%%" % train_acc_Ada)
print("Testing Accuracy: %.2f%%" % test_acc_Ada)
print("Precision:", precision_score(y_test, preds))
print(classification_report(y_test, preds))
print(confusion_matrix(y_test, preds))

Training Accuracy: 77.53%
Testing Accuracy: 74.08%
Precision: 0.7759882869692533
              precision    recall  f1-score   support

           0       0.73      0.88      0.80      1295
           1       0.78      0.55      0.64       962

    accuracy                           0.74      2257
   macro avg       0.75      0.72      0.72      2257
weighted avg       0.75      0.74      0.73      2257

[[1142  153]
 [ 432  530]]


**1.2. Using XGBoost_classifier**

In [14]:
from sklearn.svm import SVC
svc = SVC(probability = True, kernel = 'linear')

from xgboost import XGBClassifier 
classifier1 = Pipeline([
    (
        'features', FeatureUnion([
        ('text', Pipeline([
            ('colext', TextSelector('text')),
            ('tfidf', TfidfVectorizer(tokenizer = Tokenizer, stop_words = 'english',
                     min_df = .0025, max_df = 0.25, ngram_range = (1, 3) ) ),
            ('svd', TruncatedSVD(algorithm ='randomized', n_components = 300) ), #for XGB
        ])),
        ('words', Pipeline([
            ('wordext', NumberSelector('Numb_words')),
            ('wscaler', StandardScaler()),
        ])),            
    ])
    ),
    ('clf', XGBClassifier(max_depth = 3, n_estimators = 300, base_estimator = svc, learning_rate = 0.1))
    ])

## Fit the model
start = time.time()
classifier1.fit(X_train, y_train)
preds = classifier1.predict(X_test)
print ('Fit&trainning time : ', time.time() - start)

train_acc_Xgb = accuracy_score(y_train, classifier1.predict(X_train)) * 100.0 
test_acc_Xgb = accuracy_score(y_test, preds) * 100.0

print("Training_Accuracy: %.2f%%" % train_acc_Xgb)
print("Testing_Accuracy: %.2f%%" % test_acc_Xgb)
print("Precision:", precision_score(y_test, preds))
print(classification_report(y_test, preds))
print(confusion_matrix(y_test, preds))

Fit&trainning time :  47.76820182800293
Training_Accuracy: 90.16%
Testing_Accuracy: 77.18%
Precision: 0.7776397515527951
              precision    recall  f1-score   support

           0       0.77      0.86      0.81      1295
           1       0.78      0.65      0.71       962

    accuracy                           0.77      2257
   macro avg       0.77      0.76      0.76      2257
weighted avg       0.77      0.77      0.77      2257

[[1116  179]
 [ 336  626]]


**1.3. Using RandomForest_classifier**

In [15]:
from sklearn.ensemble import RandomForestClassifier

classifier1 = Pipeline([
    (
        'features', FeatureUnion([
        ('text', Pipeline([
            ('colext', TextSelector('text')),
            ('tfidf', TfidfVectorizer(tokenizer = Tokenizer, stop_words = 'english',
                     min_df = .0025, max_df = 0.25, ngram_range = (1, 3) ) ),
            ('svd', TruncatedSVD(algorithm ='randomized', n_components = 300) ), 
        ])),
        ('words', Pipeline([
            ('wordext', NumberSelector('Numb_words')),
            ('wscaler', StandardScaler()),
        ])),            
    ])
    ),
    ('clf', RandomForestClassifier(max_depth = 3, n_estimators = 300)),
    ])

start = time.time()
classifier1.fit(X_train, y_train)
preds = classifier1.predict(X_test)
print ('Fit&trainning time : ', time.time() - start)

train_acc_RFC = accuracy_score(y_train, classifier1.predict(X_train)) * 100.0
test_acc_RFC = accuracy_score(y_test, preds) * 100.0

print("Training_Accuracy: %.2f%%" % train_acc_RFC )
print("Testing_Accuracy: %.2f%%" % test_acc_RFC )
print("Precision:", precision_score(y_test, preds))
print(classification_report(y_test, preds))
print(confusion_matrix(y_test, preds))

Fit&trainning time :  7.6379077434539795
Training_Accuracy: 71.60%
Testing_Accuracy: 69.78%
Precision: 0.8398058252427184
              precision    recall  f1-score   support

           0       0.67      0.95      0.78      1295
           1       0.84      0.36      0.50       962

    accuracy                           0.70      2257
   macro avg       0.75      0.65      0.64      2257
weighted avg       0.74      0.70      0.66      2257

[[1229   66]
 [ 616  346]]


**1.4. Using NaiveBayes**

In [16]:
from sklearn.feature_extraction.text import CountVectorizer
text_process = CountVectorizer(analyzer = Tokenizer).fit_transform(df['text'])


from sklearn.naive_bayes import MultinomialNB
X_train, X_test, y_train, y_test = train_test_split(text_process, df['target'], 
                                                    test_size = test_size, 
                                                    stratify = df['target'], 
                                                    random_state = 42)

classifier1 = MultinomialNB()
classifier1.fit(X_train, y_train)
preds = classifier1.predict(X_test)

train_acc_NVB = accuracy_score(y_train, classifier1.predict(X_train)) * 100.0
test_acc_NVB = accuracy_score(y_test, preds) * 100.0

print("Training_Accuracy: %.2f%%" % train_acc_NVB)
print("Testing_Accuracy: %.2f%%" % test_acc_NVB)
print(classification_report(y_test, preds))
print('Confusion Matrix: \n',confusion_matrix(y_test, preds))

Training_Accuracy: 90.44%
Testing_Accuracy: 79.26%
              precision    recall  f1-score   support

           0       0.80      0.86      0.83      1295
           1       0.78      0.71      0.74       962

    accuracy                           0.79      2257
   macro avg       0.79      0.78      0.78      2257
weighted avg       0.79      0.79      0.79      2257

Confusion Matrix: 
 [[1108  187]
 [ 281  681]]


**1.5) Using Logistic Regression**

In [25]:
from sklearn.linear_model import LogisticRegression

X_train, X_test, y_train, y_test = train_test_split(df['text'], df['target'], 
                                                    stratify = df['target'],
                                                    test_size = test_size, 
                                                    random_state = 42)


# Initialize the tfidf_vectorizer 
tfidf_vectorizer = TfidfVectorizer(stop_words = 'english', max_df = 0.7) 

# Fit and transform the training data to Tfidf_Vec
tfidf_train = tfidf_vectorizer.fit_transform(X_train) 
tfidf_test = tfidf_vectorizer.transform(X_test)

# Fit the model
C = 0.1
classifier1 = LogisticRegression(random_state = 42, C = C).fit(tfidf_train, y_train)

preds = classifier1.predict(tfidf_test)

train_acc_logreg = accuracy_score(y_train, classifier1.predict(tfidf_train)) * 100.0
test_acc_logreg = accuracy_score(y_test, preds) * 100.0

print("Training_Accuracy: %.2f%%" % train_acc_logreg)
print("Testing_Accuracy: %.2f%%" % test_acc_logreg)
print(classification_report(y_test, preds))
print('Confusion Matrix: \n', confusion_matrix(y_test, preds))

Training_Accuracy: 78.38%
Testing_Accuracy: 76.92%
              precision    recall  f1-score   support

           0       0.76      0.88      0.81      1295
           1       0.80      0.62      0.69       962

    accuracy                           0.77      2257
   macro avg       0.78      0.75      0.75      2257
weighted avg       0.77      0.77      0.76      2257

Confusion Matrix: 
 [[1143  152]
 [ 369  593]]


**1.6) Using KNN**

In [18]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors = 100)
classifier1 = knn.fit(tfidf_train, y_train)

preds = knn.predict(tfidf_test)

train_acc_knn = accuracy_score(y_train, knn.predict(tfidf_train)) * 100.0
test_acc_knn = accuracy_score(y_test, preds) * 100.0

print("Training_Accuracy: %.2f%%" % train_acc_knn)
print("Testing_Accuracy: %.2f%%" % test_acc_knn)
print(classification_report(y_test, preds))
print('Confusion Matrix: \n', confusion_matrix(y_test, preds))

Training_Accuracy: 78.38%
Testing_Accuracy: 76.92%
              precision    recall  f1-score   support

           0       0.76      0.88      0.81      1295
           1       0.80      0.62      0.69       962

    accuracy                           0.77      2257
   macro avg       0.78      0.75      0.75      2257
weighted avg       0.77      0.77      0.76      2257

Confusion Matrix: 
 [[1143  152]
 [ 369  593]]


**Summary 1.** With the model contains 2 features `'text'` and `'number of words'`

In [19]:
train_acc = [train_acc_Ada, train_acc_Xgb, train_acc_RFC, train_acc_NVB, train_acc_logreg, train_acc_knn]
test_acc = [test_acc_Ada, test_acc_Xgb, test_acc_RFC, test_acc_NVB, test_acc_logreg, test_acc_knn]
method = ['AdaBoost', 'XGBoost', 'RandomForest', 'Naive_Bayes', 'Logistic_Regression', 'k-NN']

model1 = pd.DataFrame({'train_acc(%)': train_acc,
                       'test_acc(%)' : test_acc,
                       'used_method': method})
model1

Unnamed: 0,train_acc(%),test_acc(%),used_method
0,77.526596,74.080638,AdaBoost
1,90.159574,77.1821,XGBoost
2,71.599544,69.782898,RandomForest
3,90.444529,79.26451,Naive_Bayes
4,71.143617,70.934869,Logistic_Regression
5,78.381459,76.916261,k-NN


**`2) Case2. Data contains 'text' & 'text_length'`**

`So the first step in this case is assign X, y again to the new columns_names`

In [20]:
X = df[['text', 'Text_length']] 
y = df['target']

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    test_size = 0.20, 
                                                    stratify = y, 
                                                    random_state = 42)

**2.1. AdaBoost**

In [21]:
classifier2 = Pipeline([
    (
        'features', FeatureUnion([
        ('text', Pipeline([
            ('colext', TextSelector('text')),
            ('tfidf', TfidfVectorizer(tokenizer = Tokenizer, stop_words = 'english',
                     min_df = .0025, max_df = 0.25, ngram_range = (1, 3) ) ),
            ('svd', TruncatedSVD(algorithm ='randomized', n_components = 300) ), 
        ])),
        ('words', Pipeline([
            ('wordext', NumberSelector('Text_length')),
            ('wscaler', StandardScaler()),
        ])),            
    ])
    ),
    ('clf', AdaBoostClassifier(n_estimators = 300, learning_rate = 0.01)),
    ])

start = time.time()
classifier2.fit(X_train, y_train)
preds = classifier2.predict(X_test)
print('Fit&trainning time : ', time.time() - start)

train_acc_Ada2 = accuracy_score(y_train, classifier2.predict(X_train)) * 100.0
test_acc_Ada2 = accuracy_score(y_test, preds) * 100.0

print("Training_Accuracy: %.2f%%" % train_acc_Ada2)
print("Testing_Accuracy: %.2f%%" % test_acc_Ada2)
print("Precision:", precision_score(y_test, preds))
print(classification_report(y_test, preds))
print(confusion_matrix(y_test, preds))

Fit&trainning time :  69.14977693557739
Training_Accuracy: 69.65%
Testing_Accuracy: 68.50%
Precision: 0.71875
              precision    recall  f1-score   support

           0       0.67      0.87      0.76       863
           1       0.72      0.43      0.54       642

    accuracy                           0.69      1505
   macro avg       0.70      0.65      0.65      1505
weighted avg       0.69      0.69      0.67      1505

[[755 108]
 [366 276]]


**2.2. XGBoost**

In [22]:
from sklearn.svm import SVC
svc = SVC(probability = True, kernel = 'linear')

from xgboost import XGBClassifier 
classifier2 = Pipeline([
    (
        'features', FeatureUnion([
        ('text', Pipeline([
            ('colext', TextSelector('text')),
            ('tfidf', TfidfVectorizer(tokenizer = Tokenizer, stop_words = 'english',
                     min_df = .0025, max_df = 0.25, ngram_range = (1, 3) ) ),
            ('svd', TruncatedSVD(algorithm ='randomized', n_components = 300) ), #for XGB
        ])),
        ('words', Pipeline([
            ('wordext', NumberSelector('Text_length')),
            ('wscaler', StandardScaler()),
        ])),            
    ])
    ),
    ('clf', XGBClassifier(max_depth = 3, n_estimators = 300, base_estimator = svc, learning_rate = 0.1))
    ])

## Fit the model
start = time.time()
classifier2.fit(X_train, y_train)
preds = classifier2.predict(X_test)
print ('Fit&trainning time : ', time.time() - start)

train_acc_Xgb2 = accuracy_score(y_train, classifier2.predict(X_train)) * 100.0 
test_acc_Xgb2 = accuracy_score(y_test, preds) * 100.0

print("Training_Accuracy: %.2f%%" % train_acc_Xgb2)
print("Testing_Accuracy: %.2f%%" % test_acc_Xgb2)
print("Precision:", precision_score(y_test, preds))
print(classification_report(y_test, preds))
print(confusion_matrix(y_test, preds))

Fit&trainning time :  52.975799798965454
Training_Accuracy: 88.78%
Testing_Accuracy: 78.41%
Precision: 0.7887067395264117
              precision    recall  f1-score   support

           0       0.78      0.87      0.82       863
           1       0.79      0.67      0.73       642

    accuracy                           0.78      1505
   macro avg       0.79      0.77      0.77      1505
weighted avg       0.78      0.78      0.78      1505

[[747 116]
 [209 433]]


**2.3. Random Forest**

In [23]:
from sklearn.ensemble import RandomForestClassifier

classifier2 = Pipeline([
    (
        'features', FeatureUnion([
        ('text', Pipeline([
            ('colext', TextSelector('text')),
            ('tfidf', TfidfVectorizer(tokenizer = Tokenizer, stop_words = 'english',
                     min_df = .0025, max_df = 0.25, ngram_range = (1, 3) ) ),
            ('svd', TruncatedSVD(algorithm ='randomized', n_components = 300) ), 
        ])),
        ('words', Pipeline([
            ('wordext', NumberSelector('Text_length')),
            ('wscaler', StandardScaler()),
        ])),            
    ])
    ),
    ('clf', RandomForestClassifier(max_depth = 3, n_estimators = 300)),
    ])

start = time.time()
classifier2.fit(X_train, y_train)
preds = classifier2.predict(X_test)
print ('Fit&trainning time : ', time.time() - start)

train_acc_RFC2 = accuracy_score(y_train, classifier2.predict(X_train)) * 100.0
test_acc_RFC2 = accuracy_score(y_test, preds) * 100.0

print("Training_Accuracy: %.2f%%" % train_acc_RFC2 )
print("Testing_Accuracy: %.2f%%" % test_acc_RFC2 )
print("Precision:", precision_score(y_test, preds))
print(classification_report(y_test, preds))
print(confusion_matrix(y_test, preds))

Fit&trainning time :  8.966670274734497
Training_Accuracy: 72.41%
Testing_Accuracy: 71.30%
Precision: 0.828125
              precision    recall  f1-score   support

           0       0.68      0.94      0.79       863
           1       0.83      0.41      0.55       642

    accuracy                           0.71      1505
   macro avg       0.75      0.67      0.67      1505
weighted avg       0.74      0.71      0.69      1505

[[808  55]
 [377 265]]


Noting that in 3 last methods (`NaiveBayes, Logistic Regression & K nearest neighbor`); we only consider only the `'text'` feature and ignored `'numb_words'` also `'text_length'`

In [24]:
train_acc = [train_acc_Ada2, train_acc_Xgb2, train_acc_RFC2, train_acc_NVB, train_acc_logreg, train_acc_knn]
test_acc = [test_acc_Ada2, test_acc_Xgb2, test_acc_RFC2, test_acc_NVB, test_acc_logreg, test_acc_knn]
method = ['AdaBoost', 'XGBoost', 'RandomForest', 'Naive_Bayes', 'Logistic_Regression', 'k-NN']

model2 = pd.DataFrame({'train_acc(%)': train_acc,
                       'test_acc(%)' : test_acc,
                       'used_method': method})
model2

Unnamed: 0,train_acc(%),test_acc(%),used_method
0,69.647606,68.504983,AdaBoost
1,88.77992,78.405316,XGBoost
2,72.406915,71.295681,RandomForest
3,90.444529,79.26451,Naive_Bayes
4,71.143617,70.934869,Logistic_Regression
5,78.381459,76.916261,k-NN
