## Master of Applied Data Science
### University of Michigan - School of Information
### Capstone Project - Rapid Labeling of Text Corpus Using Information Retrieval Techniques
### Fall 2021
#### Team Members: Carlo Tak, Michael Penrose

### Experiment Flow

Class label > Count vectorizer > 800 features > scikit-learn

### Purpose

This notebook investigates how well a classifier can predict the **event type (i.e. 'earthquake', 'fire', 'flood', 'hurricane)** of the Tweets in the [Disaster tweets dataset](https://crisisnlp.qcri.org/humaid_dataset.html#).

This classifier is to be used as a baseline of classification performance. Two things are investigated:
- Is it possible to build a reasonable 'good' classifier of these tweets at all
- If it is possible to build a classifier how well does the classifier perform using all of the labels from the training data

If it is possible to build a classifier using all of the labels in the training dataset then it should be possible to implement a method for rapidly labeling the corpus of texts in the dataset. Here we think of rapid labeling as any process that does not require the user to label each text in the corpus, one at a time.

To measure the performance of the classifier we use a metric called the Area Under the Curve (AUC). This metric was used because we believe it is a good metric for the preliminary work in this project. If a specific goal emerges later that requires a different metric, then the appropriate metric can be used at that time. The consequence of false positives (texts classified as having a certain label, but are not that label) and false negatives should be considered. For example, a metric like precision can be used to minimize false positives. The AUC metric provides a value between zero and one, with a higher number indicating better classification performance. 


### Summary

The baseline classifier built using all the labels in the training dataset produced a classifier that had a fairly good AUC score for each of the 4 event type labels (i.e. earthquake, fire, flood, hurricane). All the AUC scores were above 0.98.

A simple vectorization (of texts) approach was implemented because we wanted the baseline classifier to be a basic solution – our feeling was that more complex techniques could be implemented at a later stage. A [count vectorizer]( https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) (with default settings) was used to convert the texts. The number of dimensions (features) was also reduced using feature selection ([SelectKBest]( https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html)). This was to improve computational times – fewer dimensions means that there are fewer data to process. Also, this was a simpler method to implement than other techniques like removing stopwords, adjusting parameters like ‘stop_words’, ‘ngram_range’, ‘max_df’, ‘min_df’, and ‘max_features’.  The complexity of the classifier could be adjusted if required, but this simple implementation produced good results.

This notebook reduced the number of features to 100.

The feature importances were extracted from the classifier, to see if they made sense. This sense check was important because we made several assumptions in building this classifier, that had to be validated. For example, when the text was vectorized we used a simple approach that just counted the individual words (tokens) – are more complex classifier might use bi-grams (two words per feature), this would have had the advantage of preserving features like ‘’.

Examining the top features
 



In [1]:
from utilities import dt_utilities as utils
from datetime import datetime
import numpy as np
import pandas as pd
# Acceleration for scikit-learn on Windows 64 bit machines
# from sklearnex import patch_sklearn
# patch_sklearn()
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.feature_selection import SelectKBest, chi2, f_classif
from sklearn.pipeline import Pipeline
from sklearn.linear_model import RidgeClassifier, SGDClassifier, Perceptron, PassiveAggressiveClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC, SVC, NuSVC
from sklearn.linear_model import LogisticRegression
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.dummy import DummyClassifier
from sklearn.utils.validation import check_is_fitted
from sklearn.ensemble import StackingClassifier
from sklearn.pipeline import make_pipeline
from scipy.sparse import coo_matrix, hstack
import scipy.sparse
from collections import Counter
import altair as alt
from tqdm import tqdm
from datetime import datetime

In [2]:
# enable correct rendering
alt.renderers.enable('default')

RendererRegistry.enable('default')

In [3]:
start_time = datetime.now()
start_time.strftime("%Y/%m/%d %H:%M:%S")
RANDOM_STATE = 257

### Load the Data

In [4]:
consolidated_disaster_tweet_data_df = \
    utils.get_consolidated_disaster_tweet_data(root_directory="data/",
                                               event_type_directory="HumAID_data_event_type",
                                               events_set_directories=["HumAID_data_events_set1_47K",
                                                                       "HumAID_data_events_set2_29K"],
                                               include_meta_data=True)

In [5]:
consolidated_disaster_tweet_data_df.head()

Unnamed: 0,tweet_id,class_label,event_type,data_type,tweet_text
0,798262465234542592,sympathy_and_support,earthquake,dev,RT @MissEarth: New Zealand need our prayers af...
1,771464543796985856,caution_and_advice,earthquake,dev,"@johnaglass65 @gordonluke Ah, woke up to a nig..."
2,797835622471733248,requests_or_urgent_needs,earthquake,dev,RT @terremotocentro: #eqnz if you need a tool ...
3,798021801540321280,other_relevant_information,earthquake,dev,RT @BarristerNZ: My son (4) has drawn a pictur...
4,798727277794033664,infrastructure_and_utility_damage,earthquake,dev,Due to earthquake damage our Defence Force is ...


In [6]:
train_df = consolidated_disaster_tweet_data_df[consolidated_disaster_tweet_data_df["data_type"]=="train"].reset_index(drop=True)
train_df.head()

Unnamed: 0,tweet_id,class_label,event_type,data_type,tweet_text
0,798064896545996801,other_relevant_information,earthquake,train,I feel a little uneasy about the idea of work ...
1,797913886527602688,caution_and_advice,earthquake,train,#eqnz Interislander ferry docking aborted afte...
2,797867944546025472,other_relevant_information,earthquake,train,Much of New Zealand felt the earthquake after ...
3,797958935126773760,sympathy_and_support,earthquake,train,"Noticing a lot of aftershocks on eqnz site, bu..."
4,797813020567056386,infrastructure_and_utility_damage,earthquake,train,"RT @E2NZ: Mike Clements, NZ police, says obvio..."


In [7]:
test_df = consolidated_disaster_tweet_data_df[consolidated_disaster_tweet_data_df["data_type"]=="test"].reset_index(drop=True)
test_df.head()

Unnamed: 0,tweet_id,class_label,event_type,data_type,tweet_text
0,798274825441538048,infrastructure_and_utility_damage,earthquake,test,The earthquake in New Zealand was massive. Bil...
1,798452064208568320,infrastructure_and_utility_damage,earthquake,test,These pictures show the alarming extent of the...
2,797804396767682560,sympathy_and_support,earthquake,test,Just woke to news of another earthquake! WTF N...
3,798434862830993408,not_humanitarian,earthquake,test,"When theres an actual earthquake, landslide an..."
4,797790705414377472,caution_and_advice,earthquake,test,"Tsunami warning for entire East Coast of NZ, b..."


In [8]:
dev_df = consolidated_disaster_tweet_data_df[consolidated_disaster_tweet_data_df["data_type"]=="dev"].reset_index(drop=True)
dev_df.head()

Unnamed: 0,tweet_id,class_label,event_type,data_type,tweet_text
0,798262465234542592,sympathy_and_support,earthquake,dev,RT @MissEarth: New Zealand need our prayers af...
1,771464543796985856,caution_and_advice,earthquake,dev,"@johnaglass65 @gordonluke Ah, woke up to a nig..."
2,797835622471733248,requests_or_urgent_needs,earthquake,dev,RT @terremotocentro: #eqnz if you need a tool ...
3,798021801540321280,other_relevant_information,earthquake,dev,RT @BarristerNZ: My son (4) has drawn a pictur...
4,798727277794033664,infrastructure_and_utility_damage,earthquake,dev,Due to earthquake damage our Defence Force is ...


In [9]:
train_df.groupby(["event_type"]).size().reset_index().rename(columns={0: "Count"}).sort_values("Count", ascending=False)

Unnamed: 0,event_type,Count
3,hurricane,31674
2,flood,7815
1,fire,7792
0,earthquake,6250


In [10]:
train_df.groupby(["class_label"]).size().reset_index().rename(columns={0: "Count"}).sort_values("Count", ascending=False)

Unnamed: 0,class_label,Count
8,rescue_volunteering_or_donation_effort,14891
6,other_relevant_information,8501
9,sympathy_and_support,6250
2,infrastructure_and_utility_damage,5715
3,injured_or_dead_people,5110
5,not_humanitarian,4407
0,caution_and_advice,3774
1,displaced_people_and_evacuations,2800
7,requests_or_urgent_needs,1833
4,missing_or_found_people,250


In [11]:
RND_STATE = 2584
train_df = train_df.sample(frac=1, random_state=RND_STATE).reset_index(drop=True)

### Utilities

In [12]:
def supervised_subset(vectorizer, num_samples, model, train_df, model_type, semi_supervised=False, semi_supervised_iterations=1, warm_start=False):
    # use this cell to reduce the train set to simulate a rapid labelling semi-supervised situation
    training_df = train_df.loc[:num_samples]
    #print("all records",len(train_df))
    #print("Training Records:", len(training_df))
    num_features = 'all'
    target_column = "event_type" # "class_label" or "event_type"
    X_train = vectorizer.transform(training_df["tweet_text"])
    X_test = vectorizer.transform(test_df["tweet_text"])
    y = training_df[target_column]
    y_frac = training_df[target_column]
    y_frac_index = y_frac.index
    y_test = test_df[target_column]
    model_start_time = datetime.now()
    if warm_start:
        try:
            check_is_fitted(model)
            model.partial_fit(X_train, y)
        except:
            model.fit(X_train, y)
    else:
        model.fit(X_train, y)
    y_train_pred = model.predict(X_train)
    if semi_supervised:
        X_train = vectorizer.transform(train_df["tweet_text"])
        y = train_df[target_column]
        y_train_pred = model.predict(X_train)
        vectorizer = TfidfVectorizer(stop_words='english', ngram_range=(1,2))
        vectorizer.fit(train_df["tweet_text"])
        X_train = vectorizer.transform(train_df["tweet_text"])
        X_test = vectorizer.transform(test_df["tweet_text"])
        for r in range(semi_supervised_iterations):
            y_train_pred[y_frac_index] = y_frac # where the labels are provided we use them, otherwise we use the predicted label for semi-supervised
            if warm_start:
                model.partial_fit(X_train, y_train_pred)
            else:
                model.fit(X_train, y_train_pred)
            y_train_pred = model.predict(X_train)
    
    y_test_pred = model.predict(X_test)
    model_end_time = datetime.now()
    # Time taken on the dummy is not part of the main model time
    dummy_model = DummyClassifier(strategy="stratified", random_state=RANDOM_STATE)
    dummy_model.fit(X_train, y)
    y_train_pred_dummy = dummy_model.predict(X_train)
    y_test_pred_dummy = dummy_model.predict(X_test)
    run_time = (model_end_time - model_start_time).total_seconds()
    
    results = {}
    results['model_type'] = model_type
    results['vectorizer_num_features'] = vectorizer.__dict__['max_features']
    results['semi_supervised'] = semi_supervised
    results['samples'] = num_samples
    results['dummy_train_accuracy'] = accuracy_score(y, y_train_pred_dummy)
    results['dummy_test_accuracy'] = accuracy_score(y_test, y_test_pred_dummy)
    results['train_accuracy'] = accuracy_score(y, y_train_pred)
    results['test_accuracy'] = accuracy_score(y_test, y_test_pred)
    results['run_time'] = run_time
    
    return results


In [13]:
alt.themes.enable('fivethirtyeight')
def chart_results_curve(df, model_type):
    sel_multi = alt.selection_multi(fields=['semi_supervised'])

    color = alt.condition(sel_multi,
                      alt.Color('semi_supervised:N'),
                      alt.value('lightgray'))

    #title = "Baseline Accuracy on Test Set by Number of Samples: " + str(model_type)
    title = str(model_type)
    chrt_super = alt.Chart(df, title=title).mark_line().encode(
        x=alt.X('samples:Q', axis=alt.Axis(grid=False, titleFontSize=14, title='Number of Training Labels Used')),
        y=alt.Y('test_accuracy:Q', axis=alt.Axis(grid=False, titleFontSize=14, title='Accuracy on Test Set'), scale=alt.Scale(domain=[0.65, 1.])),
        color=color,
        tooltip=[alt.Tooltip("samples", format=",.0f"), "semi_supervised", alt.Tooltip("test_accuracy", format=",.4f"), 
                 'vectorizer_num_features', alt.Tooltip("run_time", format=",.4f")]
    ).properties(
        width=240,
        height=320
    ).add_selection(
    sel_multi
    ) 
    
    #     legend = alt.Chart(df).mark_point().encode(
    #         y=alt.Y('vectorizer_num_features:N', axis=alt.Axis(orient='right')),
    #         color=color
    #     ).add_selection(
    #         sel_multi
    #     )    

    #chrt_super = chrt_super | legend
    
    return chrt_super   

In [14]:
def chart_accuracy_speed_scatter(df, model_type, chart_upper_limit):
    #title = "Baseline Accuracy on Test Set by Number of Samples: " + str(model_type)
    title = str(model_type)
    chrt_super = alt.Chart(df, title=title).mark_circle().encode(
        x=alt.X('run_time:Q', axis=alt.Axis(grid=False, titleFontSize=14, title='Run Time in Seconds'), scale=alt.Scale(domain=[0., chart_upper_limit])),
        y=alt.Y('test_accuracy:Q', axis=alt.Axis(grid=False, titleFontSize=14, title='Accuracy on Test Set'), scale=alt.Scale(domain=[0.65, 1.])),
        color=alt.Color('semi_supervised:N', title="Semi Supervised"),
        tooltip=['model_type', alt.Tooltip("samples", format=",.0f"), alt.Tooltip("test_accuracy", format=",.4f"), 
                 'semi_supervised:N', alt.Tooltip("run_time", format=",.4f")]
    ).properties(
        width=240,
        height=200
    )
    
    return chrt_super   

In [15]:
def initiate_sgd(use_warm_start):
    model = SGDClassifier(loss="modified_huber", max_iter=1000, tol=1e-3, random_state=2584, n_jobs=-1, warm_start=use_warm_start)
    
    return model

### Prepare for Modeling

In [16]:
# Set up the checkpoints for the list of number of labels against which we check the model accuracy on the test set
upper_limit = len(train_df)
step_size = 1000
label_count_checkpoints = [i for i in range(0, upper_limit, step_size)]
label_count_checkpoints.pop(0)
label_count_checkpoints = [250, 500, 750] + label_count_checkpoints
if upper_limit!=label_count_checkpoints[-1]: label_count_checkpoints.append(upper_limit)

In [17]:
# Model Parameters
semi_supervised_iterations = 1
run_semi_supervised = True
use_warm_start = False #True

tfidf_max_features = [100, 200, 300, 500, 800, None]

kernel = 1.0 * RBF(1.0)
# Define Models
model_dict = {}
# model_dict['GaussianProcessClassifier'] = GaussianProcessClassifier(kernel=kernel,random_state=0)
#model_dict['MultinomialNB']= MultinomialNB()
model_dict['LinearSVC'] = LinearSVC(random_state=RANDOM_STATE)
#model_dict['SGDClassifier'] = initiate_sgd(use_warm_start)
#model_dict['Perceptron'] = Perceptron(random_state=RANDOM_STATE, n_jobs=-1)
#model_dict['PassiveAggressiveClassifier'] = PassiveAggressiveClassifier(random_state=RANDOM_STATE, n_jobs=-1)

# estimators = [
#     ('MultinomialNB', MultinomialNB()),
#     ('LinearSVC', LinearSVC(random_state=RANDOM_STATE)),
#     ('SGDClassifier', initiate_sgd(use_warm_start)),
#     ('Perceptron', Perceptron(random_state=RANDOM_STATE, n_jobs=-1)),
#     ('PassiveAggressiveClassifier', PassiveAggressiveClassifier(random_state=RANDOM_STATE, n_jobs=-1)),
# ]
#model_dict['StackingClassifier'] = StackingClassifier(estimators=estimators, final_estimator=LinearSVC(random_state=RANDOM_STATE), n_jobs=-1)

# Baseline Model
#model = LinearSVC(random_state=RANDOM_STATE)
# App Model
model = initiate_sgd(use_warm_start)

df_results = pd.DataFrame(columns = ['model_type', 'vectorizer_num_features', 'semi_supervised', 'samples', 'dummy_train_accuracy', 
                                     'dummy_test_accuracy', 'train_accuracy', 'test_accuracy', 'run_time'])

In [18]:
# Supervised
# for tmf in tqdm(tfidf_max_features):
#     # Vectorize the train data - we have a corpus before we start labeling
#     vectorizer = TfidfVectorizer(ngram_range=(1, 2), stop_words="english", max_features=tmf)
#     vectorizer.fit(train_df["tweet_text"])
#     for model_type, model in model_dict.items():
#         for current_num_samples in label_count_checkpoints:
#             results = supervised_subset(vectorizer, current_num_samples, model, train_df, model_type, warm_start=use_warm_start)
#             df_results = df_results.append(results, ignore_index=True)
# to see results on the full train data
#df_results.tail()
#df_results.to_csv("model_accuracy_results.csv", index=False)

In [19]:
# Semi Supervised with Same Model
vectorizer = TfidfVectorizer(ngram_range=(1, 2), stop_words="english")
semi_supervised_options_list = [False, True]
vectorizer.fit(train_df["tweet_text"])
for model_type, model in model_dict.items():
    if use_warm_start:
        model = initiate_sgd_warm_start()
    for current_num_samples in tqdm(label_count_checkpoints):
        for ssol in semi_supervised_options_list:
            results = supervised_subset(vectorizer, current_num_samples, model, train_df, model_type, warm_start=use_warm_start,
                                        semi_supervised=ssol, semi_supervised_iterations=semi_supervised_iterations)
            df_results = df_results.append(results, ignore_index=True)
    df_results.to_csv("model_accuracy_results_semi_supervised.csv", index=False)

100%|██████████████████████████████████████████████████████████████████████████████████| 57/57 [13:39<00:00, 14.38s/it]


In [20]:
df_results.loc[df_results['samples']<=1000]

Unnamed: 0,model_type,vectorizer_num_features,semi_supervised,samples,dummy_train_accuracy,dummy_test_accuracy,train_accuracy,test_accuracy,run_time
0,LinearSVC,,False,250,0.374502,0.377243,1.0,0.68595,0.023
1,LinearSVC,,True,250,0.403803,0.401055,0.636435,0.65442,9.43402
2,LinearSVC,,False,500,0.399202,0.389578,1.0,0.783707,0.027002
3,LinearSVC,,True,500,0.403803,0.401055,0.717285,0.764314,8.925019
4,LinearSVC,,False,750,0.398136,0.392414,1.0,0.835554,0.033005
5,LinearSVC,,True,750,0.403803,0.401055,0.774019,0.817678,9.073002
6,LinearSVC,,False,1000,0.394605,0.393997,1.0,0.867942,0.036985
7,LinearSVC,,True,1000,0.403803,0.401055,0.814855,0.850198,9.045017


In [28]:
df_results.loc[df_results['samples']>=50000]

Unnamed: 0,model_type,vectorizer_num_features,semi_supervised,samples,dummy_train_accuracy,dummy_test_accuracy,train_accuracy,test_accuracy,run_time
104,LinearSVC,,False,50000,0.407392,0.401319,0.99986,0.973813,1.310016
105,LinearSVC,,True,50000,0.403803,0.401055,0.998001,0.973747,10.61
106,LinearSVC,,False,51000,0.404639,0.401517,0.999882,0.973879,1.305016
107,LinearSVC,,True,51000,0.403803,0.401055,0.998562,0.973879,10.875995
108,LinearSVC,,False,52000,0.40605,0.401055,0.999865,0.97434,1.392995
109,LinearSVC,,True,52000,0.403803,0.401055,0.99901,0.973945,10.94999
110,LinearSVC,,False,53000,0.40569,0.401385,0.999849,0.974274,1.643022
111,LinearSVC,,True,53000,0.403803,0.401055,0.99957,0.974011,10.642995
112,LinearSVC,,False,53531,0.403803,0.401055,0.999851,0.97434,1.441994
113,LinearSVC,,True,53531,0.403803,0.401055,0.999851,0.97434,10.713996


## Results Visualizations

### Visualising Results Curve - Accuracy vs Number of Training Samples

In [21]:
chrts = []

semi_supervised_options_list = [False, True]

#for ssol in semi_supervised_options_list:
chrts.append(chart_results_curve(df_results.loc[(df_results['vectorizer_num_features'].isna())], model_type))
row1 = chrts[0]# + chrts[1]

super_chrt = alt.vconcat(row1).properties(
    title='Baseline Accuracy Curves with True Labels'
).configure_title(
    fontSize=20,
    anchor='start',
    color='gray'
)
super_chrt.save('super_chrt_semi_supervised.html')
super_chrt

##### Zoom in to the early part of the chart and the first labels added.

In [22]:
upper_early_sample_limit = 10000
chrts_early = []
for model_type in model_dict.keys():
    chrts_early.append(chart_results_curve(df_results.loc[(df_results['model_type']==model_type) & 
                                                          (df_results['samples']<=upper_early_sample_limit)], model_type))
    

row1 = alt.hconcat(chrts_early[0])# | chrts_early [1] | chrts_early [2] )
#row2 = alt.hconcat(chrts_early[3] | chrts_early [4] | chrts_early [5] )

super_chrt_early = alt.vconcat(row1).properties(
    title='Baseline Accuracy Curves with True Labels - Few Labels'
).configure_title(
    fontSize=20,
    anchor='start',
    color='gray'
)
super_chrt_early.save('super_chrt_early_harder_target.html')
super_chrt_early

### Visualizing Accuracy vs Speed in App for Recommended Texts

In [24]:
chrts = []
chart_upper_limit = 14
chrts.append(chart_accuracy_speed_scatter(df_results.loc[(df_results['vectorizer_num_features'].isna()) & (df_results['model_type']!='StackingClassifier')], None, chart_upper_limit))

row_chrt1 = alt.hconcat(chrts[0]) # | chrts [1] | chrts [2])
#row_chrt2 = alt.hconcat(chrts[3] | chrts [4] | chrts [5])
super_chrt_speed_accuracy = alt.vconcat(row_chrt1).properties(
    title='Baseline Accuracy to Speed with True Labels'
).configure_title(
    fontSize=20,
    anchor='start',
    color='gray'
)
super_chrt_speed_accuracy.save('super_chrt_speed_accuracy_harder_target.html')
super_chrt_speed_accuracy

### Highest Test Accuracy

In [25]:
supervised_highest_test_accuracy = np.max(df_results.loc[df_results['semi_supervised']==False,"test_accuracy"])
semi_supervised_highest_test_accuracy = np.max(df_results.loc[df_results['semi_supervised']==True,"test_accuracy"])
supervised_mean_test_accuracy = np.mean(df_results.loc[df_results['semi_supervised']==False,"test_accuracy"])
semi_supervised_mean_test_accuracy = np.mean(df_results.loc[df_results['semi_supervised']==True,"test_accuracy"])
print("Highest Test Accuracy on Supervised Learning      %.5f" % supervised_highest_test_accuracy)
print("Highest Test Accuracy on Semi-Supervised Learning %.5f" % semi_supervised_highest_test_accuracy)
print("Mean Test Accuracy on Supervised Learning         %.5f" % supervised_mean_test_accuracy)
print("Mean Test Accuracy on Semi-Supervised Learning    %.5f" % semi_supervised_mean_test_accuracy)

Highest Test Accuracy on Supervised Learning      0.97434
Highest Test Accuracy on Semi-Supervised Learning 0.97434
Mean Test Accuracy on Supervised Learning         0.95574
Mean Test Accuracy on Semi-Supervised Learning    0.95357


In [26]:
end_time = datetime.now()
end_time.strftime("%Y/%m/%d %H:%M:%S")

'2021/11/27 12:37:11'

In [27]:
duration = end_time - start_time
print("duration :", duration)

duration : 0:14:52.880723


# To Do
* Viz Small Multiples scatter plot accuracy to speed with color for num features and a plot each for model type
* Save a Version of this Notebook as Baseline
* Run a New Version of this Notebook with Carlo's SGD warm start incremental model - output results and charts
* Compare Accuracy and Run Times between these 2 baselines
* Chart Baselines Against Each other and compare run times to create the baseline.
* Ensemble
* Semi Supervised