<h2> Comments for the Task </h2>

The <i>f-score</i> of the baseline system is 0.6263 (default SVM was used). It was decided to use two "tactics" to exceed this score:
<ul>
    <li>Use three additional classifiers, Random Forest, Decision Tree and improved SVM (with SGD training, SGDClassifier)</li>
    <li>Change the number of n-grams in the Tf-Idf vectorizer</li>
</ul>

It has been found that the effective number of n-grams is 6 (6-grams). In that case the <i>f-scores</i> are:
<ol>
    <li>with SVM (the same with SGD): 0.6671</li>
    <li>with RF: 0.6666</li>
    <li>with DT: 0.6660</li>
</ol>

With unigrams results of RF, DT and SGD are worse than baseline's SVM: 0.5563, 0.5727 and 0.6191 respectively.

In [66]:
from nltk.tokenize import TweetTokenizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn import linear_model
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import LinearSVC
from sklearn.datasets import dump_svmlight_file
from sklearn import metrics
import numpy as np
import logging
import codecs

In [16]:
def parse_dataset(fp):
    '''
    Loads the dataset .txt file with label-tweet on each line and parses the dataset.
    :param fp: filepath of dataset
    :return:
        corpus: list of tweet strings of each tweet.
        y: list of labels
    '''
    y = []
    corpus = []
    with open(fp, 'rt', encoding='utf-8') as data_in:
        for line in data_in:
            if not line.lower().startswith("tweet index"): # discard first line if it contains metadata
                line = line.rstrip() # remove trailing whitespace
                label = int(line.split("\t")[1])
                tweet = line.split("\t")[2]
                y.append(label)
                corpus.append(tweet)

    return corpus, y

In [68]:
def featurize(corpus):
    '''
    Tokenizes and creates TF-IDF BoW vectors.
    :param corpus: A list of strings each string representing document.
    :return: X: A sparse csr matrix of TFIDF-weigted ngram counts.
    '''

    tokenizer = TweetTokenizer(preserve_case=False, reduce_len=True, strip_handles=True).tokenize
    vectorizer = TfidfVectorizer(strip_accents="unicode", analyzer="word", tokenizer=tokenizer, stop_words="english",
                                ngram_range=(6,6))
    #vectorizer = TfidfVectorizer(strip_accents="unicode", analyzer="word", tokenizer=tokenizer, stop_words="english")
    X = vectorizer.fit_transform(corpus)
    # print(vectorizer.get_feature_names()) # to manually check if the tokens are reasonable
    return X

In [62]:
# Dataset: SemEval2018-T4-train-taskA.txt or SemEval2018-T4-train-taskB.txt
DATASET_FP = "./SemEval2018-T4-train-taskA.txt"
TASK = "A" # Define, A or B
FNAME = './predictions-task' + TASK + '.txt'
PREDICTIONSFILE = open(FNAME, "w", encoding='utf-8')

K_FOLDS = 10 # 10-fold crossvalidation
CLF = LinearSVC() # the default, non-parameter optimized linear-kernel SVM

# Loading dataset and featurised simple Tfidf-BoW model
corpus, y = parse_dataset(DATASET_FP)
X = featurize(corpus)

class_counts = np.asarray(np.unique(y, return_counts=True)).T.tolist()
print (class_counts)
    
# Returns an array of the same size as 'y' where each entry is a prediction obtained by cross validated
predicted = cross_val_predict(CLF, X, y, cv=K_FOLDS)
    
# Modify F1-score calculation depending on the task
if TASK.lower() == 'a':
    score = metrics.f1_score(y, predicted, pos_label=1)
elif TASK.lower() == 'b':
    score = metrics.f1_score(y, predicted, average="macro")
print ("F1-score Task", TASK, score)
for p in predicted:
    PREDICTIONSFILE.write("{}\n".format(p))
PREDICTIONSFILE.close()

[[0, 1923], [1, 1911]]
F1-score Task A 0.626344086022


In [63]:
# Dataset: SemEval2018-T4-train-taskA.txt or SemEval2018-T4-train-taskB.txt
DATASET_FP = "./SemEval2018-T4-train-taskA.txt"
TASK = "A" # Define, A or B
FNAME = './predictions-task' + TASK + '.txt'
PREDICTIONSFILE = open(FNAME, "w", encoding='utf-8')

K_FOLDS = 10 # 10-fold crossvalidation
CLF = RandomForestClassifier()

# Loading dataset and featurised simple Tfidf-BoW model
corpus, y = parse_dataset(DATASET_FP)
X = featurize(corpus)

class_counts = np.asarray(np.unique(y, return_counts=True)).T.tolist()
print (class_counts)
    
# Returns an array of the same size as 'y' where each entry is a prediction obtained by cross validated
predicted = cross_val_predict(CLF, X, y, cv=K_FOLDS)
    
# Modify F1-score calculation depending on the task
if TASK.lower() == 'a':
    score = metrics.f1_score(y, predicted, pos_label=1)
elif TASK.lower() == 'b':
    score = metrics.f1_score(y, predicted, average="macro")
print ("F1-score Task", TASK, score)
for p in predicted:
    PREDICTIONSFILE.write("{}\n".format(p))
PREDICTIONSFILE.close()

[[0, 1923], [1, 1911]]
F1-score Task A 0.556307692308


In [64]:
# Dataset: SemEval2018-T4-train-taskA.txt or SemEval2018-T4-train-taskB.txt
DATASET_FP = "./SemEval2018-T4-train-taskA.txt"
TASK = "A" # Define, A or B
FNAME = './predictions-task' + TASK + '.txt'
PREDICTIONSFILE = open(FNAME, "w", encoding='utf-8')

K_FOLDS = 10 # 10-fold crossvalidation
CLF = DecisionTreeClassifier()

# Loading dataset and featurised simple Tfidf-BoW model
corpus, y = parse_dataset(DATASET_FP)
X = featurize(corpus)

class_counts = np.asarray(np.unique(y, return_counts=True)).T.tolist()
print (class_counts)
    
# Returns an array of the same size as 'y' where each entry is a prediction obtained by cross validated
predicted = cross_val_predict(CLF, X, y, cv=K_FOLDS)
    
# Modify F1-score calculation depending on the task
if TASK.lower() == 'a':
    score = metrics.f1_score(y, predicted, pos_label=1)
elif TASK.lower() == 'b':
    score = metrics.f1_score(y, predicted, average="macro")
print ("F1-score Task", TASK, score)
for p in predicted:
    PREDICTIONSFILE.write("{}\n".format(p))
PREDICTIONSFILE.close()

[[0, 1923], [1, 1911]]
F1-score Task A 0.572747014115


In [69]:
# Dataset: SemEval2018-T4-train-taskA.txt or SemEval2018-T4-train-taskB.txt
DATASET_FP = "./SemEval2018-T4-train-taskA.txt"
TASK = "A" # Define, A or B
FNAME = './predictions-task' + TASK + '.txt'
PREDICTIONSFILE = open(FNAME, "w", encoding='utf-8')

K_FOLDS = 10 # 10-fold crossvalidation
CLF = linear_model.SGDClassifier()

# Loading dataset and featurised simple Tfidf-BoW model
corpus, y = parse_dataset(DATASET_FP)
X = featurize(corpus)

class_counts = np.asarray(np.unique(y, return_counts=True)).T.tolist()
print (class_counts)
    
# Returns an array of the same size as 'y' where each entry is a prediction obtained by cross validated
predicted = cross_val_predict(CLF, X, y, cv=K_FOLDS)
    
# Modify F1-score calculation depending on the task
if TASK.lower() == 'a':
    score = metrics.f1_score(y, predicted, pos_label=1)
elif TASK.lower() == 'b':
    score = metrics.f1_score(y, predicted, average="macro")
print ("F1-score Task", TASK, score)
for p in predicted:
    PREDICTIONSFILE.write("{}\n".format(p))
PREDICTIONSFILE.close()

[[0, 1923], [1, 1911]]
F1-score Task A 0.667133111772
