# Stance Detection in Tweets (Conforti Et Al 2020)

In this notebook we will work with data that is part of the current largest Stance Detection dataset: the Twitter dataset Will-They-Won't-They, which was published in 2020 by Conforti et al.  

In contrast to the notebook on Somasundaran & Wiebe 2010 we will explore a different type of features that can be used for stance detection: word/document embeddings. We will also work with more classification algorithms.

The data set was presented and discussed in:

Conforti et al (2020). Will-They-Won’t-They: A Very Large Dataset for Stance Detection on Twitter.
https://www.aclweb.org/anthology/2020.acl-main.157/


In [1]:
import io
import json
import numpy as np

from flair.data import Sentence
from flair.embeddings import TransformerDocumentEmbeddings

from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split

  return torch._C._cuda_getDeviceCount() > 0


In [2]:
# need to be adjusted to your platform
id_path = '/home/robin/research/corpora/acl2020-wtwt-tweets/wtwt_ids.json'
corpus_path = '/home/robin/research/corpora/acl2020-wtwt-tweets/tweets/tweets_final.json'

X_path = '/home/robin/research/corpora/acl2020-wtwt-tweets/X_array.npy'
X_stance_path = '/home/robin/research/corpora/acl2020-wtwt-tweets/X_stance_array.npy'

## 1. Having a look at the IDs

The Tweet IDs can be cloned from Github: https://github.com/cambridge-wtwt/acl2020-wtwt-tweets

The IDs are in the file: wtwt_ids.json

In [3]:
with io.open(id_path, mode='r') as f_in:
    corpus = json.load(f_in)
    
print(len(corpus))

51284


In [4]:
print('The first entry:')
corpus[0]

The first entry:


{'tweet_id': '971761970117357568', 'merger': 'CI_ESRX', 'stance': 'support'}

Due to strict data privacy regulations the actual Tweet content can not be distributed. In order to use the corpus, one has to collect the tweets using the IDs. In a second step, one has to match the IDs with the respective stance label. To facilitate the matching step we restructure the corpus such that we can access the labels via the respective ID. 

In [5]:
corpus_restructured = {}
for entry in corpus:
    corpus_restructured[entry['tweet_id']] = {'merger': entry['merger'], 'stance': entry['stance']}

corpus_restructured['971761970117357568']

{'merger': 'CI_ESRX', 'stance': 'support'}

## 2. Collecting the Tweet data set

There are several ways to collect tweets via the Twitter API. I recommend using Tweepy as it is a high level tool that makes the task quite easy. 

https://www.tweepy.org/

Alternatively, you may use a simple tweet crawler tool that I wrote. It is based on Tweepy and even more high level.

https://github.com/RobinSchaefer/tweet-crawler

In any case you will need to get your own Twitter Developer credentials. These allow you to access the API.

https://developer.twitter.com/en

## 2. Loading the Actual Tweet Data + Matching With Labels

In this section we load the actual tweet data and match them with their labels.

In [6]:
# Again: you need to collect the tweets first and set corpus_path
with io.open(corpus_path, mode='r') as f_in:
    tweets_json = json.load(f_in)

In [7]:
tweets = []
labels = []
support_tweets = 0
refute_tweets = 0
comment_tweets = 0
unrelated_tweets = 0

stance_tweets = []
stance_labels = []

for tweet in tweets_json:
    id_ = tweet['id_str']
    text = tweet['full_text']
    label = corpus_restructured[id_]['stance']
    
    tweets.append(text)
    labels.append(label)
    
    if label == 'support':
        support_tweets += 1
        stance_tweets.append(text)
        stance_labels.append(label)
    elif label == 'refute':
        refute_tweets += 1
        stance_tweets.append(text)
        stance_labels.append(label)
    elif label == 'comment':
        comment_tweets += 1   
        stance_tweets.append(text)
        stance_labels.append(label)
    elif label == 'unrelated':
        comment_tweets += 1

print('Number support tweets: {}'.format(support_tweets))
print('Number refute tweets: {}'.format(refute_tweets))
print('Number comment tweets: {}'.format(comment_tweets))
print('Number unrelated tweets: {}'.format(unrelated_tweets))

Number support tweets: 2891
Number refute tweets: 1229
Number comment tweets: 10668
Number unrelated tweets: 0


## 3. Creating Embedding Features

For embedding creation we use Flair, in particular the `TransformerDocumentEmbeddings` class. We need this class to make use of pretrained BERT embeddings. Here we use 'bert-based-cased'.

Flair allows for different embedding types, both for words and documents, which can be easily applied.

For the documentation and heplful tutorials see: https://github.com/flairNLP/flair

In [8]:
embedder = TransformerDocumentEmbeddings('bert-base-cased')

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


As shown in the next code block we can encode a text string simply by using the `.embed()` method of our embedder object. 

In [9]:
# Embeddings for total data set

embeddings = []

for i, text in enumerate(tweets):
    text = Sentence(text)

    embedder.embed(text)
    embedded = text.embedding.data.numpy()
    embeddings.append(list(embedded))

    if (i+1) % 1000 == 0:
        print('Generated Embeddings: {}'.format(i+1))

X = np.array(embeddings)

Generated Embeddings: 1000
Generated Embeddings: 2000
Generated Embeddings: 3000
Generated Embeddings: 4000
Generated Embeddings: 5000
Generated Embeddings: 6000
Generated Embeddings: 7000
Generated Embeddings: 8000
Generated Embeddings: 9000
Generated Embeddings: 10000
Generated Embeddings: 11000
Generated Embeddings: 12000
Generated Embeddings: 13000
Generated Embeddings: 14000


In [10]:
# Embeddings for support/refute data set

embeddings_stance = []

for i, text in enumerate(stance_tweets):
    text = Sentence(text)

    embedder.embed(text)
    embedded = text.embedding.data.numpy()
    embeddings_stance.append(list(embedded))

    if (i+1) % 1000 == 0:
        print('Generated Embeddings: {}'.format(i+1))

X_stance = np.array(embeddings_stance)

Generated Embeddings: 1000
Generated Embeddings: 2000
Generated Embeddings: 3000
Generated Embeddings: 4000
Generated Embeddings: 5000
Generated Embeddings: 6000
Generated Embeddings: 7000
Generated Embeddings: 8000
Generated Embeddings: 9000
Generated Embeddings: 10000


### Save Embeddings

As creating document embeddings takes some time, we save our features for future steps.

In [11]:
np.save(X_path,X)
np.save(X_stance_path, X_stance)

### Load Embeddings

Once created and saved we can easily load them using `np.load`.

In [12]:
X = np.load(X_path)
print('Data array size: {}'.format(X.shape))

Data array size: (14788, 768)


In [13]:
X_stance = np.load(X_stance_path)
print('Data array size: {}'.format(X_stance.shape))

Data array size: (10381, 768)


## 4. Training and Testing Classifiers

As in the notebook on Somasundaran & Wiebe 2010 we split the data into training and testing sets. We then initialize different classification models, fit them and use them for prediction. For evaluation we use macro f1 scores. 

### 4.1. Full Dataset

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.25, random_state=0)

In [15]:
# Defining models

svm_classifier = SVC()
naive_bayes_classifier = GaussianNB()
decision_tree_classifier = DecisionTreeClassifier()
adaboost_classifier = AdaBoostClassifier()

In [16]:
# Fitting models

svm_classifier.fit(X_train, y_train)
naive_bayes_classifier.fit(X_train, y_train)
decision_tree_classifier.fit(X_train, y_train)
adaboost_classifier.fit(X_train, y_train)

AdaBoostClassifier()

In [17]:
# Testing models

y_pred_svm = svm_classifier.predict(X_test)
y_pred_naive_bayes = naive_bayes_classifier.predict(X_test)
y_pred_decision_tree = decision_tree_classifier.predict(X_test)
y_pred_adaboost = adaboost_classifier.predict(X_test)

In [18]:
# Evaluating models

f1_svm = f1_score(y_test, y_pred_svm, average='macro')
print("F1 Score SVM: {}".format(round(f1_svm,3)))
f1_naive_bayes = f1_score(y_test, y_pred_naive_bayes, average='macro')
print("F1 Score Naive Bayes: {}".format(round(f1_naive_bayes, 3)))
f1_decision_tree = f1_score(y_test, y_pred_decision_tree, average='macro')
print("F1 Score Decision Tree: {}".format(round(f1_decision_tree, 3)))
f1_adaboost = f1_score(y_test, y_pred_adaboost, average='macro')
print("F1 Score AdaBoost: {}".format(round(f1_adaboost, 3)))

F1 Score SVM: 0.475
F1 Score Naive Bayes: 0.432
F1 Score Decision Tree: 0.395
F1 Score AdaBoost: 0.485


### 4.2 Support/Refute Dataset (more balanced)

In [19]:
X_stance_train, X_stance_test, y_stance_train, y_stance_test = train_test_split(X_stance, stance_labels, test_size=0.25, random_state=0)

In [20]:
svm_classifier_stance = SVC()
naive_bayes_classifier_stance = GaussianNB()
decision_tree_classifier_stance = DecisionTreeClassifier()
adaboost_classifier_stance = AdaBoostClassifier()

In [21]:
svm_classifier_stance.fit(X_stance_train, y_stance_train)
naive_bayes_classifier_stance.fit(X_stance_train, y_stance_train)
decision_tree_classifier_stance.fit(X_stance_train, y_stance_train)
adaboost_classifier_stance.fit(X_stance_train, y_stance_train)

AdaBoostClassifier()

In [22]:
y_stance_pred_svm = svm_classifier_stance.predict(X_stance_test)
y_stance_pred_naive_bayes = naive_bayes_classifier_stance.predict(X_stance_test)
y_stance_pred_decision_tree = decision_tree_classifier_stance.predict(X_stance_test)
y_stance_pred_adaboost = adaboost_classifier_stance.predict(X_stance_test)

In [23]:
f1_svm_stance = f1_score(y_stance_test, y_stance_pred_svm, average='macro')
print("F1 Score SVM: {}".format(round(f1_svm_stance,3)))
f1_naive_bayes_stance = f1_score(y_stance_test, y_stance_pred_naive_bayes, average='macro')
print("F1 Score Naive Bayes: {}".format(round(f1_naive_bayes_stance, 3)))
f1_decision_tree_stance = f1_score(y_stance_test, y_stance_pred_decision_tree, average='macro')
print("F1 Score Decision Tree: {}".format(round(f1_decision_tree_stance, 3)))
f1_adaboost_stance = f1_score(y_stance_test, y_stance_pred_adaboost, average='macro')
print("F1 Score AdaBoost: {}".format(round(f1_adaboost_stance, 3)))

F1 Score SVM: 0.469
F1 Score Naive Bayes: 0.556
F1 Score Decision Tree: 0.478
F1 Score AdaBoost: 0.554


### Question

The classification results of both data sets show a substantial difference. What could lead to these differences?