# Stance Detection in Political Debates Using Deep Learning Techniques I

This is a follow up notebook to the data used in Somasundaran & Wiebe 2010. Contrasting with the first notebook we do not use traditional ML techniques but instead train a DL model on the data using Flair and Google Colab.

Flair: https://github.com/flairNLP/flair

Google Colab: https://colab.research.google.com/?utm_source=scs-index

The data can be downloaded from:

http://mpqa.cs.pitt.edu/corpora/political_debates/

In [19]:
import io
import os
import pandas as pd
import numpy as np

from flair.data import Corpus
from flair.datasets import ClassificationCorpus
from flair.embeddings import TransformerDocumentEmbeddings
from flair.models import TextClassifier
from flair.trainers import ModelTrainer

from sklearn.model_selection import StratifiedShuffleSplit

In [14]:
# need to be adjusted to your platform
abortion_data_path = '/home/robin/research/corpora/political_debates_SomasundaranWiebeAcl2009/abortion'
gayrights_data_path = '/home/robin/research/corpora/political_debates_SomasundaranWiebeAcl2009/gayRights'

train_path = '/home/robin/research/course_pages/stance-detection-st2021/data/train.csv'
dev_path = '/home/robin/research/course_pages/stance-detection-st2021/data/dev.csv'
test_path = '/home/robin/research/course_pages/stance-detection-st2021/data/test.csv'

data_path = '/home/robin/research/course_pages/stance-detection-st2021/data'
model_path = '/home/robin/research/course_pages/stance-detection-st2021/models'

## 1. Reading in Data

The following block contains code for reading in the data. Data is read from txt files and joined to strings. In order to train a model using Flair we need to modify the labels a bit. Each label needs to get a tag "\_\_label__".

In [15]:
# Full vocab list
vocab = []

# Loading and first preprocessing of abolition data
abortion_data = []
abortion_stance = []

for file in os.listdir(abortion_data_path):
    abortion_file_path = os.path.join(abortion_data_path, file)
    
    with io.open(abortion_file_path, mode='r', encoding='utf-8') as f_in:
        
        try:
            text = []
            for line in f_in.read().split('\n'):
                if line.startswith('#stance'):
                    abortion_stance.append(int(line[-1]))
                elif line.startswith('#'):
                    continue
                else:
                    text.append(line)
                                
            text = " ".join(text)
            #text = [token.strip() for token in text]
            vocab.extend(text.split())
            
            abortion_data.append(text)
        except:
            pass

# Loading and first preprocessing of gay rights data        
gayrights_data = []
gayrights_stance = []

for file in os.listdir(gayrights_data_path):
    gayrights_file_path = os.path.join(gayrights_data_path, file)
    
    with io.open(gayrights_file_path, mode='r', encoding='utf-8') as f_in:
        
        try:
            text = []
            for line in f_in.read().split('\n'):
                if line.startswith('#stance'):
                    gayrights_stance.append(int(line[-1]))
                elif line.startswith('#'):
                    continue
                else:
                    text.append(line)
                        
            text = " ".join(text)
            #text = [token.strip() for token in text]
            
            vocab.extend(text.split())
            
            gayrights_data.append(text)
        except:
            pass

vocab = set(vocab)

# __label__ needed for training of model
abortion_stance = ['__label__'+str(stance) for stance in abortion_stance]
gayrights_stance = ['__label__'+str(stance) for stance in gayrights_stance]
        
data_total = abortion_data + gayrights_data
stance_total = abortion_stance + gayrights_stance

abortion_data = pd.Series(abortion_data)
abortion_stance = pd.Series(abortion_stance)
gayrights_data = pd.Series(gayrights_data)
gayrights_stance = pd.Series(gayrights_stance)
data_total = pd.Series(data_total)
stance_total = pd.Series(stance_total)

print("Abortion Data Size: {}".format(len(abortion_data)))
print("Gay Rights Data Size: {}".format(len(gayrights_data)))
print("Total Data Size: {}".format(len(data_total)))

Abortion Data Size: 1082
Gay Rights Data Size: 1927
Total Data Size: 3009


In [16]:
abortion_stance[0]

'__label__1'

## 2. Stratified Data Splitting and Data File Saving

We create stratified splits using `StratifiedShuffleSplit`. 'stratified' means that the class distribution is kept intact which is important if the classes are not balanced. 

`StratifiedShuffleSplit` can be used for cross validation. As we only want to create one split we set `n_splits` to 1. However as we want to create a threeway split (train, development, test) we have to split `X_train` and `y_train` another time (using `sss2`).   

In [17]:
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.1, random_state=0)
sss2 = StratifiedShuffleSplit(n_splits=1, test_size=0.112, random_state=0)

In [18]:
for train_index, dev_index in sss.split(abortion_data, abortion_stance):
    #print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_dev = abortion_data[train_index], abortion_data[dev_index]
    y_train, y_dev = abortion_stance[train_index], abortion_stance[dev_index]
    
    X_train = X_train.reset_index(drop=True)
    y_train = y_train.reset_index(drop=True)
    
    for train_index, test_index in sss2.split(X_train, y_train):
        X_train, X_test = X_train[train_index], X_train[test_index]
        y_train, y_test = y_train[train_index], y_train[test_index]

In [11]:
# concatenation of data and stance label; flair requires the label to be in column 0 and the data to be in column 1
train_df = pd.concat([y_train, X_train], axis=1)
dev_df = pd.concat([y_dev, X_dev], axis=1)
test_df = pd.concat([y_test, X_test], axis=1)

test_df.head()

Unnamed: 0,0,1
422,__label__2,Sorry- I forgot that this was HTML format...I ...
437,__label__2,So it is metaphysical independence that create...
492,__label__1,Risk-taking and disorders lead to abortions; ...
182,__label__2,This is what it comes down to. Government is h...
913,__label__1,"So you think it is a baby, the moment the sper..."


We save training, development and testing set and will load it again in notebook part II.

In [12]:
train_df.to_csv(train_path, sep='\t', index = False, header = False)
dev_df.to_csv(dev_path, sep='\t', index=False, header=False)
test_df.to_csv(test_path, sep='\t', index=False, header=False)