# Stance Detection in Political Debates (Somasundaran & Wiebe 2010)

In this notebook we will see a first simple example of applied stance detection. The data was used in:

Somasundaran & Wiebe (2010). Recognizing Stances in Ideological On-Line Debates.
https://www.aclweb.org/anthology/W10-0214/

Here we try to partly replicate the results presented in the paper. However, to keep things simple we only focus on subparts of the corpus (abortion, gay rights) and stick to unigrams as features for now.

If you want to experiment with the code on your own you need to download the data and adjust the path definitions.

The data can be downloaded from:

http://mpqa.cs.pitt.edu/corpora/political_debates/

In [1]:
import io
import os
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import f1_score, accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

## 1. Reading in Data

In [2]:
# need to be adjusted to your platform
abortion_data_path = '/home/robin/research/corpora/political_debates_SomasundaranWiebeAcl2009/abortion'
gayrights_data_path = '/home/robin/research/corpora/political_debates_SomasundaranWiebeAcl2009/gayRights'

The following block contains code for reading in the data. Data is read from txt files and joined to strings.

In [3]:
# Full vocab list
vocab = []

# Loading and first preprocessing of abolition data
abortion_data = []
abortion_stance = []

for file in os.listdir(abortion_data_path):
    abortion_file_path = os.path.join(abortion_data_path, file)
    
    with io.open(abortion_file_path, mode='r', encoding='utf-8') as f_in:
        
        try:
            text = []
            for line in f_in.read().split('\n'):
                if line.startswith('#stance'):
                    abortion_stance.append(int(line[-1]))
                elif line.startswith('#'):
                    continue
                else:
                    text.append(line)
                                
            text = " ".join(text)
            #text = [token.strip() for token in text]
            vocab.extend(text.split())
            
            abortion_data.append(text)
        except:
            pass

# Loading and first preprocessing of gay rights data        
gayrights_data = []
gayrights_stance = []

for file in os.listdir(gayrights_data_path):
    gayrights_file_path = os.path.join(gayrights_data_path, file)
    
    with io.open(gayrights_file_path, mode='r', encoding='utf-8') as f_in:
        
        try:
            text = []
            for line in f_in.read().split('\n'):
                if line.startswith('#stance'):
                    gayrights_stance.append(int(line[-1]))
                elif line.startswith('#'):
                    continue
                else:
                    text.append(line)
                        
            text = " ".join(text)
            #text = [token.strip() for token in text]
            
            vocab.extend(text.split())
            
            gayrights_data.append(text)
        except:
            pass

vocab = set(vocab)
        
data_total = abortion_data + gayrights_data
stance_total = abortion_stance + gayrights_stance

print("Abortion Data Size: {}".format(len(abortion_data)))
print("Gay Rights Data Size: {}".format(len(gayrights_data)))
print("Total Data Size: {}".format(len(data_total)))

Abortion Data Size: 1082
Gay Rights Data Size: 1927
Total Data Size: 3009


In [4]:
vocab = {token: i for i, token in enumerate(vocab)}
print("Vocabulary Size: {}".format(len(vocab)))

Vocabulary Size: 36278


## 2. Feature Extraction and Data Splitting

We use the CountVectorizer to create a matrix representation of token counts. By setting `ngram_range` we can define the length of the token sequence. Setting it to `(1,1)` results in a unigram representation.

Documentation: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

In [5]:
# Defining count vectorizer instance
vectorizer = CountVectorizer(ngram_range=(1,1), vocabulary=vocab)

# Vectorizing data
X_abortion = vectorizer.fit_transform(abortion_data)
X_gayrights = vectorizer.fit_transform(gayrights_data)
X_total = vectorizer.fit_transform(data_total)

In [6]:
X_abortion = X_abortion.toarray()
X_gayrights = X_gayrights.toarray()
X_total = X_total.toarray()

In [7]:
# Splitting data
X_abortion_train, X_abortion_test, y_abortion_train, y_abortion_test = train_test_split(X_abortion, abortion_stance, test_size=0.25, random_state=42)
X_gayrights_train, X_gayrights_test, y_gayrights_train, y_gayrights_test = train_test_split(X_gayrights, gayrights_stance, test_size=0.25, random_state=42)
X_total_train, X_total_test, y_total_train, y_total_test = train_test_split(X_total, stance_total, test_size=0.25, random_state=42)

## 3. Training Classifier

In the following the classifiers are defined and trained. Feel free to change some hyperparameters like the C parameter. You can find the documentation on:

https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

A quick excursus on the C parameter:
"The C parameter tells the SVM optimization how much you want to avoid misclassifying each training example. For large values of C, the optimization will choose a smaller-margin hyperplane if that hyperplane does a better job of getting all the training points classified correctly. Conversely, a very small value of C will cause the optimizer to look for a larger-margin separating hyperplane, even if that hyperplane misclassifies more points. For very tiny values of C, you should get misclassified examples, often even if your training data is linearly separable."

https://stats.stackexchange.com/questions/31066/what-is-the-influence-of-c-in-svms-with-linear-kernel

In [8]:
# Create SVM instances
clf_abortion = SVC()
clf_gayrights = SVC()
clf_total = SVC()

In [9]:
clf_abortion.fit(X_abortion_train, y_abortion_train)
clf_gayrights.fit(X_gayrights_train, y_gayrights_train)
clf_total.fit(X_total_train, y_total_train)

SVC()

## 4. In-Domain Testing

In this section trained models are tested in domain, i.e. a classifier trained on abortion data is tested on abortion data.

In [10]:
y_abortion_pred = clf_abortion.predict(X_abortion_test)
y_gayrights_pred = clf_gayrights.predict(X_gayrights_test)
y_total_pred = clf_total.predict(X_total_test)

In [11]:
accuracy_abortion = accuracy_score(y_abortion_test, y_abortion_pred)
accuracy_gayrights = accuracy_score(y_gayrights_test, y_gayrights_pred)
accuracy_total = accuracy_score(y_total_test, y_total_pred)

In [12]:
print('Accuracy Abortion: {}'.format(round(accuracy_abortion, 3)))
print('Accuracy Gay Rights: {}'.format(round(accuracy_gayrights, 3)))
print('Accuracy Total: {}'.format(round(accuracy_total, 3)))

Accuracy Abortion: 0.631
Accuracy Gay Rights: 0.633
Accuracy Total: 0.636


## 5. Cross-Domain Testing

In this section trained models are tested on cross-domain data, e.g. a classifier trained on abortion data is tested on gay rights data.

In [13]:
y_abortion_pred_cross = clf_gayrights.predict(X_abortion_test)
y_gayrights_pred_cross = clf_abortion.predict(X_gayrights_test)

In [14]:
accuracy_abortion_cross = accuracy_score(y_abortion_test, y_abortion_pred_cross)
accuracy_gayrights_cross = accuracy_score(y_gayrights_test, y_gayrights_pred_cross)

In [15]:
print('Accuracy Abortion Cross-Domain: {}'.format(accuracy_abortion_cross))
print('Accuracy Gay Rights Cross-Domain: {}'.format(accuracy_gayrights_cross))

Accuracy Abortion Cross-Domain: 0.5756457564575646
Accuracy Gay Rights Cross-Domain: 0.6161825726141079
