# Tasks

## Task 1: Predict the topic labels of discussions

### Purpose

We apply the natural language technologies, such as n-gram and stemming, we've learned in class to generate useful features and evaluate their effectiveness on classifying argument discussions into 10 labeled topic groups. The architecture for this task is as below:

![alt text](imgs/task1-arch.png "Architecture")

According to our observation, people often use different keywords when discussing different topics. We develop a keyword-based approach to classify dicussions into topics. The intuition is to identify keywords that are not frequently used for each topic, so we can classify a discussion based on the keywords used in it.

### Preprocessing

For each discussion, 
1. We filtered characters that are neither numbers nor english letters (e.g. "[A-Za-z0-9]" in regular expression). 
2. We filtered stopwords in text. The stopword list, which contains 2,400 stopwords for 11 languages, is from the stopword corpus created by Porter et al. (2001). 
3. We adopted Porter stemmer to convert words in dicussions to their stemmed form. There are 98,988 distinct stems used in our dataset.
4. We took the stems as unigrams and built an unigram occurrence vector as the representation of discussion.

Since the number of unigrams is large, most of representation vectors are sparse. It is an known issue which may affect classification performance. Our future goal would be adopt some dimension reduction methods to reduce the size of feature vectors.

### Topic Classification


| Topic        | # of discussions  |
| ------------- |:-------------:|
| abortion | 564 |
| climate change | 40 |
| communism vs capitalism | 38 |
| death penalty | 25 |
| evolution | 871 |
| existence of God | 105 |
| gay marriage | 305 |
| gun control | 824 |
| healthcare | 81 |
| marijuana legalization | 13 |

The above figure shows the distribution of topics. We split the dataset into 75% stratified training data and 25% testing data. We trained a multi-class SVM model with linear kernel (LIBSVM, proposed by Chang et al. (2011)). The experiment was conducted in 5-fold cross validation. The following are the accuaracy values of cross validation:

| 1 round accuracy  | 2 round accuracy  | 3 round accuracy  | 4 round accuracy  | 5 round accuracy  | Mean accuracy |
| ------------- |:-------------|:-------------|:-------------|:-------------|:-------------:|
| 88.54% | 88.33% | 85.89% | 86.71% | 88.42% | 87.58% |


The average accuracy of our result is 87.58%, which indicates the unigram occurrance representation is significantly effective in this task. Keyword-based representation performs well to recognize the topic of discussions.



## Task 2: Identify author stance in discussions with Doc2Vec

### Purpose


In a discussion, authors express their different stances toward a topic. We want to discover how to effectively identify an author's stance in a discussion based on his posts. Since the IAC dataset contains the annotations of author stance for discussions, we try to convert author posts in discussion to Doc2Vec (Le and Mikolov, 2014) representation and evaluate if the representation is effective on identifying user stance. The Doc2Vec model is pre-trained using AP-NEWS corpus (Lau and Baldwin, 2016), which contains 25 million documents and 0.9 billion tokens.

In our analysis, we peformed clustering and a classification repectively to evaluate the performance of the Doc2Vec features.

### Preprocessing

For each discussion, we first identify the authors. For each author, we first filtered characters that are neither numbers nor english letters (e.g. "[A-Za-z0-9]" in regular expression) in his posts. All filtered posts were concatenated and converted to a Doc2Vec vector.

In IAC dataset, the author stance on a discussion is recorded as a combination of votes from annotators. Each annotator can vote "pro", "anti", or "other". We use the following rules to consider the majority vote as the label of author stance:
- If the number of "pro" votes equals to "anti", or "other" dominates, choose "other" as the stance label.
- If "pro" dominates, choose "pro" as the stance label.
- If "anti" dominates, choose "anti" as the stance label.

### Analysis by Kmeans clustering

We perform K-means clustering (Kanungo et al., 2002) with K=3 for every topic respectively. Our metrics is Adjusted Mutual Infomation Score (Vinh et al., 2010). The value of AMI is used to evaluate the extent of shared information between two clustering results. Below is the results of different topics:

| Topic | Adjusted Mutual Infomation Score (AMI) |
| ------------- |:-------------:|
| gay marriage | 0.007 |
| existence of God | 0.016 |
| death penalty | 0.009 |
| abortion | 0.004 |
| climate change | 0.000 |
| healthcare | 0.005 |
| evolution | 0.083 |
| communism vs capitalism | 0.045 |
| gun control | 0.003 |
| marijuana legalization | 0.068 |

According to the results, using Doc2Vec representation stand-alone is not effective on identifying author stance in discussions.


### Analysis by SVM classification

We follow the same settings in task 1 to train a three-class SVM model. The following are the accuracy values of 5-fold cross validation in different topics:

| Topic        | 1 round accuracy  | 2 round accuracy  | 3 round accuracy  | 4 round accuracy  | 5 round accuracy  | Mean accuracy |
| ------------- |:-------------|:-------------|:-------------|:-------------|:-------------|:-------------:|
| communism vs capitalism | 60.00% | 79.26% | 56.72% | 63.91% | 54.55% | 62.89% |
| marijuana legalization | 64.66% | 66.38% | 75.44% | 78.76% | 72.57% | 71.56% |
| death penalty | 45.39% | 54.44% | 47.21% | 46.47% | 45.90% | 47.88% |
| climate change | 48.30% | 47.95% | 49.66% | 42.07% | 33.33% | 44.26% |
| existence of God | 43.44% | 48.43% | 45.42% | 49.10% | 46.86% | 46.65% |
| healthcare | 60.00% | 44.29% | 38.57% | 62.32% | 57.35% | 52.51% |
| gun control | 64.28% | 46.28% | 52.18% | 54.81% | 54.69% | 54.45% |
| abortion | 47.04% | 45.69% | 52.90% | 43.59% | 51.48% | 48.14% |
| gay marriage | 64.82% | 58.96% | 63.25% | 58.62% | 59.62% | 61.05% |
| evolution | 59.85% | 53.31% | 59.85% | 58.29% | 64.37% | 59.14% |

The classification results show the Doc2Vec representation only performs  in most of topics.

### Reference

1. Porter, M. F. (2001). Snowball: A language for stemming algorithms.
2. Chang, C. C., & Lin, C. J. (2011). LIBSVM: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2(3), 27.
3. Le, Q., & Mikolov, T. (2014). Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on Machine Learning (ICML-14) (pp. 1188-1196).
4. Lau, J. H., & Baldwin, T. (2016). An empirical evaluation of doc2vec with practical insights into document embedding generation. arXiv preprint arXiv:1607.05368.
5. Kanungo, T., Mount, D. M., Netanyahu, N. S., Piatko, C. D., Silverman, R., & Wu, A. Y. (2002). An efficient k-means clustering algorithm: Analysis and implementation. IEEE transactions on pattern analysis and machine intelligence, 24(7), 881-892.
6. Vinh, N. X., Epps, J., & Bailey, J. (2010). Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. Journal of Machine Learning Research, 11(Oct), 2837-2854.