# Tasks

## Task 1: Predict the topic labels of discussions

### Purpose

We apply the natural language technologies, such as n-gram and stemming, we've learned in class to generate useful features and evaluate their effectiveness on classifying argument discussions into 10 labeled topic groups. The architecture for this task is as below:

![alt text](imgs/task1-arch.png "Architecture")


### Preprocessing

| Topic        | # of discussions  |
| ------------- |:-------------:|
| abortion | 564 |
| climate change | 40 |
| communism vs capitalism | 38 |
| death penalty | 25 |
| evolution | 871 |
| existence of God | 105 |
| gay marriage | 305 |
| gun control | 824 |
| healthcare | 81 |
| marijuana legalization | 13 |



For each discussion, 
1. We first filtered characters that are neither numbers nor english letters (e.g. "[A-Za-z0-9]" in regular expression). 
2. We filtered stopwords in text. The stopword list, which contains 2,400 stopwords for 11 languages, is from the stopword corpus created by Porter et al. (2001). 
3. We adopted Porter stemmer to convert words in dicussions to their stemmed form. There are 98,988 distinct stems used in our dataset.
4. We took the stems as unigrams and built an unigram occurrence vector as the representation of discussion.

Since the number of unigrams is large, most of representation vectors are sparse. It is an known issue which may affect classification performance. Our future goal would be adopt some dimension reduction methods to reduce the size of feature vectors.

### Article Classification
We split the dataset into 75% stratified training data and 25% testing data. We trained a multi-class SVM model with linear kernel (LIBSVM, proposed by Chang et al., 2011). We performed 5-fold cross validation and the average accuracy is 87.58%.

### Python Programs

In [2]:
# Run task1.py
# - The testing accuracy is about 87%. 
# - The SVM model (kernel='linear', Penalty parameter C=1) is saved as svm_model.pkl
# - Some additional dump files:
#     - unigram_dict.pickle: binary dump of unigram_dict
#     - unigram_dict.txt: list all unigrams
#     - discussions_unigram_label_dict.txt: unigram vectors of dicussions, the format of each row: 
#     "[Discussion ID],[Discussion Topic Label in text],[Discussion Unigram Occurance Vector]"
# - The trained model can achieve higher than 90% accuracy when testing on all discussions
# - The accuracy when doing 5-fold cross validation: [ 0.88541667  0.88327526  0.85888502  0.86713287  0.88421053]

import task1
task1.main()

Loading dataset files...
5000 dicussions were loaded
10000 dicussions were loaded
11799 dicussions were loaded
Loading topic file...
Loading author stance file...
===== Start preprocessing =====
[nltk_data] Downloading package stopwords to /home/tomelf/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
Load unigram_dict ... 
Load discussions_unigram_label_dict ... 
1000 discussions were loaded
2000 discussions were loaded
2865 discussions were loaded
===== Done! =====
Divide data into train/test sets
Load pre-trained SVM model
5-fold cross validation: [ 0.88541667  0.88327526  0.85888502  0.86713287  0.88421053]
