# Political stance classification
In this project, we will solve a supervised machine learning task. The data that we will use for training and evaluation will be annotated collectively by us and other individuals.
The machine learning task that will be addressed in this project is to develop a text classifier that determines whether a given textual comment expresses an opinion that is positive or negative towards Brexit: the United Kingdom leaving the European Union.
The first two parts of this project deal with data annotation and are solved individually. In the third and final part, you will implement the classification system, and here you will work in a group of two or three people.
Didactic purpose of this assignment:
* Getting some practical understanding of annotating data and inter-annotator agreement.
* Practice several aspects of system development based on machine learning: getting data, cleaning data, processing and selecting features, selecting and tuning a model, evaluating.
* Analyzing results in a machine learning experiment.
Part 1: Crowdsourcing the data
We collect at least 100 Brexit-related comments from social media or the comment fields from online articles. Good places to trawl for comments include social media sites such as Youtube, and newspaper sites in Britain and elsewhere, such as the Telegraph, the Guardian, Daily Mail, the Independent, the Sun, Daily Express, Breitbart, Huffington Post or other English-language sources.
Collect comments that express a pro- or anti-Brexit stance. We will create a balanced dataset, so you should try to collect about 50 instances of each stance. We did not include comments not expressing an opinion about Brexit. Also, since other annotators will see each comment in isolation, don't include comments where you need to read previous comments to understand the opinion (e.g. "You're wrong!"). Try to select comments from a variety of sources.
Store all the comments you collected in an Excel file. This file should have two columns. The first column will store your annotation of whether this comment is pro-Brexit (represented as 1 in the spreadsheet) or anti-Brexit (0 in the spreadsheet). The second column should store the text of the comment. Make sure that the text of each comment is stored in a single cell. The following figure shows an example.

In [1]:
import pandas as pd

# the actual classification algorithm
from sklearn.svm import LinearSVC

# for converting training and test datasets into matrices
# TfidfVectorizer does this specifically for documents
from sklearn.feature_extraction.text import TfidfVectorizer

# for bundling the vectorizer and the classifier as a single "package"
from sklearn.pipeline import make_pipeline

# for splitting the dataset into training and test sets 
from sklearn.model_selection import train_test_split

# for evaluating the quality of the classifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_validate

from sklearn.feature_extraction.text import TfidfTransformer
# we can choose differnt classifier to model our algorithm
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression, SGDClassifier
 

from sklearn.metrics import confusion_matrix
from sklearn.feature_extraction.text import CountVectorizer






we wrote a function to read the dataset. This function returns a list of documents X with their corresponding sentiment labels Y. Since we have an array of labels(value) for each row, we use some assumptions to select a label. The array size is either 1 or two or more. For example, if the array size is 1, then check whether it is '1' or '0' and assign it to that particular row. If, the array size is two we have two scenarios, 1st if both annotations are the same then simply we can assign it to its corresponding comment. 2nd if the annotation is '1' and '-1' we should remove that line(row) because it is ambiguous comment(let us check mathematically, count of 1's is 1 and len(count)/2 is also 1, 1> 1 not true, this means we should remove the comment). The same is true for '1' and '0', '0' and '-1'...etc. Let's see if the size of an array is 3, in this case, the majority vote will be assigned to the comment. For example, if we have '0/0/-1', we have two annotators with '0' and one with '-1', so '0' could be assigned to the comment. Finally, we count the number of annotators with Pro-Brexit, Anti-Brexit and Neither pro or anti-Brexit annotations from the training data set. Finally, the function returns a single label for each row for the whole dataset.

In [2]:

def read_data(filename):
    data = pd.read_csv(filename, sep="\t",names=['target', 'comments'],skiprows=0)
    
    labels = data['target'].astype('str').values
    removeRows = []
    count_pos=0
    count_neg=0
    count_none=0
    for i in range(len(labels)):
        labels[i] = labels[i].replace('/','')
        
        if(labels[i].count('1') > len(labels[i])/2):
            labels[i] = 1
            count_pos+=1
        elif(labels[i].count('0') > len(labels[i])/2):
            labels[i] = 0
            count_neg+=1
        else: 
            removeRows.append(i)
            count_none+=1
    data = data.replace(data['target'].values, labels)
    print('Positive count:',count_pos)
    print('Negative count:',count_neg)
    print('None count:',count_none)
    #drop ambigous answers
    for i in range(len(removeRows)):
        data = data.drop([removeRows[i]])
    
    return data


data returns each row with a single label based on the model we developed shown in the above line.

In [5]:
# working with preliminary sample set
print("From preliminary data set ......")
df_preliminay_sample= read_data("a2_first_sample.tsv")
# working with first annotation round traing data
print("From first annoation round ......")
df_first_training_anota= read_data("a2a_train_round1.tsv")
# working with the final training 
print("From the final trianing set ......")
df_traing = read_data("a2a_train_final.tsv")
print("From the final test set ......")
df_test = read_data("a2a_test_final.tsv")


From preliminary data set ......
Positive count: 466
Negative count: 428
None count: 0
From first annoation round ......
Positive count: 3492
Negative count: 3460
None count: 0
From the final trianing set ......
Positive count: 6378
Negative count: 5434
None count: 1705
From the final test set ......
Positive count: 616
Negative count: 544
None count: 0


In [6]:
Xsample= df_preliminay_sample['comments']
Ysample= df_preliminay_sample['target']

Xround1= df_first_training_anota['comments']
Yround1= df_first_training_anota['target']

Xfinal= df_traing['comments']
Yfinal= df_traing['target']

X_sample1_train, X_sample1_test, Y_sample1_train, Y_sample1_test = train_test_split(Xsample, Ysample, test_size=0.2, random_state=12345)
X_round1_train, X_round1_test, Y_round1_train, Y_round1_test = train_test_split(Xround1, Yround1, test_size=0.2, random_state=12345)
X_final_train, X_final_test, Y_final_train, Y_final_test = train_test_split(Xfinal, Yfinal, test_size=0.2, random_state=12345)


In [7]:
Y_final_train=Y_final_train.astype(int)
Y_final_test = Y_final_test.astype(int)

In [8]:
Y_sample1_test.size,Y_round1_test.size,Y_final_test.size

(179, 1391, 2363)

This function builds a Pipeline for document classification, consisting of a vectorizer and a classifier. The TfidfVectorizer is used to convert a document collection into a matrix that can be used with scikit learn learning algorithms. (Here are some additional details.) We can choose the classifiers. For now, LinearSVC is a type of linear classifier, which often tends to work quite well for high-dimensional feature spaces.

After combining the vectorizer and the classifier into a Pipeline, we call fit to train the complete model.

In [9]:
def train_document_classifier(X,Y):
    pipeline = make_pipeline(TfidfVectorizer(), 
                            LinearSVC()
                            #DecisionTreeClassifier()
                            #MLPClassifier(alpha=1)
                            #SGDClassifier(loss="log", max_iter=50)

                            )
    pipeline.fit(X, Y)
    return pipeline


Let's fit each datasets using train_document_classifier method.

In [10]:
clf_sample = train_document_classifier(X_sample1_train, Y_sample1_train)
clf_round1 = train_document_classifier(X_round1_train, Y_round1_train)
clf_final = train_document_classifier(X_final_train, Y_final_train)

In [11]:
Yguess0 = clf_sample.predict(X_sample1_test)
Yguess1 = clf_round1.predict(X_round1_test)
Yguess2 = clf_final.predict(X_final_test)

accScore0 = accuracy_score(Y_sample1_test, Yguess0)
accScore1 = accuracy_score(Y_round1_test, Yguess1)
accScore2 = accuracy_score(Y_final_test, Yguess2)

print("The accuracy score from the sample set:", accScore0)
print("The accuracy score from the first round dataset:", accScore1)
print("The accuracy score from the second/final round dataset:", accScore2)

The accuracy score from the sample set: 0.659217877094972
The accuracy score from the first round dataset: 0.7268152408339325
The accuracy score from the second/final round dataset: 0.7917900973338976


Let us calculate the cross-validation and confusion matrix for each data set and check how many are labeled correctly and how many are misclassified.

In [12]:
print('-----Cross_validation for the first round------')
print(cross_validate(clf_sample, X_sample1_train, Y_sample1_train))

print('-----Cross_validation for the first round------')
print(cross_validate(clf_round1, X_round1_train, Y_round1_train))

print('-----Cross_validation for the final trining set------')
print(cross_validate(clf_final, X_final_train, Y_final_train))


print('-----Confusion Matrix for preliminary sample training and test set------')
print(confusion_matrix(Y_sample1_test, Yguess0))

print('-----Confusion Matrix for round 1 training and test set------')
print(confusion_matrix(Y_round1_test, Yguess1))

print('-----Confusion Matrix for the final trainin and tset set------')
print(confusion_matrix(Y_final_test, Yguess2))


-----Cross_validation for the first round------




{'fit_time': array([0.02984643, 0.01979589, 0.02002573]), 'score_time': array([0.00746918, 0.00748563, 0.00726318]), 'test_score': array([0.69456067, 0.63598326, 0.62025316]), 'train_score': array([0.99369748, 1.        , 0.9916318 ])}
-----Cross_validation for the first round------
{'fit_time': array([0.16993976, 0.15780473, 0.16638851]), 'score_time': array([0.06376386, 0.06249571, 0.06092954]), 'test_score': array([0.71790723, 0.72545847, 0.72638964]), 'train_score': array([0.97275425, 0.96897761, 0.96952535])}
-----Cross_validation for the final trining set------




{'fit_time': array([0.29655957, 0.2737844 , 0.27260184]), 'score_time': array([0.1049149 , 0.10553718, 0.11014605]), 'test_score': array([0.76920635, 0.76571429, 0.76849794]), 'train_score': array([0.96856644, 0.96491507, 0.96936508])}
-----Confusion Matrix for preliminary sample training and test set------
[[48 36]
 [25 70]]
-----Confusion Matrix for round 1 training and test set------
[[500 185]
 [195 511]]
-----Confusion Matrix for the final trainin and tset set------
[[ 821  257]
 [ 235 1050]]


For the preliminary sample dataset, we have 49 true positives and 70 true negative documents correctly classified and the remaining 61 documents are misclassified out of 179 documents.

For the first round dataset, we have 500 true positives and 511 true negative total of 1011 documents are correctly classified and 380 documents are misclassified out of 1391 documents.

The confusion matrix for the final training and test set presents 821 true positives and 1050 true negative total of 1871 documents correctly classified and 492 documents are misclassified out of 2363 documents.



We can use different classifiers and check the accuracy for each classifier.
DummyClassifier() with  accuracy .........
DecisionTreeClassifier() with the  accuracy ..............
KNeighborsClassifier() with the accuracy of ............
LinearSVC() with accuracy of 0.8017241379310345
LinearSVR() .........
MLPClassifier(alpha=1) ..........
SGDClassifier(loss="hinge", penalty="l2", max_iter=5) .........
SGDClassifier(loss="log", max_iter=5) ...........
and many more classifiers we can add here............

LinearRegression()
...
...
Finally, we can choose the best model.


For example, let's try a classifier such as MLPClassifier and SGDClassifier(loss=" log", max_iter=5) the hyperparameter max_iter will be changed based on the size of the iteration.

In [13]:
def train_document_classifier2(X,Y):
    pipeline = make_pipeline(TfidfVectorizer(), 
                            MLPClassifier(alpha=1)
                            )
    pipeline.fit(X, Y)
    return pipeline

In [14]:
clf_final = train_document_classifier2(X_final_train, Y_final_train)
Yguess6 = clf_final.predict(X_final_test)
accScore20 = accuracy_score(Y_final_test, Yguess6)
print("The accuracy score from the second/final round dataset:", accScore20)

The accuracy score from the second/final round dataset: 0.7735928903935675


In [15]:
def train_document_classifier1(X,Y):
    pipeline = make_pipeline(TfidfVectorizer(), 
                            SGDClassifier(loss="log", max_iter=5)
                            )
    pipeline.fit(X, Y)
    return pipeline

In [16]:
number_of_test = 5
acc_test = 0
for i in range(number_of_test):
    
    clf_commentss = train_document_classifier1(X_final_train, Y_final_train)
    Yguess4 = clf_commentss.predict(X_final_test)
    accScore = accuracy_score(Y_final_test, Yguess4)
    print("Run", (i+1), "got score:", accScore)
    acc_test += accScore
    print("Current average:", acc_test/(i+1))
    print("############################")
print('-----Total Result-----')
print("Final Result: ", acc_test/number_of_test)



Run 1 got score: 0.7867118070249682
Current average: 0.7867118070249682
############################
Run 2 got score: 0.781633516716039
Current average: 0.7841726618705036
############################
Run 3 got score: 0.7833262801523487
Current average: 0.7838905346311186
############################
Run 4 got score: 0.7867118070249682
Current average: 0.784595852729581
############################
Run 5 got score: 0.7875581887431231
Current average: 0.7851883199322894
############################
-----Total Result-----
Final Result:  0.7851883199322894


The accuracy of the SGDClassifier is more than Linear SVC. Therefore, we chose SGDClassifier as the best classifier for our model.

we can apply to use the final test set for the final training set and calculate the accuracy of the training set using the given test set.

In [17]:
# For the final training set 
Xtrain = df_traing['comments']
Ytrain= df_traing['target'].astype(int)
# For the final test set
Xtest = df_test['comments']
Ytest = df_test['target'].astype(int)

Now, we can calculate the accuracy for the final data set and domain shift affects the performance of a classifier.
Let's see how well the classifier perform on the test sets

In [18]:
clf_final_with_test = train_document_classifier(Xtrain, Ytrain)
Yguess_with_test = clf_final_with_test.predict(Xtest)
accScore = accuracy_score(Ytest, Yguess_with_test)
print("The accuracy score for the final dateset:", accScore)

The accuracy score for the final dateset: 0.8017241379310345


Let us calculate the cross-validation and confusion matrix for the last training and test set and check how many are labeled correctly and how many are misclassified.

In [19]:
print('-----Cross_validation for the first round------')
print(cross_validate(clf_final_with_test, Xtrain, Ytrain))

print('-----Confusion Matrix for the final trainin and tset set------')
print(confusion_matrix(Ytest, Yguess_with_test))

-----Cross_validation for the first round------




{'fit_time': array([0.42187452, 0.33099151, 0.3473568 ]), 'score_time': array([0.12289691, 0.14712071, 0.12762499]), 'test_score': array([0.7813611 , 0.76809754, 0.76047752]), 'train_score': array([0.96113792, 0.96177778, 0.96419048])}
-----Confusion Matrix for the final trainin and tset set------
[[421 123]
 [107 509]]


The confusion matrix for the final training and test set presents 421 true positives and 509 true negative total of 930 documents correctly classified and 230 documents are misclassified out of 1160 documents(test set document).

The score slightly increases when we use the test set. we have got 0.7917900973338976 accuracies by splitting the final set to training and test set. But the accuracy of the system using the given test set is 0.8017241379310345.