# Argument Detection

## Prepare Data

In [80]:
# Load data from file

import json

dataset = []

with open('./labelled_data/1000_labelled_argument_sentences_3.json') as f:
    for line in f:
        json_line = json.loads(line)
        arg = {"text": json_line["content"], "label": json_line["annotation"]["labels"][0]}

        dataset.append(arg)

dataset

[{'text': "The motivation for the age restriction, like a lot of the Constitution, might have roots in the political situation in Europe in the 1700's.",
  'label': 'arg'},
 {'text': 'If Alexandria-Ocasio Cortez wanted to run for President in 2020, and people thought she was too young and inexperienced, they could vote against her for that reason.',
  'label': 'arg'},
 {'text': '(Various articles I could quote to support this lmk) Women generally live a couple years longer.',
  'label': 'arg'},
 {'text': 'Why, exactly?', 'label': 'not_arg'},
 {'text': "The minimum age requirement does at least give you some potential limit as to which someone can fill out a political 'resume', at the very least make themselves a known quantity, giving people better ideas as to what sort of a person a politician is/can be, while setting a maximum only hampers potentially still competent, still capable public servants from fulfilling an important role.",
  'label': 'arg'},
 {'text': 'As a foot note after

In [81]:
# Remove punctuation

import string

print(string.punctuation)

dataset = [{"text": sample["text"].translate(str.maketrans('', '', string.punctuation)), "label":sample["label"]} for sample in dataset]

dataset

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


[{'text': 'The motivation for the age restriction like a lot of the Constitution might have roots in the political situation in Europe in the 1700s',
  'label': 'arg'},
 {'text': 'If AlexandriaOcasio Cortez wanted to run for President in 2020 and people thought she was too young and inexperienced they could vote against her for that reason',
  'label': 'arg'},
 {'text': 'Various articles I could quote to support this lmk Women generally live a couple years longer',
  'label': 'arg'},
 {'text': 'Why exactly', 'label': 'not_arg'},
 {'text': 'The minimum age requirement does at least give you some potential limit as to which someone can fill out a political resume at the very least make themselves a known quantity giving people better ideas as to what sort of a person a politician iscan be while setting a maximum only hampers potentially still competent still capable public servants from fulfilling an important role',
  'label': 'arg'},
 {'text': 'As a foot note after the US left Vietnam 

In [82]:
# Create alternative dataset based on lemma


import nltk
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

nltk.download('wordnet')

def lemmatize_all(sentence):
    wnl = WordNetLemmatizer()
    for word, tag in pos_tag(word_tokenize(sentence)):
        if tag.startswith("NN"):
            yield wnl.lemmatize(word, pos='n')
        elif tag.startswith('VB'):
            yield wnl.lemmatize(word, pos='v')
        elif tag.startswith('JJ'):
            yield wnl.lemmatize(word, pos='a')
        else:
            yield word

dataset_lemma = [{"text": ' '.join(lemmatize_all(sample["text"])), "label":sample["label"]} for sample in dataset]

dataset_lemma

[nltk_data] Downloading package wordnet to /home/effsy/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


[{'text': 'The motivation for the age restriction like a lot of the Constitution might have root in the political situation in Europe in the 1700s',
  'label': 'arg'},
 {'text': 'If AlexandriaOcasio Cortez want to run for President in 2020 and people think she be too young and inexperienced they could vote against her for that reason',
  'label': 'arg'},
 {'text': 'Various article I could quote to support this lmk Women generally live a couple year longer',
  'label': 'arg'},
 {'text': 'Why exactly', 'label': 'not_arg'},
 {'text': 'The minimum age requirement do at least give you some potential limit as to which someone can fill out a political resume at the very least make themselves a known quantity give people better idea as to what sort of a person a politician iscan be while set a maximum only hamper potentially still competent still capable public servant from fulfil an important role',
  'label': 'arg'},
 {'text': 'As a foot note after the US leave Vietnam China invade and fail 

In [168]:
# Split dataset into training and testing set

from sklearn.model_selection import train_test_split

train, test = train_test_split(dataset, test_size=0.1)

train_x = [sample["text"] for sample in train]
train_y = [sample["label"] for sample in train]

test_x = [sample["text"] for sample in test]
test_y = [sample["label"] for sample in test]
train_x
train_y

['not_arg',
 'not_arg',
 'not_arg',
 'not_arg',
 'not_arg',
 'arg',
 'arg',
 'not_arg',
 'arg',
 'arg',
 'arg',
 'arg',
 'not_arg',
 'not_arg',
 'not_arg',
 'not_arg',
 'not_arg',
 'arg',
 'arg',
 'arg',
 'arg',
 'arg',
 'not_arg',
 'not_arg',
 'arg',
 'arg',
 'arg',
 'not_arg',
 'arg',
 'arg',
 'arg',
 'not_arg',
 'not_arg',
 'not_arg',
 'arg',
 'arg',
 'not_arg',
 'not_arg',
 'arg',
 'not_arg',
 'arg',
 'not_arg',
 'not_arg',
 'not_arg',
 'arg',
 'not_arg',
 'not_arg',
 'arg',
 'not_arg',
 'not_arg',
 'not_arg',
 'not_arg',
 'arg',
 'arg',
 'not_arg',
 'not_arg',
 'arg',
 'not_arg',
 'not_arg',
 'arg',
 'arg',
 'not_arg',
 'arg',
 'arg',
 'arg',
 'arg',
 'not_arg',
 'arg',
 'not_arg',
 'arg',
 'arg',
 'not_arg',
 'arg',
 'not_arg',
 'arg',
 'not_arg',
 'arg',
 'not_arg',
 'not_arg',
 'not_arg',
 'not_arg',
 'arg',
 'not_arg',
 'arg',
 'not_arg',
 'not_arg',
 'not_arg',
 'arg',
 'not_arg',
 'not_arg',
 'not_arg',
 'arg',
 'arg',
 'not_arg',
 'not_arg',
 'not_arg',
 'arg',
 'arg',
 'no

In [84]:
# Represent text as BoW 

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

cv = TfidfVectorizer(ngram_range=(1, 2))
train_x_bow = cv.fit_transform(train_x)
test_x_bow = cv.transform(test_x)
train_x_bow.shape

(670, 10369)

In [102]:
# Split dataset into features and targets

dataset_x = [sample["text"] for sample in dataset]
dataset_y = [sample["label"] for sample in dataset]

dataset_lemma_x = [sample["text"] for sample in dataset_lemma]
dataset_lemma_y = [sample["label"] for sample in dataset_lemma]

['arg',
 'arg',
 'arg',
 'not_arg',
 'arg',
 'arg',
 'arg',
 'not_arg',
 'not_arg',
 'not_arg',
 'not_arg',
 'arg',
 'arg',
 'arg',
 'arg',
 'not_arg',
 'not_arg',
 'arg',
 'not_arg',
 'not_arg',
 'not_arg',
 'arg',
 'not_arg',
 'not_arg',
 'not_arg',
 'arg',
 'not_arg',
 'arg',
 'not_arg',
 'arg',
 'not_arg',
 'arg',
 'not_arg',
 'not_arg',
 'arg',
 'not_arg',
 'not_arg',
 'not_arg',
 'not_arg',
 'not_arg',
 'not_arg',
 'arg',
 'arg',
 'arg',
 'arg',
 'arg',
 'arg',
 'not_arg',
 'arg',
 'arg',
 'arg',
 'arg',
 'arg',
 'arg',
 'not_arg',
 'arg',
 'not_arg',
 'arg',
 'not_arg',
 'not_arg',
 'not_arg',
 'arg',
 'arg',
 'not_arg',
 'arg',
 'arg',
 'arg',
 'arg',
 'arg',
 'arg',
 'arg',
 'arg',
 'not_arg',
 'not_arg',
 'not_arg',
 'arg',
 'arg',
 'arg',
 'arg',
 'not_arg',
 'not_arg',
 'arg',
 'not_arg',
 'not_arg',
 'not_arg',
 'arg',
 'arg',
 'arg',
 'not_arg',
 'not_arg',
 'not_arg',
 'arg',
 'not_arg',
 'arg',
 'not_arg',
 'arg',
 'arg',
 'arg',
 'not_arg',
 'not_arg',
 'arg',
 'not_ar

(1000, 14562)

## Classification

#### SVM

In [166]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from numpy import mean


from sklearn import svm


clf_svm = Pipeline([('vect', TfidfVectorizer(ngram_range=(1, 2))), ('svm', svm.SVC(kernel='linear'))])

mean(cross_val_score(clf_svm, dataset_x, dataset_y, cv=10, scoring='f1_macro'))

0.6739155683399748

#### Decision Tree

In [86]:
from sklearn.tree import DecisionTreeClassifier

clf_dec = DecisionTreeClassifier()

clf_dec.fit(train_x_bow, train_y)

clf_dec.predict(test_x_bow[0])


array(['not_arg'], dtype='<U7')

#### Logistic Regression

In [87]:
from sklearn.linear_model import LogisticRegression

clf_log = LogisticRegression()

clf_log.fit(train_x_bow, train_y)

clf_log.predict(test_x_bow[0])




array(['not_arg'], dtype='<U7')

#### Naive Bayes

In [88]:
from sklearn.naive_bayes import GaussianNB

clf_gnb = GaussianNB()

clf_gnb.fit(train_x_bow.toarray(), train_y)

clf_gnb.predict(test_x_bow[0].toarray())


array(['not_arg'], dtype='<U7')

#### Dummy

In [89]:
from sklearn.dummy import DummyClassifier

clf_dum = DummyClassifier(strategy='stratified', random_state=0)

clf_dum.fit(train_x_bow, train_y)

clf_dum.predict(test_x_bow[0])


array(['not_arg'], dtype='<U7')

### SVM

In [180]:
# It is possible to get undefined f1-score as in an exhaustive search (grid search), 
# some labels may never be predicted. This can lead to 0 precision or recall.
# Ignore these warnings

import warnings
warnings.filterwarnings('ignore')


from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV, StratifiedKFold

In [181]:
from sklearn import svm

parameters_svm = {'svm__C': [0.1, 1, 10, 100, 1000],  
              'svm__gamma': [1, 0.1, 0.01, 0.001, 0.0001], 
              'svm__kernel': ['rbf', 'linear']}  

svm_pipeline = Pipeline([('vect', TfidfVectorizer(ngram_range=(1, 2))), ('svm', svm.SVC())])

clf_svm = GridSearchCV(svm_pipeline, parameters_svm, cv=StratifiedKFold(n_splits=5, shuffle = True, random_state = 999), scoring='f1_macro')

clf_svm.fit(train_x, train_y)

GridSearchCV(cv=StratifiedKFold(n_splits=5, random_state=999, shuffle=True),
             error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('vect',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                               

In [158]:
from sklearn.naive_bayes import MultinomialNB

parameters_svm = {'svm__C': [0.1, 1, 10, 100, 1000],  
              'svm__gamma': [1, 0.1, 0.01, 0.001, 0.0001], 
              'svm__kernel': ['rbf', 'linear']}  

svm_pipeline = Pipeline([('vect', TfidfVectorizer(ngram_range=(1, 2))), ('svm', svm.SVC())])

clf_svm = GridSearchCV(svm_pipeline, parameters_svm, cv=StratifiedKFold(n_splits=5, shuffle = True, random_state = 999), scoring='f1_macro')

clf_svm.fit(train_x, train_y)



{'svm__C': 0.1, 'svm__gamma': 1, 'svm__kernel': 'rbf'} 0.345928832381665
{'svm__C': 0.1, 'svm__gamma': 0.1, 'svm__kernel': 'rbf'} 0.345928832381665
{'svm__C': 0.1, 'svm__gamma': 0.01, 'svm__kernel': 'rbf'} 0.345928832381665
{'svm__C': 0.1, 'svm__gamma': 0.001, 'svm__kernel': 'rbf'} 0.345928832381665
{'svm__C': 0.1, 'svm__gamma': 0.0001, 'svm__kernel': 'rbf'} 0.345928832381665
{'svm__C': 1, 'svm__gamma': 1, 'svm__kernel': 'rbf'} 0.578525436733805
{'svm__C': 1, 'svm__gamma': 0.1, 'svm__kernel': 'rbf'} 0.345928832381665
{'svm__C': 1, 'svm__gamma': 0.01, 'svm__kernel': 'rbf'} 0.345928832381665
{'svm__C': 1, 'svm__gamma': 0.001, 'svm__kernel': 'rbf'} 0.345928832381665
{'svm__C': 1, 'svm__gamma': 0.0001, 'svm__kernel': 'rbf'} 0.345928832381665
{'svm__C': 10, 'svm__gamma': 1, 'svm__kernel': 'rbf'} 0.649482302365062
{'svm__C': 10, 'svm__gamma': 0.1, 'svm__kernel': 'rbf'} 0.6615520895870345
{'svm__C': 10, 'svm__gamma': 0.01, 'svm__kernel': 'rbf'} 0.345928832381665
{'svm__C': 10, 'svm__gamma': 0

In [None]:
from sklearn.tree import DecisionTreeClassifier

parameters_svm = {'svm__C': [0.1, 1, 10, 100, 1000],  
              'svm__gamma': [1, 0.1, 0.01, 0.001, 0.0001], 
              'svm__kernel': ['rbf', 'linear']}  

svm_pipeline = Pipeline([('vect', TfidfVectorizer(ngram_range=(1, 2))), ('svm', svm.SVC())])

clf_svm = GridSearchCV(svm_pipeline, parameters_svm, cv=StratifiedKFold(n_splits=5, shuffle = True, random_state = 999), scoring='f1_macro')

clf_svm.fit(train_x, train_y)

In [None]:
from sklearn.linear_model import LogisticRegression

parameters_svm = {'svm__C': [0.1, 1, 10, 100, 1000],  
              'svm__gamma': [1, 0.1, 0.01, 0.001, 0.0001], 
              'svm__kernel': ['rbf', 'linear']}  

svm_pipeline = Pipeline([('vect', TfidfVectorizer(ngram_range=(1, 2))), ('svm', svm.SVC())])

clf_svm = GridSearchCV(svm_pipeline, parameters_svm, cv=StratifiedKFold(n_splits=5, shuffle = True, random_state = 999), scoring='f1_macro')

clf_svm.fit(train_x, train_y)

In [162]:

from sklearn.metrics import classification_report

y_true, y_pred = test_y, clf_svm.predict(test_x)
print (classification_report(y_true, y_pred))


              precision    recall  f1-score   support

         arg       0.55      0.90      0.68        40
     not_arg       0.88      0.50      0.64        60

    accuracy                           0.66       100
   macro avg       0.71      0.70      0.66       100
weighted avg       0.75      0.66      0.65       100



{'mean_fit_time': array([0.19169555, 0.20035076, 0.20642376, 0.18683457, 0.19028363,
        0.18359694, 0.22767892, 0.2242135 , 0.19481959, 0.16957254,
        0.17064142, 0.18002262, 0.18897939, 0.16739783, 0.21151533,
        0.17690983, 0.17905188, 0.16472702, 0.16546984, 0.16401119,
        0.17651253, 0.16427131, 0.17352662, 0.16368251, 0.16684523,
        0.16524277, 0.18302641, 0.16483483, 0.18043642, 0.21338053,
        0.20057144, 0.19314384, 0.17319698, 0.17001143, 0.20415325,
        0.17844639, 0.16783361, 0.16719723, 0.17032518, 0.19480267,
        0.17746067, 0.16702099, 0.16922493, 0.16744614, 0.17453222,
        0.16667366, 0.17205791, 0.16691537, 0.17185802, 0.16781387]),
 'std_fit_time': array([0.02071794, 0.02913824, 0.03292887, 0.00544549, 0.01190256,
        0.01444971, 0.05537727, 0.02733026, 0.02420517, 0.00397557,
        0.00203077, 0.00999936, 0.01466984, 0.00566747, 0.03507242,
        0.00482943, 0.0081652 , 0.00326431, 0.00331185, 0.00251948,
        0.003

In [177]:
from sklearn import metrics

print("SVM")
metrics.f1_score(test_y, clf_svm.predict(test_x), average='weighted')

print("Decision Tree")
metrics.f1_score(test_y, clf_dec.predict(test_x), average='weighted')

print("Logistic Regression")
metrics.f1_score(test_y, clf_log.predict(test_x), average='weighted')

print("Naive Bayes")
metrics.f1_score(test_y, clf_nb.predict(test_x), average='weighted')

0.6617088800732377

## Evaluation

In [90]:
# print(clf_svm.score(test_x_bow, test_y))
# print(clf_dec.score(test_x_bow, test_y))
# print(clf_gnb.score(test_x_bow.toarray(), test_y))
# print(clf_log.score(test_x_bow, test_y))


In [91]:
# f1 score

from sklearn.metrics import f1_score

display(f1_score(test_y, clf_svm.predict(test_x_bow), average=None, labels=["arg", "not_arg"]))
display(f1_score(test_y, clf_gnb.predict(test_x_bow.toarray()), average=None, labels=["arg", "not_arg"]))
display(f1_score(test_y, clf_log.predict(test_x_bow), average=None, labels=["arg", "not_arg"]))
display(f1_score(test_y, clf_dec.predict(test_x_bow), average=None, labels=["arg", "not_arg"]))
display(f1_score(test_y, clf_dum.predict(test_x_bow), average=None, labels=["arg", "not_arg"]))


array([0.73815461, 0.59459459])

array([0.6557377 , 0.57142857])

array([0.73429952, 0.55284553])

array([0.61842105, 0.6741573 ])

array([0.47590361, 0.4695122 ])

In [92]:
# Tune the model parameters with grid search

## Improving the Model

This is the baseline. We will now explore adding different features to improve the classifier

In [93]:
# Create new features. 

import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

nltk.download('vader_lexicon')

train_x_bow
train_x



# Sentiment
sid = SentimentIntensityAnalyzer()


for sentence in train_x:
    print(sentence)
    ss = sid.polarity_scores(sentence)
    for k in sorted(ss):
        print('{0}: {1}, '.format(k, ss[k]), end='')
        print()

# URLs

# Reddit features

# POS number of each

# Sentence length
train_x_bow



[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/effsy/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


This may lead to alcoholism or drug addiction in some peoples genetics or mental states
compound: 0.0, 
neg: 0.0, 
neu: 1.0, 
pos: 0.0, 
Hes not going to win running on the environment Mr Pinion says
compound: -0.4717, 
neg: 0.218, 
neu: 0.782, 
pos: 0.0, 
Other people are rarely trying to offend others so will get defensive
compound: 0.3075, 
neg: 0.0, 
neu: 0.755, 
pos: 0.245, 
If you go here  httpswwwmerriamwebstercomdictionaryhirsutehttpswwwmerriamwebstercomdictionaryhirsute  I think you get a much more concise definition and will know the precise usage
compound: 0.0, 
neg: 0.0, 
neu: 1.0, 
pos: 0.0, 
The motivation for the age restriction like a lot of the Constitution might have roots in the political situation in Europe in the 1700s
compound: 0.4215, 
neg: 0.078, 
neu: 0.741, 
pos: 0.181, 
In terms of health studies using longitudinal data have found a correlation between marriage and better health of varying types
compound: 0.4404, 
neg: 0.0, 
neu: 0.861, 
pos: 0.139, 
Why woul

compound: 0.3773, 
neg: 0.051, 
neu: 0.833, 
pos: 0.117, 
Are the conditions on death row a little better than for a guy serving life in general population
compound: -0.3167, 
neg: 0.19, 
neu: 0.683, 
pos: 0.127, 
Ultimately Buddy says he wants a world where Supers are gone as a class Including him after he gets his personal revenge first of course
compound: -0.5267, 
neg: 0.134, 
neu: 0.866, 
pos: 0.0, 
This could be the case for a few years after leaving but it does not have to be
compound: 0.0, 
neg: 0.0, 
neu: 1.0, 
pos: 0.0, 
My post was pointing out that the Democratic Party tends to have more factions for lack of a better term than the Republican Party
compound: 0.6943, 
neg: 0.086, 
neu: 0.636, 
pos: 0.278, 
Also you dont have the right to violence unless its for selfdefense or castle law in some states otherwise it would be legal
compound: -0.5574, 
neg: 0.154, 
neu: 0.789, 
pos: 0.056, 
3 Support of traditional medicine in this operative context does not promote the prolifera

compound: -0.25, 
neg: 0.333, 
neu: 0.667, 
pos: 0.0, 
There does not appear to be such an entity for DBS thats financially relevant
compound: 0.0, 
neg: 0.0, 
neu: 1.0, 
pos: 0.0, 
People having good careers means more revenue for the government and happier citizens
compound: 0.743, 
neg: 0.0, 
neu: 0.636, 
pos: 0.364, 
But   what if you could have programs like the ones in Overlord
compound: 0.3612, 
neg: 0.0, 
neu: 0.815, 
pos: 0.185, 
I mean the issue there is misinterpretation
compound: 0.0, 
neg: 0.0, 
neu: 1.0, 
pos: 0.0, 
Just be polite and not an ass about it maybe ask for a description of the person rather than their cup size
compound: 0.431, 
neg: 0.0, 
neu: 0.875, 
pos: 0.125, 
This is a weird example because despite the fact that literally no one had heard of Captain Marvel Captain Marvel made more money than Wonderwoman in the box office
compound: 0.4019, 
neg: 0.116, 
neu: 0.716, 
pos: 0.167, 
Also even if you are not openly biased your choice of the facts you mention an

compound: 0.5927, 
neg: 0.0, 
neu: 0.879, 
pos: 0.121, 
vegetarian here
compound: 0.0, 
neg: 0.0, 
neu: 1.0, 
pos: 0.0, 
That does not change the fact that the 2A probably allows nukes as it is written
compound: 0.0, 
neg: 0.0, 
neu: 1.0, 
pos: 0.0, 
Proeducation republicans are are also very well camouflaged then but they can certainly learn a thing or two from the prosexeducation republicans
compound: 0.5854, 
neg: 0.0, 
neu: 0.798, 
pos: 0.202, 
Thats just how our legal system works
compound: 0.128, 
neg: 0.0, 
neu: 0.8, 
pos: 0.2, 
In Virginia the legislators there are trying to pass a extreme form of gun control and confiscation
compound: -0.34, 
neg: 0.138, 
neu: 0.862, 
pos: 0.0, 
Its even worse if you know that they died in fear for their life or in pain
compound: -0.9217, 
neg: 0.504, 
neu: 0.496, 
pos: 0.0, 
Arguments in favor of the view OP is willing to change must be restricted to replies to other comments
compound: -0.3818, 
neg: 0.221, 
neu: 0.667, 
pos: 0.113, 
uRs3vsos

<670x10369 sparse matrix of type '<class 'numpy.float64'>'
	with 18857 stored elements in Compressed Sparse Row format>

In [94]:
# Add extra features into original train/test sets

train_new

NameError: name 'train_new' is not defined

In [None]:
train_x


## Save the Model