Import all the super important and useful libraries

In [1]:
import pandas as pd
import numpy as np
import nltk
from sklearn.preprocessing import MinMaxScaler

Read in the four datasets

In [2]:
raw_data_dp = pd.read_csv('data/dp-slider-means.csv')
raw_data_evo = pd.read_csv('data/evo-slider-means.csv')
raw_data_gc = pd.read_csv('data/gc-slider-means.csv')
raw_data_gm = pd.read_csv('data/gm-slider-means.csv')

Show the amount of rows and columns for each dataset

In [3]:
n_rows = len(raw_data_dp) + len(raw_data_evo) + len(raw_data_gc) + len(raw_data_gm)

print("The data contains {0} columns".format(len(raw_data_dp.columns)))
print("Amount of rows:\n total: {0} \n dp: {1} \n evo: {2}\n gc: {3}\n gm: {4}".format(n_rows, len(raw_data_dp),
                                                                                       len(raw_data_evo), 
                                                                         len(raw_data_gc), len(raw_data_gm)))

The data contains 8 columns
Amount of rows:
 total: 5375 
 dp: 987 
 evo: 1252
 gc: 1590
 gm: 1546


Combine all four datasets into one data frame

In [4]:
frames = [raw_data_dp, raw_data_evo, raw_data_gc, raw_data_gm]

In [5]:
raw_data = pd.concat(frames, axis = 0)

In [6]:
print("The combined dataset contains {0} rows and {1} columns".format(len(raw_data), len(raw_data.columns)))

The combined dataset contains 5375 rows and 8 columns


Inspect the first 5 rows of the new dataset

In [7]:
raw_data.head()

Unnamed: 0.1,Unnamed: 0,ItemId,GoodSliderMean,GoodSliderDev,Connective.x,PairType.x,ResponseInitial.x,Phrase.x
0,658,ab5810d83f23243ddce713ac23d775cd,1.0,,so,P1_P2,False,"Sorry for the length of the post, but I hope i..."
1,871,e0a35a65ce12b2457e8ff1f9b8cec749,1.0,0.0,no_connective,P1_P2,True,I am all for the death penalty.
2,931,f16863ac9454707946061848c7e9a3e5,1.0,0.0,no_connective,P1_P2,True,I am pro death penalty.
3,936,f1f99c6b1f3f14025a3c01cb8a13b10b,1.0,,no_connective,P1_P2,False,"I can't believe that you just said ""So what if..."
4,11,029bc4e01ac943f87837556b32d5627a,0.999,0.001414,so,QR,False,So what does he have to do with a debate like ...


We are only interested in the argument score and the argument itself, so only keep that

In [8]:
raw_data = raw_data[["GoodSliderMean", "Connective.x", "ResponseInitial.x", "Phrase.x"]]

Now, we only have the arguments and its annotated score

In [9]:
raw_data.head()

Unnamed: 0,GoodSliderMean,Connective.x,ResponseInitial.x,Phrase.x
0,1.0,so,False,"Sorry for the length of the post, but I hope i..."
1,1.0,no_connective,True,I am all for the death penalty.
2,1.0,no_connective,True,I am pro death penalty.
3,1.0,no_connective,False,"I can't believe that you just said ""So what if..."
4,0.999,so,False,So what does he have to do with a debate like ...


The data is now ordered by topic and by argument score, so shuffle it before using it for classification

In [10]:
raw_data = raw_data.sample(frac = 1)

The data is now properly shuffled

In [11]:
raw_data["Connective.x"]

1045              but
3       no_connective
645                so
1459               so
751                so
547                so
528             first
527                so
83              first
150                if
289                so
77                 so
33      no_connective
891                if
79                 if
237               but
602     no_connective
109                so
242                if
185               but
419                so
920                if
569     no_connective
1237              but
1088               if
281                if
1429              but
786               but
942             first
650               but
            ...      
554               but
32                 so
408             first
191     no_connective
291                so
299               but
589     no_connective
384             first
816     no_connective
1421              but
753                so
808             first
126                so
624                if
427       

In [12]:
sentences = raw_data["Phrase.x"].values

In [13]:
# define a threshold for a good argument
threshold = 0.5

labels = raw_data["GoodSliderMean"].values
#negatives = labels[labels < threshold]
#positives = labels[labels >= threshold]
#print("Negatives: {0}, Positives: {1}".format(len(negatives) / len(raw_data), len(positives) / len(raw_data)))

labels[labels < threshold] = 0
labels[labels >= threshold] = 1
                                                 


Define a function that encodes the sentences by their part-of-speech 

In [14]:
def encode_sentences_POS(sentences):
    scaler = MinMaxScaler() 
    dataset = []
    for i in range(len(sentences)):
        # tokenize the sentence
        tokens = nltk.word_tokenize(sentences[i])
        # tag the all the tokens
        pos_tokens = nltk.pos_tag(tokens)
        # for each pos tag, count how many times it occurs in the sentence
        pos_dict = generate_POS_dict()
        # check if certain keywords are present in the sentence
        keyword_dict = generate_keyword_dict()
        for tag in pos_tokens:
            if tag[1] in pos_dict:
                pos_dict[tag[1]] += 1
        # get the pos tag counts as features 
        pos_vector = list(pos_dict.values())
        
        for key in keyword_dict:
            if(key in sentences[i].lower()):
                keyword_dict[key] += 1
        keyword_vector = list(keyword_dict.values())
        
        feature_vector = pos_vector 
        # compute the average word length of the sentence
        #n_chars = 0
        #avg_len = 0
        """for token in tokens:
            if(token not in ".,?!"):
                n_chars += len(token)
        avg_len = n_chars / len(tokens)"""
        
        #feature_vector.append(avg_len)
        feature_vector.append(len(tokens))
        dataset.append(feature_vector)
    #dataset = pd.DataFrame(dataset)
    #dataset[len(feature_vector) - 1] = dataset[len(feature_vector) - 1] / max(dataset[len(feature_vector) - 1])
    #return dataset.values
    return dataset
    

In [15]:
def generate_POS_dict():
    #tags = ['IN', 'PRP', 'VBP', 'TO', 'VB', 'JJ', 'NN', 'VBZ', 'VBN', 'DT', 'NNP', 'VBD', '.',
    #       'CC', 'CD', 'EX', 'FW', 'JJR', 'JJS', 'LS', 'WP', 'WP$', 'WRB']
    tags = ['NN', 'IN', 'DT', 'PRP', 'RB', 'JJ', '.', 'VB', 'NNS', ',', 'SYM']    
    dic = dict.fromkeys(tags, 0)
    return dic
    

In [16]:
def generate_keyword_dict():
    words = ['because', 'but', 'since', 'reason', 'that', 'either', 'nonetheless', 'example', 'so']
    dic = dict.fromkeys(words, 0)
    return dic

Define a function that encodes the sentence based on the words within

In [17]:
def encode_sentences_BOW(sentences, BOW_dict):
    dataset = []
    for i in range(len(sentences)):
        sentence_dict = BOW_dict.copy()
        for key in BOW_dict:
            if(key in sentences[i]):
                sentence_dict[key] = 1
        dataset.append(list(sentence_dict.values()))
    return dataset

In [18]:
def generate_BOW_dict(sentences):
    # concatenate all argument sentences into one big string
    all_sentences = " ".join(sentences)
    # construct a dictionary containing all the words that occur in the sentences
    all_tokens = nltk.word_tokenize(all_sentences)
    sno = nltk.stem.SnowballStemmer('english')
    stemmed_tokens = []
    for token in all_tokens:
        stemmed_tokens.append(sno.stem(token))
    feature_dict = dict([token, 0] for token in stemmed_tokens)
    return feature_dict

In [19]:
dataset = encode_sentences_POS(sentences)

In [20]:
#BOW_dict = generate_BOW_dict(sentences)

In [21]:
#dataset = encode_sentences_BOW(sentences, BOW_dict)

In [22]:
N = len(dataset)
Xtrain = dataset[0:int(0.8*len(dataset))]
Ytrain = labels[0:int(0.8*len(dataset))]

Xtest = dataset[int(0.8*len(dataset)):]
Ytest = labels[int(0.8*len(dataset)):]

negatives = Ytest[Ytest < threshold]
positives = Ytest[Ytest >= threshold]
print("Negatives: {0}, Positives: {1}".format(len(negatives) / len(Xtest), len(positives) / len(Xtest)))

Negatives: 0.40930232558139534, Positives: 0.5906976744186047


Fit a MLP on the dataset

In [23]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.neural_network import MLPClassifier
from sklearn.naive_bayes import BernoulliNB
from MLP import *
#mlp = MLPClassifier()
#lm = LogisticRegression()
#svm =  LinearSVC() 
#nb = BernoulliNB()
#mlp = Network(n_features = len(Xtrain[0]), architecture = [len(Xtrain[0]), 50, 1], n_outputs = 1, learning_rate = 0.01)
#sk_mlp = MLPClassifier()


# try out regression classifiers
from sklearn.neural_network import MLPRegressor
sk_mlp = MLPRegressor()

ImportError: cannot import name 'abs'

In [None]:
#lm.fit(Xtrain, Ytrain)
#svm.fit(Xtrain, Ytrain)
#nb.fit(Xtrain, Ytrain)
Ytrain = Ytrain.reshape(len(Ytrain), 1)
mlp.train(Xtrain, Ytrain, 5000, 100)
#sk_mlp.fit(Xtrain, Ytrain)

See how well the classifiers score

In [None]:
#print(mlp.score(Xtest, Ytest))
#print(lm.score(Xtest, Ytest))
#print(svm.score(Xtest, Ytest))
#print(nb.score(Xtest, Ytest))
Ytest = Ytest.reshape(len(Ytest), 1)
print(mlp.accuracy(Xtest, Ytest))
#print(sk_mlp.score(Xtest, Ytest))

Define a function that splits text into argument sentences and non-argument/bad argument sentences

In [None]:
def filter_post_binary(text):
    sentences = text.split('.')[:-1] 
    #encoded_sentences = encode_sentences_BOW(sentences, BOW_dict)
    encoded_sentences = encode_sentences_POS(sentences)
    filtered_text = []
    removed_sentences = []
    for i in range(len(encoded_sentences)):
        #prediction = mlp.predict([encoded_sentences[i]])
        prediction = lm.predict([encoded_sentences[i]])
        #conf = mlp.predict_proba([encoded_sentences[i]])[0][1]
        conf_lm = lm.predict_proba([encoded_sentences[i]])[0][1]  
        #conf_svm = svm.predict_proba([encoded_sentences[i]])[0][1] 
        conf_nb = nb.predict_proba([encoded_sentences[i]])[0][1] 
        conf_mlp = sk_mlp.predict_proba([encoded_sentences[i]])[0][1] 
        print(conf_lm, conf_nb, conf_mlp)
        conf = np.mean([conf_lm, conf_nb, conf_mlp])
        if(prediction == 1 and conf >= 0.8):     
            filtered_text.append((conf, sentences[i]))
        else:
            #conf = mlp.predict_proba([encoded_sentences[i]])[0][1]
            removed_sentences.append((conf, sentences[i]))

    print("{0} filtered from the input text!".format(len(removed_sentences)))
    return filtered_text, removed_sentences
        
    

In [None]:
def filter_post_regression(text):
    sentences = text.split('.')[:-1] 
    #encoded_sentences = encode_sentences_BOW(sentences, BOW_dict)
    encoded_sentences = encode_sentences_POS(sentences)
    filtered_text = []
    removed_sentences = []
    for i in range(len(encoded_sentences)):
        score = mlp.test([encoded_sentences[i]])
        print(score)
        if(score >= 0.6):     
            filtered_text.append((score, sentences[i]))
        else:
            removed_sentences.append((score, sentences[i]))

    print("{0} filtered from the input text!".format(len(removed_sentences)))
    return filtered_text, removed_sentences

Clas

In [None]:
OP_post = "The classic example of this would be a person’s Grandparents that grew up in a different time where it was not considered a bad thing to classify and ultimately look down on a person based on their perceived racial category. Many of not most of the time this person has had limited sustained contact with people of other races and is at times arrogantly used to making broad statements about members of certain racial categories. Most people people don’t have any problems referring their grandparents as racists. Their way of thinking didn’t and doesn’t automatically prevent them from being a good parent, spouse, grandparent, citizen, etc. The difficulty today is that people are unwilling to come to terms with the fact that they have racist beliefs, often inherited from older generations that can subtly (and not so subtly) come out in their interactions with people of different races. Racist to them equal someone that is evil and it feels bad to be labeled that way. Ultimately being a racist is a negative quality that many otherwise good people have. Being racist can and does cause harm to others and people should be proactive in changing their racist beliefs as a method of self improvement. There certainly are evil people who maliciously harbor racial hatred against others that are racists too. We often consider them more representative of the term Racist than your Grandparents or instance but this is just a matter of degree.If a person calls you a racist there is always the possibility that they are right. This should be an opportunity for self reflection rather than an automatic denial."

text, filtered = filter_post_regression(OP_post)

In [None]:
text

In [None]:
filtered

In [None]:
counter = "Ok so I'm not if I should be changing your view on if racists are evil, or if racists are good people. Personally I don't believe racists are good people, I also don't believe they are bad or evil people either, UNLESS they use their ignorance to either do harm or not do anything. For example someone doesn't like black people and refuses to hire them at their work, or refuses to help a black person who is in need.By that same token it's the same with people who don't like kids for instance. You're not good, bad, or evil, UNLESS you use it to harm or not help. Like people who don't like kids and refuse to help children in need, or hurt children. Racial biases, stereotypes, and prejudices exist in the older generations for sure, and no I wouldn't say it is evil, but I wouldn't classify them as good people either. My grandmother is a racist and I don't classify her as good, because it's ignorant as fuck. I also don't classify her as evil for holding beliefs that she hasn't promoted or done harm with. Doesn't mean I have to like her or associate with her for example."
text, filtered = filter_post_regression(counter)

In [None]:
text

In [None]:
filtered

In [None]:
new_str = "It's the same as raising children. In order for them to have the best possible outcome in life, you have to sometimes sacrifice their short-term well-being and hurt their feelings in order for them to have a better chance at long-term happiness. It's the same with fat people, in order for them to have a happy and prosperous long-term future, you (you as in society, friends, family or whatever) have to sacrifice their short-term happiness to achieve this. But that is not ok for some reason. While if someone told you 'I just let my kid do what he wants to whoever he wants and I never tell him what and when he is wrong because I want him to be happy', most normal people would think those are some sucky parents. And just so I try to remove some strawman arguments, I don't include people who have some medical or other problems which are causing them to gain weight. I'm talking about most fat people who are just too lazy and have no self-control when it comes to food. I also don't think we should go around and yell at fat people or try to change them the way parents do to their kids. Im talking about overall societal view to these issue."

In [None]:
text, filtered = filter_post_regression(new_str)

In [None]:
text

In [None]:
filtered

In [None]:
counter = "Who are these people going around telling fat people that it is ok to be fat? I don't think this is an even remotely common occurrence. In my experience people tend to agree (the 'overall societal view') that it is not ok/good/healthy to be fat, but we shouldn't treat people negatively just because they are fat. Accepting people and treating them with kindness is not the same as telling them there is no need to get in shape. Fat people are reminded constantly by society (especially healthcare professionals) that they should work on losing weight."

In [None]:
text, filtered = filter_post_regression(counter)

In [None]:
text

In [None]:
filtered

In [None]:
new_post = "I realize this is illegal in some places, but it should be illegal everywhere. Putting flyers in people's car windows or in their windshield wipers is basically littering. The owners of the cars never consented to having pieces of paper be left on their property. If you want to advertise your business using flyers, have people hand them out to people. Don't just leave it on their car. Will that be more expensive paying someone by the hour to hand people flyers? Yes. But that's the business owner's problem. An attempt at cheap advertisement does not justify putting shit on people's cars. 'But think of the small businesses' will not change my view."

In [None]:
text, filtered = filter_post_regression(new_post)

In [None]:
text

In [None]:
filtered