## Josh's attempt at an inclusive online-attack identifier
   #### Overview:
Using the dataset [Wikipedia Talk Labels: Personal Attacks](https://figshare.com/articles/Wikipedia_Talk_Labels_Personal_Attacks/4054689) I built a natural language processor to identify personal attacks in online comments and forum posts. In order to train a model that is more conscious of intersectionally unique voices and perceptions, especially from those who are systemically underrepresented in modern society, I incorporated the available demographics data in an attempt to promote these voices. The methods for doing so are explained in the textbox below.


In [15]:
import pandas as pd
import numpy as np
import sys
from sklearn.pipeline import Pipeline, FeatureUnion 
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import SGDClassifier, RidgeClassifier
from sklearn.feature_extraction import text 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from skopt import BayesSearchCV
from skopt.space import Real, Categorical, Integer
from sklearn.metrics import confusion_matrix, classification_report

### 1.0 Working With Data
   #### 1.1 Combining the three sources:
   In order to include minority opinions in the "attack" labled data, I discussed the data with several friends who together cover more than half of the demographic catagories in the dataset as well as 6 different ethnicities (best somewhat representative sample I could manage during COVID) before settling on the measures where:
   
   - Every comment marked an attack by 50% or more annotators would be labled as such.
   
   - Additionally, any comment where more than 50% of a given demographic claimed something was an attack that also had 25% agreement from annotators overall.
   
       - Greater than 50% of a demographic is required because no individual of a group is a monolith speaking for all of the group, so if there is equal consideration for and against, then it is likely some other factor driving their evaluations.
       
       - The 25% support overall requirement was chosen to ensure that at least one other annotator inside or outside of the demographic agreed. (the minimum number of annotators on a comment is 8)
       
           - Requiring someone else to agree was a response to the lack of quality in the demographic data which was only available for 54% of annotators.
           
           - The data also covered a rather limited range of demographics:
           
               - gender: Options were male, female, other but no annotator used other so this comlumn was expanded into male and female columns

               - english as a first language: already a boolean column
               
               - age: Each age range was split into its own column
               
               - education: I reduce the responses into 2 buckets of no_degree and college_degree due to the lack of overall participation. The thought being that a college education's intent is to mold an individual's world views and thought processes so the split between holding a college degree vs. not would be most likely to produce different lived experiences and thus perceptions of what constitutes an attack.

The demographic dataset is far from perfect and negatively contributed to model performance, which is covered in a later section. However, data is often less robust than would be preferred, so it is a constraint that will likely be a part of any ML project and therefore is useful to still attempt to utilize.

NumPy/Pandas were utilized exclusively to avoid iterating through data during transformation.


In [2]:
# import the three tsv documents
comments = pd.read_csv('attack_annotated_comments.tsv', sep = '\t', index_col = 0)
annotations = pd.read_csv('attack_annotations.tsv',  sep = '\t')
demographics = pd.read_csv('attack_worker_demographics.tsv', sep = '\t')

In [3]:
# Join the demographic data to the annotations data, then blow it
# out into boolean series (technically 1s and 0s) and drop any irrelevant or pre-blown-out columns
annotWithDemo = pd.merge(annotations, demographics, on='worker_id', how='left')
annotWithDemo.drop(columns=['quoting_attack','recipient_attack','third_party_attack',
                            'other_attack', 'worker_id'], inplace=True)
boolCols = annotWithDemo.join(annotWithDemo.gender.str.get_dummies())
boolCols.drop(columns='gender',inplace=True)
boolCols = boolCols.join(boolCols.age_group.str.get_dummies())
boolCols.drop(columns='age_group',inplace=True)
boolCols = boolCols.join(boolCols.education.str.get_dummies())
boolCols.drop(columns='education',inplace=True)
boolCols['no_degree'] = boolCols['none'] + boolCols['some'] + boolCols['hs']
boolCols['college_degree'] = boolCols['bachelors'] + boolCols['doctorate'] + boolCols['masters'] + boolCols['professional']
boolCols.drop(columns=['bachelors','doctorate','hs','masters','none','professional','some'],
              inplace=True)

In [4]:
# Create a data frame containing only reviews with at least one attack identified
# Group both data frames by rev_id, the "attack only" frame will be used as the numerator in
# finding pctg of each demographic column that labeled a review an attack
boolColsAttackOnly = boolCols.loc[boolCols['attack'] > 0]
boolColsAttackOnlyGrouped = boolColsAttackOnly.groupby('rev_id', as_index=False).sum()
boolColsGrouped = boolCols.groupby('rev_id', as_index=False).sum()

In [5]:
# Combine the demographic columns into percentages and find the overall pct of
# annotators marking a comment as an attack to aid in classifying comments
allRev= boolColsGrouped['rev_id'].to_frame("rev_id")
allRevAttackOnlyGrouped = pd.merge(allRev, boolColsAttackOnlyGrouped, on='rev_id', how='left')
demo = allRevAttackOnlyGrouped.loc[:,'english_first_language':].div(boolColsGrouped.loc[:,'english_first_language':])
totalAnnotators = boolCols.groupby('rev_id', as_index=False).count()['attack']
attack = boolColsGrouped['attack'].div(totalAnnotators).to_frame('pctAttack')


In [6]:
# Find the max demographic percentage that advocated for attack in each row and add it to the attack
# dataframe. Create an attack column for the target labels and flip any rows meeting the criteria
# to True. Insert the rev_id column into the attack frame 
demoMax = demo.loc[:,'english_first_language':].max(axis = 1)
attack.insert(1,'demoMax',demoMax)
attack['attack'] = False
attack.loc[(attack['pctAttack'] >= .5) | (attack['demoMax'] > .5), 'attack'] = True
attack.loc[attack['pctAttack'] <.25,'attack'] = False
attack.insert(0,'rev_id',boolColsGrouped['rev_id'])
labels = attack.drop(columns=['demoMax', 'pctAttack'])

In [7]:
# Create the labels data frame by dropping irrelevant columns from the attack frame and merge the
# labels into the comments dataframe to complete labeling all comments
labels = attack.drop(columns=['demoMax', 'pctAttack'])
comments = pd.merge(comments, labels, on='rev_id', how='left')

#### 1.2 Cleaning the data
In addition to the cleaning necessary to create the labels, there is cleaning done on the comment text as well:
 - Remove newline and tab tokens
 - Split into test and train groups
 - Remove stop words from the word ngrams
 - Convert all characters to lowercase
 - Strip all accents
 - Change the comments into vectors necessary for most classifiers
 - Encode the labels for use by the classifiers
     - scored for rarity and frequency (tfidf)
Other methods I tried included setting maximum and minimum document frequency, however, these are made a bit redundant by removing stop words and setting max_feature limits respectively.

In [8]:
# remove newline and tab tokens
comments['comment'] = comments['comment'].apply(lambda x: x.replace("NEWLINE_TOKEN", " "))
comments['comment'] = comments['comment'].apply(lambda x: x.replace("TAB_TOKEN", " "))

In [9]:
# Split and encode the training data
X_train,X_test,y_train,y_test = train_test_split(
    comments.comment, comments.attack, test_size=.33,random_state=42)

encode = LabelEncoder()
y_train = encode.fit_transform(y_train)
y_test = encode.fit_transform(y_test)


### 3.0 Features, Methods and Parameters, Oh my
#### 3.1 Features
The features are pretty basic with a blend of word and character n-grams. I tried making a couple features such as exclamation mark to sentence ending punctuation ratio, as well as, upper case letter proportion, to attempt to capture times when people are figuratively yelling. However, they didn't make much of a difference so I left them out.

#### 3.2 Results, the output on average looks like this

| Method | Label | Precision | Recall | F1-Score | Support |
| ----------- | ----------- | ----------- | ----------- | ----------- | ----------- |
| Stochastic Gradient Descent | Not Attack | .92 | .97 | .94 | 30,754 |
|  | Attack | .84 | .66 | 74 | 7,482 |
|  | Weighted Avg. | .91 | .91 | .90 | 38,236 |

#### 3.3 Fine Tuning
I took a look at altering nearly every parameter in the documentation, as well as max_feature numbers from 5,000 to 100,000. The biggest difference maker for the SGD model was switching the loss to 'modified_huber' which is ironic given that huber loss is for regression, but I suppose that is what makes it modified. I was able to improve SGD by 2% overall weighted average by increasing attack recall by 16% while attack accuracy reduced by 8% for a net 8% gain on attack while on not attack precision went up 3% while recall only lost 1%.

I used a mixture of BayesSearchCV from the skopt module (scikit optimize), and RandomSearchCV to quickly get a better idea of what features and value areas looked best then finished it out with a couple of large and long GridSearchCVs to find the best combos. I tried optimizing for precision, recall, and f1-score independently and as might be expected found that precision and recall often offer a tradeoff so in the end optimizing for f1 at least smoothed the trade to get the two values closer to one another.

In [10]:
# define the parameter grid, I kept it separate for ease in tweaking values:

parameterGrid = dict(
    features__word__max_features=[10000],
    features__word__ngram_range=[(1,2)],
    features__word__lowercase=[True],
    features__word__stop_words=['english'],
    features__word__strip_accents=['unicode'],
    
    features__char__max_features=[25000],
    features__char__ngram_range=[(2,3)],
    features__char__lowercase=[True],
    features__char__strip_accents=['unicode'],
    clf__loss=['modified_huber'],
    clf__alpha=[.0001],
    clf__learning_rate=['optimal'],
    clf__eta0=[.001]
    
)

In [11]:
# Setup classifier
clf = SGDClassifier(verbose = 51) #Verbosity over 50 prints the entire log as it is fitted
wVector = TfidfVectorizer(analyzer='word')
cVector = TfidfVectorizer(analyzer='char')
fUnion = FeatureUnion([("word", wVector), ("char", cVector)])

pipe = Pipeline([
    ('features', fUnion),
    ('clf', clf)
])

grid_search = GridSearchCV(pipe, param_grid=parameterGrid, n_jobs=6, pre_dispatch=4,
                            verbose=51,cv=3, scoring='f1')


In [12]:
# Train the model
grid_search.fit(X_train,y_train)

Fitting 3 folds for each of 1 candidates, totalling 3 fits
[Parallel(n_jobs=6)]: Using backend LokyBackend with 6 concurrent workers.
[Parallel(n_jobs=6)]: Done   1 tasks      | elapsed:   51.1s
[Parallel(n_jobs=6)]: Done   3 out of   3 | elapsed:   51.4s remaining:    0.0s
[Parallel(n_jobs=6)]: Done   3 out of   3 | elapsed:   51.4s finished
-- Epoch 1
Norm: 39.67, NNZs: 34254, Bias: -1.381699, T: 77628, Avg. loss: 0.786204
Total training time: 0.09 seconds.
-- Epoch 2
Norm: 31.83, NNZs: 34639, Bias: -1.158743, T: 155256, Avg. loss: 0.269599
Total training time: 0.18 seconds.
-- Epoch 3
Norm: 29.88, NNZs: 34729, Bias: -1.079704, T: 232884, Avg. loss: 0.245773
Total training time: 0.25 seconds.
-- Epoch 4
Norm: 29.02, NNZs: 34749, Bias: -1.045399, T: 310512, Avg. loss: 0.237410
Total training time: 0.32 seconds.
-- Epoch 5
Norm: 28.55, NNZs: 34768, Bias: -1.036023, T: 388140, Avg. loss: 0.232991
Total training time: 0.39 seconds.
-- Epoch 6
Norm: 28.24, NNZs: 34770, Bias: -1.025659, T:

GridSearchCV(cv=3,
             estimator=Pipeline(steps=[('features',
                                        FeatureUnion(transformer_list=[('word',
                                                                        TfidfVectorizer()),
                                                                       ('char',
                                                                        TfidfVectorizer(analyzer='char'))])),
                                       ('clf', SGDClassifier(verbose=51))]),
             n_jobs=6,
             param_grid={'clf__alpha': [0.0001], 'clf__eta0': [0.001],
                         'clf__learning_rate': ['optimal'],
                         'clf__loss': ['modified_huber'],
                         'features__char__lowercase':...
                         'features__char__max_features': [25000],
                         'features__char__ngram_range': [(2, 3)],
                         'features__char__strip_accents': ['unicode'],
                  

In [16]:
# Classification Report
y_valid_pred = grid_search.best_estimator_.predict(X_test)
met = classification_report(y_test, y_valid_pred)
print(met)

              precision    recall  f1-score   support

           0       0.92      0.98      0.94     30754
           1       0.86      0.63      0.73      7482

    accuracy                           0.91     38236
   macro avg       0.89      0.80      0.84     38236
weighted avg       0.91      0.91      0.90     38236



In [17]:
# Confusion Matrix: Y-axis is what was predicted by the model, X-axis is what it should be
conf_mat = confusion_matrix(y_test, y_valid_pred)
print(conf_mat)

[[30001   753]
 [ 2755  4727]]


In [18]:
# Lists best parameters from the grid search, borrowed from lecture code:
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameterGrid.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))
sys.stdout.flush()

Best parameters set:
	clf__alpha: 0.0001
	clf__eta0: 0.001
	clf__learning_rate: 'optimal'
	clf__loss: 'modified_huber'
	features__char__lowercase: True
	features__char__max_features: 25000
	features__char__ngram_range: (2, 3)
	features__char__strip_accents: 'unicode'
	features__word__lowercase: True
	features__word__max_features: 10000
	features__word__ngram_range: (1, 2)
	features__word__stop_words: 'english'
	features__word__strip_accents: 'unicode'
