# Table of Contents
[A Naive Classifier](#anc)  
[Char Logistic Model](#clm)

<a name="anc"></a>
# Tesing the Naive Classifier

This is a simple logistic regression classifier used by WikiMedia in its demo  
But I dont think it is the model they use in their public dataset  
A more comprehensive model is elaborated below

In [7]:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
import sys
import os
path = './TalkData/computed_dataset/'

In [5]:
comments_dir = os.path.join(path, 'attack_annotated_comments.tsv')
annotations_dir = os.path.join(path, 'attack_annotations.tsv')
comments = pd.read_csv(comments_dir, sep='\t', index_col=0)
annotations = pd.read_csv(annotations_dir, sep='\t')

In [6]:
labels = annotations.groupby('rev_id')['attack'].mean()>.5
comments['attack'] = labels
comments['comment'] = comments['comment'].apply(lambda x: x.replace("NEWLINE_TOKEN", " "))
comments['comment'] = comments['comment'].apply(lambda x: x.replace("TAB_TOKEN", " "))
comments.query('attack')['comment'].head()


rev_id
801279             Iraq is not good  ===  ===  USA is bad   
2702703      ____ fuck off you little asshole. If you wan...
4632658         i have a dick, its bigger than yours! hahaha
6545332      == renault ==  you sad little bpy for drivin...
6545351      == renault ==  you sad little bo for driving...
Name: comment, dtype: object

In [7]:
training = comments.query('split == "train"')
testing = comments.query('split == "test"')

clf = Pipeline([
    ('vect', CountVectorizer(max_features=10000, ngram_range=(1,2))),
    ('tfidf', TfidfTransformer(norm='l2')),
    ('clf', LogisticRegression())
])
clf = clf.fit(training['comment'], training['attack'])
auc = roc_auc_score(testing['attack'], clf.predict_proba(testing['comment'])[:,1])
print(auc)

0.956968629551


<a name='clm'></a>
# Char-Logistic Model

After searching for some time, I think this is the right model they described in their paper.   
Unfortunately, they did not publicize their api, so I grabed some of their code and made a slightly modifed version for our own use.

**Requirements**
* BeautifulSoup4
* Joblib
* mwparserfromhell

### Step 0
get to the github directory where all wikiMedia files / data exist

In [8]:
import sys
import os

WMdir = '/home/ubuntu/wikipedia/TalkAnalytics/ClonedModel/wmModel/wiki-detox/' # need to change this directory
sys.path.append(WMdir)

### Step 1
Clean the dataset using the **diff_utils** functions provided by WikiMedia

In [9]:
import sklearn
import requests
import sys
import os
import inspect
from pprint import pprint
from bs4 import BeautifulSoup

import src.data_generation.diff_utils as diff_utils

def diff_and_clean(data):
    ''' taking the diff and clean the text column'''
    
    titles = data.title.unique()
    for title in titles:
        data_subset = data[data.title == title]

        text_new = data_subset.text[1:]
        text_old = data_subset.text[:-1]
        text_diff = [data_subset.text.iloc[0]]

        for [new,old] in zip(text_new,text_old):
            if(type(new) is not str):
                print("text is not str: %s, changed to empty"%(new))
                new = ''
            if(type(old) is not str):
                print("text is not str: %s, changed to empty"%(old))
                old = ''
                
            text_diff.append(new.replace(old,''))
        # the diff_utils function requires a column named "insertion
        data.loc[data.title==title,'insertion'] = text_diff
    data = diff_utils.clean_and_filter(data)
    return data



### Step 2
Build Models    

**Step 2.1** Load pre_trained model

In [3]:
import copy
import joblib

def load_pipeline(directory):
    if os.path.isfile(directory):
        return joblib.load(directory)
    else:
        print("pipeline not found")
        return None
    
# load the sklearn .pkl pipline from WikiMedia

aggresion_model_dir = os.path.join(WMdir,'app/models/aggression_linear_char_oh_pipeline.pkl')
attack_model_dir = os.path.join(WMdir,'app/models/attack_linear_char_oh_pipeline.pkl')

model_dict = {
    'aggresion': load_pipeline(aggresion_model_dir),
    'attack': load_pipeline(attack_model_dir)
}




**Step 2.2** Define functions to apply models

In [10]:
def apply_models_DF(df, col):
    ''' Predict the probability of input data to be labelled
        'aggressive' or 'attack'
        
        Input:
        ======
        df:   dataFrame to be predicted
        col:  name of the column that stores the texts
        
    '''
    
    texts = df[col]
    for task,model in model_dict.items():
        scores = model.predict_proba(texts)[:,1]
        df['pred_%s_score_uncalibrated'%(task)] = scores
    return df

def apply_models_text(text):
    ''' Predict the probability of input texts to be labelled
        'aggressive' or 'attack'
        
        Input:
        ======
        text:  comments to be predicted
        
    '''

    for task,model in model_dict.items():
        scores = model.predict_proba([text])[:,1]
        print('pred_%s_score_uncalibrated: %f'%(task,scores))


# Trial

In [12]:
comments_dir = './TalkData/computed_dataset/aggression_annotated_comments.tsv'
data = pd.read_csv(comments_dir, sep='\t')

In [13]:
comments = data.comment
data['comment'] = data['comment'].apply(lambda x: x.replace("NEWLINE_TOKEN", " "))
data['comment'] = data['comment'].apply(lambda x: x.replace("TAB_TOKEN", " "))

In [14]:
text = data.comment[0]
apply_models_text(text)

pred_aggresion_score_uncalibrated: 0.003575
pred_attack_score_uncalibrated: 0.002437


In [15]:
apply_models_text('damn')
apply_models_text('well done')

pred_aggresion_score_uncalibrated: 0.958296
pred_attack_score_uncalibrated: 0.747354
pred_aggresion_score_uncalibrated: 0.017243
pred_attack_score_uncalibrated: 0.013077


# Apply

In [16]:
test_data = pd.read_csv('parsed.csv', sep='\t', nrows=1000)
test_data = diff_and_clean(test_data)

# remove redundant columns
test_data['clean_text'] = test_data['clean_diff']
test_data = test_data.drop(['text','diff','clean_diff'],1)

text is not str: nan, changed to empty
text is not str: nan, changed to empty
text is not str: nan, changed to empty
text is not str: nan, changed to empty


In [18]:
test_data = apply_models_DF(test_data, 'clean_text')

In [29]:
demo_idx = 798
print(test_data.clean_text[demo_idx])
print(test_data.pred_aggresion_score_uncalibrated[demo_idx],test_data.pred_attack_score_uncalibrated[demo_idx])

 Biggest CROOK 

Lalit Modi is the one of the biggest crook in Indian History. This guy should hanged with rest. He is all about making money, even cutting others throat. His biggest supporters are BJP, which is also uneducated losers want to make Indian free from foreign investment. These bums never realized India is the way it today because foreign investment. Anyway, get rid off  Lalit Modi from his IPL post, send him to Jail.
0.236954043285 0.171614240261


In [30]:
test_data.head()

Unnamed: 0,byte,comment,time,title,user,clean_text,pred_aggresion_score_uncalibrated,pred_attack_score_uncalibrated
0,307,Project tag,2009-01-20T02:04:31Z,Jagatballavpur (community development block),Chandan Guha,Map\nThe map has been created with coordinates...,0.005145,0.001985
2,20,cities -> geography,2011-10-20T09:28:30Z,Jagatballavpur (community development block),Chandan Guha,Map\nThe map has been created with coordinates...,0.005145,0.001985
3,141,/* Map */,2011-10-20T09:30:10Z,Jagatballavpur (community development block),Chandan Guha,It now has Jagatballavpur coordinates. -,0.004637,0.005583
15,1131,Notification of altered sources needing review...,2016-11-14T21:22:02Z,Captain Strong,InternetArchiveBot,External links modified \n\nHello fellow Wiki...,0.00374,0.002265
32,447,,2013-02-01T07:38:18Z,Foreigners (Protected Areas) Order 1958 (India),2001:208:5:801:F412:ADB7:2F0E:5507,Something to note is that the PAP has been sus...,0.006208,0.01163


### Evaluate the Model