In [1]:
import os
import re
import pandas as pd
from tqdm import tqdm
from unidecode import unidecode
from collections import defaultdict

In [2]:
pd.set_option('display.max_colwidth', -1)
pd.set_option('display.max_rows', 500)

In [3]:
df = pd.read_csv("annotations_metadata.csv")
df.head(5)

Unnamed: 0,file_id,user_id,subforum_id,num_contexts,label
0,12834217_1,572066,1346,0,noHate
1,12834217_2,572066,1346,0,noHate
2,12834217_3,572066,1346,0,noHate
3,12834217_4,572066,1346,0,hate
4,12834217_5,572066,1346,0,noHate


## Read in posts as a csv column

In [4]:
post_list = []
for idx, row in df.iterrows():
    with open("{}/{}.txt".format("all_files", row["file_id"])) as f:
        post_list.append(f.read().strip())

In [5]:
df["post"] = post_list

In [6]:
df["comment_id"] = [int(x.split("_")[0]) for x in df.file_id.tolist()]
df["sentence_num"] = [int(x.split("_")[1]) for x in df.file_id.tolist()]

## What do some of the hate speech comments look like?

In [7]:
df[df["label"] == "hate"][["file_id", "post", "num_contexts"]].head(20)

Unnamed: 0,file_id,post,num_contexts
3,12834217_4,Click below for a FREE download of a colorfully illustrated 132 page e-book on the Zionist-engineered INTENTIONAL destruction of Western civilization .,0
10,12834493_1,"She may or may not be a Jew but she 's certainly stupid , she seems to think the Blacks wo n't kill her alongside every other White they can get their dirty hands on , what a muppet !",0
21,12849464_3,"I 've always considered teaching as one of the professions I would like to get into , but not in a neighbourhood like that ... never. kids like that disgust me .",1
22,12849464_4,And the sad thing is the white students at those schools will act like that too .,1
31,12855154_1,Their gods often had pale skin and blue eyes but I suppose literacy is n't valued highly by their cultural intelligencia .,0
42,12860820_1,Greece need a man like Adolf Hitler and now is the perfect time to start a national political movement to free Your country from the zionist bankers,0
46,12869018_3,He is a pathetic little chimp .,1
59,12883101_1,Try El Paso Texas Soo much mudd invasion that the river is the same color as those playing/crossing in it .. year round !,4
62,12904919_1,"Here is a video of the event that is worth watching just to see a beautiful White lady screaming `` scum , scum , scum , f * * * ing scum !!! '' at the muslim filth .",0
70,13061184_1,"I suppose the concept of a male cow and a female cow producing MORE cows , and the repeating process is too much for them to understand .",1


## Convert label column to numeric

In [8]:
df["num_label"] = [1 if label == "hate" else 0 for label in df["label"].tolist()]

## Remove Extra Columns

In [9]:
df = df.drop(["label"], axis="columns")

## Class Distribution:
- Dataset is mostly no_hate... may need to balance it later

In [10]:
df.groupby("num_label")["file_id"].count()/len(df.index)

num_label
0    0.890716
1    0.109284
Name: file_id, dtype: float64

In [11]:
# Ex: find_examples_in_text(train_df["posts"].tolist(), [r":[a-zA-Z]+:(?=[:\s])"], padding=5, num_examples=20)
def find_examples_in_text(documents, words, num_examples=1, padding=100):
    curr_examples = 0
    examples = defaultdict(lambda: defaultdict(lambda: defaultdict(str)))
    for word in words:
        print("Examples of {}".format(word))
        for doc_idx, document in enumerate(documents):
            word_idx = re.search(word, document)
            if word_idx:
                # print(word_idx)
                examples[word][doc_idx][word_idx] = document[max(word_idx.start()-100, 0):min(word_idx.end()+100, len(document))]
                curr_examples += 1
                if curr_examples >= num_examples:
                    break
    return examples

In [12]:
find_examples_in_text(df.post.tolist(), [r"\*"], num_examples=50)

Examples of \*


defaultdict(<function __main__.find_examples_in_text.<locals>.<lambda>()>,
            {'\\*': defaultdict(<function __main__.find_examples_in_text.<locals>.<lambda>.<locals>.<lambda>()>,
                         {62: defaultdict(str,
                                      {<_sre.SRE_Match object; span=(123, 124), match='*'>: "event that is worth watching just to see a beautiful White lady screaming `` scum , scum , scum , f * * * ing scum !!! '' at the muslim filth ."}),
                          76: defaultdict(str,
                                      {<_sre.SRE_Match object; span=(0, 1), match='*'>: '* Unsubscribed * Off to the SA threads .'}),
                          92: defaultdict(str,
                                      {<_sre.SRE_Match object; span=(21, 22), match='*'>: "Excellent Article !! * * * * * Why Were n't They In Jail ?"}),
                          157: defaultdict(str,
                                      {<_sre.SRE_Match object; span=(6, 7), match='*'>: '2.0 )

## Text Preprocessing
The text looks like it has already been run through a tokenizer. For now, we should just do some basic preprocessing.  I'll check off the stuff that I actually did.  The other stuff can be recommendations for future:

- [x] Try to normalize some of the links (they have already been preprocessed unfortunately so it will be a bit harder)
- [ ] Surround capitalized words with "_caps_ word _caps_"
- [x] Convert Numbers to "_number_"
- [ ] Each post is labelled commendID_sentenceNumber.txt - Some sentences however may only be toxic due to the surrounding ones.  Append the previous and current sentence to each post
- [ ] The incoming comments have already been tokenized (which sucks because their tokenizer is not great).  May need to undo this and re-tokenize
- [x] The posts have also either been run through a profanity filter, or the site itself does not allow profanity, this has then been retokenized
- [x] Standardize repeated characters

In [13]:
def replace_word_dates(text, replace_str="_date_"):
    short_month_regex = r"([Jj]an|[Ff]eb|[Mm]ar|[Aa]pr|[Mm]a|[Jj]un|[Jj]ul|[Aa]ug|[Ss](ep|ept)|[Oo]ct|[Nn]ov|[Dd]ec)\.?"
    long_month_regex = r"([Jj]anuary|[Ff]ebuary|[Mm]arch|[Aa]pril|[Mm]ay|[Jj]une|[Jj]uly|[Aa]ugust|[Ss]eptember|[Oo]ctober|[Nn]ovember|[Dd]ecember)"
    day_regex = r"([1-2][0-9]|3[01]|0?[1-9])(th|st)?(,| ,)?"
    space_regexes = [r" ", r"\s?-\s?"]
    # 1900s and 2000s
    year_regex = r"(1[0-9]{3}|20[0-9]{2}|'?[0-9]{2})"
    for space_regex in space_regexes:
        final_regex = r"({sm}|{lm})({sp}{day})?{sp}{year}".format(sm=short_month_regex, lm=long_month_regex, day=day_regex, sp=space_regex, year=year_regex)
        text = re.sub(final_regex, replace_str, text)
    return text

In [14]:
import nltk
nltk.download('wordnet')
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english", ignore_stopwords=True)
lemma = nltk.wordnet.WordNetLemmatizer()

[nltk_data] Downloading package wordnet to /home/michael/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [15]:
post_list = df.post.tolist()
preprocessed_posts = []
for idx, post in enumerate(tqdm(df.post.tolist())):
    post = unidecode(post)
    post = replace_word_dates(post)
    # Replace File Links: http://www.mediafire.com/download/96fg6ft02lyfruz/Booklet _ White _ YT _ Comment _ ( Hyperlinked ) .txt
    post = re.sub(r"https?:\/\/([a-zA-Z0-9_\-\/%\+]\s?|[a-zA-Z0-9]\.[a-zA-Z0-9])+(\( Hyperlinked \) |\( Hyperlinked-Back-Up % 5D)?\.(txt|pdf|docx)", " _file_link_ ", post)
    # Replace Youtube Links: http://www.youtube.com/watch ? v = _ 8hg254ALpM
    post = re.sub(r"https?:\/\/(www\.)?(youtube\.com\/watch \? v = _ [a-zA-Z0-9_]+)", " _youtube_link_ ", post)
    # Replace Other Links: http://trutube.tv/video/14247/The-Zionist-Attack-on-Western-Civilization-Pages-1-33-Part-1-of-4-Banned-from-YouTubeNotepadPromotionalYouTubeComment
    post = re.sub(r"https?:\/\/([a-zA-Z0-9_\-\/\.](?!ttp))+", " _other_link_ ", post)
    # Replace Numbers: 
    post = re.sub(r"(?<=\s)([0-9]+\.?[0-9]*|[0-9]{1,3},([0-9]{3})+)(?=\s)", " _number_ ", post)
    # Repeated !
    post = re.sub(r"!{2,}", " _repeatexc_ ", post)
    # Repeated ?
    post = re.sub(r"\?{2,}", " _repeatq_ ", post)
    # Repeated .
    post = re.sub(r"\.{4,}", " ... ", post)
    # Censor in middle of word: s**t, s*t, etc
    post = re.sub(r"(?<=\s)[a-zA-Z]\s(\*\s){1,}[a-z]*", " _censored_ ", post)
    # Censor first part of word: ***ing, ***ed
    post = re.sub(r"(?<!(\s[a-zA-Z]|(\s|[0-9])[0-9]|\s\*)\s)(\*\s?){2,}(ing|hole|ed|\!|\?|\.(?![a-z]))", " _censored_ ", post)
    # Remove Dashes
    post = re.sub(r"-", " ", post)
    # Repeated *
    post = re.sub(r"(?<=\S)\s(\*\s){2,}", " _repeatstar_ ", post)
    # Remove Commas
    post = re.sub(r",", " ", post)
    # Remove Quotation
    post = re.sub(r"'", "", post)
    # Slashes
    post = re.sub(r"\/", " / ", post)
    # Remove `:
    post = re.sub(r"`", " ", post)
    # Remove Excess Spaces
    post = re.sub(r'\s+', ' ', post).strip()
    # Stem?
    post = ' '.join([stemmer.stem(x) for x in post.split(" ")])
    #post = ' '.join([lemma.lemmatize(x) for x in post.split(" ")])
    # Lower Case
    final_post = post.lower()
    preprocessed_posts.append(final_post)

100%|██████████| 10944/10944 [00:01<00:00, 7153.22it/s]


In [16]:
df["preprocessed_post"] = preprocessed_posts

In [17]:
df[["post", "preprocessed_post", "num_label"]].head(20)

Unnamed: 0,post,preprocessed_post,num_label
0,"As of March 13th , 2014 , the booklet had been downloaded over 18,300 times and counting .",as of _date_ the booklet had been download over _number_ time and count .,0
1,"In order to help increase the booklets downloads , it would be great if all Stormfronters who had YouTube accounts , could display the following text in the description boxes of their uploaded YouTube videos .",in order to help increas the booklet download it would be great if all stormfront who had youtub account could display the follow text in the descript box of their upload youtub video .,0
2,( Simply copy and paste the following text into your YouTube videos description boxes. ),( simpli copi and past the follow text into your youtub video descript boxes. ),0
3,Click below for a FREE download of a colorfully illustrated 132 page e-book on the Zionist-engineered INTENTIONAL destruction of Western civilization .,click below for a free download of a color illustr _number_ page e book on the zionist engin intent destruct of western civil .,1
4,Click on the `` DOWNLOAD ( 7.42 MB ) '' green banner link .,click on the download ( _number_ mb ) green banner link .,0
5,"Booklet updated on Feb. 14th , 2014 .",booklet updat on _date_ .,0
6,"( Now with over 18,300 Downloads. )",( now with over _number_ downloads. ),0
7,PDF file : http://www.mediafire.com/download/7p3p3goadvvqvsf/WNDebateBooklet_2-14-14.pdfMSWordfile:http://www.mediafire.com/download/psezkkk4d6a3wt1/WNDebateBooklet _ 2-14-14.docx Watch the 10 hour video version of `` The Zionist Attack on Western Civilization '' @ http://trutube.tv/video/14247/The-Zionist-Attack-on-Western-Civilization-Pages-1-33-Part-1-of-4-Banned-from-YouTubeNotepadPromotionalYouTubeComment:http://www.mediafire.com/download/96fg6ft02lyfruz/Booklet _ White _ YT _ Comment _ ( Hyperlinked ) .txt http://www.mediafire.com/download/zcn3wozjbwnezms/Booklet-White-YT-Comment- ( Hyperlinked-Back-Up % 5D.txt http://www.mediafire.com/download/9uyudq1yuxu1dur/Booklet+Comment+%28Firefox%29.txt2minutepromotionalBOOKLETvideo@http://www.youtube.com/watch ? v = _ 8hg254ALpM Are you interested in helping spread the booklet download link across the world ?,pdf file : _file_link_ mswordfile: _file_link_ watch the _number_ hour video version of the zionist attack on western civil @ _other_link_ : _file_link_ _file_link_ _file_link_ 2minutepromotionalbookletvideo@ _youtube_link_ are you interest in help spread the booklet download link across the world ?,0
8,Then why not simply copy this text ( & links ) and paste it into the description box of your YouTube videos ?,then why not simpli copi this text ( & link ) and past it into the descript box of your youtub video ?,0
9,Thank you in advance. : ) Download the youtube `` description box '' info text file below @ http://www.mediafire.com/download/dqhn1czprr17o21/Booklet-Description-Box _ Info.txt,thank you in advance. : ) download the youtub descript box info text file below @ _file_link_,0


## Write Preprocessed Dataset to CSV

In [18]:
df.to_csv("preprocessed_posts.csv", index=False)