# Notebook for simple text classification

In this notebook, we are going to build a classifier for detecting offensive language. We are going to use the data set that is provided by the SemEval-2019 task Offenseval: https://competitions.codalab.org/competitions/20011. SemEval tasks are academic competitions in which the organizers challenge research teams to solve a shared task. The results of all teams are published and the participants are requested to submit a paper describing their system. In the case of Offenseval, there were three subtasks:

- decide whether a tweet is offensive or not (subtask_a)
- decide what is the type of offense: targeted or untargeted (subtask_b)
- decide what is the type of target: individual, group, other (subtask_c)

More information can be found here:

- Marcos Zampieri, Shervin Malmasi, Preslav Nakov, Sara Rosenthal, Noura Farra, Ritesh Kumar (2019). SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval), https://www.aclweb.org/anthology/S19-2010, https://arxiv.org/abs/1903.08983
- Offensive Language Identification Dataset - OLID: https://scholar.harvard.edu/malmasi/olid

At the end of this notebook, you learned:

- How to create a classifier for detecting offensive language
- How to use words and their part of speech as features
- How to evalate the performance of the classifier



## Offenseval data sets

The Offenseval codalab websites provides access to the data sets that have been used for the task. We provide the data sets here for this notebook in the folder "offenseval-datasets". It is wise to analyse the data sets before starting. We see that the folder contains the following subfolders:

- test-A
- test-B
- test-C
- training-v1
- trial-data

The test folder contain the actual tests used for the shared tasks, corresponding to the three subtasks. The folders contain a readme file that describes the test data in terms of content and format. Read these files to understand what is required. The test data is unlabeled as systems are supposed to generate the labels. Participants had to upload their system output to the codalab platform to get an evaluation. Since we cannot do that anymore, we cannot use the test data.

The trial-data were released early on for participants to have an idea about the task before it was released.

The training data contain the annotated tweets. The folder contain the following files:

- offenseval-annotation.txt: instructions to human annotators how to annotate the tweets
- offenseval-training-v1.tsv
- readme-trainingset-v1.txt: explaining the content and format of the training data

It is wise to 

## Recipe for building a classifier

- Step 1: read the training data to obtain the tweets and their annotations
- Step 2: split the data into training data and test data
- Step 3: 

In [20]:
# load data

import pandas as pd


def load_data(data_dir, setname):
    
    
    test_path = 'trial-data/offenseval-trial.txt'
    train_path = 'training-v1/offenseval-training-v1.tsv'
    
    
    if setname == 'test':
        filepath = data_dir+test_path
        data = pd.read_csv(filepath, 
                       delimiter = '\t', 
                       header = None,  
                       names=["tweet", "subtask_a", "subtask_b", "subtask_c"])
    elif setname == 'train':
        filepath = data_dir+train_path
        data = pd.read_csv(filepath, delimiter="\t")
        
    return data


def split_train(train_data):
    
    # split 90%, 10%
    
    total = len(train_data)
    total_90 = round(total * 0.9)
    train_data_split = train_data[:total_90]
    validation_data = train_data[total_90:]
    return train_data_split, validation_data
    
    



In [21]:
# your local path to the offenseval data sets
data_dir = 'offenseval/datasets/'

# we load the training data in a the variable "train_data"
train_data = load_data(data_dir, 'train')
print("The size of the training data len(train_data))
# we load the test data in a the variable "test_data"
test_data = load_data(data_dir, 'test')
print(len(train_data))
# we split the training data again into train and validation data
train_data, val_data = split_train(train_data)

13240


In [10]:
print(train_data)

          id                                              tweet subtask_a  \
0      86426  @USER She should ask a few native Americans wh...       OFF   
1      90194  @USER @USER Go home you’re drunk!!! @USER #MAG...       OFF   
2      16820  Amazon is investigating Chinese employees who ...       NOT   
3      62688  @USER Someone should'veTaken" this piece of sh...       OFF   
4      43605  @USER @USER Obama wanted liberals &amp; illega...       NOT   
5      97670                  @USER Liberals are all Kookoo !!!       OFF   
6      77444                   @USER @USER Oh noes! Tough shit.       OFF   
7      52415  @USER was literally just talking about this lo...       OFF   
8      45157                         @USER Buy more icecream!!!       NOT   
9      13384  @USER Canada doesn’t need another CUCK! We alr...       OFF   
10     82776  @USER @USER @USER It’s not my fault you suppor...       NOT   
11     42992  @USER What’s the difference between #Kavanaugh...       NOT   

In [11]:
print(test_data)

                                                 tweet subtask_a subtask_b  \
0    @BreitbartNews OK Shannon, YOU tell the vetera...       NOT       NaN   
1    @LeftyGlenn @jaredeker @BookUniverse @hashtagz...       NOT       NaN   
2    Hot Mom Sucks Off Step Son In Shower 8 min htt...       OFF       UNT   
3    bro these are some cute butt plugs I’m trying ...       OFF       UNT   
4    Arizona Supreme Court strikes down state legis...       NOT       NaN   
5    Arguing gun control is wrong of me whoever has...       NOT       NaN   
6    Doctors’ interest in medical marijuana far out...       NOT       NaN   
7    A must-read and a must-share for all your frie...       NOT       NaN   
8    @Jo2timess Now that’s the dumbest shit I have ...       OFF       UNT   
9    Agreed! When all of this drama was unfolding a...       OFF       UNT   
10   @NewYorker On the condition of self-reading af...       NOT       NaN   
11   Surprise Vote in Congress Protects Medical Mar...       NOT

In [12]:
print(train_data[0:5])

      id                                              tweet subtask_a  \
0  86426  @USER She should ask a few native Americans wh...       OFF   
1  90194  @USER @USER Go home you’re drunk!!! @USER #MAG...       OFF   
2  16820  Amazon is investigating Chinese employees who ...       NOT   
3  62688  @USER Someone should'veTaken" this piece of sh...       OFF   
4  43605  @USER @USER Obama wanted liberals &amp; illega...       NOT   

  subtask_b subtask_c  
0       UNT       NaN  
1       TIN       IND  
2       NaN       NaN  
3       UNT       NaN  
4       NaN       NaN  


In [15]:
tweet=train_data[0:1]

In [19]:
print(tweet)

      id                                              tweet subtask_a  \
0  86426  @USER She should ask a few native Americans wh...       OFF   

  subtask_b subtask_c  
0       UNT       NaN  


In [18]:
# Preprocess

# tokenize, remove stop-words

# lemmatize? 



In [9]:
# transform to count vectors

# to do:
# tokenize and lemmatize, get rid of stop words, punct,  etc
# add function for embeddings
from sklearn.feature_extraction.text import CountVectorizer

def tweets_to_count_vec(tweets_train, tweets_test):
    
    vectorizer = CountVectorizer()
    train_X = vectorizer.fit_transform(tweets_train)
    test_X = vectorizer.transform(tweets_test)
    
    
    return train_X, test_X

train_X, val_X = tweets_to_count_vec(train_data['tweet'], val_data['tweet'])
 
#test_X = tweets_to_count_vec(test_data['tweet'])

TypeError: tweets_to_count_vec() missing 1 required positional argument: 'tweets_test'

In [7]:
print(train_X)

  (0, 891)	1
  (0, 1225)	1
  (0, 6200)	1
  (0, 8545)	1
  (0, 10746)	1
  (0, 11292)	1
  (0, 14301)	1
  (0, 14407)	1
  (0, 15656)	1
  (0, 15927)	1
  (0, 15998)	1
  (0, 16889)	1
  (0, 17444)	1
  (1, 5184)	1
  (1, 6967)	1
  (1, 7739)	1
  (1, 9775)	1
  (1, 13014)	1
  (1, 16416)	1
  (1, 16870)	1
  (1, 16889)	3
  (1, 17866)	1
  (2, 870)	2
  (2, 916)	1
  (2, 1142)	1
  :	:
  (11914, 744)	1
  (11914, 932)	1
  (11914, 2947)	1
  (11914, 3986)	1
  (11914, 5646)	1
  (11914, 5688)	1
  (11914, 6140)	1
  (11914, 7717)	1
  (11914, 11292)	1
  (11914, 11454)	1
  (11914, 12431)	1
  (11914, 14033)	1
  (11914, 14055)	1
  (11914, 15567)	1
  (11914, 15902)	1
  (11914, 16093)	1
  (11914, 16140)	1
  (11914, 16870)	1
  (11914, 16881)	1
  (11914, 16889)	1
  (11915, 6386)	1
  (11915, 11220)	1
  (11915, 12658)	1
  (11915, 16889)	13
  (11915, 17005)	2


In [8]:
print(val_X)

  (0, 788)	1
  (0, 932)	3
  (0, 2546)	1
  (0, 2647)	1
  (0, 3857)	1
  (0, 6430)	1
  (0, 6456)	3
  (0, 7564)	1
  (0, 8844)	1
  (0, 9775)	1
  (0, 10538)	1
  (0, 11220)	1
  (0, 11463)	1
  (0, 13788)	1
  (0, 14189)	1
  (0, 15884)	1
  (0, 15891)	1
  (0, 15902)	1
  (0, 16155)	1
  (0, 16889)	1
  (0, 17337)	1
  (0, 17866)	1
  (0, 17880)	2
  (1, 788)	1
  (1, 1808)	1
  :	:
  (1320, 16179)	1
  (1321, 932)	1
  (1321, 3926)	1
  (1321, 5011)	1
  (1321, 6770)	1
  (1321, 6920)	1
  (1321, 13344)	1
  (1321, 15998)	1
  (1321, 16889)	1
  (1321, 17337)	1
  (1321, 17528)	1
  (1322, 12800)	1
  (1322, 16889)	1
  (1323, 932)	1
  (1323, 2826)	1
  (1323, 5991)	1
  (1323, 6581)	1
  (1323, 7889)	1
  (1323, 8545)	1
  (1323, 8817)	1
  (1323, 14266)	1
  (1323, 14893)	1
  (1323, 16870)	1
  (1323, 16889)	16
  (1323, 17177)	1


In [4]:
# classify

from sklearn import svm

def train(train_X, train_y):
    
    clf = svm.SVC(gamma='scale')
    clf.fit(train_X, train_y)  
    return clf

def predict(clf, test_X):
    
    predictions = clf.predict(test_X)
    return predictions

train_y = train_data['subtask_a']

clf = train(train_X, train_y)
predictions = predict(clf, val_X)  

TypeError: must be real number, not str

In [39]:
# evaluate
# to do: calculate f-score (you can simply use the scikit learn implementation)
for gold, prediction in zip(val_data['subtask_a'], predictions):
    print(gold, p)

NOT NOT
NOT NOT
OFF NOT
NOT NOT
OFF NOT
OFF NOT
OFF NOT
NOT NOT
OFF NOT
OFF NOT
NOT NOT
NOT NOT
NOT NOT
NOT NOT
OFF NOT
NOT NOT
NOT NOT
NOT NOT
NOT NOT
NOT NOT
NOT NOT
NOT NOT
OFF NOT
NOT NOT
OFF NOT
NOT NOT
NOT NOT
NOT NOT
NOT NOT
NOT NOT
OFF NOT
NOT NOT
NOT NOT
NOT NOT
OFF NOT
NOT NOT
OFF NOT
NOT NOT
OFF NOT
NOT NOT
NOT NOT
NOT NOT
NOT NOT
OFF NOT
NOT NOT
NOT NOT
OFF NOT
OFF NOT
NOT NOT
NOT NOT
NOT NOT
NOT NOT
NOT NOT
NOT NOT
OFF NOT
OFF NOT
NOT NOT
NOT NOT
NOT NOT
NOT NOT
NOT NOT
NOT NOT
NOT NOT
OFF NOT
OFF NOT
NOT NOT
NOT NOT
NOT NOT
NOT NOT
NOT NOT
NOT NOT
NOT NOT
NOT NOT
OFF NOT
NOT NOT
NOT NOT
NOT NOT
OFF NOT
NOT NOT
OFF NOT
NOT NOT
NOT NOT
NOT NOT
OFF NOT
NOT NOT
NOT NOT
OFF NOT
NOT NOT
NOT NOT
NOT NOT
NOT NOT
NOT NOT
NOT NOT
OFF NOT
NOT NOT
NOT NOT
NOT NOT
NOT NOT
OFF NOT
NOT NOT
NOT NOT
NOT NOT
NOT NOT
NOT NOT
NOT NOT
NOT NOT
NOT NOT
NOT NOT
NOT NOT
NOT NOT
OFF NOT
NOT NOT
NOT NOT
OFF NOT
OFF NOT
NOT NOT
NOT NOT
NOT NOT
OFF NOT
NOT NOT
OFF NOT
NOT NOT
NOT NOT
NOT NOT
NOT NOT
