# Natural Language Processing

Natural language processing (NLP) is an area of computer science and artificial intelligence that is concerned with the interaction between computers and humans in natural language. The ultimate goal of NLP is to enable computers to understand language as well as we do. It is the driving force behind things like virtual assistants, speech recognition, sentiment analysis, automatic text summarization, machine translation and much more.
(https://towardsdatascience.com/introduction-to-nlp-5bff2b2a7170)

![NLP number of publications](NLP_publications.png)

## Todays Task: GermEval - Identification of Offensive Language

The task is to decide whether a tweet includes:

1. no form of offensive language- marked OTHER

2. some form of offensive language - marked OFFENSE
    * PROFANITY: usage of profane words, however, the tweet clearly does not want to insult anyone.
    
    * INSULT: unlike PROFANITY the tweet clearly wants to offend someone.
    
    * ABUSE: unlike INSULT, the tweet does not just insult but represents the stronger form of abusive language.



### Couple of examples

### Read the data

In [2]:
import numpy as np
import pandas as pd
def get_train_data(filename):
    X  = []
    y_task1 = []
    y_task2 = []
    
    with open(filename, encoding='UTF-8') as file:
        for line in file:
            #rstrip - remove trailing characters (\n is a new line)
            #split into more parts (separator is tab: \t)
            tweet = line.rstrip('\n').split('\t')
            X.append(tweet[0]) # the actual tweet
            y_task1.append(tweet[1]) # first label
            y_task2.append(tweet[2]) # second label
    
    return np.asarray(X), np.asarray(y_task1), np.asarray(y_task2)

filename = "./data/germeval2018.training.txt"
X_train, Y_train1, Y_train2 = get_train_data(filename)
filename = "./data/germeval2018.test.txt"
X_test, Y_test1, Y_test2 = get_train_data(filename)

Let us have a look at the first ten tweets

In [3]:
display(X_train[:10])

array(['@corinnamilborn Liebe Corinna, wir würden dich gerne als Moderatorin für uns gewinnen! Wärst du begeisterbar?',
       '@Martin28a Sie haben ja auch Recht. Unser Tweet war etwas missverständlich. Dass das BVerfG Sachleistungen nicht ausschließt, kritisieren wir.',
       '@ahrens_theo fröhlicher gruß aus der schönsten stadt der welt theo ⚓️',
       '@dushanwegner Amis hätten alles und jeden gewählt...nur Hillary wollten sie nicht und eine Fortsetzung von Obama-Politik erst recht nicht..!',
       '@spdde kein verläßlicher Verhandlungspartner. Nachkarteln nach den Sondierzngsgesprächen - schickt diese Stümper #SPD in die Versenkung.',
       '@Dirki_M Ja, aber wo widersprechen die Zahlen denn denen, die im von uns verlinkten Artikel stehen? In unserem Tweet geht es rein um subs. Geschützte. 2017 ist der gesamte Familiennachzug im Vergleich zu 2016 - die Zahlen, die Hr. Brandner bemüht - übrigens leicht rückläufig gewesen.',
       '@milenahanm 33 bis 45 habe ich noch gar nicht 

### Data statistics

In [4]:
display(pd.Series(Y_train1).value_counts())
display(pd.Series(Y_train2).value_counts())

OTHER      3321
OFFENSE    1688
dtype: int64

OTHER        3321
ABUSE        1022
INSULT        595
PROFANITY      71
dtype: int64

## How to start tackling this problem?

### Using only data given

We can simply look at each word and count how many times it is in an offensive tweet in comparison to its presence in an non offensive tweet

In [5]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from scipy.sparse import find

In [32]:
from collections import Counter, defaultdict

def create_dist(text):
    c = Counter(text)

    total = sum(c.values())
    
    for k, v in c.items():
        c[k] = v/total

    return defaultdict(lambda: min(c.values()), c)

list_of_others = []
list_of_offensive = []
for number, polarity in enumerate(Y_train1):
    if polarity == "OTHER":
        list_of_others.append(number)
    else:
        list_of_offensive.append(number)

In [6]:
# Transform the tweets into a sparse matrix
count_vect = CountVectorizer(min_df=1)
X_train_counts = count_vect.fit_transform(X_train)


display("The shape of the data is: " + str(X_train_counts.shape))

display("Sparse vector representation of the tweet ", X_train[3])

display(find(X_train_counts[3])[1:])

'The shape of the data is: (5009, 17534)'

'Sparse vector representation of the tweet '

'@dushanwegner Amis hätten alles und jeden gewählt...nur Hillary wollten sie nicht und eine Fortsetzung von Obama-Politik erst recht nicht..!'

(array([  664,   747,  3872,  4033,  4644,  5286,  6361,  7213,  7504,
         8019, 10865, 11034, 11076, 11669, 12199, 13480, 15077, 16084,
        16817], dtype=int32),
 array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1]))

In [53]:
tweet_words = find(X_train_counts[3])[1]
words = [count_vect.get_feature_names()[pos] for pos in tweet_words]
offensive_prob = [offensive_counts[pos] for pos in tweet_words]
non_offensive_prob = [non_offensive_counts[pos] for pos in tweet_words]
display([(x, y, off, other) for x,y, off, other in zip(words, find(X_train_counts[3])[2], offensive_prob, non_offensive_prob)])

[('alles', 1, 0.001696005614363413, 0.0017457619332016355),
 ('amis', 1, 5.8482952219428036e-05, 3.06274023368708e-05),
 ('dushanwegner', 1, 0.00023393180887771215, 0.00016845071285278939),
 ('eine', 1, 0.004854085034212527, 0.004456287040014701),
 ('erst', 1, 0.0010819346160594186, 0.0006431754490742867),
 ('fortsetzung', 1, 0.0, 1.53137011684354e-05),
 ('gewählt', 1, 0.00032165623720685423, 0.00038284252921088496),
 ('hillary', 1, 0.0, 0.0001071959081790478),
 ('hätten', 1, 0.0001754488566582841, 0.00039815623037932037),
 ('jeden', 1, 0.0002924147610971402, 0.00016845071285278939),
 ('nicht', 2, 0.011374934206678754, 0.012587862360453898),
 ('nur', 1, 0.005906778174162232, 0.0040581308096353805),
 ('obama', 1, 0.0, 6.12548046737416e-05),
 ('politik', 1, 0.0007895198549622785, 0.0007044302537480284),
 ('recht', 1, 0.0005555880460845664, 0.0009035083689376886),
 ('sie', 1, 0.006725539505234224, 0.009096338494050627),
 ('und', 2, 0.020732206561787238, 0.01764138374603758),
 ('von', 1, 0

In [48]:
non_offensive_counts = np.array(X_train_counts[list_of_others].sum(axis = 0)).flatten()
non_offensive_counts = non_offensive_counts / non_offensive_counts.sum()
offensive_counts = np.array(X_train_counts[list_of_offensive].sum(axis = 0)).flatten()
offensive_counts = offensive_counts / offensive_counts.sum()

In [49]:
offensive_counts

array([5.84829522e-05, 1.75448857e-04, 0.00000000e+00, ...,
       2.92414761e-05, 0.00000000e+00, 2.92414761e-05])

In [50]:
non_offensive_counts

array([6.12548047e-05, 2.29705518e-04, 3.06274023e-05, ...,
       0.00000000e+00, 1.53137012e-05, 0.00000000e+00])

### Couple of terms 

DF: Document Frequency - df(t): how many documents contain term t

TF: Term Frequency - tf(t,d): the number of times that term t occurs in document d

IDF: Inverse Document Frequency - idf(t, D) = log (|D|/df(t))+1: measure of how much information the term t provides in the set of all documents D

Goal: Scale down the impact of tokens that occur very frequently in a given corpus and that are hence empirically less informative than features that occur in a small fraction of the training corpus.

tf-idf(t, d, D) = tf(t, d) * idf(t, D)


In [92]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
display([(x,y) for x, y in zip(words, find(X_train_tfidf[3])[2])])

[('alles', 0.17160673491699563),
 ('amis', 0.31007615417674533),
 ('dushanwegner', 0.25573099210317224),
 ('eine', 0.13582432850827794),
 ('erst', 0.20288416837938814),
 ('fortsetzung', 0.34599635245987703),
 ('gewählt', 0.23161463976175056),
 ('hillary', 0.29165119038630394),
 ('hätten', 0.23730602831273087),
 ('jeden', 0.2519946664612334),
 ('nicht', 0.20025121923204178),
 ('nur', 0.13429034749981186),
 ('obama', 0.31007615417674533),
 ('politik', 0.20606443153617798),
 ('recht', 0.20444205872496402),
 ('sie', 0.11795542720298555),
 ('und', 0.17114742491461543),
 ('von', 0.12285815231128688),
 ('wollten', 0.2726184353837312)]

### Computing polarity of each word

We are now able to characterize the polarity of each word by its inclusion in the offensive tweets vs non offensive ones. An example: the word "peinlich" is contained in 2 offensive and 5 non offensive tweets, we can thus compute its offensiveness as 2/(2+5) = 2/7. We will furthermore multiply this by its tf-idf statistics to lower the importance of the common words for the classification. The product of these results for all words in the tweet gives us a number characterizing the probability of an offensive tweet.

In [7]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.metrics import f1_score, make_scorer

# Use Multinomial naive Bayes
#clf = MultinomialNB().fit(X_train_tfidf, Y_train1)

# Characterize the order of operations
text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB()),
])

# Set hyperparameters
parameters = {
    'vect__ngram_range': [(1, 1)],
    'tfidf__use_idf': [(False)],
    'clf__alpha': [(1e-2)], #smoothing
}

# Find the best parameters
gs_clf = GridSearchCV(text_clf, parameters, cv=10, iid=False,
                      n_jobs=-1, scoring=make_scorer(f1_score, average='macro'))
gs_clf = gs_clf.fit(X_train, Y_train1)
for param_name in sorted(parameters.keys()):
    print("%s: %r" % (param_name, gs_clf.best_params_[param_name]))


display(gs_clf.best_score_)
display(gs_clf.cv_results_)

clf__alpha: 0.01
tfidf__use_idf: False
vect__ngram_range: (1, 1)


0.6943130807402307



{'mean_fit_time': array([0.30255985]),
 'std_fit_time': array([0.08236278]),
 'mean_score_time': array([0.02378073]),
 'std_score_time': array([0.00740208]),
 'param_clf__alpha': masked_array(data=[0.01],
              mask=[False],
        fill_value='?',
             dtype=object),
 'param_tfidf__use_idf': masked_array(data=[False],
              mask=[False],
        fill_value='?',
             dtype=object),
 'param_vect__ngram_range': masked_array(data=[(1, 1)],
              mask=[False],
        fill_value='?',
             dtype=object),
 'params': [{'clf__alpha': 0.01,
   'tfidf__use_idf': False,
   'vect__ngram_range': (1, 1)}],
 'split0_test_score': array([0.67820978]),
 'split1_test_score': array([0.66089269]),
 'split2_test_score': array([0.73644201]),
 'split3_test_score': array([0.70977208]),
 'split4_test_score': array([0.6771315]),
 'split5_test_score': array([0.71850876]),
 'split6_test_score': array([0.66695157]),
 'split7_test_score': array([0.67001162]),
 'split8_

### Getting better, part I

Ngrams: Using a word count does not always works, an example of two sentences with same words:

"Alle nicht töten und leben lassen vs Alle töten und nicht leben lassen"

A solution: use group of words (2-5) so called n-grams to better represent the context. For the choice n=2 we have tokens: Alle nicht, nicht töten, töten und, und leben, leben lassen, ...

In [75]:
parameters = {
    'vect__ngram_range': [(1, 1), (1, 2), (1, 3)],
    'tfidf__use_idf': (True, False),
    'clf__alpha': (1e-2, 1e-3),
}
gs_clf = GridSearchCV(text_clf, parameters, cv=10, iid=False,
                      n_jobs=-1, scoring=make_scorer(f1_score, average='macro'))
gs_clf.fit(X_train, Y_train1)
for param_name in sorted(parameters.keys()):
    print("%s: %r" % (param_name, gs_clf.best_params_[param_name]))

clf__alpha: 0.01
tfidf__use_idf: False
vect__ngram_range: (1, 2)


In [77]:
gs_clf.cv_results_



{'mean_fit_time': array([0.25981333, 0.69109664, 1.1778029 , 0.24055579, 0.64192955,
        1.17820904, 0.2945739 , 0.77496316, 1.33338332, 0.30388579,
        1.10317478, 1.43830168]),
 'std_fit_time': array([0.00734382, 0.02066649, 0.06060069, 0.00587286, 0.0134585 ,
        0.0599287 , 0.01731191, 0.00960534, 0.04687888, 0.03156629,
        0.11497705, 0.2490255 ]),
 'mean_score_time': array([0.02124653, 0.0414655 , 0.04980199, 0.02066953, 0.03777406,
        0.0539731 , 0.02496397, 0.04513886, 0.05968206, 0.02507839,
        0.05951574, 0.06048067]),
 'std_score_time': array([0.00119077, 0.00390739, 0.00170173, 0.00273915, 0.00148889,
        0.00426007, 0.00085359, 0.00219275, 0.00550332, 0.00310754,
        0.0204003 , 0.01968232]),
 'param_clf__alpha': masked_array(data=[0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.001, 0.001,
                    0.001, 0.001, 0.001, 0.001],
              mask=[False, False, False, False, False, False, False, False,
                    False, False, F

### Tokenization


Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens. These tokens are often loosely referred to as terms or words, but it is sometimes important to make distinction.

Usually, the punctuation is removed together with twitter specific tokens like |LBR| (new line).

What should one do with the following examples?

Jetzt habt ihr's schon wieder FAST geschafft

ES IST NUR UND AUSSCHLIEßLICH DER ISLAM, ALSO A L L E UND J E D E R MOSLEM

Sicher doch und es würde besser gehen, wenn 100TSD Mecklenburger Kühe das politische Berlin zuscheißen würden 😜

In [None]:
from nltk.tokenize import TweetTokenizer as Tokenizer_NLTK
from nltk.tokenize.casual import remove_handles
from nltk.stem.snowball import GermanStemmer as Stemmer_NLTK
from sklearn.feature_extraction.text import TfidfVectorizer

class Tokenizer:
    def __init__(self, preserve_case=True, use_stemmer=False, join=False):
        self.preserve_case=preserve_case
        self.use_stemmer=use_stemmer
        self.join=join

    def tokenize(self, tweet):
        tweet=remove_handles(tweet)
        tweet=tweet.replace('#', ' ')
        tweet=tweet.replace('&lt;', ' ')
        tweet=tweet.replace('&gt;', ' ')
        tweet=tweet.replace('&amp;', ' und ')
        tweet=tweet.replace('|LBR|', ' ')
        tweet=tweet.replace('-', ' ')
        tweet=tweet.replace('_', ' ')
        tweet=tweet.replace("'s", ' ')
        tweet=tweet.replace(",", ' ')
        tweet=tweet.replace(";", ' ')
        tweet=tweet.replace(":", ' ')
        tweet=tweet.replace("/", ' ')
        tweet=tweet.replace("+", ' ')
        tknzr=Tokenizer_NLTK(preserve_case=self.preserve_case, reduce_len=True)

        if self.join:
            return " ".join(tknzr.tokenize(tweet))
        elif self.use_stemmer:
            stmmr=Stemmer_NLTK()
            return [stmmr.stem(token) for token in tknzr.tokenize(tweet)]
        else:
            return tknzr.tokenize(tweet)


### Stemming and Lemmatization

Lauf, gelaufen, lauffend, läuft but also Demokratie, demokratisch und Demokratisierung. In many situations, it seems as if it would be useful for a search for one of these words to return documents that contain another word in the set.

The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. The difference is in the fact, that stemming usually refers to a crude heuristic process that chops off the beginnings and ends of words. Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma 

Both of them would succeed in transforming the word gegessen to essen, but only the Lematizer would convert "ich aß" to ich essen, the stemmer would not perform any transformation. 

### German specific challenge - compound splitting

Austrian Word of the Year 2016 was Bundespräsidentenstichwahlwiederholungsverschiebung. An iterative process of finding the longest part of the word in the dictionary and splitting can be applied. The result could be:
Bundespräsident Stichwahl Wiederholung Verschiebung. How is it with "Bundespräsident" vs "Bund Präsident"?

In [None]:
# %%
token_vect=TfidfVectorizer(analyzer="word", max_df=0.01, min_df=0.0002,
                             tokenizer=Tokenizer(preserve_case=False, use_stemmer=True).tokenize)

char_vect  = TfidfVectorizer(analyzer="char", ngram_range=(3, 7), max_df=0.01, min_df=0.0002,
                             preprocessor=Tokenizer(preserve_case=False, join=True).tokenize)
# %%
X_TNGR_train = token_vect.fit_transform(X_train)
X_TNGR_test  = token_vect.transform(X_test)

X_CNGR_train = char_vect.fit_transform(X_train)
X_CNGR_test  = char_vect.transform(X_test)
#%%

#%%
text_clf = Pipeline([
    ('vect', token_vect),
    ('clf', MultinomialNB()),
])
parameters={
    'vect__ngram_range': [(1, 1), (1, 2), (1, 3), (1, 4)],
    'vect__use_idf': (True, False),
    'clf__alpha': (1e-2, 1e-3),
}
gs_clf=GridSearchCV(text_clf, parameters, cv=StratifiedKFold(n_splits=10), iid=False,
                      n_jobs=-1, scoring=make_scorer(f1_score, average='macro'))
gs_clf=gs_clf.fit(X_train, Y_train1)
for param_name in sorted(parameters.keys()):
    print("%s: %r" % (param_name, gs_clf.best_params_[param_name]))

# %%
gs_clf.best_score_

# %%
text_clf.set_params(**gs_clf.best_params_)
text_clf.fit(X_train, Y_train1)
predicted=text_clf.predict(X_test)
np.mean(predicted == Y_test1)
# %%
from sklearn.metrics import f1_score
f1_score(predicted, Y_test1, average='macro')
# %%
token_vect=TfidfVectorizer(analyzer="word", max_df=0.01, min_df=0.0002,
                             tokenizer=Tokenizer(preserve_case=False, use_stemmer=True).tokenize)

char_vect  = TfidfVectorizer(analyzer="char", ngram_range=(3, 7), max_df=0.01, min_df=0.0002,
                             preprocessor=Tokenizer(preserve_case=False, join=True).tokenize)
# %%

X_TNGR_train = token_vect.fit_transform(X_train)
X_TNGR_test  = token_vect.transform(X_test)

X_CNGR_train = char_vect.fit_transform(X_train)
X_CNGR_test  = char_vect.transform(X_test)

#%%
def get_META_feats(clf, X_train, X_test, y, seeds=[42]):
    feats_train = []
    for seed in seeds:
        skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
        feats_train.append(cross_val_predict(clf, X_train, y=y, method='predict_proba', cv=skf, n_jobs=-1))
    feats_train = np.mean(feats_train, axis=0)
    
    clf.fit(X_train, y)
    feats_test = clf.predict_proba(X_test)
    
    return feats_train, feats_test

#%%
clfs_task1 = [LogisticRegression(class_weight='balanced'),
              ExtraTreesClassifier(n_estimators=100, criterion='entropy', n_jobs=-1),
              ExtraTreesClassifier(n_estimators=100, criterion='gini', n_jobs=-1)]

base_feats_task1 = [#(X_CNGR_train, X_CNGR_test),
                    (X_TNGR_train, X_TNGR_test)]

X_META_task1_train = []
X_META_task1_test  = []
for X_train, X_test in base_feats_task1:
    for clf in clfs_task1:
        feats = get_META_feats(clf, X_train, X_test, Y_train1)
        X_META_task1_train.append(feats[0])
        X_META_task1_test.append(feats[1])
        
X_META_task1_train = np.concatenate(X_META_task1_train, axis=1)
X_META_task1_test  = np.concatenate(X_META_task1_test, axis=1)


#%%
clf_task1 = LogisticRegression(C=0.17, class_weight='balanced')
clf_task1.fit(X_META_task1_train, Y_train1)

preds_task1 = clf_task1.predict(X_META_task1_test)    


#%%
np.mean(preds_task1 == Y_test1)
# %%
from sklearn.metrics import f1_score
f1_score(preds_task1, Y_test1, average='macro')

### Dictionary of bad words

In [14]:
with open("./data/lexicon.txt", "r") as text_file:
    lexicon = text_file.read().split('\n')

In [15]:
lexicon

['AAA Batterie',
 'ABS-Bremser',
 'Aa Esser',
 'Aa Fresser',
 'Aa Gesicht',
 'Aa Lecker',
 'Aa Loch',
 'Aa Lutscher',
 'Aa Wurst',
 'Aal',
 'Aalauge',
 'Aalficker',
 'Aalfresse',
 'Aalschwanz',
 'Aalwurstverkäufer',
 'Aas',
 'Aasaffe',
 'Aasfresser',
 'Aasgeier',
 'Abart',
 'Abbumser',
 'Abdeckstiftbenutzer',
 'Abdeckstiftdauerbenutzer',
 'Abdomen',
 'Abfall',
 'Abfall, biochemischer',
 'Abfallecker',
 'Abfalleimervagina',
 'Abfallficker',
 'Abfallproduckt',
 'Abfallprodukt',
 'Abfallschlucker',
 'Abfalltonnenvollscheißer',
 'Abficker',
 'Abflussrohrgucker',
 'Abflussrohrsauger',
 'Abflußverstopfer',
 'Abgard',
 'Abgaskakerlake',
 'Abgaslaus',
 'Abgasproduzent',
 'Abgefickter',
 'Abnippeler',
 'Abort',
 'Abortdeckel',
 'Abortschüsseltaucher',
 'Abschaum',
 'Abscheißer',
 'Abschiedswinker',
 'Abseiler',
 'Abseitserklärer',
 'Abspritzer',
 'Abspritzmuschi',
 'Abspritzmuschie',
 'Absturztorte',
 'Absturzvogel',
 'Abszess',
 'Abtörner',
 'Abwasserschlüfer',
 'Abwasserschlürfer',
 'Abwasser