# NLP Workshop : Text Classification

In this workshop we'll learn about a NLP (Natural Language Processing) technique called Text Classification. This means to which category a piece of text belongs. An example application of this is sentiment analysis, detecting positive or negative texts.

We will use a dataset from crowdflower about hate speech - the use case is detecting offensive language on social media.

## 1. Loading data

We start by loading the dataset, for this we use [Pandas](https://pandas.pydata.org/)

In [2]:
import pandas as pd

df = pd.read_csv('data/twitter-hate-speech.csv', index_col=0)

## 2. Quick dataset overview

Now that data is loaded we'll habe a quick look at what's available, what information do we have?

I'll explain the columns you see in the dataset sample output:

- **count** : Number of human annotations for this sample
- **hate_speech** : Times annotated as containing hate speech
- **offensive_language** : Times annotated as containing offensive language
- **neither** : Times annotated as not containing hatefull of offensive language (normal, respectfull language)
- **class** : Human annotated category, max votes determines category (0=Hate, 1=Offensive, 2=Neither)
- **tweet** : The tweets text

From these available columns we'll use **class** as the value we try to predict and **tweet** as input to determine the class.

In [3]:
pd.set_option('max_colwidth', 140)

df.head()

Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet
0,3,0,0,3,2,!!! RT @mayasolovely: As a woman you shouldn't complain about cleaning up your house. &amp; as a man you should always take the trash ou...
1,3,0,3,0,1,!!!!! RT @mleew17: boy dats cold...tyga dwn bad for cuffin dat hoe in the 1st place!!
2,3,0,3,0,1,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby4life: You ever fuck a bitch and she start to cry? You be confused as shit
3,3,0,2,1,1,!!!!!!!!! RT @C_G_Anderson: @viva_based she look like a tranny
4,6,0,6,0,1,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you hear about me might be true or it might be faker than the bitch who told it to ya &#57361;


Now let's see how much data there is in total?

In [15]:
df.shape

(24783, 6)

### Assignment : Plot the distribution of class in the dataset

An usefull insight can be to have a look at the distriution of categories in the dataset. Plotting can be done using the [Seaborn library](https://seaborn.pydata.org/).

**Your assignment is to plot the distrubution (count) of the categories.**

After plotting you will notice that the categories are not evenly distributed, category one has many more samples than the others. We'll get back to this later!

*Hint: Look at the countplot function in seaborn documentation*

In [5]:
import seaborn as sns

%matplotlib inline

sns.countplot(...)  # This line needs to be completed

## 3. Text preprocessing

In the data above you can see that the text of the tweets contains lot's of slang words, social media specific abbreviations and symbols mixed in. Also there is usernames and hashtags in the text, which we might not want a model to take into consideration for classifying (We'd like the model to learn hatefull/offensive keywords, rather than remember which users use bad language)

For this we'll now look at text preprocessing to clean up the data.

Two libraries we will use here are [sklean](http://scikit-learn.org) and [gensim](https://radimrehurek.com/gensim/).

You can read more here:
- [Gensim text preprocessing](https://radimrehurek.com/gensim/parsing/preprocessing.html)
- [Sklearn text feature extraction](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction)

First i will show what the default sklearn and gensim preprocessing functions do, then we'll have a deeper look and customize our own preprocessing pipeline.

### Whitespace splitting

First let's see how the output looks when we simply split on whitespace. What you will see is that the tweets are simpy split up into separate words, all noise such as symbols is retained of course.

In [6]:
df['tweet'].head(n=5).apply(lambda x: x.split())

0    [!!!, RT, @mayasolovely:, As, a, woman, you, shouldn't, complain, about, cleaning, up, your, house., &amp;, as, a, man, you, should, alw...
1                                         [!!!!!, RT, @mleew17:, boy, dats, cold...tyga, dwn, bad, for, cuffin, dat, hoe, in, the, 1st, place!!]
2    [!!!!!!!, RT, @UrKindOfBrand, Dawg!!!!, RT, @80sbaby4life:, You, ever, fuck, a, bitch, and, she, start, to, cry?, You, be, confused, as,...
3                                                                       [!!!!!!!!!, RT, @C_G_Anderson:, @viva_based, she, look, like, a, tranny]
4    [!!!!!!!!!!!!!, RT, @ShenikaRoberts:, The, shit, you, hear, about, me, might, be, true, or, it, might, be, faker, than, the, bitch, who,...
Name: tweet, dtype: object

### Sklearn default tokenizer

Next up is the sklean tokenizer. This already does some text cleaning, such as removing symbols and stopwords.

In [7]:
from sklearn.feature_extraction.text import CountVectorizer

sklearn_default_preprocessor = CountVectorizer(strip_accents='unicode', stop_words='english').build_analyzer()

df['tweet'].head(n=5).apply(sklearn_default_preprocessor)

0     [rt, mayasolovely, woman, shouldn, complain, cleaning, house, amp, man, trash]
1       [rt, mleew17, boy, dats, cold, tyga, dwn, bad, cuffin, dat, hoe, 1st, place]
2    [rt, urkindofbrand, dawg, rt, 80sbaby4life, fuck, bitch, start, confused, shit]
3                                 [rt, c_g_anderson, viva_based, look, like, tranny]
4              [rt, shenikaroberts, shit, hear, true, faker, bitch, told, ya, 57361]
Name: tweet, dtype: object

### Gensim preprocessor

Finally we will have a look at the gensim preprocessor. This does even more cleaning of the text, for example removing short tokens, numbers and it does stemming.

In [8]:
from gensim.parsing.preprocessing import preprocess_string

df['tweet'].head(n=5).apply(preprocess_string)

0    [mayasolov, woman, shouldn, complain, clean, hous, amp, man, trash]
1       [mleew, boi, dat, cold, tyga, dwn, bad, cuffin, dat, hoe, place]
2      [urkindofbrand, dawg, sbabylif, fuck, bitch, start, confus, shit]
3                             [anderson, viva, base, look, like, tranni]
4                  [shenikarobert, shit, hear, true, faker, bitch, told]
Name: tweet, dtype: object

### Assignment: Customized preprocessor

We have shown a few different approaches for preprocessing text, now we'll create a customized preprocessor that does some extra social media specific data cleaning.

For example:
- Remove usernames
- Remove 'hash' from hashtags
- No stemming

**Take a look at the code below and complete the preprocessing functions**

*Hint: Look at how to lowercase strings and how to use regular expressions in python*

In [9]:
from gensim.parsing.preprocessing import strip_tags, strip_punctuation, strip_multiple_whitespaces, strip_numeric, \
    remove_stopwords

def drop_short(tweet):
    # This function is included as an example, it removes short tokens
    return ' '.join(x for x in tweet.split() if len(x) >= 3)

def to_lowercase(tweet):
    return tweet  # TODO Complete this function

def drop_usernames(tweet):
    return tweet  # TODO Complete this function
    
my_filters = [ to_lowercase, drop_usernames, strip_multiple_whitespaces, strip_punctuation, strip_numeric,
               remove_stopwords, drop_short ]

df['tweet'].head(n=5).apply(lambda x: preprocess_string(x, my_filters))

0        [mayasolovely, woman, shouldn, complain, cleaning, house, amp, man, trash]
1                 [mleew, boy, dats, cold, tyga, dwn, bad, cuffin, dat, hoe, place]
2    [UrKindOfBrand, Dawg, sbabylife, You, fuck, bitch, start, You, confused, shit]
3                                       [Anderson, viva, based, look, like, tranny]
4                       [ShenikaRoberts, The, shit, hear, true, faker, bitch, told]
Name: tweet, dtype: object

## 4. Feature creation

In the above section we have preprocessed the text of the tweets and removed noisy or undesireable words. Also the tweets have been split into separate words, this is called tokenization. With this preparation done, we are now ready to transform the data into a format suitable for a machine learning model.

Machine learning models generally require numerical input, they don't work on text or words directly. Also machine learning models usually require a fixed amount of input columns or features. So in this section we will transform the variable-length tokenized tweets into a fixed set of features.

One method of of transforming variable-length texts to a fixed set of numerical features is using each unique word as a feature, and using the count of that word in the text as the value. This is called bad-of-words, below is an image illustrating this, there is some example sentences and the table shows them transformed into bag-of-words features:

1. I love machine learning
2. I hate learning boring things
3. Machine learning is a passion


| Sentence   | I | Love | Machine | Learning | Hate | Boring | Things | Is | A | Passion |
| ---------- |:-:|:----:|:-------:|:--------:|:----:|:------:|:------:|:--:|:-:|:-------:|
| Sentence 1 | 1 |    1 |       1 |        1 |    0 |      0 |      0 |  0 | 0 |       0 |
| Sentence 2 | 1 |    0 |       1 |        0 |    1 |      1 |      1 |  0 | 0 |       0 |
| Sentence 3 | 0 |    0 |       1 |        1 |    0 |      0 |      0 |  1 | 1 |       1 | 

We are going to do the this now for the tweets using a utility from sklearn, a CountVectorizer.

In [30]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from gensim.parsing.preprocessing import preprocess_string

vectorizer = CountVectorizer(strip_accents='unicode', stop_words='english')

X = vectorizer.fit_transform(df['tweet'].values)
y = df['class'].values

Now let's have a look at what the output is?

In [31]:
X

<24783x35573 sparse matrix of type '<class 'numpy.int64'>'
	with 210603 stored elements in Compressed Sparse Row format>

In [32]:
X.shape

(24783, 35573)

So this transformation has resulted in a matrix of 35573 feature columns, that's probably a few to many. Reason for this is, there is many words appearing once or twice. A machine learning algorithm can't learn much from words that appear so infrequently, or in any case the patterns that it might learn won't apply to many new tweets. So we can safely filter out a lot here. The easiest way is to filter by frequency, we simply drop tokens that appear only in a few examples.

In [27]:
vectorizer = CountVectorizer(min_df=5, strip_accents='unicode', stop_words='english')

X_filtered = vectorizer.fit_transform(df['tweet'].values)

In [28]:
X_filtered.shape

(24783, 4693)

### Assignment: Filtering tokens

In this exercise we will experiment with filtering tokens by frequency to remove low-frequency tokens, since they would likely not be very usefull anyway.

**Experiment with the code below to obtain a feature-matrix of around 500 to 1000 features.**

*Hint: Have a look at the min_df parameter*

In [None]:
vectorizer = CountVectorizer(min_df=..., strip_accents='unicode', stop_words='english')

X_filtered_more = vectorizer.fit_transform(df['tweet'].values)

In [None]:
X_filtered_more.shape

### Features from preprocessed data

In the examples above we haven't yet used our preprocessing logic, we had just split up words as default. Let's do this now:

In [40]:
vectorizer = CountVectorizer(min_df=10, strip_accents='unicode', analyzer='word',
                             tokenizer=preprocess_string, stop_words='english')

X_preprocessed = vectorizer.fit_transform(df['tweet'].values)

In [41]:
X_preprocessed.shape

(24783, 2164)

### Assignment: Plug in our custom preprocessor

In this exercise you will plug the custom preprocessor we created earlier in to the vectorizer.

**Adjust the code below to use the custom preprocessor logic**

*Hint: Use a lambda function*

In [None]:
vectorizer = CountVectorizer(min_df=10, strip_accents='unicode', analyzer='word',
                             tokenizer=..., stop_words='english')

X_custom = vectorizer.fit_transform(df['tweet'].values)

## 5. Comparing the preprocessing / feature approaches

In [42]:
from sklearn.model_selection import cross_val_score, ShuffleSplit
from sklearn.naive_bayes import MultinomialNB

scores = cross_val_score(MultinomialNB(), X, y, cv=ShuffleSplit(n_splits=10, test_size=0.2))
print("%s score: %0.2f (+/- %0.2f)" % ('All', scores.mean(), scores.std()))

scores = cross_val_score(MultinomialNB(), X_filtered, y, cv=ShuffleSplit(n_splits=10, test_size=0.2))
print("%s score: %0.2f (+/- %0.2f)" % ('Filtered', scores.mean(), scores.std()))

scores = cross_val_score(MultinomialNB(), X_filtered_more, y, cv=ShuffleSplit(n_splits=10, test_size=0.2))
print("%s score: %0.2f (+/- %0.2f)" % ('Filtered more', scores.mean(), scores.std()))

scores = cross_val_score(MultinomialNB(), X_preprocessed, y, cv=ShuffleSplit(n_splits=10, test_size=0.2))
print("%s score: %0.2f (+/- %0.2f)" % ('Preprocessed', scores.mean(), scores.std()))

All score: 0.87 (+/- 0.00)
Filtered score: 0.89 (+/- 0.00)
Filtered Moe score: 0.87 (+/- 0.00)
Preprocessed score: 0.89 (+/- 0.00)


## 6. Machine learning models

In [10]:
from sklearn.model_selection import cross_val_score, ShuffleSplit
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier

models = [
    LogisticRegression(multi_class='multinomial', solver='newton-cg'),
    MultinomialNB(),
    DecisionTreeClassifier(min_samples_split=50)
]

for model in models:
    scores = cross_val_score(model, X_count, y, cv=ShuffleSplit(n_splits=10, test_size=0.2))
    print("Model %s score: %0.2f (+/- %0.2f)" % (model.__class__.__name__, scores.mean(), scores.std()))

Model LogisticRegression score: 0.90 (+/- 0.00)
Model MultinomialNB score: 0.89 (+/- 0.00)
Model DecisionTreeClassifier score: 0.88 (+/- 0.00)


## 7. Class Imbalance

In [11]:
models = [
    LogisticRegression(multi_class='multinomial', solver='newton-cg', class_weight='balanced'),
    DecisionTreeClassifier(min_samples_split=50, class_weight='balanced')
]

for model in models:
    scores = cross_val_score(model, X_count, y, cv=ShuffleSplit(n_splits=10, test_size=0.2))
    print("Model %s score: %0.2f (+/- %0.2f)" % (model.__class__.__name__, scores.mean(), scores.std()))

Model LogisticRegression score: 0.83 (+/- 0.00)
Model DecisionTreeClassifier score: 0.83 (+/- 0.01)


## 7. Detailed evaluation

In [13]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

X_train, X_test, y_train, y_test = train_test_split(X_count, y, test_size=0.2)

lr = MultinomialNB()
lr.fit(X_train, y_train)

print(classification_report(y_test, lr.predict(X_test), target_names=('Hate', 'Offensive', 'Neither')))

             precision    recall  f1-score   support

       Hate       0.46      0.23      0.31       294
  Offensive       0.91      0.96      0.93      3850
    Neither       0.84      0.78      0.81       813

avg / total       0.87      0.89      0.88      4957

