# Intro

<img width=600 src="images/plane.jpeg">
<p>This notebook focuses on text classification using a naive Bayes classifier. The <a href="https://www.kaggle.com/crowdflower/twitter-airline-sentiment">Twitter US Airline Sentiment</a> dataset, which contains tweets labeled by their sentiment, is used as an example.</p>

# Import dependencies

In [1]:
import numpy as np
import pandas as pd
import string
import preprocessor as p
from stop_words import get_stop_words
from nltk.stem import SnowballStemmer

import sys
sys.path.insert(0, "..")
import mymllib
import mymllib.metrics.classification as metrics

# Load and preprocess data

<p>Before proceeding, please, download the dataset using the link above and extract it to <i>twitter_airline_sentiment</i> directory like this:</p>
<p><i>
./twitter_airline_sentiment/<br/>
├── database.sqlite<br/>
└── Tweets.csv<br/>
</i></p>


The dataset contains many columns, but we'll need only tweet text and sentiment: 

In [2]:
dataset = pd.read_table(
    "./twitter_airline_sentiment/Tweets.csv",
    sep=",")[["text", "airline_sentiment"]]
dataset.head()

Unnamed: 0,text,airline_sentiment
0,@VirginAmerica What @dhepburn said.,neutral
1,@VirginAmerica plus you've added commercials t...,positive
2,@VirginAmerica I didn't today... Must mean I n...,neutral
3,@VirginAmerica it's really aggressive to blast...,negative
4,@VirginAmerica and it's a really big bad thing...,negative


Tweets are labeled by sentiment as negative, neutral or positive. The classes are unbalanced, with positive one being the least represented: 

In [3]:
dataset.airline_sentiment.value_counts()

negative    9178
neutral     3099
positive    2363
Name: airline_sentiment, dtype: int64

Let's encode sentiment with numeric labels: 

In [4]:
dataset.airline_sentiment = dataset.airline_sentiment.astype("category").cat.codes
dataset.head()

Unnamed: 0,text,airline_sentiment
0,@VirginAmerica What @dhepburn said.,1
1,@VirginAmerica plus you've added commercials t...,2
2,@VirginAmerica I didn't today... Must mean I n...,1
3,@VirginAmerica it's really aggressive to blast...,0
4,@VirginAmerica and it's a really big bad thing...,0


<p>Here we define a preprocessing pipeline with following steps:</p>
<ol>
    <li>
        Use <i>tweet-preprocessor</i> package to remove:
        <ul>
            <li>URLs</li>
            <li>Hashtags</li>
            <li>Mentions</li>
            <li>Reserved words (RT, FAV)</li>
            <li>Emojis</li>
            <li>Smileys</li>
            <li>Numbers</li>
        </ul>
    <li>Convert to lowercase</li>
    <li>Remove stopwords and words that contain digits</li>
    <li>Remove punctuation</li>
    <li>Apply Snowball stemmer</li>
</ol>
<p>The pipeline can be improved, for example, to fix/remove misspelled words, but for a baseline model these steps should suffice.</p>

In [5]:
LANGUAGE = "english"
STOP_WORDS = get_stop_words(LANGUAGE)
STEMMER = SnowballStemmer(LANGUAGE)
REMOVE_PUNCTUATION = str.maketrans('', '', string.punctuation)

def preprocess(tweet):
    tweet = p.clean(tweet)
    tweet = tweet.lower()
    tweet = " ".join(word for word in tweet.split()
                     if word not in STOP_WORDS and not any(c.isdigit() for c in word))
    tweet = tweet.translate(REMOVE_PUNCTUATION)
    tweet = STEMMER.stem(tweet)
    return tweet

An example of how the preprocessing works with a single tweet:

In [6]:
source_tweet = dataset.text[1]
preprocessed_tweet = preprocess(source_tweet)

print("Source tweet:", source_tweet)
print("Preprocessed tweet:", preprocessed_tweet)

Source tweet: @VirginAmerica plus you've added commercials to the experience... tacky.
Preprocessed tweet: plus added commercials experience tacki


Preprocess all tweets in the dataset:

In [7]:
dataset.text = dataset.text.map(preprocess)
dataset.head()

Unnamed: 0,text,airline_sentiment
0,said,1
1,plus added commercials experience tacki,2
2,today must mean need take another trip,1
3,really aggressive blast obnoxious entertainmen...,0
4,really big bad th,0


Now we are ready to perform a train/test split: 

In [8]:
train_dataset = dataset.sample(frac=0.8, random_state=123)
test_dataset = dataset.drop(train_dataset.index)

print("Train dataset size:", train_dataset.shape[0])
print("Test dataset size:", test_dataset.shape[0])

Train dataset size: 11712
Test dataset size: 2928


A vocabulary is required to replace words with numeric tokens. It is built using only the train subset:

In [9]:
unique_words = set(word for text in train_dataset.text for word in text.split())
idx_to_word = sorted(unique_words)
word_to_idx = {word: idx for idx, word in enumerate(idx_to_word)}

print("Vocabulary size:", len(idx_to_word))
print("Vocabulary sample:", idx_to_word[:30])

Vocabulary size: 10434
Vocabulary sample: ['a', 'aa', 'aaaand', 'aadavantage', 'aadfw', 'aadvantage', 'aal', 'aampc', 'aaron', 'aas', 'aaus', 'ab', 'aback', 'abandon', 'abandoned', 'abandonment', 'abassinet', 'abbrev', 'abc', 'abcdef', 'abcs', 'abducted', 'abi', 'abidfw', 'abilities', 'ability', 'able', 'aboard', 'abounds', 'about']


Tweets and their labels are separeted and all words in tweets are replaced with their indices in the vocabulary:

In [10]:
PADDING_VALUE = -1

def get_x_y(dataset, word_to_idx):
    max_len = max(len(text.split()) for text in dataset.text)
    x = []
    for text in dataset.text:
        idx = [word_to_idx[word] for word in text.split() if word in word_to_idx]
        idx += [PADDING_VALUE] * (max_len - len(idx))
        x.append(idx)
    return np.asarray(x), dataset.airline_sentiment.to_numpy()

X_train, y_train = get_x_y(train_dataset, word_to_idx)
X_test, y_test = get_x_y(test_dataset, word_to_idx)

print(f"Train X shape: {X_train.shape}, train y shape: {y_train.shape}")
print(f"Test X shape: {X_test.shape}, test y shape: {y_test.shape}")

Train X shape: (11712, 21), train y shape: (11712,)
Test X shape: (2928, 21), test y shape: (2928,)


# Train and test the model

Finally we can train the naive Bayes text classifier:

In [11]:
naive_bayes = mymllib.nlp.NaiveBayesTextClassifier()
naive_bayes.fit(X_train, y_train)

Let's test model's accuracy:

In [12]:
y_train_pred = naive_bayes.predict(X_train)
y_test_pred = naive_bayes.predict(X_test)

print("Train accuracy:", metrics.accuracy(y_train, y_train_pred))
print("Train balanced accuracy:", metrics.balanced_accuracy(y_train, y_train_pred))
print()
print("Test accuracy:", metrics.accuracy(y_test, y_test_pred))
print("Test balanced accuracy", metrics.balanced_accuracy(y_test, y_test_pred))

Train accuracy: 0.8495560109289617
Train balanced accuracy: 0.7699964802630902

Test accuracy: 0.7520491803278688
Test balanced accuracy 0.6119290791743199


# Conclusion

The model achieves a balanced accuracy of 61% and an unbalanced one of 75%. While this is definetely not the best result possible, it's quite good for such simple preprocessing pipeline and model. The naive Bayes is also much faster to train compared to advanced deep learning models, which makes it a good option to be used as a baseline model.