# Language processing with NLTK

This notebooks shows some basic language processing concepts and approaches using the python NLTK library. You can view this as the most basic example approach.

Concepts i'm showing explain how to process text. Text processing with machine learning is a bit special, because machine learning algorithms only work on fixed-size vectors of numbers and text is a variable length of words. Also the words in the text are organized in sentences and separated by puntuation and other symbols.

In this notebook i'll show how to deal with that by:
- Preprocessing ("cleaning") the text
- Vectorizing the text to fixed-size vectors
- Classifying sentiment with a Naive Bayes Classifier

In [1]:
import pandas as pd

df = pd.read_csv('data/airline_sentiment.csv', index_col=0)

pd.set_option('max_colwidth', 140)

df[['airline_sentiment', 'text']].head()

Unnamed: 0_level_0,airline_sentiment,text
_unit_id,Unnamed: 1_level_1,Unnamed: 2_level_1
681448150,neutral,@VirginAmerica What @dhepburn said.
681448153,positive,@VirginAmerica plus you've added commercials to the experience... tacky.
681448156,neutral,@VirginAmerica I didn't today... Must mean I need to take another trip!
681448158,negative,"@VirginAmerica it's really aggressive to blast obnoxious ""entertainment"" in your guests' faces &amp; they have little recourse"
681448159,negative,@VirginAmerica and it's a really big bad thing about it


Above you can see the first lines of the loaded data. For each tweet we have the sentiment and the text. The sentiment means if this tweet was a positive one, a neutral one or negative.

Overall what we'll try to do in this notebook is predict the sentiment value from the text.

First step is preprocessing, we'll clean up the text by removing punctuation, lowercasing and splitting into words.

In [25]:
import nltk

tweets = df['text'].astype(str).map(lambda txt: [tok.strip('!,.') for tok in txt.lower().split()]).values
sentiments = df['airline_sentiment'].values

Now that our text has been cleaned, we can start on creating feature vectors for the machine learning. The way we'll do this, is to represent each tweet by a vector of true/false values indicating which words were present in the tweet.

First off, we need to know all the words that exist in this data:

In [26]:
all_words = list()
for words in tweets:
    all_words.extend(words)
    
word_dist = nltk.FreqDist(all_words)

Next thing, we define our feature function. This function will transform a single tweet to a list of true/false values for each word indicating if that was present in the tweet.

In [27]:
def feature_function(doc):
    document_words = set(doc)
    #corpus_words = word_dist.keys()
    corpus_words = [w for w,f in word_dist.most_common(1000)]
    return {
        word : (word in document_words)
        for word in corpus_words
    }

Next an example of how this looks, for a particular tweet. Note i'm only showing the first 10 entries of the feature vector! Also note most words are not present in the tweet (=False) and a few are present (=True).

In [28]:
print(tweets[1])
print(list(feature_function(tweets[1]).items())[:10])

['@virginamerica', 'plus', "you've", 'added', 'commercials', 'to', 'the', 'experience', 'tacky']
[('to', True), ('the', True), ('i', False), ('a', False), ('you', False), ('for', False), ('@united', False), ('flight', False), ('on', False), ('and', False)]


Next order of business, we apply the above feature function to all tweets!

In [34]:
processed_data = list(nltk.classify.util.apply_features(feature_function, list(zip(tweets, sentiments)), True))

Now comes a part that's important to understand in machine learning: Splitting data into the training and the test set.

When training a machine learning algorithm, you're learning patterns from the data you're feeding it. When you've trained the algorithm you want to know how good (or bad) it does on new data, data that it hasn't been trained on yet. This is called evaluating. It's important to do that with new, unseen data, else you might learn patterns that only work on the training data.

So the correct way to do this is to train on one part of data and then evaluate on another part. It's common to split the available data 80%/20% for example or 75%/25%.

In [35]:
import random

random.shuffle(processed_data)

split_at = int(len(processed_data) * 0.75)
train_set, test_set = processed_data[split_at:], processed_data[:split_at]

So now we have preprocessed, vectorized and split up the data, we are ready to train a classifier and see if we can predict the sentiment!

In [36]:
classifier = nltk.classify.NaiveBayesClassifier.train(train_set)

Classifier is trained, so how well does it work?!

In [37]:
print(nltk.classify.accuracy(classifier, test_set))

0.7585610200364299


Final thing i want to show here, which features or words were most important in predicting the sentiment. Of course you assume works like *thanks* or *good* to indicate positive and unfriendly words to indicate negative. Let's check!

In [38]:
classifier.show_most_informative_features(20)

Most Informative Features
                   thank = True           positi : negati =     20.8 : 1.0
                      ;) = True           positi : negati =     19.6 : 1.0
                   hours = True           negati : neutra =     18.4 : 1.0
                 awesome = True           positi : negati =     16.8 : 1.0
                   great = True           positi : neutra =     16.7 : 1.0
               flightled = True           negati : positi =     15.9 : 1.0
              appreciate = True           positi : neutra =     15.9 : 1.0
                 amazing = True           positi : negati =     15.3 : 1.0
     #destinationdragons = True           neutra : negati =     15.2 : 1.0
                 despite = True           positi : negati =     11.8 : 1.0
               wonderful = True           positi : negati =     11.8 : 1.0
                     :-) = True           positi : negati =     11.8 : 1.0
                     ceo = True           neutra : negati =     10.9 : 1.0

So concluding, we've seen a basic approach of text processing with machine learning and explained some concepts.

I hope you enjoyed!