# Introduction to Natural Language Processing
## 1. Data Representation

In [None]:
import pandas as pd

This notebook will take you through many of the concepts we have introduced in this session. We will use the same dataset for many of the examples, namely a collection of 6000 or so tweets from @realDonaldTrump and @BarackObama. 

Wherever possible we will use `sklearn`, Python's machine learning library that you are most likely already familiar with. For a few tasks we will turn to `nltk` (natural language toolkit) a Python library for Nautural Language Procession (NLP), and a few other libraries.

In [None]:
df = pd.read_pickle("tweets.pkl")

In [None]:
df.head()

## Data Cleaning 

There are many things to consider when cleaning text data. Some problems are common to other data types, such as how to deal with missing values. Others are unique to text data, and include things like removing HTML tags or urls. We don't want to focus too much on data cleaning for the purposes of this course, we've done a little bit of cleaning below to give you a taste. Generally speaking regular expressions (available in Python in the `re` module) will get you pretty far. For specific tasks there are often existing libraries you can use. For example `feedparser` is good for getting data from an RSS feed, `beautifulsoup` is good for parsing HTML/XML.

In [None]:
import re


def clean_tweet(text):
    # encode tweets as utf-8 strings
    text = text.decode("utf-8")
    # remove commas in numbers (else vectorizer will split on them)
    text = re.sub(r",([0-9])", "\\1", text)
    # sort out HMTL formatting of &
    text = re.sub(r"&amp", "and", text)
    # strip urls
    return re.sub(r"http[s]{0,1}://[^\s]*", "", text)


df["text"] = df["text"].map(clean_tweet)

## Tokenizing

The field of NLP contains a lot of jargon from linguistics. We don't want to get too bogged down in defining lots of new terms, but the following two are helpful:

- Type: An element of the vocabulary. May be a word, may be an n-gram (ordered sequence of words)
- Token: An instance of a type in running text.

Any given language has a large enough vocabulary that trying to do data science on the set of all possible sentences is totally impractical. Instead it helps to break text up into smaller chunks, a process called tokenizing.

Exactly how we do this will depend on the problem, but some common ways include splitting on whitespace, or splitting on non-alphanumeric characters. In general, the method of tokenizing will be informed by the format of the text data being studied.

**Exercise: Tokenizers are accessed in a slightly roundabout way in `sklearn`, as below. Run this cell a few times to tokenize random tweets.**

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from random import randint

# tokenize a random tweet
i = randint(0, len(df) - 1)
tokenizer = CountVectorizer().build_tokenizer()
tokenizer(df["text"].iloc[i])

## Vectorizing

Tokenizing breaks our raw text data down into more manageable chunks, but it's still not in a form that is particularly useful for training models. Let's look at a few common, simple ways of vectorizing text data. We will use `sklearn` which can efficiently vectorize text data and stores everything as `scipy` sparse arrays.

### Count Vectors

Perhaps the simplest way to vectorize is to simply create a vector of counts of the number of times any type appears in a given piece of text.

To get some intuition, let's try it on a small test corpus of 10 random tweets.

**Exercise: Use `sample` on the series `df['text']` to get a random selection of 10 tweets**.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

test_corpus = df["text"].sample(n=10, random_state=42).copy()

test_corpus

Given the sample, we can create count vectors using `CountVectorizer` from `sklearn`. We set `max_features=5` so as to work with a small vocabulary of only the most common terms.

See the next cell for usage of `CountVectorizer`.

In [None]:
# create a count vectorizer with our desired parameters
count_vectorizer = CountVectorizer(max_features=5)

# first 'fit' the vectorizer to the corpus
# this step automatically determines the vocabulary
count_vectorizer.fit(test_corpus)

# then 'transform' the corpus to count vectors (a matrix)
count_vectors = count_vectorizer.transform(test_corpus)

In [None]:
features = count_vectorizer.get_feature_names()
# we use .toarray() to convert from sparse
# array to dense numpy array
for i, row in enumerate(count_vectors.toarray()):
    print(test_corpus.iloc[i])
    print(
        pd.DataFrame({"Terms": features, "Counts": row}).to_string(index=False)
    )
    print("-" * 40)

### Term frequency vectors

Count vectors are very sensitive to document length. In our case we expect all tweets to be similar lengths, but in general we might be dealing with documents of varying lengths, so it makes sense to normalise the count vectors. This results in so-called frequency vectors.

**Exercise: Using `TfidfVectorizer`, compute term frequency vectors for the test corpus and print them out as we did for the count vectors. Make sure you set `use_idf=False` when initialising your `TfidfVectorizer`. As before limit the vocabulary to 5 types.**

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# create a term frequency vectorizer
# note that use_idf=False, more on tf-idf in a second

tf_vectorizer = TfidfVectorizer(max_features=5, use_idf=False)

# 'fit' the vectorizer to the corpus
# this step automatically determins the vocabulary
tf_vectorizer.fit(test_corpus)

# then 'transform' the corpus
# this computes the term frequency vectors
tf_vectors = tf_vectorizer.transform(test_corpus)

In [None]:
features = tf_vectorizer.get_feature_names()
for i, row in enumerate(tf_vectors.toarray()):
    print(test_corpus.iloc[i])
    print(
        pd.DataFrame({"Terms": features, "Term frequencies": row}).to_string(
            index=False
        )
    )
    print("-" * 40)

### tfidf vectors

tf-idf stands for 'term frequency - inverse document frequencys. Given our term frequencys, we re-weight by the inverse of the document frequency. Therefore a given term will have a larger value if it both appears many times in the document, but appears infrequently across the corpus. In this sense it automatically detects and upweights terms which are likely to be able to help us distinguish between documents.

**Exercise: Compute tfidf vectors for your test_corpus. You can once again use `TfidfVectorizer`, but this time set `use_idf=True`.**

In [None]:
# we create a term frequency - inverse document frequency vectorizer with our desired parameters
# in this case let us limit the vocabulary (max features) to 10
tfidf_vectorizer = TfidfVectorizer(max_features=5, use_idf=True)

# we 'fit' the vectorizer to the corpus, this step automatically determines the vocabulary
tfidf_vectorizer.fit(test_corpus)

# then we 'transform' the corpus, this is the vectorizing step
tfidf_vectors = tfidf_vectorizer.transform(test_corpus)

In [None]:
features = tfidf_vectorizer.get_feature_names()
for i, row in enumerate(tfidf_vectors.toarray()):
    print(test_corpus.iloc[i])
    print(
        pd.DataFrame(
            {"Terms": features, "Term Frequency (weighted)": row}
        ).to_string(index=False)
    )
    print("-" * 40)

## N-grams

So far we have only considered individual words and their frequencies. We lose a lot of information doing so, because we discard word order and grammar etc.

A simple solution to this is to use n-grams, that is sequences of words of length n, when we tokenize.

In [None]:
# you can tokenize/vectorize with n-grams using the parameter
# ngram_range. It takes a tuple of ints that specify min and max
# n-gram lengths
ngram_vectorizer = CountVectorizer(max_features=5, ngram_range=(2, 2))

**Exercise: Use `ngram_vectorizer` to compute bigram count vectors for your test corpus**

In [None]:
# fit the vectorizer to the test_corpus
ngram_vectorizer.fit(test_corpus)

# transform to get the vectors
ngram_vectors = ngram_vectorizer.transform(test_corpus)

In [None]:
features = ngram_vectorizer.get_feature_names()
for i, row in enumerate(ngram_vectors.toarray()):
    print(test_corpus.iloc[i])
    print(
        pd.DataFrame({"Terms": features, "Term frequencies": row}).to_string(
            index=False
        )
    )
    print("-" * 40)

The same principles as before, namely vectorizing using term frequencies or term frequency-inverse document frequencies, apply here too.

A big advantage of tokenizing using n-grams is that models can learn some basic information about which words tend to appear together, and which words follow on from other sequences.

## spaCy

We will use `spacy` for accessing GloVe vectors in Python. It's a really nice package for natural language processing that we definitely recommend checking out.

In [None]:
import spacy

In [None]:
nlp = spacy.load("en_core_web_lg")

In [None]:
nlp("This is a sentence")

In [None]:
# extract all nouns from the sentance
[t.text for t in nlp("This is a sentence") if t.pos_ == "NOUN"]

In [None]:
# lemmatise each of the words in the sentance
[t.lemma_ for t in nlp("Some knives, forks and spoons")]

In [None]:
# get the GloVe vector's shape associcated to the word "octopus"
nlp("octopus").vector.shape

## Word similarity

We can use GloVe vectors to compare how similar words are to each other. The code below computes the "cosine similarity" between the GloVe vectors, which is defined as follows:
$$\mathrm{simularity}(\mathbf{v}, \mathbf{w}) = \cos(\theta) = \frac{\mathbf{v} \cdot \mathbf{w}}{\|\mathbf{v}\| \| \mathbf{w}\|}$$
where $\theta$ is the angle between the two vectors.

In [None]:
import numpy as np

doc = nlp("dog cat banana apple")

similarities = np.zeros((4, 4))

for i, token1 in enumerate(doc):
    for j, token2 in enumerate(doc):
        similarities[i, j] = token1.similarity(
            token2
        )  # computes the cosine similarity

pd.DataFrame(
    similarities,
    index=["Dog", "Cat", "Banana", "Apple"],
    columns=["Dog", "Cat", "Banana", "Apple"],
)

Below we'll read in similarities comparing other words to "octopus".

In [None]:
sim = pd.read_pickle("similarities.pkl")

sim.sample(5, random_state=5)

In [None]:
sim.sort_values(by="similarity", ascending=True).head(15)

In [None]:
sim.sort_values(by="similarity", ascending=False).head(15)

## Direction vectors

GloVe vectors preserve relationships between similar words leading to relationships like $$\mathrm{king} \approx \mathrm{queen} - \mathrm{woman} + \mathrm{man}$$
We'll investigate this further below.

In [None]:
from sklearn.decomposition import PCA

In [None]:
words = ["man", "woman", "king", "queen", "uncle", "aunt", "sir", "madam"]

In [None]:
pca = PCA()

glove_vectors = np.concatenate(
    [nlp(word).vector.reshape(1, 300) for word in words]
)

glove_pca = pca.fit_transform(glove_vectors)

In [None]:
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode, iplot


init_notebook_mode(connected=True)


def scatter(x, y, labels, text):
    data = [
        go.Scatter(
            x=x[labels == label],
            y=y[labels == label],
            mode="markers",
            opacity=0.7,
            text=text[labels == label],
            name=label,
            marker={"size": 15, "line": {"width": 0.5, "color": "white"}},
        )
        for label in set(labels)
    ]
    layout = go.Layout(
        xaxis={"showgrid": False, "showticklabels": False, "zeroline": False},
        yaxis={"showgrid": False, "showticklabels": False, "zeroline": False},
        hovermode="closest",
    )
    fig = go.Figure(data=data, layout=layout)
    iplot(fig, config={"displayModeBar": False})

In [None]:
# cast words to numpy array so scatter can create boolean mask
scatter(glove_pca[:, 0], glove_pca[:, 1], np.array(words), np.array(words))

In [None]:
companies_ceo = [
    "dorsey",
    "twitter",
    "zuckerberg",
    "facebook",
    "ballmer",
    "microsoft",
    "bezos",
    "amazon",
]

In [None]:
pca = PCA()

glove_vectors_ceo = np.concatenate(
    [nlp(word).vector.reshape(1, 300) for word in companies_ceo]
)

glove_pca_ceo = pca.fit_transform(glove_vectors_ceo)

In [None]:
# cast companies_ceo to numpy array so scatter can create boolean mask
scatter(
    glove_pca_ceo[:, 0],
    glove_pca_ceo[:, 1],
    np.array(companies_ceo),
    np.array(companies_ceo),
)

In [None]:
comparatives = [
    "slow",
    "slower",
    "slowest",
    "fast",
    "faster",
    "fastest",
    "long",
    "longer",
    "longest",
]

In [None]:
pca = PCA()

glove_comparatives = np.concatenate(
    [nlp(word).vector.reshape(1, 300) for word in comparatives]
)

glove_comparatives = pca.fit_transform(glove_comparatives)

In [None]:
# cast comparatives to numpy array so scatter can create boolean mask
scatter(
    glove_comparatives[:, 0],
    glove_comparatives[:, 1],
    np.array(comparatives),
    np.array(comparatives),
)