# NLP on tweets - basics

Set up the required imports, and load the data:

In [268]:
import pandas as pd
import spacy as spc
from spacy import displacy as dsp
train_df = pd.read_csv("data/train.csv")

## Tailor the NLP pipeline to our purposes

Useful for reference:
* [rule-based matching](https://spacy.io/usage/rule-based-matching)
* [pipelines](https://spacy.io/usage/processing-pipelines)

Load up the base pipeline:

In [281]:
nlp = spc.load("en_core_web_sm")

The first step is to make sure that '@' and '#' get the same treatment. By default, '@' is considered a part of a token, and '#' is considered its own token. So, make sure that they are considered individual tokens, to make processing easier in later parts.

In [282]:
prefixes = nlp.Defaults.prefixes + (r'@',r'#')
prefix_regex = spc.util.compile_prefix_regex(prefixes)
nlp.tokenizer.prefix_search = prefix_regex.search

So, this is what the tokenizer does now:

In [283]:
for tok in nlp("@username text #hashtag"):
    print(tok)

@
username
text
#
hashtag


Next, define the part of the pipeline that combines the '@' and '#' symbols, followed by alphanumerics, into a single token:

In [271]:
def hashtag_user_pipe(doc):
    with doc.retokenize() as retokenizer:
        for tok in doc:
            if (tok.text == '#' or tok.text == '@') and not bool(tok.whitespace_):
                retokenizer.merge(doc[tok.i:tok.i+2])
    return doc

Then, the part of the pipeline which takes a '@xxxx' or '#xxx' symbol and marks it as a user or hashtag, entity, respectively. Also marks links as link entities.

In [272]:
def entity_pipe(nlp):
    ruler = nlp.create_pipe("entity_ruler")
    patterns = [
        {"label": "HASHTAG", "pattern": [{"TEXT": {"REGEX": r'^#\w+'}}]},
        {"label": "USER", "pattern": [{"TEXT": {"REGEX": r'^@\w+'}}]},
        {"label": "LINK", "pattern": [{"TEXT": {"REGEX": r'https?://.*'}}]}
    ]
    ruler.add_patterns(patterns)
    return ruler

Insert the two functions into the NLP pipeline. The tokenizer is the first thing that runs (implicitly, it's not visible in the pipeline), so the token combiner function should be the first thing in the pipeline. The entity ruler should go before the named entity recogniser, as we want the NER to recognise anything that our custom ruler doesn't recognise, not the other way around.

In [273]:
nlp.add_pipe(hashtag_user_pipe, name="retokenizer", first=True)
nlp.add_pipe(entity_pipe(nlp), name="entruler", before='ner')

The current pipeline looks like this:

In [274]:
nlp.pipeline

[('retokenizer', <function __main__.hashtag_user_pipe(doc)>),
 ('tagger', <spacy.pipeline.pipes.Tagger at 0x1ada722d0>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x1de3a4d00>),
 ('entruler', <spacy.pipeline.entityruler.EntityRuler at 0x1dac9b2d0>),
 ('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x1de3a4ec0>)]

## Execute the pipeline

Now, run the pipeline on a random tweet:

In [275]:
random_tweet = train_df.sample().iloc[0].text
doc = nlp(random_tweet)

The tweet has been tokenized, but not all tokens are useful. In particular, stop words and punctuation are useless for us, so `is_token_allowed` will filter those out:

In [276]:
def is_token_allowed(token):
    return (token and token.string.strip() and not token.is_stop and not token.is_punct)

We also only want some entities:

In [277]:
def is_entity_allowed(entity):
    wanted = ['USER', 'HASHTAG', 'ORG', 'GPE']
    return entity.label_ in wanted

Also, all tokens should be converted to their lowercase, lemmatized form.
So, define two hashes containing the results from the processed doc:

In [278]:
useful_tokens = [{'token': token.lemma_.strip().lower(), 'pos': token.pos_, 'dep': token.dep_, 'ent': token.ent_type_} for token in doc if is_token_allowed(token)]
useful_entities = [{'text': ent.text, 'label': ent.label_} for ent in doc.ents if is_entity_allowed(ent)]

## See the results

Finally, print out the results:

In [279]:
for tok in useful_tokens:
    print(tok)

print()

for ent in useful_entities:
    print(ent)

if doc.ents:
    dsp.render(doc, style='ent', options={'colors': {'USER': 'linear-gradient(90deg, #fc4a1a, #f7b733)', 
                                                     'HASHTAG': 'linear-gradient(90deg, #aa9cfc, #fc9ce7)',
                                                     'LINK': 'linear-gradient(90deg, #B2FEFA, #0ED2F7)'}})
else:
    print("No entities present.")

{'token': 'riot', 'pos': 'NOUN', 'dep': 'compound', 'ent': ''}
{'token': 'kit', 'pos': 'NOUN', 'dep': 'compound', 'ent': ''}
{'token': 'bah', 'pos': 'PROPN', 'dep': 'compound', 'ent': ''}
{'token': 'new', 'pos': 'ADJ', 'dep': 'amod', 'ent': ''}
{'token': 'concept', 'pos': 'NOUN', 'dep': 'compound', 'ent': ''}
{'token': 'gear', 'pos': 'PROPN', 'dep': 'pobj', 'ent': 'ORG'}
{'token': 'come', 'pos': 'VERB', 'dep': 'ROOT', 'ent': ''}
{'token': 'autumn', 'pos': 'PROPN', 'dep': 'nmod', 'ent': 'DATE'}
{'token': 'winter', 'pos': 'PROPN', 'dep': 'pobj', 'ent': 'DATE'}
{'token': '#menswear', 'pos': 'NUM', 'dep': 'nummod', 'ent': 'HASHTAG'}
{'token': '#fashion', 'pos': 'NUM', 'dep': 'nummod', 'ent': 'HASHTAG'}
{'token': '#urbanfashion\x89û', 'pos': 'PUNCT', 'dep': 'appos', 'ent': 'HASHTAG'}
{'token': 'https://t.co/ccwzdtfbus', 'pos': 'X', 'dep': 'punct', 'ent': 'LINK'}

{'text': 'Gear', 'label': 'ORG'}
{'text': '#menswear', 'label': 'HASHTAG'}
{'text': '#fashion', 'label': 'HASHTAG'}
{'text': '#ur