# Regular expressions & word tokenization
 

## Introduction to regular expressions 

Regular expressions: strings with a special syntax; allow us to match patterns in other strings

In [None]:
import re

In [None]:
# match a pattern with a string
re.match('abc', 'abcdef')

In [None]:
# use special patterns that regex understand
word_regex = '\w+' # match a word
re.match(word_regex, 'hi there!') # match the first word it finds

Common Regex patterns: \w+ (word), \d (digit), \s (space), .* (wildcard - any letter or symbol), + or * (greedy match - repetition of letter/symbol), \S (anything not space), [a-z] (lowercase group)

re Module: split, findall, search, match

Syntax: pattern first, string second. May return an iterator, string, or match object.

In [None]:
my_string = "Let's write RegEx!"

In [None]:
# Write a pattern to match sentence endings: sentence_endings
sentence_endings = r"[.?!]"

# Split my_string on sentence endings and print the result
print(re.split(sentence_endings, my_string))

# Find all capitalized words in my_string and print the result
capitalized_words = r"[A-Z]\w+"
print(re.findall(capitalized_words, my_string))

# Split my_string on spaces and print the result
spaces = r"\s+"
print(re.split(spaces, my_string))

# Find all digits in my_string and print the result
digits = r"\d+"
print(re.findall(digits, my_string))

## Introduction to tokenization 

Tokenization: a process of turning a string or document into tokens (smaller chunks); one step in preparing a text for NLP, many different theories and rules; you can create your own rules using Regex. Examples: breaking out words or sentences, separating punctuation, separating all hashtags in a tweet.

nltk (natural language toolkit) library.

In [None]:
from nltk.tokenize import word_tokenize

word_tokenize('Hi there!')

Tokenization makes it easier to map part of speech, match common words, and remove unwanted tokens.

Other nltk tokenizers: sent_tokenize, regexp_tokenize, TweetTokenizer.

Different between re.search() and re.match():

In [None]:
re.match('abc', 'abcde')

In [None]:
re.search('abc', 'abcde')

In [None]:
re.match('cd', 'abcde')

In [None]:
re.search('cd', 'abcde')

## Advanced tokenization with NLTK and regex 

### Regex groups using OR "|" 

Define a group using (), define explicit character ranges using [].

In [None]:
match_digits_and_words = ('(\d+|\w+)')

In [None]:
re.findall(match_digits_and_words, 'He has 11 cats.')

Other patterns: [A-Za-z]+ (upper and lowercase English alphabet), [0-9], [A-Za-z\-\.]+ (all English alphabet, - and .), (a-z) (a - and z), (\s+|,) (space or a comma)

In [None]:
# character range with re.match()
my_str = 'match lowercase spaces nums like 12, but no commas'
re.match('[a-z0-9 ]+', my_str)

## Charting word length with nltk 

In [None]:
from matplotlib import pyplot as plt

In [None]:
plt.hist([1, 5, 5, 7, 7, 7, 9])
plt.show()

Combining NLP data extraction with plotting

In [None]:
words = word_tokenize('This is a pretty cool tool!')

In [None]:
# use list comprehension to transform it to a list of lengths
word_lengths = [len(w) for w in words]
plt.hist(word_lengths)
plt.show()

As we can see, we have the majority of 4-letter words in the sentence.

# Simple topic identification 

## Word counts with bag-of-words

Basic method for finding topics in a text. Need to first create tokens using tokenization, and then count up all the tokens. The more frequent a word/token is, the more important it might be.

In [None]:
from nltk.tokenize import word_tokenize
from collections import Counter

counter = Counter(word_tokenize("""The cat is in the box. The cat likes the box.
The box is over the cat."""))

In [None]:
counter.most_common(2)

## Simple text preprocessing

Why preprocess: helps make for better input data. Examples: tokenization to create a bag of words, lowercasing words.

Lemmatization/Stemming: shorten words to their root stems.

Removing stop words (and, the), punctuation, or unwanted tokens

In [None]:
from nltk.corpus import stopwords

text = """The cat is in the box. The cat likes the box.
The box is over the cat."""

# take each token of the lowercase text, if only it contains alphabetical characters
tokens = [w for w in word_tokenize(text.lower()) if w.isalpha()]

# remove stopwords
no_stops = [t for t in tokens if t not in stopwords.words('english')]

In [None]:
Counter(no_stops).most_common(2)

In [None]:
# Import WordNetLemmatizer
from nltk.stem import WordNetLemmatizer

# Retain alphabetic words: alpha_only
alpha_only = [t for t in text if t.isalpha()]

# Remove all stop words: no_stops
no_stops = [t for t in alpha_only if t not in stopwords.words('english')]

# Instantiate the WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

# Lemmatize all tokens into a new list: lemmatized
lemmatized = [wordnet_lemmatizer.lemmatize(t) for t in no_stops]

# Create the bag-of-words: bow
bow = Counter(lemmatized)

# Print the 10 most common tokens
print(bow.most_common(10))

## Introduction to gensim

Popular open-source NLP library; uses top academic models to perform complex tasks: building document or word vectors, performing topic identification and document comparison

What is a word vector?

Gensim example: using LDA...

Creating a gensim dictionary:

In [None]:
from gensim.corpora.dictionary import Dictionary
from nltk.tokenize import word_tokenize

my_documents = ['The movie was about a spaceship and aliens.',
                'I really liked the movie!',
                'Awesome action scenes, but boring characters.']

tokenized_docs = [word_tokenize(doc.lower()) for doc in my_documents]

dictionary = Dictionary(tokenized_docs)

dictionary.token2id

In [None]:
# create a gensim corpus
corpus = [dictionary.doc2bow(doc) for doc in tokenized_docs]
corpus

gensim models can be easily saved, updated, and reused

## Tf-idf with gensim

NLP model that allows you to determine the most important words in each document. Each corpus may have shared words beyond just stopwords, and these words should be down-weighted in importance. Tf-idf ensures most common words don't show up as key words.

The weight will be low if the terms does appear often, because tf will then be low. However, the weight will also be low if the log is closer to zero, which means the internal equation (N/dfi) is closer to one (which means all documents contain token i).

In [None]:
from gensim.models.tfidfmodel import TfidfModel

tfidf = TfidfModel(corpus)

# reference 2nd document
tfidf[corpus[1]]

# Named-entity recognition

NLP task to identify important named entities in the text, like people, places, dates, etc. 

nltk and the Stanford CoreNLP Library: integrated into Python via nltk, Java based, support for NER as well as conference and dependency trees. 

In [None]:
import nltk
sentence = """In New York, I like to ride the Metro to visit MOMA
              and some restaurants rated well by Ruth Reichl."""
# preprocessing
tokenized_sent = nltk.word_tokenize(sentence)

# pos = part of speech
tagged_sent = nltk.pos_tag(tokenized_sent)

tagged_sent[:3]

NNP is part of speech tag for proper noun, singular.

In [None]:
# chunk function, returns function as a tree
print(nltk.ne_chunk(tagged_sent))

## Introduction to SpaCy

NLP library similar to gensim, with different implementations; focus on creating NLP pipelines to generate models and corpora.

In [None]:
# SpaCy NER
import spacy
nlp = spacy.load('en')
nlp.entity

In [None]:
doc = nlp("""Berlin is the capital of Germany; and the residence of Chancellor Angela Merkel.""")
doc.ents

In [None]:
# investigate the label for each entity
print(doc.ents[0], doc.ents[0].label_)

Why use SpaCy for NER? Easy pipeline creation, different entity types compared to nltk, informal language corpora (easily find entities in Tweets and chat messages), quickly growing.

## Multilingual NER with polyglot

polyglot: NLP library that uses word vectors; has support for 130+ languages

Spanish NER with polyglot:

In [None]:
from polyglot.text import Text

text = """Respecto al referéndum, Puigdemont ha defendido que éste será "efectivo" y tendrá "credibilidad" si la ciudadanía "lo hace suyo" con una amplia movilización."""
ptext = Text(text)
ptext.entities

# Building a "fake news" classifier

## Classifying fake news using supervised learning with NLP

Create supervised learning data from text using bag-of-words models or tf-idf as features.

## Building word count vectors with scikit-learn

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

We've transformed text into bag of words vectors and generated test and training dataset

## Training and testing a classification model with scikit-learn

Naive Bayes model: commonly used in NLP, basis in probability.

Example: if the plot has a spaceship, how likely is it to be sci-fi?

Each word from CountVectorizer act as a feature

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

## Simple NLP, complex problems