This notebook outlines several methods for tokenizing text into words (and sentences), including:

* whitespace
* nltk (Penn Treebank tokenizer)
* nltk (Twitter-aware)
* spaCy
* custom regular expressions

highlighting differences between them.

In [5]:
import nltk, re, json
import spacy
from collections import Counter

In [6]:
# spaCy lemmatization needs tagger but disable the rest
nlp = spacy.load('en', disable=['tagger,ner,parser'])
nlp.remove_pipe('tagger')
nlp.remove_pipe('ner')
nlp.remove_pipe('parser');

In [7]:
def read_tweets_from_json(filename):
    tweets=[]
    with open(filename, encoding="utf-8") as file:
        data=json.load(file)
        for tweet in data:
            tweets.append(tweet["text"])
    return tweets        

trump_tweets.json comes from the Trump Twitter collection here (downloaded 1/19/19)
http://www.trumptwitterarchive.com/archive

In [8]:
filename="../data/trump_tweets.json"

In [8]:
#import nltk
#nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/frederikwarburg/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [190]:
tweets=read_tweets_from_json(filename)

In [191]:
whitespace_tokens=[]
for tweet in tweets:
    whitespace_tokens.append(tweet.split())

In [192]:
tweet

'Be sure to tune in and watch Donald Trump on Late Night with David Letterman as he presents the Top Ten List tonight!'

In [193]:
nltk_tokens=[]
for tweet in tweets:
    nltk_tokens.append(nltk.word_tokenize(tweet, language="english"))

In [194]:
nltk_casual_tokens=[]
for tweet in tweets:
    nltk_casual_tokens.append(nltk.casual_tokenize(tweet))

In [195]:
spacy_tokens=[]
for tweet in tweets:
    spacy_tokens.append([token.text for token in nlp(tweet)])

In [196]:
# Shorter version of http://sentiment.christopherpotts.net/code-data/happyfuntokenizing.py

# The order here is important (match from first to last)

# Keep usernames together (any token starting with @, followed by A-Z, a-z, 0-9)
regexes=(r"(?:@[\w_]+)",

# Keep hashtags together (any token starting with #, followed by A-Z, a-z, 0-9, _, or -)
r"(?:\#+[\w_]+[\w\'_\-]*[\w_]+)",

# Keep words with apostrophes, hyphens and underscores together
r"(?:[a-z][a-z’'\-_]+[a-z])",

# Keep all other sequences of A-Z, a-z, 0-9, _ together
r"(?:[\w_]+)",

# Everything else that's not whitespace
r"(?:\S)"
)

big_regex="|".join(regexes)

my_extensible_tokenizer = re.compile(big_regex, re.VERBOSE | re.I | re.UNICODE)

def my_extensible_tokenize(text):
    return my_extensible_tokenizer.findall(text)

In [197]:
extensible_tokens=[]
for tweet in tweets:
    extensible_tokens.append(my_extensible_tokenize(tweet))

"(?:\\#+[\\w_]+[\\w\\'_\\-]*[\\w_]+)"

Q1: Write a function to print out the first 5 tokenized tweets in each of the five tokenizers above. Examine those tweets; how would you characterize the differences?



In [205]:
def write_tweets(tweets, regexes, max_):
    for x, tweet in enumerate(tweets):
        print("Tweet ", x)
        for i in range(len(regexes)):
            tmp = re.compile(regexes[i])
            print("regex ", i, tmp.findall(tweet))
        print()

        if x >= max_ - 1:
            break
            
write_tweets(tweets, regexes, 5)



Tweet  0
regex  0 []
regex  1 []
regex  2 ['exico', 'doing', 'stop', 'the', 'aravan', 'which', 'now', 'fully', 'formed', 'and', 'heading', 'the', 'nited', 'tates', 'stopped', 'the', 'last', 'two', 'many', 'are', 'still', 'exico', 'but', 'can’t', 'get', 'through', 'our', 'all', 'but', 'takes', 'lot', 'order', 'gents', 'there', 'all', 'easy']
regex  3 ['Mexico', 'is', 'doing', 'NOTHING', 'to', 'stop', 'the', 'Caravan', 'which', 'is', 'now', 'fully', 'formed', 'and', 'heading', 'to', 'the', 'United', 'States', 'We', 'stopped', 'the', 'last', 'two', 'many', 'are', 'still', 'in', 'Mexico', 'but', 'can', 't', 'get', 'through', 'our', 'Wall', 'but', 'it', 'takes', 'a', 'lot', 'of', 'Border', 'Agents', 'if', 'there', 'is', 'no', 'Wall', 'Not', 'easy']
regex  4 ['M', 'e', 'x', 'i', 'c', 'o', 'i', 's', 'd', 'o', 'i', 'n', 'g', 'N', 'O', 'T', 'H', 'I', 'N', 'G', 't', 'o', 's', 't', 'o', 'p', 't', 'h', 'e', 'C', 'a', 'r', 'a', 'v', 'a', 'n', 'w', 'h', 'i', 'c', 'h', 'i', 's', 'n', 'o', 'w', 'f', '

The tweets have both grammatical and contentual differences. Some a addressing the economy, others border control. They are all consusive. Some are very aggresive, others more formel. This is deduced from the length of the sentences and the grammatical signs e.g. ! 

Q2: Write a function `compare(tokenization_one, tokenization_two)` that compares two tokenizations of the same text and finds the 20 most frequent tokens that don't appear in the other.



In [179]:
def compare(tokenization_one, tokenization_two, max_ = 20):

    tokenization_one = [item for sublist in tokenization_one for item in sublist]
    tokenization_one_dict = dict((x,tokenization_one.count(x)) for x in set(tokenization_one))
    tokenization_one_dict = sorted(tokenization_one_dict.items(), key=lambda x: x[1],reverse=True)

    tokenization_two = [item for sublist in tokenization_two for item in sublist]
    count = 0
    result = {}
    for key, value in tokenization_one_dict:
        if key not in tokenization_two:
            result[key] = value
            count += 1
        if count >= 19:
            break

    return result

In [184]:
results = compare(nltk_casual_tokens, nltk_tokens)
results

{'"': 24807,
 '@realDonaldTrump': 8661,
 '#Trump2016': 840,
 '@BarackObama': 732,
 "don't": 626,
 '#MakeAmericaGreatAgain': 560,
 '@FoxNews': 547,
 "I'm": 524,
 '@foxandfriends': 504,
 "can't": 423,
 '@ApprenticeNBC': 393,
 '@MittRomney': 314,
 "It's": 304,
 "it's": 303,
 '🇺': 300,
 '🇸': 300,
 '#CelebApprentice': 289,
 '@CNN': 285,
 "you're": 276}

Q3: Use one of the NLTK tokenizers; write code to determine how many sentences are in this dataset, and what the average number of words per sentence is.



In [71]:
nltk_tokens_senteces=[]
for tweet in tweets:
    nltk_tokens_senteces.append(nltk.sent_tokenize(tweet, language="english"))
        
import numpy as np
nltk_tokens_sentence_flat = [item for sublist in nltk_tokens_senteces for item in sublist]
print("Number of senteces: ", len(nltk_tokens_sentence_flat))
nltk_tokens_flat = [item for sublist in nltk_tokens for item in sublist]
print("Number of words: ", len(nltk_tokens_flat))
nltk_tokens_unique = np.unique(nltk_tokens_flat)
print("Number of unique words: ", len(nltk_tokens_unique))

Number of senteces:  70491
Number of words:  885028
Number of unique words:  56681


In [77]:
sentence_len = [len(sentence) for sentence in nltk_tokens_sentence_flat]
avg_len = np.mean(sentence_len)
print("Average sentence length: ", avg_len)

Average sentence length:  60.863982636081204


Q4 (check-plus): modify the extensible tokenizer above to keep urls together (e.g., www.google.com or http://www.google.com)

In [176]:
# Keep usernames together (any token starting with @, followed by A-Z, a-z, 0-9)
regexes=(r"(?:@[\w_]+)",

# Keep hashtags together (any token starting with #, followed by A-Z, a-z, 0-9, _, or -)
r"(?:\#+[\w_]+[\w\'_\-]*[\w_]+)",

# Keep urls together
r"(?:https?:\/\/.+[\w_]+)",
         
# Keep words with apostrophes, hyphens and underscores together
r"(?:[a-z][a-z’'\-_]+[a-z])",

# Keep all other sequences of A-Z, a-z, 0-9, _ together
r"(?:[\w_]+)",

# Everything else that's not whitespace
r"(?:\S)"
)

big_regex="|".join(regexes)

my_url_extensible_tokenizer = re.compile(big_regex, re.VERBOSE | re.I | re.UNICODE)

def my_extensible_tokenize_with_urls(text):
    return my_url_extensible_tokenizer.findall(text)

In [177]:
print ('\n'.join(my_extensible_tokenize_with_urls("The course website is http://people.ischool.berkeley.edu/~dbamman/info256.html")))

The
course
website
is
http://people.ischool.berkeley.edu/~dbamman/info256.html


In [185]:
my_url_extensible_tokenizer

re.compile(r"(?:@[\w_]+)|(?:\#+[\w_]+[\w\'_\-]*[\w_]+)|(?:https?:\/\/.+[\w_]+)|(?:[a-z][a-z’'\-_]+[a-z])|(?:[\w_]+)|(?:\S)",
re.IGNORECASE|re.UNICODE|re.VERBOSE)