This notebook outlines several methods for tokenizing text into words (and sentences), including:

* whitespace
* nltk (Penn Treebank tokenizer)
* nltk (Twitter-aware)
* spaCy
* custom regular expressions

highlighting differences between them.

In [459]:
import nltk, re, json
import spacy
from collections import Counter

In [460]:
# spaCy lemmatization needs tagger but disable the rest
nlp = spacy.load('en', disable=['tagger,ner,parser'])
nlp.remove_pipe('tagger')
nlp.remove_pipe('ner')
nlp.remove_pipe('parser');

In [461]:
def read_tweets_from_json(filename):
    tweets=[]
    with open(filename, encoding="utf-8") as file:
        data=json.load(file)
        for tweet in data:
            tweets.append(tweet["text"])
    return tweets        

trump_tweets.json comes from the Trump Twitter collection here (downloaded 1/19/19)
http://www.trumptwitterarchive.com/archive

In [462]:
filename="../data/trump_tweets.json"

In [463]:
tweets=read_tweets_from_json(filename)

In [464]:
tweets

['Mexico is doing NOTHING to stop the Caravan which is now fully formed and heading to the United States. We stopped the last two - many are still in Mexico but can’t get through our Wall, but it takes a lot of Border Agents if there is no Wall. Not easy!',
 'Many people are saying that the Mainstream Media will have a very hard time restoring credibility because of the way they have treated me over the past 3 years (including the election lead-up), as highlighted by the disgraceful Buzzfeed story &amp; the even more disgraceful coverage!',
 'The Economy is one of the best in our history, with unemployment at a 50 year low, and the Stock Market ready to again break a record (set by us many times) - &amp; all you heard yesterday, based on a phony story, was Impeachment. You want to see a Stock Market Crash, Impeach Trump!',
 '.@newtgingrich just stated that there has been no president since Abraham Lincoln who has been treated worse or more unfairly by the media than your favorite Presi

In [465]:
whitespace_tokens=[]
for tweet in tweets:
    whitespace_tokens.append(tweet.split())

In [466]:
whitespace_tokens

[['Mexico',
  'is',
  'doing',
  'NOTHING',
  'to',
  'stop',
  'the',
  'Caravan',
  'which',
  'is',
  'now',
  'fully',
  'formed',
  'and',
  'heading',
  'to',
  'the',
  'United',
  'States.',
  'We',
  'stopped',
  'the',
  'last',
  'two',
  '-',
  'many',
  'are',
  'still',
  'in',
  'Mexico',
  'but',
  'can’t',
  'get',
  'through',
  'our',
  'Wall,',
  'but',
  'it',
  'takes',
  'a',
  'lot',
  'of',
  'Border',
  'Agents',
  'if',
  'there',
  'is',
  'no',
  'Wall.',
  'Not',
  'easy!'],
 ['Many',
  'people',
  'are',
  'saying',
  'that',
  'the',
  'Mainstream',
  'Media',
  'will',
  'have',
  'a',
  'very',
  'hard',
  'time',
  'restoring',
  'credibility',
  'because',
  'of',
  'the',
  'way',
  'they',
  'have',
  'treated',
  'me',
  'over',
  'the',
  'past',
  '3',
  'years',
  '(including',
  'the',
  'election',
  'lead-up),',
  'as',
  'highlighted',
  'by',
  'the',
  'disgraceful',
  'Buzzfeed',
  'story',
  '&amp;',
  'the',
  'even',
  'more',
  'disg

In [467]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/ayoanimashaun/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [468]:
nltk_tokens=[]
for tweet in tweets:
    nltk_tokens.append(nltk.word_tokenize(tweet, language="english"))

In [469]:
len(nltk_tokens)

36583

In [470]:
nltk_casual_tokens=[]
for tweet in tweets:
    nltk_casual_tokens.append(nltk.casual_tokenize(tweet))

In [471]:
len(nltk_casual_tokens)

36583

In [472]:
spacy_tokens=[]
for tweet in tweets:
    spacy_tokens.append([token.text for token in nlp(tweet)])

In [473]:
spacy_tokens

[['Mexico',
  'is',
  'doing',
  'NOTHING',
  'to',
  'stop',
  'the',
  'Caravan',
  'which',
  'is',
  'now',
  'fully',
  'formed',
  'and',
  'heading',
  'to',
  'the',
  'United',
  'States',
  '.',
  'We',
  'stopped',
  'the',
  'last',
  'two',
  '-',
  'many',
  'are',
  'still',
  'in',
  'Mexico',
  'but',
  'ca',
  'n’t',
  'get',
  'through',
  'our',
  'Wall',
  ',',
  'but',
  'it',
  'takes',
  'a',
  'lot',
  'of',
  'Border',
  'Agents',
  'if',
  'there',
  'is',
  'no',
  'Wall',
  '.',
  'Not',
  'easy',
  '!'],
 ['Many',
  'people',
  'are',
  'saying',
  'that',
  'the',
  'Mainstream',
  'Media',
  'will',
  'have',
  'a',
  'very',
  'hard',
  'time',
  'restoring',
  'credibility',
  'because',
  'of',
  'the',
  'way',
  'they',
  'have',
  'treated',
  'me',
  'over',
  'the',
  'past',
  '3',
  'years',
  '(',
  'including',
  'the',
  'election',
  'lead',
  '-',
  'up',
  ')',
  ',',
  'as',
  'highlighted',
  'by',
  'the',
  'disgraceful',
  'Buzzfeed'

In [474]:
# Shorter version of http://sentiment.christopherpotts.net/code-data/happyfuntokenizing.py

# The order here is important (match from first to last)

# Keep usernames together (any token starting with @, followed by A-Z, a-z, 0-9)
regexes=(r"(?:@[\w_]+)",

# Keep hashtags together (any token starting with #, followed by A-Z, a-z, 0-9, _, or -)
r"(?:\#+[\w_]+[\w\'_\-]*[\w_]+)",

# Keep words with apostrophes, hyphens and underscores together
r"(?:[a-z][a-z’'\-_]+[a-z])",

# Keep all other sequences of A-Z, a-z, 0-9, _ together
r"(?:[\w_]+)",

# Everything else that's not whitespace
r"(?:\S)"
)

big_regex="|".join(regexes)

my_extensible_tokenizer = re.compile(big_regex, re.VERBOSE | re.I | re.UNICODE)

def my_extensible_tokenize(text):
    return my_extensible_tokenizer.findall(text)

In [475]:
big_regex

"(?:@[\\w_]+)|(?:\\#+[\\w_]+[\\w\\'_\\-]*[\\w_]+)|(?:[a-z][a-z’'\\-_]+[a-z])|(?:[\\w_]+)|(?:\\S)"

In [476]:
my_extensible_tokenizer

re.compile(r"(?:@[\w_]+)|(?:\#+[\w_]+[\w\'_\-]*[\w_]+)|(?:[a-z][a-z’'\-_]+[a-z])|(?:[\w_]+)|(?:\S)",
re.IGNORECASE|re.UNICODE|re.VERBOSE)

In [477]:
extensible_tokens=[]
for tweet in tweets:
    extensible_tokens.append(my_extensible_tokenize(tweet))

In [478]:
len(extensible_tokens)

36583

Q1: Write a function to print out the first 5 tokenized tweets in each of the five tokenizers above. Examine those tweets; how would you characterize the differences?



In [479]:
def last_five_tokens(tokenizer_list):
    return tokenizer_list[-5:]

In [480]:
last_five_tokens(whitespace_tokens)

[['"My',
  'persona',
  'will',
  'never',
  'be',
  'that',
  'of',
  'a',
  'wallflower',
  '-',
  'I’d',
  'rather',
  'build',
  'walls',
  'than',
  'cling',
  'to',
  'them"',
  '--Donald',
  'J.',
  'Trump'],
 ['New',
  'Blog',
  'Post:',
  'Celebrity',
  'Apprentice',
  'Finale',
  'and',
  'Lessons',
  'Learned',
  'Along',
  'the',
  'Way:',
  'http://tinyurl.com/qlux5e'],
 ['Donald',
  'Trump',
  'reads',
  'Top',
  'Ten',
  'Financial',
  'Tips',
  'on',
  'Late',
  'Show',
  'with',
  'David',
  'Letterman:',
  'http://tinyurl.com/ooafwn',
  '-',
  'Very',
  'funny!'],
 ['Donald',
  'Trump',
  'will',
  'be',
  'appearing',
  'on',
  'The',
  'View',
  'tomorrow',
  'morning',
  'to',
  'discuss',
  'Celebrity',
  'Apprentice',
  'and',
  'his',
  'new',
  'book',
  'Think',
  'Like',
  'A',
  'Champion!'],
 ['Be',
  'sure',
  'to',
  'tune',
  'in',
  'and',
  'watch',
  'Donald',
  'Trump',
  'on',
  'Late',
  'Night',
  'with',
  'David',
  'Letterman',
  'as',
  'he',


In [481]:
last_five_tokens(nltk_tokens)

[['``',
  'My',
  'persona',
  'will',
  'never',
  'be',
  'that',
  'of',
  'a',
  'wallflower',
  '-',
  'I',
  '’',
  'd',
  'rather',
  'build',
  'walls',
  'than',
  'cling',
  'to',
  'them',
  "''",
  '--',
  'Donald',
  'J.',
  'Trump'],
 ['New',
  'Blog',
  'Post',
  ':',
  'Celebrity',
  'Apprentice',
  'Finale',
  'and',
  'Lessons',
  'Learned',
  'Along',
  'the',
  'Way',
  ':',
  'http',
  ':',
  '//tinyurl.com/qlux5e'],
 ['Donald',
  'Trump',
  'reads',
  'Top',
  'Ten',
  'Financial',
  'Tips',
  'on',
  'Late',
  'Show',
  'with',
  'David',
  'Letterman',
  ':',
  'http',
  ':',
  '//tinyurl.com/ooafwn',
  '-',
  'Very',
  'funny',
  '!'],
 ['Donald',
  'Trump',
  'will',
  'be',
  'appearing',
  'on',
  'The',
  'View',
  'tomorrow',
  'morning',
  'to',
  'discuss',
  'Celebrity',
  'Apprentice',
  'and',
  'his',
  'new',
  'book',
  'Think',
  'Like',
  'A',
  'Champion',
  '!'],
 ['Be',
  'sure',
  'to',
  'tune',
  'in',
  'and',
  'watch',
  'Donald',
  'Tru

In [482]:
last_five_tokens(nltk_casual_tokens)

[['"',
  'My',
  'persona',
  'will',
  'never',
  'be',
  'that',
  'of',
  'a',
  'wallflower',
  '-',
  'I',
  '’',
  'd',
  'rather',
  'build',
  'walls',
  'than',
  'cling',
  'to',
  'them',
  '"',
  '-',
  '-',
  'Donald',
  'J',
  '.',
  'Trump'],
 ['New',
  'Blog',
  'Post',
  ':',
  'Celebrity',
  'Apprentice',
  'Finale',
  'and',
  'Lessons',
  'Learned',
  'Along',
  'the',
  'Way',
  ':',
  'http://tinyurl.com/qlux5e'],
 ['Donald',
  'Trump',
  'reads',
  'Top',
  'Ten',
  'Financial',
  'Tips',
  'on',
  'Late',
  'Show',
  'with',
  'David',
  'Letterman',
  ':',
  'http://tinyurl.com/ooafwn',
  '-',
  'Very',
  'funny',
  '!'],
 ['Donald',
  'Trump',
  'will',
  'be',
  'appearing',
  'on',
  'The',
  'View',
  'tomorrow',
  'morning',
  'to',
  'discuss',
  'Celebrity',
  'Apprentice',
  'and',
  'his',
  'new',
  'book',
  'Think',
  'Like',
  'A',
  'Champion',
  '!'],
 ['Be',
  'sure',
  'to',
  'tune',
  'in',
  'and',
  'watch',
  'Donald',
  'Trump',
  'on',
 

In [483]:
last_five_tokens(spacy_tokens)

[['"',
  'My',
  'persona',
  'will',
  'never',
  'be',
  'that',
  'of',
  'a',
  'wallflower',
  '-',
  'I',
  '’d',
  'rather',
  'build',
  'walls',
  'than',
  'cling',
  'to',
  'them',
  '"',
  '--Donald',
  'J.',
  'Trump'],
 ['New',
  'Blog',
  'Post',
  ':',
  'Celebrity',
  'Apprentice',
  'Finale',
  'and',
  'Lessons',
  'Learned',
  'Along',
  'the',
  'Way',
  ':',
  'http://tinyurl.com/qlux5e'],
 ['Donald',
  'Trump',
  'reads',
  'Top',
  'Ten',
  'Financial',
  'Tips',
  'on',
  'Late',
  'Show',
  'with',
  'David',
  'Letterman',
  ':',
  'http://tinyurl.com/ooafwn',
  '-',
  'Very',
  'funny',
  '!'],
 ['Donald',
  'Trump',
  'will',
  'be',
  'appearing',
  'on',
  'The',
  'View',
  'tomorrow',
  'morning',
  'to',
  'discuss',
  'Celebrity',
  'Apprentice',
  'and',
  'his',
  'new',
  'book',
  'Think',
  'Like',
  'A',
  'Champion',
  '!'],
 ['Be',
  'sure',
  'to',
  'tune',
  'in',
  'and',
  'watch',
  'Donald',
  'Trump',
  'on',
  'Late',
  'Night',
  'w

In [484]:
last_five_tokens(extensible_tokens)

[['"',
  'My',
  'persona',
  'will',
  'never',
  'be',
  'that',
  'of',
  'a',
  'wallflower',
  '-',
  'I’d',
  'rather',
  'build',
  'walls',
  'than',
  'cling',
  'to',
  'them',
  '"',
  '-',
  '-',
  'Donald',
  'J',
  '.',
  'Trump'],
 ['New',
  'Blog',
  'Post',
  ':',
  'Celebrity',
  'Apprentice',
  'Finale',
  'and',
  'Lessons',
  'Learned',
  'Along',
  'the',
  'Way',
  ':',
  'http',
  ':',
  '/',
  '/',
  'tinyurl',
  '.',
  'com',
  '/',
  'qlux',
  '5e'],
 ['Donald',
  'Trump',
  'reads',
  'Top',
  'Ten',
  'Financial',
  'Tips',
  'on',
  'Late',
  'Show',
  'with',
  'David',
  'Letterman',
  ':',
  'http',
  ':',
  '/',
  '/',
  'tinyurl',
  '.',
  'com',
  '/',
  'ooafwn',
  '-',
  'Very',
  'funny',
  '!'],
 ['Donald',
  'Trump',
  'will',
  'be',
  'appearing',
  'on',
  'The',
  'View',
  'tomorrow',
  'morning',
  'to',
  'discuss',
  'Celebrity',
  'Apprentice',
  'and',
  'his',
  'new',
  'book',
  'Think',
  'Like',
  'A',
  'Champion',
  '!'],
 ['Be'

### Each tokenizer works slightly different. Some tokenizers recognize characters as individual tokens, whilst others do not. For example extensible_tokens recognizes "I'd" as a token whilst nltk_casual_tokens recognizes the same expression as three different tokens because of the three different charcters contained in the expression" ###


Q2: Write a function `compare(tokenization_one, tokenization_two)` that compares two tokenizations of the same text and finds the 20 most frequent tokens that don't appear in the other.



In [485]:
from itertools import chain
def compare(tokenization_one, tokenization_two):
    #tokenization_one=list(chain.from_iterable(tokenization_one))
    #tokenization_two=list(chain.from_iterable(tokenization_two))
    newlist=[]
    for i in range(len(tokenization_one)):
        newlist+=list(set(tokenization_one[i])-set(tokenization_two[i])) 
    print(Counter(newlist).most_common(20))

In [486]:
compare(nltk_casual_tokens, nltk_tokens)

[('"', 12802), ('@realDonaldTrump', 8593), ('-', 1714), ('.', 1152), ('S', 893), ('…', 875), ('U', 844), ('#Trump2016', 839), ('@BarackObama', 720), ('/', 681), ("don't", 604), ("'", 567), ('#MakeAmericaGreatAgain', 560), ('@FoxNews', 545), ('@foxandfriends', 503), ("I'm", 500), ('s', 471), ('—', 443), ("can't", 417), ('@ApprenticeNBC', 393)]


Q3: Use one of the NLTK tokenizers; write code to determine how many sentences are in this dataset, and what the average number of words per sentence is.



In [487]:
from nltk.tokenize import sent_tokenize

In [488]:
sentences=[len(sent_tokenize(tweet)) for tweet in tweets]

In [489]:
sum(sentences)

70491

whitespace_tokens=list(chain.from_iterable(whitespace_tokens))

In [490]:
number_of_words=[len(whitespace_tokens[i])for i in range(len(whitespace_tokens))]

In [495]:
sum(number_of_words)

664189

In [491]:
Word_to_Sentence_ratio= sum(number_of_words)/sum(sentences)
Word_to_Sentence_ratio

9.422323417173823

Q4 (check-plus): modify the extensible tokenizer above to keep urls together (e.g., www.google.com or http://www.google.com)

In [492]:
# Keep usernames together (any token starting with @, followed by A-Z, a-z, 0-9)
regexes=(r"(?:@[\w_]+)",

# Keep hashtags together (any token starting with #, followed by A-Z, a-z, 0-9, _, or -)
r"(?:\#+[\w_]+[\w\'_\-]*[\w_]+)",

# Keep urls together
r"(?:http|https?:(?:/{1,3})[a-z0-9.\_\/\~]+[.]|(?:com|net|org|edu|gov))" 
         
# Keep words with apostrophes, hyphens and underscores together
r"(?:[a-z][a-z’'\-_]+[a-z])",

# Keep all other sequences of A-Z, a-z, 0-9, _ together
r"(?:[\w_]+)",

# Everything else that's not whitespace
r"(?:\S)" 

        
)

big_regex="|".join(regexes)

my_url_extensible_tokenizer = re.compile(big_regex, re.VERBOSE | re.I | re.UNICODE)

def my_extensible_tokenize_with_urls(text):
    return my_url_extensible_tokenizer.findall(text)

In [493]:
print ('\n'.join(my_extensible_tokenize_with_urls("The course website is http://people.ischool.berkeley.edu/~dbamman/info256.html")))

The
course
website
is
http://people.ischool.berkeley.edu/~dbamman/info256.html
