TOKENIZATION and VECTORIZATION

Let's make some arrangement to be able to install any module from inside Jupyper - there will be no need to use the Terminal!

In [1]:
import sys

Now, you can install any module from Jupyter by running a line such as:
!{sys.executable} -m pip install module_name

We'll need the NLTK module (NLTK stands for Natural Language ToolKit)

In [2]:
!{sys.executable} -m pip install nltk
import nltk

[31mdistributed 1.21.8 requires msgpack, which is not installed.[0m
[33mYou are using pip version 10.0.1, however version 19.0.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


From the NLTK module we'll use a sentence tokenizer 'punkt'

In [3]:
nltk.download('punkt')
from pprint import pprint #pretty printing

[nltk_data] Downloading package punkt to /Users/corrine/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In what follows, we'll use an electronic archive of books from Project Gutenberg. In particular, we'll use "Alice in Wonderland" by Lewis Carrol. Note our corpus will be just one file called carroll-alice.txt (it's in .txt format)

In [6]:
nltk.download('gutenberg') 
from nltk.corpus import gutenberg 
alice = gutenberg.raw(fileids='carroll-alice.txt') 
pprint(alice[0:35])

[nltk_data] Downloading package gutenberg to
[nltk_data]     /Users/corrine/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
"[Alice's Adventures in Wonderland b"


Let's tokenize the Alice corpus by sentence by using a sentence tokenizer from the NLTK module

In [7]:
alice_sentences = nltk.sent_tokenize(text=alice)
print('\nTotal sentences in alice:', len(alice_sentences))


Total sentences in alice: 1625


Let's have a look at the first sentence in the Alice corpus.

In [8]:
print('\nFirst sentence in alice:', alice_sentences[0])


First sentence in alice: [Alice's Adventures in Wonderland by Lewis Carroll 1865]

CHAPTER I.


What does the second sentence look like?

In [9]:
print('\nSecond sentence in alice:', alice_sentences[1])


Second sentence in alice: Down the Rabbit-Hole

Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations in
it, 'and what is the use of a book,' thought Alice 'without pictures or
conversation?'


Let's do some tokenization by words now. We'll do it on a sentence below.

In [10]:
sentence = "The brown fox wasn't that quick and he couldn't win the races"
words = nltk.word_tokenize(sentence)
print(words)  

['The', 'brown', 'fox', 'was', "n't", 'that', 'quick', 'and', 'he', 'could', "n't", 'win', 'the', 'races']


Let's tokenize by punctuation rules now. Do you see any difference between this tokenization the previous one?

In [11]:
wordpunkt_wt = nltk.WordPunctTokenizer()
words = wordpunkt_wt.tokenize(sentence)
print(words)

['The', 'brown', 'fox', 'wasn', "'", 't', 'that', 'quick', 'and', 'he', 'couldn', "'", 't', 'win', 'the', 'races']


Let's tokenize by white spaces.

In [12]:
whitespace_wt = nltk.WhitespaceTokenizer()
words = whitespace_wt.tokenize(sentence)
print(words)

['The', 'brown', 'fox', "wasn't", 'that', 'quick', 'and', 'he', "couldn't", 'win', 'the', 'races']


Let's get rid of stopwords ("it's", "is", "the", etc.)

In [13]:
from nltk.corpus import stopwords
stop_words=set(stopwords.words("english"))
print(stop_words)

{'yourself', "needn't", "don't", "you'll", 'into', 'don', 'while', 'yourselves', 'they', 'then', "wouldn't", 'during', 'ain', "shouldn't", "shan't", 'up', 'i', 'won', 'before', 'where', 'some', 'can', 've', 'for', 'does', 'his', 'after', 'be', 'own', "didn't", 'haven', "weren't", 'were', 'to', 'wouldn', 'each', 'how', 'have', 'ourselves', 'couldn', 'been', 'about', 'as', 'o', 'we', 'he', 'at', 'it', 'nor', 'which', 'both', "won't", 'other', 'doing', 'now', 'will', 'more', 'yours', 'once', 'from', "wasn't", "isn't", 'until', 'themselves', 'because', 'only', 'any', 'ours', 'y', 'why', 's', 'mustn', 'by', 'few', 'himself', 'are', 'again', 'my', 'that', 'their', 'very', 'll', 'hadn', 'just', 'above', 'didn', 'against', 'those', 'was', 'same', 'needn', 'if', 'ma', 'hasn', 'not', "you'd", 'd', "should've", 'she', 'down', 'an', 'who', 'than', 'doesn', 'in', 'most', 'had', 'this', 'being', 'off', 'a', 'did', 'no', 'her', "you're", 'am', 'all', 'having', 'myself', 'through', 'me', "you've", "do

Compare the tokenized sentence before and after removing the stopwords.

In [14]:
filtered_tokens=[]

for w in words:
    if w not in stop_words:
        filtered_tokens.append(w)
        
print("Tokenized Sentence:",words)
print("Filterd Sentence (without stopwords):",filtered_tokens)

Tokenized Sentence: ['The', 'brown', 'fox', "wasn't", 'that', 'quick', 'and', 'he', "couldn't", 'win', 'the', 'races']
Filterd Sentence (without stopwords): ['The', 'brown', 'fox', 'quick', 'win', 'races']


Stemming and Lemmatization. Let's stem the sentence first - what changed?

In [15]:
from nltk.stem import PorterStemmer

ps = PorterStemmer()

stemmed_tokens=[]
for w in filtered_tokens:
    stemmed_tokens.append(ps.stem(w))

print("Filtered Sentence:",filtered_tokens)
print("Stemmed Sentence:",stemmed_tokens)

Filtered Sentence: ['The', 'brown', 'fox', 'quick', 'win', 'races']
Stemmed Sentence: ['the', 'brown', 'fox', 'quick', 'win', 'race']


Compare stemming vs. lemmatization: 

In [16]:
from nltk.stem.wordnet import WordNetLemmatizer
lem = WordNetLemmatizer()

from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()

word = "running"
print("Lemmatized Word:",lem.lemmatize(word,"v")) # 'v' indicates that the word is a verb
print("Stemmed Word:",ps.stem(word))

Lemmatized Word: run
Stemmed Word: run


One more comparison:

In [17]:
word = "flying"
print("Lemmatized Word:",lem.lemmatize(word,"v")) # 'v' indicates that the word is a verb
print("Stemmed Word:",ps.stem(word))

Lemmatized Word: fly
Stemmed Word: fli


In [18]:
nltk.download('averaged_perceptron_tagger')
nltk.download('tagsets')
nltk.pos_tag(stemmed_tokens)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/corrine/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package tagsets to /Users/corrine/nltk_data...
[nltk_data]   Package tagsets is already up-to-date!


[('the', 'DT'),
 ('brown', 'JJ'),
 ('fox', 'NN'),
 ('quick', 'JJ'),
 ('win', 'NN'),
 ('race', 'NN')]

You can look up the tags here:

In [19]:
nltk.help.upenn_tagset('JJ')

JJ: adjective or numeral, ordinal
    third ill-mannered pre-war regrettable oiled calamitous first separable
    ectoplasmic battery-powered participatory fourth still-to-be-named
    multilingual multi-disciplinary ...


Let's vectorize the corpus about "blue skys and blue cheese". 

In [20]:
corpus = ['the sky is blue',
          'sky is blue and sky is beautiful', 
          'the beautiful sky is so blue',
          'i love blue cheese']

We'll use built-in vectorizers from SciLearn module for machine learning. We'll use bag-of-words representation first. 

In [21]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_BOW = CountVectorizer(max_features=1000)
BOW_matrix = vectorizer_BOW.fit_transform(corpus).toarray()
print(BOW_matrix)

[[0 0 1 0 1 0 1 0 1]
 [1 1 1 0 2 0 2 0 0]
 [0 1 1 0 1 0 1 1 1]
 [0 0 1 1 0 1 0 0 0]]


Now, let's do feature extraction (vectorization) using the TF-IDF approach. Note, the results can be slightly different depending on the options you use. See documentation: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer

In [25]:
from sklearn.feature_extraction.text import TfidfVectorizer 

vectorizer_TF_IDF = TfidfVectorizer(max_df = 1.0, min_df = 1, norm = None, smooth_idf=True)
TF_IDF_matrix = vectorizer_TF_IDF.fit_transform(corpus).todense()
print(TF_IDF_matrix)


[[0.         0.         1.         0.         1.22314355 0.
  1.22314355 0.         1.51082562]
 [1.91629073 1.51082562 1.         0.         2.4462871  0.
  2.4462871  0.         0.        ]
 [0.         1.51082562 1.         0.         1.22314355 0.
  1.22314355 1.91629073 1.51082562]
 [0.         0.         1.         1.91629073 0.         1.91629073
  0.         0.         0.        ]]


Have a look at the IDF weights:

In [26]:
print(vectorizer_TF_IDF.idf_)

[1.91629073 1.51082562 1.         1.91629073 1.22314355 1.91629073
 1.22314355 1.91629073 1.51082562]


It's a good idea to normalize the TF-IDF matrix, i.e. restrict all entries to be between 0 and 1. Some text mining models require normalized matrices.

In [27]:
vectorizer_TF_IDF = TfidfVectorizer(max_df = 1.0, min_df = 1, norm = 'l2', smooth_idf=True)
TF_IDF_matrix = vectorizer_TF_IDF.fit_transform(corpus).todense()
print(TF_IDF_matrix)

[[0.         0.         0.39921021 0.         0.48829139 0.
  0.48829139 0.         0.60313701]
 [0.44051607 0.34730793 0.22987956 0.         0.5623514  0.
  0.5623514  0.         0.        ]
 [0.         0.43202578 0.28595344 0.         0.3497621  0.
  0.3497621  0.54796992 0.43202578]
 [0.         0.         0.34618161 0.66338461 0.         0.66338461
  0.         0.         0.        ]]


EXERCISE: You are given a new small corpus (see below). In Excel, compute the TF-IDF matrix (do not use normalization). Upload your excel file to the Blackboard at the end of class.

In [28]:
corpus_exercise = ['python is great for text mining',
          'anyone can learn python and do text mining', 
          'python can go without eating for days',
          'python can be a great pet']

In [29]:
vectorizer_BOW = CountVectorizer(max_features=1000)
BOW_matrix = vectorizer_BOW.fit_transform(corpus_exercise).toarray()
vectorizer_TF_IDF = TfidfVectorizer(max_df = 1.0, min_df = 1, norm = None, smooth_idf=True)
TF_IDF_matrix = vectorizer_TF_IDF.fit_transform(corpus).todense()
print(vectorizer_TF_IDF.idf_)

[1.91629073 1.51082562 1.         1.91629073 1.22314355 1.91629073
 1.22314355 1.91629073 1.51082562]


In [31]:
print(BOW_matrix)

[[0 0 0 0 0 0 0 1 0 1 1 0 1 0 1 1 0]
 [1 1 0 1 0 1 0 0 0 0 0 1 1 0 1 1 0]
 [0 0 0 1 1 0 1 1 1 0 0 0 0 0 1 0 1]
 [0 0 1 1 0 0 0 0 0 1 0 0 0 1 1 0 0]]


NameError: name 'write' is not defined