# Distributional Semantics (Übung): NLTK (tf * idf)

NLTK is one of the largest Python libraries for Natural Language Processing. It was written in particular for educational purposes in computational linguistics and information retrieval. The developer Steven Bird, Ewan Klein, Edward Loper developed NLTK as a open source project the first time for Python 2.6.

Access to NLTK: http://www.nltk.org/

## Installation
If you use Jupyter Notebook, NLTK usually is pre-installed. Nevertheless, in the case that something went wrong during installation, just type the commands below in a debian-based (Ubuntu) OS. Numpy is also a module which is used by NLTK.

    >>> sudo pip3 install -U nltk

    >>> sudo pip3 install -U numpy
    
For the the missing modules in Anaconda, just type in:

    >>> conda install nltk

    >>> conda install numpy
    
You can check, whether the installation was successful, by executing "import nltk" in your python shell.

## The roadmap of this session

1. Import and include text data
2. Tokenize raw data
3. Remove stopwords
4. Display a term-document matrix
5. Apply tf'idf
6. Display the term-document matrix with smoothed values

In [None]:
import nltk

# If you are interested in large corpora, you can access and download pre-set corpora from NLTK. 
nltk.download("reuters") # download corpora

# You can use one specific corpus ressource
from nltk.corpus import brown

# Access different kind of information of the corpus by the following commands

brown.fileids() # corpora data
rawtext = brown.raw() # raw data
words = brown.words() # words from the data
sentences = brown.sentences() # sentences from the data
text1.concordance("whale") # pretty print the concordances
text1.similar("whale") # calculate similarities

Working with NLTK demands a couple of imported modules. The most important ones are loaded at once below. First you are required to download the additional data before applying them.

In [None]:
import nltk

nltk.download('punkt')   # Punkt Tokenizer Model
nltk.download('averaged_perceptron_tagger')  # Part-of-Speech Tokeniser
nltk.download("stopwords") # Stopwords

In [None]:
import numpy as np
import string
import re

# modules for tokenization and removal of stopwords
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer

# modules for stemming
from nltk.stem.porter import PorterStemmer
from nltk.stem.lancaster import LancasterStemmer
from nltk.stem import SnowballStemmer 

We will need some text data to process. Consider the following three documents in our dummy set.

In [None]:
ulysses = """Mrkgnao! the cat said loudly. She blinked up out of her avid shameclosing eyes, mewing \
plaintively and long, showing him her milkwhite teeth. He watched the dark eyeslits narrowing \
with greed till her eyes were green stones. Then he went to the dresser, took the jug Hanlon's\
milkman had just filled for him, poured warmbubbled milk on a saucer and set it slowly on the floor.\
— Gurrhr! she cried, running to lap."""

hungry_cat = """The hungry cat
A hungry cat is looking for something to eat. She sees a little grey mouse sitting near his house. \
- I want to catch that little mouse, - says the hungry cat. \
She sits down and begins to cry "mew, mew, mew". \
The little grey mouse jumps up to run into his house, but the cat sits still and mews again. \
- She is sitting still, - thinks the mouse. \
- She doesn't want to catch me. I shall not run away. \
- Mew, mew, mew, - says the cat again. \
- Why are you crying? - asks the mouse. \
- See, I have a penny in my hand. \
- Good, you are lucky. That's nothing to cry about, - says the mouse. \
The hungry cat comes nearer. \
- Oh, little mouse, I shall get some meat with the penny. I shall cook it and have it for supper. \
- Good, you are lucky. That's nothing to cry about. \
The hungry cat comes nearer and nearer. \
- There lives a hungry dog in this house. He will eat all the meat. \
- Poor Pussy, - says the mouse. - What will you eat then? \
- You, - cries the cat and jumps at the little grey mouse. \
But the mouse is too quick. He jumps into his little house before the cat can say another "mew". \
- No, no, sly Pussy, - says the mouse. - You will not eat me. You must first catch me."""

bell_the_cat = """There was a grocery shop in a town. Plenty of mice lived in that grocery shop. Food was in plenty for them. They ate everything and spoiled all the bags. They also wasted the bread, biscuits and fruits of the shop.  \
The grocer got really worried. So, he thought "I should buy a cat and let it stay at the grocery. Only then I can save my things." \
He bought a nice, big fat cat and let him stay there. The cat had a nice time hunting the mice and killing them. The mice could not move freely now. They were afraid that anytime the cat would eat them up. \
The mice wanted to do something. They held a meeting and all of them tweeted "We must get rid of the cat. Can someone give a suggestion"? \
All the mice sat and brooded. A smart looking mouse stood up and said, "The cat moves softly. That is the problem. If we can tie a bell around her neck, then things will be fine. We can know the movements of the cat". \
“Yes, that is answer," stated all the mice. An old mouse slowly stood up and asked, "Who would tie the bell?" After some moments there was no one there to answer this question. \
MORAL : Empty solutions are of no worth."""



#### Tokenizing

NLTK has an inbuild tokenizer which seperates in sentences (sent_tokenize()) or in words (word_tokenize()). Actually, there is a number of possibilities to tokenize the raw text, depending on the desired result.

The next step is important, so that the one large string of each raw data gets cut into single words. In our approach, remove all the punctuation with a regex.

In [None]:
#doc = nltk.word_tokenize(ulysses)
#print(doc)
#for s in doc:
#    print(">",s)

In [None]:
ulysses = re.findall(r"(?:[A-Z]\.)+|\w+(?:[']\w+)*|\$?\d+(?:\.\d+)?%?", ulysses)
hungry_cat = re.findall(r"(?:[A-Z]\.)+|\w+(?:[']\w+)*|\$?\d+(?:\.\d+)?%?", hungry_cat)
bell_the_cat = re.findall(r"(?:[A-Z]\.)+|\w+(?:[']\w+)*|\$?\d+(?:\.\d+)?%?", bell_the_cat)

Above we imported a list of stopword. If you wish, you can print them out.


In [None]:
print(stopwords.words('english'))

To remove stopwords from the data, apply the next command. __Task:__ Do the same steps for the other two documents!

Note: the larger the data, the longer it takes to process. On larger texts it can takes a couples of minutes to run.

In [None]:
bellcat_withoutStopwords = [w for w in bell_the_cat if w.lower() not in stopwords.words('english') ]
ulysses_withoutStopwords = [w for w in ulysses if w.lower() not in stopwords.words('english') ]
hungrycat_withoutStopwords = [w for w in hungry_cat if w.lower() not in stopwords.words('english') ]
print(bellcat_withoutStopwords)

In [None]:
# Instead of the regex, the code below would equally remove all the punctuation.

#bell_the_cat = [w for w in bell_the_cat if w not in string.punctuation]
#punctCombo = [c+"\"" for c in string.punctuation ]+ ["\""+c for c in string.punctuation ]
#bell_the_cat = [w for w in bell_the_cat if w not in punctCombo]

In [None]:
# If you print out the length of both texts, you will realize that the size was reduced due to sorting out stopwords
print(len(bell_the_cat))
print(len(bellcat_withoutStopwords))

In [None]:
# Use the data, plot, and see your first results. :-)

fdist_catuly = nltk.FreqDist(ulysses_withoutStopwords)
fdist_cat1 = nltk.FreqDist(hungrycat_withoutStopwords)
fdist_cat2 = nltk.FreqDist(bellcat_withoutStopwords)

fdist_catuly.plot(25, cumulative=False)
fdist_cat1.plot(25, cumulative=False)
fdist_cat2.plot(25, cumulative=False)

#### Stemming and Lemmatizing

Stemming and lemmatizing are two similar methods to standarize your data. Stemming reduces words to its stems by defined rules. Lemmatizer need an additional lexicon for this task, but the results are 'cleaner'. For our purposes a stemmer ist perfectly fine.

Reminder: Your tokenized data without stopwords are bellcat_withoutStopwords, ulysses_withoutStopwords, hungrycat_withoutStopwords.

__Task:__ Play with the other documents and the different stemmers!

In [None]:
# You are free to choose your stemmer! Just un/comment the correct ones.

def do_stemming(filtered):
    stemmed = []
    for f in filtered:
        stemmed.append(PorterStemmer().stem(f))
        #stemmed.append(LancasterStemmer().stem(f))
        #stemmed.append(SnowballStemmer('english').stem(f))
    return stemmed

if __name__ == "__main__":

    print("tokens = %s \n" %(bellcat_withoutStopwords))

    stemmed_tokens = do_stemming(bellcat_withoutStopwords)
    print("stemmed_tokens = %s \n" %stemmed_tokens)

    result = dict(zip(bellcat_withoutStopwords, stemmed_tokens))
    for element, stemmed in result.items():
        print("stemmed token: {} \t {}".format(element, stemmed))

In [None]:
ulysses_stemmed = do_stemming(ulysses_withoutStopwords)
hungrycat_stemmed = do_stemming(hungrycat_withoutStopwords)
bellcat_stemmed = do_stemming(bellcat_withoutStopwords)

Borrowing some sklearn library enables us to print the word-document matrix!

__Task:__ The data is not stemmed. What will the matrix look like after stemming?

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

docs = [str(ulysses_withoutStopwords), str(hungrycat_withoutStopwords), str(bellcat_withoutStopwords)]
vec = CountVectorizer()
X = vec.fit_transform(docs)
pd.set_option('display.max_rows', 200)
df = pd.DataFrame(X.toarray().transpose(), index=vec.get_feature_names())
print(df)

## Term Frequency * Inverse Document Frequency

__Term Frequency _tf(w,d)_:__ number of times a word _w_ appears in a document _d_.

<img src="img/tf.png" align="left" title="Source: http://www.akbarian.org/notes/text-mining-nlp-python/"/>

<img src="img/idf.png" align="left" title="Source: http://www.akbarian.org/notes/text-mining-nlp-python/"/>


In [None]:
from sklearn.feature_extraction.text import TfidfTransformer
transformer = TfidfTransformer(smooth_idf=False)
tfidf = transformer.fit_transform(X)
tfidf.toarray().transpose() 


dtfidf = pd.DataFrame(tfidf.toarray().transpose(), index=vec.get_feature_names())
print(dtfidf)

##### Some further references:

https://github.com/edbullen/nltk

http://www.nltk.org/

http://www.akbarian.org/notes/text-mining-nlp-python/

http://scikit-learn.org/stable/modules/feature_extraction.html
