**Date of Creation**: Jul 21, 2020<br>
**Author**: Karina ChiÃ±as Fuentes

---

# Basic usages of `nltk`

This notebook will show 11 diffent ways to use `nltk`'s functions to study the basic components of text, in this case, it will be part of the Moby Dick novel.

1. Total of tokens
2. Total of unique tokens
3. Change in unique tokens after lemmatizing the verbs
4. Lexical diversity of the text (i.e. ratio of unique tokens to the total number of tokens)
5. Comparing capitalized words with non-capitalized words (i.e. 'whale' vs 'Whale')
6. Finding the most frequently occurring (unique) tokens in the text and their frequency
7. Specific characteristics in tokens
8. The longest word in text its length
9. Unique words have a frequency of more than 2000 and their frequency value
10. The average number of tokens per sentence
11. The 5 most frequent parts of speech in this text and their frequency

In [1]:
import nltk
import pandas as pd
import numpy as np

# If you would like to work with the raw text you can use 'moby_raw'
with open('Data/moby.txt', 'r') as f:
    moby_raw = f.read()
    
# If you would like to work with the novel in nltk. 
# Text format you can use 'text1'
moby_tokens = nltk.word_tokenize(moby_raw)
text1 = nltk.Text(moby_tokens)

### 1. Total of tokens
This is what you can do to fin the total tokens (words and punctuation symbols) within the text.

In [2]:
len(nltk.word_tokenize(moby_raw)),len(text1) # both methods are valid

(255038, 255038)

### 2. Total of unique tokens

In [3]:
len(set(nltk.word_tokenize(moby_raw))),len(set(text1)) # both methods are valid

(20742, 20742)

### 3. Change in unique tokens after lemmatizing the verbs

In [4]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(w,'v') for w in text1]

len(set(lemmatized))

16887

### 4. Lexical diversity of the text (i.e. ratio of unique tokens to the total number of tokens)

In [5]:
len(set(moby_tokens))/len(moby_tokens)

0.08132905684643073

### 5. Comparing capitalized words with non-capitalized words (i.e. 'whale' vs 'Whale')

Have a look at the [documentation](http://www.nltk.org/api/nltk.html?highlight=freqdist) of FreqDist.

In [6]:
from nltk.probability import FreqDist
    
dist = FreqDist(moby_tokens)
((dist["whale"]+dist["Whale"])/len(text1))*100 

0.41248755087477157

### 6. Finding the most frequently occurring (unique) tokens in the text and their frequency

For example, the first 20.

In [7]:
def unique_tokens():
    from nltk.probability import FreqDist
    
    x = dict(FreqDist(moby_tokens))
    # this sorts the dictionary according to the keys.
    dic = {k: v for k, v in sorted(x.items(), key=lambda item: item[1],reverse=True)}
    
    tokens = list(dic.keys())
    freq = list(dic.values())
    
    return [(tokens[_],freq[_]) for _ in range(20)]

unique_tokens()

[(',', 19204),
 ('the', 13715),
 ('.', 7306),
 ('of', 6513),
 ('and', 6010),
 ('a', 4545),
 ('to', 4515),
 (';', 4173),
 ('in', 3908),
 ('that', 2978),
 ('his', 2459),
 ('it', 2196),
 ('I', 2113),
 ('!', 1767),
 ('is', 1722),
 ('--', 1713),
 ('with', 1659),
 ('he', 1658),
 ('was', 1639),
 ('as', 1620)]

### 7. Specific characteristics in tokens

Fo example, let us fing the tokens that have a length of greater than 5 and frequency of more than 150.

In [8]:
from nltk.probability import FreqDist
    
dist = FreqDist(moby_tokens)
vocab = dist.keys()
freqwords = [w for w in vocab if len(w) > 5 and dist[w] > 150]
    
sorted(freqwords) 

['Captain',
 'Pequod',
 'Queequeg',
 'Starbuck',
 'almost',
 'before',
 'himself',
 'little',
 'seemed',
 'should',
 'though',
 'through',
 'whales',
 'without']

### 8. The longest word in text its length

In [9]:
l = max( len(x) for x in text1)
w = [x for x in text1 if len(x) == l]
(w[0],l)

("twelve-o'clock-at-night", 23)

### 9. Unique words have a frequency of more than 2000 and their frequency value

**Hint**: [`isalpha()`](https://www.geeksforgeeks.org/python-string-isalpha-application/) is used to check if the token is a word and not punctuation.

In [10]:
def unique_words_and_frequency():
    from nltk.probability import FreqDist
    import pandas as pd
    
    dist = FreqDist(moby_tokens)
    vocab = dist.keys()
    freq_words  = [w for w in vocab if dist[w] > 2000 and w.isalpha() == True]
    frequencies = [ dict(dist)[_] for _ in freq_words]
    
    df = (pd.DataFrame([freq_words,frequencies])
          .T.rename(columns={0:"words",1:"freq"})
          .sort_values(by='freq', ascending=False)
          .reset_index())
    
    return [(df["freq"][_],df["words"][_]) for _ in range(len(df))]

unique_words_and_frequency()

[(13715, 'the'),
 (6513, 'of'),
 (6010, 'and'),
 (4545, 'a'),
 (4515, 'to'),
 (3908, 'in'),
 (2978, 'that'),
 (2459, 'his'),
 (2196, 'it'),
 (2113, 'I')]

### 10. The average number of tokens per sentence

In [11]:
sentences = nltk.sent_tokenize(moby_raw)
num_tps = [ len(nltk.word_tokenize(s)) for s in sentences]
    
print("The average number of tokens per sentence is: ",round(sum(num_tps)/len(sentences),0))

The average number of tokens per sentence is:  26.0


### 11. The 5 most frequent parts of speech in this text and their frequency

<img src="word_classes.png">

In [14]:
def pos_frequency(top = 5):
    """
    Argument:
    top -- integer: number to set the limit of firt values to display.
    
    Returns
    pos_list -- list:  list of tuples of the form (part_of_speech, frequency) sorted 
                       in descending order of frequency.
    """
    
    unique_tokens_frequency = unique_tokens()
    
    words_freq = [wf for wf in unique_tokens_frequency if wf[0].isalpha()==True]
    POS = nltk.pos_tag([words_freq[w][0] for w in range(5)])
    
    pos_list = [(POS[i][1],words_freq[i][1]) for i in range(top)]

    return pos_list

pos_frequency()

[('DT', 13715), ('IN', 6513), ('CC', 6010), ('DT', 4545), ('TO', 4515)]