<a href="https://colab.research.google.com/github/RahulTechTutorials/NLP/blob/master/NLP_term_frequency_and_TTR.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Term Frequency

Term Frequency is nothing but the count of the number of words in your text. We use FreqDist class to calculate the frequency distribution in any given text.

The class FreqDist  works like a dictionary where the keys are the words in the text and the values are the count associated with that word

In [0]:
import nltk
from nltk.probability import FreqDist
from nltk.corpus import stopwords

In [6]:
#nltk.download('stopwords')
stopwords = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [0]:
#nltk.download('gutenberg')
words = nltk.Text(nltk.corpus.gutenberg.words('bryant-stories.txt'))

In [0]:
##Converting to small case
words = [word.lower() for word in words if word.isalpha()]
words = [word for word in words if word not in stopwords]

In [0]:
fDist = FreqDist(words)

In [26]:
for x,v in fDist.most_common(10):
  print(x,':',v)

little : 597
said : 453
came : 191
one : 183
could : 158
king : 141
went : 122
would : 112
great : 110
day : 107


In [29]:
import numpy as np
for x,v in fDist.most_common(10):
  print(x,':',np.round(v/len(fDist),decimals=4))

little : 0.1619
said : 0.1228
came : 0.0518
one : 0.0496
could : 0.0428
king : 0.0382
went : 0.0331
would : 0.0304
great : 0.0298
day : 0.029


# Finding out the TTR - Type Token Ratio 

The type-token ratio (TTR) is a measure of vocabulary variation within a written text or a
person’s speech. The type-token ratios of two real world examples are calculated and interpreted.
The type-token ratio is shown to be a helpful measure of lexical variety within a text. It can be used
to monitor changes in children and adults with vocabulary difficulties.

If we count the number of words we spoke, say 87. The number of words in a text is often referred
to as the number of tokens. However, several of these tokens are repeated. For example, the token
again occurs two times, the token are occurs three times, and the token and occurs five times. 

Say out of the total of 87 tokens in some text there are 62 so-called types (distinct words). The relationship
between the number of types and the number of tokens is known as the type-token ratio (TTR). In this case the TTR will be (62/87)

In [0]:
import nltk
from nltk.corpus import stopwords

In [3]:
nltk.download('stopwords')
stopwords = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [5]:
nltk.download('gutenberg')
words_bryant = nltk.Text(nltk.corpus.gutenberg.words('bryant-stories.txt'))
words_emma = nltk.Text(nltk.corpus.gutenberg.words('austen-emma.txt'))

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.


In [0]:
words_bryant = [word.lower() for word in words_bryant if word.isalpha()]
words_emma = [word.lower() for word in words_emma if word.isalpha()]

In [0]:
words_bryant = [word.lower() for word in words_bryant if word not in stopwords][:15000]
words_emma = [word.lower() for word in words_emma if word not in stopwords] [:15000]

In [0]:
import numpy as np
TTR_bryant = np.round(len(set(words_bryant))/len(words_bryant),decimals=4)
TTR_emma = np.round(len(set(words_emma))/len(words_emma), decimals= 4)

In [23]:
print('TTR_bryant = {}\nTotal_words = {}\nVocablory_Count = {}'.format(TTR_bryant,len(words_bryant),len(set(words_bryant))), end= '\n\n')

print('TTR_emma = {}\nTotal_words = {}\nVocablory_Count = {}'.format(TTR_emma,len(words_emma),len(set(words_emma))))

TTR_bryant = 0.1864
Total_words = 15000
Vocablory_Count = 2796

TTR_emma = 0.2183
Total_words = 15000
Vocablory_Count = 3274
