<a href="https://colab.research.google.com/github/MK316/mynltkdata/blob/main/Lexical_diversity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# [Lexical Diversity (LD)]("https://textinspector.com/help/lexical-diversity/")
- Lexical diversity index: It is a measurement of how many different lexical words (N, V, Adv, Advj) are found in a given text.
- LD is considered as one of the important indicators of how complex and difficult to read a given text is.

[Duran et al. 2004]("https://psycnet.apa.org/record/2004-95315-004"): “…lexical diversity is about more than vocabulary range.  Alternative terms, ‘flexibility’, ‘vocabulary richness’, ‘verbal creativity’, or ‘lexical range and balance’ indicate that it has to do with how vocabulary is deployed as well as how large the vocabulary might be.” 
[D scale from (Duran, Malvern, Richards, Chipere 2004:238)] ("https://textinspector.com/wp-content/uploads/2020/12/VocD-comparison-1.jpeg")

# Package to install and import

In [None]:
!pip install lexical-diversity

In [2]:
from lexical_diversity import lex_div as ld

# Paste your text:

In [25]:
text = """A Dove saw an Ant fall into a brook. The Ant struggled in vain to reach the bank, \
and in pity, the Dove dropped a blade of straw close beside it. Clinging to the straw like a shipwrecked sailor to a broken spar, \
the Ant floated safely to shore. Soon after, the Ant saw a man getting ready to kill the Dove with a stone. \
But just as he cast the stone, the Ant stung him in the heel, so that the pain made him miss his aim, \
and the startled Dove flew to safety in a distant wood.
"""

In [32]:
# or upload your text file here
from google.colab import files
text = files.upload(); print(type(text))
text = str(text)

Saving DoveAndAnt.txt to DoveAndAnt (8).txt
<class 'dict'>


In [30]:
tok = ld.tokenize(text)

In [36]:
print('First 10 tokenized words:', tok[:10])
print('Total number of tokenized words:', len(tok))

First 10 tokenized words: ['{doveandanttxt', 'ba', 'dove', 'saw', 'an', 'ant', 'fall', 'into', 'a', 'brook']
Total number of tokenized words: 100


# Lemmatize using flemmatize():

- Captial to lower case
- Contraction forms => apostrphe deletion
- Tense: saw > see
- Plurality: friends > friend, children > child

  e.g., don't > dont, John's > johns, wanna > wanna

In [None]:
flt = ld.flemmatize(text)
flt[:10]

In [55]:
ld.flemmatize("hasn't")

['hasnt']

# Lexical Diversity indices:

## [1] TTR (Type-Token ratio) = Type / Token: 
e.g., unique words = 400, Token = 1,000
TTR = 400 / 1,000 = 0.4

- It is a measure of vocabulary ***variation*** within a written text or a speech

- The closer the TTR ratio is to 1, the greater the lexical richness of the segment.

* Note: TTR values vary in accordance with the length of the text. That is, TTR’s are not comparable unless they are based on texts of the same length. 

=> STTR (Standardized TTR): You can get an average type/token ratio based on consecutive 1,000-word chunks of text. (Texts with less than 1,000 words will get a standardised type/token ratio of 0.)

References
Johnson (1939) Language and Speech Hygiene: an Application of General Semantics: Outline of a Course (Chicago).

In [48]:
ld.ttr(flt)

0.62

## [2] The Root TTR:
Guiraud (1959) Probl&egrave;mes et m&eacute;thodes de la statistique linguistique (Dordrecht).

In [51]:
ld.root_ttr(flt)

6.2

## [3] Log TTR:
Chotlos (1944); Herdan (1960)

In [52]:
ld.log_ttr(flt)

0.896195844749127

## [4] MASS TTR

In [53]:
ld.maas_ttr(flt)

0.05190207762543653

## [5] the mean segmental type-token ratio (MSTTR):
=> the average of the TTR’s of several consecutive equal-sized samples.

Johnson (1944), which

In [50]:
# Default window length = 50
ld.msttr(flt)
ld.msttr(flt,window_length=1000)

0.62

## [6] Moving average TTR (MATTR)

In [54]:
ld.mattr(flt)
ld.mattr(flt,window_length=25)

0.7868421052631575

## [7] Hypergeometric distribution D (HDD)

A more straightforward and reliable implementation of vocD (Malvern, Richards, Chipere, & Duran, 2004) as per McCarthy and Jarvis (2007, 2010).

In [None]:
ld.hdd(flt)

## [8] Measure of lexical textual diversity (MTLD)

Calculates MTLD based on McCarthy and Jarvis (2010).

In [None]:
ld.mtld(flt)

## [9] the bilogarithmic type-token ratio:
 (Log TTR, Chotlos, 1944; Herdan, 1960); and

Measure of lexical textual diversity (moving average, wrap)
Calculates MTLD using a moving window approach. Instead of calculating partial factors, it wraps to the beginning of the text to complete the last factors.

In [None]:
ld.mtld_ma_wrap(flt)

## [10] Measure of lexical textual diversity (moving average, bi-directional)
Calculates the average MTLD score by calculating MTLD in each direction using a moving window approach.

In [None]:
ld.mtld_ma_bid(flt)