<a href="https://colab.research.google.com/github/MK316/mynltkdata/blob/main/Lexical_diversity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# [Lexical Diversity (LD)]("https://textinspector.com/help/lexical-diversity/")
- Lexical diversity index: It is a measurement of how many different lexical words (N, V, Adv, Advj) are found in a given text.
- LD is considered as one of the important indicators of how complex and difficult to read a given text is.

[Duran et al. 2004]("https://psycnet.apa.org/record/2004-95315-004"): “…lexical diversity is about more than vocabulary range.  Alternative terms, ‘flexibility’, ‘vocabulary richness’, ‘verbal creativity’, or ‘lexical range and balance’ indicate that it has to do with how vocabulary is deployed as well as how large the vocabulary might be.” 
[D scale from (Duran, Malvern, Richards, Chipere 2004:238)] ("https://textinspector.com/wp-content/uploads/2020/12/VocD-comparison-1.jpeg")

# Package to install and import

In [8]:
!pip install lexical-diversity

Collecting lexical-diversity
  Downloading lexical_diversity-0.1.1-py3-none-any.whl (117 kB)
[?25l[K     |██▉                             | 10 kB 19.9 MB/s eta 0:00:01[K     |█████▋                          | 20 kB 17.4 MB/s eta 0:00:01[K     |████████▍                       | 30 kB 11.3 MB/s eta 0:00:01[K     |███████████▏                    | 40 kB 9.7 MB/s eta 0:00:01[K     |██████████████                  | 51 kB 4.5 MB/s eta 0:00:01[K     |████████████████▊               | 61 kB 5.3 MB/s eta 0:00:01[K     |███████████████████▌            | 71 kB 5.7 MB/s eta 0:00:01[K     |██████████████████████▎         | 81 kB 5.9 MB/s eta 0:00:01[K     |█████████████████████████       | 92 kB 6.6 MB/s eta 0:00:01[K     |███████████████████████████▉    | 102 kB 5.1 MB/s eta 0:00:01[K     |██████████████████████████████▋ | 112 kB 5.1 MB/s eta 0:00:01[K     |████████████████████████████████| 117 kB 5.1 MB/s 
[?25hInstalling collected packages: lexical-diversity
Successful

In [9]:
from lexical_diversity import lex_div as ld

## Paste your text: two texts to cocmpare

e.g., 
text1: Dove and Ant (478 tokens, Aesop fable), TTR(0.457), Root TTR(6.1), 
text2: National geography reading level 4 (3,381 tokens)

In [6]:
text1 = """A Dove saw an Ant fall into a brook. The Ant struggled in vain to reach the bank, \
and in pity, the Dove dropped a blade of straw close beside it. Clinging to the straw like a shipwrecked sailor to a broken spar, \
the Ant floated safely to shore. Soon after, the Ant saw a man getting ready to kill the Dove with a stone. \
But just as he cast the stone, the Ant stung him in the heel, so that the pain made him miss his aim, \
and the startled Dove flew to safety in a distant wood.
"""

In [13]:
text2 = """Living light
The ability of some species to create light, known as bioluminescence, is both magical and commonplace. Magical, because of its glimmering beauty. Commonplace, because many life forms can do it. On land the most familiar examples are fireflies, flashing to attract mates on a warm summer night. But there are other luminous land organisms, including glow-worms, millipedes, and some ninety species of fungus. Even some birds, such as the Atlantic puffin, have beaks that glow in the dark.
But the real biological light show takes place in the sea. Here, an astonishing number of beings can make light. Some, such as ostracods, are like ocean fireflies, using flashes of light to attract mate. There are also glowing bacteria, and light-making fish, squid, and jellyfish. Indeed, of all the groups of organisms known to make light, more than four-fifths live in the ocean.
As a place to live, the ocean has a couple of peculiarities. Firstly, there is almost nowhere to hide, so being invisible is very important. Secondly, as you descend, sunlight disappears. At first, red light is absorbed. Then the yellow and green parts of the spectrum disappear, leaving just the blue. At two-hundred meters below the surface, the ocean becomes a kind of perpetual twilight, and at six-hundred meters the blue fades out too. In fact, most of the ocean is as black as the night sky. These factors make light uniquely useful as a weapon or a veil.
Hiding with light
In the ocean’s upper layers, where light penetrates, creatures need to blend in to survive. Any life form that stands out is in danger of being spotted by predators, especially those swimming below, looking up. Many life forms solve this problem by avoiding the light zone during the day. Others, such as jellyfish and swimming snails, are transparent, ghostlike creatures, almost impossible to see.
Other sea species use light to survive in the upper layers, but how? Some, such as certain shrimp and squid, illuminate their bellies to match the light coming from above. This allows them to become invisible to predators below. Their light can be turned on and off at will, some even have a dimmer switch. For example, certain types of shrimp can alter how much light they give off, depending on the brightness of the water around them. If a cloud passes overhead and briefly blocks the light, the shrimp will dim itself accordingly.
But if the aim is to remain invisible, why do some creatures light up when they are touched, or when the water nearby is disturbed? A couple of reasons. First, a sudden burst of light may startle a predator, giving the prey a chance to escape. Some kinds of deep-sea squid, for example, give a big squirt of light before darting off into the gloom.
Second, there is the principle of ‘the enemy of my enemy is my friend.’ Giving off light can help summon the predator of your predator. Known as the burglar alarm effect, this is especially useful for tiny life forms, such as dinoflagellates, that cannot swim fast. For such small beings, water is too viscous to allow a quick getaway, it would be like trying to swim through syrup. Instead, when threatened by a shrimp, for example, these organisms light up. The flashes attract larger fish that are better able to spot, and eat, the shrimp. The chief defense for these tiny organisms is therefore not fight or flight, but light.
"""

In [14]:
print(len(text1))
print(len(text2))

478
3381


## Or upload texts on colab:

In [42]:
# # or upload your text file here
# from google.colab import files
# text = files.upload(); print(type(text))

In [43]:
# # Load your file on colab and then read it

# file = open('Ch02.txt')
# text = file.read()
# file.close()

In [15]:
tok1 = ld.tokenize(text1)
tok2 = ld.tokenize(text2)

In [16]:
print('First 10 tokenized words:', tok1[:10])
print('Total number of tokenized words:', len(tok1))

First 10 tokenized words: ['a', 'dove', 'saw', 'an', 'ant', 'fall', 'into', 'a', 'brook', 'the']
Total number of tokenized words: 100


# Lemmatize using flemmatize():

- Captial to lower case
- Contraction forms => apostrphe deletion
- Tense: saw > see
- Plurality: friends > friend, children > child

  e.g., don't > dont, John's > johns, wanna > wanna

In [17]:
flt1 = ld.flemmatize(text1)
flt2 = ld.flemmatize(text2)

# Show first 10 items
print(flt[1:10])

# See how [hasn't] is lemmatized:
ld.flemmatize("hasn't")

['dove', 'see', 'a', 'ant', 'fall', 'into', 'a', 'brook', 'the']


['hasnt']

# Lexical Diversity indices:

## [1] TTR (Type-Token ratio) = Type / Token: 

e.g., unique words = 400, Token = 1,000
TTR = 400 / 1,000 = 0.4

- It is a measure of vocabulary ***variation*** within a written text or a speech

- The closer the TTR ratio is to 1, the greater the lexical richness of the segment.

* Note: TTR values vary in accordance with the length of the text. That is, TTR’s are not comparable unless they are based on texts of the same length. 

=> STTR (Standardized TTR): You can get an average type/token ratio based on consecutive 1,000-word chunks of text. (Texts with less than 1,000 words will get a standardised type/token ratio of 0.)

References
Johnson (1939) Language and Speech Hygiene: an Application of General Semantics: Outline of a Course (Chicago).

In [23]:
ttr1 = ld.ttr(flt1)
ttr2 = ld.ttr(flt2)
print(ttr1, ttr2)

0.61 0.45689655172413796


## [2] The Root TTR:
Guiraud (1959) Probl&egrave;mes et m&eacute;thodes de la statistique linguistique (Dordrecht).

In [25]:
rttr1 = ld.root_ttr(flt1)
rttr2 = ld.root_ttr(flt2)
print(rttr1,rttr2)

6.1 11.003526080620546


## [3] Log TTR:
Chotlos (1944); Herdan (1960)

In [26]:
logttr1 = ld.log_ttr(flt1)
logttr2 = ld.log_ttr(flt2)

## [4] MASS TTR

In [28]:
msttr1 = ld.maas_ttr(flt1)
msttr2 = ld.maas_ttr(flt2)

## [5] the mean segmental type-token ratio (MSTTR):
=> the average of the TTR’s of several consecutive equal-sized samples.

Johnson (1944), which

In [29]:
# Default window length = 50
fdttr1 = ld.msttr(flt1)
fdttr2 = ld.msttr(flt1,window_length=1000)

## [6] Moving average TTR (MATTR)

In [31]:
mattr1 = ld.mattr(flt1); mattr2 = ld.mattr(flt2)
mattr2 = ld.mattr(flt1,window_length=25); ld.mattr(flt2, window_length=25)

0.8894244604316532

## [7] Hypergeometric distribution D (HDD)

A more straightforward and reliable implementation of vocD (Malvern, Richards, Chipere, & Duran, 2004) as per McCarthy and Jarvis (2007, 2010).

In [32]:
hdd1 = ld.hdd(flt1)
hdd2 = ld.hdd(flt2)

## [8] Measure of lexical textual diversity (MTLD)

Calculates MTLD based on McCarthy and Jarvis (2010).

In [33]:
mtld1 = ld.mtld(flt1)
mtld2 = ld.mtld(flt2)

## [9] the bilogarithmic type-token ratio:
 (Log TTR, Chotlos, 1944; Herdan, 1960); and

Measure of lexical textual diversity (moving average, wrap)
Calculates MTLD using a moving window approach. Instead of calculating partial factors, it wraps to the beginning of the text to complete the last factors.

In [37]:
bilog1 = ld.mtld_ma_wrap(flt1)
bilog2 = ld.mtld_ma_wrap(flt2)

## [10] Measure of lexical textual diversity (moving average, bi-directional)
Calculates the average MTLD score by calculating MTLD in each direction using a moving window approach.

In [35]:
ma1 = ld.mtld_ma_bid(flt1)
ma2 = ld.mtld_ma_bid(flt2)

In [None]:
c1 = c(ttr1, rttr1, logttr1, mattr1, hdd1, mtld1,bilog1, ma1)