<a href="https://colab.research.google.com/github/MK316/applications/blob/main/LD_mtld_hdd_mass.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lexical Diversity Assessments: **MTLD, vocd-D, and HD-D**
**[Notes]** 
1. The lexical diversity indices here refer to the following article:

* [McCarthy, P.M., Jarvis, S.]("https://link.springer.com/article/10.3758/BRM.42.2.381") (2010). MTLD, vocd-D, and HD-D: A validation study of sophisticated approaches to lexical diversity assessment. Behavior Research Methods 42, 381–392. https://doi.org/10.3758/BRM.42.2.381

2. The Python coding here uses [Kristopher Kyle's {lexical-diversity} package]("https://github.com/kristopherkyle/lexical_diversity").

3. TTR, Log TTR, RootTTR etc: These measures have been criticized since they are heavily influenced by text length. (Tweedie & Baayen, 1998; Chipere et al., 2004; Kettunen, 2014) Thus, researchers should use matrics such as MTLD, HDD, MASS, etc

# Intalling {lexical-diversity} packages:

In [1]:
pip install lexical-diversity

Collecting lexical-diversity
  Downloading lexical_diversity-0.1.1-py3-none-any.whl (117 kB)
[?25l[K     |██▉                             | 10 kB 20.9 MB/s eta 0:00:01[K     |█████▋                          | 20 kB 12.6 MB/s eta 0:00:01[K     |████████▍                       | 30 kB 9.8 MB/s eta 0:00:01[K     |███████████▏                    | 40 kB 8.6 MB/s eta 0:00:01[K     |██████████████                  | 51 kB 4.6 MB/s eta 0:00:01[K     |████████████████▊               | 61 kB 5.3 MB/s eta 0:00:01[K     |███████████████████▌            | 71 kB 5.2 MB/s eta 0:00:01[K     |██████████████████████▎         | 81 kB 5.4 MB/s eta 0:00:01[K     |█████████████████████████       | 92 kB 6.0 MB/s eta 0:00:01[K     |███████████████████████████▉    | 102 kB 5.2 MB/s eta 0:00:01[K     |██████████████████████████████▋ | 112 kB 5.2 MB/s eta 0:00:01[K     |████████████████████████████████| 117 kB 5.2 MB/s 
[?25hInstalling collected packages: lexical-diversity
Successfull

In [20]:
from lexical_diversity import lex_div as ld

# Bring your texts to be analyzed:

Text upload: one of the following ways. Remove #hashtag in the code line

In [45]:
#1# Copy and paste the text directly in the cell:

text = """
THE FROGS, grieved at having no established Ruler, sent
ambassadors to Jupiter entreating for a King.  Perceiving their
simplicity, he cast down a huge log into the lake.  The Frogs
were terrified at the splash occasioned by its fall and hid
themselves in the depths of the pool.  But as soon as they
realized that the huge log was motionless, they swam again to the
top of the water, dismissed their fears, climbed up, and began
squatting on it in contempt.  After some time they began to think
themselves ill-treated in the appointment of so inert a Ruler,
and sent a second deputation to Jupiter to pray that he would set
over them another sovereign.  He then gave them an Eel to govern
them.  When the Frogs discovered his easy good nature, they sent
yet a third time to Jupiter to beg him to choose for them still
another King.  Jupiter, displeased with all their complaints,
sent a Heron, who preyed upon the Frogs day by day till there
were none left to croak upon the lake.
"""

In [None]:
#2# Copy and paste in the following box:

text = input()


In [None]:
#3# Upload a file on Colab: Use the "folder-shape" panel on the left


In [11]:
#4# Clone your github folder to use multiple texts

!git clone https://github.com/MK316/LexicalAnalysis.git

Cloning into 'LexicalAnalysis'...
remote: Enumerating objects: 25, done.[K
remote: Counting objects: 100% (25/25), done.[K
remote: Compressing objects: 100% (17/17), done.[K
remote: Total 25 (delta 1), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (25/25), done.


In [13]:
! git pull

remote: Enumerating objects: 9, done.[K
remote: Counting objects:  11% (1/9)[Kremote: Counting objects:  22% (2/9)[Kremote: Counting objects:  33% (3/9)[Kremote: Counting objects:  44% (4/9)[Kremote: Counting objects:  55% (5/9)[Kremote: Counting objects:  66% (6/9)[Kremote: Counting objects:  77% (7/9)[Kremote: Counting objects:  88% (8/9)[Kremote: Counting objects: 100% (9/9)[Kremote: Counting objects: 100% (9/9), done.[K
remote: Compressing objects:  16% (1/6)[Kremote: Compressing objects:  33% (2/6)[Kremote: Compressing objects:  50% (3/6)[Kremote: Compressing objects:  66% (4/6)[Kremote: Compressing objects:  83% (5/6)[Kremote: Compressing objects: 100% (6/6)[Kremote: Compressing objects: 100% (6/6), done.[K
remote: Total 7 (delta 0), reused 0 (delta 0), pack-reused 0[K
Unpacking objects:  14% (1/7)   Unpacking objects:  28% (2/7)   Unpacking objects:  42% (3/7)   Unpacking objects:  57% (4/7)   Unpacking objects:  71% (5/7)   Unpacking objec

In [8]:
# Current directory
%pwd

'/content'

In [10]:
# Change directory
%cd /content/LexicalAnalysis/textdata/

/content/LexicalAnalysis/textdata


In [None]:
file = open('Aesop01.txt','r')
text = file.read().replace("\n", " ")
file.close()

text

# [1] **MTLD**: the Measure of Textual Lexical Diversity

* The index for this approach is calculated as the mean length of word strings that maintain a criterion level of lexical variation.

* This index is not found to vary as a function of text length. (McCarthy, 2010)

"... MTLD is obtained by dividing the total number of words by
the total number of factors. Thus, if the text is 360 words long and there are 4 factors, the MTLD value is 90. ... The final version of MTLD is obtained by running the programme
forward and backward through the data and calculating an average of the outcome of both. According to McCarthy (2005) and Crossley et al. (2009), MTLD does not vary as a function of text length for text segments whose length is in the
100–2,000-word range." excerpt from [Treffers-Daller, J. (2013)]("https://centaur.reading.ac.uk/28712/1/04ch3.pdf") Measuring lexical diversity among L2
learners of French: an exploration of the validity of D, MTLD
and HD-D as measures of language ability. In: Jarvis, S. and
Daller, M. (eds.) Vocabulary knowledge: human ratings and
automated measures. Benjamins, Amsterdam, pp. 79-104.
ISBN 9789027241887 Available at
http://centaur.reading.ac.uk/28712/

In [63]:
tleng = len(text); print('Text length: ',tleng)

Text length:  983


In [46]:
tok = ld.tokenize(text); print('First 10 tokenized words:', tok[:10])

First 10 tokenized words: ['', 'the', 'frogs', 'grieved', 'at', 'having', 'no', 'established', 'ruler', 'sent']


In [47]:
flt = ld.flemmatize(text)

In [48]:
mtld = ld.mtld(flt); print("MTLD index: %d"%mtld)

MTLD index: 79


# **[2] HD-D**(or vocd-D): Hypergeometric distribution D
* HD-D is a viable alternative to the vocd-D standard.

In [54]:
hdd = ld.hdd(flt); print('HD-D index: ',hdd)

HD-D index:  0.8217430142133639


# **[3] MASS**:

* Three of the indices — MTLD, vocd-D (or HD-D), and Maas—appear to capture unique lexical information. 

In [55]:
mass = ld.maas_ttr(flt); print('MASS index: ', mass)

MASS index:  0.043234341423214924


# Result summary

In [64]:
print('TEXT length: ', tleng)
print('MTLD index: ', round(mtld,2))
print('HDD index: ', round(hdd,4))
print('MASS index: ', round(mass,4))

TEXT length:  983
MTLD index:  79.32
HDD index:  0.8217
MASS index:  0.0432
