# 第三讲： 写作文本中的词汇分析
本课目标：
-	介绍写作评估中常用的词汇评价方法
-	学习使用NLP技术进行文本的分词、词性标注、语义分析等
-	利用给定的数据集生成词汇密度、词汇多样性、词汇复杂度等指标

## 1. 研究背景【秦】：
【简述智能写作评估中词汇的作用，推荐3-5篇关键文献】  
写作研究长期以来一直是英语研究（EFL research, English as Foreign Language research）中的一个重要分支。为了识别写作的潜在发展轨迹及其与学习者其他认知能力（如语言熟练度）之间的关系，已经开发出多种测量方法来识别写作中使用的语言的复杂度。这些测量方法通常被称为语言复杂度指标。  
**语言复杂度**被定义为使用更高级的语言形式和功能的能力，这些通常是在第二语言或外语发展的后期获得的（Ellis, 2009; Pallotti, 2015）。语言复杂度长期以来被视为一个多维构念，涵盖了在不同语言层面（即词汇、句法和结构层面）对语言的复杂使用。为了更好地大规模分析这些语言特征，许多研究人员开发了工具来自动化这些分析过程。在本讲中，我们将简要讨论各种语言复杂度指标的测量方法，以及如何自动提取和分析学习者写作中的这些特征。


## 2. 导入数据
请下载第三讲对应的数据文件，并且在Python中读入数据

In [1]:
import os
# read the txt file in the data directory and save the data in the dict
essay_dict = {}
files = os.listdir('./DATA') # list of files in the DATA directory
for file in files:
    if file.endswith('.txt'): # only read the txt file
        with open('./DATA/'+file, 'r', encoding="utf-8-sig") as f: # open the files
            essay = f.read()
            essay_dict[file] = essay # save the data in the dict

# show the data in the dict
for key, value in essay_dict.items():
    print(f"Filename: {key}")
    print(f"Essay: {value}")
    print("------------------------------------")

Filename: 1.txt
Essay: 
I'm for the cloning of plants, animals and human beings. As it is acknowledged that plants and animal clones can do no harm, these should be definitely allowed in order to serve the human race and its development. And (speuhly?) of humans, while some experts say that moral issues are to be considered, other evidence have shown us that human clones have a lot of genetic flaws. This means that they are incapable of cloning themselves and they shouldn’t be regarded as humane due to their short lives and relatively lower than standard IQ. I propose that they can be used as slaves to benefit humans in production. Once we have artificial cells and wombs for mass clone production, I suggest everyone get two to three clones of themselves to serve them. Legislations may be under going on clone's rights as slaves and the mass number of clones a person can have. Just as ancient Greece thrived on slavery, we shall thrive again on slavery of clones who are sub-humans, thus g

## 3. 数据预处理
在展开词汇分析研究之前，对数据进行预处理是必不可少的。使用干净、格式统一的数据进行分析才可以得到有意义的、可解释的结果。一般来说，数据预处理可以包括全半角标点符号替换、拼写错误的识别及修正、去除停用词、词干提取、词形还原、词性标注、分词等。在词汇分析中，我们主要关注前两项：全半角标点符号替换和拼写错误的识别及修正。
### 3.1 全半角标点符号替换
在中文为母语的儿语学习者写作中，经常会出现学习者混淆全角（中文）和半角（英文）标点符号的使用的情况。例如，学生可能会将“,”写成“，”，将“.”写成“。”，将“?”写成“？”等。全角和半角标点符号的替换，可以消除这种错误，使得后续的词汇分析更加准确。下面是将全角标点符号替换为半角标点符号的代码：

In [2]:
def replace_punctuation(text):
    # Define replacements for each punctuation symbol
    replacements = {
        '，': ',',
        '。': '.',
        '！': '!',
        '？': '?',
        '：': ':',
        '；': ';',
        '“': '"',
        '”': '"',
        '‘': "'",
        '’': "'",
        '（': '(',
        '）': ')',
        '【': '[',
        '】': ']',
        '《': '<',
        '》': '>',
    }
    # Replace each punctuation symbol with its English equivalent
    for zh, en in replacements.items():
        text = text.replace(zh, en)
    
    return text
# Test the function
example = '这是一个例子，展示如何用 Python 将中文标点符号替换为英文标点符号！'
print(f"before replace: {example} \n after replace: {replace_punctuation(example)}")
# Replace punctuation in each essay
for key, value in essay_dict.items():
    essay_dict[key] = replace_punctuation(value)
    print(f"Filename: {key}")
    print(f"Essay: {essay_dict[key]}")

before replace: 这是一个例子，展示如何用 Python 将中文标点符号替换为英文标点符号！ 
 after replace: 这是一个例子,展示如何用 Python 将中文标点符号替换为英文标点符号!
Filename: 1.txt
Essay: 
I'm for the cloning of plants, animals and human beings. As it is acknowledged that plants and animal clones can do no harm, these should be definitely allowed in order to serve the human race and its development. And (speuhly?) of humans, while some experts say that moral issues are to be considered, other evidence have shown us that human clones have a lot of genetic flaws. This means that they are incapable of cloning themselves and they shouldn't be regarded as humane due to their short lives and relatively lower than standard IQ. I propose that they can be used as slaves to benefit humans in production. Once we have artificial cells and wombs for mass clone production, I suggest everyone get two to three clones of themselves to serve them. Legislations may be under going on clone's rights as slaves and the mass number of clones a person can have. Jus

### 3.2 识别和修正拼写错误
完成全半角标点符号的替换后，我们接下来进行拼写错误的识别和修正。

In [3]:
# 识别数据中的拼写错误
# 修正这些拼写错误，并且将所有的修正结果打印出来，例如：["anmal", "animal"]

# !!!You should run `python -m spacy download en_core_web_sm` in terminal to download the English model for spaCy!!!
!python -m spacy download en_core_web_sm
from spellchecker import SpellChecker
import spacy
import re

def check_spelling_and_correct(text):
    # Create a SpellChecker object, which will be used to check and correct
    spell = SpellChecker()
    # Tokenize the text using spaCy
    nlp = spacy.load("en_core_web_sm")
    doc = nlp(text)
    words = [token.text for token in doc if token.is_alpha]
    # find the misspelled words
    misspelled = spell.unknown(words)
    # find the candidates and corrections for each misspelled word
    candidates = {misspelled_word: spell.candidates(misspelled_word) for misspelled_word in misspelled}
    corrections = {misspelled_word: spell.correction(misspelled_word) for misspelled_word in misspelled}

    return candidates, corrections


# Check and correct spelling in each essay
correct_replacement = {}
correct_text = {}
for key, value in essay_dict.items():
    print("---------------------------")
    print(f"Filename: {key}")
    print(f"Essay: {value}")
    candidates, corrections = check_spelling_and_correct(value) # check the spelling errors
    print(f"candidates: {candidates}")
    print(f"corrections: {corrections}")
    correct_replacement[key] = corrections # save the corrections in the dict

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
      --------------------------------------- 0.3/12.8 MB ? eta -:--:--
     ---- ----------------------------------- 1.6/12.8 MB 5.2 MB/s eta 0:00:03
     ---------------- ----------------------- 5.2/12.8 MB 10.3 MB/s eta 0:00:01
     --------------------------------------- 12.8/12.8 MB 18.7 MB/s eta 0:00:00
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
---------------------------
Filename: 1.txt
Essay: 
I'm for the cloning of plants, animals and human beings. As it is acknowledged that plants and animal clones can do no harm, these should be definitely allowed in order to serve the human race and its de

In [4]:
# fix the spelling errors in the essays manually (or using GPT-4) based on the output of the spell checker
correct_replacement["10.txt"] = {'unfavourable': 'unfavorable'}
correct_replacement["4.txt"] = {'inteneds': 'intends'}
correct_replacement["3.txt"] = {'subtal': 'subtle'}
correct_replacement["2.txt"] = {'al':'a', 'ger':'get', 'persumably':'presumably', 'illed':'ill'}
correct_replacement["1.txt"] = {'speuhly':'especially', 'Legislations':'Legislation'}

# Replace the misspelled words with the corrected words using regex
for key, value in essay_dict.items(): # iterate over the essays
    for misspelled, corrected in correct_replacement[key].items(): # iterate over the misspelled words and their corrections
        value = re.sub(rf"\b{misspelled}\b", corrected, value) # replace the misspelled word with the corrected word
    essay_dict[key] = value # save the corrected essay in the dict
    print(f"Filename: {key}")
    print(f"Essay: {value}")

Filename: 1.txt
Essay: 
I'm for the cloning of plants, animals and human beings. As it is acknowledged that plants and animal clones can do no harm, these should be definitely allowed in order to serve the human race and its development. And (especially?) of humans, while some experts say that moral issues are to be considered, other evidence have shown us that human clones have a lot of genetic flaws. This means that they are incapable of cloning themselves and they shouldn't be regarded as humane due to their short lives and relatively lower than standard IQ. I propose that they can be used as slaves to benefit humans in production. Once we have artificial cells and wombs for mass clone production, I suggest everyone get two to three clones of themselves to serve them. Legislation may be under going on clone's rights as slaves and the mass number of clones a person can have. Just as ancient Greece thrived on slavery, we shall thrive again on slavery of clones who are sub-humans, thus

## 4. 重要概念及其计算指标介绍
完成数据预处理之后，我们就可以开始进行词汇分析了。  
学习者写作中的词汇复杂度(Lexical complexity)可以理解为学习者的语言产出中多样、有意义、少见且更复杂的词汇的使用情况。其包括多种指标，包括词汇多样性、词汇密度、词汇复杂度、词汇形态复杂度等。下面将逐个介绍这些概念、常用指标及生成方法。
### 4.1 词汇多样性(Lexical diversity)
#### 概念
词汇多样性指学习者在语言使用中展现出的词汇范围(Malvern et al., 2004; Yu, 2010)。
#### 常用指标
- 形符-类符比(TTR, Type-Token Ratio)：指文本中形符数量与类符数量之比(Templin, 1957)。形符指文本中出现的所有不同词汇，类符指文本中出现的所有词汇。
- VocD (D)：指学习者写作中词汇多样性指数(Malvern et al., 2004)。相较于TTR，VocD在计算时控制了文本的长度，得到的结果更加稳定。
#### 指标生成方法
陆小飞等人在2012年开发了Lexical Complexity Analyzer(LCA)工具，用于生成一系列关于词汇多样性和密度的指标。下面是代码演示如何使用LCA生成本节示例文本的TTR和VocD。


In [15]:
# 插入代码，使用LCA对数据进行词汇复杂度分析，生成TTR和D指标
# https://sites.psu.edu/xxl13/lca/
# Lu, X. (2012). The relationship of lexical richness to the quality of ESL learners’ oral narratives. Modern Language Journal, 96(2), 190–208. https://doi.org/10.1111/j.1540-4781.2011.01232_1.x

import src.TAALED.process_fn as process_fn_taaled

for filename, essay in essay_dict.items():
    refined_lemma_dict = process_fn_taaled.tag_processor_spaCy(essay, adj_word_list_path="src/TAALED/dep_files/adj_lem_list.txt", real_word_list_path="src/TAALED/dep_files/real_words.txt") # Process the essay using the functions in process_fn
    # Extract the lemma text and the content and function words
    lemma_text_aw = refined_lemma_dict["lemma"]
    lemma_text_cw = refined_lemma_dict["content"]
    lemma_text_fw = refined_lemma_dict["function"] 
    # Calculate the lexical density and MATTR
    [lexical_density_tokens, lexical_density_types] = process_fn_taaled.lex_density(lemma_text_cw, lemma_text_fw)
    mattr_aw_50 = process_fn_taaled.mattr(lemma_text_aw, 50)
    print(f"Filename: {filename}")
    print(f"Lexical Density Tokens: {lexical_density_tokens}")
    # print(f"Lexical Density Types: {lexical_density_types}")
    print(f"MATTR AW 50: {mattr_aw_50}")





Filename: 1.txt
Lexical Density Tokens: 0.46408839779005523
MATTR AW 50: 0.8036363636363629
Filename: 10.txt
Lexical Density Tokens: 0.5357142857142857
MATTR AW 50: 0.8289655172413806
Filename: 2.txt
Lexical Density Tokens: 0.5405405405405406
MATTR AW 50: 0.809494949494949
Filename: 3.txt
Lexical Density Tokens: 0.47761194029850745
MATTR AW 50: 0.75921052631579
Filename: 4.txt
Lexical Density Tokens: 0.5644444444444444
MATTR AW 50: 0.8406818181818188
Filename: 5.txt
Lexical Density Tokens: 0.5266272189349113
MATTR AW 50: 0.7488333333333337
Filename: 6.txt
Lexical Density Tokens: 0.48186528497409326
MATTR AW 50: 0.7979166666666654
Filename: 7.txt
Lexical Density Tokens: 0.4010416666666667
MATTR AW 50: 0.746853146853147
Filename: 8.txt
Lexical Density Tokens: 0.48148148148148145
MATTR AW 50: 0.7809954751131232
Filename: 9.txt
Lexical Density Tokens: 0.4846938775510204
MATTR AW 50: 0.7624489795918374


### 4.2 词汇密度(Lexical density)
#### 概念
词汇密度指的是文本中实词（相对于虚词）的数量与总词数的比例(Ure, 1971)。
#### 常用指标
- Density: 文本中实词与总次数的比例(Engber, 1995)。
#### 指标生成方法
同词汇多样性指标，词汇密度指标Density也可以使用LCA工具自动生成。下面是代码演示如何使用LCA生成本节示例文本的Density。

### 4.3 词汇复杂度(Lexical sophistication)
#### 概念
词汇复杂度指的是文本中“高级词汇”的使用比例。“高级词汇”既可以指词汇使用的多样性，也可以指复杂词汇的频繁使用。因此，词汇复杂度指标通常使用一个语料库作为基准参照，例如British National Corpus (BNC), Corpus of Contemporary American English (COCA)；或者更加专业的学术词汇语料库，例如Academic Word List (AWL) (Coxhead, 2000); Academic Formulas List (AFL) (Simpson-Vlach & Ellis, 2010) 。
#### 常用指标
- BNC word frequency: 文本中词汇在BNC语料库中词频的平均值(Kyle & Crossley, 2015)。其中包括写作语料库和口语语料库两个指标。
- Academic word list: 文本中词汇在AWL列表词汇中的频数(Kyle & Crossley, 2015)。
#### 指标生成方法
Kyle等人在2015年开发了自动化词汇复杂度分析工具(TAALES: Tool for Automatic Analysis of Lexical Sophistication)，并在后续进行了多次升级(Crossley & Kyle, 2018; Kyle et al., 2018)。该工具可以生成一系列词汇复杂度指标，包括频数指标，N元词组指标等。详细介绍参考https://www.linguisticanalysistools.org/taales.html。下面是代码演示如何使用TAALES生成本节示例文本的BNC Written Frequency AW和Academic Word List ALL指标。

In [None]:
# ///插入代码，使用TAALES对数据进行词汇复杂度分析，生成BNC Written Frequency AW和Academic Word List ALL指标
# 代码，index description sheet，以及网站地址我全部打包放在压缩包里了
# 网站：https://www.linguisticanalysistools.org/taales.html
# 文献可以参考：Kyle, K., & Crossley, S. A. (2015). Automatically assessing lexical sophistication: Indices, tools, findings, and application. Tesol Quarterly, 49(4), 757–786. https://doi.org/10.1002/tesq.194

### 4.4 形态复杂度(Morphological complexity)
#### 概念
形态复杂度可以分为屈折变化复杂度(inflectional complexity)和派生变换复杂度(derivational compelxity):    
- 屈折变化复杂度：屈折变化表示一个词与其局部话语语境的关系，这种关系体现在时态、数等方面，而不改变词性或底层含义(Tywoniw & Crossley, 2020)。例如动词时态、语态、情态之间的变化，do变成doing或者done；或者名词性数格的变化，apple变成apples。
- 派生变化复杂度：派生形变化改变一个词的语义或句法类别（即改变意义或词性），通过在词的开头（前缀）或结尾（后缀）添加带有意义的单位(Tywoniw & Crossley, 2020)。例如同词根名词，形容词和副词之间的变换，interest变成interesting或者interestingly。
#### 常用指标
- 屈折变化复杂度指数(Inflectional mean complexity index)：反映屈折变化复杂度的综合指数，具体计算方法参见Brezina & Pallotti (2019)，指数越高，屈折变化越丰富
- 派生变化复杂度指数(Derivational mean complexity index)：反映派生变化复杂度的综合指数，具体计算方法参考Brezina & Pallotti (2019)，由Tywoniw & Crossley (2020)开发，指数越高，派生变化越丰富。
#### 指标生成方法
Tywoniw和Kyle两人在2022年开发了可以生成词汇形态复杂度指标的自动化分析工具(TAMMI: Tool for Automatic Morphological Complexity Measurement)，其中包括一系列词汇和形态复杂度指标，其中包括Inflectional_MCI_10和Derivational_MCI_10。下面是代码演示如何使用TAMMI生成本节示例文本的两个形态复杂度指标。

In [21]:
# Tywoniw, R., & Crossley, S. (2020). Morphological complexity of L2 discourse. In The Routledge Handbook of Corpus Approaches to Discourse Analysis (pp. 269–297). Routledge.
# using TAMMI tool to analyze the morphological complexity of the essays
import src.TAMMI.process_fn as process_fn_tammi

for filename, essay in essay_dict.items():
    result = process_fn_tammi.analyze_text(essay, "src/TAMMI/morpho_lex_df_w_log_w_prefsuf_no_head.csv")
    print(f"Filename: {filename}")
    # print(f"Result: {result}")
    print(f"inflectional MCI (10): {result['inflectional MCI (10)']}")
    print(f"derivational MCI (10): {result['derivational MCI (10)']}")

Filename: 1.txt
Result: {'Inflected_Tokens': 0.41975308641975306, 'Derivational_Tokens': 0.19753086419753085, 'Tokens_w_Prefixes': 0.07407407407407407, 'Tokens_w_Affixes': 0.16049382716049382, 'Compounds': 0.012345679012345678, 'number_prefixes': 0.07407407407407407, 'number_roots': 0.9506172839506173, 'number_suffixes': 0.18518518518518517, 'number_affixes': 0.25925925925925924, 'num_roots_affixes': 1.2098765432098766, 'num_root_affix_inflec': 3.234567901234568, '%_more_freq_words_morpho-family_prefix': 0.3182466318148148, 'prefix_family_size': 17.530864197530864, 'prefix_freq': 73640.55555555556, 'prefix_log_freq': 0.4414360596913581, 'prefix_len': 0.16049382716049382, 'prefix_in_hapax': 1.0477777777777777e-06, 'hapax_in_prefix': 0.00012397176543209876, '%_more_freq_words_morpho-family_root': 2.5174771048518516, 'root_family_size': 12.91358024691358, 'root_freq': 240965.88888888888, 'root_log_freq': 4.7899698973950615, '%_more_freq_words_morpho-family_suffix': 0.46805068293827157, 's

## 5. 练习
1. 请根据课程代码，对本节示例文本使用LCA工具生成词汇多样性指标NDW和MSTTR-50
2. 请根据课程代码，对本节示例文本使用TAALES工具生成词汇复杂度指标COCA_Academic_Frequnecy_50和SUBTLEXus_Freq_AW。并且阅读Kyle & Crossley (2015)的文章，思考：如果生成的指标并不符合正态分布，直接用于后续的分析会有影响吗，为什么？