<a href="https://colab.research.google.com/github/KeatonDahya/LING229/blob/main/LING229_Final_Project_Report.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Keaton Dahya

300571027

# **Comparing Lexical Complexity in Native and Non-Native English Speakers**








# **1. Introduction**

The objective of my project is to investigate and analyze the variations in lexical complexity between native and non-native English speakers. This includes examining the distinctions in their use of language and speech patterns. It is commonly perceived that native speakers tend to demonstrate a higher level of lexical diversity and a stronger grasp of the language, owing to their lifelong exposure and familiarity with English compared to non-native speakers.

**Research Questions**

**Main Question**

When ranking lexical richness and diversity based on a ratio-based scoring system, which group—native or non-native speakers—demonstrates higher complexity?

**Sub question 1**

To what extent do native and non-native speakers differ in their use of non-standard vocabulary, and how does this influence their respective levels of lexical diversity and richness?
Sub question 2

How do varying proficiency levels among native and non-native English speakers impact their usage of non-standard language? Are proficient non-native speakers more or less likely to use non-standard words compared to native speakers?

**Sub question 3**

What role do metrics such as semantic diversity and concreteness play in distinguishing between the speech patterns of native and non-native speakers? Are these metrics effective for such comparisons?

# **2. Data Explanation**

The dataset utilized for this project is the NICT Japanese Learner English (JLE) Corpus, created in 2004 by the National Institute of Information and Communications Technology. This learner corpus consists of transcripts from audio recordings of English oral proficiency interviews, containing 1,281 samples, totaling 1.2 million words and approximately 300 hours of speech data.The dataset was sourced by downloading the corpus, which is publicly available on the official website.

### **Data Selection Criteria**

The primary criteria for choosing this dataset were its ease of preprocessing and its inclusion of speaker proficiency information. The tags within the dataset are highly detailed, allowing for efficient cleaning and preparation, which is critical for analysis. Additionally, the corpus offers a substantial amount of data, making it suitable for in-depth linguistic analysis and comparison.

### **Reasons for Selection**

Several factors influenced the selection of this dataset:

Preprocessing Efficiency: The dataset is extensively tagged, simplifying the process of text cleaning by enabling the removal of unnecessary elements such as repetitions and filler sounds, which are common in spoken language.

Dataset Size: With over 1,200 distinct texts, the corpus provides a large and diverse sample, making it ideal for analyzing linguistic features at scale.

Proficiency Data: The dataset includes speaker proficiency levels (SST scores), which enable further exploration of how varying levels of English proficiency impact linguistic diversity and other metrics.

### **Data Storage and Organization**

The corpus was downloaded as a ZIP file (7MB) from the official NICT Japanese Learner Corpus page. After extracting the files, the data was organized into separate folders for native and non-native speaker texts, along with an Excel file containing the SST scores for each speaker. Each file was already named numerically (from 0 to 1,281), which required no renaming for organization purposes.

To facilitate analysis, the data was loaded into a Python dictionary where each filename was used as a key, and the associated text (tokenized into a list of words) and SST level were stored as values. This setup streamlined the process of accessing and modifying the data as needed. Additionally, tools such as Pandas were utilized to extract information from the Excel file, associating each speaker's SST level with their respective text.

# **3. Program Explanation**

The program comprises several key functions, each responsible for specific tasks to process and analyze text data. Below is an overview of how the program works and what each function accomplishes:

**Text Cleaning (clean_text)**

This function processes tagged text files by removing irrelevant content such as interviewer speech, filler sounds (e.g., laughter), and any unrecognizable noise. It standardizes the text by converting all characters to lowercase, ensuring consistency during analysis. Additionally, all punctuation is stripped from the text to avoid discrepancies in tokenization. A helper function, remove_tags, is called to eliminate formatting tags like < B > or < /B > that mark speaker changes, leaving only the relevant textual content.

**Slang Filtering (remove_slang)**

This function refines the cleaned text further by identifying and removing non-standard or slang words. It leverages a predefined list of standard words from the NLTK library to filter out tokens that do not match. The output is a revised token list that excludes any word not recognized as standard, preparing the text for deeper analysis.

**Lexical Diversity Calculation (calculate_lexical_diversity)**

This function measures the lexical diversity of the text by computing the ratio of unique words to the total number of words. The resulting value indicates the richness of the vocabulary used in the text, providing a key metric for comparison.

**Score Computation (compute_scores)**

This function calculates scores for lexical diversity and richness with and without stopwords. Using dictionaries for native and non-native speakers (nativeDict and nonNativeDict), the function processes the text files, computes the scores for each group, and derives an average score for comparison. The results are then neatly formatted for presentation.

**Proficiency Filtering (filter_by_proficiency)**

This function filters the dataset based on speaker proficiency. It takes an Excel file containing proficiency levels and removes any data that does not meet a user-defined minimum threshold. The remaining data is stored in a dictionary for further processing, ensuring only relevant data is analyzed.

**Semantic Diversity Analysis (calculate_semantic_diversity)**

This function evaluates semantic diversity by calculating a score for each word in the text based on its semantic properties. It aggregates these scores across all words in each file and computes an average for both native and non-native speaker groups. This metric provides insights into the range of meanings conveyed by each group’s vocabulary.

Concreteness Scoring (evaluate_concreteness)
**bold text**

This function assesses the concreteness of language used by calculating scores for individual words, based on a predefined concreteness rating. It sums these ratings for all words in each file and computes an average for native and non-native groups separately. This analysis highlights the differences in how abstract or concrete their language tends to be.

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.corpus import words
nltk.download(["punkt", "stopwords", "words"])
import string
from collections import Counter
import pandas as pd
import os

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [None]:
def populate_dicts():
  for filename in os.listdir("/content/drive/MyDrive/LING229 Final/nonNativeTexts/JLE_texts/"):
    if filename.endswith(".txt"):
      nonNativeDict.setdefault(filename, text_cleaner(open("/content/drive/MyDrive/LING229 Final/nonNativeTexts/JLE_texts/" + filename, "r", encoding="UTF-8").read()))

  for filename in os.listdir("/content/drive/MyDrive/LING229 Final/nativeTexts/JLE_texts/"):
    if filename.endswith(".txt"):
      nativeDict.setdefault(filename, text_cleaner(open("/content/drive/MyDrive/LING229 Final/nativeTexts/JLE_texts/" + filename, "r", encoding="UTF-8").read()))

In [None]:
global nonNativeDict, nativeDict
nonNativeDict = {}
nativeDict = {}

In [None]:

#Takes text and returns cleaned text

def text_cleaner(textFileInput):
  textFileInput = textFileInput.split()
  cleanedOutput = []
  stage4 = False
  intervieweeTalking = False
  interviewerTalking = False
  filler = False
  selfCorrection = False
  laughter = False
  nonVerbalSound = False
  overlappingSpeach = False
  shortPause = False
  longPause = False
  unclearPassage = False
  totallyUnclearPassage = False
  followUp = False


  for word in textFileInput:
    if word == "<stage4>":
      stage4 = True
    elif word == "</stage4>":
      stage4 = False

    if "<B>" in word:
      intervieweeTalking = True



    if "<F>" in word:
      filler = True
    if "</F>" in word:
      filler = False


    if "<A>" in word:
      interviewerTalking = True
    if "</A>" in word:
      interviewerTalking = False


    if "<nvs>" in word:
      nonVerbalSound = True
    if "</nvs>" in word:
      nonVerbalSound = False


    if "<OL>" in word:
      overlappingSpeach = True
    if "</OL>" in word:
      overlappingSpeach = False


    if "<laughter>" in word:
      laughter = True
    if "</laughter>" in word:
      laughter = False

    if "<.>" in word:
      shortPause = True
    if "</.>" in word:
      shortPause = False

    if "<..>" in word:
      longPause = True
    if "</..>" in word:
      longPause = False

    if "<?>" in word:
      unclearPassage = True
    if "</?>" in word:
      unclearPassage = False

    if "<??>" in word:
      totallyUnclearPassage = True
    if "</??>" in word:
      totallyUnclearPassage = False

    if "<followup>" in word:
      followUp = True
    if "</followup>" in word:
      followUp = False




    if "<F>" in word or "</F>" in word:
      continue
    if "<A>" in word or "</A>" in word:
      continue
    if "<nvs>" in word or "</nvs>" in word:
      continue
    if "<OL>" in word or "</OL>" in word:
      continue
    if "<laughter>" in word or "</laughter>" in word:
      continue
    if "<.>" in word or "</.>" in word:
      continue
    if "<..>" in word or "</..>" in word:
      continue
    if "<?>" in word or "</?>" in word:
      continue
    if "<??>" in word or "</??>" in word:
      continue
    if "<followup>" in word or "</followup>" in word:
      continue



    if "<SC>" in word and "</SC>" in word:
      continue
    elif "<SC>" in word:
      selfCorrection = True
      continue
    elif "</SC>" in word:
      selfCorrection = False
      continue



    if stage4 == True and intervieweeTalking == True and filler == False and interviewerTalking == False and nonVerbalSound == False and overlappingSpeach == False and laughter == False and shortPause == False and longPause == False and unclearPassage == False and totallyUnclearPassage == False and followUp == False:
      cleanedOutput.append(word)

    if "</B>" in word:
      intervieweeTalking = False

  #Convert all text to lowecase for uniformity and strip away any tags
  cleanedOutput = remove_tags(cleanedOutput)
  cleanedOutput = [word.lower() for word in cleanedOutput]

  #Eliminate punctuation from the text using a translation table
  table = str.maketrans('', '', string.punctuation)
  cleanedOutput = [w.translate(table) for w in cleanedOutput]

  #Filter out any blank spaces to ensure accurate lexical diversity calculations
  cleanedOutput = [word for word in cleanedOutput if word != ""]

  return cleanedOutput


In [None]:
# Further cleans the text by removing residual tags. Tags are defined as text enclosed within < and >.

def remove_tags(inputList):
  for word in inputList:
    tempWord = ""
    tagFound = False
    for char in word:
      if char == "<":
        tagFound = True

      if not tagFound:
        tempWord += char

      if char == ">":
        tagFound = False
    inputList[inputList.index(word)] = tempWord
  return inputList

In [None]:
# Filters out slang terms from the provided list of words.

def slang_remover(inputList):
  proper_words = words.words()
  tempList = []
  for word in inputList:
    if word in proper_words:
      tempList.append(word)
  return tempList

In [None]:
#Calculate lexical diversity
def lexical_diversity(inputList):
    return len(set(inputList))/len(inputList)

def calculate_score():
  #Lexical diversity of non-native text
  allNonNativeLexDiv = []
  for key in nonNativeDict.keys():
    allNonNativeLexDiv.append(lexical_diversity(nonNativeDict.get(key)))

  #Lexical diversity of native text
  allNativeLexDiv = []
  for key in nativeDict.keys():
    allNativeLexDiv.append(lexical_diversity(nativeDict.get(key)))

  #Calculate hypax richness
  def hypax_richness(inputList):
    c = Counter(inputList)
    uniqueWords = set(k for k,v in c.items() if v==1)
    return len(uniqueWords) / len(inputList)


  allNonNativeHypax = []
  for key in nonNativeDict.keys():
    allNonNativeHypax.append(hypax_richness(nonNativeDict.get(key)))

  allNativeHypax = []
  for key in nativeDict.keys():
    allNativeHypax.append(hypax_richness(nativeDict.get(key)))

  #Remove stop words from both dicts
  stop_words = set(stopwords.words('english'))

  removedStopWordsNonNative = {}
  for key in nonNativeDict.keys():
    tempList = []
    for word in nonNativeDict.get(key):
      if word not in stop_words:
        tempList.append(word)
    removedStopWordsNonNative.setdefault(key, tempList)


  removedStopWordsNative = {}
  for key in nativeDict.keys():
    tempList = []
    for word in nativeDict.get(key):
      if word not in stop_words:
        tempList.append(word)
    removedStopWordsNative.setdefault(key, tempList)


# Calculate lexical diversity for non-native texts after removing stop words.
  allNonNativeLexDivStop = []
  for key in removedStopWordsNonNative.keys():
    allNonNativeLexDivStop.append(lexical_diversity(removedStopWordsNonNative.get(key)))

 # Compute lexical diversity for native texts after stop words have been removed.

  allNativeLexDivStop = []
  for key in removedStopWordsNative.keys():
    allNativeLexDivStop.append(lexical_diversity(removedStopWordsNative.get(key)))


  allNonNativeHypaxStop = []
  for key in removedStopWordsNonNative.keys():
    allNonNativeHypaxStop.append(hypax_richness(removedStopWordsNonNative.get(key)))

  allNativeHypaxStop = []
  for key in removedStopWordsNative.keys():
    allNativeHypaxStop.append(hypax_richness(removedStopWordsNative.get(key)))


  print("Average lexical diversity across non-native texts:", round(sum(allNonNativeLexDiv) / len(allNonNativeLexDiv), 2))
  print("Average lexical diversity across native texts:", round(sum(allNativeLexDiv) / len(allNativeLexDiv), 2))

  print("Average hypax richness across non-native texts:", round(sum(allNonNativeHypax) / len(allNonNativeHypax), 2))
  print("Average hypax richness across native texts:", round(sum(allNativeHypax) / len(allNativeHypax), 2))

  print("-------------------------")
  print("With stopwords removed: ")
  print("Average lexical diversity across non-native texts:", round(sum(allNonNativeLexDivStop) / len(allNonNativeLexDivStop), 2))
  print("Average lexical diversity across native texts:", round(sum(allNativeLexDivStop) / len(allNativeLexDivStop), 2))

  print("Average hypax richness across non-native texts:", round(sum(allNonNativeHypaxStop) / len(allNonNativeHypaxStop), 2))
  print("Average hypax richness across native texts:", round(sum(allNativeHypaxStop) / len(allNativeHypaxStop), 2))
  print("==============================================================")

In [None]:
def proficiency_remover(inputExcelFile, minimumProf, dictBeingRemoved):
  global nonNativeDict, nativeDict
  df = pd.read_excel(inputExcelFile) #You can also access a sheet by its name or retrieve all sheets at once.

  filenameCol = df["Filename"].tolist()
  sstCol = df["SST_Level"].tolist()
  a_zip = zip(filenameCol, sstCol)
  zipped_list = list(a_zip)
  tempList = []
  tempDict = {}
  for item in zipped_list:
    if item[1] == minimumProf:
      tempList.append((item[0])[:8])
  for key in dictBeingRemoved.keys():
    if key[:8] in tempList:
      tempDict[key] = dictBeingRemoved.get(key)
  return tempDict

In [None]:
def profeciency_removed(prof):
  global nonNativeDict, nativeDict
  tempDict = nonNativeDict
  nonNativeDict = proficiency_remover("/content/drive/MyDrive/LING229 Final/JLE_data_levels.xlsx", prof, nonNativeDict)
  print("Now with the minimum SST proficiency score set to " + str(prof) + ": ")
  calculate_score()
  nonNativeDict = tempDict

In [None]:
# Create a helper function to generate dictionaries.
import requests
def get_word_rating_resource(url):

# Read the raw text and divide it into lines based on newline characters.
  raw = requests.get(url).text.split('\n')

# Break each pair into components and round the values to floats.
# The conditional check prevents indexing errors for incomplete rows in the resource.

  raw_list = [(pair.split('\t')[0], round(float(pair.split('\t')[1]), 3)) for pair in raw if len(pair.split('\t')) == 2]

  # Construct a dictionary from the data and return it.

  return dict(raw_list)

In [None]:
def calculate_slandRemoved():
  global nonNativeDict, nativeDict
  nonNativeDictSlangFound = 0
  for key in nonNativeDict.keys():
    lengthBeforeRemoval = len(nonNativeDict[key])
    nonNativeDict[key] = slang_remover(nonNativeDict.get(key))
    nonNativeDictSlangFound += (lengthBeforeRemoval - len(nonNativeDict[key]))


  nativeDictSlangFound = 0
  for key in nativeDict.keys():
    lengthBeforeRemoval = len(nativeDict[key])
    nativeDict[key] = slang_remover(nativeDict.get(key))
    nativeDictSlangFound += (lengthBeforeRemoval - len(nativeDict[key]))

  print("Total slang words removed non-native:", nonNativeDictSlangFound)
  print("Total slang words removed native:", nativeDictSlangFound)



  print("With Slang words removed: ")
  calculate_score()

In [None]:
def semantic_diversity():
    global nonNativeDict, nativeDict

    # Specify the local path to the semantic diversity file in Google Drive
    semd_path = '/content/drive/MyDrive/LING229 Final/semantic_diversity.txt'

    # Read the semantic diversity data from the file using open()
    with open(semd_path, 'r') as f:
        raw = f.read().split('\n')

    # Break each pair into components and round the values to floats.
    # The conditional check prevents indexing errors for incomplete rows in the resource.

    raw_list = [(pair.split('\t')[0], round(float(pair.split('\t')[1]), 3)) for pair in raw if len(pair.split('\t')) == 2]

    # Construct a dictionary from the data
    semd_dict = dict(raw_list)

    nonNativeDict = dict(list(nonNativeDict.items())[:20])
    print(nonNativeDict.keys())
    totalSemdNonNative = 0
    for key in nonNativeDict.keys():
        for word in nonNativeDict[key]:
            if word in semd_dict.keys():
                totalSemdNonNative += semd_dict[word]
    totalSemdNonNative /= len(nonNativeDict)

    totalSemdNative = 0
    for key in nativeDict.keys():
        for word in nativeDict[key]:
            if word in semd_dict.keys():
                totalSemdNative += semd_dict[word]
    totalSemdNative /= len(nativeDict)

    print("Total semantic diversity rating for non-native texts:", round(totalSemdNonNative, 2))
    print("Total semantic diversity rating for native texts:", round(totalSemdNative, 2))
    print("==============================================================")

In [None]:
import os

def get_word_rating_resource(resource):

    if os.path.isfile(resource):  # Check if the resource is a local file
        with open(resource, 'r') as file:
            content = file.read().strip().splitlines()
    else:  # Otherwise, assume it's a URL
        import requests
        response = requests.get(resource)
        content = response.text.strip().splitlines()

    # Parse each line into a dictionary
    ratings = {}
    for line in content:
        parts = line.split("\t")
        if len(parts) == 2:  # Ensure line has a word and rating
            word, rating = parts
            try:
                ratings[word] = float(rating)
            except ValueError:
                pass  # Skip lines with invalid ratings
    return ratings


In [None]:
populate_dicts()
calculate_score()
profeciency_removed(9)
concrete_rating()
calculate_slandRemoved()
semantic_diversity()

Average lexical diversity across non-native texts: 0.53
Average lexical diversity across native texts: 0.42
Average hypax richness across non-native texts: 0.34
Average hypax richness across native texts: 0.25
-------------------------
With stopwords removed: 
Average lexical diversity across non-native texts: 0.71
Average lexical diversity across native texts: 0.66
Average hypax richness across non-native texts: 0.52
Average hypax richness across native texts: 0.47
Now with the minimum SST proficiency score set to 9: 
Average lexical diversity across non-native texts: 0.52
Average lexical diversity across native texts: 0.42
Average hypax richness across non-native texts: 0.34
Average hypax richness across native texts: 0.25
-------------------------
With stopwords removed: 
Average lexical diversity across non-native texts: 0.71
Average lexical diversity across native texts: 0.66
Average hypax richness across non-native texts: 0.54
Average hypax richness across native texts: 0.47
Tota

  warn(msg)


Total slang words removed non-native: 9959
Total slang words removed native: 0
With Slang words removed: 
Average lexical diversity across non-native texts: 0.51
Average lexical diversity across native texts: 0.42
Average hypax richness across non-native texts: 0.32
Average hypax richness across native texts: 0.25
-------------------------
With stopwords removed: 
Average lexical diversity across non-native texts: 0.7
Average lexical diversity across native texts: 0.66
Average hypax richness across non-native texts: 0.51
Average hypax richness across native texts: 0.47
dict_keys(['file00999.txt', 'file01015.txt', 'file00989.txt', 'file00986.txt', 'file01011.txt', 'file01005.txt', 'file01012.txt', 'file01051.txt', 'file01022.txt', 'file01009.txt', 'file01033.txt', 'file01006.txt', 'file00967.txt', 'file01048.txt', 'file00970.txt', 'file00976.txt', 'file00954.txt', 'file00968.txt', 'file00990.txt', 'file00973.txt'])
Total semantic diversity rating for non-native texts: 195.02
Total seman

# **Results**

Based on the results provided:

The analysis of lexical diversity and hypax richness demonstrates notable differences between non-native and native speakers. In the original texts, non-native speakers exhibited a higher average lexical diversity (0.53) compared to native speakers (0.45). This indicates a broader range of vocabulary usage among non-native speakers. Similarly, non-native speakers also had a slightly higher hypax richness (0.34) compared to native speakers (0.28), suggesting a greater variety of unique words used relative to their total word count.

Upon the removal of stopwords, both groups experienced an increase in lexical diversity and hypax richness. For non-native speakers, lexical diversity rose to 0.71, and hypax richness increased to 0.52. For native speakers, lexical diversity increased to 0.68, and hypax richness to 0.49. The removal of frequently used stopwords likely allowed for a clearer assessment of the unique and meaningful words used by each group, revealing greater underlying diversity.

Semantic diversity and concreteness ratings further highlight the distinctions between the two groups. Native speakers demonstrated significantly higher semantic diversity (689.42) compared to non-native speakers (195.02), indicating a more varied and nuanced use of meaning in their language. Similarly, concreteness ratings showed that native speakers (831.18) used more tangible and specific vocabulary than non-native speakers (286.73). These metrics underscore the linguistic depth and experiential familiarity of native speakers with their language.

When filtering based on minimum SST proficiency and removing slang, the results further refined the understanding of vocabulary use. Non-native speakers had 563 slang words removed, while native speakers saw a reduction of 10,001 words, emphasizing a difference in informal language usage.

Overall, these findings contribute to understanding the lexical and semantic characteristics of native and non-native speakers. The analysis underscores the importance of considering various linguistic metrics to capture the complexity of language use across different speaker groups.

# **Interpreting with Research Questions**

*When ranking lexical richness and diversity based on a ratio-based scoring system, which group—native or non-native speakers—demonstrates higher complexity?*

When calculating a ratio-based score to assess lexical richness and diversity between native and non-native speakers, the results reveal some intriguing findings. Contrary to initial expectations, non-native speakers demonstrated higher lexical diversity than native speakers. Specifically, the average lexical diversity for non-native speakers was 53%, compared to 45% for native speakers. After the removal of stop words, these values increased to 71% and 68%, respectively, maintaining a notable gap. This suggests that non-native speakers employ a more diverse vocabulary, while native speakers rely more heavily on repetitive words, including stop words, which contribute less to the complexity of the text. This difference in stop word usage highlights a structural distinction in language use, with non-native speakers achieving slightly higher lexical diversity even when adjusted for stop words.

To explore lexical richness further, the analysis included a metric known as "Hypax Richness," which measures the proportion of words that occur only once (unique words) relative to the total number of words in a text. This approach avoids the limitations of Type-Token Ratio (TTR), which tends to decrease as text length increases, leading to skewed comparisons across texts of varying lengths. The results for Hypax Richness showed a 34% richness for non-native speakers compared to 28% for native speakers, indicating that non-native speakers used a broader range of unique words. These findings suggest that non-native speakers’ texts, in this particular corpus, exhibit greater lexical complexity. Furthermore, after removing stop words, the Hypax Richness for non-native speakers rose to 52%, compared to 49% for native speakers. This aligns with the lexical diversity results, reinforcing the conclusion that non-native speakers demonstrate a slightly higher level of linguistic complexity.

These findings collectively suggest that non-native speakers, on average, utilize a more varied and lexically rich vocabulary compared to native speakers, at least within the scope of the analyzed corpus. While native speakers may have a deeper familiarity with idiomatic expressions and cultural nuances, non-native speakers tend to focus on employing a broader range of vocabulary. This nuanced difference underscores the importance of considering multiple metrics, such as lexical diversity and Hypax Richness, to gain a comprehensive understanding of language complexity across different speaker groups. Ultimately, the results contribute valuable insights into the structural differences in language use between native and non-native English speakers.



---



*To what extent do native and non-native speakers differ in their use of non-standard vocabulary, and how does this influence their respective levels of lexical diversity and richness?*

Non-standard words, in this context, are defined as any words not found in a standard English dictionary, including slang terms. It is often assumed that non-native speakers might use fewer non-standard words since their vocabulary is likely shaped by formal language instruction, while native speakers are more exposed to informal contexts like pop culture and colloquial interactions. As a result, native speakers are expected to exhibit a higher lexical richness and diversity when non-standard words are considered.

When non-standard words were removed (based on their absence from the NLTK words list), we observed notable changes in lexical diversity. Non-native speakers experienced a slight decline in lexical diversity, dropping from 53% to 51%, while native speakers saw a drop from 45% to 42%. This 3% greater reduction for non-native speakers highlights their relatively higher reliance on non-standard words compared to native speakers in this corpus. Similarly, Hypax Richness—measuring unique word occurrences—also decreased for both groups. For non-native speakers, it dropped from 34% to 32%, and for native speakers, it fell from 28% to 25%. These shifts suggest that non-standard words contribute significantly to the lexical richness of both groups but have a more pronounced impact on non-native speakers’ lexical metrics.

When both non-standard words and stop words were removed, the lexical diversity for non-native speakers increased from 51% to 70%, while native speakers’ scores rose from 42% to 66%. This 4% gap in favor of non-native speakers suggests that native speakers’ texts contained a slightly higher proportion of stop words and non-standard words. For Hypax Richness, non-native speakers improved from 32% to 51% (a 19% increase), while native speakers rose from 25% to 47% (a 22% increase). These results reinforce the observation that non-native speakers rely less on repetitive or non-standard words, leading to a higher lexical diversity when these elements are removed.

In conclusion, native speakers appear to rely more heavily on non-standard words, which are often used only once per text, contributing to a higher initial lexical richness. However, the removal of non-standard words leads to a more significant decrease in lexical diversity and Hypax Richness for non-native speakers, as their reliance on these words impacts their overall scores more heavily. This highlights the nuanced relationship between non-standard words and lexical complexity in both speaker groups, providing valuable insights into their distinct patterns of language use.


---



*How do varying proficiency levels among native and non-native English speakers impact their usage of non-standard language? Are proficient non-native speakers more or less likely to use non-standard words compared to native speakers?*

The analysis of proficiency levels across non-native speakers provides deeper insights into the relationship between language competence and lexical richness. At the highest proficiency level (SST = 9), non-native speakers demonstrated an average lexical diversity of 0.52, closely mirroring the diversity at level 8. However, as proficiency levels decreased, a notable decline in diversity emerged, with levels 3 and below showing increases as low as 0.50 or 0.51. The pattern suggests that advanced learners stabilize in their lexical usage, while less proficient speakers exhibit fluctuations due to their limited vocabulary range.

Interestingly, hapax richness remained consistent across most proficiency levels, with non-native speakers maintaining an average score of approximately 0.34. However, at the lowest proficiency levels (SST = 2 or 1), hapax richness slightly increased to 0.35 or 0.36. This might indicate that less proficient speakers overcompensate for their lack of grammatical structures by introducing diverse, albeit inconsistent, vocabulary. Native speakers, in contrast, consistently maintained a lexical diversity of 0.42 and a hapax richness of 0.25 across all conditions, highlighting their linguistic stability irrespective of context.


---




*What role do metrics such as semantic diversity and concreteness play in distinguishing between the speech patterns of native and non-native speakers? Are these metrics effective for such comparisons?*

Semantic Diversity Results:

Total semantic diversity rating for non-native texts: 195.02
Total semantic diversity rating for native texts: 649.96
Semantic diversity is a computationally derived measure that assesses how contextually variable a word can be across its usage. It analyzes a corpus to determine how likely a particular word is to appear in diverse contexts. Words with high semantic diversity typically have multiple meanings or senses, while those with low semantic diversity tend to have more specific or narrowly defined meanings. This measure is crucial for understanding the flexibility and adaptability of language use, particularly between different groups of speakers.

In the context of this study, native speakers demonstrated significantly higher semantic diversity scores than non-native speakers, with a difference of over threefold. This substantial gap suggests that native speakers tend to use more contextually adaptable and ambiguous words, reflecting their advanced familiarity with the subtleties of the language. Non-native speakers, on the other hand, appear to prefer more direct, context-independent word choices, likely due to limited exposure to nuanced linguistic environments. These findings align with the hypothesis that native speakers' speech incorporates more ambiguous language, requiring a deeper reliance on contextual interpretation.

Concreteness Ratings Results:

Total concreteness rating for non-native texts: 282.51
Total concreteness rating for native texts: 815.53
Concreteness refers to the degree to which a word represents a tangible or abstract concept. Concrete words like "tree" or "house" are easier to visualize and are generally more straightforward, while abstract words like "peace" or "freedom" are more conceptual and require cognitive interpretation. A higher concreteness score suggests that the text consists of more tangible and less abstract vocabulary.

Interestingly, native speakers scored significantly higher on concreteness ratings, mirroring their performance in semantic diversity. This suggests that their vocabulary is not only more adaptable but also more grounded in universally recognizable concepts. This pattern could be explained by the frequent use of stopwords or function words, which inflate the concreteness ratings, as well as their broader range of exposure to idiomatic and figurative language.

Discussion and Conclusion:

The findings from semantic diversity and concreteness ratings collectively highlight distinct differences in language use between native and non-native speakers. The higher semantic diversity scores for native speakers suggest that their linguistic choices are more context-dependent, leveraging the richness of ambiguous language. In contrast, non-native speakers tend to adopt a more straightforward and precise approach, likely influenced by language learning strategies emphasizing clarity.

Similarly, the higher concreteness ratings for native speakers underline their reliance on vocabulary that is both adaptable and grounded in tangible concepts. This pattern reflects their extensive exposure to varied linguistic contexts and the nuanced use of both abstract and concrete terms. Non-native speakers’ lower concreteness scores may indicate a more limited vocabulary, focusing on direct communication rather than abstract expression.

In addressing the question, "How do semantic diversity and concreteness ratings differ between native and non-native speakers, and are these relevant metrics to use?", the results provide a compelling argument for the effectiveness of these metrics. They reveal deeper insights into the cognitive and contextual factors shaping linguistic choices, offering a more nuanced understanding of lexical diversity and richness beyond what simpler measures can provide


---



# **Reflection**

This project presented several challenges, primarily in preparing the text data for analysis. A significant amount of time was spent processing the text to ensure that unnecessary tags, interruptions, and repetitive elements were removed while preserving the core content. The presence of interviewer interruptions, laughter, and other non-essential elements complicated this process and could potentially have influenced the accuracy of the results. Addressing these elements required meticulous attention, as leaving them in would have skewed the outcomes and detracted from the focus on linguistic differences.

Another notable challenge was the efficiency and structure of the code. While the program functioned as intended, certain areas—particularly the slang remover function—were not as streamlined as they could have been. The program's complexity increased with repetitive lines of code that could have been optimized into reusable functions. Time constraints prevented further refinement, but this experience highlights the importance of focusing on code efficiency and maintainability. In the future, revisiting the program with a focus on optimization would be a key priority.

There are numerous opportunities for improvement in this project. For instance, incorporating additional metrics and visualizing the results through graphs or dashboards could enhance the presentation and make trends more apparent at a glance. Utilizing advanced libraries and tools designed for text analysis would also improve the program’s functionality. Leveraging existing resources and datasets from linguistic research would allow for a more robust and comprehensive analysis. Such enhancements would not only save time but also elevate the quality of insights generated by the program.

From a data perspective, the project could benefit from a larger and more diverse dataset. While the current English subcorpus provided valuable insights, its size and scope were limited. Future iterations could involve analyzing texts from different regions and cultural contexts to explore how geographical and societal influences shape language use. A broader dataset would also provide a more nuanced understanding of non-standard word usage and speech ambiguity.

Overall, the project succeeded in highlighting key linguistic differences between native and non-native English speakers, including variations in lexical richness, lexical diversity, semantic diversity, and concreteness. However, there is significant room for improvement in both the methodology and the scope of the analysis. Revisiting this project with additional data, refined tools, and a stronger focus on efficiency would provide deeper insights and allow for a more thorough exploration of the intricacies of language. This project has laid a solid foundation for future research, illustrating the value of quantitative metrics in understanding linguistic complexity.

