# Language Learning App Prototype Notebook

Welcome to the Language Learning App Prototype! This notebook demonstrates an initial prototype for a language learning application. The app predicts and translates words that a learner might find challenging based on their proficiency level and the words they mark as unknown.

## How to Use This Notebook

1. **Run All Cells**: To get started, you need to run all the cells in this notebook. This will install the necessary libraries, download required data, and set up the environment.

2. **Interface Deployment**: Once all cells are run, a Gradio interface will be deployed at the bottom of this notebook. You will interact with this interface to use the app.

3. **Using the Interface**:
   - **Native Language**: Select your native language (currently only English is available).
   - **Target Language**: Select the target language you want to learn (currently only Spanish is available).
   - **Proficiency Level**: Choose your proficiency level from A1 to C2.
   - **Text Input**: Enter the text you want to read and learn from in the target language.
   - **Start**: Click the 'Start' button to begin the process. The app will process the text, predict unknown words, and provide translations.
   - **Input Unknown Words**: You can input any additional unknown words you encounter.
   - **Next Paragraph**: Click the 'Next Paragraph' button to process the next paragraph of text.
   - **Restart**: If you want to start over, click the 'Restart' button to reset the interface.


## Library Installation

In this section, we install all the necessary libraries required for our language learning application. These libraries include NLP tools, translation services, frequency analysis tools, and the Gradio library for building the user interface.

In [None]:
# Install necessary libraries
!pip install -q stanza deep-translator langdetect wordfreq wiktionaryparser nltk gradio rapidfuzz

## Data Download

We download and extract the CogNet data, which contains cognate pairs between English and Spanish. This data helps identify cognates in the text and provides accurate translations.


In [None]:
# Download and extract the CogNet data
!wget https://github.com/kbatsuren/CogNet/raw/master/CogNet-v2.0.zip
!unzip CogNet-v2.0.zip

## Library Imports

This section imports all the necessary libraries that we will use throughout the notebook. These libraries provide functionalities such as natural language processing, translation, and data manipulation.


In [None]:
import cProfile
import pstats
import io
import gradio as gr
import stanza
from deep_translator import GoogleTranslator
from collections import defaultdict
from nltk.stem.snowball import SnowballStemmer
from wordfreq import word_frequency
from rapidfuzz import fuzz, process
import pandas as pd
import re
import time

## NLP Tools Initialization

We download the necessary language models for Stanza and initialize the pipelines for English and Spanish. These pipelines will handle tokenization (splitting text into words), lemmatization (reducing words to their base forms), part-of-speech tagging (identifying grammatical roles), and named entity recognition (detecting proper names and entities). Additionally, we initialize the Snowball Stemmer for Spanish, which will reduce words to their root forms, aiding in identifying morphological similarities.


In [None]:
# Download Stanza language models for English and Spanish
stanza.download('en')
stanza.download('es')

# Initialize Stanza pipelines with specific components
nlp_native = stanza.Pipeline('en', processors='tokenize,lemma,pos')
nlp_target = stanza.Pipeline('es', processors='tokenize,lemma,pos,ner')

# Initialize the Snowball Stemmer for Spanish
stemmer = SnowballStemmer("spanish")


## Data Loading and Filtering

We load the CogNet data into a DataFrame and filter it to get the Spanish-English cognates. This filtered data will help us identify cognates in the input text, which are words that have a common etymological origin.


In [None]:
# Load the CogNet TSV file into a DataFrame
cognet_df = pd.read_csv(
    'CogNet-v2.0.tsv', sep='\t', header=None,
    names=['concept_id', 'lang1', 'word1', 'lang2', 'word2', 'translit1', 'translit2'],
    on_bad_lines='skip', engine='python'
)

# Filter the DataFrame to get Spanish-English cognates
cognet_sp_en = cognet_df[
    ((cognet_df['lang1'] == 'spa') & (cognet_df['lang2'] == 'eng')) |
    ((cognet_df['lang1'] == 'eng') & (cognet_df['lang2'] == 'spa'))
]

## Initializing Variables

This section initializes various variables and data structures that will hold the state of the application, such as paragraphs, known words, unknown words, and translations.


In [None]:

# Define frequency thresholds for different proficiency levels
frequency_thresholds = {
    'A1': 0.0005,
    'A2': 0.00005,
    'B1': 0.00001,
    'B2': 0.000005,
    'C1': 0.000001,
    'C2': 0.0000005
}

# Cache for translations to avoid repeated translations
translation_cache = {}

# Function to initialize the state variables
def initialize_variables():
    global state
    state = {
        'paragraphs': [],
        'current_paragraph_index': 0,
        'known_words': [],
        'unknown_words': [],
        'validated_translations': [],
        'all_final_unknown_words': [],
        'all_cognate_pairs': {},
        'final_unknown_words_dict': defaultdict(set),
        'original_word_mapping': {},
        'native_language': '',
        'target_language': '',
        'level': '',
        'final_unknown_word_counts': defaultdict(int),
        'nlp_cache': {},
        'frequency_cache': {},
        'ner_cache': {},
        'merged_paragraphs': []
    }

initialize_variables()

## Cognate Identification Function

This function identifies cognates between Spanish and English words using the CogNet data and independent similarity checks. Cognates are words in two languages that have a common etymological origin.


In [None]:

# Function to identify cognates
def find_cognates(spanish_sentences, english_sentences, cognet_df, similarity_threshold=65):
    cognates = []

    for sp_sentence, en_sentence in zip(spanish_sentences, english_sentences):
        spanish_words = [word.lower() for word in sp_sentence.split()]
        english_words = [word.lower() for word in en_sentence.split()]

        # Check for cognates in CogNet
        for sp_word in spanish_words:
            matches = cognet_df[(cognet_df['word1'].str.lower() == sp_word) | (cognet_df['word2'].str.lower() == sp_word)]
            for _, row in matches.iterrows():
                if row['lang1'] == 'spa' and row['lang2'] == 'eng':
                    en_word = row['word2'].lower()
                elif row['lang1'] == 'eng' and row['lang2'] == 'spa':
                    en_word = row['word1'].lower()
                else:
                    continue

                if en_word in english_words:
                    similarity = fuzz.ratio(sp_word, en_word)
                    if similarity >= similarity_threshold:
                        cognates.append((sp_word, en_word))
                        print(f"CogNet Cognate Identified: {sp_word} (Spanish) <-> {en_word} (English)")

        # Identify lemma-based cognates with POS tag check
        for sp_word in spanish_words:
            if len(sp_word) <= 3:
                continue
            sp_features = state['nlp_cache'].get(sp_word, {})
            for en_word in english_words:
                if len(en_word) <= 3:
                    continue
                en_features = state['nlp_cache'].get(en_word, {})

                """# Debugging: Print words being compared and their features
                print(f"Comparing: {sp_word} (Spanish) with {en_word} (English)")
                print(f"  Similarity: {fuzz.ratio(sp_word, en_word)}")
                print(f"  Spanish POS: {sp_features.get('pos')}")
                print(f"  English POS: {en_features.get('pos')}")
                print(f"  Spanish Lemma: {sp_features.get('lemma')}")
                print(f"  English Lemma: {en_features.get('lemma')}")"""

                similarity = fuzz.ratio(sp_word, en_word)
                if similarity >= 80:
                    """print(f"Similarity Cognate Identified: {sp_word} (Spanish) <-> {en_word} (English)")"""
                    cognates.append((sp_word, en_word))
                if similarity >= 50 and sp_features.get('pos') == en_features.get('pos'):
                    """print(f"Similarity Cognate Identified: {sp_word} (Spanish) <-> {en_word} (English)")"""
                    cognates.append((sp_word, en_word))
                if sp_features.get('pos') == en_features.get('pos') and similarity >= similarity_threshold:
                    cognates.append((sp_word, en_word))
                    """print(f"POS Cognate Identified: {sp_word} (Spanish) <-> {en_word} (English)")"""
                    if similarity >= similarity_threshold:
                        if sp_features and en_features and sp_features.get('lemma') == en_features.get('lemma'):
                            cognates.append((sp_word, en_word))
                            """print(f"Lemma Cognate Identified: {sp_word} (Spanish) <-> {en_word} (English)")"""

    return cognates




## Batch Translation Function

Here we define the `batch_translate` function, which translates a list of sentences in batch mode. This function uses the GoogleTranslator to translate sentences from the target language to the native language. It also caches translations to avoid redundant API calls, improving performance. It uses retry logic to handle errors. The function attempts to translate each sentence up to three times before returning a fallback message. This ensures that temporary issues with the translation service do not cause the app to fail.


In [None]:
# Batch translation
def batch_translate(sentences, src, dest, max_retries=3):
    translations = []
    for sentence in sentences:
        if sentence in translation_cache:
            translations.append(translation_cache[sentence])
        else:
            for attempt in range(max_retries):
                try:
                    translation = GoogleTranslator(source=src, target=dest).translate(sentence)
                    if translation:
                        translations.append(translation)
                        translation_cache[sentence] = translation
                        break
                except Exception as e:
                    print(f"Error translating sentence '{sentence}': {e} (Attempt {attempt + 1} of {max_retries})")
                if attempt == max_retries - 1:
                    translations.append("Translation not available")
                    translation_cache[sentence] = "Translation not available"
    return translations

## Morphological Similarity Check Function

This section introduces the `is_similar_morphology` function. The function checks if two words are morphologically similar based on their stems and lemmas. This is useful for identifying words that are related or share a common root, which can help in predicting unknown words.

In [None]:
# Function to check morphological similarity
def is_similar_morphology(word1, word2, threshold):
    stem1, stem2 = word1['stem'], word2['stem']
    lemma1, lemma2 = word1['lemma'], word2['lemma']
    freq2 = word2['frequency']

    if stem1 == stem2:
        return True
    if (stem1 in stem2 or stem2 in stem1) and freq2 < threshold:
        return True
    if lemma1 == lemma2:
        return True
    if (lemma1 in lemma2 or lemma2 in lemma1) and freq2 < threshold:
        return True
    return False

## Text Preprocessing Function

In this section, we define the `batch_preprocess_text` function. This function tokenizes, lemmatizes, and performs part-of-speech tagging on the input text using Stanza. It also calculates word frequencies and caches the results to improve performance. The preprocessing steps are essential for understanding the structure of the text and identifying which words might be challenging for learners.


In [None]:
# Preprocess text (tokenize, lemmatize, POS tagging)
def batch_preprocess_text(paragraphs, nlp, target_language, use_cache=True):
    batch_text = "\n\n".join(paragraphs)
    if use_cache and batch_text in state['nlp_cache']:
        return state['nlp_cache'][batch_text]

    doc = nlp(batch_text)
    sentences = [sentence.text for sentence in doc.sentences]
    words = []
    for sentence in doc.sentences:
        for word in sentence.words:
            if len(word.text) > 1:  # Ensure we are processing only whole words
                if word.text in state['frequency_cache']:
                    frequency = state['frequency_cache'][word.text]
                else:
                    frequency = word_frequency(word.text, target_language)
                    state['frequency_cache'][word.text] = frequency
                word_features = {
                    'text': word.text,
                    'lemma': word.lemma,
                    'pos': word.upos,
                    'frequency': frequency,
                    'stem': stemmer.stem(word.text)
                }
                words.append(word_features)
                state['nlp_cache'][word.text.lower()] = word_features


    if use_cache:
        state['nlp_cache'][batch_text] = (sentences, words)
    return sentences, words


## Named Entity Recognition (NER) Function

This section defines the `perform_ner` function, which performs named entity recognition on the input text using the Stanza pipeline. Named entities (e.g., names of people, places, organizations) are often known words for language learners, and recognizing them helps in accurately predicting unknown words.

In [None]:
# Perform Named Entity Recognition (NER)
def perform_ner(text, nlp):
    if text in state['ner_cache']:
        return state['ner_cache'][text]

    doc = nlp(text)

    entities = [entity.text.lower() for sentence in doc.sentences for entity in sentence.ents]
    state['ner_cache'][text] = entities
    return entities

## Translation Validation Function

This section defines the `validate_translation_in_context` function, which aims to ensure that the translations make sense within the given context. The function uses several checks:

1. **Initial Check (Direct Match)**: It directly matches the translated word with words in the translated sentence.
2. **Similarity Check**: It uses fuzzy string matching to find the most similar word in the translated sentence.
3. **POS Tag Check**: It matches the part-of-speech (POS) tag of the translated word with words in the translated sentence.

These checks help in providing accurate translations that fit well in the context of the sentences.

In [None]:
# Validate translation with context
def validate_translation_in_context(translation, original_sentences, translated_sentences, spanish_pos):
    for orig_sent, trans_sent in zip(original_sentences, translated_sentences):
        # Retrieve or compute NLP results for the translated sentence
        trans_doc = state['nlp_cache'].get(trans_sent, nlp_native(trans_sent))
        state['nlp_cache'][trans_sent] = trans_doc
        if isinstance(trans_doc, tuple):
            trans_doc = trans_doc[1]

        # Retrieve or compute NLP results for the original sentence
        orig_doc = state['nlp_cache'].get(orig_sent, nlp_target(orig_sent))
        state['nlp_cache'][orig_sent] = orig_doc
        if isinstance(orig_doc, tuple):
            orig_doc = orig_doc[1]

        # Initial Check: Direct match
        for word in trans_doc.sentences[0].words:
            if word.text.lower() == translation.lower():
                return word.text

        # Similarity Check: Find the most similar word
        words_in_trans_sent = [word.text for word in trans_doc.sentences[0].words]
        most_similar = process.extractOne(translation, words_in_trans_sent, scorer=fuzz.ratio, score_cutoff=80)
        similar_enough = process.extractOne(translation, words_in_trans_sent, scorer=fuzz.ratio, score_cutoff=70)

        if most_similar:
            similar_word = most_similar[0]
            for word in trans_doc.sentences[0].words:
                if word.text == similar_word:
                    return f"{translation}/{similar_word}"  # Return both the translation and the most similar word

        if similar_enough:
            similar_enough_word = similar_enough[0]
            for word in trans_doc.sentences[0].words:
                if word.text == similar_enough_word and word.upos == spanish_pos:
                    return f"{translation}/{similar_enough_word}"

        # POS Tag Check: Find a word with the same POS tag as the Spanish word
        pos_matches = [word.text for word in trans_doc.sentences[0].words if word.upos == spanish_pos]
        if len(pos_matches) == 1:
            return f"{translation}/{pos_matches[0]}"  # Return both the translation and the POS matching word if there's only one match
        elif len(pos_matches) > 1:
            return translation  # Stick to the individual translation if multiple POS matches are found

    # If no word with the same POS tag is found
    return translation


## Profiling Decorator

This section introduces a decorator function `profile_func` that profiles the execution time of functions. It helps in identifying performance bottlenecks by logging the time taken by various parts of the code, which is crucial for optimizing the app.


In [None]:
# Profile Decorator
def profile_func(func):
    def wrapper(*args, **kwargs):
        pr = cProfile.Profile()
        pr.enable()
        result = func(*args, **kwargs)
        pr.disable()
        s = io.StringIO()
        sortby = 'cumulative'
        ps = pstats.Stats(pr, stream=s).sort_stats(sortby)
        ps.print_stats(10)
        print(s.getvalue())
        return result
    return wrapper


## Paragraph Processing Function

This section defines the `process_paragraph` function, which processes each paragraph of text to predict unknown words, translate sentences, and validate translations. It involves several steps:

1. **Preprocessing Text**: Tokenizes, lemmatizes, and performs POS tagging.
2. **Named Entity Recognition**: Identifies named entities in the text.
3. **Translation**: Translates sentences from the target language to the native language.
4. **Identifying Cognates**: Finds cognates between the target language and the native language.
5. **Identifying Unknown Words**: Determines which words are unknown to the learner based on frequency thresholds and similarity checks.

This function combines these steps to provide a comprehensive analysis of each paragraph.


In [None]:
# Function to generate a word report
def generate_word_report(word_report):
    print("\nWord Report:")
    for report in word_report:
        print(f"Word: {report['text']}")
        print(f"  Is known word: {'yes' if report['is_known'] else 'no'}")
        print(f"  Is input unknown word: {'yes' if report['is_input_unknown'] else 'no'}")
        print(f"  Is within frequency threshold: {report['frequency_status']}")
        print(f"  Is entity: {'yes' if report['is_entity'] else 'no'}")
        print(f"  Is cognate: {report['is_cognate']}")
        if not report['is_known']:
            print(f"  Is morphologically related: {report['is_morphologically_related']}")
        print()

# Translate and process each paragraph
import re

def normalize_word(word):
    return re.sub(r'\W+', '', word).lower()

@profile_func
def process_paragraph(paragraphs, input_unknown_words, known_words, unknown_words, validated_translations):
    if not paragraphs:
        return [], [], [], [], {}

    current_paragraph_unknown_words = defaultdict(set)
    sentences, words = batch_preprocess_text(paragraphs, nlp_target, state['target_language'])
    entities = perform_ner("\n\n".join(paragraphs), nlp_target)
    translated_sentences = batch_translate(sentences, state['target_language'], state['native_language'])
    threshold = frequency_thresholds[state['level']]

    # Extract Spanish and English words
    spanish_words = [normalize_word(word['text']) for word in words]
    english_words = batch_translate([normalize_word(word['text']) for word in words], state['target_language'], state['native_language'])
    cognates = find_cognates(sentences, translated_sentences, cognet_sp_en)
    cognate_pairs = {normalize_word(sp): normalize_word(en) for sp, en in cognates}

    word_report = []  # Initialize the list to store report details for each word

    for word in words:
        word_text = normalize_word(word['text'])
        state['original_word_mapping'][word_text] = word['text']

        report_details = {
            'text': word['text'],
            'is_known': False,
            'is_input_unknown': word_text in input_unknown_words,
            'frequency_status': 'above' if word['frequency'] >= threshold else 'below',
            'is_entity': word_text in entities,
            'is_cognate': cognate_pairs.get(word_text, 'no'),
            'is_morphologically_related': 'no'  # Placeholder, will be updated later if necessary
        }



        if word_text in entities or word['pos'] == 'PUNCT':
            if word_text not in input_unknown_words:
                known_words.append(word)
                report_details['is_known'] = True
        elif word['frequency'] >= threshold or word_text in cognate_pairs:
            known_words.append(word)
            report_details['is_known'] = True
        else:
            state['final_unknown_words_dict'][word_text].add(word['lemma'])
            current_paragraph_unknown_words[word_text].add(word['lemma'])

        word_report.append(report_details)  # Add the report details to the list

    input_unknown_word_details = []  # Initialization here
    for unknown_word in input_unknown_words:
        processed_words = batch_preprocess_text([unknown_word], nlp_target, state['target_language'], use_cache=False)[1]
        if processed_words:
            processed_word = processed_words[0]
            word_text = normalize_word(processed_word['text'])
            input_unknown_word_details.append(processed_word)
            state['final_unknown_words_dict'][word_text].add(processed_word['lemma'])
            if word_text in spanish_words:
                current_paragraph_unknown_words[word_text].add(processed_word['lemma'])
        occurrences = spanish_words.count(word_text)

        if word_text in state['final_unknown_word_counts']:
            state['final_unknown_word_counts'][word_text] += occurrences
        else:
            state['final_unknown_word_counts'][word_text] = occurrences

    for word in words:
        for unknown_word in input_unknown_word_details:
            if is_similar_morphology(word, unknown_word, threshold):
                state['final_unknown_words_dict'][word['text'].lower()].add(word['lemma'])
                current_paragraph_unknown_words[word['text']].add(word['lemma'])
                # Update the report if the word is morphologically related
                for report in word_report:
                    if report['text'].lower() == word['text'].lower():
                        report['is_morphologically_related'] = unknown_word['text']
                        break

    for word_text, lemmas in state['final_unknown_words_dict'].items():
        for word in words:
            if is_similar_morphology({'text': word_text, 'lemma': next(iter(lemmas)), 'stem': stemmer.stem(word_text)}, word, threshold):
                current_paragraph_unknown_words[word['text']].add(word['lemma'])
                # Update the report if the word is morphologically related
                for report in word_report:
                    if report['text'].lower() == word['text'].lower():
                        report['is_morphologically_related'] = word_text
                        break

    for word in words:
        word_text = word['text'].lower()
        if word_text in current_paragraph_unknown_words:
            if word_text in state['final_unknown_word_counts']:
                state['final_unknown_word_counts'][word_text] += 1
            elif state['final_unknown_word_counts'][word_text] >= 8:
                del state['final_unknown_words_dict'][word_text]
            else:
                state['final_unknown_word_counts'][word_text] = 1


    final_unknown_words = []
    for word_text, lemmas in current_paragraph_unknown_words.items():
        # Exclude cognates from final unknown words
        if word_text in input_unknown_words or word_text not in cognate_pairs:
            final_unknown_words.append({
                'text': word_text,
                'lemma': next(iter(lemmas)),
                'pos': next((word['pos'] for word in words if word['text'].lower() == word_text), 'UNKNOWN'),
                'frequency': next((word['frequency'] for word in words if word['text'].lower() == word_text), 0.0),
                'stem': stemmer.stem(word_text)
            })

    def map_words_to_sentences(sentences, words):
        sentence_word_map = {}
        for i, sentence in enumerate(sentences):
            for word in words:
                if word['text'].lower() in sentence.lower():
                    sentence_word_map[word['text'].lower()] = (sentence, i)
        return sentence_word_map

    sentence_word_map = map_words_to_sentences(sentences, final_unknown_words)
    for word in final_unknown_words:
        if word['text'].lower() not in sentence_word_map:
            continue
        if word['text'] in translation_cache:
            translation = translation_cache[word['text']]
        else:
            translation = GoogleTranslator(source=state['target_language'], target=state['native_language']).translate(word['text'])
            translation_cache[word['text']] = translation
        validated_translation = validate_translation_in_context(
            translation,
            sentences,
            translated_sentences,
            word['pos']
        )
        word['translation'] = validated_translation
        word['sentence'] = sentence_word_map[word['text'].lower()][0]
        word['translated_sentence'] = translated_sentences[sentence_word_map[word['text'].lower()][1]]
        validated_translations.append({
            'original': word['text'],
            'translation': validated_translation,
            'translated_pos': word['pos']
        })

    # Generate the word report
    generate_word_report(word_report)

    return sentences, translated_sentences, final_unknown_words, validated_translations, cognate_pairs



## Starting Processing Function

This section defines the `start_processing` function, which initializes the app state and begins processing the input text. The function sets the native and target languages, the proficiency level, and splits the input text into paragraphs. It then processes the first paragraph to start the app. The profiling decorator is used to measure the performance of this function.

Key steps:
1. **Initialize Variables**: Reset all state variables to ensure a fresh start.
2. **Set Language and Level**: Set the user's native language, target language, and proficiency level.
3. **Split Text into Paragraphs**: Divide the input text into separate paragraphs for step-by-step processing.
4. **Process First Paragraph**: Call `process_next_paragraph` to start processing the first paragraph.


In [None]:
@profile_func
def start_processing(native_language, target_language, level, text):
    # Initialize all state variables
    initialize_variables()

    # Set user language preferences and proficiency level
    state['native_language'] = native_language
    state['target_language'] = target_language
    state['level'] = level

    # Split the input text into paragraphs for processing
    state['paragraphs'] = text.strip().split('\n')
    state['current_paragraph_index'] = 0

    # Start processing the first paragraph
    return process_next_paragraph([])



## Processing Next Paragraph Function

The `process_next_paragraph` function processes the next paragraph of the text, updating the app state as it goes. This function is called repeatedly to process each paragraph in the input text.

Key steps:
1. **Check Paragraph Index**: If there are more paragraphs to process, it proceeds with the next one.
2. **Process Paragraph**: Calls the `process_paragraph` function to handle the current paragraph.
3. **Update State**: Updates the state variables with the processed data.
4. **Display Output**: Formats and returns the processed paragraph with highlights and translations.
5. **Generate Summary**: If all paragraphs are processed, it generates a summary of all unknown words.


In [None]:
def process_next_paragraph(input_unknown_words):
    global state
# Check if there are more paragraphs to process
    while state['current_paragraph_index'] < len(state['paragraphs']):
        paragraph = state['paragraphs'][state['current_paragraph_index']].strip()

        if paragraph:
            sentences, translated_sentences, final_unknown_words, validated_translations, cognate_pairs = process_paragraph(
                [paragraph],
                input_unknown_words,
                state['known_words'],
                state['unknown_words'],
                state['validated_translations']
            )

            # Update state with processed data
            state['all_final_unknown_words'].extend(final_unknown_words)
            state['all_cognate_pairs'].update(cognate_pairs)
            output = display_output(paragraph, final_unknown_words)
            state['current_paragraph_index'] += 1
            print(f"Paragraph {state['current_paragraph_index']} processed with input unknown words: {input_unknown_words}")  # Debugging: Print paragraph processing status
            return output

        # If the paragraph is empty, skip to the next one
        state['current_paragraph_index'] += 1

    # If all paragraphs are processed, generate a summary
    summary = generate_summary()
    return f"All paragraphs processed<br>{summary}"



## Highlighted Paragraph Function

The `highlighted_paragraph` function highlights unknown words in a paragraph by wrapping them in HTML tags and adding translations.

Key steps:
1. **Preserve Case**: The function preserves the original casing of the words when replacing them with highlighted versions.
2. **Highlight Words**: Each unknown word is wrapped in `<b>` tags and its translation is appended in parentheses.
3. **Return Highlighted Paragraph**: Returns the formatted paragraph with highlighted unknown words and translations.


In [None]:
def highlighted_paragraph(paragraph, final_unknown_words, validated_translations):
    def preserve_case_replace(match, replacement):
        matched_text = match.group()
        if matched_text.isupper():
            return replacement.upper()
        elif matched_text[0].isupper():
            return replacement.capitalize()
        else:
            return replacement

    highlighted_paragraph = paragraph
    for word in final_unknown_words:
        original_word = state['original_word_mapping'].get(word['text'], word['text'])
        translation_info = next((item for item in validated_translations if item['original'] == word['text']), None)
        if translation_info:
            translation = translation_info['translation']
            highlighted_paragraph = re.sub(
                r'(?i)\b{}\b'.format(re.escape(original_word)),
                lambda match: preserve_case_replace(match, f"<b>{match.group()}</b>({translation})"),
                highlighted_paragraph, flags=re.IGNORECASE
            )

    return highlighted_paragraph


## Display Output Function

This function, `display_output`, formats the processed paragraph and unknown words for display. It highlights unknown words in the paragraph and provides contextual sentences to help learners understand the usage of these words.

Key steps:
1. **Highlight Paragraph**: Calls `highlighted_paragraph` to get the formatted paragraph with highlighted unknown words.
2. **Generate Contextual Sentences**: Creates sentences showing each unknown word in its context.
3. **Format Output**: Combines the highlighted paragraph and contextual sentences into HTML format for display.

In [None]:
# Modified display_output function
def display_output(paragraph, final_unknown_words):
    highlighted_para = highlighted_paragraph(paragraph, final_unknown_words, state['validated_translations'])
    context_sentences = []
    for word in final_unknown_words:
        translation = word.get('translation', 'No translation available')
        context_sentence = f"<b>{word['text']}:</b> <b>{translation}</b>.<br>{word.get('translated_sentence', 'No sentence available')}<br>"
        context_sentences.append(context_sentence)

    context_output = "<br>".join(context_sentences)
    original_paragraphs = paragraph.split(' ')
    highlighted_original_para = highlighted_paragraph(" ".join(original_paragraphs), final_unknown_words, state['validated_translations'])

    return f"<p><b style='font-size: larger;'>Highlighted Text:</b></p><p>{highlighted_original_para}</p><hr><p><b style='font-size: larger;'>Predicted Unknown Words In Context:</b></p><p>{context_output}</p>"


## Generate Summary Function

The `generate_summary` function compiles a summary of all unknown words encountered during text processing. It provides the count of appearances for each unknown word and its translation, helping learners review new vocabulary.

Key steps:
1. **Initialize Summary**: Starts with a heading for the summary.
2. **Compile Translations**: Gathers translations for all unknown words from the validated translations.
3. **Format Summary**: Creates a summary listing each unknown word, its appearance count, and translation.


In [None]:
def generate_summary():
    summary = "<p style='font-size: larger;'><b>Summary of Unknown Words:</b></p><br>"

    translations_dict = {word: next((item for item in state['validated_translations'] if item['original'] == word), {}).get('translation', 'No translation found')
                         for word in state['final_unknown_word_counts'].keys()}

    for word, count in state['final_unknown_word_counts'].items():
        translation = translations_dict.get(word, 'No translation found')
        summary += f"<b>{word}:</b> {count} appearances, Translation: {translation}<br>"

    return summary

## Next Paragraph Function

This function, `next_paragraph`, is used to process the next paragraph in the text based on additional unknown words input by the user. It calls `process_next_paragraph` with the new unknown words and returns the output for the next paragraph.

Key steps:
1. **Convert Input to List**: Converts the input string of unknown words into a list.
2. **Process Next Paragraph**: Calls `process_next_paragraph` with the list of input unknown words.

In [None]:
def next_paragraph(input_unknown_words):
    if isinstance(input_unknown_words, str):
        input_unknown_words = input_unknown_words.split()
    return process_next_paragraph(input_unknown_words)

## Reset Interface Function

The `reset_interface` function resets the entire interface and state variables, allowing the user to start over with a new text. This function is useful when the user wants to restart the session.

Key steps:
1. **Initialize Variables**: Calls `initialize_variables` to reset all state variables.
2. **Update Interface**: Resets the Gradio interface elements to their initial state.

In [None]:
def reset_interface():
    initialize_variables()
    return gr.update(value=''), gr.update(value=''), gr.update(value=''), gr.update(value=''), gr.update(value=''), gr.update(value='')

## Gradio Interface

This section defines the Gradio interface for the language learning app. The interface includes inputs for the native language, target language, proficiency level, and text. It also provides buttons to start processing, move to the next paragraph, and restart the app.

Key components:
1. **Dropdowns**: For selecting native language, target language, and proficiency level.
2. **Textbox**: For entering the text to be processed.
3. **Buttons**: For starting the processing, moving to the next paragraph, and restarting the app.
4. **Output Area**: Displays the processed text with highlighted unknown words and translations.

In [None]:
# Gradio Interface
iface = gr.Blocks()

with iface:
    native_language_input = gr.Dropdown(choices=['en'], label='Native Language', value='en')
    target_language_input = gr.Dropdown(choices=['es'], label='Target Language')
    level_input = gr.Dropdown(choices=['A1', 'A2', 'B1', 'B2', 'C1', 'C2'], label='Level')
    text_input = gr.Textbox(label='Text', lines=10)
    start_button = gr.Button('Start')
    output_area = gr.HTML()
    unknown_words_input = gr.Textbox(label='Input Unknown Words', lines=2)
    next_button = gr.Button('Next Paragraph')
    restart_button = gr.Button('Restart')

    start_button.click(start_processing, [native_language_input, target_language_input, level_input, text_input], [output_area])
    next_button.click(next_paragraph, [unknown_words_input], [output_area])
    restart_button.click(reset_interface, [], [native_language_input, target_language_input, level_input, text_input, output_area, unknown_words_input])

iface.launch(share=True, debug=True)



## How the code works
The code takes a native language (currently English), a target language (currently Spanish), a proficiency level in the target language (from A1 to C2), and a text. It then uses Stanza to preprocess the text in batches. The preprocessing function divides the text into sentences (tokenization), extracts the base form of the words (lemmatization), and assigns a Part Of Speech tag (grammatical categories such as nouns, verbs, adjectives, adverbs, pronouns, etc.) to each word in a sentence. It also calculates the word frequencies and caches all results to improve performance.

In the perform_ner function, the code uses Stanza to identify entities (proper names) in the text, which are stored in a special list to prevent them from being categorized as unknown. In the find_cognates function, the code checks if a word from the text is in the CogNet database. If it is, it adds the word and its translation to a cognates list. If the word is not found in the database, the code checks if the Spanish word shares the same lemma (base form) as the English translation or if the two words have a similarity threshold above 65%. Word pairs that meet any of these conditions are added to the cognates list to prevent them from being categorized as unknown.

The code also includes a function called is_similar_morphology, which checks if two words are similar in morphology by comparing their stems (the word root, the minimal part of a word) and lemmas. This function predicts further unknown words given an initial list of unknown words by assuming that morphologically related words will also be unknown for the user.

The code translates the preprocessed sentences from the target language to the native language using the Google Translator API. The translated sentences is used to ensure more contextually accurate translations.

Here's how everything comes together. The code identifies unknown words as follows: if a word's frequency is below the threshold for the selected proficiency level, it is listed as unknown. Any word the user inputs is listed as unknown regardless of frequency. Any word in the text that is morphologically similar to a word in the list of unknown words is also marked as unknown. Proper names and cognates are not marked as unknown.

Once the code identifies an unknown word, it translates it individually using the Google Translator API. This individual translation (which we can call 'the original translation') is then passed to the validate_translation_in_context_function. To validate the translation in context, the code first tries to find the exact same word in the corresponding translated sentence. If found, the translation is passed to be displayed. If not, the code looks for the most similar word in the translated sentence, with a similarity threshold of at least 80%. If still not found, the code looks for a word with at least 70% similarity and checks if it has the same Part Of Speech (POS) tag as the original word. If both conditions are met, it displays the contextual translation. If no word meets these conditions, the code checks if the unknown word's POS tag is unique in the original Spanish sentence. If it is, it finds the corresponding word in the translated sentence and displays the contextual translation. If multiple words share the same POS tag, only the original translation is displayed.

The code displays a paragraph of the text with the identified unknown words highlighted in bold and their validated translations in parentheses. It also displays a list of the unknown words, their translations, and the full translated sentence to ensure the user has all the necessary context. The code tracks how many times a word has been translated, and after eight times, the word is removed from the unknown words list.

The code iterates through each paragraph until all have been displayed. Finally, it presents a summary of the unknown words, their translations, and the number of times they were translated.

To manage efficiency, the code profiles (measures) the execution time of various functions. This profiling helps identify performance bottlenecks. The specific parts profiled are: text preprocessing (the time taken for tokenization, lemmatization, and POS tagging); entity recognition (the time taken to identify proper names using the perform_ner function);
cognate finding (the time taken to search the CogNet database and check for morphological similarity); the translation of the sentences using the Google Translator API; and the context validation (time taken to validate translations in context).
By logging the time taken by these functions, the code can be optimized for better performance.


## Experimentation and Tuning

These are the parts of the code can be easily modified to experiment with different results and potentially improve the performance or accuracy of the prototype:

### Fuzz Threshold in Cognate Identification

In the `find_cognates` function, the `similarity_threshold` can be adjusted to see if it helps in better identifying cognates. The `similarity_threshold` determines how similar two words need to be to be considered cognates. This threshold ranges from 0 to 100, where a higher value means that the words need to be more similar to be identified as cognates. Setting a high threshold, such as 80 or 90, makes it stricter, so only very similar words are identified as cognates. This reduces false positives but might miss some valid cognates. Conversely, a lower threshold, such as 50 or 60, makes it more lenient, allowing more words to be identified as cognates, which can include false positives.

### Frequency Thresholds for Different Levels

The `frequency_thresholds` dictionary defines thresholds for different proficiency levels, determining which words are considered known or unknown based on their frequency in the language. These thresholds can be adjusted to see how it affects the identification of unknown words. For instance, lowering the threshold for the A1 level means that more words will be considered unknown, which might be less overwhelming for beginners. On the other hand, increasing the threshold will consider more words as known, potentially making the reading experience more challenging.

### Translation Service

The `safe_translate` function uses GoogleTranslator to translate sentences. Using a different translation service like YandexTranslator or DeepL might offer different levels of accuracy and performance, but each service has its limitations, such as API call limits, costs, or different degrees of language support.

### Similarity Scoring Method

The `rapidfuzz` library is also used for similarity scoring in the `validate_translation_in_context` function. You can experiment with different similarity scorers like `fuzz.token_sort_ratio` or `fuzz.partial_ratio`, and adjust the scoring cutoff values (threshold values) to see if it improves translation validation. Higher cutoff values make the matching criteria stricter, which can reduce false positives but might miss valid matches. Lowering the cutoff values can include more matches but at the risk of increased false positives. Changing the scoring method and cutoff values helps find the optimal balance for accurate translation validation.

### Batch Size for Preprocessing

The `batch_preprocess_text` function processes text in batches. Adjusting the batch size, such as processing paragraphs in smaller or larger batches, can impact efficiency and performance.

## Evaluating the prototype
Here are some ideas on how to measure the accuracy and effectiveness of the predictions:
### 1. Ground Truth Data
The first and easiest approach, which can be done without having users to test the app, is to use ground truth data, where we get or create pre-annotated texts with known unknown words for different proficiency levels. We can then compare the app's predictions with these annotations. Data for expected known and unknown words according to the proficiency level can be gathered from CEFR lexical sets (https://cvc.cervantes.es/ensenanza/biblioteca_ele/plan_curricular/niveles/08_nociones_generales_inventario_a1-a2.htm ). Another good place to look for data is [CL Anthology](https://aclanthology.org/search/?q=spanish+unknown+words), a repository of research papers in computational linguistics that often include datasets.
#### Precision and Recall:
Calculate precision and recall for the predicted unknown words compared to the ground truth annotations. Precision measures the proportion of correctly identified unknown words out of all predicted unknown words, while recall measures the proportion of correctly identified unknown words out of all actual unknown words.
#### F1 Score:
Compute the F1 score, which is the harmonic mean of precision and recall, to provide a single metric for evaluation.
### 2. User Feedback
When user feedback becomes available, we can ask users to mark which predicted unknown words they actually found challenging and measure the percentage of user agreement with the app's predictions.
### 3. Weighted Precision and Recall
For a more comprehensive evaluation, we should note that not all errors should be treated with the same importance. The severity of an error in the context of this language learning app can vary based on the type of word and its role in the sentence or text.
To reflect the different severities of errors, we can use weighted precision and recall. Here, different types of errors are assigned different weights based on their importance.
#### Error Severity
Here are some examples of how we can classify the severity of the errors based on their impact on understanding the text:
*   **False Positives (Unnecessary Translations)**: Less severe. These are words that the app translates but the user didn't need help with. They might create noise in the interface but they will not affect understanding.
*   **False Negatives (Missed Translations)**: More severe. These are words that the app didn't translate but the user needed help with. They can significantly affect understanding.
*   **Key Content Words**: Very severe if missed. These include nouns, verbs, adjectives, and adverbs that are crucial for understanding the sentence.
*   **Function Words**: Less severe if missed. These include conjunctions, prepositions, articles, etc., that add grammatical structure but less semantic meaning.

#### Importance-Based Weighting
To implement the above, we can assign weights to different types of words and errors. For example:
*   **False Negatives (Key Content Words)**: Weight = 3
*   **False Negatives (Function Words)**: Weight = 2
*   **False Positives (Key Content Words)**: Weight = 1
*   **False Positives (Function Words)**: Weight = 1
















### Expanding to different languages
To adapt the code for different language pairs we need to consider the a series of modifications. This involves updating language-specific components such as NLP models, translation services, frequency data, and handling morphological differences.
1. NLP Models
The current code uses Stanza for tokenization, lemmatization, and part-of-speech tagging. Stanza supports many languages, but not all. If we need to support a language that Stanza does not, we may need to switch to another NLP library such as SpaCy, which also supports a wide range of languages.
It is important to ensure the new NLP model provides similar functionality (tokenization, lemmatization, POS tagging, NER). And to consider that the quality of NLP tasks may vary across languages, impacting accuracy.
2. Translation Services
The code currently uses GoogleTranslator from the deep-translator library. Alternative APIs like DeepL, Yandex, or Microsoft Translator can be considered if a specific language pair is not supported or if a more accurate service is needed for certain languages.
Each translation service has different strengths. DeepL is known for high-quality translations but supports fewer languages compared to Google. Costs and API limits also vary.
3. Frequency Data
The current code uses the wordfreq library for frequency analysis. While it supports many languages, coverage and accuracy might vary. For languages not well-supported by wordfreq, we need to consider alternative frequency data sources or even corpora specific to those languages. Tools like Lexique3 for French or SUBTLEX for various languages can be useful.
Frequency data quality is crucial for accurate unknown word prediction, so it is important to ensure that the frequency data is reliable and representative of modern language use.
4. Morphological Differences
Languages with rich morphology (like Finnish or Turkish) require a more robust morphological analysis. Morfessor or a similar tool can be used for such languages.
The current code uses Snowball Stemmer for Spanish. For other languages, stemmers or lemmatizers might vary (for example using the ISRI Stemmer for Arabic).
5. Named Entity Recognition (NER)
The current code uses Stanza for NER, which supports many languages but its performance could vary depending on the language. If NER is not well-supported for a target language, we need to consider alternatives like SpaCy, Polyglot, or even custom-trained models.
6. Cognate Identification
Finding Cognates in New Language Pairs
Technically CogNet has data for several language pairs, however its quality varies from language to language. Wiktionary or BabelNet can be potential sources for multilingual cognate data.
7. User Interface Adjustments
We will need to update the Gradio interface to ensure that it supports the input and display of characters from various languages, including those with different scripts (like Arabic).

In summary, not all tools support all languages equally. We might need to mix and match different libraries for comprehensive support.
These are some possible options for libraries and tools for some of the most used languages as target:
1. English
NLP: SpaCy, Stanza, NLTK
Translation: Google Translator, DeepL, Microsoft Translator
Frequency Data: SUBTLEX-US, wordfreq
Morphological Analysis: NLTK, SpaCy
NER: SpaCy, Stanza
2. Mandarin Chinese
NLP: Jieba (for tokenization), Stanza, SpaCy (with Chinese models)
Translation: Google Translator, Baidu Translate, Microsoft Translator
Frequency Data: Chinese Linguistic Data Consortium (CLDC), wordfreq
Morphological Analysis: Jieba, Stanza
NER: Stanford NLP, SpaCy, Stanza
3. French
NLP: SpaCy, Stanza, Talismane
Translation: Google Translator, DeepL, Microsoft Translator
Frequency Data: Lexique3, wordfreq
Morphological Analysis: Talismane, SpaCy
NER: SpaCy, Stanza
4. Arabic
NLP: Farasa, MADAMIRA, Stanza
Translation: Google Translator, Microsoft Translator
Frequency Data: Aralex (Arabic Lexicon Project), wordfreq
Morphological Analysis: Farasa, MADAMIRA
NER: Farasa, SpaCy (with Arabic models), Stanza
5. Russian
NLP: SpaCy, Stanza, Natasha
Translation: Google Translator, Yandex Translate, Microsoft Translator
Frequency Data: ruTenTen (Russian Web Corpus), wordfreq
Morphological Analysis: Pymorphy2, SpaCy
NER: Natasha, SpaCy, Stanza


## Useful Resources for Identifying Difficult Words for Language Learners

While specific datasets for testing the app prototype are not directly available, the following resources can provide valuable insights into word difficulty and vocabulary acquisition for language learners.

### 1. Selecting Reading Texts Suitable for Incidental Vocabulary Learning by Considering the Estimated Distribution of Acquired Vocabulary
This paper discusses a method for selecting reading texts that support incidental vocabulary learning by considering the distribution of vocabulary already acquired by learners. This approach helps identify which words might be more challenging based on learners' existing vocabulary.
- **Link:** [Selecting Reading Texts Suitable for Incidental Vocabulary Learning](https://educationaldatamining.org/EDM2022/proceedings/2022.EDM-posters.99/)

### 2. More Than Frequency: Exploring Predictors of Word Difficulty for Second Language Learners
This research examines various predictors of word difficulty beyond frequency, including word length, concreteness, and phonological complexity. These insights help predict which words learners are likely to find challenging by considering a range of linguistic factors.
- **Link:** [Exploring Predictors of Word Difficulty](https://www.researchgate.net/publication/333144175_More_Than_Frequency_Exploring_Predictors_of_Word_Difficulty_for_Second_Language_Learners)

### 3. Harvard Dataverse: Second Language Acquisition Data
This dataset includes extensive data on second language acquisition, covering learner interactions and vocabulary tests. Analyzing this dataset provides empirical evidence on common word difficulties and learner patterns, which can inform the app’s word prediction and translation algorithms.
- **Link:** [Second Language Acquisition Data](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/8SWHNO)


## Scaling Up
The most computationally intensive processes in the current code are the translation and linguistic analysis tasks. The translation process, handled by the deep-translator library, creates a computational overhead due to its reliance on external translation API. This leads to inefficiencies in translation. Similarly, linguistic analysis performed by the Stanza library requires substantial computational resources, particularly for tasks like part-of-speech tagging (identifying the grammatical roles of words). These tasks demand memory and processing power, making them computationally intensive, especially as the size of the text increases.

To address these challenges and improve performance, migrating to the Google Cloud Translation API offers several advantages. The API provides higher quality translations, increased efficiency, and scalability. The basic service has a price of 20 USD per million characters, which translates approximately 667 pages of text. If we estimate that language learners would read around 50-100 pages per month, the cost per user ranges from 1 USD to 2 USD monthly.

Additionally, optimizing the code to leverage transformer-based models, such as those available in Hugging Face's Transformers library, can enhance linguistic analysis. These models offer superior capabilities in natural language processing tasks but require substantial computational resources, particularly GPUs, to operate efficiently. While the adoption of transformer models requires some  investment, the current code can still be optimizaded to improve performance without the need for advanced services yet.