# Task 1: Natural Language Processing Toolkit

- Build gradio UIs to apply transformer models in different
NLP contexts (Named Entity Recognition, Translation,
Summarization)
- Create and invite me to a GitHub repository for task 1
- Coding and presentation should be based on Jupyter
Notebook files (ipynb)
- Additional record and upload a screencast demonstrating
your application
- Task Share: 20%

## Imports

In this task, we are utilizing several powerful Python libraries to perform data processing, , named entity recognition, translation, text summarization, and to create a user-friendly interface. Below are the descriptions of each imported library:

> - re: This is the regular expression library in Python. It provides a set of functions that allows us to search, match, and manipulate strings using regular expressions. Regular expressions are used for string searching and manipulation, which can be very useful for data cleaning and preprocessing.

> - pandas (pd): Pandas is a highly popular data manipulation and analysis library for Python. It provides data structures like DataFrame, which is used for handling and analyzing structured data efficiently. In this script, we use pandas to read and preprocess the CSV file containing articles and their highlights.

> - gradio (gr): Gradio is a library that allows you to create user-friendly web interfaces for your machine learning models easily. It helps in building interactive demos that can be shared with others without any web development knowledge. In this script, Gradio will be used to create an interactive interface for the text summarization model.

> - transformers (pipeline, AutoTokenizer): Transformers is a library by Hugging Face that provides state-of-the-art machine learning models, especially for natural language processing (NLP). The pipeline function simplifies the usage of these models by providing high-level APIs for common tasks like text summarization. AutoTokenizer is used to load the appropriate tokenizer corresponding to the chosen model, which is essential for preparing text data for the model.

In [1]:
import re
import pandas as pd
import gradio as gr
from transformers import pipeline, AutoTokenizer

## Data 

In this section, we handle the import and preparation of datasets for the natural language processing (NLP) tasks named entity recognition (NER), translation and summarization. Each task requires its specific dataset, which we read and preprocess accordingly. This setup ensures that our data is structured and ready for the intended NLP analyses.

### Named Entity Recognition

In this section, we handle the import and preparation of data specifically for the Named Entity Recognition (NER) task. We load a text file containing sentences from Wikipedia crawls and map each sentence to a descriptive label. This mapping helps in organizing the sentences by their context, which can be useful for understanding the content and performing NER tasks effectively.

In [2]:
# Load the named entity recognition data from a file
with open('Case Dataset/Data files/NER_text_Wikipedia_crawl.txt', 'r') as file:
    named_entity_recognition_data = file.read()

def map_sentences_to_descriptions(file):
    """
    Maps sentences from a given text file to descriptive labels.

    Args:
        file (str): The content of the text file.

    Returns:
        dict: A dictionary mapping descriptive labels to sentences.
    """
    # Basic sentence splitting using regular expressions
    sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', file)
    
    # Provide a descriptive word for each sentence
    descriptions = [
        "Introduction", "Importance", "Frequency", "Inspiration", "Founding",
        "Governance", "Evolution", "Adjustments", "Endorsements", "Adaptation",
        "Changes", "Media", "Cancellations", "Components", "Responsibilities",
        "Programme", "Symbols", "Participation", "Awards", "Growth",
        "Challenges", "Exposure", "Showcase", "Ancient", "Legend",
        "Myth", "Tradition", "Religious", "Decline", "Revival", "Modern",
        "Forerunners", "National", "Festival", "Event", "Reconstruction",
        "Historic", "Legacy", "Promotion", "Development", "Re-establishment",
        "Success", "Athens", "Paris", "Stagnation", "Survival", "Rebounding",
        "Popularity", "Winter", "Figure Skating", "Ice Hockey", "Expansion",
        "Congress", "Host", "International", "Agreement", "Official", "Youth",
        "Games", "Opportunities", "Nonprofit", "Effect", "Displacement",
        "Business", "Infrastructure", "Sponsorship", "Revenue", "Expenditures",
        "Financial", "Profit", "Exclusivity", "Marketing", "Broadcasting",
        "Audience", "Television", "Viewership", "Commercialisation",
        "Economic", "Investment", "International", "Symbolism", "Flag",
        "Motto", "Flame", "Mascot", "Ceremony", "Parade", "Athletes",
        "Hosts", "Medals", "Victory", "Events", "Governing", "Demonstration",
        "Recognized", "Professionalism", "Controversy", "Amateurism",
        "Participation", "Boycotts", "Politics", "Protest", "Doping", "Scandal",
        "Testing", "Banning", "Citizenship", "Medallists", "Athletes",
        "Nations", "Hosting"
    ]

    # Ensure the descriptions list matches the number of sentences
    # If there are more sentences, repeat the last description
    while len(descriptions) < len(sentences):
        descriptions.append(descriptions[-1])

    # Create a dictionary mapping sentences to descriptions
    sentence_description_map = {description: sentence.strip() for description, sentence in zip(descriptions, sentences)}

    return sentence_description_map

# Map the named entity recognition data to descriptions
named_entity_recognition_data = map_sentences_to_descriptions(named_entity_recognition_data)

named_entity_recognition_data['Introduction']


'The modern Olympic Games or Olympics (French: Jeux olympiques)[a][1] are the leading international sporting events featuring summer and winter sports competitions in which thousands of athletes from around the world participate in a variety of competitions.'

### Translation

In this section, we handle the import and preparation of data specifically for the translation task. We load a CSV file containing the translation training data, adjust the column names to remove any locale-specific suffixes, and prepare the data for translation. This step ensures that the data is in a consistent format and ready for the translation task.

In [3]:
# Load translation training data from a CSV file
translation_data = pd.read_csv('Case Dataset/Data files/Translation_Training.csv', sep=';')

def adjust_column_names(df):
    """
    Adjusts the column names of a DataFrame by removing the locale suffix.

    Args:
        df (pd.DataFrame): The DataFrame whose column names need adjustment.

    Returns:
        pd.DataFrame: The DataFrame with adjusted column names.
    """
    df.columns = [column.split('_')[0] for column in df.columns]
    return df

# Apply the column name adjustment function to the translation data
translation_data = adjust_column_names(translation_data)

translation_data.head()


Unnamed: 0,id,split,en,de,es,fr,it
0,1847,train,order me a cheese burger from tommy's burgers,bestell mir einen cheeseburger von tommy's bur...,pídeme una hamburguesa de queso del mcdonalds,commande moi un burger au fromage chez tommy's...,ordinami un cheese burger da america graffiti
1,876,train,play kari jobe for me,spiel kari jobe für mich,pon melendi para mi,mets jacques brel ne me quitte pas,metti laura pausini per me
2,14494,train,what is i. b. m.'s stock worth,was ist i. b. m.'s aktie wert,cuál es el valor de las acciones del ibm,quelle est la valeur des actions d'i. b. m.,qual è il valore delle azioni generali
3,14366,train,will it be good to buy nike stock today,wäre es gut heute volkswagen aktien zu kaufen,será bueno comprar acciones de nike hoy dia,sera-t-il bon d'acheter des actions nike aujou...,oggi è un buon giorno per comprare le azioni d...
4,1977,train,please remove the alarm which i set for today ...,bitte lösche den wecker den ich für heute früh...,por favor borrar la alarma que tenía activada ...,veuillez retirer l'alarme que j'ai réglée pour...,rimuovi la sveglia impostata per questa mattina


### Summarization

In this section, we handle the import of data specifically for the summarization task. 

In [4]:
# Load summarization training data from a CSV file
summarization_data = pd.read_csv('Case Dataset/Data files/Summarization_Training.csv', sep=';')

summarization_data.head()


Unnamed: 0,article,highlights
0,"By . Anthony Bond . PUBLISHED: . 07:03 EST, 2 ...",John and .\nAudrey Cook were discovered alongs...
1,UNITED NATIONS (CNN) -- A rare meeting of U.N....,NEW: Libya can serve as example of cooperation...
2,Cover-up: Former Archbishop Lord Hope allowed ...,Very Reverend Robert Waddington sexually abuse...
3,"By . Kristie Lau . PUBLISHED: . 10:48 EST, 14 ...",Monday night's episode showed Buddy Valastro t...
4,'The lamps are going out all over Europe. We s...,People asked to turn out lights for hour betwe...


## Constants

In this section, we define a set of constants that are crucial for performing various NLP tasks such as Named Entity Recognition (NER) and summarization. These constants include pre-trained model identifiers for different tasks and descriptive labels for entities recognized by the NER models. By organizing these constants in a structured manner, we ensure that our code is both readable and easily configurable for different use cases.

In [5]:
# Named Entity Recognition models
NAMED_ENTITY_RECOGNITION_MODELS = {
    "BERT (CoNLL-03 English)": "dbmdz/bert-large-cased-finetuned-conll03-english",
    "Wikineural Multilingual (Babelscape)": "Babelscape/wikineural-multilingual-ner",
    "BERT (dslim)": "dslim/bert-large-NER",
}

# Named Entity Recognition labels and descriptions
ENTITY_DESCRIPTIONS = {
    "I-PER": "Person",
    "I-ORG": "Organization",
    "I-LOC": "Location",
    "I-MISC": "Miscellaneous",
    "B-PER": "Person",
    "B-ORG": "Organization",
    "B-LOC": "Location",
    "B-MISC": "Miscellaneous"
}

# Models for summarization tasks
SUMMARIZATION_MODELS = {
    "BART (CNN/DailyMail)": "facebook/bart-large-cnn",
    "T5 (CNN/DailyMail)": "t5-large",
    "Pegasus (Newsroom)": "google/pegasus-newsroom",
    "BART (XSum)": "facebook/bart-large-xsum",
    "T5 (XSum)": "t5-large",
    "Pegasus (XSum)": "google/pegasus-xsum"
}


## Common Functions

In this section, we define a versatile function that handles different types of input sources for our NLP tasks. This function allows users to select input data from direct text input, predefined sample data, or files. By providing a unified approach to input selection, we ensure that our NLP processing pipeline is flexible and user-friendly, accommodating various use cases and preferences.

In [6]:
def input_selection(selected_input: str = None, text: str = None, sample: str = None, file: str = None):
    """
    Selects the input based on the specified input type.

    Args:
        selected_input (str): The type of input, either "text", "sample", or "file".
        text (str): The text input for processing.
        sample (str): The sample input key for retrieving sample data.
        file (File): The file object containing the input data.

    Returns:
        str: The selected input data as a string.
    """
    if selected_input == "text":
        retval = text
    elif selected_input == "sample":
        retval = named_entity_recognition_data[sample] if sample in named_entity_recognition_data else sample
    elif selected_input == "file":
        with open(file.name, 'r') as f:
            retval = f.read()
    else:
        retval = None
    
    return retval


## Implementations

### Named Entity Recognition

Named Entity Recognition (NER) is a crucial task in Natural Language Processing (NLP) that involves identifying and classifying entities in text into predefined categories such as persons, organizations, locations, and more. In this section, we implement NER using pre-trained models and provide functionality for comparing the outputs of two different models. This implementation includes generating a legend of identified entities, which enhances the interpretability of the results.

In [7]:
def generate_legend(entities):
    """
    Generates a legend string for the provided entities.

    Args:
        entities (list): A list of dictionaries containing entity information.

    Returns:
        str: A formatted string that represents the legend of entities.
    """
    legend = {}
    for ent in entities:
        if ent['entity'] not in legend:
            legend[ent['entity']] = 1
        else:
            legend[ent['entity']] += 1

    legend_str = "Entities Legend:\n" + "\n".join(
        [f"{ENTITY_DESCRIPTIONS.get(entity, entity)} ({entity}): {count}" for entity, count in legend.items()]
    )
    return legend_str


def named_entity_recognition(model: str = None, model_to_compare: str = None,
                             selected_input: str = "text", text: str = None,
                             sample: str = None, file: str = None):
    """
    Performs named entity recognition (NER) using the specified models and inputs.

    Args:
        model (str): The primary model to use for NER.
        model_to_compare (str): The secondary model to compare results with.
        selected_input (str): The type of input, e.g., "text", "sample", or "file".
        text (str): The text input for NER analysis.
        sample (str): The sample input for NER analysis.
        file (str): The file input for NER analysis.

    Returns:
        tuple: A tuple containing NER results and their legends for both models.
    """
    entities = {}
    models = [model, model_to_compare]
    text = input_selection(selected_input, text, sample, file)
    
    for model_name in models:
        tokenizer = AutoTokenizer.from_pretrained(NAMED_ENTITY_RECOGNITION_MODELS[model_name], model_max_length=512)
        model = NAMED_ENTITY_RECOGNITION_MODELS[model_name]
        nlp = pipeline('ner', model=model, tokenizer=tokenizer)
        model_entities = nlp(text)
        entities[model_name] = [{"entity": ent["entity"], "score": ent["score"], "index": ent["index"],
                                 "start": ent["start"], "end": ent["end"]} for ent in model_entities]

    legend_text_1 = generate_legend(entities[models[0]])
    legend_text_2 = generate_legend(entities[models[1]])
    
    return ({"text": text, "entities": entities[models[0]]}, legend_text_1,
            {"text": text, "entities": entities[models[1]]}, legend_text_2)


### Translation

In this section, we implement translation functionalities using state-of-the-art NLP models from Helsinki-NLP and Facebook. These models facilitate the translation of text from one language to another. The implementation includes functions for translating text using each of these models and a function to compare translations from both models along with a provided reference translation. This setup ensures a robust and comprehensive translation pipeline that leverages the strengths of different translation models.

In [8]:
def helsinki_translation(source_language: str = None, target_language: str = None, text: str = None):
    """
    Translates text using the Helsinki-NLP translation model.

    Args:
        source_language (str): The source language code.
        target_language (str): The target language code.
        text (str): The text to be translated.

    Returns:
        str: The translated text.
    """
    helsinki_translation_pipeline = pipeline(
        'translation', model=f"Helsinki-NLP/opus-mt-{source_language}-{target_language}"
    )
    helsinki_translation = helsinki_translation_pipeline(text)
    return helsinki_translation[0]['translation_text']


def facebook_translation(source_language: str = None, target_language: str = None, text: str = None):
    """
    Translates text using the Facebook M2M100 translation model.

    Args:
        source_language (str): The source language code.
        target_language (str): The target language code.
        text (str): The text to be translated.

    Returns:
        str: The translated text.
    """
    facebook_translation_pipeline = pipeline(
        'translation', model="facebook/m2m100_418M", src_lang=source_language, tgt_lang=target_language
    )
    facebook_translation = facebook_translation_pipeline(text)
    return facebook_translation[0]['translation_text']


def translation(source_language: str = None, target_language: str = None, selected_input: str = None,
                text: str = None, sample: str = None, file: str = None):
    """
    Translates text using both Helsinki-NLP and Facebook translation models and compares with provided translations.

    Args:
        source_language (str): The source language code.
        target_language (str): The target language code.
        selected_input (str): The type of input, e.g., "text", "sample", or "file".
        text (str): The text input for translation.
        sample (str): The sample input for translation.
        file (str): The file input for translation.

    Returns:
        tuple: The translations from Helsinki-NLP, Facebook, and the provided translation data.
    """
    input_text = input_selection(selected_input, text, sample, file)
    first_translation = helsinki_translation(source_language=source_language, target_language=target_language, text=input_text)
    second_translation = facebook_translation(source_language=source_language, target_language=target_language, text=input_text)
    provided_translation = translation_data.loc[translation_data[source_language] == input_text, target_language].values[0]
    return first_translation, second_translation, provided_translation


### Summarization

Summarization is a vital task in Natural Language Processing (NLP) that aims to condense large bodies of text into shorter, coherent summaries, capturing the main points and essential information. In this section, we implement a summarization functionality using pre-trained models, enabling the generation of concise summaries from various input sources. This implementation includes handling long texts by chunking them and comparing generated summaries with provided reference summaries.

In [9]:
def chunk_text(tokenizer, text, max_length):
    """
    Splits text into chunks of a specified maximum length using the provided tokenizer.

    Args:
        tokenizer (transformers.PreTrainedTokenizer): The tokenizer to use for splitting text.
        text (str): The text to be split into chunks.
        max_length (int): The maximum length of each chunk.

    Returns:
        list: A list of token chunks, each of max_length or smaller.
    """
    tokens = tokenizer.tokenize(text)
    return [tokens[i:i + max_length] for i in range(0, len(tokens), max_length)]

def summarization(model: str = None, selected_input: str = None, text: str = None, sample: str = None, file: str = None):
    """
    Summarizes the given text using the specified summarization model.

    Args:
        model (str): The model to use for summarization.
        selected_input (str): The type of input, e.g., "text", "sample", or "file".
        text (str): The text input for summarization.
        sample (str): The sample input for summarization.
        file (str): The file input for summarization.

    Returns:
        tuple: The generated summary and the provided summary from the dataset.
    """
    text = input_selection(selected_input, text, sample, file)
    tokenizer = AutoTokenizer.from_pretrained(SUMMARIZATION_MODELS[model])
    chunks = chunk_text(tokenizer, text, 512)
    summarization_pipeline = pipeline('summarization', model=SUMMARIZATION_MODELS[model])
    
    summaries = []
    for chunk in chunks:
        chunked_text = tokenizer.convert_tokens_to_string(chunk)
        summary = summarization_pipeline(chunked_text)
        summaries.append(summary[0]['summary_text'])
    
    summary_text = ' '.join(summaries)
    if selected_input == "sample":
        provided_summary = summarization_data.loc[summarization_data['article'] == sample, 'highlights'].values[0]
    else:
        provided_summary = ""
    return summary_text, provided_summary


## Gradio Interface

Gradio is a powerful library that allows developers to quickly create interactive web interfaces for machine learning models. In this section, we implement Gradio interfaces for three core NLP tasks: Named Entity Recognition (NER), Translation, and Summarization. Each interface allows users to interact with pre-trained models by providing inputs directly, choosing from predefined samples, or uploading files. The implementation ensures that the user experience is seamless and intuitive, making it easy to explore and compare the outputs of different NLP models.

In [10]:
def update_legend_and_output(model, model_to_compare, selected_input, text, samples, file):
    """
    Updates the legend and output for named entity recognition.

    Args:
        model (str): The model to use for NER.
        model_to_compare (str): The model to compare against.
        selected_input (str): The type of input, e.g., "text", "sample", or "file".
        text (str): The text input for NER.
        samples (str): The sample input for NER.
        file (str): The file input for NER.

    Returns:
        tuple: The highlighted text and legend for both models.
    """
    highlighted_text_output_1, legend_text_1, highlighted_text_output_2, legend_text_2 = named_entity_recognition(
        model, model_to_compare, selected_input, text, samples, file
    )
    return highlighted_text_output_1, legend_text_1, highlighted_text_output_2, legend_text_2

def named_entity_recognition_interface():
    """
    Creates the named entity recognition interface using Gradio.

    Returns:
        gr.Blocks: The Gradio interface for named entity recognition.
    """
    with gr.Blocks() as blocks:
        model = gr.Dropdown(list(NAMED_ENTITY_RECOGNITION_MODELS.keys()), label="Model", value="BERT (CoNLL-03 English)")
        model_to_compare = gr.Dropdown(list(NAMED_ENTITY_RECOGNITION_MODELS.keys()), label="Model to Compare", value="Wikineural Multilingual (Babelscape)")
        text = gr.Textbox(lines=5, label="Input", value="Please enter the text to analyze.")
        samples = gr.Dropdown(list(named_entity_recognition_data.keys()), label="Samples", value="Introduction")
        file = gr.File(label="Upload a file")
        selected_input = gr.Radio(
            [("Text", "text"), ("Sample", "sample"), ("File", "file")], 
            label="Select Input Type", 
            value="sample"
        )

        highlighted_text_1 = gr.HighlightedText(label="Model 1 Result")
        legend_1 = gr.Markdown("Entities Legend for Model 1 will be shown here")
        
        highlighted_text_2 = gr.HighlightedText(label="Model 2 Result")
        legend_2 = gr.Markdown("Entities Legend for Model 2 will be shown here")

        inputs = [model, model_to_compare, selected_input, text, samples, file]
        for input_component in inputs:
            input_component.change(
                update_legend_and_output, 
                inputs=inputs, 
                outputs=[highlighted_text_1, legend_1, highlighted_text_2, legend_2]
            )

        gr.Interface(
            fn=update_legend_and_output, 
            inputs=inputs, 
            outputs=[highlighted_text_1, legend_1, highlighted_text_2, legend_2], 
            title="Named Entity Recognition"
        )
        
    return blocks

def update_samples(source_language):
    """
    Updates the sample choices based on the selected source language.

    Args:
        source_language (str): The selected source language.

    Returns:
        gr.update: Updated dropdown choices for samples.
    """
    samples = list(translation_data[source_language])
    return gr.update(choices=samples, value=samples[0])

def translation_interface():
    """
    Creates the translation interface using Gradio.

    Returns:
        gr.Blocks: The Gradio interface for translation.
    """
    with gr.Blocks() as blocks:
        source_language = gr.Dropdown(
            choices=[("English", "en"), ("Spanish", "es"), ("French", "fr"), ("German", "de"), ("Italian", "it")], 
            label="Source Language", 
            multiselect=False, 
            value="en"
        )

        target_language = gr.Dropdown(
            choices=[("English", "en"), ("Spanish", "es"), ("French", "fr"), ("German", "de"), ("Italian", "it")], 
            label="Target Language", 
            multiselect=False, 
            value="de"
        )

        text = gr.Textbox(lines=5, label="Input", value="Please enter the text to translate.")

        sample = gr.Dropdown(
            choices=list(translation_data["en"]), 
            label="Samples", 
            value=translation_data["en"][0]
        )

        file = gr.File(label="Upload a file")

        selected_input = gr.Radio(
            choices=[("Text", "text"), ("Sample", "sample"), ("File", "file")], 
            label="Select Input Type", 
            value="sample"
        )

        source_language.change(update_samples, inputs=source_language, outputs=sample)

        gr.Interface(
            fn=translation, 
            inputs=[source_language, target_language, selected_input, text, sample, file], 
            outputs=[gr.Textbox(label="Translation"), gr.Textbox(label="Alternative"), gr.Textbox(label="Ground Truth")], 
            title="Translation"
        )
        
    return blocks

def summarization_interface():
    """
    Creates the summarization interface using Gradio.

    Returns:
        gr.Blocks: The Gradio interface for summarization.
    """
    with gr.Blocks() as blocks:
        model = gr.Dropdown(list(SUMMARIZATION_MODELS.keys()), label="Model", value="BART (CNN/DailyMail)")
        
        text = gr.Textbox(lines=5, label="Input", value="Please enter the text to summarize.")
        
        sample = gr.Dropdown(list(summarization_data["article"]), label="Samples", value=summarization_data["article"][0])
        
        file = gr.File(label="Upload a file")
        
        selected_input = gr.Radio(
            choices=[("Text", "text"), ("Sample", "sample"), ("File", "file")], 
            label="Select Input Type", 
            value="sample"
        )

        gr.Interface(
            fn=summarization, 
            inputs=[model, selected_input, text, sample, file], 
            outputs=[gr.Textbox(label="Summary"), gr.Textbox(label="Highlights (Ground Truth)")], 
            title="Summarization"
        )
        
    return blocks

def build_interface():
    """
    Builds the entire Gradio interface with tabs for named entity recognition, translation, and summarization.

    Returns:
        gr.Interface: The complete Gradio interface.
    """
    interface = gr.TabbedInterface([
        named_entity_recognition_interface(),
        translation_interface(),
        summarization_interface()
    ], ["Named Entity Recognition", "Translation", "Summarization"], title="NLP Toolkit")
    return interface

interface = build_interface()
interface.launch()


Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.


