# Task 1

- Build gradio UIs to apply transformer models in different
NLP contexts (Named Entity Recognition, Translation,
Summarization)
- Create and invite me to a GitHub repository for task 1
- Coding and presentation should be based on Jupyter
Notebook files (ipynb)
- Additional record and upload a screencast demonstrating
your application
- Task Share: 20%

**Gradio**

> Gradio is an open-source Python library that allows developers to create and share interactive web interfaces for machine learning models and data science workflows with minimal coding. It supports various input and output types, provides real-time interactivity, and integrates easily with popular machine learning frameworks.

**Hugging Face**

> Hugging Face is a company and open-source community known for its transformers library, which provides state-of-the-art pre-trained models for natural language processing tasks. It also offers a platform for sharing, training, and deploying machine learning models, making advanced NLP technology accessible to researchers and developers.

## NER (Named Entity Recognition)

> Named entity recognition (NER) is a task where the model has to find which parts of the input text correspond to entities such as persons, locations, or organizations.

In the first step we import all necessarry libraries and start with importing the data. The user will either be able to select one of the sentences from the provided data or pass own input to the interface. Also the model is selected at this point.

After the input is applied, we use a tokenizer to preprocess the input.

> The tokenizer is essential in transforming raw text into a format that a transformer model like BERT can understand by splitting the text into tokens and converting these tokens into numerical input IDs. It ensures consistency with the model's pre-trained vocabulary and includes necessary special tokens and attention masks, enabling accurate and efficient text processing.

Once the input has been preprcessed the model does it's job and the recognized entities with the corresponding word gets highlighted in the text with an information of which entity they are.

In [None]:
#!pip install gradio transformers

In [None]:
import gradio as gr
from transformers import pipeline

# Dictionary mapping model names to their identifiers and their respective label mappings
MODELS = {
    "BERT Large (CoNLL-03 English)": ("dbmdz/bert-large-cased-finetuned-conll03-english", None),
    "Wikineural Multilingual NER (Babelscape)": ("Babelscape/wikineural-multilingual-ner", None),
    "BERT Large NER (dslim)": ("dslim/bert-large-NER", None)
}

# Color mapping for entity types, ignoring B- and I- prefixes
ENTITY_COLORS = {
    "PER": ("yellow", "black"),       # Person
    "LOC": ("green", "white"),        # Location
    "ORG": ("blue", "white"),         # Organization
    "MISC": ("gray", "white"),        # Miscellaneous (nationalities, religions, products, or events)
    "DATE": ("orange", "black"),      # Date
    "TIME": ("purple", "white"),      # Time
    "MONEY": ("red", "white"),        # Money
    "PERCENT": ("lightblue", "black"),# Percent
    "QUANTITY": ("pink", "black"),    # Quantity
    "LAW": ("lightgreen", "black"),   # Legal references
    "LANGUAGE": ("beige", "black"),   # Language
    "O": ("lightgray", "black")       # Non-entity
}

def get_entities(model_name, text):
    """
    Function to get named entities from the text using the specified model.
    """
    model_info = MODELS[model_name]
    model, label_map = model_info
    nlp = pipeline("ner", model=model)
    entities = nlp(text)
    
    # Remap labels if a label map is provided
    if label_map:
        for entity in entities:
            try:
                entity['entity'] = label_map[entity['entity']]
            except KeyError:
                raise ValueError(f"Unexpected label '{entity['entity']}' encountered. Please check the label map.")
    
    return entities

def highlight_text(text, entities):
    """
    Function to highlight entities in the text with different colors based on their type.
    """
    highlighted_text = ""
    last_position = 0

    for entity in entities:
        start = entity['start']
        end = entity['end']
        word = text[start:end]

        # Determine entity type and corresponding color, ignoring B- and I- prefixes
        entity_type = entity['entity'].split('-')[-1]
        bg_color, text_color = ENTITY_COLORS.get(entity_type, ("lightgray", "black"))

        # Append text segments
        highlighted_text += text[last_position:start]
        highlighted_text += f'<span style="background-color: {bg_color}; color: {text_color};">{word}</span>'
        last_position = end

    highlighted_text += text[last_position:]
    return highlighted_text

def highlight_entities(model_name, text):
    """
    Function to highlight named entities in the text using the specified model.
    """
    # Split text into chunks of manageable size
    max_chunk_size = 512
    chunks = [text[i:i+max_chunk_size] for i in range(0, len(text), max_chunk_size)]
    
    highlighted_text = ""
    for chunk in chunks:
        entities = get_entities(model_name, chunk)
        highlighted_text += highlight_text(chunk, entities)
    
    return highlighted_text

def process_text_input(model_name, text):
    """
    Process text input and return highlighted entities.
    """
    return highlight_entities(model_name, text)

def process_file_input(model_name, filepath):
    """
    Process uploaded file and return highlighted entities.
    """
    with open(filepath, 'r', encoding='utf-8') as file:
        text = file.read()
    return highlight_entities(model_name, text)

def gradio_interface(model_name, text, filepath):
    """
    Gradio interface function to handle text input or file upload.
    """
    if filepath is not None:
        return process_file_input(model_name, filepath)
    elif text:
        return process_text_input(model_name, text)
    else:
        return "Please enter text or upload a file."

# List of available model names
model_list = list(MODELS.keys())

# CSS for the fixed legend
css = """
<style>
#legend {
    position: fixed;
    bottom: 10px;
    left: 10px;
    padding: 10px;
    background: #2f3b4a;
    border: 1px solid #ccc;
    z-index: 1000;
    width: 250px;
    color: white;
}
body {
    background: #2f3b4a;
    color: white;
}
</style>
"""

# HTML for the fixed legend
legend_html = """
<div id="legend">
    <b>Legend:</b><br>
    <span style="background-color: yellow; color: black; padding: 2px;">PER</span>: Person<br>
    <span style="background-color: green; color: white; padding: 2px;">LOC</span>: Location<br>
    <span style="background-color: blue; color: white; padding: 2px;">ORG</span>: Organization<br>
    <span style="background-color: gray; color: white; padding: 2px;">MISC</span>: Miscellaneous<br>
    <span style="background-color: orange; color: black; padding: 2px;">DATE</span>: Date<br>
    <span style="background-color: purple; color: white; padding: 2px;">TIME</span>: Time<br>
    <span style="background-color: red; color: white; padding: 2px;">MONEY</span>: Money<br>
    <span style="background-color: lightblue; color: black; padding: 2px;">PERCENT</span>: Percent<br>
    <span style="background-color: pink; color: black; padding: 2px;">QUANTITY</span>: Quantity<br>
    <span style="background-color: lightgreen; color: black; padding: 2px;">LAW</span>: Legal references<br>
    <span style="background-color: beige; color: black; padding: 2px;">LANGUAGE</span>: Language<br>
</div>
"""

# Creating the Gradio interface
with gr.Blocks() as interface:
    gr.HTML(css)
    with gr.Row():
        with gr.Column():
            model_input = gr.Dropdown(choices=model_list, label="Model")
            text_input = gr.Textbox(label="Input Text", lines=5, placeholder="Enter text here or upload a file below")
            file_input = gr.File(label="Upload a text file", type="filepath")
            submit_btn = gr.Button("Submit")
        with gr.Column():
            output = gr.HTML()
            submit_btn.click(fn=gradio_interface, inputs=[model_input, text_input, file_input], outputs=output)
    gr.HTML(legend_html)

# Launching the Gradio interface
interface.launch()


Running on local URL:  http://127.0.0.1:7871

To create a public link, set `share=True` in `launch()`.




Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trai