# TP2 Semantic annotation (Named-Entity Linking) and text enrichment

@Authors : `HADDOU, Amine` and `DE SERROUX, Colin`

The main objective of this TD is to develop an automatic system which takes a 
input text, recognizes the entities mentioned there and determines the resource of 
DBpedia most appropriate for each recognized entity depending on the context. 
Then, by querying each resource, the system is able to retrieve 
information to ultimately enrich/augment the original text.


## Installation

In [1]:
%pip install pyspotlight
%pip install SPARQLWrapper
%pip install pandas

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


## Librairies

In [2]:
import spotlight
from SPARQLWrapper import SPARQLWrapper, JSON
import pandas as pd

## Global functions

In [3]:
def get_annotations(api_url: str, text_to_annotate: str, confidence: float = 0.4, support: int = 20) -> list[dict]:
    """
    Retrieving annotations in the tweet with the corresponding link.

    :param api_url: the url of DBPedia api or localhost
    :param text_to_annotate: the tweet
    :param confidence: TODO
    :param support: TODO
    """
    
    try:
        annotations = spotlight.annotate(
            api_url,
            text_to_annotate,
            confidence=confidence,
            support=support,
        )

        # There is no need to explain what twitter is when talking about a tweet.
        annotations = [item for item in annotations if item["surfaceForm"] != "t.co"]

        return annotations
    except spotlight.SpotlightException as e:
        print(f"Spotlight error : {e}")

        return []

In [4]:
def get_entity_info(uri: str) -> list:
    """
    Retrieves information from an annotation using its URI in DBPedia.
    
    :param uri: URI of the annotation
    """

    sparql = SPARQLWrapper("http://dbpedia.org/sparql")
    sparql.setQuery(f"""
        PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
        SELECT ?comment
        WHERE {{
            <{uri}> rdfs:comment ?comment .
            FILTER(langMatches(lang(?comment), "EN"))
        }}
    """)
    sparql.setReturnFormat(JSON)

    try:
        results = sparql.query().convert()

        return [result["comment"]["value"] for result in results["results"]["bindings"]]
    except Exception as e:
        print(f"SPARQL error : {e}")

        return []

## The test code

TODO joué avec la confidence et le support

In [5]:
api_url = "https://api.dbpedia-spotlight.org/en/annotate"
text_to_annotate = "With an equilibrium temperature of about 4,050 kelvin, the exoplanet KELT-9b (also known as HD 195689b) is an archetype of the class of ultrahot Jupiters that straddle the transition between stars and gas-giant exoplanets and are therefore useful for studying atmospheric chemistry."

annotations = get_annotations(api_url, text_to_annotate)

print("Entities annoted :")

for annotation in annotations:
    print(f"Entity : {annotation['URI']} - Surface : {annotation['surfaceForm']}")
    
    results = get_entity_info(annotation["URI"])

    for result in results:
        print(f"Comment : {result}\n")

Entities annoted :
Entity : http://dbpedia.org/resource/Planetary_equilibrium_temperature - Surface : equilibrium temperature
Comment : The planetary equilibrium temperature is a theoretical temperature that a planet would be if it were a black body being heated only by its parent star. In this model, the presence or absence of an atmosphere (and therefore any greenhouse effect) is irrelevant, as the equilibrium temperature is calculated purely from a balance with incident stellar energy.

Entity : http://dbpedia.org/resource/Exoplanet - Surface : exoplanet
Comment : An exoplanet or extrasolar planet is a planet outside the Solar System. The first possible evidence of an exoplanet was noted in 1917 but was not recognized as such. The first confirmation of detection occurred in 1992. A different planet, initially detected in 1988, was confirmed in 2003. As of 1 December 2022, there are 5,284 confirmed exoplanets in 3,899 planetary systems, with 847 systems having more than one planet.



## Use cases

### Environment

In [6]:
dataset_name = "hf://datasets/zeroshot/twitter-financial-news-topic/topic_train.csv"
api_url = "https://api.dbpedia-spotlight.org/en/annotate"

### Dataset

The dataset is made up of tweets associated with labels that we will not use in this practical work that we retrieved from [twitter-financial-news-topic](https://huggingface.co/datasets/zeroshot/twitter-financial-news-topic).

#### Loading the dataset

In [7]:
df = pd.read_csv(dataset_name)

df.head()

  from .autonotebook import tqdm as notebook_tqdm


Unnamed: 0,text,label
0,Here are Thursday's biggest analyst calls: App...,0
1,Buy Las Vegas Sands as travel to Singapore bui...,0
2,"Piper Sandler downgrades DocuSign to sell, cit...",0
3,"Analysts react to Tesla's latest earnings, bre...",0
4,Netflix and its peers are set for a ‘return to...,0


#### Data pre-processing

Deletion of the 'label' column which we will not use during the lab.

In [8]:
df = df.drop(columns=["label"])

df.head()

Unnamed: 0,text
0,Here are Thursday's biggest analyst calls: App...
1,Buy Las Vegas Sands as travel to Singapore bui...
2,"Piper Sandler downgrades DocuSign to sell, cit..."
3,"Analysts react to Tesla's latest earnings, bre..."
4,Netflix and its peers are set for a ‘return to...


Replaced column name 'text' with 'tweet'.

In [9]:
df = df.rename(columns={"text": "tweet"})

df.head()

Unnamed: 0,tweet
0,Here are Thursday's biggest analyst calls: App...
1,Buy Las Vegas Sands as travel to Singapore bui...
2,"Piper Sandler downgrades DocuSign to sell, cit..."
3,"Analysts react to Tesla's latest earnings, bre..."
4,Netflix and its peers are set for a ‘return to...


Added the size of each tweet.

In [10]:
df["tweet_length"] = df["tweet"].apply(lambda x: len(x.split()))

df.head()

Unnamed: 0,tweet,tweet_length
0,Here are Thursday's biggest analyst calls: App...,15
1,Buy Las Vegas Sands as travel to Singapore bui...,13
2,"Piper Sandler downgrades DocuSign to sell, cit...",13
3,"Analysts react to Tesla's latest earnings, bre...",15
4,Netflix and its peers are set for a ‘return to...,19


#### Analysis of the dataset

We only have 2 columns left 'tweet' and 'tweet_length'.

In [11]:
size = len(df)

print(f"There are {size} rows in the dataset.")

There are 16990 rows in the dataset.


In [12]:
df["tweet"].is_unique

True

All our tweets are unique.

In [13]:
mean_length = df["tweet_length"].mean()
median_length = df["tweet_length"].median()

print(f"Average tweet size : {mean_length}")
print(f"Median tweet size : {median_length}")

Average tweet size : 18.31153619776339
Median tweet size : 16.0


### Code

## Benchmarking on `confidence` and `support`

In [14]:
confidences = [0.4, 0.6, 0.9]
supports = [5, 20, 40, 110]

In [15]:
# Shuffle 5 twitts
tweets_test = df["tweet"].sample(5)

In [16]:
from rich.console import Console
from rich.table import Table
from rich.panel import Panel
from rich.text import Text

# Initialize console
console = Console()

for tweet in tweets_test:
    # Create a table for the tweet and confidence/support details
    tweet_table = Table(title="Tweet Analysis", show_lines=True)
    tweet_table.add_column("Category", style="bold cyan")
    tweet_table.add_column("Details", style="magenta")
    
    # Add the original tweet to the table
    tweet_table.add_row("Tweet", tweet)
    console.print(tweet_table)

    for confidence in confidences:
        for support in supports:
            # Display confidence and support values
            console.print(Panel(f"Confidence: [bold green]{confidence}[/] - Support: [bold yellow]{support}[/]",
                                title="Configuration",
                                border_style="bold magenta"))
            
            # Get annotations
            annotations = get_annotations(api_url, tweet, confidence, support)
            
            if annotations:
                # Create a table for annotations
                annotations_table = Table(title="Annotations", show_lines=True)
                annotations_table.add_column("Entity", style="bold cyan")
                annotations_table.add_column("Surface", style="magenta")
                annotations_table.add_column("URI", style="green")
                
                for annotation in annotations:
                    annotations_table.add_row(
                        annotation['URI'],
                        annotation['surfaceForm'],
                        annotation['URI']
                    )
                
                console.print(annotations_table)
                
                # # Fetch and display entity information
                # for annotation in annotations:
                #     results = get_entity_info(annotation["URI"])
                #     if results:
                #         results_panel = Panel(
                #             "\n".join([f"[bold cyan]Comment:[/]\n{result}" for result in results]),
                #             title=f"Details for {annotation['surfaceForm']}",
                #             border_style="yellow"
                #         )
                #         console.print(results_panel)
            else:
                console.print(Panel("No annotations found.", border_style="red"))

    console.print("\n" + "="*50 + "\n")

Spotlight error : No Resources found in spotlight response: {'@text': '$SPY - SPY: The Inflation Report From The Abyss.  https://t.co/K4OeVz9hRm #trading #business #investing', '@confidence': '0.9', '@support': '5', '@types': '', '@sparql': '', '@policy': 'whitelist'}


Spotlight error : No Resources found in spotlight response: {'@text': '$SPY - SPY: The Inflation Report From The Abyss.  https://t.co/K4OeVz9hRm #trading #business #investing', '@confidence': '0.9', '@support': '20', '@types': '', '@sparql': '', '@policy': 'whitelist'}


Spotlight error : No Resources found in spotlight response: {'@text': '$SPY - SPY: The Inflation Report From The Abyss.  https://t.co/K4OeVz9hRm #trading #business #investing', '@confidence': '0.9', '@support': '40', '@types': '', '@sparql': '', '@policy': 'whitelist'}


Spotlight error : No Resources found in spotlight response: {'@text': '$SPY - SPY: The Inflation Report From The Abyss.  https://t.co/K4OeVz9hRm #trading #business #investing', '@confidence': '0.9', '@support': '110', '@types': '', '@sparql': '', '@policy': 'whitelist'}


Annotation of tweets.

In [17]:
annotated_tweets = []

for tweet in df["tweet"][:50]:
    annotations = get_annotations(api_url, tweet)
    annotated_tweets.append({
        "tweet": tweet,
        "annotations": annotations
    })

Checking results.

In [18]:
result = annotated_tweets[0]
print(f"\nTweet : {result['tweet']}")
print("Annotations :")
    
for ann in result["annotations"]:
    print(f"  - {ann['surfaceForm']} ({ann['URI']})")


Tweet : Here are Thursday's biggest analyst calls: Apple, Amazon, Tesla, Palantir, DocuSign, Exxon &amp; more  https://t.co/QPN8Gwl7Uh
Annotations :
  - Apple (http://dbpedia.org/resource/Apple_Inc.)
  - Amazon (http://dbpedia.org/resource/Amazon_Prime_Video)
  - Tesla (http://dbpedia.org/resource/Tesla_Model_3)
  - Palantir (http://dbpedia.org/resource/Palantir_Technologies)
  - DocuSign (http://dbpedia.org/resource/DocuSign)
  - Exxon (http://dbpedia.org/resource/Exxon)


Annotation enrichment.

In [19]:
enriched_tweets = []

for result in annotated_tweets:
    enriched_annotations = []

    for ann in result["annotations"]:
        summary = get_entity_info(ann["URI"])

        if summary:
            enriched_annotations.append({
                "entity": ann["surfaceForm"],
                "uri": ann["URI"],
                "summary": summary[0]
            })
    
    enriched_tweets.append({
        "tweet": result["tweet"],
        "enriched_annotations": enriched_annotations
    })

Tweets enriched.

In [20]:
enriched_texts = []

for enriched in enriched_tweets:
    tweet = enriched["tweet"]
    enriched_text = tweet
    
    for annotation in enriched["enriched_annotations"]:
        enriched_text = enriched_text.replace(
            annotation["entity"],
            f"{annotation['entity']}, {annotation['summary']}"
        )
    
    enriched_texts.append(enriched_text)

Create a new dataframe for the enriched tweets.

In [21]:
df = pd.DataFrame({
    "tweet": [result["tweet"] for result in annotated_tweets],
    "enriched_tweet": enriched_texts,
})

Calculate the length of the enriched tweets.

In [22]:
df["tweet_length"] = df["tweet"].apply(lambda x: len(x.split()))
df["enriched_tweet_length"] = df["enriched_tweet"].apply(lambda x: len(x.split()))

Save the enriched tweets to a CSV file.

In [23]:
df.to_csv("enriched_tweets.csv", index=False, sep=";")

df.head()

Unnamed: 0,tweet,enriched_tweet,tweet_length,enriched_tweet_length
0,Here are Thursday's biggest analyst calls: App...,Here are Thursday's biggest analyst calls: App...,15,480
1,Buy Las Vegas Sands as travel to Singapore bui...,"Buy, Buy (Russian: Буй) is a town in Kostroma ...",13,265
2,"Piper Sandler downgrades DocuSign to sell, cit...","Piper, Piper Aircraft, Inc. is a manufacturer ...",13,297
3,"Analysts react to Tesla's latest earnings, bre...","Analysts react to Tesla, The Tesla Model 3 is ...",15,216
4,Netflix and its peers are set for a ‘return to...,"Netflix, Netflix, Inc. is an American subscrip...",19,202


## Vizualization

In [24]:
# Initialize console
console = Console()

for i in range(len(annotated_tweets)):
    annotated_tweet = annotated_tweets[i]
    
    # Create a table for tweet details
    table = Table(title=f"Tweet {i + 1}", show_lines=True)
    table.add_column("Category", style="bold cyan")
    table.add_column("Details", style="magenta")

    table.add_row("Original Tweet", df['tweet'][i])
    table.add_row("Enriched Tweet", df['enriched_tweet'][i])
    table.add_row("Original Length", str(df['tweet_length'][i]))
    table.add_row("Enriched Length", str(df['enriched_tweet_length'][i]))
    
    # Add annotations in a separate section
    annotations_panel = Panel(
        "\n".join(
            [f"[bold green]-[/] {ann['surfaceForm']} ([blue]{ann['URI']}[/])" for ann in annotated_tweet["annotations"]]
        ),
        title="Annotations",
        border_style="bold yellow"
    )
    
    # Print the tweet table and annotations
    console.print(table)
    console.print(annotations_panel)
    console.print("\n" + "="*50 + "\n")

## Results

In [25]:
mean_length_tweet = df["tweet_length"].mean()
median_length_tweet = df["tweet_length"].median()
mean_length_enriched_tweet = df["enriched_tweet_length"].mean()
median_length_enriched_tweet = df["enriched_tweet_length"].median()

print(f"Average tweet size : {mean_length_tweet}, enriched : {mean_length_enriched_tweet}")
print(f"Median tweet size : {median_length_tweet}, enriched : {median_length_enriched_tweet}")

Average tweet size : 16.2, enriched : 268.66
Median tweet size : 15.0, enriched : 214.5


# Discussion des Avantages et Limites de l'Approche d'Enrichissement de Texte

## Avantages

1. **Enrichissement du contenu** : 
   - Ajout d'informations contextuelles pertinentes pour améliorer la compréhension des tweets.

2. **Source riche (DBpedia)** : 
   - Fournit des données structurées et fiables sur de nombreuses entités.

3. **Automatisation et évolutivité** :
   - Permet de traiter un grand nombre de tweets efficacement grâce à des outils comme `pyspotlight` et `SPARQLWrapper`.

4. **Amélioration des tâches NLP** :
   - Utile pour des applications comme la classification contextuelle ou l'analyse de sentiments enrichis.

## Limites

1. **Longueur et lisibilité** :
   - L'ajout de résumés détaillés peut rendre les tweets longs et moins lisibles.

2. **Pertinence des annotations** :
   - Les entités identifiées peuvent être hors contexte ou ambiguës.

3. **Qualité des données DBpedia** :
   - DBpedia peut contenir des informations obsolètes ou biaisées (volontairement ou pas).

4. **Limites linguistiques** :
   - Fonctionne mieux pour l'anglais, difficulté avec d'autres langues ou langage informel.

5. **Performance des outils** :
   - Dépend des serveurs DBpedia et de la connectivité réseau.

6. **Absence de validation humaine** :
   - Aucune validation manuelle pour garantir la pertinence des enrichissements.

## Suggestions d'Amélioration

1. **Filtrage des annotations** :
   - Vérifier la pertinence des entités avec des modèles contextuels comme BERT.

2. **Résumé concis** :
   - Limiter les résumés pour maintenir un équilibre entre richesse et lisibilité.

3. **Support multilingue** :
   - Ajouter une traduction automatique pour traiter les tweets non anglais.

4. **Interface utilisateur améliorée** :
   - Présenter les enrichissements dans un format compact et interactif.

5. **Évaluation humaine** :
   - Ajouter une revue humaine pour vérifier la qualité sur un échantillon.

## Conclusion

Cette approche est prometteuse pour des applications comme le journalisme automatisé ou l'analyse des réseaux sociaux. Cependant, elle nécessite des ajustements pour maximiser sa précision et son utilité.
