# TP2 Semantic annotation (Named-Entity Linking) and text enrichment

@Authors : `HADDOU, Amine` and `DE SERROUX, Colin`

The main objective of this TD is to develop an automatic system which takes a 
input text, recognizes the entities mentioned there and determines the resource of 
DBpedia most appropriate for each recognized entity depending on the context. 
Then, by querying each resource, the system is able to retrieve 
information to ultimately enrich/augment the original text.


## Installation

In [1]:
%pip install pyspotlight
%pip install SPARQLWrapper
%pip install pandas

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


## Librairies

In [2]:
import spotlight
from SPARQLWrapper import SPARQLWrapper, JSON
import pandas as pd

## Global functions

In [3]:
def get_annotations(api_url: str, text_to_annotate: str, confidence: float = 0.4, support: int = 20) -> list[dict]:
    """
    Retrieving annotations in the tweet with the corresponding link.

    :param api_url: the url of DBPedia api or localhost
    :param text_to_annotate: the tweet
    :param confidence: TODO
    :param support: TODO
    """
    
    try:
        annotations = spotlight.annotate(
            api_url,
            text_to_annotate,
            confidence=confidence,
            support=support,
        )

        # There is no need to explain what twitter is when talking about a tweet.
        annotations = [item for item in annotations if item["surfaceForm"] != "t.co"]

        return annotations
    except spotlight.SpotlightException as e:
        print(f"Spotlight error : {e}")

        return []

In [4]:
def get_entity_info(uri: str) -> list:
    """
    Retrieves information from an annotation using its URI in DBPedia.
    
    :param uri: URI of the annotation
    """

    sparql = SPARQLWrapper("http://dbpedia.org/sparql")
    sparql.setQuery(f"""
        PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
        SELECT ?comment
        WHERE {{
            <{uri}> rdfs:comment ?comment .
            FILTER(langMatches(lang(?comment), "EN"))
        }}
    """)
    sparql.setReturnFormat(JSON)

    try:
        results = sparql.query().convert()

        return [result["comment"]["value"] for result in results["results"]["bindings"]]
    except Exception as e:
        print(f"SPARQL error : {e}")

        return []

## The test code

TODO joué avec la confidence et le support

In [5]:
api_url = "https://api.dbpedia-spotlight.org/en/annotate"
text_to_annotate = "With an equilibrium temperature of about 4,050 kelvin, the exoplanet KELT-9b (also known as HD 195689b) is an archetype of the class of ultrahot Jupiters that straddle the transition between stars and gas-giant exoplanets and are therefore useful for studying atmospheric chemistry."

annotations = get_annotations(api_url, text_to_annotate)

print("Entities annoted :")

for annotation in annotations:
    print(f"Entity : {annotation['URI']} - Surface : {annotation['surfaceForm']}")
    
    results = get_entity_info(annotation["URI"])

    for result in results:
        print(f"Comment : {result}\n")

Entities annoted :
Entity : http://dbpedia.org/resource/Planetary_equilibrium_temperature - Surface : equilibrium temperature
Comment : The planetary equilibrium temperature is a theoretical temperature that a planet would be if it were a black body being heated only by its parent star. In this model, the presence or absence of an atmosphere (and therefore any greenhouse effect) is irrelevant, as the equilibrium temperature is calculated purely from a balance with incident stellar energy.

Entity : http://dbpedia.org/resource/Exoplanet - Surface : exoplanet
Comment : An exoplanet or extrasolar planet is a planet outside the Solar System. The first possible evidence of an exoplanet was noted in 1917 but was not recognized as such. The first confirmation of detection occurred in 1992. A different planet, initially detected in 1988, was confirmed in 2003. As of 1 December 2022, there are 5,284 confirmed exoplanets in 3,899 planetary systems, with 847 systems having more than one planet.



## Use cases

### Environment

In [6]:
dataset_name = "hf://datasets/zeroshot/twitter-financial-news-topic/topic_train.csv"
api_url = "https://api.dbpedia-spotlight.org/en/annotate"

### Dataset

The dataset is made up of tweets associated with labels that we will not use in this practical work that we retrieved from [twitter-financial-news-topic](https://huggingface.co/datasets/zeroshot/twitter-financial-news-topic).

#### Loading the dataset

In [7]:
df = pd.read_csv(dataset_name)

df.head()

  from .autonotebook import tqdm as notebook_tqdm


Unnamed: 0,text,label
0,Here are Thursday's biggest analyst calls: App...,0
1,Buy Las Vegas Sands as travel to Singapore bui...,0
2,"Piper Sandler downgrades DocuSign to sell, cit...",0
3,"Analysts react to Tesla's latest earnings, bre...",0
4,Netflix and its peers are set for a ‘return to...,0


#### Data pre-processing

Deletion of the 'label' column which we will not use during the lab.

In [8]:
df = df.drop(columns=["label"])

df.head()

Unnamed: 0,text
0,Here are Thursday's biggest analyst calls: App...
1,Buy Las Vegas Sands as travel to Singapore bui...
2,"Piper Sandler downgrades DocuSign to sell, cit..."
3,"Analysts react to Tesla's latest earnings, bre..."
4,Netflix and its peers are set for a ‘return to...


Replaced column name 'text' with 'tweet'.

In [9]:
df = df.rename(columns={"text": "tweet"})

df.head()

Unnamed: 0,tweet
0,Here are Thursday's biggest analyst calls: App...
1,Buy Las Vegas Sands as travel to Singapore bui...
2,"Piper Sandler downgrades DocuSign to sell, cit..."
3,"Analysts react to Tesla's latest earnings, bre..."
4,Netflix and its peers are set for a ‘return to...


Added the size of each tweet.

In [10]:
df["tweet_length"] = df["tweet"].apply(lambda x: len(x.split()))

df.head()

Unnamed: 0,tweet,tweet_length
0,Here are Thursday's biggest analyst calls: App...,15
1,Buy Las Vegas Sands as travel to Singapore bui...,13
2,"Piper Sandler downgrades DocuSign to sell, cit...",13
3,"Analysts react to Tesla's latest earnings, bre...",15
4,Netflix and its peers are set for a ‘return to...,19


#### Analysis of the dataset

We only have 2 columns left 'tweet' and 'tweet_length'.

In [11]:
size = len(df)

print(f"There are {size} rows in the dataset.")

There are 16990 rows in the dataset.


In [12]:
df["tweet"].is_unique

True

All our tweets are unique.

In [13]:
mean_length = df["tweet_length"].mean()
median_length = df["tweet_length"].median()

print(f"Average tweet size : {mean_length}")
print(f"Median tweet size : {median_length}")

Average tweet size : 18.31153619776339
Median tweet size : 16.0


### Code

Annotation of tweets.

In [14]:
annotated_tweets = []

for tweet in df["tweet"][:50]:
    annotations = get_annotations(api_url, tweet)
    annotated_tweets.append({
        "tweet": tweet,
        "annotations": annotations
    })

Checking results.

In [15]:
result = annotated_tweets[0]
print(f"\nTweet : {result['tweet']}")
print("Annotations :")
    
for ann in result["annotations"]:
    print(f"  - {ann['surfaceForm']} ({ann['URI']})")


Tweet : Here are Thursday's biggest analyst calls: Apple, Amazon, Tesla, Palantir, DocuSign, Exxon &amp; more  https://t.co/QPN8Gwl7Uh
Annotations :
  - Apple (http://dbpedia.org/resource/Apple_Inc.)
  - Amazon (http://dbpedia.org/resource/Amazon_Prime_Video)
  - Tesla (http://dbpedia.org/resource/Tesla_Model_3)
  - Palantir (http://dbpedia.org/resource/Palantir_Technologies)
  - DocuSign (http://dbpedia.org/resource/DocuSign)
  - Exxon (http://dbpedia.org/resource/Exxon)


Annotation enrichment.

In [16]:
enriched_tweets = []

for result in annotated_tweets:
    enriched_annotations = []

    for ann in result["annotations"]:
        summary = get_entity_info(ann["URI"])

        if summary:
            enriched_annotations.append({
                "entity": ann["surfaceForm"],
                "uri": ann["URI"],
                "summary": summary[0]
            })
    
    enriched_tweets.append({
        "tweet": result["tweet"],
        "enriched_annotations": enriched_annotations
    })

Tweets enriched.

In [17]:
enriched_texts = []

for enriched in enriched_tweets:
    tweet = enriched["tweet"]
    enriched_text = tweet
    
    for annotation in enriched["enriched_annotations"]:
        enriched_text = enriched_text.replace(
            annotation["entity"],
            f"{annotation['entity']}, {annotation['summary']}"
        )
    
    enriched_texts.append(enriched_text)

Create a new dataframe for the enriched tweets.

In [18]:
df = pd.DataFrame({
    "tweet": [result["tweet"] for result in annotated_tweets],
    "enriched_tweet": enriched_texts,
})

Calculate the length of the enriched tweets.

In [19]:
df["tweet_length"] = df["tweet"].apply(lambda x: len(x.split()))
df["enriched_tweet_length"] = df["enriched_tweet"].apply(lambda x: len(x.split()))

Save the enriched tweets to a CSV file.

In [20]:
df.to_csv("enriched_tweets.csv", index=False, sep=";")

df.head()

Unnamed: 0,tweet,enriched_tweet,tweet_length,enriched_tweet_length
0,Here are Thursday's biggest analyst calls: App...,Here are Thursday's biggest analyst calls: App...,15,480
1,Buy Las Vegas Sands as travel to Singapore bui...,"Buy, Buy (Russian: Буй) is a town in Kostroma ...",13,265
2,"Piper Sandler downgrades DocuSign to sell, cit...","Piper, Piper Aircraft, Inc. is a manufacturer ...",13,297
3,"Analysts react to Tesla's latest earnings, bre...","Analysts react to Tesla, The Tesla Model 3 is ...",15,216
4,Netflix and its peers are set for a ‘return to...,"Netflix, Netflix, Inc. is an American subscrip...",19,202


## Vizualization

In [21]:
for i in range(len(annotated_tweets)):
    annotated_tweet = annotated_tweets[i]
    print(f"\nOriginal tweet : {df['tweet'][i]}")
    print(f"Enriched tweet : {df['enriched_tweet'][i]}")
    print(f"Original tweet length : {df['tweet_length'][i]}")
    print(f"Enriched tweet length : {df['enriched_tweet_length'][i]}")
    print("Annotations :")
    
    for ann in annotated_tweet["annotations"]:
        print(f"  - {ann['surfaceForm']} ({ann['URI']})")


Original tweet : Here are Thursday's biggest analyst calls: Apple, Amazon, Tesla, Palantir, DocuSign, Exxon &amp; more  https://t.co/QPN8Gwl7Uh
Enriched tweet : Here are Thursday's biggest analyst calls: Apple, Apple Inc. is an American multinational technology company headquartered in Cupertino, California, United States. Apple is the largest technology company by revenue (totaling US$365.8 billion in 2021) and, as of June 2022, is the world's biggest company by market capitalization, the fourth-largest personal computer vendor by unit sales and second-largest mobile phone manufacturer. It is one of the Big Five American information technology companies, alongside Alphabet, Amazon, Amazon Prime Video, also known simply as Prime Video, is an American subscription video on-demand over-the-top streaming and rental service of Amazon offered as a standalone service or as part of Amazon's Prime subscription. The service primarily distributes films and television series produced by Amazon S

## Results

In [22]:
mean_length_tweet = df["tweet_length"].mean()
median_length_tweet = df["tweet_length"].median()
mean_length_enriched_tweet = df["enriched_tweet_length"].mean()
median_length_enriched_tweet = df["enriched_tweet_length"].median()

print(f"Average tweet size : {mean_length_tweet}, enriched : {mean_length_enriched_tweet}")
print(f"Median tweet size : {median_length_tweet}, enriched : {median_length_enriched_tweet}")

Average tweet size : 16.2, enriched : 268.66
Median tweet size : 15.0, enriched : 214.5


TODO results + 6.