# TP-QA
## Installation des packages
Afin de pouvoir effectuer ce TP, nous allons installer les packages suivant:
- **nltk**: Le "Natural Language ToolKit" (NLTK) est package python permettant de traiter le language naturel.
- **pandas**: Pandas est un package permettant de créer et manipuler des structures de données rapides, fléxibles.
- **sparqlwrapper**: C'est un package permettant de créer l'URI de la requête et convertir le résultat dans un format plus facile à gérer.
- **jellyfish**: Ces un packages pour la correspondance approximative et phonétique de chaines de caractères.

In [105]:
%pip install --upgrade nltk
%pip install --upgrade pandas
%pip install --upgrade sparqlwrapper
%pip install --upgrade jellyfish

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


## Importation et téléchagement

In [106]:
import nltk
import re
import pandas as pd
import itertools
import jellyfish
from SPARQLWrapper import SPARQLWrapper, JSON

In [107]:
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Bandh\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Bandh\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Bandh\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

## Importation des questions
Les question qui nous intéresse se trouve dans un fichier **XML** dans les balises "*string*".<br><br>
Pour cela nous allons récupérer le contenu du fichier en chaines de caractères.

In [108]:
doc = open("asserts/questions.xml").read()

Nous allons récupérer un exemplaire de question.

In [109]:
question = re.search('<string lang="en">(.*?)</string>', doc).group(1)
question

'Which river does the Brooklyn Bridge cross?'

Nous récupérons les "*tokens*" de la question.

In [110]:
tokens = nltk.word_tokenize(question)
tokens

['Which', 'river', 'does', 'the', 'Brooklyn', 'Bridge', 'cross', '?']

Nous récupérons une liste de tuple contenant un token (un mot) et un tag correspondant au token.

In [111]:
tags = nltk.pos_tag(tokens)
print(tags)

[('Which', 'JJ'), ('river', 'NN'), ('does', 'VBZ'), ('the', 'DT'), ('Brooklyn', 'NNP'), ('Bridge', 'NNP'), ('cross', 'NN'), ('?', '.')]


Récupérations de toutes les questions du fichier XML.

In [112]:
def getContentCDATA(s):
    reg = r'<!\[CDATA\[(.*?)]]>*'
    result = re.match(reg, s)
    if result == None: return s
    return result.group(1)

questions = re.findall(r'<string lang="en">(.*?)</string>', doc)
questions = list(map(getContentCDATA, questions))
print('\n'.join(questions))

Which river does the Brooklyn Bridge cross?
Who is the author of Wikipedia?
In which country does the Nile start?
What is the highest place of Karakoram?
Who designed the Brooklyn Bridge?
Who created Goofy?
Who is the mayor of New York City?
Which is the source of the Yenisey river
Which museum exhibits The Scream by Munch?
Which states border Illinois?
Who was the wife of U.S. president Lincoln?
In which programming language is GIMP written?
In which country is the Limerick Lake?
What is the currency of the Czech Republic?
Who developed the video game World of Warcraft?
Who founded Aldi?
How many employees does IBM have?
What is the area code of Berlin?
When was the Battle of Gettysburg?
What is the official languages of Italy?
Who wrote the book The Pillars of the Earth?
Who is the author of the WikiLeaks association?
Give me all actors starring in Last Action Hero.
Who is the owner of Universal Studios Lot?
Where is Bruce Carver born?


## Identifier le type de réponse attendu

Méthode permettant de récupérer le type de la question.
- when
- who
- where
- what
- how

In [113]:
def indentify(question):
    types = (
        ('when', 'when'),
        ('who', 'who'),
        ('where', 'where'),
        ('what', 'what'), ('how', 'how'), ('in which', 'where'), ('which', 'what'))
    for (regex, type) in types:
        if re.search(regex, question, re.IGNORECASE):
            return type
    return None

In [114]:
types_questions = [(indentify(question), question) for question in questions]
pd.DataFrame(types_questions, columns=['Type', 'Question'])


Unnamed: 0,Type,Question
0,what,Which river does the Brooklyn Bridge cross?
1,who,Who is the author of Wikipedia?
2,where,In which country does the Nile start?
3,what,What is the highest place of Karakoram?
4,who,Who designed the Brooklyn Bridge?
5,who,Who created Goofy?
6,who,Who is the mayor of New York City?
7,what,Which is the source of the Yenisey river
8,what,Which museum exhibits The Scream by Munch?
9,what,Which states border Illinois?


## Récupérations des NNP

In [115]:
questions_tokens = list(map(nltk.word_tokenize, questions))
questions_tags = list(map(nltk.pos_tag, questions_tokens))

In [116]:
def get_NNP(question):
    GRAMMAR_NP = r"""
    NP: {<NNP>+}
        {<DT><NN>}
  """
    tokens = nltk.word_tokenize(question)
    tags = nltk.pos_tag(tokens)
    cp = nltk.RegexpParser(GRAMMAR_NP)
    tree = cp.parse(tags)

    RESULTS = []
    for i in tree.subtrees(filter=lambda x: x.label() == 'NP'):
        NPS = [word for (word, _) in i if not re.match('which', word, re.IGNORECASE)]
        RESULTS.append("_".join(NPS))
    return list(filter(lambda value: value, RESULTS))

In [117]:
NNPs_questions = zip(questions, [get_NNP(question) for question in questions])
pd.DataFrame(NNPs_questions, columns=['Question', "NNP"])


Unnamed: 0,Question,NNP
0,Which river does the Brooklyn Bridge cross?,[Brooklyn_Bridge]
1,Who is the author of Wikipedia?,"[the_author, Wikipedia]"
2,In which country does the Nile start?,[Nile]
3,What is the highest place of Karakoram?,[Karakoram]
4,Who designed the Brooklyn Bridge?,[Brooklyn_Bridge]
5,Who created Goofy?,[Goofy]
6,Who is the mayor of New York City?,"[the_mayor, New_York_City]"
7,Which is the source of the Yenisey river,"[the_source, Yenisey]"
8,Which museum exhibits The Scream by Munch?,"[The_Scream, Munch]"
9,Which states border Illinois?,[Illinois]


## Extraction des relations

In [118]:
def extract_relations(path: str):
    f_read = open(f'{path}.txt', 'r').read()

    def get(pattern, array):
        res = re.findall(pattern, array)
        return [value.strip() for value in res]

    def typage(type, value):
        return (type, value.strip())
    relation_DBO = get('dbo:(.+?)\n', f_read)
    relation_DBP = get('dbp:(.+?)\n', f_read)
    res = [typage('dbo', value) for value in relation_DBO] + \
        [typage('dbp', value) for value in relation_DBP]
    return res

In [119]:
relations = extract_relations('asserts/relations')
pd.DataFrame(relations, columns=['Type', 'Nom'])

Unnamed: 0,Type,Nom
0,dbo,album
1,dbo,areaCode
2,dbo,author
3,dbo,battle
4,dbo,birthDate
5,dbo,birthPlace
6,dbo,capital
7,dbo,child
8,dbo,city
9,dbo,country


## Récupération des synonymes des relations

In [120]:
def join_synsets(separators: list[str], word: str):
    from nltk.corpus import wordnet
    word_splited = re.findall('[a-zA-Z][^A-Z]*', word)
    words = [separator.join(word_splited).lower() for separator in separators]
    array_synsets = [wordnet.synsets(item) for item in words]
    array_synsets = list(itertools.chain.from_iterable(array_synsets))
    array_synsets = list(dict.fromkeys(array_synsets))
    return array_synsets

In [121]:
def get_dict_synonyms(relations: list[(str, str)]) -> dict:
    d = {}
    for type, word in relations:
        d[(type, word)] = join_synsets(['', ' ', '-', '_', '.', ';'], word)
    return d

In [122]:
dict_synonyms = get_dict_synonyms(relations)
relations_synonyms = [(f'{type}:{relation}', synsets) for (type, relation), synsets in dict_synonyms.items()]
pd.DataFrame(relations_synonyms, columns=['Relation', 'Synsets'])

Unnamed: 0,Relation,Synsets
0,dbo:album,"[Synset('album.n.01'), Synset('album.n.02')]"
1,dbo:areaCode,[Synset('area_code.n.01')]
2,dbo:author,"[Synset('writer.n.01'), Synset('generator.n.03..."
3,dbo:battle,"[Synset('battle.n.01'), Synset('struggle.n.01'..."
4,dbo:birthDate,[]
5,dbo:birthPlace,"[Synset('birthplace.n.01'), Synset('birthplace..."
6,dbo:capital,"[Synset('capital.n.01'), Synset('capital.n.02'..."
7,dbo:child,"[Synset('child.n.01'), Synset('child.n.02'), S..."
8,dbo:city,"[Synset('city.n.01'), Synset('city.n.02'), Syn..."
9,dbo:country,"[Synset('state.n.04'), Synset('country.n.02'),..."


## Récupération des réponses des question
Méthode permettant de trouver la relations la plus proche de la question inséré en paramètre.<br>
Pour cela, nous utilisons la méthode de *jara_similarity* du package **jellyfish**

In [123]:
def get_relation_by_distance(token_question: list[str], relations: list[(str, str)]):
    import sys
    dis_max = sys.float_info.min
    relation = None
    for word in token_question:
        for (type, name) in relations:
            dis = jellyfish.jaro_similarity(word, name)
            if dis > dis_max:
                dis_max = dis
                relation = (type, name)
    return relation
                

Mise en relations des questions et des relations.

In [124]:
questions_relations = [(get_relation_by_distance(qt, relations), qt) for qt in questions_tokens]
pd.DataFrame([(f'{type}:{name}', ' '.join(qt)) for (type, name), qt in questions_relations], columns=['Relation', 'Question'])

Unnamed: 0,Relation,Question
0,dbo:crosses,Which river does the Brooklyn Bridge cross ?
1,dbo:author,Who is the author of Wikipedia ?
2,dbo:country,In which country does the Nile start ?
3,dbo:highest,What is the highest place of Karakoram ?
4,dbp:designer,Who designed the Brooklyn Bridge ?
5,dbo:creator,Who created Goofy ?
6,dbo:city,Who is the mayor of New York City ?
7,dbp:source,Which is the source of the Yenisey river
8,dbp:museum,Which museum exhibits The Scream by Munch ?
9,dbp:borderingstates,Which states border Illinois ?


Méthode permettant de créer les requêtes à envoyer à *dbpedia*.

In [125]:
def create_request(type_relation, relation, NNP):
    link = 'http://dbpedia.org/ontology/' if type_relation == 'dbo' else 'http://dbpedia.org/property/'
    return """
    PREFIX %s: <%s>
    PREFIX res: <http://dbpedia.org/resource/>
    SELECT DISTINCT ?uri WHERE {
        res:%s %s ?uri .
    }
    """ % (type_relation, link, NNP.replace(' ', '_'), f'{type_relation}:{relation}')


Méthode permettant d'avoir seulement le premier contenue d'une liste.

In [126]:
def get_first(array: list[str]) -> str:
    if (len(array) > 0):
        return array[0]
    return ''

Créations des requetes par rapport aux relations et aux questions.

In [127]:
requests = [create_request(type, name, get_first(get_NNP(' '.join(qt)))) for (type, name), qt in questions_relations]
f'Nombre de requete : {len(requests)}'


'Nombre de requete : 25'

Création de l'objet *SPARQLWrapper* qui permettra d'envoyé les requetes et de recevoir les réponses (en format JSON) au server de **DBPedia**.

In [128]:
def Sparql():
    sparql = SPARQLWrapper("http://dbpedia.org/sparql")
    sparql.setReturnFormat(JSON)
    return sparql
sparql = Sparql()

Méthodes permettant d'obtenir les reponses d'une requêtes à partir d'un *SPARQLWrapper*.

In [129]:
def getResponses(sparql, quest):
    sparql.setQuery(quest)
    sparql.query()
    ret: dict = sparql.queryAndConvert()
    responses = []
    if ('results' in ret and 'bindings' in ret['results']):
        for r in ret['results']['bindings']:
            if ('uri' in r and 'value' in r['uri']):
                responses.append(r['uri']['value'])
    return responses

Méthode permettant de remplacer les contenue vide par *None*.

In [130]:
def replace_values_empty(array):
    for index in range(len(array)):
        value = array[index]
        if not value:
            array[index] = None

Envoie des requetes à **DBPedia** et récupérations des réponses.

In [131]:
responses = [getResponses(sparql, q) for q in requests]
replace_values_empty(responses)
questions_responses = zip(questions, responses)
pd.DataFrame(questions_responses, columns=['Questions', 'Reponses'])

Unnamed: 0,Questions,Reponses
0,Which river does the Brooklyn Bridge cross?,[http://dbpedia.org/resource/East_River]
1,Who is the author of Wikipedia?,
2,In which country does the Nile start?,
3,What is the highest place of Karakoram?,[http://dbpedia.org/resource/K2]
4,Who designed the Brooklyn Bridge?,[http://dbpedia.org/resource/John_Augustus_Roe...
5,Who created Goofy?,"[http://dbpedia.org/resource/Bob_Ogle, http://..."
6,Who is the mayor of New York City?,
7,Which is the source of the Yenisey river,
8,Which museum exhibits The Scream by Munch?,[National Gallery and Munch Museum]
9,Which states border Illinois?,"[http://dbpedia.org/resource/Kentucky, http://..."


## Vérification des réponses

Afin de vérifie la fibilité des résultats de nos requetes nous allons les comparés aux requêtes enregistré dans le XML.<br>
Pour cela nous allons récupérer les requetes contenue dans les balises *query*.

In [132]:
queries = re.findall('<query>(.*?)</query>', doc, re.DOTALL)
queries = [query.replace('\n', "") for query in queries]
queries = list(map(getContentCDATA, queries))
f'Nombre de requete trouver dans le XML : {len(queries)}'

'Nombre de requete trouver dans le XML : 25'

Récupérations des reponses.

In [133]:
responses_waiting = [getResponses(sparql, q) for q in queries]
replace_values_empty(responses_waiting)
questions_responses_waiting = zip(questions, responses_waiting)
pd.DataFrame(questions_responses_waiting, columns=['Questions', 'Reponses'])


Unnamed: 0,Questions,Reponses
0,Which river does the Brooklyn Bridge cross?,[http://dbpedia.org/resource/East_River]
1,Who is the author of Wikipedia?,
2,In which country does the Nile start?,
3,What is the highest place of Karakoram?,
4,Who designed the Brooklyn Bridge?,[http://dbpedia.org/resource/John_Augustus_Roe...
5,Who created Goofy?,
6,Who is the mayor of New York City?,
7,Which is the source of the Yenisey river,"[http://dbpedia.org/resource/Ka-Hem, The most ..."
8,Which museum exhibits The Scream by Munch?,
9,Which states border Illinois?,"[http://dbpedia.org/resource/Kentucky, http://..."


## Conclusion
Comme nous pouvons le voir, certains résultats corresponde mais une grosse partie sont fausse entre eux.<br>
Cela peut s'expliquer par le fait que les relations ont été mis à jour et que les requetes dans le XML ne sont pas à jour.

In [134]:
results = zip(questions, responses_waiting, responses)
pd.DataFrame(results, columns=['Questions', 'Réponses attendues', 'Réponses'])


Unnamed: 0,Questions,Réponses attendues,Réponses
0,Which river does the Brooklyn Bridge cross?,[http://dbpedia.org/resource/East_River],[http://dbpedia.org/resource/East_River]
1,Who is the author of Wikipedia?,,
2,In which country does the Nile start?,,
3,What is the highest place of Karakoram?,,[http://dbpedia.org/resource/K2]
4,Who designed the Brooklyn Bridge?,[http://dbpedia.org/resource/John_Augustus_Roe...,[http://dbpedia.org/resource/John_Augustus_Roe...
5,Who created Goofy?,,"[http://dbpedia.org/resource/Bob_Ogle, http://..."
6,Who is the mayor of New York City?,,
7,Which is the source of the Yenisey river,"[http://dbpedia.org/resource/Ka-Hem, The most ...",
8,Which museum exhibits The Scream by Munch?,,[National Gallery and Munch Museum]
9,Which states border Illinois?,"[http://dbpedia.org/resource/Kentucky, http://...","[http://dbpedia.org/resource/Kentucky, http://..."
