# Knowledge Extraction

Considering the originality of the text we are dealing with, we decided to use several automated tools along with our analogical interpretation of the text. Our aims where, on the one hand, <b>achieving a comprehensive understanding of the content of the text on a conceptual level</b>, on the other hand <b>annotate the text syntactically and semantically</b> to be able to inquire deeper in the relation between Haraway's theory and the semiotics that convey it.


## Topic Modeling

Along with the personal interpretation of our sources, we used <a href="https://github.com/MaartenGr/BERTopic">BERTopic</a> to extract meaningful topics, especially the central theoretical concepts, from the text.

<a href="">find here</a> the documentation of this step as well as the visualization of the book's extracted topics.

## Named Entity Recognition, POS tagging, dependency and more

We used <a href="https://github.com/booknlp/booknlp">Book NLP</a>, natural language processing pipeline that scales to books and other long documents (in English). To fully annotate the text and extract its main <b>entities</b> as well as their relations and dependencies.

The pipeline produces 6 files: Tokens, Entities, Supersense, Quotes (tsv) and two Book files (json and html).

Althought the entire output was useful for in-depth analysis of <i>Staying with the trouble</i> it didn't seem to perform very accurately on it, according to our needs, probably because of the philosophical and poetic content, in comparison to more narrative texts. 

For this reason we decided to concentrate only on one chapter, <b>Chapter 7: A Curious Practice</b> and manually correct the output of the <b>tokens</b> file. This process allowed us to precisely <a href="">access the tool's accuracy</a> and to obtain <a href="">useful data</a> for the creation of our ontology vocabulary and model.

#### Tokens 

In [2]:
import pandas as pd

tokens_df = pd.read_csv("script\BookNLP\ch_7\ch_7.tokens", delimiter="\t")
tokens_df.head()

Unnamed: 0,paragraph_ID,sentence_ID,token_ID_within_sentence,token_ID_within_document,word,lemma,byte_onset,byte_offset,POS_tag,fine_POS_tag,dependency_relation,syntactic_head_ID,event
0,0,0,0,0,Vinciane,Vinciane,0,8,PROPN,NNP,compound,1,O
1,0,0,1,1,Despret,Despret,9,16,PROPN,NNP,nsubj,2,O
2,0,0,2,2,thinks,think,17,23,VERB,VBZ,ROOT,2,O
3,0,0,3,3,-,-,23,24,PUNCT,HYPH,punct,2,O
4,0,0,4,4,with,with,24,28,ADP,IN,prep,2,O


#### Entities

In [3]:
entities_df = pd.read_csv("script\BookNLP\ch_7\ch_7.entities", delimiter="\t")
entities_df.head()

Unnamed: 0,COREF,start_token,end_token,prop,cat,text
0,14,0,1,PROP,PER,Vinciane Despret
1,25,5,10,NOM,PER,"other beings , human and not"
2,26,33,34,NOM,LOC,the world
3,14,47,47,PROP,PER,Despret
4,14,60,60,PRON,PER,her


## FRED and FrameNet

<a href="https://github.com/anuzzolese/fredclient">FRED</a> is a machine reader for the Semantic Web: it is able to parse natural language text in 48 different languages and transform it to linked data.

Our initial aim was to exploit FRED's output to add semantic role labeling of the entities we extracted in the previous steps. Unfortunately, this task resulted too time-consuming and complex to preform in the contest of this project, that's why we decided to try and mimic fred's semantic notation through a more simple, semi-automated process that implied manually annotating identified Frames and Roles in <b>the first paragraph</b> of Chapter 7 and creating a <b>Python pipeline</b> to automatically create instances of entities and their relation, according to our designed Ontology model.

To annotate Frames we studied <a href="https://framenet.icsi.berkeley.edu/">FrameNet project</a> and went through its lexical units index to see if we could classify our sentences in appropriate FrameNet frames. When suitable, we adopted FrameNet data, otherwise we created our own Frame, such as in the case of the Frame <b>Collaborative_thinking</b>, give that it evokes a semantic concept that is original to Donna Haraway's theory.

the identification of frames in the text was carried out following this workflow:

1. Identification of all the main concepts
2. Individuation of the significant terms related to each concept and relation of concepts between them
3. Abstraction into Frames evoked by the previous conceptualization
4. Aligning identified Frames with FrameNet's where possible
5. Listing of lexical units (concept-related identified lexical units + new FrameNet related lexical unit)

Find <a href="">here</a> all annotated Frames of our first paragraph.

# Knowledge Graph