Skip to content

ESBigeard/paper_graph

Repository files navigation

Paper Graph

Dev/tools repo for a project about scientific papers mining to construct graphs

bibliography has been moved to its own file

If you want to execute the code you can find the corpus here. It's TEI XML files.

Working pipeline

From canceropole PDF articles to the website showing the graph

 +--------------------+
 |                    |
 |   PDF articles     |
 |                    |
 +---------+----------+
           |
 +---------v----------+
 |                    |
 |     grobit         |
 |                    |
 +---------+----------+
           |
           |        +----------------------------------+          +------------------------------------+
           |        |  generate_html_article_pages.py  |          | html pages with                    |
           |    +--->                                  +--------->+ the text of the articles           |
           |    |   +----------------------------------+          | and important sentences in yellow  |
           |    |                                                 |                                    |
           |    |                                                 +------------------------------------+
           |    |
 +---------v----+-----+          +--------------------------------------+
 |                    |          | utilsperso.edif_idf()                |
 |      TEI XML files +--------->+ (check the bottom of utilsperso.py   |
 |                    |          | there's a few lines that allow       |
 +--------+-----------+          | standalone launching of              |
          |                      | edit_idf()                           |
+---------v-----------+          |                                      |
|                     |          +-----------------+--------------------+
| generate_gephi_csv.py                            |
|                     |                            |
+----------+----------+            +---------------v----------------+
           |                       | idf.pickle                     |
 +---------v----------+            | I've added an idf file in the  |
 |  nodes.csV         |            | git for convenience, but       |
 |  edges.csv         <------------+ a new one should be            |
 |                    |            | generated for each corpus      |
 +---------+----------+            |                                |
           |                       +--------------------------------+
 +---------v-----------+
 | aman's script       |
 | adds coordinates for|
 | similary view       |
 | and similar nodes   |
 |                     |
 +----------+----------+
            |
 +----------v-------------------------+
 |  convert_id_to_tile.py             |
 |  Aman's script gives similar nodes |
 |  as ID. this converts to           |
 |  node label                        |
 |                                    |
 +---------+--------------------------+
           |
     +-----v-----+
     |   Gephi   |
     +-----+-----+
           |
    +------v------+
    |GEXF XML file|
    +------+------+
           |
+----------v-----------------------+
| this javascript website          |
|https://github.com/raphv/gexf-js  |
| with small changes               |
+----------------------------------+

For the paper, from several corpora (GSM, DBLP, ACL anthology) to .dat files

generate_aman_features.extract_acm() glove : /home/sam/work/glove

Main scripts and useful stuff

main script for canceropole. takes a folder of tei xml generated by grobit, outputs nodes.csv and edges.csv ready for gephi

necessary to make anything else run

creates the html pages for each article with the main sentences highlighted in yellow

for the most similar nodes added by aman, replace the ID of each node by its label

for the paper, generates the .dat files that aman uses to run the experiments

About

Dev/tools repo for a project about scientific papers mining to construct graphs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages