add get_article_by_id replace get_article_by_pmid

EhsanBitaraf · Dec 30, 2023 · 3e5f936 · 3e5f936
1 parent 0a920d3
commit 3e5f936
Show file tree

Hide file tree

Showing 27 changed files with 563 additions and 217 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,11 +2,21 @@
 All notable changes to this project will be documented in this file.
 
 ## v0.0.5 2023-12-28
+
+### Task
+- add `get_article_id_list_by_cstate` replace `get_article_pmid_list_by_cstate`
+- add `get_article_by_id` replace `get_article_by_pmid`
+- add `get_all_article_id_list` replace `get_all_article_pmid_list`
+- move_state_forward may be error in TinyDB
+- check all TinyDB
+
 ### Improvements
+- Add print_error in utils.general for unified Error printing
 - Add Published, ArxivID, SourceBank field in Article
 
 
 ### Bug Fixes
+- Fix session of extract_triple
 
 ## v0.0.4 2023-10-14
 ### Improvements

diff --git a/README.md b/README.md
@@ -210,7 +210,7 @@ output:
 
 #### Get and Save list of article identifier base on search term
 
-Get list of article identifier (PMID) base on search term and save into knowledge repository in first state (0):
+Get list of article identifier like PMID base on search term and save into knowledge repository in first state (0):
 
 use this command:
 ```shell
@@ -234,13 +234,11 @@ The preparation of the article for extracting the graph has different steps that
 |State|Short Description|Description|
 |-----|-----------------|-----------|
 |0    |article identifier saved|At this stage, the article object stored in the data bank has only one identifier, such as the PMID or DOI identifier|
-|1    |article details article info saved (json Form)|Metadata related to the article is stored in the `OreginalArticle` field from the `SourceBank`, but it has not been parsed yet|
-|2    |parse details info||
-|3    |Get Citation||
-<!-- |4|NER Title||
-|5|extract graph|| -->
-|-1   |Error|if error happend in move state 1 to 2|
-
+|1    |article details article info saved (json Form)|Metadata related to the article is stored in the `OriginalArticle` field from the `SourceBank`, but it has not been parsed yet|
+|2    |parse details info|The contents of the `OriginalArticle` field are parsed and placed in the fields of the Article object.|
+|3    |Get Citation      ||
+|-1   |Error             |if error happend in move state 1 to 2|
+|-2   |Error             |if error happend in move state 2 to 3|
 
 There are two ways to run a pipeline. In the first method, we give the number of the existing state and all the articles in this state move forward one state.
 In another method, we give the final state number and each article under that state starts to move until it reaches the final state number that we specified.

diff --git a/export b/export
@@ -0,0 +1 @@
+key,title,pmid,year,publisher,url,abstract,state,doi,journal_issn,journal_iso_abbreviation,language,publication_type,citation
diff --git a/export_authors b/export_authors
@@ -0,0 +1 @@
+key,authors,affiliations,country,university,institute,center,hospital,department,location,email,zipcode
diff --git a/export_keywords b/export_keywords
@@ -0,0 +1 @@
+key,keywords
diff --git a/export_topics b/export_topics
@@ -0,0 +1 @@
+key,topics,rank
diff --git a/jupyter_lab/database/Arxiv_test.json b/jupyter_lab/database/Arxiv_test.json
diff --git a/jupyter_lab/pipeline.ipynb b/jupyter_lab/pipeline.ipynb
@@ -10,9 +10,35 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 3,
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "ti%3A%22large%20language%20model%22%20AND%20ti%3ABenchmark\n",
+      "\u001b[32mTotal number of article is 114\u001b[0m\n",
+      "\u001b[32m            Round (1) : Get another 10 record (Total 10 record)\u001b[0m\n",
+      "\u001b[32m            Round (2) : Get another 10 record (Total 20 record)\u001b[0m\n",
+      "\u001b[32m            Round (3) : Get another 10 record (Total 30 record)\u001b[0m\n",
+      "\u001b[32m            Round (4) : Get another 10 record (Total 40 record)\u001b[0m\n",
+      "\u001b[32m            Round (5) : Get another 10 record (Total 50 record)\u001b[0m\n",
+      "\u001b[32m            Round (6) : Get another 10 record (Total 60 record)\u001b[0m\n",
+      "\u001b[32m            Round (7) : Get another 10 record (Total 70 record)\u001b[0m\n",
+      "\u001b[32m            Round (8) : Get another 10 record (Total 80 record)\u001b[0m\n",
+      "\u001b[32m            Round (9) : Get another 10 record (Total 90 record)\u001b[0m\n",
+      "\u001b[32m            Round (10) : Get another 10 record (Total 100 record)\u001b[0m\n",
+      "\u001b[32m            Round (11) : Get another 10 record (Total 110 record)\u001b[0m\n",
+      "\n",
+      "\u001b[31mError in parsing arxiv response. Entry missing.\u001b[0m\n",
+      "\n",
+      "\u001b[31mError Line 23\u001b[0m\n",
+      "\u001b[31mError 'entry'\u001b[0m\n",
+      "\u001b[32m            Round (12) : Get another 10 record (Total 120 record)\u001b[0m\n"
+     ]
+    }
+   ],
    "source": [
     "import urllib.parse\n",
     "from triplea.service.repository.state.initial_arxiv import get_article_list_from_arxiv_all_store_to_arepo\n",
@@ -27,7 +53,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 11,
+   "execution_count": 4,
    "metadata": {},
    "outputs": [
     {
@@ -59,68 +85,76 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 22,
+   "execution_count": 1,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
       "\u001b[32mNumber of article in article repository is 115\u001b[0m\n",
-      "[{'State': -2, 'n': 0}, {'State': -1, 'n': 115}, {'State': 0, 'n': 0}, {'State': 1, 'n': 0}, {'State': 2, 'n': 0}, {'State': 3, 'n': 0}, {'State': 4, 'n': 0}]\n",
-      "\u001b[32m115 article(s) in state -1.\u001b[0m\n"
+      "\u001b[32m115 article(s) in state 3.\u001b[0m\n"
      ]
     }
    ],
    "source": [
-    "from triplea.service.click_logger import logger\n",
-    "from triplea.service.repository import persist\n",
     "\n",
+    "import triplea.service.repository.persist as PERSIST\n",
+    "import triplea.service.repository.pipeline_core as PIPELINE\n",
     "\n",
-    "logger.INFO(\n",
-    "    \"Number of article in article repository is \"\n",
-    "    + str(persist.get_all_article_count())\n",
-    ")\n",
     "\n",
-    "data = persist.get_article_group_by_state()\n",
-    "for i in range(-3, 7):\n",
-    "    for s in data:\n",
-    "        if s[\"State\"] == i:\n",
-    "            w = 1\n",
-    "            n = s[\"n\"]\n",
-    "            if n != 0:\n",
-    "                logger.INFO(f\"{n} article(s) in state {i}.\")"
+    "PERSIST.print_article_info_from_repo()     "
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Moving Forward\n",
+    "## Moving Forward in core pipeline\n",
     "We move from state `0` to state `3`\n",
     "The best approach is to finalize state all the article in the `core state`.\n",
-    "\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Define dependency:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from triplea.service.repository.pipeline_core import move_state_forward"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
     "### Moving from `0` to `1`"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 15,
+   "execution_count": 3,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "\u001b[32m1 Article(s) is in state 0\u001b[0m\n",
-      "Article 37567487 with state 0 forward to 1\n"
+      "\u001b[32m0 Article(s) is in state 0\u001b[0m\n"
      ]
     }
    ],
    "source": [
-    "from triplea.service.repository.pipeline_core import move_state_forward\n",
     "\n",
-    "move_state_forward(0)"
+    "PIPELINE.move_state_forward(0)"
    ]
   },
   {
@@ -130,13 +164,158 @@
     "### Moving from `1` to `2`"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\u001b[32m0 Article(s) is in state 1\u001b[0m\n"
+     ]
+    }
+   ],
+   "source": [
+    "PIPELINE.move_state_forward(1)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Moving from `2` to `3`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\u001b[32m0 Article(s) is in state 2\u001b[0m\n"
+     ]
+    }
+   ],
+   "source": [
+    "PIPELINE.move_state_forward(2)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Check article object info"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\u001b[32m\u001b[0m\n",
+      "\u001b[32mTitle   : Evaluation High-Quality of Information from ChatGPT (Artificial Intelligence-Large Language Model) Artificial Intelligence on Shoulder Stabilization Surgery.\u001b[0m\n",
+      "\u001b[32mJournal : Arthroscopy : the journal of arthroscopic & related surgery : official publication of the Arthroscopy Association of North America and the International Arthroscopy Association\u001b[0m\n",
+      "\u001b[32mDOI     : 10.1016/j.arthro.2023.07.048\u001b[0m\n",
+      "\u001b[32mPMID    : 37567487\u001b[0m\n",
+      "\u001b[32mPMC     : None\u001b[0m\n",
+      "\u001b[32mState   : 3\u001b[0m\n",
+      "\u001b[32mAuthors : Eoghan T Hurley, Bryan S Crook, Samuel G Lorentz, Richard M Danilkowicz, Brian C Lau, Dean C Taylor, Jonathan F Dickens, Oke Anakwenze, Christopher S Klifto, \u001b[0m\n",
+      "\u001b[32mKeywords: \u001b[0m\n"
+     ]
+    }
+   ],
+   "source": [
+    "PERSIST.print_article_short_description(\"37567487\",\"pmid\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Moving forward in custom pipeline\n",
+    "These stages in custom pipleline do not have a specific prerequisite and post-requirement relationship, and when the core pipeline is provided and it has reached state 3, each of the actions of this pipeline can be done independently. This pipeline includes the following:\n",
+    "\n",
+    "|Action|Tag Name|Description|\n",
+    "|------|--------|-----------|\n",
+    "|Triple extraction from article abstract|FlagExtractKG||\n",
+    "|Topic extraction from article abstract|FlagExtractTopic||\n",
+    "|Convert Affiliation text to structural data|FlagAffiliationMining|This is simple way for parse Affiliation text |\n",
+    "|Convert Affiliation text to structural data|FlagAffiliationMining_Titipata|use [Titipat Achakulvisut Repo](https://github.com/titipata/affiliation_parser) for parsing Affiliation text|\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Extract Topic\n",
+    "In this method, we convert the article summary and the article title into a list of topics using topic extraction algorithms and save it. Previously, this method was in the program, but in the new versions, it is considered as an external service. The following variables are used to configure the service:\n",
+    "\n",
+    "- AAA_TOPIC_EXTRACT_ENDPOINT\n",
+    "- AAA_CLIENT_AGENT"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import triplea.service.repository.pipeline_flag as cPIPELINE\n",
+    "cPIPELINE.go_extract_topic()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Affiliation Mining"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "cPIPELINE.go_affiliation_mining(method=\"Titipata\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Extract Triple\n",
+    "Extract Triple refers to the task of extracting subject-predicate-object triples from natural language text. Specifically:\n",
+    "\n",
+    "- A triple consists of a subject, a predicate (typically a verb), and an object. For example:\n",
+    "\n",
+    "[John] (subject) [eats] (predicate) [apples] (object)\n",
+    "\n",
+    "- Extracting triples involves analyzing sentences in text to identify these key elements and convert them into a structured format.\n",
+    "\n",
+    "- This allows capturing semantic relationships in text and representing them in a more machine-readable way for tasks like knowledge base construction, question answering, summarization, etc.\n",
+    "\n",
+    "- There are various methods for extract triple extraction ranging from rule-based systems to statistical and neural network models. These models identify the syntactic structure of sentences to detect appropriate noun phrases that can act as entities and predicates.\n",
+    "\n",
+    "So in summary, extract triple extraction aims to transform unstructured text into more structured triple representations automatically that provide deeper semantics and understand relationships described in the text. It serves as a key information extraction component for multiple downstream artificial intelligence applications."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
-    "move_state_forward(1)"
+    "cPIPELINE.go_extract_triple()"
    ]
   }
  ],
@@ -156,7 +335,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.5"
+   "version": "3.11.3"
   }
  },
  "nbformat": 4,

diff --git a/pyproject.toml b/pyproject.toml
@@ -1,6 +1,6 @@
 [tool.poetry]
 name = "triplea"
-version = "0.0.4"
+version = "0.0.5"
 license = "Apache-2.0"
 description = "Article Analysis Assistant"
 authors = ["Ehsan Bitaraf <bitaraf.e@iums.ac.ir>", "Maryam Jafarpour <maryam.jafarpoor@gmail.com>"]

diff --git a/triplea/cli/arepo.py b/triplea/cli/arepo.py
@@ -112,7 +112,7 @@ def arepo(command, pmid, output):
             logger.ERROR("Not found.")
             sys.exit(1)
             return
-
+ 
         output_data = a
         a_title = a["Title"]
         a_journal = a["Journal"]