## All imports

In [3]:
!pip install yake

Collecting yake
  Downloading yake-0.4.8-py2.py3-none-any.whl (60 kB)
Collecting tabulate
  Using cached tabulate-0.8.9-py3-none-any.whl (25 kB)
Collecting segtok
  Using cached segtok-1.5.10.tar.gz (25 kB)
Collecting jellyfish
  Downloading jellyfish-0.8.8-cp38-cp38-win_amd64.whl (28 kB)
Building wheels for collected packages: segtok
  Building wheel for segtok (setup.py): started
  Building wheel for segtok (setup.py): finished with status 'done'
  Created wheel for segtok: filename=segtok-1.5.10-py3-none-any.whl size=25022 sha256=8b5ebb750647755e2f75e321cf63c6ba2ce4c891617d3673395e9545c86998f4
  Stored in directory: c:\users\shweata\appdata\local\pip\cache\wheels\36\6d\90\6d9b11ba404f68f340ef3f6060cfdf9c9f34653b08eceeacf6
Successfully built segtok
Installing collected packages: tabulate, segtok, jellyfish, yake
Successfully installed jellyfish-0.8.8 segtok-1.5.10 tabulate-0.8.9 yake-0.4.8


In [1]:
import os
import glob
import xml.etree.ElementTree as ET
import pathlib
import yake
import subprocess
import logging
from bs4 import BeautifulSoup


## Defining all the functions

In [15]:
logging.basicConfig(level=logging.INFO)
# All the functions
def querying_pygetpapers_sectioning(query, hits, output_directory, using_terms = False, terms_txt=None):
    """queries pygetpapers for specified query. Downloads XML, and sections papers using ami section

    Args:
        query (str): query to pygetpapers
        hits (int): no. of papers to download
        output_directory (str): CProject Directory (where papers get downloaded)
        using_terms (bool, optional): pygetpapers --terms flag. Defaults to False.
        terms_txt (str, optional): path to text file with terms. Defaults to None.
    """
    logging.info('querying pygetpapers')
    if using_terms:
        subprocess.run(f'pygetpapers -q "{query}" -k {hits} -o {output_directory} -x --terms {terms_txt}',
                                shell=True)
    else:  
        subprocess.run(f'pygetpapers -q "{query}" -k {hits} -o {output_directory} -x', 
                                shell=True)
    logging.info('running ami section')
    subprocess.run(f'ami -p {output_directory} section', shell=True)

def parse_xml(output_directory, results_txt, body_section='figure'):
    """globs the specified section parsed xml and dumps the text to a file

    Args:
        output_directory (str): CProject directory
        results_txt (str):name of text file to write parsed XML
        body_section (str, optional): [description]. Defaults to 'method'.
    """
    WORKING_DIRECTORY = os.getcwd()
    glob_results = glob.glob(os.path.join(WORKING_DIRECTORY,
                                          output_directory,"*", "sections",
                                          "**", f"*{body_section}*.xml"), recursive = True)
    for glob_result in glob_results:
        logging.info(f'sections: {glob_result}')
    file1 = open(results_txt,"w+", encoding='utf-8')
    for result in glob_results:
        tree = ET.parse(result)
        root = tree.getroot()
        xmlstr = ET.tostring(root, encoding='utf8', method='xml')
        soup = BeautifulSoup(xmlstr, features='lxml')
        text = soup.get_text(separator="")
        text = text.replace(
            '\n', '')
        print(text, file = file1)
    logging.info(f'wrote text to {results_txt}')
    
def key_phrase_extraction(results_txt, terms_txt):
    """extract key phrases from the text file with parsed xml and saves the phrases in a text file (comma-separated)

    Args:
        results_txt (str): text file with parsed XML text
        terms_txt (str): name of text file with comma-separated extracted key phrases
    """
    text = pathlib.Path(results_txt).read_text(encoding='utf-8')
    custom_kw_extractor = yake.KeywordExtractor(lan='en', n=2, top=50, features=None)
    keywords = custom_kw_extractor.extract_keywords(text)
    keywords_list = []
    for kw in keywords:
        keywords_list.append(kw[0])
    logging.info('extracted key phrases')
    
    keywords_list_string = ', '.join(str(i) for i in keywords_list)
    with open(terms_txt, 'w', encoding='utf-8') as fo:
        fo.write(keywords_list_string)
    logging.info(f'wrote the phrases to {terms_txt}')

## Defining all variables

In [3]:
OD_QUERY = '(cyclic voltammetry) AND batteries'
OD_HITS = '50'
OD_OUTPUT='cyclic_voltammetry_20210824_1'
OD_RESULTS= 'cyclic_volammtery_1.txt'
OD_TERMS = 'terms_1.txt'
OD_OUTPUT_2 = 'cyclic_voltammetry_2'
#OD_RESULTS_2= 'cyclic_volammtery_2.txt'




#querying_pygetpapers_sectioning(OD_QUERY, OD_HITS, OD_OUTPUT_2, using_terms=True, terms_txt=OD_TERMS)

## 1. Query [`pygetpapers`](https://pypi.org/project/pygetpapers/)
`pygetpapers` is a command-line tool which downloads open scientific papers from repositories like EPMC, biorxiv, arxiv, and so on. 
![image](https://user-images.githubusercontent.com/70576776/130623817-73596788-a3b1-4a35-9332-1d0cf375a7d7.png)
In this Demo, we've used `pygetpapers` to download `50` papers in XML-format on `(cyclic voltammetry) AND batteries` from EPMC. 
## 2. Section papers using [`ami-section`](https://github.com/petermr/ami3)
We use `ami`'s sectioning functionality to create smaller sections (like Introduction, Method, Results, Figures, and so on) for each paper. 
![image](https://user-images.githubusercontent.com/70576776/130624722-aecb3ff3-c26c-490a-92c5-30bb98b25318.png)


In [None]:
querying_pygetpapers_sectioning(OD_QUERY, OD_HITS, OD_OUTPUT)

INFO:root:querying pygetpapers
INFO:root:running ami section


`pygetpapers` gives us: 
![image](https://user-images.githubusercontent.com/70576776/130625542-192e3133-91d7-4b6d-815f-9cc3db924a4f.png)

After ami-section: 
![image](https://user-images.githubusercontent.com/70576776/130625282-407b6f91-7ed6-4735-90e7-6334bd798f97.png)


## 3. Get text from Figure Caption (or section of your choice)
Sectioning papers makes it easy to select for specific section in papers and get the text within it. 

In this demonstration, we write all the text to a single `.txt` file called `cyclic_volammtery_1.txt`

In [16]:
parse_xml(OD_OUTPUT,OD_RESULTS)

INFO:root:sections: C:\Users\shweata\snowball\cyclic_voltammetry_20210824_1\PMC7645205\sections\3_floats-group\0_figure_1.xml
INFO:root:sections: C:\Users\shweata\snowball\cyclic_voltammetry_20210824_1\PMC7645205\sections\3_floats-group\1_figure_2.xml
INFO:root:sections: C:\Users\shweata\snowball\cyclic_voltammetry_20210824_1\PMC7645205\sections\3_floats-group\3_figure_3.xml
INFO:root:sections: C:\Users\shweata\snowball\cyclic_voltammetry_20210824_1\PMC7645205\sections\3_floats-group\4_figure_4.xml
INFO:root:sections: C:\Users\shweata\snowball\cyclic_voltammetry_20210824_1\PMC7645205\sections\3_floats-group\5_figure_5.xml
INFO:root:sections: C:\Users\shweata\snowball\cyclic_voltammetry_20210824_1\PMC7645205\sections\3_floats-group\6_figure_6.xml
INFO:root:sections: C:\Users\shweata\snowball\cyclic_voltammetry_20210824_1\PMC7693081\sections\3_floats-group\0_figure_1.xml
INFO:root:sections: C:\Users\shweata\snowball\cyclic_voltammetry_20210824_1\PMC7693081\sections\3_floats-group\1_figur

INFO:root:sections: C:\Users\shweata\snowball\cyclic_voltammetry_20210824_1\PMC7957769\sections\3_floats-group\3_figure_4.xml
INFO:root:sections: C:\Users\shweata\snowball\cyclic_voltammetry_20210824_1\PMC7957769\sections\3_floats-group\4_figure_5.xml
INFO:root:sections: C:\Users\shweata\snowball\cyclic_voltammetry_20210824_1\PMC7957769\sections\3_floats-group\5_figure_6.xml
INFO:root:sections: C:\Users\shweata\snowball\cyclic_voltammetry_20210824_1\PMC7957769\sections\3_floats-group\6_figure_7.xml
INFO:root:sections: C:\Users\shweata\snowball\cyclic_voltammetry_20210824_1\PMC7957769\sections\3_floats-group\7_figure_8.xml
INFO:root:sections: C:\Users\shweata\snowball\cyclic_voltammetry_20210824_1\PMC7996271\sections\3_floats-group\0_figure_1.xml
INFO:root:sections: C:\Users\shweata\snowball\cyclic_voltammetry_20210824_1\PMC7996271\sections\3_floats-group\1_figure_2.xml
INFO:root:sections: C:\Users\shweata\snowball\cyclic_voltammetry_20210824_1\PMC7996271\sections\3_floats-group\2_figur

INFO:root:sections: C:\Users\shweata\snowball\cyclic_voltammetry_20210824_1\PMC8071024\sections\3_floats-group\11_figure_12.xml
INFO:root:sections: C:\Users\shweata\snowball\cyclic_voltammetry_20210824_1\PMC8071024\sections\3_floats-group\12_figure_13.xml
INFO:root:sections: C:\Users\shweata\snowball\cyclic_voltammetry_20210824_1\PMC8071024\sections\3_floats-group\1_figure_2.xml
INFO:root:sections: C:\Users\shweata\snowball\cyclic_voltammetry_20210824_1\PMC8071024\sections\3_floats-group\2_figure_3.xml
INFO:root:sections: C:\Users\shweata\snowball\cyclic_voltammetry_20210824_1\PMC8071024\sections\3_floats-group\3_figure_4.xml
INFO:root:sections: C:\Users\shweata\snowball\cyclic_voltammetry_20210824_1\PMC8071024\sections\3_floats-group\4_figure_5.xml
INFO:root:sections: C:\Users\shweata\snowball\cyclic_voltammetry_20210824_1\PMC8071024\sections\3_floats-group\5_figure_6.xml
INFO:root:sections: C:\Users\shweata\snowball\cyclic_voltammetry_20210824_1\PMC8071024\sections\3_floats-group\6_f

INFO:root:sections: C:\Users\shweata\snowball\cyclic_voltammetry_20210824_1\PMC8198776\sections\3_floats-group\0_figure_1.xml
INFO:root:sections: C:\Users\shweata\snowball\cyclic_voltammetry_20210824_1\PMC8198776\sections\3_floats-group\10_figure_11.xml
INFO:root:sections: C:\Users\shweata\snowball\cyclic_voltammetry_20210824_1\PMC8198776\sections\3_floats-group\11_figure_12.xml
INFO:root:sections: C:\Users\shweata\snowball\cyclic_voltammetry_20210824_1\PMC8198776\sections\3_floats-group\12_figure_13.xml
INFO:root:sections: C:\Users\shweata\snowball\cyclic_voltammetry_20210824_1\PMC8198776\sections\3_floats-group\13_figure_14.xml
INFO:root:sections: C:\Users\shweata\snowball\cyclic_voltammetry_20210824_1\PMC8198776\sections\3_floats-group\14_figure_15.xml
INFO:root:sections: C:\Users\shweata\snowball\cyclic_voltammetry_20210824_1\PMC8198776\sections\3_floats-group\15_figure_16.xml
INFO:root:sections: C:\Users\shweata\snowball\cyclic_voltammetry_20210824_1\PMC8198776\sections\3_floats-g

INFO:root:sections: C:\Users\shweata\snowball\cyclic_voltammetry_20210824_1\PMC8270313\sections\3_floats-group\6_figure_2.xml
INFO:root:sections: C:\Users\shweata\snowball\cyclic_voltammetry_20210824_1\PMC8270313\sections\3_floats-group\7_figure_3.xml
INFO:root:sections: C:\Users\shweata\snowball\cyclic_voltammetry_20210824_1\PMC8270313\sections\3_floats-group\8_figure_4.xml
INFO:root:sections: C:\Users\shweata\snowball\cyclic_voltammetry_20210824_1\PMC8270313\sections\3_floats-group\9_figure_5.xml
INFO:root:sections: C:\Users\shweata\snowball\cyclic_voltammetry_20210824_1\PMC8279017\sections\3_floats-group\0_figure_1.xml
INFO:root:sections: C:\Users\shweata\snowball\cyclic_voltammetry_20210824_1\PMC8279017\sections\3_floats-group\2_figure_2.xml
INFO:root:sections: C:\Users\shweata\snowball\cyclic_voltammetry_20210824_1\PMC8279017\sections\3_floats-group\3_figure_3.xml
INFO:root:sections: C:\Users\shweata\snowball\cyclic_voltammetry_20210824_1\PMC8279017\sections\3_floats-group\5_figur

INFO:root:sections: C:\Users\shweata\snowball\cyclic_voltammetry_20210824_1\PMC8336491\sections\3_floats-group\2_figure_3.xml
INFO:root:sections: C:\Users\shweata\snowball\cyclic_voltammetry_20210824_1\PMC8336491\sections\3_floats-group\3_figure_4.xml
INFO:root:sections: C:\Users\shweata\snowball\cyclic_voltammetry_20210824_1\PMC8336491\sections\3_floats-group\4_figure_5.xml
INFO:root:sections: C:\Users\shweata\snowball\cyclic_voltammetry_20210824_1\PMC8336491\sections\3_floats-group\5_figure_6.xml
INFO:root:sections: C:\Users\shweata\snowball\cyclic_voltammetry_20210824_1\PMC8336491\sections\3_floats-group\6_figure_7.xml
INFO:root:sections: C:\Users\shweata\snowball\cyclic_voltammetry_20210824_1\PMC8348064\sections\3_floats-group\0_figure_1.xml
INFO:root:sections: C:\Users\shweata\snowball\cyclic_voltammetry_20210824_1\PMC8348064\sections\3_floats-group\1_figure_2.xml
INFO:root:sections: C:\Users\shweata\snowball\cyclic_voltammetry_20210824_1\PMC8348064\sections\3_floats-group\2_figur

## 4. Extract Key Phrases from the text retrieved earlier using [`YAKE!`](https://github.com/LIAAD/yake)
For this step, we can use any unsupervised key phrase extractor of your choice. We've used YAKE. We, then, all the extracted phrases to `terms.txt`
Extracted Key Phrases (copied from the text file)
```
Figure, Cyclic voltammetry, scan rate, rate, Electrode, Chemical Society, American Chemical, electrodes, Cyclic, curves, discharge curves, discharge, charge, current, cycles, SEM image, image, KOH, voltammetry, SiO, Nyquist plots, SEM, SEM images, cell, voltammetry curves, cells, scan, images, permission, cycling, RDC, Cyclic voltammogram, TEM image, voltage, Rate capability, cycle, electrolyte, AMT, ZnO, carbon, Nyquist, spectra, copyright, Chemical, current density, LFP, Reproduced, CNT, XRD, CNT interlayer
```

In [17]:
key_phrase_extraction(OD_RESULTS, OD_TERMS)

INFO:root:extracted key phrases
INFO:root:wrote the phrases to terms_1.txt
