# Parse Latex file and extract paragraphs to be translated

Use pylatexenc as LaTeX-parser

Documentation: https://pylatexenc.readthedocs.io/en/latest/

## Convert LaTeX to text using pylatexenc

Only for demonstration purpose

In [1]:
from pylatexenc.latex2text import LatexNodes2Text

input_file = input()
with open(input_file) as f:
    latex = f.read()
text = LatexNodes2Text().latex_to_text(latex)
with open('test.txt', mode='w') as f:
    f.write(text)

 test.tex


## Use pylatexenc parser to analyze

### Input latex source from file

In [24]:
from pylatexenc.latexwalker import LatexWalker, LatexEnvironmentNode

input_file = input()
with open(input_file) as f:
    latex = f.read()
w = LatexWalker(latex)

 test.tex


### Check properties of output

In [3]:
(nodelist, pos, len_) = w.get_latex_nodes(pos=0)
len(nodelist)

82

In [4]:
nodelist[0]

LatexCommentNode(parsing_state=<parsing state 140667701332192>, pos=0, len=51, comment='%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%', comment_post_space='\n')

In [5]:
nodelist[1]

LatexMacroNode(parsing_state=<parsing state 140667701332192>, pos=51, len=38, macroname='documentclass', nodeargd=ParsedMacroArgs(argspec='[{', argnlist=[LatexGroupNode(parsing_state=<parsing state 140667701332192>, pos=65, len=15, nodelist=[LatexCharsNode(parsing_state=<parsing state 140667701332192>, pos=66, len=13, chars='12pt, a4paper')], delimiters=('[', ']')), LatexGroupNode(parsing_state=<parsing state 140667701332192>, pos=80, len=9, nodelist=[LatexCharsNode(parsing_state=<parsing state 140667701332192>, pos=81, len=7, chars='article')], delimiters=('{', '}'))]), macro_post_space='')

In [6]:
nodelist[78]

LatexMacroNode(parsing_state=<parsing state 140667701332192>, pos=1820, len=10, macroname='rightskip', nodeargd=ParsedMacroArgs(argspec='', argnlist=[]), macro_post_space='')

In [7]:
nodelist[79]

LatexCharsNode(parsing_state=<parsing state 140667701332192>, pos=1830, len=6, chars='=0pt\n\n')

**Next element, nodelist[80], corresponds to the element \begin{document}**

In [8]:
nodelist[81]

LatexCharsNode(parsing_state=<parsing state 140667701332192>, pos=86536, len=1, chars='\n')

### Check properties of \begin{document} element

In [9]:
nodelist[80].environmentname

'document'

In [10]:
(nodelist[80].pos, nodelist[80].len)

(1836, 84700)

In [11]:
len(nodelist[80].nodelist)

1782

In [12]:
abstract = nodelist[80].nodelist[2].nodelist[5]
abstract

LatexEnvironmentNode(parsing_state=<parsing state 140667701332192>, pos=2770, len=652, environmentname='abstract', nodelist=[LatexCharsNode(parsing_state=<parsing state 140667701332192>, pos=2786, len=622, chars='\nThe QCD axion or axion-like particles are candidates of dark matter of the universe. On the other hand, axion-like excitations exist in certain condensed matter systems, which implies that there can be interactions of dark matter particles with condensed matter axions. We discuss the relationship between the condensed matter axion and a collective spin-wave excitation in an anti-ferromagnetic insulator at the quantum level. The conversion rate of the light dark matter, such as the elementary particle axion or hidden photon, into the condensed matter axion is estimated for the discovery of the dark matter signals.\n')], nodeargd=ParsedMacroArgs(argspec='', argnlist=[]))

In [13]:
(abstract.environmentname, type(abstract))

('abstract', pylatexenc.latexwalker.LatexEnvironmentNode)

In [34]:
(abstract.pos, abstract.len, abstract.nodeType())

(2770, 652, pylatexenc.latexwalker.LatexEnvironmentNode)

In [14]:
abstract.nodelist[0].chars

'\nThe QCD axion or axion-like particles are candidates of dark matter of the universe. On the other hand, axion-like excitations exist in certain condensed matter systems, which implies that there can be interactions of dark matter particles with condensed matter axions. We discuss the relationship between the condensed matter axion and a collective spin-wave excitation in an anti-ferromagnetic insulator at the quantum level. The conversion rate of the light dark matter, such as the elementary particle axion or hidden photon, into the condensed matter axion is estimated for the discovery of the dark matter signals.\n'

In [15]:
section1 = nodelist[80].nodelist[25]
section1

LatexMacroNode(parsing_state=<parsing state 140667701332192>, pos=3796, len=22, macroname='section', nodeargd=ParsedMacroArgs(argspec='*[{', argnlist=[None, None, LatexGroupNode(parsing_state=<parsing state 140667701332192>, pos=3804, len=14, nodelist=[LatexCharsNode(parsing_state=<parsing state 140667701332192>, pos=3805, len=12, chars='Introduction')], delimiters=('{', '}'))]), macro_post_space='')

In [16]:
(section1.macroname, type(section1.nodeargd.argnlist[2]), section1.nodeargd.argnlist[2].nodelist[0].chars)

('section', pylatexenc.latexwalker.LatexGroupNode, 'Introduction')

In [33]:
(section1.pos, section1.len, section1.nodeType())

(3796, 22, pylatexenc.latexwalker.LatexMacroNode)

In [43]:
latex[3796:3796+22]

'\\section{Introduction}'

In [17]:
(type(nodelist[80].nodelist[32]), nodelist[80].nodelist[32].chars)

(pylatexenc.latexwalker.LatexCharsNode,
 '\n\n\nThe QCD axion is a hypothetical elementary particle that solves the strong CP problem')

In [36]:
sp = nodelist[80].nodelist[33]
sp

LatexSpecialsNode(parsing_state=<parsing state 140667701332192>, pos=4000, len=1, specials_chars='~', nodeargd=None)

In [38]:
sp.specials_chars

'~'

In [19]:
nodelist[80].nodelist[34]

LatexMacroNode(parsing_state=<parsing state 140667701332192>, pos=4001, len=51, macroname='cite', nodeargd=ParsedMacroArgs(argspec='*[[{', argnlist=[None, None, None, LatexGroupNode(parsing_state=<parsing state 140667701332192>, pos=4006, len=46, nodelist=[LatexCharsNode(parsing_state=<parsing state 140667701332192>, pos=4007, len=44, chars='Peccei:1977hh,Weinberg:1977ma,Wilczek:1977pj')], delimiters=('{', '}'))]), macro_post_space='')

In [22]:
(type(nodelist[80].nodelist[35]), nodelist[80].nodelist[3].chars)

(pylatexenc.latexwalker.LatexCharsNode, '\n\n')