Skip to content

Commit

Permalink
Cache regexp
Browse files Browse the repository at this point in the history
python 2 does not cache re.sub regexps,
and it's faster even on python 3
  • Loading branch information
lopuhin committed May 26, 2017
1 parent 6135ba6 commit 43f1bd4
Showing 1 changed file with 4 additions and 1 deletion.
5 changes: 4 additions & 1 deletion html_text/html_text.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,13 +41,16 @@ def parse_html(html):
return lxml.html.fromstring(html.encode('utf8'), parser=parser)


_whitespace = re.compile('\s+')


def selector_to_text(sel):
""" Convert a cleaned selector to text.
Almost the same as xpath normalize-space, but this also
adds spaces between inline elements (like <span>) which are
often used as block elements in html markup.
"""
fragments = (re.sub('\s+', ' ', x.strip())
fragments = (_whitespace.sub(' ', x.strip())
for x in sel.xpath('//text()').extract())
return ' '.join(x for x in fragments if x)

Expand Down

0 comments on commit 43f1bd4

Please sign in to comment.