-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Description
Hi,
I've been scrapping the french wiktionary and I've found an issue with the WikionnaireParser : for some words (6 326 over 1 874 000) the method "get_word_data" crashes and can't give the word data.
After some investigation, it comes from the module cssselect (eventhough i don't know why on these specific words) and i've just hotfixed the code with a try/catch around "for p_ in p:" (row 146 of the parser).
Here is the error :
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
d:\workspaces\peg_words\v2\check_dataframes.py in <module>
51 ERROR_PAGE = df_all_words[df_all_words["status"] == "ERROR_PAGE"]
52 page = wiktp.from_source("lithotypographier")
---> 53 page.get_word_data
D:\dev\Python\Python387\lib\site-packages\wiktionnaireparser\parser.py in get_word_data(self)
57 'title' : self.get_title(),
58 'etymologies' : self.get_etymology(),
---> 59 'partOfSpeech': self.get_parts_of_speech(),
60 }
61
D:\dev\Python\Python387\lib\site-packages\wiktionnaireparser\parser.py in get_parts_of_speech(self)
144 ]
145 try:
--> 146 for p_ in p:
147 related = self.get_related_words(p_)
148 if related:
D:\dev\Python\Python387\lib\site-packages\wiktionnaireparser\parser.py in get_related_words(self, related_word)
302
303 section = section.getparent().getnext()
--> 304 if 'Notes' in value:
305 related = self.get_notes(section)
306 else:
D:\dev\Python\Python387\lib\site-packages\wiktionnaireparser\utils.py in extract_related_words(section)
47 url = '/wiki/'
48 while section.tag != 'h3' and section.tag != 'h4':
---> 49 for link in section.cssselect('a'):
50 if 'Annexe:' in link.attrib.get('href'):
51 continue
src\lxml\etree.pyx in lxml.etree._Element.cssselect()
src\lxml\xpath.pxi in lxml.etree.XPath.__call__()
src\lxml\apihelpers.pxi in lxml.etree._rootNodeOrRaise()
ValueError: Input object is not an XML element: HtmlComment
And here are some words that don't work :
- à croupeton
- lithotypographier
- piloris
- pied au plancher
- clochepied
- cloîtres
Thank you anyway for your work, it's awesome :)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels