Skip to content

Error with related words #19

@lrosique

Description

@lrosique

Hi,

I've been scrapping the french wiktionary and I've found an issue with the WikionnaireParser : for some words (6 326 over 1 874 000) the method "get_word_data" crashes and can't give the word data.

After some investigation, it comes from the module cssselect (eventhough i don't know why on these specific words) and i've just hotfixed the code with a try/catch around "for p_ in p:" (row 146 of the parser).

Here is the error :

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
d:\workspaces\peg_words\v2\check_dataframes.py in <module>
     51 ERROR_PAGE = df_all_words[df_all_words["status"] == "ERROR_PAGE"]
     52 page = wiktp.from_source("lithotypographier")
---> 53 page.get_word_data

D:\dev\Python\Python387\lib\site-packages\wiktionnaireparser\parser.py in get_word_data(self)
     57             'title'       : self.get_title(),
     58             'etymologies' : self.get_etymology(),
---> 59             'partOfSpeech': self.get_parts_of_speech(),
     60         }
     61 

D:\dev\Python\Python387\lib\site-packages\wiktionnaireparser\parser.py in get_parts_of_speech(self)
    144         ]
    145         try:
--> 146             for p_ in p:
    147                 related = self.get_related_words(p_)
    148                 if related:

D:\dev\Python\Python387\lib\site-packages\wiktionnaireparser\parser.py in get_related_words(self, related_word)
    302 
    303             section = section.getparent().getnext()
--> 304             if 'Notes' in value:
    305                 related = self.get_notes(section)
    306             else:

D:\dev\Python\Python387\lib\site-packages\wiktionnaireparser\utils.py in extract_related_words(section)
     47     url = '/wiki/'
     48     while section.tag != 'h3' and section.tag != 'h4':
---> 49         for link in section.cssselect('a'):
     50             if 'Annexe:' in link.attrib.get('href'):
     51                 continue

src\lxml\etree.pyx in lxml.etree._Element.cssselect()

src\lxml\xpath.pxi in lxml.etree.XPath.__call__()

src\lxml\apihelpers.pxi in lxml.etree._rootNodeOrRaise()

ValueError: Input object is not an XML element: HtmlComment

And here are some words that don't work :

  • à croupeton
  • lithotypographier
  • piloris
  • pied au plancher
  • clochepied
  • cloîtres

Thank you anyway for your work, it's awesome :)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions