Tag Len DataError Occuring Regardless of Tag Len Matching Address Len #124

joseandrejv · 2022-05-10T19:30:45Z

I'm trying to retrain a Bpemb model with new address tags, and am using the CSVDatasetContainer function to load the data. I've followed all possible guidelines so it'll read in the data without errors. The training data is two columns with the specific formatting. None of the addresses are empties or single whitespaces, and I've corroborated time and time again that the length of each address is compatible with the length of the tag list. I've done this by tokenizing the original addresses and programmatically comparing their lengths with the lengths of the tag lists from the same row (using a pandas version of the same dataframe). I also dug into the source code and tried the function you guys have listed there (_data_tags_is_same_len_then_address) and when I try it with the pandas version of my df, the output is True, which is supposed to mean that everything is as it should be. I also tried this with PickleDatasetContainer instead, using a .p file with the data formatted as requested, and I get the same error.

This is how I'm trying to read in the data:
CSVDatasetContainer(training_dataset_name + "." + file_extension, column_names=['Address', 'Tags'], separator=',')

And this is the error I keep getting:

System Info:

OS: Windows 10
IDE: VS Code
Python Version: 3.9.12
Deepparse Version: 0.7.3
Poutyne Version: 1.9 (I used this specific version so I could use the progress bar feature, since there's another issue with the code that compares the float version of Poutyne to 1.8, because the latest version is 1.11 and that is technically a smaller decimal number)

I'm not 100% sure whether this qualifies as a bug, but it sure is perplexing and I'm not sure where else to ask for help.

I guess this boils down to:

Is there anything about my system that could be causing this?
Is it the separator I'm using (without using ',', the function won't read in the data correctly, and its worked with a smaller training set before)
Is there any other potential factor I haven't considered?

Thanks in advance for your help.

davebulaval · 2022-05-10T19:58:48Z

Is it possible for you to share your dataset with me (in private) to ease the debugging?

joseandrejv · 2022-05-10T20:14:31Z

Hello Dave and thank you for replying so promptly.

I'm going to ask my boss, but chances are that I won't be allowed to, as the information is (even in sample form with limited attributes) confidential. Is there anything else I could do to help? Anything I could check, or something?

davebulaval · 2022-05-10T20:20:53Z

No worries. I have pushed code on a branch to try to debug it.

Install the following version of the project using this pip install -U git+https://github.com/GRAAL-Research/deepparse.git@bug_fix_data_tags_len. You can later return to the stable version with pip install -U deepparse.

Then try running this:

dataset = CSVDatasetContainer(training_dataset_name + "." + file_extension, column_names=['Address', 'Tags'], separator=',')

I have pushed a new method that is invoked when the error occurs to print the cases where there is a difference between len. I have not tested it, however. Tell me if it shows proper details.

Then send a print screen of the output (if it work properly).

Edits:
*I have simplified the code for you.
*Text improvement.

joseandrejv · 2022-05-10T20:26:05Z

I installed the fix and ran the code. Now there's a unicode error.

"UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte"

I also ran the code after installing -U deepparse again and got the same error.

In pandas I would change the encoding with the encoding argument, how would I fix this with CSVDatasetContainer?

*Edit: Would I just have to create a different file?

davebulaval · 2022-05-10T20:28:59Z

Uhm, which language is the addresses? It is possible your CSV is not in UTF-8. Try recreating it but specify UTF-8 encoding. It is possible that there are special characters that look like whitespace but are not.

joseandrejv · 2022-05-10T20:35:59Z

Wow, that makes sense!

The addresses are in Spanish, but part of the cleaning process involves programmatically replacing all latin characters with plain letters (as well as removing others that aren't a part of the language but have no business being in a standardized address), so I didn't think this would be an issue. I'll try recreating the file with the UTF-8 specification and let you know what happens as soon as I can.

joseandrejv · 2022-05-10T21:21:11Z

Ok, I've re-saved the csv with utf-8 encoding specified, and ran the code once again. I'm back where I started, with the same DataError about the lengths of the Tag lists not matching the lengths of the addresses.

davebulaval · 2022-05-10T22:55:08Z

k this would be an issue. I'll try recreating the file with the UTF-8 specification and let you know wh

Ok and does the bug_fix print something useful? (This version of deepparse pip install -U git+https://github.com/GRAAL-Research/deepparse.git@bug_fix_data_tags_len)

joseandrejv · 2022-05-11T17:35:45Z

No, first it was the unicode error (which doesn't seem to be an issue now that I changed the file) and now I'm getting the same len error after reinstalling the bug fix version and trying it out again.

Would sharing the data help much more? They got back to me and I can send it to you privately.

davebulaval · 2022-05-12T13:40:16Z

It would definitely be easier. david.beauchemin.5 at(@) ulaval.ca

davebulaval · 2022-05-12T16:23:14Z

See #127 for details.
The problem where that the list of tags where not properly split. It is fixed in release 0.7.4.

joseandrejv added the bug Something isn't working label May 10, 2022

davebulaval self-assigned this May 10, 2022

davebulaval closed this as completed May 10, 2022

davebulaval reopened this May 10, 2022

davebulaval mentioned this issue May 12, 2022

Bug-Fix Data Tags Len #127

Merged

davebulaval closed this as completed May 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tag Len DataError Occuring Regardless of Tag Len Matching Address Len #124

Tag Len DataError Occuring Regardless of Tag Len Matching Address Len #124

joseandrejv commented May 10, 2022

davebulaval commented May 10, 2022

joseandrejv commented May 10, 2022

davebulaval commented May 10, 2022 •

edited

joseandrejv commented May 10, 2022 •

edited

davebulaval commented May 10, 2022

joseandrejv commented May 10, 2022

joseandrejv commented May 10, 2022

davebulaval commented May 10, 2022

joseandrejv commented May 11, 2022 •

edited

davebulaval commented May 12, 2022 •

edited

davebulaval commented May 12, 2022 •

edited

Tag Len DataError Occuring Regardless of Tag Len Matching Address Len #124

Tag Len DataError Occuring Regardless of Tag Len Matching Address Len #124

Comments

joseandrejv commented May 10, 2022

davebulaval commented May 10, 2022

joseandrejv commented May 10, 2022

davebulaval commented May 10, 2022 • edited

joseandrejv commented May 10, 2022 • edited

davebulaval commented May 10, 2022

joseandrejv commented May 10, 2022

joseandrejv commented May 10, 2022

davebulaval commented May 10, 2022

joseandrejv commented May 11, 2022 • edited

davebulaval commented May 12, 2022 • edited

davebulaval commented May 12, 2022 • edited

davebulaval commented May 10, 2022 •

edited

joseandrejv commented May 10, 2022 •

edited

joseandrejv commented May 11, 2022 •

edited

davebulaval commented May 12, 2022 •

edited

davebulaval commented May 12, 2022 •

edited