Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tag Len DataError Occuring Regardless of Tag Len Matching Address Len #124

Closed
joseandrejv opened this issue May 10, 2022 · 11 comments
Closed
Assignees
Labels
bug Something isn't working

Comments

@joseandrejv
Copy link

I'm trying to retrain a Bpemb model with new address tags, and am using the CSVDatasetContainer function to load the data. I've followed all possible guidelines so it'll read in the data without errors. The training data is two columns with the specific formatting. None of the addresses are empties or single whitespaces, and I've corroborated time and time again that the length of each address is compatible with the length of the tag list. I've done this by tokenizing the original addresses and programmatically comparing their lengths with the lengths of the tag lists from the same row (using a pandas version of the same dataframe). I also dug into the source code and tried the function you guys have listed there (_data_tags_is_same_len_then_address) and when I try it with the pandas version of my df, the output is True, which is supposed to mean that everything is as it should be. I also tried this with PickleDatasetContainer instead, using a .p file with the data formatted as requested, and I get the same error.

This is how I'm trying to read in the data:
CSVDatasetContainer(training_dataset_name + "." + file_extension, column_names=['Address', 'Tags'], separator=',')

And this is the error I keep getting:
image

System Info:

  • OS: Windows 10
  • IDE: VS Code
  • Python Version: 3.9.12
  • Deepparse Version: 0.7.3
  • Poutyne Version: 1.9 (I used this specific version so I could use the progress bar feature, since there's another issue with the code that compares the float version of Poutyne to 1.8, because the latest version is 1.11 and that is technically a smaller decimal number)

I'm not 100% sure whether this qualifies as a bug, but it sure is perplexing and I'm not sure where else to ask for help.

I guess this boils down to:

  • Is there anything about my system that could be causing this?
  • Is it the separator I'm using (without using ',', the function won't read in the data correctly, and its worked with a smaller training set before)
  • Is there any other potential factor I haven't considered?

Thanks in advance for your help.

@joseandrejv joseandrejv added the bug Something isn't working label May 10, 2022
@davebulaval davebulaval self-assigned this May 10, 2022
@davebulaval
Copy link
Collaborator

Is it possible for you to share your dataset with me (in private) to ease the debugging?

@joseandrejv
Copy link
Author

Hello Dave and thank you for replying so promptly.

I'm going to ask my boss, but chances are that I won't be allowed to, as the information is (even in sample form with limited attributes) confidential. Is there anything else I could do to help? Anything I could check, or something?

@davebulaval
Copy link
Collaborator

davebulaval commented May 10, 2022

No worries. I have pushed code on a branch to try to debug it.

Install the following version of the project using this pip install -U git+https://github.com/GRAAL-Research/deepparse.git@bug_fix_data_tags_len. You can later return to the stable version with pip install -U deepparse.

Then try running this:

dataset = CSVDatasetContainer(training_dataset_name + "." + file_extension, column_names=['Address', 'Tags'], separator=',')

I have pushed a new method that is invoked when the error occurs to print the cases where there is a difference between len. I have not tested it, however. Tell me if it shows proper details.

Then send a print screen of the output (if it work properly).

Edits:
*I have simplified the code for you.
*Text improvement.

@joseandrejv
Copy link
Author

joseandrejv commented May 10, 2022

I installed the fix and ran the code. Now there's a unicode error.

"UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte"

I also ran the code after installing -U deepparse again and got the same error.

In pandas I would change the encoding with the encoding argument, how would I fix this with CSVDatasetContainer?

*Edit: Would I just have to create a different file?

@davebulaval
Copy link
Collaborator

Uhm, which language is the addresses? It is possible your CSV is not in UTF-8. Try recreating it but specify UTF-8 encoding. It is possible that there are special characters that look like whitespace but are not.

@joseandrejv
Copy link
Author

Wow, that makes sense!

The addresses are in Spanish, but part of the cleaning process involves programmatically replacing all latin characters with plain letters (as well as removing others that aren't a part of the language but have no business being in a standardized address), so I didn't think this would be an issue. I'll try recreating the file with the UTF-8 specification and let you know what happens as soon as I can.

@joseandrejv
Copy link
Author

Ok, I've re-saved the csv with utf-8 encoding specified, and ran the code once again. I'm back where I started, with the same DataError about the lengths of the Tag lists not matching the lengths of the addresses.

@davebulaval
Copy link
Collaborator

k this would be an issue. I'll try recreating the file with the UTF-8 specification and let you know wh

Ok and does the bug_fix print something useful? (This version of deepparse pip install -U git+https://github.com/GRAAL-Research/deepparse.git@bug_fix_data_tags_len)

@joseandrejv
Copy link
Author

joseandrejv commented May 11, 2022

No, first it was the unicode error (which doesn't seem to be an issue now that I changed the file) and now I'm getting the same len error after reinstalling the bug fix version and trying it out again.

Would sharing the data help much more? They got back to me and I can send it to you privately.

@davebulaval
Copy link
Collaborator

davebulaval commented May 12, 2022

It would definitely be easier. david.beauchemin.5 at(@) ulaval.ca

@davebulaval
Copy link
Collaborator

davebulaval commented May 12, 2022

See #127 for details.
The problem where that the list of tags where not properly split. It is fixed in release 0.7.4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants