-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lookup Table File Version UnicodeError #1781
Comments
If I were opening the file on line 126 of above, I would use encoding='utf-8' in the open statement. |
Thanks for raising this issue, @paulaWesselmann will get back to you about it soon. |
@demongolem how does the data in your |
My file is now 80,524 lines long so I won't include the entire file unless necessary. But it is one item per line (with end of line following it). The last 10 lines of the file (which shows some of the characters in question likely) is
|
Thank you! I've copied those lines into one of my lookup tables and training went fine. Can you tell me which python version you are using? |
Python 3.6.7 |
As I said, list works fine with that data, file does not work fine. I think it is a Windows specific issue. |
I'm sorry but I'm afraid I can't help you with that. You might find better help in the Rasa forum https://forum.rasa.com/. |
I think it is a bug. If open statement in line 126 of rasa_nlu\featurizers\regex_featurizer.py were edited to include encoding='utf-8' it would work. The problems as I see it is that for Windows the default encoding is cp1252 which is something that would not be true on other platforms. My data is not cp1252 nor do I care for it to be. |
@demongolem Do you mind submitting a PR for this? |
It will be my first time doing a pull request here, but sure I will give it a go later today. It is of benefit to me because of course the size of the .md file is just too large if you do the list approach with so many rows and you have difficulty find what you are looking for. |
Awesome, Thank you! Let us know if you need help with anything. |
…eates a UnicodeError potentially when trying to use files for Lookup Tables (list method still works in Windows). Apply this fix such that the encoding is explicitly utf-8 (as it is in other places in the repository).
…eates a UnicodeError potentially when trying to use files for Lookup Tables (list method still works in Windows). Apply this fix such that the encoding is explicitly utf-8 (as it is in other places in the repository).
Issue #1781. Resolving UnicodeError for Windows and Lookup Tables via file method
Bumps [github.com/opencontainers/image-spec](https://github.com/opencontainers/image-spec) from 1.1.0-rc2.0.20221005185240-3a7f492d3f1b to 1.1.0-rc.3. - [Release notes](https://github.com/opencontainers/image-spec/releases) - [Changelog](https://github.com/opencontainers/image-spec/blob/main/RELEASES.md) - [Commits](https://github.com/opencontainers/image-spec/commits/v1.1.0-rc3) --- updated-dependencies: - dependency-name: github.com/opencontainers/image-spec dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Rasa NLU version:
0.14.4
Operating system (windows, osx, ...):
Windows 10
Content of model configuration file:
Issue:
When I create a Lookup Table in the format
training is successful. (This is not my real lookup table, it contains 65,000 entries, some of the characters are utf-8). When I have the same content, but I use the filename version (and this is the actual content of my .md file below)
I get
My first version, the list version is created from these same files. But when I call do_train on the second version, the encoding is mixed up. What can I do so that the filename version works?
The text was updated successfully, but these errors were encountered: