Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lookup Table File Version UnicodeError #1781

Closed
demongolem opened this issue Mar 13, 2019 · 12 comments · Fixed by #1783
Closed

Lookup Table File Version UnicodeError #1781

demongolem opened this issue Mar 13, 2019 · 12 comments · Fixed by #1783

Comments

@demongolem
Copy link
Contributor

Rasa NLU version:
0.14.4

Operating system (windows, osx, ...):
Windows 10

Content of model configuration file:

language: en
pipeline:
- name: "tokenizer_whitespace"
- name: "intent_entity_featurizer_regex"
- name: "ner_crf"
- name: "ner_synonyms"
- name: "intent_featurizer_count_vectors"
- name: "intent_classifier_tensorflow_embedding"
  intent_tokenization_flag: true
  intent_split_symbol: "+"

Issue:
When I create a Lookup Table in the format

## lookup:currencies   <!-- lookup table list -->
- Yen
- USD
- Euro

training is successful. (This is not my real lookup table, it contains 65,000 entries, some of the characters are utf-8). When I have the same content, but I use the filename version (and this is the actual content of my .md file below)

## lookup:client   <!-- no list to specify lookup table file -->
entities/Client.txt

I get

File "C:\Users\gwerner004\eclipse-workspace\RasaKickTheTires\Camunda.py", line 135, in
num_threads=1)
File "C:\Users\gwerner004\AppData\Local\Programs\Python\Python36\lib\site-packages\rasa_nlu\train.py", line 154, in do_train
interpreter = trainer.train(training_data, **kwargs)
File "C:\Users\gwerner004\AppData\Local\Programs\Python\Python36\lib\site-packages\rasa_nlu\model.py", line 196, in train
**context)
File "C:\Users\gwerner004\AppData\Local\Programs\Python\Python36\lib\site-packages\rasa_nlu\featurizers\regex_featurizer.py", line 53, in train
self._add_lookup_table_regexes(training_data.lookup_tables)
File "C:\Users\gwerner004\AppData\Local\Programs\Python\Python36\lib\site-packages\rasa_nlu\featurizers\regex_featurizer.py", line 76, in _add_lookup_table_regexes
regex_pattern = self._generate_lookup_regex(table)
File "C:\Users\gwerner004\AppData\Local\Programs\Python\Python36\lib\site-packages\rasa_nlu\featurizers\regex_featurizer.py", line 126, in _generate_lookup_regex
for line in f:
File "C:\Users\gwerner004\AppData\Local\Programs\Python\Python36\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 5770: character maps to

My first version, the list version is created from these same files. But when I call do_train on the second version, the encoding is mixed up. What can I do so that the filename version works?

@demongolem
Copy link
Contributor Author

If I were opening the file on line 126 of above, I would use encoding='utf-8' in the open statement.

@akelad
Copy link
Contributor

akelad commented Mar 14, 2019

Thanks for raising this issue, @paulaWesselmann will get back to you about it soon.

@paulaWesselmann
Copy link
Contributor

@demongolem how does the data in your entities/Client.txt file look?

@paulaWesselmann paulaWesselmann added the status:more-details-needed Waiting for the user to provide more details / stacktraces / answer a question label Mar 14, 2019
@demongolem
Copy link
Contributor Author

demongolem commented Mar 14, 2019

My file is now 80,524 lines long so I won't include the entire file unless necessary. But it is one item per line (with end of line following it). The last 10 lines of the file (which shows some of the characters in question likely) is

xDUPLICATE - Van Sickle Consulting
xINACTIVE CLIENT - Classey Closets
xINACTIVE CLIENT - Taxolog Inc.
xINACTIVE CLIENT-Schuh William
zulily Inc.
zynerba Pharmaceuticals
École De Technologie Gazière
Édifice Marine
Ópticas Lux, S.A. de C.V.
Übelhör Organic-Germany

@no-response no-response bot removed the status:more-details-needed Waiting for the user to provide more details / stacktraces / answer a question label Mar 14, 2019
@paulaWesselmann
Copy link
Contributor

Thank you! I've copied those lines into one of my lookup tables and training went fine. Can you tell me which python version you are using?

@paulaWesselmann paulaWesselmann added the status:more-details-needed Waiting for the user to provide more details / stacktraces / answer a question label Mar 14, 2019
@demongolem
Copy link
Contributor Author

Python 3.6.7

@no-response no-response bot removed the status:more-details-needed Waiting for the user to provide more details / stacktraces / answer a question label Mar 14, 2019
@demongolem
Copy link
Contributor Author

As I said, list works fine with that data, file does not work fine. I think it is a Windows specific issue.

@paulaWesselmann
Copy link
Contributor

I'm sorry but I'm afraid I can't help you with that. You might find better help in the Rasa forum https://forum.rasa.com/.
I'll close this issue for now since this is a place for bugs in the code specifically and not usage issues.
Let us know if you believe there is a bug in the code after further investigation. I hope you can fix this soon!

@demongolem
Copy link
Contributor Author

demongolem commented Mar 14, 2019

I think it is a bug. If open statement in line 126 of rasa_nlu\featurizers\regex_featurizer.py were edited to include encoding='utf-8' it would work. The problems as I see it is that for Windows the default encoding is cp1252 which is something that would not be true on other platforms. My data is not cp1252 nor do I care for it to be.

@paulaWesselmann
Copy link
Contributor

@demongolem Do you mind submitting a PR for this?

@demongolem
Copy link
Contributor Author

It will be my first time doing a pull request here, but sure I will give it a go later today. It is of benefit to me because of course the size of the .md file is just too large if you do the list approach with so many rows and you have difficulty find what you are looking for.

@paulaWesselmann
Copy link
Contributor

Awesome, Thank you! Let us know if you need help with anything.

demongolem pushed a commit to demongolem/rasa_nlu that referenced this issue Mar 14, 2019
…eates a UnicodeError potentially when trying to

use files for Lookup Tables (list method still works in Windows).  Apply this fix such that the encoding is
explicitly utf-8 (as it is in other places in the repository).
demongolem added a commit to demongolem/rasa_nlu that referenced this issue Mar 15, 2019
…eates a UnicodeError potentially when trying to

use files for Lookup Tables (list method still works in Windows).  Apply this fix such that the encoding is
explicitly utf-8 (as it is in other places in the repository).
paulaWesselmann added a commit that referenced this issue Mar 18, 2019
Issue #1781.  Resolving UnicodeError for Windows and Lookup Tables via file method
taytzehao pushed a commit to taytzehao/rasa that referenced this issue Jul 14, 2023
Bumps [github.com/opencontainers/image-spec](https://github.com/opencontainers/image-spec) from 1.1.0-rc2.0.20221005185240-3a7f492d3f1b to 1.1.0-rc.3.
- [Release notes](https://github.com/opencontainers/image-spec/releases)
- [Changelog](https://github.com/opencontainers/image-spec/blob/main/RELEASES.md)
- [Commits](https://github.com/opencontainers/image-spec/commits/v1.1.0-rc3)

---
updated-dependencies:
- dependency-name: github.com/opencontainers/image-spec
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants