Lookup Table File Version UnicodeError #1781

demongolem · 2019-03-13T20:57:17Z

Rasa NLU version:
0.14.4

Operating system (windows, osx, ...):
Windows 10

Content of model configuration file:

language: en
pipeline:
- name: "tokenizer_whitespace"
- name: "intent_entity_featurizer_regex"
- name: "ner_crf"
- name: "ner_synonyms"
- name: "intent_featurizer_count_vectors"
- name: "intent_classifier_tensorflow_embedding"
  intent_tokenization_flag: true
  intent_split_symbol: "+"

Issue:
When I create a Lookup Table in the format

## lookup:currencies   <!-- lookup table list -->
- Yen
- USD
- Euro

training is successful. (This is not my real lookup table, it contains 65,000 entries, some of the characters are utf-8). When I have the same content, but I use the filename version (and this is the actual content of my .md file below)

## lookup:client   <!-- no list to specify lookup table file -->
entities/Client.txt

I get

File "C:\Users\gwerner004\eclipse-workspace\RasaKickTheTires\Camunda.py", line 135, in
num_threads=1)
File "C:\Users\gwerner004\AppData\Local\Programs\Python\Python36\lib\site-packages\rasa_nlu\train.py", line 154, in do_train
interpreter = trainer.train(training_data, **kwargs)
File "C:\Users\gwerner004\AppData\Local\Programs\Python\Python36\lib\site-packages\rasa_nlu\model.py", line 196, in train
**context)
File "C:\Users\gwerner004\AppData\Local\Programs\Python\Python36\lib\site-packages\rasa_nlu\featurizers\regex_featurizer.py", line 53, in train
self._add_lookup_table_regexes(training_data.lookup_tables)
File "C:\Users\gwerner004\AppData\Local\Programs\Python\Python36\lib\site-packages\rasa_nlu\featurizers\regex_featurizer.py", line 76, in _add_lookup_table_regexes
regex_pattern = self._generate_lookup_regex(table)
File "C:\Users\gwerner004\AppData\Local\Programs\Python\Python36\lib\site-packages\rasa_nlu\featurizers\regex_featurizer.py", line 126, in _generate_lookup_regex
for line in f:
File "C:\Users\gwerner004\AppData\Local\Programs\Python\Python36\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 5770: character maps to

My first version, the list version is created from these same files. But when I call do_train on the second version, the encoding is mixed up. What can I do so that the filename version works?

The text was updated successfully, but these errors were encountered:

demongolem · 2019-03-14T01:11:35Z

If I were opening the file on line 126 of above, I would use encoding='utf-8' in the open statement.

akelad · 2019-03-14T09:01:42Z

Thanks for raising this issue, @paulaWesselmann will get back to you about it soon.

paulaWesselmann · 2019-03-14T12:31:56Z

@demongolem how does the data in your entities/Client.txt file look?

demongolem · 2019-03-14T12:35:32Z

My file is now 80,524 lines long so I won't include the entire file unless necessary. But it is one item per line (with end of line following it). The last 10 lines of the file (which shows some of the characters in question likely) is

xDUPLICATE - Van Sickle Consulting
xINACTIVE CLIENT - Classey Closets
xINACTIVE CLIENT - Taxolog Inc.
xINACTIVE CLIENT-Schuh William
zulily Inc.
zynerba Pharmaceuticals
École De Technologie Gazière
Édifice Marine
Ópticas Lux, S.A. de C.V.
Übelhör Organic-Germany

paulaWesselmann · 2019-03-14T13:26:04Z

Thank you! I've copied those lines into one of my lookup tables and training went fine. Can you tell me which python version you are using?

demongolem · 2019-03-14T13:47:44Z

Python 3.6.7

demongolem · 2019-03-14T13:48:35Z

As I said, list works fine with that data, file does not work fine. I think it is a Windows specific issue.

paulaWesselmann · 2019-03-14T13:55:08Z

I'm sorry but I'm afraid I can't help you with that. You might find better help in the Rasa forum https://forum.rasa.com/.
I'll close this issue for now since this is a place for bugs in the code specifically and not usage issues.
Let us know if you believe there is a bug in the code after further investigation. I hope you can fix this soon!

demongolem · 2019-03-14T14:15:33Z

I think it is a bug. If open statement in line 126 of rasa_nlu\featurizers\regex_featurizer.py were edited to include encoding='utf-8' it would work. The problems as I see it is that for Windows the default encoding is cp1252 which is something that would not be true on other platforms. My data is not cp1252 nor do I care for it to be.

paulaWesselmann · 2019-03-14T14:21:06Z

@demongolem Do you mind submitting a PR for this?

demongolem · 2019-03-14T14:44:26Z

It will be my first time doing a pull request here, but sure I will give it a go later today. It is of benefit to me because of course the size of the .md file is just too large if you do the list approach with so many rows and you have difficulty find what you are looking for.

paulaWesselmann · 2019-03-14T14:47:08Z

Awesome, Thank you! Let us know if you need help with anything.

…eates a UnicodeError potentially when trying to use files for Lookup Tables (list method still works in Windows). Apply this fix such that the encoding is explicitly utf-8 (as it is in other places in the repository).

Issue #1781. Resolving UnicodeError for Windows and Lookup Tables via file method

Bumps [github.com/opencontainers/image-spec](https://github.com/opencontainers/image-spec) from 1.1.0-rc2.0.20221005185240-3a7f492d3f1b to 1.1.0-rc.3. - [Release notes](https://github.com/opencontainers/image-spec/releases) - [Changelog](https://github.com/opencontainers/image-spec/blob/main/RELEASES.md) - [Commits](https://github.com/opencontainers/image-spec/commits/v1.1.0-rc3) --- updated-dependencies: - dependency-name: github.com/opencontainers/image-spec dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

paulaWesselmann added the status:more-details-needed Waiting for the user to provide more details / stacktraces / answer a question label Mar 14, 2019

no-response bot removed the status:more-details-needed Waiting for the user to provide more details / stacktraces / answer a question label Mar 14, 2019

paulaWesselmann added the status:more-details-needed Waiting for the user to provide more details / stacktraces / answer a question label Mar 14, 2019

no-response bot removed the status:more-details-needed Waiting for the user to provide more details / stacktraces / answer a question label Mar 14, 2019

paulaWesselmann closed this as completed Mar 14, 2019

paulaWesselmann reopened this Mar 14, 2019

demongolem mentioned this issue Mar 15, 2019

Issue #1781. Resolving UnicodeError for Windows and Lookup Tables via file method #1783

Merged

4 tasks

paulaWesselmann closed this as completed in #1783 Mar 18, 2019

paulaWesselmann added a commit that referenced this issue Mar 18, 2019

Merge pull request #1783 from demongolem/master

560e46f

Issue #1781. Resolving UnicodeError for Windows and Lookup Tables via file method

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lookup Table File Version UnicodeError #1781

Lookup Table File Version UnicodeError #1781

demongolem commented Mar 13, 2019

demongolem commented Mar 14, 2019

akelad commented Mar 14, 2019

paulaWesselmann commented Mar 14, 2019

demongolem commented Mar 14, 2019 •

edited

Loading

paulaWesselmann commented Mar 14, 2019

demongolem commented Mar 14, 2019

demongolem commented Mar 14, 2019

paulaWesselmann commented Mar 14, 2019

demongolem commented Mar 14, 2019 •

edited

Loading

paulaWesselmann commented Mar 14, 2019

demongolem commented Mar 14, 2019

paulaWesselmann commented Mar 14, 2019

Lookup Table File Version UnicodeError #1781

Lookup Table File Version UnicodeError #1781

Comments

demongolem commented Mar 13, 2019

demongolem commented Mar 14, 2019

akelad commented Mar 14, 2019

paulaWesselmann commented Mar 14, 2019

demongolem commented Mar 14, 2019 • edited Loading

paulaWesselmann commented Mar 14, 2019

demongolem commented Mar 14, 2019

demongolem commented Mar 14, 2019

paulaWesselmann commented Mar 14, 2019

demongolem commented Mar 14, 2019 • edited Loading

paulaWesselmann commented Mar 14, 2019

demongolem commented Mar 14, 2019

paulaWesselmann commented Mar 14, 2019

demongolem commented Mar 14, 2019 •

edited

Loading

demongolem commented Mar 14, 2019 •

edited

Loading