[bug] UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd8 in position (...): invalid continuation byte #14

UtopianElectronics · 2022-08-14T14:21:12Z

After applying this patch and by running python main.py --dry_run --photos_dir="D:\gallery" --digikam_db="D:\digiKam_library\digikam4.db" --contacts="%LocalAppData%\Google\Picasa2\contacts\contacts.xml" -vv, I get this error message (excerpt from the whole output):

INFO: ===========================================================================================
INFO: Now migrating D:\gallery\2019
Traceback (most recent call last):
  File "C:\Users\USERNAME\Downloads\picasa2digikam-main\migrator.py", line 138, in migrate_directory
    ini.read(ini_file, encoding='utf8')
  File "C:\Users\USERNAME\AppData\Local\Programs\Python\Python311\Lib\configparser.py", line 712, in read
    self._read(fp, filename)
  File "C:\Users\USERNAME\AppData\Local\Programs\Python\Python311\Lib\configparser.py", line 1035, in _read
    for lineno, line in enumerate(fp, start=1):
  File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd8 in position 4451: invalid continuation byte

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\USERNAME\Downloads\picasa2digikam-main\main.py", line 65, in <module>
    main()
  File "C:\Users\USERNAME\Downloads\picasa2digikam-main\main.py", line 55, in main
    migrator.migrate_directories_under(input_root_dir=args.photos_dir, db=db,
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\USERNAME\Downloads\picasa2digikam-main\migrator.py", line 107, in migrate_directories_under
    contact_tags_per_dir[dir] = migrate_directory(dir, files, db,
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\USERNAME\Downloads\picasa2digikam-main\migrator.py", line 140, in migrate_directory
    raise RuntimeError(f'Failed to read ini file "{ini_file}".') from err
RuntimeError: Failed to read ini file "D:\gallery\2019\.picasa.ini".

I have no idea what <frozen codecs> means, and why it's mentioned as a file.
Looks like the RuntimeError: Failed to read ini file "D:\gallery\2019\.picasa.ini". error is related to the patch.

The text was updated successfully, but these errors were encountered:

Philipp91 · 2022-08-14T14:25:36Z

If you open up the file D:\gallery\2019\.picasa.ini, what data is there around position 4451. (An editor like Notepad++ allows you to jump to a certain byte position, but you can also post the entire file contents here if it's not sensitive and not super long.)

I'm not sure why the tool so far assumes that the encoding is UTF-8. Sadly none of my own files (i.e. none of my contacts) contain any non-ASCII characters, so I can't distinguish UTF-8 from ISO encodings, for instance. If you find special characters in one of your files, it would be interesting to know what encoding those were using. E.g. in Notepad++, you can change the encoding with which the file is loaded, until the characters are rendered correctly.

UtopianElectronics · 2022-08-14T14:49:29Z

what data is there around position 4451.

It's the first semicolon at the end of a contact's name (in the [Contacts2] section), same as the lines before and after. The characters in that line have been repeated in the file multiple times earlier. However, some Arabic/Persian characters increment the position number by 2. Double-checked with HxD, and it also shows it to be the ; character.

I'm not sure why the tool so far assumes that the encoding is UTF-8.

Notepad++ opens the .picasa.ini file with UTF-8 encoding by default.

Philipp91 · 2022-08-14T15:04:58Z

Notepad++ opens the .picasa.ini file with UTF-8 encoding by default.

And the Arabic characters are displayed correctly? Then it should indeed be utf-8.

Double-checked with HxD, and it also shows it to be the ; character.

Then the position (4451) is somehow off. Because if it were a plain ASCII character, then ; would be 0x3b, but the error message complains about a 0xd8 value. And because it says "invalid continuation byte", it might actually be confused by the 1 or 2 bytes before (because some bytes in UTF-8 are a whole character, whereas others need to be continued in the next byte, up to 4 in total I believe).

Philipp91 · 2022-08-14T15:07:39Z

By any chance, does it work if you replace utf-8 with ISO-8859-1 in the code?

UtopianElectronics · 2022-08-14T15:20:46Z

By any chance, does it work if you replace utf-8 with ISO-8859-1 in the code?

Yes!

And the Arabic characters are displayed correctly?

Yes.

Then the position (4451) is somehow off.

HxD shows the binary (8-bit) value of position 4451 as 00111011, and 10001100 for position 4450.

Philipp91 · 2022-08-14T15:27:40Z

0xd8==11011000

Do you see that anywhere around there?

I wonder if we should just commit this change to ISO-8859-1 for everyone. Does the file (incl. the Arabic characters at the beginning) look correct when you load it with that encoding in Notepad++? Or does something else look odd then?

UtopianElectronics · 2022-08-14T15:52:27Z

By any chance, does it work if you replace utf-8 with ISO-8859-1 in the code?

--dry_run gets executed, but it makes the UTF-8 characters (Persian text) in some people's face tags during INFO: Creating digiKam person tag (...) unreadable, but strangely some other tags also containing Farsi text are fine.

When reading contacts from C:\Users\USERNAME\AppData\Local\Google\Picasa2\contacts\contacts.xml, the Persian text is always readable.

Also, after running the same command without --dry_run , I noticed this:

INFO: Creating database backup at %s

Where's %s?

And it gets terminated by this error, after a DEBUG: self_contact_to_tag={(...)}:

Traceback (most recent call last):
  File "C:\Users\USERNAME\Downloads\picasa2digikam-main\main.py", line 65, in <module>
    main()
  File "C:\Users\USERNAME\Downloads\picasa2digikam-main\main.py", line 55, in main
    migrator.migrate_directories_under(input_root_dir=args.photos_dir, db=db,
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\USERNAME\Downloads\picasa2digikam-main\migrator.py", line 107, in migrate_directories_under
    contact_tags_per_dir[dir] = migrate_directory(dir, files, db,
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\USERNAME\Downloads\picasa2digikam-main\migrator.py", line 154, in migrate_directory
    assert contact_to_tag[contact_id] == tag_id
AssertionError

DougRogers · 2022-08-14T16:01:55Z

Can the .picasa.ini file be posted here?

UtopianElectronics · 2022-08-14T16:11:51Z

Do you see that anywhere around there?

The nearest one at position 4445.

Does the file (incl. the Arabic characters at the beginning) look correct when you load it with that encoding in Notepad++?

No. It only looks correct with UTF-8.

DougRogers · 2022-08-14T16:14:04Z

Load it into Notepad (not Notepad++) and select Save As. What encoding is listed?

UtopianElectronics · 2022-08-14T16:17:50Z

Can the .picasa.ini file be posted here?

Unfortunately no, unless I put some dummy text in there which would make it useless to post.

Load it into Notepad (not Notepad++) and select Save As. What encoding is listed?

If you mean to load the .picasa.ini file, UTF-8 is selected by default, but other options are ANSI, UTF-16 LE, UTF-16 BE, and UTF-8 with BOM.

DougRogers · 2022-08-14T16:25:16Z

@UtopianElectronics
Can you create a sharable .picasa.ini file that has the same issues?

Yes, I was referring to the .picasa.ini file.
Notepad lists the encoding of the current file when saving, so the file is UTF-8.

DougRogers · 2022-08-14T16:31:03Z

When you open the .picasa.ini file in Notepad++, what is listed in the "Encoding" menu?

Philipp91 · 2022-08-14T16:39:18Z

Also, after running the same command without --dry_run , I noticed this: INFO: Creating database backup at %s

That was already fixed: c941174

Philipp91 · 2022-08-14T16:44:57Z

The fact that loading with ISO-8859-1 in Python works but then some other characters are messed up can only mean one of two things, I believe: Either the file legitimately contains multiple different encodings, which would be quite the hassle to deal with, or it's meant to be UTF-8 but somehow a few invalid characters ended up in there. I think we should find out what happens around that 0xd8 byte. Does that byte make sense in ISO-8859-1 encoding, or is it a garbage byte no matter how one would interpret it?

The nearest one at position 4445.

That's pretty close actually. The discrepancy could be caused by one system counting bytes and the other counting characters. If no other 0xd8 byte is in the vicinity, it's safe to assume it's that one. So what's the context there, i.e. what do the surrounding bytes mean in ASCII? Is it important information and can we deduce something about the meaning of the 0xd8 byte from that?

DougRogers · 2022-08-14T16:54:31Z

I am new to encoding, but it looks like this is not a straightforward issue. It looks like detecting the actual encoding is non-trivial. This file is probably not UTF-8, but is being reported as such.

UtopianElectronics · 2022-08-14T18:13:59Z

When you open the .picasa.ini file in Notepad++, what is listed in the "Encoding" menu?

UTF-8.

the file legitimately contains multiple different encodings

Not sure about that, but I don't think it's the case.

Does that byte make sense in ISO-8859-1 encoding

When I select ISO-8859-1 in Notepad++ (Encoding > Character sets > Western European > ISO 8859-1), position 4451 changes place and goes to the beginning of the 16 characters string at the beginning of another line.

UtopianElectronics · 2022-08-14T18:27:13Z

I deleted the line in .picasa.ini that had the faulty byte at position 4451, plus the two lines before and after it, but still it says UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd8 in position 4451: invalid continuation byte. Is it really about .picasa.ini or it's referring to position 4451 somewhere else?

UtopianElectronics · 2022-08-14T18:31:58Z

Is this doable here in this code? How do I test it?

Philipp91 · 2022-08-14T18:51:20Z

Just to double-check, the file in question should be "D:\gallery\2019\.picasa.ini".

Is this doable here in this code?

Well, maybe.

picasa2digikam uses the configparser library. You can open a python shell and hopefully reproduce that same error with these few lines:

import configparser
ini = configparser.ConfigParser(strict=False)
ini.read(''D:\\gallery\\2019\\.picasa.ini", encoding='utf8')

To plug in the codecs package with that error='ignore' workaround, try this:

import configparser
import codecs
ini = configparser.ConfigParser(strict=False)
with codecs.open("D:\\gallery\\2019\\.picasa.ini", 'r', encoding='utf-8', errors='ignore') as fdata:
    ini.read_file(fdata)

Philipp91 · 2022-08-14T18:54:22Z

This service promises client-side (i.e. privacy-preserving) UTF-8 validation: https://onlineutf8tools.com/validate-utf8

Philipp91 · 2022-08-14T19:05:32Z

position 4451 changes place

Then Notepad++ is clearly counting characters, not bytes. Whereas the error message from Python is most likely based on counting bytes. That explains the discrepancy.

You can try snip a section around the byte in question like this:

with open("D:\\gallery\\2019\\.picasa.ini", "rb") as f:
    d = f.read()
print(d[4400:4500])  # Print 100 bytes around the problem byte. If this turns out non-sensitive, you can post it here.
assert d[4451] == 0xd8  # Make sure we understood the offset right

Or decode it like this, which would presumably fails with a similar error as when the data is decoded right during file loading:

d.decode('utf-8')
d[4400:4500].decode('utf-8')

UtopianElectronics · 2022-08-14T21:34:55Z

ini.read(''D:\\gallery\\2019\\.picasa.ini", encoding='utf8')

It just outputs ['D:\\gallery\\2019\\.picasa.ini'] and no errors. Not sure what it means.
Also, '' is in fact two single quotation marks, which gives a syntax error. It should be a double quotation mark.

To plug in the codecs package with that error='ignore' workaround, try this

Tried it and it shows nothing.

This service promises client-side (i.e. privacy-preserving) UTF-8 validation

It says it's valid.

as f

Shouldn't it be as fh? Because it gives a syntax error: "NameError: name 'fh' is not defined. Did you mean: 'f'?" I tried running it with as fh and it gives some characters and a AssertionError at the end.

d[4400:4500].decode('utf-8')

It decodes everything smoothly without any problem, and it showed the same characters as Notepad++.
I used it like this:

with open("D:\\gallery\\2019\\.picasa.ini", "rb") as fh:
    d = fh.read()
print(d[4400:4500].decode('utf-8'))

There are multiple subdirectories (folders) inside 2019. Could it be causing any problem?

A mystery to me is that if I edit D:\gallery\2019\.picasa.ini and delete or change characters or lines at 4451 and re-run the program, it still gives the same error about byte 0xd8 in position 4451.

Philipp91 · 2022-08-14T22:13:31Z

Also, '' is in fact two single quotation marks, which gives a syntax error. It should be a double quotation mark.

Shouldn't it be as fh?

Yeah, those were just some typos on my part, sorry.

It just outputs ['D:\gallery\2019\.picasa.ini'] and no errors. Not sure what it means.

Tried it and it shows nothing.

After this, the file has been read, so apparently it did succeed in loading the file. You can then view it by querying the ini object, e.g. by running list(ini.items()) or list(ini['Contacts2'].items()) or so, and see if the contents were correctly loaded.

It's plausible that the attempt with the codecs package and errors='ignore' went through, but I'm surprised that apparently the attempt with just ini.read(''D:\\gallery\\2019\\.picasa.ini", encoding='utf8') threw no errors either. That's pretty much what picasa2digikam also runs (or so I believed) when it runs into this 4451 error. Can you check (as detailed just above) that this loading actually worked, i.e. data got loaded properly?

A mystery to me is that if I edit D:\gallery\2019.picasa.ini and delete or change characters or lines at 4451 and re-run the program, it still gives the same error about byte 0xd8 in position 4451.

Yeah, something is fishy here. Perhaps picasa2digikam doesn't load the ini file as intended. How about:

import configparser
import pathlib
ini = configparser.ConfigParser(strict=False)
ini.read(pathlib.Path(''D:\\gallery\\2019\\.picasa.ini"), encoding='utf8')
print(list(init.items()))

This should really be 100% what picasa2digikam calls when that 4451 error happens.

Perhaps that file has some restrictions on it that make it impossible for picasa2digikam to access (it is a hidden file after all) and then it instead receives some error message that has a non-ASCII character at 4451? You could try patching the following into migrator.py above the ini.read(... line:

with open(ini_file, "rb") as f:
    print(f'Here comes {ini_file}:')
    print(f.read())

I'd expect this to print the whole file's contents onto the console, but perhaps we get something else (like the supposed error message) instead.

UtopianElectronics · 2022-08-15T07:58:29Z

You can then view it by querying the ini object, e.g. by running list(ini.items()) or list(ini['Contacts2'].items()) or so, and see if the contents were correctly loaded.

Can you check (as detailed just above) that this loading actually worked, i.e. data got loaded properly?

Again, it outputs nothing, or maybe I'm doing it wrong? What should be in the code before them?

This should really be 100% what picasa2digikam calls when that 4451 error happens.

It works fine and without any error, also no unreadable characters in the output.

You could try patching the following into migrator.py above the ini.read(... line:

There are two ini.read(... instances, which one do you mean? Also, what the indentation should be exactly? Because I'm getting some indentation errors and tried fixing them, but I don't know if I changed the meaning of the code. Regardless of that, I still get that old error message.

Philipp91 · 2022-08-15T18:12:47Z

I meant like this: #15

UtopianElectronics · 2022-08-16T17:32:09Z

So I ran gh pr checkout 15 and then python main.py --dry_run --photos_dir="D:\gallery" --digikam_db="D:\digiKam_library\digikam4.db" --contacts="%LocalAppData%\Google\Picasa2\contacts\contacts.xml" -vv. Here's the output:

INFO: Now migrating D:\gallery\2019
Here comes D:\gallery\2019\.picasa.ini in binary:

It shows the non-Latin UTF-8 characters like \xd8\xaf\xd9\x8a etc., which I guess it's because of being binary?
Then:

That was D:\gallery\2019\.picasa.ini.
Here comes D:\gallery\2019\.picasa.ini in UTF-8:
Traceback (most recent call last):
  File "C:\Users\USERNAME\picasa2digikam\migrator.py", line 145, in migrate_directory
    print(f.read())
          ^^^^^^^^
  File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd8 in position 602467: invalid continuation byte

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\USERNAME\picasa2digikam\main.py", line 65, in <module>
    main()
  File "C:\Users\USERNAME\picasa2digikam\main.py", line 55, in main
    migrator.migrate_directories_under(input_root_dir=args.photos_dir, db=db,
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\USERNAME\picasa2digikam\migrator.py", line 107, in migrate_directories_under
    contact_tags_per_dir[dir] = migrate_directory(dir, files, db,
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\USERNAME\picasa2digikam\migrator.py", line 150, in migrate_directory
    raise RuntimeError(f'Failed to read ini file "{ini_file}".') from err
RuntimeError: Failed to read ini file "D:\gallery\2019\.picasa.ini".

Philipp91 · 2022-08-16T21:55:19Z

Huh, what's up with that position suddenly jumping to 602467. Wasn't it 4451 before? Is the file even that long (0.6MB)?

It shows the non-Latin UTF-8 characters like \xd8\xaf\xd9\x8a etc., which I guess it's because of being binary?

Yes that's okay, as long as the other characters (I assume most of the ini file is regular ASCII stuff) is output normally.
How does the end look, i.e. shortly before the That was D:\gallery\2019\.picasa.ini. bit? Does it actually output precisely the end of your ini file too?

I've updated the patch. I guess you can get it with git pull or so, perhaps with -f. Now it also prints the length of the string and it decodes it after reading it as binary, let's see if that also fails or succeeds.

UtopianElectronics · 2022-08-17T08:40:38Z

602467

Checked it with Notepad++. It was a part of a file name, and it was actually shown as one of those strange symbols that Notepad++ shows if you open for example an image file. I deleted that single character and the code now seems to work fine.

Strange enough, I saved the file to another location and when I opened it, that strange symbol was changed to a readable character. So it was probably an encoding bug or something by Picasa because the actual file doesn't have that extra character in its name (I might have renamed it outside Picasa).

Wasn't it 4451 before?

I think 4451 was for another .picasa.ini file, the one at D:\gallery.

Is the file even that long (0.6MB)?

Yes, it's 595 KB.

Does it actually output precisely the end of your ini file too?

Yes.

I've updated the patch.

As the previous one seems to be working, I'm now going to check further if it's really working. I'll share the findings here.

UtopianElectronics · 2022-08-17T09:03:13Z

When I want to export the output of the command to a text file using both > log.txt and | echo > log.txt, the program gets terminated with errors. The log file ends with Here comes D:\gallery\.picasa.ini in UTF-8: and the non-ASCII UTF-8 characters in the log file are again in the format of "UTF-8 (in literal)" as shown here too.

Here's one of the many errors in the output (not in the log file):

--- Logging error ---
Traceback (most recent call last):
  File "C:\Users\USERNAME\AppData\Local\Programs\Python\Python311\Lib\logging\__init__.py", line 1113, in emit
    stream.write(msg + self.terminator)
  File "C:\Users\USERNAME\AppData\Local\Programs\Python\Python311\Lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'charmap' codec can't encode characters in position 63-70: character maps to <undefined>
Call stack:
  File "C:\Users\USERNAME\picasa2digikam\main.py", line 65, in <module>
    main()
  File "C:\Users\USERNAME\picasa2digikam\main.py", line 55, in main
    migrator.migrate_directories_under(input_root_dir=args.photos_dir, db=db,
  File "C:\Users\USERNAME\picasa2digikam\migrator.py", line 74, in migrate_directories_under
    logging.debug(f'{contact.attrib}')
Message: "{'id': 'e5ce9e6c386f84fa', 'name': '[REDACTED]', 'modified_time': '2022-01-18T13:03:09+03:30', 'local_contact': '1'}"
Arguments: ()
Traceback (most recent call last):
  File "C:\Users\USERNAME\picasa2digikam\migrator.py", line 145, in migrate_directory
    print(f.read())
  File "C:\Users\USERNAME\AppData\Local\Programs\Python\Python311\Lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'charmap' codec can't encode characters in position 49-55: character maps to <undefined>

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\USERNAME\picasa2digikam\main.py", line 65, in <module>
    main()
  File "C:\Users\USERNAME\picasa2digikam\main.py", line 55, in main
    migrator.migrate_directories_under(input_root_dir=args.photos_dir, db=db,
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\USERNAME\picasa2digikam\migrator.py", line 107, in migrate_directories_under
    contact_tags_per_dir[dir] = migrate_directory(dir, files, db,
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\USERNAME\picasa2digikam\migrator.py", line 150, in migrate_directory
    raise RuntimeError(f'Failed to read ini file "{ini_file}".') from err
RuntimeError: Failed to read ini file "D:\gallery\.picasa.ini".

Philipp91 · 2022-08-17T17:13:30Z

Ah great, do I understand correctly that you found a workaround for the problem by removing the weird character from the file?

When I want to export the output of the command to a text file

And I assume you want to do that not to debug the encoding issue any further, but rather just because you want the whole output somewhere? E.g. to see if the dry-run was successful?

cp1252.py

Looks like it's trying to log with non-UTF-8 too. Hopefully this is the fix. It's on the main branch and I've also rebased the other patch, so if you wanted to keep that one, you could check it out anew.

UtopianElectronics · 2022-08-17T17:41:09Z

Ah great, do I understand correctly that you found a workaround for the problem by removing the weird character from the file?

Yes. However, I suggest a workaround that would automatically ignore those characters without having to manually removing them. Here it mentions errors='ignore' but I'm not sure if it could also be an option for picasa2digikam.

And I assume you want to do that not to debug the encoding issue any further, but rather just because you want the whole output somewhere? E.g. to see if the dry-run was successful?

Well, I'd like to debug anything and help to make this program as flawless as it could be! But isn't the encoding issue in the dry-run already fixed?
Yes, I want to carefully examine the log.

Hopefully this is the fix. It's on the main branch and I've also rebased the other patch, so if you wanted to keep that one, you could check it out anew.

It's a shame that I'm not much familiar with git. How can I exactly keep the current code and try the new commit without losing the previous version?

Philipp91 · 2022-08-17T17:47:41Z

How can I exactly keep the current code

So I assume you don't want to lose it. Then it's best to give it a name, which in Git is a branch (or a tag). If you've made local modifications (git status has non-empty output), you need to commit them first. Then you can do git checkout -b thisworks to create a branch, or git tag thisworks to create a tag, with thisworks being a name you'll understand in the future. You can find those again with git branch or git tag. Then to apply the patch on top, download all the new commits (git fetch) so that it becomes known locally, and then git cherry-pick 676e50f9064a3e308532a926d21711a6138b0c94.

UtopianElectronics · 2022-08-17T20:21:57Z

Thanks a lot! After applying this patch, I could successfully export the output to a text file. The log file seems fine, but just a small issue with \u200c instead of real half space, as also mentioned here and here. But it's not such a big deal. Fixing it, however, would be nice.

Philipp91 · 2022-08-17T20:50:20Z

I can't reproduce this. Which log output is this referring to? The one you (only) get from #15 (which I don't intend to merge ever)? Or could you identify another logging.info() or logging.debug() statement that produces this log output?

And why do you care? Besides reading the log file, do you have another use case where you need the characters to be output correctly? When you don't redirect the output to a file but read it on the terminal directly, is it also "wrong" there?

UtopianElectronics · 2022-08-17T22:29:51Z

I had just exported the command output to a text file, and noticed this:

INFO: Traversing input directories
DEBUG: Reading contacts from C:\Users\USERNAME\AppData\Local\Google\Picasa2\contacts\contacts.xml
DEBUG: {'id': '2c112b01a7d580c5', 'name': '[REDACTED] ي\u200cک [REDACTED]', 'modified_time': '2022-01-24T16:27:25+03:30', 'local_contact': '1'}

It's after the recent patch. I don't know if the older ones would result in this as well.

And why do you care?

I don't. I just thought maybe it would result in errors later at final steps.

When you don't redirect the output to a file but read it on the terminal directly, is it also "wrong" there?

It's more than 71000 lines and I can't scroll back to it on the terminal despite changing the buffer size to 100000. However, by pressing the key pause break, it shows that it's the same on the terminal as well.

Philipp91 · 2022-08-18T19:00:07Z

I think it's intended that log output uses \u200c instead of the proper representation. After all, it's meant for debugging purposes and not for end-user output. So I think this won't change and it doesn't affect the subsequent program run. It's just a different representation of the same data, and the program continues operating on the data as it is without (re)presenting it.

I suggest a workaround that would automatically ignore those characters without having to manually removing them.

How many characters in total were affected in your case, and across how many different files? If just a single bit flipped on your disk, I'm inclined to call that a random coincidence and wouldn't change the code.

UtopianElectronics · 2022-08-18T19:34:27Z

So I think this won't change and it doesn't affect the subsequent program run. It's just a different representation of the same data, and the program continues operating on the data as it is without (re)presenting it.

That's fine. Thank you.

How many characters in total were affected in your case, and across how many different files?

Just one character and one file.

This issue seems to be fixed by now. I'm closing it for now.

UtopianElectronics changed the title ~~[bug] UnicodeDecodeError: 'utf-8' codec can't decode byte [...] in position [...]: invalid continuation byte~~ [bug] UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd8 in position (...): invalid continuation byte Aug 14, 2022

UtopianElectronics closed this as completed Aug 14, 2022

UtopianElectronics reopened this Aug 14, 2022

UtopianElectronics closed this as completed Aug 18, 2022

[bug] UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd8 in position (...): invalid continuation byte #14

[bug] UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd8 in position (...): invalid continuation byte #14

Comments

UtopianElectronics commented Aug 14, 2022

Philipp91 commented Aug 14, 2022 • edited

UtopianElectronics commented Aug 14, 2022

Philipp91 commented Aug 14, 2022 • edited

Philipp91 commented Aug 14, 2022

UtopianElectronics commented Aug 14, 2022

Philipp91 commented Aug 14, 2022

UtopianElectronics commented Aug 14, 2022

DougRogers commented Aug 14, 2022

UtopianElectronics commented Aug 14, 2022

DougRogers commented Aug 14, 2022 • edited

UtopianElectronics commented Aug 14, 2022 • edited

DougRogers commented Aug 14, 2022

DougRogers commented Aug 14, 2022

Philipp91 commented Aug 14, 2022

Philipp91 commented Aug 14, 2022

DougRogers commented Aug 14, 2022

UtopianElectronics commented Aug 14, 2022 • edited

UtopianElectronics commented Aug 14, 2022 • edited

UtopianElectronics commented Aug 14, 2022

Philipp91 commented Aug 14, 2022 • edited

Philipp91 commented Aug 14, 2022

Philipp91 commented Aug 14, 2022 • edited

UtopianElectronics commented Aug 14, 2022

Philipp91 commented Aug 14, 2022

UtopianElectronics commented Aug 15, 2022

Philipp91 commented Aug 15, 2022

UtopianElectronics commented Aug 16, 2022

Philipp91 commented Aug 16, 2022

UtopianElectronics commented Aug 17, 2022

UtopianElectronics commented Aug 17, 2022

Philipp91 commented Aug 17, 2022

UtopianElectronics commented Aug 17, 2022

Philipp91 commented Aug 17, 2022

UtopianElectronics commented Aug 17, 2022

Philipp91 commented Aug 17, 2022

UtopianElectronics commented Aug 17, 2022 • edited

Philipp91 commented Aug 18, 2022

UtopianElectronics commented Aug 18, 2022

Philipp91 commented Aug 14, 2022 •

edited

Philipp91 commented Aug 14, 2022 •

edited

DougRogers commented Aug 14, 2022 •

edited

UtopianElectronics commented Aug 14, 2022 •

edited

UtopianElectronics commented Aug 14, 2022 •

edited

UtopianElectronics commented Aug 14, 2022 •

edited

Philipp91 commented Aug 14, 2022 •

edited

Philipp91 commented Aug 14, 2022 •

edited

UtopianElectronics commented Aug 17, 2022 •

edited