New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug] UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd8 in position (...): invalid continuation byte #14
Comments
If you open up the file I'm not sure why the tool so far assumes that the encoding is UTF-8. Sadly none of my own files (i.e. none of my contacts) contain any non-ASCII characters, so I can't distinguish UTF-8 from ISO encodings, for instance. If you find special characters in one of your files, it would be interesting to know what encoding those were using. E.g. in Notepad++, you can change the encoding with which the file is loaded, until the characters are rendered correctly. |
It's the first semicolon at the end of a contact's name (in the
Notepad++ opens the |
And the Arabic characters are displayed correctly? Then it should indeed be utf-8.
Then the position (4451) is somehow off. Because if it were a plain ASCII character, then |
By any chance, does it work if you replace |
Yes!
Yes.
HxD shows the binary (8-bit) value of position |
Do you see that anywhere around there? I wonder if we should just commit this change to |
When reading contacts from Also, after running the same command without
Where's And it gets terminated by this error, after a
|
Can the .picasa.ini file be posted here? |
The nearest one at position 4445.
No. It only looks correct with UTF-8. |
Load it into Notepad (not Notepad++) and select Save As. What encoding is listed? |
Unfortunately no, unless I put some dummy text in there which would make it useless to post.
If you mean to load the |
@UtopianElectronics Yes, I was referring to the .picasa.ini file. |
When you open the .picasa.ini file in Notepad++, what is listed in the "Encoding" menu? |
That was already fixed: c941174 |
The fact that loading with
That's pretty close actually. The discrepancy could be caused by one system counting bytes and the other counting characters. If no other 0xd8 byte is in the vicinity, it's safe to assume it's that one. So what's the context there, i.e. what do the surrounding bytes mean in ASCII? Is it important information and can we deduce something about the meaning of the 0xd8 byte from that? |
I am new to encoding, but it looks like this is not a straightforward issue. It looks like detecting the actual encoding is non-trivial. This file is probably not UTF-8, but is being reported as such. |
UTF-8.
Not sure about that, but I don't think it's the case.
When I select ISO-8859-1 in Notepad++ (Encoding > Character sets > Western European > ISO 8859-1), position 4451 changes place and goes to the beginning of the 16 characters string at the beginning of another line. |
I deleted the line in |
Is this doable here in this code? How do I test it? |
Just to double-check, the file in question should be
Well, maybe. picasa2digikam uses the configparser library. You can open a
To plug in the codecs package with that
|
This service promises client-side (i.e. privacy-preserving) UTF-8 validation: https://onlineutf8tools.com/validate-utf8 |
Then Notepad++ is clearly counting characters, not bytes. Whereas the error message from Python is most likely based on counting bytes. That explains the discrepancy. You can try snip a section around the byte in question like this:
Or decode it like this, which would presumably fails with a similar error as when the data is decoded right during file loading:
|
It just outputs
Tried it and it shows nothing.
It says it's valid.
Shouldn't it be
It decodes everything smoothly without any problem, and it showed the same characters as Notepad++.
There are multiple subdirectories (folders) inside A mystery to me is that if I edit |
Yeah, those were just some typos on my part, sorry.
After this, the file has been read, so apparently it did succeed in loading the file. You can then view it by querying the It's plausible that the attempt with the codecs package and
Yeah, something is fishy here. Perhaps picasa2digikam doesn't load the ini file as intended. How about:
This should really be 100% what picasa2digikam calls when that 4451 error happens. Perhaps that file has some restrictions on it that make it impossible for picasa2digikam to access (it is a hidden file after all) and then it instead receives some error message that has a non-ASCII character at 4451? You could try patching the following into
I'd expect this to print the whole file's contents onto the console, but perhaps we get something else (like the supposed error message) instead. |
Again, it outputs nothing, or maybe I'm doing it wrong? What should be in the code before them?
It works fine and without any error, also no unreadable characters in the output.
There are two |
I meant like this: #15 |
So I ran
It shows the non-Latin UTF-8 characters like
|
Huh, what's up with that position suddenly jumping to 602467. Wasn't it 4451 before? Is the file even that long (0.6MB)?
Yes that's okay, as long as the other characters (I assume most of the ini file is regular ASCII stuff) is output normally. I've updated the patch. I guess you can get it with |
Checked it with Notepad++. It was a part of a file name, and it was actually shown as one of those strange symbols that Notepad++ shows if you open for example an image file. I deleted that single character and the code now seems to work fine. Strange enough, I saved the file to another location and when I opened it, that strange symbol was changed to a readable character. So it was probably an encoding bug or something by Picasa because the actual file doesn't have that extra character in its name (I might have renamed it outside Picasa).
I think 4451 was for another
Yes, it's 595 KB.
Yes.
As the previous one seems to be working, I'm now going to check further if it's really working. I'll share the findings here. |
When I want to export the output of the command to a text file using both Here's one of the many errors in the output (not in the log file):
|
Ah great, do I understand correctly that you found a workaround for the problem by removing the weird character from the file?
And I assume you want to do that not to debug the encoding issue any further, but rather just because you want the whole output somewhere? E.g. to see if the dry-run was successful?
Looks like it's trying to log with non-UTF-8 too. Hopefully this is the fix. It's on the |
Yes. However, I suggest a workaround that would automatically ignore those characters without having to manually removing them. Here it mentions
Well, I'd like to debug anything and help to make this program as flawless as it could be! But isn't the encoding issue in the dry-run already fixed?
It's a shame that I'm not much familiar with git. How can I exactly keep the current code and try the new commit without losing the previous version? |
So I assume you don't want to lose it. Then it's best to give it a name, which in Git is a branch (or a tag). If you've made local modifications ( |
I can't reproduce this. Which log output is this referring to? The one you (only) get from #15 (which I don't intend to merge ever)? Or could you identify another And why do you care? Besides reading the log file, do you have another use case where you need the characters to be output correctly? When you don't redirect the output to a file but read it on the terminal directly, is it also "wrong" there? |
I had just exported the command output to a text file, and noticed this:
It's after the recent patch. I don't know if the older ones would result in this as well.
I don't. I just thought maybe it would result in errors later at final steps.
It's more than 71000 lines and I can't scroll back to it on the terminal despite changing the buffer size to 100000. However, by pressing the key |
I think it's intended that log output uses
How many characters in total were affected in your case, and across how many different files? If just a single bit flipped on your disk, I'm inclined to call that a random coincidence and wouldn't change the code. |
That's fine. Thank you.
Just one character and one file. This issue seems to be fixed by now. I'm closing it for now. |
After applying this patch and by running
python main.py --dry_run --photos_dir="D:\gallery" --digikam_db="D:\digiKam_library\digikam4.db" --contacts="%LocalAppData%\Google\Picasa2\contacts\contacts.xml" -vv
, I get this error message (excerpt from the whole output):I have no idea what
<frozen codecs>
means, and why it's mentioned as a file.Looks like the
RuntimeError: Failed to read ini file "D:\gallery\2019\.picasa.ini".
error is related to the patch.The text was updated successfully, but these errors were encountered: