Saskwoch patch 1 #1

saskwoch · 2015-09-07T13:04:10Z

I needed to extract >70k image attachments from a >140k message mbox archive downloaded from Gmail.

PabloCastellano's script was a tremendous start for me, it just needed some changes to work on Python 3.4.3 (Windows) and with my particular data set.

Note: I ran my script within IDLE and used a literal filename rather than pass in the filename as an argument, hence the version of the script here is untested 'as is'.

List of changes :-

All print statements updated to print() function for Python 3.4.3
len(mb) gave me a performance hit so I used a numeric literal instead, hence this version of the script is untested 'as is'
I was getting "TypeError: Can't convert 'bytes' object to str implicitly" for both "subject = subject + l[0]" and "em_from = em_from + l[0]" statements so added decode to both and encode to "content = content[fh:]"
I was getting "TypeError: memoryview: str object does not have the buffer interface" for "extract_attachment(payl)" statement so added decode=False to preceding statement
I got "TypeError: must be str, not bytes" for "fp.write(content)" statement so opened file as binary
I was getting "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 27: invalid start byte" for "em_from = em_from + l[0].decode('utf-8')" so added the "replace_spc" taken from http://www.gossamer-threads.com/lists/python/python/780611#780611
A couple of messages threw "AttributeError: 'NoneType' object has no attribute 'find'" so added the exception handling to skip and didn't investigate further

I needed to extract >70k image attachments from a >140k message mbox archive downloaded from Gmail. PabloCastellano's script was a tremendous start for me, it just needed some changes to work on Python 3.4.3 (Windows) and with my particular data set. Note: I ran my script within IDLE and used a literal filename rather than pass in the filename as an argument, hence the version of the script here is untested 'as is'. [I've written these notes in retrospect, so apologies if they aren't 100% accurate] List of changes :- 1. All print statements updated to print() function for Python 3.4.3 2. len(mb) gave me a performance hit so I used a numeric literal instead, hence this version of the script is untested 'as is' 3. I was getting "TypeError: Can't convert 'bytes' object to str implicitly" for both "subject = subject + l[0]" and "em_from = em_from + l[0]" statements so added decode to both and encode to "content = content[fh:]" 4. I was getting "TypeError: memoryview: str object does not have the buffer interface" for "extract_attachment(payl)" statement so added decode=False to preceding statement 5. I got "TypeError: must be str, not bytes" for "fp.write(content)" statement so opened file as binary 6. I was getting "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 27: invalid start byte" for "em_from = em_from + l[0].decode('utf-8')" so added the "replace_spc" taken from http://www.gossamer-threads.com/lists/python/python/780611#780611 7. A couple of messages threw "AttributeError: 'NoneType' object has no attribute 'find'" so added the exception handling to skip and didn't investigate further

Inserted "codecs.register_error("replace_spc", replace_spc_error_handler)"

saskwoch added 2 commits September 7, 2015 12:47

Missed an edit

784afc1

Inserted "codecs.register_error("replace_spc", replace_spc_error_handler)"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Saskwoch patch 1 #1

Saskwoch patch 1 #1

saskwoch commented Sep 7, 2015

Saskwoch patch 1 #1

Are you sure you want to change the base?

Saskwoch patch 1 #1

Conversation

saskwoch commented Sep 7, 2015