Skip to content

Commit

Permalink
more
Browse files Browse the repository at this point in the history
  • Loading branch information
SeanThomasWilliams committed Mar 10, 2012
1 parent 4bff603 commit 4e92815
Show file tree
Hide file tree
Showing 2 changed files with 48 additions and 0 deletions.
1 change: 1 addition & 0 deletions PyCon2012/picloud.rst
Expand Up @@ -47,3 +47,4 @@ The Journey to Give Every Scientist a Supercomputer
- OpenAFS for environment storage
- Getting data to the cloud is a feature they have
- Over 100,000,000 jobs processed to date
- Pay by the millisecond, not the hour
47 changes: 47 additions & 0 deletions PyCon2012/unicode.rst
@@ -0,0 +1,47 @@
=====
Pragmatic Unicode - or - How do I stop the pain?
=====

- Entire presentation made with unicode
- 256 symbols is not enough for the world to communicate using text
- Started with encodings for 1-byte, then 2-byte
- Now we are using unicode
- Assigns characters to code points (integers)
- 1.1M code points
- 110K assigned
- Pile of poo character was covered
- Python can address characters by their unicode name
- str stores bytes
- unicode stores 'code points'
- Unicode encode method turns code points into bytes
- str decode turns bytes into code points
- Encode and decode can't always work (if ordinal is out of range)
- Error handling: my_unicode.encode("ascii", "replace")
Or xml/html character replace

- Python implicitly converts your 'assumed' ascii data when concatenating

There are two things: Bytes and Unicode
You have to know what you're dealing with

- str in python 2 is a byte string, str in python 3 is a unicode string
- Python 3 will never implicitly convert bytes into unicode

Mixing bytes and unicode is always **PAIN**
You are forced to keep them straight in Python 3

In python 3, th data you get back from a read operation depends on how you open it (r vs rb)

locale.getpreferredencoding() will get the default for read

stdin and stdout are pre-opened filed and must be handled or wrapped

"There is no string"

Fact of life: Sometimes you are told the wrong encoding. It sucks.

I/O is always bytes.

Test!

http://bit.ly/unipain

0 comments on commit 4e92815

Please sign in to comment.