# Week 6

This week is about getting data into Python from external sources, such as files on your computer or online. When working with these kinds of sources, we need to understand **character encodings** and **streams**. Additionally, this week we will cover **string formatting**, as it is useful when writing to files or to the terminal.

In [1]:
import nltk  # make sure NLTK is installed and loaded

## Unicode

Every character displayed by your computer is assigned a number. Before, each character set (e.g., for a language) chose different numbers for the characters, but this made it difficult to have documents with more than one character set. [Unicode](https://unicode.org/) is the modern standard for assigning these numbers, and it is one giant table comprising all the known characters, including some non-language characters (🥳🦥🌤...). In Python, strings are "pure" sequences of codepoints

## Encodings

Whenever a unicode string needs to be stored or transmitted outside of Python it must be encoded into a sequence of bytes.

In [3]:
'あ'.encode('utf-8')

b'\xe3\x81\x82'

In [7]:
b'\xe3\x81\x82'.decode('utf-8')

'あ'

In [19]:
'\u3042'

'あ'

In [16]:
int('3042', 16)

12354

In [20]:
'café'.encode('ascii', errors='ignore')

b'caf'

In [23]:
for c in 'abcdefghijklmnopqrstuvwxyz':
    print(c, ord(c))

a 97
b 98
c 99
d 100
e 101
f 102
g 103
h 104
i 105
j 106
k 107
l 108
m 109
n 110
o 111
p 112
q 113
r 114
s 115
t 116
u 117
v 118
w 119
x 120
y 121
z 122


In [25]:
hex(120)

'0x78'

In [27]:
chr(120)

'x'

In [29]:
ord('あ')

12354

In [30]:
hex(12354)

'0x3042'

## Streams

In [33]:
import urllib.request

bytestring = urllib.request.urlopen('http://gutenberg.org/files/13083/13083-0.txt').read()


In [39]:
type(bytestring)

bytes

In [41]:
string = bytestring.decode('utf-8')
type(string)

str

In [48]:
with open('myfile.txt', 'wb') as f:
    f.write(bytestring)

In [52]:
with open('myfile.txt', encoding='utf-8_with_bom') as f:
    string = f.read()

In [53]:
with open('myfile2.txt', 'wt', encoding='utf-16') as f:
    print(string, file=f)

In [None]:
'\xff\xfe'

## String Formatting


>>> tabulate_chars(string)
e      1050233
t       559024
a       480902
o       239402
...


In [54]:
import nltk
with open('myfile.txt') as f:
    fd = nltk.FreqDist(c for c in f.read() if c.isalpha())

In [55]:
fd

FreqDist({'e': 10013, 'o': 7979, 'a': 6453, 'n': 6291, 't': 6016, 'l': 4945, 'i': 4759, 's': 4357, 'r': 3717, 'm': 3482, ...})

In [57]:
maximum = max(fd.values())
width = len(str(maximum))

In [63]:
for c, count in fd.most_common():
    print(f'{c}\t{count:>{width}}')

e	10013
o	 7979
a	 6453
n	 6291
t	 6016
l	 4945
i	 4759
s	 4357
r	 3717
m	 3482
d	 3119
u	 3050
v	 2840
k	 2324
h	 2198
c	 2187
y	 2076
j	 2027
p	 2000
í	 1930
á	 1922
b	 1677
ě	 1270
z	 1095
ž	  941
H	  894
š	  774
ř	  717
D	  699
č	  693
P	  610
A	  544
é	  520
R	  481
N	  474
ý	  433
f	  366
G	  362
T	  352
g	  352
J	  299
ů	  287
B	  263
w	  255
S	  236
q	  210
V	  201
C	  194
F	  192
M	  173
O	  163
U	  149
K	  130
E	  128
ď	  109
I	  108
Z	  107
L	   90
ť	   67
ň	   57
ú	   42
ó	   35
x	   33
Ř	   30
Y	   29
Č	   24
Ž	   24
W	   16
Ó	   14
Š	    9
Ú	    5
Í	    5
Ě	    3
X	    2
Ť	    1
É	    1
Q	    1
