【臺北大學】Python程式設計<br>
【授課老師】[陳祥輝 (Email : HsiangHui.Chen@gmail.com)](mailto:HsiangHui.Chen@gmail.com)<br>
【facebook】[陳祥輝老師的臉書 (歡迎加好友)](https://goo.gl/osivhx)<br>
【參考書籍】[從零開始學Python程式設計（適用Python 3.5以上）](http://www.drmaster.com.tw/Bookinfo.asp?BookID=MP31821)<br>
【主要議題】檔案編碼的轉碼問題

【參考資料】
- [Python2標準的編碼列表](https://docs.python.org/2.4/lib/standard-encodings.html)
- [Python3標準的編碼列表](https://docs.python.org/3.7/library/codecs.html#standard-encodings)
- [大五碼(Big-5)](https://zh.wikipedia.org/wiki/%E5%A4%A7%E4%BA%94%E7%A2%BC#Code_Page_950)
- [Byte order mark, BOM](https://en.wikipedia.org/wiki/Byte_order_mark)

In [1]:
# -*- coding: utf-8 -*-
import pandas as pd
import os, time, glob, socket

print("【日期時間】{}".format(time.strftime("%Y/%m/%d %H:%M:%S")))
print("【工作目錄】{}".format(os.getcwd()))
print("【主機名稱】{} ({})".format(socket.gethostname(),socket.gethostbyname(socket.gethostname())))

【日期時間】2019/06/16 17:21:44
【工作目錄】E:\annie\3下課程\MyPython
【主機名稱】LAPTOP-7TNGIFIP (172.20.10.3)


### <font color=#0000FF>使用 str.encode() 與 bytes.decode() 編碼</font>

#### <font color=#0000FF>【編碼 與 解碼】</font>
- <b>str --> [encode] --> bytes</b>
    - str.encode() 編碼成 bytes
- <b>str <-- [decode] <-- bytes</b>
    - bytes.decode() 解編成 str
- 常用編碼
    - <font color=#0000FF>cp950</font> : Big-5
    - <font color=#0000FF>big5 (ms950, 950)</font> : Big-5
    - <font color=#0000FF>utf-8</font> : UTF-8
    - <font color=#0000FF>utf-8-sig</font> : UTF-8 with BOM

### <font color=#0000FF>(1) 先 瞭解甚麼是 encode & decode</font>

In [None]:
str.encode() #str 搭配 encode
bytes.decode() #

In [4]:
msg1 =  '我是陳祥輝'
msg2 =  '是'
print(msg1.encode(encoding='cp950'))   # bytes
print(msg1.encode(encoding='cp950').decode(encoding='cp950')) #str

print(msg2.encode(encoding='cp950'))   # bytes
print(msg2.encode(encoding='cp950').decode(encoding='cp950'))#str

b'\xa7\xda\xacO\xb3\xaf\xb2\xbb\xbd\xf7'
我是陳祥輝
b'\xacO'
是


In [2]:
msg1 = u'我是陳祥輝'
msg2 = u'是'
print(msg1.encode(encoding='utf-8'))   # bytes
print(msg1.encode(encoding='utf-8').decode(encoding='utf-8'))#str

print(msg2.encode(encoding='utf-8'))   # bytes
print(msg2.encode(encoding='utf-8').decode(encoding='utf-8'))#str

b'\xe6\x88\x91\xe6\x98\xaf\xe9\x99\xb3\xe7\xa5\xa5\xe8\xbc\x9d'
我是陳祥輝
b'\xe6\x98\xaf'
是


In [6]:
msg1 = u'我是陳祥輝'
msg2 = u'是'
print(msg1.encode(encoding='utf-8-sig'))   # bytes
print(msg1.encode(encoding='utf-8-sig').decode(encoding='utf-8-sig'))

print(msg2.encode(encoding='utf-8-sig'))   # bytes
print(msg2.encode(encoding='utf-8-sig').decode(encoding='utf-8-sig'))

b'\xef\xbb\xbf\xe6\x88\x91\xe6\x98\xaf\xe9\x99\xb3\xe7\xa5\xa5\xe8\xbc\x9d'
我是陳祥輝
b'\xef\xbb\xbf\xe6\x98\xaf'
是


#### <font color=#0000FF>讀入不同編碼的檔案</font>

In [14]:
path = r'F:\annie\3下課程\MyPython\PyData'
os.chdir(path)
os.getcwd()

'F:\\annie\\3下課程\\MyPython\\PyData'

In [16]:
fname = 'AirQty2016-06-01_CP950.csv'
airQty = pd.read_csv(fname,sep=',',engine='python',encoding='cp950')
airQty.head(n=3)

Unnamed: 0,SiteName,County,PSI,MajorPollutant,Status,SO2,CO,O3,PM10,PM2.5,NO2,WindSpeed,WindDirec,FPMI,NOx,NO,PublishTime
0,麥寮,雲林縣,36,,良好,1.2,0.11,10.0,35.0,7,4.5,2.2,182.0,1,6.78,2.3,2016/6/1 02:00
1,關山,臺東縣,23,,良好,1.2,,5.1,21.0,7,3.6,0.7,232.0,1,4.89,1.32,2016/6/1 02:00
2,馬公,澎湖縣,16,,良好,0.8,0.09,20.0,5.0,2,1.7,2.5,173.0,1,3.26,1.53,2016/6/1 02:00


In [17]:
fname = 'AirQty2016-06-01_UTF8.csv'
airQty = pd.read_csv(fname,sep=',',engine='python',encoding='utf-8')
airQty.head(n=3)

Unnamed: 0,SiteName,County,PSI,MajorPollutant,Status,SO2,CO,O3,PM10,PM2.5,NO2,WindSpeed,WindDirec,FPMI,NOx,NO,PublishTime
0,麥寮,雲林縣,36,,良好,1.2,0.11,10.0,35.0,7,4.5,2.2,182.0,1,6.78,2.3,2016/6/1 02:00
1,關山,臺東縣,23,,良好,1.2,,5.1,21.0,7,3.6,0.7,232.0,1,4.89,1.32,2016/6/1 02:00
2,馬公,澎湖縣,16,,良好,0.8,0.09,20.0,5.0,2,1.7,2.5,173.0,1,3.26,1.53,2016/6/1 02:00


In [18]:
fname = 'AirQty2016-06-01_UTF8_BOM.csv'
airQty = pd.read_csv(fname,sep=',',engine='python',encoding='utf-8-sig')
airQty.head(n=3)

Unnamed: 0,SiteName,County,PSI,MajorPollutant,Status,SO2,CO,O3,PM10,PM2.5,NO2,WindSpeed,WindDirec,FPMI,NOx,NO,PublishTime
0,麥寮,雲林縣,36,,良好,1.2,0.11,10.0,35.0,7,4.5,2.2,182.0,1,6.78,2.3,2016/6/1 02:00
1,關山,臺東縣,23,,良好,1.2,,5.1,21.0,7,3.6,0.7,232.0,1,4.89,1.32,2016/6/1 02:00
2,馬公,澎湖縣,16,,良好,0.8,0.09,20.0,5.0,2,1.7,2.5,173.0,1,3.26,1.53,2016/6/1 02:00


### <font color=#0000FF>(2) 進行檔案的 encode & decode</font>

#### <font color=#0000FF>從 utf-8-sig(就是 UTF-8 with BOM) 轉換成 cp950(就是Big-5)</font>

In [19]:
path = r'F:\annie\3下課程\MyPython\PyData'
os.chdir(path)
os.getcwd()

'F:\\annie\\3下課程\\MyPython\\PyData'

In [20]:
fname = 'AirQty2016-06-01_UTF8_BOM.csv'
with open(file=fname,mode='rt', encoding='utf-8-sig') as inpf, \  #讀進來要跟他說格式 UTF-8-SIG
     open(file='out_cp950.txt',mode='wt', encoding='cp950') as outf: #存出去 用 CP950
    outf.write(inpf.read().encode(encoding='cp950').decode(encoding='cp950'))

#### <font color=#0000FF>從 cp950(就是Big-5) 轉換成 utf-8-sig(就是 UTF-8 with BOM)</font>

In [21]:
fname = 'AirQty2016-06-01_CP950.csv'
with open(file=fname,mode='rt', encoding='cp950') as inpf, \
     open(file='out_utf-8-sig.txt',mode='wt', encoding='utf-8-sig') as outf:
    outf.write(inpf.read().encode(encoding='utf-8-sig').decode(encoding='utf-8-sig'))

#### <font color=#0000FF>從 cp950(就是Big-5) 轉換成 utf-8(就是 UTF-8)</font>

In [22]:
fname = 'AirQty2016-06-01_CP950.csv'
with open(file=fname,mode='rt', encoding='cp950') as inpf, \
     open(file='out_utf-8.txt',mode='wt', encoding='utf-8') as outf:
    outf.write(inpf.read().encode(encoding='utf-8').decode(encoding='utf-8'))

作業七批次轉碼
codeConverter(fromPath, toPath, srcCode, destCode)

* fromPath : 來源目錄
* toPath : 輸出目錄
* srcCode : 來源的編碼
* destCode : 輸出的編碼 <br>
【輸出檔名】AirQty2016-06-01_destCode.csv

In [None]:
fromPath = 'AirQty2016-06-01_CP950.csv'
toPath=
srcCode=
destCode=
with open(file=fname,mode='rt', encoding='cp950') as inpf, \
     open(file='out_utf-8.txt',mode='wt', encoding='utf-8') as outf:
    outf.write(inpf.read().encode(encoding='utf-8').decode(encoding='utf-8'))