### Numpy
从表格数据使用数组  
Numpy提供了一个genfromtxt函数可以从表格数据中创建数组，数据存放到Numpy数组中后，系统处理数据就轻松的多。

In [1]:
# use StringIo 
import numpy as np
from StringIO import StringIO
in_data = StringIO("10,20,30\n45,65,23\n33,54,62")

In [2]:
# 使用Numpy的genfromtxt来读取数据，并创建一个Numpy数组
data = np.genfromtxt(in_data,dtype=int,delimiter=",")
data

array([[10, 20, 30],
       [45, 65, 23],
       [33, 54, 62]])

In [3]:
# clear some col that we don't use
in_data = StringIO("10,20,30\n45,65,23\n33,54,62")
data = np.genfromtxt(in_data,dtype=int,delimiter=",",usecols=(0,1))
data

array([[10, 20],
       [45, 65],
       [33, 54]])

In [4]:
# set column name
in_data = StringIO("10,20,30\n45,65,23\n33,54,62")
data = np.genfromtxt(in_data,dtype=int,delimiter=',',names="a,b,c")
data

array([(10, 20, 30), (45, 65, 23), (33, 54, 62)], 
      dtype=[('a', '<i4'), ('b', '<i4'), ('c', '<i4')])

也可以将第一行作为列名

In [5]:
in_data = StringIO("a,b,c\n10,20,30\n45,65,23\n33,54,62")
data = np.genfromtxt(in_data,dtype=int,delimiter=',',names=True)
data

array([(10, 20, 30), (45, 65, 23), (33, 54, 62)], 
      dtype=[('a', '<i4'), ('b', '<i4'), ('c', '<i4')])

### 对列进行预处理

In [6]:
# 首先看一下不进行预处理的结果
# 30kg,inr2000,31.13,56.45,1
# 45kg,inr3000,34.34,346.2,2

in_data = StringIO('30kg,inr2000,31.13,56.45,1\n45kg,inr3000,34.34,346.2,2')
data = np.genfromtxt(in_data,delimiter=',')
data

array([[    nan,     nan,   31.13,   56.45,    1.  ],
       [    nan,     nan,   34.34,  346.2 ,    2.  ]])

可以看到输出结果中有nan出现的情况，这不是我们想要的结果

因此处理这样的数据时，我们需要进行预处理

In [7]:
import numpy as np
from StringIO import StringIO

# 定义一个数据集
in_data = StringIO('30kg,inr2000,31.13,56.45,1\n45kg,inr3000,34.34,346.2,2')

# 使用模板预处理
strip_func_1 = lambda x:float(x.rstrip('kg'))
strip_func_2 = lambda x:float(x.lstrip('inr'))

# 创建一个函数的字典
convert_funcs = {0:strip_func_1,1:strip_func_2}

# 将面板用到genfromtxt
data = np.genfromtxt(in_data,delimiter=',',converters=convert_funcs)
data

array([[  3.00000000e+01,   2.00000000e+03,   3.11300000e+01,
          5.64500000e+01,   1.00000000e+00],
       [  4.50000000e+01,   3.00000000e+03,   3.43400000e+01,
          3.46200000e+02,   2.00000000e+00]])

当数据中有缺失值时

In [8]:
in_data = StringIO('10,20,30\n23,,34\n36,31,76')
miss_func = lambda x:float(x.strip() or -1)
data = np.genfromtxt(in_data,delimiter=',',converters={1:miss_func})
data

array([[ 10.,  20.,  30.],
       [ 23.,  -1.,  34.],
       [ 36.,  31.,  76.]])