## 初识pandas

首先，pandas重点提供了两种数据结构：
- Series

  序列，一维数据，是对NumPy的一维数组的封装，但是相较于NumPy使用整型下标，它使用自定义(比如有意义的字符串)的索引(index)
- DataFrame

  数据框，二维数据，是对NumPy的二维数组的封装，但是相较于NumPy使用整型下标，它可以使用自定义的索引(index)和列名(column)

在使用index,column之外，这两个封装额外还附带了更多趁手的方法，比如：
- describe ——— 快速计算数据的各种描述性统计值(均值、总和、中位数、四分位数等等)
- unique ——— 数据的独立值列表(比如想知道某个特征的所有取值可能)
- value_count ——— 各个值的计数
- hist ——— 直接绘制直方图
- plot ———— 对matplotlib进行了简单的封装，可以快速地进行简单的数据绘图

其次，pandas还提供了很多非常有用的处理数据时的小工具，比如:
- 便捷的Ｉ/O ——— 提供了直接读取Excel、CSV等常见的数据文件工具
- SQL的功能 ——— 提供了groupby，join等功能
- Excel的功能 ——— 透视表(pivot table)功能
- 日期相关功能 ——— 直观到像自然语言，不必费劲地去理解Python自带的日期库

接下来，我们大致领略一下pandas的核心功能
首先导入需要用的模块:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# 创建对象(Object Creation)
通过传入一个列表数据，pandas可以创建一个使用默认整型作为索引的Series对象

In [2]:
s = pd.Series([1,2,3,4,np.nan,6,7,8])
s

0    1.0
1    2.0
2    3.0
3    4.0
4    NaN
5    6.0
6    7.0
7    8.0
dtype: float64

可以构建一个使用日期和标签作为索引的DataFrame对象

In [5]:
dates = pd.date_range('20190819', periods = 6)
dates

DatetimeIndex(['2019-08-19', '2019-08-20', '2019-08-21', '2019-08-22',
               '2019-08-23', '2019-08-24'],
              dtype='datetime64[ns]', freq='D')

In [6]:
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=tuple('ABCD')) # tuple('ABCD') is short for ('A','B','C','D')
df

Unnamed: 0,A,B,C,D
2019-08-19,-0.252714,-0.225335,-0.950349,1.324846
2019-08-20,0.880918,-0.355636,0.949529,0.512453
2019-08-21,0.809549,0.065785,0.47902,-0.586858
2019-08-22,0.505261,-0.238466,-0.311095,1.132118
2019-08-23,1.337236,-0.549709,-0.24829,0.766576
2019-08-24,0.66509,-0.594375,0.83388,0.964168


也可以使用一个字典(dict)来创建一个DataFrame对象，而且它会自动应用NumPy的广播

In [7]:
df2 = pd.DataFrame({'A' : 1.,
                    'B' : pd.Timestamp('20171026'),
                    'C' : pd.Series(1, index=list(range(4)),dtype='float32'),
                    'D' : np.array([3] * 4, dtype = 'int32'),
                    'E' : pd.Categorical(["test", "train", "test", "train"]),
                    'F' : 'foo'})
df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2017-10-26,1.0,3,test,foo
1,1.0,2017-10-26,1.0,3,train,foo
2,1.0,2017-10-26,1.0,3,test,foo
3,1.0,2017-10-26,1.0,3,train,foo


In [8]:
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

In [9]:
df2.C # 直接使用标签来选择列，等价于df2['C']

0    1.0
1    1.0
2    1.0
3    1.0
Name: C, dtype: float32

## 查看数据(viewing data)
比如想看看一个DataFrame的头部和尾部：

In [10]:
df.head() #head默认参数为5，前5行

Unnamed: 0,A,B,C,D
2019-08-19,-0.252714,-0.225335,-0.950349,1.324846
2019-08-20,0.880918,-0.355636,0.949529,0.512453
2019-08-21,0.809549,0.065785,0.47902,-0.586858
2019-08-22,0.505261,-0.238466,-0.311095,1.132118
2019-08-23,1.337236,-0.549709,-0.24829,0.766576


In [11]:
df.tail(3)

Unnamed: 0,A,B,C,D
2019-08-22,0.505261,-0.238466,-0.311095,1.132118
2019-08-23,1.337236,-0.549709,-0.24829,0.766576
2019-08-24,0.66509,-0.594375,0.83388,0.964168


我们也可以看看索引、列名、以及底层的numpy数据都是什么样

In [12]:
df.index

DatetimeIndex(['2019-08-19', '2019-08-20', '2019-08-21', '2019-08-22',
               '2019-08-23', '2019-08-24'],
              dtype='datetime64[ns]', freq='D')

In [13]:
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

In [14]:
df.values

array([[-0.25271442, -0.22533495, -0.95034883,  1.32484585],
       [ 0.88091758, -0.35563601,  0.94952919,  0.51245277],
       [ 0.80954862,  0.06578511,  0.47901969, -0.58685848],
       [ 0.50526099, -0.23846574, -0.31109471,  1.13211788],
       [ 1.33723599, -0.54970916, -0.24828969,  0.76657566],
       [ 0.66508961, -0.59437456,  0.83388035,  0.96416846]])

而且我们可以通过describe()方法来快速地看看数据的概括统计：
- count:统计数量
- mean: 均值
- std: 标准差
- min: 最小值
- 25%: 四分位数
- 50%: 中位数
- 75%: 四分之三位数
- max: 最大值

In [15]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,0.657556,-0.316289,0.125449,0.68555
std,0.52665,0.242353,0.747474,0.684295
min,-0.252714,-0.594375,-0.950349,-0.586858
25%,0.545218,-0.501191,-0.295393,0.575983
50%,0.737319,-0.297051,0.115365,0.865372
75%,0.863075,-0.228618,0.745165,1.090131
max,1.337236,0.065785,0.949529,1.324846


In [16]:
df.T # 转置

Unnamed: 0,2019-08-19,2019-08-20,2019-08-21,2019-08-22,2019-08-23,2019-08-24
A,-0.252714,0.880918,0.809549,0.505261,1.337236,0.66509
B,-0.225335,-0.355636,0.065785,-0.238466,-0.549709,-0.594375
C,-0.950349,0.949529,0.47902,-0.311095,-0.24829,0.83388
D,1.324846,0.512453,-0.586858,1.132118,0.766576,0.964168


以某一个轴排序，注意这是按照轴自己的值来排序，比如按照列名来排序：

In [17]:
df.sort_index(axis = 1, ascending = False)

Unnamed: 0,D,C,B,A
2019-08-19,1.324846,-0.950349,-0.225335,-0.252714
2019-08-20,0.512453,0.949529,-0.355636,0.880918
2019-08-21,-0.586858,0.47902,0.065785,0.809549
2019-08-22,1.132118,-0.311095,-0.238466,0.505261
2019-08-23,0.766576,-0.24829,-0.549709,1.337236
2019-08-24,0.964168,0.83388,-0.594375,0.66509


In [21]:
df.sort_index(axis = 0, ascending = True)

Unnamed: 0,A,B,C,D
2019-08-19,-0.252714,-0.225335,-0.950349,1.324846
2019-08-20,0.880918,-0.355636,0.949529,0.512453
2019-08-21,0.809549,0.065785,0.47902,-0.586858
2019-08-22,0.505261,-0.238466,-0.311095,1.132118
2019-08-23,1.337236,-0.549709,-0.24829,0.766576
2019-08-24,0.66509,-0.594375,0.83388,0.964168


也可以按照数据的值来排序：

In [18]:
df.sort_values(by = 'B')

Unnamed: 0,A,B,C,D
2019-08-24,0.66509,-0.594375,0.83388,0.964168
2019-08-23,1.337236,-0.549709,-0.24829,0.766576
2019-08-20,0.880918,-0.355636,0.949529,0.512453
2019-08-22,0.505261,-0.238466,-0.311095,1.132118
2019-08-19,-0.252714,-0.225335,-0.950349,1.324846
2019-08-21,0.809549,0.065785,0.47902,-0.586858
