pandas 是基于 Numpy 构建的，让以 Numpy 为中心的应用变得更加简单。

pandas主要包括三类数据结构，分别是：

Series：一维数组，与Numpy中的一维array类似。二者与Python基本的数据结构List也很相近，其区别是：List中的元素可以是不同的数据类型，而Array和Series中则只允许存储相同的数据类型，这样可以更有效的使用内存，提高运算效率。

DataFrame：二维的表格型数据结构。很多功能与R中的data.frame类似。可以将DataFrame理解为Series的容器。以下的内容主要以DataFrame为主。

Panel ：三维的数组，可以理解为DataFrame的容器。

Pandas官网，更多功能请参考http://pandas-docs.github.io/pandas-docs-travis/index.html


In [6]:
#首先导入库
import pandas as pd
import numpy as np

# Series

由一组数据（各种Numpy数据类型），以及一组与之相关的标签数据（即索引）组成。仅由一组数据即可产生最简单的Series，可以通过传递一个list对象来创建一个Seriess


In [3]:
s=pd.Series([1,3,5,np.nan,6,8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

# 获取Series的索引:


In [4]:
s.index

RangeIndex(start=0, stop=6, step=1)

# DataFrame

**DataFrame是一个表格型的数据结构，它含有一组有序的列，每一列的数据结构都是相同的，而不同的列之间则可以是不同的数据结构（数值、字符、布尔值等）。或者以数据库进行类比，DataFrame中的每一行是一个记录，名称为Index的一个元素，而每一列则为一个字段，是这个记录的一个属性。DataFrame既有行索引也有列索引，可以被看做由Series组成的字典（共用同一个索引）。**

# 创建一个DataFrame，包括一个numpy array,时间索引和列名字:

In [5]:
dates=pd.date_range('20130101',periods=6)
dates

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [6]:
df=pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
2013-01-01,0.12847,-0.542391,1.515312,0.952603
2013-01-02,-0.772007,0.332301,0.60637,-0.675859
2013-01-03,0.382476,0.447688,0.163751,-0.709277
2013-01-04,0.753636,-0.168071,1.239027,-0.880015
2013-01-05,-0.14769,-0.157131,-0.662839,-2.052546
2013-01-06,1.176746,-0.724703,0.511843,1.27809


# 通过传递一个能够被转换成类似序列结构的字典对象来创建一个DataFrame:

In [7]:
df2=pd.DataFrame({'A':1.,
                 'B':pd.Timestamp('20130102'),
                 'C':pd.Series(1,index=list(range(4)),dtype='float32'),
                 'D':np.array([3]*4,dtype='int32'),
                 'E':pd.Categorical(['test','train','test','train']),
                 'F':'foo'})
df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


## 查看不同列的数据类型：

In [8]:
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

## 查看数据
#### np.round(np.random.uniform(9,10,10),2) 生成9到10的两位小数

In [36]:
df=pd.DataFrame({'open':np.round(np.random.uniform(9,10,10),2),
                 'high':np.round(np.random.uniform(9,10,10),2),
                 'low':np.round(np.random.uniform(9,10,10),2),
                 'close':np.round(np.random.uniform(9,10,10),2)},
                index=pd.date_range('20170601',periods=10))
df

Unnamed: 0,open,high,low,close
2017-06-01,9.73,9.93,9.25,9.66
2017-06-02,9.06,9.27,9.91,9.45
2017-06-03,9.49,9.74,9.71,9.56
2017-06-04,9.69,9.72,9.51,9.35
2017-06-05,9.52,9.61,9.24,9.55
2017-06-06,9.99,9.14,9.89,9.89
2017-06-07,9.53,9.96,9.25,9.71
2017-06-08,9.03,9.08,9.06,9.83
2017-06-09,9.54,9.2,9.2,9.17
2017-06-10,9.78,9.44,9.52,9.82


### 查看前3条数据

In [10]:
df.head(3)  

Unnamed: 0,open,high,low,close
2017-06-01,9.54,9.33,9.48,9.57
2017-06-02,9.34,9.95,9.51,9.31
2017-06-03,9.1,9.6,9.49,9.68


### 查看后3条数据

In [11]:
df.tail(3)

Unnamed: 0,open,high,low,close
2017-06-08,9.86,9.37,9.14,9.59
2017-06-09,9.08,9.58,9.8,9.64
2017-06-10,9.88,10.0,9.09,9.99


### 显示索引、列和底层的numpy数据

In [12]:
df.index

DatetimeIndex(['2017-06-01', '2017-06-02', '2017-06-03', '2017-06-04',
               '2017-06-05', '2017-06-06', '2017-06-07', '2017-06-08',
               '2017-06-09', '2017-06-10'],
              dtype='datetime64[ns]', freq='D')

In [13]:
df.columns

Index(['open', 'high', 'low', 'close'], dtype='object')

In [14]:
df.values

array([[ 9.54,  9.33,  9.48,  9.57],
       [ 9.34,  9.95,  9.51,  9.31],
       [ 9.1 ,  9.6 ,  9.49,  9.68],
       [ 9.5 ,  9.23,  9.27,  9.32],
       [ 9.43,  9.97,  9.43,  9.48],
       [ 9.49,  9.23,  9.22,  9.53],
       [ 9.09,  9.63,  9.94,  9.17],
       [ 9.86,  9.37,  9.14,  9.59],
       [ 9.08,  9.58,  9.8 ,  9.64],
       [ 9.88, 10.  ,  9.09,  9.99]])

### describe()函数对于数据的快速统计汇总

In [15]:
df.describe()

Unnamed: 0,open,high,low,close
count,10.0,10.0,10.0,10.0
mean,9.431,9.589,9.437,9.528
std,0.291183,0.301679,0.274552,0.229966
min,9.08,9.23,9.09,9.17
25%,9.16,9.34,9.2325,9.36
50%,9.46,9.59,9.455,9.55
75%,9.53,9.87,9.505,9.6275
max,9.88,10.0,9.94,9.99


### 使用sort_values和sort_index对表进行排序

In [16]:
df.sort_index(axis=1,ascending=False)
'''
参数axis只有两个值，分别是0和1，而df中只有两个index分别是表最左一列的时间和表最上一行的ABCDE

axis=0对应的是对左边一列的index进行排序，ascending=False代表降序，ascending=True代表升序

若运行sort_index(axis=0,ascending=False)后，最左边的时间列呈降序排列

axis=1对应的是对上边一行的index进行排序，同样的，ascending=False代表降序，ascending=True代表升序

若运行sort_index(axis=1,ascending=False)后，最上边的ABCDE行呈降序排列

上面的解释仅针对于视频中实例的解释，不同的DataFrame可能有所不一样。
'''

'\n参数axis只有两个值，分别是0和1，而df中只有两个index分别是表最左一列的时间和表最上一行的ABCDE\n\naxis=0对应的是对左边一列的index进行排序，ascending=False代表降序，ascending=True代表升序\n\n若运行sort_index(axis=0,ascending=False)后，最左边的时间列呈降序排列\n\naxis=1对应的是对上边一行的index进行排序，同样的，ascending=False代表降序，ascending=True代表升序\n\n若运行sort_index(axis=1,ascending=False)后，最上边的ABCDE行呈降序排列\n\n上面的解释仅针对于视频中实例的解释，不同的DataFrame可能有所不一样。\n'

In [38]:
df.sort_values(by='open',ascending=False)

Unnamed: 0,open,high,low,close
2017-06-06,9.99,9.14,9.89,9.89
2017-06-10,9.78,9.44,9.52,9.82
2017-06-01,9.73,9.93,9.25,9.66
2017-06-04,9.69,9.72,9.51,9.35
2017-06-09,9.54,9.2,9.2,9.17
2017-06-07,9.53,9.96,9.25,9.71
2017-06-05,9.52,9.61,9.24,9.55
2017-06-03,9.49,9.74,9.71,9.56
2017-06-02,9.06,9.27,9.91,9.45
2017-06-08,9.03,9.08,9.06,9.83


# 选择数据

### (1)通过下标选取数据
df['open']，df.open 以上两个语句是等效的，都是返回 df 名称为 open 列的数据，返回的为一个 Series。 df[0:3], df['2017-06-01':'2017-06-05'] 下标索引选取的是 DataFrame 的记录，与 List 相同 DataFrame 的下标也是从0开始，区间索引的话，为一个左闭右开的区间，即[0：3]选取的为0-2三条记录。 与此等价，还可以用起始的索引名称和结束索引名称选取数据,如：df['a':'b']。有一点需要注意的是使用起始索引名称和结束索引名称时，也会包含结束索引的数据。具体看下方示例： 以上两种方式返回的都是DataFrame。

选择一列数据：

In [17]:
df['open']

2017-06-01    9.54
2017-06-02    9.34
2017-06-03    9.10
2017-06-04    9.50
2017-06-05    9.43
2017-06-06    9.49
2017-06-07    9.09
2017-06-08    9.86
2017-06-09    9.08
2017-06-10    9.88
Freq: D, Name: open, dtype: float64

In [18]:
df[['open','close']]        #注意这里的列名用了一个列表

Unnamed: 0,open,close
2017-06-01,9.54,9.57
2017-06-02,9.34,9.31
2017-06-03,9.1,9.68
2017-06-04,9.5,9.32
2017-06-05,9.43,9.48
2017-06-06,9.49,9.53
2017-06-07,9.09,9.17
2017-06-08,9.86,9.59
2017-06-09,9.08,9.64
2017-06-10,9.88,9.99


In [19]:
df[0:3]

Unnamed: 0,open,high,low,close
2017-06-01,9.54,9.33,9.48,9.57
2017-06-02,9.34,9.95,9.51,9.31
2017-06-03,9.1,9.6,9.49,9.68


In [20]:
df['2017-06-01':'2017-06-03']

Unnamed: 0,open,high,low,close
2017-06-01,9.54,9.33,9.48,9.57
2017-06-02,9.34,9.95,9.51,9.31
2017-06-03,9.1,9.6,9.49,9.68


## (2.1)使用标签选取数据：
###### df.loc[行标签，列标签]
###### df.loc['a':'b']          选取ab两行数据
###### df.loc[:,'open']      选取open列的数据
###### df.loc的第一个参数是行标签，第二个参数为列标签(可选参数，默认为所有列标签)，两个参数既可以是列表也可以是单个字符，如果两个参数都为列表则返回的是DataFrame，否则为Series
###### loc为location的缩写

In [21]:
df.loc['2017-06-01','open']

9.54

In [22]:
df.loc['2017-06-01':'2017-06-03']

Unnamed: 0,open,high,low,close
2017-06-01,9.54,9.33,9.48,9.57
2017-06-02,9.34,9.95,9.51,9.31
2017-06-03,9.1,9.6,9.49,9.68


In [23]:
df.loc[:,'open']

2017-06-01    9.54
2017-06-02    9.34
2017-06-03    9.10
2017-06-04    9.50
2017-06-05    9.43
2017-06-06    9.49
2017-06-07    9.09
2017-06-08    9.86
2017-06-09    9.08
2017-06-10    9.88
Freq: D, Name: open, dtype: float64

In [58]:
df.loc['2017-06-01':'2017-06-05','open']

2017-06-01    9.57
2017-06-02    9.12
2017-06-03    9.80
2017-06-04    9.61
2017-06-05    9.86
Freq: D, Name: open, dtype: float64

## (2.2)使用位置选取数据：
###### df.iloc[行位置,列位置]
###### df.iloc[1,1] #选取第二行，第二列的值，返回的为单个值
###### df.iloc[[0,2],:] #选取第一行及第三行的数据
###### df.iloc[0:2,:] #选取第一行到第三行（不包含）的数据
###### df.iloc[:,1] #选取所有记录的第二列的值，返回的为一个Series
###### df.iloc[1,:] #选取第一行数据，返回的为一个Series
###### PS：iloc 则为 integer & location 的缩写

In [30]:
df .iloc[1,1]      #选取第二行，第二列的值，返回为单个值

9.95

In [32]:
df.iloc[[0,2],:]   #选取第一行和第三行的所有列

Unnamed: 0,open,high,low,close
2017-06-01,9.54,9.33,9.48,9.57
2017-06-03,9.1,9.6,9.49,9.68


In [33]:
df.iloc[0:2,:]    #选取第一行到第二行的数据

Unnamed: 0,open,high,low,close
2017-06-01,9.54,9.33,9.48,9.57
2017-06-02,9.34,9.95,9.51,9.31


In [34]:
df.iloc[:,1]      #选取第二列的所有行即第二列

2017-06-01     9.33
2017-06-02     9.95
2017-06-03     9.60
2017-06-04     9.23
2017-06-05     9.97
2017-06-06     9.23
2017-06-07     9.63
2017-06-08     9.37
2017-06-09     9.58
2017-06-10    10.00
Freq: D, Name: high, dtype: float64

In [35]:
df.iloc[1,:]      #选取第二行的所有列即第二行

open     9.34
high     9.95
low      9.51
close    9.31
Name: 2017-06-02 00:00:00, dtype: float64

###### 更广义的切片方式是使用ix，它自动根据给到的索引类型判断是使用位置还是标签进行切片

In [43]:
df.ix[1,1]

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.


9.95

In [42]:
df.ix[[1,4,2],1:3]    #花式提取数据

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.


Unnamed: 0,high,low
2017-06-02,9.95,9.51
2017-06-05,9.97,9.43
2017-06-03,9.6,9.49


In [44]:
df.ix['2017-06-01':'2017-06-04'] 

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.


Unnamed: 0,open,high,low,close
2017-06-01,9.54,9.33,9.48,9.57
2017-06-02,9.34,9.95,9.51,9.31
2017-06-03,9.1,9.6,9.49,9.68
2017-06-04,9.5,9.23,9.27,9.32


## (3)通过逻辑指针进行数据切片
###### df[逻辑条件]
###### df[df.one>=2]                             单个逻辑条件
###### df[df.one>=1]&(df.one<3)         多个逻辑条件组合

In [63]:
df[df.open>9.12]       #筛选出open大于9.12的数据

Unnamed: 0,open,high,low,close
2017-06-01,9.14,9.07,9.36,9.95
2017-06-02,9.41,9.31,9.84,9.07
2017-06-03,9.69,9.2,9.02,9.14
2017-06-04,9.19,9.84,9.53,9.08
2017-06-05,9.35,9.84,9.17,9.74
2017-06-06,9.15,9.95,9.19,9.99
2017-06-07,9.76,9.59,9.5,9.36
2017-06-08,9.76,9.57,9.4,9.78


In [65]:
df[(df.open>9.12)&(df.close<9.53)]    #筛选出open大于9.12，并且close小于9.53的数据

Unnamed: 0,open,high,low,close
2017-06-02,9.41,9.31,9.84,9.07
2017-06-03,9.69,9.2,9.02,9.14
2017-06-04,9.19,9.84,9.53,9.08
2017-06-07,9.76,9.59,9.5,9.36


##### 使用条件过来更改数据。

In [66]:
df[df>10]

Unnamed: 0,open,high,low,close
2017-06-01,,,,
2017-06-02,,,,
2017-06-03,,,,
2017-06-04,,,,
2017-06-05,,,,
2017-06-06,,,,
2017-06-07,,,,
2017-06-08,,,,
2017-06-09,,,,
2017-06-10,,,,


In [69]:
df[df>9.12]=0     #大于9.12的数据都改为0

In [70]:
df

Unnamed: 0,open,high,low,close
2017-06-01,0.0,9.07,0.0,0.0
2017-06-02,0.0,0.0,0.0,9.07
2017-06-03,0.0,0.0,9.02,0.0
2017-06-04,0.0,0.0,0.0,9.08
2017-06-05,0.0,0.0,0.0,0.0
2017-06-06,0.0,0.0,0.0,0.0
2017-06-07,0.0,0.0,0.0,0.0
2017-06-08,0.0,0.0,0.0,0.0
2017-06-09,9.12,0.0,0.0,0.0
2017-06-10,9.07,0.0,0.0,0.0


## 使用isin()方法来过滤在指定列中的数据（重点）

In [74]:
df[df['open'].isin([0.00,9.00])] #很强大的一个功能  指定一个open列，提取含0.00和9.00的列

Unnamed: 0,open,high,low,close
2017-06-01,0.0,9.07,0.0,0.0
2017-06-02,0.0,0.0,0.0,9.07
2017-06-03,0.0,0.0,9.02,0.0
2017-06-04,0.0,0.0,0.0,9.08
2017-06-05,0.0,0.0,0.0,0.0
2017-06-06,0.0,0.0,0.0,0.0
2017-06-07,0.0,0.0,0.0,0.0
2017-06-08,0.0,0.0,0.0,0.0


## np.NAN设置缺失值和dropna()去掉包含缺失值的行

In [4]:
df=pd.DataFrame({'open':np.round(np.random.uniform(9,10,10),2),
                 'high':np.round(np.random.uniform(9,10,10),2),
                 'low':np.round(np.random.uniform(9,10,10),2),
                 'close':np.round(np.random.uniform(9,10,10),2)},
                index=pd.date_range('20170601',periods=10))
df.iloc[2:4,2]=np.NAN
df

Unnamed: 0,open,high,low,close
2017-06-01,9.7,9.47,9.92,9.78
2017-06-02,9.87,9.29,9.41,9.33
2017-06-03,9.66,9.27,,9.29
2017-06-04,9.94,9.76,,9.37
2017-06-05,9.5,9.61,9.05,9.96
2017-06-06,9.23,9.01,9.64,9.12
2017-06-07,9.04,9.46,10.0,9.18
2017-06-08,9.38,9.55,9.29,9.2
2017-06-09,9.76,9.35,9.26,9.58
2017-06-10,9.56,9.45,9.64,9.15


In [90]:
df.dropna()

Unnamed: 0,open,high,low,close
2017-06-01,9.95,9.35,9.19,9.26
2017-06-02,9.51,9.06,9.91,9.27
2017-06-05,9.54,9.28,9.49,9.04
2017-06-06,9.62,9.92,9.75,9.96
2017-06-07,9.36,9.15,9.27,9.06
2017-06-08,9.8,9.85,9.47,9.3
2017-06-09,9.11,9.91,9.73,9.54
2017-06-10,9.19,9.94,9.7,9.71


## 对缺失值进行填充

In [6]:
df.fillna(value=0)

Unnamed: 0,open,high,low,close
2017-06-01,9.7,9.47,9.92,9.78
2017-06-02,9.87,9.29,9.41,9.33
2017-06-03,9.66,9.27,0.0,9.29
2017-06-04,9.94,9.76,0.0,9.37
2017-06-05,9.5,9.61,9.05,9.96
2017-06-06,9.23,9.01,9.64,9.12
2017-06-07,9.04,9.46,10.0,9.18
2017-06-08,9.38,9.55,9.29,9.2
2017-06-09,9.76,9.35,9.26,9.58
2017-06-10,9.56,9.45,9.64,9.15


## 判断数据是否为nan，并进行布尔填充：

In [8]:
df.isnull()

Unnamed: 0,open,high,low,close
2017-06-01,False,False,False,False
2017-06-02,False,False,False,False
2017-06-03,False,False,True,False
2017-06-04,False,False,True,False
2017-06-05,False,False,False,False
2017-06-06,False,False,False,False
2017-06-07,False,False,False,False
2017-06-08,False,False,False,False
2017-06-09,False,False,False,False
2017-06-10,False,False,False,False


## 函数的应用和映射
###### 常用的方法如上所介绍们，还要其他许多，可自行学习，下面罗列了一些，可供参考：
######  count 非na值的数量
######  describe 针对Series或个DataFrame列计算汇总统计
######  min、max 计算最小值和最大值
######  argmin、argmax 计算能够获取到最大值和最小值得索引位置（整数）
######  idxmin、idxmax 计算能够获取到最大值和最小值得索引值
######  quantile 计算样本的分位数（0到1）
######  sum 值的总和
######  mean 值得平均数
######  median 值得算术中位数（50%分位数）
######  mad 根据平均值计算平均绝对离差
######  var 样本值的方差
######  std 样本值的标准差
######  skew 样本值得偏度（三阶矩）
######  kurt 样本值得峰度（四阶矩）
######  cumsum 样本值得累计和
######  cummin，cummax 样本值得累计最大值和累计最小值
######  cumprod 样本值得累计积
######  diff 计算一阶差分（对时间序列很有用）
######  pct_change 计算百分数变化

In [12]:
df.mean()        #列计算平均值

open     9.56400
high     9.42200
low      9.52625
close    9.39600
dtype: float64

In [14]:
df.mean(1)      #行计算平均值

2017-06-01    9.717500
2017-06-02    9.475000
2017-06-03    9.406667
2017-06-04    9.690000
2017-06-05    9.530000
2017-06-06    9.250000
2017-06-07    9.420000
2017-06-08    9.355000
2017-06-09    9.487500
2017-06-10    9.450000
Freq: D, dtype: float64

In [15]:
df.mean(axis=1,skipna=False)   #skipna参数默认是True表示排除缺失值

2017-06-01    9.7175
2017-06-02    9.4750
2017-06-03       NaN
2017-06-04       NaN
2017-06-05    9.5300
2017-06-06    9.2500
2017-06-07    9.4200
2017-06-08    9.3550
2017-06-09    9.4875
2017-06-10    9.4500
Freq: D, dtype: float64

## 数据规整         (重点)
###### pandas提供了大量的方法能够轻松的对Series，DataFrame和Panel对象进行各种符合各种逻辑关系的合并操作

###### concat可以沿一条轴将多个对象堆叠在一起。
###### append将一行链接到一个DataFrame上
###### duplicated移除重复数据

In [17]:
df1=pd.DataFrame({'open':np.round(np.random.uniform(9,10,10),2),
                 'high':np.round(np.random.uniform(9,10,10),2),
                 'low':np.round(np.random.uniform(9,10,10),2),
                 'close':np.round(np.random.uniform(9,10,10),2)},
                index=pd.date_range('20170601',periods=10))
df1

Unnamed: 0,open,high,low,close
2017-06-01,9.53,9.86,9.02,9.37
2017-06-02,9.74,9.61,9.94,9.58
2017-06-03,9.95,9.61,9.52,9.07
2017-06-04,9.22,9.85,9.35,9.62
2017-06-05,9.95,9.05,9.09,9.85
2017-06-06,9.17,9.88,9.75,9.14
2017-06-07,9.91,9.86,9.98,9.62
2017-06-08,9.21,9.78,9.17,9.37
2017-06-09,9.99,9.5,9.01,9.9
2017-06-10,9.82,9.88,9.52,9.22


In [18]:
df2=pd.DataFrame({'open':np.round(np.random.uniform(9,10,10),2),
                 'high':np.round(np.random.uniform(9,10,10),2),
                 'low':np.round(np.random.uniform(9,10,10),2),
                 'close':np.round(np.random.uniform(9,10,10),2)},
                index=pd.date_range('20170601',periods=10))
df2

Unnamed: 0,open,high,low,close
2017-06-01,9.89,9.4,9.21,9.73
2017-06-02,9.62,9.38,9.84,9.5
2017-06-03,9.78,9.14,9.51,9.15
2017-06-04,9.1,9.33,9.44,9.9
2017-06-05,9.93,9.51,9.58,9.98
2017-06-06,9.8,9.9,9.65,9.02
2017-06-07,9.97,9.58,9.98,9.08
2017-06-08,9.07,10.0,9.42,9.66
2017-06-09,9.76,9.45,9.76,9.87
2017-06-10,9.44,9.2,9.23,9.96


### concat

In [20]:
pd.concat([df1,df2],axis=0)

Unnamed: 0,open,high,low,close
2017-06-01,9.53,9.86,9.02,9.37
2017-06-02,9.74,9.61,9.94,9.58
2017-06-03,9.95,9.61,9.52,9.07
2017-06-04,9.22,9.85,9.35,9.62
2017-06-05,9.95,9.05,9.09,9.85
2017-06-06,9.17,9.88,9.75,9.14
2017-06-07,9.91,9.86,9.98,9.62
2017-06-08,9.21,9.78,9.17,9.37
2017-06-09,9.99,9.5,9.01,9.9
2017-06-10,9.82,9.88,9.52,9.22


In [21]:
pd.concat([df1,df2],axis=1)

Unnamed: 0,open,high,low,close,open.1,high.1,low.1,close.1
2017-06-01,9.53,9.86,9.02,9.37,9.89,9.4,9.21,9.73
2017-06-02,9.74,9.61,9.94,9.58,9.62,9.38,9.84,9.5
2017-06-03,9.95,9.61,9.52,9.07,9.78,9.14,9.51,9.15
2017-06-04,9.22,9.85,9.35,9.62,9.1,9.33,9.44,9.9
2017-06-05,9.95,9.05,9.09,9.85,9.93,9.51,9.58,9.98
2017-06-06,9.17,9.88,9.75,9.14,9.8,9.9,9.65,9.02
2017-06-07,9.91,9.86,9.98,9.62,9.97,9.58,9.98,9.08
2017-06-08,9.21,9.78,9.17,9.37,9.07,10.0,9.42,9.66
2017-06-09,9.99,9.5,9.01,9.9,9.76,9.45,9.76,9.87
2017-06-10,9.82,9.88,9.52,9.22,9.44,9.2,9.23,9.96


## append

In [9]:
df1=pd.DataFrame({'open':np.round(np.random.uniform(9,10,10),2),
                 'high':np.round(np.random.uniform(9,10,10),2),
                 'low':np.round(np.random.uniform(9,10,10),2),
                 'close':np.round(np.random.uniform(9,10,10),2)},
                index=pd.date_range('20170601',periods=10))

In [10]:
s=df1.iloc[0]

In [11]:
df1.append(s,ignore_index=False)   #ignore_index=False 表示索引不变     

Unnamed: 0,open,high,low,close
2017-06-01,9.81,9.94,9.52,9.89
2017-06-02,9.51,9.75,9.11,9.02
2017-06-03,9.52,9.15,9.18,9.52
2017-06-04,9.82,9.14,9.16,9.9
2017-06-05,9.47,9.46,9.36,9.73
2017-06-06,9.33,9.96,9.38,9.38
2017-06-07,9.13,9.69,9.53,9.62
2017-06-08,9.93,9.72,9.76,9.46
2017-06-09,9.96,9.33,9.94,9.46
2017-06-10,9.64,9.5,9.63,9.99


In [26]:
df.append(s,ignore_index=True)      #ignore_index=True 表示索引重置

Unnamed: 0,open,high,low,close
0,9.7,9.47,9.92,9.78
1,9.87,9.29,9.41,9.33
2,9.66,9.27,,9.29
3,9.94,9.76,,9.37
4,9.5,9.61,9.05,9.96
5,9.23,9.01,9.64,9.12
6,9.04,9.46,10.0,9.18
7,9.38,9.55,9.29,9.2
8,9.76,9.35,9.26,9.58
9,9.56,9.45,9.64,9.15


### duplicated显示重复数据drop_duplicates移除重复数据（重点）

In [12]:
z=df1.append(s,ignore_index=False)   
z

Unnamed: 0,open,high,low,close
2017-06-01,9.81,9.94,9.52,9.89
2017-06-02,9.51,9.75,9.11,9.02
2017-06-03,9.52,9.15,9.18,9.52
2017-06-04,9.82,9.14,9.16,9.9
2017-06-05,9.47,9.46,9.36,9.73
2017-06-06,9.33,9.96,9.38,9.38
2017-06-07,9.13,9.69,9.53,9.62
2017-06-08,9.93,9.72,9.76,9.46
2017-06-09,9.96,9.33,9.94,9.46
2017-06-10,9.64,9.5,9.63,9.99


In [33]:
z.duplicated()            

2017-06-01    False
2017-06-02    False
2017-06-03    False
2017-06-04    False
2017-06-05    False
2017-06-06    False
2017-06-07    False
2017-06-08    False
2017-06-09    False
2017-06-10    False
2017-06-01     True
dtype: bool

In [13]:
z.drop_duplicates()

Unnamed: 0,open,high,low,close
2017-06-01,9.81,9.94,9.52,9.89
2017-06-02,9.51,9.75,9.11,9.02
2017-06-03,9.52,9.15,9.18,9.52
2017-06-04,9.82,9.14,9.16,9.9
2017-06-05,9.47,9.46,9.36,9.73
2017-06-06,9.33,9.96,9.38,9.38
2017-06-07,9.13,9.69,9.53,9.62
2017-06-08,9.93,9.72,9.76,9.46
2017-06-09,9.96,9.33,9.94,9.46
2017-06-10,9.64,9.5,9.63,9.99


In [15]:
z.drop_duplicates(['open'])        #drop_duplicates()里面加入列名参数即可根据列去重

Unnamed: 0,open,high,low,close
2017-06-01,9.81,9.94,9.52,9.89
2017-06-02,9.51,9.75,9.11,9.02
2017-06-03,9.52,9.15,9.18,9.52
2017-06-04,9.82,9.14,9.16,9.9
2017-06-05,9.47,9.46,9.36,9.73
2017-06-06,9.33,9.96,9.38,9.38
2017-06-07,9.13,9.69,9.53,9.62
2017-06-08,9.93,9.72,9.76,9.46
2017-06-09,9.96,9.33,9.94,9.46
2017-06-10,9.64,9.5,9.63,9.99
