> 前言：
pandas是在numpy的基础上开发出来的，Pandas 的数据结构：Pandas 主要有 Series（一维数组），DataFrame（二维数组），Panel（三维数组），Panel4D（四维数组），PanelND（更多维数组）等数据结构。其中 Series 和 DataFrame 应用的最为广泛。
Series

##### 何为Series？
Series由一组数据（numpy的ndarray）和一组与之相对应的标签构成

* 创建Series

In [13]:
from pandas import Series, DataFrame
import pandas as pd
import numpy as np

In [3]:
#直接构建
ser01 = Series([1,2,3], index=['a','b','c'])
#通过字典形式创建
ser02 = pd.Series({'a':1,'b':2,'c':3})

In [4]:
ser01

a    1
b    2
c    3
dtype: int64

In [5]:
ser02

a    1
b    2
c    3
dtype: int64

* 索引切片

In [6]:
ser02[0:2]

a    1
b    2
dtype: int64

In [8]:
ser01["a"]

1

##### 运算
* 类似 ndarray运算

In [9]:
ser01[ser01>=2]#求范围（注意输出值用中括号括起来）

b    2
c    3
dtype: int64

In [10]:
ser01>=2#返回的是布尔值

a    False
b     True
c     True
dtype: bool

In [11]:
ser01+10#求整体加10

a    11
b    12
c    13
dtype: int64

In [14]:
np.exp(ser01)#求指数

a     2.718282
b     7.389056
c    20.085537
dtype: float64

* 用API计算的方法

In [15]:
#加法
ser01.add(ser02)

a    2
b    4
c    6
dtype: int64

In [16]:
#减法
ser01.sub(ser02)

a    0
b    0
c    0
dtype: int64

In [17]:
#乘法
ser01.mul(ser02)

a    1
b    4
c    9
dtype: int64

In [18]:
#除法
ser01.div(ser02)

a    1.0
b    1.0
c    1.0
dtype: float64

In [20]:
#中位数
ser01.median()

2.0

In [21]:
#最大值
ser01.max()

3

In [22]:
#求和
ser01.sum()

6

* 缺失值处理

In [24]:
ser03 = Series(ser01,index=['a','b','c','d'])
ser03

a    1.0
b    2.0
c    3.0
d    NaN
dtype: float64

In [28]:
pd.isnull(ser03)#判断是否为空值

a    False
b    False
c    False
d     True
dtype: bool

In [29]:
ser03

a    1.0
b    2.0
c    3.0
d    NaN
dtype: float64

In [31]:
#过滤掉np.nan的值
ser03[pd.notnull(ser03)]

a    1.0
b    2.0
c    3.0
dtype: float64

##### DataFrame
* 何为DataFrame？    
DataFrame表格样的数据结构，包含一组有序的列，有行、列索引，可以看做是Series的字典组成

* 创建DataFrame

In [32]:
df01 =DataFrame([['susan','long','meimei'],[50,60,60]],index=['姓名','成绩'],columns=['Chinese','math','english'])
df01  #index为行索引，columns为列索引

Unnamed: 0,Chinese,math,english
姓名,susan,long,meimei
成绩,50,60,60


In [42]:
#用字典创建一个列表,
dict={
    "apart":[121,111,144,122],
    "year":[2011,2013,2022,2003],
    "month":8,
    "profit":[100,22,99,80]
}
df02=DataFrame(dict,index=['one','two','three','four'])
df02

Unnamed: 0,apart,year,month,profit
one,121,2011,8,100
two,111,2013,8,22
three,144,2022,8,99
four,122,2003,8,80


* 通过行列数据获取    
默认为列获取，如果获取行可用pd.loc()

In [43]:
#列增加
df02['address']=['北京','shanghai','shuangzhou','shenzhen']
df02

Unnamed: 0,apart,year,month,profit,address
one,121,2011,8,100,北京
two,111,2013,8,22,shanghai
three,144,2022,8,99,shuangzhou
four,122,2003,8,80,shenzhen


In [44]:
#列删除
df02.pop('apart')
df02

Unnamed: 0,year,month,profit,address
one,2011,8,100,北京
two,2013,8,22,shanghai
three,2022,8,99,shuangzhou
four,2003,8,80,shenzhen


In [45]:
#列修改
df02['month']=3
df02

Unnamed: 0,year,month,profit,address
one,2011,3,100,北京
two,2013,3,22,shanghai
three,2022,3,99,shuangzhou
four,2003,3,80,shenzhen


In [46]:
#行操作
df02.loc['two']

year           2013
month             3
profit           22
address    shanghai
Name: two, dtype: object

In [47]:
df02

Unnamed: 0,year,month,profit,address
one,2011,3,100,北京
two,2013,3,22,shanghai
three,2022,3,99,shuangzhou
four,2003,3,80,shenzhen


* 读取文件

In [49]:
#分别读取csv、excel、txt文件
df03 = pd.read_csv(r"C:\Users\leo.zhangzs\Downloads\test.txt",sep="\t",header=None)#通过tab键分割数据

In [50]:
df03

Unnamed: 0,0,1,2,3,4
0,id,year,month,profit,address
1,one,2011,3,100,北京
2,two,2013,3,22,shanghai
3,three,2022,3,99,shuangzhou
4,four,2003,3,80,shenzhen


In [66]:
df04 = pd.read_csv(r'C:\Users\leo.zhangzs\Downloads\test.csv',encoding='gb2312')#CSV
df04

Unnamed: 0,id,year,month,profit,address,NULL
0,one,2011,3,100,北京,
1,two,2013,3,22,shanghai,
2,three,2022,3,99,shuangzhou,
3,four,2003,3,80,shenzhen,


In [55]:
df05 = pd.read_excel(r'C:\Users\leo.zhangzs\Downloads\test.xlsx')#excel
df05

Unnamed: 0,id,year,month,profit,address
0,one,2011,3,100,北京
1,two,2013,3,22,shanghai
2,three,2022,3,99,shuangzhou
3,four,2003,3,80,shenzhen


* 过滤切片

In [56]:
df02[df02.columns[1:]]#截取从第二列之后所有行的值

Unnamed: 0,month,profit,address
one,3,100,北京
two,3,22,shanghai
three,3,99,shuangzhou
four,3,80,shenzhen


* 缺失值操作    
和series类似

In [68]:
df04.isnull()

Unnamed: 0,id,year,month,profit,address,NULL
0,False,False,False,False,False,True
1,False,False,False,False,False,True
2,False,False,False,False,False,True
3,False,False,False,False,False,True


In [69]:
#删除缺失值
df04.dropna(axis=1)#axis=1为去一列，默认为去一行，注意和数学统计里面默认计算的列不一样

Unnamed: 0,id,year,month,profit,address
0,one,2011,3,100,北京
1,two,2013,3,22,shanghai
2,three,2022,3,99,shuangzhou
3,four,2003,3,80,shenzhen


In [84]:
df04.dropna(how="all")

Unnamed: 0,id,year,month,profit,address,NULL
0,one,2011,3,100,北京,
1,two,2013,3,22,shanghai,
2,three,2022,3,99,shuangzhou,
3,four,2003,3,80,shenzhen,


In [86]:
#替换缺失值
df04.fillna(0)

Unnamed: 0,id,year,month,profit,address,NULL
0,one,2011,3,100,北京,0.0
1,two,2013,3,22,shanghai,0.0
2,three,2022,3,99,shuangzhou,0.0
3,four,2003,3,80,shenzhen,0.0


In [87]:
df04.fillna({0:1,1:2,2:3})

Unnamed: 0,id,year,month,profit,address,NULL
0,one,2011,3,100,北京,
1,two,2013,3,22,shanghai,
2,three,2022,3,99,shuangzhou,
3,four,2003,3,80,shenzhen,


* 数学统计    
常见的方法如count describe min/max idxmin、idxmax quantile sum mean median mad var std cumsum pct_change

In [89]:
df02.describe()

Unnamed: 0,year,month,profit
count,4.0,4.0,4.0
mean,2012.25,3.0,75.25
std,7.804913,0.0,36.673105
min,2003.0,3.0,22.0
25%,2009.0,3.0,65.5
50%,2012.0,3.0,89.5
75%,2015.25,3.0,99.25
max,2022.0,3.0,100.0


In [90]:
df1=df02.dropna(axis=1)
df1

Unnamed: 0,year,month,profit,address
one,2011,3,100,北京
two,2013,3,22,shanghai
three,2022,3,99,shuangzhou
four,2003,3,80,shenzhen


In [91]:
df02.quantile(0.25)#计算样本分位（0到1）

year      2009.0
month        3.0
profit      65.5
Name: 0.25, dtype: float64

In [92]:
df02.median()#中位数

year      2012.0
month        3.0
profit      89.5
dtype: float64

In [106]:
df02[['year','month','profit']].pct_change()#计算样本分位, 0-1

Unnamed: 0,year,month,profit
one,,,
two,0.000995,0.0,-0.78
three,0.004471,0.0,3.5
four,-0.009397,0.0,-0.191919


* 协方差和相对系数    
直观反应两组数据的相关程度分别为cov，corr

In [109]:
df2=DataFrame({
    "gdp":[2,4,6],
    "chukou":[3,2,1]
})
df2

Unnamed: 0,gdp,chukou
0,2,3
1,4,2
2,6,1


In [110]:
df2.cov()

Unnamed: 0,gdp,chukou
gdp,4.0,-2.0
chukou,-2.0,1.0


In [111]:
df2.corr()

Unnamed: 0,gdp,chukou
gdp,1.0,-1.0
chukou,-1.0,1.0


* 唯一值，值计数，成员资格   
唯一值unique，值计数value_counts，成员资格isin（等于用没里面的元素来过滤）

In [113]:
df3=Series([12,13,14,15,13,13,12,11,14])
df3

0    12
1    13
2    14
3    15
4    13
5    13
6    12
7    11
8    14
dtype: int64

In [114]:
df3.unique()

array([12, 13, 14, 15, 11], dtype=int64)

In [115]:
df3.value_counts()

13    3
14    2
12    2
15    1
11    1
dtype: int64

In [116]:
df3[df3.isin([14,15])]#成员资格

2    14
3    15
8    14
dtype: int64

* 层次索引   
索引可以大于一维，unstack(level=1)可把series转化为dataframe，swapleve转换索引

In [122]:
df02.set_index(['year','month'])

Unnamed: 0_level_0,Unnamed: 1_level_0,profit,address
year,month,Unnamed: 2_level_1,Unnamed: 3_level_1
2011,3,100,北京
2013,3,22,shanghai
2022,3,99,shuangzhou
2003,3,80,shenzhen
