# Python的pandas库

# 1. Pandas 简介

① pandas一般解决表格型的数据、二维的。

② pandas是专门为处理表格和混杂数据设计的，而Numpy更适合处理统一数值数据。

③ pandas主要数据结构：Series 和 DataFrame

# 2. Series 类型

① 系列(Series)是能够保存任何类型的数据(整数，字符串，浮点数，Python对象等)的一维数组。

② Series的表现形式为：索引在左边，值在右边。如果没有为数据指定索引，于是会自动创建一个0到N-1（N为数据长度）的整数型索引，可以为数据指定索引index。

③ 可以通过Series的values和index属性获取其数组值和索引。

④ Series 值的获取主要有两种方式：

1. 通过方括号+索引名的方式读取对应索引的数，有可能返回多条数据。
2. 通过方括号+下标值的方式读取对应下标值的数据，下标值的取值范围为：[0,len(Series.values)]，另外下标值也可以是负数，表示从右往左获取数据。

⑤ Numpy中的数组运算，在Series中都保留了，都可以使用，并且Series进行数组运算的时候，索引与值之间的映射关系不会发生改变。

⑥ 其实在操作Series的时候，基本上可以把Series看成Numpy中的ndarray数组进行操作，ndarray数组的绝大多数操作都可以应用到Series上。

In [1]:
import pandas as pd
obj = pd.Series([4,7,-5,3])      # 自动创建一个0到N-1(N为数据长度)的整数型索引
obj

0    4
1    7
2   -5
3    3
dtype: int64

In [2]:
import pandas as pd
obj2 = pd.Series([4,7,-5,3],index = ['d','b','a','c'])  # 自定义索引
obj2

d    4
b    7
a   -5
c    3
dtype: int64

In [3]:
import pandas as pd
obj2 = pd.Series([4,7,-5,3],index = ['d','b','a','c'])
obj2['d']    # 通过索引名获取里面的值

4

In [4]:
# 如何从列表、数组、字典创建Series
import numpy as np
mylist = list('qwe')               # 列表
myarr = np.arange(3)               # 数组
mydict = dict(zip(mylist,myarr))   # 字典
print(mydict)

# 构造方法
ser1 = pd.Series(mylist)
ser2 = pd.Series(myarr)
ser3 = pd.Series(mydict)

print(ser3.head())                  # 取 ser3 的前五行
print(ser3.head(1))                 # 取 ser3 的第一行
print(ser1,ser2,ser3)               # 打印 ser1,ser2,ser3

{'q': 0, 'w': 1, 'e': 2}
q    0
w    1
e    2
dtype: int64
q    0
dtype: int64
0    q
1    w
2    e
dtype: object 0    0
1    1
2    2
dtype: int32 q    0
w    1
e    2
dtype: int64


In [5]:
import pandas as pd
sdata  = {'Ohio':35000,'Texax':71000,'Oregon':16000,'Utah':5000}
states = ['California','Ohio','Oregon','Texax']
obj3   = pd.Series(sdata)
print(obj3)
obj4   = pd.Series(sdata,index = states)   # 将有索引的赋值，否则为空
print(obj4)
pd.isnull(obj4)                            # 为空的 为True

Ohio      35000
Texax     71000
Oregon    16000
Utah       5000
dtype: int64
California        NaN
Ohio          35000.0
Oregon        16000.0
Texax         71000.0
dtype: float64


California     True
Ohio          False
Oregon        False
Texax         False
dtype: bool

In [6]:
import pandas as pd
obj2 = pd.Series([4,7,-5,3],index = ['d','b','a','c'])
print(obj2.values)  # 获得Series的值
print(obj2.index)   # 获得Series的索引
print(obj2.dtype)   # 获得Series的值的类型

[ 4  7 -5  3]
Index(['d', 'b', 'a', 'c'], dtype='object')
int64


In [7]:
obj2 = pd.Series([4,7,-5,3],index = ['d','b','a','a'])
print(obj2['a'])  # 用索引名a，获得2个对应的值
print(obj2[0:2])
print(obj2[0])
print(obj2[-1])   # 字符串为索引时可以用负索引

a   -5
a    3
dtype: int64
d    4
b    7
dtype: int64
4
3


In [8]:
obj2 = pd.Series([4,7,-5,3],index = [100,200,300,400])
print(obj2[0:2])
# print(obj2[0])    # 报错，指定数值作为索引时不能用数值索引
# print(obj2[-1])   # 报错，指定数值作为索引时不能用负索引

100    4
200    7
dtype: int64


In [9]:
obj2 = pd.Series([4,7,-5,3])
print(obj2[0:2])
print(obj2[0])     # 不报错，不指定索引时可以用数值索引
# print(obj2[-1])  # 报错，不指定索引时不能用负索引

0    4
1    7
dtype: int64
4


In [10]:
obj2 = pd.Series([4,7,-5,3],index = ['d','b','a','a'])
obj2['a'] = 100
obj2[1]   = 200
print(obj2)       # 通过索引修改值

d      4
b    200
a    100
a    100
dtype: int64


In [11]:
obj2 = pd.Series([4,7,-5,3],index = ['d','b','a','a'])
obj2.index = ['aa','bb','cc','dd']  # 修改索引
obj2[1]   = 200                     # 修改值
print(obj2)

aa      4
bb    200
cc     -5
dd      3
dtype: int64


In [12]:
import numpy as np
obj2 = pd.Series([4,7,-5,3],dtype = np.float32,index = ['d','b','a','a'])  # 修改值类型
print(obj2)

d    4.0
b    7.0
a   -5.0
a    3.0
dtype: float32


In [13]:
import numpy as np
obj2 = pd.Series([4,7,-5,3],dtype = np.float32,index = ['d','b','a','a'])
print(obj2+100)  # 元素都进行加100

d    104.0
b    107.0
a     95.0
a    103.0
dtype: float32


# 3. DataFrame 类型

① DataFrame是一个表格型的数据结构，它含有一组有序的列，每列可以是不同的值类型(数值、字符串、布尔值等)。DataFrame既有行索引也有列索引。

② DataFrame中的数据是一个或多个二维块存放的(而不是列表、字典或别的一维数据结构)。

In [14]:
data = {'state':['Ohio','Ohio','Ohio','Nevada'],
       'year':[2000,2001,2002,2003],
       'pop':[1.5,1.7,3.6,2.4]}
frame = pd.DataFrame(data)
print(frame)
pd1 = pd.DataFrame(data,columns=['year','state','pop'])   # 修改列索引
pd1                                                       # 表格形式打印

    state  year  pop
0    Ohio  2000  1.5
1    Ohio  2001  1.7
2    Ohio  2002  3.6
3  Nevada  2003  2.4


Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2003,Nevada,2.4


In [15]:
import pandas as pd
data = {'state':['Ohio','Ohio','Ohio','Nevada'],
       'year':[2000,2001,2002,2003],
       'pop':[1.5,1.7,3.6,2.4]}
frame = pd.DataFrame(data)
print(frame)
pd1 = pd.DataFrame(data,columns=['year','state','pop'],index=['one','two','three','four'])   # 修改行索引、列索引
pd1            

    state  year  pop
0    Ohio  2000  1.5
1    Ohio  2001  1.7
2    Ohio  2002  3.6
3  Nevada  2003  2.4


Unnamed: 0,year,state,pop
one,2000,Ohio,1.5
two,2001,Ohio,1.7
three,2002,Ohio,3.6
four,2003,Nevada,2.4


In [16]:
import pandas as pd
data = {'state':['Ohio','Ohio','Ohio','Nevada'],
       'year':[2000,2001,2002,2003],
       'pop':[1.5,1.7,3.6,2.4]}

pd1 = pd.DataFrame(data,columns=['year','state','pop'],index=['one','two','three','four'])   
pd1.columns

Index(['year', 'state', 'pop'], dtype='object')

In [17]:
import pandas as pd
data = {'state':['Ohio','Ohio','Ohio','Nevada'],
       'year':[2000,2001,2002,2003],
       'pop':[1.5,1.7,3.6,2.4]}

pd1 = pd.DataFrame(data,columns=['year','state','pop'],index=['one','two','three','four'])   
print(type(pd1.year))   # 是一个 Series 类型
pd1.year                # 方法一：获得year列数据和索引

<class 'pandas.core.series.Series'>


one      2000
two      2001
three    2002
four     2003
Name: year, dtype: int64

In [18]:
import pandas as pd
data = {'state':['Ohio','Ohio','Ohio','Nevada'],
       'year':[2000,2001,2002,2003],
       'pop':[1.5,1.7,3.6,2.4]}

pd1 = pd.DataFrame(data,columns=['year','state','pop'],index=['one','two','three','four'])   
print(type(pd1.year))   # 是一个 Series 类型
pd1['year']             # 方法二：获得year列数据和索引

<class 'pandas.core.series.Series'>


one      2000
two      2001
three    2002
four     2003
Name: year, dtype: int64

In [19]:
import pandas as pd
data = {'state':['Ohio','Ohio','Ohio','Nevada'],
       'year':[2000,2001,2002,2003],
       'pop':[1.5,1.7,3.6,2.4]}

pd1 = pd.DataFrame(data,columns=['year','state','pop'],index=['one','two','three','four'])   
pd1[['year','state']]  # 获得多列数据，里面用列表

Unnamed: 0,year,state
one,2000,Ohio
two,2001,Ohio
three,2002,Ohio
four,2003,Nevada


In [20]:
import pandas as pd
data = {'state':['Ohio','Ohio','Ohio','Nevada'],
       'year':[2000,2001,2002,2003],
       'pop':[1.5,1.7,3.6,2.4]}

pd1 = pd.DataFrame(data,columns=['year','state','pop'],index=['one','two','three','four'])   
pd1[:2]  # 切片行

Unnamed: 0,year,state,pop
one,2000,Ohio,1.5
two,2001,Ohio,1.7


In [21]:
import pandas as pd
data = {'state':['Ohio','Ohio','Ohio','Nevada'],
       'year':[2000,2001,2002,2003],
       'pop':[1.5,1.7,3.6,2.4]}

pd1 = pd.DataFrame(data,columns=['year','state','pop'],index=['one','two','three','four'])   
pd1[pd1['year']>2000]  # 筛选满足条件的数据，获得满足条件的行 

Unnamed: 0,year,state,pop
two,2001,Ohio,1.7
three,2002,Ohio,3.6
four,2003,Nevada,2.4


In [22]:
import pandas as pd
data = {'state':['Ohio','Ohio','Ohio','Nevada'],
       'year':[2000,2001,2002,2003],
       'pop':[1.5,1.7,3.6,2.4]}

pd1 = pd.DataFrame(data,columns=['year','state','pop'],index=['one','two','three','four'])   
print(type(pd1['year']>2000))
pd1['year']>2000  # 获得的是Series，布尔类型的Series

<class 'pandas.core.series.Series'>


one      False
two       True
three     True
four      True
Name: year, dtype: bool

In [23]:
import pandas as pd
data = {'state':['Ohio','Ohio','Ohio','Nevada'],
       'year':[2000,2001,2002,2003],
       'pop':[1.5,1.7,3.6,2.4]}

pd1 = pd.DataFrame(data,columns=['year','state','pop'],index=['one','two','three','four'])   
pd1[pd1['year']>2000].state  # 筛选满足条件的数据，获得满足条件的列

two        Ohio
three      Ohio
four     Nevada
Name: state, dtype: object

In [24]:
import pandas as pd
data = {'state':['Ohio','Ohio','Ohio','Nevada'],
       'year':[2000,2001,2002,2003],
       'pop':[1.5,1.7,3.6,2.4]}

pd1 = pd.DataFrame(data,columns=['year','state','pop'],index=['one','two','three','four'])   
print(pd1)
pd1[pd1['year']>2000]=5  # 取year列大于2000的行，把行中所有元素赋值为5
pd1

       year   state  pop
one    2000    Ohio  1.5
two    2001    Ohio  1.7
three  2002    Ohio  3.6
four   2003  Nevada  2.4


Unnamed: 0,year,state,pop
one,2000,Ohio,1.5
two,5,5,5.0
three,5,5,5.0
four,5,5,5.0


In [25]:
import pandas as pd
data = {'state':['Ohio','Ohio','Ohio','Nevada'],
       'year':[2000,2001,2002,2003],
       'pop':[1.5,1.7,3.6,2.4]}

pd1 = pd.DataFrame(data,columns=['year','state','pop'],index=['one','two','three','four'])   
pd1.loc['one',['year','state']]  # 选择一行多列

year     2000
state    Ohio
Name: one, dtype: object

In [26]:
import pandas as pd
data = {'state':['Ohio','Ohio','Ohio','Nevada'],
       'year':[2000,2001,2002,2003],
       'pop':[1.5,1.7,3.6,2.4]}

pd1 = pd.DataFrame(data,columns=['year','state','pop'],index=['one','two','three','four'])   
pd1.loc[['one','two'],['year','state']]  # 选择多行多列

Unnamed: 0,year,state
one,2000,Ohio
two,2001,Ohio


In [27]:
import pandas as pd
data = {'state':['Ohio','Ohio','Ohio','Nevada'],
       'year':[2000,2001,2002,2003],
       'pop':[1.5,1.7,3.6,2.4]}

pd1 = pd.DataFrame(data,columns=['year','state','pop'],index=['one','two','three','four'])   
pd1[['year','state']]  # 选择多列的所有行

Unnamed: 0,year,state
one,2000,Ohio
two,2001,Ohio
three,2002,Ohio
four,2003,Nevada


In [28]:
import pandas as pd
data = {'state':['Ohio','Ohio','Ohio','Nevada'],
       'year':[2000,2001,2002,2003],
       'pop':[1.5,1.7,3.6,2.4]}

pd1 = pd.DataFrame(data,columns=['year','state','pop'],index=['one','two','three','four'])   
pd1.iloc[[0,2],[2,0,1]]  # 选择某些行的某些列

Unnamed: 0,pop,year,state
one,1.5,2000,Ohio
three,3.6,2002,Ohio


In [29]:
import pandas as pd
data = {'state':['Ohio','Ohio','Ohio','Nevada'],
       'year':[2000,2001,2002,2003],
       'pop':[1.5,1.7,3.6,2.4]}

pd1 = pd.DataFrame(data,columns=['year','state','pop'],index=['one','two','three','four'])   
pd1.iloc[1:3,[2,0,1]]  # 选择某些行的某些列

Unnamed: 0,pop,year,state
two,1.7,2001,Ohio
three,3.6,2002,Ohio


In [30]:
import pandas as pd
data = {'state':['Ohio','Ohio','Ohio','Nevada'],
       'year':[2000,2001,2002,2003],
       'pop':[1.5,1.7,3.6,2.4]}

pd1 = pd.DataFrame(data,columns=['year','state','pop'],index=['one','two','three','four'])   
pd1.iloc[1:3,0:2]  # 选择某些行的某些列

Unnamed: 0,year,state
two,2001,Ohio
three,2002,Ohio


In [31]:
import pandas as pd
data = {'state':['Ohio','Ohio','Ohio','Nevada'],
       'year':[2000,2001,2002,2003],
       'pop':[1.5,1.7,3.6,2.4]}

pd1 = pd.DataFrame(data,columns=['year','state','pop'],index=['one','two','three','four'])   
print(pd1.iloc[[0,2],[2,0,1]])           # 警告无需理会，正确运行了
pd1.iloc[[0,2],[2,0,1]][pd1['pop']>1.6]  # 选择某些行的某些列，然后用条件进行进行过滤

       pop  year state
one    1.5  2000  Ohio
three  3.6  2002  Ohio


  


Unnamed: 0,pop,year,state
three,3.6,2002,Ohio


In [32]:
import pandas as pd
data = {'state':['Ohio','Ohio','Ohio','Nevada'],
       'year':[2000,2001,2002,2003],
       'pop':[1.5,1.7,3.6,2.4]}

pd1 = pd.DataFrame(data,columns=['year','state','pop'],index=['one','two','three','four'])   
pd1.loc[:'two','year']  # 首行到two行，取year列

one    2000
two    2001
Name: year, dtype: int64

In [33]:
import pandas as pd
data = {'state':['Ohio','Ohio','Ohio','Nevada'],
       'year':[2000,2001,2002,2003],
       'pop':[1.5,1.7,3.6,2.4]}

pd1 = pd.DataFrame(data,columns=['year','state','pop'],index=['one','two','three','four'])   
pd1[:3][['pop','year']]  # 先筛选行，再筛选列

Unnamed: 0,pop,year
one,1.5,2000
two,1.7,2001
three,3.6,2002


In [34]:
import pandas as pd
data = {'state':['Ohio','Ohio','Ohio','Nevada'],
       'year':[2000,2001,2002,2003],
       'pop':[1.5,1.7,3.6,2.4]}

pd1 = pd.DataFrame(data,columns=['year','state','pop'],index=['one','two','three','four'])   
print(type(pd1.loc['four']))  # 是一个 Series 类型
pd1.loc['four']               # 获得行数据，不能用pd1.four来读取

<class 'pandas.core.series.Series'>


year       2003
state    Nevada
pop         2.4
Name: four, dtype: object

In [35]:
frame = pd.DataFrame(np.arange(9).reshape((3,3)),index=['a','c','d'],
                    columns=['Ohio','Texas','California'])
print(frame)
frame2 = frame.reindex(['aa','bb','cc'])  # 重命名索引，若没有原索引，则为空
print(frame2)  

frame2 = frame.reindex(['c','b','a','d'])  # 重命名索引，若有原索引，则修改顺序
print(frame2)  

   Ohio  Texas  California
a     0      1           2
c     3      4           5
d     6      7           8
    Ohio  Texas  California
aa   NaN    NaN         NaN
bb   NaN    NaN         NaN
cc   NaN    NaN         NaN
   Ohio  Texas  California
c   3.0    4.0         5.0
b   NaN    NaN         NaN
a   0.0    1.0         2.0
d   6.0    7.0         8.0


In [36]:
frame = pd.DataFrame(np.arange(9).reshape((3,3)),index=['a','c','d'],
                    columns=['Ohio','Texas','California'])

frame2 = frame.reindex(['c','b','a','d'])  # 重命名索引，若有原索引，则修改顺序
print(frame2)  

data = frame2.drop(['d'])  # 删除d行
print(data)  

   Ohio  Texas  California
c   3.0    4.0         5.0
b   NaN    NaN         NaN
a   0.0    1.0         2.0
d   6.0    7.0         8.0
   Ohio  Texas  California
c   3.0    4.0         5.0
b   NaN    NaN         NaN
a   0.0    1.0         2.0


In [37]:
frame = pd.DataFrame(np.arange(9).reshape((3,3)),index=['a','c','d'],
                    columns=['Ohio','Texas','California'])

frame2 = frame.reindex(['c','b','a','d'])  # 重命名索引，若有原索引，则修改顺序
print(frame2)  

data = frame2.drop('d',axis=0)  # 删除d行
print(data)  

   Ohio  Texas  California
c   3.0    4.0         5.0
b   NaN    NaN         NaN
a   0.0    1.0         2.0
d   6.0    7.0         8.0
   Ohio  Texas  California
c   3.0    4.0         5.0
b   NaN    NaN         NaN
a   0.0    1.0         2.0


In [38]:
frame = pd.DataFrame(np.arange(9).reshape((3,3)),index=['a','c','d'],
                    columns=['Ohio','Texas','California'])

frame2 = frame.reindex(['c','b','a','d'])  # 重命名索引，若有原索引，则修改顺序
print(frame2)  

data = frame2.drop('Ohio',axis=1)  # 删除Ohio列
print(data)  

   Ohio  Texas  California
c   3.0    4.0         5.0
b   NaN    NaN         NaN
a   0.0    1.0         2.0
d   6.0    7.0         8.0
   Texas  California
c    4.0         5.0
b    NaN         NaN
a    1.0         2.0
d    7.0         8.0


In [39]:
import pandas as pd
data = {'state':['Ohio','Ohio','Ohio','Nevada'],
       'year':[2000,2001,2002,2003],
       'pop':[1.5,1.7,3.6,2.4]}

pd1 = pd.DataFrame(data,columns=['year','state','pop'],index=['one','two','three','four'])   
print(pd1)
pd1+pd1   # 里面的元素相加

       year   state  pop
one    2000    Ohio  1.5
two    2001    Ohio  1.7
three  2002    Ohio  3.6
four   2003  Nevada  2.4


Unnamed: 0,year,state,pop
one,4000,OhioOhio,3.0
two,4002,OhioOhio,3.4
three,4004,OhioOhio,7.2
four,4006,NevadaNevada,4.8


In [40]:
import numpy as np
df1 = pd.DataFrame(np.arange(9.).reshape(3,3),columns=list('bcd'),index=['Ohio','Texas','Coloradp'])
df2 = pd.DataFrame(np.arange(9.).reshape(3,3),columns=list('bcd'),index=['Ohio','Texas','Oregon'])
print(df1)
print(df2)
df1 + df2  # 另外一个表格没有值时，相加为空

            b    c    d
Ohio      0.0  1.0  2.0
Texas     3.0  4.0  5.0
Coloradp  6.0  7.0  8.0
          b    c    d
Ohio    0.0  1.0  2.0
Texas   3.0  4.0  5.0
Oregon  6.0  7.0  8.0


Unnamed: 0,b,c,d
Coloradp,,,
Ohio,0.0,2.0,4.0
Oregon,,,
Texas,6.0,8.0,10.0
