# 第3章 数据预处理
## 3.2 pandas基础
### 　3.2.1 pandas的数据结构
#### 　　3.2.1.1 Series
#### 　　3.2.1.2 DataFrame
### 　<font color=gray>3.2.2 pandas的数据操作
#### 　　<font color=gray>3.2.2.1 排序
#### 　　<font color=gray>3.2.2.2 排名
#### 　　<font color=gray>3.2.2.3 运算
#### 　　<font color=gray>3.2.2.4 函数应用与映射
#### 　　<font color=gray>3.2.2.5 分组
#### 　　<font color=gray>3.2.2.6 合并
#### 　　<font color=gray>3.2.2.7 分类数据
#### 　　<font color=gray>3.2.2.8 时间序列
#### 　　<font color=gray>3.2.2.9 缺失值处理 </font>

## 3.2 pandas基础

In [1]:
'''
1.pandas的特点：
-- pandas构造于numpy基础之上，兼具numpy高性能的数组计算功能以及电子表格和关系型数据灵活的
   数据处理能力。
-- 提供了复杂精细的索引功能，可以更便捷的完成索引、切片、组合以及选取子集等数据整理和操作。
-- pandas包含了Series（序列）和DataFrame（数据框、数据帧或数据表）两种最为常用的类，尤其
   是其提供的DataFrame类，是一个面向列的二维表结构，与通常统计分析和数据分析中具有变量和观
   测值的数据格式非常一致，使得处理大数据变得非常快速和简单。
-- pandas提供了大量适用于数据处理和分析的高性能探索性数据分析、统计推断、时间序列等功能和
   工具。

2.pandas的安装：
-- 使用pip命令安装pandas模块
   -- pip install pandas
   
-- 导入Anaconda中内置的pandas
   import pandas as pd 
   在调用pandas中的模块或函数时应该使用“pd.模块或函数名称”的方式。
'''
import pandas as pd #本书采用该种导入方式

### 　3.2.1 pandas的数据结构

In [3]:
'''
pandas提供了最重要的两种数据结构类型：Series和DataFrame。
'''

#### 　　3.2.1.1 Series

In [None]:
'''
类Series（序列）的实例是一个类似一维数组的对象，其基本内容包含数据值和数据标签、索引,
可以存储任意数据类型，如整型、字符串、浮点型和Python对象等。

-- 1.定义Series对象(通过传递列表和字典)
-- 2.获取Series对象的值和索引（标签）
-- 3.索引（标签）的相关知识（更改索引（标签）、根据索引（标签）访问数据元素、索引（标签）的重新排序）
'''

In [2]:
#1.定义Series：将列表和字典传递给Series
#将列表传递给Series
s1=pd.Series([100,78,59,63]) 
s1

0    100
1     78
2     59
3     63
dtype: int64

In [3]:
#2.获取值和索引：通过values和index属性
s1.values

array([100,  78,  59,  63], dtype=int64)

In [4]:
s1.index

RangeIndex(start=0, stop=4, step=1)

In [None]:
#3.更改索引:(标签)
-- 方法1：通过给index属性赋值的方式
-- 方法2：创建Series时通过指定index关键字的方式

In [5]:
#方法1：通过给index属性赋值的方式更改索引（标签）
s1.index=['No.1','No.2','No.3','No.4']
s1

No.1    100
No.2     78
No.3     59
No.4     63
dtype: int64

In [6]:
#方法2：创建Series时通过指定index关键字的方式更改索引（标签）
s2=pd.Series([100,78,59,63],index=['Maths','English','Literature','History'])
s2

Maths         100
English        78
Literature     59
History        63
dtype: int64

In [7]:
#4.访问某个数据元素：通过索引（标签）
s2['English']

78

In [8]:
#4.访问多个数据元素：通过数组索引（标签），数组元素为多个索引 （标签）
s2[['English','History']]

English    78
History    63
dtype: int64

In [9]:
#5.定义Series：将字典传递给Series，字典中的键成为Series的索引。
d3={'Name':'Zhang San','Gender':'Male','Age':'19','Height':178,'Weight':66}
s3=pd.Series(d3)
s3

Name      Zhang San
Gender         Male
Age              19
Height          178
Weight           66
dtype: object

In [10]:
#指定索引（标签）顺序：创建Series时通过指定index关键字的方式指定索引（标签）顺序，
#同时可增加索引（标签）
student_attrib=['ID','Name','Gender','Age','Grade','Height','Weight']
s4=pd.Series(d3,index=student_attrib)
s4

ID              NaN
Name      Zhang San
Gender         Male
Age              19
Grade           NaN
Height          178
Weight           66
dtype: object

In [11]:
'''
缺失值的标记：NaN
缺失值的检测：函数isnull、notnull或Series实例对象的isnull、notnull方法
'''
pd.isnull(s4) #s4.isnull()

ID         True
Name      False
Gender    False
Age       False
Grade      True
Height    False
Weight    False
dtype: bool

In [12]:
#6.Series对相同索引（标签）的数据的自动对齐
s3+s4   #加号：字符串连接，数值算术运算,根据同名索引自动运算

Age                     1919
Gender              MaleMale
Grade                    NaN
Height                   356
ID                       NaN
Name      Zhang SanZhang San
Weight                   132
dtype: object

In [13]:
#7.Series对象本身及其索引（标签）的name属性赋值
s4.name="Student's profile" #字符串内部包括单引号，两侧必须加双引号。
s4.index.name='Attribute'
s4

Attribute
ID              NaN
Name      Zhang San
Gender         Male
Age              19
Grade           NaN
Height          178
Weight           66
Name: Student's profile, dtype: object

In [14]:
#8.Series对象重新索引：通过reindex方法
#reindex并不会改变原索引（标签）的实际存储位置，而是返回一个经过重新索引（对标签重新排序）的视图。
s4.reindex(index=['Name','ID','Age','Gender','Height','Weight','Grade'])

Attribute
Name      Zhang San
ID              NaN
Age              19
Gender         Male
Height          178
Weight           66
Grade           NaN
Name: Student's profile, dtype: object

In [1]:
s4

NameError: name 's4' is not defined

In [16]:
'''
进行reindex重新索引（对标签排序时）时可新增索引（标签），并可使用backfill、bfill、pad、ffill等方法或
fill_value为原索引（标签）中没有的新增索引（标签）指定填充的内容。
'''
#fill_value使用：
s4.index=['b','g','a','c','e','f','d']
s4

b          NaN
g    Zhang San
a         Male
c           19
e          NaN
f          178
d           66
Name: Student's profile, dtype: object

In [2]:
s4.reindex(index=['a','b','c','d','e','f','g','h'],fill_value=0)

NameError: name 's4' is not defined

In [18]:
s4

b          NaN
g    Zhang San
a         Male
c           19
e          NaN
f          178
d           66
Name: Student's profile, dtype: object

In [19]:
#ffill使用：forwardfill前向填充
s4.index=[0,2,3,6,8,9,11] #通过index属性赋值彻底更改索引
s4

0           NaN
2     Zhang San
3          Male
6            19
8           NaN
9           178
11           66
Name: Student's profile, dtype: object

In [20]:
s4.reindex(range(10),method='ffill') #reindex(index, method= 'ffill')前向填充

0          NaN
1          NaN
2    Zhang San
3         Male
4         Male
5         Male
6           19
7           19
8          NaN
9          178
Name: Student's profile, dtype: object

In [21]:
s4.reindex(range(10),method='bfill') #backwardfill 后向填充

0          NaN
1    Zhang San
2    Zhang San
3         Male
4           19
5           19
6           19
7          NaN
8          NaN
9          178
Name: Student's profile, dtype: object

In [22]:
'''
注意：使用reindex(index,method='**')的时候，Series的原index必须是单调的。
本例中如s4的标签 （索引）仍然为'b','g','a','c','e','f','d'的话，则系统会给出出错信息。
'''
s4.index=['b','g','a','c','e','f','d'] #通过index属性赋值更改了索引
s4

b          NaN
g    Zhang San
a         Male
c           19
e          NaN
f          178
d           66
Name: Student's profile, dtype: object

In [24]:
s4.reindex(range(10),method='ffill') #reindex(index, method= 'ffill')前向填充，要求原索引与新索引类型一致

TypeError: '<' not supported between instances of 'int' and 'str'

#### 　　3.2.1.2 DataFrame

In [None]:
'''
-- DataFrame是二维标记数据结构，类似电子表格的数据结构。
-- 列可以是不同的数据类型。
-- 像Series一样可以接收多种输入:lists、dicts、numpy的ndarray、Series和DataFrame等。 
-- 初始化对象时,除了数据还可以传index和columns这两个参数(行、列标签）。
-- 类DataFrame的实例对象有行和列的索引。

-- 1.创建
-- 2.数据导出
-- 3.索引和切片
-- 4.行列操作
'''

In [None]:
'''
1.创建DataFrame实例对象
（1）使用字典
（2）使用numpy数组
（3）直接读入csv或excel文件
（4）其他数据源
'''

In [25]:
'''
(1)使用字典创建DataFrame
利用DataFrame可以将字典的键直接设置为列标签，并且指定一个列表作为字典的值，字典的值
便成为该列标签下所有的元素：

'''
dfdata={'Name':['Zhang San','Li Si','Wang Laowu','Zhao Liu','Qian Qi','Sun Ba'],
        'Subject':['Literature','History','English','Maths','Physics','Chemics'],
       'Score':[98,76,84,70,93,83]}
scoresheet=pd.DataFrame(dfdata)
scoresheet

Unnamed: 0,Name,Subject,Score
0,Zhang San,Literature,98
1,Li Si,History,76
2,Wang Laowu,English,84
3,Zhao Liu,Maths,70
4,Qian Qi,Physics,93
5,Sun Ba,Chemics,83


In [26]:
#查看DataFrame的内容：使用DataFrame实例对象的head和tail方法查看指定行数的数据
scoresheet.head() #默认前5行

Unnamed: 0,Name,Subject,Score
0,Zhang San,Literature,98
1,Li Si,History,76
2,Wang Laowu,English,84
3,Zhao Liu,Maths,70
4,Qian Qi,Physics,93


In [27]:
scoresheet.head(3) #指定查看数据的前n行

Unnamed: 0,Name,Subject,Score
0,Zhang San,Literature,98
1,Li Si,History,76
2,Wang Laowu,English,84


In [28]:
scoresheet.tail() #默认后5行

Unnamed: 0,Name,Subject,Score
1,Li Si,History,76
2,Wang Laowu,English,84
3,Zhao Liu,Maths,70
4,Qian Qi,Physics,93
5,Sun Ba,Chemics,83


In [29]:
scoresheet.tail(3) #指定后3行

Unnamed: 0,Name,Subject,Score
3,Zhao Liu,Maths,70
4,Qian Qi,Physics,93
5,Sun Ba,Chemics,83


In [30]:
#查看DataFrame实例对象的列和值：使用columns和values属性
scoresheet.columns #查看列

Index(['Name', 'Subject', 'Score'], dtype='object')

In [31]:
scoresheet.values #查看行

array([['Zhang San', 'Literature', 98],
       ['Li Si', 'History', 76],
       ['Wang Laowu', 'English', 84],
       ['Zhao Liu', 'Maths', 70],
       ['Qian Qi', 'Physics', 93],
       ['Sun Ba', 'Chemics', 83]], dtype=object)

In [32]:
#使用嵌套的字典构造DataFrame:
dfdata2={'Name':{101:'Zhang San',102:'Li Si',103:'Wang Laowu',
                 104:'Zhao Liu',105:'Qian Qi',106:'Sun Ba'},
         'Subject':{101:'Literature',102:'History',103:'Engish',
                    104:'Maths',105:'Physics',106:'Chemics'},
         'Score':{101:98,102:76,103:84,104:70,105:93,106:83}}
scoresheet2=pd.DataFrame(dfdata2)
scoresheet2

Unnamed: 0,Name,Subject,Score
101,Zhang San,Literature,98
102,Li Si,History,76
103,Wang Laowu,Engish,84
104,Zhao Liu,Maths,70
105,Qian Qi,Physics,93
106,Sun Ba,Chemics,83


In [33]:
#DataFrame是由多个Series构成的，其每列都是一个Series，如：
scoresheet2['Score']

101    98
102    76
103    84
104    70
105    93
106    83
Name: Score, dtype: int64

In [34]:
scoresheet2['Name']

101     Zhang San
102         Li Si
103    Wang Laowu
104      Zhao Liu
105       Qian Qi
106        Sun Ba
Name: Name, dtype: object

In [35]:
#DataFrame是由多个Series构成的，其每列都是一个Series，如：
scoresheet2.Score

101    98
102    76
103    84
104    70
105    93
106    83
Name: Score, dtype: int64

In [36]:
#DataFrame是由多个Series构成的，其每列都是一个Series，如：
scoresheet2.Name

101     Zhang San
102         Li Si
103    Wang Laowu
104      Zhao Liu
105       Qian Qi
106        Sun Ba
Name: Name, dtype: object

In [37]:
#(2)使用numpy数组构造DataFrame
'''
numpy.random.randn(d0,d1,…,dn)

randn函数根据给定维度返回一个或一组样本，具有标准正态分布。
dn表示每个维度
返回值为指定维度的array
'''
import numpy as np
numframe=np.random.randn(10,5)
numframe

array([[ 2.05421744, -0.34796082,  2.54376791,  0.92526479,  1.70930582],
       [-2.02652163,  1.00062142, -0.13110848,  0.6821614 ,  0.18275813],
       [ 0.94841864,  1.34920733,  0.07839491,  0.81301191,  0.88353303],
       [-0.39578797,  0.41968657, -1.40764367,  0.04143718, -0.33691126],
       [ 0.56855952,  0.9724694 ,  0.98917314,  0.06019136,  0.28617911],
       [ 0.59762935, -1.39403931, -1.22204714,  0.65765251, -3.3895936 ],
       [ 0.16823372, -0.91597244, -1.28515493,  0.32125322,  0.25197222],
       [-1.27446108,  0.60580062, -0.59073431,  0.19355914,  1.16516286],
       [ 0.23300641, -0.88208188, -1.60186047,  0.46666064, -0.08316368],
       [-0.41152676,  0.39186396, -2.0232177 ,  0.11821536, -1.3692514 ]])

In [38]:
framenum=pd.DataFrame(numframe)
framenum

Unnamed: 0,0,1,2,3,4
0,2.054217,-0.347961,2.543768,0.925265,1.709306
1,-2.026522,1.000621,-0.131108,0.682161,0.182758
2,0.948419,1.349207,0.078395,0.813012,0.883533
3,-0.395788,0.419687,-1.407644,0.041437,-0.336911
4,0.56856,0.972469,0.989173,0.060191,0.286179
5,0.597629,-1.394039,-1.222047,0.657653,-3.389594
6,0.168234,-0.915972,-1.285155,0.321253,0.251972
7,-1.274461,0.605801,-0.590734,0.193559,1.165163
8,0.233006,-0.882082,-1.60186,0.466661,-0.083164
9,-0.411527,0.391864,-2.023218,0.118215,-1.369251


In [39]:
framenum.head()

Unnamed: 0,0,1,2,3,4
0,2.054217,-0.347961,2.543768,0.925265,1.709306
1,-2.026522,1.000621,-0.131108,0.682161,0.182758
2,0.948419,1.349207,0.078395,0.813012,0.883533
3,-0.395788,0.419687,-1.407644,0.041437,-0.336911
4,0.56856,0.972469,0.989173,0.060191,0.286179


In [40]:
framenum.info() #打印数据框的属性信息

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       10 non-null     float64
 1   1       10 non-null     float64
 2   2       10 non-null     float64
 3   3       10 non-null     float64
 4   4       10 non-null     float64
dtypes: float64(5)
memory usage: 528.0 bytes


In [41]:
framenum.dtypes #dtypes属性可查看DataFrame每列的属性

0    float64
1    float64
2    float64
3    float64
4    float64
dtype: object

In [42]:
#自定义结构类型
stock=np.dtype([('name',np.str_,4),
                ('time',np.str_,10),
                ('opening_price',np.float64),
                ('closing_price',np.float64),
                ('lowest_price',np.float64),
                ('highest_price',np.float64),
                ('volume',np.int32)])
#利用自定义的结构类型，初始化结构化数组，即解释数组所占的内存。
jd_stock=np.loadtxt('data/data.csv',delimiter=',',dtype=stock)
jd_stock

OSError: data/data.csv not found.

In [59]:
jd=pd.DataFrame(jd_stock)
jd.head()

Unnamed: 0,name,time,opening_price,closing_price,lowest_price,highest_price,volume
0,JD,3-Jan-17,25.95,25.82,25.64,26.11,8275300
1,JD,4-Jan-17,26.05,25.85,25.58,26.08,7862800
2,JD,5-Jan-17,26.15,26.3,26.05,26.8,10205600
3,JD,6-Jan-17,26.3,26.27,25.92,26.41,6234300
4,JD,9-Jan-17,26.64,26.26,26.14,26.95,8071500


In [61]:
jd.info() #打印属性信息

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 71 entries, 0 to 70
Data columns (total 7 columns):
name             71 non-null object
time             71 non-null object
opening_price    71 non-null float64
closing_price    71 non-null float64
lowest_price     71 non-null float64
highest_price    71 non-null float64
volume           71 non-null int32
dtypes: float64(4), int32(1), object(2)
memory usage: 3.1+ KB


In [62]:
#(3)直接读入csv文件或excel文件构造DataFrame: read_csv和read_excel函数

jddf=pd.read_csv('data/data.csv',header=None,
                 names=['name','time','opening_price','closing_price',
                       'lowest_price','highest_price','volume'])
# header=None 表示不会自动把数据的第1行和第1列设置成行、列标签
# names指定列标签，即通常意义下的变量名
jddf.head()

Unnamed: 0,name,time,opening_price,closing_price,lowest_price,highest_price,volume
0,JD,3-Jan-17,25.95,25.82,25.64,26.11,8275300
1,JD,4-Jan-17,26.05,25.85,25.58,26.08,7862800
2,JD,5-Jan-17,26.15,26.3,26.05,26.8,10205600
3,JD,6-Jan-17,26.3,26.27,25.92,26.41,6234300
4,JD,9-Jan-17,26.64,26.26,26.14,26.95,8071500


In [68]:
jddf=pd.read_excel('data/data.xlsx',header=None,
                 names=['name','time','opening_price','closing_price',
                       'lowest_price','highest_price','volume'])
jddf.head()

Unnamed: 0,name,time,opening_price,closing_price,lowest_price,highest_price,volume
0,JD,2017-01-03,25.95,25.82,25.64,26.11,8275300
1,JD,2017-01-04,26.05,25.85,25.58,26.08,7862800
2,JD,2017-01-05,26.15,26.3,26.05,26.8,10205600
3,JD,2017-01-06,26.3,26.27,25.92,26.41,6234300
4,JD,2017-01-09,26.64,26.26,26.14,26.95,8071500


In [None]:
#(4)其他数据源构造DataFrame：见教材p117

In [None]:
#设置某一列数据为DataFrame的行标签：通过set_index方法

In [15]:
jddf=pd.read_table('data/data.csv',sep=',',header=None,
                  names=['name','time','opening_price','closing_price',
                       'lowest_price','highest_price','volume'])
jddfsetindex=jddf.set_index(jddf['time'])
jddfsetindex.head()

Unnamed: 0_level_0,name,time,opening_price,closing_price,lowest_price,highest_price,volume
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
3-Jan-17,JD,3-Jan-17,25.95,25.82,25.64,26.11,8275300
4-Jan-17,JD,4-Jan-17,26.05,25.85,25.58,26.08,7862800
5-Jan-17,JD,5-Jan-17,26.15,26.3,26.05,26.8,10205600
6-Jan-17,JD,6-Jan-17,26.3,26.27,25.92,26.41,6234300
9-Jan-17,JD,9-Jan-17,26.64,26.26,26.14,26.95,8071500


In [46]:
'''
注意：因time变量只不过是用了时间数据的样子来存储数据，其本质上不是一个时间序列，而是
一个文本序列。故本例用time变量作为整个DataFrame实例对象的索引，也就只是一个普通索引
而已，即：
'''
type(jddfsetindex.index)

pandas.core.indexes.base.Index

In [16]:
#2. 数据导出
'''
pandas除了可使用表3-3中的I/0 API之外，还可以使用to_dict、to_latex等诸多方法，
把构造好的DataFrame实例对象输出到外部文件、HDFS分布式文件系统或指定格式的对
象中：
'''
jddf.to_csv('data/jdstockdata.csv')
jddf.to_excel('data/jdstockdata.xlsx')

In [55]:
#3. 索引和切片

In [47]:
import pandas as pd
dfdata={'Name':['Zhang San','Li Si','Wang Laowu','Zhao Liu','Qian Qi','Sun Ba'],
        'Subject':['Literature','History','English','Maths','Physics','Chemics'],
       'Score':[98,76,84,70,93,83]}
scoresheet=pd.DataFrame(dfdata)
scoresheet

Unnamed: 0,Name,Subject,Score
0,Zhang San,Literature,98
1,Li Si,History,76
2,Wang Laowu,English,84
3,Zhao Liu,Maths,70
4,Qian Qi,Physics,93
5,Sun Ba,Chemics,83


In [48]:
scoresheet.index=(['No1','No2','No3','No4','No5','No6'])
scoresheet

Unnamed: 0,Name,Subject,Score
No1,Zhang San,Literature,98
No2,Li Si,History,76
No3,Wang Laowu,English,84
No4,Zhao Liu,Maths,70
No5,Qian Qi,Physics,93
No6,Sun Ba,Chemics,83


In [49]:
# DataFrame可以按行或列进行索引或切片，即提取数据的子集
# 3.1 提取某1列的值：采用列标签的方式
scoresheet['Subject'] # scoresheet.Subject

No1    Literature
No2       History
No3       English
No4         Maths
No5       Physics
No6       Chemics
Name: Subject, dtype: object

In [50]:
# 3.2 提取某几列的值：使用数组，列标签作为数组元素
scoresheet[['Name','Score']]

Unnamed: 0,Name,Score
No1,Zhang San,98
No2,Li Si,76
No3,Wang Laowu,84
No4,Zhao Liu,70
No5,Qian Qi,93
No6,Sun Ba,83


In [51]:
scoresheet[['No3','No4']] #此种方式不能提取行

KeyError: "None of [Index(['No3', 'No4'], dtype='object')] are in the [columns]"

In [53]:
# 3.3 提取连续行值：使用切片提取
'''
-- 使用整数切片时，结果与列表和numpy数组的默认情况相同
-- 使用行标签切片时，它是末端包含的
'''
scoresheet[:4] 

Unnamed: 0,Name,Subject,Score
No1,Zhang San,Literature,98
No2,Li Si,History,76
No3,Wang Laowu,English,84
No4,Zhao Liu,Maths,70


In [54]:
scoresheet[:'No4'] #针对行切片

Unnamed: 0,Name,Subject,Score
No1,Zhang San,Literature,98
No2,Li Si,History,76
No3,Wang Laowu,English,84
No4,Zhao Liu,Maths,70


In [55]:
# 3.4 提取非连续行值：使用loc、iloc方法通过标签数组来检索

scoresheet.loc[['No1','No2','No6']] #基于行标签

Unnamed: 0,Name,Subject,Score
No1,Zhang San,Literature,98
No2,Li Si,History,76
No6,Sun Ba,Chemics,83


In [56]:
scoresheet.iloc[[0,1,5]] #基于行索引

Unnamed: 0,Name,Subject,Score
No1,Zhang San,Literature,98
No2,Li Si,History,76
No6,Sun Ba,Chemics,83


In [57]:
#3.5 提取非连续行和列数据：使用loc、iloc索引

scoresheet.loc[['No1','No5'],['Name','Score']]  #loc是用标签进行索引

Unnamed: 0,Name,Score
No1,Zhang San,98
No5,Qian Qi,93


In [58]:
scoresheet.iloc[[0,4],[0,1]] #iloc是基于行索引、列索引

Unnamed: 0,Name,Subject
No1,Zhang San,Literature
No5,Qian Qi,Physics


In [59]:
# 3.6 通过逻辑索引提取数据：返回列中满足条件的整行数据
scoresheet

Unnamed: 0,Name,Subject,Score
No1,Zhang San,Literature,98
No2,Li Si,History,76
No3,Wang Laowu,English,84
No4,Zhao Liu,Maths,70
No5,Qian Qi,Physics,93
No6,Sun Ba,Chemics,83


In [60]:
scoresheet[(scoresheet.Score>80)&(scoresheet.Score<=90)]

Unnamed: 0,Name,Subject,Score
No3,Wang Laowu,English,84
No6,Sun Ba,Chemics,83


In [61]:
# scoresheet[行][列]:分步提取，第一步取行，第二步取列。
scoresheet[(scoresheet.Score>80)&(scoresheet.Score<=90)][['Name','Score']]

Unnamed: 0,Name,Score
No3,Wang Laowu,84
No6,Sun Ba,83


In [62]:
# scoresheet[列][行]:分步提取，第一步取列，第二步取行。
scoresheet[['Name','Score']][(scoresheet.Score>80)&(scoresheet.Score<=90)]

Unnamed: 0,Name,Score
No3,Wang Laowu,84
No6,Sun Ba,83


In [63]:
#4. 行列操作
#4.1 改变列的顺序
#构造DataFrame时直接指定columns值或者调用reindex方法
dfdata={'Name':['Zhang San','Li Si','Wang Laowu','Zhao Liu','Qian Qi','Sun Ba'],
        'Subject':['Literature','History','English','Maths','Physics','Chemics'],
       'Score':[98,76,84,70,93,83]}
scoresheet=pd.DataFrame(dfdata)
scoresheet

Unnamed: 0,Name,Subject,Score
0,Zhang San,Literature,98
1,Li Si,History,76
2,Wang Laowu,English,84
3,Zhao Liu,Maths,70
4,Qian Qi,Physics,93
5,Sun Ba,Chemics,83


In [64]:
scoresheet=pd.DataFrame(dfdata,columns=['ID','Name','Subject','Score'],
                       index=['No1','No2','No3','No4','No5','No6'])
scoresheet

Unnamed: 0,ID,Name,Subject,Score
No1,,Zhang San,Literature,98
No2,,Li Si,History,76
No3,,Wang Laowu,English,84
No4,,Zhao Liu,Maths,70
No5,,Qian Qi,Physics,93
No6,,Sun Ba,Chemics,83


In [None]:
'''
注意：如果指定顺序中出现了新的列索引，则其值用缺失值NaN表示，同时也可以为每一行指定
索引。
'''

In [65]:
#使用reindex：返回视图
#使用reindex指定columns来对DataFrame的列进行重新索引达到改变变量顺序的目的
scoresheet.reindex(columns=['Name','Subject','ID','Score'])

Unnamed: 0,Name,Subject,ID,Score
No1,Zhang San,Literature,,98
No2,Li Si,History,,76
No3,Wang Laowu,English,,84
No4,Zhao Liu,Maths,,70
No5,Qian Qi,Physics,,93
No6,Sun Ba,Chemics,,83


In [66]:
#4.2 修改行/列的数据
#增加列：通过赋值
scoresheet['Homeword']=90
scoresheet

Unnamed: 0,ID,Name,Subject,Score,Homeword
No1,,Zhang San,Literature,98,90
No2,,Li Si,History,76,90
No3,,Wang Laowu,English,84,90
No4,,Zhao Liu,Maths,70,90
No5,,Qian Qi,Physics,93,90
No6,,Sun Ba,Chemics,83,90


In [67]:
#修改列名：rename()方法
scoresheet.rename(columns={'Homeword':'Homework'},inplace=True)
scoresheet

Unnamed: 0,ID,Name,Subject,Score,Homework
No1,,Zhang San,Literature,98,90
No2,,Li Si,History,76,90
No3,,Wang Laowu,English,84,90
No4,,Zhao Liu,Maths,70,90
No5,,Qian Qi,Physics,93,90
No6,,Sun Ba,Chemics,83,90


In [None]:
'''
注意：凡是会对原数据作出修改并返回一个新数据的，往往都有一个inplace可选参数。
如果将其设定为True（默认为False），那么原数据就被替换。也就是说采用inplace=True
之后，原数据对应的内存值直接改变；而采用inplace=False之后，原数据对应的内存值并
不改变，需要将新的结果赋给一个新的对象或覆盖原数据的内存位置。
'''

In [68]:
#列赋值：通过列表或数组，要求所赋的值长度必须和DataFrame的长度匹配
import numpy as np
scoresheet['ID']=np.arange(6)
scoresheet

Unnamed: 0,ID,Name,Subject,Score,Homework
No1,0,Zhang San,Literature,98,90
No2,1,Li Si,History,76,90
No3,2,Wang Laowu,English,84,90
No4,3,Zhao Liu,Maths,70,90
No5,4,Qian Qi,Physics,93,90
No6,5,Sun Ba,Chemics,83,90


In [69]:
#列赋值：通过Series
#它会代替在DataFrame中精确匹配的索引的值，如果没有匹配的索引则赋缺失值。
fixed=pd.Series([97,76,83],index=['No1','No3','No6'])
scoresheet['Homework']=fixed
scoresheet

Unnamed: 0,ID,Name,Subject,Score,Homework
No1,0,Zhang San,Literature,98,97.0
No2,1,Li Si,History,76,
No3,2,Wang Laowu,English,84,76.0
No4,3,Zhao Liu,Maths,70,
No5,4,Qian Qi,Physics,93,
No6,5,Sun Ba,Chemics,83,83.0


In [70]:
#4.3 删除行/列数据：
#使用del语句删除列
#drop方法删除行、列
del scoresheet['Homework']
scoresheet

Unnamed: 0,ID,Name,Subject,Score
No1,0,Zhang San,Literature,98
No2,1,Li Si,History,76
No3,2,Wang Laowu,English,84
No4,3,Zhao Liu,Maths,70
No5,4,Qian Qi,Physics,93
No6,5,Sun Ba,Chemics,83


In [71]:
scoresheet.drop('ID',axis=1,inplace=True)  #axis=1删除列，axix=0删除行
scoresheet

Unnamed: 0,Name,Subject,Score
No1,Zhang San,Literature,98
No2,Li Si,History,76
No3,Wang Laowu,English,84
No4,Zhao Liu,Maths,70
No5,Qian Qi,Physics,93
No6,Sun Ba,Chemics,83


In [72]:
scoresheet.drop(['No1','No5','No6'],axis=0,inplace=True)
scoresheet

Unnamed: 0,Name,Subject,Score
No2,Li Si,History,76
No3,Wang Laowu,English,84
No4,Zhao Liu,Maths,70
