## **17. Pandas的数据转换函数 map, apply, applymap**

### 数据转换函数对比：

- map: 只适用于Series,实现每个值-->值的映射
- apply: 用于Series时，实现每个值的处理；用于DataFrame时，实现某个轴的Series的处理
- applymap: 只能用于DataFrame,用于该DataFrame的每个值的处理（所有元素同时处理）

### **17.0 准备数据**

In [1]:
import pandas as pd
import numpy as np
%matplotlib inline

In [2]:
df000651=pd.read_csv('./stocks/000651.csv',dtype={'SaleOrderID':'object','BuyOrderID':'object'})
df000651['Code']='000651'

df601012=pd.read_csv('./stocks/601012.csv',dtype={'SaleOrderID':'object','BuyOrderID':'object'})
df601012['Code']='601012'

df601288=pd.read_csv('./stocks/601288.csv',dtype={'SaleOrderID':'object','BuyOrderID':'object'})
df601288['Code']='601288'

df601318=pd.read_csv('./stocks/601318.csv',dtype={'SaleOrderID':'object','BuyOrderID':'object'})
df601318['Code']='601318'

#合并，生成大的DataFrame
df=pd.concat([df000651,df601012,df601288,df601318])

In [3]:
df

Unnamed: 0,TranID,Time,Price,Volume,SaleOrderVolume,BuyOrderVolume,Type,SaleOrderID,SaleOrderPrice,BuyOrderID,BuyOrderPrice,Code
0,1,09:25:00,59.10,400,400,1400,S,1,53.87,1,65.84,000651
1,2,09:25:00,59.10,300,300,1400,S,2,53.87,1,65.84,000651
2,3,09:25:00,59.10,200,200,1400,S,3,53.87,1,65.84,000651
3,4,09:25:00,59.10,200,200,1400,S,4,53.87,1,65.84,000651
4,5,09:25:00,59.10,300,5900,1400,S,5,53.87,1,65.84,000651
...,...,...,...,...,...,...,...,...,...,...,...,...
196762,196763,15:00:00,85.18,100,243258,100,S,8493946,85.18,8461234,85.18,601318
196763,196764,15:00:00,85.18,100,243258,100,S,8493946,85.18,8474391,85.18,601318
196764,196765,15:00:00,85.18,25500,243258,25500,S,8493946,85.18,8492769,85.18,601318
196765,196766,15:00:00,85.18,1000,262500,1000,B,8493946,85.18,8498955,85.18,601318


In [4]:
df['Code'].unique()

array(['000651', '601012', '601288', '601318'], dtype=object)

### **17.1 map只能用于Series值的处理**

#### Series.map(dict)或者Series.map(fun)均可

In [6]:
#将代码映射成中文
#定义一个字典
dict_name={'000651':'格力电气', '601012':'隆基绿能', '601288':'农业银行', '601318':'中国平安'}

#### **方法1:Series.map(dict)**

In [13]:
#以DataFrame中的Code的字段值为key,返回字典的Value值，自动根据key取value
#Name1为新增的字段。df['Code']是一个Series，而map函数只能作用于Series.
df['Name1']=df['Code'].map(dict_name)  #注意这里的映射方式，与通常的理解不同

In [14]:
df.head(1)

Unnamed: 0,TranID,Time,Price,Volume,SaleOrderVolume,BuyOrderVolume,Type,SaleOrderID,SaleOrderPrice,BuyOrderID,BuyOrderPrice,Code,Name1,name2,name_fun1
0,1,09:25:00,59.1,400,400,1400,S,1,53.87,1,65.84,651,格力电气,格力电气,格力电气**


#### **方法2：Series.map(fun)**
**此时，fun的参数是Series的每个元素的值**

In [15]:
#这里，fun是一个lambda函数
#df['Code']是一个Series，而map函数只能作用于Series
df['name2']=df['Code'].map(lambda x:dict_name[x]) #fun的参数自动设置为Series(这里是df['Code'])的每个元素的值

In [16]:
df.head(1)

Unnamed: 0,TranID,Time,Price,Volume,SaleOrderVolume,BuyOrderVolume,Type,SaleOrderID,SaleOrderPrice,BuyOrderID,BuyOrderPrice,Code,Name1,name2,name_fun1
0,1,09:25:00,59.1,400,400,1400,S,1,53.87,1,65.84,651,格力电气,格力电气,格力电气**


In [17]:
#使用一个自定义的标准函数。df['Code']是一个Series，函数fun1(x)的入口参数是这个Series的每个元素值，参数的赋值过程是自动进行的
def fun1(x):
    return dict_name[x]+'**'
df['name_fun1']=df['Code'].map(fun1)

In [18]:
df.head(1)

Unnamed: 0,TranID,Time,Price,Volume,SaleOrderVolume,BuyOrderVolume,Type,SaleOrderID,SaleOrderPrice,BuyOrderID,BuyOrderPrice,Code,Name1,name2,name_fun1
0,1,09:25:00,59.1,400,400,1400,S,1,53.87,1,65.84,651,格力电气,格力电气,格力电气**


### **17.2 apply既可用于Series，也可用于DataFrame的转换**

#### 1. apply(function)只能使用fun作为参数，不能使用dict作为参数
#### 2. Series.apply(function),function的参数是Series的每个值
#### 3. DataFrame.apply(function),function的参数是DataFrame对应轴的Series

### **17.2.1 Series.apply(fun)**
#### fun的参数是Series的每个值

In [19]:
#适用lambda函数
#与Series.map(fun)相似
df['name3']=df['Code'].apply(lambda x:dict_name[x])

In [20]:
df.head(1)

Unnamed: 0,TranID,Time,Price,Volume,SaleOrderVolume,BuyOrderVolume,Type,SaleOrderID,SaleOrderPrice,BuyOrderID,BuyOrderPrice,Code,Name1,name2,name_fun1,name3
0,1,09:25:00,59.1,400,400,1400,S,1,53.87,1,65.84,651,格力电气,格力电气,格力电气**,格力电气


In [24]:
#使用标准行数fun2。Series df['Code']的每个值作为参数传递给函数fun2(x)
def fun2(x):
    return dict_name[x]+'=='
df['name_fun2']=df['Code'].apply(fun2)

In [25]:
df.head(1)

Unnamed: 0,TranID,Time,Price,Volume,SaleOrderVolume,BuyOrderVolume,Type,SaleOrderID,SaleOrderPrice,BuyOrderID,BuyOrderPrice,Code,Name1,name2,name_fun1,name3,name_fun2
0,1,09:25:00,59.1,400,400,1400,S,1,53.87,1,65.84,651,格力电气,格力电气,格力电气**,格力电气,格力电气==


### **17.2.2 DataFrame.apply(fun)**
**fun的参数是DataFrame对应轴的Serie**

In [26]:
#执行速度慢，此时apply函数作用于DataFrame
df['name4']=df.apply(lambda x:dict_name[x['Code']],axis=1)

注意：
- apply函数是在df这个DataFrame上调用的
- lambda x 中的 x 是一个Series，因为指定了axis=1，所以Series的key是df的列名，可以用df['Code']获取列的值

In [27]:
df.head(1)

Unnamed: 0,TranID,Time,Price,Volume,SaleOrderVolume,BuyOrderVolume,Type,SaleOrderID,SaleOrderPrice,BuyOrderID,BuyOrderPrice,Code,Name1,name2,name_fun1,name3,name_fun2,name4
0,1,09:25:00,59.1,400,400,1400,S,1,53.87,1,65.84,651,格力电气,格力电气,格力电气**,格力电气,格力电气==,格力电气


### **17.3 applymap用于DataFrame所有值的转换**

In [28]:
df_sub=df[['Price','SaleOrderPrice','BuyOrderPrice','SaleOrderVolume','BuyOrderVolume']]

In [44]:
df_sub.reset_index(inplace=True)

In [57]:
df_sub.head(5)

Unnamed: 0,index,Price,SaleOrderPrice,BuyOrderPrice,SaleOrderVolume,BuyOrderVolume
0,0,59.1,53.87,65.84,400,1400
1,1,59.1,53.87,65.84,300,1400
2,2,59.1,53.87,65.84,200,1400
3,3,59.1,53.87,65.84,200,1400
4,4,59.1,53.87,65.84,5900,1400


In [58]:
#将数据取整，应用于所有元素
df_sub.applymap(lambda x:int(x))

Unnamed: 0,index,Price,SaleOrderPrice,BuyOrderPrice,SaleOrderVolume,BuyOrderVolume
0,0,59,53,65,400,1400
1,1,59,53,65,300,1400
2,2,59,53,65,200,1400
3,3,59,53,65,200,1400
4,4,59,53,65,5900,1400
...,...,...,...,...,...,...
534032,196762,85,85,85,243258,100
534033,196763,85,85,85,243258,100
534034,196764,85,85,85,243258,25500
534035,196765,85,85,85,262500,1000


In [61]:
df_sub.drop(columns=['index'],inplace=True)

----

## **18. Pandas怎样对每个groupby分组应用apply**

### **18.0 准备数据**

In [1]:
import pandas as pd

In [2]:
df=pd.read_csv('./movielens/ml-latest-small/ratings.csv')

In [3]:
df

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
...,...,...,...,...
100831,610,166534,4.0,1493848402
100832,610,168248,5.0,1493850091
100833,610,168250,5.0,1494273047
100834,610,168252,5.0,1493846352


### **18.1 知识**

#### **Pandas对每个groupby分组应用apply的执行过程如下：**

#### **这个过程分为三个步骤，分别是：split-->apply-->combine，如图所示：**

![GroupBy模式](groupby.png)

#### 第一步：split。具体就是执行pandas的groupby；
#### 第二步：apply。具体就是通过apply函数在各个分组上执行我们需要的函数（这个函数可以是系统定义的，也可以是自定义的）；
#### 第三步：combine。具体就是由Pandas将各个分组apply后的结果组装起来，形成一个结果集并返回。

#### **GroupBy.apply(function)**
#### 1 function的第一个参数是DataFrame
#### 2 function的返回结果可以是：DataFrame、Series、单个值，甚至是和输入的参数DataFrame没有任何关系的值

### **实例1. 对数值按分组进行归一化处理**

#### 将不同取值范围的数值列进行归一化处理，映射到[0,1]区间

#### 归一化处理的好处：
#### 1. 更容易进行数值的横向比对。比如价格取值是几百到几千，增幅字段是0到100
#### 2. 机器学习模型学得更快、性能更好

#### 归一化的公式：

![guiyihua](normalization.png)

#### **按userId字段分组，在对rating字段进行归一化。做法是：先分组，然后用apply执行一个自定义的归一化函数normal(x)**

In [66]:
def normal(x): #参数x是DataFrame，这里就是groupby后的各个分组
    max_value=x['rating'].max()
    min_value=x['rating'].min()
    #新增一列normal_rating。在Series x['rating']上通过apply执行自定义的lambda函数，实现分组内的归一化处理
    x['normal_rating']=x['rating'].apply(lambda y:(y-min_value)/(max_value-min_value))
    return x

#以下一句，执行了三步，（1）进行分组，（2）调用组内归一化处理函数，（3）组装处理后的结果
df.groupby('userId',group_keys=True).apply(normal) #group_keys=True表示将分组标识添加到结果集的索引中

Unnamed: 0_level_0,Unnamed: 1_level_0,userId,movieId,rating,timestamp,normal_rating
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,0,1,1,4.0,964982703,0.750000
1,1,1,3,4.0,964981247,0.750000
1,2,1,6,4.0,964982224,0.750000
1,3,1,47,5.0,964983815,1.000000
1,4,1,50,5.0,964982931,1.000000
...,...,...,...,...,...,...
610,100831,610,166534,4.0,1493848402,0.777778
610,100832,610,168248,5.0,1493850091,1.000000
610,100833,610,168250,5.0,1494273047,1.000000
610,100834,610,168252,5.0,1493846352,1.000000


### **实例2. 取每个分组的TOPn数据**

#### 本例获取每个用户（userId）打分最高的两部电影信息

In [4]:
df.head(3)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224


In [73]:
df.dtypes

userId         int64
movieId        int64
rating       float64
timestamp      int64
dtype: object

In [83]:
df.groupby(by='userId').agg({'rating':[sum,max,min]})

Unnamed: 0_level_0,rating,rating,rating
Unnamed: 0_level_1,sum,max,min
userId,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
1,1013.0,5.0,1.0
2,114.5,5.0,2.0
3,95.0,5.0,0.5
4,768.0,5.0,1.0
5,160.0,5.0,1.0
...,...,...,...
606,4078.0,5.0,0.5
607,708.0,5.0,1.0
608,2604.5,5.0,0.5
609,121.0,4.0,3.0


In [None]:
def gettopn(df1,topn):
    return df1.
df.groupby(by='userId').apply(gettopn(gettopn,topn=2))