# 基于图的LBSN网络分析

## 参考

- [1] [An Experimental Evaluation of Pointofinterest Recommendation in Locationbased Social Networks（打开思路）](https://www.semanticscholar.org/paper/An-Experimental-Evaluation-of-Point-of-interest-in-Liu-Pham/8c999b9c270340bf57f2113064ab7c0e98710e08?p2df)  
- [2] [Exploiting Geographical Influence for Collaborative Point-of-Interest Recommendation（主文献）](http://www.cse.cuhk.edu.hk/irwin.king.new/_media/presentations/p325.pdf)  
<div id="refer-anchor-3"></div>
- [3] [Augmenting Collaborative Recommender by Fusing Explicit Social Relationships（计算$L+$）](https://www.researchgate.net/profile/Georgios-Pitsilis/publication/230609374_Social_Trust_as_a_solution_to_address_sparsity-inherent_problems_ofRecommender_systems/links/0912f50c4844795591000000/Social-Trust-as-a-solution-to-address-sparsity-inherent-problems-of-Recommender-systems.pdf#page=56)  
<div id="refer-anchor-4"></div>
- [4] [Random-Walk Computation of Similarities between Nodes of a Graph with Application to Collaborative Recommendation（计算$L+$）](https://d1wqtxts1xzle7.cloudfront.net/2136944/5zoa93jb40vzxxq.pdf?1425083454=&response-content-disposition=inline%3B+filename%3DRandom_walk_computation_of_similarities.pdf&Expires=1624010106&Signature=QN7nQr96QUJdXfH2A9E5qlAjVJKzYPslpb52lpgZdniULKKqL351apNIEkDLrirgoYZZQp8x8B3mOYZ1zpELXIFzJitx5FkGVCtYaJjAnQVWFsUjcAoR83TLCjSxeUCY6-aXpAK6kz-O6HoOdc6v2qXnG4l7Ss9-PBXTIPjttRSkucl8cg7Org6IWvfwVqcvVetZzfGf6DZhF92xFuanwwvtW~GGJJmbjxj9mXKNTQx-Lb7rB38bJvw5ZtRhGRRvW2TM-uMm~HiV1Wunp61pSt4WdP5M8xa5aTODsqxy-DYQ0kBcFmo1mzOfuNurTFz~a2sPkN06anQiXegiRSa5hg__&Key-Pair-Id=APKAJLOHF5GGSLRBV4ZA)  

## 导入必要的包

* 网络处理：[networkx](https://www.osgeo.cn/networkx/)  
* 稀疏矩阵的处理：[scipy.sparse](https://docs.scipy.org/doc/scipy/reference/sparse.html) & [scipy.sparse(CSDN1)](https://blog.csdn.net/qq_33466771/article/details/80304498) & [scipy.sparse(CSDN2)](https://blog.csdn.net/winycg/article/details/80967112)
* 求解稀疏矩阵方程：[cvxopt](http://cvxopt.org/userguide/spsolvers.html?highlight=cholmod#cvxopt.cholmod.linsolve)
* 加速dataframe计算：[dask](https://dask.org/)
* 加速dataframe的apply计算：[swifter](https://github.com/jmcarpenter2/swifter)   

In [1]:
from IPython.core.interactiveshell import InteractiveShell 
InteractiveShell.ast_node_interactivity = 'all' #默认为'last'
from IPython.display import Image  

import pandas as pd
import numpy as np
import networkx as nx
import os
from scipy import sparse
from cvxopt import cholmod, spmatrix, matrix, umfpack
import dask.dataframe as dd
import swifter
import json
from sklearn.utils import shuffle
from math import exp, sqrt
from tqdm import tqdm
import warnings
import gc

warnings.filterwarnings('ignore')

## 原理分析

### 马尔可夫链

#### 马尔可夫性质和马尔可夫过程

**马尔可夫性质**：对于任何给定的时间，当前时间点往后的未来状态的条件分布仅取决于当前状态，而完全不取决于过去状态（无记忆性）  
**马尔可夫过程**：具有马尔可夫性质的随机过程称为马尔可夫过程。

$$\begin{equation} P(future|present,past) = P(future|present) \end{equation}$$

#### 马尔可夫链

**马尔可夫链**：马尔可夫链是具有离散时间和离散状态空间的马尔可夫过程，是遵循马尔可夫性质的一个离散状态序列。

在数学上，可以表示一个马尔可夫链为

$$\begin{equation} 
X=(X_{n})_{n\in\mathbb{N}}={X_0,X_1,X_2,\ldots}\\
其中,X_{n}\in E \quad \forall n\in\mathbb{N}
\end{equation}$$

那么，马尔可夫性质意味着有
$$\begin{equation} \mathbb{P}(X_{n+1}=s_{n+1}|X_{n}=s_{n},X_{n-1}=s_{n-1},X_{n-2}=s_{n-2},\ldots)
= \mathbb{P}(X_{n+1}=s_{n+1}|X_{n}=s_{n})
\end{equation}$$

#### 社交网络（一个马尔可夫链）

![image.png](attachment:image.png)

<div id="equation-anchor-4"></div>
$Irreduciable$并且对称的用户联系矩阵（或邻接矩阵）
$\\[10px]$
$$
\begin{equation}\begin{pmatrix}
u_{11} & \cdots & u_{1n} \\
\vdots & \ddots & \vdots \\
u_{n1} & \cdots & u_{nn}
\end{pmatrix}\end{equation}
$$

用户$u_{a}$到$u_{b}$的概率可以表达为
$\\[10px]$
$$\begin{equation}
p(b|a) = \frac{u_{ab}}{\sum\limits_{i=1}^{n}u_{ai}}
\end{equation}$$

因此，用户$u_a$的转移矩阵为
$\\[10px]$
$$\begin{equation}
p = [p(u_1|u_a),p(u_2|u_a),\ldots]
\end{equation}$$

### 拉普拉斯矩阵$L$和其伪逆矩阵（也称摩尔-彭若思广义逆）$L+$

以用户联系矩阵[(4)](#equation-anchor-4)为例，其拉普拉斯矩阵定义为
$$\begin{equation}
L(G) = D(G) - A(G)
\\[10px]
\begin{pmatrix}
degree(u_1) & -u_{12} & \cdots & -u_{1n} \\
-u_{21} & degree(u_2) & \cdots & -u_{2n} \\
\vdots & \ddots & \vdots & \vdots \\
-u_{n1} & -u_{n2} & \cdots & degree(u_n)
\end{pmatrix}\end{equation}$$
$\\[30px]$
根据文献[<sup>[4]</sup>](#refer-anchor-4)，$L+$的定义为
$$\begin{equation}
L+ = (L-ee^T/n)^{-1}+ee^T/n
\end{equation}$$
$\\[10px]$
因为大型稀疏矩阵的逆是极难求解的，根据文献[<sup>[4]</sup>](#refer-anchor-4)提供的迭代方法每次求取$L+$的一行$(l_{i}^+)$以减少运算量

定义单位矩阵$I$、列向量$e_{i}$和$e$

$$I=\begin{pmatrix}
1 & 0 & \cdots & 0 \\
0 & 1 & \cdots & 0 \\
\vdots & \ddots & \vdots & \vdots \\
0 & 0 & \cdots & 1
\end{pmatrix}$$

$$e_{i} = I的第i列向量$$

$$e = 
\begin{bmatrix}
1 & 1 & \cdots & 1
\end{bmatrix}^T
$$

$1.$计算向量$e_{i}在L$的列空间上的投影

$$\begin{equation}
y_{i}=proj_{L}(e_{i})=(I-ee^{T}/n)e_{i}
\end{equation} $$

$2.$找到方程$Ll=y_{i}$的解$l_{i}^{*+}$

$3.$将结果$l_{i}^{*+}$投影到$L$的行空间上

$$\begin{equation}
l_{i}^{+}=proj_{L}(l_{i}^{*+})=(I-ee^{T}/n)l_{i}^{*+}
\end{equation} $$

对**第一步**过程进行再化简
$\\[30px]$
$$
(I-ee^T/n)e_{i} =
\begin{pmatrix}
\frac{n-1}{n} & -\frac{1}{n} & \cdots & -\frac{1}{n} \\
-\frac{1}{n} & \frac{n-1}{n} & \cdots & -\frac{1}{n} \\
\vdots & \ddots & \vdots \\
-\frac{1}{n} & -\frac{1}{n} & \cdots & \frac{n-1}{n}
\end{pmatrix}_{n\times{n}} \cdot e_{i} = 
[\underset{1}{-\frac{1}{n}},\cdots,\underset{i-1}{-\frac{1}{n}},\underset{i}{\frac{n-1}{n}},\underset{i+1}{-\frac{1}{n}},\cdots,\underset{n}{-\frac{1}{n}}]^T = I的第i列每个元素 - \frac{1}{n}
$$
$\\[30px]$

In [None]:
e_i = I.getcol(0)
e_i = e_i - np.ones((length,1)) / length
y_i = e_i

根据文献[<sup>[4]</sup>](#refer-anchor-4)对**第二步**过程再化简  

使用普通的求解稠密矩阵方程的方法求解大规模的稀疏矩阵方程效率很低  

结合拉普拉斯矩阵**半正定性**，我们使用$cvxopt$库提供的$cvxopt.cholmod.linsolve$方法对$L$进行科列斯基分解并求解方程

In [None]:
# Cholesky factorization & Solve equation
coo= L.tocoo()
SP = spmatrix(coo.data, coo.row.tolist(), coo.col.tolist())         # ！耗时较长！
b = matrix(y_i, tc='d')
cholmod.linsolve(SP,b)                                              # ！耗时较长！

对**第三步**过程进行再化简
$$
(I-ee^T/n)l_{i}^{*+} =
\begin{pmatrix}
\frac{n-1}{n} & -\frac{1}{n} & \cdots & -\frac{1}{n} \\
-\frac{1}{n} & \frac{n-1}{n} & \cdots & -\frac{1}{n} \\
\vdots & \ddots & \vdots \\
-\frac{1}{n} & -\frac{1}{n} & \cdots & \frac{n-1}{n}
\end{pmatrix}_{n\times{n}} \cdot l_{i}^{*+} =
l_{i1}^{*+}
\begin{pmatrix}
\frac{n-1}{n} \\ -\frac{1}{n} \\ \vdots \\ -\frac{1}{n} \\
\end{pmatrix} +
l_{i2}^{*+}
\begin{pmatrix}
-\frac{1}{n} \\ \frac{n-1}{n} \\ \vdots \\ -\frac{1}{n} \\
\end{pmatrix} + \cdots +
l_{in}^{*+}
\begin{pmatrix}
-\frac{1}{n} \\ \vdots \\ - \frac{1}{n} \\ \frac{n-1}{n} 
\end{pmatrix}
$$
$\\[30px]$
$$
= \begin{pmatrix}
l_{i1}^{*+} \cdot \frac{n-1}{n} - (\sum\limits_{j=1}^{n} l_{ij}^{*+} - l_{i1}^{*+}) \cdot \frac{1}{n} \\
l_{i2}^{*+} \cdot \frac{n-1}{n} - (\sum\limits_{j=1}^{n} l_{ij}^{*+} - l_{i2}^{*+}) \cdot \frac{1}{n} \\
\vdots \\
l_{in}^{*+} \cdot \frac{n-1}{n} - (\sum\limits_{j=1}^{n} l_{ij}^{*+} - l_{in}^{*+}) \cdot \frac{1}{n}
\end{pmatrix} = 
\begin{pmatrix}
l_{i1}^{*+} - \frac{1}{n}\sum\limits_{j=1}^{n} l_{ij}^{*+} \\
l_{i2}^{*+} - \frac{1}{n}\sum\limits_{j=1}^{n} l_{ij}^{*+} \\
\vdots \\
l_{in}^{*+} - \frac{1}{n}\sum\limits_{j=1}^{n} l_{ij}^{*+} \\
\end{pmatrix} = 
\begin{pmatrix}
l_{i1}^{*+} - avg(l_{i}^{*+}) \\
l_{i2}^{*+} - avg(l_{i}^{*+}) \\
\vdots \\
l_{in}^{*+} - avg(l_{i}^{*+}) \\
\end{pmatrix}
$$

In [None]:
l = np.array(b) - np.mean(np.array(b))

计算出来的$L+$具有什么样的意义？
$\\[0px]$
<center><b>$L+$ provides a <font color="#FF0000">similarity measure $(sim(i,j)=l^{+}_{ij})$</font> since it is the matrix containing</b></center> 
<center><b>the inner products of the node vectors in the Euclidean space where the nodes are exactly separated by the ECTD.</b></center>
<center>(ECTD means Euclidean Commute Time Distance)</center>

$\\[0px]$
$L+$可以用来计算<b><font color="#FF0000">平均通勤时间（Average Commute Time）</font></b>
$\\[0px]$
<center>$ACT(i,j)$<font color="##33CC00">$\downarrow$</font>$= V_G(l_{ii}^{+}+l_{jj}^{+}-2l_{ij}^{+}$<font color="#FF0000">$\uparrow$</font>$)$</center>

$$其中,V_G为所有节点的度的总和$$

$\\[0px]$
以用户$u_{a}$为例，从$u_{a}$转移到每个与其相邻的点的概率即可定义为<b><font color="#FF0000">用户转移矩阵</font></b>
$\\[0px]$
$$p = [p(u_1|u_a),p(u_2|u_a),\ldots]$$

与文献[<sup>[3]</sup>](#refer-anchor-3)[<sup>[4]</sup>](#refer-anchor-4)类似，我们将用户和地点连接成图（即将**用户—用户**和**用户—地点**两个邻接矩阵加入同一张图）  

并用以下的方式生成两个相邻节点间边的权重
$\\[0px]$

| 用户—用户边的权重 |          用户—地点边的权重         | 
| :----: | :----: |
| 是朋友，则 $=1$  | $checkins>$均值 $\&$ $avg(ratings)>$均值 $=1$|
| 不是朋友，则 $=0$ | 反之 $=0$                        |

## 数据来源

### Raw Data

In [2]:
DATA_PATH = "./Raw Data/"
USERS_DATA_PATH = DATA_PATH + "users.csv"
VENUES_DATA_PATH = DATA_PATH + "venues.csv"
SOCIAL_DATA_PATH = DATA_PATH + "socialgraph.csv"
RATINGS_DATA_PATH = DATA_PATH + "ratings.csv"
CHECKINS_DATA_PATH = DATA_PATH + "checkins.csv"

### Clean Data

In [2]:
CLEAN_DATA = "./Clean Data/"
CLEAN_USERS_DATA = CLEAN_DATA + 'users.csv'
CLEAN_RATINGS_DATA = CLEAN_DATA + 'ratings.csv'
CLEAN_SOCIALGRAPH_DATA = CLEAN_DATA + 'socialgraph.csv'
CLEAN_CHECKINS_DATA = CLEAN_DATA + 'checkins.csv'
CLEAN_AGGR_ALL_DATA = CLEAN_DATA + 'aggr.csv'

## 数据结构

In [3]:
all_dtypes = {'checkins': {'id': 'int32', 'user_id': 'int32', 'venue_id': 'int32'},
 'socialgraph': {'first_user_id': 'int32', 'second_user_id': 'int32'},
 'ratings': {'user_id': 'int32', 'venue_id': 'int32', 'rating': 'int8'},
 'users': {'id': 'int32', 'latitude': 'float16', 'longitude': 'float16'},
 'venues': {'id': 'int32', 'latitude': 'float16', 'longitude': 'float16'},
 'aggr': {'rating_mean': 'float16', 'rating_count': 'int16', 'checkins_count': 'float16', 
          'checkins_count_adjust': 'float16'}}

# 数据预处理

## 5个数据集的单独处理

### checkins数据集预处理

#### 读取数据

In [5]:
# 设置chunksize参数，来控制每次迭代数据的大小
checkins_data = pd.DataFrame()
chunker = pd.read_csv(CHECKINS_DATA_PATH,chunksize=500000)
for item in chunker:
    checkins_data = checkins_data.append(item)

In [6]:
checkins_data

Unnamed: 0,id,user_id,venue_id,latitude,longitude,created_at
0,16,539270,1206,41.878114,-87.629798,2011-12-08 05:08:42
1,17,1330941,1206,0.000000,0.000000,2011-12-08 04:32:19
2,18,1330942,1206,0.000000,0.000000,2011-12-08 04:29:38
3,19,282798,1206,41.878114,-87.629798,2011-12-08 04:26:06
4,20,376793,1206,41.878114,-87.629798,2011-12-08 04:17:50
...,...,...,...,...,...,...
1021961,1021977,244608,11138,0.000000,0.000000,2012-04-23 01:47:05
1021962,1021978,2153502,783,0.000000,0.000000,2012-04-23 01:42:42
1021963,1021979,592192,82919,40.239812,-76.919974,2012-04-22 23:26:48
1021964,1021980,494946,68691,32.912624,-96.638883,2012-04-23 00:36:33


#### 去除该表格中venues的经纬度数据并将create_at字段转化为日期格式
* 我们发现在该数据集中相同venue_id对应的latitude和longitude有细微的差别，考虑可能是用户打卡的地点问题，所以**统一使用venue表中的地理位置信息作为每个地点的经纬度**  
* 将**user_id、venue_id和created_at三个字段均相同**的记录**视为重复并删除**

In [7]:
def clean_checkins_data(cdf):
    # 去除经纬度数据
    cdf = cdf.drop(['latitude','longitude'], axis=1)
    # 将object转为日期格式
    cdf['created_at'] = pd.to_datetime(cdf['created_at'])
    # 去重
    print('存在{}个重复值'.format(cdf.duplicated(['user_id','venue_id','created_at']).sum()))
    cdf = cdf.drop_duplicates(['user_id','venue_id','created_at'])
    # 查看空值
    print('数据集中含有{}个空值'.format(cdf.isna().sum().sum()))
    
    return cdf

In [8]:
cdf = clean_checkins_data(checkins_data)

存在5462个重复值
数据集中含有0个空值


### 社交网络数据预处理

#### 读取数据

In [9]:
#设置chunksize参数，来控制每次迭代数据的大小
social_data = pd.DataFrame()
chunker = pd.read_csv(SOCIAL_DATA_PATH, chunksize=5000000)
for item in chunker:
    social_data = social_data.append(item)

In [10]:
social_data

Unnamed: 0,first_user_id,second_user_id
0,1,10
1,10,1
2,1,11
3,11,1
4,1,12
...,...,...
27098485,venues,0
27098486,ratings,0
27098487,schema_migrations,0
27098488,(12 rows),0


#### 去除表格尾部字段并转换类型为int
* 表格尾部有一些杂乱的字段，删除  
* 因为id都是整数，直接将float转换为int
* 将**first_user_id和second_user_id两个字段均相同**的记录**视为重复并删除**

In [11]:
def clean_social_data(sdf):
    # 去除尾部字段
    sdf = sdf[:-18]
    # 重新设置索引
    sdf.reset_index(drop=True, inplace=True)
    # 将类型转换成整数int
    sdf = sdf.astype('int64')
    # 去重
    print('存在{}个重复值'.format(sdf.duplicated().sum()))
    sdf = sdf.drop_duplicates()
    # 查看空值
    print('数据集中含有{}个空值'.format(sdf.isna().sum().sum()))
    
    return sdf

In [12]:
sdf = clean_social_data(social_data)

存在9260220个重复值
数据集中含有0个空值


### 用户评分数据预处理

#### 读取数据

In [13]:
#设置chunksize参数，来控制每次迭代数据的大小
ratings_data = pd.DataFrame()
chunker = pd.read_csv(RATINGS_DATA_PATH,chunksize=1000000)
for item in chunker:
    ratings_data = ratings_data.append(item)

In [14]:
ratings_data

Unnamed: 0,user_id,venue_id,rating
0,1,1,5
1,1,51,4
2,1,51,2
3,1,51,5
4,1,52,5
...,...,...,...
2809575,2153498,91385,2
2809576,2153499,783,2
2809577,2153500,91385,2
2809578,2153501,68691,2


#### 查看空值

In [15]:
def clean_rating_data(rdf):
    # 查看空值
    print('数据集中含有{}个空值'.format(rdf.isna().sum().sum()))
    return rdf

In [16]:
rdf = clean_rating_data(ratings_data)

数据集中含有0个空值


### 用户数据预处理

#### 读取数据

In [80]:
#设置chunksize参数，来控制每次迭代数据的大小
users_data = pd.DataFrame()
chunker = pd.read_csv(USERS_DATA_PATH,chunksize=1000000)
for item in chunker:
    users_data = users_data.append(item)

In [81]:
users_data

Unnamed: 0,id,latitude,longitude
0,1,45.072464,-93.455788
1,2,30.669682,-81.462592
2,3,43.549975,-96.700327
3,4,44.840798,-93.298280
4,5,27.949436,-82.465144
...,...,...,...
2153464,2153498,0.000000,0.000000
2153465,2153499,0.000000,0.000000
2153466,2153500,0.000000,0.000000
2153467,2153501,0.000000,0.000000


#### 去重和查看空值
* 将**id、latitude和longitude三个字段均相同**的记录**视为重复并删除**  
* 经纬度均为0的家乡认为是缺失值，填充Null

![image.png](attachment:9159cca9-c915-470d-ab08-3bf5ddfa9c59.png)

In [82]:
def clean_users_data(udf):
    # 去重
    print('存在{}个重复值'.format(udf.duplicated().sum()))
    udf = udf.drop_duplicates()
    # 查看空值
    print('数据集中含有{}个空值'.format(udf.isna().sum().sum()))
    # 置空值为Null
    null_idx = list(udf[(udf['latitude']==0)&(udf['longitude']==0)].index)
    udf.loc[null_idx, 'latitude':'longitude'] = np.nan
    print('共{}个地点缺失，置为Null！'.format(udf.isna().sum().sum()/2))
    
    return udf

In [83]:
udf = clean_users_data(users_data)

存在0个重复值
数据集中含有0个空值


#### 纬度数据大于90°或小于-90°，或经度数据大于180°或小于-180°，则认为地理信息记录出错，填充Null值

In [99]:
def set_error_lat_and_lon(df):
    has_null = df.isna().sum().sum()/2
    lat_error_idx = list(df[(df['latitude']>90)|(df['latitude']<-90)].index)
    lon_error_idx = list(df[(df['longitude']>180)|(df['longitude']<-180)].index)
    df.loc[lat_error_idx+lon_error_idx, 'latitude':'longitude'] = np.nan
    print('共{}个地点的经纬度数据记录错误，置为Null！'.format(df.isna().sum().sum()/2 - has_null))
    
    return df

In [100]:
udf = set_error_lat_and_lon(udf)

共0.0个地点的经纬度数据记录错误，置为Null！


### 地点数据预处理

#### 读取数据

In [101]:
venues_data = pd.DataFrame()
chunker = pd.read_csv(VENUES_DATA_PATH,chunksize=500000)
for item in chunker:
    venues_data = venues_data.append(item)

In [102]:
venues_data

Unnamed: 0,id,latitude,longitude
0,1,44.882011,-93.212364
1,2,44.883169,-93.213687
2,3,44.883455,-93.214316
3,4,44.881387,-93.213801
4,5,44.882129,-93.214012
...,...,...,...
1143085,1143087,0.000000,0.000000
1143086,1143088,0.000000,0.000000
1143087,1143089,0.000000,0.000000
1143088,1143090,0.000000,0.000000


#### 判断空值

In [103]:
print('数据集中含有{}个空值'.format(venues_data.isna().sum().sum()))

数据集中含有0个空值


#### 地点数据去重
* **将lontitude和latitude均为0**的数据置为Null（因为该数据为缺失数据）

In [104]:
# 对地点数据做应进行修改
def revise_venues_df(vdf):
    # 判断缺失数据
    null_idx = list(vdf[(vdf['latitude']==0)&(vdf['longitude']==0)].index)
    # 将缺失值置为Null
    vdf.loc[null_idx, 'latitude':'longitude'] = np.nan
    print('共{}个地点缺失，置为Null！'.format(vdf.isna().sum().sum()/2))
    return vdf

In [105]:
vdf = revise_venues_df(venues_data)

共1327.0个地点缺失，置为Null！


#### 纬度数据大于90°或小于-90°，或经度数据大于180°或小于-180°，则认为地理信息记录出错，对其置为Null

In [106]:
vdf = set_error_lat_and_lon(vdf)

共1.0个地点的经纬度数据记录错误，置为Null！


## 数据集整合的预处理

### 查看各数据集之间的数据是否均匹配

In [112]:
# 判断cdf的uid是否都在udf表中
set(cdf['user_id']) - set(udf['id'])
# 判断cdf的vid是否都在vdf表中
set(cdf['venue_id']) - set(vdf['id'])
# 判断social表中的uid是否都在udf表中
set(sdf['first_user_id']) - set(udf['id'])
set(sdf['second_user_id']) - set(udf['id'])
# 判断ratings中的uid是否都在udf表中
set(rdf['user_id']) - set(udf['id'])
# 判断ratings中的vid是否都在vdf表中
set(rdf['venue_id']) - set(vdf['id'])

set()

set()

set()

set()

set()

set()

### 分析每个节点的入度和出度是否相等以检查是否存在**“关注”**行为
* 可以观察到：（一）仅有2个节点入度大于出度，差值为1；（二）1个节点的出度大于入度，差值为2。
* 因此，可以**忽略“关注”行为**

In [27]:
def search_no_bothway(G):
    # 记录入度-出度的差值
    point_degree_difference = {}  
    for node in G.nodes():
        # 入度-出度
        point_degree_difference[node] = G.in_degree(node) - G.out_degree(node)
            
    return point_degree_difference

In [28]:
# 创建空的简单有向图
G=nx.DiGraph()  
# 添加边
G.add_edges_from(sdf.values)   
# 分析入度-出度
point_degree_difference = search_no_bothway(G)

In [29]:
pd.DataFrame(sorted(point_degree_difference.items(), key=lambda x : x[1], reverse=True), columns=['node','difference'])

Unnamed: 0,node,difference
0,1562442,1
1,1562480,1
2,1,0
3,10,0
4,11,0
...,...,...
1880401,2090033,0
1880402,1764449,0
1880403,2146571,0
1880404,2146575,0


### 处理评分数据和checkins不匹配的问题
* 如果有rating的数量和checkins的数量**不匹配**，以**最多的那个为访问频率**作为checkins的次数
* 如果**有checkins没有rating**，**将rating设置为1**（不存在该情况）

#### 观察rating数据

In [30]:
rdf.rating.describe()

count    2.809580e+06
mean     3.515076e+00
std      1.275262e+00
min      2.000000e+00
25%      2.000000e+00
50%      4.000000e+00
75%      5.000000e+00
max      5.000000e+00
Name: rating, dtype: float64

#### 使用**dask**进行加速运算

In [113]:
rdfdd = dd.read_csv(RATINGS_DATA_PATH)
cdfdd = dd.read_csv(CHECKINS_DATA_PATH, parse_dates=['created_at'])

In [114]:
# 聚合用户对某一个地点的评分和频率
def aggregation_of_rates(rating):
    # 用户对某地点评分的均值
    df_mean = rating.groupby(['user_id','venue_id']).rating.mean()
    df_mean = pd.DataFrame(df_mean.compute())
    df_mean.columns = ['rating_mean']
    # 用户对某地点评分的次数
    df_count = rating.groupby(['user_id','venue_id']).rating.count()
    df_count = pd.DataFrame(df_count.compute())
    df_count.columns = ['rating_count']
    df = pd.merge(df_mean, df_count, how='inner', left_index=True, right_index=True)
    return df

# 聚合用户对某一个地点的访问频率
def aggregation_of_checks(checkins):
    # 用户对某地点的checkins次数
    df = checkins.groupby(['user_id','venue_id']).id.count()
    df = pd.DataFrame(df.compute())
    df.columns = ['checkins_count']
    return df

In [115]:
aggr_rating_df = aggregation_of_rates(rdfdd)
aggr_check_df = aggregation_of_checks(cdfdd)
aggr_all = pd.merge(aggr_rating_df, aggr_check_df, how='outer', left_index=True, right_index=True)
# 使用rating_count填充checkins_count的缺失值
aggr_all.checkins_count.fillna(aggr_all.rating_count, inplace=True)
# # 调整后的checkins_count
aggr_all['checkins_count_adjust'] = np.maximum(aggr_all.rating_count.values, aggr_all.checkins_count.values)

In [116]:
aggr_all

Unnamed: 0_level_0,Unnamed: 1_level_0,rating_mean,rating_count,checkins_count,checkins_count_adjust
user_id,venue_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,1,5.000000,1,1.0,1.0
1,51,3.666667,3,1.0,3.0
1,52,5.000000,1,1.0,1.0
1,53,5.000000,1,1.0,1.0
1,54,5.000000,1,1.0,1.0
...,...,...,...,...,...
2153498,91385,2.000000,1,1.0,1.0
2153499,783,2.000000,1,1.0,1.0
2153500,91385,2.000000,1,1.0,1.0
2153501,68691,2.000000,1,1.0,1.0


## 数据结构优化
* 减少内存，加速计算

### 记录dtype

In [117]:
# 记录每个文件读取时的dtype
all_dtypes = {}

### 减少内存占用

In [118]:
# reduce_mem_usage 函数通过调整数据类型，帮助我们减少数据在内存中占用的空间
def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() 
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            elif str(col_type)[:8] == 'datetime':
                pass
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() 
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    return df

# 清洗并保存数据
def clean_and_save_data(df, dict_dt, name):
    df = reduce_mem_usage(df)
    dict_dt[name] = dict(df.dtypes.map(lambda x: str(x)))
    if name != 'aggr':
        df.to_csv('./Clean Data/{}.csv'.format(name), index=0)
    else:
        df.to_csv('./Clean Data/{}.csv'.format(name))
    print('{}数据保存成功！'.format(name))

In [119]:
df_all = [cdf, sdf, rdf, udf, vdf, aggr_all]
name_all = ['checkins', 'socialgraph', 'ratings', 'users', 'venues', 'aggr']
for df,n in zip(df_all, name_all):
    clean_and_save_data(df, all_dtypes, n)

Memory usage of dataframe is 28458696.00 MB
Memory usage after optimization is: 28458696.00 MB
Decreased by 0.0%
checkins数据保存成功！
Memory usage of dataframe is 142706144.00 MB
Memory usage after optimization is: 142706144.00 MB
Decreased by 0.0%
socialgraph数据保存成功！
Memory usage of dataframe is 25286348.00 MB
Memory usage after optimization is: 25286348.00 MB
Decreased by 0.0%
ratings数据保存成功！
Memory usage of dataframe is 152797088.00 MB
Memory usage after optimization is: 118341584.00 MB
Decreased by 22.5%
users数据保存成功！
Memory usage of dataframe is 27434288.00 MB
Memory usage after optimization is: 9144848.00 MB
Decreased by 66.7%
venues数据保存成功！
Memory usage of dataframe is 174803433.00 MB
Memory usage after optimization is: 116322081.00 MB
Decreased by 33.5%
aggr数据保存成功！


# 马尔科夫链预处理

## 读入数据

In [4]:
udf = pd.read_csv(CLEAN_USERS_DATA, dtype=all_dtypes['users'])
sdf = pd.read_csv(CLEAN_SOCIALGRAPH_DATA, dtype=all_dtypes['socialgraph'])
aggr_all = pd.read_csv(CLEAN_AGGR_ALL_DATA, index_col=[0,1], dtype=all_dtypes['aggr'])

## 构造转换字典
* **tdict['uid']:** {uid : node_id}
* **tdict['vid']:** {vid : node_id}
* **rtdict['uid']:** {node_id : uid}
* **rtdict['vid']:** {node_id : vid}

In [15]:
def transfer_dict(uid,vid):
    # 字典总长度
    new_id = list(range(len(uid+vid)))
    # 正向字典：{真实id：node_id}
    transfer_dict = {'uid':dict(zip(uid, new_id[:len(uid)])), 'vid':dict(zip(vid, new_id[len(uid):]))}
    # 逆向字典：{node_id：真实id}
    reverse_transfer_dict = {'uid':dict(zip(new_id[:len(uid)], uid)), 'vid':dict(zip(new_id[len(uid):], vid))}
    return transfer_dict, reverse_transfer_dict

In [16]:
tdict, rtdict = transfer_dict(list(udf.id.values), np.unique(aggr_all.reset_index(drop=False).venue_id.values).tolist())

In [17]:
# 观察node为多少的时候开始是地点
tdict['vid'][1]

2153469

## 使用**scipy.sparse.lil_matrix**初始化邻接矩阵
* 使用该稀疏矩阵类型的原因是**方便后续的修改（加入）**

In [18]:
# 稀疏矩阵长度
len_of_id = len(list(udf.id.values)+np.unique(aggr_all.reset_index(drop=False).venue_id.values).tolist())
# 创建稀疏矩阵
adjacency = sparse.lil_matrix((len_of_id, len_of_id),dtype='int8')

## 构造邻接矩阵（无权拉普拉斯矩阵）

In [19]:
sdf

Unnamed: 0,first_user_id,second_user_id
0,1,10
1,10,1
2,1,11
3,11,1
4,1,12
...,...,...
17838247,456244,97074
17838248,97074,186390
17838249,186390,97074
17838250,97074,143776


### 用户之间的邻接矩阵
* 好友之间的距离设置为 $1$
* 非好友之间的距离设置为 $0$  

In [20]:
# 创建用户-用户邻接矩阵
def create_user_user_adj(tdict, social, adj):
    udict = tdict['uid']
    # 通过字典进行转换
    row = list(social.first_user_id.map(lambda x: udict[x]).values)
    line = list(social.second_user_id.map(lambda x: udict[x]).values)
    # 因为原数据已经是双向了，所以只需要连接一次
    adj[row,line] = 1
    return adj

In [21]:
adjacency = create_user_user_adj(tdict, sdf, adjacency)

### 用户-地点的邻接矩阵  
* 直接使用均值对数据进行划分
* 评分均分 $>=$ 均值 或访问次数 $>=$ 均值，则连接为1；反之则为0
* 使用**swifter**进行加速

In [22]:
aggr_all

Unnamed: 0_level_0,Unnamed: 1_level_0,rating_mean,rating_count,checkins_count,checkins_count_adjust
user_id,venue_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,1,5.000000,1,1.0,1.0
1,51,3.666016,3,1.0,3.0
1,52,5.000000,1,1.0,1.0
1,53,5.000000,1,1.0,1.0
1,54,5.000000,1,1.0,1.0
...,...,...,...,...,...
103224,833901,5.000000,1,1.0,1.0
103224,833530,5.000000,1,1.0,1.0
103224,833861,5.000000,1,1.0,1.0
103224,833919,5.000000,1,1.0,1.0


In [13]:
aggr_all = aggr_all.astype('float32')
aggr_all.describe()
aggr_all = aggr_all.astype(all_dtypes['aggr'])

Unnamed: 0,rating_mean,rating_count,checkins_count,checkins_count_adjust
count,2436723.0,2436723.0,2436723.0,2436723.0
mean,3.5596,1.153016,1.151342,1.155248
std,1.290131,0.554087,0.551501,0.5594989
min,2.0,1.0,1.0,1.0
25%,2.0,1.0,1.0,1.0
50%,4.0,1.0,1.0,1.0
75%,5.0,1.0,1.0,1.0
max,5.0,137.0,137.0,137.0


In [14]:
# 将数据按0、1分隔开
aggr_count = aggr_all.swifter.apply(lambda x: (x['rating_mean']>=3.6)|(x['checkins_count_adjust']>=2), axis=1)

In [15]:
# 创建用户-地点邻接矩阵
def create_user_venue_adj(tdict, aggr, adj):
    udict = tdict['uid']
    vdict = tdict['vid']
    # 通过字典进行转换
    row = [udict[i[0]] for i in list(aggr[aggr].index)]
    line = [vdict[i[1]] for i in list(aggr[aggr].index)]
    # 需要双向连接
    adj[row,line] = 1
    adj[line,row] = 1
    return adj

In [16]:
adjacency = create_user_venue_adj(tdict, aggr_count, adjacency)

In [17]:
# g观察邻接矩阵的大小
adjacency.get_shape()

(3293963, 3293963)

In [18]:
# 保存邻接矩阵
sparse.save_npz('adj_matrix.npz', adjacency.tocsc())

## 根据邻接矩阵生成图

In [19]:
# 读取邻接矩阵
adjacency = sparse.load_npz('adj_matrix.npz')
# 根据邻接矩阵生成图
G = nx.from_scipy_sparse_matrix(adjacency)

# 马尔科夫链

## 寻找checkins数量前10的用户

In [20]:
# 找出拥有最多checkins的前10名用户
most_checkins_users = list(aggr_all.groupby(['user_id']).checkins_count_adjust.sum()\
                          .sort_values(ascending=False).index)[:10]

In [21]:
most_checkins_users

[30200, 54953, 103224, 281, 4442, 79082, 41460, 56474, 61219, 46040]

## 获取目标节点两步之内的所有节点构成图网络并计算拉普拉斯矩阵$L$
* 计算整张图的开销很大（$320万\times320万$的稀疏邻接矩阵），并且意义也不是很大。
* 因此取以**每个中心节点（每个most_checkins_user）两步内能到达的所有节点构成的子图**作为分析对象

In [22]:
# 将uid列表转为node_id列表
def transfer_uid_list_2_node(l, tdict):
    nlist = []
    for n in l:
        nlist.append(tdict['uid'][n])
        
    return nlist

# 获取目标节点limit步以内的所有节点构成图网络（这里limit为2，
# 同时我们去除了第二步的user的user节点，保留由地点引出的user节点。因为二层user作用比较少，又会增大计算量）
def get_sub_graphs(G, nlist, tdict, limit=2, drop_none_friends=True):
    # 先将uid列表（nlist）转换为node_id列表
    nlist = transfer_uid_list_2_node(nlist, tdict)
    sub_graphs = []
    for n in tqdm(nlist):
        # 获取limit步以内的图构成图网络
        nodes = list(nx.dfs_preorder_nodes(G, n, limit))
        sub = nx.subgraph(G, nodes)
        # 删除第二步的user的user节点
        if drop_none_friends:
            loc_id = tdict['vid'][1]
            friends_nodes = [user for user in list(sub.adj[n]) if user < loc_id]
            loc_nodes = [loc for loc in list(sub.nodes) if loc >= loc_id]
            nodes = friends_nodes + loc_nodes
            # 记录由地点引出的user节点
            loc_user = []
            loc_nodes = [loc for loc in sub.adj[n] if loc >= loc_id]
            for loc in loc_nodes:
                loc_user.append(list(sub.adj[loc]))
            loc_user = sum(loc_user, [])
            nodes = list(set(nodes + loc_user))  
            nodes.remove(n)
            nodes = [n] + nodes
        
        sub = nx.subgraph(G, nodes)
        sub_graphs.append([nodes, sub])

    return sub_graphs


# 直接获取拉普拉斯矩阵
def get_L(sub_graphs_l, weight='weight'):
    Ls = []
    for sub in tqdm(sub_graphs_l):
        node_list = sub[0]
        graph = sub[1]
        Ls.append(nx.linalg.laplacianmatrix.laplacian_matrix(graph, node_list, weight))
        
    return Ls

In [23]:
# 获取不带第二层user的子图
sub_graphs_list = get_sub_graphs(G, most_checkins_users, tdict)

100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:03<00:00,  3.22it/s]


In [24]:
# 保存node_list
nodes_list = [s[0] for s in sub_graphs_list]
np.savez('./graph/node_list.npz', result=np.array(nodes_list, dtype=object))

In [25]:
# 获取子图的拉普拉斯矩阵
Ls = get_L(sub_graphs_list)

100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:07<00:00,  1.25it/s]


## 改进的图网络生成的拉普拉斯矩阵$L$ 
* 计算用户与其好友的$Jaccard$相似度（包括地点和好友的$Jaccard$相似度）
* 用户与地点之间的邻近程度由以下公式得到
$$用户对该地点的评分均值 \times \sqrt[3]{用户访问的次数}$$

In [157]:
# 计算Jaccard相似系数并返回weight属性更新的graph
def jaccard_sparse(graph, node_l, tdict, keep_nodes):
    target = node_l[0]
    loc_id = tdict['vid'][1]
    user_id = [user for user in keep_nodes if user < loc_id]
    similarity = []
    for u in tqdm(user_id):
        adj = set(graph.adj[u])
        for au in [a for a in adj if ((a < loc_id) and (a in keep_nodes))]:
            au_adj = set(graph.adj[au])
            union = adj | au_adj
            intrsct = adj & au_adj
            jaccard = len(intrsct) / len(union)
            similarity.append((u, au, jaccard))
    
    # jaccard 归一化
    similarity = pd.DataFrame(similarity, columns=['user_1','user_2','similarity'])
    max_ =  similarity['similarity'].max()
    min_ = similarity['similarity'].min() - 0.00001
    similarity['similarity'] = similarity['similarity'].map(lambda x: (x-min_)/(max_-min_))
    similarity = similarity[['user_1', 'user_2', 'similarity']].values
    
    similarity = [(i[0], i[1], {'weight':i[2]}) for i in similarity]
    print('归一化完成')
    
    # 解冻
    graph = nx.Graph(graph)
    
    # 添加第一层的user相似度
    graph.add_edges_from(similarity)
    print('添加第一层user相似度完成')
    
    # 删除多余的user节点
    graph = nx.subgraph(graph, keep_nodes)
    print('删除第二层user完成')
    
    return graph

# 计算用户对地点的偏好程度并返回weight属性更新的graph
def user_preference(aggr, graph, node_l, tdict, rtdict):
    loc_id = tdict['vid'][1]
    user_g = np.array(node_l)[np.array(node_l) < loc_id]
    user = [rtdict['uid'][ug] for ug in user_g]
    # 筛选出那些有checkins的用户
    has_checkins = aggr.reset_index(drop=False, inplace=False)['user_id'].values
    user = list(set(user) & set(has_checkins))
    aggr = aggr.loc[user, :]
    # 筛选出在图中的vid
    loc = [loc for loc in graph.nodes if loc >= loc_id]
    
    # 对评分进行归一化
    max_ = np.max(aggr['rating_mean'])
    min_ = np.min(aggr['rating_mean'])
    aggr['rating_mean'] = aggr['rating_mean'].map(lambda x: (x-min_)/(max_-min_))
    aggr['total_point'] = aggr.apply(lambda x: x['rating_mean']*pow(x['checkins_count_adjust'], 1/3), axis=1)
    aggr.reset_index(drop=False, inplace=True)
    preference = aggr[['user_id', 'venue_id', 'total_point']]
    preference['user_id'] = preference['user_id'].map(lambda x: tdict['uid'][x])
    preference['venue_id'] = preference['venue_id'].map(lambda x: tdict['vid'][x])
    # 筛选出在图中的vid
    preference = preference[preference['venue_id'].isin(loc)]
    # 结果归一化
    max_ = preference['total_point'].max()
    min_ = preference['total_point'].min() - 0.00001
    preference['total_point'] = preference['total_point'].map(lambda x: (x-min_)/(max_-min_))
    preference = preference[['user_id', 'venue_id', 'total_point']].values
    
    preference = [(i[0], i[1], {'weight': i[2]}) for i in preference] + [(i[1], i[0], {'weight': i[2]}) for i in preference]
    print('归一化完成')
    
    # 解冻
    graph = nx.Graph(graph)
    
    # 添加用户对地点的偏好
    graph.add_edges_from(preference)
    print('添加地点用户偏好完成')
    
    return graph

# 根据子图计用户间的相似度
def similarity_of_each_friend(sub_graphs_l, tdict, keep_node_l):
    new_sub_graphs_list = []
    i = 0
    for sub in tqdm(sub_graphs_l):
        node_list = sub[0]
        graph = sub[1]
        keep_nodes = keep_node_l[i]
        new_graph = jaccard_sparse(graph, node_list, tdict, keep_nodes)
        new_sub_graphs_list.append([keep_nodes, new_graph])
        i += 1

    return new_sub_graphs_list

# 地点的评分和访问次数综合打分并修改连接数值
def location_preference(aggr, sub_graphs_l, tdict, rtdcit):
    new_sub_graphs_list = []
    for sub in tqdm(sub_graphs_l):
        aggr = aggr.copy()
        node_list = sub[0]
        graph = sub[1]
        new_graph = user_preference(aggr, graph, node_list, tdict, rtdict)
        new_sub_graphs_list.append([node_list, new_graph])
    
    return new_sub_graphs_list
        
def get_new_sub_graphs_list(aggr, sub_graphs_l, tdict, rtdict, keep_node_l):
    new_sub_graphs_list = similarity_of_each_friend(sub_graphs_l, tdict, keep_node_l)
    new_sub_graphs_list = location_preference(aggr, new_sub_graphs_list, tdict, rtdict)
    
    return new_sub_graphs_list

In [158]:
# 获取带第二层user的子图
sub_graphs_list_with_users = get_sub_graphs(G, most_checkins_users, tdict, drop_none_friends=False)
# 计算Jaccard相似系数和综合评分并返回
new_sub_graphs_list = get_new_sub_graphs_list(aggr_all, sub_graphs_list_with_users, tdict, rtdict, nodes_list)

100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:03<00:00,  3.27it/s]
  0%|                                                                                           | 0/10 [00:00<?, ?it/s]
  0%|                                                                                          | 0/113 [00:00<?, ?it/s][A
  1%|▋                                                                                 | 1/113 [00:00<01:21,  1.37it/s][A
  2%|█▍                                                                                | 2/113 [00:02<02:30,  1.35s/it][A
  3%|██▏                                                                               | 3/113 [00:05<03:55,  2.14s/it][A
  4%|██▉                                                                               | 4/113 [00:06<03:06,  1.72s/it][A
  4%|███▋                                                                              | 5/113 [00:08<03:28,  1.93s/it][A
  5%|████▎            

归一化完成


 10%|████████▏                                                                         | 1/10 [02:23<21:28, 143.13s/it]
  0%|                                                                                         | 0/1314 [00:00<?, ?it/s][A

添加第一层user相似度完成
删除第二层user完成



  0%|                                                                                 | 1/1314 [00:00<12:15,  1.79it/s][A
  0%|▏                                                                                | 3/1314 [00:00<04:39,  4.69it/s][A
  0%|▏                                                                                | 4/1314 [00:00<03:53,  5.60it/s][A
  0%|▎                                                                                | 6/1314 [00:01<02:48,  7.76it/s][A
  1%|▌                                                                               | 10/1314 [00:06<18:00,  1.21it/s][A
  1%|▋                                                                               | 11/1314 [00:08<21:26,  1.01it/s][A
  1%|▊                                                                               | 13/1314 [00:14<36:28,  1.68s/it][A
  2%|█▍                                                                              | 24/1314 [00:14<10:22,  2.07it/s][A
  3%|██▎       

 28%|██████████████████████                                                         | 366/1314 [01:03<01:10, 13.40it/s][A
 28%|██████████████████████▏                                                        | 369/1314 [01:03<01:00, 15.75it/s][A
 28%|██████████████████████▎                                                        | 371/1314 [01:03<01:06, 14.09it/s][A
 28%|██████████████████████▍                                                        | 373/1314 [01:03<01:30, 10.45it/s][A
 29%|██████████████████████▌                                                        | 375/1314 [01:04<01:33, 10.03it/s][A
 29%|██████████████████████▋                                                        | 377/1314 [01:04<02:06,  7.43it/s][A
 29%|██████████████████████▋                                                        | 378/1314 [01:04<02:01,  7.72it/s][A
 29%|███████████████████████                                                        | 383/1314 [01:04<01:17, 12.00it/s][A
 29%|███████████

 52%|█████████████████████████████████████████▎                                     | 687/1314 [01:41<01:13,  8.47it/s][A
 52%|█████████████████████████████████████████▍                                     | 689/1314 [01:41<01:08,  9.10it/s][A
 53%|█████████████████████████████████████████▌                                     | 692/1314 [01:42<00:55, 11.14it/s][A
 53%|█████████████████████████████████████████▋                                     | 694/1314 [01:42<00:52, 11.88it/s][A
 53%|█████████████████████████████████████████▊                                     | 696/1314 [01:43<03:05,  3.33it/s][A
 53%|██████████████████████████████████████████                                     | 700/1314 [01:44<01:59,  5.12it/s][A
 53%|██████████████████████████████████████████▏                                    | 702/1314 [01:44<01:42,  5.99it/s][A
 54%|██████████████████████████████████████████▎                                    | 704/1314 [01:44<01:36,  6.30it/s][A
 54%|███████████

 80%|██████████████████████████████████████████████████████████████▋               | 1057/1314 [02:18<00:50,  5.14it/s][A
 81%|██████████████████████████████████████████████████████████████▊               | 1059/1314 [02:18<00:39,  6.45it/s][A
 81%|██████████████████████████████████████████████████████████████▉               | 1061/1314 [02:18<00:36,  6.91it/s][A
 81%|███████████████████████████████████████████████████████████████               | 1063/1314 [02:18<00:31,  8.04it/s][A
 81%|███████████████████████████████████████████████████████████████▏              | 1065/1314 [02:18<00:27,  8.90it/s][A
 81%|███████████████████████████████████████████████████████████████▎              | 1067/1314 [02:19<00:30,  8.23it/s][A
 81%|███████████████████████████████████████████████████████████████▌              | 1070/1314 [02:19<00:25,  9.74it/s][A
 82%|███████████████████████████████████████████████████████████████▋              | 1073/1314 [02:19<00:19, 12.25it/s][A
 82%|███████████

归一化完成


 20%|████████████████▍                                                                 | 2/10 [05:21<21:49, 163.70s/it]
  0%|                                                                                           | 0/92 [00:00<?, ?it/s][A


添加第一层user相似度完成
删除第二层user完成


  1%|▉                                                                                  | 1/92 [00:00<00:16,  5.56it/s][A
  4%|███▌                                                                               | 4/92 [00:00<00:06, 14.17it/s][A
  7%|█████▍                                                                             | 6/92 [00:03<00:57,  1.51it/s][A
 10%|████████                                                                           | 9/92 [00:04<00:45,  1.83it/s][A
 16%|█████████████▎                                                                    | 15/92 [00:04<00:19,  3.88it/s][A
 18%|███████████████▏                                                                  | 17/92 [00:04<00:16,  4.62it/s][A
 21%|████████████████▉                                                                 | 19/92 [00:05<00:19,  3.82it/s][A
 22%|█████████████████▊                                                                | 20/92 [00:05<00:17,  4.03it/s][A
 25%|███████████

归一化完成


 30%|████████████████████████▌                                                         | 3/10 [05:47<11:46, 100.88s/it]
  0%|                                                                                         | 0/4493 [00:00<?, ?it/s][A

添加第一层user相似度完成
删除第二层user完成



  0%|                                                                               | 1/4493 [00:02<2:46:16,  2.22s/it][A
  0%|                                                                               | 2/4493 [00:02<1:19:53,  1.07s/it][A
  0%|                                                                                 | 3/4493 [00:02<48:15,  1.55it/s][A
  0%|                                                                                 | 5/4493 [00:02<28:28,  2.63it/s][A
  0%|                                                                                 | 6/4493 [00:03<22:53,  3.27it/s][A
  0%|▏                                                                              | 8/4493 [00:07<1:27:18,  1.17s/it][A
  0%|▏                                                                              | 9/4493 [00:07<1:07:57,  1.10it/s][A
  0%|▏                                                                             | 10/4493 [00:15<3:24:30,  2.74s/it][A
  0%|▏         

  4%|███▍                                                                           | 193/4493 [02:24<37:33,  1.91it/s][A
  4%|███▍                                                                           | 194/4493 [02:25<35:36,  2.01it/s][A
  4%|███▍                                                                           | 195/4493 [02:25<28:26,  2.52it/s][A
  4%|███▍                                                                           | 197/4493 [02:25<25:57,  2.76it/s][A
  4%|███▍                                                                           | 198/4493 [02:26<23:10,  3.09it/s][A
  4%|███▍                                                                           | 199/4493 [02:26<23:38,  3.03it/s][A
  4%|███▌                                                                           | 201/4493 [02:26<16:24,  4.36it/s][A
  4%|███▌                                                                           | 202/4493 [02:26<16:13,  4.41it/s][A
  5%|███▌       

  8%|██████▌                                                                        | 370/4493 [03:49<23:49,  2.88it/s][A
  8%|██████▌                                                                        | 371/4493 [03:49<24:16,  2.83it/s][A
  8%|██████▌                                                                        | 372/4493 [03:49<22:52,  3.00it/s][A
  8%|██████▌                                                                        | 373/4493 [03:50<24:52,  2.76it/s][A
  8%|██████▌                                                                        | 374/4493 [03:50<23:35,  2.91it/s][A
  8%|██████▌                                                                        | 376/4493 [03:50<15:25,  4.45it/s][A
  8%|██████▋                                                                        | 379/4493 [03:50<12:05,  5.67it/s][A
  8%|██████▋                                                                        | 381/4493 [03:51<11:34,  5.92it/s][A
  9%|██████▋    

 12%|█████████▌                                                                     | 547/4493 [04:44<32:45,  2.01it/s][A
 12%|█████████▋                                                                     | 548/4493 [04:45<26:01,  2.53it/s][A
 12%|█████████▋                                                                     | 550/4493 [04:45<17:32,  3.75it/s][A
 12%|█████████▋                                                                     | 552/4493 [04:45<13:46,  4.77it/s][A
 12%|█████████▋                                                                     | 553/4493 [04:45<12:26,  5.28it/s][A
 12%|█████████▋                                                                     | 554/4493 [04:46<21:10,  3.10it/s][A
 12%|█████████▊                                                                     | 555/4493 [04:46<18:16,  3.59it/s][A
 12%|█████████▊                                                                     | 557/4493 [04:46<13:11,  4.97it/s][A
 12%|█████████▊ 

 16%|████████████▌                                                                  | 715/4493 [05:48<13:18,  4.73it/s][A
 16%|████████████▌                                                                  | 716/4493 [05:49<30:12,  2.08it/s][A
 16%|████████████▌                                                                  | 717/4493 [05:49<24:36,  2.56it/s][A
 16%|████████████▌                                                                  | 718/4493 [05:49<19:22,  3.25it/s][A
 16%|████████████▋                                                                  | 719/4493 [05:50<16:30,  3.81it/s][A
 16%|████████████▋                                                                  | 720/4493 [05:50<15:00,  4.19it/s][A
 16%|████████████▋                                                                  | 722/4493 [05:50<09:45,  6.44it/s][A
 16%|████████████▋                                                                  | 723/4493 [05:50<09:59,  6.29it/s][A
 16%|███████████

 21%|████████████████▏                                                              | 922/4493 [06:45<07:18,  8.14it/s][A
 21%|████████████████▏                                                              | 924/4493 [06:46<11:43,  5.07it/s][A
 21%|████████████████▎                                                              | 926/4493 [06:46<10:04,  5.90it/s][A
 21%|████████████████▎                                                              | 928/4493 [06:47<16:33,  3.59it/s][A
 21%|████████████████▎                                                              | 929/4493 [06:47<16:00,  3.71it/s][A
 21%|████████████████▎                                                              | 930/4493 [06:47<14:37,  4.06it/s][A
 21%|████████████████▍                                                              | 932/4493 [06:48<10:47,  5.50it/s][A
 21%|████████████████▍                                                              | 933/4493 [06:48<16:30,  3.60it/s][A
 21%|███████████

 25%|███████████████████▎                                                          | 1116/4493 [07:55<24:26,  2.30it/s][A
 25%|███████████████████▍                                                          | 1117/4493 [07:55<20:03,  2.81it/s][A
 25%|███████████████████▍                                                          | 1118/4493 [07:56<25:22,  2.22it/s][A
 25%|███████████████████▍                                                          | 1119/4493 [07:56<25:22,  2.22it/s][A
 25%|███████████████████▍                                                          | 1120/4493 [07:56<21:24,  2.63it/s][A
 25%|███████████████████▍                                                          | 1123/4493 [07:57<11:21,  4.94it/s][A
 25%|███████████████████▌                                                          | 1124/4493 [07:57<11:49,  4.75it/s][A
 25%|███████████████████▌                                                          | 1125/4493 [07:57<10:53,  5.15it/s][A
 25%|███████████

 29%|██████████████████████▌                                                       | 1301/4493 [09:08<24:45,  2.15it/s][A
 29%|██████████████████████▌                                                       | 1302/4493 [09:09<26:40,  1.99it/s][A
 29%|██████████████████████▋                                                       | 1304/4493 [09:09<19:22,  2.74it/s][A
 29%|██████████████████████▋                                                       | 1305/4493 [09:09<20:02,  2.65it/s][A
 29%|██████████████████████▋                                                       | 1307/4493 [09:09<14:11,  3.74it/s][A
 29%|██████████████████████▋                                                       | 1308/4493 [09:10<12:43,  4.17it/s][A
 29%|██████████████████████▋                                                       | 1309/4493 [09:10<12:59,  4.09it/s][A
 29%|██████████████████████▊                                                       | 1312/4493 [09:10<09:41,  5.47it/s][A
 29%|███████████

 38%|█████████████████████████████▌                                                | 1702/4493 [09:42<02:21, 19.76it/s][A
 38%|█████████████████████████████▌                                                | 1706/4493 [09:42<02:02, 22.78it/s][A
 38%|█████████████████████████████▋                                                | 1709/4493 [09:42<01:59, 23.39it/s][A
 38%|█████████████████████████████▋                                                | 1713/4493 [09:42<01:59, 23.30it/s][A
 38%|█████████████████████████████▊                                                | 1718/4493 [09:42<01:37, 28.58it/s][A
 38%|█████████████████████████████▉                                                | 1723/4493 [09:42<01:23, 33.19it/s][A
 38%|█████████████████████████████▉                                                | 1728/4493 [09:43<01:27, 31.72it/s][A
 39%|██████████████████████████████                                                | 1732/4493 [09:43<01:29, 30.85it/s][A
 39%|███████████

 50%|███████████████████████████████████████▏                                      | 2256/4493 [10:24<01:40, 22.20it/s][A
 50%|███████████████████████████████████████▏                                      | 2260/4493 [10:24<02:04, 18.00it/s][A
 50%|███████████████████████████████████████▎                                      | 2265/4493 [10:25<01:55, 19.33it/s][A
 51%|███████████████████████████████████████▍                                      | 2272/4493 [10:25<01:40, 22.16it/s][A
 51%|███████████████████████████████████████▌                                      | 2277/4493 [10:25<01:27, 25.41it/s][A
 51%|███████████████████████████████████████▌                                      | 2281/4493 [10:25<01:24, 26.16it/s][A
 51%|███████████████████████████████████████▋                                      | 2285/4493 [10:25<01:30, 24.46it/s][A
 51%|███████████████████████████████████████▊                                      | 2291/4493 [10:25<01:13, 30.13it/s][A
 51%|███████████

 63%|████████████████████████████████████████████████▊                             | 2815/4493 [10:53<01:19, 21.07it/s][A
 63%|████████████████████████████████████████████████▉                             | 2818/4493 [10:53<01:20, 20.83it/s][A
 63%|████████████████████████████████████████████████▉                             | 2821/4493 [10:54<01:40, 16.66it/s][A
 63%|█████████████████████████████████████████████████                             | 2824/4493 [10:54<01:42, 16.21it/s][A
 63%|█████████████████████████████████████████████████                             | 2829/4493 [10:54<01:16, 21.65it/s][A
 63%|█████████████████████████████████████████████████▏                            | 2832/4493 [10:54<01:17, 21.48it/s][A
 63%|█████████████████████████████████████████████████▏                            | 2836/4493 [10:54<01:06, 24.95it/s][A
 63%|█████████████████████████████████████████████████▎                            | 2840/4493 [10:55<01:26, 19.07it/s][A
 63%|███████████

 71%|███████████████████████████████████████████████████████                       | 3171/4493 [11:36<07:05,  3.11it/s][A
 71%|███████████████████████████████████████████████████████                       | 3172/4493 [11:37<07:37,  2.89it/s][A
 71%|███████████████████████████████████████████████████████                       | 3173/4493 [11:37<06:15,  3.51it/s][A
 71%|███████████████████████████████████████████████████████                       | 3174/4493 [11:37<05:56,  3.70it/s][A
 71%|███████████████████████████████████████████████████████                       | 3175/4493 [11:37<05:51,  3.75it/s][A
 71%|███████████████████████████████████████████████████████▏                      | 3177/4493 [11:38<07:18,  3.00it/s][A
 71%|███████████████████████████████████████████████████████▏                      | 3178/4493 [11:38<06:35,  3.33it/s][A
 71%|███████████████████████████████████████████████████████▏                      | 3179/4493 [11:38<05:59,  3.66it/s][A
 71%|███████████

 75%|██████████████████████████████████████████████████████████▌                   | 3375/4493 [12:51<05:02,  3.69it/s][A
 75%|██████████████████████████████████████████████████████████▌                   | 3376/4493 [12:52<05:49,  3.20it/s][A
 75%|██████████████████████████████████████████████████████████▋                   | 3377/4493 [12:52<04:49,  3.86it/s][A
 75%|██████████████████████████████████████████████████████████▋                   | 3378/4493 [12:52<04:33,  4.08it/s][A
 75%|██████████████████████████████████████████████████████████▋                   | 3380/4493 [12:53<06:11,  3.00it/s][A
 75%|██████████████████████████████████████████████████████████▋                   | 3383/4493 [12:53<04:07,  4.49it/s][A
 75%|██████████████████████████████████████████████████████████▋                   | 3384/4493 [12:54<05:46,  3.20it/s][A
 75%|██████████████████████████████████████████████████████████▊                   | 3385/4493 [12:54<05:28,  3.38it/s][A
 75%|███████████

 80%|██████████████████████████████████████████████████████████████▏               | 3579/4493 [15:02<40:18,  2.65s/it][A
 80%|██████████████████████████████████████████████████████████████▏               | 3580/4493 [15:03<32:52,  2.16s/it][A
 80%|██████████████████████████████████████████████████████████████▏               | 3581/4493 [15:03<25:48,  1.70s/it][A
 80%|██████████████████████████████████████████████████████████████▏               | 3582/4493 [15:03<20:35,  1.36s/it][A
 80%|██████████████████████████████████████████████████████████████▏               | 3583/4493 [15:03<15:42,  1.04s/it][A
 80%|██████████████████████████████████████████████████████████████▏               | 3584/4493 [15:04<12:43,  1.19it/s][A
 80%|██████████████████████████████████████████████████████████████▏               | 3585/4493 [15:04<09:52,  1.53it/s][A
 80%|██████████████████████████████████████████████████████████████▎               | 3587/4493 [15:04<06:05,  2.48it/s][A
 80%|███████████

 84%|█████████████████████████████████████████████████████████████████▌            | 3774/4493 [15:47<01:06, 10.78it/s][A
 84%|█████████████████████████████████████████████████████████████████▌            | 3776/4493 [15:48<02:45,  4.34it/s][A
 84%|█████████████████████████████████████████████████████████████████▌            | 3777/4493 [15:49<02:54,  4.10it/s][A
 84%|█████████████████████████████████████████████████████████████████▌            | 3779/4493 [15:49<02:14,  5.33it/s][A
 84%|█████████████████████████████████████████████████████████████████▌            | 3780/4493 [15:49<02:27,  4.85it/s][A
 84%|█████████████████████████████████████████████████████████████████▋            | 3781/4493 [15:50<03:52,  3.06it/s][A
 84%|█████████████████████████████████████████████████████████████████▋            | 3783/4493 [15:50<02:38,  4.48it/s][A
 84%|█████████████████████████████████████████████████████████████████▋            | 3784/4493 [15:50<02:57,  4.00it/s][A
 84%|███████████

 88%|████████████████████████████████████████████████████████████████████▌         | 3952/4493 [16:39<02:26,  3.68it/s][A
 88%|████████████████████████████████████████████████████████████████████▋         | 3953/4493 [16:39<02:18,  3.91it/s][A
 88%|████████████████████████████████████████████████████████████████████▋         | 3955/4493 [16:40<01:47,  5.03it/s][A
 88%|████████████████████████████████████████████████████████████████████▋         | 3956/4493 [16:40<01:36,  5.55it/s][A
 88%|████████████████████████████████████████████████████████████████████▋         | 3959/4493 [16:40<01:22,  6.50it/s][A
 88%|████████████████████████████████████████████████████████████████████▋         | 3960/4493 [16:40<01:41,  5.27it/s][A
 88%|████████████████████████████████████████████████████████████████████▊         | 3962/4493 [16:41<01:47,  4.96it/s][A
 88%|████████████████████████████████████████████████████████████████████▊         | 3964/4493 [16:41<01:40,  5.28it/s][A
 88%|███████████

 92%|███████████████████████████████████████████████████████████████████████▋      | 4126/4493 [17:34<00:56,  6.45it/s][A
 92%|███████████████████████████████████████████████████████████████████████▋      | 4128/4493 [17:40<06:19,  1.04s/it][A
 92%|███████████████████████████████████████████████████████████████████████▋      | 4129/4493 [17:40<05:44,  1.06it/s][A
 92%|███████████████████████████████████████████████████████████████████████▋      | 4130/4493 [17:41<04:49,  1.25it/s][A
 92%|███████████████████████████████████████████████████████████████████████▋      | 4132/4493 [17:41<03:25,  1.76it/s][A
 92%|███████████████████████████████████████████████████████████████████████▊      | 4133/4493 [17:41<02:55,  2.05it/s][A
 92%|███████████████████████████████████████████████████████████████████████▊      | 4134/4493 [17:41<02:34,  2.33it/s][A
 92%|███████████████████████████████████████████████████████████████████████▊      | 4135/4493 [17:42<02:35,  2.31it/s][A
 92%|███████████

 96%|██████████████████████████████████████████████████████████████████████████▌   | 4298/4493 [18:34<01:17,  2.51it/s][A
 96%|██████████████████████████████████████████████████████████████████████████▋   | 4299/4493 [18:34<01:04,  3.02it/s][A
 96%|██████████████████████████████████████████████████████████████████████████▋   | 4300/4493 [18:35<01:04,  2.99it/s][A
 96%|██████████████████████████████████████████████████████████████████████████▋   | 4302/4493 [18:35<00:51,  3.71it/s][A
 96%|██████████████████████████████████████████████████████████████████████████▋   | 4305/4493 [18:35<00:31,  5.95it/s][A
 96%|██████████████████████████████████████████████████████████████████████████▊   | 4306/4493 [18:36<00:44,  4.18it/s][A
 96%|██████████████████████████████████████████████████████████████████████████▊   | 4307/4493 [18:36<00:41,  4.47it/s][A
 96%|██████████████████████████████████████████████████████████████████████████▊   | 4308/4493 [18:36<00:40,  4.61it/s][A
 96%|███████████

归一化完成


 40%|████████████████████████████████▊                                                 | 4/10 [25:39<53:11, 531.89s/it]
  0%|                                                                                         | 0/4252 [00:00<?, ?it/s][A

添加第一层user相似度完成
删除第二层user完成



  0%|                                                                               | 1/4252 [00:02<2:43:52,  2.31s/it][A
  0%|                                                                               | 2/4252 [00:02<1:17:45,  1.10s/it][A
  0%|                                                                                 | 3/4252 [00:03<56:37,  1.25it/s][A
  0%|                                                                                 | 5/4252 [00:03<30:55,  2.29it/s][A
  0%|                                                                               | 6/4252 [00:05<1:01:42,  1.15it/s][A
  0%|▏                                                                              | 7/4252 [00:09<2:04:26,  1.76s/it][A
  0%|▏                                                                              | 8/4252 [00:14<3:12:08,  2.72s/it][A
  0%|▏                                                                              | 9/4252 [00:14<2:20:44,  1.99s/it][A
  0%|▏         

  4%|███▍                                                                           | 187/4252 [02:32<44:41,  1.52it/s][A
  4%|███▌                                                                           | 189/4252 [02:32<26:51,  2.52it/s][A
  4%|███▍                                                                         | 191/4252 [02:36<1:06:21,  1.02it/s][A
  5%|███▍                                                                         | 192/4252 [02:38<1:27:57,  1.30s/it][A
  5%|███▍                                                                         | 193/4252 [02:39<1:15:51,  1.12s/it][A
  5%|███▌                                                                           | 195/4252 [02:39<55:42,  1.21it/s][A
  5%|███▋                                                                           | 197/4252 [02:40<48:23,  1.40it/s][A
  5%|███▋                                                                           | 198/4252 [02:41<51:56,  1.30it/s][A
  5%|███▋       

  9%|██████▊                                                                        | 366/4252 [04:18<33:34,  1.93it/s][A
  9%|██████▊                                                                        | 368/4252 [04:19<35:47,  1.81it/s][A
  9%|██████▊                                                                        | 370/4252 [04:20<25:04,  2.58it/s][A
  9%|██████▉                                                                        | 371/4252 [04:20<21:29,  3.01it/s][A
  9%|██████▉                                                                        | 372/4252 [04:20<25:10,  2.57it/s][A
  9%|██████▉                                                                        | 373/4252 [04:21<29:10,  2.22it/s][A
  9%|██████▉                                                                        | 375/4252 [04:22<25:33,  2.53it/s][A
  9%|██████▉                                                                        | 376/4252 [04:22<22:10,  2.91it/s][A
  9%|███████    

 13%|█████████▉                                                                   | 549/4252 [05:52<1:43:36,  1.68s/it][A
 13%|█████████▉                                                                   | 550/4252 [05:52<1:24:59,  1.38s/it][A
 13%|█████████▉                                                                   | 551/4252 [05:53<1:14:25,  1.21s/it][A
 13%|█████████▉                                                                   | 552/4252 [05:55<1:34:29,  1.53s/it][A
 13%|██████████                                                                   | 553/4252 [05:57<1:34:58,  1.54s/it][A
 13%|██████████                                                                   | 554/4252 [06:01<2:23:11,  2.32s/it][A
 13%|██████████                                                                   | 555/4252 [06:01<1:45:51,  1.72s/it][A
 13%|██████████                                                                   | 556/4252 [06:03<1:38:30,  1.60s/it][A
 13%|██████████ 

 17%|█████████████▌                                                                 | 727/4252 [07:21<23:28,  2.50it/s][A
 17%|█████████████▌                                                                 | 728/4252 [07:21<21:20,  2.75it/s][A
 17%|█████████████▌                                                                 | 729/4252 [07:21<18:25,  3.19it/s][A
 17%|█████████████▌                                                                 | 730/4252 [07:22<17:05,  3.43it/s][A
 17%|█████████████▌                                                                 | 731/4252 [07:22<16:09,  3.63it/s][A
 17%|█████████████▌                                                                 | 732/4252 [07:22<18:39,  3.14it/s][A
 17%|█████████████▌                                                                 | 733/4252 [07:23<23:10,  2.53it/s][A
 17%|█████████████▋                                                                 | 734/4252 [07:24<36:34,  1.60it/s][A
 17%|███████████

 22%|█████████████████                                                              | 919/4252 [08:55<29:23,  1.89it/s][A
 22%|█████████████████                                                              | 920/4252 [08:55<23:26,  2.37it/s][A
 22%|█████████████████                                                              | 921/4252 [08:56<26:39,  2.08it/s][A
 22%|████████████████▋                                                            | 922/4252 [09:03<2:04:40,  2.25s/it][A
 22%|████████████████▋                                                            | 924/4252 [09:04<1:23:50,  1.51s/it][A
 22%|████████████████▊                                                            | 925/4252 [09:04<1:10:39,  1.27s/it][A
 22%|█████████████████▏                                                             | 926/4252 [09:05<58:42,  1.06s/it][A
 22%|█████████████████▏                                                             | 927/4252 [09:05<44:50,  1.24it/s][A
 22%|███████████

 26%|████████████████████▍                                                         | 1111/4252 [10:14<14:47,  3.54it/s][A
 26%|████████████████████▍                                                         | 1113/4252 [10:14<13:07,  3.99it/s][A
 26%|████████████████████▍                                                         | 1114/4252 [10:14<12:52,  4.06it/s][A
 26%|████████████████████▍                                                         | 1117/4252 [10:15<10:01,  5.22it/s][A
 26%|████████████████████▌                                                         | 1119/4252 [10:16<16:06,  3.24it/s][A
 26%|████████████████████▌                                                         | 1120/4252 [10:16<16:41,  3.13it/s][A
 26%|████████████████████▌                                                         | 1121/4252 [10:16<17:29,  2.98it/s][A
 26%|████████████████████▌                                                         | 1122/4252 [10:17<15:33,  3.35it/s][A
 26%|███████████

 31%|████████████████████████▏                                                     | 1321/4252 [11:15<18:15,  2.67it/s][A
 31%|████████████████████████▎                                                     | 1322/4252 [11:15<14:54,  3.28it/s][A
 31%|████████████████████████▎                                                     | 1323/4252 [11:16<20:03,  2.43it/s][A
 31%|████████████████████████▎                                                     | 1324/4252 [11:16<18:06,  2.70it/s][A
 31%|████████████████████████▎                                                     | 1326/4252 [11:17<15:59,  3.05it/s][A
 31%|████████████████████████▎                                                     | 1328/4252 [11:17<14:51,  3.28it/s][A
 31%|████████████████████████▍                                                     | 1330/4252 [11:17<10:36,  4.59it/s][A
 31%|████████████████████████▍                                                     | 1332/4252 [11:18<12:32,  3.88it/s][A
 31%|███████████

 38%|█████████████████████████████▌                                                | 1614/4252 [12:22<13:48,  3.18it/s][A
 38%|█████████████████████████████▋                                                | 1615/4252 [12:23<12:43,  3.45it/s][A
 38%|█████████████████████████████▋                                                | 1618/4252 [12:23<09:31,  4.61it/s][A
 38%|█████████████████████████████▋                                                | 1619/4252 [12:23<09:24,  4.66it/s][A
 38%|█████████████████████████████▋                                                | 1620/4252 [12:24<14:03,  3.12it/s][A
 38%|█████████████████████████████▊                                                | 1622/4252 [12:24<09:45,  4.49it/s][A
 38%|█████████████████████████████▊                                                | 1624/4252 [12:24<07:13,  6.07it/s][A
 38%|█████████████████████████████▊                                                | 1626/4252 [12:24<05:33,  7.88it/s][A
 38%|███████████

 43%|█████████████████████████████████▎                                            | 1819/4252 [13:17<36:45,  1.10it/s][A
 43%|█████████████████████████████████▍                                            | 1822/4252 [13:17<18:43,  2.16it/s][A
 43%|█████████████████████████████████▍                                            | 1823/4252 [13:17<16:34,  2.44it/s][A
 43%|█████████████████████████████████▍                                            | 1824/4252 [13:17<15:42,  2.58it/s][A
 43%|█████████████████████████████████▍                                            | 1825/4252 [13:18<15:29,  2.61it/s][A
 43%|█████████████████████████████████▍                                            | 1826/4252 [13:18<15:59,  2.53it/s][A
 43%|█████████████████████████████████▌                                            | 1827/4252 [13:19<20:18,  1.99it/s][A
 43%|█████████████████████████████████▌                                            | 1829/4252 [13:20<19:49,  2.04it/s][A
 43%|███████████

 47%|████████████████████████████████████▊                                         | 2005/4252 [14:17<11:24,  3.28it/s][A
 47%|████████████████████████████████████▊                                         | 2006/4252 [14:17<10:12,  3.66it/s][A
 47%|████████████████████████████████████▊                                         | 2009/4252 [14:18<09:44,  3.83it/s][A
 47%|████████████████████████████████████▉                                         | 2012/4252 [14:18<09:02,  4.13it/s][A
 47%|████████████████████████████████████▉                                         | 2015/4252 [14:19<07:19,  5.09it/s][A
 47%|████████████████████████████████████▉                                         | 2016/4252 [14:19<09:13,  4.04it/s][A
 47%|█████████████████████████████████████                                         | 2017/4252 [14:19<08:58,  4.15it/s][A
 47%|█████████████████████████████████████                                         | 2018/4252 [14:20<10:51,  3.43it/s][A
 47%|███████████

 52%|███████████████████████████████████████▍                                    | 2207/4252 [16:02<1:13:25,  2.15s/it][A
 52%|████████████████████████████████████████▌                                     | 2208/4252 [16:02<56:53,  1.67s/it][A
 52%|████████████████████████████████████████▌                                     | 2209/4252 [16:03<41:45,  1.23s/it][A
 52%|███████████████████████████████████████▌                                    | 2210/4252 [16:10<1:49:14,  3.21s/it][A
 52%|███████████████████████████████████████▌                                    | 2211/4252 [16:14<1:55:24,  3.39s/it][A
 52%|███████████████████████████████████████▌                                    | 2212/4252 [16:15<1:24:49,  2.49s/it][A
 52%|████████████████████████████████████████▋                                     | 2215/4252 [16:15<41:54,  1.23s/it][A
 52%|████████████████████████████████████████▋                                     | 2217/4252 [16:16<29:29,  1.15it/s][A
 52%|███████████

 56%|███████████████████████████████████████████▊                                  | 2390/4252 [17:25<05:50,  5.31it/s][A
 56%|███████████████████████████████████████████▊                                  | 2391/4252 [17:25<08:31,  3.64it/s][A
 56%|███████████████████████████████████████████▉                                  | 2392/4252 [17:26<09:02,  3.43it/s][A
 56%|███████████████████████████████████████████▉                                  | 2394/4252 [17:26<06:54,  4.49it/s][A
 56%|███████████████████████████████████████████▉                                  | 2395/4252 [17:26<07:42,  4.02it/s][A
 56%|███████████████████████████████████████████▉                                  | 2396/4252 [17:27<08:31,  3.63it/s][A
 56%|███████████████████████████████████████████▉                                  | 2397/4252 [17:27<07:59,  3.87it/s][A
 56%|████████████████████████████████████████████                                  | 2400/4252 [17:27<04:32,  6.79it/s][A
 56%|███████████

 61%|███████████████████████████████████████████████▍                              | 2588/4252 [18:50<07:23,  3.75it/s][A
 61%|███████████████████████████████████████████████▍                              | 2589/4252 [18:50<07:30,  3.69it/s][A
 61%|███████████████████████████████████████████████▌                              | 2590/4252 [18:51<07:11,  3.85it/s][A
 61%|███████████████████████████████████████████████▌                              | 2592/4252 [18:52<09:27,  2.93it/s][A
 61%|███████████████████████████████████████████████▌                              | 2595/4252 [18:52<05:47,  4.77it/s][A
 61%|███████████████████████████████████████████████▋                              | 2597/4252 [18:52<07:12,  3.83it/s][A
 61%|███████████████████████████████████████████████▋                              | 2599/4252 [18:53<05:30,  5.00it/s][A
 61%|███████████████████████████████████████████████▋                              | 2601/4252 [19:00<32:17,  1.17s/it][A
 61%|███████████

 65%|██████████████████████████████████████████████████▉                           | 2778/4252 [20:00<15:09,  1.62it/s][A
 65%|██████████████████████████████████████████████████▉                           | 2780/4252 [20:00<08:53,  2.76it/s][A
 65%|███████████████████████████████████████████████████                           | 2782/4252 [20:00<07:12,  3.40it/s][A
 65%|███████████████████████████████████████████████████                           | 2784/4252 [20:01<05:55,  4.12it/s][A
 65%|███████████████████████████████████████████████████                           | 2785/4252 [20:01<06:14,  3.92it/s][A
 66%|███████████████████████████████████████████████████                           | 2786/4252 [20:01<05:59,  4.08it/s][A
 66%|███████████████████████████████████████████████████▏                          | 2787/4252 [20:02<11:04,  2.20it/s][A
 66%|███████████████████████████████████████████████████▏                          | 2790/4252 [20:03<09:42,  2.51it/s][A
 66%|███████████

 70%|██████████████████████████████████████████████████████▋                       | 2981/4252 [20:55<06:41,  3.17it/s][A
 70%|██████████████████████████████████████████████████████▋                       | 2982/4252 [20:55<05:57,  3.55it/s][A
 70%|██████████████████████████████████████████████████████▋                       | 2983/4252 [20:55<06:55,  3.05it/s][A
 70%|██████████████████████████████████████████████████████▋                       | 2984/4252 [20:56<06:09,  3.43it/s][A
 70%|██████████████████████████████████████████████████████▊                       | 2985/4252 [20:56<06:08,  3.44it/s][A
 70%|██████████████████████████████████████████████████████▊                       | 2988/4252 [20:56<03:40,  5.73it/s][A
 70%|██████████████████████████████████████████████████████▊                       | 2989/4252 [20:57<04:30,  4.68it/s][A
 70%|██████████████████████████████████████████████████████▊                       | 2991/4252 [20:57<03:48,  5.52it/s][A
 70%|███████████

 75%|██████████████████████████████████████████████████████████▏                   | 3172/4252 [21:58<05:12,  3.45it/s][A
 75%|██████████████████████████████████████████████████████████▏                   | 3173/4252 [21:59<06:50,  2.63it/s][A
 75%|██████████████████████████████████████████████████████████▏                   | 3174/4252 [21:59<06:14,  2.88it/s][A
 75%|██████████████████████████████████████████████████████████▎                   | 3177/4252 [21:59<03:52,  4.63it/s][A
 75%|██████████████████████████████████████████████████████████▎                   | 3178/4252 [22:00<03:41,  4.85it/s][A
 75%|██████████████████████████████████████████████████████████▎                   | 3180/4252 [22:00<04:08,  4.31it/s][A
 75%|██████████████████████████████████████████████████████████▎                   | 3182/4252 [22:02<08:45,  2.04it/s][A
 75%|██████████████████████████████████████████████████████████▍                   | 3183/4252 [22:02<07:35,  2.35it/s][A
 75%|███████████

 79%|█████████████████████████████████████████████████████████████▍                | 3348/4252 [23:03<03:56,  3.82it/s][A
 79%|█████████████████████████████████████████████████████████████▍                | 3350/4252 [23:04<04:42,  3.20it/s][A
 79%|█████████████████████████████████████████████████████████████▍                | 3351/4252 [23:05<04:48,  3.12it/s][A
 79%|█████████████████████████████████████████████████████████████▍                | 3352/4252 [23:05<05:08,  2.91it/s][A
 79%|█████████████████████████████████████████████████████████████▌                | 3353/4252 [23:06<05:50,  2.57it/s][A
 79%|█████████████████████████████████████████████████████████████▌                | 3354/4252 [23:06<06:26,  2.32it/s][A
 79%|█████████████████████████████████████████████████████████████▌                | 3355/4252 [23:07<06:39,  2.24it/s][A
 79%|█████████████████████████████████████████████████████████████▋                | 3361/4252 [23:07<02:24,  6.18it/s][A
 79%|███████████

 83%|████████████████████████████████████████████████████████████████▊             | 3533/4252 [24:06<03:43,  3.21it/s][A
 83%|████████████████████████████████████████████████████████████████▊             | 3536/4252 [24:06<01:54,  6.25it/s][A
 83%|████████████████████████████████████████████████████████████████▉             | 3538/4252 [24:07<02:14,  5.29it/s][A
 83%|████████████████████████████████████████████████████████████████▉             | 3540/4252 [24:07<02:23,  4.95it/s][A
 83%|████████████████████████████████████████████████████████████████▉             | 3541/4252 [24:08<03:29,  3.40it/s][A
 83%|████████████████████████████████████████████████████████████████▉             | 3543/4252 [24:08<03:10,  3.72it/s][A
 83%|█████████████████████████████████████████████████████████████████             | 3545/4252 [24:09<03:28,  3.39it/s][A
 83%|█████████████████████████████████████████████████████████████████             | 3546/4252 [24:09<03:21,  3.51it/s][A
 83%|███████████

 88%|████████████████████████████████████████████████████████████████████▋         | 3744/4252 [25:06<02:52,  2.95it/s][A
 88%|████████████████████████████████████████████████████████████████████▋         | 3745/4252 [25:06<02:33,  3.30it/s][A
 88%|████████████████████████████████████████████████████████████████████▊         | 3748/4252 [25:07<02:22,  3.54it/s][A
 88%|████████████████████████████████████████████████████████████████████▊         | 3750/4252 [25:08<02:05,  4.00it/s][A
 88%|████████████████████████████████████████████████████████████████████▊         | 3751/4252 [25:08<02:53,  2.88it/s][A
 88%|████████████████████████████████████████████████████████████████████▊         | 3752/4252 [25:09<02:40,  3.12it/s][A
 88%|████████████████████████████████████████████████████████████████████▊         | 3753/4252 [25:10<04:21,  1.90it/s][A
 88%|████████████████████████████████████████████████████████████████████▊         | 3754/4252 [25:10<03:57,  2.09it/s][A
 88%|███████████

 93%|████████████████████████████████████████████████████████████████████████▊     | 3970/4252 [26:19<00:57,  4.92it/s][A
 93%|████████████████████████████████████████████████████████████████████████▊     | 3972/4252 [26:19<00:45,  6.09it/s][A
 93%|████████████████████████████████████████████████████████████████████████▉     | 3973/4252 [26:20<01:02,  4.48it/s][A
 93%|████████████████████████████████████████████████████████████████████████▉     | 3975/4252 [26:20<00:44,  6.26it/s][A
 94%|████████████████████████████████████████████████████████████████████████▉     | 3978/4252 [26:20<00:30,  9.12it/s][A
 94%|█████████████████████████████████████████████████████████████████████████     | 3980/4252 [26:27<05:01,  1.11s/it][A
 94%|█████████████████████████████████████████████████████████████████████████     | 3981/4252 [26:27<04:12,  1.07it/s][A
 94%|█████████████████████████████████████████████████████████████████████████     | 3982/4252 [26:27<03:31,  1.28it/s][A
 94%|███████████

 98%|████████████████████████████████████████████████████████████████████████████▊ | 4188/4252 [27:26<00:14,  4.49it/s][A
 99%|████████████████████████████████████████████████████████████████████████████▉ | 4191/4252 [27:27<00:09,  6.21it/s][A
 99%|████████████████████████████████████████████████████████████████████████████▉ | 4193/4252 [27:27<00:12,  4.81it/s][A
 99%|████████████████████████████████████████████████████████████████████████████▉ | 4196/4252 [27:27<00:08,  6.50it/s][A
 99%|█████████████████████████████████████████████████████████████████████████████ | 4198/4252 [27:28<00:08,  6.07it/s][A
 99%|█████████████████████████████████████████████████████████████████████████████ | 4201/4252 [27:28<00:06,  7.90it/s][A
 99%|█████████████████████████████████████████████████████████████████████████████ | 4203/4252 [27:28<00:05,  8.25it/s][A
 99%|█████████████████████████████████████████████████████████████████████████████▏| 4205/4252 [27:28<00:05,  9.22it/s][A
 99%|███████████

归一化完成


 50%|████████████████████████████████████████                                        | 5/10 [53:46<1:19:02, 948.43s/it]
  0%|                                                                                           | 0/93 [00:00<?, ?it/s][A
  1%|▉                                                                                  | 1/93 [00:00<00:13,  6.94it/s][A

添加第一层user相似度完成
删除第二层user完成



  4%|███▌                                                                               | 4/93 [00:01<00:40,  2.22it/s][A
  6%|█████▎                                                                             | 6/93 [00:02<00:45,  1.91it/s][A
  8%|██████▏                                                                            | 7/93 [00:04<01:04,  1.32it/s][A
  9%|███████▏                                                                           | 8/93 [00:06<01:23,  1.02it/s][A
 18%|██████████████▉                                                                   | 17/93 [00:06<00:18,  4.11it/s][A
 22%|█████████████████▋                                                                | 20/93 [00:06<00:14,  4.99it/s][A
 24%|███████████████████▍                                                              | 22/93 [00:06<00:12,  5.57it/s][A
 27%|██████████████████████                                                            | 25/93 [00:06<00:09,  6.92it/s][A
 34%|██████████

归一化完成


 60%|█████████████████████████████████████████████████▏                                | 6/10 [54:10<42:15, 633.97s/it]
  0%|                                                                                          | 0/183 [00:00<?, ?it/s][A

添加第一层user相似度完成
删除第二层user完成



  1%|▍                                                                                 | 1/183 [00:00<02:02,  1.49it/s][A
  2%|█▎                                                                                | 3/183 [00:00<00:43,  4.15it/s][A
  2%|█▊                                                                                | 4/183 [00:03<03:23,  1.14s/it][A
  3%|██▏                                                                               | 5/183 [00:04<03:12,  1.08s/it][A
  3%|██▋                                                                               | 6/183 [00:06<04:07,  1.40s/it][A
  4%|███▏                                                                              | 7/183 [00:09<05:28,  1.86s/it][A
  9%|███████▌                                                                         | 17/183 [00:11<01:17,  2.15it/s][A
 14%|███████████                                                                      | 25/183 [00:13<00:55,  2.83it/s][A
 14%|██████████

归一化完成


 70%|█████████████████████████████████████████████████████████▍                        | 7/10 [56:20<23:27, 469.25s/it]
  0%|                                                                                          | 0/173 [00:00<?, ?it/s][A

添加第一层user相似度完成
删除第二层user完成



  1%|▍                                                                                 | 1/173 [00:00<01:25,  2.01it/s][A
  2%|█▍                                                                                | 3/173 [00:02<02:06,  1.35it/s][A
  2%|█▉                                                                                | 4/173 [00:03<02:34,  1.09it/s][A
  3%|██▎                                                                               | 5/173 [00:04<03:03,  1.09s/it][A
  3%|██▊                                                                               | 6/173 [00:06<03:33,  1.28s/it][A
  5%|███▊                                                                              | 8/173 [00:08<02:58,  1.08s/it][A
  5%|████▎                                                                             | 9/173 [00:09<03:01,  1.11s/it][A
  7%|█████▌                                                                           | 12/173 [00:10<02:06,  1.27it/s][A
  8%|██████    

归一化完成


 80%|█████████████████████████████████████████████████████████████████▌                | 8/10 [57:16<11:15, 337.71s/it]
  0%|                                                                                           | 0/77 [00:00<?, ?it/s][A

添加第一层user相似度完成
删除第二层user完成



  1%|█                                                                                  | 1/77 [00:00<00:25,  3.01it/s][A
  3%|██▏                                                                                | 2/77 [00:01<01:17,  1.03s/it][A
  4%|███▏                                                                               | 3/77 [00:02<01:17,  1.05s/it][A
  5%|████▎                                                                              | 4/77 [00:04<01:23,  1.14s/it][A
  6%|█████▍                                                                             | 5/77 [00:05<01:32,  1.29s/it][A
  9%|███████▌                                                                           | 7/77 [00:06<01:02,  1.12it/s][A
 12%|█████████▋                                                                         | 9/77 [00:06<00:37,  1.79it/s][A
 14%|███████████▋                                                                      | 11/77 [00:07<00:24,  2.66it/s][A
 17%|██████████

归一化完成


 90%|█████████████████████████████████████████████████████████████████████████▊        | 9/10 [57:57<04:04, 244.89s/it]
  0%|                                                                                           | 0/63 [00:00<?, ?it/s][A

添加第一层user相似度完成
删除第二层user完成



  2%|█▎                                                                                 | 1/63 [00:00<00:13,  4.67it/s][A
  3%|██▋                                                                                | 2/63 [00:00<00:14,  4.29it/s][A
  5%|███▉                                                                               | 3/63 [00:01<00:39,  1.50it/s][A
  6%|█████▎                                                                             | 4/63 [00:02<00:43,  1.36it/s][A
  8%|██████▌                                                                            | 5/63 [00:03<00:51,  1.14it/s][A
 10%|███████▉                                                                           | 6/63 [00:04<00:50,  1.14it/s][A
 13%|██████████▌                                                                        | 8/63 [00:04<00:26,  2.05it/s][A
 16%|█████████████                                                                     | 10/63 [00:04<00:16,  3.18it/s][A
 17%|██████████

归一化完成


100%|█████████████████████████████████████████████████████████████████████████████████| 10/10 [58:23<00:00, 350.31s/it]
  0%|                                                                                           | 0/10 [00:00<?, ?it/s]

添加第一层user相似度完成
删除第二层user完成
归一化完成


 10%|████████▎                                                                          | 1/10 [00:30<04:33, 30.43s/it]

添加地点用户偏好完成
归一化完成


 20%|████████████████▌                                                                  | 2/10 [00:37<02:11, 16.47s/it]

添加地点用户偏好完成
归一化完成
添加地点用户偏好完成


 30%|████████████████████████▉                                                          | 3/10 [00:38<01:06,  9.44s/it]

归一化完成


 40%|█████████████████████████████████▏                                                 | 4/10 [00:47<00:57,  9.51s/it]

添加地点用户偏好完成
归一化完成


 50%|█████████████████████████████████████████▌                                         | 5/10 [00:53<00:41,  8.24s/it]

添加地点用户偏好完成


 60%|█████████████████████████████████████████████████▊                                 | 6/10 [00:56<00:24,  6.21s/it]

归一化完成
添加地点用户偏好完成
归一化完成


 70%|██████████████████████████████████████████████████████████                         | 7/10 [00:57<00:13,  4.54s/it]

添加地点用户偏好完成
归一化完成


 80%|██████████████████████████████████████████████████████████████████▍                | 8/10 [00:58<00:06,  3.40s/it]

添加地点用户偏好完成


 90%|██████████████████████████████████████████████████████████████████████████▋        | 9/10 [00:58<00:02,  2.58s/it]

归一化完成
添加地点用户偏好完成


100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:59<00:00,  5.96s/it]

归一化完成
添加地点用户偏好完成





In [159]:
def save_new_sub_graphs(new_sub_graphs_l):
    i = 1
    for nsg in tqdm(new_sub_graphs_l):
        nl = nsg[0]
        graph = nsg[1]
        adj = nx.adjacency_matrix(graph, nl)
        sparse.save_npz('./graph/similar/{}.npz'.format(i), adj)
        i += 1

In [160]:
save_new_sub_graphs(new_sub_graphs_list)

100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:01<00:00,  6.90it/s]


In [161]:
# 保存node_list
nodes_list_new = [s[0] for s in new_sub_graphs_list]
np.savez('./graph/node_list_similar.npz', result=np.array(nodes_list_new, dtype=object))

In [162]:
def read_graph():
    new_sub_graphs_list = []
    nodes_list = np.load('./graph/node_list_similar.npz', allow_pickle=True)['result']
    for i in tqdm(list(range(1,11))):
        sub_graph = nx.from_scipy_sparse_matrix(sparse.load_npz('./graph/similar/{}.npz'.format(i)))
        node_list = nodes_list[i-1]
        new_sub_graphs_list.append([node_list, sub_graph])
    
    return new_sub_graphs_list

In [163]:
# 读取子图
new_sub_graphs_list = read_graph()

100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:01<00:00,  6.02it/s]


In [164]:
# 直接获取拉普拉斯矩阵
def get_L_new(sub_graphs_l, weight='weight'):
    Ls = []
    for sub in tqdm(sub_graphs_l):
        graph = sub[1]
        node_list = list(range(len(sub[0])))
        Ls.append(nx.linalg.laplacianmatrix.laplacian_matrix(graph, node_list, weight))
        
    return Ls

In [165]:
Ls_new = get_L_new(new_sub_graphs_list)

100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:14<00:00,  1.44s/it]


## 参考论文中的4.3 Computational Issues部分计算 ${L+}$

In [166]:
def cal_target_L_sum(Ls):
    L_sum = []
    
    # 对每一个L
    for L in tqdm(Ls):
        
        length = L.get_shape()[0]
        I = sparse.identity(length)
        
        # ======================= 第一步 =======================
        # 计算i=0的情况，即只计算目标与其他点之间的相似度
        e_i = I.getcol(0)
        e_i = e_i - np.ones((length,1)) / length
        y_i = e_i
    
        # ======================= 第二步 =======================
        # Cholesky factorization & Solve equation
        coo= L.tocoo()
        SP = spmatrix(coo.data, coo.row.tolist(), coo.col.tolist())         # ！耗时较长！
        b = matrix(y_i, tc='d')
        try:
            cholmod.linsolve(SP,b)                                          # ！耗时较长！
        except ArithmeticError:
            print('方程无法分解，考虑直接求解！')
            umfpack.linsolve(SP,b)

        # ======================= 第三步 =======================
        l = np.array(b) - np.mean(np.array(b))
        L_sum.append(l)
        
        # ====================================================
    
    return L_sum

In [167]:
L_sum = cal_target_L_sum(Ls)

 40%|█████████████████████████████████▏                                                 | 4/10 [00:00<00:00,  9.61it/s]

方程无法分解，考虑直接求解！


100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 12.15it/s]

方程无法分解，考虑直接求解！
方程无法分解，考虑直接求解！
方程无法分解，考虑直接求解！





In [168]:
L_sum_new = cal_target_L_sum(Ls_new)

100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 16.26it/s]

方程无法分解，考虑直接求解！
方程无法分解，考虑直接求解！





## 保存结果

In [169]:
# 评分归一化和统计量输出
def standard_and_describe(result, other_result=None):
    i = 1
    new_result = []
    describe_df = pd.DataFrame()
    for r in result:
        if isinstance(r, str) and other_result is not None:
            new_result.append(other_result[i-1])
            r = other_result[i-1]
        else:
            r = (r-r.min())/(r.max() - r.min())
            new_result.append(r)

        # 输出r[0]!=1的情况
        if r[0] != 1:
            print('第{}个结果可能存在问题！'.format(i))
            break
        # 记录describe
        r = pd.DataFrame(r).describe()
        r.columns = [i]
        describe_df = pd.concat([describe_df, r], axis=1)
        i += 1
    return describe_df, new_result

In [170]:
describe, result = standard_and_describe(L_sum)

In [171]:
describe

Unnamed: 0,1,2,3,4,5,6,7,8,9,10
count,2944.0,4323.0,1179.0,9881.0,13642.0,1173.0,2808.0,1630.0,1509.0,1104.0
mean,0.457911,0.77984,0.595799,0.801971,0.78104,0.444606,0.746023,0.445451,0.390123,0.421949
std,0.408504,0.170479,0.399588,0.150851,0.117664,0.450859,0.168023,0.394487,0.396646,0.436762
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.110799,0.745881,0.165096,0.778376,0.762557,0.004823,0.696741,0.14988,0.070516,0.029695
50%,0.193095,0.810328,0.948628,0.830734,0.801399,0.100324,0.722107,0.212377,0.156822,0.12231
75%,0.971196,0.874424,0.948628,0.883903,0.843571,0.975606,0.765306,0.964419,0.968254,0.969259
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [172]:
np.savez('L_sum.npz', result=np.array(result, dtype=object))

In [173]:
describe_new, result_new = standard_and_describe(L_sum_new, result)

In [174]:
describe_new

Unnamed: 0,1,2,3,4,5,6,7,8,9,10
count,2944.0,4323.0,1179.0,9881.0,13642.0,1173.0,2808.0,1630.0,1509.0,1104.0
mean,0.97226,0.985667,0.987868,0.99155,0.995888,0.984001,0.969965,0.984849,0.989768,0.983687
std,0.045182,0.088394,0.082685,0.060579,0.034095,0.108881,0.169351,0.09809,0.044379,0.098866
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.957068,0.999767,0.996705,0.999763,0.999504,0.99595,0.999721,0.997509,0.988572,0.988926
50%,0.957536,0.999827,0.999987,0.999836,0.999726,0.99627,0.999722,0.997525,0.988851,0.990474
75%,0.999965,0.999847,0.999987,0.999853,0.99976,0.99999,0.999834,0.999997,0.999983,0.999971
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [175]:
np.savez('L_sum_new.npz', result=np.array(result_new, dtype=object))

# 记录最终的结果

In [176]:
# 加载node_list
nodes_list = np.load('./graph/node_list.npz', allow_pickle=True)['result']
# 加载node_list_new
nodes_list_new = np.load('./graph/node_list_similar.npz', allow_pickle=True)['result']
# 加载L_sum
L_sum = np.load('./L_sum.npz', allow_pickle=True)['result']
# 加载L_sum_new
L_sum_new = np.load('./L_sum_new.npz', allow_pickle=True)['result']

## 获取用户间的$l_{ij}$并且归一化（无权拉普拉斯矩阵）

In [177]:
def get_u_ij_similar(L_sum, node_l, rtdict):
    similar = {}
    for l_add, nodes in tqdm(list(zip(L_sum, node_l))):
        # 获取中心节点（most_checkins_user）的node_id
        user = nodes[0]
        nodes = pd.DataFrame(nodes)
        # 获取node_id中uid的部分
        nodes = nodes[nodes[0].isin(rtdict['uid'])]
        df = pd.DataFrame(l_add).loc[list(nodes.index), :]
        # 获取其他users
        df = df.iloc[1:,:]
        # 归一化
        df = (df-df.min()) / (df.max()-df.min())
        # 匹配uid
        df['uid'] = nodes.iloc[1:,:].apply(lambda x: rtdict['uid'][x[0]], axis=1)
        similar[int(rtdict['uid'][user])] = dict(zip(df['uid'], df[0]))
        
    return similar

In [178]:
similar_dict = get_u_ij_similar(L_sum, nodes_list, rtdict)

100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:02<00:00,  4.67it/s]


In [179]:
similar_dict

{30200: {38: 0.07878217032815471,
  40: 0.11757217794420786,
  45: 0.0,
  47: 0.17193274553855012,
  50: 0.0947386439794172,
  52: 0.15577305250799828,
  56: 0.1392953306711737,
  58: 0.18011116466190044,
  59: 0.21491866216043326,
  68401: 0.5210602579936021,
  187: 0.16288723121894885,
  204: 0.2166991501617278,
  147833: 0.10399913613464655,
  147878: 0.05405080989007283,
  465: 0.27830016272256713,
  115180: 0.18683551057677844,
  512: 0.22497045614133854,
  147970: 0.41858759896995396,
  131601: 0.3516482983336641,
  74276: 0.2819632626323146,
  625: 0.25391259080137474,
  754: 0.20489850106301202,
  935: 0.22818240776131357,
  1030: 0.2336530957365302,
  1031: 0.24919898451174496,
  17510: 0.1798961016681618,
  83084: 0.201216591625348,
  28378: 0.3769636506161469,
  140458: 0.48049317198903224,
  83121: 0.17304070396527893,
  148732: 0.11232987642453009,
  148785: 0.2340537339560481,
  17801: 0.28159613014437657,
  1132058: 1.0,
  67118: 0.6541076041253665,
  34504: 0.4014596769

In [180]:
similar_json = json.dumps(similar_dict)
filename='similar.json'
with open(filename,'w') as file_obj:
    json.dump(similar_json,file_obj)

## 获取target用户与没去过的地点 的𝑙𝑖𝑗 并且归一化（加权拉普拉斯矩阵）

In [181]:
def get_loc_ij_similar(L_sum_new, node_l_new, aggr, rtdict):
    similar = {}
    for l_add, nodes in tqdm(list(zip(L_sum_new, node_l_new))):
        # 获取中心节点（most_checkins_user）的node_id
        user = int(rtdict['uid'][nodes[0]])
        nodes = pd.DataFrame(nodes)
        # 获取node_id中vid的部分
        nodes = nodes[nodes[0].isin(rtdict['vid'])]
        df = pd.DataFrame(l_add).loc[list(nodes.index), :]
        # 获取用户去过的地点
        has_gone_loc = list(aggr.loc[user,:].index)
        # 匹配vid
        df['vid'] = nodes.apply(lambda x: rtdict['vid'][int(x[0])], axis=1)
        # 删除用户去过的地点
        df = df[~df['vid'].isin(has_gone_loc)]
        # 归一化
        df[0] = (df[0]-df[0].min()) / (df[0].max()-df[0].min())
        similar[user] = dict(zip(df['vid'], df[0]))
        
    return similar

In [182]:
similar_dict_new = get_loc_ij_similar(L_sum_new, nodes_list_new, aggr_all, rtdict)

100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:01<00:00,  6.79it/s]


### 发现大多数的地点评分都很接近

In [183]:
similar_dict_new

{30200: {74822: 0.035427545618121045,
  74827: 0.035427545618121045,
  153828: 0.02883070121184631,
  9304: 0.03369896052277556,
  25718: 0.03535746767647561,
  1174: 0.02834485184515494,
  1175: 0.02834485184515494,
  42147: 0.0,
  9373: 0.03956857762242965,
  1180: 0.04140039599504347,
  1181: 0.04140039599504347,
  1182: 0.04140039599504347,
  1183: 0.04140039599504347,
  1184: 0.04140039599504347,
  1185: 0.04140039599504347,
  1186: 0.04140039599504347,
  1187: 0.037062134417825264,
  1188: 0.04140039599504347,
  1189: 0.053384019703667435,
  1191: 0.04140039599504347,
  1192: 0.04140039599504347,
  1193: 0.04140039599504347,
  1194: 0.032996304854485875,
  1195: 0.04140039599504347,
  1196: 0.04140039599504347,
  1197: 0.04140039599504347,
  1198: 0.04388825702034915,
  1199: 0.04140039599504347,
  1200: 0.04140039599504347,
  1201: 0.04140039599504347,
  1202: 0.04140039599504347,
  1203: 0.04140039599504347,
  1204: 0.037062134417825264,
  1205: 0.03693932472425621,
  1206: 0.0

### 于是改换成无权拉普拉斯矩阵的做法

In [184]:
similar_dict_new = get_u_ij_similar(L_sum_new, nodes_list_new, rtdict)

100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:01<00:00,  6.07it/s]


In [185]:
similar_dict_new

{30200: {38: 0.9571338310369235,
  40: 0.9573690097311561,
  45: 0.956719488047973,
  47: 0.9577025610823089,
  50: 0.957087977833034,
  52: 0.9579272149390543,
  56: 0.9577103726893895,
  58: 0.9567714407770326,
  59: 0.9575571862009673,
  68401: 0.9995335009565239,
  187: 0.9571291500834902,
  204: 0.9574226473955982,
  147833: 0.9572316608859561,
  147878: 0.9572733499511891,
  465: 0.9575599000398555,
  115180: 0.9592512023536337,
  512: 0.957447757249807,
  147970: 0.9577128474443276,
  131601: 0.9995064544424094,
  74276: 0.9573913291831648,
  625: 0.9575075825527102,
  754: 0.9574334343914693,
  935: 0.9576202227198268,
  1030: 0.9574434465111052,
  1031: 0.9575103174535485,
  17510: 0.9594054138967487,
  83084: 0.9590805105707781,
  28378: 0.9998604946306143,
  140458: 0.999969161481681,
  83121: 0.9611322566912777,
  148732: 0.957444937660541,
  148785: 0.9575450051706758,
  17801: 0.9917257114688708,
  1132058: 0.0,
  67118: 0.9999876809589002,
  34504: 0.9585498934368049,
  

In [186]:
similar_new_json = json.dumps(similar_dict_new)
filename='similar_new.json'
with open(filename,'w') as file_obj:
    json.dump(similar_new_json,file_obj)

## 无权和加权拉普拉斯矩阵的区别
* 以user_id为30200的user为例，观察前10名具有最高相似度的user

In [189]:
none_weighted = sorted(similar_dict[30200].items(), key=lambda x: x[1], reverse=True)[:10]
none_weighted

[(1132058, 1.0),
 (692019, 1.0),
 (890024, 1.0),
 (767486, 1.0),
 (366273, 1.0),
 (452953, 0.938870205049575),
 (318699, 0.938870205049575),
 (130808, 0.7281056785722662),
 (396815, 0.6804846840922983),
 (63064, 0.6764168603241539)]

In [190]:
has_weighted = sorted(similar_dict_new[30200].items(), key=lambda x: x[1], reverse=True)[:10]
has_weighted

[(452953, 1.0),
 (318699, 0.9999923899726323),
 (67118, 0.9999876809589002),
 (140458, 0.999969161481681),
 (63614, 0.9999652254000443),
 (717737, 0.9999305060345433),
 (120699, 0.9998947205313697),
 (28378, 0.9998604946306143),
 (63064, 0.9998136873391862),
 (65726, 0.9997280098455819)]

### 查看这些节点的关系

In [224]:
# 无权拉普拉斯矩阵
for i in none_weighted:
    print('user_id {}: 关联节点（图中的编号） {}'.format(i[0], list(sub_graphs_list[0][1].adj[tdict['uid'][i[0]]])))

user_id 1132058: 关联节点（图中的编号） [30199]
user_id 692019: 关联节点（图中的编号） [30199]
user_id 890024: 关联节点（图中的编号） [30199]
user_id 767486: 关联节点（图中的编号） [30199]
user_id 366273: 关联节点（图中的编号） [30199]
user_id 452953: 关联节点（图中的编号） [2917600]
user_id 318699: 关联节点（图中的编号） [2976537]
user_id 130808: 关联节点（图中的编号） [2385993]
user_id 396815: 关联节点（图中的编号） [30199, 35653, 35958]
user_id 63064: 关联节点（图中的编号） [28377, 30199, 68400, 2321968, 2330848]


In [225]:
# 加权拉普拉斯矩阵
for i in has_weighted:
    nodes_l = new_sub_graphs_list[0][0]
    print('user_id {}: 关联节点（图中的编号） {}'.format(i[0], list(new_sub_graphs_list[0][1].adj[nodes_l.index(tdict['uid'][i[0]])])))

user_id 452953: 关联节点（图中的编号） [381]
user_id 318699: 关联节点（图中的编号） [803]
user_id 67118: 关联节点（图中的编号） [2173, 2262]
user_id 140458: 关联节点（图中的编号） [330, 626, 1855]
user_id 63614: 关联节点（图中的编号） [10, 502, 1145]
user_id 717737: 关联节点（图中的编号） [322, 336, 1855, 1880, 2183]
user_id 120699: 关联节点（图中的编号） [12, 111, 787, 798, 820]
user_id 28378: 关联节点（图中的编号） [4, 9, 2202, 2256]
user_id 63064: 关联节点（图中的编号） [0, 15, 355, 1487, 1862]
user_id 65726: 关联节点（图中的编号） [15, 186]


## 推荐（无权拉普拉斯矩阵）

In [12]:
filename='similar.json'
with open(filename,'r') as file_obj:
# 读取similar
    similar_dict = json.load(file_obj)
similar_dict = json.loads(similar_dict)
# 读取aggr_all
aggr_all = pd.read_csv(CLEAN_AGGR_ALL_DATA, index_col=[0,1], dtype=all_dtypes['aggr'])

In [13]:
def get_recommend(similar_dict, aggr):
    
    rec = {}
    for key, values in tqdm(similar_dict.items()):
        # 用户之间的相似度
        recommend = pd.DataFrame()
        recommend = recommend.append(values, ignore_index=True).T
        recommend.columns = ['similarity']
        
        # 有些用户没有checkins 
        recommend.index = [int(x) for x in list(recommend.index)]
        has_checkins = set([i[0] for i in aggr.index.tolist()]) & set(recommend.index.tolist())
        relate_checkins = aggr.loc[has_checkins,:].reset_index(drop=False)
        # 去除目标用户已经访问过的地方
        has_been_visited = set(aggr.loc[int(key),:].reset_index(drop=False)['venue_id'].values)
        relate_checkins = relate_checkins[relate_checkins.venue_id.isin(set(relate_checkins.venue_id)-has_been_visited)]
        

        # 加入用户相似度评分
        relate_checkins = pd.merge(relate_checkins, recommend, left_on='user_id', right_index=True, how='inner')

        relate_checkins['synchronize'] = relate_checkins['rating_mean'].map(lambda x: exp(x-2)/3)\
                                * relate_checkins['checkins_count_adjust'].map(lambda x: pow(x, 1/3))\
                                * relate_checkins['similarity']
        relate_checkins = relate_checkins.astype('float64')
        rec_location = relate_checkins.groupby('venue_id').synchronize.sum() / \
                    relate_checkins.groupby('venue_id').similarity.sum()
        # 可能存在分母为0的情况，填充na为0
        rec_location.fillna(0, inplace=True)
        max_ = np.max(rec_location)
        min_ = np.min(rec_location)
        rec_location = rec_location.map(lambda x: (x-min_)/(max_-min_))
        rec[key] = rec_location.to_dict() 
        
    return rec

In [14]:
recommend_dict = get_recommend(similar_dict, aggr_all)

100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:04<00:00,  2.49it/s]


In [15]:
recommend_dict['30200']

{60.0: 0.04978706836786393,
 64.0: 0.04978706836786393,
 208.0: 0.3678794411714422,
 380.0: 0.2811263856374208,
 381.0: 0.2811263856374208,
 382.0: 0.2811263856374208,
 383.0: 0.2811263856374208,
 384.0: 0.2811263856374208,
 385.0: 0.2811263856374208,
 386.0: 0.34847000588630744,
 387.0: 0.2811263856374208,
 388.0: 0.3903266745113293,
 389.0: 0.379925431361456,
 390.0: 0.3447490653282655,
 391.0: 0.2811263856374208,
 392.0: 0.2811263856374208,
 393.0: 0.2811263856374208,
 394.0: 0.2811263856374208,
 395.0: 0.35717924128691153,
 396.0: 0.2811263856374208,
 397.0: 0.34847000588630744,
 398.0: 0.2811263856374208,
 399.0: 0.2811263856374208,
 400.0: 0.2811263856374208,
 401.0: 0.2811263856374208,
 402.0: 0.2811263856374208,
 403.0: 0.2811263856374208,
 404.0: 0.379925431361456,
 405.0: 0.2811263856374208,
 406.0: 0.379925431361456,
 407.0: 0.33521086748560946,
 408.0: 0.2811263856374208,
 409.0: 0.2811263856374208,
 410.0: 0.2811263856374208,
 411.0: 0.2811263856374208,
 412.0: 0.281126385

In [16]:
recommend_json = json.dumps(recommend_dict)
# 保存推荐结果
filename='recommend.json'
with open(filename,'w') as file_obj:
    json.dump(recommend_json,file_obj)

## 推荐（加权拉普拉斯矩阵）也使用相同的推荐算法

In [17]:
filename='similar_new.json'
with open(filename,'r') as file_obj:
# 读取similar
    similar_dict_new = json.load(file_obj)
similar_dict_new = json.loads(similar_dict_new)
recommend_dict_new = get_recommend(similar_dict_new, aggr_all)

100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:03<00:00,  3.18it/s]


In [18]:
recommend_dict_new['30200']

{60.0: 0.04978706836786394,
 64.0: 0.04978706836786394,
 208.0: 0.36787944117144233,
 380.0: 0.2811263856374209,
 381.0: 0.2811263856374209,
 382.0: 0.2811263856374209,
 383.0: 0.2811263856374209,
 384.0: 0.2811263856374209,
 385.0: 0.2811263856374209,
 386.0: 0.3389721108667551,
 387.0: 0.2811263856374209,
 388.0: 0.3723239200810958,
 389.0: 0.37992543136145607,
 390.0: 0.3245094568785692,
 391.0: 0.2811263856374209,
 392.0: 0.2811263856374209,
 393.0: 0.2811263856374209,
 394.0: 0.2811263856374209,
 395.0: 0.3429765856319122,
 396.0: 0.2811263856374209,
 397.0: 0.3389721108667551,
 398.0: 0.2811263856374209,
 399.0: 0.2811263856374209,
 400.0: 0.2811263856374209,
 401.0: 0.2811263856374209,
 402.0: 0.2811263856374209,
 403.0: 0.2811263856374209,
 404.0: 0.37992543136145607,
 405.0: 0.2811263856374209,
 406.0: 0.37992543136145607,
 407.0: 0.33318237019153574,
 408.0: 0.2811263856374209,
 409.0: 0.2811263856374209,
 410.0: 0.2811263856374209,
 411.0: 0.2811263856374209,
 412.0: 0.28112

In [19]:
recommend_json_new = json.dumps(recommend_dict_new)
# 保存推荐结果
filename='recommend_new.json'
with open(filename,'w') as file_obj:
    json.dump(recommend_json_new,file_obj)

# 评估

## 读取数据

In [4]:
udf = pd.read_csv(CLEAN_USERS_DATA, dtype=all_dtypes['users'])
sdf = pd.read_csv(CLEAN_SOCIALGRAPH_DATA, dtype=all_dtypes['socialgraph'])
aggr_all = pd.read_csv(CLEAN_AGGR_ALL_DATA, index_col=[0,1], dtype=all_dtypes['aggr'])

In [5]:
# 找出拥有最多checkins的前10名用户
most_checkins_users = list(aggr_all.groupby(['user_id']).checkins_count_adjust.sum()\
                          .sort_values(ascending=False).index)[:10]

## 不考虑时间先后，随机抹去$x\%$的地点数据（这里$x=30$）

In [6]:
# most_checkins_users的aggr数据
most_aggr = aggr_all.loc[most_checkins_users,:]
most_aggr

Unnamed: 0_level_0,Unnamed: 1_level_0,rating_mean,rating_count,checkins_count,checkins_count_adjust
user_id,venue_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
30200,836,3.5,2,2.0,2.0
30200,1190,3.0,1,1.0,1.0
30200,168605,4.5,2,2.0,2.0
30200,215113,4.0,1,1.0,1.0
30200,231037,5.0,1,1.0,1.0
...,...,...,...,...,...
46040,876652,5.0,1,1.0,1.0
46040,876653,5.0,1,1.0,1.0
46040,876654,5.0,1,1.0,1.0
46040,876655,5.0,1,1.0,1.0


In [7]:
def mark_off_venues(df, percent=0.3):
    df = df.reset_index(drop=False)
    new_aggr = pd.DataFrame()
    for u, value in df.groupby('user_id'):
        # 随机打乱
        value = shuffle(value)
        # 取计访问次数达到30%后的数据
        value['percent'] = value['checkins_count_adjust'].cumsum() / value['checkins_count_adjust'].sum()
        # 如果第一项就是经常去的地方，则重新shuffle
        while value['percent'].values[0] > percent:
            value = shuffle(value)
            value['percent'] = value['checkins_count_adjust'].cumsum() / value['checkins_count_adjust'].sum()
        value = value[value['percent'] > percent]
        new_aggr = new_aggr.append(value.set_index(['user_id', 'venue_id']))
    
    return new_aggr

In [8]:
most_aggr_mark_off = mark_off_venues(most_aggr)

In [9]:
most_aggr_mark_off

Unnamed: 0_level_0,Unnamed: 1_level_0,rating_mean,rating_count,checkins_count,checkins_count_adjust,percent
user_id,venue_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
281,88994,3.0,1,1.0,1.0,0.301514
281,3213,4.5,2,2.0,2.0,0.304932
281,257902,3.0,1,1.0,1.0,0.306641
281,169011,3.0,1,1.0,1.0,0.308350
281,2576,3.0,1,1.0,1.0,0.310059
...,...,...,...,...,...,...
103224,833901,5.0,1,1.0,1.0,0.993652
103224,833530,5.0,1,1.0,1.0,0.995117
103224,833861,5.0,1,1.0,1.0,0.997070
103224,833919,5.0,1,1.0,1.0,0.998535


In [10]:
mark_off_df = most_aggr.drop(most_aggr_mark_off.index)

In [11]:
mark_off_df

Unnamed: 0_level_0,Unnamed: 1_level_0,rating_mean,rating_count,checkins_count,checkins_count_adjust
user_id,venue_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
30200,1190,3.0,1,1.0,1.0
30200,231037,5.0,1,1.0,1.0
30200,674629,5.0,1,1.0,1.0
30200,823637,5.0,1,1.0,1.0
30200,823640,5.0,1,1.0,1.0
...,...,...,...,...,...
46040,876647,5.0,1,1.0,1.0
46040,876648,5.0,1,1.0,1.0
46040,876652,5.0,1,1.0,1.0
46040,876654,5.0,1,1.0,1.0


## 合并成新的aggr_all

In [12]:
# 删除前100的数据并加入mark_off后的数据
aggr_all_new = aggr_all.drop(most_aggr.index).append(most_aggr_mark_off)
aggr_all_new.drop('percent', axis=1, inplace=True)

In [13]:
aggr_all_new

Unnamed: 0_level_0,Unnamed: 1_level_0,rating_mean,rating_count,checkins_count,checkins_count_adjust
user_id,venue_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,1,5.000000,1,1.0,1.0
1,51,3.666016,3,1.0,3.0
1,52,5.000000,1,1.0,1.0
1,53,5.000000,1,1.0,1.0
1,54,5.000000,1,1.0,1.0
...,...,...,...,...,...
103224,833901,5.000000,1,1.0,1.0
103224,833530,5.000000,1,1.0,1.0
103224,833861,5.000000,1,1.0,1.0
103224,833919,5.000000,1,1.0,1.0


## 计算$L+$矩阵

### 函数准备

In [14]:
def transfer_dict(uid,vid):
    # 字典总长度
    new_id = list(range(len(uid+vid)))
    # 正向字典：{真实id：node_id}
    transfer_dict = {'uid':dict(zip(uid, new_id[:len(uid)])), 'vid':dict(zip(vid, new_id[len(uid):]))}
    # 逆向字典：{node_id：真实id}
    reverse_transfer_dict = {'uid':dict(zip(new_id[:len(uid)], uid)), 'vid':dict(zip(new_id[len(uid):], vid))}
    return transfer_dict, reverse_transfer_dict

# 创建用户-用户邻接矩阵
def create_user_user_adj(tdict, social, adj):
    udict = tdict['uid']
    # 通过字典进行转换
    row = list(social.first_user_id.map(lambda x: udict[x]).values)
    line = list(social.second_user_id.map(lambda x: udict[x]).values)
    # 因为原数据已经是双向了，所以只需要连接一次
    adj[row,line] = 1
    return adj

# 创建用户-地点邻接矩阵
def create_user_venue_adj(tdict, aggr, adj):
    udict = tdict['uid']
    vdict = tdict['vid']
    # 通过字典进行转换
    row = [udict[i[0]] for i in list(aggr[aggr].index)]
    line = [vdict[i[1]] for i in list(aggr[aggr].index)]
    # 需要双向连接
    adj[row,line] = 1
    adj[line,row] = 1
    return adj

# 将uid列表转为node_id列表
def transfer_uid_list_2_node(l, tdict):
    nlist = []
    for n in l:
        nlist.append(tdict['uid'][n])
        
    return nlist

def get_sub_graphs(G, nlist, tdict, limit=2, drop_none_friends=True):
    # 先将uid列表（nlist）转换为node_id列表
    nlist = transfer_uid_list_2_node(nlist, tdict)
    sub_graphs = []
    for n in tqdm(nlist):
        # 获取limit步以内的图构成图网络
        nodes = list(nx.dfs_preorder_nodes(G, n, limit))
        sub = nx.subgraph(G, nodes)
        # 删除第二步的user的user节点
        if drop_none_friends:
            loc_id = tdict['vid'][1]
            friends_nodes = [user for user in list(sub.adj[n]) if user < loc_id]
            loc_nodes = [loc for loc in list(sub.nodes) if loc >= loc_id]
            nodes = friends_nodes + loc_nodes
            # 记录由地点引出的user节点
            loc_user = []
            loc_nodes = [loc for loc in sub.adj[n] if loc >= loc_id]
            for loc in loc_nodes:
                loc_user.append(list(sub.adj[loc]))
            loc_user = sum(loc_user, [])
            nodes = list(set(nodes + loc_user))  
            nodes.remove(n)
            nodes = [n] + nodes
        
        sub = nx.subgraph(G, nodes)
        sub_graphs.append([nodes, sub])

    return sub_graphs

# 计算Jaccard相似系数并返回weight属性更新的graph
def jaccard_sparse(graph, node_l, tdict, keep_nodes):
    target = node_l[0]
    loc_id = tdict['vid'][1]
    user_id = [user for user in keep_nodes if user < loc_id]
    similarity = []
    for u in tqdm(user_id):
        adj = set(graph.adj[u])
        for au in [a for a in adj if ((a < loc_id) and (a in keep_nodes))]:
            au_adj = set(graph.adj[au])
            union = adj | au_adj
            intrsct = adj & au_adj
            jaccard = len(intrsct) / len(union)
            similarity.append((u, au, jaccard))
    
    # jaccard 归一化
    similarity = pd.DataFrame(similarity, columns=['user_1','user_2','similarity'])
    max_ =  similarity['similarity'].max()
    min_ = similarity['similarity'].min() - 0.00001
    similarity['similarity'] = similarity['similarity'].map(lambda x: (x-min_)/(max_-min_))
    similarity = similarity[['user_1', 'user_2', 'similarity']].values
    
    similarity = [(i[0], i[1], {'weight':i[2]}) for i in similarity]
    print('归一化完成')
    
    # 解冻
    graph = nx.Graph(graph)
    
    # 添加第一层的user相似度
    graph.add_edges_from(similarity)
    print('添加第一层user相似度完成')
    
    # 删除多余的user节点
    graph = nx.subgraph(graph, keep_nodes)
    print('删除第二层user完成')
    
    return graph

# 计算用户对地点的偏好程度并返回weight属性更新的graph
def user_preference(aggr, graph, node_l, tdict, rtdict):
    loc_id = tdict['vid'][1]
    user_g = np.array(node_l)[np.array(node_l) < loc_id]
    user = [rtdict['uid'][ug] for ug in user_g]
    # 筛选出那些有checkins的用户
    has_checkins = aggr.reset_index(drop=False, inplace=False)['user_id'].values
    user = list(set(user) & set(has_checkins))
    aggr = aggr.loc[user, :]
    # 筛选出在图中的vid
    loc = [loc for loc in graph.nodes if loc >= loc_id]
    
    # 对评分进行归一化
    max_ = np.max(aggr['rating_mean'])
    min_ = np.min(aggr['rating_mean'])
    aggr['rating_mean'] = aggr['rating_mean'].map(lambda x: (x-min_)/(max_-min_))
    aggr['total_point'] = aggr.apply(lambda x: x['rating_mean']*pow(x['checkins_count_adjust'], 1/3), axis=1)
    aggr.reset_index(drop=False, inplace=True)
    preference = aggr[['user_id', 'venue_id', 'total_point']]
    preference['user_id'] = preference['user_id'].map(lambda x: tdict['uid'][x])
    preference['venue_id'] = preference['venue_id'].map(lambda x: tdict['vid'][x])
    # 筛选出在图中的vid
    preference = preference[preference['venue_id'].isin(loc)]
    # 结果归一化
    max_ = preference['total_point'].max()
    min_ = preference['total_point'].min() - 0.00001
    preference['total_point'] = preference['total_point'].map(lambda x: (x-min_)/(max_-min_))
    preference = preference[['user_id', 'venue_id', 'total_point']].values
    
    preference = [(i[0], i[1], {'weight': i[2]}) for i in preference] + [(i[1], i[0], {'weight': i[2]}) for i in preference]
    print('归一化完成')
    
    # 解冻
    graph = nx.Graph(graph)
    
    # 添加用户对地点的偏好
    graph.add_edges_from(preference)
    print('添加地点用户偏好完成')
    
    return graph

# 根据子图计用户间的相似度
def similarity_of_each_friend(sub_graphs_l, tdict, keep_node_l):
    new_sub_graphs_list = []
    i = 0
    for sub in tqdm(sub_graphs_l):
        node_list = sub[0]
        graph = sub[1]
        keep_nodes = keep_node_l[i]
        new_graph = jaccard_sparse(graph, node_list, tdict, keep_nodes)
        new_sub_graphs_list.append([keep_nodes, new_graph])
        i += 1

    return new_sub_graphs_list

# 地点的评分和访问次数综合打分并修改连接数值
def location_preference(aggr, sub_graphs_l, tdict, rtdcit):
    new_sub_graphs_list = []
    for sub in tqdm(sub_graphs_l):
        aggr = aggr.copy()
        node_list = sub[0]
        graph = sub[1]
        new_graph = user_preference(aggr, graph, node_list, tdict, rtdict)
        new_sub_graphs_list.append([node_list, new_graph])
    
    return new_sub_graphs_list
        
def get_new_sub_graphs_list(aggr, sub_graphs_l, tdict, rtdict, keep_node_l):
    new_sub_graphs_list = similarity_of_each_friend(sub_graphs_l, tdict, keep_node_l)
    new_sub_graphs_list = location_preference(aggr, new_sub_graphs_list, tdict, rtdict)
    
    return new_sub_graphs_list


# 直接获取拉普拉斯矩阵
def get_L(sub_graphs_l, weight='weight'):
    Ls = []
    for sub in tqdm(sub_graphs_l):
        node_list = sub[0]
        graph = sub[1]
        Ls.append(nx.linalg.laplacianmatrix.laplacian_matrix(graph, node_list, weight))
        
    return Ls

# 直接获取拉普拉斯矩阵
def get_L_new(sub_graphs_l, weight='weight'):
    Ls = []
    for sub in tqdm(sub_graphs_l):
        graph = sub[1]
        node_list = list(range(len(sub[0])))
        Ls.append(nx.linalg.laplacianmatrix.laplacian_matrix(graph, node_list, weight))
        
    return Ls

def cal_target_L_sum(Ls):
    L_sum = []
    
    # 对每一个L
    for L in tqdm(Ls):
        
        length = L.get_shape()[0]
        I = sparse.identity(length)
        
        # ======================= 第一步 =======================
        # 计算i=0的情况，即只计算目标与其他点之间的相似度
        e_i = I.getcol(0)
        e_i = e_i - np.ones((length,1)) / length
        y_i = e_i
    
        # ======================= 第二步 =======================
        # Cholesky factorization & Solve equation
        coo= L.tocoo()
        SP = spmatrix(coo.data, coo.row.tolist(), coo.col.tolist())         # ！耗时较长！
        b = matrix(y_i, tc='d')
        try:
            cholmod.linsolve(SP,b)                                          # ！耗时较长！
        except ArithmeticError:
            print('方程无法分解，考虑直接求解！')
            umfpack.linsolve(SP,b)

        # ======================= 第三步 =======================
        l = np.array(b) - np.mean(np.array(b))
        L_sum.append(l)
        
        # ====================================================
    
    return L_sum

# 评分归一化和统计量输出
def standard_and_describe(result):
    i = 1
    new_result = []
    describe_df = pd.DataFrame()
    for r in result:
        r = (r-r.min())/(r.max() - r.min())
        new_result.append(r)
        # 输出r[0]!=1的情况
        if r[0] != 1:
            print('第{}个结果可能存在问题！'.format(i))
            break
        
        # 记录describe
        r = pd.DataFrame(r).describe()
        r.columns = [i]
        describe_df = pd.concat([describe_df, r], axis=1)
        i += 1
        
    return describe_df, new_result

### 计算

In [None]:
# 构造转换字典
tdict, rtdict = transfer_dict(list(udf.id.values),\
                              np.unique(aggr_all_new.reset_index(drop=False).venue_id.values).tolist())
# 观察node为多少的时候开始是地点
print('node序号为{}及之后的为地点'.format(tdict['vid'][1]))

# 稀疏矩阵长度
len_of_id = len(list(udf.id.values)+np.unique(aggr_all_new.reset_index(drop=False).venue_id.values).tolist())
# 创建稀疏矩阵
adjacency = sparse.lil_matrix((len_of_id, len_of_id),dtype='int8')
# 构建用户之间的邻接矩阵
print('正在生成用户-用户邻接矩阵...')
adjacency = create_user_user_adj(tdict, sdf, adjacency)
# 构建用户-地点之间的邻接矩阵
print('正在生成用户-地点邻接矩阵...')
aggr_count = aggr_all_new.swifter.apply(lambda x: (x['rating_mean']>=3.6)|(x['checkins_count_adjust']>=2), axis=1)
adjacency = create_user_venue_adj(tdict, aggr_count, adjacency)
# 根据邻接矩阵生成图
print('正在生成图...')
G = nx.from_scipy_sparse_matrix(adjacency)

# 获取拉普拉斯矩阵L
print('正在获取子图...')
sub_graphs_list = get_sub_graphs(G, most_checkins_users, tdict)
# 获取节点数据
nodes_list = [sub[0] for sub in sub_graphs_list]
# 获取带第二层user的子图
sub_graphs_list_with_users = get_sub_graphs(G, most_checkins_users, tdict, drop_none_friends=False)
# 计算Jaccard相似系数和综合评分并返回
new_sub_graphs_list = get_new_sub_graphs_list(aggr_all_new, sub_graphs_list_with_users, tdict, rtdict, nodes_list)
# 获取节点数据
nodes_list_new = [sub[0] for sub in new_sub_graphs_list]
print('正在计算拉普拉斯矩阵L...')
Ls = get_L(sub_graphs_list)
Ls_new = get_L_new(new_sub_graphs_list)


# # 计算L+
print('正在计算L+...')
L_sum = cal_target_L_sum(Ls)
L_sum_new = cal_target_L_sum(Ls_new)
print('计算完成！')

In [24]:
describe, result = standard_and_describe(L_sum)
describe_new, result_new = standard_and_describe(L_sum_new)

In [None]:
describe

In [43]:
describe_new

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,16,17,18,19,20,21,22,23,24,25
count,200691.0,104956.0,102403.0,139050.0,69136.0,106841.0,172055.0,156101.0,146599.0,111005.0,...,120297.0,74507.0,155166.0,129588.0,70567.0,212139.0,189342.0,140547.0,97447.0,111489.0
mean,0.004267,0.023818,0.004938,0.125126,0.221659,0.003405,0.002791,0.003165,0.002521,0.00294,...,0.002321,0.033535,0.001901,0.002545,0.003462,0.00297,0.001597,0.001768,0.005886,0.002143
std,0.062198,0.075222,0.064888,0.038695,0.038172,0.055796,0.043664,0.048853,0.045797,0.051644,...,0.044601,0.050196,0.038104,0.043859,0.056967,0.028189,0.033753,0.038662,0.051712,0.044338
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.000309,0.01638,0.000524,0.121001,0.216246,0.000165,0.000732,0.000533,0.00032,0.000192,...,0.000192,0.029693,0.000345,0.000478,0.000122,0.001616,0.000366,0.000202,0.002551,0.00014
50%,0.000392,0.017268,0.000588,0.124128,0.228381,0.000198,0.00088,0.000636,0.00038,0.000221,...,0.000226,0.030561,0.000419,0.000564,0.000142,0.002082,0.000459,0.00024,0.0028,0.000164
75%,0.000431,0.018696,0.000637,0.126841,0.231392,0.000219,0.000959,0.000693,0.000417,0.000236,...,0.000245,0.0316,0.000457,0.000613,0.000158,0.002311,0.000502,0.000261,0.002981,0.000177
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


## 获取好友用户间的相似性

### 函数准备

In [44]:
def get_u_ij_similar(G, L_sum, node_l, rtdict):
    similar = {}
    for l_add, nodes in tqdm(list(zip(L_sum, node_l))):
        # 获取中心节点（most_checkins_user）的node_id
        user = nodes[0]
        nodes = pd.DataFrame(nodes)
        # 获取node_id中uid的部分
        nodes = nodes[nodes[0].isin(rtdict['uid'])]
        # 去除非好友用户和中心节点
        nodes = nodes[nodes[0].isin(set(G.adj[user]).intersection(set(nodes[0].values)))]
        df = pd.DataFrame(l_add).loc[list(nodes.index), :]
        
        # 归一化
        df = (df-df.min()) / (df.max()-df.min())
        # 匹配uid
        df['uid'] = nodes.apply(lambda x: rtdict['uid'][x[0]], axis=1)
        similar[int(rtdict['uid'][user])] = dict(zip(df['uid'], df[0]))
    return similar

### 计算

In [45]:
similar_dict = get_u_ij_similar(G, L_sum, nodes_list, rtdict)

100%|██████████████████████████████████████████████████████████████████████████████████| 25/25 [00:07<00:00,  3.33it/s]


In [None]:
similar_dict_new = get_u_ij_similar(G, L_sum_new, nodes_list_new, rtdict)

## 推荐

In [48]:
def get_recommend(similar_dict, aggr_all_new, most_aggr_mark_off):
    
    rec = {}
    for key, values in tqdm(similar_dict.items()):
        # 用户之间的相似度
        recommend = pd.DataFrame()
        recommend = recommend.append(values, ignore_index=True).T
        recommend.columns = ['similarity']
        
        # 有些用户没有checkins 
        recommend.index = [int(x) for x in list(recommend.index)]
        has_checkins = set([i[0] for i in aggr_all_new.index.tolist()]) & set(recommend.index.tolist())
        relate_checkins = aggr_all_new.loc[has_checkins,:].reset_index(drop=False)
        # 去除用户已经访问过的地方
        has_been_visited = set(most_aggr_mark_off.reset_index(drop=False).venue_id)
        relate_checkins = relate_checkins[relate_checkins.venue_id.isin(set(relate_checkins.venue_id)-has_been_visited)]
        
        # 加入用户相似度评分
        relate_checkins = pd.merge(relate_checkins, recommend, left_on='user_id', right_index=True)

        relate_checkins['synchronize'] = relate_checkins['rating_mean'].map(lambda x: exp(x-3.6))\
                                * relate_checkins['checkins_count_adjust'] * relate_checkins['similarity']
        relate_checkins = relate_checkins.astype('float64')
        rec_location = relate_checkins.groupby('venue_id').synchronize.sum() / \
                    relate_checkins.groupby('venue_id').similarity.sum()
        # 可能存在分母为0的情况，填充na为0
        rec_location.fillna(0, inplace=True)
        rec[key] = rec_location
        
    return rec

In [49]:
rec_location = get_recommend(similar_dict, aggr_all_new, most_aggr_mark_off)

100%|██████████████████████████████████████████████████████████████████████████████████| 25/25 [00:38<00:00,  1.54s/it]


In [None]:
rec_location_new = get_recommend(similar_dict_new, aggr_all_new, most_aggr_mark_off)

## 验证

In [52]:
def validation(rec, mark_off_df, percent=[0.01,0.05,0.1,0.3,0.5,0.8,1]):
    evaluation = {}
    for key, values in rec.items():
        # 从大到小排序
        result = pd.DataFrame(values.items()).set_index([0]).sort_values(by=1, ascending=False)
        # 获取朋友去过的所有地点与被遮蔽的地点的交集
        all_loc = set(result.index) & set(mark_off_df.loc[key,:].index)
        if len(all_loc) == 0:
            print('{}的朋友去过的地点与TA被遮盖的地点之间没有重复！'.format(key))
            continue
        has_include = []
        # 观察不同%下推荐的地点覆盖被遮盖地点的比例
        for p in percent:
            # 取二者的交集
            r = set(result.iloc[:int(result.shape[0]*p),:].index).intersection(set(mark_off_df.loc[key,:].index))
            has_include.append([p, len(r)/len(all_loc)])
        evaluation[key] = has_include
    
    eval_df = pd.DataFrame()
    for key, values in evaluation.items():
        df = pd.DataFrame(values, index=[[key]*len(percent), list(range(len(percent)))])
        df['总的交集大小'] = len(set(rec[key].index).intersection(set(mark_off_df.loc[key,:].index)))
        eval_df = eval_df.append(df)
    eval_df.columns = ['推荐地点的前%', '覆盖被遮盖的地点的比率', '总的交集大小']
    
    return eval_df       

In [53]:
evaluation = validation(rec_location, mark_off_df)

30200的朋友去过的地点与TA被遮盖的地点之间没有重复！
54953的朋友去过的地点与TA被遮盖的地点之间没有重复！
103224的朋友去过的地点与TA被遮盖的地点之间没有重复！
281的朋友去过的地点与TA被遮盖的地点之间没有重复！
4442的朋友去过的地点与TA被遮盖的地点之间没有重复！
79082的朋友去过的地点与TA被遮盖的地点之间没有重复！
41460的朋友去过的地点与TA被遮盖的地点之间没有重复！
56474的朋友去过的地点与TA被遮盖的地点之间没有重复！
61219的朋友去过的地点与TA被遮盖的地点之间没有重复！
46040的朋友去过的地点与TA被遮盖的地点之间没有重复！


ValueError: Length mismatch: Expected axis has 0 elements, new values have 3 elements

# 基于共同好友的相似度

## 创建用户-用户的稀疏矩阵

In [52]:
# 稀疏矩阵长度
len_of_id = len(list(udf.id.values))
# 构造转换字典
new_id = list(range(len_of_id))
tdict = {'uid': dict(zip(list(udf.id.values), new_id))}
rtdict = {'uid': dict(zip(new_id, list(udf.id.values)))}
# 创建稀疏矩阵
adjacency = sparse.lil_matrix((len_of_id, len_of_id),dtype='int8')

In [53]:
adjacency = create_user_user_adj(tdict, sdf, adjacency)

## 根据邻接矩阵生成图

In [54]:
G_social = nx.from_scipy_sparse_matrix(adjacency)

## 根据共同好友计算好友间的相似度

In [55]:
def get_social_similarity(G_s, users, tdict, rtdict):
    user_similar = {}
    for u in users:
        node = tdict['uid'][u]
        fri_node = list(G_s.adj[node])
        similar = {}
        for n in tqdm(fri_node):
            # 获取两人的各自的好友并取并集
            all_fri = set(list(G_s.adj[n]) + fri_node)
            # 获取两人的共同好友
            common_fri = set(list(G_s.adj[n])).intersection(set(fri_node))
            # 计算相似度：共同好友数量 / 各自好友的并集的数量
            similar[rtdict['uid'][n]] = len(common_fri) / len(all_fri)
        user_similar[u] = similar
        
    return user_similar

In [56]:
social_similar = get_social_similarity(G_social, most_checkins_users, tdict, rtdict)

100%|█████████████████████████████████████████████████████████████████████████████████| 72/72 [00:00<00:00, 334.88it/s]
100%|████████████████████████████████████████████████████████████████████████████| 1071/1071 [00:00<00:00, 3454.89it/s]
100%|████████████████████████████████████████████████████████████████████████████████| 66/66 [00:00<00:00, 1346.95it/s]
100%|████████████████████████████████████████████████████████████████████████████| 4427/4427 [00:02<00:00, 2089.16it/s]
100%|████████████████████████████████████████████████████████████████████████████| 3045/3045 [00:01<00:00, 2001.93it/s]
100%|█████████████████████████████████████████████████████████████████████████████████| 21/21 [00:00<00:00, 583.88it/s]
100%|███████████████████████████████████████████████████████████████████████████████| 151/151 [00:00<00:00, 888.02it/s]
100%|██████████████████████████████████████████████████████████████████████████████| 156/156 [00:00<00:00, 1367.98it/s]
100%|███████████████████████████████████

## 根据好友相似度进行推荐

In [57]:
rec_location_social = get_recommend(social_similar, aggr_all_new, most_aggr_mark_off)

100%|██████████████████████████████████████████████████████████████████████████████████| 25/25 [00:11<00:00,  2.22it/s]


In [58]:
rec_location_social

{30200: venue_id
 208.0       1.491825
 380.0       1.809675
 381.0       1.809675
 382.0       1.809675
 383.0       1.809675
               ...   
 917671.0    7.459123
 920009.0    1.491825
 921472.0    1.491825
 989876.0    4.475474
 997291.0    1.491825
 Length: 2624, dtype: float64,
 54953: venue_id
 1.0          1.809675
 60.0         0.392726
 64.0         0.658385
 246.0        1.809675
 257.0        0.548812
                ...   
 1136787.0    0.000000
 1136790.0    0.000000
 1136794.0    0.000000
 1136797.0    0.000000
 1136819.0    0.000000
 Length: 10308, dtype: float64,
 103224: venue_id
 60.0         2.818752
 208.0        1.491825
 386.0        1.491825
 390.0        1.491825
 395.0        1.491825
                ...   
 910495.0     1.491825
 917671.0     7.459123
 997291.0     1.491825
 1088573.0    4.055200
 1088601.0    2.983649
 Length: 1210, dtype: float64,
 281: venue_id
 1.0          0.201897
 60.0         1.236328
 64.0         0.286894
 74.0         4.055200

## 验证

In [59]:
evaluation_social = validation(rec_location_social, mark_off_df)

46040的朋友去过的地点与TA被遮盖的地点之间没有重复！
87283的朋友去过的地点与TA被遮盖的地点之间没有重复！
34456的朋友去过的地点与TA被遮盖的地点之间没有重复！


# 比较
* https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html

In [60]:
df = pd.merge(evaluation, evaluation_social,left_index=True, right_index=True, \
         suffixes=('_L+', '_social')).drop(['推荐地点的前%_social', '总的交集大小_L+'], axis=1).\
         rename(columns={'推荐地点的前%_L+': '推荐地点的前%', '总的交集大小_social': '总的交集大小'})

In [61]:
def highlight_adv(s):
    '''
    highlight the advantages of L+.
    '''
    if s['覆盖被遮盖的地点的比率_L+'] > s['覆盖被遮盖的地点的比率_social']:
        color = 'background-color: darkorange'
    elif s['覆盖被遮盖的地点的比率_L+'] < s['覆盖被遮盖的地点的比率_social']:
        color = 'background-color: darkgreen'
    else:
        color = ''
    is_adv = ['', color, '', '']
    return is_adv

In [62]:
df.style.apply(highlight_adv, axis=1)

Unnamed: 0,Unnamed: 1,推荐地点的前%,覆盖被遮盖的地点的比率_L+,覆盖被遮盖的地点的比率_social,总的交集大小
30200,0,0.01,0.0,0.0,2
30200,1,0.05,0.0,0.0,2
30200,2,0.1,0.5,0.5,2
30200,3,0.3,0.5,0.5,2
30200,4,0.5,0.5,0.5,2
30200,5,0.8,1.0,0.5,2
30200,6,1.0,1.0,1.0,2
54953,0,0.01,0.0,0.0,3
54953,1,0.05,0.0,0.0,3
54953,2,0.1,0.0,0.0,3
