# 基于GitHub库的推荐系统引擎

该推荐引擎是用于GitHub的库推荐

这里使用GitHub的API，基于协同过滤的推荐系统。

这个推荐系统的任务是获得我所有标星的资料库，然后得到这些库的全部创作者。

再获取这些作者的标星资料库。然后比较已加星的资料库，找到和我最相似的用户。发现最相似的GitHub用户后，把他标星的所有资料库生成一组推荐。

## 导入库

In [1]:
import pandas as pd
import numpy as np
import requests
import json

## 获取令牌，令牌的获取方法：

访问该URL，进入到Personalaccess tokens（个人访问令牌）页面。

https://github.com/settings/tokens

点击生成新的令牌，勾上repo和user即可，点击生成令牌。

然后得到一串随机串，填入上面的mypw变量里即可。

In [2]:
# 需要自己的GitHub账户和令牌
myun = '935048000'
mypw = 'xxxxx'

## 创建获取资料库函数
创建一个能拉取我已标星资料库名称的函数

该实验的前提是你的标星的资料库有一定的数量，才能有意义。

In [3]:
my_starred_repos = []

def get_starred_by_me():
    resp_list = []
    last_resp = ''
    first_url_to_get = 'https://api.github.com/user/starred'
    first_url_resp = requests.get(first_url_to_get, auth=(myun,mypw))
    last_resp = first_url_resp
    resp_list.append(json.loads(first_url_resp.text))
    
    while last_resp.links.get('next'):
        next_url_to_get = last_resp.links['next']['url']
        next_url_resp = requests.get(next_url_to_get, auth=(myun,mypw))
        last_resp = next_url_resp
        resp_list.append(json.loads(next_url_resp.text))
        
    for i in resp_list:
        for j in i:
            msr = j['html_url']
            my_starred_repos.append(msr)


调用函数，获取我已标星的资料库名字,并输出名字列表

In [4]:
get_starred_by_me()
len(my_starred_repos)

109

获取每个标星库的用户名，并输出

In [5]:
my_starred_users = []
for ln in my_starred_repos:
    right_split = ln.split('.com/')[1]
    starred_usr = right_split.split('/')[0]
    my_starred_users.append(starred_usr)
 
len(my_starred_users)

109

去除重复的用户

In [6]:
len(set(my_starred_users))

95

## 创建标星资料库检索函数
构建一个可以检索他们所有标星的资料库的函数

In [7]:
starred_repos = {k:[] for k in set(my_starred_users)}
def get_starred_by_user(user_name):
    starred_resp_list = []
    last_resp = ''
    first_url_to_get = 'https://api.github.com/users/'+ user_name +'/starred'
    first_url_resp = requests.get(first_url_to_get, auth=(myun,mypw))
    last_resp = first_url_resp
    starred_resp_list.append(json.loads(first_url_resp.text))
    
    while last_resp.links.get('next'):
        next_url_to_get = last_resp.links['next']['url']
        next_url_resp = requests.get(next_url_to_get, auth=(myun,mypw))
        last_resp = next_url_resp
        starred_resp_list.append(json.loads(next_url_resp.text))
        
    for i in starred_resp_list:
        for j in i:
            sr = j['html_url']
            starred_repos.get(user_name).append(sr)


调用函数

In [8]:
for usr in list(set(my_starred_users)):
    print(usr)
    try:
        get_starred_by_user(usr)
    except:
        print('failed for user', usr)

deepinsight
amusi
Roshanson
apcode
kpu
PAIR-code
PaddlePaddle
apache
HuJieRu
pavelgonchar
edublancas
exacity
davisking
django
torvalds
tensorflow
pandas-dev
MongoEngine
935048000
Kozea
plotly
vinta
keon
jupyterlab
fengdu78
jumpserver
nltk
zotroneneis
pytorch
ApolloAuto
openai
sigmavirus24
ansible
numpy
triaquae
hominlinx
pyecharts
vega
matplotlib
DiscoverML
sentsin
caicloud
kubeflow
tuan3w
chartjs
liuliu
d3
facebookresearch
explosion
toddlerya
python
yorkoliu
floodsung
alibaba
dkarunakaran
racaljk
gperftools
ipython
bendangnuksung
scikit-learn
channelcat
networkx
google
zhaozhiyong19890102
fxsjy
RaRe-Technologies
kpzhang93
mpld3
pallets
dmlc
faif
Microsoft
keras-team
zhedongzheng
scrapy
saltstack
JuliaLang
jupyter
caffe2
pypa
rushter
memcached
kubernetes
ageitgey
openstack
haiwen
scipy
requests
RasaHQ
junyanz
danielfrg
MashiMaroLjc
sloria
donnemartin
NicolasHug


为所有被标星的资料库，构建一个特征集，并去除重复的资料库

In [9]:
repo_vocab = [item for sl in list(starred_repos.values()) for item in sl]
repo_set = list(set(repo_vocab))

对每位用户和每个资料库建立一个二进制向量，标星=1，不标星=0.

In [10]:
all_usr_vector = []
for k,v in starred_repos.items():
    usr_vector = []
    for url in repo_set:
        if url in v:
            usr_vector.extend([1])
        else:
            usr_vector.extend([0])
    all_usr_vector.append(usr_vector)

In [11]:
df = pd.DataFrame(all_usr_vector, columns=repo_set, index=starred_repos.keys())

In [13]:
df.head()

Unnamed: 0,https://github.com/thearn/webcam-pulse-detector,https://github.com/syang1993/gst-tacotron,https://github.com/requests/requests,https://github.com/greedo/python-xbrl,https://github.com/VinceShieh/spark-ffm,https://github.com/soundcloud/cosine-lsh-join-spark,https://github.com/mitsuhiko/flask-sqlalchemy,https://github.com/scrapinghub/splash,https://github.com/airbnb/aerosolve,https://github.com/RubyLouvre/avalon,...,https://github.com/zhihaozhang/TuringCalendar,https://github.com/ShangtongZhang/reinforcement-learning-an-introduction,https://github.com/cosenary/Simple-PHP-Cache,https://github.com/seaneking/postcss-responsive-type,https://github.com/wistful/SublimeAutoPEP8,https://github.com/fukuta0614/chainer-SeqGAN,https://github.com/dnouri/nolearn,https://github.com/dyve/django-bootstrap3,https://github.com/square/maximum-awesome,https://github.com/apache/arrow
deepinsight,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
amusi,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Roshanson,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
apcode,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
kpu,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


如果自己不关注自己的资料库，则因需要将自己和他们比较，所以要把自己添加进数据框里。

In [14]:
my_repo_comp = []
for i in df.columns:
    if i in my_starred_repos:
        my_repo_comp.append(1)
    else:
        my_repo_comp.append(0)
mrc = pd.Series(my_repo_comp).to_frame(myun).T


In [15]:
mrc

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,6494,6495,6496,6497,6498,6499,6500,6501,6502,6503
935048000,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


添加列名并连接到数据框中

In [17]:
mrc.columns = df.columns
fdf = pd.concat([df, mrc])
fdf.head()

Unnamed: 0,https://github.com/thearn/webcam-pulse-detector,https://github.com/syang1993/gst-tacotron,https://github.com/requests/requests,https://github.com/greedo/python-xbrl,https://github.com/VinceShieh/spark-ffm,https://github.com/soundcloud/cosine-lsh-join-spark,https://github.com/mitsuhiko/flask-sqlalchemy,https://github.com/scrapinghub/splash,https://github.com/airbnb/aerosolve,https://github.com/RubyLouvre/avalon,...,https://github.com/zhihaozhang/TuringCalendar,https://github.com/ShangtongZhang/reinforcement-learning-an-introduction,https://github.com/cosenary/Simple-PHP-Cache,https://github.com/seaneking/postcss-responsive-type,https://github.com/wistful/SublimeAutoPEP8,https://github.com/fukuta0614/chainer-SeqGAN,https://github.com/dnouri/nolearn,https://github.com/dyve/django-bootstrap3,https://github.com/square/maximum-awesome,https://github.com/apache/arrow
deepinsight,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
amusi,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Roshanson,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
apcode,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
kpu,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### 计算相似度
计算自己和其他用户的相似性，使用scipy库的pearsonr函数

In [18]:
from sklearn.metrics import jaccard_similarity_score
from scipy.stats import pearsonr

将数据框中最后一个向量和其他向量进行比较，并生成中心化余弦相似度。

In [20]:
sim_score = {}
for i in range(len(fdf)):
    ss = pearsonr(fdf.iloc[-1,:], fdf.iloc[i,:])
    sim_score.update({i: ss[0]})
sf = pd.Series(sim_score).to_frame('similarity')
sf.head()

  r = r_num / r_den
  x = np.where(x < 1.0, x, 1.0)  # if x > 1 then return 1.0


Unnamed: 0,similarity
0,
1,0.056237
2,0.056626
3,-0.00658
4,0.023776


有一些值为NaN，因为他们没有给任何项目标星，导致计算除以0.

对值进行排序，返回最相似的索引列表

In [21]:
sf.sort_values('similarity', ascending=False)

Unnamed: 0,similarity
95,1.000000
18,1.000000
90,0.146854
80,0.084576
14,0.066024
63,0.063764
2,0.056626
1,0.056237
93,0.054838
91,0.053211


相似度为1的是自己

查看相似度对应的用户名

In [22]:
fdf.index[95]

'935048000'

In [24]:
fdf.index[90]

'danielfrg'

查看某位用户的所有标星资料库

In [25]:
fdf.iloc[90,:][fdf.iloc[90,:]==1]

https://github.com/requests/requests                             1
https://github.com/JuliaLang/julia                               1
https://github.com/numba/numba                                   1
https://github.com/minrk/conda-bundle                            1
https://github.com/toml-lang/toml                                1
https://github.com/helm/helm                                     1
https://github.com/scipy/scipy                                   1
https://github.com/jakubroztocil/httpie                          1
https://github.com/vuejs/vue                                     1
https://github.com/google/flatbuffers                            1
https://github.com/minrk/thebelab                                1
https://github.com/spf13/cobra                                   1
https://github.com/amueller/futurepast                           1
https://github.com/Kozea/Multicorn                               1
https://github.com/dask/dask                                  

创建一个数据框，内容为和我最相似的三位用户已标星的资料库

In [26]:

all_recs = fdf.iloc[[90,80,14,95],:][fdf.iloc[[90,80,14,95],:]==1].fillna(0).T
all_recs[(all_recs==1).all(axis=1)]
 
str_recs_tmp = all_recs[all_recs[myun]==0].copy()
str_recs = str_recs_tmp.iloc[:,:-1].copy()
str_recs

Unnamed: 0,danielfrg,rushter,torvalds
https://github.com/thearn/webcam-pulse-detector,0.0,0.0,0.0
https://github.com/syang1993/gst-tacotron,0.0,0.0,0.0
https://github.com/greedo/python-xbrl,0.0,0.0,0.0
https://github.com/VinceShieh/spark-ffm,0.0,0.0,0.0
https://github.com/soundcloud/cosine-lsh-join-spark,0.0,0.0,0.0
https://github.com/mitsuhiko/flask-sqlalchemy,0.0,0.0,0.0
https://github.com/scrapinghub/splash,0.0,0.0,0.0
https://github.com/airbnb/aerosolve,0.0,1.0,0.0
https://github.com/RubyLouvre/avalon,0.0,0.0,0.0
https://github.com/codetainerapp/codetainer,0.0,0.0,0.0


看看是否存在两位共同标星的资料库

In [27]:
str_recs[(str_recs==1).all(axis=1)]
str_recs[str_recs.sum(axis=1)>1]

Unnamed: 0,danielfrg,rushter,torvalds
https://github.com/numba/numba,1.0,1.0,0.0
https://github.com/dask/dask,1.0,1.0,0.0
https://github.com/thampiman/reverse-geocoder,1.0,1.0,0.0
https://github.com/mwaskom/seaborn,1.0,1.0,0.0
https://github.com/statsmodels/statsmodels,1.0,1.0,0.0
https://github.com/ambv/black,1.0,1.0,0.0
https://github.com/marcotcr/lime,1.0,1.0,0.0
https://github.com/dmlc/xgboost,1.0,1.0,0.0
https://github.com/wesm/feather,1.0,1.0,0.0
https://github.com/JohnLangford/vowpal_wabbit,1.0,1.0,0.0
