# 爬取 Google Scholar、 AceMap 和 Altmetric 引用数据 

## Altmetric API 需要的论文编码

|idtype|idnumber example|
|:---:|:---:|
|'id'|"108989"|
|'doi'|"10.1126/science.1173146"|
|'ads'|"2009sci...325..578w"|
|'arxiv'|"1212.4819"|
|'pmid'|"19644114"|

In [145]:
from altmetric import Altmetric
import requests
import re
from bs4 import BeautifulSoup
import pandas as pd

In [146]:
# select Altmetric data
def selected_altmetric_items(rsp):
    all_rank = rsp['context']['all']['rank']/rsp['context']['all']['count']
    journal_rank = rsp['context']['journal']['rank']/rsp['context']['journal']['count']
    return [rsp['score'],
            round(all_rank,2),
            round(journal_rank,2),
            rsp['details_url']]

In [147]:
# get Altmetric data, and return as list with 4 members
def get_altmetric(title, idtype, idnumber):
    # get Altmetric data dict
    a = Altmetric(apikey='', apiver='v1')
    try:
        rsp = eval('a.' + idtype + '("' + idnumber + '")')
        if rsp is None:
            print("There is no '{}' in AceMap".format(title))
            return [None]*4
        else:
            return selected_altmetric_items(rsp)
    except [AltmetricHTTPException, e]:
        if e.status_code == 403:
            print("You aren't authorized for this call")
        elif e.status_code == 420:
            print("You are being rate limited")
        elif e.status_code == 502:
            print("The API version you are using is currently down for maintenance.")
        elif e.status_code == 404:
            print("Invalid API function")
            print(e.msg)

In [148]:
# get website text
def getHtml(url):
    try:
        req=requests.get(url)
        #if not 200，then raise error
        req.raise_for_status()
        req.encoding=req.apparent_encoding
        return req.text
    except:
        return "raise Error"

In [149]:
# get acemap citation
def get_acemap_citation(title):
    url = 'http://acemap.sjtu.edu.cn/result?q=' + title
    text = getHtml(url)
    soup = BeautifulSoup(text, "html.parser")
    # find the paper's title
    a_list = soup.find_all('a',href=re.compile('/paper'))
    count = 0
    for i in a_list:
        count += 1
        # check the title
        if i.text.capitalize() == title:
            # find the citation corresponding to title
            citation = soup.find_all('span',class_='ace_result_citations_year')[count*2-1].text
            return citation[12:]
    print("There is no '{}' in AceMap".format(title))

In [150]:
def get_gs_citation(title):
    url = 'https://scholar.google.com/scholar?q=' + title
    text = getHtml(url)
    soup = BeautifulSoup(text, "html.parser")
    # find the result item
    div_list = soup.find_all('div', class_="gs_ri")
    for i in div_list:
        # check the title
        if i.h3.text.capitalize() == title:
            # find the citation of the item
            is_citation = i.find(href=re.compile("/scholar\?cites"))
            if is_citation != None:
                citation = is_citation.text[9:] 
            else:
                citation = None
                print("There is no citation of '{}' in Google Scholar".format(title))
            return citation
    print("There is no '{}' in Google Scholar".format(title))

In [151]:
# get citation list
def get_citations_list(title, year, idtype, idnumber):
    title = title.capitalize()
    citations = get_altmetric(title, idtype, idnumber)
    for i in get_acemap_citation(title) ,get_gs_citation(title), year, title:
        citations.insert(0, i)
    return citations

In [152]:
# main func that get citation of each item
def get_citations_df(papers):
    citations = pd.DataFrame(columns=range(7))
    for i in papers:
        citation_list = get_citations_list(i[0], i[1], i[2], i[3])
        citations = citations.append(pd.DataFrame(citation_list).T)
    citations.rename(columns={0: 'title',
                              1: 'publish_year',
                              2: 'gs_citation',
                              3: 'acemap_citation',
                              4: 'altmetric_score',
                              5: 'top_all',
                              6: 'top_journal',
                              7:'altmetric_url'},
                    inplace=True)
    citations.set_index('title', inplace=True)
    citations.dropna(axis=0, how='all', inplace=True)
    return citations

In [153]:
import json

In [166]:
# get the paper imformations including title, publish year, idtype and idnumber
# the json file is export from Zotero in CLS JSON style
with open("MLS.json",'r') as load_f:
    load_dict = json.load(load_f)
    papers = []
    for i in load_dict:
        title = i['title']
        year = i['issued']['date-parts'][0][0]
        if i.get('DOI') != None:
            idtype = 'doi'
            idnumber = i['DOI']
        elif i.get('URL','').startswith('http://arxiv.org/abs/'):
            idtype = 'arxiv'
            idnumber = i['URL'].strip('http://arxiv.org/abs/')
        papers.append([title, year, idtype, idnumber])

In [167]:
df = get_citations_df(papers)
df

There is no 'A review of machine learning for automated planning' in AceMap
There is no 'Regret analysis of stochastic and nonstochastic multi-armed bandit problems' in AceMap
There is no 'Approximate policy iteration: a survey and some new methods' in AceMap
There is no 'Approximate policy iteration: a survey and some new methods' in AceMap
There is no 'Representation learning: a review and new perspectives' in AceMap
There is no 'Two faces of active learning' in AceMap
There is no 'Discrete wavelet transform-based time series analysis and mining' in AceMap
There is no 'A comparative study of palmprint recognition algorithms' in AceMap
There is no 'A survey of emerging approaches to spam filtering' in AceMap
There is no 'Time-series data mining' in AceMap
There is no 'Human activity analysis: a review' in AceMap
There is no 'Xml data clustering: an overview' in AceMap
There is no 'Ontology learning from text: a look back and into the future' in AceMap
There is no 'Subspace methods for

Unnamed: 0_level_0,publish_year,gs_citation,acemap_citation,altmetric_score,top_all,top_journal,altmetric_url
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Context based object categorization: a critical survey,2010,281.0,81.0,6.0,0.17,0.09,http://www.altmetric.com/details.php?citation_...
A review of machine learning for automated planning,2012,42.0,4.0,,,,
Regret analysis of stochastic and nonstochastic multi-armed bandit problems,2012,909.0,,3.0,0.27,0.81,http://www.altmetric.com/details.php?citation_...
Kernels for vector-valued functions: a review,2011,204.0,21.0,4.5,0.22,0.11,http://www.altmetric.com/details.php?citation_...
An introduction to conditional random fields,2010,704.0,61.0,15.7,0.07,0.02,http://www.altmetric.com/details.php?citation_...
Randomized algorithms for matrices and data,2011,440.0,37.0,41.5,0.03,0.01,http://www.altmetric.com/details.php?citation_...
A few useful things to know about machine learning,2012,1045.0,110.0,24.31,0.04,0.02,http://www.altmetric.com/details.php?citation_...
Translation techniques in cross-language information retrieval,2012,45.0,77.0,9.064,0.12,0.09,http://www.altmetric.com/details.php?citation_...
Approximate policy iteration: a survey and some new methods,2011,129.0,,,,,
Representation learning: a review and new perspectives,2012,3185.0,,99.316,0.01,0.0,http://www.altmetric.com/details.php?citation_...


参考：
* [GitHub - lnielsen/python-altmetric: Altmetric API v1 wrapper for Python](https://github.com/lnielsen/python-altmetric)
* [Altmetric API Support – Altmetric](https://www.altmetric.com/support/almetric-api/)
* [Getting Started | Altmetric API documentation](http://api.altmetric.com/)