# Getting Started
Scisciの分析対象とするデータは、

データベースは商業データベース（Scopusなど）とオープンアクセスデータベースがあります。OAデータベースは以下を使うことが多く、pythonのライブラリも整備されています。([pyscisci](https://github.com/SciSciCollective/pyscisci)のgithubレポジトリより引用)

| Data Set      | Example |
| ----------- | ----------- |
| [Microsoft Academic Graph](https://docs.microsoft.com/en-us/academic-services/graph/) (MAG)      | [Getting Started with MAG](/examples/Getting_Started/Getting%20Started%20with%20MAG.ipynb)       |
| [Clarivate Web of Science](https://clarivate.com/webofsciencegroup/solutions/web-of-science/) (WoS)   | [Getting Started with WOS](/examples/Getting_Started/Getting%20Started%20with%20WOS.ipynb)        |
| [DBLP](https://dblp.uni-trier.de) | [Getting Started with DBLP](/examples/Getting_Started/Getting%20Started%20with%20DBLP.ipynb) |
| [American Physical Society](https://journals.aps.org/datasets) (APS) | [Getting Started with APS](/examples/Getting_Started/Getting%20Started%20with%20APS.ipynb) |
| [PubMed](https://www.nlm.nih.gov/databases/download/pubmed_medline.html) | [Getting Started with PubMed](/examples/Getting_Started/Getting%20Started%20with%20PubMed.ipynb) |
| [OpenAlex](https://openalex.org/) | [Getting Started with OpenAlex](/examples/Getting_Started/Getting%20Started%20with%20OpenAlex.ipynb) |

今回は、[webサイト](https://openalex.org/)のUIも含め、初めてでも使いやすい[OpenAlex](https://docs.openalex.org/)を使って分析をします。使うライブラリは [pyalex](https://github.com/J535D165/pyalex?tab=readme-ov-file#pyalex) です。

以下のpython notebookをgoogle colaboratory などのサービス上で動かしてみてください。

この章では、どのようなデータがどのような形式で取れるかをざっと確認します。

In [1]:
%pip install pyalex

# より詳しい説明は https://github.com/J535D165/pyalex?tab=readme-ov-file#pyalex を参照

Defaulting to user installation because normal site-packages is not writeable
Collecting pyalex
  Downloading pyalex-0.13-py3-none-any.whl.metadata (12 kB)
Downloading pyalex-0.13-py3-none-any.whl (10 kB)
Installing collected packages: pyalex
Successfully installed pyalex-0.13

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [98]:
import pandas as pd
from pyalex import Works, Authors, Sources, Institutions, Concepts, Funders
import pyalex

# pyalex.config.email = "your@email.address"
pyalex.config.email = "zhmuler@gmail.com"


## 論文のデータ

In [84]:
# Open Alex 総採録数
f"{Works().count():,}"

'248,665,581'

In [125]:
data = Works()["W2741809807"] # IDを指定すると論文データが一つ返る

# メタデータ
print("---------")
print(data['id']) # Open Alex内部でのID
print(data['doi']) # 論文のDOI
print(data['type']) # 論文のタイプ：article, book, editorial, letter など
print(data['publication_year']) # 論文の出版年. publication_date で日付がとれる
print(data['language']) # 論文の執筆言語
print(data['is_retracted']) # 撤回された論文かどうか
print(data['apc_list']) #APC(オープンアクセス出版料)

# 著者データ
print("---------")
print(data['authorships'][0]) # 著者データ。OpenAlexID, 共著順, 所属機関, 氏名など
print(data['institutions_distinct_count']) # 共著者がいくつの機関にまたがっているか. country_distinct_countもある

# 内容、テーマ
print("---------")
print(data['topics'][0]) # OpenAlexが割り当てたトピック。domain, field, subfieldの3階層で、subfieldに当たる分野名を記載
print(data['keywords'][0:3]) # 論文のキーワード
print(data['concepts'][0]) # OpenAlexが割り当てた概念。wikidataの分類を利用している。現在はtopicsの利用が推奨されている。

# 書誌学的情報
print("---------")
print(data['title']) # 論文のタイトル
print(data['primary_location']) # 最初に出版された媒体
print(data['cited_by_count']) # OpenAlexが再録した最新時点までの被引用数
print(data['referenced_works'][0:3]) # 参考文献

# エンティティ一覧
data.keys()

# さらに詳しくは https://docs.openalex.org/api-entities/works/work-object

---------
https://openalex.org/W2741809807
https://doi.org/10.7717/peerj.4375
article
2018
en
False
{'value': 1395, 'currency': 'USD', 'value_usd': 1395, 'provenance': 'doaj'}
---------
{'author_position': 'first', 'author': {'id': 'https://openalex.org/A5048491430', 'display_name': 'Heather Piwowar', 'orcid': None}, 'institutions': [{'id': 'https://openalex.org/I4210166736', 'display_name': 'Impact Technology Development (United States)', 'ror': 'https://ror.org/05ppvf150', 'country_code': 'US', 'type': 'company', 'lineage': ['https://openalex.org/I4210166736']}], 'countries': ['US'], 'is_corresponding': False, 'raw_author_name': 'Heather Piwowar', 'raw_affiliation_string': 'Impactstory, Sanford, NC, USA', 'raw_affiliation_strings': ['Impactstory, Sanford, NC, USA']}
7
---------
{'id': 'https://openalex.org/T10102', 'display_name': 'Bibliometric Analysis and Research Evaluation', 'score': 0.9969, 'subfield': {'id': 1804, 'display_name': 'Statistics, Probability and Uncertainty'}, 'fie

dict_keys(['id', 'doi', 'title', 'display_name', 'publication_year', 'publication_date', 'ids', 'language', 'primary_location', 'type', 'type_crossref', 'indexed_in', 'open_access', 'authorships', 'countries_distinct_count', 'institutions_distinct_count', 'corresponding_author_ids', 'corresponding_institution_ids', 'apc_list', 'apc_paid', 'has_fulltext', 'cited_by_count', 'cited_by_percentile_year', 'biblio', 'is_retracted', 'is_paratext', 'primary_topic', 'topics', 'keywords', 'concepts', 'mesh', 'locations_count', 'locations', 'best_oa_location', 'sustainable_development_goals', 'grants', 'referenced_works_count', 'referenced_works', 'related_works', 'ngrams_url', 'abstract_inverted_index', 'cited_by_api_url', 'counts_by_year', 'updated_date', 'created_date'])

In [106]:
# 条件を指定して検索ができる。多数あってもデフォルトでは25件までしか返らない
result = Works().filter(publication_year=2020, is_oa=True).get() # 2020年に発行されたOA論文を取得
print(len(result))
list(map(lambda x: x["title"],result))

25


['Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China',
 'Clinical Characteristics of Coronavirus Disease 2019 in China',
 'A Novel Coronavirus from Patients with Pneumonia in China, 2019',
 'Clinical course and risk factors for mortality of adult inpatients with COVID-19 in Wuhan, China: a retrospective cohort study',
 'Clinical Characteristics of 138 Hospitalized Patients With 2019 Novel Coronavirus–Infected Pneumonia in Wuhan, China',
 'A pneumonia outbreak associated with a new coronavirus of probable bat origin',
 'SciPy 1.0: fundamental algorithms for scientific computing in Python',
 'SARS-CoV-2 Cell Entry Depends on ACE2 and TMPRSS2 and Is Blocked by a Clinically Proven Protease Inhibitor',
 'Cancer statistics, 2020',
 'Epidemiological and clinical characteristics of 99 cases of 2019 novel coronavirus pneumonia in Wuhan, China: a descriptive study',
 'Characteristics of and Important Lessons From the Coronavirus Disease 2019 (COVID-19) Outbreak in

### 利用例
- タイトル、アブストを検索
- 引用文献一覧
- ある機関から出版された論文の中で引用数の多い順に文献を取得

In [121]:
# タイトルとアブストラクトに含まれる文字列での検索
result = Works().search("fierce creatures").get()
# fierce および creature, creatures が含まれていれば検索結果に引っかかる
# 大文字小文字の区別は無い
# アブストラクトが取得できなかった場合に "A summary is not available for this content so..." が格納されていることがあり、
# 検索時には "summary" や "link", "provide" などでヒットするので注意が必要

print(result[4]["abstract"])
pd.DataFrame(map(lambda x: [x["title"],x["abstract"]],result),
             columns=["タイトル","アブストラクト"])

A summary is not available for this content so a preview has been provided. Please use the Get access link above for information on how to access this content.


Unnamed: 0,タイトル,アブストラクト
0,Lame Ducks or Fierce Creatures? - The Role of ...,In the pathogenesis of multiple sclerosis (MS)...
1,Fierce creatures,Analysis1 October 2003free access Fierce creat...
2,Fierce creatures : don't pet them,
3,Fierce creatures [Newspaper columnists and the...,
4,"“Small, fierce creatures”: Mac Wellman's Aurat...",A summary is not available for this content so...
5,The primitive Anglo-Saxon calendar,"When the Anglo-Saxons invaded England, about t..."
6,Mau Mau: An African Crucible.,Preface. Acknowledgments. 1. There Will Be a G...
7,Middle Knowledge and the Problem of Evil,"TF President Kennedy had not been shot, would ..."
8,How many legs,First Words Non-fiction is a collection of eng...
9,Taming the Beast: Images of Trained Bears in T...,Amongst the surviving representations of bears...


In [122]:
# 論文の引用文献一覧
w = Works()["W2741809807"]
# pager
# for page in pager:
#     print(len(page))
references = Works()[w["referenced_works"]]
list(map(lambda x: x["id"],references))

['https://openalex.org/W1767272795',
 'https://openalex.org/W2016860460',
 'https://openalex.org/W2463568293',
 'https://openalex.org/W2048185449',
 'https://openalex.org/W2140880926',
 'https://openalex.org/W2089123513',
 'https://openalex.org/W2322381034',
 'https://openalex.org/W2115339903',
 'https://openalex.org/W2160597895',
 'https://openalex.org/W1560783210',
 'https://openalex.org/W2343014812',
 'https://openalex.org/W2003844967',
 'https://openalex.org/W2997143876',
 'https://openalex.org/W3121567788',
 'https://openalex.org/W2753353163',
 'https://openalex.org/W2588027260',
 'https://openalex.org/W2785823074',
 'https://openalex.org/W2587705861',
 'https://openalex.org/W4254015553',
 'https://openalex.org/W2029057325',
 'https://openalex.org/W2520991028',
 'https://openalex.org/W2231201268',
 'https://openalex.org/W2511661767',
 'https://openalex.org/W2306268324',
 'https://openalex.org/W2953072907']

In [128]:
w = Works().filter(cites="W2741809807").get()
w

[{'id': 'https://openalex.org/W3137875885',
  'doi': 'https://doi.org/10.3390/publications9010012',
  'title': 'Web of Science (WoS) and Scopus: The Titans of Bibliographic Information in Today’s Academic World',
  'display_name': 'Web of Science (WoS) and Scopus: The Titans of Bibliographic Information in Today’s Academic World',
  'publication_year': 2021,
  'publication_date': '2021-03-12',
  'ids': {'openalex': 'https://openalex.org/W3137875885',
   'doi': 'https://doi.org/10.3390/publications9010012',
   'mag': '3137875885'},
  'language': 'en',
  'primary_location': {'is_oa': True,
   'landing_page_url': 'https://doi.org/10.3390/publications9010012',
   'pdf_url': 'https://www.mdpi.com/2304-6775/9/1/12/pdf?version=1615975873',
   'source': {'id': 'https://openalex.org/S2738007992',
    'display_name': 'Publications',
    'issn_l': '2304-6775',
    'issn': ['2304-6775'],
    'is_oa': True,
    'is_in_doaj': True,
    'host_organization': 'https://openalex.org/P4310310987',
    'ho

In [118]:
# ある機関から出版された論文の中で引用数の多い順に文献を取得
result = Works() \
  .filter(authorships={"institutions": {"ror": "02956yf07"}}) \
  .sort(cited_by_count="desc") \
  .get()
list(map(lambda x: x["title"], result))

['A novel potent vasoconstrictor peptide produced by vascular endothelial cells',
 'Electronic properties of two-dimensional systems',
 'Statistical inference in vector autoregressions with possibly integrated processes',
 'Magnetic control of ferroelectric polarization',
 'Edge state in graphene ribbons: Nanometer size effect and edge shape dependence',
 'An Nrf2/Small Maf Heterodimer Mediates the Induction of Phase II Detoxifying Enzyme Genes through Antioxidant Response Elements',
 'Active sites of nitrogen-doped carbon materials for oxygen reduction reaction clarified using model catalysts',
 'The complete genome sequence of the Gram-positive bacterium Bacillus subtilis',
 'The 2015 World Health Organization Classification of Lung Tumors',
 'Asian Working Group for Sarcopenia: 2019 Consensus Update on Sarcopenia Diagnosis and Treatment',
 'Peculiar Localized State at Zigzag Graphite Edge',
 'Cardiorespiratory Fitness as a Quantitative Predictor of All-Cause Mortality and Cardiovasc

## 著者のデータ

In [83]:
# Open Alex 総採録数
f"{Authors().count():,}"

'90,207,193'

In [52]:
data = Authors()["A5025323154"] # Authors()["https://orcid.org/0000-0002-4028-3522"]でも同じ

# 名前
print("---------")
print(data["display_name"]) # 氏名。OpenAlexで名寄せした表記揺れは display_name_alternativesにより取得可能

# メタデータ、要約データ
print("---------")
print(data["affiliations"][0]) # 著者の所属機関の全経歴（OpenAlex Institution Object）
print(data["last_known_institutions"][0]["display_name"]) # OpenAlexが把握している最新の機関（OpenAlex Institution Object）
print(data["summary_stats"]) #　2年間平均被引用数、h-index, i10indexなど

# 著作に関するデータ
print("---------")
print(data["counts_by_year"]) # 年度ごとの被引用数(年ごと)
print(data["cited_by_count"]) # 著者が得た合計の被引用数
print(data["works_count"]) # 著者が出版したarticle, letter, bookなどの合計数

# さらに詳しくは https://docs.openalex.org/api-entities/authors/author-object

---------
Albert László Barabási
---------
{'institution': {'id': 'https://openalex.org/I12912129', 'ror': 'https://ror.org/04t5xt781', 'display_name': 'Northeastern University', 'country_code': 'US', 'type': 'education', 'lineage': ['https://openalex.org/I12912129']}, 'years': [2023, 2022, 2021, 2020, 2019, 2018, 2017, 2016, 2015, 2014]}
Harvard University
{'2yr_mean_citedness': 2.375, 'h_index': 78, 'i10_index': 112}
---------
[{'year': 2024, 'works_count': 1, 'cited_by_count': 798}, {'year': 2023, 'works_count': 11, 'cited_by_count': 7616}, {'year': 2022, 'works_count': 6, 'cited_by_count': 8431}, {'year': 2021, 'works_count': 2, 'cited_by_count': 9492}, {'year': 2020, 'works_count': 9, 'cited_by_count': 9158}, {'year': 2019, 'works_count': 9, 'cited_by_count': 8757}, {'year': 2018, 'works_count': 3, 'cited_by_count': 8278}, {'year': 2017, 'works_count': 3, 'cited_by_count': 7752}, {'year': 2016, 'works_count': 3, 'cited_by_count': 7589}, {'year': 2015, 'works_count': 6, 'cited_by_c

In [40]:
Authors().search_filter(display_name="einstein").get()


[{'id': 'https://openalex.org/A5054034686',
  'orcid': None,
  'display_name': 'Albert Einstein',
  'display_name_alternatives': ['A.B Einstein',
   'Albert. Einstein',
   'Albert B. Einstein',
   'A. Einstein',
   'A. B. Einstein'],
  'relevance_score': 11494.396,
  'works_count': 1019,
  'cited_by_count': 48333,
  'summary_stats': {'2yr_mean_citedness': 0.0,
   'h_index': 78,
   'i10_index': 210},
  'ids': {'openalex': 'https://openalex.org/A5054034686'},
  'affiliations': [{'institution': {'id': 'https://openalex.org/I201646274',
     'ror': 'https://ror.org/00cv1e222',
     'display_name': "Saint Peter's University",
     'country_code': 'US',
     'type': 'education',
     'lineage': ['https://openalex.org/I201646274']},
    'years': [2024]},
   {'institution': {'id': 'https://openalex.org/I2801107848',
     'ror': 'https://ror.org/01j17xg39',
     'display_name': 'New York Hospital Queens',
     'country_code': 'US',
     'type': 'healthcare',
     'lineage': ['https://openalex.o

## その他のデータ
さらに、論文誌、機関、Concepts、Funderに関してデータを取得できる。

In [97]:
# Open Alex採録数
print(f"{Sources().count():,}") # 論文誌
print(f"{Institutions().count():,}") # 機関
print(f"{Funders().count():,}") # 出資機関
print(f"{Concepts().count():,}") # concept


251,289
106,631
32,437
65,073


In [96]:
# Sources
source = Sources().filter(works_count=">1000000").get() # 100万本以上論文が出ている論文誌を取得

pd.DataFrame(map(lambda x: [x["id"],x["display_name"], x["host_organization_name"], x["is_oa"],x["works_count"]],source),
            columns=["id","論文誌名","論文誌発行機関","OA誌かどうか", "累計論文発行数"])

Unnamed: 0,id,論文誌名,論文誌発行機関,OA誌かどうか,論文発行数
0,https://openalex.org/S4306525036,PubMed,National Institutes of Health,False,33075864
1,https://openalex.org/S2764455111,PubMed Central,National Institutes of Health,True,8009760
2,https://openalex.org/S4306400806,Europe PMC (PubMed Central),European Bioinformatics Institute,True,5316266
3,https://openalex.org/S4306400194,arXiv (Cornell University),Cornell University,True,3015170
4,https://openalex.org/S4306401280,DOAJ (DOAJ: Directory of Open Access Journals),,True,2672478
5,https://openalex.org/S4306402512,HAL (Le Centre pour la Communication Scientifi...,French National Centre for Scientific Research,True,2571027
6,https://openalex.org/S4306463937,Springer eBooks,,False,2519831
7,https://openalex.org/S4306400562,Zenodo (CERN European Organization for Nuclear...,European Organization for Nuclear Research,True,1405433
8,https://openalex.org/S4306401271,RePEc: Research Papers in Economics,Federal Reserve Bank of St. Louis,True,1126422
9,https://openalex.org/S4210172589,Social Science Research Network,RELX Group (Netherlands),False,1079692


In [87]:
# Institutions
institution = Institutions().filter(country_code="JP").get() # 国コードがJP:日本の機関を取得

pd.DataFrame(map(lambda x: [x["id"],x["display_name"], x["summary_stats"]["2yr_mean_citedness"]],institution),
            columns=["id","機関名","2年間平均被引用数"])

Unnamed: 0,id,機関名,2年間平均被引用数
0,https://openalex.org/I74801974,The University of Tokyo,3.622451
1,https://openalex.org/I22299242,Kyoto University,3.246753
2,https://openalex.org/I98285908,Osaka University,3.026423
3,https://openalex.org/I201537933,Tohoku University,3.123361
4,https://openalex.org/I135598925,Kyushu University,3.173971
5,https://openalex.org/I60134161,Nagoya University,2.997435
6,https://openalex.org/I205349734,Hokkaido University,3.011574
7,https://openalex.org/I114531698,Tokyo Institute of Technology,2.738239
8,https://openalex.org/I146399215,University of Tsukuba,2.645707
9,https://openalex.org/I203951103,Keio University,2.989865


In [85]:
# Funders
funder_health = Funders().search_filter(display_name="health").filter(country_code="JP").get() # 名前に"health"と入っている日本の出資機関を取得

print(list(map(lambda x: x["display_name"], funder_health)))

['Ministry of Health, Labour and Welfare', 'Japan Health Sciences Foundation', 'Fujita Health University', 'National Center for Global Health and Medicine', 'Daiwa Securities Health Foundation', 'National Center for Child Health and Development', 'Japan Foundation for Aging and Health', 'Global Health Innovative Technology Fund', 'Pfizer Health Research Foundation', 'University of Occupational and Environmental Health', 'Japan Health Foundation', 'Meiji Yasuda Life Foundation of Health and Welfare', 'National Institute of Occupational Safety and Health, Japan', 'Mother and Child Health Foundation', 'Fukuoka Foundation for Sound Health', 'MOA Health Science Foundation', 'Sasakawa Memorial Health Foundation', 'Niigata University of Health and Welfare', 'Juntendo Institute of Mental Health', 'Sasakawa Memorial Health Foundation', 'Yuumi Memorial Foundation for Home Health Care', 'Morinaga Foundation For Health and Nutrition', 'Mitsukoshi Health and Welfare Foundation', 'Hyogo Prefecture H

In [89]:
# Concepts
results, meta = Concepts().get(return_meta=True) # OpenAlexに収録されている全てのConceptをメタ情報と共に取得

print(meta) # meta情報
print(results) # Concept一覧、最初の25件

{'count': 65073, 'db_response_time_ms': 22, 'page': 1, 'per_page': 25, 'groups_count': None}
[{'id': 'https://openalex.org/C41008148', 'wikidata': 'https://www.wikidata.org/wiki/Q21198', 'display_name': 'Computer science', 'level': 0, 'description': 'study of computation', 'works_count': 86636941, 'cited_by_count': 489616335, 'summary_stats': {'2yr_mean_citedness': 1.0980407251408955, 'h_index': 3383, 'i10_index': 7873743}, 'ids': {'openalex': 'https://openalex.org/C41008148', 'wikidata': 'https://www.wikidata.org/wiki/Q21198', 'mag': '41008148', 'wikipedia': 'https://en.wikipedia.org/wiki/Computer%20science', 'umls_cui': ['C0599726']}, 'image_url': 'https://upload.wikimedia.org/wikipedia/commons/6/6a/Sorting_quicksort_anim.gif', 'image_thumbnail_url': 'https://upload.wikimedia.org/wikipedia/commons/thumb/6/6a/Sorting_quicksort_anim.gif/100px-Sorting_quicksort_anim.gif', 'international': {'display_name': {'af': 'informatika', 'am': 'የኮምፒውተር፡ጥናት', 'an': 'Informatica', 'ar': 'علم الحاسوب