<a href="https://colab.research.google.com/github/AlexanderFriedrichsen/AlexanderFriedrichsen.github.io/blob/main/ScrapingCitations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This Collab is intended for data sample collection and EDA (exploratory data analysis) of our intended GitHub repo > paper links dataset. We want to gather the set of papers that cite a paper liked, the (likely dependent) set of papers that cite a repo that is linked to a paper, and the intersection - those that cite the paper and the repo.

Our intention is to discover the relationship between citation and code citation.

"We seek to identify a dataset of (bidirectionally) linked repositories and papers to uncover the potential indirect effects of papers on repositories. We will examine the distribution of original repository citations from papers that cite a linked paper. We intend to analyze how papers cite repositories and other papers, and whether this relationship changes for different fields of study and popularity of the cited artifact. Finally, we intend to examine the evolution of how papers have began citing repositories in addition to other papers. During our analysis we hope to find additional questions that explain these sought indirect effects."

After meeting with Laurent today (12/8/2021) going to go through the google scholar api and search for trext within cites. Gotta be able to fid the google scholar ID programmatically,


In [2]:
import pandas as pd
import numpy as np
import math
import matplotlib as plt

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
# Exploratory checking process:
# paste link in dataframe below to paper -> google scholar cite by X
# look at first 5 and check if they cite the GitHub link as well
# In df, fill out the number of links back to github columns

#column_names = ["original_paper_link", "cited_by_1", "cited_by_2", "cited_by_3", "cited_by_4", "cited_by_5"]
column_names = ["original_paper_link", "number_citations_to_repo", "number_papers_citing_original_paper"]

data = np.array([["https://arxiv.org/abs/1804.02047", 0, 51],
                 ["https://arxiv.org/abs/1703.08619", 0, 1],
                 ["https://arxiv.org/abs/1810.10551", 1, 29],
                 ["https://www.mdpi.com/1424-8220/19/3/636", 0, 11],
                 ["https://arxiv.org/abs/1408.5093", 0, 15980],
                 ["https://arxiv.org/abs/1408.5093", 0, 15980],
                 ["https://www.biorxiv.org/content/10.1101/170571v1", 0, 10],
                 ["https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1029/WR020i011p01659", 0, 253],
                 ["https://arxiv.org/abs/1201.6082", 0, 21]
                ])


cites_repo = pd.DataFrame(data, columns = column_names)

In [4]:
cites_repo

Unnamed: 0,original_paper_link,number_citations_to_repo,number_papers_citing_original_paper
0,https://arxiv.org/abs/1804.02047,0,51
1,https://arxiv.org/abs/1703.08619,0,1
2,https://arxiv.org/abs/1810.10551,1,29
3,https://www.mdpi.com/1424-8220/19/3/636,0,11
4,https://arxiv.org/abs/1408.5093,0,15980
5,https://arxiv.org/abs/1408.5093,0,15980
6,https://www.biorxiv.org/content/10.1101/170571v1,0,10
7,https://agupubs.onlinelibrary.wiley.com/doi/ab...,0,253
8,https://arxiv.org/abs/1201.6082,0,21


Last meeting we were working on how to query for the results we wanted in google scholar. It is possible to use the google scholar ID, the id of the search, to do this. In google scholar we can search for some term and from one of the articles that comes up, click cited by (x). This will bring us to a page with all the articles that cite that article. Then we can check the box at the top that says "search within citing articles", which allows us to use the search bar to find a term inside of these articles that cite the original paper of interest.


The URL of the search appears differently when different options are selected: 


we want to systematically get the list of urls or google scholar ids for each paper that has a citation to the github repository of the original paper from within these citing articles


To start with, we need to take the spreadsheet of all the repos that have papers connected, and put this into a loop in our code. We also need to put the associated github link into a list. For each paper, search google scholar and go to cited by, then search term shuold be the github link - however this may require regular expressions, because not everyone will list the github link exactly the same in their bibliography. We can use limited boolean expressions in our search queries, which means we will have to figure out a way to do the regex manually...



In [5]:
#step one: read csv into df
path = "/content/drive/MyDrive/School/HonorsThesis/GithubToPaperLinks.csv"
df = pd.read_csv(path)
df

Unnamed: 0,1,Link to GitHub Repository,Link to paper,Referencing or not,RQ1 Open Access,RQ2 Relationship between Repository and Paper,RQ2 Diversity of GitHub Repository,RQ2 Affiliation,RQ2 Paper References back to the Repository,RQ3 README evolution,RQ3 Paper evolution
0,2,https://github.com/SimBussy/binarsity,https://arxiv.org/abs/1703.08619,yes,yes,official,deep learning,university,no,no,yes
1,3,https://github.com/previtus/AttentionPipeline,https://arxiv.org/abs/1810.10551,yes,yes,official,computer vision,university,yes,no,no
2,4,https://github.com/SFI-Mechatronics/wp3_decomp...,https://www.mdpi.com/1424-8220/19/3/636,yes,yes,official,sensors,university,yes,no,no
3,5,https://github.com/rickyHong/CaffeDLFramework,https://arxiv.org/abs/1408.5093,yes,yes,fork of official,deep learning,unknow,yes to official,no,no
4,6,https://github.com/laycoding/caffe_focal_loss,https://arxiv.org/abs/1408.5093,yes,yes,fork of official,deep learning,unknow,yes to official,no,no
...,...,...,...,...,...,...,...,...,...,...,...
373,375,https://github.com/straylightrun/cytofviz,https://arxiv.org/abs/1201.6082,yes,yes,independent unofficial,ml,industry,no,no,no
374,376,https://github.com/zsylvester/meanderpy,https://agupubs.onlinelibrary.wiley.com/doi/ab...,yes,yes,independent unofficial,computer vision,university,no,no,no
375,377,https://github.com/alexandrovteam/curatr,https://www.biorxiv.org/content/10.1101/170571v1,yes,yes,official,other,university,yes,no,no
376,378,https://github.com/yueruchen/Pedestrian-Synthe...,https://arxiv.org/abs/1804.02047,yes,yes,official,computer vision,both,yes,no,no


In [6]:
#cleaning the sheet a bit
df.columns = df.columns.str.replace(' ', '_')
df.head(0)

df = df[df.Referencing_or_not == "yes"]


In [7]:
df[df.columns[3]]

0      yes
1      yes
2      yes
3      yes
4      yes
      ... 
372    yes
373    yes
374    yes
375    yes
376    yes
Name: Referencing_or_not, Length: 344, dtype: object

In [8]:
!pip install google-search-results
# using serp (search enginss results pages) api (limited to 100 queries)
# https://serpapi.com/dashboard api key here
from serpapi import GoogleSearch

params = {
  "api_key": "d6da7345463a63444b3312d3d55e7c944bd7ed59078633bf84d2f948d166709a",
  "engine": "google_scholar",
  "q": "caffe",
  "cites": "1739257544589912763",
  "hl": "en"
}

search = GoogleSearch(params)
results = search.get_dict()


Collecting google-search-results
  Downloading google_search_results-2.4.1.tar.gz (11 kB)
Building wheels for collected packages: google-search-results
  Building wheel for google-search-results (setup.py) ... [?25l[?25hdone
  Created wheel for google-search-results: filename=google_search_results-2.4.1-py3-none-any.whl size=25789 sha256=13823b4806b9f85924b6fe7a909bc64fd35a31a346b8d0d77f5b168560012995
  Stored in directory: /root/.cache/pip/wheels/82/a3/c5/364155118f298722dff2f79ae4dd7c91e92b433ad36d6f7e0e
Successfully built google-search-results
Installing collected packages: google-search-results
Successfully installed google-search-results-2.4.1
https://serpapi.com/search


In [9]:
#Automating search process
# make the column of cites_ids
Cites_Ids = []

for indices, row in df.iterrows():
    q = df.at[indices, "Link_to_paper"]
    params = {
    "api_key": "d6da7345463a63444b3312d3d55e7c944bd7ed59078633bf84d2f948d166709a",
    "engine": "google_scholar",
    "q": q,
    "hl": "en"
    }
    search = GoogleSearch(params)
    cites_id = search["organic_results"]["inline_links"]["cited_by"]["cites_id"]
    if ((cites_id != None) and (cites_id != "")):
        Cites_Ids.append(cites_id)
    else: 
        Cites_Ids.append("missing_cites_id")

if df.len() == Cites_Ids.len():
    df["Cites_Ids"] = Cites_Ids

df

In [None]:
# What do we want from the search?
# - the number of papers
# - the field, author, whether they have a github repo, overlapping author, journal?
# this could be a good subjet for the meeting partially
Citing_Artifacts_With_GitHub_Citation = []

for indices, row in df.iterrows():
    cites = df.at[indices, "Cites_Ids"]

    # Set our search inside cited articles to be the github repository of the original. 
    # Might want to use regex here??
    q = df.at[indices, "Link_to_GitHub_Repository"]

    params = {
    "api_key": "d6da7345463a63444b3312d3d55e7c944bd7ed59078633bf84d2f948d166709a",
    "engine": "google_scholar",
    "q": q,
    "cites":cites,
    "hl": "en"
    }
    search = GoogleSearch(params)

    # save to harddisk here **
    
    #edit here for what data we want to pull out, can make more columns etc.
    
    Citing_Artifacts_With_GitHub_Citation.append(search)

    # cites_id = search["organic_results"]["inline_links"]["cited_by"]["cites_id"]
    # if ((cites_id != None) and (cites_id != "")):
    #     Cites_Ids.append(cites_id)
    # else: 
    #     Cites_Ids.append("missing_cites_id")

if df.len() == Citing_Artifacts_With_GitHub_Citation.len():
    df["Citing_Artifacts_With_GitHub_Citation"] = Citing_Artifacts_With_GitHub_Citation

df

Divegence of code references and code usage

when software is updated, does this help original authors, contributors, both?

DOIs for github repos!


Search scholar for just the github urul without inside of citing articles - get what papers are citing the repo and not the original paper

when searching for github repo, we should search for username/reponame OR github.com/username/repo

how good is scholar at searching through PDFs/other forms of documents 

citations are a lagging indicator

look at the sample and check when the papers were published, stratify sample, taking a bit from each year (how many?)

Nature papers code availability section - how many papers have it, how many have a url, how many of the url are GitHub?

coocurrence network how often do papers join multiple github repos together
if for example, I copy code from 3 repos, i might not cite all 3 of those in the paper, but instead only in the repo that is combing the pieces from all 3

Make this int oa github repo




#Visualization of Preliminary Results
