# Ideas for SSI policy research

Use existing datasets of software mentions in publications or preprints to analyse aspects of how research software projects have changed over time, or differ between disciplines.

Input datasources:
- CZI Software Mentions (includes extracted github links)
- Softcite Software Mentions (doesn’t include github links) 
- crawl ePrints to extract github links and associate with appropriate metadata, to create an equivalent set

Potential questions:
- How has preferred license changed over time?
- How has team size (number of concurrent contributors?) changed over time?
- How has commit frequency changed over time?
- How has linkage to other research outputs (e.g. DOIs to datasets, papers being included in READMEs) changed over time?

In [1]:
import pandas as pd
import numpy
pd.set_option('max_colwidth', 1000)

In [18]:
import requests
from github import Github
import datetime

In [3]:
ROOT_DATA_DIR = '../data/'

In [4]:
# Get Config
import configparser
config = configparser.ConfigParser()
config.read('../config.cfg')
access_token = config['ACCESS']['token']

### GitHub linked software

Concentrating on software linked on GitHub for now.

In [6]:
linked_df = pd.read_csv(ROOT_DATA_DIR + 'linked/metadata.tsv.gz', sep = '\\t', engine = 'python', compression = 'gzip', error_bad_lines = False)



  linked_df = pd.read_csv(ROOT_DATA_DIR + 'linked/metadata.tsv.gz', sep = '\\t', engine = 'python', compression = 'gzip', error_bad_lines = False)
Skipping line 79480: Expected 16 fields in line 79480, saw 17. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.


In [7]:
github_df = linked_df[linked_df['source'] == 'Github API']
len(github_df)

112708

In [8]:
def parse_github_repo_url(url: str):
    try:
        _,_,_,user,repo_name = url.split('/')
    except ValueError:
        print(f"Could not unpack URL {url} into 5 segments. Refactor function parse_github_repo.")
    return user, repo_name

#### Play w GitHub API

Using `requests` or possibly `PyGitHub` package.

In [25]:
test_data = github_df.sample(1)
test_url = test_data.github_repo.values[0]

In [30]:
test_data

Unnamed: 0,ID,software_mention,mapped_to,source,platform,package_url,description,homepage_url,other_urls,license,github_repo,github_repo_license,exact_match,RRID,reference,scicrunch_synonyms
126059,SM480072,Analizer,Analizer,Github API,,https://github.com/ShinShil/Analizer,https://github.com/ShinShil/,,,,https://github.com/ShinShil/Analizer,,True,,,


In [33]:
test_user, test_repo_name = parse_github_repo_url(test_url)

In [34]:
repo_data = requests.get(f'https://api.github.com/repos/{test_user}/{test_repo_name}').json()
repo_data

{'id': 31628058,
 'node_id': 'MDEwOlJlcG9zaXRvcnkzMTYyODA1OA==',
 'name': 'Analizer',
 'full_name': 'ShinShil/Analizer',
 'private': False,
 'owner': {'login': 'ShinShil',
  'id': 11064696,
  'node_id': 'MDQ6VXNlcjExMDY0Njk2',
  'avatar_url': 'https://avatars.githubusercontent.com/u/11064696?v=4',
  'gravatar_id': '',
  'url': 'https://api.github.com/users/ShinShil',
  'html_url': 'https://github.com/ShinShil',
  'followers_url': 'https://api.github.com/users/ShinShil/followers',
  'following_url': 'https://api.github.com/users/ShinShil/following{/other_user}',
  'gists_url': 'https://api.github.com/users/ShinShil/gists{/gist_id}',
  'starred_url': 'https://api.github.com/users/ShinShil/starred{/owner}{/repo}',
  'subscriptions_url': 'https://api.github.com/users/ShinShil/subscriptions',
  'organizations_url': 'https://api.github.com/users/ShinShil/orgs',
  'repos_url': 'https://api.github.com/users/ShinShil/repos',
  'events_url': 'https://api.github.com/users/ShinShil/events{/privacy

In [39]:
g = Github()
test_repo = g.get_repo(f'{test_user}/{test_repo_name}')

In [54]:
for s in test_repo.get_stats_contributors():
    print(s.author)
    for w in s.weeks:
        if w.a > 0 or w.d > 0 or w.c > 0:
            print(f"Contribution in week {w.w}: Additions {w.a}, Deletions {w.d}, Commits {w.c}")


NamedUser(login="ShinShil")
Contribution in week 2015-03-01 00:00:00: Additions 4, Deletions 0, Commits 1
Contribution in week 2015-05-24 00:00:00: Additions 29, Deletions 0, Commits 1
Contribution in week 2015-05-31 00:00:00: Additions 335, Deletions 0, Commits 1


#### Apply on full dataset

In [9]:
g = Github(access_token)

In [14]:
# test purposes
smaller_df = github_df.sample(20)

Repo crawling function: goes through all linked repos in the provided dataset. TODO: add a progressbar.

In [58]:
def crawl_repo(df):
    """For all github repositories in dataset, retrieve contributions, contents.

    Args:
        df (pd.DataFrame): CZI linked dataset with links to GitHub

    Returns:
        (pd.DataFrame, pd.DataFrame): one data frame holding info on contributions, one data frame holding info on licenses.
            - contributions dataframe columns:
                - github_repo: same as in CZI linked dataset
                - author: contributor to repository
                - year, week: determine the week of contributions in question
                - commits: number of commits in that specific week
            - license dataframe columns:
                - github_repo: same as in CZI linked dataset
                - license: license key if license was found (e.g. mit, lgpl-3.0, mpl-2.0, ... (https://docs.github.com/en/rest/licenses?apiVersion=2022-11-28#get-all-commonly-used-licenses))
                - readme_size: size of README file, 0 if none was found
    """
    contributions_df = pd.DataFrame(columns=['github_repo', 'author', 'year', 'week', 'commits'])
    contents_df = pd.DataFrame(columns=['github_repo', 'license', 'readme_size'])
    for u in df['github_repo']:
        user, repo_name = parse_github_repo_url(u)
        try:
            repo = g.get_repo(f"{user}/{repo_name}")
        except:
            print(f"Could not resolve repository for URL {u}.")
        contribution_stats = repo.get_stats_contributors()
        if contribution_stats is not None:
            for s in contribution_stats:
                for w in s.weeks:
                    contributions_df.loc[len(contributions_df)] = [u, s.author, w.w.year, w.w.isocalendar().week, w.c]
        try:
            license_file = repo.get_license()
            license_entry = license_file.license.key
        except:
            license_entry = None
        try:
            readme = repo.get_readme()
            readme_entry = readme.size
        except:
            readme_entry = 0
        contents_df.loc[len(contents_df)] = [u, license_entry, readme_entry]
    return contributions_df, contents_df

In [59]:
contributions_df, contents_df = crawl_repo(github_df)

Could not resolve repository for URL https://github.com/cheukyin699/ProtParam.


KeyboardInterrupt: 

In [None]:
contributions_df.to_csv(ROOT_DATA_DIR + 'linked/contributions.tsv.gz', sep = '\\t', engine = 'python', compression = 'gzip')
contents_df.to_csv(ROOT_DATA_DIR + 'linked/contents.tsv.gz', sep = '\\t', engine = 'python', compression = 'gzip')

In [None]:
# play with this
# repo.get_dir_contents()  # might need a path