# GitHub Commit Data: Getting, Processing, Analysing Pipeline.    

This notebook will:  
    - explore commit data (from GitHub API) processing and analysis  
    - act as documentation about how the project commit data is being obtained, reshaped and analysed.  
    - explain the interactions between different project functions and scripts  
    - clarify what commit data is stored and where  
    - prove why these steps are necessary for generating the dataset which will be used for Research Assistant Personas research work  

## Overall plan

Aims:   

I need to obtain and analyse commit data for given github repositories of research software to be able to explore the ways in which the developers in those research software projects engage and interact with the codebase and associated GH development and project management tools.

I expect that different developers will fall into at least two to three defineable categories of behaviours, and that analysis of commits will be crucial to being able to: a) describe these categories, and b) how to assign individual developers to those different categories, on the basis of their interactions with the repo.  

By exploring the data from many different RS repositories using scripts, I should be able to gather a large dataset to investigate these hypotheses.  

Overview:  


00) Setup, imports, logging  

01) Get RS repo name(s) for study (inclusion/exclusion steps)   
02) Check whether current data already exists; if so, use that      

03) Query API for repo(s) commits data   

04) Save out raw commits json for repo(s)  
05) Convert data to pandas format for analysis  

06) Calculate commits summary stats for repo(s)  

07) Slice commits data by commit author  

08) Calculate commits summary stats for author(s)  

09) Compare author commits stats to repo commits stats; looking for outliers or differences in terms of frequency of commits, change size, files changed, file types changed (vasilescu_variations_2014), key words frequencies (hattori_nature_2008), etc.  

10) Report findings  
11) Return data in useful format for subsequent analyses / visualisation  
12) Save out data 

13) Visualisation of commits data for a) dataset of repo(s); b) individual repo(s); c) individual authors (optional, perhaps only where outliers or interesting differences show up); 






## File system Organisation  

REPO ROOT:   
`/` (coding-smart repo root)   

DOCS FOLDERS:    
`/docs/` - currently this holds .jpg files of UML diagrams referred to in the overall repo readme.   

OUTPUT FOLDERS:   
`/data/` - holds .csv files of data obtained from GH API calls.  
`/logs/` - holds .txt logfiles with name format `[functionname or scriptname]_(NOTEBOOK)_logs.txt`
 where 'NOTEBOOK' is optional and relates to output logs generated by jupyter notebook function runs.  
`/images/` - holds .png files generated by python plotting and visualisation of the GH data.    

SOURCE CODE FOLDERS:    
`/utilities/` - contains utility functions and scripts of general usability   
`/githubanalysis/` -  contains functions and scripts relating to the github API and subsequent data operations     
`/zenodocode/` - contains functions and scripts relating to zenodo API for identifying RS repos' github repo names      
 

## Notebook Imports and Setup  

## Setup 

Ensure you are in the `coding-smart-github` conda environment and have the following packages installed in your environment which match the `requirements.txt` file in the coding-smart repository.  

### Github Authentication 

Create a classic access token via [Github Authentication Settings](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token#creating-a-personal-access-token-classic) and create a file called `config.cfg` in the `githubanalysis/` folder with the following content: 
```bash
[ACCESS]
token = <your-access-token>
```
Ensure you've pasted in your token, but leave `[ACCESS]` and `token = `.

## Import python packages required for notebook 

In [1]:
import os
from os import path
import configparser
from github import Github
import pandas as pd

In [2]:
# set up github access token with github package: 

config = configparser.ConfigParser()
config.read('../config.cfg')
config.sections()

access_token = config['ACCESS']['token']
g = Github(access_token) 

In [None]:
# import existing scripts from repo for use in notebook   

# logging:  

# GH API access/querying  



05) Convert data to pandas format for analysis  

06) Calculate commits summary stats for repo(s)  

07) Slice commits data by commit author  

08) Calculate commits summary stats for author(s)  

09) Compare author commits stats to repo commits stats; looking for outliers or differences in terms of frequency of commits, change size, files changed, file types changed (vasilescu_variations_2014), key words frequencies (hattori_nature_2008), etc.  

In [None]:
# read in commits data from an existing data file  

# check it's in pandas format (05)  

# calculate a summary stat (e.g. changesize of commit; )

In [None]:
# get number of commits for repo of repo_name

import requests
from urllib.parse import parse_qs, urlparse

class CommitsCount: 
    def get_commits_count(self, repo_name: str) -> int:
        """
        Returns the number of commits to a GitHub repository.
        """
        url = f"https://api.github.com/repos/{repo_name}/commits?per_page=1"
        r = requests.get(url)
        links = r.links
        rel_last_link_url = urlparse(links["last"]["url"])
        rel_last_link_url_args = parse_qs(rel_last_link_url.query)
        rel_last_link_url_page_arg = rel_last_link_url_args["page"][0]
        commits_count = int(rel_last_link_url_page_arg)
        return commits_count
    # code via https://brianli.com/2022/07/python-get-number-of-commits-github-repository/  

In [None]:
# get number of commits for 1 named (currently hardcoded) repo
c = CommitsCount()
c.get_commits_count(repo_name="JeschkeLab/DeerLab")

In [None]:
# import githubanalysis.processing.get_repo_connection as ghconn
# repo_con = ghconn.get_repo_connection(repo_name, config_path) 

In [4]:
import pandas as pd
pd.__version__

'1.5.3'

In [10]:
import utilities.get_default_logger as loggit
import utilities.chunker as chunker

import githubanalysis.processing.get_all_pages_commits 
from githubanalysis.processing.get_all_pages_commits import CommitsGetter 

In [3]:
repo_name = 'JeschkeLab/DeerLab'

logger = loggit.get_default_logger(console=True, set_level_to='DEBUG', log_name='../../logs/get_all_pages_commits_NOTEBOOK_logs.txt')  
commits_getter = CommitsGetter(logger)

coms_df = commits_getter.get_all_pages_commits(repo_name=repo_name, config_path='../../githubanalysis/config.cfg', out_filename='all-commits', write_out_location='../../data/')

INFO:>> Running commit grab for repo JeschkeLab/DeerLab, in page 1 of 5.
INFO:>> Running commit grab for repo JeschkeLab/DeerLab, in page 2 of 5.
INFO:>> Running commit grab for repo JeschkeLab/DeerLab, in page 3 of 5.
INFO:>> Running commit grab for repo JeschkeLab/DeerLab, in page 4 of 5.
INFO:>> Running commit grab for repo JeschkeLab/DeerLab, in page 5 of 5.
INFO:Total number of commits grabbed is 497 in 5 page(s).
INFO:Commits data written out to file for repo JeschkeLab/DeerLab at ../../data/all-commits_JeschkeLab-DeerLab_2024-09-19.csv and ../../data/all-commits_JeschkeLab-DeerLab_2024-09-19.json.


In [4]:
locatstr = "../../data/all-commits_JeschkeLab-DeerLab_2024-09-19_lol.json"
coms_df.to_json(path_or_buf=locatstr, orient='records')

In [8]:
pd.read_json(path_or_buf="../../data/all-commits_JeschkeLab-DeerLab_2024-09-19.json", orient='records', lines=True)

Unnamed: 0,sha,node_id,commit,url,html_url,comments_url,author,committer,parents,repo_name,commit_message,author_dev,committer_dev,author_date,committer_date,same_date
0,34e5a3a1d2395b7ad517ac323024537db3b10785,C_kwDOEIKXK9oAKDM0ZTVhM2ExZDIzOTViN2FkNTE3YWMz...,"{'author': {'name': 'Hugo Karas', 'email': 'hk...",https://api.github.com/repos/JeschkeLab/DeerLa...,https://github.com/JeschkeLab/DeerLab/commit/3...,https://api.github.com/repos/JeschkeLab/DeerLa...,"{'login': 'HKaras', 'id': 21962092, 'node_id':...","{'login': 'web-flow', 'id': 19864447, 'node_id...",[{'sha': '55d4eab57ede1d24899088c45fb02f63754f...,JeschkeLab/DeerLab,Update changelog (#484),HKaras,web-flow,2024-09-03T17:55:23Z,2024-09-03T17:55:23Z,True
1,55d4eab57ede1d24899088c45fb02f63754f6acd,C_kwDOEIKXK9oAKDU1ZDRlYWI1N2VkZTFkMjQ4OTkwODhj...,"{'author': {'name': 'Hugo Karas', 'email': 'hk...",https://api.github.com/repos/JeschkeLab/DeerLa...,https://github.com/JeschkeLab/DeerLab/commit/5...,https://api.github.com/repos/JeschkeLab/DeerLa...,"{'login': 'HKaras', 'id': 21962092, 'node_id':...","{'login': 'web-flow', 'id': 19864447, 'node_id...",[{'sha': '4ae181ffc3016f40e552757ff274c0f3d219...,JeschkeLab/DeerLab,"Remove 3.8, Require Numpy 2.0 (#483)\n\n\r\n* ...",HKaras,web-flow,2024-09-03T15:36:26Z,2024-09-03T15:36:26Z,True
2,4ae181ffc3016f40e552757ff274c0f3d219e46d,C_kwDOEIKXK9oAKDRhZTE4MWZmYzMwMTZmNDBlNTUyNzU3...,"{'author': {'name': 'Hugo Karas', 'email': 'hk...",https://api.github.com/repos/JeschkeLab/DeerLa...,https://github.com/JeschkeLab/DeerLab/commit/4...,https://api.github.com/repos/JeschkeLab/DeerLa...,"{'login': 'HKaras', 'id': 21962092, 'node_id':...","{'login': 'web-flow', 'id': 19864447, 'node_id...",[{'sha': '178249ef8cc28700e41d9293f7a3cb926d52...,JeschkeLab/DeerLab,Sophegrid expansion (#482),HKaras,web-flow,2024-09-03T00:19:14Z,2024-09-03T00:19:14Z,True
3,178249ef8cc28700e41d9293f7a3cb926d52d8ad,C_kwDOEIKXK9oAKDE3ODI0OWVmOGNjMjg3MDBlNDFkOTI5...,"{'author': {'name': 'Hugo Karas', 'email': 'hk...",https://api.github.com/repos/JeschkeLab/DeerLa...,https://github.com/JeschkeLab/DeerLab/commit/1...,https://api.github.com/repos/JeschkeLab/DeerLa...,"{'login': 'HKaras', 'id': 21962092, 'node_id':...","{'login': 'web-flow', 'id': 19864447, 'node_id...",[{'sha': '534495e9107565753e622ea438e7b59e41b6...,JeschkeLab/DeerLab,Regparam grid bug fix (#477)\n\n* Update for 3...,HKaras,web-flow,2024-07-15T11:50:49Z,2024-07-15T11:50:49Z,True
4,534495e9107565753e622ea438e7b59e41b6817d,C_kwDOEIKXK9oAKDUzNDQ5NWU5MTA3NTY1NzUzZTYyMmVh...,"{'author': {'name': 'Hugo Karas', 'email': 'hk...",https://api.github.com/repos/JeschkeLab/DeerLa...,https://github.com/JeschkeLab/DeerLab/commit/5...,https://api.github.com/repos/JeschkeLab/DeerLa...,"{'login': 'HKaras', 'id': 21962092, 'node_id':...","{'login': 'web-flow', 'id': 19864447, 'node_id...",[{'sha': '21df49d0096580b67df914ef77565a1d1cad...,JeschkeLab/DeerLab,Numpy2 support (#479)\n\n* Update for 3.12\n\n...,HKaras,web-flow,2024-07-02T05:57:54Z,2024-07-02T05:57:54Z,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
492,cfa9ee9bfa7486382482c2edfc7c04e2fd7cd553,MDY6Q29tbWl0Mjc2OTkzODM1OmNmYTllZTliZmE3NDg2Mz...,"{'author': {'name': 'Luis Fabregas', 'email': ...",https://api.github.com/repos/JeschkeLab/DeerLa...,https://github.com/JeschkeLab/DeerLab/commit/c...,https://api.github.com/repos/JeschkeLab/DeerLa...,"{'login': 'luisfabib', 'id': 48292540, 'node_i...","{'login': 'luisfabib', 'id': 48292540, 'node_i...",[{'sha': '56b48bac1b3c91ad615b12614e55cb35e2a2...,JeschkeLab/DeerLab,minor edits,luisfabib,luisfabib,2020-07-04T16:40:45Z,2020-07-04T16:40:45Z,True
493,56b48bac1b3c91ad615b12614e55cb35e2a23bd7,MDY6Q29tbWl0Mjc2OTkzODM1OjU2YjQ4YmFjMWIzYzkxYW...,"{'author': {'name': 'Luis Fabregas', 'email': ...",https://api.github.com/repos/JeschkeLab/DeerLa...,https://github.com/JeschkeLab/DeerLab/commit/5...,https://api.github.com/repos/JeschkeLab/DeerLa...,"{'login': 'luisfabib', 'id': 48292540, 'node_i...","{'login': 'luisfabib', 'id': 48292540, 'node_i...",[{'sha': '36a45ae97dfffcee2a29a9a4da00e0a9aefb...,JeschkeLab/DeerLab,added selregparam function,luisfabib,luisfabib,2020-07-04T15:05:33Z,2020-07-04T15:05:33Z,True
494,36a45ae97dfffcee2a29a9a4da00e0a9aefb68fe,MDY6Q29tbWl0Mjc2OTkzODM1OjM2YTQ1YWU5N2RmZmZjZW...,"{'author': {'name': 'Luis Fabregas', 'email': ...",https://api.github.com/repos/JeschkeLab/DeerLa...,https://github.com/JeschkeLab/DeerLab/commit/3...,https://api.github.com/repos/JeschkeLab/DeerLa...,"{'login': 'luisfabib', 'id': 48292540, 'node_i...","{'login': 'luisfabib', 'id': 48292540, 'node_i...",[{'sha': 'e74486046004a3211ec3b02d423f265e3f28...,JeschkeLab/DeerLab,added whiteguassnoise function,luisfabib,luisfabib,2020-07-04T13:30:32Z,2020-07-04T13:30:32Z,True
495,e74486046004a3211ec3b02d423f265e3f28dc93,MDY6Q29tbWl0Mjc2OTkzODM1OmU3NDQ4NjA0NjAwNGEzMj...,"{'author': {'name': 'Stefan Stoll', 'email': '...",https://api.github.com/repos/JeschkeLab/DeerLa...,https://github.com/JeschkeLab/DeerLab/commit/e...,https://api.github.com/repos/JeschkeLab/DeerLa...,"{'login': 'stestoll', 'id': 9491761, 'node_id'...","{'login': 'stestoll', 'id': 9491761, 'node_id'...",[{'sha': '781a1808f59adf9acc78f38b3e3535a39eac...,JeschkeLab/DeerLab,add basic regularization functions,stestoll,stestoll,2020-07-04T05:02:38Z,2020-07-04T05:02:38Z,True


In [31]:
import githubanalysis.processing.setup_github_auth as ghauth
import requests
from requests.adapters import HTTPAdapter, Retry
import logging
import traceback
import pandas as pd
import datetime
from datetime import datetime
import numpy as np

import utilities.get_default_logger as loggit
import utilities.chunker as chunker

import githubanalysis.processing.get_all_pages_commits 
from githubanalysis.processing.get_all_pages_commits import CommitsGetter 


logger = loggit.get_default_logger(console=True, set_level_to='DEBUG', log_name='../../logs/get_all_pages_commits_NOTEBOOK_logs.txt')  
commits_getter = CommitsGetter(logger)

config_path='../../githubanalysis/config.cfg'
repo_name = 'JeschkeLab/DeerLab'
write_out_location='../../data/'
out_filename='testfile_all-commits'
per_pg=100

current_date_info = datetime.now().strftime("%Y-%m-%d") # run this at start of script not in loop to avoid midnight/long-run commits
sanitised_repo_name = repo_name.replace("/", "-")
write_out = f'{write_out_location}{out_filename}_{sanitised_repo_name}'
write_out_extra_info = f"{write_out}_{current_date_info}.csv"  

write_out_extra_info_json = f"{write_out}_{current_date_info}.json"

gh_token = ghauth.setup_github_auth(config_path=config_path)
headers = {'Authorization': 'token ' + gh_token}

s = requests.Session()
retries = Retry(total=10, connect=5, read=3, backoff_factor=1.5, status_forcelist=[202, 502, 503, 504])
s.mount('https://', HTTPAdapter(max_retries=retries))


# create empty df to store commits data
all_commits = pd.DataFrame()


#sha="34e5a3a1d2395b7ad517ac323024537db3b10785" # main branch  
#sha="5496cff4fcdbad6be1684deba99870cad02b76aa" # v1.0 release
#sha="34e5a3a1d2395b7ad517ac323024537db3b10785"
#sha="a4f5ec07336ff454978541146019378fecc0ae11" # gh-pages
#sha="gh-pages"
sha="main"

page = 1 # try first page only
repos_api_url = "https://api.github.com/repos/"
commits_url = f"{repos_api_url}{repo_name}/commits?sha={sha}&per_page={per_pg}&page={page}"
    # per_page=30 by default on GH, set to max (100)  
# important bit: API request with auth headers  
api_response = s.get(url=commits_url, headers=headers)
api_response.status_code

commit_links = api_response.links
store_pg = pd.DataFrame()
json_pg = {} # create empty json storage object
pg_count = 0

try: 
    # pg = 1
    # commits_query = f"https://api.github.com/repos/{repo_name}/commits?per_page={per_pg}&page={pg}"

    if 'last' in commit_links:
        commit_links_last = commit_links['last']['url'].split("&page=")[1]
        pages_commits = int(commit_links_last)

        pg_range = range(1, (pages_commits+1))

        for i in pg_range: 
                #pg = i
            pg_count += 1
            logger.info(f">> Running commit grab for repo {repo_name}, in page {pg_count} of {pages_commits}.")        
            page = i
            commits_query = f"https://api.github.com/repos/{repo_name}/commits?sha={sha}&per_page={per_pg}&page={page}"


            logger.debug(f"Commits query for page {pg_count} is {commits_query}")
            api_response = s.get(url=commits_query, headers=headers)
            json_pg = api_response.json()
            if not json_pg: # check emptiness of result.
                logger.debug(f"Result of api_response.json() is empty list.")
                logger.error(f"Result of API request is an empty json. Error - cannot currently handle this result nicely. Traceback: {traceback.format_exc()}")
            store_pg = pd.DataFrame.from_dict(json_pg)  # convert json to pd.df
              # using pd.DataFrame.from_dict(json) instead of pd.read_json(url) because otherwise I lose rate handling 

            if len(store_pg.index) > 0:
                try:
                    store_pg['repo_name'] = repo_name
                    store_pg['commit_message'] = pd.DataFrame.from_dict(store_pg['commit']).apply(lambda x: [x.get('message') for x in x])
                    store_pg['author_dev'] = store_pg['author'].str['login']   # via: @Mozway on SO: https://stackoverflow.com/a/71782066
                    store_pg['committer_dev'] = store_pg['committer'].str['login'] # via: @Mozway on SO: https://stackoverflow.com/a/71782066
                    store_pg['author_date'] = pd.DataFrame.from_dict(store_pg['commit']).apply(lambda x: [x.get('author').get('date') for x in x])
                    store_pg['committer_date'] = pd.DataFrame.from_dict(store_pg['commit']).apply(lambda x: [x.get('committer').get('date') for x in x])
                    store_pg['same_date'] = np.where((store_pg['author_date'] == store_pg['committer_date']), True, False)
                except Exception as e_pages: 
                    logger.debug(f"There seems to be some issue: {e_pages}. Traceback: {traceback.format_exc()}")

                # write out 'completed' page of commits as df to csv via APPEND (use added date filename with reponame inc)
                #store_pg.to_csv(write_out_extra_info, mode='a', index=True, header= not os.path.exists(write_out_extra_info))
                all_commits = pd.concat([all_commits, store_pg], ) # append this page (df) to main commits df

                store_pg = pd.DataFrame() # empty the df of last page      
                            
                
            else: # there's no next page, grab all on this page and proceed.
                    pg_count += 1
                    commits_query = f"https://api.github.com/repos/{repo_name}/commits?sha={sha}&per_page={per_pg}&page={page}"
                    logger.debug(f"getting json via request url {commits_query}.")
                    api_response = s.get(url=commits_query, headers=headers)
                    json_pg = api_response.json()
                    if not json_pg: # check emptiness of result.
                        logger.debug(f"Result of api_response.json() is empty list.")
                        logger.error(f"Result of API request is an empty json. Error - cannot currently handle this result nicely. Traceback: {traceback.format_exc()}")
                    store_pg = pd.DataFrame.from_dict(json_pg)

                    if len(store_pg.index) > 0:
                            try:
                                store_pg['repo_name'] = repo_name
                                store_pg['commit_message'] = pd.DataFrame.from_dict(store_pg['commit']).apply(lambda x: [x.get('message') for x in x])
                                store_pg['author_dev'] = store_pg['author'].str['login']   # via: @Mozway on SO: https://stackoverflow.com/a/71782066
                                store_pg['committer_dev'] = store_pg['committer'].str['login'] # via: @Mozway on SO: https://stackoverflow.com/a/71782066
                                store_pg['author_date'] = pd.DataFrame.from_dict(store_pg['commit']).apply(lambda x: [x.get('author').get('date') for x in x])
                                store_pg['committer_date'] = pd.DataFrame.from_dict(store_pg['commit']).apply(lambda x: [x.get('committer').get('date') for x in x])
                                store_pg['same_date'] = np.where((store_pg['author_date'] == store_pg['committer_date']), True, False)
                            except Exception as e_empty: 
                                logger.debug(f"There seem to be no commits on the only page of the query... {e_empty}. Traceback: {traceback.format_exc()}")

                    all_commits = store_pg
                    # write out the page content to csv via APPEND (use added date filename)
                    
                    all_commits.to_json(write_out_extra_info_json, orient='records', date_format='iso')
                    logger.info(f"Commits data written out to JSON file for repo {repo_name} at {write_out_extra_info_json}")
                    
                
            logger.info(f"Total number of commits grabbed is {len(all_commits.index)} in {pg_count} page(s).")
            logger.info(f"Commits data written out to file for repo {repo_name} at {write_out_extra_info}.")

except Exception as e_commits:
    logger.error(f"Something failed in getting commits for repo {repo_name}: {e_commits}. API response was: {api_response}. Traceback: {traceback.format_exc()}")


INFO:>> Running commit grab for repo JeschkeLab/DeerLab, in page 1 of 5.
INFO:Total number of commits grabbed is 100 in 1 page(s).
INFO:Commits data written out to file for repo JeschkeLab/DeerLab at ../../data/testfile_all-commits_JeschkeLab-DeerLab_2024-09-21.csv.
INFO:>> Running commit grab for repo JeschkeLab/DeerLab, in page 2 of 5.
INFO:Total number of commits grabbed is 200 in 2 page(s).
INFO:Commits data written out to file for repo JeschkeLab/DeerLab at ../../data/testfile_all-commits_JeschkeLab-DeerLab_2024-09-21.csv.
INFO:>> Running commit grab for repo JeschkeLab/DeerLab, in page 3 of 5.
INFO:Total number of commits grabbed is 300 in 3 page(s).
INFO:Commits data written out to file for repo JeschkeLab/DeerLab at ../../data/testfile_all-commits_JeschkeLab-DeerLab_2024-09-21.csv.
INFO:>> Running commit grab for repo JeschkeLab/DeerLab, in page 4 of 5.
INFO:Total number of commits grabbed is 400 in 4 page(s).
INFO:Commits data written out to file for repo JeschkeLab/DeerLab at

In [25]:
len(all_commits)

497

In [3]:
all_commits

NameError: name 'all_commits' is not defined

In [4]:
#https://docs.github.com/en/rest/branches/branches?apiVersion=2022-11-28

import githubanalysis.processing.setup_github_auth as ghauth
import requests
from requests.adapters import HTTPAdapter, Retry
import logging
import traceback
import pandas as pd
import datetime
from datetime import datetime


config_path='../../githubanalysis/config.cfg'
repo_name = 'JeschkeLab/DeerLab'
write_out_location='../../data/'
out_filename='testfile_all-commits'
per_pg=100


repos_api_url = "https://api.github.com/repos/"
api_call = f"{repos_api_url}{repo_name}/branches"

gh_token = ghauth.setup_github_auth(config_path=config_path)
headers = {'Authorization': 'token ' + gh_token}

s = requests.Session()
retries = Retry(total=10, connect=5, read=3, backoff_factor=1.5, status_forcelist=[202, 502, 503, 504])
s.mount('https://', HTTPAdapter(max_retries=retries))


api_response = s.get(url=api_call, headers=headers)
api_response.status_code

200

In [6]:
api_response.json()

[{'name': 'bootsrap_uncertainty_bug',
  'commit': {'sha': '2b7e527c50e34f97455e49a598b6fdc46eb78c58',
   'url': 'https://api.github.com/repos/JeschkeLab/DeerLab/commits/2b7e527c50e34f97455e49a598b6fdc46eb78c58'},
  'protected': False},
 {'name': 'gh-pages',
  'commit': {'sha': 'a4f5ec07336ff454978541146019378fecc0ae11',
   'url': 'https://api.github.com/repos/JeschkeLab/DeerLab/commits/a4f5ec07336ff454978541146019378fecc0ae11'},
  'protected': False},
 {'name': 'gha_darker',
  'commit': {'sha': '24c8fe641b1ad7ae915f9bb5315fcdc493284982',
   'url': 'https://api.github.com/repos/JeschkeLab/DeerLab/commits/24c8fe641b1ad7ae915f9bb5315fcdc493284982'},
  'protected': False},
 {'name': 'main',
  'commit': {'sha': '34e5a3a1d2395b7ad517ac323024537db3b10785',
   'url': 'https://api.github.com/repos/JeschkeLab/DeerLab/commits/34e5a3a1d2395b7ad517ac323024537db3b10785'},
  'protected': True},
 {'name': 'negative_P',
  'commit': {'sha': '14253d711515f7c5bdf179af0f934db4c280a104',
   'url': 'https://

In [7]:
branchesdf = pd.DataFrame.from_dict(api_response.json())

In [8]:
len(branchesdf)

13

In [32]:
branchesdf


Unnamed: 0,name,commit,protected
0,bootsrap_uncertainty_bug,{'sha': '2b7e527c50e34f97455e49a598b6fdc46eb78...,False
1,gh-pages,{'sha': 'a4f5ec07336ff454978541146019378fecc0a...,False
2,gha_darker,{'sha': '24c8fe641b1ad7ae915f9bb5315fcdc493284...,False
3,main,{'sha': '34e5a3a1d2395b7ad517ac323024537db3b10...,True
4,negative_P,{'sha': '14253d711515f7c5bdf179af0f934db4c280a...,False
5,release/v0.13,{'sha': 'dd71314ec74d4fe255d3301e46ad7568a0ca7...,True
6,release/v0.14,{'sha': 'fcdb450d1241e3c6d232bf0eb2bad3a61eaef...,True
7,release/v1.0,{'sha': '5496cff4fcdbad6be1684deba99870cad02b7...,True
8,release/v1.0.1,{'sha': '5c40569959ed684cb93305e06ef30c08961f5...,True
9,release/v1.1,{'sha': '2546b4290fe4530575462fb343182e29bba06...,True


In [5]:
import pandas as pd

import githubanalysis.processing.get_branches as branchgetter

repo_name = 'JeschkeLab/DeerLab'

branch_df = branchgetter.get_branches(repo_name=repo_name, config_path='../../githubanalysis/config.cfg', per_pg=100)

In [6]:
branch_df

Unnamed: 0,name,commit,protected,branch_sha
0,bootsrap_uncertainty_bug,{'sha': '2b7e527c50e34f97455e49a598b6fdc46eb78...,False,2b7e527c50e34f97455e49a598b6fdc46eb78c58
1,gh-pages,{'sha': 'a4f5ec07336ff454978541146019378fecc0a...,False,a4f5ec07336ff454978541146019378fecc0ae11
2,gha_darker,{'sha': '24c8fe641b1ad7ae915f9bb5315fcdc493284...,False,24c8fe641b1ad7ae915f9bb5315fcdc493284982
3,main,{'sha': '34e5a3a1d2395b7ad517ac323024537db3b10...,True,34e5a3a1d2395b7ad517ac323024537db3b10785
4,negative_P,{'sha': '14253d711515f7c5bdf179af0f934db4c280a...,False,14253d711515f7c5bdf179af0f934db4c280a104
5,release/v0.13,{'sha': 'dd71314ec74d4fe255d3301e46ad7568a0ca7...,True,dd71314ec74d4fe255d3301e46ad7568a0ca7ea3
6,release/v0.14,{'sha': 'fcdb450d1241e3c6d232bf0eb2bad3a61eaef...,True,fcdb450d1241e3c6d232bf0eb2bad3a61eaefc72
7,release/v1.0,{'sha': '5496cff4fcdbad6be1684deba99870cad02b7...,True,5496cff4fcdbad6be1684deba99870cad02b76aa
8,release/v1.0.1,{'sha': '5c40569959ed684cb93305e06ef30c08961f5...,True,5c40569959ed684cb93305e06ef30c08961f59f9
9,release/v1.1,{'sha': '2546b4290fe4530575462fb343182e29bba06...,True,2546b4290fe4530575462fb343182e29bba069a3


In [7]:
branch_df.branch_sha

0     2b7e527c50e34f97455e49a598b6fdc46eb78c58
1     a4f5ec07336ff454978541146019378fecc0ae11
2     24c8fe641b1ad7ae915f9bb5315fcdc493284982
3     34e5a3a1d2395b7ad517ac323024537db3b10785
4     14253d711515f7c5bdf179af0f934db4c280a104
5     dd71314ec74d4fe255d3301e46ad7568a0ca7ea3
6     fcdb450d1241e3c6d232bf0eb2bad3a61eaefc72
7     5496cff4fcdbad6be1684deba99870cad02b76aa
8     5c40569959ed684cb93305e06ef30c08961f59f9
9     2546b4290fe4530575462fb343182e29bba069a3
10    178249ef8cc28700e41d9293f7a3cb926d52d8ad
11    34e5a3a1d2395b7ad517ac323024537db3b10785
12    2a420338f2e9c4c3121f4937747a50caf0509959
Name: branch_sha, dtype: object

In [10]:
len(branch_df.branch_sha)

13

In [24]:
branches = pd.DataFrame()

In [28]:
branch_df.commit.keys

<bound method Series.keys of 0     {'sha': '2b7e527c50e34f97455e49a598b6fdc46eb78...
1     {'sha': 'a4f5ec07336ff454978541146019378fecc0a...
2     {'sha': '24c8fe641b1ad7ae915f9bb5315fcdc493284...
3     {'sha': '34e5a3a1d2395b7ad517ac323024537db3b10...
4     {'sha': '14253d711515f7c5bdf179af0f934db4c280a...
5     {'sha': 'dd71314ec74d4fe255d3301e46ad7568a0ca7...
6     {'sha': 'fcdb450d1241e3c6d232bf0eb2bad3a61eaef...
7     {'sha': '5496cff4fcdbad6be1684deba99870cad02b7...
8     {'sha': '5c40569959ed684cb93305e06ef30c08961f5...
9     {'sha': '2546b4290fe4530575462fb343182e29bba06...
10    {'sha': '178249ef8cc28700e41d9293f7a3cb926d52d...
11    {'sha': '34e5a3a1d2395b7ad517ac323024537db3b10...
12    {'sha': '2a420338f2e9c4c3121f4937747a50caf0509...
Name: commit, dtype: object>

In [20]:
# get 1st page of commits for branches, check if they're ALL the same commit hashes or not
import pandas as pd
import githubanalysis.processing.setup_github_auth as ghauth
import requests
from requests.adapters import HTTPAdapter, Retry
import logging
import traceback


config_path='../../githubanalysis/config.cfg'
repo_name = 'JeschkeLab/DeerLab'
per_pg=100
page=1


gh_token = ghauth.setup_github_auth(config_path=config_path)
headers = {'Authorization': 'token ' + gh_token}

s = requests.Session()
retries = Retry(total=10, connect=5, read=3, backoff_factor=1.5, status_forcelist=[202, 502, 503, 504])
s.mount('https://', HTTPAdapter(max_retries=retries))


store_pg = pd.DataFrame() # empty df

for branch in branch_df.branch_sha:
    commits_query = f"https://api.github.com/repos/{repo_name}/commits?sha={branch}&per_page={per_pg}&page={page}"
    api_response = s.get(url=commits_query, headers=headers)
    print(api_response)
    print(commits_query)
    json_pg = api_response.json()

    store_pg = pd.DataFrame.from_dict(json_pg)  # convert json to pd.df
    print((store_pg))

<Response [200]>
https://api.github.com/repos/JeschkeLab/DeerLab/commits?sha=2b7e527c50e34f97455e49a598b6fdc46eb78c58&per_page=100&page=1
                                         sha  \
0   2b7e527c50e34f97455e49a598b6fdc46eb78c58   
1   691cdf0bd8cfb925545e3c8b2546af2f4f6de192   
2   14668464a63904187da84047f937ee6524c69bb0   
3   b8ac0eccb3aca94f5fd0956a953644f9be808c5f   
4   4ae39640c5947c2c21d5f635fd35cd75826dc20b   
..                                       ...   
95  6f80a79186b7064d848b86a75ed3a0f60a0315f0   
96  adcc09f2dfb1c434bf80251e47a1be7f8d053a9b   
97  9161e8a4320240f3c017c8fa99ff8e8d528f6e60   
98  2844b542613231e73e03631d9e62636396ebceac   
99  9f5006715ad35dfff02c381b0d25316abe23abbc   

                                              node_id  \
0   C_kwDOEIKXK9oAKDJiN2U1MjdjNTBlMzRmOTc0NTVlNDlh...   
1   C_kwDOEIKXK9oAKDY5MWNkZjBiZDhjZmI5MjU1NDVlM2M4...   
2   C_kwDOEIKXK9oAKDE0NjY4NDY0YTYzOTA0MTg3ZGE4NDA0...   
3   C_kwDOEIKXK9oAKGI4YWMwZWNjYjNhY2E5NGY1ZmQwOTU2...   


<Response [200]>
https://api.github.com/repos/JeschkeLab/DeerLab/commits?sha=24c8fe641b1ad7ae915f9bb5315fcdc493284982&per_page=100&page=1
                                         sha  \
0   24c8fe641b1ad7ae915f9bb5315fcdc493284982   
1   8b4e91ac02fb4ad88cdb41865b00ee304a9a710b   
2   738f6dd3d60c33bef96030529064f3a86632214e   
3   3947c1c91e489a8aaf5fdb6e0fd19a202de59a83   
4   8dd79ae90b5926ab479ed3ba33b8ebeaa2f80789   
..                                       ...   
95  c641f28f187d020bb66610a35757f838c9c33dab   
96  2f06bdaf4a16ef7f78b260374c2e443fbb2d55c1   
97  a74a4162a9ce67ffcbbd8a5bbe0ebf619772c8f7   
98  84fbb2ecf56344f7f44295be3b4c7a5db6cb5ba5   
99  d79453ce76a8db54c4e5b044bebe786f64dae9dc   

                                              node_id  \
0   C_kwDOEIKXK9oAKDI0YzhmZTY0MWIxYWQ3YWU5MTVmOWJi...   
1   C_kwDOEIKXK9oAKDhiNGU5MWFjMDJmYjRhZDg4Y2RiNDE4...   
2   C_kwDOEIKXK9oAKDczOGY2ZGQzZDYwYzMzYmVmOTYwMzA1...   
3   C_kwDOEIKXK9oAKDM5NDdjMWM5MWU0ODlhOGFhZjVmZGI2...   


<Response [200]>
https://api.github.com/repos/JeschkeLab/DeerLab/commits?sha=14253d711515f7c5bdf179af0f934db4c280a104&per_page=100&page=1
                                         sha  \
0   14253d711515f7c5bdf179af0f934db4c280a104   
1   b9b76e497bdf0462c02a39b51179561010b53208   
2   f7e0340d259cb2c2f5a237ffb0d27b1a7657de25   
3   be55a177e0fe5d4d1eedc72c1225bfb93d8bfc8a   
4   7d1116016e270cffa43bf7410da6eb006a4199e6   
..                                       ...   
95  c998479dc22925003efd3cf5d6bbfd02094eb897   
96  b99c73151b9c0c69036ce4178f2987e81ee58663   
97  60eb505fa91316a8987606d82b3d63418133162c   
98  820b80dda4703b7a1025ef108a63ff2ba8b89350   
99  8925df4d20ec9d1819157d9de657866a8b019ec4   

                                              node_id  \
0   C_kwDOEIKXK9oAKDE0MjUzZDcxMTUxNWY3YzViZGYxNzlh...   
1   C_kwDOEIKXK9oAKGI5Yjc2ZTQ5N2JkZjA0NjJjMDJhMzli...   
2   C_kwDOEIKXK9oAKGY3ZTAzNDBkMjU5Y2IyYzJmNWEyMzdm...   
3   C_kwDOEIKXK9oAKGJlNTVhMTc3ZTBmZTVkNGQxZWVkYzcy...   


<Response [200]>
https://api.github.com/repos/JeschkeLab/DeerLab/commits?sha=fcdb450d1241e3c6d232bf0eb2bad3a61eaefc72&per_page=100&page=1
                                         sha  \
0   fcdb450d1241e3c6d232bf0eb2bad3a61eaefc72   
1   56856d9e282c6129221d232fe4522594b6990642   
2   11f4b1dc5a56957452f3782cc6f726822dcd508c   
3   384deb632dd15680df3bdd4626491122b51ff79e   
4   27675503ae926deb4cd3934e53fb854812037378   
..                                       ...   
95  200320e602a7a610bfee3ff5c846bb258594a277   
96  df26c16d6d4c59e95e9160078493b23782f7c767   
97  1cd1e7a7a9010f2b4ca79981ed9ce75a1db24558   
98  3bf0a39d979828a61d77db6895220684a7608bf3   
99  98af07d93fd5cd7edbd7e1822c7e5dcebceebe88   

                                              node_id  \
0   C_kwDOEIKXK9oAKGZjZGI0NTBkMTI0MWUzYzZkMjMyYmYw...   
1   C_kwDOEIKXK9oAKDU2ODU2ZDllMjgyYzYxMjkyMjFkMjMy...   
2   C_kwDOEIKXK9oAKDExZjRiMWRjNWE1Njk1NzQ1MmYzNzgy...   
3   C_kwDOEIKXK9oAKDM4NGRlYjYzMmRkMTU2ODBkZjNiZGQ0...   


<Response [200]>
https://api.github.com/repos/JeschkeLab/DeerLab/commits?sha=5c40569959ed684cb93305e06ef30c08961f59f9&per_page=100&page=1
                                         sha  \
0   5c40569959ed684cb93305e06ef30c08961f59f9   
1   719c99d07141cfc0aa3d376b8d084e1605c193f9   
2   da387047cce5b574e37ba69beab21088f853e158   
3   9a65fdf549350a60526a1e63f79506e8d9dae51d   
4   99c4b348dc777fdcb727bd023030852b5e30191d   
..                                       ...   
95  06d19417e0c3ba256bdae67ed6bd88111fc5b325   
96  04913965c78769bc491f22518ced7cac1e798cf4   
97  71f8ec442e2320833301ef0540373387570cc01b   
98  fe7c15f81a2c1c3da210a433a1effe499ca47a6f   
99  069e9d55ba2807fd8dca39397e359b67e2527a6f   

                                              node_id  \
0   C_kwDOEIKXK9oAKDVjNDA1Njk5NTllZDY4NGNiOTMzMDVl...   
1   C_kwDOEIKXK9oAKDcxOWM5OWQwNzE0MWNmYzBhYTNkMzc2...   
2   C_kwDOEIKXK9oAKGRhMzg3MDQ3Y2NlNWI1NzRlMzdiYTY5...   
3   C_kwDOEIKXK9oAKDlhNjVmZGY1NDkzNTBhNjA1MjZhMWU2...   


<Response [200]>
https://api.github.com/repos/JeschkeLab/DeerLab/commits?sha=178249ef8cc28700e41d9293f7a3cb926d52d8ad&per_page=100&page=1
                                         sha  \
0   178249ef8cc28700e41d9293f7a3cb926d52d8ad   
1   534495e9107565753e622ea438e7b59e41b6817d   
2   21df49d0096580b67df914ef77565a1d1cada039   
3   a088f3164559b4d1def0b79fca5121b139b1e2b1   
4   fba1c4eb58d8e8b090b71836781c0bb3f4d8a34f   
..                                       ...   
95  cc4b5ee4a6046b4998a3aa21bb8ec6850372cec8   
96  5bf00865dc6f8cd20f93973b708687b365764de5   
97  beb7d28a77d41d838bb85a55bd5b49d9602cbe6a   
98  2e13c55252cbed8fb3f2c7acc719cd7b22f59f35   
99  3dbfc8bea7feed4e6fbc2b62457b14c93088808e   

                                              node_id  \
0   C_kwDOEIKXK9oAKDE3ODI0OWVmOGNjMjg3MDBlNDFkOTI5...   
1   C_kwDOEIKXK9oAKDUzNDQ5NWU5MTA3NTY1NzUzZTYyMmVh...   
2   C_kwDOEIKXK9oAKDIxZGY0OWQwMDk2NTgwYjY3ZGY5MTRl...   
3   C_kwDOEIKXK9oAKGEwODhmMzE2NDU1OWI0ZDFkZWYwYjc5...   


<Response [200]>
https://api.github.com/repos/JeschkeLab/DeerLab/commits?sha=2a420338f2e9c4c3121f4937747a50caf0509959&per_page=100&page=1
                                         sha  \
0   2a420338f2e9c4c3121f4937747a50caf0509959   
1   142a02630cc260cb519ff09afed717ac8f284f44   
2   b158b1e5ff519e4b7f8fdfaf479180a2ca0f711f   
3   a5085d0e6eef59e7156e798b6e61f8a3731db934   
4   d0a21ac4547250094097e250eb55cfc74b99f800   
..                                       ...   
95  3d8434e4eef2c95f37867ce0aca4ec7e63a3b7bd   
96  ba9dc349f655ca54067d3c6291c97fc219c1b135   
97  598eebc6ad416814d4b37846d1aa9fc650053df4   
98  27ff814fb8c2771740bd7928f283c1f70e88215c   
99  5c19035168b1e12dbe831a17a17cbc3243a9a25a   

                                              node_id  \
0   C_kwDOEIKXK9oAKDJhNDIwMzM4ZjJlOWM0YzMxMjFmNDkz...   
1   C_kwDOEIKXK9oAKDE0MmEwMjYzMGNjMjYwY2I1MTlmZjA5...   
2   C_kwDOEIKXK9oAKGIxNThiMWU1ZmY1MTllNGI3ZjhmZGZh...   
3   C_kwDOEIKXK9oAKGE1MDg1ZDBlNmVlZjU5ZTcxNTZlNzk4...   


In [1]:
# get 1st page of commits for branches, check if they're ALL the same commit hashes or not
import pandas as pd
import githubanalysis.processing.setup_github_auth as ghauth
import requests
from requests.adapters import HTTPAdapter, Retry
import logging
import traceback


import githubanalysis.processing.get_branches as branchgetter
import utilities.get_default_logger as loggit
from githubanalysis.processing.get_all_branches_commits import AllBranchesCommitsGetter

config_path='../../githubanalysis/config.cfg'
repo_name = 'JeschkeLab/DeerLab'
per_pg=100
page=1

branchlogger = loggit.get_default_logger(console=True, set_level_to='DEBUG', log_name='../../logs/get_all_branches_commits_NOTEBOOK_logs.txt')  
all_branches_commits_getter = AllBranchesCommitsGetter(branchlogger)

all_branches_commits_getter.get_all_branches_commits(repo_name=repo_name, config_path='../../githubanalysis/config.cfg', per_pg=100)



13
