# GitHub Commit Data: Getting, Processing, Analysing Pipeline.    

This notebook will:  
    - explore commit data (from GitHub API) processing and analysis  
    - act as documentation about how the project commit data is being obtained, reshaped and analysed.  
    - explain the interactions between different project functions and scripts  
    - clarify what commit data is stored and where  
    - prove why these steps are necessary for generating the dataset which will be used for Research Assistant Personas research work  

## Overall plan

Aims:   

I need to obtain and analyse commit data for given github repositories of research software to be able to explore the ways in which the developers in those research software projects engage and interact with the codebase and associated GH development and project management tools.

I expect that different developers will fall into at least two to three defineable categories of behaviours, and that analysis of commits will be crucial to being able to: a) describe these categories, and b) how to assign individual developers to those different categories, on the basis of their interactions with the repo.  

By exploring the data from many different RS repositories using scripts, I should be able to gather a large dataset to investigate these hypotheses.  

Overview:  


00) Setup, imports, logging  

01) Get RS repo name(s) for study (inclusion/exclusion steps)   
02) Check whether current data already exists; if so, use that      

03) Query API for repo(s) commits data   

04) Save out raw commits json for repo(s)  
05) Convert data to pandas format for analysis  

06) Calculate commits summary stats for repo(s)  

07) Slice commits data by commit author  

08) Calculate commits summary stats for author(s)  

09) Compare author commits stats to repo commits stats; looking for outliers or differences in terms of frequency of commits, change size, files changed, file types changed (vasilescu_variations_2014), key words frequencies (hattori_nature_2008), etc.  

10) Report findings  
11) Return data in useful format for subsequent analyses / visualisation  
12) Save out data 

13) Visualisation of commits data for a) dataset of repo(s); b) individual repo(s); c) individual authors (optional, perhaps only where outliers or interesting differences show up); 






## File system Organisation  

REPO ROOT:   
`/` (coding-smart repo root)   

DOCS FOLDERS:    
`/docs/` - currently this holds .jpg files of UML diagrams referred to in the overall repo readme.   

OUTPUT FOLDERS:   
`/data/` - holds .csv files of data obtained from GH API calls.  
`/logs/` - holds .txt logfiles with name format `[functionname or scriptname]_(NOTEBOOK)_logs.txt`
 where 'NOTEBOOK' is optional and relates to output logs generated by jupyter notebook function runs.  
`/images/` - holds .png files generated by python plotting and visualisation of the GH data.    

SOURCE CODE FOLDERS:    
`/utilities/` - contains utility functions and scripts of general usability   
`/githubanalysis/` -  contains functions and scripts relating to the github API and subsequent data operations     
`/zenodocode/` - contains functions and scripts relating to zenodo API for identifying RS repos' github repo names      
 

## Notebook Imports and Setup  

## Setup 

Ensure you are in the `coding-smart-github` conda environment and have the following packages installed in your environment which match the `requirements.txt` file in the coding-smart repository.  

### Github Authentication 

Create a classic access token via [Github Authentication Settings](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token#creating-a-personal-access-token-classic) and create a file called `config.cfg` in the `githubanalysis/` folder with the following content: 
```bash
[ACCESS]
token = <your-access-token>
```
Ensure you've pasted in your token, but leave `[ACCESS]` and `token = `.

## Import python packages required for notebook 

In [1]:
import os
from os import path
import configparser
from github import Github
import pandas as pd

In [2]:
# set up github access token with github package: 

config = configparser.ConfigParser()
config.read('../config.cfg')
config.sections()

access_token = config['ACCESS']['token']
g = Github(access_token) 

In [None]:
# import existing scripts from repo for use in notebook   

# logging:  

# GH API access/querying  



05) Convert data to pandas format for analysis  

06) Calculate commits summary stats for repo(s)  

07) Slice commits data by commit author  

08) Calculate commits summary stats for author(s)  

09) Compare author commits stats to repo commits stats; looking for outliers or differences in terms of frequency of commits, change size, files changed, file types changed (vasilescu_variations_2014), key words frequencies (hattori_nature_2008), etc.  

In [None]:
# read in commits data from an existing data file  

# check it's in pandas format (05)  

# calculate a summary stat (e.g. changesize of commit; )

In [3]:
# get number of commits for repo of repo_name

import requests
from urllib.parse import parse_qs, urlparse

class CommitsCount: 
    def get_commits_count(self, repo_name: str) -> int:
        """
        Returns the number of commits to a GitHub repository.
        """
        url = f"https://api.github.com/repos/{repo_name}/commits?per_page=1"
        r = requests.get(url)
        links = r.links
        rel_last_link_url = urlparse(links["last"]["url"])
        rel_last_link_url_args = parse_qs(rel_last_link_url.query)
        rel_last_link_url_page_arg = rel_last_link_url_args["page"][0]
        commits_count = int(rel_last_link_url_page_arg)
        return commits_count
    # code via https://brianli.com/2022/07/python-get-number-of-commits-github-repository/  

In [6]:
# get number of commits for 1 named (currently hardcoded) repo
c = CommitsCount()
c.get_commits_count(repo_name="JeschkeLab/DeerLab")

497

In [17]:
# import githubanalysis.processing.get_repo_connection as ghconn
# repo_con = ghconn.get_repo_connection(repo_name, config_path) 

In [25]:
# check current function(s) for getting all commits for a repo   
import githubanalysis.processing.get_all_pages_commits

repo_name = "JeschkeLab/DeerLab"
config_path='../../githubanalysis/config.cfg'

#all_commits = getallcommits.get_all_pages_commits(repo_name, config_path='../../githubanalysis/config.cfg', per_pg=100, verbose=True)

get_all_pages_commits(repo_name=repo_name, config_path=config_path, per_pg=100)

NameError: name 'get_all_pages_commits' is not defined

In [3]:
import utilities.get_default_logger as loggit
import utilities.chunker as chunker

import githubanalysis.processing.get_all_pages_commits 
from githubanalysis.processing.get_all_pages_commits import CommitsGetter 

In [4]:
repo_name = 'JeschkeLab/DeerLab'

In [5]:
#logger = loggit.get_default_logger(console=True, set_level_to='DEBUG', log_name='../../logs/get_all_pages_issues_NOTEBOOK_logs.txt')  
#issues_getter = IssueGetter(logger)
#iss_df = issues_getter.get_all_pages_issues(repo_name=item, config_path='../../githubanalysis/config.cfg', out_filename='all-issues', write_out_location='../../data/')

logger = loggit.get_default_logger(console=True, set_level_to='DEBUG', log_name='../../logs/get_all_pages_commits_NOTEBOOK_logs.txt')  
commits_getter = CommitsGetter(logger)

coms_df = commits_getter.get_all_pages_commits(repo_name=repo_name, config_path='../../githubanalysis/config.cfg', out_filename='all-commits', write_out_location='../../data/')

ERROR:Something failed in getting commits for repo JeschkeLab/DeerLab: local variable 'i' referenced before assignment. Traceback: Traceback (most recent call last):
  File "/home/eidf103/eidf103/flic/clonezone/coding-smart/githubanalysis/processing/get_all_pages_commits.py", line 98, in get_all_pages_commits
    commits_query = f"https://api.github.com/repos/{repo_name}/commits?per_page={per_pg}&page={i}"
UnboundLocalError: local variable 'i' referenced before assignment



In [None]:
# load data from all commits for 1 repo .csv  

import gi