# GitHub Commit Data: Getting, Processing, Analysing Pipeline.    

This notebook will:  
    - explore commit data (from GitHub API) processing and analysis  
    - act as documentation about how the project commit data is being obtained, reshaped and analysed.  
    - explain the interactions between different project functions and scripts  
    - clarify what commit data is stored and where  
    - prove why these steps are necessary for generating the dataset which will be used for Research Assistant Personas research work  

## Overall plan

Aims:   

I need to obtain and analyse commit data for given github repositories of research software to be able to explore the ways in which the developers in those research software projects engage and interact with the codebase and associated GH development and project management tools.

I expect that different developers will fall into at least two to three defineable categories of behaviours, and that analysis of commits will be crucial to being able to: a) describe these categories, and b) how to assign individual developers to those different categories, on the basis of their interactions with the repo.  

By exploring the data from many different RS repositories using scripts, I should be able to gather a large dataset to investigate these hypotheses.  

Overview:  


00) Setup, imports, logging  

01) Get RS repo name(s) for study (inclusion/exclusion steps)   
02) Check whether current data already exists; if so, use that      

03) Query API for repo(s) commits data   

04) Save out raw commits json for repo(s)  
05) Convert data to pandas format for analysis  

06) Calculate commits summary stats for repo(s)  

07) Slice commits data by commit author  

08) Calculate commits summary stats for author(s)  

09) Compare author commits stats to repo commits stats; looking for outliers or differences in terms of frequency of commits, change size, files changed, file types changed (vasilescu_variations_2014), key words frequencies (hattori_nature_2008), etc.  

10) Report findings  
11) Return data in useful format for subsequent analyses / visualisation  
12) Save out data 

13) Visualisation of commits data for a) dataset of repo(s); b) individual repo(s); c) individual authors (optional, perhaps only where outliers or interesting differences show up); 






## File system Organisation  

REPO ROOT:   
`/` (coding-smart repo root)   

DOCS FOLDERS:    
`/docs/` - currently this holds .jpg files of UML diagrams referred to in the overall repo readme.   

OUTPUT FOLDERS:   
`/data/` - holds .csv files of data obtained from GH API calls.  
`/logs/` - holds .txt logfiles with name format `[functionname or scriptname]_(NOTEBOOK)_logs.txt`
 where 'NOTEBOOK' is optional and relates to output logs generated by jupyter notebook function runs.  
`/images/` - holds .png files generated by python plotting and visualisation of the GH data.    

SOURCE CODE FOLDERS:    
`/utilities/` - contains utility functions and scripts of general usability   
`/githubanalysis/` -  contains functions and scripts relating to the github API and subsequent data operations     
`/zenodocode/` - contains functions and scripts relating to zenodo API for identifying RS repos' github repo names      
 

## Notebook Imports and Setup  

## Setup 

Ensure you are in the `coding-smart-github` conda environment and have the following packages installed in your environment which match the `requirements.txt` file in the coding-smart repository.  

### Github Authentication 

Create a classic access token via [Github Authentication Settings](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token#creating-a-personal-access-token-classic) and create a file called `config.cfg` in the `githubanalysis/` folder with the following content: 
```bash
[ACCESS]
token = <your-access-token>
```
Ensure you've pasted in your token, but leave `[ACCESS]` and `token = `.

05) Convert data to pandas format for analysis  

06) Calculate commits summary stats for repo(s)  

07) Slice commits data by commit author  

08) Calculate commits summary stats for author(s)  

09) Compare author commits stats to repo commits stats; looking for outliers or differences in terms of frequency of commits, change size, files changed, file types changed (vasilescu_variations_2014), key words frequencies (hattori_nature_2008), etc.  

## Import python packages required for notebook 

In [1]:
import os
from os import path
import configparser
from github import Github
import pandas as pd
import requests
import datetime
from pprint import pprint
import re
from requests.adapters import HTTPAdapter, Retry

import utilities.get_default_logger as loggit
import githubanalysis.processing.setup_github_auth as ghauth
from githubanalysis.processing.get_all_branches_commits import AllBranchesCommitsGetter 
from githubanalysis.processing.get_commit_changes import CommitChanges
import githubanalysis.analysis.hattori_lanza_commit_size_classification as sizecat
from githubanalysis.analysis.hattori_lanza_commit_content_classification import Hattori_Lanza_Content_Classification

In [2]:
# #https://docs.github.com/en/rest/branches/branches?apiVersion=2022-11-28

current_date_info = datetime.datetime.now().strftime(
            "%Y-%m-%d"
        )  # run this at start of script not in loop to avoid midnight/long-run commits
repo_name = 'JeschkeLab/DeerLab'
write_out_location='../../data/'
out_filename='commits_cats_stats'
sanitised_repo_name = repo_name.replace("/", "-")

# per_pg=100

# config_path='../../githubanalysis/config.cfg'
# repos_api_url = "https://api.github.com/repos/"
# api_call = f"{repos_api_url}{repo_name}/branches"

# gh_token = ghauth.setup_github_auth(config_path=config_path)
# headers = {'Authorization': 'token ' + gh_token}

# s = requests.Session()
# retries = Retry(total=10, connect=5, read=3, backoff_factor=1.5, status_forcelist=[202, 502, 503, 504])
# s.mount('https://', HTTPAdapter(max_retries=retries))


# api_response = s.get(url=api_call, headers=headers)
# api_response.status_code

  ## Get commits for 1 repo: JeschkeLab/DeerLab

## Get ALL commits from ALL branches for repo 'JeschkeLab/DeerLab' using get_all_branches_commits( )

In [3]:
# logger = loggit.get_default_logger(console=True, set_level_to='DEBUG', log_name='../../logs/get_all_branches_commits_NOTEBOOK_logs.txt')  

# repo_name = 'JeschkeLab/DeerLab'

# allbranchescommitsgetter = AllBranchesCommitsGetter(repo_name = repo_name, in_notebook=True, config_path='../../githubanalysis/config.cfg', logger=logger)

# all_branches_commits = allbranchescommitsgetter.get_all_branches_commits(repo_name=repo_name)

## Process data by feeding dataframe of all_branches_commits into reformat_commits_object( ) function

In [4]:
# from githubanalysis.processing.reformat_commits import CommitReformatter

# logger = loggit.get_default_logger(console=True, set_level_to='DEBUG', log_name='../../logs/commits_reformatters_NOTEBOOK_logs.txt')  

# reformat_commits = CommitReformatter(repo_name = repo_name, in_notebook=True, logger=logger)

# # feed dataframe of all_branches_commits into reformat_commits_object( ) function
# processed_commits = reformat_commits.reformat_commits_object(unique_commits_all_branches=all_branches_commits)

# reformat_commits.save_formatted_commits(write_out_location="data/")


## Load PROCESSED commits data 

In [5]:
# read in PROCESSED (aka dataframe format) commit data for example: 
# load processed data from JeschkeLab/DeerLab 
processed_deerlab = pd.read_csv("../../data/processed-commits_JeschkeLab-DeerLab_2024-09-28.csv")

In [6]:
processed_deerlab

Unnamed: 0.1,Unnamed: 0,repo_name,branch_sha,commit_sha,author_fullname,author_commit_date,comitter_commit_date,commit_message,author_username,comitter_username
0,0,JeschkeLab/DeerLab,2b7e527c50e34f97455e49a598b6fdc46eb78c58,2b7e527c50e34f97455e49a598b6fdc46eb78c58,Hugo Karas,2024-09-16T11:56:06Z,2024-09-16T11:56:06Z,General improvemements,HKaras,HKaras
1,1,JeschkeLab/DeerLab,2b7e527c50e34f97455e49a598b6fdc46eb78c58,691cdf0bd8cfb925545e3c8b2546af2f4f6de192,Hugo Karas,2024-09-16T11:55:37Z,2024-09-16T11:55:37Z,Add model copying,HKaras,HKaras
2,2,JeschkeLab/DeerLab,2b7e527c50e34f97455e49a598b6fdc46eb78c58,14668464a63904187da84047f937ee6524c69bb0,Hugo Karas,2024-09-13T14:01:44Z,2024-09-13T14:01:44Z,Bootrstap Uncertainty sampling reduction\n\nRe...,HKaras,HKaras
3,3,JeschkeLab/DeerLab,2b7e527c50e34f97455e49a598b6fdc46eb78c58,b8ac0eccb3aca94f5fd0956a953644f9be808c5f,Hugo Karas,2024-09-13T13:33:11Z,2024-09-13T13:33:11Z,Minor docstring update,HKaras,HKaras
4,4,JeschkeLab/DeerLab,2b7e527c50e34f97455e49a598b6fdc46eb78c58,4ae39640c5947c2c21d5f635fd35cd75826dc20b,Hugo Karas,2024-09-13T13:33:02Z,2024-09-13T13:33:02Z,Add test for modelUncert output,HKaras,HKaras
...,...,...,...,...,...,...,...,...,...,...
561,561,JeschkeLab/DeerLab,2546b4290fe4530575462fb343182e29bba069a3,2546b4290fe4530575462fb343182e29bba069a3,Hugo Karas,2023-11-03T11:20:13Z,2023-11-03T11:20:13Z,Push patch into V1.1 (#467)\n\n* Increase vers...,HKaras,web-flow
562,562,JeschkeLab/DeerLab,2a420338f2e9c4c3121f4937747a50caf0509959,2a420338f2e9c4c3121f4937747a50caf0509959,Hugo Karas,2023-05-03T20:29:51Z,2023-05-03T20:29:51Z,Adding cvx as an optional test,HKaras,web-flow
563,563,JeschkeLab/DeerLab,2a420338f2e9c4c3121f4937747a50caf0509959,142a02630cc260cb519ff09afed717ac8f284f44,Hugo Karas,2023-05-03T20:20:55Z,2023-05-03T20:20:55Z,Remove test for cvxopt,HKaras,web-flow
564,564,JeschkeLab/DeerLab,2a420338f2e9c4c3121f4937747a50caf0509959,b158b1e5ff519e4b7f8fdfaf479180a2ca0f711f,Hugo Karas,2023-05-03T20:08:31Z,2023-05-03T20:08:31Z,Update installation.rst,HKaras,web-flow


## Get changes per commit hash for each commit in df  

### Apply Vasilescu (filetype) categorisation method

In [7]:

logger = loggit.get_default_logger(console=True, set_level_to='DEBUG', log_name='../../logs/get_commit_changes_NOTEBOOK_logs.txt')  
commitchanges = CommitChanges(logger=logger, repo_name ="JeschkeLab/DeerLab", in_notebook=True, config_path='../../githubanalysis/config.cfg')


from githubanalysis.analysis.vasilescu_commit_files_classification import Vasilescu_Commit_Classifier

logger = loggit.get_default_logger(console=True, set_level_to='DEBUG', log_name='../../logs/vasilescu_commit_files_classification_NOTEBOOK_logs.txt')  
vasilescucommitclassifier = Vasilescu_Commit_Classifier(repo_name = "JeschkeLab/DeerLab", in_notebook=True, logger=logger, config_path='../../githubanalysis/config.cfg')

In [8]:
tmpdf1 = commitchanges.get_commit_changes(commit_hash="a5085d0e6eef59e7156e798b6e61f8a3731db934")
tmpdf1

Unnamed: 0,commit_hash,filename,changes,additions,deletions
0,a5085d0e6eef59e7156e798b6e61f8a3731db934,setup.py,1,0,1


In [9]:
tmpdf2= commitchanges.get_commit_changes(commit_hash="2b7e527c50e34f97455e49a598b6fdc46eb78c58")
tmpdf2

Unnamed: 0,commit_hash,filename,changes,additions,deletions
0,2b7e527c50e34f97455e49a598b6fdc46eb78c58,deerlab/classes.py,2,1,1
1,2b7e527c50e34f97455e49a598b6fdc46eb78c58,deerlab/fit.py,4,2,2
2,2b7e527c50e34f97455e49a598b6fdc46eb78c58,deerlab/fitresult.py,31,17,14


In [10]:
tmpdf3 = commitchanges.get_commit_changes(commit_hash="2546b4290fe4530575462fb343182e29bba069a3")
tmpdf3

Unnamed: 0,commit_hash,filename,changes,additions,deletions
0,2546b4290fe4530575462fb343182e29bba069a3,VERSION,2,1,1
1,2546b4290fe4530575462fb343182e29bba069a3,deerlab/dd_models.py,2,1,1
2,2546b4290fe4530575462fb343182e29bba069a3,deerlab/solvers.py,8,7,1
3,2546b4290fe4530575462fb343182e29bba069a3,deerlab/utils.py,8,4,4
4,2546b4290fe4530575462fb343182e29bba069a3,docsrc/source/changelog.rst,12,12,0
5,2546b4290fe4530575462fb343182e29bba069a3,docsrc/source/installation.rst,23,14,9
6,2546b4290fe4530575462fb343182e29bba069a3,examples/advanced/ex_long_threespin_analysis.py,95,0,95
7,2546b4290fe4530575462fb343182e29bba069a3,test/test_utils.py,16,16,0


In [11]:
vasilescucommitclassifier.vasilescu_commit_files_classification(commit_changes_df=tmpdf1)

('code', 'a5085d0e6eef59e7156e798b6e61f8a3731db934')

In [12]:
vasilescucommitclassifier.vasilescu_commit_files_classification(commit_changes_df=tmpdf2)

('code', '2b7e527c50e34f97455e49a598b6fdc46eb78c58')

In [13]:
vasilescucommitclassifier.vasilescu_commit_files_classification(commit_changes_df=tmpdf3)

('doc', '2546b4290fe4530575462fb343182e29bba069a3')

## Apply Vasilescu (filetype categorisation) method; get N files changed, N changes

In [None]:
logger = loggit.get_default_logger(console=True, set_level_to='DEBUG', log_name='../../logs/get_commit_changes_NOTEBOOK_logs.txt')  
commitchanges = CommitChanges(logger=logger, repo_name ="JeschkeLab/DeerLab", in_notebook=True, config_path='../../githubanalysis/config.cfg')

n_files: list[tuple[int, str]] = []
n_changes = []
v_category = []

i = 0
for commit in processed_deerlab['commit_sha']:
    i += 1
    print(f"{i} of {len(processed_deerlab)}")
    
    tmpdf = commitchanges.get_commit_changes(commit_hash = commit)
        
    n_files.append(commitchanges.get_commit_files_changed(commit_changes_df=tmpdf))
    n_changes.append(commitchanges.get_commit_total_changes(commit_changes_df=tmpdf))
    
    # apply Vasilescu et al commit classification (filetype) method: 
    v_category.append(vasilescucommitclassifier.vasilescu_commit_files_classification(commit_changes_df=tmpdf))
    

1 of 566
2 of 566
3 of 566
4 of 566
5 of 566
6 of 566
7 of 566
8 of 566
9 of 566
10 of 566
11 of 566
12 of 566
13 of 566
14 of 566
15 of 566
16 of 566
17 of 566
18 of 566
19 of 566
20 of 566
21 of 566
22 of 566
23 of 566
24 of 566
25 of 566
26 of 566
27 of 566
28 of 566
29 of 566
30 of 566
31 of 566
32 of 566
33 of 566
34 of 566
35 of 566
36 of 566
37 of 566
38 of 566
39 of 566
40 of 566
41 of 566
42 of 566
43 of 566
44 of 566
45 of 566
46 of 566
47 of 566
48 of 566
49 of 566
50 of 566
51 of 566
52 of 566
53 of 566
54 of 566
55 of 566
56 of 566
57 of 566
58 of 566
59 of 566
60 of 566
61 of 566
62 of 566
63 of 566
64 of 566
65 of 566
66 of 566
67 of 566
68 of 566
69 of 566
70 of 566
71 of 566
72 of 566
73 of 566
74 of 566
75 of 566
76 of 566
77 of 566
78 of 566
79 of 566
80 of 566
81 of 566
82 of 566
83 of 566
84 of 566
85 of 566
86 of 566
87 of 566
88 of 566
89 of 566
90 of 566
91 of 566
92 of 566
93 of 566
94 of 566
95 of 566
96 of 566
97 of 566
98 of 566
99 of 566
100 of 566
101 of 5

### Merge changestats and Vasilescu categories on commit hashes to main commits df

In [None]:
# generate changes_df of files changed from zipped lists of results
output = [[commit_hash, files_changed, number_changes, vasilescu_category] for (files_changed, commit_hash),(number_changes, _),(vasilescu_category, _) in zip(n_files, n_changes, v_category)]
changes_df = pd.DataFrame(data=output, columns=["commit_sha", "files_changed", "n_changes", "vasilescu_category"])

# merge changes_df to main commits df
processed_commits = processed_deerlab.merge(changes_df, on="commit_sha", validate="one_to_one")
#processed_commits['v_categories'] = v_category

In [None]:
processed_commits

 ## Run hattori-lanza CONTENT classification on commits df  

In [None]:
logger = loggit.get_default_logger(console=True, set_level_to='WARNING', log_name='../../logs/hattori_lanza_commit_content_classification_NOTEBOOK_logs.txt')  
hattorilanzaclassifier = Hattori_Lanza_Content_Classification(logger=logger)

results = []

for msg in processed_commits['commit_message']: 
    rslt = hattorilanzaclassifier.hattori_lanza_commit_content_classification(msg)
    results.append(rslt)
    
#print(results)

processed_commits['hattori_lanza_content_cat'] = results

processed_commits['hattori_lanza_content_cat'].value_counts(dropna=False)


In [None]:
processed_commits

## Run hattori-lanza commit SIZE classifier on commits df

In [None]:
results = []

for msg in processed_commits['n_changes']: 
    rslt = sizecat.hattori_lanza_commit_size_classification(commit_size = msg)
    results.append(rslt)

#print(results)
processed_commits['hattori_lanza_size_cat'] = results
processed_commits['hattori_lanza_size_cat'].value_counts(dropna=False)

## Write out repo df commit data with classifications and changestats

In [None]:
#processed_commits.to_csv()
write_out = f"{write_out_location}commits_cats_stats_{sanitised_repo_name}_{current_date_info}.csv" 
print(write_out)

processed_commits.to_csv(path_or_buf=write_out, header=True, index=False, na_rep="", mode='w')

In [None]:
# view dataset
processed_commits

In [None]:
# # read back in the dataset after writeout.
# testread = pd.read_csv(filepath_or_buffer="../../data/commits_cats_stats_JeschkeLab-DeerLab_2024-10-01.csv", header=0)
#testread

### Summary stats for dataset fields

In [None]:
processed_commits.vasilescu_category.value_counts(dropna=False)

In [None]:
processed_commits.hattori_lanza_content_cat.value_counts(dropna=False)

In [None]:
processed_commits.hattori_lanza_size_cat.value_counts(dropna=False)

In [None]:
processed_commits.files_changed.describe()

In [None]:
processed_commits.n_changes.describe()