Skip to content
This repository has been archived by the owner on Feb 29, 2024. It is now read-only.

Extract email addresses from commit log #6

Closed
missaugustina opened this issue Jun 6, 2017 · 11 comments
Closed

Extract email addresses from commit log #6

missaugustina opened this issue Jun 6, 2017 · 11 comments
Assignees

Comments

@missaugustina
Copy link
Contributor

missaugustina commented Jun 6, 2017

How many contributors with minimal github info are able to be identified this way? Do the email addresses improve other identification results?

See: countering-bean-counting/bonnyci_shuffleboard#85

@missaugustina missaugustina added this to the Milestone 2: Contributor Profiles milestone Jun 6, 2017
@missaugustina
Copy link
Contributor Author

I extracted 226 names and emails from the commit history (many of them belonged to the same person, eyeballed estimate is actually 150 individuals). I was able to link 79 of top contributors to those results. For the ones I wasn't able to link, many had commit activity prior to the event time-frame I analyzed (pre 2016). I saw at least one case where I had done a case-sensitive check of name or login. The other non-matches may have just been folks that didn't make the top contributor list based on my initial exploration of that metric. So further analysis needed but still it's a start!

@missaugustina
Copy link
Contributor Author

Here's a better way using the user's profile:

Get a list of repos for each user (not able to filter out where fork=true, see if this is something we need to do): https://api.github.com/users/0rchard/repos
Get a list of commits for each repo the user owns that they authored: https://api.github.com/repos/0rchard/cage_source/commits?author=0rchard
The committer info including their Github username will be in the commit data so it can be matched up easily.

@missaugustina
Copy link
Contributor Author

This method seems to be returning a decent amount of identifying data, but there is a bit of overhead on account of it requiring several API calls. I will need to implement something to check for that and then to be able to pick up where it left off down the road. In the meantime I'm just dumping all the responses I get so we don't lose any data.

@missaugustina
Copy link
Contributor Author

I found this for SystemML when I was tracking down a Github user that had zero information associated with them: https://systemml.apache.org/community-members

I think for each organization of interest, some time may need to be spent to manually create info on any lists of contributors like this that show a company affiliation. I know other communities have similar lists in various formats.

@missaugustina
Copy link
Contributor Author

This is starting to become an issue now and should definitely find its way into the pipeline at some point... (making api calls async) countering-bean-counting/bonnyci_shuffleboard#11

@missaugustina
Copy link
Contributor Author

missaugustina commented Jun 9, 2017

Before this issue can be closed, these things need to be addressed:

  • If no commits are found, check the author's commits to the project they are associated with
  • Run this against a list of non-committers to see proportion of identification
  • Summarize results in an R Notebook

@missaugustina
Copy link
Contributor Author

missaugustina commented Jun 12, 2017

I had pulled a list of "top contributors" for the mxnet project based on overall event frequency and event type diversity previously and I need to upload that R notebook to this repo. I ran the updated shuffleboard script in chunks (due to github api limit of 5000 requests per hour) for this list of contributors (about 900 or so) to pull names + emails from commit history. I need to combine these into a single CSV and then see how well we did with identifying these contributors. The R notebook should show a) proportion of contributors with name info in their profile, b) proportion with company info, c) proportion with name/email pulled from commits, d) stretch goal: proportion with work email pulled from commit. This should also consider whether they already had company info in their profile.

missaugustina added a commit that referenced this issue Jun 21, 2017
Also moved to milestone 2 folder where it belongs

Part of Issue #6
Signed-off-by: Augustina Ragwitz <augustina.ragwitz@ibm.com>
@missaugustina
Copy link
Contributor Author

Just need to finish up the section on emails analysis, write up final conclusions, and put together a slide deck summary of my findings.

missaugustina added a commit that referenced this issue Jun 21, 2017
Also moved to milestone 2 folder where it belongs

Part of Issue #6
Signed-off-by: Augustina Ragwitz <augustina.ragwitz@ibm.com>
missaugustina added a commit that referenced this issue Jun 22, 2017
Also moved to milestone 2 folder where it belongs

Part of Issue #6
Signed-off-by: Augustina Ragwitz <augustina.ragwitz@ibm.com>
missaugustina added a commit that referenced this issue Jun 22, 2017
Also moved to milestone 2 folder where it belongs

Part of Issue #6
Signed-off-by: Augustina Ragwitz <augustina.ragwitz@ibm.com>
@missaugustina
Copy link
Contributor Author

Finished email analysis! Need to write up final conclusions and put together a slide deck with my findings.

missaugustina added a commit that referenced this issue Jun 23, 2017
Also moved to milestone 2 folder where it belongs

Part of Issue #6
Signed-off-by: Augustina Ragwitz <augustina.ragwitz@ibm.com>
missaugustina added a commit that referenced this issue Jun 23, 2017
Also moved to milestone 2 folder where it belongs

Closes Issue #6
Signed-off-by: Augustina Ragwitz <augustina.ragwitz@ibm.com>
missaugustina added a commit that referenced this issue Jun 23, 2017
Also moved to milestone 2 folder where it belongs

Closes Issue #6
Signed-off-by: Augustina Ragwitz <augustina.ragwitz@ibm.com>
@missaugustina
Copy link
Contributor Author

@missaugustina
Copy link
Contributor Author

I will mark this as complete and open a new issue for the additional discoveries.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant