# GitHub Analytics of the Hack for LA website project

By Ava Li

## Abstract

This research aims to answer the question: what is the current and projected velocity of the HackForLA website team. Velocity, in this scenario, can be thought of as the amount of work that is accomplished over a set period of time. To measure work done, we examined the timeline of every issue and pull request in the hackforla/website repository. More specifically, we collected information on when an issue is created, assigned, and closed and when a pull request is opened, reviewed, and closed. With this information, we created graphics that shows the velocity of the HackforLA website team for the last five months. To accurately assess velocity, we not only examined overall velocity, we also differentied issues and pull requests by their role and size to guage the main and intersectional effects they have on velocity.

## Introduction

As the HackForLA team expands, we are ready to take on bigger and bigger projects. In order to effectively track our projects, we need a way to measure the velocity of our team. For this issue, we will gather various metrics that allows us to access not only the velocity of the team, but points where velocity can be improved. This allows us to better communicate with stakeholders, as well as team members, on our progress with the Website's various projects.

The goal of this research is to a) get an estimate of when features can be expected to be completed b) at what points can we improve our velocity, and c) assess the accuracy of our labels.

### Defining Velocity

In terms of the website project, velocity can be thought of as the amount of time taken for work to go from one stage to another. For people familiar with the physics defitnition of velocity, this sounds somewhat counterintuitive. Certainly, the more accurate way to measure velocity would be the inverse: the amount of work done at a certain amount of time. This, however, is not a preferable way to measure velocity for two reasons:

1. Amount of work cannot be easily quantified as most of the work of coding involves brainstorming and testing and is this invisible.
2. One applied benefit to velocity is understanding when a new feature or product will be delivered. Because of that, time is a much more valuable measure than work.
3. At certain stages, work is placed into queues, and measuring work done would not capture these moments when a task, or issue, is in a queue. Queue time is an important part of velocity to capture because it indicates areas where velocity can improve.

<br>
<figure>
    <img src="images/issue_sample.png" alt="drawing" width="800" height="800" />
    <center><b>Fig 1. A typical issue or task</b></center>
</figure>
<br>

Of course, despite measuring velocity in terms of time, we still need to decide the point where time starts and the point where time ends. For this particular study, we have identified six key points in the Hack For LA website team's workflow that represent moments of work hand-off. They are: issue opening, issue assignment, pull request opening, pull request reviewed, pull request closing, and issue closing. To clairfy on what each of these means:

* Issue opening is the moment when an issue has been codified and placed into a queue. Note that at this point, no physical progress has begun, but is awaiting for resources to free up such progress can formally begin.
* Issue assignment is the moment when an issue is actively being worked on. This work is usually done by one or more individuals.
* Pull request opening is the moment when the first draft an assigned issue has been completed and is placed into a queue. The next step is for a reviewer to suggest changes to work.
* Pull request reviewed is the moment when a reviewer had completed a review of a draft. After this, work on the next draft of the issue starts.
* Issue closed and pull request closed is the moment when work is completed, and the final draft is accepted.

By isolating these key points, we began to construct graphics on the time it takes for an issue to move from one point to the next. This allowed us to answer how many issues are completed in a month, where is the bottleneck in the team's workflow, and whether we our size labels accurately reflects the time commitment of a ticket.

### Labels

Labels represent ways to classify an issue. For example, an issue might involve a certain type of work, such as refactoring, and those issues would be marked with a refactoring label. For this research, we looked at two categories of labels commonly used at the Hack For LA website team: size and role.

<br>
<figure>
    <img src="images/label_sample.png" alt="drawing" width="800" height="800" />
    <center><b>Fig 2. A sample of the labels tagged on an issue.</b></center>
</figure>
<br>

#### Size

Size labels represent, in an ideal world, an accurate assessment of the time it takes for an issue to be completed after assignment. In other words, the size labels themselves should act as an accurate assessment of velocity. Why then, do we need this current research? As it currently stands on the website team, the standards for placing the size labels are informal. These labels are placed upon issue creation, and thus serves as a best guess of how long an issue would take. This guess is usually based on the preferences of the creator of the issue, which might include their expertise, their assumptions of an assignee's expertise, and the type of issues that typically has a select size labels. As these preferences may vary, size labels are often do not capture an accurate representation of time to completion.

In addition, individual differences in assignees might inflate or deflate a label's time to completion. One assignee might find an issue quite simple and straightforward, while another might discover that the issue requires substantial research to complete. For instance, a medium issue can be completed in as little as two or 30 days! Therefore, these size labels show a large amount of variance, which would render the label meaningless. Therefore, in this study, we examine size, to assess whether they accurately reflect an issue's time to completion and set criterias for when to assign certain size labels.

In this study, we will examine four size labels: 'Good first issue', 'Size: Good second issue', 'Size: Medium', and 'Size: Large'. To simplify matters, this write-up will refer to them as First, Second, Medium and Large labels respectively.

#### Role

Role labels at the Hack For LA website team usually denotes the department that handles the given issue. For example, a 'role: design' label would mean the issue is only appropriate for assignment to someone on the design team. This research is interested in looking at role labels because different roles perform very different types of work. Designers, for instance, would never create a pull request, and thus their workflow only consists of the issue opened, issue assigned, and issue closed moments. Likewise, each department function somewhat independently from one another with their own workflow. This means that methods to improve velocity for one department might not apply to another.

In this study, we will examine three size labels: 'role: front end', 'role: back end', and 'role: design'. To simplify matters, this write-up will refer to them as Front-end, Back-end, and Design labels respectively. 

## Method

This study is done in a two-step process: data gathering, and data analysis. For the data gathering step, we gathered timestamps of key moments for every issue in the hackforla/website repository. These timestamps will be used to calculate velocity, which in this study means the time it took for an issue to move from one workflow moment to the next. After the data has been cleaned, we separated the data by roles and size labels so that their velocity can be examined separately.

### GitHub API

As the Hack For LA website team primarily uses GitHub to organize project workflow, a useful tool for gathering data on issues is through the GitHub API. The API contains data on everything displayed on GitHub, including timestamps of various events, the unique ID for every one of its users, and relationships between issues. For this study specifically, we used two APIs, the issue API and timeline API.

The issue API is primarily used to get the date of when an issue is opened, and to extract connections between issues. The connection that we were interested in is the relationship between linked issue and linked pull requests. To clarify, GitHub considers issues and pull requests to be separate sets of data. Often, when a draft of some work is completed, a pull request is opened to signal for reviewers to review the draft. When a pull request is made in reference to a certain issue, the pull request and issue is said to be linked. Sadly, the API does not contain information on this link. The website team, however, were instructed to reference their pull requests with an associated issue, so a simple, custom regEx parser is able to obtain linked information.

Through the GitHub API, we were able to consolidate timestamps for issue opened, assignment, and closed and pull request opened, reviewed, and closed. Likewise, information about links between issues and pull requests were also consolidated. We have separated the data into two .csv files, one with data concerning issues and the other with data concerning pull requests. Table 1 and 2 shows the moments data we gathered per .csv as well as the definition of each moment.


<center><h3>Issues .csv moments and definitions</h3></center>

| Moment      | Definition  |
| ----------- | ----------- |
| Issue opened | The time when the issue was first opened. |
| Issue assigned | The time when an issue was first assigned. |
| Linked pull request made | The time when the assignee made the pull request. |
| Issue closed | The last time the issue was closed, but only if it was after when it was last reopened. |

<center><b>Table 1</b></center>

<br>
<br>

<center><h3>Pull Requests .csv moments and definitions</h3></center>

| Moment      | Definition  |
| ----------- | ----------- |
| Pull request opened | The time when the pull request was opened. |
| Pull request reviewed | The time when the pull request was first reviewed. |
| Pull request closed | The last time when the pull request was closed, but only if it was after when it was last reopened. |

<center><b>Table 2</b></center>

### MatplotLib

MatplotLib is a popular Python library for statistical analysis and data visualization. According to its website, it is touted as the Python equivilent of the R langauge. For this study, we used MatplotLib to visualize the data obtained from the GitHub API.

In order to measure the velocity of the HackForLA website team MatplotLib was used to first group the data together to calculate the average time it took for issues and pull requests to move from one key moment to the next. These averages were separated monthly, so that projections about velocity can be made. For example, if we want to examine the velocity for issues to go from opened to assigned, we would take all the issues, separate them by the month they were assigned, and find the average difference in the timestamps for opened vs assigned.

In addition to separating the averages by month, we also created graphics that separated these averages based on the role and size labels. In doing so, we were able to analyze th velocity of different teams as well as the accuracy of our size labels.



## Results

## Conclusion

### Ways to Improve Velocity

### Limitations and Future of the automation