-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Add repository metrics #177
base: master
Are you sure you want to change the base?
Conversation
@steffen: Thank you so much for this huge contribution 👍! I’ll want to look at this in detail, but I’ll be off the next days unfortunately. It will take a week or two until I can come back to your pull request. I hope that’s okay 🙂. |
@pluehne No worries! Thanks for looking into it! Enjoy your time off. 🙂 |
Hi @pluehne 👋, Just a "quick" question about my question about passing repository name with owner to bash scripts from the original post. The repository name with owner has to be passed to the bash scripts, such as to this one: I'm not sure yet what the best way is to accomplish that via the executeScript method. Basically I want to pass Thanks!! |
@steffen: The best way to achieve that would be to extend the And sorry that I still haven’t had the time to look over your pull request! |
@pluehne I've thought about using environment variables as well. Any pointers on how |
@steffen: |
@pluehne Great, thanks! Speaking of reviewing and merging the PR. I thought about keeping this one here as small as possible, but big enough to get the whole picture of how I imagine the integration with repository specific metrics. I do have a few changes locally I want to push here which include the functionality of having working directories of the actual repositories in a temporary folder next to |
"trees", | ||
"blobs", | ||
"tags", | ||
"refs" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These series can have dramatically different ranges. Should we put them into individual charts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I separated counts and bytes into different charts. The charts allow to view selected series by clicking on the legend items, which I thought is better than too many charts. But yeah, worth thinking about splitting it up. Which series would you suggest to separate, e.g. tags
and refs
into a separate chart?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know if refs
include tags
? If they do, then I would even drop tags
.
Plus, I am not sure if the total "number of trees" and/or "number of blobs" are actionable metrics. Maybe number of trees/files of HEAD in the default branch represents the current performance characteristics of the repo for the user better?
In general, I believe we should track all this data but only visualize the most important/actionable subset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@larsxschneider Great feedback! I changed the git-sizer
report to include all values and now we are flexible in picking what to display in the charts.
The refs
count includes the tags
, so I agree that we only need to show the number of refs
.
I tweaked the charts to show more actionable metrics and included some comments from your article:
https://steffen.github.io/hubble/repository/git-repository-size?nwo=test-org/large-repo
Great point on the number of trees/files of HEAD in the default branch. git-sizer
doesn't include that metric, but we can just get it directly from the repo.
Thank you! And Happy Eastern! 🙂🐰
return round(bytes / 1024 ** 2, 3) | ||
|
||
def updateData(self): | ||
stdout = self.executeScript(["ghe-repo", self.repository, "-c", self.configuration["gitSizerPath"] + " --json --json-version=2"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is nice! But don't we run git-sizer
here twice per repo? Here and there: https://github.com/Autodesk/hubble/pull/177/files#diff-6b012ceb81adedd4a22e2089a0102dd3R10
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I'm aware and thought it's ok to optimize later. 🙂It would need a bit of refactoring I think, as the common design in the updater is to have one report file and class per report. But it shouldn't be too hard to find a solution for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see two options:
(1) All reports are executed sequentially. That means we could pass all reports a "cache dictionary" which is used to use data that a previous report calculated.
(2) We add a cache
flag to executeScript
. If this flag is set, then the output is cached and reused in a subsequent call with the same command.
@pluehne/@steffen Do you see another option? If no, which one would you prefer? I am leaning towards (2).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@larsxschneider Thanks for these ideas! I consolidated the git-sizer
reports into one: 7008ec9#diff-66c529dd9266003f0fc5892af8ac0cb8
All quantity values are now tracked in one report. This gives us more flexibility on what to show while having all the data as discussed here. I didn't really think of this idea of using the same report for different charts until I worked on the git-download
reports which does exactly that. 🙂
uniq -ic | | ||
sort -rn | | ||
head -20 | | ||
awk '{ printf("%s\t%s\n", $2, $1) }' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks pretty similar to https://github.com/Autodesk/hubble/blob/master/updater/scripts/api-requests-by-user.sh ... should we add a repo option to the original script to avoid code duplication?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! There are other reports I wanted to add for repo metrics that are pretty much the same as well. I'll explore adding a repo option to the original scripts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@larsxschneider I added a repo option for git-download.sh
. Let me know if you think that's a good solution and if it's ok to use for other reports/scripts as well.
I ran the same greps by hand against my top repo as reported in Hubble:API requests per repo. The results were oddly formatted. "/repositories/:repository_id/contents/?" 79534 Whereas on another repo, the results looked OK. "/repositories/:repository_id/pulls/:id/comments" 25860 This was the api-request-types zcat -f /var/log/github/unicorn.log.1* | |
Here's the raw data up to the uniq -ic command if it helps. $ zcat -f /var/log/github/unicorn.log.1* | |
Hi @toddocon, thank you for running the API queries!! Thanks again for your feedback! |
This allows running `python3 update-stats.py` on the GitHub Enterprise appliance
Which aligns with Hubble install instructions
Description
Earlier this week I presented the idea of a modified Hubble to display repository metrics to a few peers (including @larsxschneider) which received very positive feedback.
The proof of concept can be seen here: https://steffen.github.io/rosat/
I called the tool Rosat as that was the next satellite that has been deployed to space after Hubble and fittingly it was a German satellite, built in collaboration with the US and UK that used X-ray imaging to study its orbiting plane (earth).
@larsxschneider suggested that it would simplify maintenance for users if the repository metrics tool could be integrated directly into Hubble, which received a 👍 from @pluehne to my knowledge. 🙂
Therefore I'm creating this PR as a proof of concept to explore how an integration could look like.
Demo
https://steffen.github.io/hubble/
https://steffen.github.io/hubble/repos-monitored
https://steffen.github.io/hubble/repository/?nwo=test-org/large-repo
See Overview with the repository name and size (fetched via a SQL query)
https://steffen.github.io/hubble/repository/github-api-requests?nwo=test-org/large-repo
See API Request Types by Count (gathered from the log files)
Screenshots
Comments
Configuration
Repositories to be monitored can be configured through a new configuration option in
config.py
:https://github.com/steffen/hubble/blob/13ee2ec9af40962b5f4225d1743fa82e7379b6da/updater/config.py.example#L75-L81
Color
I made the header background blue to make it obvious to the user that they are in "repository view mode". I found the blue fitting as it represents the sky which becomes bluer as one gets closer to planet earth (to follow-up on that Rosat analogy). 🙂
Monitored repositories page
Right now I added the list of monitored repositories and the links to the repository view mode to the Repositories section. I believe it would be handier to the user if we'd display the monitored repositories on Hubble's overview dashboard page, which does not exist yet. I would find it convenient to have an overview dashboard page similar to Rosat's overview dashboard page that displays a few key metrics of the instance. We can discuss this in a separate issue though.
"Back to instance statistics" link
I'm open for style and position suggestions of the "Back to instance statistics" link. For now I found the top left position within the navigation to be the best place.
Question about passing repository name with owner to bash scripts
The repository name with owner has to be passed to the bash scripts, such as to this one_
https://github.com/steffen/hubble/blob/13ee2ec9af40962b5f4225d1743fa82e7379b6da/updater/scripts/repository/github-api-request-types-by-count.sh
I'm not sure yet what the best way is to accomplish that via the executeScript method. Any ideas?
The code
I programmed with Python for the first time this week. 😁 As I have a background in Ruby and JavaScript, it was not hard and I rather enjoyed it. Though, don't expect the best Python code just yet. 🙂 I'm open for suggestions!
Also, obviously it needs more testing, but I think the changes are a good base for discussing the integration.
Having said that, huge kudos to the current Hubble code base, it's fun building on it! ✨
Looking forward to your thoughts!