Scraping all of the GitHub-commits dates of a given user
Many github users spend long hours to improve and upgrade their codes. Any changes made in particular repository are saved and documented as "commits". These commits allow the user to track the changes he has made and document them in an orderly fashion. This repo also as commits page, and you can see it here: commits
The idea behind this code is simple: use these commits not only as documentation, but as Data. this data allows the user to learn about himself and his behavior. Since all commits are attached to their execution date, you can learn about the user work schedule. Therefore, the user's calendar can be drawn using the user's commits throughout his activity on github. For this use, the code gitime.py created.
The code is based on a number of key steps:
- The code gets the name of the user's github profile and uses it to turn to a page that centralizes all its repositories.
- All links to the various repo scraped using bs4.
- Now, the code goes over each repo, and crawls over its commits.
- The crawled commits are converted into a summarized table, which can be saved or converted to a calendar output.
During the work process, there are one-second breaks between each crawl request. This time period can be changed using the "slp" variable.
In addition, the code is built to deal with the difference between main repo and master repo. As you know, in 2020 it was decided to stop using the term "master" in github routings and use the term main instead (you can read more on this here: renaming). Therefore, the code deals with the fact that some commits are kept under "main" routing and some are still under "master".
Also, the commits are saved so that only 35 records can be kept on each page. Therefore, to scrap all the commits of repo with more than 35 commits, the code uses the link of the scroll down button. By doing so, it be possible to reach the pages where the additional commits are stored.
After the code ended the Scraping proccess, it isolates only the dates of the commits and arranges them into a summary table. It keeps this table on the local computer of the user (by default, in the path where the user is working). An example of such a table can be seen here:
repo | year | month | day | hour | minutes | seconds | dates | weekday |
---|---|---|---|---|---|---|---|---|
/EtzionR/Github-Commits-Crawling | 2021 | 10 | 01 | 10 | 59 | 43 | 2021-10-01 | Friday |
/EtzionR/Github-Commits-Crawling | 2021 | 10 | 01 | 10 | 53 | 18 | 2021-10-01 | Friday |
/EtzionR/Github-Commits-Crawling | 2021 | 10 | 01 | 10 | 52 | 52 | 2021-10-01 | Friday |
/EtzionR/Github-Commits-Crawling | 2021 | 10 | 01 | 10 | 52 | 44 | 2021-10-01 | Friday |
/EtzionR/Github-Commits-Crawling | 2021 | 10 | 01 | 10 | 51 | 43 | 2021-10-01 | Friday |
You can see example of such table (based on my own github profile) in here: GiTime_EtzionR.csv.
Also, the data is converted to a Calendar Table, which can also be accessed (GiTime('user').clndr) and can also be plotted automatically:
# import GiTime object
from gitime import GiTime
# using the object on "EtzionR" Github user
GiTime('EtzionR').plot()
# resulting plot:
This example is based on my private github. As you can see, the calendar produced shows that I do not work at all between 1:00 and 7:00, and also refrains from working on weekends. Also, you can see that most of my work on github is concentrated in the middle of the week.
As you can see, this product can better teach the user how he divides his working hours throughout the week and allows him to become more efficient and learn about his own activities
Note: The code indicates at each stage which repo it scraping. If the user github contains a large amount of commits and repositorie, it will take the code some time to run
The code uses the following libraries in Python:
BeautifulSoup (bs4)
matplotlib
requests
datetime
pandas
An application of the code is attached to this page under the name:
the examples outputs are also attached here.
To use this code, you just need to import it as follows:
# import GiTime object
from gitime import GiTime
# application
GiTime('user')
When the variables displayed are:
user: given github user you want to scrap all of it commits
save: save the commits to csv file (bool, defualt: True)
slp: sleep time between crawling sessions
MIT © Etzion Harari