Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added a script to import git repository to ClickHouse #14471

Merged
merged 24 commits into from Sep 12, 2020

Conversation

alexey-milovidov
Copy link
Member

@alexey-milovidov alexey-milovidov commented Sep 4, 2020

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Added a script to import git repository to ClickHouse.

Detailed description / Documentation draft:
Better implementation of #12577

Allows to answer questions like:

  • list files with maximum number of authors;
  • show me the oldest lines of code in the repository;
  • show me the files with longest history;
  • list favorite files for author;
  • list largest files with lowest number of authors;
  • at what weekday the code has highest chance to stay in repository;
  • the distribution of code age across repository;
  • files sorted by average code age;
  • quickly show file with blame info (rough);
  • commits and lines of code distribution by time; by weekday, by author; for specific subdirectories;
  • show history for every subdirectory, file, line of file, the number of changes (lines and commits) across time; how the number of contributors was changed across time;
  • list files with most modifications;
  • list files that were rewritten most number of time or by most of authors;
  • what is percentage of code removal by other authors, across authors;
  • the matrix of authors that shows what authors tends to rewrite another authors code;
  • what is the worst time to write code in sense that the code has highest chance to be rewritten;
  • the average time before code will be rewritten and the median (half-life of code decay);
  • comments/code percentage change in time / by author / by location;
  • who tend to write more tests / cpp code / comments.

@robot-clickhouse robot-clickhouse added doc-alert pr-feature Pull request with new product feature labels Sep 4, 2020
@KochetovNicolai KochetovNicolai self-assigned this Sep 4, 2020
@robot-clickhouse robot-clickhouse added the submodule changed At least one submodule changed in this PR. label Sep 6, 2020
@alexey-milovidov
Copy link
Member Author

alexey-milovidov commented Sep 6, 2020

Now the tool can easily process ClickHouse, LLVM, Linux and Chromium repositories.

ClickHouse repository preprocessing takes 31 seconds.
LLVM - 8 minutes 45 seconds.
Linux - 12 minutes 35 seconds.
Chromium - 67 minutes 3 seconds.

(I picked them as an examples of somewhat large repositories. What is the largest repository on GitHub?).

The time of processing is in fact comparable to what time takes to clone the repository.

@robot-clickhouse robot-clickhouse removed the submodule changed At least one submodule changed in this PR. label Sep 7, 2020
@alexey-milovidov
Copy link
Member Author

SELECT 
    database,
    table,
    sum(rows),
    formatReadableSize(sum(bytes))
FROM system.parts
WHERE database LIKE 'git\\_%'
GROUP BY 
    database,
    table
ORDER BY 
    database ASC,
    table ASC

┌─database───────┬─table────────┬─sum(rows)─┬─formatReadableSize(sum(bytes))─┐
│ git_chromium   │ commits      │    920847 │ 88.93 MiB                      │
│ git_chromium   │ file_changes │   8280114 │ 274.98 MiB                     │
│ git_chromium   │ line_changes │ 941256784 │ 18.65 GiB                      │
│ git_clickhouse │ commits      │     31900 │ 2.44 MiB                       │
│ git_clickhouse │ file_changes │    135270 │ 4.87 MiB                       │
│ git_clickhouse │ line_changes │   3300441 │ 62.27 MiB                      │
│ git_linux      │ commits      │    875169 │ 74.33 MiB                      │
│ git_linux      │ file_changes │   2088558 │ 116.73 MiB                     │
│ git_linux      │ line_changes │  85152422 │ 1.71 GiB                       │
│ git_llvm       │ commits      │    363055 │ 37.06 MiB                      │
│ git_llvm       │ file_changes │   1468936 │ 67.69 MiB                      │
│ git_llvm       │ line_changes │  62253479 │ 1.20 GiB                       │
└────────────────┴──────────────┴───────────┴────────────────────────────────┘

@alexey-milovidov
Copy link
Member Author

Integration tests (release) — fail: 0, passed: 786, error: 2

name = 'connect', self = <socket._socketobject object at 0x7f44784b8980>
args = (('172.19.0.3', 9000),)

    def meth(name,self,*args):
>       return getattr(self._sock,name)(*args)
E       error: [Errno 113] No route to host

Infrastructure failure.

@alexey-milovidov
Copy link
Member Author

#14553

*/
void setLineInfo(std::string full_line)
{
indent = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is to easy to overflow.
Probably, we may calculate using uint32_t and cut up to 255 if overflowed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

@alexey-milovidov alexey-milovidov merged commit cc3d1a3 into master Sep 12, 2020
@alexey-milovidov alexey-milovidov deleted the git-to-clickhouse branch September 12, 2020 00:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-feature Pull request with new product feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants