Exposed: the cvs2git process we used for ensembl-otter
Shell
Switch branches/tags
Nothing to show
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
.gitignore
Ana_VC_general.mediawiki
IGM Git copyright.pdf
IGM Git copyright.pptx
README.mediawiki
cvs2git
cvs2git-ensembl-foo
cvs2git-example.options
cvs2git-funcs
cvs2git-intcvs1-foo
cvs2git-sourceforge-foo
cvs2git-wrapper
email-addr-remap.sh
mk-known-good.sh
reimport.sh

README.mediawiki

Table of Contents

Introduction

These files were created as an internal project during the migration of https://github.com/Anacode/ensembl-otter from CVS.

The team-internal project was announced in a presentation GIT - making a hash of it by Michael to the rest of the Informatics Group on 29 May 2012. (pptx) (pdf). Internal - pdf from IGM Talks.

This public project is an export of those files, primarily for the benefit of other teams on the Genome Campus wishing to migrate code from these CVS repositories; but also the process may be useful elsewhere.

In the interest of getting the files out - this release will include junk, broken links and references to stuff you can't see. Please let us know if this troubles you, otherwise fork and be merry.

History

This page sprouted from Anacode: Version Control in order to document the process after the fact.

Originally it lived in a MediaWiki instance named SangerWiki.

The wiki markup is retained and renders well enough on Github to make sense, but many links are broken.

Licences

In accordance with Sanger Institute release guidelines,

This is not made clear in all individual files, which reflects only the hasty (but blessed by team leader) publication of internal files.

Ensembl

Ensembl migration to Git is under discussion (here and here).

Paul requested that views be mailed direct to him.

#Projects owners control the history. Anacode recommends against using this cvs2git process or publishing its output, except

  • when you are the code owner trying the process
  • as part of your own internal process, by which CVS patches are generated
  • to demonstrate something (temporary and disposable Git import)

Old README

Before import of SangerWiki (*.mediawiki) docs, there was a sparse README.

 Some files are historic / vestigial.
 
 
 The ensembl-otter module was imported regularly
 via a crontab line for mca@deskpro17119
 
   9,39 8-18       * * *   PUSH_AND_CLEAN=1 $C2R/add -Q cvs2git  $ANACODE_TEAM_TOOLS/git-importing/reimport.sh ensembl-otter
 
 ("$C2R/add -Q cvs2git" is a cron2rss wrapper)
 
 
 Other modules are imported ad-hoc.

Successes

  • Migration of ensembl-otter from to and https://github.com/Anacode
    We imported the entire module as one Git repository, and it makes sense this way.
  • Imports of other ensembl modules - these are for our internal (read-only) use. Many have branch structure problems imported from CVS.
  • The dbchk/ module (subdirectory) from /repos/cvs/anacode CVS, mostly as proof of concept after extending the importer wrapper.

Migration recipe

(incomplete - this is a reconstruction of what we did)

Look before you leap

Be clear about what you lose, that CVS provides. Many of these neat tricks turn out to be not such a good idea in use, anyway.

  • Ease of mixing multiple projects in one repository.
  • Ability to compose a set of these subprojects via CVSROOT/modules
  • Some attempt to track where checkouts have been scattered, via CVSROOT/history
  • Easy substitution of keywords in source files
  • Appending of commit-comment text to affected files via $Log$
  • Ability to quietly rewrite history without making a commit. Git explicitly supports history rewrite, and explicitly prevents "silent" history rewrite aka. corruption.
  • Apparent atomicity of commits between subprojects (even different repositories?).
    Older CVS commits were never atomic, but importers can patch them up so they looked like they were. Heuristics are used.
    Newer CVS commits have a commitid field, visible in cvs log output.
    If two CVS modules (directories) are transferred to separate Git repositories, atomicity of commits between them is lost. This can (has in our experience) caused deployment problems.
    • e.g. a module and its calling scripts having the meaning of some variable changed, but only one of them being deployed.
    Git supports the notion of submodules, as a nesting of repositories. These can record simultaneous state of multiple projects, but I haven't found them very practical. -- mca
  • Expect that what you learn during one migration may affect how you prefer to do another.

Sub-projects

Some CVS repositories contain one well-defined project.

Others contain multiple sub-projects, either split into well-defined "CVS modules" (subdirectories) or scattered more loosely. These may need breaking into pieces, in a way that they would not if you switched to Subversion.

If you have sub-projects, choose one or two to extract as Git repositories. Bear in mind

  • The issues of breaking atomic commits.
    You might prefer to avoid breaking atomic-in-CVS commits across two Git repositories,
    unless they are only atomic because someone ran cvs ci -m . in the top level.
  • Whether it would be meaningful to make a branch or tag covering all sub-projects.
    If you have existing CVS branches or tags which cover a subset of the files present at the time, ask what this means for the relationship of those files across those set boundaries.
  • At the code dependency level, which project requires the others?
    We preferred to reduce circular dependencies. Some projects are cross-linked like a thermoset polymer.
When migration of these is underway, you can continue with the remainder.
In the #Hackery phase, simply unlink all previously-imported files.

Select an importer

You're here because you want to generate an accurate Git repository from your old history.

You could omit this and start with one Git commit containing the most recent state at import time.

  • Throwing away history like this, or making it very difficult to access across the VCS type boundary, is likely to make future maintenance more difficult.
  • It is difficult to patch up the history later.

Why cvs2git?

  • Other tools are listed on https://git.wiki.kernel.org/index.php/Interfaces,_frontends,_and_tools#Interaction_with_other_Revision_Control_Systems
  • Reading Lessons from PostgreSQL's Git transition was a strong influence on Anacode's choice of cvs2git. I was encouraged by their attention to detail. -- mca
  • cvs2git is based on the cvs2svn importer, which is maintained alongside CVS itself. We took the reasonable view that these maintainers understand CVS ,v files (meaning of and problems with) best.
  • For imports on several projects, for every branch and tag, the import wrapper runs a diff -r . Nothing significant has shown up.
    There are some corner cases relating to $Keyword$ expansion, possibly during CVS vendor branch setup, which cause minor diffs. I ignored these. -- mca
  • Where Git commitids have been unstable, there has always been a sane explanation.
    • Non-historic permissions changed manually.
    • Same-second commits being reordered, for no discernible reason. Hash enumeration order?
We avoided git-cvsimport because it dropped branches it did not understand, such as humpub-branch-52.

After this I saw no need to investigate Tailor and other tools. -- mca

Perform a trial import

  • Expect to be doing many of these and start automating things.
  • Look at the result with gitk and run some diffs against CVS checkouts.
  • Check the branches and tags. There should be 1:1 correspondence, except for MAIN vs. master.
  • Expect to find that your project's #CVS history is broken.

Grow the import process

Make a script which performs your import in a repeatable way.

  • Put it on a crontab. Fresh CVS commits should then start appearing in a Git repository automatically.
  • Full import & checking runs may take 30 minutes. This may be reduced by some caching.

Run with both

Run with both systems for a while, considering CVS to be the master.

For early adopters,:
  1. build commits in Git
  2. squash or otherwise commit them to CVS, being careful to watch the outgoing diffs to avoid accidental reverts.
  3. wait for the importer to re-run
  4. git pull --rebase or similar, to rebase your un-pushed commits onto the latest CVS.
You should find that your "pushed" Git commits disappear from the rebased branch. This works almost as well as the git-svn workflow.
  • beware merges and branches. We stuck to linear -- mca
  • start replacing any CVS-dependent machinery in the project.
For late adopters,:
  • Ensure you can clone the new repositories, commit and push.
  • Look at the shiny new tools. gitk --all is my favourite -- mca
We ran in this state for some months. -- mca

Stability of import

Various small things may happen that cause the Git history to be wildly different in commitid on subsequent run of cvs2git.

  • Changes to the import process, including the #Hackery phase.
  • Changing the nature of the initial commit in the Git repository.
  • Fixes to permissions of CVS ,v files with chmod.
  • The historic ordering of non-atomic same-second CVS commits, as perceived by the importer.
    This seems to have instabilities, which we removed by bumping one commit on a second during #Hackery.
After such a commitid remapping event, you will need to rescue every derived clone,

  • branches taken off CVS history need rebasing
  • tags will be stale (left on the old branch), and need updating with git fetch -t
These problems are more likely to affect early adopters of the import. If you plan to #Run with both for a long time or with large numbers of developers taking clones, the risk exposure of needing to do a lot of fixing up will of course be higher.

Disentangle from CVS

While it is still in CVS, start to remove from the codebase any CVS-specific features.

  • $Keyword$ expansion, and any code that uses it e.g. my $VERSION = (qw$Revision: 1.1 $)[1]; . XXX: suggestions for replacements - distzilla, home-rolled version, git-describe
  • Make one last pass of fixing up execute permissions
    You can change them (via commit) in Git. Such commits would need to be rebased forward for each reimport, or merged after the importer is finally stopped.
    Changing them in CVS (via chmod foo,v) will change all subsequent commitids on the next cvs2git run after the first appearance of that file. See #Stability of import.
  • Fix -kb (binary flag) - relevant only for Windows users? I think Git handles this with smudge filters.
    It must be said that we paid little attention to this and related line-ending problems, living entirely in the Un*x world.
  • Stop using CVSROOT/modules .
  • Terminate as many CVS branches as possible (if any).
  • Find and remove explicit references to CVS
    • Mostly in POD + comments
    • Also in install/make_dist scripts.
  • Create or update any deployment scripts.
    • If you are considering git describe --tags to extract a label for the code, note that imported history may not be good sample material. The release tags are likely to be separated from their branch by fixup commits.
      Fixup commits are created by cvs2svn (before the "2git" part of cvs2git), to maintain the accuracy of the tagged set of files. Sometimes they are null commits, byt generally consist of removal of files which were not tagged in CVS.
      git describe will not look up the short side-branch for a label, so may be unable to return useful results. During Git development you may place Git tags to allow a meaningful version number to be extracted automatically.

Pick a date

Decide when to throw the switch.

  • Ensure all other repository users are following, by whatever means.
  • Arrange support from Systems group, if necessary.
  • Chase down old CVS working copies and commit until "clean enough".
    Later, commits will be impossible.
    Diffs and updates will remain possible until the CVS server goes away. You may plan to leave it up indefinitely, or keep a tarball of it "just in case".

Switch

  1. Stop the cvs2git importer loop
  2. Commit a prominent MOVED.txt or similar file to CVS, explaining where development is continuing.
  3. Make the old CVS modules (directories) read-only.
    This will be trickier if you are importing a scattered set of files. Have the script prepared.
(Optional) Separate the "CVS history" from "the future".
  • Leave the complete CVS history in one archive repository,
    with all branches and tags renamed into a cvs/ namespace, e.g. refs/heads/cvs/humpub-branch-52 . The import wrapper does this for you.
    the entire repository named one level down. Our import wrapper did this by pushing successful conversions to two places, one containing only master.
    made read-only after a final import run for MOVED.txt.
  • Continue working from a master branch, plus any other branches carried over, stored in another Git repository.
    You proceed on a relatively clean (long, but linear and uncluttered with tags) Git history.
  • It is easy to add (git fetch ancient) and remove (rm -r .git/refs/heads/cvs/ or similar) refs for the old CVS history, in any Git working copy.
  • Nothing is lost, but the namespace is nicely clean.

Afterwards

Git is now the primary source - migration complete.

  • Be prepared to deal with the inevitable mistakes that may be caused by adopting a new tool.
    We were pleased that ours were not published to Github, due to a decision to push release branches but not master.

CVS history is broken

To start with, you should suspect your CVS history.

Unless you are certain nobody has every moved any tags or taken a text editor to any ,v files, on a ten year old project it is inevitable there will be broken files and branches.

I have some experience patching these up. -- mca

Branch tags lost or moved

  1. Find unlabelled branches. They will import as numeric branch names like 1.1.2.2
  2. Restore their branch tags. This will require some sleuthing, local knowledge and/or guesswork.
    • In many cases the branch tag is still there on another numeric version.
    • Often it will have no commits on the branch (be grateful).
    • Otherwise, note that your history is ambiguous.
Also note,
  • Attempting to move a branch with cvs tag -F -B -b cannot be done right.
    Instead try cvs rtag and inspect the results with ViewVC's graphviz generator (view=graph).
  • If your CVS admin had the foresight to include a taglog, write and say thanks.
  • Problems may be fixed directly in the repository, if you have the necessary access and no less confidence than the person who broke it. Don't forget to take a backup first.
  • Or you can code up a bodge script to run in the #Hackery phase.

Revisions lost

  • Possibly caused by cvs admin -ko ?
  • These cause cvs2git to abort with an error. You won't get a repository out.
  • We used #Hackery to insert some bogus deltatext.

CVS merges are broken

Branch merges made with cvs up -j are (probably? did I miss something?) not propagated into Git - the files merely change with no record of why.

It is possible to update the Git history to record the additional parents. The cvs2git wrapper currently does not help with that. I did this for perlunit.sf.net , it is tedious -- mca

Old content, to revise

(delivery here triggerd by mg13's IGM talk 2012-05)

Import process

Our wrapper scripts perform repeatable imports with stable commitids, and check the contents of tags and the tips of all branches against CVS. This process has proved fairly stable but can be disrupted by the movement of CVS tags.

Output was pushed to the internal server at

  • /repos/git/anacode/cvs/ensembl-otter.git (all CVS tags and branches, no other Git pushes) and
  • /repos/git/anacode/ensembl-otter.git (MAIN branch only, and any other Git pushes)
By keeping all CVS branches and tags under the cvs/ namespace, we have a clean slate for the future but retain access to old versions. By keeping CVS history and the Git future in separate repositories, we keep the branch list short and initial history is clear of cvs2git fixups and strange merges.

Hackery

Happens between "get a copy of the ,v files" and "run cvs2git", if the relevant shellscript is present, to apply pre-import patches to the ,v files.

Examples of this are in the history of master branch and remain on the anacode branch.

Related

Initial Empty Commit

  • When viewing commit diffs in gitk, there is no diff for the initial commit. This in turns means that searches for commit "adding/removing string" will not show that commit, even if the initial tree contains the string.
  • When grafting a new commit tree onto an existing project, it makes sense to me for both have a common tree at some point. That point it the initial commit.
For these reasons, I routinely begin new repositories with git commit --allow-empty -m 'initial empty commit'

During the cvs2git munging process, such a commit is inserted at the beginning of the git-fast-import(1) stream.

  • For subsequent cvs2git import processes to be repeatable, all properties of this IEC must be identical.
  • To easily distinguish different repositories from each other, or match up clones, one property that could be useful is the commitid(s) of the set of commits with no parents.
In resolving these useful properties, the code generating IEC (do_import) changed several times. Currently the dates are fixed to arbitrary values and the author is changed via $IEC_NAME . There is also an escape hatch whereby fast-import data can be given directly in an override file via $IEC_FILE .

Goals

These were either implicit or the statement of them has been lost - sorry. Reconstruction is a work in progress. They are sorted by ascending "meta"ness.

Import from CVS to Git should be faithful

The aim is to not introduce any additional diffs. Bugs and ugly formatting from CVS should be faithfully reproduced.

Revisionism

If you want a revisionist import, you could experiment with that in several ways.

  • The most invisible would be to rewrite each commit with any formatting changes applied automatically. This would introduce a time-smeared patch. Any bugs generated by that process would be invisible until you diff against an original CVS checkout.
  • The most transparent would be to publish an extra set of branches and tags in the same repo, under another namespace (e.g. cvs-cleaned/foo). These can contain original commits interspersed with explicit revisionist commits.
Either way, round-tripping through CVS is likely to be impaired for the duration of the migration process.

An alternative might be a smudge filter which reformats code. The corresponding clean filter could be tricky to implement.

Mapping of CVS commit to Git commitid should be stable

The migration process makes repeated imports of the still-growing set of CVS commits. Git branches are expected to be taken from these CVS-derived commits.

When re-import process generates a different commitid to the one it made last time, downstream branches have to be recovered from what is effectively an upstream rebase

Care has been taken to make the import deterministic and reproducible, with a hackery phase to patch up any non-determinism. It also requires that non-fast-forward pushes of imports are manually approved.

Sometimes rebase is inevitable. Where CVS history-so-far makes them necessary, it is worth chasing them down to do them early and in one pass.

Projects owners control the history

Technically, nothing short of secrecy will stop you rewriting repository history with different version control, author names or code formatting.

This section aims to persuade you that doing this publicly without consent of the project owner may be a short-term win, but the longer term cost is too high.

Private migration

Used privately, this migration process allows Git to be used as a more friendly front to the CVS backside.

CVS commits you make can be

  • squashed Git branches. Information will be lost.
  • commits from interactively rebased and curated branches. You appear to get it right first time.
  • straight off the editor. Only the timestamps are lost.
  • onto a CVS branch, with a later merge to MAIN. Merge info will be lost.
Your history when you contribute it. Your choice.

Public project-owner migration

If the project-owner publishes the Git repository during migration you have one central and up-to-date Git repository. This would be useful for anyone wanting to see the CVS history.

An "official early-access Git repository" is more efficient than several individuals repeating the import.

The split between CVS and Git looks a bit like an old fashioned "bad" project fork.

  • Does your project have the cohesion to prevent this becoming a real split?
  • Would an early-access Git repository make it difficult for you to choose to NOT migrate to Git?

Public third-party migration

By publishing your own migration of somebody's project, you encourage others to take branches but you have no way to round up the results.

  • Who is going to push those commits back to CVS?
  • When rebase occurs, who is going to pick up the pieces?
  • If there is an official migration it is very likely the commitids will differ.
Code could be left behind on dead Git branches. These forks will requires more effort to merge than the usual Git type of fork.

  • Does it look like you are wrestling for control of the project?
Then we have left programming and entered politics.

Imports of selected versions

This has been done to good effect e.g. gitPAN for released versions of CPAN distributions.

Its minimal form is a fresh Git repository containing a checkout of MAIN from somewhere else.

It is a useful compromise when access to original code history is impossible, or too difficult in aggregate across a large number of projects.

  • History committed later onto such an import is disconnected from the full code history.
  • It could later be rebased onto a full history, but who will do that?
So this looks like a fresh start, possibly a project take-over. This is appropriate for some CPAN distributions but will not suit all projects.