Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design and implementation of the Mirror module (formerly: Incremental/resumable GraphQL loading notes) #622

Closed
wchargin opened this issue Aug 8, 2018 · 18 comments
Assignees

Comments

@wchargin
Copy link
Member

@wchargin wchargin commented Aug 8, 2018

Documenting design efforts here.

Editor’s note: Comments before 2018-08-10 describe older ideas of how the system might be designed, and were never implemented. Discussion of the design that we use today starts here: #622 (comment)

@wchargin

This comment has been minimized.

Copy link
Member Author

@wchargin wchargin commented Aug 8, 2018

Loosely organized brain dump, 2018-08-07.

Observations

  • At the value level, only objects (and primitives) exist. Interfaces
    and unions exist only at the level of the type system.

  • We want to assume that every node is identifiable by an id. This
    wreaks havoc on the Actor interface, which does not structurally
    extend Node (awful design), even though all its implementations
    do.

Schemata

We make the concession that an interface is treated as a union of its
known implementations. This resolves the issue that the Actor
interface in the GitHub schema does not even expose an ID property,
which significantly limits how we may interact from it. It has slightly
negative effects on code generation compared to hand-written code
(common fieldsets will not be factored out). This concession is really
not too bad in our case; it incurs an additional ~30 bytes per request
sent to GitHub.

Every object-type schema is required to have exactly one field of type
id, and may have zero or more other fields.

const schema = s.schema({
  Repository: s.object([
    s.field("id", s.id()),
    s.field("url", s.primitive()),
    s.field("issues", s.connection("Issue")),
  ]),
  Issue: s.object([
    s.field("id", s.id()),
    s.field("url", s.primitive()),
    s.field("author", s.node("Actor")),
    s.field("parent", s.node("Repository")),
    s.field("title", s.primitive()),
    s.field("comments", s.connection("IssueComment")),
  ]),
  IssueComment: s.object([
    s.field("id", s.id()),
    // ...
  ]),
  Actor: s.union(["User", "Bot", "Organization"]),  // actually an interface
  User: s.object([
    s.field("id", s.id()),
    s.field("url", s.primitive()),
    s.field("login", s.primitive()),
  ]),
  Bot: /* same */,
  Organization: /* same */,
});

Schema module (s) interface:

export interface Schema {
  schema: ({[Typename]: NodeType}) => Schema;
  object: ($ReadOnlyArray<Field>) => NodeType;
  union: ($ReadOnlyArray<Typename>) => NodeType;
  field: (name: string, type: FieldType) => Field;
  id: () => FieldType;
  primitive: () => FieldType;
  node: (Typename) => FieldType;
  connection: (Typename) => FieldType;
}
type Typename = string;
declare opaque type Schema;
declare opaque type NodeType;
declare opaque type Field;
declare opaque type FieldType;

Entry point

We need to somehow specify the objects that we care about, and how to
get them in the first place. A simple way to do that is the following:

const rootRepo = entryPoint({
  type: "Repository",
  extract: (response) => response.repository,
  body: (fields) => [
    b.query(
      "FetchData",
      [b.param("owner", "String!"), b.param("name", "String!")],
      [
        b.field(
          "repository",
          {owner: b.variable("owner"), name: b.variable("name")},
          fields,
        ),
      ]
    ),
  ],
});

There may be other similar approaches, some of them perhaps nicer, but
the simplicity of this approach is appealing.

Another high-level approach would be to require that the system work
only on Id-identified nodes. This would require that the client first
perform one query outside of the system to get a repository’s Id…

{ repository(owner: "foo", name: "bar") { id } }

and then simply feed the system {type: "Repository", id} to get
started. This is maybe simpler from the perspective of the system, but
maybe more complex for clients. Either is viable option.

Persistent representations of nodes

i987 [SerializedObject<Issue>] {
  meta [ObjectMeta<Issue>]: {
    last-updated [Timestamp]: @12345
    kind [Kind]: "object"
    type [Typename]: "Issue"
  }
  data [Fields<Issue>]: {
    id [Id<Repository>]: i987
    url [Primitive]: "https://example.com/"
    author [Id<Actor>]: i444
    parent [Id<Actor>]: i555
    title [Primitive]: "my favorite issue"
    comments [Id<Connection<IssueComment>>]: i987-comments
  }
}

i987-comments [SerializedConnection<IssueComment>] {
  meta [ConnectionMeta<Issue>]: {
    last-updated [Timestamp]: @12346
    kind [Kind]: "connection"
    type [Typename]: "IssueComment"
    parent [Id<Issue>]: i987
    field [Primitive]: "comments"  # an index into the schema fields
    end-cursor [Primitive]: "uiodfwej"
  }
  nodes [List<ID<IssueComment>>]: [
    i765
    i766
    # ...
  ]
}
@wchargin

This comment has been minimized.

Copy link
Member Author

@wchargin wchargin commented Aug 9, 2018

Daily dump of some stuff that is reasonably solid.

Object model:

// A reference to a GraphQL node. The typename is its GraphQL type, as
// defined in a relevant schema. The ID is the unique ID assigned by the
// remote server (i.e., GitHub).
type Ref = {|+typename: Typename, +id: string|};

// The interface used by the GitHub plugin. Described below.
declare class GraphqlMirror {
  constructor(options: {|
    +storage: Storage,
    +schema: Schema,
    +ttl: Duration
  |}): void;
  update(root: Ref): void;
  denormalize(root: Ref): any;
}

// An abstraction over persistent storage.
interface Storage {
  write(filename: string, contents: string): void;
  read(filename: string): string | void;
  sync(): Promise<void>;
}

// Storage backed by a physical filesystem; `sync` writes to disk.
declare class DiskStorage implements Storage {
  constructor(basedir: string): void;
  ...Storage;
}

// Memory-only storage, for tests and such; `sync` is NOP.
declare class MemoryStorage implements Storage {
  constructor(initialState: Map<string, string>): void;
  ...Storage;
}

Top-level options (constructor to GraphqlMirror):

  • storage: Persistent storage, wholly scoped to the plugin and
    repository. In principle, this could be structurally shared across
    repositories, but we’ll keep it simple for now.
  • schema: A structured sub-GraphQL schema, as described above. This
    is required to remain fixed across invocations with the same cache
    directory; it may be persisted on disk to check for
    data-incompatible code upgrades.
  • ttl: Time to live; a duration after which data is to be considered
    stale.* This may vary across invocations.

* For liveness, we may wish to relax the definition of staleness to
prevent a case wherein our fetch rate can’t keep up with the TTL.
Semantics to be determined. This isn’t very important for an initial
version of the system, so we omit discussion.

Desired operations (methods on GraphqlMirror):

  • update(root: Ref): Make any needed updates for root and any
    transitive dependencies. Writes to persistent storage, syncing
    periodically (in particular, syncing at each network request).
  • denormalize(root: Ref): Create a structured form where IDs are
    replaced by their referents and connections are inlined as arrays,
    failing if any transitive dependencies are not met. The result will
    conform to the schema entry for type root.typename.

Workflow:

  • The GitHub plugin maintains a standalone registry mapping Repo to
    Ref. A query repository(owner: "foo", name: "bar") { id } is
    used to populate the repo-ref registry.
  • At node ./bin/sourcecred.js load time:
    • Retrieve the repository’s GraphQL ID id from the repo-ref
      registry, computing via a single GraphQL query if absent. Let
      ref be {typename: "Repository", id}.
    • Execute update(ref).
    • Let repo be denormalize(ref).
    • Construct a RelationalView from repo, and store it on disk
      in the data directory.
  • At runtime, the relational view is loaded from disk as usual.

Update semantics:

  • A node is up to date if (a) it is present, (b) its mtime is more
    recent than the TTL, and (c) if it is a connection, it is exhausted.
  • Updating an object node entails fetching its data and replacing the
    existing content.
  • Append-updating a connection node entails fetching any remaining
    elements, starting from its end cursor.
  • Full-updating a connection node essentially entails removing it and
    then append-updating it. Only a full update will pick up deletions,
    reorders, etc. of elements in the list. We may or may not support
    full-updating in the first version, as we do not expect this to be a
    big issue; the data that we care about is append/edit-only in the
    vast majority of cases.
@wchargin

This comment has been minimized.

Copy link
Member Author

@wchargin wchargin commented Aug 9, 2018

-- Current thoughts about representation follow.


-- A singleton table. We store the stable-stringified JSON schema
-- here. At each load, we check that it is consistent to prevent
-- data incompatibility. This is redundant with the actual table
-- structure, but is convenient and only incurs a small constant
-- additive increase in space usage.
CREATE TABLE meta (
    json_schema TEXT NOT NULL
);

-- An "update" is an invocation of `node ./bin/sourcecred.js load`. Such
-- an update may write to the database multiple times, updating the
-- timestamp each time, so that the final timestamp associated with the
-- update is the time that the update completes.
CREATE TABLE updates (
    time_epoch_seconds INTEGER NOT NULL /* epoch time in seconds */
);

-- Each node has an entry in `objects`. Its data is stored across three
-- tables:
--   * for primitives, a type-specific "data" table;
--   * for singleton fields (e.g., author of issue), the "links" table;
--   * for connections (e.g., comments on PR), the "connections" table.
CREATE TABLE objects (
    id TEXT PRIMARY KEY NOT NULL,
    last_update INTEGER,
    typename TEXT NOT NULL,  -- needed for union/interface deref
    FOREIGN KEY(last_update) REFERENCES updates(rowid)
) WITHOUT ROWID;

CREATE TABLE links (
    parent_id TEXT NOT NULL,
    fieldname TEXT NOT NULL,
    child_id TEXT NOT NULL,
    PRIMARY KEY(parent_id, fieldname),
    FOREIGN KEY(parent_id) REFERENCES objects(id),
    FOREIGN KEY(child_id) REFERENCES objects(id)
) WITHOUT ROWID;
CREATE TABLE connections (
    object_id TEXT NOT NULL,
    fieldname TEXT NOT NULL,
    end_cursor TEXT /* may be null if no items */,
    UNIQUE(object_id, fieldname)
);
CREATE TABLE connection_entries (
    conn_id INTEGER NOT NULL,
    idx INTEGER NOT NULL,  -- impose an ordering
    child_id TEXT NOT NULL,
    FOREIGN KEY(conn_id) REFERENCES connections(rowid)
);

-- Primitives on objects.
CREATE TABLE data_Repository (
    id TEXT NOT NULL PRIMARY KEY,
    description TEXT NOT NULL,
    starcount INTEGER NOT NULL
) WITHOUT ROWID;
CREATE TABLE data_Organization (
    id TEXT NOT NULL PRIMARY KEY,
    login TEXT NOT NULL
) WITHOUT ROWID;
CREATE TABLE data_User (
    id TEXT NOT NULL PRIMARY KEY,
    login TEXT NOT NULL
) WITHOUT ROWID;
CREATE TABLE data_Issue (
    id TEXT NOT NULL PRIMARY KEY,
    num INTEGER NOT NULL,
    title TEXT NOT NULL
) WITHOUT ROWID;

-- Example data.
INSERT INTO updates VALUES (1533851800);
INSERT INTO objects (last_update, typename, id) VALUES
    (1, 'Repository', 'repo:sc/sc'),
    (1, 'Organization', 'org:sourcecred'),
    (1, 'Issue', 'issue:sc#1'),
    (1, 'User', 'user:wchargin')
;
INSERT INTO connections (object_id, fieldname, end_cursor) VALUES
    ('repo:sc/sc', 'issues', 'cr123'),
    ('issue:sc#1', 'comments', NULL)
;
INSERT INTO links (parent_id, fieldname, child_id) VALUES
    ('repo:sc/sc', 'owner', 'org:sourcecred'),
    ('issue:sc#1', 'author', 'user:wchargin')
;
INSERT INTO connection_entries (conn_id, idx, child_id) VALUES
    (1 /* (repo:sc/sc).issues */, 0, 'issue:sc#1')
;

INSERT INTO data_Repository (id, description, starcount) VALUES
    ('repo:sc/sc', 'open source is pretty cool', 57);
INSERT INTO data_Organization (id, login) VALUES
    ('org:sourcecred', 'sourcecred');
INSERT INTO data_User (id, login) VALUES
    ('user:wchargin', 'wchargin');
INSERT INTO data_Issue (id, num, title) VALUES
    ('issue:sc#1', 1, 'todo: write the code');

-- We can find transitive dependencies...
WITH RECURSIVE cte_deps (id) AS (
    SELECT 'repo:sc/sc'
    UNION SELECT connection_edges.child FROM cte_deps JOIN (
        SELECT parent_id AS parent, child_id AS child FROM links
        UNION
        SELECT
            connections.object_id AS parent,
            connection_entries.child_id AS child
        FROM connections JOIN connection_entries
        ON connections.rowid = connection_entries.conn_id
    ) AS connection_edges
    ON cte_deps.id = connection_edges.parent
)
--
-- ...and see which are out of date:
SELECT typename, objects.id, last_update FROM objects LEFT OUTER JOIN cte_deps
ON objects.id = cte_deps.id
WHERE last_update < 2;  -- could do a join with timestamp to be more specific

-- vim: ft=sql
@wchargin

This comment has been minimized.

Copy link
Member Author

@wchargin wchargin commented Aug 10, 2018

Let me elaborate on the previous dump of SQL code.

Our goal is to persistently store and update a subset of entities
fetched from a remote GraphQL server. There are three evident approaches
to the storage aspect:

  • Store everything in a db.json file.
  • Store each object in a separate file, like issue_17.json.
  • Use a SQL database or other similar tool.

The first option has serious disadvantages. Reading any object requires
deserializing a massive blob of JSON. Writing any object or making in a
change requires serializing a similarly massive blob. These limitations
mean that you have to store the entire string in Node memory (unless you
have some on-the-fly JSON decoder that reads from a stream directly),
and you have to store the whole database in memory to read a single
object. Node also has limits on the size of a string, which we have
already hit on large repositories.

The second option also has disadvantages. We might easily have tens of
thousands of objects to store—for instance, the GitHub graph for
metamask/metamask-extension has 31 853 nodes at last count. This
incurs a lot of syscalls, a lot of context switches, and a lot of
interaction with the filesystem. Typical filesystems also struggle with
directories that have a large number of entries—operations like
pathname-to-inode resolution perform a linear scan over the directory
entries—so we’d want to perform some kind of chunking, like Git does
with .git/objects/??/*.

In light of the above discussion, it becomes clear that it is silly to
deal with these problems when the solution is clear: just use an actual
database. SQLite is an attractive choice: it stores its database in a
single file, it is lightweight and portable, and it has a plethora of
Node bindings. Thus, we adopt this approach.

The remainder of this document is a literate SQL script. You can run it
with

src=/path/to/this/document &&
db=/tmp/test.db &&
rm -f "$db" &&
sqlite3 "$db" < <(print_code_fences -v lang=sql "$src")

where print_code_fences comes from this gist.

Boilerplate

First, some setup for pretty-printing, and an indication that SQLite
should check foreign key constraints. (Without the PRAGMA, everything
would work, but the foreign key annotations would be ignored.)

.mode line
PRAGMA foreign_keys = ON;

Our goal is to turn a schema for a subset of a GraphQL database into a
schema for a SQL database. We will demonstrate on a subset of the GitHub
schema.

We start with a safety check. If the schema changes between invocations,
we should fail and require the user to blow away all data. The simplest
way to do this is to store a JSON-serialized version of the GraphQL
schema from which we generated the SQL schema, and check at each
invocation that it matches the schema that we expect. There’s a bit of
redundancy here, but that’s okay (the additional space used is totally
negligible). We store this data in a singleton table.

CREATE TABLE meta (
    json_schema TEXT NOT NULL
);

Next, we define what it means for data to be out of date. We say that a
single invocation of node ./bin/sourcecred.js load is an update. An
update has an associated timestamp, but an update may write to the
database multiple times (say, between network requests), updating its
timestamp each time. Then the final timestamp associated with an update
is the time that the update completes.

CREATE TABLE updates (
    rowid INTEGER PRIMARY KEY,
    time_epoch_seconds INTEGER NOT NULL
);

Our GraphQL objects will be stored as follows. Each GraphQL object has
an entry in the updates table, which is only a registry. It stores the
last time that the object was updated, and the GraphQL type associated
with the object, but does not store any actual data.

CREATE TABLE objects (
    id TEXT PRIMARY KEY NOT NULL,
    last_update INTEGER,
    typename TEXT NOT NULL,
    FOREIGN KEY(last_update) REFERENCES updates(rowid)
) WITHOUT ROWID;

The data for an object is stored across three tables, depending on the type of each field.

  • Primitive data is stored in an type-specific table, like
    data_Repository. The fact that this table is type-specific means
    that the actual data can be stored with locality and with strong
    typing.
  • A reference to a single other GraphQL object, like the author of an
    issue or the owner of a repository, is stored in the links table.
  • A connection, like the labels on a pull request or the comments on a
    review, is stored in the connections table, and individual entries
    in the connection are stored in the connection_entries table.

Here are the schemata for the link and connection tables:

CREATE TABLE links (
    parent_id TEXT NOT NULL,
    fieldname TEXT NOT NULL,
    child_id TEXT NOT NULL,
    PRIMARY KEY(parent_id, fieldname),
    FOREIGN KEY(parent_id) REFERENCES objects(id),
    FOREIGN KEY(child_id) REFERENCES objects(id)
) WITHOUT ROWID;

CREATE TABLE connections (
    rowid INTEGER PRIMARY KEY,
    object_id TEXT NOT NULL,
    fieldname TEXT NOT NULL,
    -- The end cursor may be NULL if no items are in the connection;
    -- this is a consequence of GraphQL and the Relay pagination spec.
    end_cursor TEXT,
    UNIQUE(object_id, fieldname)
);

CREATE TABLE connection_entries (
    rowid INTEGER PRIMARY KEY,
    conn_id INTEGER NOT NULL,
    idx INTEGER NOT NULL,  -- impose an ordering
    child_id TEXT NOT NULL,
    UNIQUE(conn_id, idx),
    FOREIGN KEY(conn_id) REFERENCES connections(rowid),
    FOREIGN KEY(child_id) REFERENCES objects(id)
);

I considered using (object_id, fieldname) as the primary key for
connections. This makes logical sense to me, but it incurs a
non-trivial increase in the space usage of connection_entries. Typical
GitHub IDs range in length from ~20 to ~40 characters (Base64 code
units), for a total PK size of maybe ~30 to ~50 bytes. TensorFlow, a
large repository, has ~32K issues and pull requests combined; if we take
each of these to have 8 or 16 or even 128 transitive connections on
average, then a numeric connection ID would have around 16 to 22 bits
total, making it a factor of ~18 smaller than the structured primary
key.

Here are some sample type-specific data tables:

CREATE TABLE data_Repository (
    id TEXT NOT NULL PRIMARY KEY,
    description TEXT NOT NULL,
    starcount INTEGER NOT NULL,
    FOREIGN KEY(id) REFERENCES objects(id)
) WITHOUT ROWID;
CREATE TABLE data_Organization (
    id TEXT NOT NULL PRIMARY KEY,
    login TEXT NOT NULL,
    FOREIGN KEY(id) REFERENCES objects(id)
) WITHOUT ROWID;
CREATE TABLE data_User (
    id TEXT NOT NULL PRIMARY KEY,
    login TEXT NOT NULL,
    FOREIGN KEY(id) REFERENCES objects(id)
) WITHOUT ROWID;
CREATE TABLE data_Issue (
    id TEXT NOT NULL PRIMARY KEY,
    num INTEGER NOT NULL,
    title TEXT NOT NULL,
    FOREIGN KEY(id) REFERENCES objects(id)
) WITHOUT ROWID;

Note that the foreign keys ensure that any referenced object must exist
in the objects table, but this doesn’t mean that we have to have data
for any referenced object. This is good!

Implicit in this schema is the key restriction that the set of data that
we care about is fully determined by an object’s type. For instance, it
is not the case that we want to record a user’s email address if we
reached them from a Git commit, but we don’t care about it if they just
posted an issue. On the contrary: a User is a User; we define a
certain set of User fields to store, and we store them for every user.

Storing the typename of an object in the objects table is necessary to
handle interface and union types. For instance, from the GraphQL schema
for Repository, we know that the owner property has type Actor.
But an actor could be a user, an organization, or a bot. If all we know
is that a repository’s owner has a particular ID, we won’t know a
priori
whether to look up its data in the data_User table or the
data_Organization table. The typename field disambiguates this.

Here is some example data. We use human-readable IDs for this example,
but the IDs would actually be GitHub’s opaque Base64-encoded strings.

INSERT INTO updates (time_epoch_seconds) VALUES (1533851800), (1533940400);
INSERT INTO objects (last_update, typename, id) VALUES
    (1, 'Repository', 'repo:sc/sc'),
    (1, 'Organization', 'org:sourcecred'),
    (1, 'Issue', 'issue:sc#1'),
    (2, 'Issue', 'issue:sc#2'),
    (1, 'User', 'user:wchargin'),
    (2, 'User', 'user:decentralion'),
    (1, 'User', 'user:bystander')
;

-- Now that the objects have been defined, the following insert
-- statements can be commuted in any order.

INSERT INTO connections (object_id, fieldname, end_cursor) VALUES
    ('repo:sc/sc', 'issues', 'cr123'),
    ('issue:sc#1', 'comments', NULL),
    ('issue:sc#2', 'comments', NULL)
;
INSERT INTO links (parent_id, fieldname, child_id) VALUES
    ('repo:sc/sc', 'owner', 'org:sourcecred'),
    ('issue:sc#1', 'author', 'user:wchargin'),
    ('issue:sc#2', 'author', 'user:decentralion')
;
INSERT INTO connection_entries (conn_id, idx, child_id) VALUES
    (1, 0, 'issue:sc#1'),  -- 1 = (repo:sc/sc).issues
    (1, 1, 'issue:sc#2')   -- 1 = (repo:sc/sc).issues
;

INSERT INTO data_Repository (id, description, starcount) VALUES
    ('repo:sc/sc', 'open source is pretty cool', 58);
INSERT INTO data_Organization (id, login) VALUES
    ('org:sourcecred', 'sourcecred');
INSERT INTO data_User (id, login) VALUES
    ('user:wchargin', 'wchargin'),
    ('user:decentralion', 'decentralion'),
    ('user:bystander', 'bystander');
INSERT INTO data_Issue (id, num, title) VALUES
    ('issue:sc#1', 1, 'todo: write the code'),
    ('issue:sc#2', 2, 'todo: ship the thing');

Note that the foreign key constraints prevent some kinds of errors—any
of the following queries would raise an error because issue:sc#3 and
user:someone-else are not registered as objects:

/*
INSERT INTO data_Issue (id, num, title) VALUES
    ('issue:sc#3', 1, 'todo: add this object to the database');
INSERT INTO links (parent_id, fieldname, child_id) VALUES
    ('issue:sc#1', 'assignee', 'user:someone-else');
INSERT INTO links (parent_id, fieldname, child_id) VALUES
    ('issue:sc#3', 'author', 'user:wchargin');
INSERT INTO connection_entries (conn_id, idx, child_id) VALUES
    (1, 2, 'issue:sc#3');  -- 1 = (repo:sc/sc).issues
*/

By expressing the links and the connections in the relational model, we
can leverage the expressive power of SQL to do useful work. For
instance, if we’ve been asked to bring a particular repository node up
to date, we can ask SQLite exactly which objects are transitively
referenced and have not been updated since a certain time (which falls
between update 1 and update 2):

WITH RECURSIVE cte_deps (id) AS (
    VALUES ('repo:sc/sc')
    UNION SELECT connection_edges.child FROM cte_deps JOIN (
        SELECT parent_id AS parent, child_id AS child FROM links
        UNION
        SELECT object_id AS parent, child_id AS child
        FROM connections JOIN connection_entries
        ON connections.rowid = connection_entries.conn_id
    ) AS connection_edges
    ON cte_deps.id = connection_edges.parent
)
SELECT typename, cte_deps.id, last_update
FROM objects
LEFT OUTER JOIN updates ON last_update = updates.rowid
JOIN cte_deps ON objects.id = cte_deps.id
WHERE last_update IS NULL OR updates.time_epoch_seconds < 1533851801;

This prints:

   typename = Issue
         id = issue:sc#1
last_update = 1

   typename = Organization
         id = org:sourcecred
last_update = 1

   typename = Repository
         id = repo:sc/sc
last_update = 1

   typename = User
         id = user:wchargin
last_update = 1

Note that issue:sc#2 and user:decentralion are not printed because
they were updated recently enough, and user:bystander is not printed
because there is no transitive reference to that user.

Next on the table: given a GraphQL schema, what kinds of GraphQL queries
do we generate? (In particular: how are initial fetches different from
updates?)

@wchargin wchargin self-assigned this Aug 22, 2018
@wchargin

This comment has been minimized.

Copy link
Member Author

@wchargin wchargin commented Aug 22, 2018

As alluded to previously, the remaining piece of design is to establish
what kinds of GraphQL queries we generate given a schema and a database
state.

Suppose that we want to fetch the contents of a repository, which has
issues, which have comments. We could execute a query to get this all at
once (eliding pageInfo, etc. for brevity):

repository {
  name
  issues(first: 100) {
    nodes {
      title
      comments(first: 100) {
        nodes {
          body
        }
      }
    }
  }
}

Or, we could use a number of “shallow queries”, in which we always only
request the id of a node in a connection:

# Query 1
repository {
  name
  issues(first: 100) { nodes { id } }
}

# Query 2
nodes(ids: ["issue1", "issue2"]) {
  ... on Issue {
    title
    comments(first: 100) { nodes { id } }
  }
}

# Query 3
nodes(ids: ["comment1", "comment2"]) {
  ... on IssueComment {
    body
  }
}

The most striking difference is that shallow queries require more
queries. This is true, but the difference is not as bad as it may sound.
Due to interleaving, we only experience an additive increase in query
count, not a multiplicative increase. We present a numerical example
below.

A worked example

Suppose that we have a repository with 300 issues, each with 300
comments. Consider the two approaches.

With deep queries, we learn the following information at each query:

    • data for the repo
    • issues 1–100
    • comments 1–100 on issues 1–100
    • issues 101–200
    • comments 1–100 on issues 101–200
    • comments 101–200 on issues 1–100
    • issues 201–300
    • comments 1–200 on issues 201–300
    • comments 101–200 on issues 101–200
    • comments 201–300 on issues 1–100
    • comments 101–200 on issues 201–300
    • comments 201–300 on issues 101–200
    • comments 201–300 on issues 201–300

With shallow queries, we learn the following information at each query:

    • data for the repo
    • IDs for issues 1–100
    • data for issues 1–100
    • IDs for comments 1–100 on issues 1–100
    • IDs for issues 101–200
    • data for comments 1–100 on issues 1–100
    • data for issues 101–200
    • IDs for comments 1–100 on issues 101–200
    • IDs for comments 101–200 on issues 1–100
    • IDs for issues 201–300
    • data for comments 1–100 on issues 101–200
    • data for comments 101–200 on issues 1–100
    • data for issues 201–300
    • IDs for comments 1–100 on issues 201–300
    • IDs for comments 101–200 on issues 101–200
    • IDs for comments 201–300 on issues 1–100
    • IDs for issues 201–300
    • data for comments 1–100 on issues 201–300
    • data for comments 101–200 on issues 101–200
    • data for comments 201–300 on issues 1–100
    • data for issues 201–300
    • IDs for comments 101–200 on issues 201–300
    • IDs for comments 201–300 on issues 101–200
    • data for comments 101–200 on issues 201–300
    • data for comments 201–300 on issues 101–200
    • IDs for comments 201–300 on issues 201–300
    • data for comments 201–300 on issues 201–300

The fact that we have 2 levels of nesting translates to 2 extra queries,
not a factor-2 increase in the number of queries, because we can fetch
in parallel across different levels of nesting. That is, in each shallow
query, we can fetch data for all objects whose IDs we know.

A major advantage of using shallow queries is that it significantly
decreases the worst-case query complexity and thus the size limit. If we
request 100 pulls with 100 review with 100 comments, then GitHub notes
that this query could in principle return a million nodes—even if it is
actually very rare for a pull to have more than (say) 5 reviews. With
shallow queries, we can ask for 100 comments on each of only the 5
reviews that actually exist.

Another advantage of shallow queries is that it reduces data duplication
in the result. Duplication of object data will be common: if you fetch
the first 100 pulls from sourcecred/sourcecred and their authors,
you’ll get 43 copies of the user data for decentralion and 57 copies
of the user data for wchargin. This seems a bit silly, though not
actively problematic as long as the data is consistent.

It’s a bit more complicated, though, when the objects being fetched have
connections. For instance, issue #483 currently appears twice in the
results of the below query:

{
  repository(owner: "sourcecred", name: "sourcecred") {
    labels(first: 5) {
      nodes {
        name
        issues(first: 3) {
          nodes {
            number
            comments(first: 3) {
              nodes {
                id
              }
            }
          }
        }
      }
    }
  }
}

We now have to take care when processing the results. Usually, if we
receive issue comments for an issue that is already in the database, we
append the comments to those already stored. But here the comments must
be stored only once even though they appear twice.

Another potential advantage of using shallow queries is that this
might help prime GitHub’s servers’ caches, decreasing the effect of
out-of-band rate limiting. See discussion on #350 for more
details. This is purely speculative, but may be helpful.

To summarize, shallow queries:

  • require an additive increase in the number of queries;
  • may have smaller worst-case result size, decreasing API cost;
  • eliminate data duplication (saving bandwidth);
  • eliminate connection entry duplication (simplifying implementation).

For these reasons, I currently plan to use shallow queries as described
above. This decision is somewhat but not wholly fundamental; if we need
to reverse it later, it may take some work, but it will be doable
without needing to rip out the entire loading system.

@wchargin

This comment has been minimized.

Copy link
Member Author

@wchargin wchargin commented Aug 22, 2018

At this point, I turn to implementation, with the following rough plan:

  • Create the schema module.
  • Generate a SQL schema from a GraphQL schema.
  • Implement an “initial query” API that adds the root object (e.g., a
    GitHub repository).
  • Implement updates.
@decentralion

This comment has been minimized.

Copy link
Member

@decentralion decentralion commented Aug 24, 2018

Just a thought for as you're working on this:
It would be really great if we could show a progress bar on the GitHub load.

@wchargin

This comment has been minimized.

Copy link
Member Author

@wchargin wchargin commented Aug 24, 2018

It would be really great if we could show a progress bar on the GitHub
load.

I’ve been thinking about that. On the one hand, it’s tricky, because
we’re exploring an object graph of unknown size. But on the other hand,
we do have some information available. We can pull the totalCount off
of each connection, so we can immediately know how many issues and pull
requests a repository has. In practice, this is probably a pretty good
heuristic for overall progress. It’s not perfect, because it obscures
the fact that some issues might have a thousand comments while others
have none. But it should at least be monotonically increasing (unless
new issues are posted while querying, which is an inevitable
concession).

So, what I have in mind is the following: for each connection on the
root-level object (i.e., the repository’s issues, pull requests, labels,
etc.), sum the number of nodes fetched and divide by the sum of the
appropriate totalCounts, and use this as the progress indicator. The
granularity will be that the progress bar updates once per GitHub query,
which should be every few seconds. (Here is a run of the current system
against TensorFlow
; the median duration is 5.25 seconds, the 90th
percentile is 10.4 seconds, and the maximum is 10.6 seconds. Query times
may be different in the new system, but not by orders of magnitude.)

Does this sound good to you?

@decentralion

This comment has been minimized.

Copy link
Member

@decentralion decentralion commented Aug 24, 2018

That sounds great.

@wchargin

This comment has been minimized.

Copy link
Member Author

@wchargin wchargin commented Sep 11, 2018

Another benefit to using shallow queries is that it provides a natural
means to work with recursive structures. For instance, a pull request
review comment has a replyTo property that points to an (optional)
pull request review comment. So a query may look like this:

{
  repository(owner: "sourcecred", name: "sourcecred") {
    pullRequests(first: 10) {
      nodes {
        reviews(first: 10) {
          nodes {
            comments(first: 10) {
              nodes {
                replyTo {
                  id
                  replyTo {
                    id
                    replyTo {
                      id
                      # when do we stop?!
                    }
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

With deep queries, we’d have to arbitrarily cut this off somewhere,
which means both logic to identify when we’re in this case (more
complicated code) and heuristics for the cutoff (more complicated
systems).

With shallow queries, we just fetch one id at a time, and this problem
is solved before it starts.

wchargin added a commit that referenced this issue Sep 14, 2018
Summary:
This commit introduces a module for declaratively specifying the schema
of a GraphQL database. See `buildGithubSchema` in `schema.test.js` for
an example of the API.

This makes progress toward #622, though the design has evolved some
since its original specification there.

Test Plan:
Unit tests added, with full coverage; `yarn unit` suffices.

wchargin-branch: graphql-schema
wchargin added a commit that referenced this issue Sep 17, 2018
Summary:
In implementing #622, we’ll want to run lots of things inside of
transactions. This commit introduces a JavaScript API to do so more
easily, properly handling success and failure cases.

Test Plan:
Unit tests included, with full coverage; run `yarn unit`.

wchargin-branch: mirror-transaction-helper
wchargin added a commit that referenced this issue Sep 17, 2018
Summary:
In implementing #622, we’ll want to run lots of things inside of
transactions. This commit introduces a JavaScript API to do so more
easily, properly handling success and failure cases.

Test Plan:
Unit tests included, with full coverage; run `yarn unit`.

wchargin-branch: mirror-transaction-helper
wchargin added a commit that referenced this issue Sep 17, 2018
Summary:
In implementing #622, we’ll want to run lots of things inside of
transactions. This commit introduces a JavaScript API to do so more
easily, properly handling success and failure cases.

Test Plan:
Unit tests included, with full coverage; run `yarn unit`.

wchargin-branch: mirror-transaction-helper
wchargin added a commit that referenced this issue Sep 17, 2018
Summary:
This commit introduces the `Mirror` class that will be the centerpiece
of the persistent-loading API as described in #622. An instance of this
class represents a mirror of a remote GraphQL database, defined by a
particular schema. In this commit, we add the construction logic, which
includes a safety measure to ensure that the database is used within one
version of the code and schema.

Test Plan:
Unit tests included, with full coverage; run `yarn unit`.

wchargin-branch: mirror-class
wchargin added a commit that referenced this issue Sep 17, 2018
Summary:
This commit introduces the `Mirror` class that will be the centerpiece
of the persistent-loading API as described in #622. An instance of this
class represents a mirror of a remote GraphQL database, defined by a
particular schema. In this commit, we add the construction logic, which
includes a safety measure to ensure that the database is used within one
version of the code and schema.

Test Plan:
Unit tests included, with full coverage; run `yarn unit`.

wchargin-branch: mirror-class
wchargin added a commit that referenced this issue Sep 17, 2018
Summary:
This commit augments the `Mirror` constructor to turn the provided
GraphQL schema into a SQL schema, with which it initializes the backing
database. The schema is roughly as originally described in #622, with
some changes (primarily: we omit `WITHOUT ROWID`; we add indexes; we
store `total_count` on connections; and we use milliseconds instead of
seconds for epoch time).

Test Plan:
Unit tests included, with full coverage; run `yarn unit`.

wchargin-branch: mirror-sql-schema
wchargin added a commit that referenced this issue Sep 17, 2018
Summary:
This commit augments the `Mirror` constructor to turn the provided
GraphQL schema into a SQL schema, with which it initializes the backing
database. The schema is roughly as originally described in #622, with
some changes (primarily: we omit `WITHOUT ROWID`; we add indexes; we
store `total_count` on connections; and we use milliseconds instead of
seconds for epoch time).

Test Plan:
Unit tests included, with full coverage; run `yarn unit`.

wchargin-branch: mirror-sql-schema
wchargin added a commit that referenced this issue Sep 17, 2018
Summary:
This commit augments the `Mirror` constructor to turn the provided
GraphQL schema into a SQL schema, with which it initializes the backing
database. The schema is roughly as originally described in #622, with
some changes (primarily: we omit `WITHOUT ROWID`; we add indexes; we
store `total_count` on connections; and we use milliseconds instead of
seconds for epoch time).

Test Plan:
Unit tests included, with full coverage; run `yarn unit`.

wchargin-branch: mirror-sql-schema
wchargin added a commit that referenced this issue Sep 18, 2018
Summary:
This commit augments the `Mirror` constructor to turn the provided
GraphQL schema into a SQL schema, with which it initializes the backing
database. The schema is roughly as originally described in #622, with
some changes (primarily: we omit `WITHOUT ROWID`; we add indexes; we
store `total_count` on connections; and we use milliseconds instead of
seconds for epoch time).

Test Plan:
Unit tests included, with full coverage; run `yarn unit`.

wchargin-branch: mirror-sql-schema
wchargin added a commit that referenced this issue Sep 18, 2018
Summary:
This commit augments the `Mirror` constructor to turn the provided
GraphQL schema into a SQL schema, with which it initializes the backing
database. The schema is roughly as originally described in #622, with
some changes (primarily: we omit `WITHOUT ROWID`; we add indexes; we
store `total_count` on connections; and we use milliseconds instead of
seconds for epoch time).

Test Plan:
Unit tests included, with full coverage; run `yarn unit`.

wchargin-branch: mirror-sql-schema
wchargin added a commit that referenced this issue Sep 18, 2018
Summary:
This commit augments the `Mirror` constructor to turn the provided
GraphQL schema into a SQL schema, with which it initializes the backing
database. The schema is roughly as originally described in #622, with
some changes (primarily: we omit `WITHOUT ROWID`; we add indexes; we
store `total_count` on connections; and we use milliseconds instead of
seconds for epoch time).

Test Plan:
Unit tests included, with full coverage; run `yarn unit`.

wchargin-branch: mirror-sql-schema
@wchargin

This comment has been minimized.

Copy link
Member Author

@wchargin wchargin commented Oct 2, 2018

Having all of our data available in a local SQL database is powerful. It
enables us to ask questions of our data that we just can’t with the
GitHub APIs.

For instance: “who, other than the SourceCred maintainers, has reacted
to posts in the SourceCred repository? What are these posts and
reactions?”

The answer is just a query away:

$ cat whoreacts.sql
.mode line  -- pretty-printing
SELECT
    reactor_data.login AS reactor_name,
    content AS reaction_type,
    reactables.typename AS reactable_typename,
    reactable_id AS reactable_id,
    COALESCE(issue_data.url, issuecomment_data.url, '(no url found)') AS url
FROM
(
    SELECT rowid AS connection_id, object_id AS reactable_id
    FROM connections
    WHERE fieldname = 'reactions'
) AS reaction_connections
JOIN objects AS reactables
    ON reaction_connections.reactable_id = reactables.id
JOIN connection_entries USING (connection_id)
JOIN objects AS reactions
    ON connection_entries.child_id = reactions.id
JOIN (
    SELECT parent_id AS reaction_id, child_id AS reactor_id FROM links
    WHERE fieldname = 'user'
) AS reaction_authors
    ON reactions.id = reaction_authors.reaction_id
JOIN objects AS reactors
    ON reaction_authors.reactor_id = reactors.id
JOIN primitives_User AS reactor_data
    ON reactors.id = reactor_data.id
JOIN primitives_Reaction AS reaction_data
    ON reactions.id = reaction_data.id
LEFT OUTER JOIN primitives_Issue AS issue_data
    ON reaction_connections.reactable_id = issue_data.id
LEFT OUTER JOIN primitives_IssueComment AS issuecomment_data
    ON reaction_connections.reactable_id = issuecomment_data.id
WHERE reactor_name NOT IN ('"wchargin"', '"decentralion"')
;

And the results:

$ sqlite3 /tmp/mirror.db <whoreacts.sql
      reactor_name = "owocki"
     reaction_type = "THUMBS_UP"
reactable_typename = Issue
      reactable_id = MDU6SXNzdWUzNTI2NjA5ODA=
               url = "https://github.com/sourcecred/sourcecred/issues/696"

      reactor_name = "ashwaniYDV"
     reaction_type = "THUMBS_UP"
reactable_typename = Issue
      reactable_id = MDU6SXNzdWUzNjU3MDYwMjc=
               url = "https://github.com/sourcecred/sourcecred/issues/905"

      reactor_name = "claireandcode"
     reaction_type = "THUMBS_UP"
reactable_typename = IssueComment
      reactable_id = MDEyOklzc3VlQ29tbWVudDQxMDA0NzY3MA==
               url = "https://github.com/sourcecred/sourcecred/issues/586#issuecomment-410047670"

      reactor_name = "campoy"
     reaction_type = "HEART"
reactable_typename = IssueComment
      reactable_id = MDEyOklzc3VlQ29tbWVudDQxMDEzMzYxMw==
               url = "https://github.com/sourcecred/sourcecred/pull/598#issuecomment-410133613"

      reactor_name = "bzz"
     reaction_type = "HOORAY"
reactable_typename = IssueComment
      reactable_id = MDEyOklzc3VlQ29tbWVudDQxMDEzMzYxMw==
               url = "https://github.com/sourcecred/sourcecred/pull/598#issuecomment-410133613"
@wchargin

This comment has been minimized.

Copy link
Member Author

@wchargin wchargin commented Oct 2, 2018

Here are some file size comparisons using the sourcecred/sourcecred
repository data:

Compression Post bodies RelationalView (B) SQLite (B)
None Included 4 945 499 6 569 984
Gzip Included 910 654 1 656 186
None Removed 3 219 701 N/A
Gzip Removed 244 632 N/A

(All sizes are in bytes. I did not run VACUUM.)

Note that tables account for 67.8% of the total database size; indices
account for the other 32.2%. Given that our performance is currently
plenty fast (and honestly we don’t do very much CPU-work), we could
probably drop all the indices without a big performance hit.

Note also that 16% of the per-byte space in the SQLite database is
unused. This proportion should decrease as the size of the database
increases, and does not increase the on-disk size.

wchargin added a commit that referenced this issue Oct 3, 2018
Summary:
This completes the minimum viable public API for the `Mirror` class. As
described on the docstring, the typical workflow for a client is:

  - Invoke the constructor to acquire a `Mirror` instance.
  - Invoke `registerObject` to register a root object of interest.
  - Invoke `update` to update all transitive dependencies.
  - Invoke `extract` to retrieve the data in structured form.

It is the third step that is added in this commit.

In this commit, we also include a command-line demo of the Mirror
module, which exercises exactly the workflow above with a hard-coded
GitHub schema. This can be used to test the module’s behavior with
real-world data. I plan to remove this entry point once we integrate
this module into SourceCred.

This commit makes progress toward #622.

Test Plan:
Unit tests included for the core functionality. The imperative shell
does not have automated tests. You can test it as follows.

First, run `yarn backend` to build `bin/mirror.js`. Then, run:

```shell
$ node bin/mirror.js /tmp/mirror-demo.db \
> Repository MDEwOlJlcG9zaXRvcnkxMjMyNTUwMDY= \
> 60
```

Here, the big base64 string is the ID for the sourcecred/example-github
repository. (This is listed in `graphql/demo.js`, alongside the ID for
the SourceCred repository itself.) The value 60 is a TTL in seconds. The
database filename is arbitrary.

This will print out a ton of output to stderr (all intermediate queries
and responses, for debugging purposes), and then the full contents of
the example repository to stdout.

Run the command again, and it should finish instantly. On my machine,
the main function runs faster than the Node startup time (50ms vs 60ms).

Then, re-run the command, changing the TTL from `60` to `1`. Note that
it sends off some queries and then prints the same repository.

It is safe to kill the program at any time, either while waiting for a
response from GitHub or while processing the results, because the mirror
module takes advantage of SQLite’s transaction-safety. Intermediate
updates will be persisted, so you’ll lose just a few seconds of
progress.

You can also of course dive into the generated database yourself to
explore the data. It’s good fun.

wchargin-branch: mirror-e2e-update
wchargin added a commit that referenced this issue Oct 3, 2018
Summary:
This completes the minimum viable public API for the `Mirror` class. As
described on the docstring, the typical workflow for a client is:

  - Invoke the constructor to acquire a `Mirror` instance.
  - Invoke `registerObject` to register a root object of interest.
  - Invoke `update` to update all transitive dependencies.
  - Invoke `extract` to retrieve the data in structured form.

It is the third step that is added in this commit.

In this commit, we also include a command-line demo of the Mirror
module, which exercises exactly the workflow above with a hard-coded
GitHub schema. This can be used to test the module’s behavior with
real-world data. I plan to remove this entry point once we integrate
this module into SourceCred.

This commit makes progress toward #622.

Test Plan:
Unit tests included for the core functionality. The imperative shell
does not have automated tests. You can test it as follows.

First, run `yarn backend` to build `bin/mirror.js`. Then, run:

```shell
$ node bin/mirror.js /tmp/mirror-demo.db \
> Repository MDEwOlJlcG9zaXRvcnkxMjMyNTUwMDY= \
> 60
```

Here, the big base64 string is the ID for the sourcecred/example-github
repository. (This is listed in `graphql/demo.js`, alongside the ID for
the SourceCred repository itself.) The value 60 is a TTL in seconds. The
database filename is arbitrary.

This will print out a ton of output to stderr (all intermediate queries
and responses, for debugging purposes), and then the full contents of
the example repository to stdout.

Run the command again, and it should finish instantly. On my machine,
the main function runs faster than the Node startup time (50ms vs 60ms).

Then, re-run the command, changing the TTL from `60` to `1`. Note that
it sends off some queries and then prints the same repository.

It is safe to kill the program at any time, either while waiting for a
response from GitHub or while processing the results, because the mirror
module takes advantage of SQLite’s transaction-safety. Intermediate
updates will be persisted, so you’ll lose just a few seconds of
progress.

You can also of course dive into the generated database yourself to
explore the data. It’s good fun.

wchargin-branch: mirror-e2e-update
@wchargin wchargin changed the title Incremental/resumable GraphQL loading notes Design and implementation of the Mirror module (formerly: Incremental/resumable GraphQL loading notes) Oct 28, 2018
@wchargin

This comment has been minimized.

Copy link
Member Author

@wchargin wchargin commented Oct 29, 2018

The Mirror system works and is integrated into SourceCred! 🎉

I’ve noted some enhancements that I would like to see moving forward in
#938, #939, and #940. But the core is complete, so I think that this
issue can be closed.

@wchargin wchargin closed this Oct 29, 2018
wchargin added a commit that referenced this issue Oct 30, 2018
Summary:
This points to #622 as the blanket issue, though really there was a long
series of pull requests worth of implementation.

Test Plan:
None.

wchargin-branch: changelog-mirror
wchargin added a commit that referenced this issue Oct 30, 2018
Summary:
This points to #622 as the blanket issue, though really there was a long
series of pull requests worth of implementation.

Test Plan:
None.

wchargin-branch: changelog-mirror
wchargin added a commit that referenced this issue Oct 31, 2018
Summary:
It is time. (Replaced with #622.)

Test Plan:
Running `yarn flow` suffices. Running `yarn test --full` also passes.

wchargin-branch: remove-legacy-graphql
wchargin added a commit that referenced this issue Oct 31, 2018
Summary:
It is time. (Replaced with #622.)

Test Plan:
Running `yarn flow` suffices. Running `yarn test --full` also passes.

wchargin-branch: remove-legacy-graphql
@wchargin

This comment has been minimized.

Copy link
Member Author

@wchargin wchargin commented Oct 31, 2018

A debugging/investigation story, per @decentralion’s request.

I had been running the following command to track the progress of the
Mirror load for tbws/bootstrap:

watch -n 1 '
    sqlite3 /tmp/sourcecred/cache/twbs/bootstrap/github/mirror_*.db \
    "SELECT typename, COUNT(1) FROM objects GROUP BY typename" |
    tr "|" "\t" | sort -n -k2 | expand -t 20
'

This printed:

Organization        1
Ref                 1
Repository          1
PullRequestReview   2890
Reaction            3519
Commit              3628
PullRequest         9218
User                17019
Issue               17843
IssueComment        60563

I noted that all the pull requests and issues had been fetched, but
there were far fewer commits than pull requests. This was strange,
because surely each merged pull request has a merge commit—were so few
pull requests really merged? Searching for “is:pr is:merged ” on GitHub
confirmed that there were in fact 4457 merged pull requests in this
repository, so something was inconsistent with my expectations.

I opened a SQLite shell to investigate:

$ sqlite3 /tmp/sourcecred/cache/twbs/bootstrap/github/mirror_*.db
SQLite version 3.11.0 2016-02-15 17:29:24
Enter ".help" for usage hints.
sqlite> 

Perhaps I had registered all these pull requests but not fetched all of
their data:

sqlite> SELECT z, COUNT(1) FROM (
   ...>   SELECT last_update IS NULL AS z
   ...>   FROM objects WHERE typename = 'PullRequest'
   ...> ) GROUP BY z;
0|9218

No: all 9218 pull requests had non-NULL update times.

Did the Mirror module fail to follow the links properly, or are the pull
requests really missing merge commits?

sqlite> SELECT z, COUNT(1) FROM (
   ...>   SELECT child_id IS NULL AS z
   ...>   FROM links WHERE fieldname = 'mergeCommit'
   ...> ) GROUP BY z;
0|2698
1|6520

Okay: most of the pull requests really do have null mergeCommit. Which
ones? Are there observable patterns? Let’s list some of them:

sqlite> SELECT url FROM
   ...> primitives_PullRequest JOIN links ON primitives_PullRequest.id = parent_id
   ...> WHERE fieldname IS 'mergeCommit' AND child_id IS NOT NULL
   ...> ORDER BY CAST(number AS INTEGER) DESC
   ...> LIMIT 10;
"https://github.com/twbs/bootstrap/pull/27566"
"https://github.com/twbs/bootstrap/pull/27563"
"https://github.com/twbs/bootstrap/pull/27562"
"https://github.com/twbs/bootstrap/pull/27560"
"https://github.com/twbs/bootstrap/pull/27556"
"https://github.com/twbs/bootstrap/pull/27547"
"https://github.com/twbs/bootstrap/pull/27544"
"https://github.com/twbs/bootstrap/pull/27540"
"https://github.com/twbs/bootstrap/pull/27539"
"https://github.com/twbs/bootstrap/pull/27533"

Comparing to the first page of search results for “is:pr is:merged ” on
GitHub, these match up. How about the ones at the very beginning?

sqlite> SELECT url FROM
   ...> primitives_PullRequest JOIN links ON primitives_PullRequest.id = parent_id
   ...> WHERE fieldname IS 'mergeCommit' AND child_id IS NOT NULL
   ...> ORDER BY CAST(number AS INTEGER) ASC
   ...> LIMIT 10;
"https://github.com/twbs/bootstrap/pull/1"
"https://github.com/twbs/bootstrap/pull/3"
"https://github.com/twbs/bootstrap/pull/12"
"https://github.com/twbs/bootstrap/pull/16"
"https://github.com/twbs/bootstrap/pull/21"
"https://github.com/twbs/bootstrap/pull/30"
"https://github.com/twbs/bootstrap/pull/51"
"https://github.com/twbs/bootstrap/pull/58"
"https://github.com/twbs/bootstrap/pull/71"
"https://github.com/twbs/bootstrap/pull/87"

Comparing to the last page of search results, we see that PR #26 was
“merged” on GitHub, but it is absent from our list—we do not have a
mergeCommit for it!

Nothing seems amiss on that pull request’s page. Indeed, it says, “mdo
merged commit 7852619 into twbs:master on Aug 20, 2011”.

What happens if we query this manually?

$ ghquery '{
>   resource(url: "https://github.com/twbs/bootstrap/pull/26") {
>     ... on PullRequest { url merged mergeCommit { id } }
>   }
> }' | jq .
{
  "data": {
    "resource": {
      "url": "https://github.com/twbs/bootstrap/pull/26",
      "merged": true,
      "mergeCommit": null
    }
  }
}

(Here, ghquery is my shell function for querying GitHub.
The same query works in the GraphiQL instance.)

Indeed, this commit is merged but has no mergeCommit. Strange.

I’m still investigating what’s actually going on here, but I hope that
this helps demonstrate how inspecting a SQLite database can be valuable
for debugging.

@decentralion

This comment has been minimized.

Copy link
Member

@decentralion decentralion commented Jul 4, 2019

@wchargin, a question for you.

In the documentation, the child_id field on the connection_entries table is not null, which makes sense to me:

CREATE TABLE connection_entries (
    conn_id INTEGER NOT NULL,
    idx INTEGER NOT NULL,  -- impose an ordering
    child_id TEXT NOT NULL,
    FOREIGN KEY(conn_id) REFERENCES connections(rowid)
);

However as implemented the field is nullable. Why is this?

decentralion added a commit that referenced this issue Aug 6, 2019
The mirror wraps a SQLite database which will store all of the data we
download from Discourse.

On a call to `update`, it downloads new data from the server and stores
it. Then, when it is asked for information like the topics and posts, it
can just pull from its local copy. This means that we don't need to
re-download the content every time we load a Discourse instance, which
makes the load more performant, more robust to network failures, etc.

Thanks to @wchargin, whose work on the GraphQL mirror for GitHub (#622)
inspired this mirror.

Test plan: I've written unit tests that use a mock fetcher to validate
the update logic. I've also used this to do a full load of the real
SourceCred Discourse instance, and to create a corresponding graph
(using subsequent commits).

Progress towards #865.
decentralion added a commit that referenced this issue Aug 6, 2019
The mirror wraps a SQLite database which will store all of the data we
download from Discourse.

On a call to `update`, it downloads new data from the server and stores
it. Then, when it is asked for information like the topics and posts, it
can just pull from its local copy. This means that we don't need to
re-download the content every time we load a Discourse instance, which
makes the load more performant, more robust to network failures, etc.

Thanks to @wchargin, whose work on the GraphQL mirror for GitHub (#622)
inspired this mirror.

Test plan: I've written unit tests that use a mock fetcher to validate
the update logic. I've also used this to do a full load of the real
SourceCred Discourse instance, and to create a corresponding graph
(using subsequent commits).

Progress towards #865.
decentralion added a commit that referenced this issue Aug 6, 2019
The mirror wraps a SQLite database which will store all of the data we
download from Discourse.

On a call to `update`, it downloads new data from the server and stores
it. Then, when it is asked for information like the topics and posts, it
can just pull from its local copy. This means that we don't need to
re-download the content every time we load a Discourse instance, which
makes the load more performant, more robust to network failures, etc.

Thanks to @wchargin, whose work on the GraphQL mirror for GitHub (#622)
inspired this mirror.

Test plan: I've written unit tests that use a mock fetcher to validate
the update logic. I've also used this to do a full load of the real
SourceCred Discourse instance, and to create a corresponding graph
(using subsequent commits).

Progress towards #865.
decentralion added a commit that referenced this issue Aug 15, 2019
The mirror wraps a SQLite database which will store all of the data we
download from Discourse.

On a call to `update`, it downloads new data from the server and stores
it. Then, when it is asked for information like the topics and posts, it
can just pull from its local copy. This means that we don't need to
re-download the content every time we load a Discourse instance, which
makes the load more performant, more robust to network failures, etc.

Thanks to @wchargin, whose work on the GraphQL mirror for GitHub (#622)
inspired this mirror.

Test plan: I've written unit tests that use a mock fetcher to validate
the update logic. I've also used this to do a full load of the real
SourceCred Discourse instance, and to create a corresponding graph
(using subsequent commits).

Progress towards #865.
decentralion added a commit that referenced this issue Aug 15, 2019
The mirror wraps a SQLite database which will store all of the data we
download from Discourse.

On a call to `update`, it downloads new data from the server and stores
it. Then, when it is asked for information like the topics and posts, it
can just pull from its local copy. This means that we don't need to
re-download the content every time we load a Discourse instance, which
makes the load more performant, more robust to network failures, etc.

Thanks to @wchargin, whose work on the GraphQL mirror for GitHub (#622)
inspired this mirror.

Test plan: I've written unit tests that use a mock fetcher to validate
the update logic. I've also used this to do a full load of the real
SourceCred Discourse instance, and to create a corresponding graph
(using subsequent commits).

Progress towards #865.
@wchargin

This comment has been minimized.

Copy link
Member Author

@wchargin wchargin commented Feb 2, 2020

In the documentation, the child_id field on the connection_entries
table is not null […] However as implemented the field is nullable.
Why is this?

Because the elements of a collection actually are nullable: for example,
a UserConnection has nodes at [User], not [User!].

I can’t recall off the top of my head if we’ve ever run into this in
practice, but it seemed best for the storage layer to be sound with
respect to the schema provided by the server.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.