Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: handle directly_collected as a separate table. #8

Open
SamHames opened this issue Feb 23, 2022 · 0 comments
Open

Proposal: handle directly_collected as a separate table. #8

SamHames opened this issue Feb 23, 2022 · 0 comments
Labels
V1 Needs to be resolved for V1 release

Comments

@SamHames
Copy link
Collaborator

SamHames commented Feb 23, 2022

Current directly_collected is a column on the tweet table. This has a couple of limitations:

  1. Because this table is updated as insert or ignore, the first version of a tweet seen 'wins' - this column is only correct if the tweets are inserted in chronological order, which isn't guaranteed (especially if there is more than one file to insert).
  2. Additionally this means that filtering on directly_collected in any other table requires joining against and processing the largest table in the collection.

I think we can avoid the order-of-operation problems and get more functionality out of structures like the following. This would also let us consider more complex tagging of collection properties later, and possibly allow us to do some nicer things to make #5 more legible for analysis.

create table if not exists tweet_source (
     id integer primary key references tweet(id),
     directly_collected integer not null,
     primary key(directly_collected, id) 
);

create table if not exists tweet_source_label (
     label text,
     directly_collected integer,
     id integer primary key references tweet(id),
     primary key(label, directly_collected, id) 
);

-- Or just a table that only has the tweet_ids of the directly collected tweets.
create table if not exists tweet_directly_collected (
     id integer primary key
);

All of these allow any of the other tables to filter against a much more compact (and correct) view of the directly_collected attribute.

@SamHames SamHames added the V1 Needs to be resolved for V1 release label May 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
V1 Needs to be resolved for V1 release
Projects
None yet
Development

No branches or pull requests

1 participant