You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Current directly_collected is a column on the tweet table. This has a couple of limitations:
Because this table is updated as insert or ignore, the first version of a tweet seen 'wins' - this column is only correct if the tweets are inserted in chronological order, which isn't guaranteed (especially if there is more than one file to insert).
Additionally this means that filtering on directly_collected in any other table requires joining against and processing the largest table in the collection.
I think we can avoid the order-of-operation problems and get more functionality out of structures like the following. This would also let us consider more complex tagging of collection properties later, and possibly allow us to do some nicer things to make #5 more legible for analysis.
createtableif not exists tweet_source (
id integerprimary keyreferences tweet(id),
directly_collected integernot null,
primary key(directly_collected, id)
);
createtableif not exists tweet_source_label (
label text,
directly_collected integer,
id integerprimary keyreferences tweet(id),
primary key(label, directly_collected, id)
);
-- Or just a table that only has the tweet_ids of the directly collected tweets.createtableif not exists tweet_directly_collected (
id integerprimary key
);
All of these allow any of the other tables to filter against a much more compact (and correct) view of the directly_collected attribute.
The text was updated successfully, but these errors were encountered:
Current directly_collected is a column on the tweet table. This has a couple of limitations:
I think we can avoid the order-of-operation problems and get more functionality out of structures like the following. This would also let us consider more complex tagging of collection properties later, and possibly allow us to do some nicer things to make #5 more legible for analysis.
All of these allow any of the other tables to filter against a much more compact (and correct) view of the directly_collected attribute.
The text was updated successfully, but these errors were encountered: