-
Notifications
You must be signed in to change notification settings - Fork 61
feat(script): add content seeding script and sample data #324
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly formatting nits.
The one blocker I see is telling people how to {determine, specify} EMBLEM_URL
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Co-authored-by: Ace Nassri <anassri@google.com>
Co-authored-by: Ace Nassri <anassri@google.com>
Co-authored-by: Ace Nassri <anassri@google.com>
Co-authored-by: Ace Nassri <anassri@google.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Great job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great foundation for content seeding! Thanks Kelsey.
I've left a number of feedback points on this PR. While there are a number of things I'd like to see fixed pretty much everything can be deferred to a follow-up. I'd appreciate if you could file an issue or issues for anything you don't think is worth tackling as part of this PR.
The only blocker here is the open question on Wikimedia image copyright.
To run the website locally, use the `flask run` command. By default, the website will run on port `8080`. | ||
|
||
## Seed Database |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than maintain duplicate seeding instructions, shouldn't this be removed here and assumed is done as part of properly setting up the Content API?
client = firestore.Client(project) | ||
print("Adding content to the database, this may take a few minutes...") | ||
for item in content: | ||
doc_ref = client.collection(item["collection"]).document(item["id"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Content API uses an optional test prefix on the collection name:
emblem/content-api/main_approver_test.py
Line 26 in dcba905
os.environ["EMBLEM_DB_ENVIRONMENT"] = "TEST" |
This script should respect that, so the seeding can be part of a test environment.
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
from google.cloud import firestore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a standalone script, please include some basic inline docs, such as what the script does and how to run it in a basic way.
import json | ||
|
||
|
||
project = os.getenv("PROJECT_ID") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PROJECT_ID is a pretty vague variable name. Is this meant to override for this script, or set as a general shell variable? The convention is GOOGLE_CLOUD_PROJECT
.
|
||
def seed_database(content): | ||
|
||
client = firestore.Client(project) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way I tend to do set project is if GOOGLE_CLOUD_PROJECT variable is set, use that, otherwise default to no project so the client can retrieve the project from the metadata server. Using metadata server is probably how we'd prefer to run this on Cloud Build.
|
||
|
||
with open("sample_data.json", "r") as f: | ||
seed_content = json.load(f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Future: We should probably have a linter configured to check the JSON doc is well-formed as a PR check. It's pretty easy to break JSON structure, especially through things like git conflict resolution.
content-api/data/sample_data.json
Outdated
"data": { | ||
"name": "Careers for Fish", | ||
"description": "This cause provides careers for gifted fish.", | ||
"imageUrl": "https://upload.wikimedia.org/wikipedia/commons/2/23/Georgia_Aquarium_-_Giant_Grouper_edit.jpg", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wikimedia recommends against hotlinking, but supports it.
Either way, copyright is left to the user. https://commons.wikimedia.org/wiki/Commons:Reusing_content_outside_Wikimedia/technical
Do you have a read on the copyrights of the images? Do we need to add a general copyright link or a per image attribution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each image I added had the CC0, CC BY or CC BY-SA license, which all permit free commercial use of the images.
However, I missed the requirement on CC BY and CC BY-SA to include attribution (the creator's name and a link to the license).
Instead of adding attribution for each image, I'll replace any CC BY/BY-SA licensed images with ones that have a CC0 license or that are in the public domain.
(And instead of hotlinking, we should upload the images to a storage bucket, yes?)
content-api/data/sample_data.json
Outdated
@@ -0,0 +1,3052 @@ | |||
[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
future: This file is pretty large for human-managed JSON. We may want to think about partitioning the data into separate files and run the seed script multiple times.
@@ -0,0 +1,42 @@ | |||
# Copyright 2021 Google LLC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copyright block should use the year the file is created.
# Copyright 2021 Google LLC | |
# Copyright 2022 Google LLC |
|
||
client = firestore.Client(project) | ||
print("Adding content to the database, this may take a few minutes...") | ||
for item in content: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
future: consider parallelization, we could probably speed up seeding time.
Follow-up for the blocker in #325. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With blocker moved out, switching to approve. Still good to raid other comments here for future improvements.
website
andcontent-api
READMEs to include instructions on seeding the databaseFixes #316 & #315