Skip to content
This repository has been archived by the owner on Jul 25, 2024. It is now read-only.

feat(script): add content seeding script and sample data #324

Merged
merged 10 commits into from
Feb 12, 2022

Conversation

kelsk
Copy link
Contributor

@kelsk kelsk commented Feb 11, 2022

  • Add python script to inject sample data into a Firestore database
  • Add sample data in json format
  • Update website and content-api READMEs to include instructions on seeding the database

Fixes #316 & #315

@kelsk kelsk added component: content-api Related to the Content API. component: demo services Related to interactive learning using the app. labels Feb 11, 2022
@kelsk kelsk requested a review from a team as a code owner February 11, 2022 09:03
Copy link
Contributor

@ace-n ace-n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly formatting nits.

The one blocker I see is telling people how to {determine, specify} EMBLEM_URL.

content-api/README.md Outdated Show resolved Hide resolved
content-api/README.md Outdated Show resolved Hide resolved
website/README.md Show resolved Hide resolved
website/README.md Outdated Show resolved Hide resolved
website/README.md Outdated Show resolved Hide resolved
Copy link
Contributor

@engelke engelke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

kelsk and others added 5 commits February 11, 2022 13:42
Co-authored-by: Ace Nassri <anassri@google.com>
Co-authored-by: Ace Nassri <anassri@google.com>
Co-authored-by: Ace Nassri <anassri@google.com>
Co-authored-by: Ace Nassri <anassri@google.com>
@kelsk kelsk requested a review from ace-n February 11, 2022 21:43
@grayside grayside added this to the v0.6.0 milestone Feb 11, 2022
Copy link
Collaborator

@iennae iennae left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Great job.

Copy link
Collaborator

@grayside grayside left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great foundation for content seeding! Thanks Kelsey.

I've left a number of feedback points on this PR. While there are a number of things I'd like to see fixed pretty much everything can be deferred to a follow-up. I'd appreciate if you could file an issue or issues for anything you don't think is worth tackling as part of this PR.

The only blocker here is the open question on Wikimedia image copyright.

To run the website locally, use the `flask run` command. By default, the website will run on port `8080`.

## Seed Database
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than maintain duplicate seeding instructions, shouldn't this be removed here and assumed is done as part of properly setting up the Content API?

client = firestore.Client(project)
print("Adding content to the database, this may take a few minutes...")
for item in content:
doc_ref = client.collection(item["collection"]).document(item["id"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Content API uses an optional test prefix on the collection name:

os.environ["EMBLEM_DB_ENVIRONMENT"] = "TEST"

This script should respect that, so the seeding can be part of a test environment.

# See the License for the specific language governing permissions and
# limitations under the License.

from google.cloud import firestore
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a standalone script, please include some basic inline docs, such as what the script does and how to run it in a basic way.

import json


project = os.getenv("PROJECT_ID")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PROJECT_ID is a pretty vague variable name. Is this meant to override for this script, or set as a general shell variable? The convention is GOOGLE_CLOUD_PROJECT.


def seed_database(content):

client = firestore.Client(project)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I tend to do set project is if GOOGLE_CLOUD_PROJECT variable is set, use that, otherwise default to no project so the client can retrieve the project from the metadata server. Using metadata server is probably how we'd prefer to run this on Cloud Build.



with open("sample_data.json", "r") as f:
seed_content = json.load(f)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future: We should probably have a linter configured to check the JSON doc is well-formed as a PR check. It's pretty easy to break JSON structure, especially through things like git conflict resolution.

"data": {
"name": "Careers for Fish",
"description": "This cause provides careers for gifted fish.",
"imageUrl": "https://upload.wikimedia.org/wikipedia/commons/2/23/Georgia_Aquarium_-_Giant_Grouper_edit.jpg",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wikimedia recommends against hotlinking, but supports it.

Either way, copyright is left to the user. https://commons.wikimedia.org/wiki/Commons:Reusing_content_outside_Wikimedia/technical

Do you have a read on the copyrights of the images? Do we need to add a general copyright link or a per image attribution?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each image I added had the CC0, CC BY or CC BY-SA license, which all permit free commercial use of the images.

However, I missed the requirement on CC BY and CC BY-SA to include attribution (the creator's name and a link to the license).

Instead of adding attribution for each image, I'll replace any CC BY/BY-SA licensed images with ones that have a CC0 license or that are in the public domain.

(And instead of hotlinking, we should upload the images to a storage bucket, yes?)

@@ -0,0 +1,3052 @@
[
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

future: This file is pretty large for human-managed JSON. We may want to think about partitioning the data into separate files and run the seed script multiple times.

@@ -0,0 +1,42 @@
# Copyright 2021 Google LLC
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copyright block should use the year the file is created.

Suggested change
# Copyright 2021 Google LLC
# Copyright 2022 Google LLC


client = firestore.Client(project)
print("Adding content to the database, this may take a few minutes...")
for item in content:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

future: consider parallelization, we could probably speed up seeding time.

@grayside
Copy link
Collaborator

Follow-up for the blocker in #325.

Copy link
Collaborator

@grayside grayside left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With blocker moved out, switching to approve. Still good to raid other comments here for future improvements.

@grayside grayside merged commit a403a87 into main Feb 12, 2022
@grayside grayside deleted the content-seeding branch February 12, 2022 00:11
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
component: content-api Related to the Content API. component: demo services Related to interactive learning using the app.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Demo: Create example data for content seeding
5 participants