-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WiP: initial doc processing file #140
Conversation
sa_helpers.KeyStateType.INTERNAL) | ||
await kvs.set(auth_key, request[constants.AUTH_HEADERS]) | ||
|
||
# store teacher roster info |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious why roster is stored and passed, rather than retrieved in the background process. It feels like the whole compute DAG should happen there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the background process we use the doc id to find the student, which finds the teacher (via the roster), which finds the teachers credentials, which are used to fetch the text of the document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code makes me unhappy and feels wrong, but the alternatives I've come up with are similarly bad, for other reasons.
I think the best option I've come up -- not in time for the next study -- is to have a document maintain:
- a list of users associated with it (students and teachers)
- some metadata about those users (e.g. when they last accessed it, permissions issues, etc.)
Might be good to have a conversation at some point.
We also will need some design documentation here:
- Stepping through all documents feels crazy, as does the path document -> roster -> student
- It's less crazy once you dig into it. Key points are that only active documents are processed, which will be non-obvious very quickly.
Again, not for next study.
@@ -99,13 +101,28 @@ async def get_active_user(request): | |||
return session[constants.USER] | |||
|
|||
|
|||
def google_stored_auth(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might consider using _store_teacher_info_for_background_process
(perhaps renamed).
{sa_helpers.KeyField.TEACHER: user['user_id']}, | ||
sa_helpers.KeyStateType.INTERNAL) | ||
await kvs.remove(remove_auth_key) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would probably remove this when auth expires (e.g. we make a request and it fails with an auth exception), rather than on logout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we will also get an auth exception when the document isn't shared which is likely to happen. I'll see about on adding this to the workflow when the auth expires.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect different auth exceptions (even in the text if not the type), but it's worth double-checking.
We don't want to keep making requests when auth has expired, and we do want to keep track of access permissions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we get an auth error, stop trying that document.
student_specifier = 'STUDENT:' | ||
students = [k[k.find(student_specifier) + len(student_specifier):] for k in matching_keys] | ||
# TODO handle more than 1 student per document, but for now just grab the first | ||
student = students[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assumption is broken. The TODO should be done reasonably soon. Perhaps this is an okay hack for one study, but this adds a lot of technical debt. There is a many-to-many mapping of docs to students to teachers, and the APIs ought to reflect that. Everything which depends on this will fail to implement corner cases which will later need to be fixed and lead to bugs.
if doc_id in failed_fetch: | ||
last_try = failed_fetch[doc_id] | ||
if last_try > now - 300: return False | ||
return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite get what you're trying to accomplish or why. Better docstring?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will update docstring, but the failed_fetch
is updated whenever a document fails to fetch the text. We don't want to retry constantly since some documents will never be shared and always fail. This limits it to the checking every 5 minutes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Random first thoughts:
- We should leave a bookmark or todo here.
- Trying to do unauthorized things every five minutes seems like a really good way to get ourselves flagged and banned by some buggy, automated Google system, so I'm a bit worried about leaving as-is.
- At the very least, I'd do an exponential backoff, but I'd be inclined to just not retry if it's a sharing issue until we are tracking sharing events.
- In practice, in the classroom, 100% was shared, so this might be a special case. We can ask the other teachers how this works there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make it so if the doc is in the failed to fetch dict, return false instead of trying to do the retry every 5 minutes
|
||
|
||
async def process_documents(docs): | ||
results = await asyncio.gather(*(process_document(d) for d in docs)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a big fan of the gather
operations. This would work better as an asynchronous chain:
def fetch_docs(doclist)
for docid in doc_list:
yield fetch_doc(docid)
def process_docs(docs)
for doc in docs:
yield process_doc(doc)
and def store_docs()
. This way, every doc is available as soon as it's processed, and perhaps things are a little more responsive. This doesn't give the same feel of "last retrieved 5 hours ago", but you potentially see things come in as they're ready.
Compute can also run in the background for network and vice-versa.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree. I think using gather
here is correct. I could be incorrect, but the asynchronous chain will block the next document from being started. Whereas with the gather operation, while one document is waiting for a response from the Google API or LanguageTool (or another await item) another document can begin its own processing.
Also we ought to create an async wrapper for the NLP AWE components. Currently they are blocking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're both wrong, but we should discuss tomorrow.
https://docs.python.org/3/library/asyncio-task.html#waiting-primitives
Design requirements:
-
The thread pool of LanguageTool / AWE / etc. is finite. Running 1000 things at once is bad. Limiting to just one, as I had, is bad too. This should be based on number of cores or similar, so specified in config.
-
I expect we'll be NLP-limited (redis writes will be fast, and hopefully so will Google API calls). If this is the case, we'd want to pull things before shortly before NLP capacity opens up, and store things right after
-
We do want things to be available to teachers once ready, and not with a big delay.
matching_teachers.append(roster['teacher_id']) | ||
|
||
# TODO handle multiple teachers | ||
teacher = matching_teachers[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same one-to-many comment as above.
|
||
if __name__ == '__main__': | ||
import asyncio | ||
asyncio.run(start()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might run this as a process when starting learning_observer
, rather than as a separate python
command. This adds devops overhead, which is expensive.
A good way to do this is to check which processes to start from our settings file. By default, we run all of them, but it's possible to break it up at some point if we ever hit our millionth user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Collin mentioned that currently there are 2 servers running. 1 for incoming events and 1 for viewing the dashboards. He mentioned that the incoming events were causing the rest of the system to slow down a lot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OT: I'd be curious where the redis
is. Network latency can be high with a non-local db.
# "ASSIGNMENT" # E.g. A collection of Google Docs (e.g. notes, outline, draft) | ||
# "TEACHER" # | ||
"TEACHER" # |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My issue, but I am not sold on the teacher / student split. We should talk about this when we're less busy.
@@ -120,7 +120,6 @@ def process_text(text, options=None): | |||
if options is None: | |||
# Do we want options to be everything initially or nothing? | |||
options = writing_observer.nlp_indicators.INDICATORS.keys() | |||
options = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this. We should explain when we talk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If no options for AWE components were pass in, did we want to run all of them or none of them? We never came to a conclusion on this and the code sat with both coded in, but the none of them option as the path used. I made the decision that we should choose all of them, so I removed the none of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah. Thank you. Makes sense. We should probably do all of them all the time in the current workflow.
408b645
to
6310e28
Compare
@@ -41,6 +43,21 @@ | |||
|
|||
source_selector = q.call('source_selector') | |||
|
|||
pmss.register_field( | |||
name='get_awe_lt_from_sep_process', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Choice and not boolean:
NLP_Source: choice[compute_in_dag, async_in_process, async_out_of_process]
Optional class
.awe {
NLP_Source
}
.language_tool {
NLP_Source
}
# overall language tool nodes | ||
'overall_lt': languagetool(texts=q.variable('docs')), | ||
'lt_combined': q.join(LEFT=q.variable('overall_lt'), LEFT_ON='provenance.provenance.STUDENT.value.user_id', RIGHT=q.variable('roster'), RIGHT_ON='user_id'), | ||
'overall_lt_sep_proc': q.select(q.keys('writing_observer.lt_process', STUDENTS=q.variable('roster'), STUDENTS_path='user_id', RESOURCES=q.variable("doc_ids"), RESOURCES_path='doc_id'), fields='All'), | ||
'lt_combined': q.join(LEFT=q.variable(lt_group_source), LEFT_ON='provenance.provenance.STUDENT.value.user_id', RIGHT=q.variable('roster'), RIGHT_ON='user_id'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is horrible, but a fine hack for the school pilots this month. We do need to figure out, on a 10,000 foot level, how we handle settings changing query DACs. That's going to take time.
No description provided.