Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WiP: initial doc processing file #140

Merged
merged 19 commits into from
Jun 26, 2024
Merged

Conversation

bradley-erickson
Copy link
Collaborator

No description provided.

@bradley-erickson bradley-erickson changed the title initial doc processing file WiP: initial doc processing file May 2, 2024
sa_helpers.KeyStateType.INTERNAL)
await kvs.set(auth_key, request[constants.AUTH_HEADERS])

# store teacher roster info
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious why roster is stored and passed, rather than retrieved in the background process. It feels like the whole compute DAG should happen there.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the background process we use the doc id to find the student, which finds the teacher (via the roster), which finds the teachers credentials, which are used to fetch the text of the document.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code makes me unhappy and feels wrong, but the alternatives I've come up with are similarly bad, for other reasons.

I think the best option I've come up -- not in time for the next study -- is to have a document maintain:

  • a list of users associated with it (students and teachers)
  • some metadata about those users (e.g. when they last accessed it, permissions issues, etc.)

Might be good to have a conversation at some point.

We also will need some design documentation here:

  • Stepping through all documents feels crazy, as does the path document -> roster -> student
  • It's less crazy once you dig into it. Key points are that only active documents are processed, which will be non-obvious very quickly.

Again, not for next study.

@@ -99,13 +101,28 @@ async def get_active_user(request):
return session[constants.USER]


def google_stored_auth():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might consider using _store_teacher_info_for_background_process (perhaps renamed).

{sa_helpers.KeyField.TEACHER: user['user_id']},
sa_helpers.KeyStateType.INTERNAL)
await kvs.remove(remove_auth_key)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably remove this when auth expires (e.g. we make a request and it fails with an auth exception), rather than on logout.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we will also get an auth exception when the document isn't shared which is likely to happen. I'll see about on adding this to the workflow when the auth expires.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect different auth exceptions (even in the text if not the type), but it's worth double-checking.

We don't want to keep making requests when auth has expired, and we do want to keep track of access permissions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we get an auth error, stop trying that document.

student_specifier = 'STUDENT:'
students = [k[k.find(student_specifier) + len(student_specifier):] for k in matching_keys]
# TODO handle more than 1 student per document, but for now just grab the first
student = students[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumption is broken. The TODO should be done reasonably soon. Perhaps this is an okay hack for one study, but this adds a lot of technical debt. There is a many-to-many mapping of docs to students to teachers, and the APIs ought to reflect that. Everything which depends on this will fail to implement corner cases which will later need to be fixed and lead to bugs.

if doc_id in failed_fetch:
last_try = failed_fetch[doc_id]
if last_try > now - 300: return False
return True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite get what you're trying to accomplish or why. Better docstring?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update docstring, but the failed_fetch is updated whenever a document fails to fetch the text. We don't want to retry constantly since some documents will never be shared and always fail. This limits it to the checking every 5 minutes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Random first thoughts:

  • We should leave a bookmark or todo here.
  • Trying to do unauthorized things every five minutes seems like a really good way to get ourselves flagged and banned by some buggy, automated Google system, so I'm a bit worried about leaving as-is.
  • At the very least, I'd do an exponential backoff, but I'd be inclined to just not retry if it's a sharing issue until we are tracking sharing events.
  • In practice, in the classroom, 100% was shared, so this might be a special case. We can ask the other teachers how this works there.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make it so if the doc is in the failed to fetch dict, return false instead of trying to do the retry every 5 minutes



async def process_documents(docs):
results = await asyncio.gather(*(process_document(d) for d in docs))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a big fan of the gather operations. This would work better as an asynchronous chain:

def fetch_docs(doclist)
  for docid in doc_list:
     yield fetch_doc(docid)
def process_docs(docs)
  for doc in docs:
     yield process_doc(doc)

and def store_docs(). This way, every doc is available as soon as it's processed, and perhaps things are a little more responsive. This doesn't give the same feel of "last retrieved 5 hours ago", but you potentially see things come in as they're ready.

Compute can also run in the background for network and vice-versa.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree. I think using gather here is correct. I could be incorrect, but the asynchronous chain will block the next document from being started. Whereas with the gather operation, while one document is waiting for a response from the Google API or LanguageTool (or another await item) another document can begin its own processing.

Also we ought to create an async wrapper for the NLP AWE components. Currently they are blocking.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're both wrong, but we should discuss tomorrow.

https://docs.python.org/3/library/asyncio-task.html#waiting-primitives

Design requirements:

  1. The thread pool of LanguageTool / AWE / etc. is finite. Running 1000 things at once is bad. Limiting to just one, as I had, is bad too. This should be based on number of cores or similar, so specified in config.

  2. I expect we'll be NLP-limited (redis writes will be fast, and hopefully so will Google API calls). If this is the case, we'd want to pull things before shortly before NLP capacity opens up, and store things right after

  3. We do want things to be available to teachers once ready, and not with a big delay.

matching_teachers.append(roster['teacher_id'])

# TODO handle multiple teachers
teacher = matching_teachers[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same one-to-many comment as above.


if __name__ == '__main__':
import asyncio
asyncio.run(start())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might run this as a process when starting learning_observer, rather than as a separate python command. This adds devops overhead, which is expensive.

A good way to do this is to check which processes to start from our settings file. By default, we run all of them, but it's possible to break it up at some point if we ever hit our millionth user.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Collin mentioned that currently there are 2 servers running. 1 for incoming events and 1 for viewing the dashboards. He mentioned that the incoming events were causing the rest of the system to slow down a lot.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OT: I'd be curious where the redis is. Network latency can be high with a non-local db.

# "ASSIGNMENT" # E.g. A collection of Google Docs (e.g. notes, outline, draft)
# "TEACHER" #
"TEACHER" #
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My issue, but I am not sold on the teacher / student split. We should talk about this when we're less busy.

@@ -120,7 +120,6 @@ def process_text(text, options=None):
if options is None:
# Do we want options to be everything initially or nothing?
options = writing_observer.nlp_indicators.INDICATORS.keys()
options = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this. We should explain when we talk.

Copy link
Collaborator Author

@bradley-erickson bradley-erickson May 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If no options for AWE components were pass in, did we want to run all of them or none of them? We never came to a conclusion on this and the code sat with both coded in, but the none of them option as the path used. I made the decision that we should choose all of them, so I removed the none of them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. Thank you. Makes sense. We should probably do all of them all the time in the current workflow.

@@ -41,6 +43,21 @@

source_selector = q.call('source_selector')

pmss.register_field(
name='get_awe_lt_from_sep_process',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choice and not boolean:

NLP_Source: choice[compute_in_dag, async_in_process, async_out_of_process]

Optional class

.awe {
NLP_Source
}

.language_tool {
NLP_Source
}

# overall language tool nodes
'overall_lt': languagetool(texts=q.variable('docs')),
'lt_combined': q.join(LEFT=q.variable('overall_lt'), LEFT_ON='provenance.provenance.STUDENT.value.user_id', RIGHT=q.variable('roster'), RIGHT_ON='user_id'),
'overall_lt_sep_proc': q.select(q.keys('writing_observer.lt_process', STUDENTS=q.variable('roster'), STUDENTS_path='user_id', RESOURCES=q.variable("doc_ids"), RESOURCES_path='doc_id'), fields='All'),
'lt_combined': q.join(LEFT=q.variable(lt_group_source), LEFT_ON='provenance.provenance.STUDENT.value.user_id', RIGHT=q.variable('roster'), RIGHT_ON='user_id'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is horrible, but a fine hack for the school pilots this month. We do need to figure out, on a 10,000 foot level, how we handle settings changing query DACs. That's going to take time.

@bradley-erickson bradley-erickson merged commit a22ac15 into master Jun 26, 2024
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants