New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

WiP: initial doc processing file #140

Merged

bradley-erickson merged 19 commits into master from berickson/side-process-docs

Jun 26, 2024

Collaborator

bradley-erickson commented May 2, 2024

No description provided.

bradley-erickson changed the title ~~initial doc processing file~~ WiP: initial doc processing file

bradley-erickson requested a review from pmitros

May 2, 2024 19:09

pmitros reviewed

View reviewed changes

learning_observer/learning_observer/auth/social_sso.py Outdated

+                      sa_helpers.KeyStateType.INTERNAL)
+                  await kvs.set(auth_key, request[constants.AUTH_HEADERS])
+                  # store teacher roster info

Contributor

pmitros May 8, 2024

I'm curious why roster is stored and passed, rather than retrieved in the background process. It feels like the whole compute DAG should happen there.

Collaborator Author

bradley-erickson May 8, 2024

In the background process we use the doc id to find the student, which finds the teacher (via the roster), which finds the teachers credentials, which are used to fetch the text of the document.

Contributor

pmitros May 9, 2024

This code makes me unhappy and feels wrong, but the alternatives I've come up with are similarly bad, for other reasons.

I think the best option I've come up -- not in time for the next study -- is to have a document maintain:

a list of users associated with it (students and teachers)
some metadata about those users (e.g. when they last accessed it, permissions issues, etc.)

Might be good to have a conversation at some point.

We also will need some design documentation here:

Stepping through all documents feels crazy, as does the path document -> roster -> student
It's less crazy once you dig into it. Key points are that only active documents are processed, which will be non-obvious very quickly.

Again, not for next study.

pmitros reviewed

View reviewed changes

learning_observer/learning_observer/auth/utils.py Outdated

		@@ -99,13 +101,28 @@ async def get_active_user(request):
		return session[constants.USER]


		def google_stored_auth():

Contributor

pmitros May 8, 2024

I might consider using _store_teacher_info_for_background_process (perhaps renamed).

pmitros reviewed

View reviewed changes

learning_observer/learning_observer/auth/utils.py

+                      {sa_helpers.KeyField.TEACHER: user['user_id']},
+                      sa_helpers.KeyStateType.INTERNAL)
+                  await kvs.remove(remove_auth_key)

Contributor

pmitros May 8, 2024

I would probably remove this when auth expires (e.g. we make a request and it fails with an auth exception), rather than on logout.

Collaborator Author

bradley-erickson May 8, 2024

I believe we will also get an auth exception when the document isn't shared which is likely to happen. I'll see about on adding this to the workflow when the auth expires.

Contributor

pmitros May 9, 2024

I would expect different auth exceptions (even in the text if not the type), but it's worth double-checking.

We don't want to keep making requests when auth has expired, and we do want to keep track of access permissions.

Collaborator Author

bradley-erickson May 9, 2024

if we get an auth error, stop trying that document.

pmitros reviewed

View reviewed changes

learning_observer/learning_observer/doc_processor.py Outdated Show resolved Hide resolved

pmitros reviewed

View reviewed changes

learning_observer/learning_observer/doc_processor.py

+                  student_specifier = 'STUDENT:'
+                  students = [k[k.find(student_specifier) + len(student_specifier):] for k in matching_keys]
+                  # TODO handle more than 1 student per document, but for now just grab the first
+                  student = students[0]

Contributor

pmitros May 8, 2024

This assumption is broken. The TODO should be done reasonably soon. Perhaps this is an okay hack for one study, but this adds a lot of technical debt. There is a many-to-many mapping of docs to students to teachers, and the APIs ought to reflect that. Everything which depends on this will fail to implement corner cases which will later need to be fixed and lead to bugs.

pmitros reviewed

View reviewed changes

learning_observer/learning_observer/doc_processor.py Outdated Show resolved Hide resolved

pmitros reviewed

View reviewed changes

learning_observer/learning_observer/doc_processor.py Show resolved Hide resolved

pmitros reviewed

View reviewed changes

learning_observer/learning_observer/doc_processor.py

+                  if doc_id in failed_fetch:
+                      last_try = failed_fetch[doc_id]
+                      if last_try > now - 300: return False
+                  return True

Contributor

pmitros May 8, 2024

I don't quite get what you're trying to accomplish or why. Better docstring?

Collaborator Author

bradley-erickson May 8, 2024

Will update docstring, but the failed_fetch is updated whenever a document fails to fetch the text. We don't want to retry constantly since some documents will never be shared and always fail. This limits it to the checking every 5 minutes.

Contributor

pmitros May 9, 2024

Random first thoughts:

We should leave a bookmark or todo here.
Trying to do unauthorized things every five minutes seems like a really good way to get ourselves flagged and banned by some buggy, automated Google system, so I'm a bit worried about leaving as-is.
At the very least, I'd do an exponential backoff, but I'd be inclined to just not retry if it's a sharing issue until we are tracking sharing events.
In practice, in the classroom, 100% was shared, so this might be a special case. We can ask the other teachers how this works there.

Collaborator Author

bradley-erickson May 9, 2024

make it so if the doc is in the failed to fetch dict, return false instead of trying to do the retry every 5 minutes

pmitros reviewed

View reviewed changes

learning_observer/learning_observer/doc_processor.py Show resolved Hide resolved

pmitros reviewed

View reviewed changes

learning_observer/learning_observer/doc_processor.py Outdated Show resolved Hide resolved

pmitros reviewed

View reviewed changes

learning_observer/learning_observer/doc_processor.py Outdated



		async def process_documents(docs):
		results = await asyncio.gather(*(process_document(d) for d in docs))

Contributor

pmitros May 8, 2024

I'm not a big fan of the gather operations. This would work better as an asynchronous chain:

def fetch_docs(doclist)
  for docid in doc_list:
     yield fetch_doc(docid)

def process_docs(docs)
  for doc in docs:
     yield process_doc(doc)

and def store_docs(). This way, every doc is available as soon as it's processed, and perhaps things are a little more responsive. This doesn't give the same feel of "last retrieved 5 hours ago", but you potentially see things come in as they're ready.

Compute can also run in the background for network and vice-versa.

Collaborator Author

bradley-erickson May 8, 2024

I disagree. I think using gather here is correct. I could be incorrect, but the asynchronous chain will block the next document from being started. Whereas with the gather operation, while one document is waiting for a response from the Google API or LanguageTool (or another await item) another document can begin its own processing.

Also we ought to create an async wrapper for the NLP AWE components. Currently they are blocking.

Contributor

pmitros May 9, 2024

We're both wrong, but we should discuss tomorrow.

https://docs.python.org/3/library/asyncio-task.html#waiting-primitives

Design requirements:

The thread pool of LanguageTool / AWE / etc. is finite. Running 1000 things at once is bad. Limiting to just one, as I had, is bad too. This should be based on number of cores or similar, so specified in config.
I expect we'll be NLP-limited (redis writes will be fast, and hopefully so will Google API calls). If this is the case, we'd want to pull things before shortly before NLP capacity opens up, and store things right after
We do want things to be available to teachers once ready, and not with a big delay.

pmitros reviewed

View reviewed changes

learning_observer/learning_observer/doc_processor.py

+                          matching_teachers.append(roster['teacher_id'])
+                  # TODO handle multiple teachers
+                  teacher = matching_teachers[0]

Contributor

pmitros May 8, 2024

Same one-to-many comment as above.

pmitros reviewed

View reviewed changes

learning_observer/learning_observer/doc_processor.py Outdated Show resolved Hide resolved

pmitros reviewed

View reviewed changes

learning_observer/learning_observer/doc_processor.py

+              if __name__ == '__main__':
+                  import asyncio
+                  asyncio.run(start())

Contributor

pmitros May 8, 2024

I might run this as a process when starting learning_observer, rather than as a separate python command. This adds devops overhead, which is expensive.

A good way to do this is to check which processes to start from our settings file. By default, we run all of them, but it's possible to break it up at some point if we ever hit our millionth user.

Collaborator Author

bradley-erickson May 8, 2024

Collin mentioned that currently there are 2 servers running. 1 for incoming events and 1 for viewing the dashboards. He mentioned that the incoming events were causing the rest of the system to slow down a lot.

Contributor

pmitros May 9, 2024

OT: I'd be curious where the redis is. Network latency can be high with a non-local db.

pmitros reviewed

View reviewed changes

learning_observer/learning_observer/stream_analytics/fields.py

                   # "ASSIGNMENT"  # E.g. A collection of Google Docs (e.g. notes, outline, draft)
-                  # "TEACHER"     #
+                  "TEACHER"       #

Contributor

pmitros May 8, 2024

My issue, but I am not sold on the teacher / student split. We should talk about this when we're less busy.

pmitros reviewed

View reviewed changes

modules/writing_observer/writing_observer/awe_nlp.py

@@ @@ -120,7 +120,6 @@ def process_text(text, options=None): @@
                   if options is None:
                       # Do we want options to be everything initially or nothing?
                       options = writing_observer.nlp_indicators.INDICATORS.keys()
-                      options = []

Contributor

pmitros May 8, 2024

I don't understand this. We should explain when we talk.

Collaborator Author

bradley-erickson May 8, 2024 •

edited

Loading

If no options for AWE components were pass in, did we want to run all of them or none of them? We never came to a conclusion on this and the code sat with both coded in, but the none of them option as the path used. I made the decision that we should choose all of them, so I removed the none of them.

Contributor

pmitros May 9, 2024

Ah. Thank you. Makes sense. We should probably do all of them all the time in the current workflow.

bradley-erickson added 4 commits

May 9, 2024 10:07


          initial doc processing file

a1a51f1


          end to end doc processor working, needs clean up

364fbad


          added last mod/process rule

1721e9b


          added retry for unshared docs

6310e28

bradley-erickson force-pushed the berickson/side-process-docs branch from 408b645 to 6310e28 Compare

May 9, 2024 14:08

bradley-erickson added 7 commits

May 9, 2024 10:40


          pr feedback for all the easier stuff

7ad01aa


          abstract out some functionality

dec2b6e


          added dynamic task processing

8f099a6


          added new reducers to line up with the extra doc processor

953d5bc


          added reconstruct fetch but do nothing with it

d2e8b14


          added student full name to dashboard instead of id

401de5d


          used the reconstruct if doc text isn't present

faf9dcd

pmitros reviewed

View reviewed changes

modules/writing_observer/writing_observer/module.py Outdated

@@ @@ -41,6 +43,21 @@ @@
               source_selector = q.call('source_selector')
+              pmss.register_field(
+                  name='get_awe_lt_from_sep_process',

Contributor

pmitros May 14, 2024

Choice and not boolean:

NLP_Source: choice[compute_in_dag, async_in_process, async_out_of_process]

Optional class

.awe {
NLP_Source
}

.language_tool {
NLP_Source
}

pmitros reviewed

View reviewed changes

modules/writing_observer/writing_observer/module.py Outdated Show resolved Hide resolved

pmitros reviewed

View reviewed changes

modules/writing_observer/writing_observer/module.py Outdated Show resolved Hide resolved

pmitros reviewed

View reviewed changes

modules/writing_observer/writing_observer/module.py

                       # overall language tool nodes
                       'overall_lt': languagetool(texts=q.variable('docs')),
-                      'lt_combined': q.join(LEFT=q.variable('overall_lt'), LEFT_ON='provenance.provenance.STUDENT.value.user_id', RIGHT=q.variable('roster'), RIGHT_ON='user_id'),
+                      'overall_lt_sep_proc': q.select(q.keys('writing_observer.lt_process', STUDENTS=q.variable('roster'), STUDENTS_path='user_id', RESOURCES=q.variable("doc_ids"), RESOURCES_path='doc_id'), fields='All'),
+                      'lt_combined': q.join(LEFT=q.variable(lt_group_source), LEFT_ON='provenance.provenance.STUDENT.value.user_id', RIGHT=q.variable('roster'), RIGHT_ON='user_id'),

Contributor

pmitros May 14, 2024

This is horrible, but a fine hack for the school pilots this month. We do need to figure out, on a 10,000 foot level, how we handle settings changing query DACs. That's going to take time.

pmitros reviewed

View reviewed changes

modules/writing_observer/writing_observer/writing_analysis.py Show resolved Hide resolved

bradley-erickson added 8 commits

May 14, 2024 11:29


          processed a bit of feedback and more documentation

0ef95a1


          added semaphore to job loop

17c3651


          fix for pmss initialization settings bug

7a44c25


          added fetch of student docs on teacher login

a903a9c


          added settings check and a few more comments

95164e0


          added further commits

532b4f4


          missed clicking save


          added more feedback as comments to prepare for merge

b813afe

bradley-erickson merged commit a22ac15 into master

1 check failed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment