Doc imp cop by Numerlor · Pull Request #1 · Numerlor/bot

Numerlor · 2020-08-08T23:05:51Z

No description provided.

# Conflicts: # bot/cogs/doc/cog.py

Numerlor · 2020-08-08T23:09:43Z


+class TODO_PLACEHOLDER:
+    __slots__ = ("queue", "item_events", "started", "fetch_lock")
+    results_dict: Dict[str, discord.Embed] = None


This class handles parsing of all items on one page, for example behind the url https://docs.python.org/3/library/random.html it'll parse the module itself, all the classes etc. using the list as a FIFO queue for prioritizing items when a doc request is made.

I'm not quite sure how to handle the results here because wait_for_item is an async task that is started on the first request to an url. Currently it is done by setting the above results_dict to the dict which the cog fetches results from, but it doesn't feel great. Something with callbacks came to mind but the interface for the caller wouldn't be much better

Yeah, I don't like it either. Yielding/returning the results makes more sense to me. Whatever is using this class should be responsible for storing the results.

The goal here was to simplify it a bit for the user class
The only thing that came to mind was starting a task in the cog that'd iterate over parse_items (which would yield the descriptions) and assign to its result dict but that doesn't feel like a great interface either as it puts the responsibility of starting the long running parsing on the class that uses it.

Yeah. Then how about making this class act as the cache container itself, or at least storing the cache solely in this class rather than passing a dict around? You can re-design this class a bit so that you're not going to need a separate instance for each page, which will make my suggestion viable. The queue can hold futures for all pages; you'll just need to keep track of which pages have already been added to the queue to avoid redundant parsing.

Just so I'm sure we're thinking of the thing.
There'd be a "master" queue where all pages would get queued as requested from an internal url: symbol_ids mapping, the results of that would be saved to something like self.objects in the cog. Then the public API of the class could just be an add_symbol_id(url, symbol_id) when initiating inventories and get_symbol(symbol_id) for requesting them

Yes, something along those lines. You could probably have those methods accept a DocItem, but if you only need those two attributes then I suppose being more explicit is better. That being said, I thought of two problems:

When should the queue start? Currently it runs for only one page once a specific symbol is requested. If the queue is shared, I'm not sure how to still accomplish this. Possibly store the symbols in a separate data structure and move them to the queue later, once one symbol from the page is requested. That's just one idea for a solution.

How should the soup objects be stored? How will it know when to get rid of them?

Yeah passing the DocItems through is better since we need the url anyway so there's no need for it to be passed by the caller, and they already exist.
Started trying out the implementation, the results and self.urls are moved from the Cog to this class. On a item request , if the symbols need to be queued, the DocItems of the page that symbol is located in (from self.urls) is moved to the queue and the queue is started. Together witch each symbol we can also pass the soup in a container which takes care of their storage (Once we're done with the symbols from the page it gets completely dereferenced on our part). When a symbol is parsed it's handled like before, but the result is returned with the events and result dict handled internally.

Should the parsing task be stopped when the queue ends up empty again? We can either start it every time symbols are added to an empty queue, or have it run infinitely from the first request and just skip the code when there's nothing to handle

That sounds like what I had in mind. I'm happy with that solution. What about you?

Should the parsing task be stopped when the queue ends up empty again?

Skip how? If it could await until an item was put in the queue, that'd be fine. Unfortunately, you can't use an asyncio queue since you need to be able to move items from arbitrary positions to the front. Speaking of which, is there a more efficient data structure than a list for doing that? Earlier, I was looking into priority queues. Anyway, restarting the tasks sounds better, but I don't know for sure until you clarify the implementation.

I haven't found a more fitting data structure, tried queue.PriorityQueue but it looks like they only do the priority matching on item insertion, and it's done by comparing it against all other items which also isn't very efficient.
Unless there's something that has a better interface than a list and doesn't need to be implemented, I don't think it's worth putting much time into it as the queue will never be long enough for the occasional remove and append to become an issue

Pushed the changed class in c1a516d

Numerlor · 2020-08-08T23:14:46Z

+            event = self.item_events.get(item)
+            if event is not None:
+                event.set()
+            await asyncio.sleep(0.1)


We need to give the event loop a bit of control so the bot can stay alive, 0 as the sleep period maxes out the core the process is running in. Locally 0.1 halved that utilization (with around a 3x increase in the total time to parse some larger pages) but I don't know how this would reflect in the production bot

I have no idea. Seems kind of hacky but I've seen this sort of thing done elsewhere before.

Another option would be to run this in a thread executor, but then we'd need to introduce some locks to prevent potential race conditions from the queue item movement. Giving the event loop control in this way works fine since the individual parsing doesn't take that long but I agree that it feels a bit hacky and the sleep period may influence things differently on different machines

This is where my concurrency knowledge fails me. I thought the separate thread would still eat all the cycles. At what point would the context switch occur if one thread is giving no breathing room and is maxing out the core?

The main thread that's running asyncio is doing lots of IO so that's getting it a lot of opportunities for the executors to acquire the GIL; for a full utilization in the executors, the first thread should send a forced request after a certain amount of time to the thread that's holding the GIL which has to release it.
The behaviour you're describing only happens when the thread is doing something during which python can't handle the GIL like huge calculations where time is spent on a single instruction.
As an example

import threading def cpu_and_memory_consumer(): a = [] while True: a.append(0) def cpu_consumer(): while True: pass threads = [ threading.Thread(target=cpu_consumer), threading.Thread(target=cpu_and_memory_consumer), #threading.Thread(target=lambda: 1234567890**1234567890), ] for thread in threads: thread.start() print("started", thread)

Both of the functions take all that's available to them and shouldn't voluntarily release the GIL, but when you look at the memory usage it's clearly increasing even with the cpu_consumer started first (main thread and then the memory consumer both got the GIL to do their own thing), but then if we uncomment the computational thread everything will stop once it gets the GIL (No big memory usage from the appends and the started print never happens).

I'm of course not familiar with the devops of the bot and don't know if maxing out the available cpu is a good thing or not, in the threads without any sleep periods or through a 0 asyncio sleep

Numerlor · 2020-08-08T23:32:52Z

@@ -3,12 +3,13 @@
 import logging


The functionality the changes here implement is that instead of caching beautifulsoup objects like in the current implementation python-discord#1014 (which can bubble up and potentially become a memory issue) we store the lookup ids of each symbol, and when one symbol on a page is requested all others on that page are parsed alongside it so we can throw out the BS object when that's done.

MarkKoz · 2020-08-15T01:01:14Z

+            if not q.started:
+                async with self.bot.http_session.get(symbol_data.url) as response:
+                    soup = BeautifulSoup(await response.text(encoding="utf8"), 'lxml')
+                    soup.find("head").decompose()


I find this conditional behaviour to be a confusing API for the class i.e. the fact that soup is only needed sometimes. I think it'll be clearer if this is done inside the class to simplify the API.

MarkKoz · 2020-08-15T01:16:28Z


+class TODO_PLACEHOLDER:
+    __slots__ = ("queue", "item_events", "started", "fetch_lock")
+    results_dict: Dict[str, discord.Embed] = None


Yeah, I don't like it either. Yielding/returning the results makes more sense to me. Whatever is using this class should be responsible for storing the results.

MarkKoz · 2020-08-15T01:18:03Z

+            embed = discord.Embed(
+                title=discord.utils.escape_markdown(item.name),
+                url=item.url,
+                description=get_symbol_markdown(soup, item)
+            )


Returning an embed feels like piling on too much responsibility to one class. Let's strip it down and keep to its minimum - returning the markdown description.

MarkKoz · 2020-08-15T01:26:45Z

+        self.put_to_front(symbol_info)
+        item_event = asyncio.Event()
+        # First request
+        if not self.item_events:


This looks wrong. Not every item in the queue is going to have an event associated with it. Only items that are being waited for will have events. Therefore, the parse_items could still be running despite an empty item_events.

MarkKoz · 2020-08-15T01:36:53Z

+            event = self.item_events.get(item)
+            if event is not None:
+                event.set()
+            await asyncio.sleep(0.1)


I have no idea. Seems kind of hacky but I've seen this sort of thing done elsewhere before.

merge back new changes

Griff/helpdm

…o-pardons feat: add reason argument to pardon commands

Numerlor added 3 commits July 17, 2020 20:39

TODO: commit description

0f61728

todo description

d65c937

Merge branch 'doc-imp' into doc-imp-cop

48eeb34

# Conflicts: # bot/cogs/doc/cog.py

Numerlor commented Aug 8, 2020

View reviewed changes

MarkKoz reviewed Aug 15, 2020

View reviewed changes

k

c1a516d

Numerlor force-pushed the doc-imp branch from 4a90297 to 6c4dd4c Compare October 1, 2020 21:09

Numerlor pushed a commit that referenced this pull request Oct 4, 2020

Merge pull request #1 from python-discord/master

2bfa1c2

merge back new changes

Numerlor force-pushed the doc-imp branch 2 times, most recently from e77e2e5 to 2836ce6 Compare November 10, 2020 12:46

Numerlor force-pushed the doc-imp branch from 0ebce26 to 5258c79 Compare December 15, 2020 04:06

Numerlor force-pushed the doc-imp branch 5 times, most recently from 1d22c3c to 5c97efa Compare January 11, 2021 01:21

Numerlor force-pushed the doc-imp branch from 7b7fad4 to d620a38 Compare February 23, 2021 02:34

Numerlor force-pushed the doc-imp branch from df5c7de to f7b56c5 Compare March 5, 2021 01:46

Numerlor pushed a commit that referenced this pull request Jun 30, 2021

Merge pull request #1 from JakeM0001/griff/helpdm

c808ca0

Griff/helpdm

Numerlor closed this Jul 10, 2021

Numerlor deleted the doc-imp-cop branch July 10, 2021 21:36

Numerlor pushed a commit that referenced this pull request Aug 5, 2023

Merge pull request #1 from VirdanTheBurden/feat-add-reason-argument-t…

2e01be7

…o-pardons feat: add reason argument to pardon commands

Conversation

Numerlor commented Aug 8, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Numerlor Aug 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Numerlor Aug 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Numerlor Aug 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Numerlor Aug 15, 2020 •

edited

Loading

Numerlor Aug 18, 2020 •

edited

Loading

Numerlor Aug 15, 2020 •

edited

Loading