Doc imp cop#1
Conversation
# Conflicts: # bot/cogs/doc/cog.py
|
|
||
| class TODO_PLACEHOLDER: | ||
| __slots__ = ("queue", "item_events", "started", "fetch_lock") | ||
| results_dict: Dict[str, discord.Embed] = None |
There was a problem hiding this comment.
This class handles parsing of all items on one page, for example behind the url https://docs.python.org/3/library/random.html it'll parse the module itself, all the classes etc. using the list as a FIFO queue for prioritizing items when a doc request is made.
I'm not quite sure how to handle the results here because wait_for_item is an async task that is started on the first request to an url. Currently it is done by setting the above results_dict to the dict which the cog fetches results from, but it doesn't feel great. Something with callbacks came to mind but the interface for the caller wouldn't be much better
There was a problem hiding this comment.
Yeah, I don't like it either. Yielding/returning the results makes more sense to me. Whatever is using this class should be responsible for storing the results.
There was a problem hiding this comment.
The goal here was to simplify it a bit for the user class
The only thing that came to mind was starting a task in the cog that'd iterate over parse_items (which would yield the descriptions) and assign to its result dict but that doesn't feel like a great interface either as it puts the responsibility of starting the long running parsing on the class that uses it.
There was a problem hiding this comment.
Yeah. Then how about making this class act as the cache container itself, or at least storing the cache solely in this class rather than passing a dict around? You can re-design this class a bit so that you're not going to need a separate instance for each page, which will make my suggestion viable. The queue can hold futures for all pages; you'll just need to keep track of which pages have already been added to the queue to avoid redundant parsing.
There was a problem hiding this comment.
Just so I'm sure we're thinking of the thing.
There'd be a "master" queue where all pages would get queued as requested from an internal url: symbol_ids mapping, the results of that would be saved to something like self.objects in the cog. Then the public API of the class could just be an add_symbol_id(url, symbol_id) when initiating inventories and get_symbol(symbol_id) for requesting them
There was a problem hiding this comment.
Yes, something along those lines. You could probably have those methods accept a DocItem, but if you only need those two attributes then I suppose being more explicit is better. That being said, I thought of two problems:
- When should the queue start? Currently it runs for only one page once a specific symbol is requested. If the queue is shared, I'm not sure how to still accomplish this. Possibly store the symbols in a separate data structure and move them to the queue later, once one symbol from the page is requested. That's just one idea for a solution.
- How should the soup objects be stored? How will it know when to get rid of them?
There was a problem hiding this comment.
Yeah passing the DocItems through is better since we need the url anyway so there's no need for it to be passed by the caller, and they already exist.
Started trying out the implementation, the results and self.urls are moved from the Cog to this class. On a item request , if the symbols need to be queued, the DocItems of the page that symbol is located in (from self.urls) is moved to the queue and the queue is started. Together witch each symbol we can also pass the soup in a container which takes care of their storage (Once we're done with the symbols from the page it gets completely dereferenced on our part). When a symbol is parsed it's handled like before, but the result is returned with the events and result dict handled internally.
Should the parsing task be stopped when the queue ends up empty again? We can either start it every time symbols are added to an empty queue, or have it run infinitely from the first request and just skip the code when there's nothing to handle
There was a problem hiding this comment.
That sounds like what I had in mind. I'm happy with that solution. What about you?
Should the parsing task be stopped when the queue ends up empty again?
Skip how? If it could await until an item was put in the queue, that'd be fine. Unfortunately, you can't use an asyncio queue since you need to be able to move items from arbitrary positions to the front. Speaking of which, is there a more efficient data structure than a list for doing that? Earlier, I was looking into priority queues. Anyway, restarting the tasks sounds better, but I don't know for sure until you clarify the implementation.
There was a problem hiding this comment.
I haven't found a more fitting data structure, tried queue.PriorityQueue but it looks like they only do the priority matching on item insertion, and it's done by comparing it against all other items which also isn't very efficient.
Unless there's something that has a better interface than a list and doesn't need to be implemented, I don't think it's worth putting much time into it as the queue will never be long enough for the occasional remove and append to become an issue
| event = self.item_events.get(item) | ||
| if event is not None: | ||
| event.set() | ||
| await asyncio.sleep(0.1) |
There was a problem hiding this comment.
We need to give the event loop a bit of control so the bot can stay alive, 0 as the sleep period maxes out the core the process is running in. Locally 0.1 halved that utilization (with around a 3x increase in the total time to parse some larger pages) but I don't know how this would reflect in the production bot
There was a problem hiding this comment.
I have no idea. Seems kind of hacky but I've seen this sort of thing done elsewhere before.
There was a problem hiding this comment.
Another option would be to run this in a thread executor, but then we'd need to introduce some locks to prevent potential race conditions from the queue item movement. Giving the event loop control in this way works fine since the individual parsing doesn't take that long but I agree that it feels a bit hacky and the sleep period may influence things differently on different machines
There was a problem hiding this comment.
This is where my concurrency knowledge fails me. I thought the separate thread would still eat all the cycles. At what point would the context switch occur if one thread is giving no breathing room and is maxing out the core?
There was a problem hiding this comment.
The main thread that's running asyncio is doing lots of IO so that's getting it a lot of opportunities for the executors to acquire the GIL; for a full utilization in the executors, the first thread should send a forced request after a certain amount of time to the thread that's holding the GIL which has to release it.
The behaviour you're describing only happens when the thread is doing something during which python can't handle the GIL like huge calculations where time is spent on a single instruction.
As an example
import threading
def cpu_and_memory_consumer():
a = []
while True:
a.append(0)
def cpu_consumer():
while True:
pass
threads = [
threading.Thread(target=cpu_consumer),
threading.Thread(target=cpu_and_memory_consumer),
#threading.Thread(target=lambda: 1234567890**1234567890),
]
for thread in threads:
thread.start()
print("started", thread)Both of the functions take all that's available to them and shouldn't voluntarily release the GIL, but when you look at the memory usage it's clearly increasing even with the cpu_consumer started first (main thread and then the memory consumer both got the GIL to do their own thing), but then if we uncomment the computational thread everything will stop once it gets the GIL (No big memory usage from the appends and the started print never happens).
I'm of course not familiar with the devops of the bot and don't know if maxing out the available cpu is a good thing or not, in the threads without any sleep periods or through a 0 asyncio sleep
| @@ -3,12 +3,13 @@ | |||
| import logging | |||
There was a problem hiding this comment.
The functionality the changes here implement is that instead of caching beautifulsoup objects like in the current implementation python-discord#1014 (which can bubble up and potentially become a memory issue) we store the lookup ids of each symbol, and when one symbol on a page is requested all others on that page are parsed alongside it so we can throw out the BS object when that's done.
| if not q.started: | ||
| async with self.bot.http_session.get(symbol_data.url) as response: | ||
| soup = BeautifulSoup(await response.text(encoding="utf8"), 'lxml') | ||
| soup.find("head").decompose() |
There was a problem hiding this comment.
I find this conditional behaviour to be a confusing API for the class i.e. the fact that soup is only needed sometimes. I think it'll be clearer if this is done inside the class to simplify the API.
|
|
||
| class TODO_PLACEHOLDER: | ||
| __slots__ = ("queue", "item_events", "started", "fetch_lock") | ||
| results_dict: Dict[str, discord.Embed] = None |
There was a problem hiding this comment.
Yeah, I don't like it either. Yielding/returning the results makes more sense to me. Whatever is using this class should be responsible for storing the results.
| embed = discord.Embed( | ||
| title=discord.utils.escape_markdown(item.name), | ||
| url=item.url, | ||
| description=get_symbol_markdown(soup, item) | ||
| ) |
There was a problem hiding this comment.
Returning an embed feels like piling on too much responsibility to one class. Let's strip it down and keep to its minimum - returning the markdown description.
| self.put_to_front(symbol_info) | ||
| item_event = asyncio.Event() | ||
| # First request | ||
| if not self.item_events: |
There was a problem hiding this comment.
This looks wrong. Not every item in the queue is going to have an event associated with it. Only items that are being waited for will have events. Therefore, the parse_items could still be running despite an empty item_events.
| event = self.item_events.get(item) | ||
| if event is not None: | ||
| event.set() | ||
| await asyncio.sleep(0.1) |
There was a problem hiding this comment.
I have no idea. Seems kind of hacky but I've seen this sort of thing done elsewhere before.
e77e2e5 to
2836ce6
Compare
1d22c3c to
5c97efa
Compare
…o-pardons feat: add reason argument to pardon commands
No description provided.