Skip to content

Doc imp cop#1

Closed
Numerlor wants to merge 4 commits into
doc-impfrom
doc-imp-cop
Closed

Doc imp cop#1
Numerlor wants to merge 4 commits into
doc-impfrom
doc-imp-cop

Conversation

@Numerlor
Copy link
Copy Markdown
Owner

@Numerlor Numerlor commented Aug 8, 2020

No description provided.

Comment thread bot/cogs/doc/cog.py Outdated

class TODO_PLACEHOLDER:
__slots__ = ("queue", "item_events", "started", "fetch_lock")
results_dict: Dict[str, discord.Embed] = None
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class handles parsing of all items on one page, for example behind the url https://docs.python.org/3/library/random.html it'll parse the module itself, all the classes etc. using the list as a FIFO queue for prioritizing items when a doc request is made.

I'm not quite sure how to handle the results here because wait_for_item is an async task that is started on the first request to an url. Currently it is done by setting the above results_dict to the dict which the cog fetches results from, but it doesn't feel great. Something with callbacks came to mind but the interface for the caller wouldn't be much better

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't like it either. Yielding/returning the results makes more sense to me. Whatever is using this class should be responsible for storing the results.

Copy link
Copy Markdown
Owner Author

@Numerlor Numerlor Aug 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal here was to simplify it a bit for the user class
The only thing that came to mind was starting a task in the cog that'd iterate over parse_items (which would yield the descriptions) and assign to its result dict but that doesn't feel like a great interface either as it puts the responsibility of starting the long running parsing on the class that uses it.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. Then how about making this class act as the cache container itself, or at least storing the cache solely in this class rather than passing a dict around? You can re-design this class a bit so that you're not going to need a separate instance for each page, which will make my suggestion viable. The queue can hold futures for all pages; you'll just need to keep track of which pages have already been added to the queue to avoid redundant parsing.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just so I'm sure we're thinking of the thing.
There'd be a "master" queue where all pages would get queued as requested from an internal url: symbol_ids mapping, the results of that would be saved to something like self.objects in the cog. Then the public API of the class could just be an add_symbol_id(url, symbol_id) when initiating inventories and get_symbol(symbol_id) for requesting them

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, something along those lines. You could probably have those methods accept a DocItem, but if you only need those two attributes then I suppose being more explicit is better. That being said, I thought of two problems:

  1. When should the queue start? Currently it runs for only one page once a specific symbol is requested. If the queue is shared, I'm not sure how to still accomplish this. Possibly store the symbols in a separate data structure and move them to the queue later, once one symbol from the page is requested. That's just one idea for a solution.
  2. How should the soup objects be stored? How will it know when to get rid of them?

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah passing the DocItems through is better since we need the url anyway so there's no need for it to be passed by the caller, and they already exist.
Started trying out the implementation, the results and self.urls are moved from the Cog to this class. On a item request , if the symbols need to be queued, the DocItems of the page that symbol is located in (from self.urls) is moved to the queue and the queue is started. Together witch each symbol we can also pass the soup in a container which takes care of their storage (Once we're done with the symbols from the page it gets completely dereferenced on our part). When a symbol is parsed it's handled like before, but the result is returned with the events and result dict handled internally.

Should the parsing task be stopped when the queue ends up empty again? We can either start it every time symbols are added to an empty queue, or have it run infinitely from the first request and just skip the code when there's nothing to handle

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like what I had in mind. I'm happy with that solution. What about you?

Should the parsing task be stopped when the queue ends up empty again?

Skip how? If it could await until an item was put in the queue, that'd be fine. Unfortunately, you can't use an asyncio queue since you need to be able to move items from arbitrary positions to the front. Speaking of which, is there a more efficient data structure than a list for doing that? Earlier, I was looking into priority queues. Anyway, restarting the tasks sounds better, but I don't know for sure until you clarify the implementation.

Copy link
Copy Markdown
Owner Author

@Numerlor Numerlor Aug 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't found a more fitting data structure, tried queue.PriorityQueue but it looks like they only do the priority matching on item insertion, and it's done by comparing it against all other items which also isn't very efficient.
Unless there's something that has a better interface than a list and doesn't need to be implemented, I don't think it's worth putting much time into it as the queue will never be long enough for the occasional remove and append to become an issue

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed the changed class in c1a516d

Comment thread bot/cogs/doc/cog.py
event = self.item_events.get(item)
if event is not None:
event.set()
await asyncio.sleep(0.1)
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to give the event loop a bit of control so the bot can stay alive, 0 as the sleep period maxes out the core the process is running in. Locally 0.1 halved that utilization (with around a 3x increase in the total time to parse some larger pages) but I don't know how this would reflect in the production bot

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea. Seems kind of hacky but I've seen this sort of thing done elsewhere before.

Copy link
Copy Markdown
Owner Author

@Numerlor Numerlor Aug 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option would be to run this in a thread executor, but then we'd need to introduce some locks to prevent potential race conditions from the queue item movement. Giving the event loop control in this way works fine since the individual parsing doesn't take that long but I agree that it feels a bit hacky and the sleep period may influence things differently on different machines

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where my concurrency knowledge fails me. I thought the separate thread would still eat all the cycles. At what point would the context switch occur if one thread is giving no breathing room and is maxing out the core?

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main thread that's running asyncio is doing lots of IO so that's getting it a lot of opportunities for the executors to acquire the GIL; for a full utilization in the executors, the first thread should send a forced request after a certain amount of time to the thread that's holding the GIL which has to release it.
The behaviour you're describing only happens when the thread is doing something during which python can't handle the GIL like huge calculations where time is spent on a single instruction.
As an example

import threading


def cpu_and_memory_consumer():
    a = []
    while True:
        a.append(0)


def cpu_consumer():
    while True:
        pass


threads = [
    threading.Thread(target=cpu_consumer),
    threading.Thread(target=cpu_and_memory_consumer),
    #threading.Thread(target=lambda: 1234567890**1234567890),
]
for thread in threads:
    thread.start()
    print("started", thread)

Both of the functions take all that's available to them and shouldn't voluntarily release the GIL, but when you look at the memory usage it's clearly increasing even with the cpu_consumer started first (main thread and then the memory consumer both got the GIL to do their own thing), but then if we uncomment the computational thread everything will stop once it gets the GIL (No big memory usage from the appends and the started print never happens).

I'm of course not familiar with the devops of the bot and don't know if maxing out the available cpu is a good thing or not, in the threads without any sleep periods or through a 0 asyncio sleep

Comment thread bot/cogs/doc/cog.py Outdated
Comment thread bot/cogs/doc/cog.py
Comment thread bot/cogs/doc/cog.py
@@ -3,12 +3,13 @@
import logging
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The functionality the changes here implement is that instead of caching beautifulsoup objects like in the current implementation python-discord#1014 (which can bubble up and potentially become a memory issue) we store the lookup ids of each symbol, and when one symbol on a page is requested all others on that page are parsed alongside it so we can throw out the BS object when that's done.

Comment thread bot/cogs/doc/cog.py Outdated
Comment thread bot/cogs/doc/cog.py Outdated
Comment on lines +267 to +270
if not q.started:
async with self.bot.http_session.get(symbol_data.url) as response:
soup = BeautifulSoup(await response.text(encoding="utf8"), 'lxml')
soup.find("head").decompose()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this conditional behaviour to be a confusing API for the class i.e. the fact that soup is only needed sometimes. I think it'll be clearer if this is done inside the class to simplify the API.

Comment thread bot/cogs/doc/cog.py Outdated

class TODO_PLACEHOLDER:
__slots__ = ("queue", "item_events", "started", "fetch_lock")
results_dict: Dict[str, discord.Embed] = None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't like it either. Yielding/returning the results makes more sense to me. Whatever is using this class should be responsible for storing the results.

Comment thread bot/cogs/doc/cog.py Outdated
Comment on lines +106 to +110
embed = discord.Embed(
title=discord.utils.escape_markdown(item.name),
url=item.url,
description=get_symbol_markdown(soup, item)
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning an embed feels like piling on too much responsibility to one class. Let's strip it down and keep to its minimum - returning the markdown description.

Comment thread bot/cogs/doc/cog.py Outdated
self.put_to_front(symbol_info)
item_event = asyncio.Event()
# First request
if not self.item_events:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks wrong. Not every item in the queue is going to have an event associated with it. Only items that are being waited for will have events. Therefore, the parse_items could still be running despite an empty item_events.

Comment thread bot/cogs/doc/cog.py
Comment thread bot/cogs/doc/cog.py
event = self.item_events.get(item)
if event is not None:
event.set()
await asyncio.sleep(0.1)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea. Seems kind of hacky but I've seen this sort of thing done elsewhere before.

Numerlor pushed a commit that referenced this pull request Oct 4, 2020
@Numerlor Numerlor force-pushed the doc-imp branch 2 times, most recently from e77e2e5 to 2836ce6 Compare November 10, 2020 12:46
@Numerlor Numerlor force-pushed the doc-imp branch 5 times, most recently from 1d22c3c to 5c97efa Compare January 11, 2021 01:21
Numerlor pushed a commit that referenced this pull request Jun 30, 2021
@Numerlor Numerlor closed this Jul 10, 2021
@Numerlor Numerlor deleted the doc-imp-cop branch July 10, 2021 21:36
Numerlor pushed a commit that referenced this pull request Aug 5, 2023
…o-pardons

feat: add reason argument to pardon commands
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants