Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial Whoosh search work #61

Merged
merged 37 commits into from Sep 6, 2022
Merged

Initial Whoosh search work #61

merged 37 commits into from Sep 6, 2022

Conversation

kura
Copy link
Contributor

@kura kura commented Sep 3, 2022

Summary

Provide a short overview ...

Details

Context to describe changes if necessary

Checks

  • In case of new feature, add short overview in docs/<corresponding file>
  • Tested changes

@kura
Copy link
Contributor Author

kura commented Sep 3, 2022

@Linbreux I have created this as a draft since I'm not done yet but this is my initial work on moving search over to Whoosh. I thought I'd create a draft PR so you could have a look to see if you're OK with the approach.

Currently it only indexes documents when the wiki is started and the search index is empty.

TODO:

  • Run search.index when a file is created in the wiki
  • Run search.update when a file is updated in the wiki - I ended up going with a simpler method of deleting the old version and indexing the new one which is a supported and seems to be how the update method works anyway. It has the advantage of making the code cleaner imo, no special cases in the save method in wiki.py and removes the update method from search.py entirely.
  • Run search.delete when a file is deleted in the wiki
  • Figure out how to run the 3 above calls when files are created/updated/deleted on the filesystem outside of wikmd
  • Write search index to same parent directory as images i.e. /wiki
    • Hide search index directory from directory listings etc
  • Add highlights to search results

I am really not sure how to force reindexing files that are created/updated/deleted directly on the filesystem. It seems to me like we'd need a thread running that is monitoring filesystem changes and triggering indexing based on that, and also trigger a reindex every time wikmd is started.

@Linbreux
Copy link
Owner

Linbreux commented Sep 3, 2022

Perfect, I'll look into it when I have some free time and maybe add some commits myself. Appreciate the work you put into it!

@kura
Copy link
Contributor Author

kura commented Sep 3, 2022

No problem, I was bored and wanted something to do. :D I'll make some more changes this evening or tomorrow to tie up loose ends.

@kura kura force-pushed the whoosh-search branch 2 times, most recently from c90524e to a195d83 Compare September 3, 2022 22:17
@kura
Copy link
Contributor Author

kura commented Sep 4, 2022

@Linbreux I am pretty happy with how this works now.

When the Flask app is started 2 things will happen;

  1. A search index will be created, overwriting the existing index. This is done because if the Markdown files were modified on disk before the Flask app was started, without a reindex the file changes would never be picked up.
  2. A watchdog thread is started that monitors the wiki directory for file changes

When creating, modifying or deleting a file with extension .md in the wiki directory via the Flask app or on the file system:

  1. The watchdog process will notice this change and update the search index, it will insert if the document is new, delete and insert if updated and delete if removed.

The search index directory lives in the wiki directory and is hidden from the results of page listing.

I have also enabled scoring and highlighting, all search results are sorted by score with the highest score first. The results returned also contain a sentence of two with the keyword highlighted. The highlighting itself comes directly from Whoosh and by default makes the highlight bold, this is probably configurable but I kind of like it as it is.
Click me to enlarge

I still need to write some tests for this.

 * Moved _searchindex to wiki directory
 * Hide searchindex from list pages
 * Added highlights to search results
For some reason search is also broken...
Removed all index operations from Flask HTTP methods and do them via
watchdog instead. This was done because when modifying a file via the
web UI an indexing operation would be triggered by the web UI and then
also very soon after by watchdog because it monitors for file changes.
@Linbreux
Copy link
Owner

Linbreux commented Sep 5, 2022

Oh alright perfect. I thought it kept some cache in the folder or something :)

@kura
Copy link
Contributor Author

kura commented Sep 5, 2022

There is one bug I still need to figure out a solution to, it involves the git_manager and _searchindex. Git tries to commit the search index and it breaks things, maybe the _searchindex should live outside of the Markdown folder like caching does?

@kura
Copy link
Contributor Author

kura commented Sep 5, 2022

I see passing tests! 🚀

@Linbreux
Copy link
Owner

Linbreux commented Sep 5, 2022

Yes I guess we could put it in the same location as the cache folder!

@kura
Copy link
Contributor Author

kura commented Sep 5, 2022

@Linbreux I just submitted a rather large set of changes. When the path was changed to just be a relative path it broke edits and deletes, edit causing duplication in results and delete deleting everything. I have fixed this now by adding an additional ID on the records: filename. This will have the added benefit of making files with the same name in different directories unique.

I have switched the writer used for indexing and deleting to an asynchronous writer because I was running in to lock contention when edit a file (delete + index).

I have entirely rewritten the watchdog so that it is easier to test, it now has a start and stop function that can be called from within Python to trigger the background indexer.

* Search indexing and deleting was broken when the path was changed, I've now added in a filename that is used alongside the path
* Changed to using an AsyncWriter because there was lock contention
* Rewritten the watchdog to be more amenable to unit tests
@Linbreux
Copy link
Owner

Linbreux commented Sep 6, 2022

@kura Prefect, I noticed that the index does not update when removing a file. This was the only bug I could find for know.

@kura
Copy link
Contributor Author

kura commented Sep 6, 2022

@Linbreux you are right, I broke it when rewriting the watchdog code. I'll fix it now.

@Linbreux
Copy link
Owner

Linbreux commented Sep 6, 2022

@kura perfect! Do you have some things you'd like to test or are you ready to merge?

@kura
Copy link
Contributor Author

kura commented Sep 6, 2022

@Linbreux I would like to write some tests around the search class and watchdog. Give me 24 hours or so to get a few tests in if you wouldn't mind.

@Linbreux
Copy link
Owner

Linbreux commented Sep 6, 2022

@kura No perfect, the more tests the better! Just let me know when you'd like to merge

@kura
Copy link
Contributor Author

kura commented Sep 6, 2022

@Linbreux I have added a set of tests for Search and Watchdog.

I have also renamed the cache_dir to /dev/shm/wikd/cache so that I could also add /dev/shm/wikmd/searchindex to separate the searchindex from the content so that git didn't try to add them to the local repo.

I'm pretty happy with this now, if you want to give it a once over and re-test.

…d it was causing problems on my Docker container so I moved search to /dev/shm.

Changes:
* Move cachedir to `wikmd/cache`
* Move searchdir to `wikmd/searchindex`
* Exclude all non-.md files from setup_search
@kura
Copy link
Contributor Author

kura commented Sep 6, 2022

I'd suggest when you're ready to merge that you also use the squash + merge function of GitHub rather than straight up merging. This is a ton of commits and it'll pollute the history so I'd suggest just squash to 1 commit and merging.

Alternatively when you're happy you can prod me and I'll squash locally and force push a squashed commit you can merge instead.

@kura kura marked this pull request as ready for review September 6, 2022 19:17
@Linbreux
Copy link
Owner

Linbreux commented Sep 6, 2022

Alright, this looks very good! I'll squash them ;) Thanks again for the work you put into it.

@Linbreux Linbreux merged commit 69b78e0 into Linbreux:main Sep 6, 2022
@kura
Copy link
Contributor Author

kura commented Sep 6, 2022

@Linbreux commented on 6 Sept 2022, 21:06 UTC:

Alright, this looks very good! I'll squash them ;) Thanks again for the work you put into it.

Thanks. And no worries, I've been doing other non-programming stuff at work recently and the lack of coding was driving me a little crazy so I needed to find an outlet. :D

@Linbreux
Copy link
Owner

Linbreux commented Sep 6, 2022

@kura Haha I get it. Always welcome to make further adjustments on wikmd!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants