Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save the searches #2786

Merged
merged 2 commits into from
Jan 23, 2024
Merged

Save the searches #2786

merged 2 commits into from
Jan 23, 2024

Conversation

heckj
Copy link
Sponsor Collaborator

@heckj heckj commented Dec 14, 2023

Kind of like save the 馃悑 ...

I meant to get this set up and iterating on it with you weeks ago, but it's here now. This is a first cut at a dead-simple (brute force) means to save search results as they're requested, with the idea to collect them over an extended period of time and do a bit of analysis.

It adds a new dependency SPISearchResult, which I lovingly crafted as a sort of DTO object structure to convert from the [Search.Result] in the current source into something I could stash, re-hydrate, and then build into reviews. The file this generates is meant to be imported pretty directly by SPISearch. (SPISearch has also been gutted and rebuilt to focus on collections up front rather than individual searches - and it's up and available on TestLink again, although I've got a lot of macOS issues, so I'd recommend only fiddling with the iOS version at the moment. I've screwed up it's in-app navigation) But back to this PR

I ran into a few "Huh, I wonder what they'd prefer" questions while smacking this thing together:
1- would you prefer that I write strings to a more standard Logger so you could route and track them in some way?
2- I'm routing around the internal FileManager proxy and using Foundation.Filemanager directly - would you prefer I extend that proxy and use it?
3- I'm using a stock JSONEncoder - would you prefer another (speed/performance impact I guess)?
4- I'm running this all as a static func on an object so that stuff doesn't get initialized up front - but it's easy to change it up and stash the initialization pieces (such as FileManager and JSONEncoder), and if so - would you prefer me to stash the relevant logger instance in the Search Controller, or maybe the App Environment setup?
5- In local dev usage, it'll splat to your home directory. On docker, it'll splat to /var/log/searches.txt. Would you like that to be configurable from an environment variable or such?

I've verified that "it works" as in it's functional, and that it wasn't noticeably slower, but didn't formally benchmark anything here. I wasn't even sure what the current number of searches per hour/day/month were to know how much effort I should expend in streamlining this, but I'm more than happy to tweak and twiddle to match it up to anything you'd prefer.

I'll be traveling (yes, again!) between Dec 18th and Jan 15th - but I'll do my best to iterate and make this happy, or we can back-burner this effort until I'm back to iterate if deeper changes are needed.

@heckj heckj self-assigned this Dec 14, 2023
@cla-bot cla-bot bot added the cla-signed label Dec 14, 2023
@finestructure
Copy link
Member

Hey Joe, thanks for raising the PR! Writing this out to a file is problematic for a couple of reasons:

  • we run a 3 node docker swarm in prod and you'd be writing files in each container that we'd have to gather and merge
  • unless you perform logic to save away the file to some persistent store, each deployment is going to reset it
  • files in containers are awkward to get to and we can't really open up prod access to get them

The way I was thinking we deal with this is to use our existing logging mechanism to capture these queries. We're using promtail to gather the logs from all nodes and Loki + Grafana as an query/display interface:

CleanShot 2023-12-15 at 10 10 47@2x

We should be able to log JSON and export it with its structure intact (or if that's hard I'm sure we can come up with some workable format). The only problem I see with that is the size of the structure. It seems to be capturing the full search result. I'm not sure what the limits are and at what point we'd hit practical limits but I think we could work something out where perhaps the result gets unrolled across the packages, with a search id to be able to stitch them back together.

It might be worth setting up the whole thing with a very simple payload first and to test the process end-to-end. I think it could be as simple as

struct SearchResult {
  // don't need any timestamps as the logging will provide them
  let id = UUID()
  var query: String
  var packages: [String]
}

let res = SearchResult(query: ..., packages: ...)
logger.info("SearchResult: \(res)")  // Ee'd want this to be JSON and not CustomStringRepresentable 
    // but you know what I mean - this is where we need to figure out how to log JSON and retrieve it 
    // again in a workable way. In fact it might be easiest to just invent a little parsable format of our 
    // own here.

where packages: [String] is [CanonicalPackageURL].prefix(5).map(\.path) to give us readable unique package ids for the top 5 (maybe 10?) search results. (Of course we could also just use Package.Ids.)

If this is problematic with the logging we could try

struct SearchResult {
  let id = UUID()
  var query: String

  struct Item {
    var id: UUID
    var index: Int
    var package: String
  }
}

let res = SearchResult(query: ...)
logger.info("SearchResult: \(res)")
for (idx, p) in packages.enumerated() {
  let item = SearchResult.Item(id: res.id, index: idx, package: p)
  logger.info("SearchResult.Item: \(item)")
}

Does that make sense?

For reference, this is a JSON export of the log query above:

Explore-logs-2023-12-15 10_36_11.json.zip

We might not be able to attach JSON as JSON to this log but as I said, we can probably come up with some parsable format of our own and it should be easy to process from there.

@daveverwer
Copy link
Member

Happy to see this, Joe! Thanks for working on it

I'd recommend only fiddling with the iOS version at the moment. I've screwed up it's in-app navigation) But back to this PR

I saw the TestFlight notification yesterday and downloaded the app, but on macOS, and couldn't get very far with it! Now I understand 馃槀

@heckj
Copy link
Sponsor Collaborator Author

heckj commented Dec 15, 2023

Ah, thank you @finestructure - I'm greatly relieved. I thought you'd suggested writing to a file directly and was worried about the logistics of collecting those things.

Dropping it all through promtail sounds fine to me conceptually - although I'm worried about the potential volume. I'm just not sure how much you're getting, to know if this will be slamming your system or not.

I do want to catch the whole search result - there's a couple bits that are fairly variable (last activity date and stars) that I was hoping to catch from the current state instead of trying to regurgitate them later from a secondary query. If it makes more sense to drop these into an extended log and re-assemble later, I'm more than happy to write a parser/scraper that will iterating through a text file and re-assemble the bits as well. The detail in addition to what's IN the search that I want to capture is the timestamp of the search request, and if we log by package, then I'd like to add an identifier to each search so that if two (or more) searches are happening at the same time, we have a consistent way of re-assembling these things.

I'll do a little digging today to see what the limit is on logging payloads with promtail - I'm sure we're abusing the system, potentially more than it wants to be abused. ;-) If we can't fit a whole search result into a single line (I did intentionally try to keep the JSON encoding tight) then dropping back to iterating over the [Search.Result] and just adding enough detail to uniquely identify it and the query is probably going to make the most sense, with a plan to parse it/re-assemble it later.

Oh - and I've found the worst of the issues with the SPISearch app and reset its navigation and views just slightly (esp on macOS) - and pushed that to TestFlight yesterday evening. So it's still butt-ugly, but at least you can navigate around if you have a starting file to import. (there's one in the repo if you want to fiddle - a raw dump file and a searchrank file. Turns out the latest SwiftUI+macOS will happily explode the window size beyond whatever the device has screen real-estate for - so what I thought was blanking was actually expanding to a huge proportion - but also the selection idioms with the now-deprecated NavigationView were pretty whack...

Oh - found the limits. It's apparently in the latest Docker (Moby) - splits and limits log lines with a length greater than 16K (ref: grafana/loki#2281, grafana/loki#4392) and Grafana is presenting that the max log line it handles is significantly larger at 2MB compressed, with a compressed block size of 4096bytes. I'll take a whack at rebuilding this to split things down. I don't know what a max search size, but so far it's been pretty safely under 16k, but better safe than not.

@heckj
Copy link
Sponsor Collaborator Author

heckj commented Dec 15, 2023

Updated the PR to a MUCH simpler experiment setup - damn near what Sven had above, really.

The docker log output from this looks like:

[ INFO ] searchfragment: SearchFragment(searchID: E9EE96AB-B4FF-45A0-9904-1ACF2B9C2908, query: "CRDT", result: App.Search.Result.package(App.Search.PackageResult(packageId: FCC598FA-54F5-42BB-8553-10F67DC0B915, packageName: Optional("Yorkie"), packageURL: "/yorkie-team/yorkie-ios-sdk", repositoryName: "yorkie-ios-sdk", repositoryOwner: "yorkie-team", stars: Optional(13), lastActivityAt: Optional(2023-11-08 01:38:37 +0000), summary: Optional("Yorkie iOS SDK"), keywords: Optional(["crdt", "hacktoberfest", "ios", "realtime-collaboration", "swift", "yorkie"]), hasDocs: false))) [component: server]
[ INFO ] searchfragment: SearchFragment(searchID: E9EE96AB-B4FF-45A0-9904-1ACF2B9C2908, query: "CRDT", result: App.Search.Result.package(App.Search.PackageResult(packageId: 3043A497-933A-411D-A0E4-D226471AFB22, packageName: Optional("YSwift"), packageURL: "/y-crdt/yswift", repositoryName: "yswift", repositoryOwner: "y-crdt", stars: Optional(48), lastActivityAt: Optional(2023-05-17 21:38:59 +0000), summary: Optional("Swift language bindings to Y-CRDT"), keywords: Optional(["crdt", "rust", "swift", "yjs"]), hasDocs: true))) [component: server]
[ INFO ] searchfragment: SearchFragment(searchID: E9EE96AB-B4FF-45A0-9904-1ACF2B9C2908, query: "CRDT", result: App.Search.Result.package(App.Search.PackageResult(packageId: 19CC257C-1128-4D6D-9C5A-0671D95DF2D2, packageName: Optional("ReplicatingTypes"), packageURL: "/appdecentral/replicatingtypes", repositoryName: "replicatingtypes", repositoryOwner: "appdecentral", stars: Optional(91), lastActivityAt: Optional(2021-11-18 15:44:11 +0000), summary: Optional("Code for the tutorial series on Conflict-Free Replicated Data Types (CRDTs) at appdecentral.com"), keywords: Optional([]), hasDocs: false))) [component: server]

I think the logger isn't doing JSON encoding, but more dumping a representation in place due to those Optionals - so I'm tempted to pre-encode the result and just log the resulting JSON string, but wanted to let you look first and see if this is along the lines you were thinking.

@heckj heckj force-pushed the savesearches branch 2 times, most recently from 7c895ae to a80ab8b Compare January 19, 2024 13:50
@heckj
Copy link
Sponsor Collaborator Author

heckj commented Jan 19, 2024

I'm back from holiday, and back to poking at code. I made the updates as suggested by @finestructure - with the goal of being able to post-process those log files and pull them back together later into JSON files that I can stash and process. SPISearch won't read them directly in that raw format, but I thought getting this in place would be the first major step, with a post-processing tool to parse and collapse log files into search results something that could come after - and that I'd maybe stash in with heckj/SPISearchResult.

I've rebased against the latest master, and run the code locally to verify it's happy and functioning, so when you have a good time to consider it, I'd love another look-through

Package.swift Outdated Show resolved Hide resolved
AppEnvironment.logger.info("searchresult: \(stringdata)")
} catch {
AppEnvironment.logger.warning("unable to encode search fragment: \(error)")
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I've run this locally with your branch and I'm getting the following when I log this:

[ INFO ] GET /search [component: server, request-id: 5AB91460-7BB6-4876-9106-9BC04769ADF1]
[ INFO ] search: 65 bytes [component: server]
[ INFO ] searchresult: 90 bytes [component: server]
[ INFO ] searchresult: 91 bytes [component: server]
[ INFO ] searchresult: 94 bytes [component: server]
[ INFO ] searchresult: 96 bytes [component: server]
[ INFO ] searchresult: 422 bytes [component: server]
[ INFO ] searchresult: 436 bytes [component: server]
[ INFO ] searchresult: 416 bytes [component: server]
[ INFO ] searchresult: 437 bytes [component: server]
[ INFO ] searchresult: 451 bytes [component: server]
[ INFO ] searchresult: 407 bytes [component: server]

which is expected I think, because the return type is Data. We'd need to wrap that with String(decoding:as:) in lines 41 and 51:

            let json = String(decoding: try jsonEncoder.encode(baseQuery), as: UTF8.self)
            AppEnvironment.logger.info("search: \(json)")

That then gives us:

[ INFO ] search: {"query":"foo","searchID":"DD410B89-DF59-437F-9D41-65C06ED39E7E"} [component: server]
[ INFO ] searchresult: {"id":"DD410B89-DF59-437F-9D41-65C06ED39E7E","r":{"author":{"_0":{"name":"adamfootdev"}}}} [component: server]
[ INFO ] searchresult: {"id":"DD410B89-DF59-437F-9D41-65C06ED39E7E","r":{"author":{"_0":{"name":"onefootprint"}}}} [component: server]
[ INFO ] searchresult: {"id":"DD410B89-DF59-437F-9D41-65C06ED39E7E","r":{"author":{"_0":{"name":"autoreleasefool"}}}} [component: server]
[ INFO ] searchresult: {"id":"DD410B89-DF59-437F-9D41-65C06ED39E7E","r":{"keyword":{"_0":{"keyword":"food-ordering"}}}} [component: server]
[ INFO ] searchresult: {"id":"DD410B89-DF59-437F-9D41-65C06ED39E7E","r":{"package":{"_0":{"hasDocs":false,"keywords":["ios","swiftui","xcode"],"lastActivityAt":"2024-01-11T12:05:14Z","packageId":"BF917587-94B7-4E4E-933D-DBAF1AD99AF7","packageName":"AboutKit","packageURL":"\/adamfootdev\/AboutKit","repositoryName":"AboutKit","repositoryOwner":"adamfootdev","stars":98,"summary":"Add an about screen to your app in just a few lines of code."}}}} [component: server]

which looks better but is perhaps not ideal to parse.

I think we'll want to drop the non-package results since they don't really take part in normal ordering:

for result in results.filter(\.isPackage)

in line 47. Then we get

[ INFO ] GET /search [component: server, request-id: 33710481-DE9C-4A78-81B4-13E467D10CAA]
[ INFO ] search: {"query":"foo","searchID":"6D35A7E2-29E1-4AEC-A4F6-4A224160AFB2"} [component: server]
[ INFO ] searchresult: {"id":"6D35A7E2-29E1-4AEC-A4F6-4A224160AFB2","r":{"package":{"_0":{"hasDocs":false,"keywords":["ios","swiftui","xcode"],"lastActivityAt":"2024-01-11T12:05:14Z","packageId":"BF917587-94B7-4E4E-933D-DBAF1AD99AF7","packageName":"AboutKit","packageURL":"\/adamfootdev\/AboutKit","repositoryName":"AboutKit","repositoryOwner":"adamfootdev","stars":98,"summary":"Add an about screen to your app in just a few lines of code."}}}} [component: server]

That's better but might still be too awkward to parse. We could also encode this as CSV:

  • searchID, query for the SearchQuery
  • searchID, index, packageID, + whatever else is needed for the fragments. NB: we can easily query package data via the API, so the packageID itself is enough if we don't want to log lots of detail - but logging the details might be easier if they're needed.
[ INFO ] search: 6D35A7E2-29E1-4AEC-A4F6-4A224160AFB2,foo [component: server]
[ INFO ] searchresult: 6D35A7E2-29E1-4AEC-A4F6-4A224160AFB2,0,BF917587-94B7-4E4E-933D-DBAF1AD99AF7 [component: server]

I'd imagine this would be easier to parse when we grab the logs. Although the prefix and suffix should be static, so stripping them reliably and then decoding the JSON should also be feasible.

What's definitly missing if we don't use my proposed CSV encoding is the record index. We can probably rely on the log ordering but it'd be safer to embed it in addition:

        for (idx, result) in results.filter(\.isPackage).enumerated() {
            let fragment = SearchResultFragment(searchID: uniqueSearchID, result: result)
            do {
                let json = String(decoding: try jsonEncoder.encode(fragment), as: UTF8.self)
                AppEnvironment.logger.info("searchresult[\(idx)]: \(json)")
            } catch {

yielding:

[ INFO ] GET /search [component: server, request-id: 7B7EA929-F2D0-45F6-9430-900DF7E9AB6A]
[ INFO ] search: {"query":"foo","searchID":"2B9FD884-B20B-4319-AB41-08DDBD54A446"} [component: server]
[ INFO ] searchresult[0]: {"id":"2B9FD884-B20B-4319-AB41-08DDBD54A446","r":{"package":{"_0":{"hasDocs":false,"keywords":["ios","swiftui","xcode"],"lastActivityAt":"2024-01-11T12:05:14Z","packageId":"BF917587-94B7-4E4E-933D-DBAF1AD99AF7","packageName":"AboutKit","packageURL":"\/adamfootdev\/AboutKit","repositoryName":"AboutKit","repositoryOwner":"adamfootdev","stars":98,"summary":"Add an about screen to your app in just a few lines of code."}}}} [component: server]
[ INFO ] searchresult[1]: {"id":"2B9FD884-B20B-4319-AB41-08DDBD54A446","r":{"package":{"_0":{"hasDocs":false,"keywords":["parser-combinators","swift","swift-library"],"lastActivityAt":"2019-04-07T21:44:16Z","packageId":"4B10B099-BC77-4687-8B47-ECF0D3A6C8B1","packageName":"FootlessParser","packageURL":"\/kareman\/FootlessParser","repositoryName":"FootlessParser","repositoryOwner":"kareman","stars":63,"summary":"A simple parser combinator written in Swift"}}}} [component: server]

I'd say let's try this to our dev environment with the JSON format and test the log-grabbing and decoding process there. Then we can see if it works well enough or if we want it as CSV.

To gate it:

    static func log(query: String, results: [Search.Result]) {
        guard Current.environment() != .production else { return }

I'll see if I can sprinkle these individual snippets in as suggestions as well.

Does that all make sense?

Copy link
Sponsor Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oooh - the index on the result to ensure its position information is a good idea - hopefully it'll be redundant in practice. I dropped the filter part - mostly because I do want to keep the keywords and author bits that are returned. What I want to assemble in the end (for SPISearch) is the equivalent of what the user sees as the search query and totality of the results, to provide all the same contextual clues as to why the result was relevant (or not). That may be overkill, but I think there's value there, especially with evaluation being so essentially subjective.

And yeah, I know it'll be an interesting "joy" to reconstitute from the logs, but I think all the critical bits are there to enable it so that we can walk away with a set/collection of search results (post-processing from the logs) that we can load up into SPISearch, evaluate, share, and run analysis on over time - sort of the big end goal for me.

@finestructure
Copy link
Member

Thanks for following up with this, Joe! I've added a comment that collectively talks about a few suggestions, which I hope all make sense together.

- to be captured and processed externally from the main processing of
  SPI
@finestructure
Copy link
Member

I'll push this branch into the SPI org so the CI tests can run.

@finestructure finestructure merged commit 182f952 into SwiftPackageIndex:main Jan 23, 2024
5 checks passed
@heckj heckj deleted the savesearches branch January 23, 2024 18:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants