Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Search articles #41

Closed
BrightDV opened this issue Dec 18, 2022 · 5 comments
Closed

[FEATURE] Search articles #41

BrightDV opened this issue Dec 18, 2022 · 5 comments
Labels
enhancement New feature or request

Comments

@BrightDV
Copy link
Owner

Is your feature request related to a problem? Please describe.
/

Describe the solution you'd like
A search functionnality, at least for articles.

Describe alternatives you've considered
/

Additional context
#26

@BrightDV BrightDV added the enhancement New feature or request label Dec 18, 2022
@BrightDV
Copy link
Owner Author

For now, the app search for articles using SearXNG instances. The instances selected allow showing the results in JSON format in order to avoid scraping. However, the requests are often blocked because of rate limits.
For the search, it filters the results using search parameters: it searches for "formula1.com/en/latest/article" $query so the url of the result must contain the string between the double quotes. Thus, it only returns articles.

One workaround is to use the RSS feed of Formula 1 and then search in it. For the moment, I didn't find any way to get more than 22 articles. Furthermore, I don't think that getting 1000 articles and then searching among them is a good solution, as it will use a lot of bandwidth and be very slow.

@sinfullad
Copy link

For now, the app search for articles using SearXNG instances. The instances selected allow showing the results in JSON format in order to avoid scraping.

Sorry for the dumb question, but what do you mean by avoid scraping in this context?

Also in the worst case of scenario of all of the selected instances going down, are there any search engines you plan to use as the fallback option or will you use other instances? Currently I found Metager (metasearch similar to SearX), Mojeek (UK, uses its own crawler), Swisscows (data center in Switzerland, uses Bing Search and Bing Ads, though it uses its own indexes for Germany) to be viable options as well

@BrightDV
Copy link
Owner Author

Sorry for the dumb question, but what do you mean by avoid scraping in this context?

I don't like fetching a page and then extracting the content, but I will try to see if the rate limits still apply. If it doesn't, I will add the scraping if no results are found using the first method.

Also in the worst case of scenario of all of the selected instances going down, are there any search engines you plan to use as the fallback option or will you use other instances? Currently I found Metager (metasearch similar to SearX), Mojeek (UK, uses its own crawler), Swisscows (data center in Switzerland, uses Bing Search and Bing Ads, though it uses its own indexes for Germany) to be viable options as well

Thanks for these suggestions! However, I choose SearXNG because the backend is open-source, even if these propositions are made to be private.
With the scraping, there are up to 106 instances available, so I am going to try this way.

@BrightDV
Copy link
Owner Author

The good news is that requesting the page in HTML format is not rate limited, so it will work better.
I have implemented a basic scraping when the five previous requests did not work, but I will improve it later.

@BrightDV
Copy link
Owner Author

Added in latest release (v0.4.0).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants