Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reCAPTCHA proxy #103

Closed
Extravi opened this issue Dec 26, 2023 · 50 comments
Closed

reCAPTCHA proxy #103

Extravi opened this issue Dec 26, 2023 · 50 comments

Comments

@Extravi
Copy link
Owner

Extravi commented Dec 26, 2023

I'm working on a system that allows users to interact with reCAPTCHA. Whenever Araa gets rate-limited, it will then load a web driver to proxy the captcha, allowing users to interact with it. If the user successfully completes the captcha, the web driver will then capture the "GOOGLE_ABUSE_EXEMPTION=ID" cookie and send it in the request header to Google using makeHTMLRequest. Both SearXNG and other projects do not do this, so this will be the first.

image
image

@Extravi
Copy link
Owner Author

Extravi commented Dec 27, 2023

i might have to do something like this and update the image in real time because its not simple to scrape

image

@Extravi
Copy link
Owner Author

Extravi commented Dec 27, 2023

image

@Extravi
Copy link
Owner Author

Extravi commented Dec 27, 2023

it will first display that to the user because each user needs its own session to prevent more then one user doing the same captcha at once

@amogusussy
Copy link
Contributor

I think a good solution could be having a backup search engine.
When the instance gets rate limited, it should just switch to another engine, like qwant, then wait ~30 minutes, and retry google. Then repeat until google stops rate limiting.

@Extravi
Copy link
Owner Author

Extravi commented Dec 27, 2023

I think a good solution could be having a backup search engine.
When the instance gets rate limited, it should just switch to another engine, like qwant, then wait ~30 minutes, and retry google. Then repeat until google stops rate limiting.

I already wrote most of the code for the captcha proxy already

@Extravi
Copy link
Owner Author

Extravi commented Dec 27, 2023

also look at librey and its fall back system it's not necessarily clean or well done

@Extravi
Copy link
Owner Author

Extravi commented Dec 27, 2023

I will likely add support for other engines at some point but the user should be able to use Google if they want rate limited or not

@Extravi
Copy link
Owner Author

Extravi commented Dec 27, 2023

that's why I'm working on the proxy

@Extravi
Copy link
Owner Author

Extravi commented Dec 27, 2023

@amogusussy i found a better way to proxy the captcha using sessions and a iframe sending data to the iframe from the server like the sitekey and s-data
image

@amogusussy
Copy link
Contributor

Have you tried that with a different device? If you send an iframe to the user's device, whatever site it's loading will just think it's a request from the user, so it wont be rate limited.
This will probably only seem like it's working, since you're testing it on the device that's rate limited. A non rate limited user will just get sent the normal page.

@Extravi
Copy link
Owner Author

Extravi commented Dec 28, 2023

Have you tried that with a different device? If you send an iframe to the user's device, whatever site it's loading will just think it's a request from the user, so it wont be rate limited. This will probably only seem like it's working, since you're testing it on the device that's rate limited. A non rate limited user will just get sent the normal page.

yes im aware of the normal page thing i have been testing for hours ill be fine once its out/done

@Extravi
Copy link
Owner Author

Extravi commented Dec 28, 2023

its going to need an entire local proxy server for this to work https://mitmproxy.org/ i found this but do you know any better http proxies?

@Extravi
Copy link
Owner Author

Extravi commented Dec 28, 2023

recaptcha needs to be done using the servers ips so i need to proxy everything to the end user

@amogusussy
Copy link
Contributor

I've found this list of alternatives for Linux, but I don't really know what makes a proxy better/worse.

@Extravi
Copy link
Owner Author

Extravi commented Dec 30, 2023

I think I should use a paid Captcha solver service

@Extravi
Copy link
Owner Author

Extravi commented Dec 30, 2023

because it's not necessarily practical to proxy it to my users or even possible

@Extravi
Copy link
Owner Author

Extravi commented Dec 30, 2023

i'm going to drop this for now and add support for a different engine as a backup

@Extravi Extravi closed this as completed Dec 30, 2023
@Extravi
Copy link
Owner Author

Extravi commented Dec 30, 2023

any ideas for what engine i should use for the backup?

@Extravi
Copy link
Owner Author

Extravi commented Dec 30, 2023

Also, I will be implementing the backup engine, so there is a template to build off of.

@Extravi
Copy link
Owner Author

Extravi commented Dec 30, 2023

i want everything to look like it belongs unlike LibreY and its broken system

@Extravi
Copy link
Owner Author

Extravi commented Dec 30, 2023

a captcha proxy cost less then google search api

@amogusussy
Copy link
Contributor

Qwant has a free API. The only problem is that it doesn't show the wikipedia results in the API, so you'll have to scrape that yourself. There's also DuckDuckGo, Startpage, yahoo, and brave.
If you need help with scraping them, you can have a look at SearXNG's source code, since all the engines I mentioned are included in SearXNG.

I think we should standardize the results into one dict/json object, like what's been done with the torrent results. If we do that, it'll be 10x easier to add new engines, and maybe even give the user the ability to choose what engines they want to use.

@Extravi
Copy link
Owner Author

Extravi commented Dec 30, 2023

question what do you think of anonymous data submission of search results as an opt in future it would only collect the sub domain and domain for each result for the query but it wouldn't record the query itself so like www.youtube.com or GitHub.com nothing after the /

@Extravi
Copy link
Owner Author

Extravi commented Dec 30, 2023

I would use that data to index and improve aspects of the search results and make results more visual

@Extravi
Copy link
Owner Author

Extravi commented Dec 30, 2023

such as favicon indexing etc I might collect YouTube channel URLs too so the / for that but that's so I can index all channels over 10k subscribers

@Extravi
Copy link
Owner Author

Extravi commented Dec 30, 2023

so i can do things like this
image
and this
image
image

@Extravi
Copy link
Owner Author

Extravi commented Dec 30, 2023

the data collection code would be open source and anonymous

@Extravi
Copy link
Owner Author

Extravi commented Dec 30, 2023

and if the user wants to opt out of the setting turned on by default in settings they can

@Extravi
Copy link
Owner Author

Extravi commented Dec 30, 2023

i want to index some stuff that each engine can use like qwant google etc in Araa

@Extravi
Copy link
Owner Author

Extravi commented Dec 30, 2023

i want to make results look more visual and modern

@Extravi
Copy link
Owner Author

Extravi commented Dec 30, 2023

I want it to be on par with closed-source meta search engines, and for that to work, some data collection may be required.

@Extravi
Copy link
Owner Author

Extravi commented Dec 30, 2023

its only an idea and does not mean it will happen

@Extravi
Copy link
Owner Author

Extravi commented Dec 30, 2023

its something i want to do but if i do decide to develop it something might change resulting in it getting dropped

@amogusussy
Copy link
Contributor

Something like that seems too far out from the reach of this project.
I do think a feature like that could be good through. If there's a link within the first 3 results that links to youtube.com/c/ (to check if it's likely that the user's searching for that channel), then you could scrape social blade for info about the channel.
That could also expand further into other widgets, like for weather, or sports results.

@Extravi
Copy link
Owner Author

Extravi commented Dec 30, 2023

it's not necessarily far out of reach it's common to come across the same websites in the search results for different queries and many people will search/request the same websites time from time so after the first request it will index the favicon etc and pair it with that sub domain and domain

@Extravi
Copy link
Owner Author

Extravi commented Dec 30, 2023

like medium and other articles sites are quite common or even stack overflow for a coding related query

@Extravi
Copy link
Owner Author

Extravi commented Dec 30, 2023

most people only really go to the top 1000 or so sites and it will naturally index's information for thos sites overtime and many other sites

@Extravi
Copy link
Owner Author

Extravi commented Dec 30, 2023

it may seem far out of reach but when you really think about it and user habits it isn't impossible

@Extravi
Copy link
Owner Author

Extravi commented Dec 30, 2023

the indexer application would have to be a separate project from this repo and this repo would only use the data it produces using the collected data from this repository

@Extravi
Copy link
Owner Author

Extravi commented Dec 30, 2023

Something like that seems too far out from the reach of this project.
I do think a feature like that could be good through. If there's a link within the first 3 results that links to youtube.com/c/ (to check if it's likely that the user's searching for that channel), then you could scrape social blade for info about the channel.
That could also expand further into other widgets, like for weather, or sports results.

also due to speed it will only show that data after it has it indexed by the other application

@Extravi
Copy link
Owner Author

Extravi commented Dec 30, 2023

the indexer might be MIT or something I'm not sure but any data it produces likely won't be subject to GPL

@Extravi
Copy link
Owner Author

Extravi commented Dec 30, 2023

Something like that seems too far out from the reach of this project.
I do think a feature like that could be good through. If there's a link within the first 3 results that links to youtube.com/c/ (to check if it's likely that the user's searching for that channel), then you could scrape social blade for info about the channel.
That could also expand further into other widgets, like for weather, or sports results.

yes I want to add things like weather news and spots thos are also topics I want to index

@Extravi
Copy link
Owner Author

Extravi commented Dec 30, 2023

I wouldn't index text results because it's more compacted then news or other topics/subjects

@Extravi
Copy link
Owner Author

Extravi commented Dec 30, 2023

I would only get data to associate with text results

@Extravi
Copy link
Owner Author

Extravi commented Dec 30, 2023

This is a local autocomplete demo with some data collection. It could improve a ton, and then there would be no need to relay on DuckDuckGo, making it faster with some optimization.
https://github.com/Extravi/araa-search/assets/98912029/9227502a-7a65-43b1-8e90-4912c157a86a

@Extravi
Copy link
Owner Author

Extravi commented Dec 30, 2023

This is a local autocomplete demo with some data collection. It could improve a ton, and then there would be no need to relay on DuckDuckGo, making it faster with some optimization. https://github.com/Extravi/araa-search/assets/98912029/9227502a-7a65-43b1-8e90-4912c157a86a

read how many lines it has of right now

@amogusussy
Copy link
Contributor

I'm more talking about how we'd need to use things like databases for all the favicons. If you want good speeds for it, you'll need to use a dedicated database, like SQLite, rather than using python dicts. It probably could be done, but it might take a bit of time to do it right.

Does the search suggestions deal with misspelled words? If I go to DuckDuckGo and type 'liux', it gives a suggestion of 'linux', because it can guess what I was probably going for. Does this have anything similar yet?

@Extravi
Copy link
Owner Author

Extravi commented Dec 30, 2023

I'm more talking about how we'd need to use things like databases for all the favicons. If you want good speeds for it, you'll need to use a dedicated database, like SQLite, rather than using python dicts. It probably could be done, but it might take a bit of time to do it right.

Does the search suggestions deal with misspelled words? If I go to DuckDuckGo and type 'liux', it gives a suggestion of 'linux', because it can guess what I was probably going for. Does this have anything similar yet?

yes it does check for spelling
image

@Extravi
Copy link
Owner Author

Extravi commented Dec 30, 2023

image

@amogusussy
Copy link
Contributor

That looks good then. I think it should still keep duckduckgo by default though, unless you make a way for it to actually guess what the user's going to type, beside using a list to look it up.
15% of google's search queries are unique, so relying on a pre-generated list of possible queries, even with every previously searched query, would still leave you with a large chunk without a good result.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants