reCAPTCHA proxy #103

Extravi · 2023-12-26T21:57:14Z

I'm working on a system that allows users to interact with reCAPTCHA. Whenever Araa gets rate-limited, it will then load a web driver to proxy the captcha, allowing users to interact with it. If the user successfully completes the captcha, the web driver will then capture the "GOOGLE_ABUSE_EXEMPTION=ID" cookie and send it in the request header to Google using makeHTMLRequest. Both SearXNG and other projects do not do this, so this will be the first.

Extravi · 2023-12-27T01:55:54Z

i might have to do something like this and update the image in real time because its not simple to scrape

Extravi · 2023-12-27T03:33:34Z

Extravi · 2023-12-27T03:34:10Z

it will first display that to the user because each user needs its own session to prevent more then one user doing the same captcha at once

amogusussy · 2023-12-27T20:41:12Z

I think a good solution could be having a backup search engine.
When the instance gets rate limited, it should just switch to another engine, like qwant, then wait ~30 minutes, and retry google. Then repeat until google stops rate limiting.

Extravi · 2023-12-27T21:10:54Z

I think a good solution could be having a backup search engine.
When the instance gets rate limited, it should just switch to another engine, like qwant, then wait ~30 minutes, and retry google. Then repeat until google stops rate limiting.

I already wrote most of the code for the captcha proxy already

Extravi · 2023-12-27T21:12:50Z

also look at librey and its fall back system it's not necessarily clean or well done

Extravi · 2023-12-27T21:14:09Z

I will likely add support for other engines at some point but the user should be able to use Google if they want rate limited or not

Extravi · 2023-12-27T21:14:26Z

that's why I'm working on the proxy

Extravi · 2023-12-27T23:50:50Z

@amogusussy i found a better way to proxy the captcha using sessions and a iframe sending data to the iframe from the server like the sitekey and s-data

amogusussy · 2023-12-28T02:00:42Z

Have you tried that with a different device? If you send an iframe to the user's device, whatever site it's loading will just think it's a request from the user, so it wont be rate limited.
This will probably only seem like it's working, since you're testing it on the device that's rate limited. A non rate limited user will just get sent the normal page.

Extravi · 2023-12-28T02:04:35Z

Have you tried that with a different device? If you send an iframe to the user's device, whatever site it's loading will just think it's a request from the user, so it wont be rate limited. This will probably only seem like it's working, since you're testing it on the device that's rate limited. A non rate limited user will just get sent the normal page.

yes im aware of the normal page thing i have been testing for hours ill be fine once its out/done

Extravi · 2023-12-28T05:45:17Z

its going to need an entire local proxy server for this to work https://mitmproxy.org/ i found this but do you know any better http proxies?

Extravi · 2023-12-28T05:46:53Z

recaptcha needs to be done using the servers ips so i need to proxy everything to the end user

amogusussy · 2023-12-28T15:54:36Z

I've found this list of alternatives for Linux, but I don't really know what makes a proxy better/worse.

Extravi · 2023-12-30T03:30:12Z

I think I should use a paid Captcha solver service

Extravi · 2023-12-30T03:30:47Z

because it's not necessarily practical to proxy it to my users or even possible

Extravi · 2023-12-30T18:42:02Z

i'm going to drop this for now and add support for a different engine as a backup

Extravi · 2023-12-30T18:42:27Z

any ideas for what engine i should use for the backup?

Extravi · 2023-12-30T18:43:35Z

Also, I will be implementing the backup engine, so there is a template to build off of.

Extravi · 2023-12-30T18:44:15Z

i want everything to look like it belongs unlike LibreY and its broken system

Extravi · 2023-12-30T19:07:22Z

a captcha proxy cost less then google search api

amogusussy · 2023-12-30T19:56:11Z

Qwant has a free API. The only problem is that it doesn't show the wikipedia results in the API, so you'll have to scrape that yourself. There's also DuckDuckGo, Startpage, yahoo, and brave.
If you need help with scraping them, you can have a look at SearXNG's source code, since all the engines I mentioned are included in SearXNG.

I think we should standardize the results into one dict/json object, like what's been done with the torrent results. If we do that, it'll be 10x easier to add new engines, and maybe even give the user the ability to choose what engines they want to use.

Extravi · 2023-12-30T19:59:15Z

question what do you think of anonymous data submission of search results as an opt in future it would only collect the sub domain and domain for each result for the query but it wouldn't record the query itself so like www.youtube.com or GitHub.com nothing after the /

Extravi · 2023-12-30T19:59:52Z

I would use that data to index and improve aspects of the search results and make results more visual

Extravi · 2023-12-30T20:00:45Z

such as favicon indexing etc I might collect YouTube channel URLs too so the / for that but that's so I can index all channels over 10k subscribers

Extravi · 2023-12-30T20:01:51Z

so i can do things like this

and this

Extravi · 2023-12-30T20:02:57Z

the data collection code would be open source and anonymous

Extravi · 2023-12-30T20:03:17Z

and if the user wants to opt out of the setting turned on by default in settings they can

Extravi · 2023-12-30T20:03:43Z

i want to index some stuff that each engine can use like qwant google etc in Araa

Extravi · 2023-12-30T20:05:26Z

i want to make results look more visual and modern

Extravi · 2023-12-30T20:07:23Z

I want it to be on par with closed-source meta search engines, and for that to work, some data collection may be required.

Extravi · 2023-12-30T20:14:48Z

its only an idea and does not mean it will happen

Extravi · 2023-12-30T20:15:43Z

its something i want to do but if i do decide to develop it something might change resulting in it getting dropped

amogusussy · 2023-12-30T20:54:41Z

Something like that seems too far out from the reach of this project.
I do think a feature like that could be good through. If there's a link within the first 3 results that links to youtube.com/c/ (to check if it's likely that the user's searching for that channel), then you could scrape social blade for info about the channel.
That could also expand further into other widgets, like for weather, or sports results.

Extravi · 2023-12-30T20:57:07Z

it's not necessarily far out of reach it's common to come across the same websites in the search results for different queries and many people will search/request the same websites time from time so after the first request it will index the favicon etc and pair it with that sub domain and domain

Extravi · 2023-12-30T20:57:44Z

like medium and other articles sites are quite common or even stack overflow for a coding related query

Extravi · 2023-12-30T20:58:47Z

most people only really go to the top 1000 or so sites and it will naturally index's information for thos sites overtime and many other sites

Extravi · 2023-12-30T20:59:57Z

it may seem far out of reach but when you really think about it and user habits it isn't impossible

Extravi · 2023-12-30T21:00:56Z

the indexer application would have to be a separate project from this repo and this repo would only use the data it produces using the collected data from this repository

Extravi · 2023-12-30T21:02:46Z

Something like that seems too far out from the reach of this project.
I do think a feature like that could be good through. If there's a link within the first 3 results that links to youtube.com/c/ (to check if it's likely that the user's searching for that channel), then you could scrape social blade for info about the channel.
That could also expand further into other widgets, like for weather, or sports results.

also due to speed it will only show that data after it has it indexed by the other application

Extravi · 2023-12-30T21:03:38Z

the indexer might be MIT or something I'm not sure but any data it produces likely won't be subject to GPL

Extravi · 2023-12-30T21:04:29Z

Something like that seems too far out from the reach of this project.
I do think a feature like that could be good through. If there's a link within the first 3 results that links to youtube.com/c/ (to check if it's likely that the user's searching for that channel), then you could scrape social blade for info about the channel.
That could also expand further into other widgets, like for weather, or sports results.

yes I want to add things like weather news and spots thos are also topics I want to index

Extravi · 2023-12-30T21:05:43Z

I wouldn't index text results because it's more compacted then news or other topics/subjects

Extravi · 2023-12-30T21:06:03Z

I would only get data to associate with text results

Extravi · 2023-12-30T21:16:53Z

This is a local autocomplete demo with some data collection. It could improve a ton, and then there would be no need to relay on DuckDuckGo, making it faster with some optimization.
https://github.com/Extravi/araa-search/assets/98912029/9227502a-7a65-43b1-8e90-4912c157a86a

Extravi · 2023-12-30T21:18:00Z

This is a local autocomplete demo with some data collection. It could improve a ton, and then there would be no need to relay on DuckDuckGo, making it faster with some optimization. https://github.com/Extravi/araa-search/assets/98912029/9227502a-7a65-43b1-8e90-4912c157a86a

read how many lines it has of right now

amogusussy · 2023-12-30T21:53:08Z

I'm more talking about how we'd need to use things like databases for all the favicons. If you want good speeds for it, you'll need to use a dedicated database, like SQLite, rather than using python dicts. It probably could be done, but it might take a bit of time to do it right.

Does the search suggestions deal with misspelled words? If I go to DuckDuckGo and type 'liux', it gives a suggestion of 'linux', because it can guess what I was probably going for. Does this have anything similar yet?

Extravi · 2023-12-30T22:23:43Z

I'm more talking about how we'd need to use things like databases for all the favicons. If you want good speeds for it, you'll need to use a dedicated database, like SQLite, rather than using python dicts. It probably could be done, but it might take a bit of time to do it right.

Does the search suggestions deal with misspelled words? If I go to DuckDuckGo and type 'liux', it gives a suggestion of 'linux', because it can guess what I was probably going for. Does this have anything similar yet?

yes it does check for spelling

Extravi · 2023-12-30T22:24:23Z

amogusussy · 2023-12-31T19:38:57Z

That looks good then. I think it should still keep duckduckgo by default though, unless you make a way for it to actually guess what the user's going to type, beside using a list to look it up.
15% of google's search queries are unique, so relying on a pre-generated list of possible queries, even with every previously searched query, would still leave you with a large chunk without a good result.

Extravi mentioned this issue Dec 27, 2023

Change image viewer design #99

Closed

Extravi closed this as completed Dec 30, 2023

amogusussy mentioned this issue Jan 1, 2024

Plans for adding new search engines #105

Closed

reCAPTCHA proxy #103

reCAPTCHA proxy #103

Comments

Extravi commented Dec 26, 2023

Extravi commented Dec 27, 2023

Extravi commented Dec 27, 2023

Extravi commented Dec 27, 2023

amogusussy commented Dec 27, 2023

Extravi commented Dec 27, 2023

Extravi commented Dec 27, 2023

Extravi commented Dec 27, 2023

Extravi commented Dec 27, 2023

Extravi commented Dec 27, 2023

amogusussy commented Dec 28, 2023

Extravi commented Dec 28, 2023

Extravi commented Dec 28, 2023

Extravi commented Dec 28, 2023

amogusussy commented Dec 28, 2023

Extravi commented Dec 30, 2023

Extravi commented Dec 30, 2023

Extravi commented Dec 30, 2023

Extravi commented Dec 30, 2023

Extravi commented Dec 30, 2023

Extravi commented Dec 30, 2023

Extravi commented Dec 30, 2023

amogusussy commented Dec 30, 2023

Extravi commented Dec 30, 2023

Extravi commented Dec 30, 2023

Extravi commented Dec 30, 2023

Extravi commented Dec 30, 2023

Extravi commented Dec 30, 2023

Extravi commented Dec 30, 2023

Extravi commented Dec 30, 2023

Extravi commented Dec 30, 2023

Extravi commented Dec 30, 2023

Extravi commented Dec 30, 2023

Extravi commented Dec 30, 2023

amogusussy commented Dec 30, 2023

Extravi commented Dec 30, 2023

Extravi commented Dec 30, 2023

Extravi commented Dec 30, 2023

Extravi commented Dec 30, 2023

Extravi commented Dec 30, 2023

Extravi commented Dec 30, 2023

Extravi commented Dec 30, 2023

Extravi commented Dec 30, 2023

Extravi commented Dec 30, 2023

Extravi commented Dec 30, 2023

Extravi commented Dec 30, 2023

Extravi commented Dec 30, 2023

amogusussy commented Dec 30, 2023

Extravi commented Dec 30, 2023

Extravi commented Dec 30, 2023

amogusussy commented Dec 31, 2023