Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search on global fediverse #824

Closed
ballsystemlord opened this issue Jul 18, 2018 · 32 comments
Closed

Search on global fediverse #824

ballsystemlord opened this issue Jul 18, 2018 · 32 comments

Comments

@ballsystemlord
Copy link

I can find no search bar and your faq.md and online faq do not list a method of searching peertube leaving me with only the option of manually going to each site and searching it.
Is there something I am not seeing or is there a tool, like a browser plugin or cmdline tool like surfraw to search all of peertube?
Thanks!

@Chocobozzz Chocobozzz changed the title How to search? Search on global fediverse Jul 19, 2018
@Chocobozzz
Copy link
Owner

No you can't for now. I a few ideas to do such thing:

  • Have special nodes that follow every public PeerTube instance, index videos and provide a public search endpoint. These endpoints would be defined for example in the configuration file
  • Redirect the user on a public search engine
  • ?

@rigelk
Copy link
Collaborator

rigelk commented Aug 17, 2018

After some chit-chat about that, we discussed of a few possibilities:

  • a single search endpoint on instances.joinpeertube.org which would index videos from all instances it knows about.
    • pros: no-brainer, it's just a single endpoint we do requests to, consistent answers for everyone
    • cons: SPOF, instances may not want to rely or use project infra at all (so far instances.joinpeertube.org is just used as a convenience, but this would put it in place of an essential infrastructure), added cost to Framasoft, trust model is centralized
  • each instance federates by flood-fill discovery (through their users interactions with new actors, or through the discovery of new actors in the comments, or via the addition of a relay to speed up that process) and keeps a local index that is larger than what is displayed. The search is done on that extended local database.
    • pros: decentralized, trust model is not modified
    • cons: small instances might have a hard time finding enough content (alleviated by relays), small instances might have a hard time searching through content as they federate with more instances (to verify).
  • same as the previous option, but some instances declare themselves as indexes (because they feel like they have the capacity to do so, most likely) and others look for such indexes among the instances they explicitly follow (either automatically or by pinning a few of them). These indexes effectively serve as remote search endpoints for instances that don't have the ressources to do so.
    • pros: decentralized, trust model is modified but community-based (essentially we use indexes that we already trusted somehow, and pinning some of them ensures we effectively use trusted ones)
    • cons: answer delay might be higher, and we need to keep track of a score of which indexes are more efficient (in terms of time to answer, or in terms of completeness of the index for instance)

@ghost
Copy link

ghost commented Oct 20, 2018

One downside of, as an instance administrator, following everything is that there's a lot less ability to moderate or curate or reliably categorize content. This can create problems for search quality, but also for community safety or legality.

Systems like the automated flood-fill discovery above would place a somewhat arbitrary moderation burden on every administrator, if I understand the description correctly.

On the other hand, a centralized server centralizes that moderation/curation/abuse-response role, for better or for worse.

Personally, as an administrator, I would not want this responsibility. I would disable a promiscuous federation feature like this, since I wouldn't want to be both the admin of my instance's content and the view of the entire fediverse that my instance provides. Other admins may be more open to take on that role, but then how will those instances deal with problems of illegal or otherwise undesirable content that threatens the instance or its users?

A central search server or multiple central search servers would avoid putting this extra responsibility on every instance administrator, and that may end up being a more useful service for users who are new to peertube anyway.

@Booteille
Copy link
Contributor

Isn't possible to have a endpoint listing all the available instances with each instance categorized by theme? (themes could be "nature", "general", "politics", etc)
Then the instance owner choose in the list which instances he wants to follow by theme and/or one by one, as he prefers.

@ghost
Copy link

ghost commented Oct 20, 2018

@Booteille, who would decide what the categories mean and what content to include or exclude? this is still a moderation job that has to be done; perhaps tags of some kind could help the instance admins and search admins share this work, but the work still exists.

@r4dh4l
Copy link

r4dh4l commented Nov 10, 2018

Hi all and thx @ballsystemlord for opening this issue!

I'm quite new to the concept of PeerTube but very interested in this concept because as home server administrator I want to support decentral IT service concepts in general.

Promoting PeerTube among my friends the fact that there is currently no "global search" was unfortunately somehow the "walk-away-point" for using PeerTube. Looking for the "global PeerTube search" I found this issue which is extremely interesting for me related to the pro/contra list. Actually I don't know which solution would be the best to preserve a strict decentral concept but what I can say for now: The explanation video What is PeerTube? (english subtitles) is not enough to explain the users how to use PeerTube, especially related to video search. People not used to decentral concepts simply don't understand the problem of a central search and refuse service with a different usability concept (even though the different concept is for their own good).

Anyway for now I would suggest: There should be an explanation text under every search box of a PeerTube node (or a symbol linking to it) that

  1. explains that there is no global search (with a link to another text eplaining why, maybe this issue)
  2. lists an overview of all other PeerTube nodes the current one is connected to so to indicate the range of the current search

Edit: Maybe the "there is no global search" explanation should be placed above any search results as well with a text like:

The decentral concept of PeerTube has no global search which means the results listed here don't reflect the whole PeerTube content of the web. The results listed here are content of the following PeerTube nodes this node is connected to:

- peertube.one
- peertube.three
- peertube.four
- peertube.z

[Other PeerTube nodes](https://joinpeertube.org/en/#getting-started) are not listed because the settings by the PeerTube administrator of this node (due to personal or legal reasons).

I understand that a sustainable solution of this issue needs a lot of time but until then PeerTube as software needs to pick up the people where they currently are (and the are in expectation of a global search). Explaining users why there is no global search would be the best solution that doesn't make anything wrong, just better. I'm very said to say but In the current state of the PeerTube usability the mass of the people won't accept it as long they are not educated better in decentral concepts (which unfortunately has to be done by PeerTube - What is PeerTube? (english subtitles) is just a (great) start for the needed "elucidation").

@ghost
Copy link

ghost commented Jan 22, 2019

We could use Yacy to solve this issue.
Yacy is a FOSS and decentralized search engine so we can host instances of Yacy ourselves and define an operator for peertube videos. For example, if someone in Yacy will search for:

video:peertube cats

then it should output only peertube links about "cats". Now we can integrate this into peertube's website and pass search results from Yacy into peertube's UI seamlessly, without having the user to type any special commands, only the keywords he needs results for.

Note: It's worth mentioning that for this to work properly, most likely we'll need to crawl peertube websites ourselves (it's very simple to do), just as shown in this video.

@elevenpassin
Copy link
Contributor

@Zig-03 If it's possible to plug into Yacy, we can run Yacy right along side Peertube. Yacy will index each and every node, instances, channels & users as members of an instance explore the fediverse. When searching, we can plug into Yacy for search results (We don't have to do everything on our own! We can stand on the shoulders of other free software).

@scanlime As far as I can tell, An instance owner shouldn't have to moderate content being hosted on other instances. Our instance will not host the content permanently (Unless you manually specify to seed them) so I don't think we have to worry about any content related issues.

@rigelk
Copy link
Collaborator

rigelk commented Apr 11, 2019

@buoyantair @Zig-03 while doing something with Yacy outside of the PeerTube codebase is certainly fine, I don't see us requiring Yacy on instances just to bring them more global search results. It is yet another external tool and it doesn't simplify the deployment of PeerTube at all.

@elevenpassin
Copy link
Contributor

@rigelk Why don't we explicitly ask the instance owner at installation time? + We can give them a cli tool to install new plugins (say Global search in this case).

This means that we will have the current instance-follow-specific search by default and anyone else looking for global search can just install and enable them? This would not only mean that we don't have to integrate it of sorts into our code base (We just send the Yacy server search strings and it gives us back search results to display on our main Peertube client) but rather just interface around it?

@ghost
Copy link

ghost commented Apr 11, 2019

Do we really want instances to be searching among servers they aren’t normally following?

For instances that want “everything”, they’ll be following as many servers as possible anyway. Many instances though don’t actually want all the content, they’re trying to be more focused.

Do you as an administrator want the ability to include search results for videos that aren’t otherwise available?

Do you as a user want to force all servers to include global search results?

I’m not sure what the intended result is here, and it might be worth making sure that the technical capability you’re envisioning will be useful and enabled by admins.

@elevenpassin
Copy link
Contributor

I agree with @scanlime
It's either be introvert or an extrovert. Instances which are extroverted will attempt to make a connection with every new instance they discover. Instances which are introverted will limit their connections to a close knit set of instances.

@ballsystemlord
Copy link
Author

ballsystemlord commented Apr 21, 2019

You guys are ignoring one REALLY big thing with respect to global search engines like Yacy. Evil instances of peertube.

Let me elaborate. If I'm running Google (Heaven forbid!), I can tell my search engine to not go to certain websites, to profile what websites users visit, and to preform a fuzzy search of the web database I have and rule out websites containing certain keywords that should not be used together (Like "C event oriented multi-threaded programming made easy", with "easy" being the operative keyword. :) ).
This allows me to train my search engine, Google, to be good at avoiding sending users to sites that are malicious and/or click bait.
In the case of peertube, it is decentralized like the web. Unlike the web, we don't want to follow users (AFAIK), have a hard time finding enough information on the peertube instances' video descriptions (and entering an accurate description takes time), and we don't even verify if the instance is still hosting videos, or has turned into a javascript powered bitcoin mining with your browser operation: https://thehackernews.com/2017/11/cryptocurrency-mining-javascript.html
Here's an instance that is telling my browser to run JS something from ajax.cloudflare.com: https://luttube.tk/ . Normal instances don't require this: https://devtube.dev-wiki.de/ .
Yes, I did search Yacy's FAQ, there is no mention of how they intend to solve this. Nor does freenet for that matter, but they have moderators and hand built indexes (EDIT: They don't allow javascript in their webpages either).

We can't expect to accomplish this like Google does. We can't have a global search without a set of moderators (Requiring no JS and telling people's browsers to disable it when viewing peertube sites would set the bar for evil sites much higher though).

I recommend the following (Sorry I don't know much JS so I can't much help):
1: We tell instances to include a user defined classification. We could use something like the original usenet system: https://en.wikipedia.org/wiki/Usenet This could be expanded to be more like the common tag systems of blogs.
2: A language tag: https://en.wikipedia.org/wiki/IETF_language_tag
3.a: A user entered rating, strictly for telling user agents not to show this result to kids (or me when I'm trying to trouble shoot a strange computer error).
3.b: I recommend: Safe, Iffy, No, like the various current search engines have.
4.a: We create a Javascript powered client that sent requests asynchronously. All it would have to do is parse and aggregate the results from many sites into one or more local pages.
4.b: It would use a set of check boxes for setting preferred languages (which would be relayed to the individual search engines).
4.c: It would have a radio button for setting the rating (which would be relayed to the individual search engines).
4.d: It would have a set of check boxes for setting which classifications should be searched under.
4.e: It would have a black list of instances which would be created and destroyed by setting cookies in the browser and would instruct the user agent to not search X or Y site.
4.f: Just in case people want to only search a set amount of sites we could add that too.
4.g: It must sort the results according to the search terms.
4.h: It must include a timeout.
4.i: It must include a way to limit the amount of simultaneous connections.
5: Each instance of peertube would have a page that the user could load the JS program from. If users did not trust the instance they could go a git repo and download the webpage with JS program or something else.

You already maintain a list of peertube instances, this could be decentralized so that each instance has a list and users would not have to request this information from only the main site.
Individual people could create and allow users to download and "install" lists of sites that are "evil" or "click bait".
It would then only be a matter of having individual instances increase the power of their search engines (case sensitive, don't include results with X word, etc.).

Advantages:
It would work on all computers that could watch videos, including phones.
Censorship would be harder then a centralized system.
It would be much more powerful than the current system.
It would hold up much better to an abusive instances than a centralized system.
We would not have to administer/censor/block/check or link tax anything: https://www.wired.co.uk/article/what-is-article-13-article-11-european-directive-on-copyright-explained-meme-ban
If users complain about lots of bad results we can tell them to block that peertube instance and make peertube vids as to how.
No trust model.

Drawbacks:
It would be slower than a centralized system, especially on a slow (cell/modem), network.
It would use up some B/W, but it's only extra text (no or limited preloading of images/vids), so it should not be too much (this might be further helped by using UDP as per HTTP 3.0: https://www.zdnet.com/article/http-over-quic-to-be-renamed-http3/ ).

@silicium14
Copy link

silicium14 commented Jun 16, 2019

Hello,
I created a prototype of centralized search engine hosted at https://peertube-index.net. The source code is at https://github.com/silicium14/peertube_index.

@Aluriak
Copy link

Aluriak commented Jun 17, 2019

@silicium14 that is a very good idea for a first step.

I bet the final solution will be something like that, with the decentralization given by Yacy, and fair&open recommendation algorithms along the way. Decoupling hosting and research seems to me an obvious improvement.

@EvgenijM86
Copy link

Hello,
I created a prototype of centralized search engine hosted at https://peertube-index.net. The source code is at https://github.com/silicium14/peertube_index.

Thanks. It is better than nothing, but we can already see the problem as people who host that search engine are already decided to censor search results. Probably not because they wanted to, but to avoid being the sole person responsible for whatever is shared. Maybe something like that should be hosted on a TOR network to be truly uncensored.

@ballsystemlord
Copy link
Author

ballsystemlord commented Jul 28, 2019

Hello,
I created a prototype of centralized search engine hosted at https://peertube-index.net. The source code is at https://github.com/silicium14/peertube_index.

Thanks. It is better than nothing, but we can already see the problem as people who host that search engine are already decided to censor search results. Probably not because they wanted to, but to avoid being the sole person responsible for whatever is shared. Maybe something like that should be hosted on a TOR network to be truly uncensored.

In the US (a "free country") Tor has the noted drawback that most places that offer internet access for free, block access to the sites from which you get tor and tails. Many block connections to the network and any other proxies that they're aware of. Some even go so far as to block access to the websites where you can download linux distros which might have tor installed. I speak with over 4 years experience hopping from one internet cafe to another.
And that's just the US.
Peertube search over Tor is a fine idea, but it's leaving a lot of ground uncovered. Especially computers where you can't just install the tor browser or anything else you feel like.

@magus777
Copy link

magus777 commented Apr 29, 2020

Hello,
I created a prototype of centralized search engine hosted at https://peertube-index.net. The source code is at https://github.com/silicium14/peertube_index.

This is great. And definitely needed.
I'd like to suggest some ideas:

  • Maybe port it into PHP, as there are many more developers who could then work on it. I personally never heard of Elixir. I would love to work on it but the learning curve looks very steep for a PHP developer.
  • Add some filters such as Order by Date, would be really useful.
  • Also the search engine could do with some work. It brings up many unrelated results.
    For example searching for 'ufo' brings up results with 'uno' in the title.

@peetss
Copy link

peetss commented May 4, 2020

I want to take on the work to evolve this into a YouTube-style interface where videos across all instances can be viewed. I'm glad to see there is recent discussion on this topic. Censorship on YouTube continues to grow, and at an accelerated rate. The time for this is now.

@thomask-gh
Copy link
Contributor

thomask-gh commented May 27, 2020

Hi, there's an idea that I don't see having already been discussed and that I think could be relevant for this global search feature: you might get some useful inspiration from the way distributed search engines such as Yacy (for instance) work. As a disclaimer, I don't know much about them nor about their inner workings, but I know they exist and it seems to me that they might be a relevant model for PeerTube. What do you think? 🙂

(sorry if what I'm bringing up is already covered in previous discussions, I honestly didn't take the time to read the detail of all the options mentioned)

@ballsystemlord
Copy link
Author

ballsystemlord commented May 27, 2020 via email

@thomask-gh
Copy link
Contributor

I decided against because, as I said earlier, a search engine, even distributed, can be attacked by govs that favor censorship.

Well, I don't really get your point. What I understand is that you're saying that you don't think we should use Yacy because it's a search engine and that any search engine, even distributed, is vulnerable to censorship and should thus be avoided. But if you follow this logic, that would mean we have no search engine at all, which means no search feature. Maybe when I say "search engine" you think of external services like Bing or Google, but any piece of software that looks for specific content in a larger pool of content is a search engine. That includes the search feature in a blog, in Twitter, on Mastodon or on your local file system for instance. So building the "search on the global fediverse" feature would definitely amount to building a search engine into PeerTube. And, on the Internet, a distributed search engine is as close as you get to being censorship-resistant. :)

So let me clarify: I don't suggest to use Yacy itself, nor any other third-party already-existing service. I suggest to build a mechanism similar to the one Yacy (or other distributed search engines) uses into PeerTube to power a fediverse-wide search feature. That is, a mechanism in which each instance indexes a part of the content on the fediverse (making up a "local index" on each instance) and in which, when a search is performed, requests are sent to other peer instances, searches are performed on those instances' indexes, and results are combined by the instance or user who made the search request (or maybe by a centralized "raking server"?).

That's just the base idea, and in fact it's somehow similar to the third option mentioned in this comment

@onlyjob
Copy link

onlyjob commented May 28, 2020

IIRC, I did think of using Yacy as a base or whole search engine for
peertube. I decided against because, as I said earlier, a search engine,
even distributed, can be attacked by govs that favor censorship.

Wrong. Everything can be potentially abused/attacked but there got to be something to abuse first.
Prioritise local search; make search on fediverse optional/configurable; let node admin to manage white/black lists of nodes to search, etc. A lot could be done to keep multi-node search a useful and valuable feature.
Discoverability of information is crucial therefore fediverse search must be implemented.
Not necessarily using YaCy but by any other means it should be possible to search on fediverse.

@ballsystemlord
Copy link
Author

ballsystemlord commented May 28, 2020

I decided against because, as I said earlier, a search engine, even distributed, can be attacked by govs that favor censorship.

Well, I don't really get your point. What I understand is that you're saying that you don't think we should use Yacy because it's a search engine and that any search engine, even distributed, is vulnerable to censorship and should thus be avoided.

Sorry, my bad. I should have re-read my comment above.
The problem with Yacy type instances is that they lend themselves to being deceived. Google can censor websites that abuse search terms, for example I search for "butter" and I get directed to a dating site. Or you could get directed to a site that hosts peertube videos, but has been hacked to make your browser do bitcoin mining, or spectre/meltdown attacks (sitting on the same site for a long time would be ideal for such an attacker), or such in the background. Yacy can't solve that AFAIK.

@onlyjob
Copy link

onlyjob commented May 28, 2020

This is not just a problem of malicious actors. The challenge and art of searching online is to cherry-pick valuable information and separate it from the noise. Any search engine have a lot of noise.
But without a search engine how and where do you even begin to discover information that you are after?
The DHT-based search on aMule/kademlia network is amazing, despite all the noise, because it is up to users to filter through the noise since they are the only ones who know what are they searching for. I think we can all agree that even bad search is better than nothing.

@ghost
Copy link

ghost commented May 28, 2020

The challenge and art of searching online is to cherry-pick valuable information and separate it from the noise. Any search engine have a lot of noise.

Most search engines have some sort of filters (I guess YaCy has them too) and we could use them to remove all the noise. For example, in google you can paste this site:github.com "activitypub" and you'll get clean results only from the github website that include the word activitypub.

Ok, that's nice, YaCy has them too! https://wiki.yacy.net/index.php/En:SearchParameters

If we need some custom search parameters that would fit our use case - we could submit a PR on yaCy's github page!

@onlyjob
Copy link

onlyjob commented May 29, 2020

IMHO YaCy is great for indexing web pagaes and RSS feeds but federated universe should have its own built-in search based on DHT, similar to aMule's Kademlia implementation.

@1000i100
Copy link
Contributor

For french who want to speak about that : https://framacolibri.org/t/recherche-globale-federee/8155

@Chocobozzz
Copy link
Owner

Implemented in #2852

@Chocobozzz Chocobozzz removed their assignment Aug 26, 2020
@r4dh4l
Copy link

r4dh4l commented Dec 13, 2021

Sorry for missusing this Issue but I don't know where else to ask: I was used to use https://peertube-index.net/ for federated search requests but since some the website seems offline. Are there any alternatives?

@Booteille
Copy link
Contributor

Booteille commented Dec 14, 2021

Hi. Take a look at https://sepiasearch.org/

@r4dh4l
Copy link

r4dh4l commented Dec 15, 2021

Hi. Take a look at https://sepiasearch.org/

Thank you very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests