Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Swarm size community: content popularity #2783

Closed
synctext opened this issue Feb 6, 2017 · 43 comments
Closed

Swarm size community: content popularity #2783

synctext opened this issue Feb 6, 2017 · 43 comments

Comments

@synctext
Copy link
Member

synctext commented Feb 6, 2017

The popularity of content is essential. Currently we do not have a method to determine if a magnet links points to a large swarms or a dying swarm.

This work enables novel caching approaches, see "Efficient analysis of caching strategies under dynamic content popularity".

Each peer in the network must contact a swarm to determine the size. Checking 1 swarm every 5 seconds to conserve bandwidth means it takes many hours to check several thousand swarms. It is possible to share this information with your neighbors. By gossiping this swarm size information around, swarms need to be checked only once and bandwidth is saved.

The swarm size community is mechanism to collaboratively determine the size of swarms. The step from magnet link to swarm size involves numerous steps. Resolving a magnet link means using the DHT mechanism. Checking a swarm and doing a bittorrent handshake also takes numerous UDP packets and dealing with non-responsive IPv4 addresses.

Experiments:

  • App on Android Google Play market
  • determine the exact bytes for each packet in the DHT and Bittorrent handshake and total lookup.
  • determine the influence of swarm size on Video-on-demand performance (what is a healthy swarm size)
  • how many neighbors do you need to collaborate with to make cooperation efficient
  • credit mining / automagic caching of most popular content to boost system efficiency
  • caching on Android: donate all your resources to Tribler (boost network with old/unused phones)
  • boost with Raspberry Pi 3

Science:

@synctext
Copy link
Member Author

first startup task:

  • create standalone program
  • Python wrapper around Libtorrent
  • cmdline tool provide magnet links as arguments
  • then tool fetches swarm info on DHT and contacts swarm
  • downloads a single piece of this swarm
  • print statistics of how many seeders and leechers we connected with
  • print total swarm piece availability

@MaChengxin
Copy link
Contributor

I found that the official tutorial of libtorrent is a good reference, though the documentation isn't very much consistent with the code.
I am running the sample code in nix-shell and it uses libtorrent-rasterbar-1.1.1. (The one in Ubuntu's package repository is a bit old. I can also build it but the building process looks complicated according to the doc.)
The sample code is here. It has only downloading functionality and prints the status every 1 second. A .torrent file is used as the command line input.
I am wondering what do you mean by "downloads a single piece of this swarm"? Does that mean downloading a piece of content that the swarm is sharing, or the info of the peers existing in the swarm?
I am also confused about "print total swarm piece availability". Does it mean for every piece of the content, printing out the availability?

@MaChengxin
Copy link
Contributor

screenshot from 2017-04-05 14-38-06
Task almost finished.
In the screenshot: download progress, number of peers, number of seeders, number of leechers, and total swarm piece availability.
Reference: https://github.com/Tribler/tribler/blob/90b85b55e80a2e6713dea85519a5ee29627ce9ea/Tribler/Core/Libtorrent/LibtorrentDownloadImpl.py

@synctext
Copy link
Member Author

synctext commented Apr 6, 2017

Can we obtain accurate statistics from Libtorrent. We desire the raw number of packets and length. Sadly, the peer_info, total_download, number of bytes "the total number of bytes downloaded from this peer. These numbers do not include the protocol chatter, but only the payload data"

There are session statistics with net.sent_payload_bytes and net.sent_ip_overhead_bytes

It seems accurate messages counter are kept:

name	type
ses.num_incoming_choke	counter
ses.num_incoming_unchoke	counter
ses.num_incoming_have	counter
ses.num_incoming_bitfield	counter
ses.num_incoming_request	counter
ses.num_incoming_piece	counter
ses.num_incoming_pex	counter
ses.num_incoming_metadata	counter
ses.num_incoming_extended	counter
ses.num_outgoing_choke	counter
ses.num_outgoing_unchoke	counter

Next step would be to determine the exact number of packet and bytes for downloading various Ubuntu images. The whole chain from magnet link to first downloaded piece.

The goal is single-byte accurate statistics Then we can calculate optimal strategies for connecting to swarms and sharing the swarm size handshake and crawling.

http://torrent.ubuntu.com:6969/file?info_hash=%2B%90%24%1F%8E%95%D5%3C%AC%F0%F8%C7%2A%92%E4kW%91%16%00
http://torrent.ubuntu.com:6969/file?info_hash=c%2B%8Dw%2A%90I%BD%27j%20%10%B2Q%B1J%5Cd%FC%DC

Final note, swarm size is not equal to the number of people that have seen this online video. Media consumption (content popularity) or downloads over time are very hard to estimate or measure. You would need to know the average in-swarm time, average downloading time and age of swarm to calculate the media consumption.

@MaChengxin
Copy link
Contributor

The goal is single-byte accurate statistics Then we can calculate optimal strategies for connecting to swarms and sharing the swarm size handshake and crawling.

Can you please elaborate on how optimal strategies can be calculated based on "single-byte accurate statistics"?

Meanwhile, I was reading some literature about how to get (near) real-time info of # of seeders and leechers and came across "DHT Crawling".
One of the literature is a PPT found here: https://www.defcon.org/images/defcon-18/dc-18-presentations/Wolchok/DEFCON-18-Wolchok-Crawling-Bittorrent-DHTS.pdf
And the overview is in the screenshot:
image
It also shows the popularity of a specific content over time:
image
I am wondering if the "DHT crawling" thing is in line with what you think about determining swarm size.
If yes I will read more about that topic to get more insight.

@MaChengxin
Copy link
Contributor

Found a bittorent DHT crawler example (also in Python) on Github: https://github.com/nitmir/btdht-crawler
Will try this first to see how seeder/leecher info is retrieved from DHT.

@MaChengxin
Copy link
Contributor

MaChengxin commented Apr 16, 2017

Two more references:
https://github.com/dontcontactme/simDHT
https://github.com/blueskyz/DHTCrawler

-- Update -- zo 16 apr 2017 21:32:22 CEST --
simDHT just prints the infohash and the seeder's IP address and port number. It shows nothing about the number of seeders and leechers.
screenshot from 2017-04-16 21-28-01
DHTCrawler shows the infohash and the "popularity" (as claimed in its README) of the torrent, so this one could be helpful.
screenshot from 2017-04-16 21-30-33

@MaChengxin
Copy link
Contributor

@synctext
I have some doubt before proceeding with the experiment.
Our ultimate goal is to fetch info about number of seeders and leechers (namely the swarm size) from DHT to show the real time popularity of the torrents. Naturally one possible solution path would be:

  1. make a peer detect the number of seeders and leechers of a certain torrent
  2. make a peer detect the number of seeders and leechers of multiple torrents
  3. make multiple peers share their information by means of gossiping or something

The program made in the first starup task can only show the number of seeders and leechers that peer is currently connected with. It has no knowledge about how large the total swarm size is. So I prefer to solve this problem first.
Two questions:

  1. Do you agree with the solution path above?
  2. Why do we desire the raw number of packets and length (single-byte accurate statistics)?

@synctext
Copy link
Member Author

indeed.
crawling of seeing the whole swarm is difficult. We can only partially see swarms. Then we can measure various swarms. Then we can see the relative size of various swarms. With the age+current swarm size we estimate popularity..

please ignore DHT crawling, too much spam.

@MaChengxin
Copy link
Contributor

Current progress:
I tried with a Raspbian ISO torrent instead of a Ubuntu one (will change it later).
Now the program stops as soon as some pieces are downloaded, but it may not be exact one piece (could be two or more).
From the downloading process, we can get the following info:

  • block size (a block is a sub piece)
  • number of downloaded pieces
  • total download
  • total payload download

from torrent status.
It is likely that the overhead can be calculated by subtracting total payload download from total download.
The following screen shot shows what can be acquired.
image

It still requires some effort to figure out which pieces of info matter, and why the total download and total payload download is 0 when the number of downloaded pieces is 2.

@MaChengxin
Copy link
Contributor

Succeeded in retrieving statistics (see session stats in the screen shot), failed to interpret it.
image

See this issue for more info: arvidn/libtorrent#1946

@synctext
Copy link
Member Author

synctext commented Apr 25, 2017

Next step is to get this operational and count the exact number of packets + their type. Goal is to measure all packets starting from a magnet link until exactly the first block is completed.

When this information from Libtorrent is combined with the packet length we get a sufficiently accurate picture. Experimental results for thesis: do this for 100-ish legal magnet links (e.g. magnet links from all Ubuntu torrents + other sources).

@MaChengxin
Copy link
Contributor

After some effort it is finally possible to interpret the session statistics.
The most relevant commit is: https://github.com/MaChengxin/expBT/commit/d0ef388aa99dbea6248292601a0ada6b71840492
In this commit, BitTorrent messages of various types are counted.
According to Understanding BitTorrent: An Experimental Perspective, message types (described in pg 13-14) in BitTorrent version 4.0x include:

Type Size (bytes)
HANDSHAKE (HS) 68
KEEP ALIVE (KA) 4
CHOKE (C) 5
UNCHOKE (UC) 5
INTERESTED (I) 5
NOT INTERESETED (NI) 5
HAVE (H) 9
BITFIELD (BF) upper((# of pieces)/8)+5
REQUEST (R) 17
PIECE (P) 2^14+13 (if the block size is 2^14)
CANCEL (CA) 17

Most of them can be measured using libtorrent, as we can see is the commit above.

@MaChengxin
Copy link
Contributor

MaChengxin commented May 2, 2017

The next steps would be to:

  • remove the hard coded test torrent and automate the process of measuring different torrents
  • store the measured statistics (in JSON)
  • run the experiment
  • find a way to skip the "bad" torrents that take a long time to download the first piece
  • analyze the statistics to profile the performance

@MaChengxin
Copy link
Contributor

MaChengxin commented May 2, 2017

The experiment is running now using torrent files instead of magnet links (temporarily, otherwise it would be too slow).
Maybe we need to add a timing function for each attempt to download the first piece? Time is also an important factor IMO.

@synctext
Copy link
Member Author

synctext commented May 2, 2017

true. time for dead torrent filtering also

@MaChengxin
Copy link
Contributor

Time measurement is done: https://github.com/MaChengxin/expBT/commit/5a8f3293d1daafb2cfc1b72552bd08d3662e535f
After running the experiment, I have got a folder containing JSON files, each of them holding the stats for a single torrent.
I am wondering how we can make use of these stats to draw some meaningful conclusions. What info is expected to retrieve from the data?

@MaChengxin
Copy link
Contributor

First trial of stats analysis: plot the histogram of the download time of all the Ubuntu images.
figure_1

We see that most torrents can download the first piece within 10 seconds.

@MaChengxin
Copy link
Contributor

MaChengxin commented May 9, 2017

figure_1-1
Update: reduced download time for each torrent by changing the way of checking pieces from time-driven to event-driven.

@MaChengxin
Copy link
Contributor

Personal repository for the experiments: https://github.com/MaChengxin/expBT

@MaChengxin
Copy link
Contributor

Just noticed that there is a page describing this issue with many details: https://www.tribler.org/SwarmSize/

@synctext
Copy link
Member Author

synctext commented May 17, 2017

Real swarm measurements. Roughly 15-KByte-ish of cost for sampling a swarm (also receive bytes?). Uses magnet links only. 160 Ubuntu swarms crawled:
image
Experiments are serial. Only 1 swarm at a time, 500 second timeout per swarm or abort when the first piece is completed.

@synctext
Copy link
Member Author

synctext commented Jun 9, 2017

It seems to work! Further polish plots made during the measurements (many more here)
Next step is to find more magnet links and run a detailed measurement of the correlation.

image

keep reading the related work

Thoughts: downloading 1 piece is good correlation. How good is downloading 16 pieces?

Does our content popularity estimation get more accurate if you spend more bandwidth and time within a swarm?

Dispersy community design sketch. Create a community which every second starts a new content popularity check. The outcome of this check is shared across the Dispersy overlay with 10 neighbors (fanout). That's it.
Each peer in the network now obtains UDP packets with popularity starts on several swarms.
20170609_172509
Future steps: align with existing TorrentCollecting in Tribler, AllChannel, and channels in general for spam prevention.

@MaChengxin
Copy link
Contributor

MaChengxin commented Jun 23, 2017

The design sketch of the Dispersy community is almost done: https://github.com/MaChengxin/tribler/tree/swarm_size_community/Tribler/community/swarmsize
Currently a swarm size detector simulator is used to mimic the behavior of the real one, because the real one still needs some polishing-up work. Once it is done it can easily replace the fake one.

@MaChengxin
Copy link
Contributor

MaChengxin commented Jun 26, 2017

Result of the experiment with the swarm size detector simulator: https://jenkins.tribler.org/job/pers/job/SwarmSizeCommunity_Chengxin/32/artifact/output/localhost/node321/00000.out/*view*/
The nodes in the community are capable of receiving the results measured by other nodes. Note that the results are generated randomly, and will eventually be replaced by actual data.

@MaChengxin
Copy link
Contributor

Questions

My task has three parts:

1.Distribute the entire work to different nodes (i.e. divide thousands of torrents into smaller sets and assign each set of torrents to nodes)
2. Each single node solves its assigned task (i.e. estimate the swarm size from statistics or just ask the tracker)
3. Aggregate the partial results (i.e. nodes share their information)

The questions are:

  1. For B I will invest more time into it, and for C I already know that I can use the Gossip protocol to share the info. However, I am not very clear about A. How can I do that in an elegant way?
  2. Between B and C, which subtask is more important?

@synctext
Copy link
Member Author

synctext commented Jun 26, 2017

Solution: no scheduling, division of work or result aggregation.

Proposed architecture in above design sketch is that each node just randomly checks 1 swarm per second. A total of 10 random content popularity checks are then shared with random neighbors. With this probabilistic approach it is highly likely that duplicate checks are conducted, however the code is simplified.

Next steps:

  • first prototype of community with random check and random exchange
  • create PR with a community that is DISABLED by default
  • align with existing TorrentCollecting in Tribler, AllChannel, and channels in general for spam prevention.
  • integrate with search results (or task for Jelle)
  • enhance from random to semantic clustering and similarity functions (only check torrents similar to your taste)
  • thesis pictures:
    • experimental Beta release on Tribler forum (25...250 users)
    • run crawler in community
    • size and growth of community (number of unique public keys in weeks)
    • number of unique swarm hashes over time
    • number of duplicate checks over time
    • figures for dead/spam swarm with and without the content popularity check
    • streaming experiment (assume DHT-only)
      • check content popularity of random swarms
      • correlation of content popularity and time to download first piece of content (or first 1-5 MByte)
      • correlation of content popularity and measured on-demand streaming start delay
      • correlation of content popularity and measured sustained streaming speed
    • Run on Android...?...
  • DONE

@MaChengxin
Copy link
Contributor

MaChengxin commented Jul 12, 2017

figure_1

This figure shows how many checks one node has to make in order to cover all the torrents using a random pick-up policy (with replacement). It takes about 800 - 900 checks to cover 162 swarms.

More generally, if m denotes the number of torrents, then the expected number of checks required for total coverage is m*Hm, where Hm is the m-th Harmonic number. This gives us an impression of the order of magnitude of the required checks.
In this experiment, m is 162, so m*Hm is about 918. The experiment result is close to the mathematical expectation.

We could improve this simple swarm selection policy by the follow strategies:

  • random pick-up with replacement => random pick-up without replacement
  • pick up new swarms according to others' results

The new architecture looks like this:
untitled diagram

Description:
The basic idea is to keep two pools of swarms. One contains swarms to check (the running pool), and the other one (the standby pool) contains the checked ones. When the running pool becomes empty, we move the swarms in the standby pool back to it.

So for a certain swarm, it will first be selected from the running pool and its size will be measured. After this, a message ({swarm: size}) will be created. This message will go to a message handler. This handler does three things: store the info locally, gossip it to other nodes, and move the swarm from the running pool to pool 1 in the standby pool.

Pool 2 in the standby tool is responsible for receiving info about swarms measured by others. Upon receiving a new message {swarm:size}, it will store the info and put the swarm into pool 2 in the standby pool. (So when selecting a swarm to check in the running pool, we will need to first check if it is already in the standby pool.)

Bootstrapping: Initially, the two pools are both empty. A node will first load swarms known by itself to the running pool. Then the flow described above can start working.

Steady state: Since a node will measure the swarm sizes by itself and also receive such info from others, it will eventually know all the swarms in the ecosystem (e.g. Tribler). One interesting question is that if every node uses the same strategy, what is the estimation of the time needed to get every swarm measured at least once. I've already abstracted this question in a mathematical way: https://math.stackexchange.com/questions/2351742/expected-number-of-steps-to-walk-through-points-by-multiple-walkers

@synctext How do you think about this design? Do you see any design flaws?

@devos50
Copy link
Contributor

devos50 commented Jul 13, 2017

@MaChengxin Please take a look at our existing torrent checker solution: https://github.com/Tribler/tribler/tree/devel/Tribler/Core/TorrentChecker. This check simply queries the tracker for the size of the swarm. It is compatible with both HTTP and UDP trackers (it also supports DHT queries).

@MaChengxin
Copy link
Contributor

untitled diagram

This is the new design for checking swarms.
The dashed rounded square contains the checking module. A swarm is picked up randomly (without replacement) from the checking queue and put into one of the three lists due to its healthiness: healthy swarms, unhealthy swarms, and dead swarms.
When checking queue is empty, swarms in the first two lists (healthy and unhealthy swarms) will be moved to the checking queue to start a new round of checking.
(Criteria for defining the healthiness of the swarms is not yet determined. But basically the larger the swarm, the more healthy it is.)
Outside the dashed rounded square are modules for communication, i.e. sending local measurement result to other node and receive remote results from them. The remote results are further processed and put in the three swarm lists (healthy, unhealthy, and dead.)
Conflicting categorization: What if a swarm is said to be healthy by one node but unhealthy by others? The simplest solution is to ignore it and put it into both categories. Before being moved to the checking queue, the duplicated swarms in the two lists shall be removed.

@synctext
Copy link
Member Author

Would suggest to remove the checking of dead swarms. Additionally, split the checking of healthy and unhealthy swarms. Use a three strikes algorithms for swarms. A swarm is declared dead and will never be checked again if zero seeders and zero leechers are found three times with at least 24h between measurements.

@MaChengxin
Copy link
Contributor

Would suggest to remove the checking of dead swarms.

As shown in the figure, only healthy and unhealthy swarms will be put back in the checking queue. The dead swarms, colored in red, is like a black hole from where no swarm can escape.

Additionally, split the checking of healthy and unhealthy swarms.

It also seems to work if I check the healthy and unhealthy swarms together, and put different tags after checking on them indicating their difference.

@MaChengxin
Copy link
Contributor

figure_2

This figure shows the experiment result of the initially proposed architecture (with one checking queue and two standby pools).

Experiment setup:
Number of swarms to check: 162
Number of nodes: 10

The results are:
Checks done by each node: around 40
Local unique swarms: 35
Total unique swarms: 159
Time: around 2 minutes
Turning point of growth of unique swarms: around 65 seconds
Corresponding number of checks: 65/120*400 = 217, for each node: around 22

The conclusion is that using 10 nodes, each of them will only need to check 22 swarms to cover (almost) all the swarms.
Coverage effciency: 159/217 * 100% = 73.3%
(Coverage effciency is defined as number of unique swarms / number of total checks)

If there is only one node, the theoretical coverage efficiency would be 100% (namely no duplicate checks). However, checking by one node would be tims-consuming. By the time 10 nodes have covered almost all the swarms (159), a single node has only checks only around 25 swarms.

Therefore, we can sacrifice some coverage efficiency to achieve high speed of checking.

@synctext
Copy link
Member Author

synctext commented Jul 18, 2017

First task: spend a few days writing problem description thesis chapter and intro.
(storyline possible: numerous scientific papers require content popularity as critical input. We want a distributed Youtube. However, determining distributed state in a distributed system is hard. Very few designs have been proposed for this problem. )

Simple model:

  • Time between swarms checks

  • Cost in Bytes for an average swarm size check

  • Number of peers you are sharing swarm size data with

  • Size in bytes of a UDP packets when sharing N swarm size checks

  • Total amount of swarms in the known universe/universe we check reguarly

  • Cycle time is the time after which a swarm will be re-checked for content popularity

Some simple parameters:

  • Check every 30 seconds a random swarm, not yet checked in past 96 hours by you or others (120 checks/hour/peer)
  • Bandwidth usage is magnet DHT resolution, .torrent download, and the first content piece, guesstimated to be a total of 500 KByte.
  • Demanding the actual first piece of content creates an basic level of attack-resilience (hash_check)
  • Filling an entire UDP packet (1480 Bytes on Ethernet) with swarm checks (20Byte SHA1, 2 Bytes Seeders, 2 Bytes leechers, 2 Bytes how many seconds ago check was done) {1480/ (20+2+2+2) = 56 entries}
  • Share results of various last checks with a semi-fixed group of peers every 5 seconds {in a one-by-one manner} (thus 720 incoming checks/hour with Max. 56 entries = 40320)
  • Key question: how big is the group you share with?
  • Optimization criteria: cost (both checking and sharing) versus amount of known content popularity
  • Key problem is to devise a distributed algorithm to avoid double checking (vulnerable to misbehaving peers) or brute force it randomly with known overlap.

Code:

  • refactor the torrentchecker
  • add new ability to check size of torrent by downloading first piece
  • more costly in terms of bandwidth, but gives wealth of statistics, is attack-resilient, and only relies on DHT.
  • Keep it simple: store as little as possible in database
    • keep checked swarms in memory only (results from previous session are not fresh and useless)
    • grow this list with one entry every 30 seconds, until it reaches Max of 56 entries.
    • 3-minute timeout for first piece download.
    • You start checking after 30sec and always have something to share after 3min 30second
    • dead/timeout swarms have seeders -1 and leechers -1 not Zero.
    • store incoming swarm checks in a simple database (how to control growth of this DB? remove dead swarms?)
  • test on Gumby using Leaseweb servers

@devos50
Copy link
Contributor

devos50 commented Jul 18, 2017

You can use one of the lease web servers for testing purposes; we should disable the exit nodes running on these servers for one or two days. Technical specifications of one of the servers: https://gyazo.com/5eb8aa07d0a767afc7586acdc787ac7f (I would say you can run up to 24 Dispersy instances on one machine but you could try to scale this up using system load statistics).

@MaChengxin
Copy link
Contributor

Filling an entire UDP packet (1480 Bytes on Ethernet) with swarm checks (20Byte SHA1, 2 Bytes Seeders, 2 Bytes leechers, 2 Bytes how many seconds ago check was done) {1480/ (20+2+2+2) = 56 entries}

According to Wiki the data length of a UDP packet is 65507 bytes. How does 1480 bytes come from?

The advantage of filling an entire UDP packet is that it reduces the cost of communication. However, this also means some data will not be sent out until a full UDP packet is filled. The consequence of such a delay might be that other nodes will do a double check because they don't know someone has already done the task.
I will first try the eager-sharing strategy (share a result as soon as it is generated) and see if the communication cost is acceptable or not.

Share results of various last checks with a semi-fixed group of peers every 5 seconds.

Can we just gossip the last checks as soon as we have the results instead of forming a group and sharing the info within the group?

@qstokkink
Copy link
Contributor

1480 bytes: https://stackoverflow.com/a/10239583

@MaChengxin
Copy link
Contributor

Link to my thesis: https://github.com/MaChengxin/msc-thesis
It is now badly written that only I myself understand what it is talking about. Will improve its readability.

@synctext
Copy link
Member Author

synctext commented Aug 23, 2017

Comments on first thesis draft

  • introduction
    Companies like Facebook continously need to decide what to show to their users, primarily their news feed. As shown this is a problem that a single profit-driven company now control media access for 2 billion people. These algorithms work as a closed black box without any accountability or consideration for public health through emotional contagion. With the dramatic social media usage levels of teens the question posed is even "has the smartphone destroyed a generation"
  • problem description
    Our aim is to create a media system which is not controlled by a single commercial entity. We want a distributed Youtube/Facebook/Twitter media system. Within this grand challenge we focus on estimating popularity. Knowing how popular content is without a central accounting entity is the key difficulty. Determining distributed state in a distributed system is hard. Our scientific goal is to create a content popularity estimating algorithm without any central entity, controlling server, or organisational embedding. Key related work is the DAO investment organisation which does not contain any central element, but was vulnerable to attack due to multiple programming bugs. Attack-resilience is also our key concern. An attacker should not be able to easily influence the content popularity. As academic studies have show, attacks on the media are real. With current social media technology the fake news is shown to be substantial. Fraud with media systems has been studied extensively and fake accounts have been detected. Our aim is to be able to devise open source algorithms and open innovation processes for detecting honest entities from attackers.
  • Content popularity in decentralised social media
    • Our experimental research requires an operational system to expand with attack-resilient content popularity estimation algorithms and evaluate their effectiveness. Very few examples exist of decentralised social media systems. Over the past decades systems like Gnutella, Freenet, Publius, Eternity, Vanish, Bittorrent, Tribler, and Tangler have been developed. However, very few system have seen any popularity and continued active development. Tribler rules ! :-)
    • Existing methods and algorithms
    • new proposal: Measure various statistics when downloading a first swarm piece. Guess popularity.
    • initial 162 swarms results, how accurate is it ? +- 50% ?
  • global system design and implementation
    In the previous chapter we look at our content popularity measurement algorithm. We now present a complete system out of this single component. Due to the limits of our work the accuracy is not yet very high.
  • evaluation and experiments
    If it works on Android, the experiments are much more impressive (and don't need to be very extensive). {two months spend on Android; redo Paul .apk code}

@synctext
Copy link
Member Author

synctext commented Sep 13, 2017

Possible first sentence of chapter 2: Our aim is to create a media system which is not controlled by a single commercial entity.
Content discovery, browsing and search is critical for any media system. A key part of that is content popularity. The frontpage of a newspaper contains the most important news. In the digital age it is difficult to determine what is the most popular website, youtube clip, tweet, blog post, wikipedia edit or discussion comment. Especially in a distributed system, see the most popular tweet:
image
3,693,759,511 views: https://www.youtube.com/watch?time_continue=5&v=kJQP7kiw5Fk

possible Chapter 3: Within this thesis we expanded an Internet-deployed open source distributed media system with content popularity. We focus on video because of the popularity of Youtube, Netflix, and Bittorrent.
3.1 The Tribler social media system
3.2 Content discovery and popularity in Bittorrent
3.3 Measuring popularity and attacks
3.5 : Create own chapter for this design and measurement work.
Pearson correlation coefficient for 162 swarms?

Basic graphs of content popularity gossip community. Keep to the default Gumby plots. Screenshots of Android. Plots of Android. thesis DONE.

@MaChengxin
Copy link
Contributor

MaChengxin commented Oct 2, 2017

ch4.pdf
Up-to-date Chapter 4

To be discussed: outline of Chapter 5; experiments to be included in Chapter 5
Current outline:

  • Rationale
  • Some theoretical analysis (based on simulation instead of pure mathematical analysis)
  • Implementation using Tribler

For Chapter 1 to 3, I think it's better to revise them after finishing Chapter 4 and 5, such that I will know what aspects to emphasize

@synctext
Copy link
Member Author

synctext commented Oct 2, 2017

Comments:

  • another problem description chapter potential addon. Twitter needs to keeps its sponsors happy. Olympics, trouble with branding.
  • Problem Description. First Decentral Social Media, Determining distributed state in a distributed system is hard. Then sub-problem to describe is content popularity.
  • Section 3.1 keep it short to 2 Max. pages (you're not a Tribler developer). + explain when presenting community results.
  • Attacks on BitTorrent. Keep in sync with thesis title, attack-resilience of Bittorrent. Make it possitive, it's pretty attack resilient. Paragraphs are too short, merge all into 1.
  • Chapter 4: informative title. Plus few intro sentences with keywords like: decentral, swarm sampling, sharing results. Mentioning 1936 distracts from storyline.
  • Language bindings of Libtorrent are not important.
  • max. figure size 4.1
  • has double text, below and above figure.
  • rename num_peers
  • Section 4.1: Limit the why you do this experiment. not 2 pages of engineering details.
  • remove all these details. My code can be found here.
  • Please remove trivial Algorithm 1.
  • Split single metric measurements and combining swarm metrics (voting is confusing).
  • Tables are Appendix details
  • Nobody in the system knows the Big Picture; everyone only has an partial knowledge of it : chapter 1 and 2 contents.
  • focus is a lot on 95+ % accuracy. Real system also needs to deal with 100k items and dead items. Show that in Chapter 5 and DONE. Get degree!

@synctext
Copy link
Member Author

First prototype deployment, moving work to new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants