Skip to content
Rafayel Mkrtchyan edited this page Oct 13, 2015 · 32 revisions

Welcome to the GSoC CernVM-FS Wiki!

Index

algo Logo * [Abstract](https://github.com/MicBrain/GSoC_CernVM-FS/wiki/Home#abstract) * [Personal Information](https://github.com/MicBrain/GSoC_CernVM-FS/wiki/Home#personal-information) * [Background](https://github.com/MicBrain/GSoC_CernVM-FS/wiki/Home#background) * [Proposal](https://github.com/MicBrain/GSoC_CernVM-FS/wiki/Home#proposal) * [Timeline](https://github.com/MicBrain/GSoC_CernVM-FS/wiki/Home#timeline) * [Works Cited](https://github.com/MicBrain/GSoC_CernVM-FS/wiki/Home#works-cited)

Abstract

CernVM-FS software system possesses a rich array of functionalities, but still needs some compilation of software tools that will make this technology more advanced and provide better performance. The purpose of this proposal is to suggest an implementation of new ideas and manipulation of some portions of previously written code. The methodology followed is to propose new techniques that will be useful for making CernVM File System a better, stronger and faster product.

Personal Information

My name is Rafayel Mkrtchyan and I am currently a second year international student at the University of California at Berkeley. I intend to double major in Computer Science and Applied Mathematics and minor in Design Innovation. I expect to graduate from UC Berkeley in May, 2017. Currently I live in Berkeley, CA, USA, but I am planning to live in Yerevan, Armenia, which is in UTC+04:00 timezone, for the entire duration of the summer. I can be reached at rafamian@berkeley.edu and 1- (310) - 347 - 5442.

Proposal

Proposal Title: HTTP/2 Support for CernVM File System

Proposal Description: All the projects provided in your “Ideas” page are very interesting to me, and it was very hard for me to choose the task towards which I want to contribute. But finally, I decided to work on “HTTP/2 Support” open-source project, because I think that my knowledge and experience is mostly relevant to that one. Moreover, utilization of the most recent technologies in this software system inspires me a lot, because that gives me the opportunity to become one of the first developers who will learn and deploy them.

Because performance is one of the primary focuses of this software system, I am planning to make the whole process as low-level as possible. Most probably, besides using C++, I might also use C language in order to implement some portions of the project. The reason of using C is that libcurl library provides fairly well developed C integrations for vectorized and asynchronous interfaces. C++, on the other hand, could be very practical in extending file system’s meta-data [2] and its code by a prefetch thread.

Currently libcurl library is trying to support both HTTP/1.1 and HTTP/2 protocols. For instance --http1.1 option enables the use HTTP/1.1, which is the internally prefered version by default [3] and --http2 tells the API to release its requests using HTTP/2. Unfortunately right now libcurl does not provide a choice that allows developers to use multiplexed connection rather than using parallel connection. However, this functionality is in libcurl team’s TODO list [4] and hopefully it will be available soon. Also, libcurl is not flexible enough to have both HTTP/2 and HTTP/1.1 connections at the same time, because HTTP/2 support is still in early stage and lacks a few important things. That is why I am planning to generate a strategic solution that will associate parallel connections with HTTP/1.1 protocol and multiplexed connections in HTTP/2 protocol. The program will firstly try use to use multiplexed connections from HTTP/2 protocol. The core idea is to check if multiplexed requests return erroneous responses or outputs that contain some differences compared to input files in the responses. In that case we can always go back to parallel connections and use HTTP/1.1.

I suppose that CernVM-FS’s download manager would be more comfortable for the users if this system supports not only vectorized interfaces, but also the idea of an asynchronous interface. I came up with this schema when I understood that libcurl has a well developed multi interface. Taking advantage from the fact that libcurl library supports asynchronous interfaces, we can definitely make download manager more functional and practical. Manipulation of Jobinfo structure [5] can make the whole process more user-friendly. This idea is just an additional suggestion, which still needs a little bit more research. In terms of extending the CernVM-FS download manager by vectorized interface, I am planning to implement a tool that will download files in several threads. The number of threads will depend on throughput. Even though http standard specifies a low limit for threads, a lot of servers don’t enforce it and every time we can certainly consider a reasonable amount of threads that will maximally increase the speed of the whole procedure.

It is also very essential to understand the methodology of development of CernVM-FS file system code by a prefetch thread. After getting familiarized with the techniques and implementation details used in cvmfs/cvmfs/ [6], I realized that this could be a good starting point. Here we can generate the logic for prefetching a given list of data chunks. My strategy here is to prefetch the data as accurately as possible and early enough so we can reduce/eliminate the latency. Understanding where to store the prefetched data is an important question too. I believe that a convenient solution would be storing prefetched data in in a “prefetch buffer”, which would be part of the file system’s metadata. Even though this technique requires a little bit more sophisticated memory system design, it will separate non-prefetched data from the prefetched one causing no cache pollution.

Timeline

Prework Period: Even though HTTP/2 has a lot of advantages such as the utilization of fewer TCP connections, it is still in a developmental stage and considered to be a work in progress. Therefore, we should be very careful with HTTP/2, because it might still contain some flaws. That is the reason why I am planning to spend my pre-work period in order to understand how to use the whole power of this technology. Additionally, researching libcurl will be another primary goal during that period of time, because HTTP/2 support in libcurl is not secure enough. One example is when the developer tries to use https:// or http:// schemes, and libcurl stops supporting HTTP/2 and goes back to HTTP/1.1.

Weeks 1 - 3: Extend and manipulate CernVM-FS download manager so that it supports multiple download requests and responses at the same time using the strategy of multi-thread division based on throughput that I have described in “Proposal Description” section. Additionally, write extensive unit tests to make sure that the whole process works perfectly and performance is as efficient as possible.

Weeks 4 - 6: Implement the tool that will decide whether to use parallel connections for file transfers or multiplexed connections. This tool will try to use multiplexed connection and will go back to parallel connections only if it finds some anomalies related to HTTP/2. I will use real-world and application testing methods to make sure that the strategy works accurately. Here, a very important thing to note is that even though HTTP/2 is considered to be one of the fastest growing technological trends, the number of servers that support HTTP/2 protocol are unfortunately very limited [7]. However, in order to examine the general procedures we can certainly utilize “easy to use” web servers like Mongoose, Fenix Web Server, Aprelium or similar other web server.

Weeks 7 - 8: Develop download manager’s replication module to use the vectorized interface. Initially I will try to make the prototype of this module in the 7th week, which we can discuss and test before 8th week. Right now I have different opinions about how to approach to this problem. However, before the Work Period I will have solid understanding what is the best possible approach. This will be very related to the work that I am planning to do from Weeks 1 - 3. Basically I am going to add some new classes and modify certain methods in cvmfs/cvmfs/ directory so we can easily and effectively replicate files.

Weeks 9 - 12: Create a functional code that will add special functionalities to CernVM-FS file system such as the ability to prefetch provided list of data chunks in the background and also to extend system’s meta-data by a storages that saves prefetched data. Planning to do very extensive thinking, because these functionalities involve conceptual tricks challenges.

Weeks 13 - 14: Generate advanced unit-, integration and application tests to detect all possible system anomalies and make sure that the whole project does not contain any bugs. Also write a well-documented wiki with relevant information for the users and for other developers. If applicable, create and update application descriptions for the public.

Works Cited

  1. https://github.com/MicBrain
  2. http://research.microsoft.com/pubs/72896/fast07-final.pdf
  3. http://curl.haxx.se/docs/manpage.html
  4. http://curl.haxx.se/dev/readme-http2.html
  5. https://ecsft.cern.ch/dist/cvmfs/doc/html/df/da1/structdownload_1_1_job_info.html
  6. https://github.com/cvmfs/cvmfs/tree/devel/cvmfs
  7. http://daniel.haxx.se/blog/2015/02/10/http2-is-at-5/