# Buffered versus unbuffered I/O

## Objective

For this laboratory, we will implement transliteration programs `tr2b` and `tr2u` that use buffered and unbuffered I/O respectively, and compare the resulting implementations and performance.

## Specification

Each implementation will be a **main program** that takes **two** operands `from` and `to` that are **byte strings** of the same length, according to which the program will convert every byte of standard input in `from` to the byte with the same index in `to`; then it will output the result to standard output. 

The implementations will report an error if `from` and `to` are not of the same length, or if `from` has duplicate bytes. 

To summarize, the implementations will act like the standard utility `tr`, except that they have no options; characters like \[, - and \\ have no special meaning in the operands; operand errors will be diagnosed; the implementations act on bytes rather than on (possibly multibyte) characters.

### Background

A **byte string** is in essence the internal representation of string as it is stored in the computer. The commonly known UTF-8, US-ASCII, and Unicode are all **character encoding system** that converts the corresponding string and byte string from and to each other. Depending on the system, the same string can be encoded into different byte strings; similarly, the same byte string representation can be decoded into different strings. When we see garbage characters or Mojibake when opening a file, it is usually because the file is decoded with the unintended character encoding.

Here is an excellent [answer](https://stackoverflow.com/a/31322359) from Stackoverflow that explains this concept very well.

## Lab Log

### Create the program: tr2u, tr2b

- Because the use of hashtable can significantly speed up the two programs, I just tried to look for a library for hashtable in C.

1. Unfortunately, there does not seem to be an implementation of hashtable in the standard library of C. So, we are going to write a rather inefficient algorithm.

1. Now we will start building tr2b and tr2u.

1. First, we need to parse the arguments from the command line.

1. After checking that there are two operands, we will then check that `from` and `to` are of the same length and `from` have no duplicate bytes.

1. As part of the main algorithm, we need a way to find if the current character being processed is in the `from` string; if it is, then we need to get its index, so we can find the character it needs to be converted into. We call this function `findChar`.

1. After finishing `findChar`, we can start writing the main part of the algorithm.

1. When writing the main algorithm, we decided that instead of just finding the index of the character in `findChar`, we will let the function simply returns the result of character conversion. The function is now renamed `convertChar`.

1. The rest of the work is straightforward. Iterate through each character from `stdin` obtained through `getchar`, convert it, and then output it to `stdout` with `putchar`.

1. Tested the program. Passed. Then copied its content into tr2u.c as the starting template. Changed buffered I/O into unbuffered I/O. Tested. Passed.

### Performance testing

We will first generate a 5,000,000 bytes file for the purpose of testing the performance of `tr2u` and `tr2b`.

In [None]:
dd if=/dev/zero of=test.txt count=5000 bs=1000

According to the instruction, we will use `strace` to track the number of system calls made when calling these two programs.

In [None]:
strace -o ./tr2b.out '1' '2' < test.txt > strace_tr2b

In [None]:
strace -o ./tr2u.out '1' '2' < file.txt > strace_tr2u

As we can easily see, the buffered version of the program makes significantly less system calls compared to the unbuffered version.

Now we are going to compared their performance when copying a file to your terminal. Again we will use `test.txt` that contains 5,000,000 bytes

In [None]:
strace -c ./tr2b.out '1' '2' < test.txt

In [None]:
strace -c ./tr2u.out '1' '2' < test.txt

Compared to the case when we write the output of the strace to a local file, printing out the output in terminal created significantly more system calls, especially for the buffered version.

Now we are going to time these two programs. Because `tr2b` uses significantly less system calls to complete, we predict that `tr2b` will take much shorter time to finish than `tr2u`.

In [None]:
time ./tr2b.out '1' '2' < test.txt > strace_tr2b

In [None]:
time ./tr2u.out '1' '2' < test.txt > strace_tr2u