##### Introduction to Information Theory (Fall 2023/4)

# Home Assignment 4

#### Topics:
- Lossless compression

### 1. Operational Capacity of a Noiseless Channel
1. Prove that for any channel $P_{X^n|Y^n}$ we have $C_{op} \leq  \log|\mathcal X|$ without using the channel coding theorem (That is, directly from the definition.). Use the following steps:
 - Suppose that there exists a $(2^{nR},n)$-coding scheme with $R >\log|\mathcal X|$. Find the number of messages that cannot be uniquely mapped to a channel input.
 - Find the proportion of this number of messages out of the entire message space.
 - Argue that the average error in decoding is at least half this proportion.
 - Complete the proof.




#### <span style="color: LIGHTgreen;">Answer</span>

Since the assumption states that there exists a $(2^{nR},n)$-coding scheme with $R>log|\mathcal{X}|$, this means that the scheme considered attempts to transmit information at a rate higher than the log of the size of the input alphabet. By definition of the $(2^{nR},n)$-coding scheme, the number of unique messages that can be sent over $n$ channel uses $2^{nR}$. However, the number of unique sentences that can be formed from the input alphabet over $n$ channel uses $|\mathcal{X}|^n$.  
Following that $R>log|\mathcal{X}|$, then $2^{nR}>|\mathcal{X}|^n$, which means that there are more unique messages that we wish to send than there are unique input sequences available: $$\text{Number of messages that cannot uniquely be mapped to a channel input}=2^{nR}-|\mathcal{X}|^n$$ Hence, the proportuon of this number of messages out of the entire message space can be calculated: $$\frac{2^{nR}-|\mathcal{X}|^n}{2^{nR}}=1-\frac{|\mathcal{X}|^n}{2^{nR}}$$
Now, the non-uniquely mappable messages (whose proportion we just identified) cannot be reliably decoded because there's ambiguity: a signle received sequence could correspond to multiple possible sent messages. Therefore, the decoder might choose the wrong message among the possibilities and this contributes to decoding errors. Given that the errors are spread across the portion of the message space that exceeds $|\mathcal{X}|^n$, it's reasonable to conclude that the decoding error for these messages will be significant, and we can define a conservative lower bound on this error rate as 50% of the proportion the the excess messages. 50% is a reasonable bound because if we consider two messages for every sequence that corresponds to multiple messages, the change of choosing the correct one is 0.5, implying a 50% error rate.  
Therefore, if $R>log|\mathcal{X}|$, there's a significant portion of the message space that cannot be uniquely mapped to the input sequences, which leads to an unavoidably high average decoding error. This however, contradicts the requirement for reliable communication, which need s that the average decoding error probability tends to zero as $n$ tends to infinity. Thus, for any channel $P_{X^n|Y^n}$, to ensure reliable communication (an average decoding error probability that approaches zero), the rate $R$ cannot exceed $log|\mathcal{X}|$. Hence, $C_{op} \leq  \log|\mathcal X|$

2. Consider a channel with identical input and output alphabets $\mathcal X$. Assume that for some $m \in \mathbb N$ the channel is noiseless, i.e. 
$$
P_{Y^m|X^m}(y^m|x^m) = \begin{cases} 1 & x^m = y^m \\
0 & x^m \neq y^m.
\end{cases}
$$
Show that the operational capacity of this channel is $C_{op} := \log |\mathcal X|$ by proving an achievability and converse claims (you are not allowed to use the channel coding theorem).

#### <span style="color: LIGHTgreen;">Answer</span>

#### Achievability
* Claim: It is possible to achieve a rate of $R=log|\mathcal{X}|$ bits per channel use with arbitrarily low error probability for this noiseless channel.  $\\$
* Proof: $\\$
Given that we assume the channel is noiseless, if $x^m = y^m$, then the transmission has no errors. This means that one can encode a message directly into a sequence of symbols from the alphabet $\mathcal{X}^m$ and expect the same sequence at the output.  
Also, for a message of size $M$, we can represent $M=|\mathcal{X}|^m$ unique messages since we can use $m$ channel uses to send any of $|\mathcal{X}|^m$ the possible sequences. Hence, the rate in bits per channel use $R$ can be denoted as:
$$R=\frac{1}{m}logM=\frac{1}{m}log|\mathcal{X}|^m=log|\mathcal{X}|$$
For each use of the channel, $log|\mathcal{X}|$ bits of information can be transmitted without error (since we're assuming no noise).  
Now, since the channel is noiseless for sequences of length $m$, $P_{Y^m|X^m}(y^m|x^m) = 1$ if $x^m = y^m$ because the error probability is 0 for any message sent.

#### Converse
* Claim: It is not possible to reliably transmit at a rate higher than $log|\mathcal{X}|$ bits per channel use. $\\$
* Proof by contradiction: $\\$
Assuming there exists a coding scheme that allows transmitting at a rate $R'>log|\mathcal{X}|$ bits per channel use with arbitrarily low error probability.  
That rate implies that the number of messages $M'$ that can be sent over $m$ uses of the channel exceeds $|\mathcal{X}|^m$ since $M'=2^{mR'}>|\mathcal{X}|^m$. But the noiseless channel can only distinguish $|\mathcal{X}|^m$ unique input sequences over $m$ uses. Here is where the contradiction happens. There are more messages than unique sequences that can be transmitted which makes it impossible to uniquely map each message to a distinct sequence of channel inputs. However, given that we assumed a noiseless channel, every message must have a unique input sequence to be transmitted without error. Hence, exceeding $log|\mathcal{X}|$ bits per channel use violates this requirement, making such a rate unachievable without introducing errors.  

We can therefore conclude that a rate of $log|\mathcal{X}|$ is attainable with zero error probability because of the noiseless nature of the channel over sequences of length $m$. Also, attempting to exceed the rate leads to a contradiction. Therefore we can conclude that the operational capacity of this channel is $log|\mathcal{X}|$

### 2. Channel Capacity
(Based on Exc. 7.6 in Thomas \& Cover) Consider a 26-key typewriter.
1. If pushing a key results in printing the associated letter, what is the capacity $C$ in bits?



#### <span style="color: LIGHTgreen;">Answer</span>

If the type writer prints out what ever key is struck, then the output, Y, is the same as the input, X, and $C=maxI(X;Y)=maxH(X)=log26$, attained by a uniform distribution over the letters.

2. Now suppose that pushing a key results in printing the associated letter or the next letter in the alphabet with equal probability. That is, $A \to B$, $B\to C$,...,$Z \to A$. What is the capacity?

#### <span style="color: LIGHTgreen;">Answer</span>

In this case, the output is either equal to the input(with probability 0.5) or equal to the next letter(with probability 0.5). Hence $H(Y|X)=log2$ independent of the distribution of $X$, and hence $C=maxI(X;Y)=maxH(Y) - log2=log26 - log2 =log13$,  attained for a uniform distribution over the output, which in turn is attained by a uniform distribution on the input.

3. What is the highest rate code with block length one $(n=1)$ that you can find that achieves *zero* probability of error for the channel in part (2)?

*Hint for (2)*: 
Show first that for a channel with transition matrix in which the rows are permutations of each other and the columns are permutations of each other, the capacity is 
$$
C = \log |\mathcal Y| - H( \text{row of transition matrix}).
$$
(such a channel is called *symmetric*)

#### <span style="color: LIGHTgreen;">Answer</span>

A simple zero error block lengthone code is the one that uses every alternate letter, say $A,C,E,...,W,Y$. In this case, none of the code words will be confused, since $A$ will produce either $A$ or $B$, $C$ will produce $C$ or $D$, etc. The rate of this code:

$$
R= \frac{log(\#codewords)}  {Blocklength} = \frac{log13}{1} = log13
$$

In this case, we can achieve capacity with a simple code with zero error.

### Channel Capacity and Random Codes
(based on Exc. 7.8 and 7.9 in Thomas \& Cover)
The $Z$-channel has a binary input and output alphabets and transition probabilities $P_{Y|X}$ given by the matrix
$$
P_{Y|X} = \begin{bmatrix} 1 & 0 \\
1/2 & 1/2 
\end{bmatrix}
$$
Namely, $\Pr[Y=0|X=0]=1$ while $\Pr[Y=0|X=1]=1/2$ ($0$ goes noiselessly, while a $1$ may turn into a zero with probability $1/2$). 
1. Find the capacity of this channel and the maximizing input probability distribution. It may help to know that 
$$
\frac{d}{dp} h_2(p) = \log_2((1-p)/p). 
$$


#### <span style="color: LIGHTgreen;">Answer</span>

First we express $I(X;Y)$, the mutual information between the input and output of the $Z-channel$, as a function of:
$$
x=P(X=1):
$$

$$
H(Y|X) = P(X=0)·0+P(X=1)·1 = x 
$$

$$
H(Y) = H(P(Y=1))=H(x/2) 
$$

$$
I(X;Y) = H(Y)H(Y|X)=H(x/2) - x 
$$

Since $I(X;Y)=0$ when $x=0$ and $x=1$, the maximum mutual information is obtained for some value of $x$ such that $0<x<1$. Using elementary calculus,we determine that
$$
\frac{d} {dx}I(X;Y)=0.5log_2\frac{1 - \frac{x}{2}} {\frac{x}{2}} - 1, 
$$


which is equal to zero for $x=\frac{2}{5}$. (It is reasonable that $P(X=1)< 0.5$ because $X=1$ is the noisy input to the channel) So the capacity of the $Z-channel$ in bits is $H(\frac{1}{5}) - \frac{2}{5}=0.722 - 0.4 = 0.322$

2. Assume that we draw a random $(2^{nR},n)$ code (as in the proof of the channel coding theorem) for this channel in which each codeword is a sequence of *fair* coin tosses (this may not achieve capacity). Find the maximum rate $R$ such that the probability of error $P_{\mathrm{err}}^{(n)}$ averaged over the randomly generated codes, tends to zero as $n$ tends to infinity. 



#### <span style="color: LIGHTgreen;">Answer</span>

From the proof of the channel coding theorem, it follows that using a random code with code words generated according to probability $p(x)$, we can send information at a rate $I(X;Y)$ corresponding to that $p(x)$ with an arbitrarily low probability of error. For the $Z-channel$ described in the previous question, we can calculate $I(X;Y)$ for a uniform distribution on the input. The distribution on $Y$ is $(\frac{3}{4}, \frac{1}{4})$, and therefore:
$$
I(X;Y)=H(Y) - H(Y|X) = H(\frac{3}{4}, \frac{1}{4}) - 0.5H(\frac{1}{2}, \frac{1}{2})=\frac{3}{2} -  \frac{3}{4}log3
$$