# Exercise 1: Classical Cipher Cryptanalysis

## Objectives
1. **Caesar/ROT Cipher Breaking** - Frequency analysis, chi-squared test, bigram analysis
2. **RC4 Brute Force** - Breaking short-key RC4 encryption using entropy validation

## Part 1: Caesar Cipher (Cyclic Substitution)

### What is Caesar Cipher?
Simple substitution cipher where each letter is shifted by a fixed number of positions in the alphabet.

Example with shift=3:
```
Plaintext:  HELLO
Ciphertext: KHOOR
```

### Key Space
For alphabet of size $n$: only $n-1$ possible shifts (shift=0 is identity)
- English (26 letters): 25 possible keys
- French (33 chars): 32 possible keys  
- Polish (32 chars): 31 possible keys

## Attack Method 1: Frequency Analysis

### Principle
Each language has characteristic letter frequencies. By comparing ciphertext frequencies with expected language frequencies, we can identify the shift.

### Chi-Squared Test
Statistical measure of how well observed frequencies match expected:

$$\chi^2 = \sum_{i} \frac{(O_i - E_i)^2}{E_i}$$

where:
- $O_i$ = observed frequency of character $i$
- $E_i$ = expected frequency of character $i$

**Lower $\chi^2$ = better match**

## Attack Method 2: Smart Frequency Attack

### Optimization
Instead of testing all shifts, predict the most likely shift:

1. Find most frequent character in ciphertext
2. Assume it corresponds to most frequent character in language
3. Calculate predicted shift
4. Verify with chi-squared test

**Complexity**: O(n) instead of O(n²)

## Attack Method 3: Bigram Analysis

### What are bigrams?
Two-character sequences (e.g., "th", "he", "in" in English)

### Method
1. For each possible shift, shift common bigrams
2. Count occurrences of shifted bigrams in ciphertext
3. Shift with highest bigram score is likely correct

### Example
```
Common English bigrams: ["th", "he", "in"]
Shift=3: ["wk", "kh", "lq"]
Count these patterns in ciphertext
```

## Part 2: RC4 Stream Cipher

### What is RC4?
- Stream cipher (encrypts byte-by-byte)
- Key-scheduling algorithm (KSA) + Pseudo-random generation algorithm (PRGA)
- Generates keystream XORed with plaintext

### Vulnerability
**Short keys** (e.g., 3 characters [a-z]) can be brute-forced:
- Key space: $26^3 = 17,576$ combinations
- Feasible in seconds on modern hardware

## Entropy-Based Validation

### Shannon Entropy
Measure of randomness/unpredictability in data:

$$H(X) = -\sum_{i=1}^{n} P(x_i) \log_2 P(x_i)$$

### Thresholds
- **Encrypted data**: 7.5-8.0 bits/byte (highly random)
- **Natural language**: 4.5-5.5 bits/byte (patterns present)
- **Validation**: Entropy < 7.0 suggests successful decryption

### Additional Checks
- Printable character ratio > 80%
- UTF-8 decoding success

## Implementation Notes

### Multilingual Support
The code supports three languages with different character sets:
```python
ALPHABETS = {
    'english': "abcdefghijklmnopqrstuvwxyz",
    'french': "abcdefghijklmnopqrstuvwxyzàâäéèêëïîôöùûüÿç",
    'polish': "abcdefghijklmnopqrstuvwxyząćęłńóśźż"
}
```

### File Handling
- UTF-8 with latin-1 fallback for reading
- Results saved to `../decrypted/` directory

## Security Lessons

### Why Caesar Cipher is Insecure
1. **Tiny key space** - Only 25-32 possible keys
2. **Statistical patterns preserved** - Letter frequencies remain
3. **No diffusion** - Each character encrypted independently

### Why Short RC4 Keys are Insecure
1. **Brute-forceable** - 17,576 combinations for 3-char keys
2. **No authentication** - No way to verify decryption
3. **RC4 weaknesses** - Known biases in keystream

### Modern Alternatives
- **Encryption**: AES-256 in GCM mode
- **Key derivation**: Argon2, PBKDF2
- **Key length**: Minimum 128 bits of entropy