<a href="https://colab.research.google.com/github/JakeyV8/cs417-exercises/blob/main/encoding_conversions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Encoding Conversions: UTF-8, UTF-16LE, and Latin-1

In this exercise you will explore how the **same string** is represented differently under three common encodings:

| Encoding | Description |
|----------|-------------|
| **UTF-8** | Variable-length (1‚Äì4 bytes per character). ASCII characters use 1 byte. |
| **UTF-16LE** | Uses 2 or 4 bytes per character (little-endian byte order). |
| **Latin-1 (ISO 8859-1)** | Fixed 1 byte per character. Only covers code points U+0000‚ÄìU+00FF. |

## Instructions

1. Run the setup cell below ‚Äî it provides helper functions you'll use throughout.
2. In each exploration cell, **try different strings** and observe the byte output.
3. Record your observations in the markdown cells provided.
4. Save this notebook to your GitHub repo under `cs417/exercises/` and submit the URL on Canvas.

---
## Setup ‚Äî Run This First

In [1]:
def show_encoding(text, encoding):
    """Encode a string and display the raw bytes, hex, and length."""
    try:
        encoded = text.encode(encoding)
        hex_bytes = ' '.join(f'{b:02x}' for b in encoded)
        print(f"  {encoding:12s} | {len(encoded):2d} bytes | {hex_bytes}")
        return encoded
    except UnicodeEncodeError as e:
        print(f"  {encoding:12s} | ERROR: {e}")
        return None


def compare_encodings(text):
    """Show a side-by-side comparison of a string in all three encodings."""
    print(f'String: "{text}"')
    print(f'Characters: {len(text)}  |  Unicode code points: {[f"U+{ord(c):04X}" for c in text]}')
    print('-' * 70)
    print(f"  {"Encoding":12s} | Bytes | Hex representation")
    print('-' * 70)
    for enc in ['utf-8', 'utf-16-le', 'latin-1']:
        show_encoding(text, enc)
    print()


def roundtrip(text, encode_as, decode_as):
    """Encode with one encoding, then try to decode with another. What happens?"""
    print(f'Original string: "{text}"')
    print(f'Encode as {encode_as}, then decode as {decode_as}:')
    try:
        raw = text.encode(encode_as)
        result = raw.decode(decode_as)
        print(f'  Result: "{result}"')
    except (UnicodeDecodeError, UnicodeEncodeError) as e:
        print(f'  ERROR: {e}')
    print()


print("‚úÖ Helper functions loaded: compare_encodings(), show_encoding(), roundtrip()")

‚úÖ Helper functions loaded: compare_encodings(), show_encoding(), roundtrip()


---
## Part 1: ASCII-Only Strings

Start with plain ASCII text. Try different strings below and observe byte counts.

In [7]:
# Try changing the string and re-running
compare_encodings("jjjj")

String: "jjjj"
Characters: 4  |  Unicode code points: ['U+006A', 'U+006A', 'U+006A', 'U+006A']
----------------------------------------------------------------------
  Encoding     | Bytes | Hex representation
----------------------------------------------------------------------
  utf-8        |  4 bytes | 6a 6a 6a 6a
  utf-16-le    |  8 bytes | 6a 00 6a 00 6a 00 6a 00
  latin-1      |  4 bytes | 6a 6a 6a 6a



In [9]:
# Try another ASCII string
compare_encodings("CS 400")

String: "CS 400"
Characters: 6  |  Unicode code points: ['U+0043', 'U+0053', 'U+0020', 'U+0034', 'U+0030', 'U+0030']
----------------------------------------------------------------------
  Encoding     | Bytes | Hex representation
----------------------------------------------------------------------
  utf-8        |  6 bytes | 43 53 20 34 30 30
  utf-16-le    | 12 bytes | 43 00 53 00 20 00 34 00 30 00 30 00
  latin-1      |  6 bytes | 43 53 20 34 30 30



### ‚úèÔ∏è Your Observations ‚Äî Part 1

*Double-click this cell to edit. Write 2‚Äì3 sentences about what you noticed for ASCII strings across the three encodings.*

YOUR OBSERVATIONS HERE
Characters are represented by bytes, each character has a unique unicode code point. So for the space character the utf-8 unicode will always be 20, but the full unicode is U+0020, which is where the utf-16-le gets the 00 behind the 20.

---
## Part 2: Accented / Latin Characters

Now try characters from Western European languages (U+0080‚ÄìU+00FF range).

In [10]:
# Accented characters ‚Äî try caf√©, na√Øve, r√©sum√©, √ºber, etc.
compare_encodings("caf√©")

String: "caf√©"
Characters: 4  |  Unicode code points: ['U+0063', 'U+0061', 'U+0066', 'U+00E9']
----------------------------------------------------------------------
  Encoding     | Bytes | Hex representation
----------------------------------------------------------------------
  utf-8        |  5 bytes | 63 61 66 c3 a9
  utf-16-le    |  8 bytes | 63 00 61 00 66 00 e9 00
  latin-1      |  4 bytes | 63 61 66 e9



In [11]:
# Try more strings with accented characters
compare_encodings("r√©sum√©")

String: "r√©sum√©"
Characters: 6  |  Unicode code points: ['U+0072', 'U+00E9', 'U+0073', 'U+0075', 'U+006D', 'U+00E9']
----------------------------------------------------------------------
  Encoding     | Bytes | Hex representation
----------------------------------------------------------------------
  utf-8        |  8 bytes | 72 c3 a9 73 75 6d c3 a9
  utf-16-le    | 12 bytes | 72 00 e9 00 73 00 75 00 6d 00 e9 00
  latin-1      |  6 bytes | 72 e9 73 75 6d e9



### ‚úèÔ∏è Your Observations ‚Äî Part 2

*How did the byte counts change for accented characters compared to Part 1? Which encoding was most compact? Which was least?*

YOUR OBSERVATIONS HERE:
The byte count increases by one for each accented character. So typically e is one byte but the accented e now becomes 2 bytes with a different Unicode code point. Latin-1 is more compacted, utf-16-le is lest compact.

---
## Part 3: Characters Beyond Latin-1

Try characters that **cannot** be represented in Latin-1: CJK, emoji, Greek, Arabic, etc.

In [12]:
# Emoji / CJK ‚Äî watch what happens to Latin-1
compare_encodings("Hello üåç")

String: "Hello üåç"
Characters: 7  |  Unicode code points: ['U+0048', 'U+0065', 'U+006C', 'U+006C', 'U+006F', 'U+0020', 'U+1F30D']
----------------------------------------------------------------------
  Encoding     | Bytes | Hex representation
----------------------------------------------------------------------
  utf-8        | 10 bytes | 48 65 6c 6c 6f 20 f0 9f 8c 8d
  utf-16-le    | 16 bytes | 48 00 65 00 6c 00 6c 00 6f 00 20 00 3c d8 0d df
  latin-1      | ERROR: 'latin-1' codec can't encode character '\U0001f30d' in position 6: ordinal not in range(256)



In [13]:
# Try Greek, Japanese, Arabic, etc.
compare_encodings("Êó•Êú¨Ë™û")

String: "Êó•Êú¨Ë™û"
Characters: 3  |  Unicode code points: ['U+65E5', 'U+672C', 'U+8A9E']
----------------------------------------------------------------------
  Encoding     | Bytes | Hex representation
----------------------------------------------------------------------
  utf-8        |  9 bytes | e6 97 a5 e6 9c ac e8 aa 9e
  utf-16-le    |  6 bytes | e5 65 2c 67 9e 8a
  latin-1      | ERROR: 'latin-1' codec can't encode characters in position 0-2: ordinal not in range(256)



In [14]:
# Try your own string with non-Latin characters
compare_encodings("ŒïŒªŒªŒ∑ŒΩŒπŒ∫Œ¨")

String: "ŒïŒªŒªŒ∑ŒΩŒπŒ∫Œ¨"
Characters: 8  |  Unicode code points: ['U+0395', 'U+03BB', 'U+03BB', 'U+03B7', 'U+03BD', 'U+03B9', 'U+03BA', 'U+03AC']
----------------------------------------------------------------------
  Encoding     | Bytes | Hex representation
----------------------------------------------------------------------
  utf-8        | 16 bytes | ce 95 ce bb ce bb ce b7 ce bd ce b9 ce ba ce ac
  utf-16-le    | 16 bytes | 95 03 bb 03 bb 03 b7 03 bd 03 b9 03 ba 03 ac 03
  latin-1      | ERROR: 'latin-1' codec can't encode characters in position 0-7: ordinal not in range(256)



### ‚úèÔ∏è Your Observations ‚Äî Part 3

*What happened when Latin-1 tried to encode these characters? Compare UTF-8 vs UTF-16LE byte counts for CJK characters vs emoji.*

YOUR OBSERVATIONS HERE: You get a error that the character cant be encoded, its not in range(256). UTF-8 is better at handling emojis, but Utf-16-le is better at handling CJK characters.


---
## Part 4: Encoding Mismatches (Mojibake)

What happens when you encode with one scheme but decode with another? This is a **very** common source of bugs in real systems.

In [15]:
# Encode as UTF-8, decode as Latin-1
roundtrip("caf√©", "utf-8", "latin-1")

Original string: "caf√©"
Encode as utf-8, then decode as latin-1:
  Result: "caf√É¬©"



In [16]:
# Encode as Latin-1, decode as UTF-8
roundtrip("caf√©", "latin-1", "utf-8")

Original string: "caf√©"
Encode as latin-1, then decode as utf-8:
  ERROR: 'utf-8' codec can't decode byte 0xe9 in position 3: unexpected end of data



In [17]:
# Encode as UTF-16LE, decode as UTF-8
roundtrip("Hello", "utf-16-le", "utf-8")

Original string: "Hello"
Encode as utf-16-le, then decode as utf-8:
  Result: "H e l l o "



In [18]:
# Try your own mismatch combinations
roundtrip("r√©sum√©", "utf-8", "utf-16-le")

Original string: "r√©sum√©"
Encode as utf-8, then decode as utf-16-le:
  Result: "Ïç≤Áé©ÊµµÍßÉ"



### ‚úèÔ∏è Your Observations ‚Äî Part 4

*Describe what "mojibake" looked like in your experiments. Why does encoding as UTF-8 and decoding as Latin-1 produce garbled text instead of an error? When did you get an actual error instead?*

YOUR OBSERVATIONS HERE: UTF-8 to latin -1 works because the range of the characters when coded by utf-8 hold in latin-1 but when its reversed the range of the symbols from latin-1 to utf-1 don't hold and those characters represent differnt things, so it throws an error message.

---
## Part 5: Free Exploration

Use the cells below to try anything else you're curious about. Some ideas:
- What's the longest UTF-8 encoding for a single character you can find?
- Can you find a string where UTF-16LE is actually *smaller* than UTF-8?
- What happens with the BOM (Byte Order Mark)? Try `compare_encodings('\ufeff')`

In [24]:
# Your exploration here
roundtrip("salt", "latin-1", "utf-8")

Original string: "salt"
Encode as latin-1, then decode as utf-8:
  Result: "salt"



In [25]:
# More exploration
compare_encodings(Ïç≤Áé©ÊµµÍßÉ)

SyntaxError: invalid character 'ÍßÉ' (U+A9C3) (ipython-input-3534152433.py, line 2)

### ‚úèÔ∏è Your Observations ‚Äî Part 5

*Share anything interesting you discovered during free exploration.*

YOUR OBSERVATIONS HERE: When using the output of the utf-8 encode to utf16-le decode in part 4. Trying to compare that result in compare encoding it throws an error. I was expecting to get and errof for at least latin-1 but not for all of them.


---
## Summary

### ‚úèÔ∏è Key Takeaways

*Write 3‚Äì5 bullet points summarizing what you learned about character encodings from this exercise.*

- latin-1 coding is the best for non emoji and western europian langue encodeing.
-Each encoding type all has pros and cons its not a one size fit all type of deal.
- some characters that pop up from mojibake dont actually exist


---
## Submission

1. **File ‚Üí Save a copy in GitHub**
2. Save to your repository under the path: `cs417/exercises/encoding_conversions.ipynb`
3. Submit the GitHub URL to Canvas.