# Activity 4 - Unicode

## Number Base

In mathematics, the number base refers to the number of unique digits used to represent numbers in a positional numeral system. The most commonly used number base is the decimal system, which has a base of 10 and uses ten digits (0-9) to represent numbers.

However, there are other number bases as well, including:

1. Binary (base 2): Uses two digits (0 and 1) to represent numbers. It is commonly used in computer science and digital systems.
2. Octal (base 8): Uses eight digits (0-7) to represent numbers.
3. Hexadecimal (base 16): Uses sixteen digits (0-9 and A-F) to represent numbers. It is commonly used in computer programming and represents four bits (a nibble) with each digit.

**Question 1**

Manaully convert the numbers below from one base to another base.

- The binary of decimal number 10.
- The octal of decimal number 10.
- The hexadecimal of decimal number 10.
- The decimal of hexadecimal number FF.
- The decimal of binary number 1111. 

### Representation

In python, you can represent 

- binary numbers using the prefix **0b** or **0B** followed by a sequence of binary digits (0s and 1s) 

- octal numbers using the prefix **0o** or **0O** followed by a sequence of octal digits (0s to 7s). 

- hexadecimal numbers using the prefix **0x** or **0X** followed by a sequence of hexadecimal digits (0s to 9s, and A-F). 

in 

- plain sequence printed as a base-10 integer.
- a string sequence of characters with prefix. 

In [None]:
binary_number1 = 0b100
binary_number2 = '0b1001'
print(type(binary_number1), binary_number1) # shown as a base-10 integer 
print(type(binary_number2), binary_number2) # shown as a sequence of characters prefixed with 0b. 

In [None]:
octal_number1 = 0o100
octal_number2 = '0o107'
print(type(octal_number1), octal_number1)
print(type(octal_number2), octal_number2)

In [None]:
hex_number1 = 0x101
hex_number2 = '0xF07'
print(type(hex_number1), hex_number1)
print(type(hex_number2), hex_number2)

<class 'int'> 257
<class 'str'> 0xF07


### Base conversion

Call following functions to convert a base-10 integer (int) to another base in string with prefix (str):

- bin: to base 2
- oct: to base 8
- hex: to base 16



In [None]:
print(bin(100))
print(oct(100))
print(hex(100))

0b1100100
0o144
0x64


Call **int** to convert a string of number from another base to base 10 with arguement *base*.

The *base* could be 0 if the string is prefixed with a specified base. 

In [None]:
# no prefix
print(int('1100100', base=2))
print(int('144', base=8))
print(int('64', base=16))

In [None]:
# prefix 
print(int('0b1100100', base=2))
print(int('0o144', base=8))
print(int('0x64', base=16))

In [None]:
print(int('0b1100100', 0)) # use specified base in the string

100


**Question 2**

Verify your answers to the previous question using python. bold text

## ASCII


ASCII (American Standard Code for Information Interchange) is a character encoding standard that assigns unique numeric values to characters used in English text and certain control characters. It was developed in the 1960s and became widely adopted as a standard character set for computers and communication systems.

In ASCII, each character is represented by a 7-bit binary number, allowing a total of 128 different characters to be encoded. The first 32 ASCII codes (0-31) represent control characters such as newline, carriage return, tab, and others. The remaining 95 ASCII codes (32-126) represent printable characters including uppercase and lowercase letters, digits, punctuation marks, and a few special characters.

The following chart is from 
https://www.asciitable.com/

![](https://www.asciitable.com/asciifull.gif)

**Question 3**

- What is dec and hex of letter 'h'?
- What is dec and hex of `backspace`?

## Escape Characters

Here are the commonly used escape characters in Python:

- `\\`: Backslash
- `\'`: Single quote
- `\"`: Double quote
- `\n`: Newline
- `\r`: Carriage return
- `\t`: Tab
- `\b`: Backspace
- `\f`: Form feed
- `\v`: Vertical tab
- `\a`: Alert (bell)
- `\0`: Null character
- `\xhh`: Hexadecimal escape (e.g., `\x41` represents the character 'A')
- `\uhhhh`: Unicode escape for 16-bit hexadecimal (e.g., `\u0041` represents the character 'A')
- `\Uhhhhhhhh`: Unicode escape for 32-bit hexadecimal (e.g., `\U00000041` represents the character 'A')

These escape characters are used to represent special characters or control sequences within a string. They allow you to include characters that are otherwise difficult to type or have specific meanings in Python strings.

If the backslash is not followed by any character with a special meaning, it is treated as a literal backslash.

In [None]:
print('\\hello\\')
print('\nhello\n')
print('\\hello\b')

In [None]:
# a regular backslash 
print('\\nhello\n')
print('\\hello\\b')

**Question 4**

Explain the outputs of following codes.

In [None]:
'\x64'

In [None]:
'\x64\x7DFF'

In [None]:
'\bhello' 

In [None]:
'\ohello' == '\\ohello'

**r** is a special prefix of a string. It denotes a raw string literal. Raw strings treat backslashes (`\`) as literal characters instead of escape sequences. This can be helpful when dealing with regular expressions, file paths, or any other scenario where you want to preserve backslashes as-is. Here's an example printing same sequences above using r:

In [None]:
print(r'\\hello\\')
print(r'\nhello\n')
print(r'\\hello\b')

## bytes and bytesarray

There are two basic built-in types for binary sequences: the
immutable **bytes** type and the mutable **bytearray**. They are **NOT** string type *str*. 

Each item in bytes or bytearray is an integer from 0 to 255 in dec or from 00 to FF in hex. 

**b** is a prefix to a string indicating it is a bytes. 

In [None]:
my_bytes = b'Hello'  # Immutable bytes
my_bytearray = bytearray(b'Hello')  # Mutable bytearray
print(type(my_bytes), my_bytes)
print(type(my_bytearray), my_bytearray)

<class 'bytes'> b'Hello'
<class 'bytearray'> bytearray(b'Hello')


In [None]:
# Each item in bytes is an integer.
for letter in my_bytes:
    print(letter)

72
101
108
108
111


In [None]:
my_bytes[0] = b'A' # immutable

TypeError: ignored

In [None]:
my_bytearray[0] = 100 # 100 is letter d 
print(my_bytearray)

bytearray(b'dello')


## Unicode

The concept of **string** is simple enough: a string is a sequence of characters. The
problem lies in the definition of **character**.
In 2021, the best definition of “character” we have is a Unicode character. [1]


Unicode is a standard encoding system that assigns a unique numerical value, called a code point, to every character, symbol, and emoji used in writing systems around the world. It provides a universal way to represent and manipulate text in various languages and scripts, regardless of the platform, software, or language being used.

The Unicode Consortium, a non-profit organization, is responsible for maintaining and updating the Unicode Standard. This standard includes a vast range of characters, including those from commonly used scripts like Latin, Cyrillic, Arabic, Chinese, and many others. It also encompasses less widely used scripts, historical scripts, mathematical symbols, punctuation marks, and various special characters.

Unicode is a universal character encoding standard that provides a consistent way to represent and process text across different languages and scripts, ensuring interoperability and multilingual support in modern computer systems.

The identity of a character—its **code point**—is a number from 0 to 1,114,111
(base 10), shown in the Unicode standard as 4 to 6 hex digits with a “U+” prefix,
from U+0000 to U+10FFFF (base 16). For example, the code point for the letter A is U
+0041, the Euro sign is U+20AC, and the musical symbol G clef is assigned to
code point U+1D11E. About 13% of the valid code points have characters
assigned to them in Unicode 13.0.0, the standard used in Python 3.10.0b4.

That is, unicode contains 1,114,112 code points; currently, characters are assigned to more than 96,000 of them, including alphabets, ideographs, diacritical marks, symbols, and more. The Unicode code space for characters is divided into 17 planes, and each plane has 65,536 code points. [7]



Go to the following websites are quickly search unicodes.

1. http://xahlee.info/comp/unicode_emoji.html
2. Official chart https://www.unicode.org/charts/
3. Unihan to seach unicode of Chinese characters. This is my last name in Chinese character: https://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=%E6%A2%81





**Question 5**

Go to first website and find at least three very different unicodes. Please copy and paste following information of the unicodes you found. 

- name: GRINNING FACE WITH ONE LARGE AND ONE SMALL EYE
- dec: 129322
- hex: U+1F92A

### chr and ord

In Python, the `ord()` and `chr()` functions are used to convert between Unicode characters and their corresponding integer code points.

1. `ord(character)`: This function takes a Unicode character as input and returns its corresponding integer code point. 

2. `chr(code_point)`: This function takes an integer code point as input and returns the corresponding Unicode character. 

For examples:


In [None]:
"""
Use print_large_unicode instead of print to
show a large character. 
"""
from IPython.display import display, HTML

def print_large_unicode(char, font_size=50):
    html = f'<div style="font-size: {font_size}px; font-family: monospace;">{char}</div>'
    display(HTML(html))

In [None]:
# define a string of a face 
face = '🥳'
print_large_unicode(face)

In [None]:
# call ord to find unicode point 
print(ord('ÿ'))
print(ord('é'))
print(hex(ord('😂')))

255
233
0x1f602


In [None]:
# f-string padding, width, and hex 
f"U+{ord('😂'):08x}".upper()

'U+0001F602'

In [None]:
"""
The string 'こんにちは' is a greeting in the Japanese language, 
and it is commonly romanized as "Konnichiwa." 
In English, it translates to "Hello" or "Good day."
"""
greeting_in_jap = 'こんにちは' 
for letter in greeting_in_jap:
    print(ord(letter))

12371
12435
12395
12385
12399


In [None]:
"""
My Chinese name 梁敬赛 of Jingsai Liang.
"""
name = '梁敬赛'
for c in name:
    print(hex(ord(c)))

name2 = '罗新甜'
for c in name2:
    print(hex(ord(c)))

0x6881
0x656c
0x8d5b
0x7f57
0x65b0
0x751c


In [None]:
# call chr to find corresponding characters 
print_large_unicode(chr(255))
print_large_unicode(chr(0x1f602))

**Question 6**

Please print out the unicode point in hex of your name. (If you have your name in other language than Engligh, please use your name in another language.) 

**Question 7**

Please use chr to print out the characters you found in  question 5. 

In [None]:
chr(129322)

'🤪'

### \u and \U

In Python, both `\u` and `\U` are escape sequences used to represent Unicode characters in string literals. However, they have slightly different meanings and usage.

1. `\u`: This escape sequence is used to represent a Unicode character using a 16-bit hex value. It is followed by exactly four hex digits (`\uXXXX`). For example, `\u0041` represents the capital letter 'A'. This escape sequence is typically used for characters in the Basic Multilingual Plane (BMP), which includes most commonly used characters.

2. `\U`: This escape sequence is used to represent a Unicode character using a 32-bit hex value. It is followed by exactly eight hex digits (`\UXXXXXXXX`). This escape sequence can represent any Unicode character, including those outside the BMP. For example, `\U0001F602` represents the "Face with Tears of Joy" emoji (😂). 

In general, you will use `\u` for characters within the BMP and `\U` for characters outside the BMP. However, it's important to note that the maximum Unicode code point is U+10FFFF, so `\U` is not necessary for all characters.

Here's a simple example:



In [None]:
print_large_unicode('\u263a')      
print_large_unicode('\U0001F3C0')

In [None]:
# My Chinese name 梁敬赛 of Jingsai Liang.
print_large_unicode('\u6881\u656C\u8D5B')

**Question 8**

Please use \u or \U to print out the characters you found in question 5. 

### \N


In Python, the `\N` escape sequence is used to represent a Unicode character by its name. It is followed by the Unicode character name enclosed in curly braces (`\N{}`). 

For example:

In [None]:
print_large_unicode('\N{SMILING FACE WITH SMILING EYES}')

In [None]:
a = "\N{Face with Tears of Joy}"
print_large_unicode(a)

**Question 9**

Please use \N to print out the characters you found in  question 5. 

In [None]:
print_large_unicode('\N{GRINNING FACE WITH ONE LARGE AND ONE SMALL EYE}')
print_large_unicode('\N{SMILING FACE WITH OPEN MOUTH AND TIGHTLY-CLOSED EYES}')
print_large_unicode('\N{SMILING FACE WITH HORNS}')

### Skin modifier

The term "Fitzpatrick" in the context of Unicode refers to the Fitzpatrick Scale. The Fitzpatrick Scale is a recognized standard for classifying human skin tones based on a range of six categories. It was developed by Thomas B. Fitzpatrick, a Harvard dermatologist, in 1975.

In the context of Unicode, the Fitzpatrick Scale is used to provide skin tone modifiers for certain emoji characters. These modifiers allow for the representation of different skin tones when depicting human figures or body parts in emoji.

The Fitzpatrick Scale modifiers can be associated with these terms as follows:

1. 🏻 Light Skin Tone (Fitzpatrick Type-1-2): U+1F3FB
2. 🏼 Medium-Light Skin Tone (Fitzpatrick Type-3): U+1F3FC
3. 🏽 Medium Skin Tone (Fitzpatrick Type-4): U+1F3FD
4. 🏾 Medium-Dark Skin Tone (Fitzpatrick Type-5): U+1F3FE
5. 🏿 Dark Skin Tone (Fitzpatrick Type-6): U+1F3FF

Here are some Unicode code points for the emoji characters that support skin tone modifiers:

1. 👍 Thumbs Up: U+1F44D
2. 👎 Thumbs Down: U+1F44E
3. 👌 OK Hand: U+1F44C
4. 👊 Fist Bump: U+1F44A
5. ✊ Raised Fist: U+270A
6. 🤛 Left-facing Fist: U+1F91B

    ....

We can using a + to combine the characters of hand gestures and skin modifiers. 

In [None]:
print('\U0001F44D') # original thumb up
print('\U0001F44D' + '\U0001F3FE') # dark version

👍
👍🏾


In [None]:
base_emoji = '\U0001F44D' # Thumbs-up emoji (👍) 

base_skin_hex = 0x1F3FB # base tone - light
skin_names = ["base", "light", "medium-light", "medium", "medium-dark", "dark"]

# Print the base one
print(base_emoji, skin_names[0])
# Combine the base emoji and modifier
for i in range(5):
    skin_tone_modifier = chr(base_skin_hex + i) 
    modified_emoji = base_emoji + skin_tone_modifier
    print(modified_emoji, skin_names[i+1])


👍 base
👍🏻 light
👍🏼 medium-light
👍🏽 medium
👍🏾 medium-dark
👍🏿 dark


**Question 10**

Please choose one hand gensture and one skin modifier to make a combined character. 

### UTF-8

To represent Unicode characters in digital systems, various encoding schemes are used, such as UTF-8, UTF-16, and UTF-32. These encoding schemes determine how the code points are stored and represented as binary data. UTF-8 is the most commonly used encoding scheme on the internet, as it is efficient, backward-compatible with ASCII, and supports the entire Unicode character set.

- UTF-8 uses 8-bit variable-width character encodings. UTF-8 uses between 1 and 6 bytes to encode a character; it may use fewer, the same, or more bytes than UTF-16 to encode the same character. In UTF-8, every code point from 0 to 127 (U+0000 to U+0127) is stored in a single byte. Only code points 128 (U+0128) and above are stored using 2 to 6 bytes.
- UTF-16 uses a single, fixed-width, 16-bit code unit. It is relatively compact, and all the most commonly used characters fit into a single 16-bit code unit. Other characters are accessible using pairs of 16-bit code units.
- UTF-32 requires 4 bytes to encode any character. In most cases, a document encoded in UTF-32 will be nearly twice as large as the same document encoded in UTF-16. Each character is encoded in a single, fixed-width, 32-bit code unit. You would use UTF-32 if memory space is not an issue and you want to be able to use a single code unit for every character. [7]

For example:


In [None]:
bytes('acÿ', encoding='utf-8')

Although binary sequences are really sequences of integers, their literal notation
reflects the fact that ASCII text is often embedded in them. Therefore, four different
displays are used, depending on each byte value:
- For bytes with decimal codes 32 to 126—from space to ~ (tilde)—the ASCII character
itself is used.
- For bytes corresponding to tab, newline, carriage return, and \, the escape
sequences \t, \n, \r, and \\ are used.
- If both string delimiters ' and " appear in the byte sequence, the whole sequence
is delimited by ', and any ' inside are escaped as \'.3
- For other byte values, a hexadecimal escape sequence is used (e.g., \x00 is the
null byte).

In [None]:
bytes('smiling\n🥳', encoding='utf-8')

b'smiling\n\xf0\x9f\xa5\xb3'

In [None]:
bytes('ac\b', encoding='utf-32')

b'\xff\xfe\x00\x00a\x00\x00\x00c\x00\x00\x00\x08\x00\x00\x00'

**Question 11**

Please print out the encoding of the characters in one method you found in question 5. 

In [None]:
bytes('🤪', encoding='utf8')

b'\xf0\x9f\xa4\xaa'

### unicodedata

The `unicodedata` module in Python provides functions and utilities for working with Unicode characters. It allows you to retrieve information about Unicode characters such as their name, category, numeric values, and perform various operations related to Unicode.

Here are some commonly used functions from the `unicodedata` module:

- `unicodedata.category(char)`: Returns the general category of a Unicode character.
- `unicodedata.name(char)`: Returns the name assigned to a Unicode character.
- `unicodedata.lookup(name)`: Returns the Unicode character with the specified name.

For example,

In [None]:
import unicodedata

char = '🐇'
print(unicodedata.category(char))
print(unicodedata.name(char))

name = 'Mouse Face'
print(unicodedata.lookup(name))

So
RABBIT
🐭


## Encoding and Decoding

The Python distribution bundles more than 100 codecs (encoder/decoders) for text to
byte conversion and vice versa. Each codec has a name, like 'utf_8', and often
aliases, such as 'utf8', 'utf-8', and 'U8', which you can use as the encoding argument
in functions like open(), str.encode(), bytes.decode(), and so on.

Here is an example from [1]:

![](https://cs.westminstercollege.edu/~jingsai/courses/CMPT300J/handouts/encoding.png)

Every system and language has its own encoding method by default. 

In [None]:
# get default encoding [13, 14]
import locale
locale.getpreferredencoding()

Reading and writing same file in difference encoding methods would be a trouble. 

In [None]:
# write file using utf-16
with open('myfile.bin', 'w', encoding="UTF-16") as file:
    # Write binary data to the file
    file.write('Hello World')

In [None]:
# read file using utf-8
with open('myfile.bin', 'r', encoding='utf-8') as file:
    # Write binary data to the file
    lines = file.readlines()
    print(lines)

Determining the encoding method used for decoding a given text can be a challenging task, especially if the encoding information is not explicitly provided or known. However, here are a few approaches you can try:

1. **Metadata or Documentation**: If the text is associated with metadata or documentation that specifies the encoding, check for any available information about the encoding. This might include file headers, content type headers, or accompanying documentation that indicates the encoding used.

2. **Common Encoding Detection Algorithms**: There are some common encoding detection algorithms that can analyze the byte patterns in the text data and make an educated guess about the encoding. Examples include libraries like `chardet` for Python, which uses statistical techniques to estimate the encoding.

3. **BOM (Byte Order Mark)**: If the text begins with a BOM, which is a special marker at the beginning of a text file, it can indicate the encoding. For instance, a BOM of `b'\xEF\xBB\xBF'` at the beginning of the data usually suggests UTF-8 encoding.

4. **Heuristics and Patterns**: Certain encodings have specific patterns or characteristics that can provide hints. For example, if you see a lot of high bytes (e.g., values above 127) in the data, it could be an indicator of an encoding like UTF-8.

5. **Context and Prior Knowledge**: If you have prior knowledge about the source or nature of the text data, it can provide clues about the likely encoding. For example, if the text is expected to be in a specific language or from a specific region, it may suggest a commonly used encoding for that language or region.

It's important to note that automatic encoding detection may not always be accurate, especially if the text data is incomplete or contains ambiguous byte sequences. In some cases, manual inspection or consultation with the data source or provider might be necessary to determine the correct encoding.

## Reference

1. https://www.fluentpython.com/
2. https://www.asciitable.com/
3. https://docs.python.org/3/library/functions.html#int
4. https://stackoverflow.com/questions/209513/convert-hex-string-to-integer-in-python
5. https://www.geeksforgeeks.org/python-program-to-print-emojis/#
6. http://xahlee.info/comp/unicode_index.html?q=
7. https://pro.arcgis.com/en/pro-app/2.9/help/data/geodatabases/overview/a-quick-tour-of-unicode.
8. https://www.unicode.org/charts/
9. https://www.unicode.org/cgi-bin/GetUnihanData.pl?
10. https://superuser.com/questions/826061/relationship-between-unicode-and-utf-8-16-32
11. https://docs.python.org/3/library/unicodedata.html
12. https://www.youtube.com/watch?v=sgHbC6udIqc
13. https://docs.python.org/3/library/functions.html#open
14. https://www.learnbyexample.org/python-open-function/