<a href="https://colab.research.google.com/github/OJB-Quantum/Navaho-Linguistics/blob/main/Python%20Scripts%20for%20Navaho%20Linguistics/Navaho_Characters_UTF8_Conversion_U_Plus_Notation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### This notebook shows example scripts in Python to convert Navaho characters into "U+" notation Unicode, based on UTF-8. The motivation comes from its convenience in character generation when one desires to correctly spell Navaho words. This form of character generation may provide a robust sequence of protocols for practical language usage in digital form, further strengthening any future initiatives to automate Navaho translation.

In [1]:
!pip install openpyxl



If running the code below on your own, be sure to upload the XLSX file to Colab or your IPYNB folder.

The location of the XLSX file that will work with the code below is located at (https://github.com/OJB-Quantum/Navaho-Linguistics/blob/main/Python%20Scripts%20for%20Navaho%20Linguistics/Navaho_Characters_UTF8_Conversion_v2.xlsx)

In [3]:
# Convert based on XLSX file type, where a list of Navaho characters may be stored. The output is a txt file.
# You will notice that for some reason, the (ǫ́) character is having some problems breaking down into "U+" notation.
# Any character complications that may arise are likely due to a string of character components being greater than 1.
# If there are more than 1 character components, then the solution is to string together 2 or more Unicode notations together.
# This will result in a single character output with all the combined components from the Unicodes that make up that character.
# In most cases, however, only 1 Unicode notation is required to generate a character.
# Examples of Unicode formats...
# "U+" notation for 1 Unicode string vs 2 Unicode strings vs. 3 Unicode strings (respectively): U+0041 vs. U+0105 U+0301 vs. U+0043 U+0068 U+0027  # Each of these are different characters.
# "\u" notation for 1 Unicode string vs 2 Unicode strings vs. 3 Unicode strings (respectively): \u0041 vs. \u0105\u0301 vs. \u0043\u0068\u0027 # Each of these are different characters as well.

import openpyxl

def convert_to_utf8_code(input_file, output_file):
    # Load the workbook and select the first sheet
    workbook = openpyxl.load_workbook(input_file)
    first_sheet = workbook.worksheets[0]

    with open(output_file, mode='w', encoding='utf-8') as output:
        for row in first_sheet.iter_rows(min_row=1, max_col=1, values_only=True):
            text = row[0]
            if text is not None:
                for char in text:
                    utf8_code = "U+{:04X}".format(ord(char))
                    output.write(utf8_code + '\n')
                    print(f"Character '{char}' -> UTF-8 Code: {utf8_code}")

# Example usage
convert_to_utf8_code('Navaho_Characters_UTF8_Conversion_v2.xlsx', 'NV_UTF8_UPLUS_v3.txt')

Character 'N' -> UTF-8 Code: U+004E
Character 'a' -> UTF-8 Code: U+0061
Character 'v' -> UTF-8 Code: U+0076
Character 'a' -> UTF-8 Code: U+0061
Character 'h' -> UTF-8 Code: U+0068
Character 'o' -> UTF-8 Code: U+006F
Character ' ' -> UTF-8 Code: U+0020
Character 'C' -> UTF-8 Code: U+0043
Character 'h' -> UTF-8 Code: U+0068
Character 'a' -> UTF-8 Code: U+0061
Character 'r' -> UTF-8 Code: U+0072
Character 'a' -> UTF-8 Code: U+0061
Character 'c' -> UTF-8 Code: U+0063
Character 't' -> UTF-8 Code: U+0074
Character 'e' -> UTF-8 Code: U+0065
Character 'r' -> UTF-8 Code: U+0072
Character 'A' -> UTF-8 Code: U+0041
Character 'a' -> UTF-8 Code: U+0061
Character 'B' -> UTF-8 Code: U+0042
Character 'b' -> UTF-8 Code: U+0062
Character 'C' -> UTF-8 Code: U+0043
Character 'h' -> UTF-8 Code: U+0068
Character 'c' -> UTF-8 Code: U+0063
Character 'h' -> UTF-8 Code: U+0068
Character 'C' -> UTF-8 Code: U+0043
Character 'h' -> UTF-8 Code: U+0068
Character ''' -> UTF-8 Code: U+0027
Character 'c' -> UTF-8 Code:

Note:
The character ǫ́ can be represented in Unicode. Here are the Unicode representations for the character:

Composed of normalised NFC (Latin Extended-A, Combining Diacritical Marks):

Capital Ǫ́: U+01EA U+0301
Small ǫ́: U+01EB U+0301

The "U+" notation equivalent of the codes below is "\u".

For example:
U+0105 U+0301 = \u0105\u0301 # Notice the space.

---



In [4]:
print("\u0105\u0301")

ą́


In [5]:
print("\u0119\u0301")

ę́


In [6]:
print("\u012F\u0301")


į́


In [7]:
print("\u01EB\u0301")

ǫ́


In [8]:
print("\u0043\u0068\u0027")

Ch'


In [9]:
print("\u0105\u0301\u0105")

ą́ą


In [10]:
print("\u0054\u0142\u0027")

Tł'


In [11]:
# Here is a whole Navaho word represented in (UTF-8) Unicode

print("\u004E\u0061\u0062\u00ED\u006B\u0027\u00ED\u0074\u0073\u00ED\u0064\u007A\u00ED\u0142\u006B\u0065\u0065\u0073")
# English meaning: careful thought or consideration

Nabík'ítsídzíłkees


## **A correction to all the characters above (excluding long vowels):**
#### The idea is that you can generate long vowels by simply repeating or combining single vowels shown at the end of the list.
_________________________________________________________________________

Character 'A' -> UTF-8 Code: U+0041

Character 'B' -> UTF-8 Code: U+0042

Character 'Ch' -> UTF-8 Code: U+0043 U+0068

Character 'Ch'' -> UTF-8 Code: U+0043 U+0068 U+0027

Character 'D' -> UTF-8 Code: U+0044

Character 'Dl' -> UTF-8 Code: U+0044 U+006C

Character 'Dz' -> UTF-8 Code: U+0044 U+007A

Character 'E' -> UTF-8 Code: U+0045

Character 'G' -> UTF-8 Code: U+0047

Character 'Gh' -> UTF-8 Code: U+0047 U+0068

Character 'H' -> UTF-8 Code: U+0048

Character 'Hw' -> UTF-8 Code: U+0048 U+0077

Character 'I' -> UTF-8 Code: U+0049

Character 'J' -> UTF-8 Code: U+004A

Character 'K' -> UTF-8 Code: U+004B

Character 'K'' -> UTF-8 Code: U+004B U+0027

Character 'Kw' -> UTF-8 Code: U+004B U+0077

Character 'L' -> UTF-8 Code: U+004C

Character 'Ł' -> UTF-8 Code: U+0141

Character 'M' -> UTF-8 Code: U+004D

Character 'N' -> UTF-8 Code: U+004E

Character 'O' -> UTF-8 Code: U+004F

Character 'S' -> UTF-8 Code: U+0053

Character 'Sh' -> UTF-8 Code: U+0053 U+0068

Character 'T' -> UTF-8 Code: U+0054

Character 'T'' -> UTF-8 Code: U+0054 U+0027

Character 'Tł' -> UTF-8 Code: U+0054 U+0142

Character 'Tł'' -> UTF-8 Code: U+0054 U+0142 U+0027

Character 'Ts' -> UTF-8 Code: U+0054 U+0073

Character 'Ts'' -> UTF-8 Code: U+0054 U+0073 U+0027

Character 'W' -> UTF-8 Code: U+0057

Character 'X' -> UTF-8 Code: U+0058

Character 'Y' -> UTF-8 Code: U+0059

Character 'Z' -> UTF-8 Code: U+005A

Character 'a' -> UTF-8 Code: U+0061

Character 'á' -> UTF-8 Code: U+00E1

Character 'ą' -> UTF-8 Code: U+0105

Character 'ą́' -> UTF-8 Code: U+0105 U+0301

Character 'é' -> UTF-8 Code: U+00E9

Character 'ę' -> UTF-8 Code: U+0119

Character 'ę́' -> UTF-8 Code: U+0119 U+0301

Character 'í' -> UTF-8 Code: U+00ED

Character 'į' -> UTF-8 Code: U+012F

Character 'į́' -> UTF-8 Code: U+012F U+0301

Character 'ó' -> UTF-8 Code: U+00F3

Character 'ǫ' -> UTF-8 Code: U+01EB

Character 'ǫ́' -> UTF-8 Code: U+01EB U+0301

Character 'ń' -> UTF-8 Code: U+0144