# Strings

Julia has a `Char` type for a single character and a `String` type for finite sequence of characters.
Julia character and strings are natively built to work with Unicode code points, thus Julia can handle international languages straight out of the box.
The complexity comes when we talk about the length of strings and indexing of strings -- do we mean the internal byte position or the logical character position?

Let's begin with the most basic single character.

## Character

`Char` is a primitive type that can represent any Unicode character, it has some limited arithmetic properties as well.
Julia makes a distinction between a `Char` and a length-1 string, they are not the same thing.
A character literal is enclosed in *single quotes*, not double quotes which is a String:

In [1]:
'a'

'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)

In [2]:
typeof('a')

Char

In [3]:
typeof("a")     # Not a Char

String

A character can be converted to its numeric value by the `Int()` function:

In [4]:
Int('a')

97

The traditional ASCII character escape sequences can be used as well:

In [5]:
Int('\r'), Int('\n'), Int('\t')           # carraige return, newline tab

(13, 10, 9)

An integer can be converted to a Char via the `Char()` function (not all integers are valid Unicode code points):

In [6]:
Char(97)

'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)

In [7]:
'好'     # Chinese character for good

'好': Unicode U+597D (category Lo: Letter, other)

In [8]:
Int('好')

22909

In [9]:
Char(22909)

'好': Unicode U+597D (category Lo: Letter, other)

Characters can be compared according to the integer value of their code point:

In [10]:
'A' < 'M' < 'Z' < 'a' < 'm' < 'z'

true

You can perform limited arithmetic on Char based on their integer value:

In [11]:
'a' + 1

'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)

In [12]:
'd' - 'a'

3

Unicode codepoint can be entered by `\u` followed by up to four hexadecimal digits or `\U` followed by up to eight hexadecimal digits:

In [13]:
'\u0061'

'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)

In [14]:
'\U00000061'

'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)

In [15]:
'\u597d'

'好': Unicode U+597D (category Lo: Letter, other)

In [16]:
'牛'

'牛': Unicode U+725B (category Lo: Letter, other)

## String

A string is a finite sequence of characters, its type is `String`.
String literals are delimited by double quotes or triple double quotes:

In [17]:
print("abcdef")

abcdef

To put Unicode characters in the string literal, you can paste them in or use `\u` and `\U`:

In [18]:
"Julia \u771f\u725b\u771f\u597d\u7528!"

"Julia 真牛真好用!"

Use `\"` to enter a double quote, use `\\` to enter a backslash as usual.

Triple double quote allows easy entry of long multi-line strings:

In [19]:
str = """abc " def"""
print(str)

abc " def

Here's a multi-line string:

In [20]:
str = 
"""
abc
def
ghi
"""
print(str)

abc
def
ghi


If there's a newline after the opening """, it is stripped, i.e., it is not part of the string.

Triple quoted string are detented to the level of the least indented line (opening triple quote ignored).
This makes it possible to indent strings along with code but not have the indent be part of the string, this can improve code readability:

In [21]:
str = """
      abc
        def
          ghi  
    """
print(str)

  abc
    def
      ghi  


The automatic detent is set by the least indented line, in the above example it is set by the enclosing triple quote line.

In [22]:
str = """
      abc
        def
          ghi
      """
print(str)

abc
  def
    ghi


This example show the detent ignores the opening triple quote line and is set by the last triple quote.

## Interpolation

It is easy to substitute expressions into a string - prefix the expression with **$**, use parenthesis if needed.
This is especially handy for printing:

In [23]:
a = 123
b = 68
str = """
    a is $a,
    b is $b,
    a + b is $(a+b),
    a - b is $(a-b),
    is a > b? $(a>b).
    """
print(str)

a is 123,
b is 68,
a + b is 191,
a - b is 55,
is a > b? true.


If you need to print a $, escape it with \\:

In [24]:
print("\$$a")

$123

## String Basics

The length of a string is obtained by the `length()` function:

In [25]:
str = "Welcome!"
length(str)

8

Strings can be indexed, the index scheme is 1-based, not 0-based:

In [26]:
print(str[1])

W

The keywords `begin`, `end` inside an indexing operation is shorthand for the first and last index along the given dimension:

In [27]:
print(str[begin], str[end])

W!

In [28]:
print(str[end-4])

c

You can index over a string:

In [29]:
for i in firstindex(str):lastindex(str)
    println("$i : $(str[i])")
end

1 : W
2 : e
3 : l
4 : c
5 : o
6 : m
7 : e
8 : !


But strings naturally act as a collection of characters:

In [30]:
for c in str
    println(c)
end

W
e
l
c
o
m
e
!


Out of bound indexes, such as negative index, are illegal and generates error; it is not indexing from the back of the string as in some other languages.

In [31]:
println(str[-1])

BoundsError: BoundsError: attempt to access String
  at index [-1]

In [32]:
println(str[end+1])

BoundsError: BoundsError: attempt to access String
  at index [9]

Range indexing will copy a portion of the string into another string:

In [33]:
str[1:2]        # string of 2 bytes

"We"

In [34]:
str[1:1]        # 1 byte string

"W"

In [35]:
str[3:1]        # empty string, length 0

""

## Unicode strings

Strings use UTF-8 which is a variable-width encoding.
ASCII characters – i.e. those with code points less than 0x80 (128) – are encoded as they are in ASCII with a single byte, code points 0x80 and above are encoded using up to four bytes per character.
If the string contains only ASCII characters, then byte indexing into the string work like it does in C.
However, if the string contains non-ASCII characters, then the valid byte indexes are the first byte of each character.
If you index into a string at an invalid index, an error is thrown.

In [36]:
str = "Julia 真好用！"
println("[$str], length is $(length(str)), last index is $(lastindex(str)).")

[Julia 真好用！], length is 10, last index is 16.


You can see that the length is not the same as the last byte index, this is because the three characters of 真好用 take three bytes each to encode.
Julia's iterator works as before:

In [37]:
for c in str
    println(c)
end

J
u
l
i
a
 
真
好
用
！


To index over the string, use `eachindex()`:

In [38]:
for i in eachindex(str)
    println("$i : $(str[i])")
end

1 : J
2 : u
3 : l
4 : i
5 : a
6 :  
7 : 真
10 : 好
13 : 用
16 : ！


## String Operations

The function `string(...)` can concatenate any type of arguments, very useful for printing:

In [39]:
string('a', 'b', "cd", '好', 123, 456.78)

"abcd好123456.78"

Concatenation of Char and String can also be done with the **\*** operator:

In [40]:
'a' * 'b' * "cd" * '好'

"abcd好"

## Non-Standard String Literals

String literals undergo mild transformation of backslash character pairs, e.g., \n is converted to the single character of newline.
There are cases where you want to stop this transform or use other interpretation.
Julia has four non-standard strings that can make your life easier:

- "..." is normal string literal, `\` escape and `\u \U` are processed as are $variable interpolation

- raw"..." for raw string literal, it is just what it is, nothing has special meaning. One place where it is quite useful is turning off `\` escape processing of Windows path names.

- r"..." is regular expression literal, this makes it much easier to type in a regular expression without the interference of normal string translation. Regular expression processing is built-in in Julia.

- b"..." for byte array literal, accepts ASCII, octal and Unicode escape sequences and translates them into bytes

- v"..." for version literal, e.g., v"major.minor.patch-annotation", useful for testing software versions