# Chapter-7 Strings
This notebook contains the sample source code explained in the book *Hands-On Julia Programming, Sambit Kumar Dash, 2021, bpb Publications. All Rights Reserved*.

In [171]:
using Pkg
pkg"activate ."
pkg"instantiate"

[32m[1m  Activating[22m[39m project at `C:\Users\WoU_AI_ML`


## 7.1 Introduction

Strings can be considered as a collection of characters. For a detailed understanding please refer to the book chapter. 

## 7.2 String

Simple example of strings presented with various initialization literal definitions. 

In [172]:
str = "This is a string"

"This is a string"

In [173]:
str = """ 
        This is a preformatted 
        "string" """

" \nThis is a preformatted \n\"string\" "

In [174]:
a = "Jack"
b = "Jill"
c = "100"

str = "$a owes $b $c dollars"

"Jack owes Jill 100 dollars"

In [175]:
str = "This is a \"quoted\\  ' string"

"This is a \"quoted\\  ' string"

## 7.3 String Methods

Strings are immutable. They cannot be manupulated. String methods combine or work on various strings and return either an attribute of a string or provide a derivative of an original string. 

### Comparisons

In [176]:
s1 = "abc"
s2 = "def"
s1 < s2

true

In [177]:
s2 > s1

true

In [178]:
s1 = "abc"
s2 = "abc"
s1 == s2

true

In [179]:
s1 === s2

true

### Iteration

Strings can be iterated as character collections. But, valid indices are only at the character boundaries. 

In [180]:
s = "Julia"
for c in s
    println(c)
end

J
u
l
i
a


In [181]:
s[1], s[2], s[3], s[4], s[5] 

('J', 'u', 'l', 'i', 'a')

In [182]:
s[begin], s[begin+2], s[end-1], s[end]

('J', 'l', 'i', 'a')

In [183]:
s = "\u2200 x \u2203 y"

"∀ x ∃ y"

In [184]:
length(s)

7

In [185]:
sizeof(s)

11

In [186]:
s[1]

'∀': Unicode U+2200 (category Sm: Symbol, math)

In [187]:
s[2]

LoadError: StringIndexError: invalid index [2], valid nearby indices [1]=>'∀', [4]=>' '

In [188]:
s[4]

' ': ASCII/Unicode U+0020 (category Zs: Separator, space)

In [189]:
for c in s
    println(c)
end

∀
 
x
 
∃
 
y


In [190]:
i, l = firstindex(s), lastindex(s)
while i <= l
    println(s[i])
    i = nextind(s, i)
end

∀
 
x
 
∃
 
y


### Split and Concatenate

Both sets of operations return a newly defined string. The old string is not modified. 

In [191]:
str = "This is a String"
str[1:4]

"This"

In [192]:
str[1:4]*str[end-6:end]

"This String"

In [193]:
repeat("A:-", 5)

"A:-A:-A:-A:-A:-"

In [194]:
"A:="^4

"A:=A:=A:=A:="

In [195]:
join(["1", "2", "3", "4", "5"])

"12345"

In [196]:
join(["Jack", "Jill", "Cathy", "Trevor"], ", ", " and ")

"Jack, Jill, Cathy and Trevor"

In [197]:
str = "This is a\nString\n"
chomp(str)

"This is a\nString"

In [198]:
chop("October")

"Octobe"

In [199]:
chop("October", head=2, tail=3)

"to"

In [200]:
s = "\u2200 x \u2203 y"
ss = split(s)

4-element Vector{SubString{String}}:
 "∀"
 "x"
 "∃"
 "y"

In [201]:
s = "\u2200,x,\u2203,y"
ss = split(s, ',', limit=2)

2-element Vector{SubString{String}}:
 "∀"
 "x,∃,y"

In [202]:
s = "\u2200,x,\u2203,y"
ss = rsplit(s, ',', limit=2)

2-element Vector{SubString{String}}:
 "∀,x,∃"
 "y"

In [203]:
lpad("string", 10, "p")

"ppppstring"

In [204]:
rpad("string", 10, "s")

"stringssss"

In [205]:
strip("     string 123  ")

"string 123"

In [206]:
strip(" {a}     string 123  ", ['{', 'a', '}', ' '])

"string 123"

In [207]:
strip("     string 123  aaa") do x
    return x == ' ' || x == 'a'
end

"string 123"

### Case Conversion

In [208]:
uppercase("Julia")

"JULIA"

In [209]:
lowercase("JUliA")

"julia"

In [210]:
titlecase("hands on programming in julia")

"Hands On Programming In Julia"

In [211]:
uppercasefirst("julia")

"Julia"

In [212]:
lowercasefirst("Julia")

"julia"

### Match and Replace

In [213]:
str = "Introduction to Julia"
startswith(str, "Intro")

true

In [214]:
endswith(str, "Julia")

true

In [215]:
contains(str, "to")

true

In [216]:
occursin("to", str)

true

In [217]:
r = findfirst("o", "Introduction to Julia")
while r !== nothing 
    println(r)
    r = findnext("o", "Introduction to Julia", r.stop+1)
end

5:5
11:11
15:15


In [218]:
findlast("o", "Introduction to Julia")

15:15

In [219]:
replace("Introduction to Julia", "o"=>"a")

"Intraductian ta Julia"

#### Regular Expressions

Regular expressions are part of text pattern matching languages. Readers are suggested to refer to a text on the specific topic for a detailed understanding of them. 

In [220]:
rx = Regex("a.a")

r"a.a"

In [221]:
m = match(rx, "abracadabra")

RegexMatch("aca")

In [222]:
m.match

"aca"

In [223]:
m = match(rx, "abracadabra", 5)

RegexMatch("ada")

In [224]:
rx = Regex("a(.)a")
m = match(rx, "abracadabra")
m.captures

1-element Vector{Union{Nothing, SubString{String}}}:
 "c"

In [225]:
rx = Regex("a(?<key>.)a")
m = match(rx, "abracadabra")
m.captures

1-element Vector{Union{Nothing, SubString{String}}}:
 "c"

In [226]:
m["key"]

"c"

In [227]:
rx = r"a.a"
m = eachmatch(rx, "abracadabra", overlap=true)

Base.RegexMatchIterator(r"a.a", "abracadabra", true)

In [228]:
collect(m)

2-element Vector{RegexMatch}:
 RegexMatch("aca")
 RegexMatch("ada")

In [229]:
m = eachmatch(rx, "abracadabra", overlap=false)

Base.RegexMatchIterator(r"a.a", "abracadabra", false)

In [230]:
collect(m)

1-element Vector{RegexMatch}:
 RegexMatch("aca")

## 7.4 Encodings

`String` objects are internally stored in the UTF-8 encoding. However, they can be translated to or from other Unicode transformations like UTF-16 or UTF-32. 

In [231]:
s = "\u2200 x \u2203 y"

"∀ x ∃ y"

In [232]:
transcode(UInt16, s)

7-element Vector{UInt16}:
 0x2200
 0x0020
 0x0078
 0x0020
 0x2203
 0x0020
 0x0079

In [233]:
transcode(UInt8, s)

11-element Base.CodeUnits{UInt8, String}:
 0xe2
 0x88
 0x80
 0x20
 0x78
 0x20
 0xe2
 0x88
 0x83
 0x20
 0x79

In [234]:
transcode(UInt32, s)

7-element Vector{UInt32}:
 0x00002200
 0x00000020
 0x00000078
 0x00000020
 0x00002203
 0x00000020
 0x00000079

In [235]:
transcode(String, transcode(UInt16, s))

"∀ x ∃ y"

### Some Useful Functions

In [236]:
isascii("∀ x ∃ y"), isascii("abcd ef")

(false, true)

In [237]:
iscntrl('a'), iscntrl('\x1')

(false, true)

In [238]:
isdigit('a'), isdigit('9')

(false, true)

In [239]:
isxdigit('a'), isxdigit('x')

(true, false)

In [240]:
isletter('1'), isletter('a')

(false, true)

In [241]:
isnumeric('1'), isnumeric('௰') #No 10 in Tamil (Indian) Language

(true, true)

In [242]:
isuppercase('A'), islowercase('a')

(true, true)

In [243]:
isspace('\n'), isspace('\r'), isspace(' '), isspace('\x20')

(true, true, true, true)

## 7.5 Character Arrays

If you need to manipulate character by character, then it may be best to transform a `String` into an `Vector{Char}`. 

In [244]:
collect("∀ x ∃ y")

7-element Vector{Char}:
 '∀': Unicode U+2200 (category Sm: Symbol, math)
 ' ': ASCII/Unicode U+0020 (category Zs: Separator, space)
 'x': ASCII/Unicode U+0078 (category Ll: Letter, lowercase)
 ' ': ASCII/Unicode U+0020 (category Zs: Separator, space)
 '∃': Unicode U+2203 (category Sm: Symbol, math)
 ' ': ASCII/Unicode U+0020 (category Zs: Separator, space)
 'y': ASCII/Unicode U+0079 (category Ll: Letter, lowercase)

## 7.6 Custom Strings

If Unicode based `String` type does not meet all your needs, you may have to implement your own string type deriving it from `AbstractString`. If the character code you are planning to use does not map to a UTF-8 `Char` you can create your own character type derived from `AbstractChar`. `LegacyStrings.jl` package in Julia has some sample implementations of such string types for reference. 

In [245]:
eltype("abcd")

Char

The subsequent command may take many minutes to complete if your environment has never been updated. 

In [246]:
]add LegacyStrings

[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `C:\Users\WoU_AI_ML\Project.toml`
[32m[1m  No Changes[22m[39m to `C:\Users\WoU_AI_ML\Manifest.toml`


In [247]:
using LegacyStrings

In [248]:
s = ASCIIString("abcd")

"abcd"

In [249]:
ncodeunits(s)

4

In [250]:
codeunit(s)

UInt8

In [251]:
s16 = UTF16String(transcode(UInt16, "abcd\0"))

"abcd"

In [252]:
codeunit(s16)

UInt16

In [253]:
typeof(s16)

UTF16String

In [254]:
ncodeunits(s16)

4

Both `UTF16String` and `ASCIIString` will behave like collections of `Char` while internally they will store the data in 16-bit and 8-bit formats respectively. Hence,  it's not necessary every string class derived from `AbstractString` needs to implement an `AbstractChar`.

In [255]:
eltype(s), eltype(s16)

(Char, Char)