# 13. Strings and Text Processing

## The Anatomy of a String
---

In $C\#$, a **string** is a **sequence of characters** stored at a certain address in memory.

In $.NET\,Framework$, each character is represented by a serial number from the $Unicode$ table. The $Unicode$ standard's predecessor, $ASCII$, is able to record only $128$ or $256$ characters (respective $ASCII$ standard with *7-bit* or *8-bit* table). Unfortunately, this often does not meet user needs – as we can only fit, within these $128$ characters, *digits*, *uppercase* and *lowercase* *Latin letters*, along with some other specific individual characters. When you have to work with text in Cyrillic or other specific language (e.g. Chinese or Arabian), $128$ or $256$ characters are extremely insufficient.

As such, $.NET$ uses a *16-bit* code table for the characters which store stores $2^{16} = 65,536$ characters.    

What's more is that some characters are encoded in a specific way, such that it is possible to use **two** **characters** of the $Unicode$ table to create a new character – the resulting possibilities exceed 100,000.

<br>

### The `System.String` Class

The `System.String` class is what enables us to directly handle strings in $C\#$.    
   
For declaring the strings, we will continue using the keyword `string`, which is an alias in C# of the `System.String` class from $.NET\,Framework$.
   
The `string` type is *unique from other data types*.      

It is itself a `class`, and as such, it complies adherently towards the principles of *object-oriented programming*. It's values are stored in the **dynamic memory** (managed **heap**), and the variables of type `string` keep a **reference to an object** in the **heap**.

<br>

#### Not a Universal Solution

The usage of `System.String` is **not the ideal and universal solution** – sometimes it is appropriate to use different character structures. Take for example the following **2 reasons to consider another character structure**:   

##### 1. Even Though They May Be Easier To Deal With, `string`s Are  Also Immutable, 

One pretty significant feature (or restriction, depending on how you look at it) of the `string` class, is that, although each character may be *read*, the character sequences stored in a variable of the class **can never be changed** (hence, they are **immutable**).

In [3]:
string completeThisMessage = "_ : (enter 'y' or 'n'), I can change characters in a string";

In [4]:
// we may access the characters in a string for read purposes...
completeThisMessage[ 0 ]

In [5]:
// ...but we may not change them
completeThisMessage[ 0 ] = 'n';

Error: (2,1): error CS0200: Property or indexer 'string.this[int]' cannot be assigned to -- it is read only

##### 2. Despite Being More Manually Intensive To Set Up, `char[]`s Are In Fact Mutable.

Instead of using a `string` type, we *could*, alternatively, declare a variable of type `char[]`, and fill in the array’s elements *character* *by* *character*:   

In [1]:
char[] stringIsh = new char[]{'T','h','i','s',' ','i','s',' ','t','e','d','i','o','u','s'};

In [2]:
stringIsh

index,value
0,T
1,h
2,i
3,s
4,
5,i
6,s
7,
8,t
9,e


However, there are some cosiderable *disadvantages* to doing so:
1. *Filling in the array happens character by character, not at onc*e.
2. *We should know the length of the text in order to be aware whether it will fit into the already allocated space for the array*.
3. *The text processing is manual*.

While `string`s are very similar to the char arrays (`char[]`), they differ in that each character in the `char[]` *can* be modified, whereas, they *can not* be modifed for `string`s.    

In [9]:
// Modify the first letter in the character array to read '_' instead of 'T'
stringIsh[ 0 ] = '_';

In [8]:
stringIsh

index,value
0,_
1,h
2,i
3,s
4,
5,i
6,s
7,
8,t
9,e


<br>

#### Declaring `string`s

In [6]:
string greeting = "Hello, C#";

Above, we have just declared the variable greeting of type `string` whose content is the text phrase "Hello, C#".   
The representation of the content in the string looks closely to this:

<table style="margin: auto; background: white; color: black;">
    <thead>
        <th style="border: 1px solid black;">H</th>
        <th style="border: 1px solid black;">E</th>
        <th style="border: 1px solid black;">L</th>
        <th style="border: 1px solid black;">L</th>
        <th style="border: 1px solid black;">O</th>
        <th style="border: 1px solid black;"> </th>
        <th style="border: 1px solid black;">C</th>
        <th style="border: 1px solid black;">#</th>
    </thead>
</table>

The internal representation of the class is quite simple – an **array of characters**. 