# 13. Strings and Text Processing

## Comparing Strings
---

There are many ways to compare strings, and depending on what exactly we need in the particular case, we can take advantage of the various features of the `string` class.

<br>

### Comparison for Equality

If we wanted to **compare two strings for** **equality**, the most convenient method is the` Equals(…)`, which works equivalently to the operator `==`, but is invoked directly from the base `string` class:

In [1]:
string.Equals( "duck", "duck" )

In [2]:
string.Equals( "duck", "goose" )

In [3]:
"duck" == "duck"

In [4]:
"duck" == "goose"

<br>

#### `.Equals()` Is Case Sensitive ( By Default )

In [5]:
string.Equals(
    "even bigger", 
    "EVEN BIGGER"
)

In practice, we are often interested of only the actual text content when comparing two strings, regardless of the character casing (uppercase / lowercase). To **ignore case** in string comparison, we can use the `Equals(…)` method with the parameter `StringComparison.CurrentCultureIgnoreCase`:

In [6]:
string.Equals(
    "even bigger", 
    "EVEN BIGGER", 
    StringComparison.CurrentCultureIgnoreCase
)

<br>

### Comparison For Alphabetical Order

The `<`,`<=`,`>`, and `>=` operators work handily for Integral types like `int`, `long`, `float`, `double`, etc., but not so great with `string` types:

In [7]:
"Apple" < "Bannana"

Error: (1,1): error CS0019: Operator '<' cannot be applied to operands of type 'string' and 'string'

<br>

The `.CompareTo(…)` method from the `string` class returns a **negative value**, **zero**, or **a positive value** depending on the **lexical order of the two compared strings**.    

- A **negative value** means that the first string is lexicographically *before* the second
- A **zero** means that the two strings are *equal* 
- A **positive value** means that the first string is lexicographically *after* the second

In [8]:
// Apple is BEFORE Bannana
"Apple".CompareTo( "Bannana" )

In [9]:
// Apple is THE SAME THING AS Apple
"Apple".CompareTo( "Apple" )

In [10]:
// Bannana is AFTER Apple
"Bannana".CompareTo( "Apple" )

<br>

#### `.CompareTo()` Is Case Sensitive...But `string.Compare()` Has An Option To Ignore Case

In [11]:
// A horse is a Horse? 
// Not so fast....
"horse".CompareTo( "Horse" )

If we have to compare the strings lexicographically, but also **ignore the case**, then we could either use of the following: 
- `string.Compare(string strA, string strB, bool ignoreCase)` 
- `string.Compare(string strA, string strB, StringComparison.CurrentCultureIgnoreCase)` . 
      
These are overloads to a static method, included in the `string` class, which works in the same way as `CompareTo(…)`:

In [12]:
// A horse is a Horse? 
// OF COURSE, OF COURSE!
string.Compare( "horse", "Horse", true )

In [13]:
// It also can accept a StringComparison.CurrentCultureIgnoreCase argument,
// which works similarly in the .Equals() method
string.Compare( "horse", "Horse", StringComparison.CurrentCultureIgnoreCase )

<br>

#### Lexicographical Comparison Does Not Follow The Arrangement in the Unicode Table.

Please note that, according to the `Compare(…)` and `CompareTo(…)` methods, **the small letters are lexicographically before the capital ones**:

In [14]:
// apple is BEFORE Apple
"apple".CompareTo( "Apple" )

The correctness of this rule is quite controversial as in the $Unicode$ table the capital letters are before the small ones. For example, due to the standard $Unicode$, the letter $A$ has a code $65$, which is smaller than the code of the letter $a$, which is $97$.

<br>

### Memory Optimization For `string`s

#### The `==` and `!=` Operators Can Lead To Confusing Results

Suppose we have some arbitrary `string`:

In [15]:
string someArbitraryString = "Ham and Cheese";

<br>

Now, let's say that we have some other `string` which is initialized with a reference to the aforementioned arbitrary `string`:

In [16]:
string someOtherString = someArbitraryString;

<img src="_img/string_object_references2.jpg" style="display: block; margin: auto;"></img>

<br>

The  `==` and `=!` operators work for strings through an internal call of `Equals(…)`.    
   
Take for example the comparison below, which returns the expected result:

In [17]:
someArbitraryString == someOtherString

<br>

Now, let's take a look at another example that's a bit more tricky:

In [18]:
string hel   = "Hel";
string hello = "Hello";
string copy  = hel + "lo";

<img src="_img/string_object_references3.jpg" style="display: block; margin: auto;"></img>

In [19]:
// Quite confusingly, although hello and copy are NOT pointed to the same object,
// C# still tells us that they are in fact equal.
hello == copy

<br>

#### Interning

Suppose we have created a `string` as a literal representation of some word:

In [20]:
string word = "Bird";

<br>

Now, let's create and a completely different `string` (which we should expect to live at a completely diffrent place in memory), that is **initialized using the same literal expression** as the `word` above:

In [21]:
string sameWord = "Bird";

<br>

Due to an optimization that occurs within the $CLR$, called **interning**, which **prevents the memory from creating duplicated strings**, if there are any existing `string` type objects in the heap, but a new variable in the stack later attempts to also store it as well, the new variable will be **redirected to point at the first object**, rather than to point at a new object with duplicate data.   
   
The following figure illustrates what actually occurs subsequent to the declarations made above:

<img src="_img/string_object_references4.jpg" style="display: block; margin: auto;"></img>