# 13. Strings and Text Processing

## Building Strings
---

### Never Concatenate Strings in a Loop!

Serious **performance problems** may be encountered when trying to **concatenate strings in a loop**.
    
Consider the example illustrated below:

In [1]:
// Initialize the empty string which will be iteratively concatenated 
string iterativelyConcatenated = "";


// Iterating For every integer value from 1 to 10: 
for( int currentInteger = 1; currentInteger <= 10; currentInteger++ )
{

    // Iteratively concatenate the currentInteger to the string
    iterativelyConcatenated += currentInteger;

}

The problem  with doing this is directly related to the `string` types's handling of dynamic memory, which is what is used to store them.    
   
To understand why we have **poor performance when concatenating strings in a loop**,     
we must first consider what happens when using `+` operator for strings.

<br>

#### How Does the String Concatenation Work?

Let’s now examine **what happens in memory when concatenating strings**.    
   
Consider two `string` type variables, `str1` and `str2`,  which have values of $Super$ and $Star$: 

In [2]:
string str1 = "Super",
       str2 = "Star";

<br>

There are **two areas** in the **heap** (**dynamic memory**) in which the values are stored.    

The task of `str1` and `str2` is to **keep a reference to the memory addresses** where our data is stored:

<img src="_img/string_object_references6.jpg" style="display: block; margin: auto;"></img>

<br>

Now, let's consider the following **string concatenation** which stores the `result` in a **new string**:

In [3]:
string result = str1 + str2;

In [4]:
result

SuperStar

<br>

What's happening with the memory?    

Creating the variable `result` will **allocate a new area in dynamic memory**,    
which will record the outcome of the `str1 + str2`, which is $SuperStar$.   
Then, the variable itself will **keep the address of the allocated area**. 
   
Consequently, we will have **three areas in memory** and **three references to them**:

<img src="_img/string_object_references7.jpg" style="display: block; margin: auto;"></img>

This is *convenient*, but we must consider the steps taken to acheive this result: 
1. allocating a new memory area 
2. recording a value 
3. creating a new variable
4. referencing the variable's address in the memory 

Executing these steps is a **timeconsuming process** that would be **a problem if repeated many times**, typically inside a **loop**.

<br>

#### Concatenating in Loop of 50,000 Iterations – the Inefficient Way

Suppose we defned the following method, taking, as an argument, the endpoint of a `for` **loop** which will: 
- execute the inefficient iterative string concatenation
- return the **amount of time it took to perform the concatenation**:

In [5]:
string InefficentIterativeConcatenation( int lastInteger )
{

    // Observe the time before the loop begins
    DateTime timeBeforeLoop = DateTime.Now;


    // Initialize a string to be iteratively concatenated 
    string iterativelyConcatenated = "";


    // Iterating For every integer from 1 up until the last integer:
    for( int currentInteger= 1; currentInteger <= lastInteger; currentInteger++ )
    {

        // Iteratively concatenate the current integer to the string
        iterativelyConcatenated += currentInteger;

    }


    // return the amount of time it took to execute the loop
    return(
        $"Time to iteratively concatenate the string inside the loop:\t"
        +
        $"{ DateTime.Now - timeBeforeLoop }\n\n"
    );

}

<br>

Below, you'll see it takes a **pretty long time** (~5-8 seconds) to execute a loop of $50,000$ iterations when performing string concatenation.  

As the number of loop iterations increases, we can expect the performance to **decrease exponentially** in turn.

In [6]:
InefficentIterativeConcatenation( 50000 )

Time to iteratively concatenate the string inside the loop:	00:00:06.0224977



<br>

##### Processing Concatenations in Memory is Expensive!

The problem which makes processing iterative concatenations so time-consuming is related to the way strings work in memory.    
   
**Each iteration creates a new object in the heap** and **points a reference to it**.    
This process requires a certain amount of physical time, each and every time.   

Several things happen at each step: 
1. An area of memory is allocated for recording the next number concatenation result. This memory is used only temporarily while concatenating, and is called a buffer.
2. The old string is moved into the new buffer. If the string is *long* (say $500 KB$, $5 MB$ or $50 MB$), it can be quite slow!
3. Next number is concatenated to the buffer. 
4. The buffer is converted to a string.
5. The old string and the temporary buffer become unused. Later they are destroyed by the garbage collector. This may also be a slow operation.

<br>

### Building and Changing Strings with the `StringBuilder` Class

`StringBuilder` is a class that serves to **build and change strings**.    
   
It **overcomes the performance problems** that arise when concatenating strings.    
The class is built in the form of an **array of characters**, and what we need to know about it is that **the information in it can be freely modified** (hence, it is **mutable**).   
   
Changes that are required in the variables of type `StringBuilder` are carried out in the **same area of memory** (**buffer**), which saves time and resources.

Changing the content **does not create a new object**, but simply *changes the current one* **in place**.

<br>

#### How Does the `StringBuilder` Class Work?

The `StringBuilder` class is an implementation of a string in $C\#$, but *different* than the `string` class.   
  
Unlike `string` types, the objects of the `StringBuilder` class are **mutable**, and performing operations **do not require creating a new object in the memory**. 
   
This reduces the unnecessary transfer of data in memory when performing basic operations such as string concatenation.

<br>

`StringBuilder` keeps a **buffer** with a certain $Capacity$ (*16 characters* by default):   

In [7]:
new StringBuilder()

Capacity,MaxCapacity,Length
16,2147483647,0


The **buffer** is implemented as an **array of characters** that is provided to the developer by a user-friendly interface providing methods that quickly and easily **add** and **edit** the **elements** of the string.   
   
Once the internal **buffer** of the `StringBuilder` is full, **it automatically is doubled** (the internal **buffer** is resized to increase its *capacity* while its content is kept unchanged).    

**Resizing is a slow operation**, but is happens rarely, so the total performance is good.

<br>

#### Concatenating in Loop of 100,000 Iterations – The Right Way

Suppose we defned the following method, taking, as an argument, the endpoint of a `for` **loop** which will: 
- execute efficient iterative string concatenation using the `StringBuilder` class
- return the **amount of time it took to perform the concatenation**:

In [8]:
string EfficentIterativeConcatenation( int lastInteger )
{

    // Observe the time before the loop begins
    DateTime timeBeforeLoop = DateTime.Now;


    // Initialize a StringBuilder object with the empty string
    // that will be to iteratively concatenated
    StringBuilder sBuilder = new StringBuilder();


    // Iterating For every integer from 1 up until the last integer:
    for( int currentInteger= 1; currentInteger <= lastInteger; currentInteger++ )
    {

        // Iteratively append the current integer to the character array
        sBuilder.Append( currentInteger );

    }


    // return the amount of time it took to execute the loop
    return(
        $"Time to iteratively concatenate the string inside the loop:\t"
        +
        $"{ DateTime.Now - timeBeforeLoop }\n\n"
    );

}

<br>

Below, you'll see it takes a **waaaaay less time** (< 1 second) to execute a loop of $100,000$ iterations when performing string concatenation using the `StringBuilder` class then it did to use the `+` operator.  

In [9]:
EfficentIterativeConcatenation( 100000 )

Time to iteratively concatenate the string inside the loop:	00:00:00.0012853



<br>

#### The More Important `StringBuilder` Methods

The `StringBuilder` class provides us with a set of **methods** that help us to easily and efficiently edit text data and construct text. The **most important** are:

- `StringBuilder(int capacity)` <br> constructor with an initial capacity parameter. It may be used to set the buffer size in advance if we have estimates of the number of iterations and concatenations, which will be performed. This way we can save unnecessary dynamic memory allocations.
- `Capacity` <br> returns the buffer size (total number of used and unused positions in the buffer).
- `Length` <br> returns length of string saved in the variable (number of used positions in the buffer)
- `Indexer [int index]` <br> return the character stored in given position.
- `Append(…)` <br> appends string, number or other value after the last character in the buffer.
- `Clear(…)` <br> removes all characters from the buffer (deletes it).
- `Remove(int startIndex, int length)` <br> removes (deletes) string from the buffer with a given start position and length.
- `Insert(int offset, string str)` <br> inserts a string in a given start position (offset).
- `Replace(string oldValue, string newValue)` <br> replaces all occurrences of a given substring with another substring.
- `ToString()` <br> returns the StringBuilder object content as a string object.

<br>