# Software security
The requirements for a program are:
* has to be correct
* has to be efficient
* has to be secure

When a program is able to perform actions outside of its intended behaviour, it can lead to insecurity.

## Security issues
### Improper implementation
When a program is not implemented properly, it can allow attackers to deviate from the programmer's intent.

### Unanticipated input
When the attacker is able to supply unanticipated input, it can cause the process to:
* access sensitive information
* deviate from the intended execution path
* execute injected code

This is a form of privilege escalation.

Programming languages have a huge wealth of functionality.
Not knowing the nuances can lead to subtle implication of functionality.
This results in the program doing more than the developer's expected.

## Computer architecture
### Code vs Data
Modern computer uses the **Von Neumann computer architecture**.
This means that code and data are stored together in memory, thus there is no clear distinction between code and data.
This is unlike the **Harvard architecture** which has separate hardware for storing code and data.

Thus, this give rise to implications which allows programs to be tricked into treating data as code, which is the basis for **code-injection attacks**.

### Attacks on software
#### Integer overflow

Integer arithmetic in most programming languages are actually modulo arithmetic.
Because integer are (usually) stored as a fix number of bits, there is a maximum size that the integer can hold.
If the datatype is made to hold a value larger than its maximum size, it will overflow.
In some languages, it will throw an explicit error; however in most languages, the value will implicitly wrap around.

(Note that because Python do not have fix size integer primitives, our discussions will not be referring to Python)

For example, if the datatype has size of 8 bits (1 byte), then the range of values it can take as an unsigned integer would be 0-255.
Now suppose that `x=255`. If we increment `x` by 5, then we get `x=4`, because it wrapped around back to 0 when we increment it by 1, and we further increment it 4 times.
This causes the condition that `x < x + 1` to be false at this boundary condition.
If the programmer made this assumption when developing the program, this can lead to a vulnerability.

Thus, consider the following banking system.
The system allows the user to draw funds up to a certain limit, before further withdrawals are denied.

In [1]:
%cd software-security-example

/home/own3d/wellspring/cyber-security/software-security-example


In [2]:
!cat withdraw.c

#include <stdio.h>

int process_withdraw(int previous_withdraw_amt, int requested_amt, int withdraw_limit) {
    if (previous_withdraw_amt + requested_amt < withdraw_limit)
        return requested_amt;
    else
        return 0;
}

int main(int argc, char *argv[])  {
   if(argc != 2) {
   	printf("Wrong number of arguments.\n");
	return 1;
   }
   
   int requested_amt;
   sscanf(argv[1], "%d", &requested_amt);

   int previous_withdraw_amt = 80;
   int WITHDRAW_LIMIT = 100;

   printf("You have previously withdrawn $%d\n", previous_withdraw_amt);
   printf("You have requested to withdraw $%d\n", requested_amt);
   
   int payout = process_withdraw(previous_withdraw_amt, requested_amt, WITHDRAW_LIMIT);
   
   if (payout > 0)
      printf("Here is your $%d. Have a nice day!\n", payout);
   else
      printf("Sorry, the requested amount is beyond the limit\n");
}


In [3]:
!./withdraw 10

You have previously withdrawn $80
You have requested to withdraw $10
Here is your $10. Have a nice day!


Since we have withdrawn \\$80, we can still withdraw up to \\$20, thus the above transaction works.

In [4]:
!./withdraw 30

You have previously withdrawn $80
You have requested to withdraw $30
Sorry, the requested amount is beyond the limit


As we can see, we cannot withdraw beyond our limit, or so it seem.

In [5]:
!./withdraw 2147483647

You have previously withdrawn $80
You have requested to withdraw $2147483647
Here is your $2147483647. Have a nice day!


As we can see, by requesting a large value, we get `previous_withdraw_amt + requested_amt = 80 + 2147483647 = 79`, which is less than the withdraw limit of 100.
Keen readers would recognize the value of `2147483647` to be the maximum an signed integer can hold using 32 bits.

Thus, we can bypass the check and withdraw more than the allowed amount.

#### Inconsistent data string representation
When different parts of the program adopt different data representation, it could lead to a vulnerability.

We have seen one such vulnerability in [null-byte injection of domain](./public_key_infrastructure.ipynb#null-byte-injection), which happened when the verifying of certificate uses non-null byte terminated strings while the address checking uses null byte terminated strings.

We can also consider the following system.
The system stores each user's documents in their own home directory.
It exposes a public interface where users are allowed to query for files using their file names.
The user's home directory will be searched for the desired file.

In [6]:
from urllib.parse import unquote

BASE_URL = '/home'
USER = 'alice'

def _get_file(base_url, user, file_name):
    file_name = unquote(file_name)
    target_file = f'{base_url}/{user}/{file_name}'
    print(f'Giving the user the file: {target_file}')

alice_get_file = lambda f: _get_file(BASE_URL, USER, f)

Note the use of `unquote`.
Because the file name may be sent via a query parameter encoded in the URL, and the file name may contain special characters that are not allowed in URL, the file name may be "percentage encoded" to represent these illegal characters.

For example, suppose the user requests the file `e=mc.txt`, the encoded representation that is sent to the system will be `e%3Dmc2.txt`, because `=` is a reserved character in URL.

In [7]:
alice_get_file('note.txt')

Giving the user the file: /home/alice/note.txt


Typical usage involves the user requesting the file name, and the system returning the file within their home directory.

In [8]:
alice_get_file('../bob/note.txt')

Giving the user the file: /home/alice/../bob/note.txt


However, notice that the user can inject `../` to change the directory that the system is returning.
In UNIX systems, `../` refers to the parent directory, thus the file path will resolve to `/home/bob/notes.txt`.
Thus, with this, Alice can illegally obtain Bob's documents.

Thus, a programmer may implement a sanitization function which ensures that `../` is not part of the file name requested.

In [9]:
from urllib.parse import unquote

BASE_URL = '/home'
USER = 'alice'

def _sanitized_get_file(base_url, user, file_name):
    if '../' in file_name:
        print('"../" detected in file name')
        return
    file_name = unquote(file_name)
    target_file = f'{base_url}/{user}/{file_name}'
    print(f'Giving the user the file: {target_file}')

alice_get_file = lambda f: _sanitized_get_file(BASE_URL, USER, f)

In [10]:
alice_get_file('../bob/note.txt')

"../" detected in file name


As we can see, the system (seemingly) works.
However, consider the following:

In [11]:
alice_get_file('%2e./bob/note.txt')
alice_get_file('..%2fbob/note.txt')

Giving the user the file: /home/alice/../bob/note.txt
Giving the user the file: /home/alice/../bob/note.txt


Since the input is represented differently from the path used to retrieve the files, this allowed the attacker to bypass the sanitization by supplying percentage encoded strings which were not picked up by the sanitization process.

#### Buffer overflow

##### Background
Refer to [computer organization](../computer-organization/stack.ipynb).

In C/C++, memory is managed by the programmer, thus illegal memory is allowed.
Notice that the variables are stored sequentially on the stack, thus other variables can be access through an variable higher on the stack by access outside of the bounds of the memory.
This allows variables to be illegally read or written.

Consider the following code:

In [12]:
!cat buffer_overflow.c

#include <stdio.h>

void change_value(char index, char value) {
	char arr[10];
	int b = 100;
	printf("The value of b is %d\n", b);	
	printf("The return address is %p\n", __builtin_return_address(0));
	
	printf("\nChanging index %d of a to %d\n\n", index, value);

	arr[index] = value;
	
	printf("The value of b is %d\n", b);	
	printf("The return address is %p\n", __builtin_return_address(0));
}


int main(int argc, char *argv[])  {
	if(argc != 3) {
   		printf("Wrong number of arguments.\n");
		return 1;
   	}

	int index, value;
	sscanf(argv[1], "%d", &index);
	sscanf(argv[2], "%d", &value);

	change_value(index, value);	
}


Note that we use `char` for the `a` and `b` so that the variables are neatly aligned on the stack.
Notice that since `b` is below `a` in the stack, we even though we are modifying variable `a`, it is possible for use to modify variable `b` as well, as per below.

In [13]:
!./buffer_overflow 10 42

The value of b is 100
The return address is 0x55a60280527d

Changing index 10 of a to 42

The value of b is 42
The return address is 0x55a60280527d


Notice that the value of `b` is now 42.

From the background knowledge of how variables are stored, readers may have noticed that the return address is also stored on the stack.
Thus, it is possible for attackers to modify the return address by writing over it, thus causing the function to jump to a different function rather than the original caller.
This attack is called **stack smashing**.

In [14]:
!./buffer_overflow 22 0 || echo "PROGRAM CRASHED"

The value of b is 100
The return address is 0x556c3ce2127d

Changing index 22 of a to 0

The value of b is 100
The return address is 0x556c3ce21200
PROGRAM CRASHED


(Note that the program was actually stopped with a `segmentation fault`, but the Jupyter environment silenced it)

As we can see, we have overwritten the last 2 bytes of the return address to be `00` instead of `7d`.

This will lead to a segmentation fault because the program will now jump to an invalid address after the function completes.
This means that it will start reading values from the modified address onwards as instructions and perform them.
If the attacker is able to rewrite certain addresses with their malicious instructions (**shell code**), and also rewrite the return address of the function to point to their **injected shell code**, then they can cause the program to perform their desired operations.

This can be combined with [privilege escalation from the set-uid bit](./access_control.ipynb#set-uid) to perform actions that they otherwise are unable to.

One might wonder whether the vulnerability is too narrow an application, because it can only be performed if we allow the attacker to choose which part of the array to modify.
However, consider the common C function `strcpy`.
The end of a string in C is signified by a null byte.
Thus, `strcpy` will copy the string into the buffer, until it finds a null byte.
However, attackers can send arbitrarily long strings (by not sending a null byte) as input into the program, leading to writing outside of the allocated memory of the array.
Hence, it becomes a vulnerability that the attacker can exploit using the method described previously.

#### SQL injection

Suppose we have the following system, where the program only allows access to users who can provide a valid user name and secret name pair.
The users are stored in a SQL database, and the list of users are retrieved via SQL commands.
If the system finds a user with a `(name, secret_name)` pair that matches one of the user, they will authenticate the entity with that identity.

In [15]:
import sqlite3
from sqlite3 import Error

class GateKeeper:
    def __init__(self):
        def create_connection(path):
            connection = sqlite3.connect(path)
        
            return connection
    
        self.connection = create_connection('./users.sqlite')

    def authenticate(self, name, secret_name):
        query = f"SELECT * FROM users WHERE name='{name}' AND secret_name='{secret_name}'"
        print(f"Executing query: {query}")
        try:
            cursor = self.connection.cursor()
            cursor.execute(query)
            user = cursor.fetchone()
        except Error as e:
            print(f"The error '{e}' occurred")
            return

        if not user:
            print("Name and secret name does not match. Villains are not allowed!")
        else:
            _, name, secret_name = user
            print(f"Name and secret name matches. Welcome to the club, {name} (alias {secret_name}).")

    def dump_data(self):        
        cursor = self.connection.cursor()
        cursor.execute("SELECT * FROM USERS")
        users = cursor.fetchall()
        
        for user in users:
            print(user)

gate_keeper = GateKeeper()

In [16]:
gate_keeper.dump_data()

(1, 'Alice', 'Diana')
(2, 'Bob', 'Clark')
(3, '0WN3D', '0WN463')


The above is the list of users currently in the database.

In [17]:
input_name, input_secret_name = "0WN3D", "0WN463"
gate_keeper.authenticate(input_name, input_secret_name)

Executing query: SELECT * FROM users WHERE name='0WN3D' AND secret_name='0WN463'
Name and secret name matches. Welcome to the club, 0WN3D (alias 0WN463).


In [18]:
input_name, input_secret_name = "Hacker", "pwner_1337"
gate_keeper.authenticate(input_name, input_secret_name)

Executing query: SELECT * FROM users WHERE name='Hacker' AND secret_name='pwner_1337'
Name and secret name does not match. Villains are not allowed!


As we can see, valid users are allowed while invalid users are denied, or so it seems.

Notice that to determine the query, the user's input is directly substituted into the command.
Suppose what happens if the user's input contains a `'`, for instance `' something something` for the name field.
The resultant SQL command ran would be `SELECT * FROM users WHERE name=''something something AND secret_name='SOME_SECRET'`

In [19]:
input_name, input_secret_name = "'something something", "SOME_SECRET"
gate_keeper.authenticate(input_name, input_secret_name)

Executing query: SELECT * FROM users WHERE name=''something something' AND secret_name='SOME_SECRET'
The error 'near "something": syntax error' occurred


Notice that the resultant SQL command is invalid, thus causing an error.
This is an indication that we are able to modify the underlying SQL command.
Now, suppose that the attacker sets the name field to be `' OR 1=1 --`.


The resultant SQL command will be: 
```
SELECT * FROM users WHERE name='' OR 1=1 --' AND secret_name='SOME_SECRET'
```

In SQL, `--` symbolizes that the characters after that are comments, thus the functional command is actually:
```
SELECT * FROM users WHERE name='' OR 1=1
```

The `--` is there to truncate further SQL statements that were part of the original SQL template, because the further part is likely to trigger a syntax error.

Now, notice that in the SQL statement, we are checking if the name is blank, which is false for all users in the system.
However, we perform an `OR` operation against `1=1`, which is always true.
Thus, the resultant statement is always true for all users.
Hence, we can trick the system into thinking we provided credentials that matched one of the users.

In [20]:
input_name, input_secret_name = "' OR 1=1 --", "SOME_SECRET"
gate_keeper.authenticate(input_name, input_secret_name)

Executing query: SELECT * FROM users WHERE name='' OR 1=1 --' AND secret_name='SOME_SECRET'
Name and secret name matches. Welcome to the club, Alice (alias Diana).


Thus, we have authenticated at an endpoint without knowing the credentials.


Below is a more secure implementation of the class.
Note that we use special `cursor.execute` which allows substitution without a risk of injection.

In [21]:
import sqlite3
from sqlite3 import Error

class GateKeeper:
    def __init__(self):
        def create_connection(path):
            connection = sqlite3.connect(path)
        
            return connection
    
        self.connection = create_connection('./users.sqlite')

    def authenticate(self, name, secret_name):
        query = f"SELECT * FROM users WHERE name=? AND secret_name=?"
        print(f"Executing query: {query}")
        try:
            cursor = self.connection.cursor()
            ## Main difference is that we use special format strings for SQL query
            ## rather than default Python implementation
            cursor.execute(query, [name, secret_name])
            user = cursor.fetchone()
        except Error as e:
            print(f"The error '{e}' occurred")
            return

        if not user:
            print("Name and secret name does not match. Villains are not allowed!")
        else:
            _, name, secret_name = user
            print(f"Name and secret name matches. Welcome to the club, {name} (alias {secret_name}).")

    def dump_data(self):        
        cursor = self.connection.cursor()
        cursor.execute("SELECT * FROM USERS")
        users = cursor.fetchall()
        
        for user in users:
            print(user)

gate_keeper = GateKeeper()

In [22]:
input_name, input_secret_name = "' OR 1=1 --", "SOME_SECRET"
gate_keeper.authenticate(input_name, input_secret_name)
print()
input_name, input_secret_name = "Alice", "Diana"
gate_keeper.authenticate(input_name, input_secret_name)

Executing query: SELECT * FROM users WHERE name=? AND secret_name=?
Name and secret name does not match. Villains are not allowed!

Executing query: SELECT * FROM users WHERE name=? AND secret_name=?
Name and secret name matches. Welcome to the club, Alice (alias Diana).


As we can see, SQL injection is now prevented.

#### Undocumented access points
Programmers may include **undocumented access points for various reasons**:
* Debugging purposes
* For fun and publicity, aka "for the lawlz"
* Malicious intent due to unhappiness

For example, debugging mode or authentication bypass can be included as part of the program.
These access points can be access via special combination of input, such as a certain combination of keys or a certain type of string input.

However, these access points can serve as **backdoors** for attackers if they are discovered.
A **backdoor** is a covert method of bypassing authentication.



### Preventive measures
Note that because it is inevitable that programmers make implementation mistakes, there is no "solution" per se to the issue.
However, there are a number of preventive measures that programmers can take to make their software more secure.

#### Input validation/filtering
From the above cases, we have seen that many issues stems from unexpected input being passed into the program.
Thus, we can perform **input validation**, and reject inputs if they do not follow our specified format.

##### Issues
However, as discussed previously, it is difficult to ensure that filtering do not omit any malicious payloads.
Hence, a filter that blocks all bad inputs while accepting all legitimate input is difficult to design.

Thus, there are 2 general approaches to filtering:
* White list
    * Certain inputs are considered "safe".
    Thus, we only allow these safe inputs and rejecting all others
    * Some legitimate inputs may be rejected
    Thus, there are 2 general approaches to filtering:
* Black list
    * Certain inputs are considered "dangerous" (*eg* "./*\\$').
    Thus, we reject these dangerous inputs and accept the rest.
    * Some malicious inputs may be accepted

#### Using safer functions
For functions that are considered unsafe, it is likely that there is a safer variant available.

For example, the safer alternative to `strcpy` would be `strncpy`, which accepts another argument `n` that decides that it will only copy that many bytes.
This helps prevent buffer overflow.

(However, note that it does not mean that it is completely safe, if the attacker has the ability to modify the variable `n`)

Another example would the one shown in the SQL injection example, where we treated the format string in a more secure way.

#### Bounds checking
More modern programming languages automatically performs bounds checking.
Thus, they will throw an error when memory outside of the intended range is accessed, rather than implicitly allowing it.
Even though the checks adds overhead to the program, many found that it is worth preventing buffer overflows.

#### Type checking
Modern languages also perform type checking to ensure that the variable types matches the values they are assigned to.
Thus, any conversion from one data type to another must be explicit, rather than handled implicitly.

#### Memory protection
##### Canaries
**Canaries** are secret values inserted at certain memory locations at run time.
If during run time, the program discovers that its canaries are modified, then it knows that illegal memory access has happen and thus will stop the process.

Note that canaries should be kept secret so that attackers are not able to avoid triggering the mechanism by writing the canaries values instead of their payload at those locations.

In Linux, canaries are enabled by default, thus you need the `--fno-stack-protector` flag to compile without canaries.
Our previous example was compiled with this flag so that the program will not halt when illegal memory access occurs.

In [23]:
!./buffer_overflow_with_canaries 10 42

The value of b is 100
The return address is 0x55fce3be42bf

Changing index 10 of a to 42

The value of b is 100
The return address is 0x55fce3be42bf
*** stack smashing detected ***: terminated


As we can see, using the canaries prevented us from modifying the values outside of the array by terminating the program early.

##### Memory randomization
Suppose that the memory location of variables and functions are fixed.
With these knowledge, attackers can access these variables by peeking at the values at those locations, or they can cause code to jump to their desired function and run it.

**Address space layout randomization (ASLR)** aims to make it harder for the attackers by randomizing the memory location of the key data areas of every process.

Notice in the previous example, each time we ran the binary causes it to report a different return address.

#### Code inspection
**Taint analysis** is a type of automatic checking of the source code.
They will identify sources of input from the user and check if they can affect critical functions within the program.
If such a situation is found, they will flag it out for further investigation by the developers.

#### Testing
By testing, vulnerability can be discovered beforehand and patched out.

Types of testing:
* White-box testing
    * Tester has access to the source code
* Black-box testing
    * Tester does not have access to the source code
* Gray-box testing
    * A combination of the above 2
    * *eg* has access to the source code that was disassembled by the binary
    
##### Fuzzing
**Fuzzing** is the act of sending malformed inputs in an attempt to uncover vulnerabilities.

#### Apply principle of least privilege
For example:
* Be conservative about elevating privilege of a program.
* Do not give user more rights than they require
* Do not activate unnecessary options



#### Patching
When a vulnerability is known, patches may be issued out in the future which fixes these vulnerabilities.
Thus, it is important for systems to be up to date in terms of patches in order to prevent these vulnerabilities.
This is more importantly so when we consider that attackers may be made aware of the vulnerability once a patch is released, thus more attackers might try to abuse this known vulnerability, despite it being "fixed".

While it is important to have timely patches, patching also brings about a few issues.

The patch may not be rigorously tested, thus applying the patch may cause instability in the system.
Or the very act of patching can cause system to be unavailable to the users during the patching process.