# Builtins Module: The String Class (str)

The ```str``` class is an abbreviation for an immutable string of Unicode characters.

## Categorize_Identifiers Module

This notebook will use the following functions ```dir2```, ```variables``` and ```view``` in the custom module ```categorize_identifiers``` which is found in the same directory as this notebook file. ```dir2``` is a variant of ```dir``` that groups identifiers into a ```dict``` under categories and ```variables``` is an IPython based a variable inspector. ```view``` is used to view a ```Collection``` in more detail:

In [1]:
from categorize_identifiers import dir2, variables, view

## Initialisation Signature

The initialisation signature of the ```str``` class may be printed using:

In [2]:
str?

[1;31mInit signature:[0m [0mstr[0m[1;33m([0m[0mself[0m[1;33m,[0m [1;33m/[0m[1;33m,[0m [1;33m*[0m[0margs[0m[1;33m,[0m [1;33m**[0m[0mkwargs[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m     
str(object='') -> str
str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or
errors is specified, then the object must expose a data buffer
that will be decoded using the given encoding and error handler.
Otherwise, returns the result of object.__str__() (if defined)
or repr(object).
encoding defaults to sys.getdefaultencoding().
errors defaults to 'strict'.
[1;31mType:[0m           type
[1;31mSubclasses:[0m     StrEnum, DeferredConfigString, FoldedCase, _rstr, _ScriptTarget, _ModuleTarget, LSString, include, Keys, InputMode, ...

The purpose of the initialisation signature is to provide the data required to initialise a new instance. For the ```str``` class, the initialisation signature shows three alternative ways of supplying the required instance data.

If the first way is examined:

```python
str(self, /, *args, **kwargs)
```

To recap:

* The parenthesis ```( )``` are used to call a function and supply any necessary input arguments.
* The comma ```,``` is used as a delimiter to separate out any input arguments.
* In Python ```self``` is used to denote *this instance*. In other words a ```str``` instance is constructed from an existing ```str```. This is a special case as a ```str``` is a fundamental datatype and has a shorthand way of instantiation.
* Any input argument before a ```/``` must be provided positionally.
* ```*args``` indicates a variable number of additional positional input arguments. These are typically not used for the string class.
* ```**kwargs``` indicates a variable number of additional named input arguments. These are typically not used for the string class.

```self``` can be provided positionally using an existing ```str``` instance:

In [3]:
str('hello')

'hello'

However because the ```str``` is a fundamental datatype it is instantiated shorthand using the following:

In [4]:
'hello'

'hello'

The characters in a ```str``` instance must be enclosed in quotations. These are used to distinguish a ```str``` of characters from an instance name.

Notice the difference in the syntax colour highlighting between the ```str``` instance (top) and the instance name (below). The instance name does not exist and the Python interpreter will flag a ```NameError``` when attempting to look it up:

```python
'hello'
```

```python
hello
```

In VSCode the Variables button can be selected to view Variables present. In this notebook, the custom function ```variables``` will instead be used which has a similar form:

In [5]:
variables()

Unnamed: 0_level_0,Type,Size/Shape,Value
Instance Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1


If the following code is input:

In [6]:
'hello'

'hello'

Notice the value ```'hello'``` is returned to the cell output. When a value is returned to the cell output, it is not stored elsewhere. 

In [7]:
variables()

Unnamed: 0_level_0,Type,Size/Shape,Value
Instance Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1


This Python ```str``` instance that has no instance name and therefore cannot be reselected. Conceptualise an instance name as a label which points to the ```str``` instance and is therefore used to select the ```str``` instance.

A ```str``` instance can be assigned to an instance name during instantiation:

In [8]:
greeting = 'hello'

Notice now that the cell has no output. Instead it is stored under the instance name ```greeting``` and this displays in Variables:

In [9]:
variables()

Unnamed: 0_level_0,Type,Size/Shape,Value
Instance Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
greeting,str,5,hello


The value of the ```str``` instance can be referenced via the instance name:

In [10]:
greeting

'hello'

In the above cell, the Python interpreter recognised the instance name. This instance name was used to point to the ```str``` instance and the value retrieved was not assigned to another instance name and is therefore shown in the cell output. 

If the instance is instead assigned to another instance name:

In [11]:
greeting2 = greeting

Then in the Variable Explorer, the ```str``` instance ```'hello'``` is shown with two different instance names ```greeting``` and ```greeting2```:

In [12]:
variables()

Unnamed: 0_level_0,Type,Size/Shape,Value
Instance Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
greeting,str,5,hello
greeting2,str,5,hello


These two instance names act as alias to one another. If an instance name is conceptualised as a label, then this ```str``` instance has two labels. If either instance name are used, the same value is retrieved:

In [13]:
greeting

'hello'

In [14]:
greeting2

'hello'

A check is made to see if the value retrieved from each instance name is equal. Because they are the same ```str``` instance, the boolean ```True``` is returned:

In [15]:
greeting == greeting2

True

Each instance in Python has a unique identification and can be checked using:

In [16]:
id(greeting)

2064235241776

In [17]:
id(greeting2)

2064235241776

Notice that the id is the same, because both these instance names are references to the same ```str``` instance. Therefore the following is ```True```:

In [18]:
greeting is greeting2

True

Which recall is shorthand for:

In [19]:
id(greeting) == id(greeting2)

True

The delete statement ```del``` can be used to delete an instance name. Note that deleting an instance name only deletes a label, leaving the instance unchanged:

In [20]:
del greeting

Notice that the instance name ```greeting``` is deleted i.e. this label is removed. However the label ```greeting2``` is still present and the instance ```'hello'``` is unaltered:

In [21]:
variables()

Unnamed: 0_level_0,Type,Size/Shape,Value
Instance Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
greeting2,str,5,hello


If ```del``` is used to also delete the instance name ```greeting2```:

In [22]:
del greeting2

In [23]:
variables()

Unnamed: 0_level_0,Type,Size/Shape,Value
Instance Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1


Then there are no instance names for the ```str``` instance ```'hello'```. When an instance has no instance name it cannot be referenced and is considered orphaned. Orphaned instances are automatically cleaned up by Pythons garbage collection. 

If a new instance is created:

In [24]:
greeting = 'Hello World'

Then the instance name displays on variables:

In [25]:
variables(show_id=True)

Unnamed: 0_level_0,Type,Size/Shape,Value,ID
Instance Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
greeting,str,11,Hello World,2064295823280


If a reassignment is carried out:

In [26]:
greeting = 'hi'

The instance name remains on Variables but the instance it points to has changed. In other words the label greeting has been peeled off from the old ```str``` instance ```'Hello World'``` and placed on the new ```str``` instance ```'hi'```. The old ```str``` instance now has no instance name and therefore no reference and is orphaned and finally because it is orphaned it is cleaned up by Pythons garbage collection:

In [27]:
variables(show_id=True)

Unnamed: 0_level_0,Type,Size/Shape,Value,ID
Instance Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
greeting,str,2,hi,140727149708216


Reassignment **moves the instance name** from the old ```str``` instance to the new ```str``` instance and does not change either ```str``` instance. A ```str``` instance is **immutable** and cannot be modified after it has been instantiated.

The initialisation signature of the ```str``` class shows instantiation using a named keyword input argument ```object``` which has a default value of an empty ```str```:

```python
str(object='') -> str
```

This is used to cast instances of other Python ```builtins``` classes to ```str``` instances:

In [28]:
str(object='hello')

'hello'

In [29]:
str(object=b'hello')

"b'hello'"

In [30]:
str(object=bytearray(b'hello'))

"bytearray(b'hello')"

In [31]:
str(object=2)

'2'

In [32]:
str(object=True)

'True'

In [33]:
str(object=3.14)

'3.14'

If not assigned, it takes on its default value which returns an empty ```str``` instance:

In [34]:
str()

''

## Spacing and PEP8

If the following is examined:

```python
instance = str(object='hello')
```

Notice the assignment operator is used to assign a value to a named parameter within the function call and the ```return``` value of the function call is also assigned to an instance name. 

Notice the subtlety in the above spacing. Within a function call spacing is typically used to visually separate out input arguments:

```python
func('a'=1, 'b'=2, 'c'=3)
```

Outside the function call, spacing is used to visually emphasise an operator:

```python
result = 2 * 3
```

Operators within a function call are not visually separated as the spacing is used to visually separate out the parameters:

```python
result = func('a'=1, 'b'=2, 'c'=2*3)
```

The code below will work but is harder to read:

```python
result=func('a'=1,'b'=2,'c'=2*3)
```

```python
result=func('a' = 1,'b' = 2,'c' = 2 * 3)
```

More details are given in the [Python Enhanced Protocol 8: Style Guide](https://peps.python.org/pep-0008/).

Use of the Python formatters such as autopep8 was previously discussed in the tutorial on installing VSCode.

## String Quotations

In Python single and double quotations can be used to enclose the characters in a ```str``` instance and are seen as equivalent:

In [35]:
"Hello World!"

'Hello World!'

In [36]:
'Hello World!'

'Hello World!'

Notice that the Python interpreter itself prefers single quotations and the value returned to the cell output in each case is the printed formal representation and is enclosed in single quotations.

The ```'``` is a formatting character in a ```str``` instance and is used to enclose the characters of the ```str``` itself. If a ```str``` containing a ```str``` literal is attempted to be constructed.

```python
'greeting = 'Hello World!'
```

Notice that the syntax highlighting above displays:

* ```'greeting = '``` as a ```str```
* ```hello``` as an instance name
* ```world!``` as an instance name
* ```''``` as an empty string 
  
This results in a ```SyntaxError```. 

The ```\``` is another formatting character that is used to insert an escape character or escape character sequence. ```\'``` will incorporate the single quotation into the ```str```:

In [37]:
'greeting = \'hello world!\''

"greeting = 'hello world!'"

Notice that the ```str``` returned in the cell output is now enclosed in double quotations and is more readable. The main purpose of the double quotations is to make it easier to create a ```str``` instance which includes a ```str``` literal.

Triple double quotations are typically used for a multiline string. Double quotations are preferred over single quotations for multiline ```str``` instances as they are commonly used as docstrings and a docstring has a high probability of including a ```str``` literal. A very basic function can be created which takes in two input ```str``` instances and prints them within a formatted ```str``` instance:

In [38]:
def fun(string1='hello', string2='world'):
    print(f'{string1} {string2}')    

The function can be tested:

In [39]:
fun()

hello world


In [40]:
fun(string1='bye')

bye world


Because it has no docstring, it has no documentation:

In [41]:
fun?

[1;31mSignature:[0m [0mfun[0m[1;33m([0m[0mstring1[0m[1;33m=[0m[1;34m'hello'[0m[1;33m,[0m [0mstring2[0m[1;33m=[0m[1;34m'world'[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m <no docstring>
[1;31mFile:[0m      c:\users\phili\appdata\local\temp\ipykernel_3712\1566935369.py
[1;31mType:[0m      function

A docstring is normally added at the start of the functions code block and although this is only a single line, it is typically input using triple double quotations:

In [42]:
def fun(string1='hello', string2='world'):
    """Prints string1 string2"""
    print(f'{string1} {string2}')    

In [43]:
fun?

[1;31mSignature:[0m [0mfun[0m[1;33m([0m[0mstring1[0m[1;33m=[0m[1;34m'hello'[0m[1;33m,[0m [0mstring2[0m[1;33m=[0m[1;34m'world'[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m Prints string1 string2
[1;31mFile:[0m      c:\users\phili\appdata\local\temp\ipykernel_3712\3973104799.py
[1;31mType:[0m      function

The triple double quotations allow it to be readily expanded later on with optional ```str``` literals:

In [44]:
def fun(string1='hello', string2='world'):
    """Prints string1 string2
    For example fun(string1='hello', string2='world') prints hello world"""
    print(f'{string1} {string2}')    

In [45]:
fun?

[1;31mSignature:[0m [0mfun[0m[1;33m([0m[0mstring1[0m[1;33m=[0m[1;34m'hello'[0m[1;33m,[0m [0mstring2[0m[1;33m=[0m[1;34m'world'[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Prints string1 string2
For example fun(string1='hello', string2='world') prints hello world
[1;31mFile:[0m      c:\users\phili\appdata\local\temp\ipykernel_3712\1096554159.py
[1;31mType:[0m      function

The [Python Enhanced Protocol 8: Style Guide](https://peps.python.org/pep-0008/) does not explicitly make a recommendation for quotation style:

>>> In Python, single-quoted strings and double-quoted strings are the same. This PEP does not make a recommendation for this. Pick a rule and stick to it. 

However the python interpreter, Python and Python documentation prefer single quotations over double quotes. Double quotes are used when the ```str``` instance contains a ```str``` literal. A docstring (which is likely to later be updated to include a ```str``` literal) uses triple double quotes. It is generally a good practice to make your code look as close to the code in the official Python documentation when getting started, as these tutorials attempt to do. Popular third-party libraries ```numpy```, ```matplotlib```, ```scipy``` and ```sklearn``` in the scientific stack are written using a consistent quotation style.

Python has a popular opinionated autoformatter ```black``` which unfortunately has a preference for double quotations, differing from the style used in Python itself. Moreover ```black``` is used for the development of some popular third-party libraries such as ```pandas``` and ```seaborn``` which are also in the scientific stack. The quotation style for the official documentation for libraries in the scientific stack therefore is unfortunately inconsistent. Finally because of the popularity of ```pandas``` in particular, double quotations tend to be more prevalent in datascience tutorials. 

## Identifiers

Two ```str``` instances can be instantiated:

In [46]:
greeting = 'hello'
farewell = 'bye'

In [47]:
variables()

Unnamed: 0_level_0,Type,Size/Shape,Value
Instance Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
greeting,str,5,hello
farewell,str,3,bye


The ```dir``` function can be used to view a list of identifiers from an instance:

In [48]:
dir(greeting)

['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getnewargs__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmod__',
 '__rmul__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'capitalize',
 'casefold',
 'center',
 'count',
 'encode',
 'endswith',
 'expandtabs',
 'find',
 'format',
 'format_map',
 'index',
 'isalnum',
 'isalpha',
 'isascii',
 'isdecimal',
 'isdigit',
 'isidentifier',
 'islower',
 'isnumeric',
 'isprintable',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'ljust',
 'lower',
 'lstrip',
 'maketrans',
 'partition',
 'removeprefix',
 'removesuffix',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'split',
 'splitlines',
 'startswith',
 'stri

These aren't grouped by category. This can be done by using the custom function ```dir2```;

In [49]:
dir2(greeting)

{'method': ['capitalize',
            'casefold',
            'center',
            'count',
            'encode',
            'endswith',
            'expandtabs',
            'find',
            'format',
            'format_map',
            'index',
            'isalnum',
            'isalpha',
            'isascii',
            'isdecimal',
            'isdigit',
            'isidentifier',
            'islower',
            'isnumeric',
            'isprintable',
            'isspace',
            'istitle',
            'isupper',
            'join',
            'ljust',
            'lower',
            'lstrip',
            'maketrans',
            'partition',
            'removeprefix',
            'removesuffix',
            'replace',
            'rfind',
            'rindex',
            'rjust',
            'rpartition',
            'rsplit',
            'rstrip',
            'split',
            'splitlines',
            'startswith',
            'strip',
            'swa

Notice the same identifier names display when the other instance is examined:

In [50]:
dir2(farewell)

{'method': ['capitalize',
            'casefold',
            'center',
            'count',
            'encode',
            'endswith',
            'expandtabs',
            'find',
            'format',
            'format_map',
            'index',
            'isalnum',
            'isalpha',
            'isascii',
            'isdecimal',
            'isdigit',
            'isidentifier',
            'islower',
            'isnumeric',
            'isprintable',
            'isspace',
            'istitle',
            'isupper',
            'join',
            'ljust',
            'lower',
            'lstrip',
            'maketrans',
            'partition',
            'removeprefix',
            'removesuffix',
            'replace',
            'rfind',
            'rindex',
            'rjust',
            'rpartition',
            'rsplit',
            'rstrip',
            'split',
            'splitlines',
            'startswith',
            'strip',
            'swa

This is because both ```greeting``` and ```farewell``` are instance of the ```str``` class:

In [51]:
type(greeting)

str

In [52]:
type(farewell)

str

And the identifiers are defined in the ```str``` class:

In [53]:
dir2(str)

{'method': ['capitalize',
            'casefold',
            'center',
            'count',
            'encode',
            'endswith',
            'expandtabs',
            'find',
            'format',
            'format_map',
            'index',
            'isalnum',
            'isalpha',
            'isascii',
            'isdecimal',
            'isdigit',
            'isidentifier',
            'islower',
            'isnumeric',
            'isprintable',
            'isspace',
            'istitle',
            'isupper',
            'join',
            'ljust',
            'lower',
            'lstrip',
            'maketrans',
            'partition',
            'removeprefix',
            'removesuffix',
            'replace',
            'rfind',
            'rindex',
            'rjust',
            'rpartition',
            'rsplit',
            'rstrip',
            'split',
            'splitlines',
            'startswith',
            'strip',
            'swa

If the classes method resolution order is examined:

In [54]:
str.mro()

[str, object]

Notice that there is a ```list``` instance containing the classes ```str``` and ```object```. This means the ```str``` instance has all the ```object``` based datamodel identifiers:

In [55]:
dir2(str, object, consistent_only=True)

{'datamodel_attribute': ['__doc__'],
 'datamodel_method': ['__class__',
                      '__delattr__',
                      '__dir__',
                      '__eq__',
                      '__format__',
                      '__ge__',
                      '__getattribute__',
                      '__getstate__',
                      '__gt__',
                      '__hash__',
                      '__init__',
                      '__init_subclass__',
                      '__le__',
                      '__lt__',
                      '__ne__',
                      '__new__',
                      '__reduce__',
                      '__reduce_ex__',
                      '__repr__',
                      '__setattr__',
                      '__sizeof__',
                      '__str__',
                      '__subclasshook__']}


Alongside the following additions:

In [56]:
dir2(str, object, unique_only=True)

{'method': ['capitalize',
            'casefold',
            'center',
            'count',
            'encode',
            'endswith',
            'expandtabs',
            'find',
            'format',
            'format_map',
            'index',
            'isalnum',
            'isalpha',
            'isascii',
            'isdecimal',
            'isdigit',
            'isidentifier',
            'islower',
            'isnumeric',
            'isprintable',
            'isspace',
            'istitle',
            'isupper',
            'join',
            'ljust',
            'lower',
            'lstrip',
            'maketrans',
            'partition',
            'removeprefix',
            'removesuffix',
            'replace',
            'rfind',
            'rindex',
            'rjust',
            'rpartition',
            'rsplit',
            'rstrip',
            'split',
            'splitlines',
            'startswith',
            'strip',
            'swa

The method resolution order is an instruction to preferentially use the method defined in the ```str``` class and to fallback on the method defined in the ```object``` class when not defined in the ```str``` class. More details about these two classes can be seen using ```help```:

In [57]:
help(str)

Help on class str in module builtins:

class str(object)
 |  str(object='') -> str
 |  str(bytes_or_buffer[, encoding[, errors]]) -> str
 |
 |  Create a new string object from the given object. If encoding or
 |  errors is specified, then the object must expose a data buffer
 |  that will be decoded using the given encoding and error handler.
 |  Otherwise, returns the result of object.__str__() (if defined)
 |  or repr(object).
 |  encoding defaults to sys.getdefaultencoding().
 |  errors defaults to 'strict'.
 |
 |  Methods defined here:
 |
 |  __add__(self, value, /)
 |      Return self+value.
 |
 |  __contains__(self, key, /)
 |      Return bool(key in self).
 |
 |  __eq__(self, value, /)
 |      Return self==value.
 |
 |  __format__(self, format_spec, /)
 |      Return a formatted version of the string as described by format_spec.
 |
 |  __ge__(self, value, /)
 |      Return self>=value.
 |
 |  __getattribute__(self, name, /)
 |      Return getattr(self, name).
 |
 |  __getitem__(

In [58]:
help(object)

Help on class object in module builtins:

class object
 |  The base class of the class hierarchy.
 |
 |  When called, it accepts no arguments and returns a new featureless
 |  instance that has no instance attributes and cannot be given any.
 |
 |  Built-in subclasses:
 |      anext_awaitable
 |      async_generator
 |      async_generator_asend
 |      async_generator_athrow
 |      ... and 90 other subclasses
 |
 |  Methods defined here:
 |
 |  __delattr__(self, name, /)
 |      Implement delattr(self, name).
 |
 |  __dir__(self, /)
 |      Default dir() implementation.
 |
 |  __eq__(self, value, /)
 |      Return self==value.
 |
 |  __format__(self, format_spec, /)
 |      Default object formatter.
 |
 |      Return str(self) if format_spec is empty. Raise TypeError otherwise.
 |
 |  __ge__(self, value, /)
 |      Return self>=value.
 |
 |  __getattribute__(self, name, /)
 |      Return getattr(self, name).
 |
 |  __getstate__(self, /)
 |      Helper for pickle.
 |
 |  __gt__(self, 

## Datamodel Identifiers

The ```str``` has the ```object``` based datamodel identifiers. Recall from the previous tutorial these define the behaviour of the following ```builtins``` identifier:

|Datamodel Identifier|Builtins Identifier|Builtins Identifier Type|Description|
|---|---|---|---|
|\_\_new\_\_|||constructs the instance self|
|\_\_init\_\_|||initialise an instance with instance data (automatically invoked by \_\_new\_\_)|
|\_\_doc\_\_|?|operator|view the docstring or initialisation signature docstring if a class|
|\_\_class\_\_|type|class|display the class type of an instance|
|\_\_dir\_\_|dir|function|list the directory of identifiers|
|\_\_repr\_\_|repr|function|formal str representation|
|\_\_str\_\_|str|class|informal str representation|
|\_\_hash\_\_|hash|function|hash value if immutable, if mutable \_\_hash\_\_ = None and the hash function cannot be used|
|\_\_getattribute\_\_|getattr|function|access an attribute (immutable)|
|\_\_setattr\_\_|setattr|function|set an attribute (mutable)|
|\_\_delattr\_\_|delattr|function|delete an attribute (mutable)|
|\_\_eq\_\_|==|operator|check if self is equal to value|
|\_\_ne\_\_|!=|operator|check if self is not equal to value|
|\_\_lt\_\_|<|operator|check if self is less than value|
|\_\_le\_\_|<=|operator|check if self is less than or equal to value|
|\_\_gt\_\_|>|operator|check if self is greater than value|
|\_\_ge\_\_|>=|operator|check if self is greater than or equal to value|
|\_\_sizeof\_\_|sys.sizeof|function|check the size of the instance in bytes|

The identifiers used by the pickle module or for subclassing are not mentioned here and were covered in the previous tutorial on the ```object``` class.

These are supplemented by the following datamodel methods:

In [59]:
dir2(str, object, unique_only=True, print_output=False)['datamodel_method']

['__add__',
 '__contains__',
 '__getitem__',
 '__getnewargs__',
 '__iter__',
 '__len__',
 '__mod__',
 '__mul__',
 '__rmod__',
 '__rmul__']

The ```str``` follows the design pattern on an immutable ```Collection```. A ```Collection``` has the following datamodel identifiers:

|Datamodel Identifier|Builtins Identifier|Builtins Identifier Type|Description|
|---|---|---|---|
|\_\_len\_\_|len|function|the number of Unicode characters in a str|
|\_\_contains\_\_|in|keyword|check if str contains a substr|
|\_\_getitem\_\_|[]||uses square brackets to index into a str|
|\_\_iter\_\_|iter|function|returns a str iterator|
|\_\_add\_\_|+|operator|concatenates two str instances|
|\_\_mul\_\_|*|operator|replicates a str by multiplication with an int instance ```'hello' * 2```|
|\_\_rmul\_\_|*|operator|replicates a str by reverse multiplication with an int instance ```2 * 'hello'```|

There are also some ```str``` specific additions:

|Datamodel Identifier|Builtins Identifier|Builtins Identifier Type|Description|
|---|---|---|---|
|\_\_mod\_\_|%|operator|create a formatted str by inserting variables into the str using a tuple ```'% and % make %' % (2, 3, 5)```|
|\_\_rmod\_\_|%|operator|create a formatted str by reverse inserting variables into the str using a tuple ```(2, 3, 5) % '% and % make %'```|

The ```__getnewargs__``` datamodel method is used by the ```pickle``` to serialise the ```str```.

Using ```?``` on the ```str``` class shoes the docstring of the ```__init__``` signature:

In [60]:
str?

[1;31mInit signature:[0m [0mstr[0m[1;33m([0m[0mself[0m[1;33m,[0m [1;33m/[0m[1;33m,[0m [1;33m*[0m[0margs[0m[1;33m,[0m [1;33m**[0m[0mkwargs[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m     
str(object='') -> str
str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or
errors is specified, then the object must expose a data buffer
that will be decoded using the given encoding and error handler.
Otherwise, returns the result of object.__str__() (if defined)
or repr(object).
encoding defaults to sys.getdefaultencoding().
errors defaults to 'strict'.
[1;31mType:[0m           type
[1;31mSubclasses:[0m     StrEnum, DeferredConfigString, FoldedCase, _rstr, _ScriptTarget, _ModuleTarget, LSString, include, Keys, InputMode, ...

The datamodel identifier ```__new__``` constructs the instance ```greeting``` and invokes the ```__init__``` signature to provide the ```str``` with the required instance data:

In [61]:
greeting = 'Hello\tWorld!'

Using ```?``` with the ```str``` instances gives the same docstring from the ```str``` class but displays instance specific details:

In [62]:
greeting?

[1;31mType:[0m        str
[1;31mString form:[0m Hello	World!
[1;31mLength:[0m      12
[1;31mDocstring:[0m  
str(object='') -> str
str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or
errors is specified, then the object must expose a data buffer
that will be decoded using the given encoding and error handler.
Otherwise, returns the result of object.__str__() (if defined)
or repr(object).
encoding defaults to sys.getdefaultencoding().
errors defaults to 'strict'.

Such as the ```type```:

In [63]:
type(greeting)

str

## formal (\_\_repr\_\_) and informal (\_\_str\_\_) str

prints out the informal ```str``` form:

In [64]:
print(greeting)

Hello	World!


Recall that there is the formal and informal ```str``` representation and the difference between these can be seen when an instance is printed (above) and examined in the cell output below:

In [65]:
greeting

'Hello\tWorld!'

The informal ```str``` (```__str__``` datamodel method) defines the behaviour of the ```str``` class. Casting a ```str``` instance to a ```str``` instance leaves it unchanged:

In [66]:
str(greeting)

'Hello\tWorld!'

Therefore the two are equivalent:

In [67]:
print(str(greeting))

Hello	World!


In [68]:
print(greeting)

Hello	World!


The formal ```repr``` (```__repr__``` datamodel method) defines the behaviour of the ```repr``` function:

In [69]:
repr(greeting)

"'Hello\\tWorld!'"

Notice the print out of this shows the informal ```str``` representation which is the form used to instantiate a new ```str``` instance:

In [70]:
print(repr(greeting))

'Hello\tWorld!'


### Indexing and Slicing (\_\_len\_\_, \_\_contains\_\_, \_\_getitem\_\_)

The length function ```len``` returns the number of Unicode Characters in the ```str```:

In [71]:
len(greeting)

12

Notice that ```\t``` is used to represent a single Unicode character. The custom function ```view``` can be imported from the custom module ```view_collection``` to view the ```str``` instance in more detail:

Notice that the ```str``` uses zero-order indexing where each index is an ```int```. Notice that the "first" index known as the start index is ```0``` and increases in ```int``` steps of ```1``` up to but excluding the stop index which is the length of the collection. The last index is therefore 1 less than the length of the ```str``` instance. 

Notice that the datatype for each character is itself a ```str``` and each of these ```str``` instances have a length of 1 corresponding to a value that is a single Unicode character:

In [72]:
view(greeting)

Index 	 Type                 	 Size   	 Value                         
0 	 str                  	 1      	 H                              	
1 	 str                  	 1      	 e                              	
2 	 str                  	 1      	 l                              	
3 	 str                  	 1      	 l                              	
4 	 str                  	 1      	 o                              	
5 	 str                  	 1      	 	                              	
6 	 str                  	 1      	 W                              	
7 	 str                  	 1      	 o                              	
8 	 str                  	 1      	 r                              	
9 	 str                  	 1      	 l                              	
10 	 str                  	 1      	 d                              	
11 	 str                  	 1      	 !                              	


Square brackets can used to select an index:

In [73]:
greeting[0]

'H'

In [74]:
greeting[len(greeting)-1]

'!'

In [75]:
greeting[11]

'!'

The ```slice``` class can be used to select a substr using a slice:

In [76]:
slice?

[1;31mInit signature:[0m [0mslice[0m[1;33m([0m[0mself[0m[1;33m,[0m [1;33m/[0m[1;33m,[0m [1;33m*[0m[0margs[0m[1;33m,[0m [1;33m**[0m[0mkwargs[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m     
slice(stop)
slice(start, stop[, step])

Create a slice object.  This is used for extended slicing (e.g. a[0:10:2]).
[1;31mType:[0m           type
[1;31mSubclasses:[0m     

To select the first word the following slice can be used:

```python
slice(0, 5, 1)
```

Note because zero-order indexing is used, the start bound is inclusive and the stop bound is exclusive. A slice is therefore selected up to but excluding the stop bound:

|Index|Type|Size|Value|
|---|---|---|---|
|0|str|1|H|
|1|str|1|e|                   	
|2|str|1|l|           	
|3|str|1|l|                     	
|4|str|1|o|                         	
|5|||| 

In [77]:
start = 0
stop = 5
step = 1

In [78]:
greeting[slice(start, stop, step)]

'Hello'

Because the default step is 1:

In [79]:
greeting[slice(start, stop)]

'Hello'

Because the default start is 0:

In [80]:
greeting[slice(stop)]

'Hello'

Slicing is usually done shorthand using colons to separate out the start, stop and step values:

In [81]:
greeting[start:stop:step]

'Hello'

Because the default step is 1, this can be simplified to:

In [82]:
greeting[start:stop:]

'Hello'

The last colon can also be dropped:

In [83]:
greeting[start:stop]

'Hello'

Because the default start is 0 this can be simplied to:

In [84]:
greeting[:stop]

'Hello'

The default stop is the length of the ```str``` and therefore the following returns the whole ```str```:

In [85]:
greeting[:]

'Hello\tWorld!'

Normally numbers are used in the slices directly:

In [86]:
greeting[0:5:1]

'Hello'

In [87]:
greeting[6:]

'World!'

The shorthand notation is generally preferred however a slice is sometimes used with a constant to make code more readable:

In [88]:
FIRST_WORD = slice(0, 5, 1)
greeting[FIRST_WORD]

'Hello'

The index before ```0``` is ```-1``` and is taken to be the last Unicode character in the ```str```. Conceptualise the ```str``` wrapping around itself and a negative index can be prescribed to each index in the ```str``` until the "first" index is reached which has a negative index of the length of the ```str``` instance:

In [89]:
view(greeting, neg_index=True)
view(greeting)

Index 	 Type                 	 Size   	 Value                         
-12 	 str                  	 1      	 H                              	
-11 	 str                  	 1      	 e                              	
-10 	 str                  	 1      	 l                              	
-9 	 str                  	 1      	 l                              	
-8 	 str                  	 1      	 o                              	
-7 	 str                  	 1      	 	                              	
-6 	 str                  	 1      	 W                              	
-5 	 str                  	 1      	 o                              	
-4 	 str                  	 1      	 r                              	
-3 	 str                  	 1      	 l                              	
-2 	 str                  	 1      	 d                              	
-1 	 str                  	 1      	 !                              	
Index 	 Type                 	 Size   	 Value                         
0 	 str        

When a negative step is used ```-1```. Notice this reverses the character order in the ```str``` instance:

In [90]:
greeting[::-1]

'!dlroW\tolleH'

The default start is therefore index ```-1``` and the default stop is ```-len(greeting)-1``` because zero-order indexing is still sued that is inclusive of the start bound and exclusive of the stop bound:

In [91]:
start = -1
stop = -len(greeting) - 1
step = -1
greeting[start:stop:step]

'!dlroW\tolleH'

In [92]:
greeting[-1:-len(greeting)-1:-1]

'!dlroW\tolleH'

The ```__contains__``` datamodel method contains the be behaviour of the ```in``` keyword:

In [93]:
greeting.__contains__?

[1;31mSignature:[0m      [0mgreeting[0m[1;33m.[0m[0m__contains__[0m[1;33m([0m[0mkey[0m[1;33m,[0m [1;33m/[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mCall signature:[0m [0mgreeting[0m[1;33m.[0m[0m__contains__[0m[1;33m([0m[1;33m*[0m[0margs[0m[1;33m,[0m [1;33m**[0m[0mkwargs[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mType:[0m           method-wrapper
[1;31mString form:[0m    <method-wrapper '__contains__' of str object at 0x000001E0A1A51B70>
[1;31mDocstring:[0m      Return bool(key in self).

It can be used to check whether a substr is present within a ```str```:

In [94]:
greeting.__contains__('Hello')

True

It is more common to use the ```in``` keyword to perform this check:

In [95]:
'Hello' in greeting

True

In [96]:
'hello' in greeting

False

## Iteration (\_\_iter\_\_) and looping

If the ```str``` instance ```letters``` (plural) is instantiated:

In [97]:
letters = 'Hello World!'

In [98]:
view(letters)

Index 	 Type                 	 Size   	 Value                         
0 	 str                  	 1      	 H                              	
1 	 str                  	 1      	 e                              	
2 	 str                  	 1      	 l                              	
3 	 str                  	 1      	 l                              	
4 	 str                  	 1      	 o                              	
5 	 str                  	 1      	                                	
6 	 str                  	 1      	 W                              	
7 	 str                  	 1      	 o                              	
8 	 str                  	 1      	 r                              	
9 	 str                  	 1      	 l                              	
10 	 str                  	 1      	 d                              	
11 	 str                  	 1      	 !                              	


It can be cast into an iterator using ```iter```:

In [99]:
forward = iter(letters)

```forward``` is a ```str``` ASCII iterator that iterates through a ```str``` of ASCII characters, displaying a single character at a time:

In [100]:
forward

<str_ascii_iterator at 0x1e0a1a90f70>

The iterator has a number of datamodel identifiers:

In [101]:
dir2(forward, object, unique_only=True)

{'datamodel_method': ['__iter__',
                      '__length_hint__',
                      '__next__',
                      '__setstate__']}


The most important one is ```__next__``` which controls the behaviour of the ```builtins``` function ```next```. ```next``` is used to advance to the next value in the iterator. An iterator displays a single value at a time and each previous value is consumed when advanced:

In [102]:
next(forward)

'H'

In [103]:
next(forward)

'e'

In [104]:
next(forward)

'l'

In each case assignment can be used, to the instance name ```letter``` (note singular):

In [105]:
letter = next(forward)

In [106]:
letter

'l'

```next``` can continue to be used on the ASCII ```iter``` instance until all the letters are exhausted. In other words ```next``` can be called on the ASCII ```iter``` instance ```len(letter)``` times. Alternatively all of the remaining elements in an ```iter``` instance can be consumed by casting using the ```tuple``` class:

In [107]:
tuple(forward)

('o', ' ', 'W', 'o', 'r', 'l', 'd', '!')

A ```range``` instance can be constructed using the ```len(letter)```. Note the similarities between the ```range``` class and the ```slice``` class:

In [108]:
range?

[1;31mInit signature:[0m [0mrange[0m[1;33m([0m[0mself[0m[1;33m,[0m [1;33m/[0m[1;33m,[0m [1;33m*[0m[0margs[0m[1;33m,[0m [1;33m**[0m[0mkwargs[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m     
range(stop) -> range object
range(start, stop[, step]) -> range object

Return an object that produces a sequence of integers from start (inclusive)
to stop (exclusive) by step.  range(i, j) produces i, i+1, i+2, ..., j-1.
start defaults to 0, and stop is omitted!  range(4) produces 0, 1, 2, 3.
These are exactly the valid indices for a list of 4 elements.
When step is given, it specifies the increment (or decrement).
[1;31mType:[0m           type
[1;31mSubclasses:[0m     

In [109]:
slice?

[1;31mInit signature:[0m [0mslice[0m[1;33m([0m[0mself[0m[1;33m,[0m [1;33m/[0m[1;33m,[0m [1;33m*[0m[0margs[0m[1;33m,[0m [1;33m**[0m[0mkwargs[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m     
slice(stop)
slice(start, stop[, step])

Create a slice object.  This is used for extended slicing (e.g. a[0:10:2]).
[1;31mType:[0m           type
[1;31mSubclasses:[0m     

In [110]:
indexes = range(len(letters))

The ```range``` instance is not an ```iter``` instance and does not have the identifier ```__next__``` but each index in it can be viewed by casting to a ```tuple```:

In [111]:
dir2(indexes, object, unique_only=True)

{'attribute': ['start', 'step', 'stop'],
 'method': ['count', 'index'],
 'datamodel_method': ['__bool__',
                      '__contains__',
                      '__getitem__',
                      '__iter__',
                      '__len__',
                      '__reversed__']}


In [112]:
tuple(indexes)

(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11)

A ```for``` loop can be constructed from it:

In [113]:
for index in indexes:
    print(index)

0
1
2
3
4
5
6
7
8
9
10
11


Notice the instructions in the ```for``` loop body was repeated 12 times and the ```index``` printed was updated each loop iteration.

The ```str``` instance ```letters``` can be into an ```iter``` instance and ```next``` can be used to advance through the iterator within the ```for``` loop:

In [114]:
forward = iter(letters)

for index in indexes:
    print(next(forward))

H
e
l
l
o
 
W
o
r
l
d
!


Creating an ```iter``` instance and advancing through all its elements in a ```for``` loop is a common task and is simplified using the syntax below:

In [115]:
for letter in letters:
    print(letter)

H
e
l
l
o
 
W
o
r
l
d
!


Note sometimes it is useful to have both the index and the letter being looped through, this can be done using the ```enumerate``` class:

In [116]:
enumerate?

[1;31mInit signature:[0m [0menumerate[0m[1;33m([0m[0miterable[0m[1;33m,[0m [0mstart[0m[1;33m=[0m[1;36m0[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m     
Return an enumerate object.

  iterable
    an object supporting iteration

The enumerate object yields pairs containing a count (from start, which
defaults to zero) and a value yielded by the iterable argument.

enumerate is useful for obtaining an indexed list:
    (0, seq[0]), (1, seq[1]), (2, seq[2]), ...
[1;31mType:[0m           type
[1;31mSubclasses:[0m     

In [117]:
enumerated_letters = enumerate(letters)

In [118]:
enumerated_letters

<enumerate at 0x1e0a1ab9d50>

Note that ```enumerate``` instances is also an ```iter``` instance and has the datamodel identifier ```__next__```:

In [119]:
dir2(enumerated_letters, object, unique_only=True)

{'datamodel_method': ['__class_getitem__', '__iter__', '__next__']}


When ```next``` is used a ```tuple``` is output:

In [120]:
next(enumerated_letters)

(0, 'H')

This can be unpacked to two variables using an explicit ```tuple``` instance:

In [121]:
(index, letter) = next(enumerated_letters)

In [122]:
index

1

In [123]:
letter

'e'

However it is more common to use implicit ```tuple``` unpacking:

In [124]:
index, letter = next(enumerated_letters)

In [125]:
index

2

In [126]:
letter

'l'

A ```for``` loop can be constructed with two loop variables using the ```enumerate``` instance:

In [127]:
for index, letter in enumerate(letters):
    print(f'{index}: {letter}')

0: H
1: e
2: l
3: l
4: o
5:  
6: W
7: o
8: r
9: l
10: d
11: !


Sometimes this is useful when the index and letter are both required:

In [128]:
for index, letter in enumerate(letters):
    print(index * letter)


e
ll
lll
oooo
     
WWWWWW
ooooooo
rrrrrrrr
lllllllll
dddddddddd
!!!!!!!!!!!


## Immutability and hash (\_\_hash\_\_)

The ```__hash__``` datamodel identifier is not equal to ```None```:

In [129]:
str.__hash__ == None

False

This means the ```str``` is immutable. Recall immutable means once an instance is created, it cannot be modified. As a consequence each method has a ```return``` value which returns a new instance, normally a new ```str``` instance and leaves the original ```str``` unmodified:

In [130]:
greeting = 'Hello World!'

In [131]:
greeting[-1:-len(greeting)-1:-1] #return value shown in cell output

'!dlroW olleH'

In [132]:
greeting # unchanged

'Hello World!'

As mentioned above reassignment should not be confused with mutability.

In [133]:
greeting = 'Hello World!'

In [134]:
hash(greeting), id(greeting)

(-7437652338063058407, 2064296737520)

When reassignment is used, the operation on the right is carried out first, in this case the operation highlighted in parenthesis. The instance data ```'Hello World!'``` is used. The ```return``` value of this operation ```'!dlroW olleH'``` is then assigned to the instance name ```greeting``` on the right:

In [135]:
greeting = (greeting[-1:-len(greeting)-1:-1])

In [136]:
hash(greeting), id(greeting)

(-2364074818600270120, 2064296728944)

Therefore the instance name ```greeting``` which can be conceptualised as a label has been unpeeled from the old instance and now is affixed to the new instance:

In [137]:
greeting

'!dlroW olleH'

Because a ```str``` is hashable and therefore immutable it can be used in a mapping such as a ```dict``` which recall has the form:

```python
{key: value,
 key: value,
 key: value}
```

A ```dict``` can be conceptualised as a collection of storage locations and an immutable key is used to access each storage location which then gives a reference to an ```object```. The key must be immutable as a key that is modified will no longer fit the lock and therefore cannot be used.

Because ```str``` instances are immutable they commonly used as keys. An example is give in the 2 ```dict``` instances below:

In [138]:
from matplotlib.colors import BASE_COLORS, CSS4_COLORS

In [139]:
BASE_COLORS

{'b': (0, 0, 1),
 'g': (0, 0.5, 0),
 'r': (1, 0, 0),
 'c': (0, 0.75, 0.75),
 'm': (0.75, 0, 0.75),
 'y': (0.75, 0.75, 0),
 'k': (0, 0, 0),
 'w': (1, 1, 1)}

In [140]:
CSS4_COLORS

{'aliceblue': '#F0F8FF',
 'antiquewhite': '#FAEBD7',
 'aqua': '#00FFFF',
 'aquamarine': '#7FFFD4',
 'azure': '#F0FFFF',
 'beige': '#F5F5DC',
 'bisque': '#FFE4C4',
 'black': '#000000',
 'blanchedalmond': '#FFEBCD',
 'blue': '#0000FF',
 'blueviolet': '#8A2BE2',
 'brown': '#A52A2A',
 'burlywood': '#DEB887',
 'cadetblue': '#5F9EA0',
 'chartreuse': '#7FFF00',
 'chocolate': '#D2691E',
 'coral': '#FF7F50',
 'cornflowerblue': '#6495ED',
 'cornsilk': '#FFF8DC',
 'crimson': '#DC143C',
 'cyan': '#00FFFF',
 'darkblue': '#00008B',
 'darkcyan': '#008B8B',
 'darkgoldenrod': '#B8860B',
 'darkgray': '#A9A9A9',
 'darkgreen': '#006400',
 'darkgrey': '#A9A9A9',
 'darkkhaki': '#BDB76B',
 'darkmagenta': '#8B008B',
 'darkolivegreen': '#556B2F',
 'darkorange': '#FF8C00',
 'darkorchid': '#9932CC',
 'darkred': '#8B0000',
 'darksalmon': '#E9967A',
 'darkseagreen': '#8FBC8F',
 'darkslateblue': '#483D8B',
 'darkslategray': '#2F4F4F',
 'darkslategrey': '#2F4F4F',
 'darkturquoise': '#00CED1',
 'darkviolet': '#9400D3

Note in each case the key is an easy to remember letter or English word and the value it corresponds to is a harder to remember ```tuple``` of the format ```(r, g, b)``` or hexadecimal value of the form ```'#rrggbb'```.

Because a ```str``` is immutable, the function ```getattr``` can be used to access the identifier as a ```str```:

In [141]:
getattr(str, '__len__')

<slot wrapper '__len__' of 'str' objects>

In [142]:
str.__len__

<slot wrapper '__len__' of 'str' objects>

The mutable counterparts ```setattr``` and ```delattr``` cannot be used because a ```str``` is mutable and therefore an attribute cannot be changed or deleted.

## Comparison Operators (\_\_gt\_\_, \_\_ge\_\_, \_\_lt\_\_, \_\_le\_\_, \_\_eq\_\_ and \_\_ne\_\_)

Early computers were based on a typewriter that essentially prints English characters onto a sheet of paper. In order to achieve such a task a number of non-printable commands such as the carriage return (moving the carriage back to the left) and the form feed (moving the piece of paper up by the width of a line) are required as well as the printable characters such as the English letters, numbers, and whitespace:

<img src='./images/img_001.png' alt='img_001' width='800'/>

Each command has to be mapped physically into the computers memory. Fundamentally the computer can only store data in the form of a bit which is essentially a digital switch.

A single switch has the possible values ```0```, ```1``` which is ```2 ** 1``` combinations which is a total of ```2```. Note the combination ```0``` is included so ```0:2``` is inclusive of the lower bount ```0``` and exclusive of the upper bound ```2```.

<img src='./images/img_002.png' alt='img_002' width='400'/>

More typically ```8``` of these switches are combined into a single logical unit called a byte. A byte has ```2 ** 8``` combinations which is a total of ```256```. Note the combination ```0``` is included so ```0:256``` is inclusive of the lower bount ```0``` and exclusive of the upper bound ```256```.

<img src='./images/img_003.png' alt='img_003' width='400'/>

One of the most popular set of commands was developed in America and is known as the American Standard for Information Interchange (ASCII). The first ```33``` combinations correspond to non-printable characters such as the carriage return and form feed as previously discussed in addition to a number of additional hardware related commands. 

Each bit can be ```0``` or ```1``` and the byte sequence corresponds to the physical position of the ```8``` switches. As binary is not human readible the hexadecimal system is also used which has ```16``` characters ```0```, ```1```, ```2```, ```3```, ```4```, ```5```, ```6```, ```7```, ```8```, ```9```, ```a```, ```b```, ```c```, ```d```, ```e```, ```f```. ```2 ** 4``` is ```16``` combinations and therefore each half of the byte is represented by its own hexadecimal character. These numbering systems are shown alongside the number in decimal.


|byte|hex|num|command|
|---|---|---|---|
|00000000|00|000|null|
|00000001|01|001|start of heading|
|00000010|02|002|start of text|
|00000011|03|003|end of text|
|00000100|04|004|end of transmission|
|00000101|05|005|enquiry|
|00000110|06|006|acknowledge|
|00000111|07|007|bell|
|00001000|08|008|**backspace**|
|00001001|09|009|**horizontal tab**|
|00001010|0a|010|**new line**|
|00001011|0b|011|**vertical tab**|
|00001100|0c|012|**form feed**|
|00001101|0d|013|**carriage return**|
|00001110|0e|014|shift out|
|00001111|0f|015|shift in|
|00010000|10|016|data link escape|
|00010001|11|017|device control 1|
|00010010|12|018|device control 2|
|00010011|13|019|device control 3|
|00010100|14|020|device control 4|
|00010101|15|021|negative acknowledge|
|00010110|16|022|synchronous idle|
|00010111|17|023|end of transmission block|
|00011000|18|024|cancel|
|00011001|19|025|end of medium|
|00011010|1a|026|substitute|
|00011011|1b|027|**escape**|
|00011100|1c|028|file separator|
|00011101|1d|029|group separator|
|00011110|1e|030|record separator|
|00011111|1f|031|unit seperator|
|00100000|20|032|**space**|

The remaining commands spanning up to half a byte contained the characters most commonly used in the English language.

|byte|hex|num|character|
|---|---|---|---|
|00100001|21|033|!|
|00100010|22|034|"|
|00100011|23|035|#|
|00100100|24|036|$|
|00100101|25|037|%|
|00100110|26|038|&|
|00100111|27|039|'|
|00101000|28|040|(|
|00101001|29|041|)|
|00101010|2a|042|*|
|00101011|2b|043|+|
|00101100|2c|044|,|
|00101101|2d|045|-|
|00101110|2e|046|.|
|00101111|2f|047|/|
|00110000|30|048|0|
|00110001|31|049|1|
|00110010|32|050|2|
|00110011|33|051|3|
|00110100|34|052|4|
|00110101|35|053|5|
|00110110|36|054|6|
|00110111|37|055|7|
|00111000|38|056|8|
|00111001|39|057|9|
|00111010|3a|058|:|
|00111011|3b|059|;|
|00111100|3c|060|<|
|00111101|3d|061|=|
|00111110|3e|062|>|
|00111111|3f|063|?|
|01000000|40|064|@|
|01000001|41|065|A|
|01000010|42|066|B|
|01000011|43|067|C|
|01000100|44|068|D|
|01000101|45|069|E|
|01000110|46|070|F|
|01000111|47|071|G|
|01001000|48|072|H|
|01001001|49|073|I|
|01001010|4a|074|J|
|01001011|4b|075|K|
|01001100|4c|076|L|
|01001101|4d|077|M|
|01001110|4e|078|N|
|01001111|4f|079|O|
|01010000|50|080|P|
|01010001|51|081|Q|
|01010010|52|082|R|
|01010011|53|083|S|
|01010100|54|084|T|
|01010101|55|085|U|
|01010110|56|086|V|
|01010111|57|087|W|
|01011000|58|088|X|
|01011001|59|089|Y|
|01011010|5a|090|Z|
|01011011|5b|091|[|
|01011100|5c|092|\|
|01011101|5d|093|]|
|01011110|5e|094|^|
|01011111|5f|095|_|
|01100000|60|096|`|
|01100001|61|097|a|
|01100010|62|098|b|
|01100011|63|099|c|
|01100100|64|100|d|
|01100101|65|101|e|
|01100110|66|102|f|
|01100111|67|103|g|
|01101000|68|104|h|
|01101001|69|105|i|
|01101010|6a|106|j|
|01101011|6b|107|k|
|01101100|6c|108|l|
|01101101|6d|109|m|
|01101110|6e|110|n|
|01101111|6f|111|o|
|01110000|70|112|p|
|01110001|71|113|q|
|01110010|72|114|r|
|01110011|73|115|s|
|01110100|74|116|t|
|01110101|75|117|u|
|01110110|76|118|v|
|01110111|77|119|w|
|01111000|78|120|x|
|01111001|79|121|y|
|01111010|7a|122|z|
|01111011|7b|123|{|
|01111100|7c|124|\||
|01111101|7d|125|}|
|01111110|7e|126|~|
|01111111|7f|127|DEL|


The Unicode ```str``` uses a single encoding table, the Unicode Transformation Format ```'utf-8``` and this encodes a single Unicode character to a numeric combination. This numeric combination is recognised by a human as a decimal integer but stored on a computer using bits. ```'utf-8'``` uses 8 bits (1 byte) for each ASCII character and (2-4 bytes for additional characters outside the ASCII range).

 ```__getsizeof__``` returns the number of bytes occupied by the ```str``` instance. Note that there is a base memory allocation for a ```str``` instance:

In [143]:
import sys
sys.getsizeof('') # 41

41

Then memory allocation for each character in the ```str``` instances:

In [144]:
sys.getsizeof('a') # 41 + 1

42

In [145]:
sys.getsizeof('ab') # 41 + (2 * 1)

43

Use of non-English characters requires a higher memory overhead and requires a larger number of bytes per character:

In [146]:
sys.getsizeof('α') # 41 + 17 + (1 * 2)

60

In [147]:
sys.getsizeof('αβ') # 41 + 17 + (2 * 2)

62

Python also has additional text classes such as the ```bytes``` class which can use additional encoding tables, usually from older standards which will be explored in the next notebook.

Each character is ordinal, the characters ```'a'``` and ```'A'``` are ASCII characters:

In [148]:
ord('a')

97

In [149]:
ord('A')

65

Because these are ASCII they are stored over a single byte. Recall a single byte has the following number of combinations:

In [150]:
2 ** (1 * 8)

256

The character ```'α'``` is non-ASCII and has a value that exceeds this and is therefore stored over multiple bytes:

In [151]:
ord('α')

945

In this case, the Greek letter is stored over 2 bytes:

In [152]:
2 ** (2 * 8)

65536

Because the ```str``` instance is ordinal, the six comparison operators can be used to compare the numeric values of ```str``` instances:

In [153]:
'a' > 'A'

True

The above is essentially a comparison between the two ordinal values:

In [154]:
97 > 65

True

This can be used with longer ```str``` instances:

In [155]:
'apples' > 'bananas'

False

A check is made letter by letter:

In [156]:
'a' > 'b'

False

If the first letters are equal, the second letters are compared:

In [157]:
'aa' > 'ab'

False

The 6 comparison operators can be used:

In [158]:
'aa' < 'aa', 'aa' <= 'aa', 'aa' == 'aa', 'aa' >= 'aa', 'aa' > 'aa', 'aa' != 'aa'

(False, True, True, True, False, False)

In [159]:
'aa' < 'ab', 'aa' <= 'ab', 'aa' == 'ab', 'aa' >= 'ab', 'aa' > 'ab', 'aa' != 'ab'

(True, True, False, False, False, True)

### Instance Methods

if the ```str``` instance ```greeting``` is instantiated:

In [160]:
greeting = 'Hello World!'

Most of the additional identifiers available to it are instance methods:

In [161]:
dir2(greeting, print_output=False)['method']

['capitalize',
 'casefold',
 'center',
 'count',
 'encode',
 'endswith',
 'expandtabs',
 'find',
 'format',
 'format_map',
 'index',
 'isalnum',
 'isalpha',
 'isascii',
 'isdecimal',
 'isdigit',
 'isidentifier',
 'islower',
 'isnumeric',
 'isprintable',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'ljust',
 'lower',
 'lstrip',
 'maketrans',
 'partition',
 'removeprefix',
 'removesuffix',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'split',
 'splitlines',
 'startswith',
 'strip',
 'swapcase',
 'title',
 'translate',
 'upper',
 'zfill']

Recall that the identifiers themselves are defined in the ```str``` class:

In [162]:
dir2(str, print_output=False)['method']

['capitalize',
 'casefold',
 'center',
 'count',
 'encode',
 'endswith',
 'expandtabs',
 'find',
 'format',
 'format_map',
 'index',
 'isalnum',
 'isalpha',
 'isascii',
 'isdecimal',
 'isdigit',
 'isidentifier',
 'islower',
 'isnumeric',
 'isprintable',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'ljust',
 'lower',
 'lstrip',
 'maketrans',
 'partition',
 'removeprefix',
 'removesuffix',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'split',
 'splitlines',
 'startswith',
 'strip',
 'swapcase',
 'title',
 'translate',
 'upper',
 'zfill']

Instance methods are accessed via an instance and therefore have access to the instance data. The docstring of the ```capitalize``` can be examined from a ```str``` instance:

In [163]:
greeting.capitalize?

[1;31mSignature:[0m [0mgreeting[0m[1;33m.[0m[0mcapitalize[0m[1;33m([0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Return a capitalized version of the string.

More specifically, make the first character have upper case and the rest lower
case.
[1;31mType:[0m      builtin_function_or_method

Or it can be examined from the class ```str``` itself:

In [164]:
str.capitalize?

[1;31mSignature:[0m [0mstr[0m[1;33m.[0m[0mcapitalize[0m[1;33m([0m[0mself[0m[1;33m,[0m [1;33m/[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Return a capitalized version of the string.

More specifically, make the first character have upper case and the rest lower
case.
[1;31mType:[0m      method_descriptor

Note that the identifier name is in American English:

|Word|English Dialect|
|---|---|
|capitali**z**e|American|
|capitali**s**e|British|

When the method ```capitalize``` is called from an instance, it has access to the instance data. As a consequence this method requires no additional data to operate which is why its parenthesis are otherwise empty. 


```python
greeting.capitalize()
```

In contrast when the method is called from the class itself, it has no instance data to work from therefore an instance must be provided. In Python ```self``` means *this instance*:

```python
str.capitalize(self, /)
```
```self``` occurs before an ```/``` and therefore must be provided positionally.

As the ```str``` is immutable the method has a ```return``` value and returns a new ```str``` instance that has been capitalised:

```python
Docstring:
Return a capitalized version of the string.
```

When the method is called from an instance:

In [165]:
greeting.capitalize()

'Hello world!'

The new capitalised ```str``` instance displays in the cell output. This a new instance and the original instance is unchanged in variables:

In [166]:
variables()

Unnamed: 0_level_0,Type,Size/Shape,Value
Instance Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
greeting,str,12.0,Hello World!
farewell,str,3.0,bye
start,int,,-1
stop,int,,-13
step,int,,-1
letters,str,12.0,Hello World!
letter,str,1.0,!
indexes,range,12.0,"range(0, 12)"
index,int,,11
BASE_COLORS,dict,8.0,"{'b': (0, 0, 1), 'g': (0, 0.5, 0), 'r': (1, 0, 0), 'c': (0, 0.75, 0.75), 'm': (0.75, 0, 0.75), 'y': (0.75, 0.75, 0), 'k': (0, 0, 0), 'w': (1, 1, 1)}"


Since this new instance is not assigned an instance name it has no references and is automatically removed by Pythons Garbage collection. It can be assigned to an instance name using:

In [167]:
cap_greeting = greeting.capitalize()

Notice no cell output as the new instance is now assigned to the instance name instead of being shown in the cell output. This can be seen in Variables:

In [168]:
variables()

Unnamed: 0_level_0,Type,Size/Shape,Value
Instance Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
greeting,str,12.0,Hello World!
farewell,str,3.0,bye
start,int,,-1
stop,int,,-13
step,int,,-1
letters,str,12.0,Hello World!
letter,str,1.0,!
indexes,range,12.0,"range(0, 12)"
index,int,,11
BASE_COLORS,dict,8.0,"{'b': (0, 0, 1), 'g': (0, 0.5, 0), 'r': (1, 0, 0), 'c': (0, 0.75, 0.75), 'm': (0.75, 0, 0.75), 'y': (0.75, 0.75, 0), 'k': (0, 0, 0), 'w': (1, 1, 1)}"


If the instance method is invoked from a class, the instance ```self``` must be provided positionally as the first input argument:

In [169]:
str.capitalize(farewell)

'Bye'

Failure to supply an instance will result in a ```TypeError```. This can be seen by inputting the following into the blank code cell below:

```python
str.capitalize()
```

## Case Methods

The ```str``` case method ```capitalize``` has already been examined:

In [170]:
greeting.capitalize?

[1;31mSignature:[0m [0mgreeting[0m[1;33m.[0m[0mcapitalize[0m[1;33m([0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Return a capitalized version of the string.

More specifically, make the first character have upper case and the rest lower
case.
[1;31mType:[0m      builtin_function_or_method

In [171]:
greeting.capitalize()

'Hello world!'

There are associated identifiers such as:

* ```lower```
* ```casefold```
* ```upper```
* ```title```
* ```swapcase```

The docstrings of these can all be examined:

In [172]:
greeting.lower?

[1;31mSignature:[0m [0mgreeting[0m[1;33m.[0m[0mlower[0m[1;33m([0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m Return a copy of the string converted to lowercase.
[1;31mType:[0m      builtin_function_or_method

In [173]:
greeting.casefold?

[1;31mSignature:[0m [0mgreeting[0m[1;33m.[0m[0mcasefold[0m[1;33m([0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m Return a version of the string suitable for caseless comparisons.
[1;31mType:[0m      builtin_function_or_method

In [174]:
greeting.upper?

[1;31mSignature:[0m [0mgreeting[0m[1;33m.[0m[0mupper[0m[1;33m([0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m Return a copy of the string converted to uppercase.
[1;31mType:[0m      builtin_function_or_method

In [175]:
greeting.title?

[1;31mSignature:[0m [0mgreeting[0m[1;33m.[0m[0mtitle[0m[1;33m([0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Return a version of the string where each word is titlecased.

More specifically, words start with uppercased characters and all remaining
cased characters have lower case.
[1;31mType:[0m      builtin_function_or_method

In [176]:
greeting.swapcase?

[1;31mSignature:[0m [0mgreeting[0m[1;33m.[0m[0mswapcase[0m[1;33m([0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m Convert uppercase characters to lowercase and lowercase characters to uppercase.
[1;31mType:[0m      builtin_function_or_method

In [177]:
greeting.title?

[1;31mSignature:[0m [0mgreeting[0m[1;33m.[0m[0mtitle[0m[1;33m([0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Return a version of the string where each word is titlecased.

More specifically, words start with uppercased characters and all remaining
cased characters have lower case.
[1;31mType:[0m      builtin_function_or_method

All of these case identifiers only require instance data and return a new ```str``` instance:

In [178]:
'hEllo wOrld'.lower()

'hello world'

In [179]:
'hEllo wOrld'.casefold()

'hello world'

In [180]:
'hEllo wOrld'.upper()

'HELLO WORLD'

In [181]:
'hEllo wOrld'.swapcase()

'HeLLO WoRLD'

In [182]:
'hEllo wOrld'.title()

'Hello World'

casefold is similar to lower but has more support for non-English characters, as seen with the additional German characters and the Greek characters where some of the lower case characters have variants:

In [183]:
'ÄäÜüÖöẞß'.lower()

'ääüüöößß'

In [184]:
'ÄäÜüÖöẞß'.casefold()

'ääüüöössss'

In [185]:
'ΑαΒβΓγΔδΕεΖζΗηΘθΙιΚκΛλΜμΝνΞξΟοΠπΡρΣσςΤτΥυΦφΧχΨψΩω'.lower()

'ααββγγδδεεζζηηθθιικκλλμμννξξοοππρρσσςττυυφφχχψψωω'

In [186]:
'ΑαΒβΓγΔδΕεΖζΗηΘθΙιΚκΛλΜμΝνΞξΟοΠπΡρΣσςΤτΥυΦφΧχΨψΩω'.casefold()

'ααββγγδδεεζζηηθθιικκλλμμννξξοοππρρσσσττυυφφχχψψωω'

## Boolean Identifiers

A number of identifiers are used to examine a specific property of a ```str``` and return a boolean of ```True``` if it has that property and ```False``` otherwise: 

In [187]:
greeting.isupper?

[1;31mSignature:[0m [0mgreeting[0m[1;33m.[0m[0misupper[0m[1;33m([0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Return True if the string is an uppercase string, False otherwise.

A string is uppercase if all cased characters in the string are uppercase and
there is at least one cased character in the string.
[1;31mType:[0m      builtin_function_or_method

In [188]:
greeting.islower?

[1;31mSignature:[0m [0mgreeting[0m[1;33m.[0m[0mislower[0m[1;33m([0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Return True if the string is a lowercase string, False otherwise.

A string is lowercase if all cased characters in the string are lowercase and
there is at least one cased character in the string.
[1;31mType:[0m      builtin_function_or_method

In [189]:
greeting.istitle?

[1;31mSignature:[0m [0mgreeting[0m[1;33m.[0m[0mistitle[0m[1;33m([0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Return True if the string is a title-cased string, False otherwise.

In a title-cased string, upper- and title-case characters may only
follow uncased characters and lowercase characters only cased ones.
[1;31mType:[0m      builtin_function_or_method

For example:

In [190]:
'HELLO'.isupper()

True

In [191]:
'Hello'.isupper()

False

In [192]:
'hello'.islower()

True

In [193]:
'Hello'.islower()

False

In [194]:
'Hello'.istitle()

True

## Valid Identifier Names

The ```str``` method ```isidentifier``` will check to see if the ```str``` is valid for an identifier name. This can be useful to check before assignment of an instance to an instance name:

In [195]:
greeting.isidentifier?

[1;31mSignature:[0m [0mgreeting[0m[1;33m.[0m[0misidentifier[0m[1;33m([0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Return True if the string is a valid Python identifier, False otherwise.

Call keyword.iskeyword(s) to test whether string s is a reserved identifier,
such as "def" or "class".
[1;31mType:[0m      builtin_function_or_method

A lowercase ```str``` instance without spaces or special characters can be checked to see if the identifier is an acceptable identifier name:

In [196]:
'hello'.isidentifier()

True

This means the following is acceptable:

```python
hello = 'some string'
```

In [197]:
hello = 'some string'

In [198]:
variables()

Unnamed: 0_level_0,Type,Size/Shape,Value
Instance Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
greeting,str,12.0,Hello World!
farewell,str,3.0,bye
start,int,,-1
stop,int,,-13
step,int,,-1
letters,str,12.0,Hello World!
letter,str,1.0,!
indexes,range,12.0,"range(0, 12)"
index,int,,11
BASE_COLORS,dict,8.0,"{'b': (0, 0, 1), 'g': (0, 0.5, 0), 'r': (1, 0, 0), 'c': (0, 0.75, 0.75), 'm': (0.75, 0, 0.75), 'y': (0.75, 0.75, 0), 'k': (0, 0, 0), 'w': (1, 1, 1)}"


A space is not acceptable and attempted use of an identifier will give a ```SyntaxError```:

In [199]:
'hello world'.isidentifier()

False

This means the following is not acceptable:

```python
hello world = 'some string'
```

because the Python interpreter sees two instance names to the left of the assignment operator.

An underscore is acceptable and identifier names generally use ```snake_case```:

In [200]:
'hello_world'.isidentifier()

True

This means the following is acceptable:

```python
hello_world = 'some string'
```

In [201]:
hello_world = 'some string'

In [202]:
variables()

Unnamed: 0_level_0,Type,Size/Shape,Value
Instance Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
greeting,str,12.0,Hello World!
farewell,str,3.0,bye
start,int,,-1
stop,int,,-13
step,int,,-1
letters,str,12.0,Hello World!
letter,str,1.0,!
indexes,range,12.0,"range(0, 12)"
index,int,,11
BASE_COLORS,dict,8.0,"{'b': (0, 0, 1), 'g': (0, 0.5, 0), 'r': (1, 0, 0), 'c': (0, 0.75, 0.75), 'm': (0.75, 0, 0.75), 'y': (0.75, 0.75, 0), 'k': (0, 0, 0), 'w': (1, 1, 1)}"


Numbers can be included in an identifier name:

In [203]:
'hello_world2'.isidentifier()

True

This means the following is acceptable:

```python
hello_world2 = 'some string'
```

In [204]:
hello_world2 = 'some string'

In [205]:
variables()

Unnamed: 0_level_0,Type,Size/Shape,Value
Instance Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
greeting,str,12.0,Hello World!
farewell,str,3.0,bye
start,int,,-1
stop,int,,-13
step,int,,-1
letters,str,12.0,Hello World!
letter,str,1.0,!
indexes,range,12.0,"range(0, 12)"
index,int,,11
BASE_COLORS,dict,8.0,"{'b': (0, 0, 1), 'g': (0, 0.5, 0), 'r': (1, 0, 0), 'c': (0, 0.75, 0.75), 'm': (0.75, 0, 0.75), 'y': (0.75, 0.75, 0), 'k': (0, 0, 0), 'w': (1, 1, 1)}"


However an identifier cannot begin with a number and the attempted use of an identifier will give a ```SyntaxError```:

In [206]:
'2hello_world'.isidentifier()

False

This means the following is not acceptable:

```python
2hello_world = 'some string'
```

Python thinks the identifier is a number but this number contains letters which are unrecognised in the context of a numeric decimal system.

Special characters cannot be used as part of an identifier as they are recognised by Python as operators. Including them in an identifier will give a ```SyntaxError```:

In [207]:
'hello-world2'.isidentifier()

False

This means the following is not acceptable:

```python
hello-world2 = 'some string'
```

because the Python interpreter is seeing an operation to carry out subtraction.

Upper case identifiers can be used but generally ```PascalCase``` is reserved for a class name:

In [208]:
'PascalCase'.isidentifier()

True

This means the following is acceptable:

```python
PascalCase = 'some string'
```

However this naming convention is normally reserved for a class.

In [209]:
PascalCase = 'some string'

In [210]:
variables()

Unnamed: 0_level_0,Type,Size/Shape,Value
Instance Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
greeting,str,12.0,Hello World!
farewell,str,3.0,bye
start,int,,-1
stop,int,,-13
step,int,,-1
letters,str,12.0,Hello World!
letter,str,1.0,!
indexes,range,12.0,"range(0, 12)"
index,int,,11
BASE_COLORS,dict,8.0,"{'b': (0, 0, 1), 'g': (0, 0.5, 0), 'r': (1, 0, 0), 'c': (0, 0.75, 0.75), 'm': (0.75, 0, 0.75), 'y': (0.75, 0.75, 0), 'k': (0, 0, 0), 'w': (1, 1, 1)}"


All capitals identifiers can be used but this generally ```ALL_CAPS``` is reserved for a constant:

In [211]:
'ALL_CAPS'.isidentifier()

True

This means the following is acceptable:

```python
ALL_CAPS = 'some string'
```

and the capitalisation states that this instance name is intended to be a constant, that should not be reassigned later on in the code:

In [212]:
ALL_CAPS = 'some string'

In [213]:
variables()

Unnamed: 0_level_0,Type,Size/Shape,Value
Instance Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
greeting,str,12.0,Hello World!
farewell,str,3.0,bye
start,int,,-1
stop,int,,-13
step,int,,-1
letters,str,12.0,Hello World!
letter,str,1.0,!
indexes,range,12.0,"range(0, 12)"
index,int,,11
BASE_COLORS,dict,8.0,"{'b': (0, 0, 1), 'g': (0, 0.5, 0), 'r': (1, 0, 0), 'c': (0, 0.75, 0.75), 'm': (0.75, 0, 0.75), 'y': (0.75, 0.75, 0), 'k': (0, 0, 0), 'w': (1, 1, 1)}"


An instance name shouldn't match any of the identifiers in ```__builtins__``` otherwise it will override the builtin (until the kernel is restarted) which will lead to confusion when the ```builtins``` is attempted to be used. 

One mistake that beginners often make is to reassign the class name to a instance:

In [214]:
str = 'hello'

Then when they attempt to use the ```str``` class they return the instance:

In [215]:
str

'hello'

To rectify this issue ```str``` can be reassigned from the ```builtins``` module:

In [216]:
str = __builtins__.str

In [217]:
str

str

Another mistake beginners make when working with modules is to call the module that they are using the same name as the module they are trying to learn. This means when they attempt to import the module they are trying to learn, they accidentally attempt to import the module they are working on flagging up a circular ```ImportError```.

There are some identifiers which are reserved, these can be seen by importing the ```keyword``` module, ```pprint``` will also be imported to allow pretty printing of an ```Collection```:

In [218]:
import keyword
import pprint

The ```list``` instance ```kwlist``` can be examined:

In [219]:
pprint.pprint(keyword.kwlist)

['False',
 'None',
 'True',
 'and',
 'as',
 'assert',
 'async',
 'await',
 'break',
 'class',
 'continue',
 'def',
 'del',
 'elif',
 'else',
 'except',
 'finally',
 'for',
 'from',
 'global',
 'if',
 'import',
 'in',
 'is',
 'lambda',
 'nonlocal',
 'not',
 'or',
 'pass',
 'raise',
 'return',
 'try',
 'while',
 'with',
 'yield']


If a keyword is reassigned a ```SyntaxError``` will display:

```python
with = 'hello'
```

There is also the soft keyword list ```softkwlist```:

In [220]:
pprint.pprint(keyword.softkwlist)

['_', 'case', 'match', 'type']


```case``` and ```match``` were recently introduced in Python 3.10 and should be regarded as keywords for new code. They are only soft keywords to allow backwards compatibility with older Python versions.

```_``` by default gives the last temporary variable. However ```_``` is also commonly used to indicate skipping of an ```object``` during ```tuple``` unpacking for example.

As each character maps to a numeric bytes sequence it is ordinal. The builtins ordinal function ```ord``` will return the ordinal numeric value of the number in decimal:

In [221]:
ord?

[1;31mSignature:[0m [0mord[0m[1;33m([0m[0mc[0m[1;33m,[0m [1;33m/[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m Return the Unicode code point for a one-character string.
[1;31mType:[0m      builtin_function_or_method

For example the ordinal value of the ```str``` instance ```'3'``` can be checked:

In [222]:
ord('3')

51

In [223]:
chr(51)

'3'

Notice the difference in syntax highlighting between the ```str``` of the number ```'3'``` and the number ```51```. This number can be converted into a binary string or hex string using the builtins ```bin``` and ```hex``` functions respectively:

In [224]:
bin?

[1;31mSignature:[0m [0mbin[0m[1;33m([0m[0mnumber[0m[1;33m,[0m [1;33m/[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Return the binary representation of an integer.

>>> bin(2796202)
'0b1010101010101010101010'
[1;31mType:[0m      builtin_function_or_method

In [225]:
hex?

[1;31mSignature:[0m [0mhex[0m[1;33m([0m[0mnumber[0m[1;33m,[0m [1;33m/[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Return the hexadecimal representation of an integer.

>>> hex(12648430)
'0xc0ffee'
[1;31mType:[0m      builtin_function_or_method

For example:

In [226]:
bin(ord('3'))

'0b110011'

This can be conceptualised as the following with the trailing zeros:

In [227]:
'0b' + bin(ord('3')).lstrip('0b').zfill(8)

'0b00110011'

Note the prefix 0b indicates a binary number and does not display the two leading zeros:

In [228]:
hex(ord('3'))

'0x33'

Note the prefix 0x indicates a hexadecimal number:

In [229]:
bin(16)

'0b10000'

## The string module

The ```string``` module contains a number of useful strings which group characters. It can be imported using:

In [230]:
import string

The identifiers can be viewed:

In [231]:
dir2(string, object, unique_only=True)

{'attribute': ['ascii_letters',
               'ascii_lowercase',
               'ascii_uppercase',
               'digits',
               'hexdigits',
               'octdigits',
               'printable',
               'punctuation',
               'whitespace'],
 'method': ['capwords'],
 'upper_class': ['Formatter', 'Template'],
 'datamodel_attribute': ['__all__',
                         '__builtins__',
                         '__cached__',
                         '__file__',
                         '__loader__',
                         '__name__',
                         '__package__',
                         '__spec__'],
 'internal_attribute': ['_re', '_sentinel_dict', '_string'],
 'internal_method': ['_ChainMap']}


Most of the identifiers are attributes and in this case are ```str``` instances. ```ascii_letters``` is a ```str``` instance containing all English letters:

In [232]:
string.ascii_letters

'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'

This can be split into lowercase and uppercase using the ```str``` instances ```ascii_lowercase``` and ```ascii_uppercase``` respectively: 

In [233]:
string.ascii_lowercase

'abcdefghijklmnopqrstuvwxyz'

In [234]:
string.ascii_uppercase

'ABCDEFGHIJKLMNOPQRSTUVWXYZ'

```digits``` is a ```str``` instance that contains the ```10``` digits used in the decimal system:

In [235]:
string.digits

'0123456789'

```hexdigits``` is a ```str``` instance that contains the ```16``` characters that can be used for hexadecimal. Note ```a``` and ```A``` are an alias of one another:

In [236]:
string.hexdigits

'0123456789abcdefABCDEF'

```printable``` is a ```str``` instance that contains the printable characters:

In [237]:
string.printable

'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'

```punctuation``` is a ```str``` instance that contains all the punctuation characters:

In [238]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

```whitespace``` is a ```str``` instance containing the whitespace characters:

In [239]:
string.whitespace

' \t\n\r\x0b\x0c'

With the exception to the space, these are shown using escape sequences which will be further explored in a moment. 

Now that the ASCII grouping and string groupings seen within the ```string``` module have been seen, the additional boolean identifiers can be examined. These boolean identifiers all act upon instance data and return a ```bool```. Their docstrings are:

In [240]:
greeting.isprintable?

[1;31mSignature:[0m [0mgreeting[0m[1;33m.[0m[0misprintable[0m[1;33m([0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Return True if the string is printable, False otherwise.

A string is printable if all of its characters are considered printable in
repr() or if it is empty.
[1;31mType:[0m      builtin_function_or_method

In [241]:
greeting.isascii?

[1;31mSignature:[0m [0mgreeting[0m[1;33m.[0m[0misascii[0m[1;33m([0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Return True if all characters in the string are ASCII, False otherwise.

ASCII characters have code points in the range U+0000-U+007F.
Empty string is ASCII too.
[1;31mType:[0m      builtin_function_or_method

In [242]:
greeting.isalnum?

[1;31mSignature:[0m [0mgreeting[0m[1;33m.[0m[0misalnum[0m[1;33m([0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Return True if the string is an alpha-numeric string, False otherwise.

A string is alpha-numeric if all characters in the string are alpha-numeric and
there is at least one character in the string.
[1;31mType:[0m      builtin_function_or_method

In [243]:
greeting.isalpha?

[1;31mSignature:[0m [0mgreeting[0m[1;33m.[0m[0misalpha[0m[1;33m([0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Return True if the string is an alphabetic string, False otherwise.

A string is alphabetic if all characters in the string are alphabetic and there
is at least one character in the string.
[1;31mType:[0m      builtin_function_or_method

In [244]:
greeting.isspace?

[1;31mSignature:[0m [0mgreeting[0m[1;33m.[0m[0misspace[0m[1;33m([0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Return True if the string is a whitespace string, False otherwise.

A string is whitespace if all characters in the string are whitespace and there
is at least one character in the string.
[1;31mType:[0m      builtin_function_or_method

In [245]:
greeting.isdecimal?

[1;31mSignature:[0m [0mgreeting[0m[1;33m.[0m[0misdecimal[0m[1;33m([0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Return True if the string is a decimal string, False otherwise.

A string is a decimal string if all characters in the string are decimal and
there is at least one character in the string.
[1;31mType:[0m      builtin_function_or_method

In [246]:
greeting.isdigit?

[1;31mSignature:[0m [0mgreeting[0m[1;33m.[0m[0misdigit[0m[1;33m([0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Return True if the string is a digit string, False otherwise.

A string is a digit string if all characters in the string are digits and there
is at least one character in the string.
[1;31mType:[0m      builtin_function_or_method

In [247]:
greeting.isnumeric?

[1;31mSignature:[0m [0mgreeting[0m[1;33m.[0m[0misnumeric[0m[1;33m([0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Return True if the string is a numeric string, False otherwise.

A string is numeric if all characters in the string are numeric and there is at
least one character in the string.
[1;31mType:[0m      builtin_function_or_method

For example:

In [248]:
'hello Γειά σου 123'.isprintable()

True

In [249]:
'hello Γειά σου 123'.isascii()

False

In [250]:
'hello 123 !'.isascii()

True

In [251]:
'hello 123 !'.isalnum()

False

In [252]:
'hello123'.isalnum()

True

In [253]:
'hello123'.isalpha()

False

In [254]:
'hello'.isalpha()

True

In [255]:
'hello'.isspace()

False

The boolean numeric ```str``` datamodel methods have subtle differences. These can be seen by examining the response of the methods for each of the following number groupings:

In [256]:
numeric_groups = {'ascii': '0123456789', 
                  'font1': '𝟶𝟷𝟸𝟹𝟺𝟻𝟼𝟽𝟾𝟿', 
                  'font2': '𝟬𝟭𝟮𝟯𝟰𝟱𝟲𝟳𝟴𝟵', 
                  'font3': '𝟘𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡', 
                  'subscript': '₀₁₂₃₄₅₆₇₈₉',
                  'superscript': '⁰¹²³⁴⁵⁶⁷⁸⁹',
                  'circled1': '➀➁➂➃➄➅➆➇➈',
                  'circled2': '➉',
                  'fractions': '½⅓¼⅕⅙⅐⅛⅑⅒⅔¾⅖⅗⅘⅚⅜⅝⅞⅟↉', 
                  'asciihex': '0123456789abcdef', }

In [257]:
for group in numeric_groups:
    print(group, numeric_groups[group], numeric_groups[group].isdecimal())

ascii 0123456789 True
font1 𝟶𝟷𝟸𝟹𝟺𝟻𝟼𝟽𝟾𝟿 True
font2 𝟬𝟭𝟮𝟯𝟰𝟱𝟲𝟳𝟴𝟵 True
font3 𝟘𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡 True
subscript ₀₁₂₃₄₅₆₇₈₉ False
superscript ⁰¹²³⁴⁵⁶⁷⁸⁹ False
circled1 ➀➁➂➃➄➅➆➇➈ False
circled2 ➉ False
fractions ½⅓¼⅕⅙⅐⅛⅑⅒⅔¾⅖⅗⅘⅚⅜⅝⅞⅟↉ False
asciihex 0123456789abcdef False


In [258]:
for group in numeric_groups:
    print(group, numeric_groups[group], numeric_groups[group].isdigit())

ascii 0123456789 True
font1 𝟶𝟷𝟸𝟹𝟺𝟻𝟼𝟽𝟾𝟿 True
font2 𝟬𝟭𝟮𝟯𝟰𝟱𝟲𝟳𝟴𝟵 True
font3 𝟘𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡 True
subscript ₀₁₂₃₄₅₆₇₈₉ True
superscript ⁰¹²³⁴⁵⁶⁷⁸⁹ True
circled1 ➀➁➂➃➄➅➆➇➈ True
circled2 ➉ False
fractions ½⅓¼⅕⅙⅐⅛⅑⅒⅔¾⅖⅗⅘⅚⅜⅝⅞⅟↉ False
asciihex 0123456789abcdef False


In [259]:
for group in numeric_groups:
    print(group, numeric_groups[group], numeric_groups[group].isnumeric())

ascii 0123456789 True
font1 𝟶𝟷𝟸𝟹𝟺𝟻𝟼𝟽𝟾𝟿 True
font2 𝟬𝟭𝟮𝟯𝟰𝟱𝟲𝟳𝟴𝟵 True
font3 𝟘𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡 True
subscript ₀₁₂₃₄₅₆₇₈₉ True
superscript ⁰¹²³⁴⁵⁶⁷⁸⁹ True
circled1 ➀➁➂➃➄➅➆➇➈ True
circled2 ➉ True
fractions ½⅓¼⅕⅙⅐⅛⅑⅒⅔¾⅖⅗⅘⅚⅜⅝⅞⅟↉ True
asciihex 0123456789abcdef False


In [260]:
for group in numeric_groups:
    print(group, numeric_groups[group], numeric_groups[group].isalnum())

ascii 0123456789 True
font1 𝟶𝟷𝟸𝟹𝟺𝟻𝟼𝟽𝟾𝟿 True
font2 𝟬𝟭𝟮𝟯𝟰𝟱𝟲𝟳𝟴𝟵 True
font3 𝟘𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡 True
subscript ₀₁₂₃₄₅₆₇₈₉ True
superscript ⁰¹²³⁴⁵⁶⁷⁸⁹ True
circled1 ➀➁➂➃➄➅➆➇➈ True
circled2 ➉ True
fractions ½⅓¼⅕⅙⅐⅛⅑⅒⅔¾⅖⅗⅘⅚⅜⅝⅞⅟↉ True
asciihex 0123456789abcdef True


The boolean identifiers are often used for checks and these checks are used to create conditions and setup loops for example.

## Escape Characters

The ```\``` is a special symbol used to insert an escape character. The most commonly used escape characters have the form:

In [261]:
print('|  |') # no escape character

|  |


In [262]:
print('| \t |') # the tab

| 	 |


In [263]:
print('| \n |') # the new line

| 
 |


In [264]:
print('| \\ |') # the leftslash itself

| \ |


In [265]:
print('| \' |') # the single quotation

| ' |


In [266]:
print('| \" |') # the double quotation

| " |


An ASCII character or character spanning over the range of a single byte can be inserted using an escape character 2 hexadecimal digits:

In [267]:
hex(ord('!')) 

'0x21'

In [268]:
'\x21' # a byte (2 hexadecimal digits)

'!'

In [269]:
print('| \x09 |') # the tab as a byte (2 hexadecimal digits)

| 	 |


Note the two hexadecimal digits have to be provided as otherwise there is an incomplete byte specified. 

The most commonly used Unicode characters, outside of the ASCII range span over 2 bytes and can therefore be inserted using an escape sequence with 4 hexadecimal digits. For example:

In [270]:
hex(ord('α'))

'0x3b1'

In [271]:
'\u03b1' # a Unicode character (4 hexadecimal digits, 2 hexadecimal digits × 2 bytes)

'α'

Note the four hexadecimal digits have to be provided otherwise there is an incomplete byte. The next line of code shows a common problem when attempting to input a Windows Path:

```python
'c:\users\philip'
```

In the above the Python interpreter sees the first ```\``` is seen as an instruction to insert an escape character. ```u``` is an instruction to expect a Unicode escape sequence and therefore the Python interpreter attempts to read the next four characters ```sers``` as hexadecimal values. In hexadecimal ```s```, ```e``` and ```r``` are not valid hexadecimal characters. Recall that a hexadecimal character has 16 digits ```0, 1, 2, 3, 4, 5, 6, 7, 8, 9, a, b, c, d, e, f``` and therefore a ```SyntaxError``` is flagged up.

To insert a Windows path ```\\``` should be used to indicate insertion of the escape character ```\```:

```python
'c:\\users\\philip'
```

Note that the hex form is normally used to represent a byte that is not printable. If the 6 whitespace characters are examined in more detail this can be seen:

In [272]:
string.whitespace

' \t\n\r\x0b\x0c'

|name||byte|
|---|---|---|
|space|' '|'\\x20'|
|tab|'\\t'|'\\x09'|
|new line|'\\n'|'\\x0a'|
|carriage return|'\\r'|'\\x0d'|
|vertical tab||'\\x0b'|
|form feed||'\\x0c'|

In [273]:
' ' == '\x20'

True

In [274]:
'\t' == '\x09'

True

In [275]:
'\n' == '\x0a'

True

In [276]:
'\r' == '\x0d'

True

It is not common to do so, however each ASCII character in a string can also be inserted as an escape character:

In [277]:
'\x68\x65\x6c\x6c\x6f\x20\x77\x6f\x72\x6c\x64\x21'

'hello world!'

The ```unicodedata``` module can be imported:

In [278]:
import unicodedata

Its identifiers can be viewed using:

In [279]:
dir2(unicodedata, object, unique_only=True)

{'attribute': ['ucd_3_2_0', 'unidata_version'],
 'method': ['bidirectional',
            'category',
            'combining',
            'decimal',
            'decomposition',
            'digit',
            'east_asian_width',
            'is_normalized',
            'lookup',
            'mirrored',
            'name',
            'normalize',
            'numeric'],
 'upper_class': ['UCD'],
 'datamodel_attribute': ['__file__',
                         '__loader__',
                         '__name__',
                         '__package__',
                         '__spec__'],
 'internal_attribute': ['_ucnhash_CAPI']}


The Unicode version can be checked using:

In [280]:
unicodedata.unidata_version

'15.0.0'

And once the version number is known, more details about the supported characters can be examined using the [Unicode Documentation](https://unicode.org/versions/Unicode15.0.0/).

A Unicode escape character span over 4 bytes and can therefore be inserted using 8 hexadecimal digits. For example:

In [281]:
'\U0000303a'

'〺'

## Translation Table

A translation table can be created for use with the instance method ```translate```:

In [282]:
greeting.translate?

[1;31mSignature:[0m [0mgreeting[0m[1;33m.[0m[0mtranslate[0m[1;33m([0m[0mtable[0m[1;33m,[0m [1;33m/[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Replace each character in the string using the given translation table.

  table
    Translation table, which must be a mapping of Unicode ordinals to
    Unicode ordinals, strings, or None.

The table must implement lookup/indexing via __getitem__, for instance a
dictionary or list.  If this operation raises LookupError, the character is
left untouched.  Characters mapped to None are deleted.
[1;31mType:[0m      builtin_function_or_method

```maketrans``` is a static method which is essentially a function thats neither bound to the instance or the class. This function merely exists in the namespace of the class as this is the most logical place to find it (conceptualise the class as a Python module):

In [283]:
str.maketrans?

[1;31mDocstring:[0m
Return a translation table usable for str.translate().

If there is only one argument, it must be a dictionary mapping Unicode
ordinals (integers) or characters to Unicode ordinals, strings or None.
Character keys will be then converted to ordinals.
If there are two arguments, they must be strings of equal length, and
in the resulting dictionary, each character in x will be mapped to the
character at the same position in y. If there is a third argument, it
must be a string, whose characters will be mapped to None in the result.
[1;31mType:[0m      builtin_function_or_method

In [284]:
greektolatin = str.maketrans('αβγδε', 'abcde')
greektolatin

{945: 97, 946: 98, 947: 99, 948: 100, 949: 101}

In [285]:
hex(945)

'0x3b1'

In [286]:
hex(97)

'0x61'

This translation table can be used on the example ```str``` instance to replace the Greek letters (keys) with the latin letters (values):

In [287]:
'αββγγγδδδδεεεεε'.translate(greektolatin)

'abbcccddddeeeee'

## File Paths and Raw Strings

In a Python string, the ```\``` is a special character that is an instruction to insert an escape character. Unfortunately the ```\``` is also the default directory seperator used for a file path in Windows. 

To incorporate an ```\``` into a ```str``` instance ```\\``` has to be used; the first ```\``` is an instruction to insert an escape character and the second ```\``` states that the escape character to be inserted is the ```\``` itself:

In [288]:
windows_file_path = 'C:\\Users\\Philip'

This problem does not occur on Linux because ```/``` is used as a directory seperator in a file path:

In [289]:
linux_file_path = '/users/philip'

Windows can also use ```/``` as an alternative directory separator however when copying file paths from Windows Explorer for example, the default separator ```\``` will be used.

Compare the difference to the cell output and the output in a cell from a ```print``` statement:

In [290]:
windows_file_path

'C:\\Users\\Philip'

In [291]:
print(windows_file_path)

C:\Users\Philip


In Windows the file path is of the form ```'C:\Users\Philip'``` using the default separator ```\``` and a ```SyntaxError``` displays when it is used:

```python
windows_file_path = 'C:\Users\Philip'
```

For the file path to be recognised as a Python string each ```\``` has to be converted into a ```\\```:

```python
windows_file_path = 'C:\\Users\\Philip'
```

This can be quite cumbersome for long file paths. Python also has a raw string which does not process escape characters and any ```\``` is recognised as being part of the ```str``` instance. A raw ```str``` has the prefix ```r``` or ```R```:

In [292]:
raw_windows_file_path1 = r'C:\Users\Philip'

In [293]:
raw_windows_file_path2 = R'C:\Users\Philip'

Although both ```r``` and ```R``` give the same raw ```str``` instance:

In [294]:
raw_windows_file_path1 == raw_windows_file_path2

True

In [295]:
raw_windows_file_path2

'C:\\Users\\Philip'

In [296]:
print(raw_windows_file_path2)

C:\Users\Philip


The subtle difference in the two is in the syntax highlighting. Uppercase ```R``` shows no formatting around the special characters which is appropriate for the file path. Lowercase ```r``` on the other hand shows syntax highlighting following the escape character and is used to construct regular expressions which will be briefly mentioned in the next section.

## Find and Index

Previously indexing using an ```int``` or a ```slice``` was discussed:

In [297]:
greeting

'Hello World!'

In [298]:
greeting[0]

'H'

In [299]:
greeting[:5]

'Hello'

The ```str``` instance methods ```index``` and ```find``` perform the counter operation and retrieve the positive index corresponding to the first occurrence of a character or the start of a substring:

In [300]:
greeting.find?

[1;31mDocstring:[0m
S.find(sub[, start[, end]]) -> int

Return the lowest index in S where substring sub is found,
such that sub is contained within S[start:end].  Optional
arguments start and end are interpreted as in slice notation.

Return -1 on failure.
[1;31mType:[0m      builtin_function_or_method

In [301]:
greeting.index?

[1;31mDocstring:[0m
S.index(sub[, start[, end]]) -> int

Return the lowest index in S where substring sub is found,
such that sub is contained within S[start:end].  Optional
arguments start and end are interpreted as in slice notation.

Raises ValueError when the substring is not found.
[1;31mType:[0m      builtin_function_or_method

These two instance methods behave identically upon success:

In [302]:
greeting.find('l')

2

In [303]:
greeting.index('l')

2

However give ```-1``` and ```ValueError``` respectively upon failure:

In [304]:
greeting.find('L')

-1

```python
word.index('L')
```

These instance methods, take consistent ```start``` and ```stop``` input arguments like in the ```slice``` and ```range``` classes seen earlier and can be used to constrict the search range. For example to find the index of all the values of ```'l'```:

In [305]:
greeting.find('l')

2

In [306]:
greeting.find('l', 2+1)

3

In [307]:
greeting.find('l', 3+1)

9

In [308]:
greeting.find('l', 9+1)

-1

A Unicode substring can also be searched for opposed to a Unicode character:

In [309]:
greeting.find('World')

6

In [310]:
greeting.find('W')

6

The ```index``` and ```find``` methods search the ```str``` instance for a substring from the left to the right. These are complemented by the reverse find and reverse index, ```rfind``` and ```rindex``` respectively which search from right to left:

In [311]:
greeting.rfind('l')

9

In [312]:
greeting.rfind('l', 0, 9)

3

In [313]:
greeting.rfind('l', 0, 3)

2

In [314]:
greeting.rfind('l', 0, 2)

-1

In [315]:
greeting.rfind('l')

9

The ```str``` instance method ```count``` returns the number of times a substring ```str``` instance is found in the ```str``` instance:

In [316]:
greeting.count('l')

3

The ```bool``` based ```str``` identifiers ```startswith``` and ```endswith``` return a ```bool``` if the ```str``` instances starts or ends with a substring ```prefix``` or ```suffix```. These also have consistent ```start``` and ```stop``` input arguments which can be used to constrict the search range:

In [317]:
greeting.startswith?

[1;31mDocstring:[0m
S.startswith(prefix[, start[, end]]) -> bool

Return True if S starts with the specified prefix, False otherwise.
With optional start, test S beginning at that position.
With optional end, stop comparing S at that position.
prefix can also be a tuple of strings to try.
[1;31mType:[0m      builtin_function_or_method

In [318]:
greeting.endswith?

[1;31mDocstring:[0m
S.endswith(suffix[, start[, end]]) -> bool

Return True if S ends with the specified suffix, False otherwise.
With optional start, test S beginning at that position.
With optional end, stop comparing S at that position.
suffix can also be a tuple of strings to try.
[1;31mType:[0m      builtin_function_or_method

In [319]:
greeting

'Hello World!'

In [320]:
greeting.startswith('hello')

False

In [321]:
greeting.startswith('hello', 1)

False

In [322]:
greeting.endswith('!')

True

In [323]:
greeting.endswith('!', 0, 11)

False

The ```str``` instance method ```replace``` can be used to replace an ```old``` substring with a ```new``` substring. It has an optional argument ```count``` which has a default value of ```-1``` and this means it allows for all replacements by default:

In [324]:
greeting.replace?

[1;31mSignature:[0m [0mgreeting[0m[1;33m.[0m[0mreplace[0m[1;33m([0m[0mold[0m[1;33m,[0m [0mnew[0m[1;33m,[0m [0mcount[0m[1;33m=[0m[1;33m-[0m[1;36m1[0m[1;33m,[0m [1;33m/[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Return a copy with all occurrences of substring old replaced by new.

  count
    Maximum number of occurrences to replace.
    -1 (the default value) means replace all occurrences.

If the optional argument count is given, only the first count occurrences are
replaced.
[1;31mType:[0m      builtin_function_or_method

In [325]:
greeting

'Hello World!'

In [326]:
greeting.replace('hello', 'bye')

'Hello World!'

In [327]:
greeting.replace('l', 'L')

'HeLLo WorLd!'

In [328]:
greeting.replace('l', 'L', 1)

'HeLlo World!'

## The re module

The regular expressions module is used for advanced pattern searching:

In [329]:
text = 'Email example@example.com, example2@example.com Telephone 0000000000 Website https://www.example.com'

For example a regular expression using ```r``` can be created for an email, number and website:

In [330]:
email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
number_pattern = r'\b\d{10}\b'
website_pattern = r'https?://(?:www\.)?[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'

Notice the difference in syntax highlighting when uppercase ```R``` is used:

In [331]:
email_pattern = R'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
number_pattern = R'\b\d{10}\b'
website_pattern = R'https?://(?:www\.)?[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'

The regular expression module can be imported:

In [332]:
import re

In [333]:
dir2(re, object, unique_only=True)

{'constant': ['A',
              'ASCII',
              'DEBUG',
              'DOTALL',
              'I',
              'IGNORECASE',
              'L',
              'LOCALE',
              'M',
              'MULTILINE',
              'NOFLAG',
              'S',
              'T',
              'TEMPLATE',
              'U',
              'UNICODE',
              'VERBOSE',
              'X'],
 'module': ['copyreg', 'enum', 'functools'],
 'method': ['compile',
            'escape',
            'findall',
            'finditer',
            'fullmatch',
            'match',
            'purge',
            'search',
            'split',
            'sub',
            'subn',
            'template'],
 'lower_class': ['error'],
 'upper_class': ['Match', 'Pattern', 'RegexFlag', 'Scanner'],
 'datamodel_attribute': ['__all__',
                         '__builtins__',
                         '__cached__',
                         '__file__',
                         '__loader__',
      

The ```re.findall``` function can be used to search for the first occurrence of a pattern:

In [334]:
re.findall?

[1;31mSignature:[0m [0mre[0m[1;33m.[0m[0mfindall[0m[1;33m([0m[0mpattern[0m[1;33m,[0m [0mstring[0m[1;33m,[0m [0mflags[0m[1;33m=[0m[1;36m0[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Return a list of all non-overlapping matches in the string.

If one or more capturing groups are present in the pattern, return
a list of groups; this will be a list of tuples if the pattern
has more than one group.

Empty matches are included in the result.
[1;31mFile:[0m      c:\users\phili\anaconda3\envs\vscode-env\lib\re\__init__.py
[1;31mType:[0m      function

For example a search for the ```email_pattern``` can be made in ```text```:

In [335]:
email_search = re.findall(email_pattern, text)

The results can be seen in the output ```list``` instance:

In [336]:
email_search

['example@example.com', 'example2@example.com']

A search can also be made for the ```number_pattern``` and ```website_pattern```:

In [337]:
number_search = re.findall(number_pattern, text)

In [338]:
number_search

['0000000000']

In [339]:
website_search = re.findall(website_pattern, text)

In [340]:
website_search

['https://www.example.com']

## The print function

The ```print``` function has previously been used with its default named parameters. More details about these can be seen in the docstring:

In [341]:
print?

[1;31mSignature:[0m [0mprint[0m[1;33m([0m[1;33m*[0m[0margs[0m[1;33m,[0m [0msep[0m[1;33m=[0m[1;34m' '[0m[1;33m,[0m [0mend[0m[1;33m=[0m[1;34m'\n'[0m[1;33m,[0m [0mfile[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0mflush[0m[1;33m=[0m[1;32mFalse[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Prints the values to a stream, or to sys.stdout by default.

sep
  string inserted between values, default a space.
end
  string appended after the last value, default a newline.
file
  a file-like object (stream); defaults to the current sys.stdout.
flush
  whether to forcibly flush the stream.
[1;31mType:[0m      builtin_function_or_method

```*args``` indicates that a variable number of positional input arguments are used. ```sep``` and ```end``` are named input arguments which have a default value of a space and a new line respectively. ```file``` and ```flush``` are for advanced purposes when the print stream is to be directed for example to a file instead of a cell output:

```python
print(*args, sep=' ', end='\n', file=None, flush=False)
```

The effect of overriding the default value of ```sep``` can be seen:

In [342]:
print('the', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog')

the brown fox jumps over the lazy dog


In [343]:
print('the', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', sep='')

thebrownfoxjumpsoverthelazydog


The effect of overriding the default value of ```end``` can be seen:

In [344]:
print('the', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog')
print('the', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog')

the brown fox jumps over the lazy dog
the brown fox jumps over the lazy dog


In [345]:
print('the', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', end='')
print('the', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog')

the brown fox jumps over the lazy dogthe brown fox jumps over the lazy dog


## Formatted Strings

Supposing a ```str``` body has the form:

In [346]:
body = 'The string to 0 is 1 2!'

And there are three ```str``` instances:

In [347]:
var0 = 'print'
var1 = 'hello'
var2 = 'world'

The objective of a formatted string is to insert these instances into the ```str``` body so a formatted ```str``` instance of the form can be returned:

In [348]:
'The string to print is hello world!'

'The string to print is hello world!'

If the docstring of the ```str``` method ```format``` is examined:

In [349]:
body.format?

[1;31mDocstring:[0m
S.format(*args, **kwargs) -> str

Return a formatted version of S, using substitutions from args and kwargs.
The substitutions are identified by braces ('{' and '}').
[1;31mType:[0m      builtin_function_or_method

Then it can be seen that substitutions are identified by braces so the ```str``` body should be modified to have the following form:

In [350]:
body = 'The string to {0} is {1} {2}!'

Notice the syntax highlighting clearly distinguishes these placeholders.

```*args``` represents a variable number of positional input arguments. When inserting instances into the ```str``` body, the number of positional input arguments should match the number of placeholders in the ```str``` body. Now the ```format``` method can be used:

In [351]:
body.format(var0, var1, var2)

'The string to print is hello world!'

The ```str``` instance body can alternatively be setup to contain named variables:

In [352]:
body = 'The string to {var0_} is {var1_} {var2_}!'

```**kwargs``` represents a variable number of named keyword input arguments which should match the named keyword input arguments in the ```str``` instance ```body```:

In [353]:
body.format(var0_=var0, var1_=var1, var2_=var2)

'The string to print is hello world!'

The two lines above can be combined:

In [354]:
'The string to {var0_} is {var1_} {var2_}!'.format(var0_=var0, var1_=var1, var2_=var2)

'The string to print is hello world!'

It is more common for the placeholders to be given the same name as the instances to be inserted in the ```tuple```:

In [355]:
'The string to {var0} is {var1} {var2}!'.format(var0=var0, var1=var1, var2=var2)

'The string to print is hello world!'

Notice in the above that each instance name is used 3 times which is pretty cumbersome. A shorthand way of writing the expression above is to use the prefix ```f``` or ```F``` which means formatted string:

In [356]:
f'The string to {var0} is {var1} {var2}!'

'The string to print is hello world!'

In [357]:
F'The string to {var0} is {var1} {var2}!'

'The string to print is hello world!'

There is no difference for uppercase and lowercase in formatted ```str``` instances and the syntax highlighting is the same in either case.

If the ```object``` datamodel method ```__format__``` is examined:

In [358]:
object.__format__?

[1;31mSignature:[0m [0mobject[0m[1;33m.[0m[0m__format__[0m[1;33m([0m[0mself[0m[1;33m,[0m [0mformat_spec[0m[1;33m,[0m [1;33m/[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Default object formatter.

Return str(self) if format_spec is empty. Raise TypeError otherwise.
[1;31mType:[0m      method_descriptor

Notice there is a format specification ```format_spec```:

In [359]:
greeting

'Hello World!'

The format specification for a ```str``` instance has the form:

```python
'0ns'
```

where ```n``` is an integer, ```s``` means ```str``` and ```0``` is used to fill in blank spaces. 

In [360]:
greeting.__format__('s')

'Hello World!'

In [361]:
greeting.__format__('22s')

'Hello World!          '

In [362]:
greeting.__format__('022s')

'Hello World!0000000000'

The formatter specifier options differ for each datatype. Normally a colon is used to include the format specifier beside the variable in the formatted ```str```:

In [363]:
f'The string to {var0:s} is {var1} {var2}!'

'The string to print is hello world!'

The ```str``` format specifier can specify an integer number of characters:

In [364]:
f'The string to {var0:10s} is {var1} {var2}!'

'The string to print      is hello world!'

If prefixed with ```0``` then trailing spaces will be displayed using ```0```:

In [365]:
f'The string to {var0:010s} is {var1:s} {var2:s}!'

'The string to print00000 is hello world!'

In the above ```str``` instances were inserted into a ```str``` instance body. It is more common to insert numeric variables into the ```str``` instance body:

In [366]:
num1 = 1
num2 = 0.0000123456789
num3 = 12.3456789

In [367]:
f'The numbers are {num1}, {num2} and {num3}.' 

'The numbers are 1, 1.23456789e-05 and 12.3456789.'

The format specifier for an integer decimal (```d```) can be used:

In [368]:
f'The numbers are {num1:d}, {num2} and {num3}.' 

'The numbers are 1, 1.23456789e-05 and 12.3456789.'

In [369]:
f'The numbers are {num1:5d}, {num2} and {num3}.' 

'The numbers are     1, 1.23456789e-05 and 12.3456789.'

In [370]:
f'The numbers are {num1:05d}, {num2} and {num3}.' 

'The numbers are 00001, 1.23456789e-05 and 12.3456789.'

In [371]:
f'The numbers are {num1: 05d}, {num2} and {num3}.' 

'The numbers are  0001, 1.23456789e-05 and 12.3456789.'

Again the number of characters in the string the number should occupy can be specified. Unlike the ```str``` formatter spacing is leading opposed to trailing. If prefixed with a ```0```, then these will be shown as ```0```. 

Notice one of the five characters is a space because a space is part of the formatter specifier. Compare the difference when this space is removed:

In [372]:
f'The numbers are {num1}, {num2:g} and {num3:g}.' 

'The numbers are 1, 1.23457e-05 and 12.3457.'

The ```e``` can be used for ```float``` exponential format:

In [373]:
f'The numbers are {num1}, {num2:e} and {num3:e}.' 

'The numbers are 1, 1.234568e-05 and 1.234568e+01.'

The number of places after the decimal point can be specified:

In [374]:
f'The numbers are {num1}, {num2:0.3e} and {num3:0.3e}.' 

'The numbers are 1, 1.235e-05 and 1.235e+01.'

A fixed format can also be used:

In [375]:
f'The numbers are {num1}, {num2:f} and {num3:f}.' 

'The numbers are 1, 0.000012 and 12.345679.'

Once again the number of spaces after the decimal point can be specified:

In [376]:
f'The numbers are {num1}, {num2:0.3f} and {num3:0.3f}.' 

'The numbers are 1, 0.000 and 12.346.'

```float``` instances can use the general (```g```), exponential (```e```) and fixed (```f```) format specifiers. The prefix ```0.3``` specifies rounding to ```3``` digits past the decimal point.

If the keys in a ```dict``` instance match the instance names in the ```str``` body:

In [377]:
numbers = {'num1': 1, 'num2': 0.0000123456789, 'num3': 12.3456789}

In [378]:
body = 'The numbers are {num1:d}, {num2:.3e} and {num3:.3e}.'

The ```format_map``` method can be used with the mapping to insert the instances:

In [379]:
body.format_map?

[1;31mDocstring:[0m
S.format_map(mapping) -> str

Return a formatted version of S, using substitutions from mapping.
The substitutions are identified by braces ('{' and '}').
[1;31mType:[0m      builtin_function_or_method

In [380]:
body.format_map(numbers)

'The numbers are 1, 1.235e-05 and 1.235e+01.'

Notice that the syntax for a format specifier ```{variable:format_spec}``` is similar to the form of a Python ```dict``` instance ```{key:value}```. However spacing to the right of the colon is often present in a dictionary ```{key: value}``` and does not change the value. If a space is added to the formatting specifier, it is incorporated into the formatting specifier.

The older style of formatted ```str``` instances uses the datamodel identifier ```__mod__``` (*dunder mod*) which controls the behaviour of the operator ```%``` and in the case of older style string formatting also uses the ```%``` as a placeholder opposed to the braces ```{}```:

In [381]:
body = 'The numbers are %d, %0.3f and %0.3g.' 
nums = (1, 0.0000123456789, 12.3456789)

In [382]:
body.__mod__?

[1;31mSignature:[0m      [0mbody[0m[1;33m.[0m[0m__mod__[0m[1;33m([0m[0mvalue[0m[1;33m,[0m [1;33m/[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mCall signature:[0m [0mbody[0m[1;33m.[0m[0m__mod__[0m[1;33m([0m[1;33m*[0m[0margs[0m[1;33m,[0m [1;33m**[0m[0mkwargs[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mType:[0m           method-wrapper
[1;31mString form:[0m    <method-wrapper '__mod__' of str object at 0x000001E0A227B960>
[1;31mDocstring:[0m      Return self%value.

In [383]:
body % nums

'The numbers are 1, 0.000 and 12.3.'

## Multiline Strings

A ```str``` instance can be displayed over multiple lines using triple double quotations:

In [384]:
multiline = """the quick brown fox jumps over the lazy dog
the quick brown fox jumps over the lazy dog
the quick brown fox jumps over the lazy dog
the quick brown fox jumps over the lazy dog"""

In [385]:
multiline

'the quick brown fox jumps over the lazy dog\nthe quick brown fox jumps over the lazy dog\nthe quick brown fox jumps over the lazy dog\nthe quick brown fox jumps over the lazy dog'

In [386]:
print(multiline)

the quick brown fox jumps over the lazy dog
the quick brown fox jumps over the lazy dog
the quick brown fox jumps over the lazy dog
the quick brown fox jumps over the lazy dog


Note that any spacing added will be incorporated into the multiline ```str``` instance:

In [387]:
multiline = """
            the quick brown fox jumps over the lazy dog
            the quick brown fox jumps over the lazy dog
            the quick brown fox jumps over the lazy dog
            the quick brown fox jumps over the lazy dog
            """

In [388]:
multiline

'\n            the quick brown fox jumps over the lazy dog\n            the quick brown fox jumps over the lazy dog\n            the quick brown fox jumps over the lazy dog\n            the quick brown fox jumps over the lazy dog\n            '

In [389]:
print(multiline)


            the quick brown fox jumps over the lazy dog
            the quick brown fox jumps over the lazy dog
            the quick brown fox jumps over the lazy dog
            the quick brown fox jumps over the lazy dog
            


Triple double quotations are preferred as multiline ```str``` instances are commonly used for docstrings and docstrings are commonly written briefly during development and expanded during production to include ```str``` literals:

In [390]:
print?

[1;31mSignature:[0m [0mprint[0m[1;33m([0m[1;33m*[0m[0margs[0m[1;33m,[0m [0msep[0m[1;33m=[0m[1;34m' '[0m[1;33m,[0m [0mend[0m[1;33m=[0m[1;34m'\n'[0m[1;33m,[0m [0mfile[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0mflush[0m[1;33m=[0m[1;32mFalse[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Prints the values to a stream, or to sys.stdout by default.

sep
  string inserted between values, default a space.
end
  string appended after the last value, default a newline.
file
  a file-like object (stream); defaults to the current sys.stdout.
flush
  whether to forcibly flush the stream.
[1;31mType:[0m      builtin_function_or_method

In [391]:
doc = """Prints the values

sep
  string inserted between values, default a space ' '.
end
  string appended after the last value, default a newline '\\n'."""

In [392]:
print(doc)

Prints the values

sep
  string inserted between values, default a space ' '.
end
  string appended after the last value, default a newline '\n'.


## Center and Justify

A ```str``` instance can be centered and justified using the ```str``` methods ```fill```, ```centre```, ```ljust``` and ```rjust```:

In [393]:
greeting.center?

[1;31mSignature:[0m [0mgreeting[0m[1;33m.[0m[0mcenter[0m[1;33m([0m[0mwidth[0m[1;33m,[0m [0mfillchar[0m[1;33m=[0m[1;34m' '[0m[1;33m,[0m [1;33m/[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Return a centered string of length width.

Padding is done using the specified fill character (default is a space).
[1;31mType:[0m      builtin_function_or_method

In [394]:
greeting.ljust?

[1;31mSignature:[0m [0mgreeting[0m[1;33m.[0m[0mljust[0m[1;33m([0m[0mwidth[0m[1;33m,[0m [0mfillchar[0m[1;33m=[0m[1;34m' '[0m[1;33m,[0m [1;33m/[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Return a left-justified string of length width.

Padding is done using the specified fill character (default is a space).
[1;31mType:[0m      builtin_function_or_method

In [395]:
greeting.rjust?

[1;31mSignature:[0m [0mgreeting[0m[1;33m.[0m[0mrjust[0m[1;33m([0m[0mwidth[0m[1;33m,[0m [0mfillchar[0m[1;33m=[0m[1;34m' '[0m[1;33m,[0m [1;33m/[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Return a right-justified string of length width.

Padding is done using the specified fill character (default is a space).
[1;31mType:[0m      builtin_function_or_method

In [396]:
len(greeting)

12

In [397]:
greeting.center(20)

'    Hello World!    '

In [398]:
greeting.center(20, 'X')

'XXXXHello World!XXXX'

In [399]:
greeting.ljust(20, 'X')

'Hello World!XXXXXXXX'

In [400]:
greeting.rjust(20, 'X')

'XXXXXXXXHello World!'

The opposite operation can be carried out using the ```str``` methods left strip and right strip, ```lstrip``` and ```rstrip``` respectively which left strip and right strip whitespace by default or a specified fill character or character sequence:

In [401]:
padded_greeting = greeting.center(20)

In [402]:
padded_greeting

'    Hello World!    '

In [403]:
padded_greeting.lstrip?

[1;31mSignature:[0m [0mpadded_greeting[0m[1;33m.[0m[0mlstrip[0m[1;33m([0m[0mchars[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [1;33m/[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Return a copy of the string with leading whitespace removed.

If chars is given and not None, remove characters in chars instead.
[1;31mType:[0m      builtin_function_or_method

In [404]:
padded_greeting.rstrip?

[1;31mSignature:[0m [0mpadded_greeting[0m[1;33m.[0m[0mrstrip[0m[1;33m([0m[0mchars[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [1;33m/[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Return a copy of the string with trailing whitespace removed.

If chars is given and not None, remove characters in chars instead.
[1;31mType:[0m      builtin_function_or_method

In [405]:
padded_greeting.lstrip()

'Hello World!    '

In [406]:
padded_greeting.rstrip()

'    Hello World!'

In [407]:
padded_greeting.lstrip().rstrip()

'Hello World!'

In [408]:
padded_greeting = greeting.center(20, 'X')

In [409]:
padded_greeting

'XXXXHello World!XXXX'

In [410]:
padded_greeting.lstrip('X').rstrip('X')

'Hello World!'

The associated ```str``` methods ```removeprefix``` and ```removesuffix``` are more precise and will only remove a specified ```prefix``` or ```suffix```:

In [411]:
padded_greeting.removeprefix?

[1;31mSignature:[0m [0mpadded_greeting[0m[1;33m.[0m[0mremoveprefix[0m[1;33m([0m[0mprefix[0m[1;33m,[0m [1;33m/[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Return a str with the given prefix string removed if present.

If the string starts with the prefix string, return string[len(prefix):].
Otherwise, return a copy of the original string.
[1;31mType:[0m      builtin_function_or_method

In [412]:
padded_greeting.removesuffix?

[1;31mSignature:[0m [0mpadded_greeting[0m[1;33m.[0m[0mremovesuffix[0m[1;33m([0m[0msuffix[0m[1;33m,[0m [1;33m/[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Return a str with the given suffix string removed if present.

If the string ends with the suffix string and that suffix is not empty,
return string[:-len(suffix)]. Otherwise, return a copy of the original
string.
[1;31mType:[0m      builtin_function_or_method

In [413]:
padded_greeting

'XXXXHello World!XXXX'

In [414]:
padded_greeting.removeprefix('X')

'XXXHello World!XXXX'

Earlier the ordinal value of the string ```'3'``` was examined. The prefix ```'0b'``` can be removed using remove prefix:

In [415]:
string_3 = bin(ord('3'))

In [416]:
string_3

'0b110011'

In [417]:
string_3 = bin(ord('3')).removeprefix('0b')

In [418]:
string_3

'110011'

There is also the zero fill string method ```zfill``` which is used to zero fill a string and is mainly intended for ```str``` instances of numeric values:

In [419]:
string_3.zfill?

[1;31mSignature:[0m [0mstring_3[0m[1;33m.[0m[0mzfill[0m[1;33m([0m[0mwidth[0m[1;33m,[0m [1;33m/[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Pad a numeric string with zeros on the left, to fill a field of the given width.

The string is never truncated.
[1;31mType:[0m      builtin_function_or_method

Since this binary number is of a byte that has ```8``` values, the width can be set to ```8```:

In [420]:
string_3.zfill(8)

'00110011'

## Binary Operators

```__add__``` is a binary datamodel method used to concatenate two ```str``` instances:

In [421]:
greeting.__add__?

[1;31mSignature:[0m      [0mgreeting[0m[1;33m.[0m[0m__add__[0m[1;33m([0m[0mvalue[0m[1;33m,[0m [1;33m/[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mCall signature:[0m [0mgreeting[0m[1;33m.[0m[0m__add__[0m[1;33m([0m[1;33m*[0m[0margs[0m[1;33m,[0m [1;33m**[0m[0mkwargs[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mType:[0m           method-wrapper
[1;31mString form:[0m    <method-wrapper '__add__' of str object at 0x000001E0A216D7B0>
[1;31mDocstring:[0m      Return self+value.

In [422]:
'hello' + 'world'

'helloworld'

In [423]:
'hello' + ' ' + 'world'

'hello world'

```__mul__``` is a binary datamodel method used to replicate the characters in a ```str``` instance using an ```int``` instance:

In [424]:
greeting.__mul__?

[1;31mSignature:[0m      [0mgreeting[0m[1;33m.[0m[0m__mul__[0m[1;33m([0m[0mvalue[0m[1;33m,[0m [1;33m/[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mCall signature:[0m [0mgreeting[0m[1;33m.[0m[0m__mul__[0m[1;33m([0m[1;33m*[0m[0margs[0m[1;33m,[0m [1;33m**[0m[0mkwargs[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mType:[0m           method-wrapper
[1;31mString form:[0m    <method-wrapper '__mul__' of str object at 0x000001E0A216D7B0>
[1;31mDocstring:[0m      Return self*value.

In [425]:
greeting * 3

'Hello World!Hello World!Hello World!'

The reverse multiplication datamodel method is also defined:

In [426]:
greeting.__rmul__?

[1;31mSignature:[0m      [0mgreeting[0m[1;33m.[0m[0m__rmul__[0m[1;33m([0m[0mvalue[0m[1;33m,[0m [1;33m/[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mCall signature:[0m [0mgreeting[0m[1;33m.[0m[0m__rmul__[0m[1;33m([0m[1;33m*[0m[0margs[0m[1;33m,[0m [1;33m**[0m[0mkwargs[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mType:[0m           method-wrapper
[1;31mString form:[0m    <method-wrapper '__rmul__' of str object at 0x000001E0A216D7B0>
[1;31mDocstring:[0m      Return value*self.

Which makes the multiplication of the ```str``` instance and ```int``` instance around the ```*``` operator commutative:

In [427]:
3 * greeting

'Hello World!Hello World!Hello World!'

Binary operators are frequently used with assignment:

In [428]:
variables(['greeting',], show_id=True)

Unnamed: 0_level_0,Type,Size/Shape,Value,ID
Instance Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
greeting,str,12,Hello World!,2064303708080


Recall the operation on the right of the assignment operator is carried out first using the original instance. The ```return``` value of the instance is then reassigned to the original instance:

In [429]:
greeting = greeting + ' world!'

In [430]:
variables(['greeting',], show_id=True)

Unnamed: 0_level_0,Type,Size/Shape,Value,ID
Instance Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
greeting,str,19,Hello World! world!,2064304980336


A binary operator for example addition ```+``` can be combined with the assignment operator ```=``` resulting in the "inplace" addition operator ```+=```. Because the ```str``` instance is immutable the operation is not in place but is equivalent to the order of the two separate operations concatenation and then reassignment as shown above:

In [431]:
greeting += ' world!'

In [432]:
variables(['greeting',], show_id=True)

Unnamed: 0_level_0,Type,Size/Shape,Value,ID
Instance Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
greeting,str,26,Hello World! world! world!,2064296437168


## Splitting and Joining Strings

A number of ```str``` methods are available for splitting and joining ```str``` instances. These generally involve casting to a Python collection such as a ```tuple``` of ```str``` instances or a ```list``` of ```str``` instances.

For example the ```str``` instance method ```partition``` and right partition ```rpartition``` will partition a ```str``` instance into a three element ```tuple``` of three ```str``` instances; the substring before the partition, the partition substring and the substring after the partition respectively. To make it more obvious the following ```str``` instance will be instantiated:

In [433]:
greeting = 'hello|world|!'

In [434]:
greeting.partition?

[1;31mSignature:[0m [0mgreeting[0m[1;33m.[0m[0mpartition[0m[1;33m([0m[0msep[0m[1;33m,[0m [1;33m/[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Partition the string into three parts using the given separator.

This will search for the separator in the string.  If the separator is found,
returns a 3-tuple containing the part before the separator, the separator
itself, and the part after it.

If the separator is not found, returns a 3-tuple containing the original string
and two empty strings.
[1;31mType:[0m      builtin_function_or_method

In [435]:
greeting.partition('|')

('hello', '|', 'world|!')

In [436]:
greeting.rpartition?

[1;31mSignature:[0m [0mgreeting[0m[1;33m.[0m[0mrpartition[0m[1;33m([0m[0msep[0m[1;33m,[0m [1;33m/[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Partition the string into three parts using the given separator.

This will search for the separator in the string, starting at the end. If
the separator is found, returns a 3-tuple containing the part before the
separator, the separator itself, and the part after it.

If the separator is not found, returns a 3-tuple containing two empty strings
and the original string.
[1;31mType:[0m      builtin_function_or_method

In [437]:
greeting.rpartition('|')

('hello|world', '|', '!')

More generally the ```str``` instance methods ```split``` and ```join``` can be used to split a ```str``` instance into a ```list``` of ```str``` instances or join a ```list``` of ```str``` instances up into a single ```str``` instance. For example if the following sentence is created:

In [438]:
sentence = 'the fat black cat sat on the mat!'

The ```str``` instance method ```split``` can be examined:

In [439]:
sentence.split?

[1;31mSignature:[0m [0msentence[0m[1;33m.[0m[0msplit[0m[1;33m([0m[0msep[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0mmaxsplit[0m[1;33m=[0m[1;33m-[0m[1;36m1[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Return a list of the substrings in the string, using sep as the separator string.

  sep
    The separator used to split the string.

    When set to None (the default value), will split on any whitespace
    character (including \n \r \t \f and spaces) and will discard
    empty strings from the result.
  maxsplit
    Maximum number of splits (starting from the left).
    -1 (the default value) means no limit.

Note, str.split() is mainly useful for data that has been intentionally
delimited.  With natural text that includes punctuation, consider using
the regular expression module.
[1;31mType:[0m      builtin_function_or_method

Since the values to be ```split``` from are whitespace, the input arguments can be left unspecified defaulting to their default values. This gives a ```list``` of ```str``` instances:

In [440]:
words = sentence.split()

In [441]:
variables(['sentence', 'words'], show_id=True)

Unnamed: 0_level_0,Type,Size/Shape,Value,ID
Instance Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
sentence,str,33,the fat black cat sat on the mat!,2064304679920
words,list,8,"['the', 'fat', 'black', 'cat', 'sat', 'on', 'the', 'mat!']",2064305096496


There is also the ```str``` instance method right split ```rsplit```, the difference is subtle and the methods behave different only when ```maxsplit``` is assigned a new value:

In [442]:
sentence.rsplit?

[1;31mSignature:[0m [0msentence[0m[1;33m.[0m[0mrsplit[0m[1;33m([0m[0msep[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0mmaxsplit[0m[1;33m=[0m[1;33m-[0m[1;36m1[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Return a list of the substrings in the string, using sep as the separator string.

  sep
    The separator used to split the string.

    When set to None (the default value), will split on any whitespace
    character (including \n \r \t \f and spaces) and will discard
    empty strings from the result.
  maxsplit
    Maximum number of splits (starting from the left).
    -1 (the default value) means no limit.

Splitting starts at the end of the string and works to the front.
[1;31mType:[0m      builtin_function_or_method

In [443]:
words_r = sentence.rsplit()

In [444]:
variables(['sentence', 'words', 'words_r'], show_id=True)

Unnamed: 0_level_0,Type,Size/Shape,Value,ID
Instance Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
sentence,str,33,the fat black cat sat on the mat!,2064304679920
words,list,8,"['the', 'fat', 'black', 'cat', 'sat', 'on', 'the', 'mat!']",2064305097168
words_r,list,8,"['the', 'fat', 'black', 'cat', 'sat', 'on', 'the', 'mat!']",2064305098960


The difference can be seen when ```maxsplit``` is used:

In [445]:
words = sentence.split(' ', maxsplit=3)

In [446]:
words_r = sentence.rsplit(' ', maxsplit=3)

In [447]:
variables(['sentence', 'words', 'words_r'], show_id=True)

Unnamed: 0_level_0,Type,Size/Shape,Value,ID
Instance Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
sentence,str,33,the fat black cat sat on the mat!,2064304679920
words,list,4,"['the', 'fat', 'black', 'cat sat on the mat!']",2064305097392
words_r,list,4,"['the fat black cat sat', 'on', 'the', 'mat!']",2064305161872


To join the words, the ```str``` method ```join``` can be called from a delimiter ```str``` instance:

In [448]:
delimiter = ' '

In [449]:
delimiter.join?

[1;31mSignature:[0m [0mdelimiter[0m[1;33m.[0m[0mjoin[0m[1;33m([0m[0miterable[0m[1;33m,[0m [1;33m/[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Concatenate any number of strings.

The string whose method is called is inserted in between each given string.
The result is returned as a new string.

Example: '.'.join(['ab', 'pq', 'rs']) -> 'ab.pq.rs'
[1;31mType:[0m      builtin_function_or_method

In [450]:
variables(show_id=True).loc[['delimiter', 'words']]

Unnamed: 0_level_0,Type,Size/Shape,Value,ID
Instance Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
delimiter,str,1,,140727149724840
words,list,4,"['the', 'fat', 'black', 'cat sat on the mat!']",2064305101312


In [451]:
delimiter.join(words)

'the fat black cat sat on the mat!'

```join``` is typically called from a space ```str``` instance directly:

In [452]:
' '.join(words)

'the fat black cat sat on the mat!'

In [453]:
'|'.join(words)

'the|fat|black|cat sat on the mat!'

If a multiline ```str``` instance is created:

In [454]:
paragraph = """The quick brown fox jumps over the lazy dog
The quick brown fox jumps over the lazy dog
The quick brown fox jumps over the lazy dog
The quick brown fox jumps over the lazy dog"""

In [455]:
paragraph

'The quick brown fox jumps over the lazy dog\nThe quick brown fox jumps over the lazy dog\nThe quick brown fox jumps over the lazy dog\nThe quick brown fox jumps over the lazy dog'

There is an associated ```str``` method ```splitlines```, which splits the ```str``` into a ```list``` using the newline. It has an input argument ```keepends``` which defaults to ```False``` and therefore excludes the newline character:

In [456]:
paragraph.splitlines?

[1;31mSignature:[0m [0mparagraph[0m[1;33m.[0m[0msplitlines[0m[1;33m([0m[0mkeepends[0m[1;33m=[0m[1;32mFalse[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Return a list of the lines in the string, breaking at line boundaries.

Line breaks are not included in the resulting list unless keepends is given and
true.
[1;31mType:[0m      builtin_function_or_method

In [457]:
paragraph.splitlines()

['The quick brown fox jumps over the lazy dog',
 'The quick brown fox jumps over the lazy dog',
 'The quick brown fox jumps over the lazy dog',
 'The quick brown fox jumps over the lazy dog']

If the multiline string is created with tabs:

In [458]:
paragraph = """\tThe quick brown fox jumps over the lazy dog
\tThe quick brown fox jumps over the lazy dog
\tThe quick brown fox jumps over the lazy dog
\tThe quick brown fox jumps over the lazy dog"""

The tabs can be replaced by a specified number of spaces using the ```str``` method ```expandtabs```:

In [459]:
paragraph.expandtabs?

[1;31mSignature:[0m [0mparagraph[0m[1;33m.[0m[0mexpandtabs[0m[1;33m([0m[0mtabsize[0m[1;33m=[0m[1;36m8[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Return a copy where all tab characters are expanded using spaces.

If tabsize is not given, a tab size of 8 characters is assumed.
[1;31mType:[0m      builtin_function_or_method

In [460]:
paragraph.expandtabs(4)

'    The quick brown fox jumps over the lazy dog\n    The quick brown fox jumps over the lazy dog\n    The quick brown fox jumps over the lazy dog\n    The quick brown fox jumps over the lazy dog'

In [461]:
print(paragraph)

	The quick brown fox jumps over the lazy dog
	The quick brown fox jumps over the lazy dog
	The quick brown fox jumps over the lazy dog
	The quick brown fox jumps over the lazy dog


In [462]:
print(paragraph.expandtabs(4))

    The quick brown fox jumps over the lazy dog
    The quick brown fox jumps over the lazy dog
    The quick brown fox jumps over the lazy dog
    The quick brown fox jumps over the lazy dog


## Bytes Related Identifiers

The ```bytes``` class is another text based class. Instead of having the fundamental unit of a Unicode character, it has the fundamental unit of a byte:

The ```str``` instances ```encode``` method encodes the ```str``` to a ```bytes``` instance. The ```str``` instance under the hood uses the ```'utf-8'``` translation table but this can be encoded to a ```bytes``` instance that uses this translation table or another one:

In [463]:
greeting.encode?

[1;31mSignature:[0m [0mgreeting[0m[1;33m.[0m[0mencode[0m[1;33m([0m[0mencoding[0m[1;33m=[0m[1;34m'utf-8'[0m[1;33m,[0m [0merrors[0m[1;33m=[0m[1;34m'strict'[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Encode the string using the codec registered for encoding.

encoding
  The encoding in which to encode the string.
errors
  The error handling scheme to use for encoding errors.
  The default is 'strict' meaning that encoding errors raise a
  UnicodeEncodeError.  Other possible values are 'ignore', 'replace' and
  'xmlcharrefreplace' as well as any other name registered with
  codecs.register_error that can handle UnicodeEncodeErrors.
[1;31mType:[0m      builtin_function_or_method

Since each English ASCII character is stored as a byte, the English character is used to represent its corresponding byte and therefore the two instances look familiar:

In [464]:
greeting.encode()

b'hello|world|!'

Recall ASCII characters are encoded over the values ```0:128```, which are the values for half a byte. Legacy translation tables uses the second half of a byte for additional characters. The ```£``` sign for example is not an ASCII character. In ```'latin1'``` it spans over a single byte:

In [465]:
'£'.encode(encoding='latin1')

b'\xa3'

In [466]:
0xa3

163

In ```'utf-16'``` each character spans over 2 bytes. There are variations of ```utf-16``` depending on the byte order. The byte order endian can be conceptualised by encoding the number twelve (in decimal) as 12 (big endian) or 21 (little endian). 

Humans normally encode numbers using big endian but Intel processors work using little endian. When ```utf-16``` was first introduced by Intel, there was confusion with the byte order and as a consequence 2 variations of ```utf-16```. Microsoft also included a third variant of little endian with a 2 bytes BOM prefix. The BOM is byte order marker used to quickly identify little endian:

In [467]:
'£'.encode(encoding='utf-16-be')

b'\x00\xa3'

In [468]:
'£'.encode(encoding='utf-16-le')

b'\xa3\x00'

In [469]:
'£'.encode(encoding='utf-16')

b'\xff\xfe\xa3\x00'

In [470]:
'££'.encode(encoding='utf-16')

b'\xff\xfe\xa3\x00\xa3\x00'

The current standard is ```'utf-8'``` which uses a different ```bytes``` combination to the previous translation tables and uses 2 bytes to encode the ```£``` sign:

In [471]:
'£'.encode(encoding='utf-8')

b'\xc2\xa3'

The Greek letters also require 2 bytes each. Each of the characters in the ```str``` instance below, except for the space are not recognised as ASCII characters and therefore represented by two hexadecimal escape characters:

In [472]:
greek_greeting = 'Γειά σου Κόσμε!'

In [473]:
greek_greeting.encode(encoding='utf-8')

b'\xce\x93\xce\xb5\xce\xb9\xce\xac \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xcf\x8c\xcf\x83\xce\xbc\xce\xb5!'

In [474]:
'Γ'.encode(encoding='utf-8')

b'\xce\x93'

The ```bytes``` class and the concept of encoding will be covered in more detail in the next notebook.

[Return to Python Tutorials](../readme.md)