# Text Sequence Type — str

- **(Main Source)** Python Docs: [Text Sequence Type — str](https://docs.python.org/3/library/stdtypes.html#text-sequence-type-str)
    - [String methods](https://docs.python.org/3/library/stdtypes.html#string-methods)
    - [Text Processing Services](https://docs.python.org/3/library/text.html#textservices)
    - [Common string operations](https://docs.python.org/3/library/string.html)

## Character Encoding

**Character encoding** is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers.

- **ASCII** codes represent text in digital devices
- ASCII has just `128` code points, of which only `95` are printable characters (English-only)
- The set of available punctuation had significant impact on the syntax of computer languages and text markup
- **ANSII** contain further characters from `128` to `255`, which differ based on language
- **Unicode** has over a **million code points**, but the first `128` of these are the same as ASCII
    - Language like Arabic are included in the Unicode code points

Figure below: showing that if we try to save Arabic text (Unicode) using ASCII (a subset of it), we get warning, and we lose our work.

<img src="../assets/save_ascii_arabic.png" title="Save Popup in Notepad" width="400" />

Error reads: "This file contains characters in Unicode format which will be lost if you save this
file as an ANSI encoded text file. To keep the Unicode information, click Cancel below and then select one of the Unicode options from the Encoding drop down list. Continue?"

In [172]:
# ASCII (English-only) characters are represented by numbers between 0 and 127
print(ord("A"), ord("Z"))
print(ord("a"), ord("z"))

65 90
97 122


In [173]:
# Arabic Unicode points are between 1536 and 1791
print(ord("أ"), ord("ب"), ord("ي"))

1571 1576 1610


See [Wikipedia: Arabic script in Unicode](https://en.wikipedia.org/wiki/Arabic_script_in_Unicode) for details.

In [164]:
import string

string.ascii_letters

'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'

In [None]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

### `string.printable`

String of ASCII characters which are considered printable. This is a combination of [`digits`](https://docs.python.org/3/library/string.html#string.digits), [`ascii_letters`](https://docs.python.org/3/library/string.html#string.ascii_letters), [`punctuation`](https://docs.python.org/3/library/string.html#string.punctuation), and [`whitespace`](https://docs.python.org/3/library/string.html#string.whitespace).

In [None]:
string.printable

'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'

### Whitespace Characters

In [167]:
string.whitespace

' \t\n\r\x0b\x0c'

In [30]:
# note that this will remove leading and trailing whitespace,
# but not whitespace in the middle of the string
text = '  hello \nthis is a \ttab test.\n\n\n\n\t'
print(text.strip())

hello 
this is a 	tab test.


In [11]:
# Tab character: "\t"
print('A\tB')

A	B


In [12]:
# Newline character: '\n'
print('A\nB')

A
B


### Carriage Return `'\r'` Character

In [21]:
import time
# progress bar using the Carriage Return "\r" character
for x in range(10):
    time.sleep(0.5)
    print(f'{x}/10' + '=' * x + '>', end='\r')




**Common use cases for strings:**

- Strings can be used to represent text, such as:
    - names
    - addresses
    - messages

- Textual data in Python is handled with [`str`](https://docs.python.org/3/library/stdtypes.html#str) objects, or *strings*.
- Strings are immutable [sequences](https://docs.python.org/3/library/stdtypes.html#typesseq) of Unicode code points.
- String literals are written in a variety of ways:
    - Single quotes: `'allows embedded "double" quotes'`
    - Double quotes: `"allows embedded 'single' quotes"`
    - Triple quoted:
        - `'''Three single quotes'''`
        - `"""Three double quotes"""`

In [44]:
first_name = "John"
last_name = 'Doe'
address = "Riyadh, Saudi Arabia"
phone = "00966555555555"

# Triple quoted strings may span multiple lines.
# All associated whitespace will be included in the string literal.
message = """Hello everyone,
I hope you are enjoying the course,

Thank you.
"""

Note: there is no separate “character” type

Python has great support for strings:

In [45]:
hello = 'hello' # String literals can use single quotes
world = "world" # or double quotes; it does not matter

Length of a string:

In [3]:
print(len(first_name))
print(len(address))

4
20


In [49]:
print(len([15, 20, 10]))

3


**Note**: that the length of a string is the number of characters in the string, including spaces and punctuation.

In [52]:
len(address) + 3 == len(address + '\n\t ')

True

Repeating strings

In [4]:
s = "Salam " * 3
print(s)

Salam Salam Salam 


In [55]:
zeros = "0" * 8
x = int("1" + zeros)
x

100000000

In [58]:
# better use scientific notation:
x = int(1e+8)
x

100000000

#### Exercise

- find the length of the variable `phone`
- find the length of the variable `message`

### membership operator: `in`

The `in` operator is used to check if a value is present in a sequence (`str`, `list`, `range`, etc.).

In [6]:
vowels = "aeiou"
print("a" in vowels)

True


In [59]:
# same as above
# since both are sequences in Python
vowels = ["a", "e", "i", "o", "u"]
print("a" in vowels)

True


In Python, Strings are **objects**.

- Objects have **attributes** that can be **accessed** with the `.` operator.
- Objects have **methods** that can be **called** using the `.` operator and the `()` parenthesis:
- ... more on objects later.

In [7]:
"hello".upper()

'HELLO'

In [63]:
"HeLLO".lower()

'hello'

In [64]:
name = "john doe"

In [65]:
# Capitalize 
print(name.capitalize()) 

John doe


In [66]:
# Title case
print(name.title())

John Doe


In [72]:
help(str.title)
# str.title? # in Jupyter

Help on method_descriptor:

title(self, /)
    Return a version of the string where each word is titlecased.
    
    More specifically, words start with uppercased characters and all remaining
    cased characters have lower case.



In [75]:
# Check case
print(name.islower())
print(name.isupper())

True
False


In [76]:
# Count occurrences
print(name.count('o')) 

2


In [87]:
name[5:7]

'do'

In [90]:
# Find position
print(name)
print(name.find('o')) # fist occurrence

john doe
1


In [92]:
# Replace
print(name.replace('o', 'w'))

jwhn dwe


In [98]:
# 3rd argument is `count`
# Maximum number of occurrences to replace.
# -1 (the default value) means replace all occurrences
print(name.replace('o', 'w', 1))

jwhn doe


In [108]:
string.whitespace

' \t\n\r\x0b\x0c'

In [109]:
# Strip whitespaces
text = f"\t  {name}  \n"
print(text.strip())

john doe


In [114]:
# Split 
print("Hello, world".split()) # default is space

['Hello,', 'world']


In [112]:
print("Hello, world".split("l"))

['He', '', 'o, wor', 'd']


In [118]:
# `maxsplit` argument: Maximum number of splits to do.
# -1 (the default value) means no limit.
print("Hello, world".split("l", 1)) # 1 means split only once

['He', 'lo, world']


See: [Splitlines](https://docs.python.org/3/library/stdtypes.html#str.splitlines)

In [120]:
# multi-line string
text = '''
Hello
World

How are you?
'''

In [123]:
text # <-- displays the string as is (showing whitespace characters)

'\nHello\nWorld\n\nHow are you?\n'

In [132]:
text.splitlines()

['', 'Hello', 'World', '', 'How are you?']

In [144]:
# Join
names = ["Adam", "Belal", "Camal"]
seperator = ','
print(seperator.join(names)) 

Adam,Belal,Camal


In [148]:
' + '.join(names)

'Adam + Belal + Camal'

In [19]:
# String formatting
print(f"Name: {name}")

Name: john doe


In [20]:
# Alignment
print(name.ljust(15)) 
print(name.center(15))

john doe       
    john doe   


In [153]:
# Padding  
num = 123
print(f'{num:05}') # pad with zeros
print(format(num, "05d"))

00123
00123


In [22]:
# Case conversion
print(name.upper())
print(name.lower())

JOHN DOE
john doe


In [23]:
# Check start/end
print(name.startswith('j'))
print(name.endswith('e'))

True
True


A fun way to decorate a string using `center` method:

In [24]:
name = 'John'
width = 20
decorator = '*'

print(decorator * width)
print(name.center(width, decorator))
print(decorator * width)

********************
********John********
********************


#### Exercise

- Change the above code to print your `name`, in all uppercase
- change the `width`
- Change the `decorator` to some other character like `#`

## Indexing and Slicing

- A string is a sequence of characters
- Sequences can be indxed using `[]`
    - 1st element is at index `0`
    - 2nd element is at index `1`
    - last element is at index `-1`

<img src="../assets/pythonista.png">

In [155]:
title = "Pythonista"

In [156]:
print(title[2:]) # thonista

thonista


In [157]:
print(title[0])  # P
print(title[-1]) # a
print(title[9])  # a
# title[10] # IndexError: string index out of range

P
a
a
a


In [161]:
# TODO: weired case that needs to be understood
# print(title[None:-4:-1])

### Slicing

Sequences can also be sliced using `[start:end]`

<img src="../assets/pythonista.png">

In [180]:
title[0] == title[0:1]

True

In [None]:
print(title[2:5]) #
print(title[:5])
print(title[-4:])
print(title[-4:None])

tho
Pytho
ista
ista


#### Exercise

Ex: Given that `name = "Johnson"` What is the value of `name[0]`? `name[1]`? `name[-1]`? `name[-2]`?

In [None]:
# try it

#### Exercise

Ex: try `name[1:3]` and `name[3:5]`

In [31]:
# try it

We can also add a step to slicing `[start:end:step]`

In [32]:
s = "ABCDEF"

In [33]:
s[0::1]

'ABCDEF'

We can also omit the `start` or `end` of the slice, which would implicitly mean the beginning or end of the string:

In [34]:
s[0::2]

'ACE'

In [35]:
s[-1:0:-1]

'FEDCB'

In [36]:
s[-1:0:-2]

'FDB'


Ex: run and try to understand the following code

```python
name = "Johnson"
print(name[::2])
print(name[::-1])
print(name[1:5:2])
```

In [37]:
# try it


Ex: For each of the following, specify the `start`, `end`, and `step`:

- `name[::2]`
- `name[::-1]`
- `name[1:5:2]`

Answer: ...

#### Exercise


Ex: Write a program that takes a string and prints the string in reverse.


In [38]:
# try it


Ex: Write a program that takes a string and prints every other character in the string. Example: `abcdef` -> `bdf`


In [39]:
# try it

Ex: Write a program that takes a string and prints the string in reverse order, but only every other character. It also must capitalize it. Example: `abcdef` -> `ECA`

In [40]:
# try it

Ex: Count the number of `o` in the string `hello world`. Hint: use the `.count()` method.

In [41]:
# try it

## String formatting

**(Main)** docs reference: [`printf`-style String Formatting](https://docs.python.org/3/library/stdtypes.html#printf-style-string-formatting)

There are 3 different ways to concatenate strings in Python:

1. Joining individual strings with + operator
2. `format` string method
3. f-strings

In [175]:
# Using the + operator
name = "John"
age = 30
print("My name is " + name + " and my age is " + str(age))
print("My name is {} and my age is {}".format(name, age))
print(f"My name is {name} and my age is {age}")

My name is John and my age is 30
My name is John and my age is 30
My name is John and my age is 30


#### Exercise

- concatenate the strings `first_name` and `last_name` using the + operator
- concatenate the strings `first_name` and `last_name` using the `format` method
- concatenate the strings `first_name` and `last_name` using f-strings

In [43]:
# try it

Ex: Use f-strings to print `Hello, my name is John Doe. and I am 30 years old`. Using the variables `first_name`, `last_name`, and `age`.

In [44]:
# try it

## Numbers formatting

In [45]:
# Basic number formatting
num = 10.5679
print(num) 

10.5679


In [46]:
# Limit decimal places to 2
print("{:.2f}".format(num))

10.57


Instead of using the `.format()` function, can just use the `f`string to format numbers:

In [47]:
# Limit decimal places to 2
print(f"{num:.2f}")


10.57


In [35]:
# Right align
print(f"{num:>10.2f}")
print(f"{num*10:>10.2f}")
print(f"{num*100:>10.2f}")


     10.57
    105.70
   1057.00


In [36]:
# Add thousands separator to integers
big_num = 1000000
print(f"{big_num:,}")

1,000,000


In [50]:
# Format percent
percent = 0.235
print(f"{percent:.2%}")


23.50%


In [51]:
# Format currency (USD)
price = 19.95
print(f"${price:,.2f}")


$19.95


In [52]:
# Scientific Notation
number = 9000
print(f"{number:.2e}")

9.00e+03
