# Working with text

So far we worked mostly with numbers and the only time we used text was to just print something on screen with `print('some text goes here')`.

Text type in Python (as in most programming languages) is called **String**

If we want to use **Strings** in your code we have to always encapulate it with either `"` or `'` quote signs.

In [4]:
'this is text in single quotes'

'this is text in single quotes'

In [5]:
"this is text in double quotes"

'this is text in double quotes'

Does it matter which one we use? Only if we want to use that specific sign in the text itself - if you want to use `'` in your text than you use `"` to encapsulate your **String**:

In [9]:
"I don't have a car"

"I don't have a car"

In [8]:
'I have read "The Art of War" recently'

'I have read "The Art of War" recently'

What if we want to use both in text? We have to **escape** that sign which means annotate it that its not a limiting quote but just a regular quote sign we want to use in text.

To espace a character we use `\` backslash character.

In [24]:
print("I havn\'t read \"The Art of War\" yet")
print('I havn\'t read \"The Art of War\" yet')

I havn't read "The Art of War" yet
I havn't read "The Art of War" yet


Lets see what we can do with **Strings**. What kind of "arythmetic" we can do on **Strings**?

1. `+` - concatenate two strings together
1. `*` - contatenate `x` copies of string to itself

In [31]:
string_a = 'Test'
string_b = 'String'

In [32]:
string_a + string_b

'TestString'

In [34]:
string_a + ' ' + string_b

'Test String'

In [51]:
string_a * 1

'Test'

In [52]:
string_a * 3

'TestTestTest'

We can compare strings. Comparison on strings is done using lexicographical ordering, which pretty much means compare them one character at the time, untill the strings are exhausted and ordering is defined using Unicode values for that character.

**Note: Unicode values for lowercase characters are ordered higher than upper case, so for example `a` is considered larger than both `A` and `Z`. So you have to make sure your use case make sense with such comparison if you are trying to deduce if alphbetically some string is after or before another (for example - when sorting).**

Testing for `==` equality / `!=` ineaquality this will produce proper results.

In [46]:
print('a' <= 'b')
print('aaaaa' > 'b')
print('Alphabet' > 'Apple')
print('a' == 'A')
print('B' < 'b')
print(string_a != string_b)

True
False
False
False
True
True


With strings we can introduce new type of comparison operators: `in` and `not`

`X in Y` tests if `X` is part of `Y`. For strings it looks trough entire string `Y` and checks if string `X` is a substring of `Y`. Its is case sensitive.

`not` operator is general negation operator that can be placed before any other boolean statement and reverse its value. It can be used in conjunction with `in` as `X not in Y` to test if `X` is not in `Y`. We will get the same results if we use `not X in Y`.

`in` operator if calculated in the same time other copmarison are calculatd (so after any arythmetic operations).

`not` operator is calculated after the comparison operators

In [48]:
print('T' in 'Test')
print('e' in 'Test')
print('E' in 'Test')
print('Tes' in 'Test')

True
True
False
True


In [49]:
print('T' not in 'Test')
print('e' not in 'Test')
print('E' not in 'Test')
print('Tes' not in 'Test')

False
False
True
False


In [50]:
print(not 'T' in 'Test')
print(not 'e' in 'Test')
print(not 'E' in 'Test')
print(not 'Tes' in 'Test')

False
False
True
False


In [56]:
not 'a' == 'a'

False

In [57]:
not 'a' != 'a'

True

# Changing types

Very often you will find yourself trying to include number variable in a string - for example you want to customise answer given by your program depending on the value of some computation or imput, like: `"I'm X years old"`, where you would want X to be some number you want injected into that string.

Lets try simple addition operation and see if it will work:

In [59]:
result = "I;m " + 7 + " years old"

TypeError: can only concatenate str (not "int") to str

If you run above cell you will get error like the one below:

`TypeError: can only concatenate str (not "int") to str`

We get `TypeError`, which means whatever is causing error is most likely related to the types of objects used in this expressions.

Next we see additional information `can only concatenate str (not "int") to str`, telling us that we can't use `int` in string 
concatenation operation.

So what can we do? First option is to change the type of object using built-in type functions, coresponding to the type we want to get:

1. `int` for Integers
1. `float` for Floats
1. `str` for Strings
1. etc...

Lets try it for our example:

In [63]:
result = "I;m " + str(7) + " years old"
print(result)

I;m 7 years old


**Success**. Lets take a look at few other examples of type casting (changing type) and its limitations.

Most importantly: 

1. We can change any number to string, but only strings representing numbers can be cast into number formats (like `int` or `float`).
1. We can change `int` into `float` easly, but when changing `float` to `int` we loose the decimal values (its cut off)


In [72]:
str(7)


'7'

In [73]:
str(7.5)

'7.5'

In [74]:
int('7')


7

In [75]:
float('7')

7.0

In [76]:
float('7.5')

7.5

In [77]:
int('aaa')

ValueError: invalid literal for int() with base 10: 'aaa'

In [78]:
int('7.5')

ValueError: invalid literal for int() with base 10: '7.5'

In [79]:
int(7.5)

7

In [80]:
float(7)

7.0

Lets try it out in the wild. We get two numbers represented as strings that are length of two legs of right triangle.

In [8]:
leg_a = 10
leg_b = 20
hypotenuse = (leg_a ** 2 + leg_b ** 2) ** 0.5

Now we want to print the expresion describing that given length of those two legs (shown with decimals) the length of hypothenuse is x (also show with decimal places).

In [11]:
print("Given two triangle legs of length " + str(float(leg_a)) + " and " + str(float(leg_b)) + " hypotenuse length is " + str(float(hypotenuse)))

Given two traingle legs of length 10.0 and 20.0 hypothenuse length is 22.360679774997898


It works, but I think we can agree that its a lot of work to concatanate all those strings and remember to change type all the time. And also that `22.360679774997898` is way more precise than we need to convey answer to our users. Likely we can solve both problems with string formatting.

# Proper string formatting

String formatting is a way to incorporate other objects inside the string without manually type casting into `str`. It also allows us to specify how the values should be printed - for example how many decimal places it should print.

If you google `python string formatting` you will most likely find examples showing 3 completly different syntaxes:

1. Python 2 syntax, using `%` sign -  `"My name is %s, I'm %d years old" % (name, age)`
1. Python 3 syntax, using `.format` and `{}` - `"My name is {}, I'm {} years old".format(name, age)`
1. Python 3.6+ syntax, using `f-strings` - `f"My name is {name}, I'm {age} years old"`

Unless you maintain old code written before Python 3.6 I would only use f-strings, they are (in my opinion) more readable than older methods. In this course we will focus on `f-string` method, but `.format` is still supported so you may still find it when looking trough code online and there are use cases where we have to still use it - I will provide example in **Extra** section.

`f-string` is just a string where we put `f` in front of quote signs. It tells python that we may want to put variables or other python expresions inside it. We denote location where we want to put python objects with `{}` curly brakets. Any python code put inside will be calculated before the string is calculated and output will be cast to string and put inside the string.

In [12]:
name = 'Marcin'
age = 40
f'My name is {name}, I\'m {age} years old'

"My name is Marcin, I'm 40 years old"

This is much shorter than concatanation, require no casting and its very clear to reader what this string will looks like.

Like I mentioned earlier we dont have to use variables but any python code that will get us some result:

In [13]:
f'My name is {"Mar" + "cin"}, I\'m {20 + 5 * 4} years old'

"My name is Marcin, I'm 40 years old"

But we have to put something inside curly brackets, otherwise:

In [14]:
f'My name is {}, I\'m {20 + 5 * 4} years old'

SyntaxError: f-string: empty expression not allowed (3286315803.py, line 1)

Lets try to format our triangle example using f-strings and fix that decimal problem we had

In [25]:
leg_a = 10
leg_b = 20
hypotenuse = (leg_a ** 2 + leg_b ** 2) ** 0.5
print(f"Given two triangle legs of length {leg_a} and {leg_b} hypotenuse length is {hypotenuse}")

Given two traingle legs of length 10 and 20 hypothenuse length is 22.360679774997898


Wait a minute, we wanted leg_a and leg_b shown as floats with decimal places! And there is still that looong tail of decimals for hypothenuse!

Lets fix that with format specifiers. Lets decide how many decimal points we want for each number. Lets say we want 3 decimal points for hypothenuse and 1 for legs. 

To add formatting specifier we start with `:` sign inside curly brackets after the expresion and provide specific formatting options, depending on what we want to print. For floating point precision we use `:.xf`, where `x` is the number of decimal places:

In [26]:
leg_a = 10
leg_b = 20
hypotenuse = (leg_a ** 2 + leg_b ** 2) ** 0.5
print(f"Given two triangle legs of length {leg_a} and {leg_b} hypotenuse length is {hypotenuse}")
print(f"Given two triangle legs of length {leg_a:.1f} and {leg_b:.1f} hypotenuse length is {hypotenuse:.3f}")

Given two traingle legs of length 10 and 20 hypothenuse length is 22.360679774997898
Given two traingle legs of length 10.0 and 20.0 hypothenuse length is 22.361


Notice that the decimal places where not simply cut off after `22.360`, they where corectly rounded to `22.361` following arithmetic rounding rules.

Here is few more examples of string formatting tricks:

In [87]:
print(f'this fraction (0.235676) is now a shown as percentage: {0.235676:.2%}')
print(f'this float (4.735676) is now rounded to integer: {4.735676:.0f}')
print(f'this float (0.00735676) is now shown with scientific notation: {0.00735676:.2e}')
print(f'this number (2346) is now a hexadecimal number: {2346:X}')

this fraction (0.235676) is now a shown as percentage: 23.57%
this float (4.735676) is now rounded to integer: 5
this float (0.00735676) is now shown with scientific notation: 7.36e-03
this number (2346) is now a hexadecimal number: 92A


# Indexing - accessing parts of the strings

TODO: do it

# Taking input from the user

It will be quite rare to write a piece of software that only work with the values we write inside our code. Espetially when working with data we will somehow want to incorporate data from outside of our program into its functionality. Later in the course we will be reading in files, accesing data from the internet or query the databases but now lets start small. 

We want to allow our user to provide requested values during program run time. The easiest way to do it is to use `input()`.

Lets run below cell and see what happens:

In [125]:
input('What is your name: ')

''

We where asked to input a value using the string provided as argument as a prompt message. If you are NOT using jupyter it might have looked differently, but generally somewhere on the screen a prompt for input should pop-up.

The value we provided was returned as output of that cell. We can store the output in a variable instead for later use:

In [127]:
user_name = input('What is your name: ')
print(f"Your name is {user_name}")

Your name is Marcin


We can ask for multiple values:

In [128]:
user_name = input('What is your name: ')
user_age = input('How old are you: ')
print(f"Your name is {user_name} and you are {user_age} years old.")

Your name is Marcin and you are 40 years old


Note: the value we get is always string, even if we type a number. Run below cell and make sure to type something that is a valid number.

In [131]:
number = input('Give me a number and I will add 5 to it for you: ')
print(number + 5)

TypeError: can only concatenate str (not "int") to str

We already know that error. We have to cast the value we get into the type we want.

In [132]:
number = input('Give me a number and I will add 5 to it for you: ')
print(float(number) + 5)

10.0


TODO: provide some bigger example before excercises

We now know how to operate on numbers and strings, convert between types, format the output we print to the screen and take input from the users. We even know how to access parts of the string. Now lets take a look how else we can interact with them.

# Functions and methods

## Functions

Since the very beginning we where using few **functions** without specifing what they are and why we use them in this specific way.

`print()` is a **function**, `int()`/`str()`/`float()`/`bool()` are a **functions** (sort off).

Those are built-in **functions**, they are part of python language. There is a lot more of those for variouse purpose, but before we learn few of them lets talk what are **functions** and how to properly use them.

**Functions** are pieces of code designed to do specific action, bundled into a single named package, that can be later accessed using that name. `print` does a lot of magic under the hood to allow us to print output of what we are doing to the screen, but we dont have to write all that low level code everytime we want to print something, we just use `print`.

All functions are called by typing their name followed by `()` brackets. For some **functions** we can pass **parameters** inside those brackets. Each function has their own set of parameters, it can have none, it can have one, it can have many. Those **parameters** can be expected to be of specific type or any type.

Functions can return a value that we can store or print or do whatever we want with it. Some functions, like `print`, do not return any value, they only perform some tasks.

How should we know what kind of parameters function accepts? How should we know what a function does? How should we know what function return?

If you are using Jupyter notebook you can access the help on a **function** by typing that **function's** name and presing `shift+tab`. It will show you that **function** signature (set of **parameters** it accepts) and its docstring: a text information describing its function. You can also use another **function** - `help()`, that will print the docstring of the function.


In [118]:
help(print)

Help on built-in function print in module builtins:

print(...)
    print(value, ..., sep=' ', end='\n', file=sys.stdout, flush=False)
    
    Prints the values to a stream, or to sys.stdout by default.
    Optional keyword arguments:
    file:  a file-like object (stream); defaults to the current sys.stdout.
    sep:   string inserted between values, default a space.
    end:   string appended after the last value, default a newline.
    flush: whether to forcibly flush the stream.



We see now that `print` accepts more than just a values to be printed, it has other parameters. `sep` which by default is equal to space is character put between printed elements. So when we print 3 numbers those are separated by whatever `sep` value is provided:

In [119]:
print(1,2,3)
print(1,2,3, sep='-')

1 2 3
1-2-3


`end` is the character(-s) placed at the end of printed text. It defaults to new line character (`\n`). Lets see what happens when we change it:

In [120]:
print(1)
print(2)
print(3)

1
2
3


In [122]:
print(1, end='*')
print(2, end='&')
print(3, end='!!!!!!')

1*2&3!!!!!!

When we replaced the `\n` we no longer have each `print` result in separate line, which can sometimes be very usefull.

`print` does not return any value, it just prints. But another function we used is another story: `input` return what user wrote in the prompt as its return value.

Lets take a look at different function that returns a value - `len`


In [139]:
help(len)

Help on built-in function len in module builtins:

len(obj, /)
    Return the number of items in a container.



We can use len the check how many characters are in a string or any other sequence-like object (we will talk about those later)

In [141]:
long_string = "The Python interpreter has a number of functions and types built into it that are always available."
length = len(long_string)
print(length)

99


Lets take a look at few other built-in functions that are usefull for us at this stage of the course. Feel free to play with them, inspect their arguments and see what they return:

In [150]:
# return absolute value of a number
print(abs(-10))
print(abs(5))

10
5


In [151]:
# get min/max of a set of numbers
print(min(10, 20, 30))
print(max(10, 20, 30))

10
30


In [160]:
# it works on strings too
print(max('Tom', 'Jerry', 'Adam'))
print(min('Tom', 'Jerry', 'Adam'))

Tom
Adam


In [156]:
# rounds a number to provided precision (defaults to integer rounding) but it can even round to tens, or hundreths with negative precision
print(round(123.5725))
print(round(123.5725, 0))
print(round(123.5725, 1))
print(round(123.5725, 2))
print(round(123.5725, -1))
print(round(123.5725, -2))


124
124.0
123.6
123.57
120.0
100.0


In [163]:
# type tells us what type is the value
print(type(123))
print(type(123.4))
print(type(False))
print(type('123'))


<class 'int'>
<class 'float'>
<class 'bool'>
<class 'str'>


TODO: add some examples before excercises

We played a little with some functions, now lets look at very similar concept - methods

## Methods

**Methods** are functions attached to specific object, like a number or a string.

Before explaining furter lets take a look at example:

In [164]:
'this whas typed using lower case'.upper()

'THIS WHAS TYPED USING LOWER CASE'

**Methods** are called by placing a dot `.` followed by **method** name, `upper()` in this example.

In [169]:
help('some string'.upper)

Help on built-in function upper:

upper() method of builtins.str instance
    Return a copy of the string converted to uppercase.



**Methods** are functions, they are just attached to specific object and actions and return values they produce are usually linked in some way to this specific object. In example of `str.upper()` it produced a new version of our string with all letters replaced by their upper case versions.

To see what type of **methods** are available for an object just type it with a dot `.` and press `tab`, depending on the tool you are using you should get list of methods (and other properties of the object)

While numbers dont have many usefull **methods** in everyday use cases, strings on the other hand have tons of usefull **methods** that can change the string, search within a string, test something about that string and many other. Here are just a few examples:

In [179]:
name = 'my name is Marcin'

In [180]:
# replace the first letter with capital letter
name.capitalize()

'My name is marcin'

In [182]:
# replace part of string with another
name.replace('cin', 'tinez')

'my name is Martinez'

In [189]:
# test if string ends with specific another string
name.endswith('cin')

True

TODO: some examples to play with before excercise

# Extra - more formating tricks

We can decide what width the text have to take and how its aligned, allowing us for example to make decently looking tables

In [105]:
#each column will have at least width of 10 characters
#1st column aligned to left
#2nd columng to the right
#3rd centered

print('-'*34)
print(f'|{"leg_a":<10}|{"leg_b":>10}|{"hypo":^10}|')
print('-'*34)
print(f'|{10:<10.1f}|{5000:>10.1f}|{5000.00999:^10.1f}|')
print(f'|{20:<10.1f}|{100:>10.1f}|{101.980390:^10.1f}|')
print(f'|{5:<10.1f}|{25:>10.1f}|{25.495097:^10.1f}|')
print(f'|{1000.58:<10.1f}|{0.73:>10.1f}|{1000.580266:^10.1f}|')
print('-'*34)

----------------------------------
|leg_a     |     leg_b|   hypo   |
----------------------------------
|10.0      |    5000.0|  5000.0  |
|20.0      |     100.0|  102.0   |
|5.0       |      25.0|   25.5   |
|1000.6    |       0.7|  1000.6  |
----------------------------------


In [106]:
# but if length of resulting text is longer it will no longer look so good, 
# as size designtaion does not cut the string, only make sure it will take AT LEAST that many characters

print('-'*34)
print(f'|{"this is leg_a":<10}|{"this is leg_b":>10}|{"this is hypotenuse":^10}|')
print('-'*34)

----------------------------------
|this is leg_a|this is leg_b|this is hypotenuse|
----------------------------------


What if we want allow our user to create template string that we will only fill with values? Thats where we have to go back to `.format` way of formatting strings. In a way its very similar to `f-string`:

In [96]:
leg_a = 10
leg_b = 20
hypotenuse = (leg_a ** 2 + leg_b ** 2) ** 0.5
print(f"Given two triangle legs of length {leg_a} and {leg_b} hypotenuse length is {hypotenuse}")
print("Given two triangle legs of length {} and {} hypotenuse length is {}".format(leg_a, leg_b, hypotenuse))

Given two triangle legs of length 10 and 20 hypotenuse length is 22.360679774997898
Given two triangle legs of length 10 and 20 hypotenuse length is 22.360679774997898


Here we dont use `f` in front of string but instead we access string `.format()` method.

As parameters of this method we pass comma separated values we want to include in the string:

`.format(leg_a, leg_b, hypothenuse)`

They will fill the `{}` in the string in the same order.

You can also specife names for those values in the string, and then refer to those names in format method:

In [88]:
print("Given two triangle legs of length {a} and {b} hypotenuse length is {h}".format(a=leg_a, b=leg_b, h=hypotenuse))

Given two triangle legs of length 10 and 20 hypothenuse length is 22.360679774997898


We can also use formatting specifications:

In [97]:
print("Given two triangle legs of length {a:.1f} and {b:.1f} hypotenuse length is {h:.3f}".format(a=leg_a, b=leg_b, h=hypotenuse))

Given two triangle legs of length 10.0 and 20.0 hypotenuse length is 22.361


So lets assume that our user is providing us with a string containing `{}` placeholders but with text part translated to their language:

In [107]:
english = "Given two triangle legs of length a={a:.1f}cm and b={b:.1f}cm hypotenuse length is {h:.3f}cm"
polish = "Znając długość dwóch przyprostokątnych a={a:.1f}cm oraz b={b:.1f}cm długość przeciwprostokątnej wynosi {h:.3f}cm"


If we want to print this with the filled parts we have to use `.format()`:

In [108]:
print(english.format(a=leg_a, b=leg_b, h=hypotenuse))
print(polish.format(a=leg_a, b=leg_b, h=hypotenuse))

Given two triangle legs of length a=10.0cm and b=20.0cm hypotenuse length is 22.361cm
Znając długość dwóch przyprostokątnych a=10.0cm oraz b=20.0cm długość przeciwprostokątnej wynosi 22.361cm


# Extra - reading error messages