# 1. Strings in Python

* Getting substrings
* Building strings
* Modifications
* Booleans
* Case
* Numbers
* Which string method?

Computers represent language using the string datatype. Python's support for string manipulation is excellent. If you want to work with (human) language, Python is the (computer) language of choice for most purposes.

## Getting Substrings

Let's start by creating a Python string of the word "antihumanitarianism" with the varible name S (for string), and access the letters contained within it using basic indexing (square brackets, i.e. \[\]). Note that indexing supports forward as well as backword (negative) indices.

In [1]:
S = "antihumanitarianism" 

In [2]:
S

'antihumanitarianism'

In [3]:
S[0]

'a'

In [4]:
S[-1]

'm'

In [5]:
S[-4]

'n'

Slicing (using the : operator inside of square brackets) applied to strings produces substrings. Note that 
- The letter at the second index (the stop) is NOT included
- Leaving the start/stop index empty means that the beginning/end is assumed (you should ALWAYS do this!)

Exercise: Let's pull out good English words that are substrings of "antihumanitarianism":

|a|n|t|i|h| u | m| a| n| i|  t | a | r | i | a | n | i | s | m |
| --- | --- | --- | --- |--- | --- |---  |--- | --- |--- | --- |--- |---|---|--- | --- | --- | --- | --- |
|0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16| 17 | 18 |
|-19|-18|-17|-16|-15|-14|-13|-12|-11|-10| - 9| -8| -7| -6| -5 | -4 | -3 | -2 | -1 |


In [6]:
S[0:4]

'anti'

In [7]:
S[:4]

'anti'

In [8]:
S[4:9]

'human'

In [9]:
S[-3:]

'ism'

In [10]:
S[4:16]

'humanitarian'

In [11]:
S[10:13]

'tar'

In [12]:
S[12:9:-1]

'rat'

In [13]:
S[4:]

'humanitarianism'

Sometimes, you don't know the exact index where you want to make a cut or cuts, but you know there's a special character (or set of characters) there. 

The [*split*](https://docs.python.org/3/library/stdtypes.html#str.split) method is used for breaking at a delimiter. The delimiter can be more than one character

It returns a list of strings corresponding to the parts. The delimiter is not included.

In [14]:
S1 = "merry-go-round"
S2 = "red, green, blue, and yellow"

In [15]:
S1.split()

['merry-go-round']

In [16]:
S2.split()

['red,', 'green,', 'blue,', 'and', 'yellow']

In [17]:
S1.split("-")

['merry', 'go', 'round']

In [18]:
S3 = S2.split(" ")

Another common occurrence is when you've got a string with some junk at the beginning or end of a string (or both) that you want to remove. 

The [strip](https://docs.python.org/3/library/stdtypes.html#str.strip) removes whitespace (spaces, newlines, tabs) by default, but can remove any character you like from the edges of your string. 

It stops when it hits something it hasn't been told to remove. 

There are other methods (*rstrip* and *lstrip*) which apply this from only one direction.

In [19]:
S = " +-+-+strip me+-+-+ "

In [20]:
S3[0].strip(",")

'red'

In [21]:
S.strip()

'+-+-+strip me+-+-+'

In [22]:
S.strip(" +")

'-+-+strip me+-+-'

In [23]:
S.lstrip()

'+-+-+strip me+-+-+ '

In [24]:
S.rstrip("+ -em")

' +-+-+strip'

If you want to remove a specific substring from the start or end of string, there are (now) methods for that, namely [removeprefix](https://docs.python.org/3/library/stdtypes.html#str.removeprefix) and [removesuffix](https://docs.python.org/3/library/stdtypes.html#str.removesuffix)  (brand new in Python 3.9!)

In [25]:
S.removeprefix(" +-")

'+-+strip me+-+-+ '

In [26]:
words = ['extend', 'excited', 'exex1223', 'ex123ex']
for word in words:
    print(word.removeprefix('ex'))

tend
cited
ex1223
123ex


## Building strings

For many purposes, the concatenator "+" is all you need.

In [27]:
S1 = "hello"
S2 = "world"

In [28]:
S1 + S2

'helloworld'

In [29]:
S1 + " " + S2

'hello world'

However this is wasteful for more than a few strings, so if you have many strings you want to put together, particularly if they are already in a Python list, use [join](https://docs.python.org/3/library/stdtypes.html#str.join). 

The join method has a funky syntax: it is called as a method of the delimiter string, with the list of strings as the argument.

In [30]:
L1 = ["merry","go","round"]
L2 = ["anti","dis","establish","ment","arian","ism"]

In [31]:
# L1.join()
# You will get an error of saying AttributeError: 'list' object has no attribute 'join'

In [32]:
"-".join(L1)

'merry-go-round'

In [33]:
" ".join(L2)

'anti dis establish ment arian ism'

In [34]:
"".join(L2)

'antidisestablishmentarianism'

In [35]:
S3 = "world"
S4 = "hello"

In [36]:
"".join([S4, S3])

'helloworld'

If you want to construct a string from one or more variables, use [f-stings](https://docs.python.org/3/tutorial/inputoutput.html#formatted-string-literals) (add a f in front of the quotes) with the variables (or even longer expressions) between curly brackets ({})

In [37]:
noun = "dog"
verb = "run"

In [38]:
f"The {noun} loves to {verb}"

'The dog loves to run'

## Modifications

First, remember that Python strings are not mutable. Whenever you "modifying" a string, you are actually not modifying the string, you are creating a new string based on the old one.

Of course, you can "modify" strings by using a combination of what we have already seen. 

Let's the turn the string "the lords of the ring" to "the lord of the rings" using slicing and concatenation. Then let's try doing it by spliting and joining.

In [39]:
S = "the lords of the ring"

In [40]:
S1 = S.split("s ")
S2 = " ".join(S1) + "s"
S2

'the lord of the rings'

In [41]:
S2 = S[:8] + S[9:] + "s"
S2

'the lord of the rings'

If there is a particular substrings within a string that you wish to change, then use the [replace](https://docs.python.org/3/library/stdtypes.html#str.replace) method.

It is called on the whole string, and takes as its two arguments the string to be replaced, and the string to replace it with. 

By default it applies to all instances of the substring, so be careful!

In [42]:
S = "hands and feet"

In [43]:
S1 = S.replace("feet", "toes")
S1

'hands and toes'

In [44]:
S

'hands and feet'

In [45]:
S1 = S.replace("and", "or", 2)
S1

'hors or feet'

## Boolean methods

Here we mean string methods (and an operator) which return a boolean; they check to see if a string has a particular property.

Two of the most useful boolean methods are [startswith](https://docs.python.org/3/library/stdtypes.html#str.replace) and [endswith](https://docs.python.org/3/library/stdtypes.html#str.endswith), which check to see a string starts/ends with another string. Very handy for checking for morphological affixes, for instance.

In [46]:
S1 = "reranked"
S2 = "disabled"

In [47]:
S1.startswith("")

True

In [48]:
S1.endswith("")

True

In [49]:
S1.startswith("dis")

False

In [50]:
if(S2.endswith("ed")):
    print(S2[:-2])

disabl


If you want to check if a string appears anywhere within another string, use the *in* operator.

In [51]:
"rank" in S1

True

In [52]:
"rank" in S2

False

In [53]:
if("rank" in S1):
    print("Success!")

Success!


Another thing we often want to know is whether a string is in fact a word (consisting only of letters). The [isalpha](https://docs.python.org/3/library/stdtypes.html#str.isalpha) method does this. 

Note that whitespace, punctuation, and some characters that appear regularly within English words (e.g. "-" and "'") are not considered alphabetic.

In [54]:
S1 = "hello"
S2 = "hello world"
S3 = "a can't-do-it attitude"

In [55]:
S1.isalpha()

True

In [56]:
S2.isalpha()

False

In [57]:
S2.replace(" ", "").isalpha()

True

In [58]:
S3.isalpha()

False

In [59]:
S4 = "Ã©tÃ©"
S4.isalpha()
S5 = "99"
S5.isalpha()

False

## Case


Upper and lower case versions of the same word are considered entirely different in Python! This means Python is _case-sensitive_. Case-insensitive means it doesn't matter what case you use (your browser's search function is probably case insenstive by default)

In [60]:
S1 = "case"
S2 = "CASE"
S3 = "Case"


In [61]:
S1.islower()

True

In [62]:
S2.isupper()

True

In [63]:
S3.isupper()

False

To convert between upper and lower case, use the [upper](https://docs.python.org/3/library/stdtypes.html#str.upper) and [lower](https://docs.python.org/3/library/stdtypes.html#str.lower) methods. If the string is already upper/lower case, this has no effect. It is very standard to lowercase all words to standardize for the effects of English capitalization rules. You can also covert a string to have capitalized letters at the beginning of each word with [title](https://docs.python.org/3/library/stdtypes.html#str.title).

In [64]:
S1.upper()

'CASE'

In [65]:
S1.upper().isupper()

True

In [66]:
S2.lower()

'case'

In [67]:
S2.title()

'Case'

To check whether a string is (entirely upper or lower case), use [isupper](https://docs.python.org/3/library/stdtypes.html#str.isupper) and [islower](https://docs.python.org/3/library/stdtypes.html#str.islower)

In [68]:
"00".isupper()

False

In [69]:
"/$&%&$".isupper()

False

In [70]:
"/$&%&$".islower()

False

## Numbers

Be careful about numbers represented as strings, versus real numbers.

In [71]:
S = "42"
i = 42

In [72]:
S == i

False

To check if a string is consists of only digits (and thus appropriate for conversion to an integer), use [isdigit](https://docs.python.org/3/library/stdtypes.html#str.isdigit).




In [73]:
S.isdigit()

True

In [74]:
S2 = "Hello"
S2.isdigit()

False

In [75]:
S3 = "3.14"
S3.isdigit()

False

In [76]:
S4 = "%"
S4.isdigit()

False

Convert from strings to ints or floats using built-in functions [int](https://docs.python.org/3/library/functions.html#int) and [float](https://docs.python.org/3/library/functions.html#float), from a number to string by using [str](https://docs.python.org/3/library/stdtypes.html#str)

In [77]:
int(S)

42

In [78]:
int(S) == i

True

In [79]:
str(i) == S

True

In [80]:
float(S)

42.0

In [81]:
float(S) == i

True

In [82]:
str(float(S)) == S

False

When using f-strings, you can format numbers using ":"! We'll just set the number of decimal places using the precision operator ".", but there are a lot more [options](https://docs.python.org/3/library/string.html#format-specification-mini-language)

In [83]:
import math
math.pi

3.141592653589793

In [84]:
f"{math.pi}"

'3.141592653589793'

In [85]:
f"{math.pi:.3f}"

'3.142'

In [86]:
import string
print("Punctuation: ", string.punctuation)
print("Digits: ", string.digits)
print("ASCII: ", string.ascii_letters)

Punctuation:  !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
Digits:  0123456789
ASCII:  abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ


For other patterns, see: https://docs.python.org/3/library/string.html