# Strings solutions

## [Download exercises zip](../../_static/strings-exercises.zip)

[Browse files online](https://github.com/DavidLeoni/datasciprolab/tree/master/exercises/strings)






## What to do

- unzip exercises in a folder, you should get something like this: 

```

-jupman.py
-my_lib.py
-other stuff ...
-exercises
     |- lists
         |- strings-exercise.ipynb     
         |- strings-solution.ipynb
         |- other stuff ..
```

<div class="alert alert-warning">

**WARNING 1**: to correctly visualize the notebook, it MUST be in an unzipped folder !
</div>


- open Jupyter Notebook from that folder. Two things should open, first a console and then browser. The browser should show a file list: navigate the list and open the notebook `exercises/strings/strings-exercise.ipynb`

<div class="alert alert-warning">

**WARNING 2**: DO NOT use the _Upload_ button in Jupyter, instead navigate to the unzipped folder while in Jupyter browser!
</div>


- Go on reading that notebook, and follow instuctions inside.


Shortcut keys:

- to execute Python code inside a Jupyter cell, press `Control + Enter`
- to execute Python code inside a Jupyter cell AND select next cell, press `Shift + Enter`
- to execute Python code inside a Jupyter cell AND a create a new cell aftwerwards, press `Alt + Enter`
- If the notebooks look stuck, try to select `Kernel -> Restart`





## Introduction

Strings are **immutable objects** (note the actual type is **str**) used by python to handle text data. Strings are sequences of *unicode code points* that can represent characters, but also formatting information (e.g. '\\n' for new line). Unlike other programming languages, python does not have the data type character, which is represented as a string of length 1.

There are several ways to define a string:


In [1]:
S = "my first string, in double quotes"

S1 = 'my second string, in single quotes'

S2 = '''my third string is 
in triple quotes
therefore it can span several lines'''

S3 = """my fourth string, in triple double-quotes
can also span
several lines"""

print(S, '\n') #let's add a new line at the end of the string with \n
print(S1,'\n')
print(S2, '\n')
print(S3, '\n')

my first string, in double quotes 

my second string, in single quotes 

my third string is 
in triple quotes
therefore it can span several lines 

my fourth string, in triple double-quotes
can also span
several lines 



To put special characters like '," and so on you need to "escape them" (i.e. write them following a back-slash).

![](img/escapes.png)

**Example**:
Let's print a string containing a quote and double quote (i.e. ' and ").

In [2]:
myString = "This is how I \'quote\' and \"double quote\" things in strings"
print(myString)

This is how I 'quote' and "double quote" things in strings


Strings can be converted to and from numbers with the functions ```str()```, ```int()``` or ```float()```.

**Example**:
Let's define a string *myString* with the value "47001" and convert it into an `int`. Try adding one and print the result.

In [3]:
myString = "47001"
print(myString, " has type ", type(myString))

myInt = int(myString)

print(myInt, " has type ", type(myInt))

myInt = myInt + 1   #adds one

myString = myString + "1" #cannot add 1 (we need to use a string). 
                          #This will append 1 at the end of the string

print(myInt)
print(myString)

47001  has type  <class 'str'>
47001  has type  <class 'int'>
47002
470011


Python defines some operators to work with strings. Recall the slides shown during the lecture:

![](img/stringoperators.png)


**Example** 
A tandem repeat is a short sequence of DNA that is repeated several times in a row. Let's create a string representing the tandem repeat of the motif "ATTCG" repeated 5 times. What is the length of the whole repetitive region? Is the motif "TCGAT" (m1) present in the region? The motif "TCCT" (m2)? Let's give an orientation to the tandem repeat by adding the string `"|-"` on the left and `"->"` to the right.

In [4]:
motif = "ATTCG"

tandem_repeat = motif * 5

print(motif)
print(tandem_repeat, " has length", len(tandem_repeat))
m1 = "TCGAT"
m2 = "TCCT"

print("Is ", m1, " in ", tandem_repeat, " ? ", m1 in tandem_repeat )
print("Is ", m2, " in ", tandem_repeat, " ? ", m2 in tandem_repeat )
oriented_tr = "|-" + tandem_repeat + "->"
print(oriented_tr)

ATTCG
ATTCGATTCGATTCGATTCGATTCG  has length 25
Is  TCGAT  in  ATTCGATTCGATTCGATTCGATTCG  ?  True
Is  TCCT  in  ATTCGATTCGATTCGATTCGATTCG  ?  False
|-ATTCGATTCGATTCGATTCGATTCG->


We can access strings at specific positions (indexing) or get a substring starting from a position S to a position E. The only thing to remember is that numbering starts from 0. The```i```-th character of a string can be accessed as ```str[i-1]```. Substrings can be accessed as ```str[S:E]```, optionally a third parameter can be specified to set the step (i.e. ```str[S:E:STEP]```). 

<div class="alert alert-warning">

**Important note.**
Remember that when you do str[S:E], **S is inclusive, while E is exclusive** (see S[0:6] below).

</div>


![](img/slicingstring.png)

Let's see these aspects in action with an example:

In [5]:
S = "Luther College"

print(S) #print the whole string
print(S == S[:]) #a fancy way of making a copy of the original string
print(S[0]) #first character
print(S[3]) #fourth character
print(S[-1]) #last character
print(S[0:6]) #first six characters
print(S[-7:]) #final seven characters
print(S[0:len(S):2]) #every other character starting from the first
print(S[1:len(S):2]) #every other character starting from the second

Luther College
True
L
h
e
Luther
College
Lte olg
uhrClee


### Methods for the str object

The object ```str``` has some methods that can be applied to it (remember methods are things you can do on objects). Recall from the lecture that the main methods are:

![](img/strmethods.png)

<div class="alert alert-warning">
**IMPORTANT NOTE**:
Since Strings are immutable, every operation that changes the string actually produces a new *str* object  having the modified string as value. 
</div>

Recall that **strings are immutable**. For this reason we cannot directly change them with an assignment operator. 

**Example:** Since the genetic code is degenerate, there are many codons encoding for the same aminoacid. Consider Proline, it can be encoded by the following codons: `CCU`, `CCA`,`CCG`, `CCC`. Let's create a string proline and assign it to its possible codons one after the other.

In other words, if you have a string, how do you obtain another string from the first one by changing only one character? Can we directly change `"CCU"` into `"CCA"`  ? If not, are there alternatives to produce a new string from the first one ?



```python
"""
Wrong solution. We cannot directly replace the value of a string
"""

proline = "CCU"
print("Proline can be encoded by: ", proline)
proline[2]="A"
print(".. or by: ", proline)

```


```bash
Proline can be encoded by:  CCU

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-8802d0c749a4> in <module>
      1 proline = "CCU"
      2 print("Proline can be encoded by: ", proline)
----> 3 proline[2]="A"
      4 print(".. or by: ", proline)

TypeError: 'str' object does not support item assignment

```

In [6]:
"""
Correct solution. Using str.replace
"""
proline = "CCU"
print("Proline can be encoded by: ", proline)
proline = proline.replace("U","A")
print(".. or by: ", proline)
proline = proline.replace("A","G")
print(".. or by: ", proline)
proline = proline.replace("G","C")
print(".. or by: ", proline)

Proline can be encoded by:  CCU
.. or by:  CCA
.. or by:  CCG
.. or by:  CCC


In [7]:
"""
Another correct solution. Using string slicing and catenation.
"""
proline = "CCU"
print("Proline can be encoded by: ", proline)
proline = proline[:-1]+"A" #equal to proline[0:-1] or proline[0:2]
print(".. or by: ", proline)
proline = proline[:-1]+"G"
print(".. or by: ", proline)
proline = proline[:-1]+"C"
print(".. or by: ", proline)

Proline can be encoded by:  CCU
.. or by:  CCA
.. or by:  CCG
.. or by:  CCC


**Example**:
Given the DNA sequence S = "   aTATGCCCATatcgctAAATTGCTGCCATTACA    ". Print its length (removing any blank spaces at either sides), the number of adenines, cytosines, guanines and thymines present. Is the sequence "ATCG" present in S? Print how many times the substring "TGCC" appears in S and all the corresponding indexes.

In [8]:
S = "   aTATGCCCATatcgctAAATTGCTGCCATTACA    "

print(S)
S = S.strip(" ")
print(S)

print(len(S))
tmpS = S.upper() #for simplicity to count only 4 different nucleotides
print("A count: ", tmpS.count("A"))
print("C count: ", tmpS.count("C"))
print("T count: ", tmpS.count("T"))
print("G count: ", tmpS.count("G"))
print("Is ATCG in ", tmpS, "? ", tmpS.find("ATCG") != -1) #or tmpS.count("ATCG") > 0
print("TGCC is present ", tmpS.count("TGCC"), " times in ", tmpS)
print("TGCC is present at pos ", tmpS.find("TGCC")) 
print("TGCC is present at pos ", tmpS.rfind("TGCC")) #or tmpS.find("TGCC",4)


   aTATGCCCATatcgctAAATTGCTGCCATTACA    
aTATGCCCATatcgctAAATTGCTGCCATTACA
33
A count:  10
C count:  9
T count:  10
G count:  4
Is ATCG in  ATATGCCCATATCGCTAAATTGCTGCCATTACA ?  True
TGCC is present  2  times in  ATATGCCCATATCGCTAAATTGCTGCCATTACA
TGCC is present at pos  3
TGCC is present at pos  23


## Exercises

### extract_email


<div class="alert alert-info" >

**COMMANDMENT 4 (adapted for strings): You shall never ever reassign function parameters **
</div>

```python

    def myfun(s):

        # You shall not do any of such evil, no matter what the type of the parameter is:
        s = "evil"          # strings
```


In [9]:
def extract_email(s):
    """ Takes a string s formatted like 
    
        "lun 5 nov 2018, 02:09 John Doe <john.doe@some-website.com>"
        
        and RETURN the email "john.doe@some-website.com"
        
        NOTE: the string MAY contain spaces before and after, but your function must be able to extract email anyway.
        
        If the string for some reason is found to be ill formatted, raises ValueError
    """
    #jupman-raise
    stripped = s.strip()
    i = stripped.find('<')
    return stripped[i+1:len(stripped)-1]
    #/jupman-raise

assert extract_email("lun 5 nov 2018, 02:09 John Doe <john.doe@some-website.com>") == "john.doe@some-website.com"
assert extract_email("lun 5 nov 2018, 02:09 Foo Baz <mrfoo.baz@blabla.com>") == "mrfoo.baz@blabla.com"
assert extract_email(" lun 5 nov 2018, 02:09 Foo Baz <mrfoo.baz@blabla.com>  ") == "mrfoo.baz@blabla.com"  # with spaces

## Further resources

Have a look at [leetcode string problems](https://leetcode.com/tag/string/) sorting by _Acceptance_ and _Easy_.

In particular, you may check:

* [Unique email addresses](https://leetcode.com/problems/unique-email-addresses/description/)
* [Unique Morse codes](https://leetcode.com/problems/unique-morse-code-words/description/)
* [Robot return to origin](https://leetcode.com/problems/robot-return-to-origin/description/)
* [Reverse Words in a String III](https://leetcode.com/problems/reverse-words-in-a-string-iii/description/)
* [Goat Latin](https://leetcode.com/problems/goat-latin/description/)
* [Detect Capital](https://leetcode.com/problems/detect-capital/description/)
* [Count Binary Substrings](https://leetcode.com/problems/count-binary-substrings/description/)