# String Manipulation



Building on our work with strings, lists, and dictionaries, let's tackle another [Rosalind Problem](https://rosalind.info/problems/rna/) that focuses on string manipulation.

This problem introduces us to a fundamental concept in molecular biology: **transcription** - the process of creating RNA from a DNA template.



## String Methods

Before we dive into the problem, let's review some useful string methods.

We can modify the case of the whole string with various methods of the format 
`my_string.method()`.


In [10]:
my_string = "HELLO world"
print("lower: ", my_string.lower())  # Convert to lowercase
print("upper: ", my_string.upper())  # Convert to uppercase
print("title: ", my_string.title())  # Convert to title case
print("caps : ", my_string.capitalize())  # Convert to capitalized case
print("swap : ", my_string.swapcase())  # Swap case

lower:  hello world
upper:  HELLO WORLD
title:  Hello World
caps :  Hello world
swap :  hello WORLD


There are methods that allow us to test if a given string is all upper/lower 
case.

In [15]:
print("AA upper: ", "AA".isupper())
print("aa lower: ", "aa".islower())
print("AA title: ", "AA".istitle())
print("Start H :", my_string.startswith("H"))  # Check if starts with 'H'
print("End d   :", my_string.endswith("d"))  # Check if ends with 'd'
print("Contains 'lo' :", "lo" in my_string)  # Check if contains 'lo'
print("Count 'l's    :", my_string.count("l"))  # Count occurrences of 'l'
print("Index of 'l'  :", my_string.index("l"))  # Find index of first 'l'

AA upper:  True
aa lower:  True
AA title:  False
Start H : True
End d   : True
Contains 'lo' : False
Count 'l's    : 1
Index of 'l'  : 9


We can swap out specific characters. Notice that `L` is considered different
from `l`

In [6]:
print(my_string.replace("L", "X"))

HEXXO world


## Iterating on Strings

We can also treat strings in some of the same ways as lists:

In [26]:
assembled = ""
for char in my_string:
    if char.isupper():
        print("Upper: ", char)
    elif char.islower():  # Skips the space
        assembled += char.upper()  # append a new capitalized letter
print("Assembled: ", assembled)

Upper:  H
Upper:  E
Upper:  L
Upper:  L
Upper:  O
Assembled:  WORLD


In [10]:
print(my_string[0:5])  # Slice the string from index 0 to 4

HELLO


And even generate strings with operators:

In [8]:
print("*" * 10)  # Repeat the asterisk 10 times

**********


And even split strings based on specific characters into a list

In [11]:
print(my_string.split(" "))  # Split the string by spaces

['HELLO', 'world']


In the next cell, try adding a string you can split based on commas:

In [None]:
your_string = ""  # Try splitting based on a different character
print()

This can be helpul to reorder pieces of a given input.

In [13]:
new_string = "Third.First.Second"
new_list = new_string.split(".")  # Split the string by '.'
reordered = [new_list[1], new_list[2], new_list[0]]  # Reorder the list
print("Reordered: ", ".".join(reordered))  # Join this list into a string

Reordered:  First.Second.Third


## Special cases

We need to be explicit about some special characters in strings, like ...

- Quotes `\"`
- Tabs `\t`
- Newlines `\n`

Because Python is flexible on requiring `'` vs `"` to define a string, this may 
only come up if we need both

In [19]:
print('I have "double" quotes')
print("I have 'single' quotes")
print("I have both 'single' and \"double\" quotes")
print("I have \n\t- new lines \n\t- tabs")

I have "double" quotes
I have 'single' quotes
I have both 'single' and "double" quotes
I have 
	- new lines 
	- tabs


For longer quotes, tripple-quoting tells Python to incorporate linebreaks.


In [23]:
print(
    """My 
multiline 
string"""
)

My 
multiline 
string


## Problem

## Bio Background 
In cells, DNA serves as the template for creating RNA through transcription:

DNA uses bases: A, T, G, C
RNA uses bases: A, U, G, C
The key difference: T in DNA becomes U in RNA

## Problem Statement

- **Given**: A DNA string t of length at most 1000 nt.
- **Return**: The transcribed RNA string of t.

Let's look at the sample:

In [1]:
t = "GATGGAACTTGACTACGTAAATT"
expected = "GAUGGAACUUGACUACGUAAAUU"

Notice how every 'T' in the DNA string becomes 'U' in the RNA string, while A, G, and C remain unchanged.

This repository comes with a validator to check solutions. No peeking!

In [2]:
import os

if os.getcwd().endswith("notebooks"):
    os.chdir("..")

from src.rna import validator

validator(t, expected)

True

What can you do to process this input t and return the transcribed RNA?

In [4]:
# your code here
rna_result = "AAAAA"

validator(t, rna_result)  # Validate your answer

Length mismatch: DNA has 23 characters, RNA has 5 characters


False

<details><summary>Hint 1</summary>

Think about this step by step:

- Look at each character in the DNA string
- If it's a 'T', change it to 'U' in your output

</details>

Now, let's try on a larger dataset:

In [5]:
with open("data/rosalind_rna.txt", "r") as file:
    t = file.read().strip()  # Get a larger DNA sequence from a file

# Your code here
rna_result = "something"

validator(t, rna_result)  # Validate your answer with the larger sequence

RNA string contains invalid character(s): {'s', 'h', 'e', 't', 'i', 'n', 'g', 'o', 'm'}


False

<details><summary>Hint 2</summary>

You could solve this with a loop:

```python
for nucleotide in t:
    if nucleotide == "T":
        "Do something"
    else:
        "Do another thing"
```
</details>

<details><summary>Hint 3</summary>

For each character in the input, you want to conditionally append to your output

```python
rna_result = ""
for char in t:
    if condition:
        rna_result += "X"
    else:
        rna_reult += char
```

</details>

Some solutions (there are many!):

<details><summary>Solution 1</summary>

Using a loop:

```python
rna_result = ""
for nucleotide in t:
    if nucleotide == "T":
        rna_result += "U"
    else:
        rna_result += nucleotide
```
</details>


<details><summary>Solution 2</summary>

Using a `join`

```python
rna_result = "U".join(t.split("T"))
```
</details>

<details><summary>Solution 3</summary>

Using a `replace`

```python
rna_result = t.replace("T", "U")
```
</details>


# Advanced

There are several other string methods and techniques that could help solve this problem efficiently.

<details><summary>Other methods</summary>

- `str.translate()`: Create a translation table for character mapping
- `map()`: Apply a function to each character

Example with translate:

```python
translation_table = str.maketrans("T", "U")
rna_result = t.translate(translation_table)
```

Example with map: 

```python
def swap_u(s):
    return "U" if s == "T" else s


"".join(list(map(swap_u, "ABCTTG")))
```

</details>

Try implementing the solution using a different method, or create your own validator function that checks if the transcription follows the biological rules correctly.

'ABCUUG'

In [None]:
# Your alternative solution here


# Bonus: Create a reverse function that converts RNA back to DNA
def rna_to_dna(rna_string):
    # Your code here
    pass