# Review print, variables, and data types

In the first session, we learned how to print out values to the screeen using `print`. We also learned that we can store values in **variables** and use them by calling their names. Variables can store different types of data such as integers (`int`), decimals (`float`), strings (`str`), and booleans (`bool`).

Which of the following variables do you think is a **string**? (*hint*: String stores sequence of characters, typically text)

```
a = "string"
b = 'two'
c = 3
d = '3'

```

I hope you can tell which variable is a string but we can also let Python tell us!

First, let's create these variables by running the cell below:

In [None]:
a = 'string'
b = 'two'
c = 3
d = '3'

Python has a built in function called `type` that tells you the type of information contained in a variable. Since we want to ask Python which of the variables is a string, we can simply run the function `type` on the variables. Try it, completing the cell below:

In [None]:
print(type(a))

### Use the print() and type() functions for variables b, c, and d ###

Did you notice that variable `c` and `d` have a different data type even though both of them seem to store `3`?

Write code in the cell below to do the following:

1. Create a new variable, `e`,  and assign it the value of `1.5` as string
2. Print the type of variable `e` and confirm it is a string
3. Create another new variable, `f`, and assign it the value of `1.5` as float
4. Print the type of variable `f` and confirm it is a float

In [None]:
### Write your code here for 1 & 2 ###
e = 


In [None]:
### Write your code here for 3 & 4 ###
f =


****

# How to name variables

Before we go further, let's consider some variable names with the following experimental dataset:

|Group|Number of Mice|Average Weight(g)|Group Id|
|-----|--------------|---------------|--------|
|alpha|3|17.0|CGJ28371|
|beta|5|16.4|SJW99399|
|gamma|6|17.8|PWS29382|

Discuss with your partner what variable names would you use to describe:

* Number of mice in beta group
* Average weight of mice in gamma group

Create the variables and use a `print` statement to display the values of the variables, as well as the variable `type`

In [None]:
### Write your code here ###



Did your variable names work? Did you get an ideal variable type for each variable? Also, hopefully, your variable names were easy-to-read and as clear as possible. For example, would the following names work?


* `alpha = 3` (3 what?)
* `beta_weight = 16.4` (maybe okay, but didn't we mean average weight?)
* `number_of_mice_in_beta_group = 5` (very descriptive but long and hard to read)

In the end, you will have to decide on names that balance how explict and how easy the names are. Here are some simple rules and conventions on how to name variables (more rules in [Python style guide](https://peps.python.org/pep-0008/#naming-conventions)): 

1. **Names can contain letters, numbers, and underscores** but cannot start with a number nor include dash `-`.
2. **Avoid Python's built-in names** (like `print`, `for`, `True`, etc.) since they have special meanings in Python.
3. **Be descriptive** with your names so that it's clear what data the variable holds (not `a`, `b` or `c`).
4. **Use lowercase letters** and underscores to separate words (`my_variable_name`).


****

# Examining strings

Much of the data bioinformaticians work with come in the form of strings; a DNA sequence `ATGCGCCGTA` is a string as far as Python is concerned. 

Let's look at a few Python functions for working with strings. First create three new variables that represent the `Group Id` for each of the mouse groups in the table above:

In [None]:
### Write your code here ###
alpha_id = "CGJ28371"
beta_id = 
gamma_id = 

Next, let's look at `alpha_id`. To investigate the **length** of the id, we can use the function `len`!

In [None]:
print(len(alpha_id))

We can also examine each character in the string of `alpha_id`. To do this, we specify which character to see by adding a notation next to the string variable. See what happens in the next cell:

In [None]:
print(alpha_id[0])

The statement above means "print the **0th element** of `alpha_id`"

But what is the 0th element?

In most computer languages, counting starts with 0 instead of 1. Therefore, breaking apart the string `CGJ28371`, here is how we could count it:

|Index|0|1|2|3|4|5|6|7|
|-----|-|-|-|-|-|-|-|-|
|Value|C|G|J|2|8|3|7|1|

In the cell below, print the last character of `beta_id`:

In [None]:
### Write your code here ###


## Slicing strings

Sometimes, we will want to extract more than one characters from a string. **Slicing** allows you to extract a subset of a string by specifying the start and end positions. 

The syntax for slicing is `string[start:end]`, where `start` is inclusive and `end` is exclusive. Run the cell below to see some examples:

In [None]:
# Example of slicing
my_string = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
print(my_string[0])
print(my_string[2])
print(my_string[0:3])
print(my_string[26:36])
print(my_string[26:])  # prints all the characters from the 26th element
print(my_string[:26]) # prints all the characters before the 26th element
print(my_string[:])  # prints all the characters

If you leave `start` empty, Python will assume you want to slice from the beginning, while if you leave `end` empty, it will slice until the end.\
What do you think will happen in the following code? Make a guess before running it and discuss the output with your partner

In [None]:
print(my_string[-10:])

You can also specify a **step** with slicing, using `string[start:end:step]`, which can be used to skip characters or reverse a string.

In [None]:
# Example of slicing with a step
my_numbers = "123456789"
print(my_numbers[0:9])
print(my_numbers[0:9:2])  # prints every second character
print(my_numbers[::3])  # prints every three character
print(my_numbers[::-1]) # step = -1 will let you reverse the string

**Challenge**: In the `Group Id` data, the first three characters are the experimenter's intials, and the numbers are a unique ID number. Use the cell below to do and demonstrate the following:

1. Create new variables that contain the initials of the experimenter for each mouse group.
2. Print the value of these new variables.
3. Create new variables that contain the ID of the experimenter for each mouse group.
4. Print the value of these new variables.


In [None]:
### Write your code here for 1 & 2 ###



In [None]:
### Write your code here for 3 & 4 ###



****

# Investigate biological data

Now that we have learned how to examine strings, we are going to investigate biological sequences (DNA, RNA, or Protein) often stored in **[FASTA](https://en.wikipedia.org/wiki/FASTA_format)** files. The FASTA format is one of the most widely used file formats in biology/bioinformatics. A typical FASTA file consists of one or more sequences, each represented as follows:

1. **Header Line**: Begins with a `>` character followed by the sequence identifier and optionally a description
2. **Sequence Lines**: Follow the header line and contain the sequence data. The sequence may span multiple lines, but should not include spaces or line breaks within the sequence itself.

Here are an example of a fasta file containing two DNA sequences:

```
>sequence 001
ATTCGAGGATCGATTTCGATCGATGCTTAGCTTTAGCTTTTTTAGATCTCCCA

>sequence 002
AAGCTGACGGGGAGCTAGTCTTAGTCGTACGTTCGAT
```

**Challenge**: Write code that will do the following 

1. Create a variable which will hold a name of the sequence (e.g., `sequence 003`)
2. Create a variable which will hold a sequence string (e.g., `ATCGATCGATCG`)
3. Print the name and sequence in the proper FASTA format like above

Discuss what you think you will need to do with your partner and use the cell below to complete the challenge

*Tip*: You can do this in 3 lines of code by calling `print` only once instead of twice. To make a new line in the printed output, use the new line character `\n`. Python will not print `\n` to the screen but interpret that you mean to end one line at that location and begin a new line.

In [None]:
### Write your code here ###




## Count and substitutions

Another important strng tool is the ability to check various properties of a string. As you have already seen, the `len` fuction allows you to count the number of characters in a string.

In [None]:
alphabets = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
print(len(alphabets))

You can also count specific characters within a string using the `count()` function.

In [None]:
my_string = 'ABCDABCDABCD'
print(my_string.count('A'))

Here, we are using a **method** by calling `.count` on your string **object**. In Python, **methods** are functions that are associated with an object and can operate on data within that object. When you see `string.count()`, it means we're calling a method on a `string` object. This is a part of Python's **[object-oriented programming](https://www.geeksforgeeks.org/introduction-of-object-oriented-programming/)** features.

A `string` in Python is an object, and it has several built-in methods that perform specific tasks. This is why you can use `.method()` syntax directly on string variables to manipulate them or to get information about them.

Let's see what Python knows about string and all the methods for string objects by using `help` function:

In [None]:
help(str)

Wow this is a pretty long help info! But the structure of the text is pretty simple; the output begins with `class str(object)`, indicating that `str` is a **class** (blueprint for creating objects). After the description of the string class, there is a long section describing all the methods defined for the string class, including `count`. Are there any other methods you might be interested in using? 

Much of it might be hard (and even unnecessary) for you to read right now but it's good to know this exists! This combined with resources like ChatGPT and Stackoverflow will help you answer a lot of your questions! Looking at the documentation, you might find capitalizations functions like `upper` and `lower`. What you think will happen if you run the cell below?

In [None]:
my_uppercase_string = 'ABCDEFG'
my_new_string = my_uppercase_string.lower()
print(my_new_string)

In this example, `my_uppercase_string` is initially set to `'ABCDEFG'`. We then apply the `.lower()` method, which converts all the uppercase letters to lowercase, and assign the result to `my_new_string`. Notice how the output from the method directly becomes the value of the new variable when we print it.

Can you write code that will print the uppercase version of an lowercase string? First, create a variable for a lowerase string and then apply the `upper` method. 

In [None]:
### Write your code here ###



## Replace

Another key method in string is `replace`, which (literally) replaces characters in string. The method works like this:

```
str.replace(old, new)
```
where 

* old: character(s) to be replaced
* new: character(s) to replace with

Let's see how it can be used!

In [None]:
string_1 = 'ABCEEFG'
print(f"original: {string_1}\nafter replace: {string_1.replace('A', 'Z')}")
string_2 = 'I like to eat ?'
print(f"original: {string_2}\nafter replace: {string_2.replace('?', 'spaghetti')}")

Using the `replace` method, we can easily convert a DNA sequence to an RNA sequence! In DNA, the bases are `A`, `T`, `G`, and `C`. In RNA, the base `U` replaces `T`. Thus, to transcribe DNA into RNA, simply replace all `T`'s in the DNA sequence with `U`'s:

In [None]:
DNA = 'AGATGGGCTTACTGATCGACCCAGTACGATCGTATTTTTCATCGT'
RNA = DNA.replace('T','U')
print(RNA)

`replace` is convenient when you want to replace all occurrences of a specified value, but if you need to replace just one character at a specific position, string slicing is often the best approach. For example, suppose we want to replace the `T` in the middle of the following DNA with a `G`:

In [6]:
DNA = 'TTTATATTT'
print(f"replace: {DNA.replace('T', 'G')}")

pos = 4  # the T in the middle is the 4th element (starting with 0th)
new_DNA = DNA[:pos] + 'G' + DNA[pos+1:]
print(f"string slicing: {new_DNA}")

replace: GGGAGAGGG
string slicing: TTTAGATTT


Here, `DNA[:pos]` slices everything before the `pos` (`TTTA`), `G` is our new nucleotide at the `pos`, and `DNA[pos+1:]` slices everything from `pos+1` (`ATTT`).

Now try transcribing the following DNA to RNA, and mutate one of the `G`s to `U`:

In [None]:
### Write your code here ###
DNA = 'ATGAATCGT'
RNA = 
mutated_RNA = 

print(mutated_RNA)

Note: You might wonder why we can't simply modify a character using an index like `str[i] = 'U'`. This isn't allowed because strings in Python are **immutable**, meaning their content cannot be changed directly. Instead, we use string slicing to create a *new* string with the desired change.


****

# Examine the HIV genome

Human Immunodeficiency Virus (HIV) is a retrovirus that attacks the immune system, specifically targeting CD4 cells, which play a critical role in immune response. Over time, this can lead to Acquired Immunodeficiency Syndrome (AIDS), which is characterized by the immune system's failure to protect against infections and certain cancers.

The [HIV genome](https://www.hiv.lanl.gov/components/sequence/HIV/asearch/query_one.comp?se_id=K03455) is relatively small and complex, consisting of two identical strands of RNA enclosed within the virus particle. Each RNA strand contains nine genes encoded by just over 9,000 nucleotides. 

In our exercise, we will use Python string slicing and manipulation techniques to extract specific genes from a simulated segment of the HIV RNA sequence, providing insight into how computational tools can assist in understanding and researching genetics.


In [None]:
# Human immunodeficiency virus type 1 (HXB2), complete genome;
# ACCESSION   K03455
# VERSION     K03455.1 GI:1906382

hiv_genome = 'tggaagggctaattcactcccaacgaagacaagatatccttgatctgtggatctaccacacacaaggctacttccctgattagcagaactacacaccagggccagggatcagatatccactgacctttggatggtgctacaagctagtaccagttgagccagagaagttagaagaagccaacaaaggagagaacaccagcttgttacaccctgtgagcctgcatggaatggatgacccggagagagaagtgttagagtggaggtttgacagccgcctagcatttcatcacatggcccgagagctgcatccggagtacttcaagaactgctgacatcgagcttgctacaagggactttccgctggggactttccagggaggcgtggcctgggcgggactggggagtggcgagccctcagatcctgcatataagcagctgctttttgcctgtactgggtctctctggttagaccagatctgagcctgggagctctctggctaactagggaacccactgcttaagcctcaataaagcttgccttgagtgcttcaagtagtgtgtgcccgtctgttgtgtgactctggtaactagagatccctcagacccttttagtcagtgtggaaaatctctagcagtggcgcccgaacagggacctgaaagcgaaagggaaaccagaggagctctctcgacgcaggactcggcttgctgaagcgcgcacggcaagaggcgaggggcggcgactggtgagtacgccaaaaattttgactagcggaggctagaaggagagagatgggtgcgagagcgtcagtattaagcgggggagaattagatcgatgggaaaaaattcggttaaggccagggggaaagaaaaaatataaattaaaacatatagtatgggcaagcagggagctagaacgattcgcagttaatcctggcctgttagaaacatcagaaggctgtagacaaatactgggacagctacaaccatcccttcagacaggatcagaagaacttagatcattatataatacagtagcaaccctctattgtgtgcatcaaaggatagagataaaagacaccaaggaagctttagacaagatagaggaagagcaaaacaaaagtaagaaaaaagcacagcaagcagcagctgacacaggacacagcaatcaggtcagccaaaattaccctatagtgcagaacatccaggggcaaatggtacatcaggccatatcacctagaactttaaatgcatgggtaaaagtagtagaagagaaggctttcagcccagaagtgatacccatgttttcagcattatcagaaggagccaccccacaagatttaaacaccatgctaaacacagtggggggacatcaagcagccatgcaaatgttaaaagagaccatcaatgaggaagctgcagaatgggatagagtgcatccagtgcatgcagggcctattgcaccaggccagatgagagaaccaaggggaagtgacatagcaggaactactagtacccttcaggaacaaataggatggatgacaaataatccacctatcccagtaggagaaatttataaaagatggataatcctgggattaaataaaatagtaagaatgtatagccctaccagcattctggacataagacaaggaccaaaggaaccctttagagactatgtagaccggttctataaaactctaagagccgagcaagcttcacaggaggtaaaaaattggatgacagaaaccttgttggtccaaaatgcgaacccagattgtaagactattttaaaagcattgggaccagcggctacactagaagaaatgatgacagcatgtcagggagtaggaggacccggccataaggcaagagttttggctgaagcaatgagccaagtaacaaattcagctaccataatgatgcagagaggcaattttaggaaccaaagaaagattgttaagtgtttcaattgtggcaaagaagggcacacagccagaaattgcagggcccctaggaaaaagggctgttggaaatgtggaaaggaaggacaccaaatgaaagattgtactgagagacaggctaattttttagggaagatctggccttcctacaagggaaggccagggaattttcttcagagcagaccagagccaacagccccaccagaagagagcttcaggtctggggtagagacaacaactccccctcagaagcaggagccgatagacaaggaactgtatcctttaacttccctcaggtcactctttggcaacgacccctcgtcacaataaagataggggggcaactaaaggaagctctattagatacaggagcagatgatacagtattagaagaaatgagtttgccaggaagatggaaaccaaaaatgatagggggaattggaggttttatcaaagtaagacagtatgatcagatactcatagaaatctgtggacataaagctataggtacagtattagtaggacctacacctgtcaacataattggaagaaatctgttgactcagattggttgcactttaaattttcccattagccctattgagactgtaccagtaaaattaaagccaggaatggatggcccaaaagttaaacaatggccattgacagaagaaaaaataaaagcattagtagaaatttgtacagagatggaaaaggaagggaaaatttcaaaaattgggcctgaaaatccatacaatactccagtatttgccataaagaaaaaagacagtactaaatggagaaaattagtagatttcagagaacttaataagagaactcaagacttctgggaagttcaattaggaataccacatcccgcagggttaaaaaagaaaaaatcagtaacagtactggatgtgggtgatgcatatttttcagttcccttagatgaagacttcaggaagtatactgcatttaccatacctagtataaacaatgagacaccagggattagatatcagtacaatgtgcttccacagggatggaaaggatcaccagcaatattccaaagtagcatgacaaaaatcttagagccttttagaaaacaaaatccagacatagttatctatcaatacatggatgatttgtatgtaggatctgacttagaaatagggcagcatagaacaaaaatagaggagctgagacaacatctgttgaggtggggacttaccacaccagacaaaaaacatcagaaagaacctccattcctttggatgggttatgaactccatcctgataaatggacagtacagcctatagtgctgccagaaaaagacagctggactgtcaatgacatacagaagttagtggggaaattgaattgggcaagtcagatttacccagggattaaagtaaggcaattatgtaaactccttagaggaaccaaagcactaacagaagtaataccactaacagaagaagcagagctagaactggcagaaaacagagagattctaaaagaaccagtacatggagtgtattatgacccatcaaaagacttaatagcagaaatacagaagcaggggcaaggccaatggacatatcaaatttatcaagagccatttaaaaatctgaaaacaggaaaatatgcaagaatgaggggtgcccacactaatgatgtaaaacaattaacagaggcagtgcaaaaaataaccacagaaagcatagtaatatggggaaagactcctaaatttaaactgcccatacaaaaggaaacatgggaaacatggtggacagagtattggcaagccacctggattcctgagtgggagtttgttaatacccctcccttagtgaaattatggtaccagttagagaaagaacccatagtaggagcagaaaccttctatgtagatggggcagctaacagggagactaaattaggaaaagcaggatatgttactaatagaggaagacaaaaagttgtcaccctaactgacacaacaaatcagaagactgagttacaagcaatttatctagctttgcaggattcgggattagaagtaaacatagtaacagactcacaatatgcattaggaatcattcaagcacaaccagatcaaagtgaatcagagttagtcaatcaaataatagagcagttaataaaaaaggaaaaggtctatctggcatgggtaccagcacacaaaggaattggaggaaatgaacaagtagataaattagtcagtgctggaatcaggaaagtactatttttagatggaatagataaggcccaagatgaacatgagaaatatcacagtaattggagagcaatggctagtgattttaacctgccacctgtagtagcaaaagaaatagtagccagctgtgataaatgtcagctaaaaggagaagccatgcatggacaagtagactgtagtccaggaatatggcaactagattgtacacatttagaaggaaaagttatcctggtagcagttcatgtagccagtggatatatagaagcagaagttattccagcagaaacagggcaggaaacagcatattttcttttaaaattagcaggaagatggccagtaaaaacaatacatactgacaatggcagcaatttcaccggtgctacggttagggccgcctgttggtgggcgggaatcaagcaggaatttggaattccctacaatccccaaagtcaaggagtagtagaatctatgaataaagaattaaagaaaattataggacaggtaagagatcaggctgaacatcttaagacagcagtacaaatggcagtattcatccacaattttaaaagaaaaggggggattggggggtacagtgcaggggaaagaatagtagacataatagcaacagacatacaaactaaagaattacaaaaacaaattacaaaaattcaaaattttcgggtttattacagggacagcagaaatccactttggaaaggaccagcaaagctcctctggaaaggtgaaggggcagtagtaatacaagataatagtgacataaaagtagtgccaagaagaaaagcaaagatcattagggattatggaaaacagatggcaggtgatgattgtgtggcaagtagacaggatgaggattagaacatggaaaagtttagtaaaacaccatatgtatgtttcagggaaagctaggggatggttttatagacatcactatgaaagccctcatccaagaataagttcagaagtacacatcccactaggggatgctagattggtaataacaacatattggggtctgcatacaggagaaagagactggcatttgggtcagggagtctccatagaatggaggaaaaagagatatagcacacaagtagaccctgaactagcagaccaactaattcatctgtattactttgactgtttttcagactctgctataagaaaggccttattaggacacatagttagccctaggtgtgaatatcaagcaggacataacaaggtaggatctctacaatacttggcactagcagcattaataacaccaaaaaagataaagccacctttgcctagtgttacgaaactgacagaggatagatggaacaagccccagaagaccaagggccacagagggagccacacaatgaatggacactagagcttttagaggagcttaagaatgaagctgttagacattttcctaggatttggctccatggcttagggcaacatatctatgaaacttatggggatacttgggcaggagtggaagccataataagaattctgcaacaactgctgtttatccattttcagaattgggtgtcgacatagcagaataggcgttactcgacagaggagagcaagaaatggagccagtagatcctagactagagccctggaagcatccaggaagtcagcctaaaactgcttgtaccaattgctattgtaaaaagtgttgctttcattgccaagtttgtttcataacaaaagccttaggcatctcctatggcaggaagaagcggagacagcgacgaagagctcatcagaacagtcagactcatcaagcttctctatcaaagcagtaagtagtacatgtaacgcaacctataccaatagtagcaatagtagcattagtagtagcaataataatagcaatagttgtgtggtccatagtaatcatagaatataggaaaatattaagacaaagaaaaatagacaggttaattgatagactaatagaaagagcagaagacagtggcaatgagagtgaaggagaaatatcagcacttgtggagatgggggtggagatggggcaccatgctccttgggatgttgatgatctgtagtgctacagaaaaattgtgggtcacagtctattatggggtacctgtgtggaaggaagcaaccaccactctattttgtgcatcagatgctaaagcatatgatacagaggtacataatgtttgggccacacatgcctgtgtacccacagaccccaacccacaagaagtagtattggtaaatgtgacagaaaattttaacatgtggaaaaatgacatggtagaacagatgcatgaggatataatcagtttatgggatcaaagcctaaagccatgtgtaaaattaaccccactctgtgttagtttaaagtgcactgatttgaagaatgatactaataccaatagtagtagcgggagaatgataatggagaaaggagagataaaaaactgctctttcaatatcagcacaagcataagaggtaaggtgcagaaagaatatgcatttttttataaacttgatataataccaatagataatgatactaccagctataagttgacaagttgtaacacctcagtcattacacaggcctgtccaaaggtatcctttgagccaattcccatacattattgtgccccggctggttttgcgattctaaaatgtaataataagacgttcaatggaacaggaccatgtacaaatgtcagcacagtacaatgtacacatggaattaggccagtagtatcaactcaactgctgttaaatggcagtctagcagaagaagaggtagtaattagatctgtcaatttcacggacaatgctaaaaccataatagtacagctgaacacatctgtagaaattaattgtacaagacccaacaacaatacaagaaaaagaatccgtatccagagaggaccagggagagcatttgttacaataggaaaaataggaaatatgagacaagcacattgtaacattagtagagcaaaatggaataacactttaaaacagatagctagcaaattaagagaacaatttggaaataataaaacaataatctttaagcaatcctcaggaggggacccagaaattgtaacgcacagttttaattgtggaggggaatttttctactgtaattcaacacaactgtttaatagtacttggtttaatagtacttggagtactgaagggtcaaataacactgaaggaagtgacacaatcaccctcccatgcagaataaaacaaattataaacatgtggcagaaagtaggaaaagcaatgtatgcccctcccatcagtggacaaattagatgttcatcaaatattacagggctgctattaacaagagatggtggtaatagcaacaatgagtccgagatcttcagacctggaggaggagatatgagggacaattggagaagtgaattatataaatataaagtagtaaaaattgaaccattaggagtagcacccaccaaggcaaagagaagagtggtgcagagagaaaaaagagcagtgggaataggagctttgttccttgggttcttgggagcagcaggaagcactatgggcgcagcctcaatgacgctgacggtacaggccagacaattattgtctggtatagtgcagcagcagaacaatttgctgagggctattgaggcgcaacagcatctgttgcaactcacagtctggggcatcaagcagctccaggcaagaatcctggctgtggaaagatacctaaaggatcaacagctcctggggatttggggttgctctggaaaactcatttgcaccactgctgtgccttggaatgctagttggagtaataaatctctggaacagatttggaatcacacgacctggatggagtgggacagagaaattaacaattacacaagcttaatacactccttaattgaagaatcgcaaaaccagcaagaaaagaatgaacaagaattattggaattagataaatgggcaagtttgtggaattggtttaacataacaaattggctgtggtatataaaattattcataatgatagtaggaggcttggtaggtttaagaatagtttttgctgtactttctatagtgaatagagttaggcagggatattcaccattatcgtttcagacccacctcccaaccccgaggggacccgacaggcccgaaggaatagaagaagaaggtggagagagagacagagacagatccattcgattagtgaacggatccttggcacttatctgggacgatctgcggagcctgtgcctcttcagctaccaccgcttgagagacttactcttgattgtaacgaggattgtggaacttctgggacgcagggggtgggaagccctcaaatattggtggaatctcctacagtattggagtcaggaactaaagaatagtgctgttagcttgctcaatgccacagccatagcagtagctgaggggacagatagggttatagaagtagtacaaggagcttgtagagctattcgccacatacctagaagaataagacagggcttggaaaggattttgctataagatgggtggcaagtggtcaaaaagtagtgtgattggatggcctactgtaagggaaagaatgagacgagctgagccagcagcagatagggtgggagcagcatctcgagacctggaaaaacatggagcaatcacaagtagcaatacagcagctaccaatgctgcttgtgcctggctagaagcacaagaggaggaggaggtgggttttccagtcacacctcaggtacctttaagaccaatgacttacaaggcagctgtagatcttagccactttttaaaagaaaaggggggactggaagggctaattcactcccaaagaagacaagatatccttgatctgtggatctaccacacacaaggctacttccctgattagcagaactacacaccagggccaggggtcagatatccactgacctttggatggtgctacaagctagtaccagttgagccagataagatagaagaggccaataaaggagagaacaccagcttgttacaccctgtgagcctgcatgggatggatgacccggagagagaagtgttagagtggaggtttgacagccgcctagcatttcatcacgtggcccgagagctgcatccggagtacttcaagaactgctgacatcgagcttgctacaagggactttccgctggggactttccagggaggcgtggcctgggcgggactggggagtggcgagccctcagatcctgcatataagcagctgctttttgcctgtactgggtctctctggttagaccagatctgagcctgggagctctctggctaactagggaacccactgcttaagcctcaataaagcttgccttgagtgcttcaagtagtgtgtgcccgtctgttgtgtgactctggtaactagagatccctcagacccttttagtcagtgtggaaaatctctagca'

The following diagram shows genes in the HIV genome, with the first and last position of nucleotides in each gene. Given what we have learned and the info below, complete the tasks in the following code cells! (More info on [HIV Genome Landmarks](http://www.hiv.lanl.gov/content/sequence/HIV/MAP/landmark.html))

![HIV Genome Landmarks](img/hxb2genome.gif)

1. Check and print the length of the HIV genome

In [None]:
### Write your code here ###



2. Create variables for the following HIV genes and print them out (one at a time)

<details>
  <summary>Hint 1</summary>
  
   *hint 1*: Use `hiv_genome[start:end]`
</details>


<details>
  <summary>Hint 2</summary>
  
   *hint 2*: Python's counting starts with zero but counting in the diagram starts with one
</details>

<details>
  <summary>Hint 3</summary>  
    
   *hint 3*: When you print them out, all the genes should be starting with 'ATG', which encode the **start codon** 'M' (Methionine) when trasnlated to a protein (but excpet `pol` gene it's unique!). If there is a rule at the start of a gene, should there be a rule at the end too?
</details>

In [None]:
### Write your code here ###
# gag
# pol
# vif
# vpr
# env


3. Generate the RNA sequence from the DNA sequence for each of the genes you have isolated above, store them in variables, and check by printing them out

In [None]:
### Write your code here ###







4. For each gene, count the number of each nuclotide in that gene (# of `A`s, # of `U`s, # of `G`s, # of `C`s) and store the values in variables (always good to check by printing them out!)

In [None]:
### Write your code here ###







5. For each gene, calculate the **GC content** (%). GC content is the ratio of `G`s and `C`s in the gene calculated by:
```
GC content = number of Gs + number of Cs / total number of nucleotides in a given gene
```

In [None]:
### Write your code here ###





