<h1><center><ins>Lists and Loops</ins></center></h1>


<div class = "alert alert-block alert-info">

<h3><center><ins>Background</ins></center></h3>
    
> As a Bioinformaticist/Bioinformatician, you will almost certainly be dealing with Big Data. 

> Data that is too big to be stored in an Excel file.

> And data that has to be parsed very quickly without our file or system freezing.

> At the same time, you will very often have to read multiple sets of data, which may be stored in multiple files or data structures (lists, dictionaries, sets, tuples, etc.).

> The process of reading over multiple sets of data, is called **looping.**

> And a data-type/structure often used by Bioinformaticians are **lists** - which we will be looking at today.

> Together, these two allow us to automate a lot of processess.

> For example, you may want to scan through all the SNPs in one file belonging to a participant

> And match it to those of another participant, in a different file, to find common SNPs (**see below**).

**<center>File 1</center>** 

|        Gene              |                 SNP          |
|--------------------      |------------------------------|                
|    <center>TP53</center> | <center>**rs54** </center>   |
|    <center>TP53</center> | <center>rs111<center>        |    
|    <center>TP53</center> | <center>rs1081</center>      | 
|    <center>TP53</center> | <center>**rs5099** </center> |  
    
<br><br>    
**<center>File 2</center>** 

| Gene                 |          SNP                     |
|--------------------  |----------------------------------|                
|<center>TP53</center> |  <center>rs19</center>           |
|<center>TP53</center> | <center>**rs54**</center>        |  
|<center>TP53</center> | <center>**rs5099** </center>     |
|<center>TP53</center> | <center>rs7878</center>          |  
    

> This is fine to do manuually if the files contain only three lines each

> But it becomes way more problematic if you have thousands of lines or more - which is oftent the case in Biology.




<h2><center><ins>What are lists?</ins></center></h2>

<div class = "alert alert-block alert-info">
    
   

> A data type/data structure often used by Bioinformaticians are **lists** 

> Lists allow us to store multiple pieces of data in one variable name, which we can easily accessed or teased out, using some of the **methods** that belong to lists.

> They may contain integers, floats, strings, Booleans, or even other lists, or more complicated data types  

> Lists can be recognized by their square brackets.

> For example, you may have a list containing all the SNPs of participant 1


> E.g. **`Participant_1_SNPs = ["rs54", "rs111", "rs1081", "rs7878"]`**

> Each SNP in the list would be called an **element** of that list

> Elements are separated by commas

> An empty list looks like this 

> **`empty_list = [ ]`**



<h2><center><ins>List Indexing</ins></center></h2>

<div class = "alert alert-block alert-info">


> If you wish to access an **element** in the list

> You may do so by using **indexing**

> Remember from **strings**, that in Python, indexing starts at **0** 

> So for the list **`Participant_1_SNPs = ["rs54", "rs111", "rs1081", "rs7878"]`**

> The SNP **rs54**, is **element 1**



<div class = "alert alert-block alert-danger">

<h2>Is that correct?</h2>


> ***NO***

> SNP **rs54**, should be **element 0**

<div class = "alert alert-block alert-info">

> To have access to an element in the list, for whatever reason (maybe to print it), you give the **list name**, followed by **square brackets** and the **index number** that corresponds to the element in your list

> For example, if we want the first **element**, which is SNP **rs54** in this case, we write:

> `Participant_1_SNPs[0]`

> If we want to print this to STDOUT:

> `print (Participant_1_SNPs[0])`

> Similarly, the last element in the list named `Participant_1_SNPs`, is **rs7878**

> Which corresponds to `Participant_1_SNPs [3]`?

> OR

>                      `Participant_1_SNPs [-1]`


In [8]:
# You may also splice out multiple elements
Participant_1_SNPs = ["rs54", "rs111", "rs1081", "rs7878"]
snps_2_and_3 = Participant_1_SNPs[1:3]
# 1<= SNP < 3

print(snps_2_and_3) # you will see that it maintains the data type = lists

['rs111', 'rs1081']


<div class = "alert alert-block alert-info">

> But lists also have a method to retrieve the position of an element

> It's called the **`.index ()`** method

> E.g. if you know that the SNP **rs111** is in your list, and you want to know where it is,

> You may use: **`Participant_1_SNPs.index("rs111")`**


In [None]:
Participant_1_SNPs = ["rs54", "rs111", "rs1081", "rs7878"]

Participant_1_SNPs.index("rs111")

<h2><center><ins>Adding elements to a list</ins></center></h2>

<div class = "alert alert-block alert-info">
    
> One of the technical jargon words that you'll often here with lists, is that they are **mutable**

> This is just a nerdy way of saying, that they can be changed without needing to create a new variable/list name to store the newly changed list

> You can change a list at any time by **adding**,**removing,** or **changing** objects/elements.

> Let's say you discovered a new SNP that needs to be added to Participant 1's list

> You will then use the built-in list method called

> **`.append( )`**

> Appending adds the new element to the **END** of the list. That is, **`your_list[-1]`**

In [9]:
Participant_1_SNPs = ["rs54", "rs111", "rs1081", "rs7878"]

print (Participant_1_SNPs)

print (len(Participant_1_SNPs)) # we can also check the length to double-check if the list is growing

['rs54', 'rs111', 'rs1081', 'rs7878']
4


In [10]:
# To add SNP rs9999 to the list Participant_1_SNPs
Participant_1_SNPs.append("rs9999")

print (Participant_1_SNPs) #can you see new the element is at the end
print (len(Participant_1_SNPs))

['rs54', 'rs111', 'rs1081', 'rs7878', 'rs9999']
5


<div class = "alert alert-block alert-info">

> You can also add lists together

> Let's create three lists, and add them, and see what happens

In [11]:
dna_sequences =['ATGATTCGCC', 'GGGCCTAA', 'ATCCNTT'] #strings
positions = [5, 9041.21, 7984512] # integers and floats
booleans = [True, True, False] # booleans

mixed_list  =  dna_sequences + positions + booleans 

print(mixed_list)

['ATGATTCGCC', 'GGGCCTAA', 'ATCCNTT', 5, 9041.21, 7984512, True, True, False]


In [12]:
# let's create a new list and append it to `mixed_lists`

IDs = [10.2, 78,45, 99.9974]

mixed_list.append(IDs)

print(mixed_list)


['ATGATTCGCC', 'GGGCCTAA', 'ATCCNTT', 5, 9041.21, 7984512, True, True, False, [10.2, 78, 45, 99.9974]]


<div class = "alert alert-block alert-info">

> I want you to notice something

> Firstly, when you use the **`append ()`** method, it may not do exactly what you intended, if your goal was to add each element to the list individually

> For that, it may be better to just concatenate the lists (using the **`+`**), or by using the method we will discuss next 

> Lastly, if you keep running the block above, check what happens to the list - this is PART of what we mean, when we say that lists are **mutable**



<h2><center><ins>Quick Hands-on exercise</ins></center></h2>

**Exercise 1**

> Create an empty list and call it **foods**

> Then populate the list with **three** of your favourite foods or types of cuisine

> **print** your list **before and after** you added your elements

In [13]:
foods = []
print(foods)
foods.append("thai food")
foods.append("sushi")
foods.append("italian")
print(foods)

[]
['thai food', 'sushi', 'italian']


<div class = "alert alert-block alert-info">
    
> Now let's go back to the list called **mixed_list**

> If your intention was to add each element in the list called **ID** to the list called **mixed_list**, then apart from using normal concatenation, you may also use the

> **`.extend ( )`** method

> Extending, also adds the elements to the **end** of the list

> Note that this changes the list called **mixed lists**, it does not create a new list

In [14]:
# Note that this works , because this block run the lists afresh
# So we've reassigned all the lists and their elements

dna_sequences =['ATGATTCGCC', 'GGGCCTAA', 'ATCCNTT']
positions = [5, 9041, 7984512]
booleans = [True, True, False]

mixed_list  =  dna_sequences + positions + booleans
# print(mixed_list)


IDs = [10.2, 78,45, 99.9974]
mixed_list.extend(IDs)
print(mixed_list)



['ATGATTCGCC', 'GGGCCTAA', 'ATCCNTT', 5, 9041, 7984512, True, True, False, 10.2, 78, 45, 99.9974]


<h2><center><ins>Quick Hands-on exercise</ins></center></h2>

**Exercise 2**

> You have your list called foods and you want to add the items from the list called, 

> `drinks = ["coffee", "cappuccino", "flat white", "smoothie", "sweet red wine"]`

> If you don't want to add this one at a time using `.append()`, which method(s) can you use?

In [15]:
foods = ['thai food', 'sushi', 'italian']
drinks = ["coffee", "cappuccino", "flat white", "smoothie", "sweet red wine"]


foods = foods + drinks

print (foods)

['thai food', 'sushi', 'italian', 'coffee', 'cappuccino', 'flat white', 'smoothie', 'sweet red wine']


In [16]:
foods = ['thai food', 'sushi', 'italian']
drinks = ["coffee", "cappuccino", "flat white", "smoothie", "sweet red wine"]

foods.extend(drinks)
print (foods)

['thai food', 'sushi', 'italian', 'coffee', 'cappuccino', 'flat white', 'smoothie', 'sweet red wine']


<div class = "alert alert-block alert-danger">

> Note that with concatenation, you can create a new list by adding the two list

> But with `.extend ()`, you cannot create a new list

In [17]:
# this will work

foods = ['thai food', 'sushi', 'italian']
drinks = ["coffee", "cappuccino", "flat white", "smoothie", "wine"]

foods_and_drinks = foods + drinks

print (foods_and_drinks )

['thai food', 'sushi', 'italian', 'coffee', 'cappuccino', 'flat white', 'smoothie', 'wine']


In [18]:
# this will NOT work

foods = ['thai food', 'sushi', 'italian']
drinks = ["coffee", "cappuccino", "flat white", "smoothie", "wine"]

foods_and_drinks  = foods.extend(drinks)
print (foods_and_drinks )

None


<h2><center><ins>Sorting and Reversing</ins></center></h2>

<div class = "alert alert-block alert-info">
    
> Lists come with many of their own methods

> Don't forget that you can use the **`help(list)`** function to see all the things you can do with lists

> We will look at two more now

> The **`.reverse ()`** and the **`.sort ()`** methods, are quite intuitive

> **`.reverse ()`**  takes no argument

> With **`.sort ()`**, you may give no argument, or you may use two optional arguments (**key** and **reverse = True**)

> The **`.sort ()`** method arranges the list in **alphabetical order** if the elements are **strings**

> And **ascending numberical order** when the elements are **numbers** (1,2,3,4,5,6...)

In [19]:
# This time, let's use the list from the book

apes = ["Gorilla gorilla", "Pan troglodytes",  "Homo sapiens"]
print(apes)

# Let's say you don't like the order of this list, 
# because you believe that humans should sit at the top of the hierarchy

# You may use the .reverse () method

apes.reverse()
print(apes)


['Gorilla gorilla', 'Pan troglodytes', 'Homo sapiens']
['Homo sapiens', 'Pan troglodytes', 'Gorilla gorilla']


In [20]:
# the .sort () method arranges the list in alphabetical order if the elements are strings
# and ascending numberical order when the elements are numbers
apes.sort()
print(apes)

['Gorilla gorilla', 'Homo sapiens', 'Pan troglodytes']


In [21]:
help(list)

Help on class list in module builtins:

class list(object)
 |  list(iterable=(), /)
 |  
 |  Built-in mutable sequence.
 |  
 |  If no argument is given, the constructor creates a new empty list.
 |  The argument must be an iterable if specified.
 |  
 |  Methods defined here:
 |  
 |  __add__(self, value, /)
 |      Return self+value.
 |  
 |  __contains__(self, key, /)
 |      Return key in self.
 |  
 |  __delitem__(self, key, /)
 |      Delete self[key].
 |  
 |  __eq__(self, value, /)
 |      Return self==value.
 |  
 |  __ge__(self, value, /)
 |      Return self>=value.
 |  
 |  __getattribute__(self, name, /)
 |      Return getattr(self, name).
 |  
 |  __getitem__(...)
 |      x.__getitem__(y) <==> x[y]
 |  
 |  __gt__(self, value, /)
 |      Return self>value.
 |  
 |  __iadd__(self, value, /)
 |      Implement self+=value.
 |  
 |  __imul__(self, value, /)
 |      Implement self*=value.
 |  
 |  __init__(self, /, *args, **kwargs)
 |      Initialize self.  See help(type(self))

In [3]:
apes = ["Gorilla gorilla", "Pan troglodytes",  "Homo sapiens"]
apes.sort(key = len, reverse = True)

print(apes)

# since "Gorilla gorilla" and "Pan troglodytes" are of equal length, 
# it sorted it alphabeticlly after it sorted it by length 

print(len("Gorilla gorilla"))
print(len("Pan troglodytes"))
print(len("Homo sapiens"))


['Gorilla gorilla', 'Pan troglodytes', 'Homo sapiens']
15
15
12


<div class = "alert alert-block alert-info">

> In the last two blocks you can really see how **mutable** lists are

> In both cases, when we printed **apes** again, it just modified the existing list and we didn't need to assign it to a new variable for the change to be effective

> **Strings are immutable** OR they cannot be changed

> You'll just be changing the content stored in the variable



In [22]:
string = '"Gorilla gorilla", "Pan troglodytes",  "Homo sapiens"'
# print(string)
print("This is my memory address:", id(string))

string = '"Gorilla gorilla", "Pan troglodytes",  "Homo sapiens"' + ', "Macaca Fascicularis"'
# print(string)
print("This is my memory address:", id(string))

lists = ["Gorilla gorilla", "Pan troglodytes",  "Homo sapiens"]
# print(lists)
print("This is my memory address:", id(lists))

lists.append("Macaca Fascicularis")
# print(lists)
print("This is my memory address:", id(lists))


# it's a bit difficult to see but with "strings", you've basically overwritten the initial
# variable name called "strings", and created a new one also called "strings"

# but with "lists", you've JUST modified the content of the initial variable name "lists"


This is my memory address: 139765417253568
This is my memory address: 139765420863920
This is my memory address: 139765417276744
This is my memory address: 139765417276744


In [23]:
# this is why THIS will not work
lists = ["Gorilla gorilla", "Pan troglodytes",  "Homo sapiens"]

lists = lists.append("Macaca Fascicularis") 
print(lists)


None


In [24]:
# If you change the list name, will it still be the same list?

# Or did you create a copy of the list that you can now modify?

apes = ["Gorilla gorilla", "Pan troglodytes",  "Homo sapiens"]
print(apes)

new_apes = apes # change the list name from "apes" to "new_apes"
# print(new_apes)

new_apes.append("new_species")
print(new_apes)


# Let's check if the list apes is still the original list?
print(apes)



['Gorilla gorilla', 'Pan troglodytes', 'Homo sapiens']
['Gorilla gorilla', 'Pan troglodytes', 'Homo sapiens', 'new_species']
['Gorilla gorilla', 'Pan troglodytes', 'Homo sapiens', 'new_species']


<div class = "alert alert-block alert-danger">
    
> By giving the list a new name, you have not created a copy of the list

> You've basically just given the list a **nickname**

> So when you append something to the new list name, the system automatically applies the changes to BOTH lists

> **To create a copy of the list to do new work on, you have to use `.copy()`**

In [25]:
apes = ["Gorilla gorilla", "Pan troglodytes",  "Homo sapiens"]
print(apes)


newest_apes = apes.copy()
newest_apes.append("new_species")
print(newest_apes)
print(apes)

# Let's check if the list apes is still the original list?
# print(apes)

['Gorilla gorilla', 'Pan troglodytes', 'Homo sapiens']
['Gorilla gorilla', 'Pan troglodytes', 'Homo sapiens', 'new_species']
['Gorilla gorilla', 'Pan troglodytes', 'Homo sapiens']


<h2><center><ins>Quick Hands-on exercise</ins></center></h2>


**Exercise 3**
> Use the **`help ()`** function to find a method that you may use 

> To add up the number of times you see the element

> **human** in the list called **`species = ["human", "fish", "human", "bear", "human", "eagle"]`**

In [None]:
help (list)

In [26]:
species = ["human", "fish", "human", "bear", "human", "eagle"]

count_humans = species.count("human")

print(count_humans)


3


**Exercise 4**

> Use the **`help ()`** function to find a method that you may use 

> To add **jaguar** to position **3** of the species list

In [27]:
species.insert(3, "jaguar")
print(species)

['human', 'fish', 'human', 'jaguar', 'bear', 'human', 'eagle']


<h2><center><ins>Looping</ins></center></h2>

<div class = "alert alert-block alert-info">

> The process of reading over multiple sets of data, in a somewhat repetitive manner, is called **looping.**

> Let's look at the example of the two files again

**<center>File 1</center>** 

|        Gene              |                 SNP          |
|--------------------      |------------------------------|                
|    <center>TP53</center> | <center>**rs54** </center>   |
|    <center>TP53</center> | <center>rs111<center>        |    
|    <center>TP53</center> | <center>rs1081</center>      | 
|    <center>TP53</center> | <center>**rs5099** </center> |  
    
<br><br>    
**<center>File 2</center>** 

| Gene                 |          SNP                     |
|--------------------  |----------------------------------|                
|<center>TP53</center> |  <center>rs19</center>           |
|<center>TP53</center> | <center>**rs54**</center>        |  
|<center>TP53</center> | <center>**rs5099** </center>     |
|<center>TP53</center> | <center>rs7878</center>          |  
    

> In this case, we would read the SNP in **Line 1** in **File 1** and compare it to the SNPs in **Lines 1, 2 and 3** in the second file.

> Then we may read the SNPs **Line 2** of **File 1** and compare it to all the SNPs in **File 2**

> We would do this until we are sure that we've compared all the lSNPs of **File 1** to those in **File 2**

> With Python, you could use something known as the **`for loop`** to do this automatically, quickly and in an error-free manner.

> But in this case, we will not use this example, because we haven't done **Conditionals** yet. We will see the full power of **looping** once we combine it with this next chapter.

> For now, let's look at how we can use **looping** with **lists**


<div class = "alert alert-block alert-info">
    
> Before we look at the `for` loop, it's worth noting, that just as with strings

> You can use **`in`** to check if an element is in your list

In [28]:
species = ['human', 'fish', 'human', 'jaguar', 'bear', 'human', 'eagle']

fish = "fish" in species

bat = "bat" in species


print(fish)
print(bat)

True
False


<div class = "alert alert-block alert-info">
    
> Now if you had 10 000 species in your list

> You wanted to print that each one of those animals is a species

> It could take you a long time, it would be error-prone, and not to mention BORING

> If you were Excel-savvy, you could do it in Excel

> But why not script it



<div class = "alert alert-block alert-info">

> By using a **`for`** loop, you can instruct your code to read through each element in your list, from the first element to the last, and do WHATEVER you want to it.

> In this case, it will just be the same thing to every element

> But once we do **Conditionals** in the next chapter, we can do different things to different elements

> The basic structure of a **`for`** loop is 

> **`for element in list:
       do X`**
         
         
> So if you wanted to print that each one of those **elements** is not a computer

> You don't have to go:

> `Species[0] is not a computer`

> `Species[1] is not a computer`

> `Species[2] is not a computer`

> etc.
       
> You can simply say:


In [29]:
species = ['human', 'fish', 'human', 'jaguar', 'bear', 'human', 'eagle']

for animal in species:
    print (animal, "is not a computer.")

human is not a computer.
fish is not a computer.
human is not a computer.
jaguar is not a computer.
bear is not a computer.
human is not a computer.
eagle is not a computer.


<div class = "alert alert-block alert-info">

> A special mention must be made to the format of this loop

> Note that the first line ends in a **colon**

> And the second line is not printed directly below the first

> In Python, once you use a **colon**, the following line must be intented with typically **four spaces** or a **tab**

> **NEVER mix spaces and tabs** - Choose one

> You IDE should allow you to choose for your tab to be four spaces

> These indented lines that follows the line with the colon

> Is called a **block** of code

> Or the **body** of the loop

<div class = "alert alert-block alert-info">
    
> When using a `for` loop, you're basically telling the code, that you will do something 

> To every element in the list, and you task is not done until you reach the last element

> Unlike before when we assigned an object to a variable name, E.g. `animal = "bear"`

> Now our variable name `animal` has no object assigned to it

> Until the `for` loop is initialized

> It then takes on the first element in the list `animal[0]`, which is equal to `human`

> After it prints, it recognizes that due to it being a `for` loop

> It is not yet done, so it goes back into the list and continues with the next element

> And then the next, until it reaches `animal[6]` or `animal[-1]`

> The last object the variable `animal` will store, will be your last element

> So if you exit the `for` loop 

> By printing on a line in line with the first line of the loop

> You will see that `print(animal)` will print out `"eagle"`



In [30]:
print(animal)

eagle


<div class = "alert alert-block alert-info">

> It is worth noting, that because you're working with a list

> You can control your `for` loop by using slicing

> For example, if you wanted your `for` loop to only print

> From `"fish"` to `"bear"`

> You can use the following

In [33]:
# can we slice a list?
# what is the variable name of the list in the code below?
# Which element is "fish"?
# Which element is "bear"?
# what would that slicing look like?

species = ['human', 'fish', 'human', 'jaguar', 'bear', 'human', 'eagle']

for animal in species [1:5]:
    print (animal, "is not a computer.")

fish is not a computer.
human is not a computer.
jaguar is not a computer.
bear is not a computer.


<h2><center><ins>Quick Hands-on exercise</ins></center></h2>

> We can give all sorts of instructions inside the block of code

> It doesn't have to be just one line following the `for` statement

> Let's try giving it some things that we previously learnt

> I've written some code at the bottom, which contain a few errors

> and some code is redundant

> Let's fix it together

<br><br>  

> First, let's practice reading other people's code

> Initially, just read through the code and check if you have an idea of what the code will do

> Then let's do some troubleshooting

In [None]:
species = ['human', 'fish', 'human', 'jaguar', 'bear', 'human', 'eagle']


for animal in species[]:
    len_animal = len(animal)
     print (animal, "is not a computer.")
    print (animal, "has a length of", str(len_animal)
    answer_eagle = eagle in species
    print (answer_eagle)

<div class = "alert alert-block alert-success">

> `for animal in species[]:`                           # there's nothing inside the brackets to slice

> `print (animal, "is not a computer.")`               # indentation error due to extra space

> `print (animal, "has a length of", str(len_animal)`  # the (str) is redundant, because we already use commas 

> `print (animal, "has a length of", str(len_animal)`  # there's also a missing bracket at the end                                                   

> `answer_eagle = eagle in species`                    # eagle must be surrounded by quotes, since it's a string

> `print (answer_eagle)`                               # the answer to this can be printed outside the loop, since we only need this to be answered ONCE

In [35]:
species = ['human', 'fish', 'human', 'jaguar', 'bear', 'human', 'eagle']

for animal in species:
    len_animal = len(animal)
    print (animal, "is not a computer.")
    print (animal, "has a length of", len_animal)
    answer_eagle = "eagle" in species
print (answer_eagle)

human is not a computer.
human has a length of 5
fish is not a computer.
fish has a length of 4
human is not a computer.
human has a length of 5
jaguar is not a computer.
jaguar has a length of 6
bear is not a computer.
bear has a length of 4
human is not a computer.
human has a length of 5
eagle is not a computer.
eagle has a length of 5
True


<h2><center><ins>Using the .split() method </ins></center></h2>

<div class = "alert alert-block alert-info">

> The **`.split ()`** method allows us to turn a string into a list by splitting it on a common character or group of characters, called a **delimiter**

> It's important to note that it splits your string at that character, so that character, (in the example below the character is a comma), will be excluded

In [36]:
# Example from book

names = "melanogaster,simulans,yakuba,ananassae"
species = names.split(",")
print(species)


['melanogaster', 'simulans', 'yakuba', 'ananassae']


In [37]:
dna_sequence = "ATGATTCGCCAAAAAGGGCCTAAAAAGGGGTCCNTTAAAAACCGAATCNN"
dna_sequences = dna_sequence.split("AAAAA")
print(dna_sequences )


['ATGATTCGCC', 'GGGCCT', 'GGGGTCCNTT', 'CCGAATCNN']


<h2><center><ins>Iterating over lines in a file<ins></center></h2>

<div class = "alert alert-block alert-info">

> We can use the `for` loop to iterate over (loop through) lines a file as well

> Previously we used `.read()`, `.readline()`, and `.readlines()`

> But when using a for loop, we do **NOT** use any of the `.read..()` methods in combination with it

> Consider the `for` loop as doing the **reading**

In [46]:
# REMEMBER!!!!
# If you're running this from GitHub then, this will work if you just
# use the line : dna_file = open("dna.txt")
# If you're running this from your own computer, then you need to include the file path
# where the file "dna.txt" is located

# e.g. ""/home/your_laptop_name/Desktop/dna.txt"
#
dna_file = open("/home/tracey/Desktop/dna.txt")


for line in dna_file:
    print(line)
    break


ATGGCAATAACCCCCCGTTTCTACTTCTAGAGGAGAAAAGTATTGACAT



<h2><center><ins>The range( ) function <ins></center></h2>

> A range represents a series of integers

> It can take three arguments **`range(start, stop, step)`**

> If only one is given, then that represents the **`range(stop)`**

> In fact, let's look at the **`help(range)`**

In [47]:
help (range)

Help on class range in module builtins:

class range(object)
 |  range(stop) -> range object
 |  range(start, stop[, step]) -> range object
 |  
 |  Return an object that produces a sequence of integers from start (inclusive)
 |  to stop (exclusive) by step.  range(i, j) produces i, i+1, i+2, ..., j-1.
 |  start defaults to 0, and stop is omitted!  range(4) produces 0, 1, 2, 3.
 |  These are exactly the valid indices for a list of 4 elements.
 |  When step is given, it specifies the increment (or decrement).
 |  
 |  Methods defined here:
 |  
 |  __bool__(self, /)
 |      self != 0
 |  
 |  __contains__(self, key, /)
 |      Return key in self.
 |  
 |  __eq__(self, value, /)
 |      Return self==value.
 |  
 |  __ge__(self, value, /)
 |      Return self>=value.
 |  
 |  __getattribute__(self, name, /)
 |      Return getattr(self, name).
 |  
 |  __getitem__(self, key, /)
 |      Return self[key].
 |  
 |  __gt__(self, value, /)
 |      Return self>value.
 |  
 |  __hash__(self, /)
 |

<div class = "alert alert-block alert-info">
    
> `range(object)` # an argument is compulsory

> `range(stop) -> range object` # it will give you all the integers up until that number

> `range(start, stop[, step])` -> range object # it will give you all the objects from the start integer you gave as an argument, up until the stop integer. The **step** can only be used if the **stop** is also used.

> Each argument is separated by a comma

In [48]:
for integer in range(10):
    print(integer, end = "") # the `end = ""` just prevents the lines from being printed one below the other
print("\n")    

for integer in range(2,10):
    print(integer, end = "")
print("\n")  
    

for integer in range(2,10,2):
    print(integer, end = "")
print("\n")  

0123456789

23456789

2468



> Imagine we have a protein sequence:

> `protein = "vlspadktnv"`

> And we want to print out the first three residues, then the first four residues, etc.

> We could create a list of all the stop positions for that protein, and print from the start to that stop position

In [49]:
protein = "vlspadktnv"

stop_positions = [3,4,5,6,7,8,9,10]

for stop_position in stop_positions:
    mini_protein = protein[0:stop_position]
    print(mini_protein)

vls
vlsp
vlspa
vlspad
vlspadk
vlspadkt
vlspadktn
vlspadktnv


In [50]:
# But if this protein was really long, you don't want to type all the stop positions
# So you could use the range ( ) function

protein = "vlspadktnv"

stop_positions = range (3,11) 
# remember, the range is a set of integers
# So it's >=3 integer < 11 

for stop_position in stop_positions:
    mini_protein = protein[0:stop_position]
    print(mini_protein)

vls
vlsp
vlspa
vlspad
vlspadk
vlspadkt
vlspadktn
vlspadktnv


<h2><center><ins>Nested Loops<ins></center></h2>

> A nested loop is a **loop within a loop**

> We will not go into much detail about this

> But we'll see it's uses become more apparent with the next chapter

> This is just a foretaste of what's coming

> But can you see how this might be useful?

In [7]:
names = ["Ray", "Shay", "Kay"]
prizes = ["car", "house", "holiday"]

for name in names:
    for prize in prizes:
        print(name, "get's a", prize)
    

Ray get's a car
Ray get's a house
Ray get's a holiday
Shay get's a car
Shay get's a house
Shay get's a holiday
Kay get's a car
Kay get's a house
Kay get's a holiday


<h2><center><ins>Exercises from Python for Biologists<ins></center></h2>

***Exercise 1: Processing DNA in a file***
<br><br> 

The file **input.txt** contains a number of DNA sequences, one per line. 

Each sequence starts with the same 14 base pair fragment – a sequencing adapter that should have been removed. 
<br><br>  


Write a program that will:

**(a)** trim this adapter and write the cleaned sequences to a new file and 

**(b)** print the length of each sequence to the screen.

In [4]:
###############
# Solution 1 #
###############

# REMEMBER!!!!
# If you're running this from GitHub then, this will work
# If you're running this from your own computer, then you need to include the file path
# where the file "dna.txt" is located

# e.g. ""/home/your_laptop_name/Desktop/dna.txt"


with open("input.txt") as input_file,\
    open("adapter_trimmed_sequences.txt", "w") as out_file: 

    for line in input_file:
        # because indexing starts at 0, so positions 0 -> 13 are the first 14 nucleotides
        # and position 14, is the nucleotide after the adapter
        newline = line.strip()[14:] 
        out_line = newline + "\n"
#         print(out_line)
        out_file.write(out_line)
        print(len(out_line))
   

43
38
49
34
48


***Exercise 2: Multiple exons from genomic DNA***
<br><br> 

The file **genomic_dna.txt** contains a section of genomic DNA, and the file **exons.txt**
contains a list of start/stop positions of exons. 

Each exon is on a separate line and the start and stop positions are separated by a comma. 

Write a program that will extract the exon segments, concatenate them, and write them to a new file.

In [None]:
###############
# Solution 2a #
###############

genomic_dna_file = open("genomic_dna.txt")
exons_file = open("exons.txt")
combined_exons_file = open("combined_exons.txt", "w")

# Below: you want to concatenate, but you can' stick something to something that does not exist
# So create an empty string called "combined_exons"
# so that something exist that we can attach to

combined_exons = ""

#########################
# Format exon file BELOW

# 5,58
# 72,133

########################

genomic_dna = genomic_dna_file.read()

for exons in exons_file:
    exons = exons.strip()
    start = int(exons.split(",")[0])   
    end = int(exons.split(",")[1])
    exon = genomic_dna[start:end]    
    combined_exons = combined_exons + exon 
   
   # print(combined_exons)

combined_exons_file.write(combined_exons)
exons_file.close()
combined_exons_file.close()


In [None]:
###############
# Solution 2b #
###############

with open("genomic_dna.txt") as genomic_dna_file,\
    open("exons.txt") as exons_file,\
        open("combined_exons.txt", "w") as combined_exons_file: 

    genomic_dna = genomic_dna_file.read()

    combined_exons = ""
    for exons in exons_file:
        start = int(exons.strip().split(",")[0])
        end = int(exons.strip().split(",")[1])
        exon = genomic_dna[start:end]
        combined_exons = combined_exons + exon # you want to concatenate, but you can' stick somethig to something that doest exit
    # print(combined_exons)

    combined_exons_file.write(combined_exons)


***Exercise 3: Processing DNA in a file***

The file **input.txt** contains a number of DNA sequences, one per line.

Each sequence starts with the same 14 base pair fragment – a sequencing adapter that should have been removed.

Write a program that will:

a) trim this adapter and 
b) write **EACH** cleaned **sequence** to their **own file**

In [5]:
###############
# Solution 3 #
###############


###########################################################################
# no output file names are given, and this is often the case in real life
# So we have to be creative about how we will assign these file names
##########################################################################

input_file = open("input.txt")

############################################################################
#This output file CANNOT be opened and written to here
# because we do NOT want all the sequences printed to the same file
############################################################################


line_number = 0    # start an increment so that you can count up   

for line in input_file:
    newline = line.strip()[14:] # because 14 is element 15
    line_number = line_number + 1      # line_number += 1 (the same)
    file_name = "file_" + str(line_number)
#     print(file_name)
    out_file = open(file_name, "w")
    out_line = newline 
    out_file.write(out_line)


input_file.close()
out_file.close()
    

file_1
file_2
file_3
file_4
file_5


In [6]:
###############
# Solution 4 #
###############

# Now let's pretend that the first 9 nucleotides in the sequence are actually unique
# We can then also use that as a file name
# Often times it's a bit easier than these two examples, because you usually read in multiple files
# such as multiple vcf files within a directory
# and these vcf files usually have unique names
# so you could just use the input file name to create the output file name with a slight tweak
# Here we have just one input file, so we have to improvise


input_file = open("input.txt")

for line in input_file:
    newline = line.strip()[14:] # because 14 is element 15

#########################################    
# Let's print these files to the Desktop
#########################################
    file_name = newline[:9] + ".txt" #pretend the first 9 nucleotides are unique
#     print(file_name)
    out_file_dir =  "/home/tracey/Desktop/" + file_name #add your own directory here
#     "/home/tracey/Desktop/TCGATCGAT.txt" 
    
#     print (out_file_dir)
    out_file = open(out_file_dir, "w")

    out_file.write(newline)



input_file.close()
out_file.close()