# Python Basics

## Why Python?

There are many programming languages that offer different functionalities. Programmers usually end up using multiple languages depending on what they want to achieve. In bioinformatics, we are often interested in manipulating "text", such as DNA and protein sequences... 

Python is therfore a good choice for bioinformatics analysis because
* It has consistent syntax
* It has bulilt-in libraries that one can use
* Allows for easy manipulatation of text


## Data Types

Data types are items that help *classify* or *categorize* data. They represent what kind of *value* each item has. These items are called *variables*.

<img src="https://drive.google.com/uc?id=1fiIM9_ydwnK8gYn1mi40QDEiljkZkKy3">

Variables can have any name and any value. For example:

A numerical variable is defined as:

```
numerical_var = 5
```
A "string", or character variable, is defined as:
```
string_var = "Hello World"
```
A combination of different data types assigned to one variable can be achieved using a list, for example like this:
```
my_list = [1, 1.34, "Hello World!"]
```

##### Detour 1: Print Function

In Python you can use print() to print, or display, a specified message or value on the screen.

We can print a message like this:

```
print("Hi there")
```

We can even print a numerical value:
```
print(193.39)
```

Also, we can print variables created before. We just have to enter the name of the variable, like this:
```
print(my_dna)
```




Try printing the values of all variables assigned so far, as well as your own message. We have started this problem for you. 

Note 1: each print statement should be started on a new line or in a new cell.

Note 2: To execute the code, press Shift + Enter, or press the play button in the cell.

In [None]:
print(numerical_var)



##### Detour 2: Comments
Something very useful in programming are "comments". 

In Python, the character "#" prevents the program from executing the line in front of which "#" is placed. This is often used for commenting, or annotating, code, or for preventing certain parts of the code from running (for example, parts that are no longer needed).

Here an example of how comments work:

In [1]:
print('This is a line 1')
#print('This is line 2')
print('This is line 3')

This is a line 1
This is line 3


As you can see from the cell above, when commenting a line of code, the line changes color, allowing us to quickly spot comments. 

We will be using comments to give you instructions and hits.

Try it out yourself! Comment, and remove the comments below.

In [None]:
print('Comment me')
print('Hello World')
#print('Uncomment me')

Comment me
Hello World


### Exercises on Data Types

Create a variable called **my_dna** as assign as its value the string: **ATGCGTA**


In [None]:
# Write your code here


Now, let's create a numerical variable called **my_dna_length** and assign as its value the number of nucleotides in my_dna

In [None]:
# Write your code here


##### Detour 3: Function len( )

While it is easy to count the length of my_dna, as the number of characters increases the manual counting gets more difficult ...
There is an easier way to do this in Python! We use a function called:
```
len()
```
We will describe functions like this one later in more detail, but in practice len() will count every character in a string or every item in a list. For example:
```
len("ABC")
```
will count how many characters constitute "ABC".

Try to get the number of nucleotides for my_dna using len( ).

In [None]:
# Place your code here


## Operators

Scientists constantly have to manipulate, modify and interpret their data. To do this with programming, we can use arithmetic symbols and operators.

We can use them for truth value testing, comparisons, data type conversions, and many other things.

<img src="https://drive.google.com/uc?id=1-73goDs7Igl3jAfwmsOyy4vXHdbX78il">

Arithmetic and operator symbols can work for different data types, but they might do different things if the variable is string or integer, for example.

### Exercises on Operators

In [None]:
# Here we assigned some variables that you will need
x = 3
y = 7
true_bool = True
false_bool = False
str_1 = "Hello"
str_2 = "World"

Use operators to show if true_bool is equal to false_bool, and do some calculations with x and y.


Hint: Refer to the diagram above to find the operation you need to use.


In [None]:
# Place your code here


Now, let's use operators for strings. The operators + and * can be also used with strings. For * you will need to to use a number. Try it out, and ask for help if you are stuck using + and * with strings.

In [None]:
# Place your code here



To conclude the Exercises on Operators:


Print the variable my_dna three times.

In [None]:
# Your code here


Create a variable "my_dna_2" with the value "CATCGGGTA" and print the concatenation of my_dna with my_dna_2

*Hint: concatenation is the string version for adding*

In [None]:
# Your code here


On the screen show the following message: 
```
My dna sequence is ATGCGTA and it has length 7
```
There is mulitple ways to do this! 

*Hint 1: you can use previously assigned variables*

*Hint 2: when using + to concatenate, you can only join the same data type together*

In [None]:
# Your code here


## Functions

Functions are blocks of code that performs an specific task. They typically do one single task, and ideally they do that one task well! We already introduced you to some functions, like print() and len(). 

There are two type of functions. Built-in functions are the ones that Python created and we can use but not modify (for example print() and len()). The most common built-in functions are [here](https://docs.python.org/3/library/functions.html).

A cool thing about programming is that we can create our own functions to serve our needs. To create a function in Python, we use the following format:

In [None]:
# This is called defining a function
def say(): # say is the name of the function that we will later use to "call" the function
    greeting = 'Hello'
    print(greeting)

To "execute" or run a function, we write the name of the function followed by round brackets. When executing, or running the code, the function will perform, its task. For example:

In [None]:
say()

Hello


Functions can take in *parameters* and do something with them. To do this, we specify the name of the parameter inside the parenthesis, like this:

In [4]:
def say_my_name(name):
    greeting = 'Hello my name is ' + name
    print(greeting)

To execute with specific parameters, we input the values of the parameters. We can use the same function with different values for its parameters:

In [5]:
# You can do this:
say_my_name('Karla')
say_my_name('Arjana')

Hello my name is Karla
Hello my name is Arjana


In [6]:
# Or you can do this:
say_my_name(name = 'Karla')
say_my_name(name = 'Arjana')

Hello my name is Karla
Hello my name is Arjana


Now try printing your name!

In [None]:
# Your code here


Sometimes it is useful to have a function that returns a value. For this, we use "return" at the end of the function. A returned value can then be used outside the function, in contrast to a printed value which can only be viewed.

In [None]:
def summation(var1,var2):
    s = var1 + var2
    return(s)


summation(1,4) # If function is called at the end of the cell, the returned value will be automatically "printed". 
# If the function is called within other code you need to do print(summation(1,4)) to see the returned value.

5

You can save the returned value as a new variable, like this:

In [None]:
s = summation(1,4)

If the operator is valid for different data types, one function can be used in multiple ways. For example, "summation(var1,var2)" can be used for numbers, strings, and even lists...

In [None]:
print(summation(4,3.5))
print(summation('hello ','world'))
list_1 = [1,2,3]
print(summation(list_1,[4,5,6]))

7.5
hello world
[1, 2, 3, 4, 5, 6]


### Exercises on Functions

Create a function called up_seq_down that does the following:

1. Receives as parameter a DNA sequence
2. Creates variable called upstream with value AAA
3. Creates variable called downstream with value GGG
4. Concatenates upstream, the DNA sequence (input) and downstream
5. Returns the concatenated sequence

Next, run your function with some different DNA sequences as input.

*Hint: Your output should look something like this*
```
AAATCGAGTGCACTCGGG
```


In [None]:
# Create your function here



## Files

Python is specially good for working with text files. It is easy to open, write, read, save, or edit the content of a text file. 

You can learn more about files [here](https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files) (optional).


Either to access the content of an existing file, or to create a file from scratch, we use the function "open( )", and specify what we intend to do with the file. We can, for example, write ("w") a file like this:
```
f = open('my_file.txt','w')
f.write('My first file\n')
f.write('This is line two\n')
f.write('And line three')
f.close()
```
Above we create a text file called "my_file", assigned it to the variable f, and wrote in it using the function "write( )" (The "\n" starts a new line). At last, we used f.close( ) to tell the computer we are done using the file and it can be created on the computer.

### Exercises on Files

Let's try to recreate the example above, creating our first file.

In [None]:
# Place your code here


Now that we have created a file, we can try to read and print its content! To do so, we will use the function "read( )" as follows:
```
f = open('my_file.txt','r')
txt = f.read()
f.close()
print(txt)
```

Use the template above to read and print the content of the file you created in the previous step.

In [None]:
# Place your code here


We can also use function "readlines( )" to create a list containing each line of the file.
```
f = open('my_file.txt','r')
txt = f.readlines()
f.close()
print(txt)
```

There is a file called "sequences.fasta" available to you in the same folder as this notebook. Open and read the file, then print its content using read() or readlines().

In [None]:
# Place your code here


## Loops

Loops are a way to iteratively go over a list or string (or any other data type that is iterable) and do the same thing to each item. (For more info see [here](https://www.learnpython.org/en/Loops)).

In [None]:
# Run this cell to see how loops work
prime_nums = [2, 3, 5, 7]
for prime in prime_nums:
    print(prime)

Next, go through a list of the numbers from 1 to 10 and display their square. 

In [None]:
# Place your code here.


# Measurements Along the Central Dogma

Now that we have some Python experience, let's try to do some bioinformatics analyses.

To start, create a variable calle dna_seq and assign **agtCAgtaactaGGatgcatatgacgTGatcGtGA** as its value

In [None]:
# Place your code here


Now, save in dna_seq_len the number of nucleotides of dna_seq. Remember to use a Python built-in function to do so.

In [None]:
# Place your code here


Show the following message on screen: “The length of my dna sequence X is Y”, where X is dna_seq and Y dna_seq_len 


In [None]:
# Place your code here


##### Detour 4: Indexing

Sometimes we need to access only a portion of a sequence (a *substring*). To do this, we use "[x]" to get a single position in a data type, or "[x : y]" to get everything from position x - y, where x is included and y is excluded. 

Substrings can be very informative in bioinformatics. For example, substrings from the genome of different organisms can show a pattern and help us infer how closely related they are.

Let's print some substrings from dna_seq.

1. Print the 3rd nucleotide in dna_seq. (*Note: In Python we start counting from 0, so if you want the first item you use [0]*)
2. Print the substring from the 3rd to 5th position (inclusive) (*Pay attention to whether the 5th position is really included in your implementation*)
2. Print the last nucledotide of sequence
3. Print position 1, 4, and what remains from the 6th position to end of sequence. (*Hint: use operators if helpful.*)


In [None]:
# Place your code here


##### Detour 5: .lower(), .upper(), .count( )

While we can work with our dna_seq in lower and upper letters, it is a better practice to use either or. To do this, we can convert the whole sequence into lower or upper format like this:
```
'aTAcgg'.lower()
'aTAcgg'.upper()
```

Assign the lower case version of dna_seq to dna_seq_l. Assign the upper case version of dna_seq to dna_seq_u. Use whichever version you prefer for the remaining of these exercises.

In [None]:
# Place your code here


As we discussed in lecture zero, it is useful to know the distribution of A, T, C and G's in our DNA sequences. To do this, we can use the function .count( ). To use it, we need to give it the string and specify within the parenthesis what element we want to count, like this:
```
'att'.count('t')
```

How many A,T,C and G's are in dna_seq?

In [None]:
# Place your code here


It is a good practice to do a quality check of the data before starting any analysis. Calculating the GC content (percentage of G and C bases in the genome) can help us determine if the distribution of bases is as expected.

A simple way to do this, is count the number of G's and C's in our genome, and get the percentage from there. The formula to calculate GC content is:
```
(Count C's + Count G's) / (Count A's + Count C's + Count G's + Count T's)
```


For this part, we are going to do some calculations with dna_seq. Use what we have learned so far to show the following output using variables and functions.
```
The dna sequence is AGTCAGTAACTAGGATGCATATGACGTGATCGTGA
length : 35
A count: 11
T count: 9
C count: 5
G count: 10
GC content is 0.42857142857142855
```

If your output doesn't quite looks like this, ask your classmates or the TA for help.

In [None]:
# Place your code here


After the quality check, we are ready to do our first bioinformatic data manipulation.

We are going to learn how to convert a DNA sequence into a protein sequence, similarly to what we did in lecture zero, just this time using Python!
<img src="https://drive.google.com/uc?id=1q-IZtNL_Do7Gcoc0LsOc87ggv0UVzHLq">


First, we need to get the messenger RNA (mRNA) based on our DNA sequence. This can be tricky but we will walk you through each step and also introduce you to some additional useful things. In bioinformatics, it is important to break down a problem into smaller subproblems.

<img src="https://drive.google.com/uc?id=11Ed_ZQozSvacKkjpg75l9YTPfG7riSVm">


Take a close look at the image above. Assume that the DNA sequence we have given you is the template strand in the 5'->3' direction. The steps to getting the mRNA include:

1. Reversing our template DNA sequence. 
2. Transcribing the reversed DNA sequence, meaning getting a C when there is a G, a G when there is a C, an A when there is a T, and a U where there is an A.

To solve subproblem 1, we can use
```
dna_seq[::-1]
```
Do not worry about the details of this operation for now, just see below how it does the job.


In [None]:
seq = 'ABC'
print(seq)
print(seq[::-1])

ABC
CBA


Now reverse dna_seq and assign it to a new variable, and print it to check that it is correct.

Note: use either the lower or upper case version of dna_seq you made earlier.

In [None]:
# We have started this for you
rev_dna_seq =
print(rev_dna_seq)

##### Detour 6: Dictionaries

Next, we need to do the transcription. A useful data type to do this and many other things are **dictionaries**. Dictionaries contain keys and values, just like a real world dictionary contains words and definitions, respectively. Unlike in a real world dictionary, items in a Python dictionary are not order alphabetically or in any other order. Regardless, we can easily access the values for a particular key like this: 
```
my_dic[key]
```
Below we created a dictionary that has integers as the keys and the spelled out numbers as the values. In dictionaries, the keys are always unique, meaning you cannot have the same key assigned to different values at the same time.

In [7]:
my_dic = {1:'one',2:'two',3:'three'}
print(my_dic)
print('key for 1: ', my_dic[1])
print('key for 3: ', my_dic[3])

{1: 'one', 2: 'two', 3: 'three'}
key for 1:  one
key for 3:  three


Now let's create a dictionary that will be useful to do the transcription. We have started this problem for you to give you an idea of what that dictionary looks like:

In [8]:
# Complete the dictionary (make sure you use either upper or lower case letters, depending on your rev_dna_seq)
transcription_dic = {'C':'G', }

Now comes the hard part! Take rev_dna_seq and transcribe it.
 

In [None]:
# We have started this for you and added comments along the way
mrna_seq = '' # This is an empty string that we will add to in order to create the final mRNA sequence
for n in rev_dna_seq: # Use a loop to go through each nucleotide at a time
  # Get the complement nucleotide from transcription_dic, and assign it to a new temporary variable

  # Now add the complement nucleotide to the the mrna string we initiated at the beginning. mrna will be updated at each iteration to reflect a growing transcript

# Lastly print out your transcript! (This should be done outside the loop.)


Extra exercise: try doing what you did above now using a function that takes in a reversed DNA template sequence and returns the transcript!

In [None]:
# Optional



Last but not least, let's translate the mRNA sequence into a protein. 


To transcribe our mRNA, we first need to create the dictionary of codons - sequences of three consecutive bases.
Use the codon table below to figure out the amino acid for each codon.

<img src="https://drive.google.com/uc?id=1xhzYFwNAQmUyHFxKLBK9SeQVL1F6ZVJf">

In [None]:
# Complete the codon dictionary (again pay attention to whether you should be using lower or upper case letters for your codons)
aa_codon_dic = {'aug':'M'}

aa_codon_dic

{'aug': 'm'}

##### Detour 7: Loops with Range

In contrast to transcription where each DNA nucleotide is converted into a RNA nucleotide, for translation we need to look at three nucleotides at the same time for deciding what amino acid to convert it to. This requires that we loop through blocks of the mRNA sequence.

To do this, we can use the function *range( )*, which creates a sequence of integers that we can then use in a loop. range(x, y, z) generates a sequence of integers from x to y with jumps of z. 

Before moving on to performing the translation of the mRNA sequence, play around with range( ).

In [None]:
# Change x, y and z until you understand how range( ) works
for i in range(x,y,z):
    print(i+1)

A fun thing about programming, is that you can combine many things together!

We can, for example, use range() and len() together. This is especially helpful when we don't know the length of a sequence beforehand.

range( ) works with one input only. In that case, Python will know to start at 0 and increase by one until the input value is reached.

See what the code below prints.

In [None]:
word = 'ABCDEFG'
for i in range(len(word)):
    print(i)

To instead print the first 1, 2, 3, etc. characters of word, we can also incorporate indexing of a string (refer to *Detour 4: Indexing* if you forgot how that works.)

In [9]:
# Try it out
word = 'ABCDEFG'
for i in range(len(word)):
    print(word[0:i])

Now that you know how to use range() in loops, try converting mrna_seq to its protein sequence using the aa_codon dictionary you created ealier. (Note: this is a tricky question so do not get discouraged if it takes you a couple of tries to get it right!)

You can save the protein in a new variable, and print it at the end, similiar to what we did for transcription.

In [None]:
# Place your code here


WELL DONE! You completed various real world bioinformatics problems and hopefully learned many new Python functions and tricks!

# References/Resources
[Python Documentation](https://www.python.org/doc/)

[Operators](https://docs.python.org/3/)

Topics and explanations in this notebook were inspired by [yourgenome.org](https://www.yourgenome.org).

Some exercises in this notebook were inspired by [Python for biologists](https://pythonforbiologists.com/upcoming-workshops/introduction-to-python-for-biologists-online-course-13th-24th-july-2020)





# About this Notebook
This notebook was created by Karla Godinez-Macias and Arjana Begzati for STARTneuro at UC San Diego.