![Alt Text](./Imgs/python-logo-generic.png)
# Introduction to Python.
***

## About.

This course will focus on teaching basic programming skills in Python. It is intended for individuals who have never written in Python, but wish to learn the ropes in a quick and efficient manner. We encourage students to pursue external teaching resources if they find this course useful.

## Using Jupyter.

Before delving into Python, let's try to understand ***what Jupyter is and how to use it***. Jupyter is an interactive Python notebook that was designed to ease the process of making program annotations and teaching one's code to others. If you have access to this document, it's because either (i) you were handed a PDF/Jupyter file copy of this course, or (ii) you are currently accessing a GitHub or Jupyter server hosted by one of the course administrators. Jupyter essentially allows one to write and run Python code in cells without having to worry about dependencies.

![Alt Text](./Imgs/Jupyter_Use.png)

## Why Python.

Python stands out amongst other scripting programming languages based on two criteria: **simplicity** and **accessibility**.  
 
- **Simplicity**: Easy to write and easy to read. Since there is no compilation step in Python, the edit-test-debug cycle is incredibly easy & fast. Wanna see? Let's have a look - try running the line of code below.

In [8]:
print(Hello World!)

SyntaxError: invalid syntax (<ipython-input-8-898d2b40ca9e>, line 1)

> You should see a "Syntax Error" displayed above on screen, along with the line number and content that is causing said error. This run-report dynamic makes troubleshooting a breeze once you know the basics.

- **Accessibility**: Incredibly easy to create and/or use packages with user-made functions, and easy to integrate into high-level projects. Here's an example of an application with GUI entirely written in Python:

![Alt Text](./Imgs/WangLabBrowserGenome.png)

With an ever growing source of user-made packages for the majority of fields in science, and the ability to compile and license/sell your projects, it is no wonder that Python is becoming the 'go-to' choice for many programmers across the globe.

## What will this tutorial teach YOU?

- [Basic syntax of Python: Keywords, Variables, and Data types](#Basic-Syntax.)
- [Operations](#Operations.)
- [Conditionals: If/elif/else statements](#Conditionals:-If/elif/else-statements.)
- [Lists](#Lists.)
- [Loops/cycles](#Loops/Cycles.)
- [Functions](#Functions.)
- [Read user input in Python](#Read-user-input-in-Python.)
- [Read from files in Python](#Read-from-files-in-Python.)
- [Packages](#Packages.)
- [Final exercise](#Final-Exercise.) (Skip to this part if you think you already know Python!)

***

## Basic Syntax.

We've all stumbled upon horribly written questions on Yahoo! Answers at some point. 
> **"Goten fet widin the lest 7 months, am i preganart?"**

While humans reading this work of art get what the user is asking, computers can't understand Python if you don't write it correctly - this is what **Syntax** is about. Python possesses a set of reserved **keywords** representing methods that in turn tell the computer what to do, given their correct use. 

Below you will find a list of common keywords in Python. In this section, we cover examples on how to use the keywords outlined in bold - we'll cover the rest in later sections.

- **print**(ANYTHING)
- **str**(ANYTHING)
- **int**(ANYTHING)
- **float**(ANYTHING)
- range(NUMBER)
- len(LIST/STRING)
- if STATEMENT :
- elif STATEMENT :
- else :
- while STATEMENT :
- for ELEMENT/INDEX in LIST/RANGE :

#### -Print: Output a message on screen
The 'print' method will tell the computer to output whatever has been handed to it on the computer screen.

In Python 2.7:
> print "Hello World!"

In Python 3:

In [3]:
print("Hello World!")

Hello World!


We can use **variables** to store information, and then pass it over to methods such as 'print'. Note that the name of variables cannot begin with numbers, and they cannot be identical to already existing keywords. Here's a few examples:

In [4]:
variable_1 = "Bee" #This is a comment, Python will ignore everything in the line after the pound symbol. Use comments to guide your future self and others who might read your code.
variable_2 = "Movie"
variable_3 = 2
print(variable_1, variable_2, variable_3)
print("should never happen.")

Bee Movie 2
should never happen.


You can also store values in multiple variables within the span of a single line in the following way: 

In [5]:
variable_1, variable_2, variable_3 = "Bee", "Movie", 2
print(variable_1, variable_2, variable_3)
print("should never happen.")

Bee Movie 2
should never happen.


**PRACTICE 1.** Write a program that outputs the following message: *Hello world!*.

In [7]:
#Write your program here.


In the real world, you will be using annotation systems when reading from or writing to files. The 'print' method itself has certain keywords that help out with formatting output. The keyword 'sep' is used to specify the separator between the terms provided to the print method. In our "Bee Movie 2" examples, you will notice that each variable is separated by a single space. Hence, the default separator of print is sep=' '. In the following line of code, we have changed the default separator of the first 'print' to the symbol *'@'* :

In [6]:
variable_1, variable_2, variable_3 = "Bee", "Movie", 2
print(variable_1, variable_2, variable_3, sep='@')
print("should never happen.")

Bee@Movie@2
should never happen.


The keyword 'end' is also used to determine the trailing character in the output of the 'print' method. By default, the 'print' method ends each output with the hidden new line character *'\n'*. In the following line of code, we have changed the default ending character of 'print' to a tab ('\t') :

In [7]:
variable_1, variable_2, variable_3 = "Bee", "Movie", 2
print(variable_1, variable_2, variable_3, end='\t')
print("should never happen.")

Bee Movie 2	should never happen.


However, 'print' has a catch: it can only output recognizable **data types**. A **data type** simply refers to the type of information that is being handled by a method and/or stored in a variable. Although Python possesses a plethora of data types (heck, you can even create your own), you can build any program in Python just by knowing and using five different data types:

- Strings
- Integers
- Floating numbers
- Booleans
- Lists

**Strings** are essentially a sequence of characters. Strings are usually encased within apostrophes or quotation marks (e.g. 'String' or "String") when provided to a method.

**Integers** are positive or negative whole numbers (i.e. ...,-1,0,1,2,3,...), typically used for counting purposes.

**Floating numbers** are numbers that possess a decimal point, typically used for mathematical operations. There are other data types similar to that of floating numbers that take up less space in memory, but we won't discuss them in this tutorial.

**Booleans** can only assume one of two values: True or False. 

**Lists** are arrangements of elements consisting of identical or varying data types. We will talk more about lists in later sections.

Data types dictate the **operations** that are in turn allowed for the data. In the scope of this course, an **operation** refers to any type of manipulation that is performed onto a data type. This includes events like concatenation, slicing, addition, multiplication, division, etc.

To understand how data types limit the operations available to a given data type, let's consider the following example. While you can perform the operation *1+2*, which equals 3, you can't perform something like *1+"Hello"*. In other words, while you can concatenate a string and a string, and sum a number and another number, you can't sum a number and a string. Luckily, Python possesses methods to force data type conversions when needed. 

- **str()** is a method that tries to convert data types into the string data type.

In [9]:
#Running this will print out a data type error
print("There were " + 11.0) 

TypeError: Can't convert 'float' object to str implicitly

In [10]:
#Using str() method to convert 11 into a string ('11') fixes the error displayed above.
print("There were " + str(11.0))

There were 11.0


- **int()** is a method that tries to convert data types into the integer data type.

In [None]:
int(11.0)

- **float()** is a method that tries to convert data types into the floating number data type.

In [None]:
float(11)

## Operations.
Okay, so now that we know about data types let's discuss basic mathematical operations in Python:

In [None]:
print(5 + 3)
print(5 - 3)
print(5 * 3)
print(5 ** 3)
print(5 / 3)
print(5 // 3) #This will only yield the integer portion of the quotient
print(5 % 3)  #This is called the modulo operator, it yields the remainder of a division. This is really useful - remember it.
              # 5 = 3(1) + 2 <- 

**PRACTICE 2.** Write a program that calculates the average of four given numbers. The result must be an integer.

In [18]:
number_1, number_2, number_3, number_4 = 124, 276, 67, 22
#Write your program here.


122.25


## Conditionals: If/elif/else statements.

**Conditional statements** can be considered a set of rules followed only if a certain condition is met. Before getting into conditionals, we should first describe **structure** in Python. Python uses indentation as a way to discern blocks of code that are to be executed together if a given condition is met. In order to keep your code organized, we suggest using tabs for each level of indentation. 

- **if** : evaluates an overarching statement and perform an action if said statement is true.
> Note that conditionals often use comparison operators :
  - thing_1 **==** thing_2 : evaluates to **True** if two 'things' are equal.
  - thing_1 **!=** thing_2 : evaluates to **True** if two 'things' are NOT equal.

In [None]:
variable_1 = True
if variable_1 == True :
    print("Statement is true!")

- **else** : if the statements before the *'else'* in question evaluate to false, indented code will be executed.

In [None]:
variable_1 = False
if variable_1 == True :
    print("Statement is true!")
else :
    print("Statement was not satisfied.")

- **elif** : Short for 'else if'. If the first 'if' statement evaluates to false, verify next 'elif' statement. If the 'elif' statement evaluates to false, evaluate the next 'elif' statement, and so on. 

In [None]:
variable_1 = False
if variable_1 == True :
    print("Statement is true!")
elif variable_1 == False :
    print("Statement is false!")
elif variable_1 == 100 :
    print("The variable contained the value 100.")
else :
    print("Statement was not satisfied.")

#### Logical operators used in conditionals: **and** | **or**
- **and** : both conditions in question must be met in order for the statement to evaluate to True.

In [None]:
variable_1, variable_2 = True, True
if (variable_1 == True) and (variable_2 == True) :
    print("Both conditions are met.")

- **or** : at least one of two conditions in question must be met in order for the statement to evaluate to True.

In [None]:
variable_1, variable_2 = True, False
if (variable_1 == True) or (variable_2 == True) :
    print("At least one condition is met.")

## Lists.
**Lists** are arrangements of elements consisting of identical or varying data types. They are the backbone of the majority of programs written in Python.

In [None]:
#A list in Python can be initialized in the following manner:
list_1 = ["a","b","c","d","e","f"]
list_2 = [ "Hello", "Bye", 88, True, 88.657, ["Ok",2,7] ] #Note that lists can also store other lists
print("list_1 =",list_1)
print("list_2 =",list_2)
print("list_1 + list_2 =",list_1 + list_2) #Adds elements of list_2 to list_1
list_1.append(list_2) #Adds content of list_2 to list_1 as a sub-list
print("list_1.append(list_2) =", list_1)

#### Slicing and indexing lists: Selecting elements within a list

One of the most common operations in lists is that of selecting its elements. **Indexing** refers to the selection of individual elements from a list, while **slicing** refers to the selection of a portion or range of a list. Note that neither indexing nor slicing alter the list being handled. Also, list indexes in Python start with 0 when counting from start to end, see image bellow.

![Alt Text](./Imgs/SlicingPython_2.png)

Have a try in the following cell:

In [13]:
list_1 = ["a","b","c","d","e","f"]
#Try indexing or slicing list_1


In [None]:
list_1 = ["a","b","c","d","e","f"]
list_2 = [ "Hello", "Bye", 88, ["Ok",2,7] ]

#Indexing examples
print("|Indexing examples|")
print("list_1[0] =", list_1[0]) #Prints first element within list_1
print("list_2[-1] =", list_2[-1]) #Prints last element of list_2. Last element of list_2 is a list.
print("list_2[-1][0] =", list_2[-1][0]) #Prints first element of list that was within list_2
print("------------")

#Slicing examples
print("|Slicing examples|")
print("list_1[:2] =", list_1[:2]) #Prints the first two elements
print("list_1[2:] =", list_1[2:]) #Prints everything after the first two elements
print("list_1[:-2] =", list_1[:-2]) #Prints the first two elements
print("list_1[-2:] =", list_1[-2:]) #Prints the first two elements


#While lists were indexed and sliced, they were not altered
print("------------")
print("list_1 =", list_1)
print("list_2 =", list_2)

In [None]:
#Strings can also be treated as lists of characters
string = "Hello World!"
print("string[0] =",string[0])
print("------------")
print("string =", string)

## Loops/Cycles.
**Loops** are a sequence of instructions that is continually repeated until a certain condition is reached. There are two types of loops used in programming: **while** loops and **for** loops.

- **while** loops make use of one or more reporter variables.

In [None]:
counter = 1
while counter <= 5 :
    print("Counter:",counter)
    counter += 1 #This is the same as writing counter = counter + 1

- **for** loops use an indexing variable that automatically increases with each cycle of the loop.

In [None]:
for counter in range(1,6) : #range(1,5) would yield a counter that only goes up to 4. The end integer of the range is always excluded.
    print("Counter:",counter)

Let's assume we have a list consisting of 1.2 million sub-lists, containing 4 elements each.

> list = [ ..., [element_1, element_2, element_3, element_4, element_5], [element_1, element_2, element_3, element_4, element_5], ... ]

Imagine that a company wants you to verify if one of the sublists contains a particular number within it. That's a lot right? Luckily we can use for loops to easily traverse through lists in two different ways:

In [None]:
#Way 1: For loop increases counter i, which in turn used to index list_1

#The company is looking for the ID number 6789
list_1 = [ ["Darling", "don't", "you", "miss", "me"], ["Darling", "don't", "you", "miss", "me"], ["Darling", "don't", "you", "miss", 6789]]
for i in range(len(list_1)) :
    if 6789 in list_1[i] :
        print("Match found in list number",i+1) #'i + 1' because indexing in Python starts in 0

In [None]:
#Way 2: For loop automatically indexes elements in each cycle

#The company is looking for the ID number 6789
list_1 = [ ["Darling", "don't", "you", "miss", "me"], ["Darling", "don't", "you", "miss", "me"], ["Darling", "don't", "you", "miss", 6789]]
counter = 1 #Counter was only used to report list position if a match is found
for element in list_1 :
    if 6789 in element :
        print("Match found in list number",counter) #No need for external operations thanks to user-controlled counter
    counter += 1

## Functions.
We can modularize programs by assigning repetitive tasks to **functions**. We can structure and define functions in Python in the following manner:

In [None]:
#This is a function that calculates the average of n numbers stored in a list
def calculateAverage(list_whatever) :
    sumNum = 0
    for i in range(len(list_whatever)) :
        sumNum +=  list_whatever[i]
    return sumNum/len(list_whatever)

#This is a function that calculates the global standard deviation, given a list of numbers and an average
def calculateSD(average, list_whatever) :
    sumNum = 0
    for number in list_whatever :
        sumNum += (number - average)**2 
    return sumNum/len(list_whatever)

list_1 = [57,67,65,67,69] #Initialize list_1 with values
average = calculateAverage(list_1) #Calls calculateAverage and stores returned value into 'average' variable
print("The average of list_1 is", average)
print("The SD of list_1 is", calculateSD(average, list_1))

**PRACTICE 3.** Make a progam that utilizes **functions** to calculate the maximum value stored within a list.

In [None]:
list_1 = [57,47,65,76,69]

#Code to help you get started
maximum = list_1[0]
for number in list_1 :
    #Write your code here (think about > and < conditionals!)



## Read user input in Python.
Reading input from users is key to creating interactive terminal programs.

In [23]:
#Hit 'Enter' once a response is provided in the input space
end = False
while end == False :
    userInput = str(input("Do you want this program to continue running? (Y/y | N/n)")) #Reads user input and specifies data type
    if userInput in ['y','Y'] :
        print("Continue process...")
    elif userInput in ['N','n'] :
        print("Ending program...")
        end = True
    else :
        print("Error: Unknown input.")

Do you want this program to continue running? (Y/y | N/n)Y
Continue process...
Do you want this program to continue running? (Y/y | N/n)N
Ending program...


## Read from files in Python.
Being able to read content from files and properly organize it in lists is perhaps one of the most important skills in Python programming. Let's provide an example of a program capable of reading a tab delimited file (i.e. each column is separated by '\t') containing annotations of an organism's genome. The program will be splitting the annotations based on strands (can be '+' or '-').

> Chromosome Accession | Gene Name | Feature | Start | End | Score | **Strand** | Coverage | Comments

> NC_000913.3 	thrL	CDS 	190 	255 	. 	**+** 	0 	ID=cds0;Parent=gene0;Dbxref=ASAP:ABE-0000006,UniProtKB/Swiss-Prot:P0AD86,Genbank:NP_414542.1,EcoGene:EG11277,GeneID:944742;Name=NP_414542.1;Ontology_term=GO:0009088;gbkey=CDS;gene=thrL;go_process=threonine biosynthetic process|0009088||;product=thr operon leader peptide;protein_id=NP_414542.1;transl_table=11

In [None]:
#It gets a little complicated from now on guys.
gff_file = './Files/RNA_FeaturesNoRedGeneNames.gff'

def readGff(gff_file) :
    splitContent = [ [],[] ] #We will store all of the '+' stranded genes in the first sublist, and all the '-' stranded genes in the second sublist
    with open(gff_file) as f:
        for line in f:
            splitLine = line.strip().split('\t') #Strip removes newline characters ('\n'), and split splits a string into a list based on the separator used ('\t' in this case, since it is a tab delimited file)
            if "+ " == splitLine[6] :
                #print("Test_1") #Checker
                splitContent[0] += [splitLine]
            elif "- " == splitLine[6] :
                #print("Test_2") #Checker
                splitContent[1] += [splitLine]
    return splitContent

#Let's test our function
splitStrand = readGff(gff_file)
#print(splitStrand[0][0]) #This should yield the first entry of the first sublist (i.e. a single '+' stranded annotation)
print('-----------------------------------------------------------')
#print(splitStrand[1][0]) #This should yielf the first entry of the second sublist (i.e. a single '-' stranded annotation)

## Packages.
In this section we will simply be exemplifying the art of importing packages, while mentioning some of the most popular packages in Python. We will make use of some of these packages in future courses.

In [None]:
#To import a whole package
import math as mt
print(mt.sqrt(4))

In [None]:
#To import a single method from a package
from math import sqrt
print(sqrt(4))

Additionally, you can save functions you have made in external files in Python, and import them within your program as packages.

**Common packages:**

- Math : Already found in most Python installations. Expands upon Python's basic math methods.
- [Pandas](http://pandas.pydata.org/) : Great for handling data bases.
- [Seaborn](https://seaborn.pydata.org/) : Our favorite standalone plotting and graphing package - outside those available for R.
- [Turtle](https://docs.python.org/2/library/turtle.html) : Simplest pixel-by-pixel art tool.
- [PyQt](https://www.riverbankcomputing.com/software/pyqt/download5) : Best documented package for GUI design. Great in combination with the C++ Qt Workflow.
- [Scikit-learn](http://scikit-learn.org/stable/) : Most complete machine learning library for Python.
- [BioPython](https://biopython.org/) : Great for simple tasks in the field of Bioinformatics.

## Final Exercise.
If you've done most of the exercises found throughout this tutorial, you're most likely ready to start writing your own programs. As a final test of your skills, we will be making a program requested by the hypothetical Pharmaceutical company *GotPharma?*. 

- *GotPharma?*'s request:
> "Using available tools, we have produced a series of genome annotation files for the bacterial strain, *E. imaginaris*. These tools are only able to annotate genic regions, yet we are interested in studying expression patterns found within intergenic regions. We need a tool that can add intergenic annotations to our genic annotation files."

![Alt Text](./Imgs/IntergenicExample.png)

In [1]:
gff_file = './Files/RNA_FeaturesNoRedGeneNames.gff' #See 'Read from files in Python' section for annotation info

def readGff(gff_file) :
    splitContent = [ [],[] ] #We will store all of the '+' stranded genes in the first sublist, and all the '-' stranded genes in the second sublist
    with open(gff_file) as f:
        for line in f:
            splitLine = line.strip().split('\t') #Strip removes newline characters ('\n'), and split splits a string into a list based on the separator used ('\t' in this case, since it is a tab delimited file)
            if "+ " == splitLine[6] :
                #print("Test_1") #Checker
                splitContent[0] += [splitLine]
            elif "- " == splitLine[6] :
                #print("Test_2") #Checker
                splitContent[1] += [splitLine]
    return splitContent

gffContent = readGff(gff_file) #Runs the function above and stores result in a variable
#print(gffContent[0]) #All genes shown here will be within the positive (+) strand. Uncomment me and see!
#print(gffContent[1]) #All genes shown here will be within the positive (-) strand. Uncomment me and see!
#print(gffContent[0][0]) #This shows a single positive strand gene entry. Uncomment me and see!
#print(gffContent[1][0]) #This shows a single negative strand gene entry. Uncomment me and see!

#GFF Format:                                    v      v
#Chromosome Accession | Gene Name | Feature | Start | End | Score | Strand | Coverage | Comments
#Assume that the records in the gff file are ordered according to start and end positions
#for record in gffContent[0] : #Loop over positive strand
    #Write your code here.


#for record in gffContent[1] : #Loop over negative strand
    #Write your code here.


***
#### Course authors: 
- [Charles Sanfiorenzo](https://github.com/CharlesSanfiorenzo/Bioinformatics) - csanfior@caltech.edu
- [Victor Irizarry](#GithubLink) - victor.irizarry2@upr.edu