# Workshop 3

## User Defined Functions

As well as the functions built in to Python, sometimes we want to write our own functions.

There are two main reasons for this:
* If you want to repeat the same task lots of times, it is easier to use a function that copy and paste the same piece of code.
* When you are writing longer scripts, it is easier to read code if it is divided into function.

A function has an input and an output.  We define a function using the `def` command.

In [None]:
def myfunction(mystring):
    newstring = "%s is my string" % mystring
    return newstring

When you run this code, Python loads the function into its memory, but doesn't actually run the function - it is just stored reading to use.
The input for the function above is `mystring` and the output is `newstring`.

The `return` value of a function another name for its output. You can assign the output of a function to a variable.

In [None]:
x = myfunction('hello')
print (x)

In [None]:
y = 'goodbye'
x = myfunction(y)
print (x)

To define functions, we need to rememeber the concept of `arguments` as the input to functions.

The names that you give to the arguments when you pass them to the function do not need to be the same as the arguments inside the function.  This can be slightly confusing.

In [None]:
def another_function(x, y):
    tot = x + y
    return (tot)

In [None]:
another_function(1, 2)

In [None]:
v = 1
w = 2
another_function(v, w)

In [None]:
v = 1
w = 2
another_function(x=v, y=w)

Just like for in-built functions, the function decides which argument is which based on either their order or the names inside the brackets when you call the function.

**Exercise** 

Write a function which returns the mean of two numbers and call the function in Python.

*Bonus: Write and call a function which returns the mean of any number of integers, using a for loop and a list as an argument.*

Inside a function we can combine several steps.

For example, this function checks how many lines are inside a file.

In [None]:
def countLines(infile):
    lines = open(infile).readlines()
    x = 0
    for line in lines:
        x += 1
    return (x)

In [None]:
countLines('lines1.txt')

In [None]:
countLines('lines2.txt')

We can add to this function to instead return True if the number of lines is even or False is the number of lines is odd.

In [None]:
def countLinesOddEven(infile):
    lines = open(infile).readlines()
    x = 0
    for line in lines:
        x += 1
    if x % 2 == 0:
        iseven = True
    else:
        iseven = False
    return (iseven)

In [None]:
countLinesOddEven('lines1.txt')

In [None]:
countLinesOddEven('lines2.txt')

**Exercise**
* Modify the function to return the total number of characters in the file (you can use the `len` function to find the number of characters in a string).

*Bonus: Instead, return a list of the lengths of all the lines in the file*

It's also possible to generate functions which have no arguments, or functions which return nothing.  These are sometimes useful.

For example, if you always need the same list of column names for your file, you can use a function to get them each time.

In [None]:
def getColumnNames():
    return (["name", "start_position", "end_position", "strand"])

In [None]:
x = getColumnNames()
print (x)

If you do something to the output file inside the function you may not need to return anything.

In [None]:
def removeSpaces(infile, outfile):
    inf = open(infile).readlines()
    out = open(outfile, "w")
    for line in inf:
        out.write(line.replace(" ", ""))
    out.close()

In [None]:
removeSpaces('removespaces.txt', "nospaces.txt")

If you look at the file `nospaces.txt` you will see the result of running this function.

## Pandas

We will now briefly look at a specific Python module - `pandas` - for dealing with dataframes (tables of data), as it is often useful.

The convention is to rename `pandas` as `pd` when we import it.

In [None]:
import pandas as pd

Pandas can be used to read, write and parse tab or comma delimited tables.

These are text files where the columns are seperated by either tabs or commas.

We can read a table using the `pd.read_csv` function.

In [None]:
comma_delim = pd.read_csv("commadelim.txt")

In Jupyter, if you just type the name of a `pandas` dataframe it will display it nicely.

In [None]:
comma_delim

For a tab delimited table we have to add an extra argument, to tell Python that the table is tab delimited.

`\t` represents a tab character in Python.

In [None]:
tab_delim = pd.read_csv("tabdelim.txt", sep="\t")

In [None]:
tab_delim

We can easily sort tables in pandas with the `sort_values` method.

In [None]:
tab_delim = tab_delim.sort_values('Age')

In [None]:
tab_delim

It's also easy to delete or add a row or column

We can delete rows or columns using the `drop` method.  It refers to rows as axis 0 and columns as axis 1.

We access rows using the row number from the left side of the table.

In [None]:
tab_delim = tab_delim.drop(0, axis=0)

In [None]:
tab_delim

We access columns using the column names.

In [None]:
tab_delim = tab_delim.drop('Name', axis=1)

In [None]:
tab_delim

To add a column or row, we can use a list.

For columns, we give the column name we want to add in square brackets after the table variable name.

In [None]:
tab_delim['Name'] = ['Bob', 'Mary']

In [None]:
tab_delim

For rows, we do the same but we add `.loc` before the first square bracket.

In [None]:
tab_delim.loc[0] = [33, 'Cambridge', 'Katy']

In [None]:
tab_delim

We access the data in the rows and columns in a similar way.

In [None]:
tab_delim['Age']

In [None]:
tab_delim.loc[2]

If we want to use this data, for example in a loop, it can be easier to convert it to a list first.  The list will always be in the same order as the data in the table.

In [None]:
ages = list(tab_delim['Age'])
print (ages)

In [None]:
bobs_data = list(tab_delim.loc[2])
print (bobs_data)

To write a table to file, we used the table name plus the `to_csv` variable.

In [None]:
tab_delim.to_csv("mytable.tsv", sep="\t")

Regardless of the input file format, the output file will be comma delimited unless we specify `sep="\t"`

**Exercise**
* Make a tab delimited table in a text editor and save it in the folder with your notebook.  Make sure there are column headings and at least one column is numerical (integers or floats).
* Read the table into Python
* Sort the table by the numerical column
* Add a new column in Python
* Add a new row in Python
* Delete a column
* Add a column
* Output the new table to a text file.

*Bonus: Try to make an additional column with a transformation applied to the numerical column, e.g. add 1 to all the numbers and put this in a new column.*

# Plotting

It is possible to draw graphs directly in Python, this makes it much easier to update the graphs when you update your data.

To do this we'll use a module called `matplotlib`

The simplest way to use matplotlib is with the interactive `plt` command.

In [None]:
import matplotlib.pyplot as plt

In [None]:
import seaborn as sns

First we'll make two lists of random integers, so that we have something to plot.

We need to import the random module to be able to do this.

In [None]:
import random

We can generate the numbers in a for loop and store them in a list.

In [None]:
data1 = list()
data2 = list()
for i in range(0, 50):
    data1.append(random.randint(0, 100))
    data2.append(random.randint(0, 100))

In [None]:
print (data1)

In [None]:
print (data2)

The most simple `plt` command is just to plot a line graph of a dataset.

In [None]:
# plot the first set of data points
plt.plot(data1)

# Remove extra axis
sns.despine()

# Save the file as "test.png"
plt.savefig("test.png", bbox_inches='tight')

# show the plot and remove it from memory
plt.show()


We can edit this plot in various ways.  We need to regenerate the plot each time, because after using plt.show() matplotlib no longer stores the plot in memory.  This also means we always need to save the plot before displaying it.

We can easily add additional datasets to the plot.

In [None]:
# plot the first set of data points
plt.plot(data1)

# plot the second set of data points
plt.plot(data2)

# Remove extra axis
sns.despine()

# Save the file as "test.png"
plt.savefig("test.png", bbox_inches='tight')

# show the plot and remove it from memory
plt.show()


In [None]:
# plot the first set of data points
plt.plot(data1)

# plot the second set of data points
plt.plot(data2)

# Remove extra axis
sns.despine()

# Label the x and y axis
plt.xlabel("Day")
plt.ylabel("Frequency")

# Set the axis limits
plt.xlim([0, 30])
plt.ylim([0, 150])

# Add a title
plt.title("My Graph")

# Save the file as "test.png"
plt.savefig("test.png", bbox_inches='tight')

# show the plot and remove it from memory
plt.show()


You can change the colours using the argument `color`.

In [None]:
# plot the first set of data points
plt.plot(data1, color='red')

# plot the second set of data points
plt.plot(data2, color='blue')

# Remove extra axis
sns.despine()

# Label the x and y axis
plt.xlabel("Day")
plt.ylabel("Frequency")

# Set the axis limits
plt.xlim([0, 30])
plt.ylim([0, 150])

# Add a title
plt.title("My Graph")

# Save the file as "test.png"
plt.savefig("test.png", bbox_inches='tight')

# show the plot and remove it from memory
plt.show()

You can set the positions and labels on the `ticks` on the x and y axis.

In [None]:
# plot the first set of data points
plt.plot(data1, color='red')

# plot the second set of data points
plt.plot(data2, color='blue')

# Remove extra axis
sns.despine()

# Label the x and y axis
plt.xlabel("Day")
plt.ylabel("Frequency")

# Set the axis limits
plt.xlim([0, 30])
plt.ylim([0, 150])

# Set the x and y tick positions and labels
plt.xticks([0, 15, 30], ['Start', 'Middle', 'End'])
plt.yticks([0, 100, 150], ['Low', 'Medium', 'High'])

# Add a title
plt.title("My Graph")

# Save the file as "test.png"
plt.savefig("test.png", bbox_inches='tight')

# show the plot and remove it from memory
plt.show()

**Exercise**
* Generate three sets of 100 random integers.
* Plot them using `matplotlib`.
* Change the axis ylabels and title.
* Change the colours.
* Change the positions and labels of the ticks on the y axis.
* Save the plot.
* View the plot.

 *Bonus: Store the three sets of integers in a dictionary and use the dictionary in the `plt` function calls.  Use a second dictionary to store the colour for each data series*

You can also generate other types of plot and change the parameters in a similar way.

For a bar chart, you need to generate a `range` to set the positions of the bars on the x axis.

In [None]:
n = range(0, len(data1))
plt.bar(n, data1, color='red')
plt.title("Bar Chart")
sns.despine()
plt.show()

For a scatter plot, you always need two data series and you provide the x positions as the first argument and the y positions as the second argument.

In [None]:
plt.scatter(data1, data2)
plt.title("Scatter Plot")
sns.despine()
plt.show()

For a pie chart we'll use the first few data points in the list.

In [None]:
plt.pie(data1[0:6])
plt.axis('equal') # make the x and y axis the same length
plt.show()

**Exercise**

Generate your own bar chart, pie chart and scatter plot.

*Bonus: Use the `plt.hist` function to generate a histogram*

There is much more information about matplotlib in another tutorial I wrote here:
    https://cgatoxford.wordpress.com/2017/05/10/matplotlib-tutorial/, plus on the matplotlib website https://matplotlib.org/.

# Other Useful Methods

## Strings

`split`: Split a string into a list based on a character in the string.

In [None]:
x = 'This_Is_My_String'
xbits = x.split("_")

In [None]:
print (xbits)

`join`: Join strings together using a character.  For some reason this is used the other way around.

In [None]:
y = "!".join(xbits)

In [None]:
print (y)

`replace`: Replace part of the string with something else

In [None]:
z = y.replace("!", "!!!!!!")

In [None]:
print (z)

`find`: Find the position of a substring in a string (gives the first position only).

In [None]:
print(z.find("!"))

`count`: Count the number of times a substring occurs in a string.

In [None]:
print(z.count("!"))

`upper` makes everything upper case

In [None]:
z.upper()

`lower` makes everything lower case

In [None]:
z.lower()

## Floats

`round`: rounds the float to a certain number of decimal places

In [None]:
x = 0.035125

In [None]:
round(x, 5)

`math.ceil` rounds a float to the integer above.

In [None]:
import math
math.ceil(x)

`math.floor` rounds a float to the integer below.

In [None]:
math.floor(x)

## Lists

`extend` appends a list to another list

In [None]:
L1 = ['a', 'b', 'c', 'd']
L2 = ['A', 'B', 'C', 'D']

In [None]:
L1.extend(L2)

`enumerate` automatically generates an index for each item in a list.

In [None]:
for item in enumerate(L1):
    print (item)

## Dictionaries

`update` combines two dictionaries (the second dictionary will replace items in the first with the same key)

In [None]:
x = {'a': 1, 'b': 2, 'c': 4}
y = {'c': 6, 'd': 8}

In [None]:
x.update(y)

In [None]:
print (x)

# Some common modules to look up

* re - regular expression operations: https://docs.python.org/3/library/re.html
* os - operating system functions: https://docs.python.org/3/library/os.html
* shutil - file operations (e.g. copying and moving files): https://docs.python.org/3/library/shutil.html?
* glob - finding files using wildcards: https://docs.python.org/3/library/glob.html
* importlib - additional import options: https://docs.python.org/3/library/importlib.html
* math - mathematical functions: https://docs.python.org/3/library/math.html
* sys - system options and parameters: https://docs.python.org/3/library/sys.html

# Final Exercise

* Read the tab delimited table `student_data.txt` into Python using pandas
* Display the table in jupyter
* Import the module `numpy` and use the `numpy.mean` function to calculate the mean age and mean score.
* Write a function called `score_percentage` which returns the score as a percentage of 50.
* Using a `for` loop calculate the score percentage for each person and store it in a list.
* Add this list to a table column called "Score_Percentage"
* Plot the score percentage using `matplotlib`, with student names as labels on the x axis and score percentage on the y axis
* Write a function to assign each student a score of "A" if their score percentage is greater than or equal to 70, "B" if their score percentage is greater than or equal to 60 but less than 70 and "C" if their score percentage is less than 60.
* Use this function to generate a "Final_Grade" column and add it to the table
* Plot a bar chart using matplotlib to show how many students got an "A", how many got a "B" and how many got a "C".  Change the colour and save the bar chart.
* Assign each student a random ID number using the random.randint function and add these to the table.
* Store the table in a tab delimted output file.
* Using string formatting and a `for` loop, make an output file called [Name].txt for each student which says "Dear [Name], your score was [score_percentage] and your grade was [Final_Grade]", based on the data in the table (fill in the square brackets with the correct data).

# Resources

**Online Material**

https://docs.python.org has the official documentation for all Python functions and modules.

* Software Carpentry: http://swcarpentry.github.io/python-novice-inflammation/

    Beginner's Python course with some additional content (especially debugging)


* Codeacademy http://www.codeacademy.com

    Interactive course for beginners


* Learn Pandas: https://bitbucket.org/hrojas/learn-pandas

    Lots of additional information about using pandas dataframes


* Numpy: https://docs.scipy.org/doc/numpy/user/quickstart.html

    Arrays, matrices and numerical operations


* Datacamp: http://www.datacamp.com

    Lots of courses about specific aspects of Python and other languages.  Not free.


* Learn Python http://www.learnpython.org


* Coursera (http://www.coursera.org/courses?query=python), EdX (https://www.edx.org/learn/python) and YouTube have lots of different courses.

    I would recommend:

    http://www.coursera.org/learn/python For beginners

    https://www.edx.org/course/introduction-to-computer-science-and-programming-using-python-0 Slightly more complex and computer science orientated




** Forums **
* Stack Overflow: http://www.stackoverflow.com

    Scary but comprehensive
 
 
* Reddit Learn Python http://www.reddit.com/r/learnpython/
    
    More accessible

** Exercises **
* http://www.reddit.com/r/dailyprogrammer/
* http://www.leetcode.com/problemset/all/
* http://www.codingbat.com/python
* http://rosalind.info

**Books**
* Learn Python the Hard Way http://www.souravsengupta.com/cds2015/python/LPTHW.pdf
* Python for Biologists http://userpages.fu-berlin.de/digga/p4b.pdf and Advanced Python for Biologists 
* Learn Python http://mmc.geofisica.unam.mx/edp/Herramientas/Lenguajes/Python/Learning%20Python,%205th%20Edition.pdf