This is the page for Bio720 (2018), a practical introduction to fundamental computational skills for biologists. This is taught through the Biology Department at McMaster University, but most course materials are freely available to anyone interested.
Class time and location
Mondays 6-8PM, GS101
Background assumed for students
For this class we are not assuming students have background in programming/scripting, nor in bioinformatics. We do assume that students have a working knowledge of basic molecular biology and genetics and have basic familiarity with using computers. i.e. you can figure out how to install basic software on a Mac (OS X) or a PC (Windows).
What students will need
A laptop with internet access and the ability to install several programs (in particular,
Python and a shell emulator if not using a Mac (OS X) or linux.
The primary goal of this course it to provide graduate students an opportunity to develop fundamental computational skills necessary to go on and (in the future) develop the appropriate (and more advanced) skills for bioinformatics, genomics, etc.
What this course is not
Because of limitations of time (one two hour lecture a week for 13-14 weeks), we are purposefully making this a course about fundamental skills. As such, this course will not cover in any detail:
- Genomic analysis pipelines (RNAseq, variant calling and populations genomics). These are covered in the winter-spring in Bio722. - Theory of computer science (nor theory on programming, algorithms, data structures etc). - Despite using `R` for much of this course, it is most definitely not a statistics course. Bio708 (taught by Dr. Ben Bolker and Dr. Jonathan Dushoff) is such a course (also using R as the primary programming environmental for statistical modeling.) - A bioinformatics class (i.e. we will not teach any conceptual or theoretical background in bioinformatics. All examples will be real examples, but mostly to illustrate the computational skills necessary to run an analysis, not the why).
Topics (some TBD)
It is important to note that in order to keep things flexible depending on how things go with the class, these topics are subject to change if necessary. We will discuss in class.
A. Introduction to UNIX and the command line. (Brian)
- Introduction to basic shell commands, logging onto remote systems
- Standard UNIX utilities that make your day to day computer work (and bioinformatics) easier.
- using pipes in UNIX (and the model of streaming data), batch processing of data.
- Writing shell tools.
- Using your UNIX skills for practical bioinformatic problems (probably setting up a BLAST database, and querying some sequences)
- (maybe) Regular expressions are you friend. No really. Using
grepand its variants (i.e.
awkfor file manipulation and processing
B. Fundamentals of programming using
R(Ian). Link to R portal for class
- Fundamentals of programming in R.
- How to avoid repetitive strain injury while programming. Control flow in R (
if else, etc). Using the
applyfamily of functions in
R. Simple simulations.
- Working with data in
R. Getting data in. Data munging (subsetting, merging, cleaning). Working with strings in
- Basics of plotting in
R. Other topics TBD.
- Reproducible research using markdown for reports and git for version control.
- An Introduction to bioinformatic tools in R. Primarily an introduction to BioConductor, and genomic range data.
C. This will likely not be taught this year Fundamentals of program using
After successfully completing this course you will:
- Have a much higher degree of comfort using your computer! - Be able to write custom UNIX shell scripts to do file copying, moving, editing, parsing and manipulation. - Be able to write simple R programs to do simple simulations, data parsing (munging), plotting. - Be able to perform computationally reproducible research, and use version control on your source code. - Be able to utilize genomic range data and incorporate simple genomic features. - Understand the fundamental framework of UNIX programs, scripting and why streaming data is so useful for genomics and bioinformatics. - Know that troubleshooting for installing and using programs, and troubleshooting when writing and using code are normal. You will have developed some tenacity in dealing with such issues and have some ideas on how to approach finding solutions (including your *google-fu*).
You are responsible for ordering your own copies of these books. Both are excellent with only a small amount of overlap, but we are only highly recommending the first book (BDS) for this class. The reason for this is that this year we are only using
UNIX (and shell scripting) and programming in
R which are both covered a bit in the BDS book.
Bioinformatics Data Skills, BDS. HIGHLY RECOMMENDED This book fills an important gap in that is oriented towards the day to day skills for anyone working in the fields of genomics and bioinformatics. In addition to covering the basic UNIX skills (and why we use UNIX in bioinformatics and genomics), it also covers subjects like overviews of the essential file types (
.gff, etc) that are ubiquitous in the field. There is also a nice, but brief introduction to the essentials of
R, using bioconductor and in particular range data, and two important chapters on how to organizing (and maximize reproducibility) of computational projects. Currently (August 29th 2018) this is ~52.44$ on amazon.ca . It is available as an e-book as well from the publisher. The author is still a PhD student (in population genomics), and wrote this in their first year of graduate school, so definitely worth supporting.
Practical Computing for Biologists. This book provides a nice, gentle introduction to the basic computational skills all biologists should have. In particular, with introduction to using the UNIX command line, shell scripting, basic
python programming, regular expressions, working on remote machines and a few other topics. The book is written to be agnostic with respect to discipline (i.e. it is not a bioinformatics book per se), but does a great job of being both very accessible and immediately useful. It seems a bit pricy on Amazon.ca, but look around for used copies (it is 4 years old). If you plan to continue in computational research, this is a fantastic resource.
For Brian's section. This will have pertinent links to Brian's section of the course.
R tutorials and screencasts. A link to the exercises, in-class activities, playlists for screencasts I have put together for the
R tutorials. I will also be putting assignments up here. I will be adding more as the semester progresses. Mostly we will be using the excellent Datacamp online interactive 'courses' for the introductory stuff, and moving on from there.
Week 1 - BDS Chapter 1, Chapter 2 pages 21-30, Chapter 3 pages 37-45, Chapter 4 pages 57-59.
Also check out here for a review of organizing computational data analysis projects. You don't need to read about version control or using markdown yet.
Week 2 - Chapter 3 pages 45 - 56, Chapter 7 125 - 156. Maybe also worth looking at pages 395 - 398 in Chapter 12.
Week 3 - Chapter 7 140-145, 157-169 might be really useful. I also recommend the first tutorial on regular expressions listed here. This takes you gently through regular expressions and within an hour you will realize what amazing things you can do.
Week 6 - For more information on some of the basic file types used in genomics (.fastq, .SAM, .BAM) see chapter 10 (only 13 pages) and chapter 11 (pages 355-365). I also suggest reading chapter 6 (pages 109-115). Also, here is a link to a few tools for doing QC at various steps. Not meant to be comprehensive, but if you use some google, chances are someone has already written some tool for QC and sanity checks for some steps similar to your own. In case you want to see another example here is a tutorial on making a BLAST database and querying sequences with it.
Week 7 (Beginning R) - Start with DataCamp assignments (Courses Introduction To R and Intermediate R). Associated readings in BDS chapter 8 pages 175-206.