# Bootcamp day 1
- who are we
- what are we trying to teach you
- what we **can't** teach you in this workshop
  - and why we should pay someone to come teach you that

# Python and R
**Python**
 - full-fledged general purpose programming language
 - 2nd most popular language (depending on who you ask)
 - "Not the best at any one thing, but at least second best in everything", "a glue language" -- somebody probably
 
  - two things Python is very good at:
    1. the web
      - used for Dropbox, YouTube, Instagram, Pinterest, many other sites
    2. science
      - many popular libaries: Numpy, Scipy, scikit-learn, Jupyter
      - many libraries for specific fields

**R**


# A brief digression: Jupyter notebooks
* what are they?
 - documents that you read in a browser
 - that let you mix code and text
 - and program in multiple languages
* why would you use them?
 - to present your results to other people
 - to do exploratory analysis, so you don't have to write a bunch of little scripts and run them from the command line
 - for interactive tutorials, like this one

# Outline:
  - Preliminary Python
    - ye olde Print function
    - Variables, How Do They Work?
    - Objects
  - A Case Study
    - parsing a csv file
    - rolling your own (functions)

# Python preliminaries

## Displaying text
To display text, we use the ```print``` function.

A _function_ is a reusable piece of code.
Python comes with a lot of built-in functions. That's part of its 'batteries included' philosophy. You can also write your own functions.

When you call a function, you pass it _arguments_ in parentheses, as shown in the cell below.

To execute the code in the cell, click on it so that it is surrounded by a green box, and then hit ```shift``` and ```enter``` at the same time.

In [1]:
print("p < 0.05")

p < 0.05


Does an argument always have to be in quotes? Let's experiment and see

In [2]:
print(significance)

NameError: name 'significance' is not defined

Okay, so that gave us an error.
What is a ```NameError```? I'll explain that in a second.

What you should know is that whenever you put something in quotes, you're telling Python:
"Hey, this is text. Don't try to interpret it as part of a program."

The computer jargon word for text is a ```string```, but no one remembers why (http://stackoverflow.com/questions/880195/the-history-behind-the-definition-of-a-string).

In [3]:
print("Hey, I'm a string")
print('Hey I\'m also a string but I\'m enclosed by single quotes') # Notice backslash to "escape" apostrophe 


"""
Hey I'm a multiple line string.
That's why I'm enclosed with three quotes.
I'm often used for something called doc strings.
You'll find out about them later.
"""

Hey, I'm a string
Hey I'm also a string but I'm enclosed by single quotes


"\nHey I'm a multiple line string.\nThat's why I'm enclosed with three quotes.\nI'm often used for something called doc strings.\nYou'll find out about them later.\n"

# Variables: like algebra, but not really

Many times you want to store a value in the computer's memory so you can use it later.
To store values somewhere, we assign them to what are called _variables_.
Just like in math, a _variable_ can take on whatever value we want it to.
But most of the time, a variable in a computer program just has one static value (unlike in math where we use values to solve entire classes of problems).

To stick some value inside a variable, we use the _assignment operator_.
That's fancy computer science jargon for the equals sign.
Like so:

In [4]:
significance = 0.04 # at least, hopefully <-- by the way, this is a comment. your comments should be more useful.

Note that the equals sign is _not_ acting like it does in a math equation.

It's saying: 'take some chunk of memory that I will refer to with the name on the left side of the equals sign, and then put the value on the right side of the equals sign inside that chunk of memory'.

Notice that you can `print` a variable just like we did with a string.

In [5]:
print(significance)

0.04


Okay, so now we don't get that ```NameError``` anymore.

But ... why?


A ```NameError``` is the Python interpreter's way of saying that it thinks you gave it the name of an object, but the interpreter doesn't find that name in its list of objects.

So once we **defined** a variable named `significance`, using the assignment operator, the interpreter was able to find that object and `print` its value.


# everything you need to know about object-oriented programming (for now)

An _object_ is a high falutin' computer term that comes from an even higher falutin' sub-field called _object-oriented programming_.

Everything in Python is an object.

So if you have some vague idea of what objects are, it will help you use other people's code and write your own.

Here's all you have to know about objects for now:

* you can have _classes_ of objects
 - for example, you could have a class called "car"
 - if you type

`car1 = car()`

then you would have created a variable that contains an _instance_ of the "car" class.

* Objects have _properties_
 - for example, your car might have a "speed" property

`print(car1.speed)`

`45`

* Objects have _methods_
 - a method is like a *function* that "belongs" to a class of objects
 - for example, your car might have an "accelerate" method that increases the speed by some amount.


`car.accelerate(amount=3)`

`print(car1.speed)`

`48`

A `string` is a class in Python.
One property of a `string` would be the actual string of characters you assign to it.
One method of a `string` is `reverse`.
Let's explore that a bit.
First we'll assign a string to a variable.

In [6]:
book = "mostly harmless"

We can use the ```type``` function to find out what kind of object our variable is.

In [7]:
type(book)

str

We call the method of an object by following its name with a period. 

In [None]:
book.title() # the title method capitalizes words in a string as if the string were the title, e.g., of a book

Notice that this is just like when we called the `print` function, except that the `title` method doesn't take any arguments.

You can see the methods and properties of an object by calling the `dir` function on it.

In [8]:
dir(book)

['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getnewargs__',
 '__gt__',
 '__hash__',
 '__init__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmod__',
 '__rmul__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'capitalize',
 'casefold',
 'center',
 'count',
 'encode',
 'endswith',
 'expandtabs',
 'find',
 'format',
 'format_map',
 'index',
 'isalnum',
 'isalpha',
 'isdecimal',
 'isdigit',
 'isidentifier',
 'islower',
 'isnumeric',
 'isprintable',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'ljust',
 'lower',
 'lstrip',
 'maketrans',
 'partition',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'split',
 'splitlines',
 'startswith',
 'strip',
 'swapcase',
 'title',
 'translate',
 'upper',
 'zfill']

Notice that everything in Python is an object.

In [9]:
type("Oh I wonder what kind of object I am")

str

In [10]:
type(4)

int

In [11]:
type(str) # meta

type

In [12]:
type(type) # uber-meta

type

Okay, that's enough preliminary stuff. Let's do some actual coding.

# Literary fiction interlude:

Your name is Lalama Evans, half-Hawaiian, former server at Ruby Tuesdays, and currently a psychology graduate student at the Metropolitan University of Fruitville, Florida.

Your advisor, who shall remain nameless, looks exactly like the advisor in Ph.D. comics.

He wants your project to involve the neurotransmitter serotonin, mainly because he finds it really interesting that this amine is also found in plants, and he rambles on about it constantly. Especially at departmental mixers. Especially if he has had too much wine.

You on the other hand would like your project to be about something useful that will help you get a job after grad school.

So you compromise: you'll study something useful that happens to involve serotonin.

You are also a bit of a hacker, because you had a good teacher for computing class in middle school. She was also the mayor of your small town but that's a story for another day.

You figure that you are probably going to study serotonin in some strains of mice, so you go to the Jackson labs website, where they have a ton of data about different strains.

A lot of the data is in files that have the _comma-separated value_ format, called csv files for short. You start by downloading a csv file that has information about serotonin receptor levels in the brains of different mouse strains.

#  Parsing a csv file, part 1

In [14]:
import urllib.request
import csv

url_for_file = "http://phenome.jax.org/tmp/Wiltshire3_means.csv"
with urllib.request.urlopen(url_for_file) as response:
   csv_file = response.read().decode('utf-8')

reader = csv.reader(csv_file, delimiter=',')
parsed_file = list(reader)

### So what did we just do?

first we `import`ed two modules (sets of functions).

You can `import` wherever you want in a script, but it is considered good form to put your list of `import` statements at the top.

There are three common ways to use the import command.

1. wholesale:
    - e.g., `import urllib`
    - this will load every "sub-module"
2. selective:
    - e.g., `from urllib import request,magic_parser`
    - this lets you load only the sub-modules you want to use. Convenient if you only need those and you don't want to type the whole name of the module, followed by its sub-module every time.
3. abbreviated
    - e.g., `import numpy as np`
    - So now you can type `np.mean()` instead of `numpy.mean()`

The `urllib` library lets us get stuff off the web. You don't have to know anything else about this for now.

We used it to load the file from pheone.jax.org into the variable `csv_file`.
Let's see what's in that variable.

In [15]:
csv_file

'"// Phenotype data set from Mouse Phenome Database (phenome.jax.org)",,,,,,,,\n"// Data set: Wiltshire3    Title: Drug study: Neurobiochemical analytes in response to chronic fluoxetine treatment in males of 30 inbred mouse strains    Year: 2011",,,,,,,,\n"// List of strain means and summary statistics",,,,,,,,,,,\n"// For more info on these data visit phenome.jax.org and type Wiltshire3 into search box",,,,,,,,,,\n,,,,,,,,,,,\nmeasnum,varname,strain,sex,mean,sd,sem,nmice,cv,zscore,,,\n38101,ACTH_cont,"129S1/SvImJ",m,1.80,0.0660,0.0381,3,0.0366,-1.40\n38101,ACTH_cont,"A/J",m,1.86,0.0710,0.0410,3,0.0382,-1.17\n38101,ACTH_cont,"AKR/J",m,1.95,0.110,0.0635,3,0.0563,-0.82\n38101,ACTH_cont,"BALB/cJ",m,2.05,0.112,0.0647,3,0.0547,-0.43\n38101,ACTH_cont,"BTBR T<+> Itpr3<tf>/J",m,2.25,0.0490,0.0283,3,0.0218,0.34\n38101,ACTH_cont,"BUB/BnJ",m,2.28,0.0720,0.0416,3,0.0316,0.46\n38101,ACTH_cont,"C3H/HeJ",m,2.26,0.0600,0.0346,3,0.0266,0.38\n38101,ACTH_cont,"C57BL/6J",m,2.49,0.0360,0.0208,3,0.0145,1.2

The `csv` module is for parsing csv files (obvs).

When we ran these lines of code...

`
reader = csv.reader(csv_file, delimiter=',')
parsed_file = list(reader)
`

...it should in theory have split each line of the file up wherever it found a comma.
Let's see if it did that.

In [22]:
parsed_file[57:73] # <-- the numbers inside square brackets are indices, we'll explain that in a bit

[['m'],
 ['e'],
 ['a'],
 ['s'],
 ['n'],
 ['u'],
 ['m'],
 ['', ''],
 ['v'],
 ['a'],
 ['r'],
 ['n'],
 ['a'],
 ['m'],
 ['e'],
 ['', '']]

Hmm, looks like it didn't.
Instead it split up the string at each letter, except for the quoted bits.
Maybe there's something we're not understanding about `urllib`.

Let's turn to our old friend Google, who takes us to a stackoverflow post.
(All programmers should know stackoverflow.)
http://stackoverflow.com/questions/21351882/reading-data-from-a-csv-file-online-in-python-3

Oh, I get it now, the `csv_file` is one big long string. Since a string is a type of sequence, and `csv.reader` splits up sequences, it's splitting up the string into characters.

What we need to do is split up our big long string so there's some other kind of sequence.

The different lines in the file are actually separated by a newline character, '\n'.

But we have to tell Python to split the string up whenever it sees that character.

To do that we can call the `splitlines` method on the string, as shown in the cell below.

In [24]:
url_for_file = "http://phenome.jax.org/tmp/Wiltshire3_means.csv"
with urllib.request.urlopen(url_for_file) as response:
   csv_file = response.read().decode('utf-8').splitlines() # <-- calling the splitlines method


**Notice that we can chain multiple methods calls together using the period.**

Here's the line again from the cell above where we did that:

`csv_file = response.read().decode('utf-8').splitlines()`

What this line of code does is say:
1. use the `read` method of the `response` object on itself
2. then `decode` whatever you read in using the "utf-8" scheme (one of many ways to decode bytes into text)
3. then finally take the text and split it up with the `splitlines` method, that by default splits wherever it finds the special character '\n' that represents the end of a line (like when you type `enter`)

So what does `csv_file` look like after we run `splitlines` on it?

In [33]:
csv_file[:4]

['"// Phenotype data set from Mouse Phenome Database (phenome.jax.org)",,,,,,,,',
 '"// Data set: Wiltshire3    Title: Drug study: Neurobiochemical analytes in response to chronic fluoxetine treatment in males of 30 inbred mouse strains    Year: 2011",,,,,,,,',
 '"// List of strain means and summary statistics",,,,,,,,,,,',
 '"// For more info on these data visit phenome.jax.org and type Wiltshire3 into search box",,,,,,,,,,']

Notice that now when we display the variable, it's surrounded by square brackets.
That's because `splitlines` returns a *list* of strings.

## Lists

Lists are one of the main data types in Python.

Other languages often call them *arrays*.

A list is just a way of grouping things together to make them easier to work with.

For example, now that we have a list of strings, we can use the csv.reader on them.

Recall that we figured out from a stackoverflow post that the csv.reader will take whatever sequence we give it and split it up.
Since a list, like a string, is a sequence, we can get the reader to split it up.

In [41]:
reader = csv.reader(csv_file, delimiter=',')
#parsed_file = list(reader)

Okay, so now we made a `reader` object (from the `csv` module).
Did we parse our file yet?

In [42]:
reader

<_csv.reader at 0x2baed1b4868>

What the heckin' heck is that?

That's the objects location in memory.

But it's not what we wanted at all. We want the lines from our csv file, split up at the commas.

The `reader` is actually an `iterator` object.
In Pythonese, we say that any object is an `iterator` if it implements the `__iter__` method.

In [40]:
'__iter__' in dir(reader)

True

Ok, so what?

You probably already know what an iterator is.

It's just that you call them "loops".

Let's iterate over the lines of the file that we put inside the reader object:

In [None]:
parsed_file = [] # an empty list
For row in reader:
    parsed_file.append(row)

## for loops

So above, we made an empty list.

Then we used a *for loop* to *iterate* over the rows of the file.

Here's what happens each time through the loop:
   1. the reader object parses one line of the file and puts that inside the variable `row`
   2. we called the `append` method of our list object `parsed_file` so that we could add `row` to the end of our list

## significant white space

How does the Python interpreter know which lines of code belong inside our *for loop*?

Many other languages use punctuation such as curly braces to identify blocks of code.
Here's a for loop in Java code

```Java
for (i=0;i++;i<10)
    { // <-- curly braces
    j = i * 10
    k = k + j
    }
print(k)
```

In Python, white space is used to organize code.

All the lines that are indented a similar amount are taken to belong to a given block.

```Python
for i in range(10):
    # beginning of code block in for loop
    j = i * 10
    k = k + j
    # end of code block
print(k)
```

Many people that are used to other languages find this weird at first.

You'll get over it.

Once you do, you'll realize that it enforces readability in a way that other languages do not.

For example, in Java, people can choose how much they indent. As long as there's opening and closing curly braces, it doesn't matter where they are. In practice, good Java coders use consistent indentation, but there's actually nothing about the language that says they have to.



How's our `parsed_file` look now?

In [31]:
parsed_file[3:6]

[['// For more info on these data visit phenome.jax.org and type Wiltshire3 into search box',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  ''],
 ['', '', '', '', '', '', '', '', '', '', '', ''],
 ['measnum',
  'varname',
  'strain',
  'sex',
  'mean',
  'sd',
  'sem',
  'nmice',
  'cv',
  'zscore',
  '',
  '',
  '']]

Okay, good, the csv reader is no longer splitting up

In [None]:
def parse_jackson_csv(csv_string):
    """
    Parses csv files from Jackson labs website.
    Deals with some of the idiosyncracies that csv.sniffer doesn't recognize.
    """
    SEPARATOR_BEFORE_HEADER = ",,,,,,,,,,\n,,,,,,,,,,,\n"
    index = csv_string.rfind(SEPARATOR_BEFORE_HEADER)
    new_start_index = index+len(SEPARATOR_BEFORE_HEADER)+1
    csv_string = csv_string[new_start_index:]
    return csv_string

In [None]:
csv_string = parse_jackson_csv(csv_file)

In [None]:
dialect = csv.Sniffer()kj.sniff(csv_string)
reader = csv.reader(csv_string, dialect)
thing = list(reader)

In [None]:
thing

In [None]:
url_for_file = "http://phenome.jax.org/tmp/Willott1_table.csv"
with urllib.request.urlopen(url_for_file) as response:
   csv_file = response.read().decode('utf-8')
reader = csv.reader(csv_file, delimiter=',', quotechar='"')

## Lists

## the easy way

In [None]:
!pip install openpyxl