# literary fiction, part 2
When last we left Lalama, she had figured out how to load her csv file into the `reader` and parse all the lines.

But there will still a bunch of useless lines at the top of the file.

Let's get that file back in memory now.

In [2]:
import urllib.request

url_for_file = "https://raw.githubusercontent.com/NickleDave/EWIN-coding-bootcamp/master/Python/Wiltshire3_means.csv"
with urllib.request.urlopen(url_for_file) as response:
   csv_file = response.read().decode('utf-8').splitlines()

Let's look again at the first six items in the list of lines from the file.

In [3]:
for index,line in enumerate(csv_file[0:7]):
    print("line {0} is: {1}".format(index,line))

line 0 is: "// Phenotype data set from Mouse Phenome Database (phenome.jax.org)",,,,,,,,
line 1 is: "// Data set: Wiltshire3    Title: Drug study: Neurobiochemical analytes in response to chronic fluoxetine treatment in males of 30 inbred mouse strains    Year: 2011",,,,,,,,
line 2 is: "// List of strain means and summary statistics",,,,,,,,,,,
line 3 is: "// For more info on these data visit phenome.jax.org and type Wiltshire3 into search box",,,,,,,,,,
line 4 is: ,,,,,,,,,,,
line 5 is: measnum,varname,strain,sex,mean,sd,sem,nmice,cv,zscore,,,
line 6 is: 38101,ACTH_cont,"129S1/SvImJ",m,1.80,0.0660,0.0381,3,0.0366,-1.40


Notice the following:

* Lines 0-3 are comments, and begin with "//"

* Line 4 was just a bunch of commas

* Line 5 looks like it's the **header**, the row of a csv file that tells us the name of each column of values.

* So really we want to skip lines 0-5, although we might want the fifth line for later.

Also let's talk about what we just did in the cell above

## enumerate

when you pass `enumerate` a `sequence`, it returns an `iterator` that yields a `counter` each time the `next` method is called.

https://docs.python.org/3.5/library/functions.html#enumerate

This is useful e.g. when you need to modify every item in a list

```Python
POINTS_FOR_CURVE = 20
for ind,val in enumerate(list_of_grades):
    list_of_grades[ind] = val + POINTS_FOR_CURVE
```

## format

In the bad old days, programmers used a function called `sprintf` to write formatted data to a string.

Old school Python used a syntax similar to `sprintf` for string formatting.

For our example above, we didn't really have to format the data.
We could have also written
```Python
print("line " + index + " is: " line)
```
because Python **overloads** the plus sign: it not only adds numbers together, it also **concatenates** strings.

An example of formatting would be "show this number in scientific notation with only 2 significant digits".

Python `strings` now have a `format` method to make formatting more human readable.

Here's a good page on how `format` works: https://pyformat.info/

Okay, so we want to skip the first five lines.

Remember that we can get the next line out of the reader by calling the next method of the iterator.

So we could do this:

In [4]:
with urllib.request.urlopen(url_for_file) as response:
   csv_file = response.read().decode('utf-8').splitlines()

import csv

reader = csv.reader(csv_file, delimiter=',')

for skip_index in range(0,6):
    next(reader)

parsed_file = list(reader)
print("First item in parsed_file list is now:\n {}".format(parsed_file[0]))

First item in parsed_file list is now:
 ['38101', 'ACTH_cont', '129S1/SvImJ', 'm', '1.80', '0.0660', '0.0381', '3', '0.0366', '-1.40']


## range

Recall that since Python lets us iterate over sequences, we don't have to make counters for every for loop as we do in other languages.

But sometimes you do need a counter.

* **range** function gives you iterator that spits out counters (kind of like `enumerate` but more flexible)
    - syntax: `range(start,stop[,step])`
    - the `stop` value is not included, just like when you index a list[start:stop]
        ```Python
        list(range(10)) # range gives us an iterator, have to consume it with list function
        >>>[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
        ```
    - if you don't give a value for `start`, the default is zero
    - you can give `range` a `step` value so that the iterator yields `[start, start + step, start + 2 * step, ...]`
        ```Python
        [val for val in range(5,25,5)] # <-- a list comprehension
        >>>[5, 10, 15, 20]
        list(range(25,5,-5))
        >>>[25, 20, 15, 10]
        ```

So the first item of our `parsed_file` list is now the first line of values from the csv file.

Ok, great.

* what if we have another csv file and it has a different number of lines with comments?

In [None]:
url_for_file2 = ''
with urllib.request.urlopen(url_for_file2) as response:
   csv_file2 = response.read().decode('utf-8').splitlines()

* We could:
  - open up each file
  - look at how many lines there are with comments and the header
  - and then change our little script so it skips the right number of lines

* boring
* takes time
* computer can do this for us

Let's write a function that will figure out how many lines we need to skip in each file.

## writing functions

* functions
  - begin with the **def** keyword
  - followed by the function name
  - and then the argument list in parentheses
  - then a colon
  - then (if you are a nice programmer) a docstring, enclosed in quotes
      - so other people, and you, months later, can figure out what your code does
  - then the code, indented
  - and, if necessary, a **return** statement

```Python
def adder_function(argument1,argument2):
    """
    adds argument1 and argument2
    """
    return argument1 + argument2
```

* simple function that lets us skip a certain number of lines in our file

In [6]:
def skip_lines(reader,num_to_skip):
    """
    skips lines in a csv file that's being parsed by csv.reader
    
    arguments:
        reader -- csv.reader object containing csv file to be parsed
        num_to_skip -- 
    
    returns:
        reader (with num_to_skip lines parsed but not kept)
    """
    
    last_index = num_to_skip + 1 # because of how indexing works in Python
    for index in range(last_index):
        next(reader)
    return reader

This function isn't much of an improvement though. We still have to count how many lines to skip.

In [7]:
reader = csv.reader(csv_file, delimiter=',')

reader = skip_lines(reader,5)

parsed_file = list(reader)

print("First item in parsed_file list is now:\n {}".format(parsed_file[0]))

First item in parsed_file list is now:
 ['38101', 'ACTH_cont', '129S1/SvImJ', 'm', '1.80', '0.0660', '0.0381', '3', '0.0366', '-1.40']


Let's get Python to parse these weird csv files for us.

Remember the format.

That will help us plan what we want our function to do.

* First there's lines with comments; they begin with "//"
* Then there's a line with a bunch of commas

**We want to skip these lines.**
 
* Then there's the **header**, the row of a csv file that tells us the name of each column of values.

**We want to give the user the option to extract the header in a separate variable, in case they need to use the names of the columns.**


## logic and control flow statements

Python like other languages has many keywords that allow for control flow.

* Boolean logic
  - True or False
  

In [12]:
type(csv_file) is list

True

In [10]:
True is True # tautological AF

True

In [8]:
False is False # woah

True



How many commas are we looking for?

In [None]:
len(csv_file[4])

In [None]:


def parse_jackson_csv(csv_string):
    """
    Parses csv files from Jackson labs website.
    Deals with some of the idiosyncracies that csv.sniffer doesn't recognize.
    """
    SEPARATOR_BEFORE_HEADER = ",,,,,,,,,,\n,,,,,,,,,,,\n"
    index = csv_string.rfind(SEPARATOR_BEFORE_HEADER)
    new_start_index = index+len(SEPARATOR_BEFORE_HEADER)+1
    csv_string = csv_string[new_start_index:]
    return csv_string

In [None]:
csv_string = parse_jackson_csv(csv_file)

In [None]:
dialect = csv.Sniffer()kj.sniff(csv_string)
reader = csv.reader(csv_string, dialect)
thing = list(reader)

In [None]:
thing

In [None]:
url_for_file = "http://phenome.jax.org/tmp/Willott1_table.csv"
with urllib.request.urlopen(url_for_file) as response:
   csv_file = response.read().decode('utf-8')
reader = csv.reader(csv_file, delimiter=',', quotechar='"')

## the easy way

In [None]:
!pip install openpyxl