# Contents

- [22 October 2018](#22-october-2018)
    - [Namespace and Variable Scope]()
- [24 October 2018](#24-october-2018) 
    - [Lists and Tuples, cont'd: Tuples](#Lists-and-Tuples,-cont'd:-Tuples)
        - [Using tuples for exchanges](#Using-tuples-for-exchanges:)
    - [Some useful list functions](#Some-useful-list-functions)
        - [List comprehension](#List-comprehension)
    - [String Parsing](#String-Parsing)
        - [Some useful string functions](#Some-useful-string-functions)
        - [Regular expressions](#Regular-expressions:)
        - [Search and replace](#Search-and-replace:-`re.sub`)
        - [Split according to a regular expression](#Split-according-to-a-regular-expression)

-----
-----

In [82]:
import os
os.getcwd()
#os.chdir(os.getcwd() + "/data")
os.chdir("/Users/Alexa/Desktop/swc-python/data")

-----


# 22 October 2018

## Namespace and Variable Scope

Within a session, say I set `a = 5`. The variable `a` is now part of my session's namespace.
Mutable values are those that refer to something outside the namespace; immutable ones refer
to something inside it.

Functions have separate namespaces:

In [3]:
def add1scalar(x):
	"""adds 1 to scalar input"""
	x += 1
	print("after add1_scalar: ", x)

def add1array(x):
	"""adds 1 to each element of an array"""
	for el in x:
		el += 1

a = 5; print(a)
add1scalar(a)
print("and yet, a = %s still" % a)
b = [5, 4, 3]; print(b)
add1array(b)
print("lists are mutable, so now b = %s" % b)

5
after add1_scalar:  6
and yet, a = 5 still
[5, 4, 3]
lists are mutable, so now b = [5, 4, 3]


In the example above, my session's namespace contains `a` and `b`. The function `add1_scalar()` has a namespace containing *its own* `a`, so running the function doesn't affect my session's `a` unless I specify that it should (by setting `a = add1scalar(a)`).

This only works because integers/floats are immutable, though: the variable is pointing to a value held entirely within a namespace. Mutable data like lists are stored outside namespaces, with any namespace's variable being just a pointer to the outside data. Thus, if a function modifies a list variable, it's modifying the underlying data -- the variable's value will be changed across all namespaces.

The scope of a `for` loop, however, is nested within the local session's namespace: variables defined or redefined during the loop will retain their last value after we exit the finished loop.

----

# 24 October 2018

## Lists and Tuples, cont'd: Tuples

Tuples are immutable, unlike lists. useful for:

- array sizes: `(60,40)` earlier
- types of arguments to functions: like (float, int) for instance
- functions can return multiple objects in a tuple

A tuple with a single value, say 6.5, is noted like this: `(6.5,)` .

### Using tuples for exchanges:

In [5]:
left = 'L'
right = 'R'

temp = left
left = right
right = temp
print("left =",left,"and right =",right)

left = R and right = L


In [6]:
left = 'L'
right = 'R'

(left, right) = (right, left)
print("left =",left,"and right =",right)

left = R and right = L


In [7]:
left, right = right, left
print("now left =",left,"and right =",right)

now left = L and right = R


## Some useful list functions

- `.append(x)`: Adds a single value `x` to a list as the new last element.
- `.extend([x])`: Appends all the elements from a second list `[x]` onto the first list, 
  in order.
- `.insert(i, x)`: Adds a value `x` to a list in position `i`.
- `.reverse()`: Flips the order of elements in a list.
- `.pop()`: Removes the last element from a list, then returns it.
    - `x = list.pop()`: Removes the last element from `list`, but saves it as `x`.
- `sort()`: 
- `sorted()`: 

In [8]:
odds = [1, 3, 5, 7]
print('odds before:', odds)
odds.append(11)
print('odds after adding a value:', odds)

odds before: [1, 3, 5, 7]
odds after adding a value: [1, 3, 5, 7, 11]


Note that the following does not do what an R user will expect:

In [10]:
odds = [1, 3, 5, 7]
odds = [odds, 11]
print('odds=',odds)

odds= [[1, 3, 5, 7], 11]


In [12]:
odds = [1, 3, 5, 7, 11]
del odds[0]
print('odds after removing the first element:', odds)
odds.reverse()
print('odds after reversing:', odds)
a = odds.pop()
print('odds after popping last element:', odds)
print("this last element was",a)
a = odds.pop(1)
print("popped element number 1 (2nd element):",a)
odds

odds after removing the first element: [3, 5, 7, 11]
odds after reversing: [11, 7, 5, 3]
odds after popping last element: [11, 7, 5]
this last element was 3
popped element number 1 (2nd element): 7
[11, 5]


What `+` and `*` do to lists (not all that intuitive), keeping in mind that lists are mutable:

In [13]:
odds = [1, 3, 5, 7]
primes = odds
primes += [2]
print('primes:', primes)
print('odds:', odds)

primes: [1, 3, 5, 7, 2]
odds: [1, 3, 5, 7, 2]


In [23]:
counts = [2, 4, 6, 8, 10]
repeats = counts * 2
print("The list with repeats, sorted numerically:", sorted(repeats))    # all integers
print("The list with repeats, unsorted:", repeats) # sorted() leaves original lists unchanged
repeats.sort() # .sort modifies the list in place
print(repeats)
print(sorted([10,5.5,2,4,30,27.1])) # sorted() treats ints and floats the same.
print(sorted(["jan","feb","mar","dec"]))  # It sorts strings alphabetically.
print(sorted(["jan",20,1,"dec"]))  # But it can't handle mixed numbers & strings -- this 
                                   # line throws an error.

The list with repeats, sorted numerically: [2, 2, 4, 4, 6, 6, 8, 8, 10, 10]
The list with repeats, unsorted: [2, 4, 6, 8, 10, 2, 4, 6, 8, 10]
.sort modifies the list in place: [2, 2, 4, 4, 6, 6, 8, 8, 10, 10]
[2, 4, 5.5, 10, 27.1, 30]
['dec', 'feb', 'jan', 'mar']


TypeError: '<' not supported between instances of 'int' and 'str'

### List comprehension

`[xxx for y in z]` where `z` is a collection, `y` introduces a local variable name, and `xxx` is some simple function of `y` (typically):

In [84]:
counts = [2, 4, 6, 8, 10]
# counts + 5 # error
added = [num+5 for num in counts] # new object, a bit like a for loop in a list
print(added)

[7, 9, 11, 13, 15]


In [31]:
for i in range(0,len(counts)):
    counts[i] += 5 # modifies "counts" in place
counts

[7, 9, 11, 13, 15]

## String Parsing

### Some useful string functions:

See also __`dir("")`__

- `.strip`: 
- `.split`: 
- `.join`: 
- `.replace`: 
- `.index`: 
- `.find`: 
- `.count`: 
- `startswith`: 
- `.endswith`: 
- `.upper`: 
- `.lower`:

In [35]:
a = "hello world"
print(a.startswith("h"))
print(a.startswith("he"))
print("h" in a)
print("low" in a)
print("lo w" in a)

True
True
True
False
True


In [38]:
print("aha".find("a")) # it shows up at index 0
print("hohoho".find("oh"))
mylist = ["AA","BB","CC"]
"-coolseparator-".join(mylist)

0
1


'AA-coolseparator-BB-coolseparator-CC'

In [39]:
print(type(""))
print(dir(""))

<class 'str'>
['__add__', '__class__', '__contains__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__mod__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmod__', '__rmul__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'capitalize', 'casefold', 'center', 'count', 'encode', 'endswith', 'expandtabs', 'find', 'format', 'format_map', 'index', 'isalnum', 'isalpha', 'isascii', 'isdecimal', 'isdigit', 'isidentifier', 'islower', 'isnumeric', 'isprintable', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'maketrans', 'partition', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill']


### Regular expressions:

Use the `re` library and its functions `re.search`, `re.findall`, `re.sub`, `re.split`, etc.
(Recall regular expression syntax.)

- `r''` to write the regular expression pattern, for “raw” strings: to read a `\n` as slash and an n, not as a newline character.
- multipliers are greedy by default: `*`, `+`, `?`. Add `?` to make them non-greedy
- info from match objects: `.group`, `.start`, `.end`
- when pattern not found, match object is `None`, or `False` when converted to a boolean

In [67]:
import glob
filenames = glob.glob('*.csv')
print(filenames)

['inflammation-01.csv', 'inflammation-02.csv', 'inflammation-03.csv', 'inflammation-04.csv', 'inflammation-05.csv', 'inflammation-06.csv', 'inflammation-07.csv', 'inflammation-08.csv', 'inflammation-09.csv', 'inflammation-10.csv', 'inflammation-11.csv', 'inflammation-12.csv', 'small-01.csv', 'small-02.csv', 'small-03.csv']


In [57]:
import re
mo = re.search(r'i.*n',filenames[0]) # multiplier * is greedy. This looks for an "i, then any 
    # number of characters, then an "n" -- and it'll take the last possible "n"
print(mo)  # "match object", stores much info. search: first instance only.

<re.Match object; span=(0, 12), match='inflammation'>


In [58]:
print(mo.group()) # what string matched the pattern
print(mo.start()) # the position where the match starts (zero-indexed)
print(mo.end())   # where the match ends: index *after* that number!

inflammation
0
12


In [59]:
mo = re.search(r'i.*?n',filenames[0]) # This is non-greedy matching -- it takes the first "n"
print(mo)
print(mo.group())
print(mo.start())
print(mo.end())

<re.Match object; span=(0, 2), match='in'>
in
0
2


When there is no match, the matched object is `None` and interpreted as `False` in boolean context. Say I have a set of DNA sequences and I want to check whether they have any non-DNA symbols:

In [60]:
sequences = ["ATCGGGGATCGAAGTTGAG", "ACGGCCAGUGUACN"]
for dna in sequences:
    mo = re.search(r'[^ATCG]', dna) # find the first non-ACTG, then stop
    if mo: # this works because no result = None = False
        print("non-ACGT found in sequence",dna,": found", mo.group())

non-ACGT found in sequence ACGGCCAGUGUACN : found U


Compare with this less-efficient code, which has to run the search twice. That can get **slow** with large datasets.

In [61]:
for dna in sequences:
    if re.search(r'[^ATCG]', dna):
        mo = re.search(r'[^ATCG]', dna)
        print("non-ACGT found in sequence",dna,": found", mo.group())

non-ACGT found in sequence ACGGCCAGUGUACN : found U


In [62]:
print(re.findall(r'i.*n',filenames[0])) # greedy. non-overlapping matches, but won't find all of them (b/c greedy).
mo = re.findall(r'i.*?n',filenames[0])  # non-greedy. Still not finding all possible matches.
print(mo)
mo

['inflammation']
['in', 'ion']


['in', 'ion']

In [63]:
for f in filenames:
    if not re.search(r'^i', f): # if no match: search object interpreted as False
        print("file name",f,"does not start with i")

file name small-01.csv does not start with i
file name small-02.csv does not start with i
file name small-03.csv does not start with i


The big difference with `search` is that it returns an array, instead of just 1 object.

### Search and replace: `re.sub`

- Capture groups with parentheses in the regular expression
- Captured elements are stored as `.group(1)`, `.group(2)` etc. in the match object
- You can recall captured elements with `\1`, `\2` etc. in a regular expression, to use them in a replacement for example

(It's very much like `sed`.)

In [68]:
filenames = glob.glob('*.csv')
re.sub(r'^(\w)\w+-(\d+)\.csv', r'\1\2.csv', filenames[0])

'i01.csv'

In [69]:
for i in range(0,len(filenames)):
    filenames[i] = re.sub(r'^(\w)\w+-(\d+)\.csv', r'\1\2.csv', filenames[i])
print(filenames)

['i01.csv', 'i02.csv', 'i03.csv', 'i04.csv', 'i05.csv', 'i06.csv', 'i07.csv', 'i08.csv', 'i09.csv', 'i10.csv', 'i11.csv', 'i12.csv', 's01.csv', 's02.csv', 's03.csv']


In [70]:
filenames = glob.glob('*.csv')

In [71]:
# Can do this by hand, but it'd be nice to do it automatically on many species names...

taxa = ["Drosophila melanogaster", "Homo sapiens"]
for taxon in taxa:
    mo = re.search(r'^([^\s]+) ([^\s]+)$', taxon) # Get everything before a space as \1, the next word as \2, and then the end.
    if mo:
        genus = mo.group(1)
        species = mo.group(2)
        print("genus=" + genus + ", species=" + species)

print(taxon)
print(mo) # variables defined inside "for" are available outside
print(mo.start(1))
print(mo.start(2))


genus=Drosophila, species=melanogaster
genus=Homo, species=sapiens
Homo sapiens
<re.Match object; span=(0, 12), match='Homo sapiens'>
0
5


In [76]:
# Now abbreviate the taxon name to its first letter and replace space with underscore:

taxa_abbrev = []
for taxon in taxa:
    taxa_abbrev.append(
        re.sub(r'^(\S).* ([^\s]+)$', r'\1_\2', taxon)
    )
print(taxa_abbrev)

['D_melanogaster', 'H_sapiens']


### Split according to a regular expression

- removes the matched substrings
- returns an array

In [78]:
coolstring = "Homo sapiens is pretty super"
re.split(r's.p', coolstring)
re.split(r's.*p', coolstring)
re.split(r's.*?p', coolstring)

['Homo ', 'ien', 'retty ', 'er']