This document is a Python exploration of this R-based document: http://m-clark.github.io/data-processing-and-visualization/more.html. Code is *not* optimized for anything but learning.  In addition, all the content is located with the main document, not here, so many sections may not be included.  I only focus on reproducing the code chunks.

# More things to think about

### Coding Style

A lot has been written about coding <span class="emph" style="font-family:'Alex Brush'; font-size:1.5em">style</span> over the decades.  If there was a definitive answer, you would have heard of it by now.  However, there are a couple things you can do at the beginning of your programming approach to go a long way making your code notably better.

In Python, the [PEP 8](https://www.python.org/dev/peps/pep-0008/) style guide will help out a lot. However, it was not developed with data science in mind, so it may not help for some things.  Here is [Google's](https://google.github.io/styleguide/pyguide.html).  

The R style guides would mostly be applicable to Python as well, and specifically assume interactive data science, so feel free to peruse. 
- [Google](https://google.github.io/styleguide/Rguide.xml)
- [Hadley Wickham](http://adv-r.had.co.nz/Style.html)

### Why does your code exist?

Either use text in a Jupyter Notebook or comment your Python script.  Explain *why*, not *what*, the code is doing.  Think of it as leaving your future self a note (they will thank you!).  Be clear, and don't assume you'll remember why you were doing what you did.

### Code length

When doing interactive data science, if your script is becoming hundreds of lines long, you probably need to compartmentalize your operations into separate scripts.  For example, separate your data processing from your model scripts.

#### Spacing

Don't be stingy with spaces. As you start out, err on the side of using them.  Just note there are exceptions (e.g. no space between function name and parenthesis, unless that function is something like <span class="func">if</span> or <span class="func">else</span>), but you'll get used to those over time.  Personally, a lot of the Python code I come across seems to be problematic with spacing, and even the autocomplete within functions will not include spaces for arguments, so do mind it when you can

In [1]:
import numpy as np

In [2]:
x=np.random.normal(size=10, loc=0, scale=1)            # harder to read
                                                       # space between lines too!
x = np.random.normal(size = 10, loc = 0, scale = 1)    # easier to read

### Naming things

You might not think of it as such initially, but one of the more difficult challenges in programming is naming things.  Even if we can come up with a name for an object or file, there are different styles we can use for the name.

Here is a brief list of things to keep in mind.

- The name should make sense to you, your future self, and others that will use the code
- Try to be concise, but see the previous
- Make liberal use of suffixes/prefixes for naming the same types of things e.g. model_x, model_z
- For function names, try for verbs that describe what they do


- Don't name anything with 'final'
- Don't name something that is already a popular function/object (e.g. `T`, <span class="func">c</span>, <span class="func">data</span>, etc.)
- Avoid distinguishing names only by number, e.g. <span class="objclass">data1</span> <span class="objclass">data2</span>


Naming styles include:

- snake_case
- CamelCase or camelCase
- spinal-case (e.g. for file names)
- dot.case

For objects and functions, I find snake case easier to read and less prone to issues[^style_claim]. For example, camel case can fail miserably when acronyms are involved. Dots already have specific uses (file name extensions, function methods, etc.), so probably should be avoided unless you're using them for that specific purpose (they can also make selecting the whole name difficult depending on the context).

### Vectorization

### Boolean Indexing

Assume <span class="objclass">x</span> is a vector of numbers. How would we create an index representing any value greater than 2?




In [3]:
x = np.array([-1, 2, 10, -5])
idx = x > 2
idx

array([False, False,  True, False])

In [4]:
x[idx]

array([10])

As mentioned previously, <span class="objclass">logicals</span> are objects with values of `True` or `False`, like the <span class="objclass">idx</span> variable above.  While sometimes we want to deal with the logical object as an end, it is extremely commonly used as an index in data processing. Note how we don't have to create an explicit index object first (though often you should), as R indexing is ridiculously flexible.  Here are more examples, not necessarily recommended, but just to demonstrate the flexibility of Boolean indexing.


In [5]:
x[x > 2]
x[(x != 'cat')]
x[~(x > 2)]
x[np.where((x > 0) & (x != 10), True, False)]

  


array([2])

This approach will transfer to using things like the <span class="func" style = "">query</span> function in pandas.


In [6]:
import pandas as pd
d = pd.DataFrame({'x': [1, 2, 3], 'y': ['a', 'b', 'c']})
d.query('x >= 2')

Unnamed: 0,x,y
1,2,b
2,3,c


Boolean indexing allows us to take <span class="emph">vectorized</span> approaches to dealing with data. Consider the following unfortunately coded loop, where we create a variable `y`, which takes on the value of **Yes** if the variable `x` is greater than 2, and **No** if otherwise.



In [7]:
mydf = d.copy()

for i in range(mydf.shape[0]):
    
    check = mydf.x[i] > 2
    
    if check == True :
        mydf.y[i] = 'Yes'
    else:
        mydf.y[i] = 'No'

mydf




A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # Remove the CWD from sys.path while we load stuff.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,x,y
0,1,No
1,2,No
2,3,Yes


Compare with `np.where`:



In [8]:
mydf = d

mydf.y = 'No'

mydf.y[mydf.x > 2] = 'Yes'

mydf


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


Unnamed: 0,x,y
0,1,No
1,2,No
2,3,Yes


This gets us the same thing, and would be much faster than the looped approach. Boolean indexing is an example of a vectorized operation.  The whole vector is considered, rather than each element individually.  The result is that any preprocessing is done once rather than the `n` iterations of the loop.  In R, this will  always faster.

Example: Log all values in a matrix.



In [9]:
mymatrix = np.random.uniform(size = 10000).reshape(100, 100)

In [10]:
mymatrix_log = np.log(mymatrix)

This is way faster than looping over elements, rows or columns. Here we'll let the <span class="func">apply</span> function stand in for our loop, logging the elements of each column.

In [11]:
import timeit

timeit.Timer(lambda: np.log(mymatrix)).timeit(number = 1000)

0.06370465800000003

In [12]:
# ?np.apply_along_axis # use as shorthand loop

In [13]:
timeit.Timer(lambda: np.apply_along_axis(np.log, 0, mymatrix)).timeit(number = 1000)

0.335478516

A more explicit loop.

In [14]:
timeit.Timer(lambda: [np.log(mymatrix[:, c]) for c in range(mymatrix.shape[1])]).timeit(number = 1000)

0.16723843000000005

As we can see, loops are pretty fast, and not as big an issue as with R, but vectorized approaches can allow for even faster results and more succinct code that requires much less programming effort.
  

### Regular Expressions

A <span class="emph">regular expression</span>, regex for short, is a sequence of characters that can be used as a search pattern for a string. Common operations are to merely detect, extract, or replace the matching string.  There are actually many different flavors of regex for different programming languages, which are all flavors that originate with the Perl approach, or can enable the Perl approach to be used.  However, knowing one means you pretty much know the others with only minor modifications if any.

To be clear, not only is regex another language, it's nigh on indecipherable.  You will not learn much regex, but what you do learn will save a potentially enormous amount of time you'd otherwise spend trying to do things in a more haphazard fashion. Furthermore, practically every situation that will come up has already been asked and answered on [Stack Overflow](https://stackoverflow.com/questions/tagged/regex), so you'll almost always be able to search for what you need.

Here is an example of a pattern we might be interested in:

`^python.*shiny[0-9]$`

What is *that* you may ask?  Well here is an example of strings it would and wouldn't match.  We're using <span class="func">grepl</span> to return a logical (i.e. `TRUE` or `FALSE`) if any of the strings match the pattern in some way.

In [15]:
import re

str_vec = ['python is the shiniest 1', 'python is the shiny1', 'python shines brightly']

In [16]:
r = re.compile('^py.*shiny[0-9]$')      

result = list(filter(r.match, str_vec))

result

['python is the shiny1']

What the regex is esoterically attempting to match is any string that starts with 'r' and ends with 'shiny_' where _ is some single digit.  Specifically it breaks down as follows:

- **^** : starts with, so ^python means starts with python
- **.** : any character
- **\*** : match the preceding zero or more times
- **shiny** : match 'shiny'
- **[0-9]** : any digit 0-9 (note that we are still talking about strings, not actual numbered values)
- **$** : ends with preceding


### Typical Uses

None of it makes sense, so don't attempt to do so. Just try to remember a couple key approaches, and search the web for the rest.

Along with ^ . * [0-9] $, a couple more common ones are:

- **[a-z]** : letters a-z
- **[A-Z]** : capital letters
- **+** : match the preceding one or more times
- **()** : groupings
- **|** : logical or e.g. [a-z]|[0-9]  (a lower case letter or a number)
- **?** : preceding item is optional, and will be matched at most once. Typically used for 'look ahead' and 'look behind'
- **\** : escape a character, like if you actually wanted to search for a period instead of using it as a regex pattern, you'd use \\.





See if you can guess which of the following will turn up `TRUE`.

In [17]:
fruit = ['apple', 'pear', 'banana']

[bool(re.search(r'a', f)) for f in fruit]

[True, True, True]

In [18]:
[bool(re.search(r'^a', f)) for f in fruit]

[True, False, False]

In [19]:
[bool(re.search(r'^a|a$', f)) for f in fruit]

[True, False, True]

Scraping the web, munging data, just finding things in your scripts ... you can potentially use this all the time, and not only with text analysis.

-- I personally find using regex in python verbose and unintuitive, whereas grep and/or packages like stringr are very straightforward, and being vectorized, almost never require more than one line, return the whole element (which is typically desired in my experience), etc. --

## Code Style Exercises

### Exercise 1

For the following model related output, come up with a name for each object. These are regression models using `statsmodels`.

`sm.ols('hwy ~ cyl', data = mpg).fit()                 # hwy mileage predicted by number of cylinders`

`sm.ols('hwy ~ cyl', data = mpg).fit().summary()       # the summary of that`

`sm.ols('hwy ~ cyl + displ + year', data = mpg).fit()  # an extension of that`

### Exercise 2

Fix this code. You don't have to run it, just clean it. Otherwise you'll need to import pandas and statsmodels.

In [20]:
# x=np.random.normal(10, 2, 100)
# y=.2* x+ np.random.normal(size = 100)
# data = pd.DataFrame(x,y)
# q = sm.ols('y~x', data=data)
# q.summary()

## Vectorization Exercises

Before we do this, did you remember to fix the names in the previous exercise?

### Exercise 1

Show a non-vectorized (e.g. a loop) and a vectorized way to add a two to the numbers 1 through 3.

?

### Exercise 2

For the following, get the column sums for `x` without a loop.

In [21]:
x = np.random.poisson(lam = 5, size = 100000).reshape((1000, 100))

## Regex Exercises

### Exercise 1

Using `str.replace`, replace all the states a’s with nothing.

In [22]:
x = ['abc', 'a', 'bd', 'eaf', 'abracadabra']