## Iterative Programming

Almost everything you do when dealing with data will need to be done again, and again, and again.  If you are copy-pasting your way to repetitively do the same thing, you're not only doing things inefficiently, you're almost certainly setting yourself up for trouble if anything changes about the data or underlying process.

In order to avoid this, you need to be familiar with basic programming, and a starting point is to use an iterative approach to repetitive problems. 

In [None]:
import pandas as pd
import numpy as np

weather = pd.read_csv('../data/weather.csv')

### For Loops

This is the sort of thing we don't want.

In [None]:
np.mean(weather.humid)
np.mean(weather.temp)
np.mean(weather.wind_speed)
np.mean(weather.precip)

In [None]:
for column in ['temp', 'humid', 'wind_speed', 'precip']: {
  print(np.mean(weather[[column]]))
}

Now if the data name changes, the columns we want change, or we want to calculate something else, we usually end up only changing one thing, rather than *at least* changing one, and probably many more things.  In addition, the amount of code is the same whether the loop goes over 100 columns or 4.

Let's do things a little differently.  The following will provide a usable result and is coded in the same fashion as the R example (not necessarily optimal).

In [None]:
?np.mean

In [None]:
columns = ['temp', 'humid', 'wind_speed', 'precip']
nyc_means = np.repeat(None, len(columns))

for i in range(len(columns)):
  column = columns[i]
  nyc_means[i] = np.mean(weather[[column]])

print(nyc_means)

Unlike R, Python loops are fast enough to be viable.  This doesn't get around the verbosity issue, but means that we shouldn't mind using them as we caution ourselves in R.  The other nice thing is that loops in Python are more flexible than R.

Python provides what is called *list comprehension*, which is a way to create a list given a list or vector that is *iterable* with a type of shorthand for a loop.

To demonstrate, we'll just get the squared values of 0, 1 and 2.

In [None]:
[x**2 for x in range(3)]

Now let's try it for our weather data.

In [None]:
[np.mean(weather[[x]]) for x in columns] # columns was created previously above

While not too dissimilar from how we use sapply or lapply in R, there is no special function to call.

Another nice thing I like about Python loops versus R loops is an easy way to create multiple objects with the loop.  It's not intuitive to start out with for our example, so let's build some intution.

First, let's just do a simple double assignment.

In [None]:
x, y = [1, 2]

In [None]:
x

In [None]:
y

Well that was easy enough!  Let's try it with a standard loop.

In [None]:
nyc_means = np.repeat(None, len(columns))
nyc_sds = np.repeat(None, len(columns))

for i in range(len(columns)):
    nyc_means[i], nyc_sds[i] = np.mean(weather[[columns[i]]]), np.std(weather[[columns[i]]])
    
nyc_means

In [None]:
nyc_sds

We can now use list comprehension and do this in one line. We have to use `zip` here, and the `*` just allows us to put any number of things into the zip function, but this approach allows us to get what we want in a very succint fashion.

In [None]:
nyc_means, nyc_sds = zip(*[(np.mean(weather[[x]]), np.std(weather[[x]])) for x in columns])

In [None]:
nyc_means

In [None]:
nyc_sds

In the end though, creating a function and using map or other approach like the R way may be best for a particular problem.

### Using while

As in other programming languages, using a while statement in Python is equivalent to a loop.  If you use them, you can take advantage of the `+=` operator, which is a baffling oversight of the R language.  Note the zero start and we change `<=` to `<` as a result, but otherwise this is identical to the R example.

In [None]:
nyc_means = np.repeat(None, len(columns))
i = 0

while i < len(columns):
    nyc_means[i] = np.mean(weather[[columns[i]]])
    i += 1

nyc_means

Understanding loops is fundamental toward spending less time processing data and more time toward exploring it. Your code will be more succinct and more able to handle the usual changes that come with dealing with data.

### Apply-type approaches

In [None]:
def stdize(x):
    return(x - np.mean(x) / np.std(x))

weather[columns].apply(stdize, axis = 1)   # 0 for columns, 1 for rowwise application

Sadly the above shows how much slower working with data frames can be in Python vs. R.  The above operation took several seconds.  But as a counterpoint, Python's string capabilities are very easy to use and fast relative to R.  The following provides an example with list comprehension.

In [None]:
x = ['aba', 'abb', 'abc', 'abd', 'abe']

print([i.strip('ab') for i in x]) 

Here is an example of a rowwise application.

In [None]:
df = pd.DataFrame(
    {
        'a': range(1,4),
        'b': range(4,7)
    }
)

df

df.apply(np.sum, 1)

### Map functionality

While we have apply functionality, we also have map functionality similar to that demonstrated with R.  Base R has a Map function, but purrr adds both flexibility and some rigor to the utilization of it.  The main point here is that we can also use something similar for Python.

In [None]:
round = lambda x: '%.2f' % x

weather[columns].applymap(round)

The `map` function for a pandas object will apply to the vector in question. Typically this would be a column, and the following is just an explicit form of `applymap`.

In [None]:
df.a.map(round)

### Working with lists

List objects make it very easy to iterate some form of data processing.

Let’s say you have models of increasing complexity, and you want to easily summarise and/or compare them. We create a list for which each element is a model object. We then apply a function, e.g. to get the AIC value for each, or adjusted R square.

In [None]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

In [None]:
mtcars = sm.datasets.get_rdataset("mtcars", "datasets").data
results = list()


# Fit regression model (using the natural log of one of the regressors)
results.append(smf.ols('mpg ~ wt', data = mtcars).fit())
results.append(smf.ols('mpg ~ wt*hp', data = mtcars).fit())
results.append(smf.ols('mpg ~ wt + hp + vs + am', data = mtcars).fit())


In [None]:
results

In [None]:
print([round(x.rsquared_adj) for x in results])

In [None]:
print([round(x.aic) for x in results])