# Python Pointers (vol. 3)

## Fun with f-strings

Reminder: there are several ways to make formatted strings in Python, but "f-strings" are the easiest and best.




In [138]:
#
# f-string application: insert values into a string and print it
#
values = [2, 3, 6, 5, 1]

[print(f'the {i}th element of values is {v}') for i, v in enumerate(values)]

# from python 3.8 onward:
print(f"No need to `a = {{a}}`, just write `{{a = }}`: {values[2] = }")

the 0th element of values is 2
the 1th element of values is 3
the 2th element of values is 6
the 3th element of values is 5
the 4th element of values is 1
No need to `a = {a}`, just write `{a = }`: values[2] = 6


In the next cell we look at a list of items that have mixed types (float, integer and string). In the first `print` statement, we just print the value, which is fine sometimes, but might not be tidy enough for generating a table or annotating a plot. In the second `print` statement, we loop through and use `:` to apply a format string to the elements. I included a string in the list, which can't be formatted as a float. This is just to show additionally how to do a check to see if the formatting is appropriate. Here we check whether the element is a number using `isinstance(v, numbers.Number)`. This also shows how to apply conditional formatting. 

In [139]:
#
# f-string application 2: use the colon (":") to provide a formatting string. Can be used for precision or string formatting. 
#
import numbers

values2 = [10.9999848587273, 1/3, 44.200001, 9, -10, -999, "a string???"]

[print(v) for v in values2]  # no formatting

[print(f"the {i}th element of values is {v:{"6.4f" if isinstance(v, numbers.Number) else ""}}") for i, v in enumerate(values2)]


10.9999848587273
0.3333333333333333
44.200001
9
-10
-999
a string???
the 0th element of values is 11.0000
the 1th element of values is 0.3333
the 2th element of values is 44.2000
the 3th element of values is 9.0000
the 4th element of values is -10.0000
the 5th element of values is -999.0000
the 6th element of values is a string???


[None, None, None, None, None, None, None]

In the following cell, we import `datetime` and get today's date as a `datetime` object. Datetime objects can be formatted using syntax described here:
https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior

Use `datetime.strptime` to take a plain text string and convert it to a `datetime` object by specifying the format. Here we do that for two dates of python version releases.

Then we use fstrings to print how long ago those python versions were released. To do that, we take the difference between the `datetime` objects for today and those release dates, use the `.days` method to convert that `timedelta` to days and divide by 365 to convert to years. Then we use the `:` to apply formatting to that value. 

In [140]:
from datetime import datetime

print(f'The current date and time is {datetime.today():%Y-%m-%d %H:%M %p}')


p3p6release = datetime.strptime("23 December 2016", "%d %B %Y")
p3p8release = datetime.strptime("14 October 2019", "%d %B %Y")
today = datetime.today()

print(f"Python 3.6 was released {(today - p3p6release).days / 365 : 2.1f} years ago")
print(f"Python 3.8 was released {(today - p3p8release).days / 365 : 2.1f} years ago")

The current date and time is 2024-01-25 13:54 PM
Python 3.6 was released  7.1 years ago
Python 3.8 was released  4.3 years ago


Starting in Python 3.12, the f-string parser was improved. The most obvious way this will impact most users is that we do not need to take special care to change the quote-mark-style between the style used for the f-string and the style used inside the expressions. That is, you can demark the f-string with double-quotes `"` and then use double-quotes inside expressions without raising an exception. **Before python 3.12** this was an error, and we would need to change, such as `f"a string with some additional string inside like {'\n'}"`.

In [141]:
#
# f-string application 4: better parsing in python 3.12, includes being able to repeat quote-mark character inside expressions
#
import sys
pyversion = sys.version
print(f"{"\n".join(["Python version:",pyversion])} \n \t note that the same double-quotes used in f-string expression.")


Python version:
3.12.0 | packaged by conda-forge | (main, Oct  3 2023, 08:36:57) [Clang 15.0.7 ] 
 	 note that the same double-quotes used in f-string expression.


### Why am I still harping on f-strings?

To me, there are a couple of key advantages to using f-strings compared to other options. The main one is that f-strings are more readable than most other options in most circumstances. Second, I think f-string tend to be more reliable than using string concatenation or similar string-operations. Third, it turns out that f-strings are somewhat more efficient than older string-formatting options. 

In the next cell, we show how string contenation works using the `+` operator. That's great, and one is tempted to use that for simple printing, but it fails. This is shown in a `try/except` block so that we can run the full cell. The error is that you can't concatenate a string with an integer in this case. 

The third section shows printing comma-separated values. The python `print` method treats comma-separated values with a special case: it prints each value separated by a space. This is great for many cases, but remember that this is special treatment within the `print` method. If the intention is to construct a string to be used somewhere else, however, the comma-separated values revert to the normal python behavior and produce a tuple. 


In [142]:
catstring = "abc" + "def"
print(catstring)
# note that another method to concatenate strings is:
catstring2 = "".join(["abc", "def"])
print(catstring2)

# can not concatenate string and non-string:
try: 
    for i in values:
        print("My value is" + i)
except Exception as error:
    # handle the exception
    print(f"The statement raises an exception occurred:\n\n{error}") 


# comma-separated values are treated nicely by `print`
for i in values:
    print("My value is",i)

# separate from `print` things break down
s = "My value is",values[3]
print(s)  # since s is already a tuple, print can't help here
print(type(s))
print(f"Enter s as a label somewhere and get: {s}")



abcdef
abcdef
The statement raises an exception occurred:

can only concatenate str (not "int") to str
My value is 2
My value is 3
My value is 6
My value is 5
My value is 1
('My value is', 5)
<class 'tuple'>
Enter s as a label somewhere and get: ('My value is', 5)


A common need is to build up some name as a string that is then used to access data. In the next cell, I show a simple example. 

First, to make this notebook stand-alone, we construct a synthetic Xarray dataset. 

Then just an example of constructing the variable name using an fstring and getting a value out of the dataset.

The end shows how to reduce the resulting xarray from a `DataArray` to just a scalar by first using `.values` to reduce to just the underlying numpy array and then the numpy `.item()` method to convert to a scalar. 

There are other ways to build that string for the variable name. For example, 
```
"_".join([z,frq, 'all', smp])
```
This is perfectly fine, but I find that consistently using f-strings is the fastest way to get code working. I usually only move to other approaches if the operations are getting too complicated to be able to easily read the f-string construction.

In [143]:
import xarray as xr
import numpy as np
# construct fake dataset
rng = np.random.default_rng(77665809) # random number generator
lat = xr.DataArray(np.arange(-90, 95, 5), dims='lat', attrs={'units':'degrees_north'})
lon = xr.DataArray(np.arange(0, 365, 5), dims='lon', attrs={'units':'degrees_east'})
data1 = xr.DataArray(rng.random((len(lat), len(lon))), dims=['lat','lon'], coords={'lat':lat,'lon':lon})
data1.name = 'toa_lw_all_mon'
data2 = xr.DataArray(rng.random((len(lat), len(lon))), dims=['lat','lon'], coords={'lat':lat,'lon':lon})
data2.name = 'sfc_sw_all_mon'
ds = xr.merge([data1,data2])


# let's assume we have somehow determined which parameters we need:
z = "toa"; frq = "lw"; smp = "mon"; latpt = 30; lonpt = 280

# construct the variable name:
varName = f"{z}_{frq}_all_{smp}"

# get the data at specified point:
data = ds[varName].sel(lat=latpt, lon=lonpt, method='nearest')

# try to print it:
print(f"My data value is {data}")

print("😖") # 😖 -- it prints a bunch of info along with the value!

print(f"My scalar data value is {data.values.item()} (\U0001f600)") # bonus: use unicode for symbols/emoji


My data value is <xarray.DataArray 'toa_lw_all_mon' ()>
array(0.31482101)
Coordinates:
    lat      int64 30
    lon      int64 280
😖
My scalar data value is 0.3148210128195048 (😀)


In [144]:
# bonus: 
# what if you don't know the dimension names (i.e., maybe 'lat' maybe 'latitude')
dims = ds[varName].dims
# if you *do* know the order, then can use it
ds[varName].sel({dims[0]:latpt, dims[1]:lonpt}, method='nearest')
# otherwise you probably have to do some kind of testing to infer which dimension is the one you want

## Simple Performance Profiling

### Jupyter-specific "magics"
When using Jupyter, "magic" commands can be used on individual lines ("%") or for the entire cell ("%%"). General information about Jupyter magics is here:
https://ipython.readthedocs.io/en/stable/interactive/magics.html

See also: https://www.aboutdatablog.com/post/top-8-magic-commands-in-jupyter-notebook

For purposes of timing code, there are several of these commands:
- `%%time`
- `%%timeit`
- `%%prun`
These all correspond to tools that are built in to python, but have been wrapped up as convenience functions in Jupyter and ipython.

In the following few cells, I just demonstrate simple uses for these. The examples load files that are on my machine, so they won't work for you unless you point to some files that can be loaded with `xr.open_mfdataset()`. 

The `%timeit` line-magic is shown to demonstrate its basic use. It is used to gather statistics on a particular expression. It runs the code several times (so be careful!). Some trivial functions are defined just to have something to time, and then I have several cells that shown how to control `%timeit` in terms of number of samples it does. 

Two cells show how `%%prun` can be used for detailed profiling. It goes deep into the code, so usually provides more fine-grained information that you will want. If you find yourself needing to really get into your code at this level of detail, I suggest you find some additional tools. Start here: https://realpython.com/python-profiling/

After that, there's a section about `perf_counter`. I find `perf_counter` to be the most useful code-timing tool on a day-to-day basis.

In [145]:
%%time

# %%time provides a timer for the whole cell.

from pathlib import Path
import xarray as xr
dataLocation = Path("/Users/brianpm/Dropbox/Data/CERES/Ed4.1")
ds = xr.open_mfdataset(sorted(dataLocation.glob("CERES_EBAF_Ed4.1_Subset_*.nc")))

CPU times: user 6.51 s, sys: 493 ms, total: 7 s
Wall time: 7.23 s


In [146]:
# basic use of %timeit on a single line of code:
%timeit ds = xr.open_mfdataset(sorted(dataLocation.glob("CERES_EBAF_Ed4.1_Subset_*.nc")))

5.87 s ± 80.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [147]:
import time # use for `sleep()` method

def sleepy(n):
    time.sleep(1/n)

def dopey(n):
    return sum(range(n))

def grumpy(num):
    if not hasattr(num, 'is_integer'):
        print(f"You have to provide a number")
        return None
    if not num.is_integer():
        print(f"You have to provide a whole number")
        return None
    if num > 1:
        # Iterate from 2 to n / 2
        for i in range(2, int(num/2)+1):
            # If num is divisible by any number between
            # 2 and n / 2, it is not prime
            if (num % i) == 0:
                print(f"{num} is not a prime number")
                return False
        else:
            print(f"{num} is a prime number")
            return True
    else:
        print(f"{num} is not a prime number")
        return False

In [148]:
# `%timeit` magic to measure a line
%timeit [dopey(i) for i in range(10)]

1.15 µs ± 10.7 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)


In [149]:
# better sampling
%timeit -n5 [dopey(i) for i in range(10)]

1.22 µs ± 206 ns per loop (mean ± std. dev. of 7 runs, 5 loops each)


In [150]:
# change number of runs
%timeit -n5 -r10 [dopey(i) for i in range(10)]

1.26 µs ± 177 ns per loop (mean ± std. dev. of 10 runs, 5 loops each)


In [151]:
# can also use timeit directly in python (no jupyter required)

# https://ioflood.com/blog/timeit-python/

import timeit
timeit.repeat('n=10; output = sum(range(n))', repeat=5)

[0.0975983329990413,
 0.09732383298978675,
 0.09365462500136346,
 0.09434850000252482,
 0.09346799999184441]

In [152]:
%%prun

# simple example of prun

n = 3 * 8 * 2

 

         3 function calls in 0.000 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 <string>:1(<module>)

In [153]:
%%prun
print("start")
time.sleep(1)
for i in range(10):
    sleepy(1)
    grumpy(i)
print("finish")


start
0 is not a prime number
1 is not a prime number
2 is a prime number
3 is a prime number
4 is not a prime number
5 is a prime number
6 is not a prime number
7 is a prime number
8 is not a prime number
9 is not a prime number
finish
 

         5659 function calls (5386 primitive calls) in 11.048 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       11    8.653    0.787    8.653    0.787 {built-in method time.sleep}
     35/1    2.371    0.068    0.000    0.000 {method 'control' of 'select.kqueue' objects}
    105/1    0.003    0.000    0.000    0.000 socket.py:621(send)
       43    0.002    0.000    0.002    0.000 attrsettr.py:65(_get_attr_opt)
       37    0.001    0.000   21.291    0.575 base_events.py:1874(_run_once)
       24    0.001    0.000    0.003    0.000 iostream.py:655(write)
        2    0.001    0.000    0.001    0.000 {method '__exit__' of 'sqlite3.Connection' objects}
       40    0.001    0.000    0.001    0.000 encoder.py:205(iterencode)
     35/1    0.000    0.000    0.000    0.000 selectors.py:553(select)
       11    0.000    0.000    0.007    0.001 iostream.py:616(_flush)
       43    0.000    0.000    0.003    0.000 attrsettr.py:

### perf_counter

In the following, define a function, `foo`, that might be something that takes a while. Here I open the files defined above, but you could just put something else in.

The `perf_counter` method is in the `time` module, so `import time` is needed.

The easy way to use `perf_counter` is to mark a starting time with it, and then run it again later and take the difference. 

For doing more detailed timing through blocks of code, I find it is convenient to have a decorator that can be applied to all my functions. This is instead of typing lots of calls to `perf_counter` and print statements (helps with clean up when finished). 

We define a decorator called `func_timer`. A decorator is defined as a function that gets another function as the input and "wraps" the input function with some other code. In our case, we just put `perf_counter` calls around the input function and return the result of the function without changing it. For more on decorators:
- https://realpython.com/primer-on-python-decorators
- https://machinelearningmastery.com/a-gentle-introduction-to-decorators-in-python/
- https://ioflood.com/blog/python-decorator

In [154]:
def foo(fils):
    return xr.open_mfdataset(fils)

In [155]:
time.perf_counter()

99805.195132166

In [156]:
%%time
start_time = time.perf_counter()
some_stuff = foo(sorted(dataLocation.glob("CERES_EBAF_Ed4.1_Subset_*.nc")))
end_time = time.perf_counter()
print(f"The elapsed time was {end_time - start_time} seconds.")

The elapsed time was 6.79295191600977 seconds.
CPU times: user 6.26 s, sys: 468 ms, total: 6.73 s
Wall time: 6.79 s


In [157]:
def func_timer(func):
    def wrapper(*args, **kwargs):
        start_time = time.perf_counter()
        value = func(*args, **kwargs)
        end_time = time.perf_counter()
        run_time = end_time - start_time
        print(f"Elapsed time for function {repr(func.__name__)} was {run_time} seconds.")
        return value
    return wrapper

In [158]:
# decorate foo ... usually just go back to original place where it is defined,
# but we repeat it here for clarity.
@func_timer
def foo(fils):
    return xr.open_mfdataset(fils)

In [159]:
# now run "foo" but it has been "decorated" so it will print the timing information.
some_stuff = foo(sorted(dataLocation.glob("CERES_EBAF_Ed4.1_Subset_*.nc")))

Elapsed time for function 'foo' was 7.238571957990644 seconds.


In [160]:
# apply the same timer decorator to more functions.

@func_timer
def sleepy(n):
    time.sleep(1/n)

@func_timer
def grumpy(num):
    if not hasattr(num, 'is_integer'):
        print(f"You have to provide a number")
        return None
    if not num.is_integer():
        print(f"You have to provide a whole number")
        return None
    if num > 1:
        # Iterate from 2 to n / 2
        for i in range(2, int(num/2)+1):
            # If num is divisible by any number between
            # 2 and n / 2, it is not prime
            if (num % i) == 0:
                print(f"{num} is not a prime number")
                return False
        else:
            print(f"{num} is a prime number")
            return True
    else:
        print(f"{num} is not a prime number")
        return False

In [161]:
%%time
print("start")
time.sleep(1)
for i in range(22):
    sleepy((i+1))
    grumpy(i)
print("finish")

start
Elapsed time for function 'sleepy' was 1.0041664589953143 seconds.
0 is not a prime number
Elapsed time for function 'grumpy' was 2.12500017369166e-05 seconds.
Elapsed time for function 'sleepy' was 0.5027005000010831 seconds.
1 is not a prime number
Elapsed time for function 'grumpy' was 2.2957989131100476e-05 seconds.
Elapsed time for function 'sleepy' was 0.33838704200752545 seconds.
2 is a prime number
Elapsed time for function 'grumpy' was 2.808299905154854e-05 seconds.
Elapsed time for function 'sleepy' was 0.2544782920012949 seconds.
3 is a prime number
Elapsed time for function 'grumpy' was 5.6541000958532095e-05 seconds.
Elapsed time for function 'sleepy' was 0.20951795800647233 seconds.
4 is not a prime number
Elapsed time for function 'grumpy' was 5.1749986596405506e-05 seconds.
Elapsed time for function 'sleepy' was 0.16874633300176356 seconds.
5 is a prime number
Elapsed time for function 'grumpy' was 5.4916992667131126e-05 seconds.
Elapsed time for function 'sleepy'

In [162]:
# Example

# Use the same decorator on three versions of the same calculation:
# run a comparison of summing a long list of numbers using a loop, python, or numpy

arrayOfNumbers = rng.random(10_000_000)


@func_timer
def sum_by_loop(a):
    """Sum with a naive loop."""
    tot = 0
    for val in a:
        tot = tot + val
    return tot


@func_timer
def sum_py(a):
    """Use python's native `sum` method."""
    return sum(a)

@func_timer
def sum_by_np(a):
    """Use numpy's `sum` method."""
    return np.sum(a)


In [163]:
sum_by_loop(arrayOfNumbers)

sum_py(arrayOfNumbers)

sum_by_np(arrayOfNumbers)



Elapsed time for function 'sum_by_loop' was 0.5169932919961866 seconds.
Elapsed time for function 'sum_py' was 0.44800083299924154 seconds.
Elapsed time for function 'sum_by_np' was 0.0019196659995941445 seconds.


4999065.117409178

In [164]:
# Bonus: 
# compare getting the unique values inside an list

import string
letters = list(string.ascii_lowercase)

@func_timer
def get_uniq_set(x):
    unique_values = set(x)
    number_unique = len(unique_values)
    return number_unique, unique_values

@func_timer
def get_uniq_np(x):
    unique_values = np.unique(x)
    number_unique = len(unique_values)
    return number_unique, unique_values

In [173]:
for i in [10, 100, 1000, 100_000, 1_000_000, 10_000_000]:
    llist = np.random.choice(letters, i, replace=True)
    print(f"{i = }")
    n1, v1 = get_uniq_set(llist)
    n2, v2 = get_uniq_np(llist)
    print(f"answers: {n1 = }, {n2 = }")
    

i = 10
Elapsed time for function 'get_uniq_set' was 1.0749994544312358e-05 seconds.
Elapsed time for function 'get_uniq_np' was 8.90829978743568e-05 seconds.
answers: n1 = 8, n2 = 8
i = 100
Elapsed time for function 'get_uniq_set' was 2.1333005861379206e-05 seconds.
Elapsed time for function 'get_uniq_np' was 9.241700172424316e-05 seconds.
answers: n1 = 24, n2 = 24
i = 1000
Elapsed time for function 'get_uniq_set' was 0.0001429999974789098 seconds.
Elapsed time for function 'get_uniq_np' was 0.000184959004400298 seconds.
answers: n1 = 26, n2 = 26
i = 100000
Elapsed time for function 'get_uniq_set' was 0.01293075000285171 seconds.
Elapsed time for function 'get_uniq_np' was 0.0034943329956149682 seconds.
answers: n1 = 26, n2 = 26
i = 1000000
Elapsed time for function 'get_uniq_set' was 0.12930979100929108 seconds.
Elapsed time for function 'get_uniq_np' was 0.036655916992458515 seconds.
answers: n1 = 26, n2 = 26
i = 10000000
Elapsed time for function 'get_uniq_set' was 1.277741540994611

In [168]:
# look for an individual letter
type(llist)
# try set intersection

list

In [174]:
# find all indices where llist equals some value
llist == 'b'

array([False, False, False, ..., False, False, False])

In [171]:
llist[0:100]

['k',
 'y',
 'l',
 'p',
 'n',
 'i',
 'z',
 'l',
 'c',
 'x',
 'c',
 'i',
 'o',
 'b',
 'c',
 'x',
 'z',
 'l',
 'g',
 'k',
 'm',
 'h',
 'f',
 'l',
 'c',
 'i',
 's',
 'l',
 'n',
 'o',
 'f',
 's',
 'v',
 'c',
 'r',
 'u',
 'r',
 'f',
 'e',
 'b',
 'c',
 'o',
 'j',
 'q',
 'o',
 'y',
 'l',
 'f',
 'b',
 't',
 'c',
 'o',
 'n',
 'f',
 'h',
 't',
 'b',
 'y',
 'd',
 'l',
 'c',
 'g',
 'e',
 'y',
 'p',
 'p',
 't',
 't',
 'w',
 't',
 'h',
 'w',
 'o',
 's',
 'x',
 'y',
 's',
 'k',
 'e',
 'b',
 'c',
 'w',
 'g',
 'i',
 'm',
 'a',
 'q',
 'c',
 'v',
 'f',
 'y',
 'm',
 't',
 't',
 'g',
 'h',
 'q',
 's',
 'g',
 'f']