# Primer on Python for R users

You may find yourself wanting to read and understand some Python, or even port some Python to R. This guide is designed to enable you to do these tasks as quickly as possible. As you’ll see, R and Python are similar enough that this is possible without necessarily learning all of Python. We start with the basics of container types and work up to the mechanics of classes, dunders, the iterator protocol, the context protocol, and more!

## Whitespace

Whitespace matters in Python. In R, expressions are grouped into a code block with {}. In Python, that is done by making the expressions share an indentation level. For example, an expression with an R code block might be:

```r
if (TRUE) {
  cat("This is one expression. \n")
  cat("This is another expression. \n")
}
#> This is one expression. 
#> This is another expression.
```

The equivalent in Python:

In [None]:
if True:
  print("This is one expression.")
  print("This is another expression.")

Python accepts tabs or spaces as the indentation spacer, but the rules get tricky when they’re mixed. Most style guides suggest (and IDE’s default to) using spaces only.

## Container Types

In R, the `list()` is a container you can use to organize R objects. R’s `list()` is feature packed, and there is no single direct equivalent in Python that supports all the same features. Instead there are (at least) 4 different Python container types you need to be aware of: lists, dictionaries, tuples, and sets.

### Lists

Python lists are typically created using bare brackets `[]`. The Python built-in `list()` function is more of a coercion function, closer in spirit to R’s `as.list()`. The most important thing to know about Python lists is that they are modified in place. Note in the example below that `y` reflects the changes made to `x`, because the underlying list object which both symbols point to is modified in place.

Some syntactic sugar around Python lists you might encounter is the usage of `+` and `*` with lists. These are concatenation and replication operators, akin to R’s `c()` and `rep()`.

In [None]:
x = [1]
x

In [None]:
x + x

In [None]:
x*3

You can index into lists with integers using trailing `[]`, but note that indexing is 0-based.

In [None]:
x = [1, 2, 3]
x[0]

In [None]:
x[1]

In [None]:
x[2]

In [None]:
try:
  x[3]
except Exception as e:
  print(e)

When indexing, negative numbers count from the end of the container.

In [None]:
x = [1, 2, 3]
x[-1]

In [None]:
x[-2]

In [None]:
x[-3]

You can slice ranges of lists using the `:` inside brackets. Note that the slice syntax is **not** inclusive of the end of the slice range. You can optionally also specify a stride.

In [None]:
x = [1, 2, 3, 4, 5, 6]
x[0:2] # get items at index positions 0, 1

In [None]:
x[1:]  # get items from index position 1 to the end

In [None]:
x[:-2] # get items from beginning up to the 2nd to last.

In [None]:
x[:]   # get all the items (idiom used to copy the list so as not to modify in place)

In [None]:
x[::2] # get all the items, with a stride of 2

In [None]:
x[1::2] # get all the items from index 1 to the end, with a stride of 2

### Tuples

Tuples behave like lists, except they are not mutable, and they don’t have the same modify-in-place methods like `append()`. They are typically constructed using bare `()`, but parentheses are not strictly required, and you may see an implicit tuple being defined just from a comma separated series of expressions. Because parentheses can also be used to specify order of operations in expressions like `(x + 3) * 4`, a special syntax is required to define tuples of length 1: a trailing comma. Tuples are most commonly encountered in functions that take a variable number of arguments.

In [None]:
x = (1, 2) # tuple of length 2
type(x)

In [None]:
len(x) # equivalent of R's length

In [None]:
x

In [None]:
x = (1,) # tuple of length 1
type(x)

In [None]:
len(x)

In [None]:
x

In [None]:
x = () # tuple of length 0
print(f"{type(x) = }; {len(x) = }; {x = }")

In [None]:
x = 1, 2 # also a tuple
type(x)

In [None]:
len(x)

In [None]:
x = 1, # beware a single trailing comma! This is a tuple!
type(x)

In [None]:
len(x)

#### Packing and Unpacking

Tuples are the container that powers the *packing* and *unpacking* semantics in Python. Python provides the convenience of allowing you to assign multiple symbols in one expression. This is called *unpacking*.

For example:

In [None]:
x = (1, 2, 3)
a, b, c = x
a

In [None]:
b

In [None]:
c

(You can access similar unpacking behavior from R using zeallot::`%<-%`).

Tuple unpacking can occur in a variety of contexts, such as iteration:

In [None]:
xx = (("a", 1),
      ("b", 2))
for x1, x2 in xx:
  print("x1 = ", x1)
  print("x2 = ", x2)

If you attempt to unpack a container to the wrong number of symbols, Python raises an error:

In [None]:
x = (1, 2, 3)
a, b, c = x # success

In [None]:
a, b = x 

In [None]:
a, b, c, d = x 

It is possible to unpack a variable number of arguments, using `*` as a prefix to a symbol. (You’ll see the `*` prefix again when we talk about functions).

In [None]:
x = (1, 2, 3)
a, *the_rest = x
a

In [None]:
the_rest

You can also unpack nested structures:

In [None]:
x = ((1, 2), (3, 4))
(a, b), (c, d) = x

### Dictionaries

Dictionaries are most similar to R environments. They are a container where you can retrieve items by name, though in Python the name (called a *key* in Python’s parlance) does not need to be a string like in R. It can be any Python object with a `hash()` method (meaning, it can be almost any Python object). They can be created using syntax like `{key: value}`. Like Python lists, they are modified in place.

In [None]:
d = {"key1": 1,
     "key2": 2}
d2 = d
d

In [None]:
d["key1"]

In [None]:
d["key3"] = 3
d2 # modified in place!

Like R environments (and unlike R’s named lists), you cannot index into a dictionary with an integer to get an item at a specific index position. Dictionaries are unordered containers. (However—beginning with Python 3.7, dictionaries do preserve the item insertion order).

In [None]:
d = {"key1": 1, "key2": 2}
d[1] # error

A container that closest matches the semantics of R’s named list is the OrderedDict, but that’s relatively uncommon in Python code so we don’t cover it further.

### Sets

Sets are a container that can be used to efficiently track unique items or deduplicate lists. They are constructed using `{val1, val2}` (like a dictionary, but without `:`). Think of them as dictionary where you only use the keys. Sets have many efficient methods for membership operations, like `intersection()`, `issubset()`, `union()` and so on.

In [None]:
s = {1, 2, 3}
type(s)

In [None]:
s

In [None]:
s.add(1)
s

## Iteration with `for`

The for statement in Python can be used to iterate over any kind of container.

In [None]:
for x in [1, 2, 3]:
  print(x)

R has a relatively limited set of objects that can be passed to `for`. Python by comparison, provides an iterator protocol interface, which means that authors can define custom objects, with custom behavior that is invoked by `for`. 

Iterating over dictionaries first requires understanding if you are iterating over the keys, values, or both. Dictionaries have methods that allow you to specify which.

In [None]:
d = {"key1": 1, "key2": 2}
for key in d:
  print(key)

In [None]:
for value in d.values():
  print(value)

In [None]:
for key, value in d.items():
  print(key, ":", value)

#### Comprehensions

Comprehensions are special syntax that allow you to construct a container like a list or a dict, while also executing a small operation or single expression on each element. You can think of it as special syntax for R’s lapply.

For example:

In [None]:
x = [1, 2, 3]

# a list comprehension built from x, where you add 100 to each element
l = [element + 100 for element in x]
l

In [None]:
# a dict comprehension built from x, where the key is a string.
# Python's str() is like R's as.character()
d = {str(element) : element + 100
     for element in x}
d

## Defining Functions with 'def'

Python functions are defined with the def statement. The syntax for specifying function arguments and default values is very similar to R.

In [None]:
def my_function(name = "World"):
  print("Hello", name)

my_function()

In [None]:
my_function("Friend")

The equivalent R snippet would be

```r
my_function <- function(name = "World") {
  cat("Hello", name, "\n")
}

my_function()
#> Hello World
my_function("Friend")
#> Hello Friend`
```

## Modules and `import`

In R, authors can bundle their code into shareable extensions called R packages, and R users can access objects from R packages via `library()` or `::`. In Python, authors bundle code into modules, and users access modules using import. Consider the line:

In [None]:
import numpy

This statement has Python go out to the file system, find an installed Python module named ‘numpy’, load it (commonly meaning: evaluate its `__init__.py` file and construct a `module` type), and bind it to the symbol `numpy`.

The closest equivalent to this in R might be:
```r
dplyr <- loadNamespace("dplyr")
```

#### Where are modules found?

In Python, the file system locations where modules are searched can be accessed (and modified) from the list found at `sys.path`. This is Python’s equivalent to R’s `.libPaths()`. `sys.path` will typically contain paths to the current working directory, the Python installation which contains the built-in standard library, administrator installed modules, user installed modules, values from environment variables like PYTHONPATH, and any modifications made directly to sys.path by other code in the current Python session (though this is relatively uncommon in practice).

In [None]:
import sys
sys.path

You can inspect where a module was loaded from by accessing the dunder `__path__` or `__file__` (especially useful when troubleshooting installation issues):

In [None]:
import os
os.__file__

In [None]:
numpy.__path__

Once a module is loaded, you can access symbols from the module using `.` (equivalent to `::`, or maybe `$.environment`, in R).

In [None]:
numpy.abs(-1)

There is also special syntax for specifying the symbol a module is bound to upon import, and for importing only some specific symbols.

```python
import numpy        # import
import numpy as np  # import and bind to a custom symbol `np`
from numpy import abs # import only `numpy.abs`, bind it to `abs`
from numpy import abs as abs2 # import only `numpy.abs`, bind it to `abs2`
```

If you’re looking for the Python equivalent of R’s `library()`, which makes all of a package’s exported symbols available, it might be using import with a `*` wildcard, though it’s relatively uncommon to do so. The `*` wildcard will expand to include all the symbols in module, or all the symbols listed in `__all__`, if it is defined.

```python
from numpy import *
```

Python doesn’t make a distinction like R does between package exported and internal symbols. In Python, all module symbols are equal, though there is the naming convention that intended-to-be-internal symbols are prefixed with a single leading underscore. (Two leading underscores invoke an advanced language feature called “name mangling”, which is outside the scope of this introduction)

## Integers and Floats

R users generally don’t need to be aware of the difference between integers and floating point numbers, but that’s not the case in Python. If this is your first exposure to numeric data types, here are the essentials:

- integer types can only represent whole numbers like `1` or `2`, not floating point numbers like `1.2`.

- floating-point types can represent any number, but with some degree of imprecision.

In R, writing a bare literal number like `12` produces a floating point type, whereas in Python, it produces an integer. You can produce an integer literal in R by appending an `L`, as in `12L`. Many Python functions expect integers, and will error when provided a float.

## What about R vectors?

R is a language designed for numerical computing first. Numeric vector data types are baked deep into the R language, to the point that the language doesn’t even distinguish scalars from vectors. By comparison, numerical computing capabilities in Python are generally provided by third party packages (modules, in Python parlance).

In Python, the `numpy` module is most commonly used to handle contiguous arrays of data. The closest equivalent to an R numeric vector is a numpy array, or sometimes, a list of scalar numbers (some Pythonistas might argue for array.array() here, but that’s so rarely encountered in actual Python code we don’t mention it further).

Teaching the NumPy interface is beyond the scope of this primer, but it’s worth pointing out some potential tripping hazards for users accustomed to R arrays:

- When indexing into multidimensional numpy arrays, trailing dimensions can be omitted and are implicitly treated as missing. The consequence is that iterating over arrays means iterating over the first dimension. For example, this iterates over the rows of a matrix.


In [None]:
import numpy as np
m = np.arange(12).reshape((3,4))
m

In [None]:
m[0, :] # first row

In [None]:
m[0]    # also first row

In [None]:
for row in m:
  print(row)

- Many numpy operations modify the array in place! This is surprising to R users, who are used to the convenience and safety of R’s copy-on-modify semantics. Unfortunately, there is no simple scheme or naming convention you can rely on to quickly determine if a particular method modifies in-place or creates a new array copy. The only reliable way is to consult the [documentation](https://numpy.org/doc/stable/reference/index.html#reference)

This is a shorter version of the primer available [here](https://rstudio.github.io/reticulate/articles/python_primer.html)