## Arithmetic, Function Application, Mapping with pandas
### Working with pandas
*Curtis Miller*

Here we will see several examples of concepts discussed in the slides.

### `Series` Arithmetic
Let's first suit up.

In [1]:
import pandas as pd
from pandas import Series, DataFrame
import numpy as np

srs1 = Series([1, 9, -4, 3, 3])
srs2 = Series([2, 3, 4, 5, 10], index=[0, 1, 2, 3, 5])
print(srs1)

0    1
1    9
2   -4
3    3
4    3
dtype: int64


In [2]:
print(srs2)

0     2
1     3
2     4
3     5
5    10
dtype: int64


Notice that the indices do not line up, even though the `Series` are of the same length.

Predict the outcomes:

In [3]:
srs1 + srs2

0     3.0
1    12.0
2     0.0
3     8.0
4     NaN
5     NaN
dtype: float64

In [4]:
srs1 * srs2

0     2.0
1    27.0
2   -16.0
3    15.0
4     NaN
5     NaN
dtype: float64

In [5]:
srs1 ** srs2

0      1.0
1    729.0
2    256.0
3    243.0
4      NaN
5      NaN
dtype: float64

In [9]:
# Boolean arithmetic is different
srs1 > srs2

ValueError: Can only compare identically-labeled Series objects

In [8]:
srs1 <= srs2    # Opposite of above

ValueError: Can only compare identically-labeled Series objects

In [10]:
srs1 > Series([1, 2, 3, 4, 5], index = [4, 3, 2, 1, 0])

ValueError: Can only compare identically-labeled Series objects

In [11]:
np.sqrt(srs2)

0    1.414214
1    1.732051
2    2.000000
3    2.236068
5    3.162278
dtype: float64

In [12]:
np.abs(srs1)

0    1
1    9
2    4
3    3
4    3
dtype: int64

In [13]:
type(np.abs(srs1))

pandas.core.series.Series

In [14]:
# Define a cusom ufunc: notice the decorator notation?
@np.vectorize
def trunc(x):
    return x if x > 0 else 0

trunc(np.array([-1, 5, 4, -3, 0]))

array([0, 5, 4, 0, 0])

In [15]:
trunc(srs1)

array([1, 9, 0, 3, 3], dtype=int64)

In [16]:
type(trunc(srs1))

numpy.ndarray

### `Series` Methods and Function Application
Having seen basic arithmetic with Series, let's look at useful Series methods.

In [17]:
# Mean of a series
srs1.mean()

2.4

In [18]:
srs1.std()

4.669047011971501

In [19]:
srs1.max()

9

In [20]:
srs1.argmax()   # Returns the index where the maximum is

1

In [21]:
srs1.cumsum()

0     1
1    10
2     6
3     9
4    12
dtype: int64

In [22]:
srs1.abs()    # An alternative to the abs function in NumPy

0    1
1    9
2    4
3    3
4    3
dtype: int64

Now let's look at function application and mapping.

In [23]:
srs1.apply(lambda x: x if x > 2 else 2)

0    2
1    9
2    2
3    3
4    3
dtype: int64

In [24]:
srs3 = Series(['alpha', 'beta', 'gamma', 'delta'], index = ['a', 'b', 'c', 'd'])
print(srs3)

a    alpha
b     beta
c    gamma
d    delta
dtype: object


In [25]:
obj = {"alpha": 1, "beta": 2, "gamma": -1, "delta": -3}
srs3.map(obj)

a    1
b    2
c   -1
d   -3
dtype: int64

In [None]:
srs4 = Series(obj)
print(srs4)

In [None]:
srs3.map(srs4)

In [None]:
srs1.map(lambda x: x if x > 2 else 2)    # Works like apply

### `DataFrame`s
Many of the tricks that work with `Series` work with `DataFrame`s, but with some more complication.

In [26]:
df = DataFrame(np.arange(15).reshape(5, 3), columns=["AAA", "BBB", "CCC"])
print(df)

   AAA  BBB  CCC
0    0    1    2
1    3    4    5
2    6    7    8
3    9   10   11
4   12   13   14


In [27]:
# Should get 0's, and CCC gets NaN because no match
df - df.loc[:,["AAA", "BBB"]]

Unnamed: 0,AAA,BBB,CCC
0,0,0,
1,0,0,
2,0,0,
3,0,0,
4,0,0,


In [28]:
df.mean()

AAA    6.0
BBB    7.0
CCC    8.0
dtype: float64

In [29]:
df.std()

AAA    4.743416
BBB    4.743416
CCC    4.743416
dtype: float64

In [30]:
# This is known as standardization
(df - df.mean())/df.std()

Unnamed: 0,AAA,BBB,CCC
0,-1.264911,-1.264911,-1.264911
1,-0.632456,-0.632456,-0.632456
2,0.0,0.0,0.0
3,0.632456,0.632456,0.632456
4,1.264911,1.264911,1.264911


Let's now look at vectorization

In [31]:
np.sqrt(df)

Unnamed: 0,AAA,BBB,CCC
0,0.0,1.0,1.414214
1,1.732051,2.0,2.236068
2,2.44949,2.645751,2.828427
3,3.0,3.162278,3.316625
4,3.464102,3.605551,3.741657


In [32]:
# trunc is a custom ufunc: does not give a DataFrame
trunc(df)

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14]])

In [33]:
# Mixed data
df2 = DataFrame({"AAA": [1, 2, 3, 4], "BBB": [0, -9, 9, 3], "CCC": ["Bob", "Terry", "Matt", "Simon"]})
print(df2)

   AAA  BBB    CCC
0    1    0    Bob
1    2   -9  Terry
2    3    9   Matt
3    4    3  Simon


In [34]:
# Produces an error
np.sqrt(df2)

AttributeError: 'int' object has no attribute 'sqrt'

In [35]:
# Let's select JUST numeric data
# The select_dtypes() method selects columns based on their dtype
# np.number indicates numeric dtypes
# Here we select columns only with numeric data
df2.select_dtypes([np.number])

Unnamed: 0,AAA,BBB
0,1,0
1,2,-9
2,3,9
3,4,3


In [37]:
np.sqrt(df2.select_dtypes([np.number]))

  """Entry point for launching an IPython kernel.


Unnamed: 0,AAA,BBB
0,1.0,0.0
1,1.414214,
2,1.732051,3.0
3,2.0,1.732051


A brief look at function application. Here we work with a function that computes the geometric mean, which is defined as:

$$\text{geometric mean} = \left(\prod_{i = 1}^n x_i\right)^{\frac{1}{n}}$$

In [38]:
# Define a function for the geometric mean
def geomean(srs):
    return srs.prod() ** (1 / len(srs))   # prod method is product of all elements of srs

# Demo
geomean(Series([2, 3, 4]))

2.8844991406148166

In [None]:
df.apply(geomean)

In [None]:
df.apply(geomean, axis='columns')

In [None]:
# Apply a truncation function to each element of df
df.applymap(lambda x: x if x > 3 else 3)