# Learning about .apply() is fundamental in the data cleaning process. It also encapsulates key concepts in programming, mainly writing functions. The .apply() method takes a function and applies it (i.e., runs it) across each row or column of a DataFrame without having you write the code for each element separately. 

# Primer on Functions

In [1]:
import pandas as pd
df = pd.DataFrame({"a": [10, 20, 30], "b": [20, 30, 40]})
print(df)

    a   b
0  10  20
1  20  30
2  30  40


In [2]:
def print_me(x):
  print(x)

In [3]:
df.apply(print_me, axis =0)

0    10
1    20
2    30
Name: a, dtype: int64
0    20
1    30
2    40
Name: b, dtype: int64


a    None
b    None
dtype: object

In [4]:
def avg_3(x, y, z):
   return (x + y + z) / 3

In [5]:
# will cause an error
print(df.apply(avg_3))

TypeError: avg_3() missing 2 required positional arguments: 'y' and 'z'

In [7]:
def avg_3_apply(col):
  

   x = col[0]
   y = col[1]
   z = col[2]
   return (x + y + z) / 3
print(df.apply(avg_3_apply))

a    20.0
b    30.0
dtype: float64


In [8]:
def avg_2_apply(row):
  """Taking the average of row value.
  Assuming that there are only 2 values in a row.
  """

  x = row[0]
  y = row[1]
  return (x + y) / 2

print(df.apply(avg_2_apply, axis =0))

a    15.0
b    25.0
dtype: float64


# Vectorized Functions

# When we use .apply(), we are able to make a function work on a column-by-column or row-by-row basis. In the previous section, Section 5.2, we had to rewrite our function when we wanted to apply it because the entire column or row was passed into the first parameter of the function. However, there might be times when it is not feasible to rewrite a function in this way. We can leverage the .vectorize() function and decorator to vectorize any function. Vectorizing your code can also lead to performance gains

In [9]:
df = pd.DataFrame({"a": [10, 20, 30], "b": [20, 30, 40]})
print(df)

    a   b
0  10  20
1  20  30
2  30  40


In [10]:
def avg_2(x, y):
  return (x + y) / 2

In [11]:
print(avg_2(df['a'], df['b']))

0    15.0
1    25.0
2    35.0
dtype: float64


In [12]:
import numpy as np

def avg_2_mod(x, y):
   """Calculate the average, unless x is 20
   If the value is 20, return a missing value
   """
   if (x == 20):
     return(np.NaN)
   else:
     return (x + y) / 2

In [13]:
# will cause an error
print(avg_2_mod(df['a'], df['b']))

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

In [14]:
print(avg_2_mod(10, 20))

15.0


# Vectorize with NumPy

In [15]:


# np.vectorize actually creates a new function
avg_2_mod_vec = np.vectorize(avg_2_mod)

# use the newly vectorized function
print(avg_2_mod_vec(df['a'], df['b']))

[15. nan 35.]


This method works well if you do not have the source code for an existing function. However, if you are writing your own function, you can use a Python decorator to automatically vectorize the function without having to create a new function. A decorator is a function that takes another function as input, and modifies how that function’s output behaves.

In [16]:
# to use the vectorize decorator
# we use the @ symbol before our function definition
@np.vectorize
def v_avg_2_mod(x, y):
  """Calculate the average, unless x is 20
  Same as before, but we are using the vectorize decorator
  """

  if (x == 20):
      return(np.NaN)
  else:
    return (x + y) / 2

# we can then directly use the vectorized function
# without having to create a new function
print(v_avg_2_mod(df['a'], df['b']))

[15. nan 35.]


# Vectorize with Numba

In [17]:
import numba

@numba.vectorize
def v_avg_2_numba(x, y):
  """Calculate the average, unless x is 20
  Using the numba decorator.
  """
  # we now have to add type information to our function
  if (int(x) == 20):
    return(np.NaN)
  else:
    return (x + y) / 2

In [18]:
print(v_avg_2_numba(df['a'], df['b']))

0    15.0
1     NaN
2    35.0
dtype: float64


In [19]:
# passing in the numpy array
print(v_avg_2_numba(df['a'].values, df['b'].values))

[15. nan 35.]


# Lambda Functions (Anonymous Functions)

In [20]:
df = pd.DataFrame({'a': [10, 20, 30],
                   'b': [20, 30, 40]})
print(df)

    a   b
0  10  20
1  20  30
2  30  40


In [21]:
def my_sq(x):
  return x ** 2

df['a_sq'] = df['a'].apply(my_sq)
print(df)

    a   b  a_sq
0  10  20   100
1  20  30   400
2  30  40   900


In [22]:
df['a_sq_lamb'] = df['a'].apply(lambda x: x ** 2)
print(df)

    a   b  a_sq  a_sq_lamb
0  10  20   100        100
1  20  30   400        400
2  30  40   900        900
