# <font color=#14F278>Unit 5 - Element-Wise Operations</font>
---

## <font color=#14F278>1. Element-Wise Operations - Definition:</font>
An <font color=#14F278>**Element-Wise Operation**</font> is an operation that can be easily distributed over the elements of a data container. One of the essential advantages of Pandas is its ability to perform quick element-wise operations.

A subset of element-wise operations is the <font color=#14F278>**Universal Functions (ufunc)**</font> - they take an array of data and operate in an <font color=#14F278>**element-by-element**</font> fashion, producing another array of data. Pandas leverages **NumPy**'s universal functions, which inlcude <font color=#14F278>**addition, subtraction, multiplication, logs, exponentials,**</font> and many others. Important features of **ufuncs**:
- index-aligned operations
- quick - use when possible
- available ufuncs can be found here: https://docs.scipy.org/doc/numpy/reference/ufuncs.html

In [1]:
import pandas as pd
import numpy as np

---
## <font color=#14F278>2. The Reshape() Function:</font>

Before exploring element-wise operations, let's take a look at a quick way of generating DataFrames. To do this, we will use the `reshape()` function:
- it takes an <font color=#14F278>**array**</font> of data - arrays are NumPy objects - create via the `np.array()` function
- it <font color=#14F278>**reshapes**</font> the array into rows and columns

<center>
    <div>
        <img src="..\images\elementwise_001.png"/>
    </div>
</center>


In [8]:
# Create a function that generates a dataframe using the .reshape() function
def make_df2():
    data = np.array(range(10)).reshape(-1,2)
    return pd.DataFrame(data, columns=['col1', 'col2'])

In [9]:
# Create a dataframe using the above function
df = make_df2()
display(df)

Unnamed: 0,col1,col2
0,0,1
1,2,3
2,4,5
3,6,7
4,8,9


In [7]:
def make_df2():
    data = np.resize(np.array(range(9)), 10).reshape(5,2)
    return pd.DataFrame(data, columns=['col1', 'col2'])

df = make_df2()
display(df)

Unnamed: 0,col1,col2
0,0,1
1,2,3
2,4,5
3,6,7
4,8,0


---
## <font color=#14F278>3. Element-Wise Operations on Series:</font>

A main characteristic of element-wise operations (and universal functions) is that they operate in an <font color=#14F278>**index-aligned fashion**</font>. Series are 1-dimensional objects, so the operation is applied along this dimension. Here we will try some basic math functions - note how they get distributed across the pairs of elements, sharing the same index:
- <font color=#14F278>**addition**</font>: use the `+` sign or the `.add()` universal function
- <font color=#14F278>**multiplication**</font>: use the `*` sign or the `.multiply()` universal function

In [10]:
# Add a constant to a Series
df['col1'] + 9000

0    9000
1    9002
2    9004
3    9006
4    9008
Name: col1, dtype: int32

In [11]:
# Add two Series together
df['col1'] + df['col2']

0     1
1     5
2     9
3    13
4    17
dtype: int32

In [12]:
# Alternatively, use the .add() ufunc
df['col1'].add(df['col2'])

0     1
1     5
2     9
3    13
4    17
dtype: int32

In [13]:
# Multiply two Series together
df['col1']*df['col2']

0     0
1     6
2    20
3    42
4    72
dtype: int32

In [14]:
# Alternatively, use the .multiply() ufunc
df['col1'].multiply(df['col2'])

0     0
1     6
2    20
3    42
4    72
dtype: int32

<font color=#FF8181>**Important:**</font> All of the above operations <font color=#FF8181>**produce a new Series Object**</font>. If you want to store this object into a variable (or a new column in a dataframe):
- `s = df['col1'] + df['col2']` to store in a **stand-alone** variable
- `df['col3] = df['col1'] + df['col2']` to store as a **column** in the orignial dataframe 

In [None]:
# Let's store the sum of the two columns in a third column 'col3'
df['col3'] = df['col1'] + df['col2']
display(df)

---
## <font color=#14F278>4. Element-Wise Operations on DataFrames:</font>

Element-Wise functions on DataFrames operate in the same way as on Series. Again, we are looking into an index-aligned application - operations between 2 dataframes with different dimensions will result in a dataframe with the dimensions of the bigger one!

In [15]:
df = make_df2()
display(df)

Unnamed: 0,col1,col2
0,0,1
1,2,3
2,4,5
3,6,7
4,8,9


In [16]:
# DataFrame + scalar -- the scalar is added to all elements of the df
df + 9000

Unnamed: 0,col1,col2
0,9000,9001
1,9002,9003
2,9004,9005
3,9006,9007
4,9008,9009


In [17]:
# DataFrame + List -- operation called broadcasting
# Adds 1 to all elements in col1 and 2 to all elements in col2
df + [1,2] 

Unnamed: 0,col1,col2
0,1,3
1,3,5
2,5,7
3,7,9
4,9,11


In [18]:
# DataFrame + DataFrame -- sums fields of 1st df with their counterpart from 2nd df
df + df

Unnamed: 0,col1,col2
0,0,2
1,4,6
2,8,10
3,12,14
4,16,18


---
## <font color=#14F278>5. Lambda Functions:</font>

<font color=#14F278>**Lambda Functions**</font> (also known as **Anonymous Functions**) are a short-hand notation for function definitions, when the body of the function consists of <font color=#14F278>**only one statement**</font>. Lambda functions are great for defining simple functions, required for a single use. Additionally, they are great for <font color=#14F278>**chaining**</font> them with other functions, requiring a function as an argument.

Syntax:
`lambda x: <statement>`

In [19]:
# A world without lambda functions:
def my_func(x):
    return x**3

my_func(4)

64

In [20]:
# The above can be written as a lambda function:
my_func2 = lambda x: x**3
my_func2(4)

64

In [21]:
# Lambda functions become potent when their statement contains other short-hand notations, such as the ternary operator
# ternary operator = short if-else statement

# A world without lambda functions:
def my_func3(x,y):
    if x>=y:
        return x
    else:
        return y

my_func3(5,10)

10

In [None]:
# The above can be written as a lambda function:
my_func4 = lambda x,y: x if x>=y else y
my_func4(5,10)

In [25]:
my_lambda = lambda name: str(name[0])
my_lambda(('alskjdhsjdklss'))

'a'

---
## <font color=#14F278>6. The `apply()` Method:</font>

The `.apply()` method is used to <font color=#14F278>**apply an element-wise function**</font> to a <font color=#14F278>**Series**</font> or <font color=#14F278>**along one of the axes of a DataFrame**</font>.

Syntax:
- <font color=#14F278>**On a Seires**</font>: `series_name.apply(function)`
- <font color=#14F278>**On a DataFrame**</font>: `df_name.apply(function, axis = ...)` where `axis = 0 (default)` or `axis = 1`

Since __DataFrames__ are 2-dimensional, we need to instruct the `.apply()` method on which axis to perform the operation:
- <font color=#14F278>**axis = 1**</font> - the operation is performed **across the columns** - result has a shape of a **column**
- <font color=#14F278>**axis = 0**</font> - the operation is performed **across the rows** - the result has the shape of a **row**

<font color=#FF8181>**Important:**</font> Note that the `apply()` method expects a **function** as an argument. This can either be **pre-defined** in the conventional way, or we can define it on the spot with as a **lambda function** (if possible)!


---
### <font color=#14F278>6.1 `apply()` on Series:</font>

In [26]:
# Setting up our data
df = make_df2()
display(df)

Unnamed: 0,col1,col2
0,0,1
1,2,3
2,4,5
3,6,7
4,8,9


In [27]:
# Using .apply method to square each element in a column -- recall lambda functions
# When there is no supported ufunc, we can use the apply method to do element-wise operation

# Here we are checking which values in 'col2' are divisible by 3
df['divisible_by_3'] = df['col2'].apply(lambda x: 'yes' if x%3==0 else 'no') # if you use the name of an existing column it overwrites it
display(df) 

Unnamed: 0,col1,col2,divisible_by_3
0,0,1,no
1,2,3,yes
2,4,5,no
3,6,7,no
4,8,9,yes


---
### <font color=#14F278>6.2 `apply()` on DataFrames:</font>

`.aaply()` can be applied in 3 different ways on DataFrames:
- we can apply a function to each individual element (cell) in the DataFrame
- we can apply a function across the rows of the dataframe, to produce a row
- we can apply a function across the columns of the dataframe, to produce a column

<center>
    <div>
        <img src="..\images\elementwise_002.png"/>
    </div>
</center>


<font color=#FF8181>**Warning:**</font> Always think about the object that the function argument takes, when applying `.apply()` on a particular axis!
- if you want to produce an output that looks like a **column**, your function takes a **row** as argument
- if you want to produce an output that looks like a **row**, your function takes a **column** as argument

In [28]:
# Setting up our data
df = make_df2()
display(df)

Unnamed: 0,col1,col2
0,0,1
1,2,3
2,4,5
3,6,7
4,8,9


In [29]:
# Applying a function to each element in our dataframe.
df.apply(lambda x: x**3)

Unnamed: 0,col1,col2
0,0,1
1,8,27
2,64,125
3,216,343
4,512,729


In [30]:
# Using .apply with axis=0
df.apply(lambda col: sum(col), axis=0)

col1    20
col2    25
dtype: int64

In [31]:
# Using .apply method to compute df['col1'] + 2*df['col2'] with axis = 1
df.apply(lambda row: row['col1'] + 2*row['col2'], axis=1)

0     2
1     8
2    14
3    20
4    26
dtype: int32

Using `.apply()` with `axis=1` is very common when you want to <font color=#14F278>**produce a new custom column**</font>, which is not easily created via simple universal functions. When the logic behind the construction of this custom column is too complicated, it might even be difficult to use `apply()` together with a `lambda` function.

In such moments, you can simply <font color=#14F278>**define your own function**</font> and pass its name in the `.apply()` method. 

Example:
- Suppose we have a dataset of clients and their Risk Profile and Risk Budget
- Both Risk Profile and Risk Budget are numericals (1 to 5)
- We need to translate these numbers to categories - 'very low', 'low', 'medium', 'high', 'very high'
- Additionally we need to bin clients in 3 groups - 'RP>RB', 'RP=RB', 'RP<RB'

In [32]:
data = {'client':['Adams', 'White', 'McDonald', 'Hirani', 'Jackson'],
         'risk_profile':[3,5,1,3,2],
         'risk_budget':[4,4,3,5,2]}
df = pd.DataFrame(data)
display(df)

Unnamed: 0,client,risk_profile,risk_budget
0,Adams,3,4
1,White,5,4
2,McDonald,1,3
3,Hirani,3,5
4,Jackson,2,2


In [33]:
# Define 2 functions that converts numericals to risk categories

def rp_category(row):
    if row['risk_profile'] == 1:
        value = 'very low'
    elif row['risk_profile'] == 2:
        value = 'low'
    elif row['risk_profile'] == 3:
        value = 'medium'
    elif row['risk_profile'] == 4:
        value = 'high'
    elif row['risk_profile'] == 5:
        value = 'very high'
    else:
        value = 'NA'
    return value


def rb_category(row):
    if row['risk_budget'] == 1:
        value = 'very low'
    elif row['risk_budget'] == 2:
        value = 'low'
    elif row['risk_budget'] == 3:
        value = 'medium'
    elif row['risk_budget'] == 4:
        value = 'high'
    elif row['risk_budget'] == 5:
        value = 'very high'
    else:
        value = 'NA'
    return value

In [34]:
# Apply the functions to the dataframe to create 2 new columns

df['rp_cat'] = df.apply(rp_category, axis = 1) # y column (axis = 1)
df['rb_cat'] = df.apply(rb_category, axis = 1)
display(df)

Unnamed: 0,client,risk_profile,risk_budget,rp_cat,rb_cat
0,Adams,3,4,medium,high
1,White,5,4,very high,high
2,McDonald,1,3,very low,medium
3,Hirani,3,5,medium,very high
4,Jackson,2,2,low,low


We can make this even more **efficient** by adding the column name that needs transforming as an <font color=#14F278>**argument to the function definition**</font> - this allows us to <font color=#14F278>**refactor**</font> the above 2 function definitions into 1 function - `risk_category()`:

In [35]:
data = {'client':['Adams', 'White', 'McDonald', 'Hirani', 'Jackson'],
         'risk_profile':[3,5,1,3,2],
         'risk_budget':[4,4,3,5,2]}
df = pd.DataFrame(data)
display(df)

Unnamed: 0,client,risk_profile,risk_budget
0,Adams,3,4
1,White,5,4
2,McDonald,1,3
3,Hirani,3,5
4,Jackson,2,2


In [36]:
# We can make this even more efficient by adding the column name that needs transforming as an argument
# to the function definition - this allows us to refactor the above 2 function definitions into 1

def risk_category(row, column_name):
    if row[column_name] == 1:
        value = 'very low'
    elif row[column_name] == 2:
        value = 'low'
    elif row[column_name] == 3:
        value = 'medium'
    elif row[column_name] == 4:
        value = 'high'
    elif row[column_name] == 5:
        value = 'very high'
    else:
        value = 'NA'
    return value

In [37]:
# pass the extra column_name argument value in the .apply() call
df['rp_cat'] = df.apply(risk_category, axis = 1, column_name = 'risk_profile')
df['rb_cat'] = df.apply(risk_category, axis = 1, column_name = 'risk_budget')
display(df)

Unnamed: 0,client,risk_profile,risk_budget,rp_cat,rb_cat
0,Adams,3,4,medium,high
1,White,5,4,very high,high
2,McDonald,1,3,very low,medium
3,Hirani,3,5,medium,very high
4,Jackson,2,2,low,low


In [48]:
# Lastly, build a function that bins the clients by comparing Risk Profile to Risk Budget:
# Write your code here
# TODO Look at thjis code very useful. Remember to reference row[] and return the value
def compare_risk(row, risk_budg, risk_prof):
    if row[risk_budg] > row[risk_prof]:
        value = 'non-risky'
    elif row[risk_budg] == row[risk_prof]:
        value = 'semi-risky'
    elif row[risk_budg] < row[risk_prof]:
        value = 'risky'

    return value

    

In [49]:
df['risk_summary'] = df.apply(compare_risk, axis = 1, risk_budg = 'risk_budget', risk_prof = 'risk_profile')
display(df)

Unnamed: 0,client,risk_profile,risk_budget,rp_cat,rb_cat,risk_summary
0,Adams,3,4,medium,high,non-risky
1,White,5,4,very high,high,risky
2,McDonald,1,3,very low,medium,non-risky
3,Hirani,3,5,medium,very high,non-risky
4,Jackson,2,2,low,low,semi-risky


---
## <font color=#14F278> 7. Summary:</font>
- An __Element-Wise Opeartion__ is an operation that can easily be distributed over the elements of a data container
- Pandas leverages on __NumPy's universal functions__, which include addition, subtraction, multiplication, logs, exponentials, etc
- The __.apply()__ method is used to apply an element-wise function to a Series or along one of the axes of a DataFrame - a great tool to use when there is no ufunc existing for the operation we want to conduct!

---
## <font color=#FF8181> 8. Concept Check: </font>

1. If you use df.apply(..., axis=0), what is the shape of the output? Suppose df.shape = (10,5)
2. If you use df.apply(..., axis=1), what is the shape of the output? Suppose df.shape = (10,5)
3. Use .apply on a series. The input to the function is x. If x is even, return 0. Else, return x.
    - Eg. `pd.Series([1,2,3,4,5])` -> `[1,0,3,0,5]`
4. Suppose you had a dataframe `df = pd.DataFrame({'currency':['USD','AUD','USD'], 'amount':[100,200,300]})`. Use the apply method to get a column with the amounts in GBP.
   - To go from AUD to GBP, divide by 1.94.
   - To go from USD to GBP, divide by 1.32.
5. Compare the time it takes to use apply vs a ufunc. Use %timeit.

In [None]:
#1 axis = 0, the shape of the output would be the shape of the row, so (1,5)
#2 axis = 1, the shape of the output would be the shape of the column, so (10,1)

In [71]:
#3
import pandas as pd

df = pd.Series([1,2,3,4,5])
df['even'] = df.apply(lambda x: '0' if x % 2 == 0 else x)
display(df['even'])


0    1
1    0
2    3
3    0
4    5
dtype: object

In [96]:
#4
df = pd.DataFrame({'currency':['USD','AUD','USD'], 'amount':[100,200,300]})

def convert(row, currency, amount):
    if row[currency] == 'USD':
        GBP = row[amount] / 1.32
    elif row[currency] == 'AUD':
        GBP = row[amount] / 1.94
    return GBP

df['GBP'] = df.apply(convert, axis = 1, currency = 'currency', amount = 'amount')
display(df)


# Different, faster solution
# df['amount_gbp'] = df.apply(lambda row: row['amount']/1.94 if row['currency'] == 'AUD' else row['amount']/1.32, axis=1)
# display(df)

Unnamed: 0,currency,amount,GBP
0,USD,100,75.757576
1,AUD,200,103.092784
2,USD,300,227.272727


In [93]:
s = pd.Series(range(10000))
%timeit s.apply(lambda x:x*10)

3.53 ms ± 334 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [94]:
s = pd.Series(range(10000))
%timeit s*10

54.8 µs ± 3.79 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
