<div>
<img src="https://edlitera-images.s3.us-east-1.amazonaws.com/new_edlitera_logo.png" width="500"/>
</div>

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

In [None]:
import pandas as pd
import numpy as np
import datetime

pd.options.display.float_format = '{:,.2f}'.format

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

## Advanced column creation

* so far we've created simple derived columns
<br><br>

In [None]:
data = pd.read_csv(
    "https://edlitera-datasets.s3.amazonaws.com/survey_sample.csv", 
    parse_dates=['Date']
)

In [None]:
data.head()

In [None]:
data['Experience'] = (data['Helpfulness'] * 3 + 
                      data['Courtesy'] * 2 + 
                      data['Empathy']) / 6

In [None]:
data.head()

### We can also:

* create columns based on a condition (similar to using the Excel IF command)
<br><br>
* fill in a column by running any custom function
    * can get the value from an API call, by performing some complex computations, etc.

<br><br><br>

### Create a column based on a condition

* several ways to do it:
    * using `np.where`
    * using the `apply()` method

Let's say we want to add a new column, called `Bad` that has the value **yes** if the scores in `Helpfulness`, `Empathy` and `Courtesy` are below 2 and a value **no** otherwise.

### Using `np.where`

<div>
<img src="https://edlitera-images.s3.amazonaws.com/np_where.png" width="500"/>
</div>

In [None]:
numbers = np.array([-1, 2, -3, -4, 5, 6])

# In the ndarray below, we'll replace negative numbers with
# 0, but keep the original positive numbers

positive_numbers = np.where(numbers > 0, numbers, 0)

In [None]:
positive_numbers

<br><br><br>

**Turns out `np.where` works with Series objects also**

In [None]:
data['Helpfulness']

In [None]:
np.where( data['Helpfulness'] > 2, 'Good score', 'Bad score')

<br><Br>

Let's say we want to add a new column, called `Bad` that has the value **yes** if the scores in `Helpfulness`, `Empathy` and `Courtesy` are below 2 and a value **no** otherwise.

In [None]:
data['Bad'] = np.where(
    ((data['Helpfulness'] < 2) &
    (data['Empathy'] < 2) & 
    (data['Courtesy'] < 2)),
    'yes', 
    'no'
)

In [None]:
data.head()

In [None]:
data[ data['Bad'] == 'yes' ].head()

<br><br><br><br><br><br>

### Using the `apply()` method

* allows maximum customization
* can use this to fill in columns with result of complex computations, API calls, etc.
* **basic idea: a custom function is applied for each row (or for each column) and the returned values are gathered in a new column (or a new row)**

<br><br>

#### How to use

**Step 1.** Create function that will takes in one input (representing either a column or a row from the DataFrame). 
* This function will need to return a value corresponding to that row (or column)
* This is a regular Python function, so it can:
    * call other functions / modules
    * access APIs, databases, files, etc.
<br><br>  

**Step 2.** Apply this function to each row (or column) in the DataFrame, using the `apply()` method.    
<br><br>
**Step 3.** 
* Create a new column by combining the values returned by applying the custom function to each row 
<br><br>
OR
<br><br>
* Create a new row by combining the values returned by applying the custom function to each column.

**We specify whether we apply the custom function to each row or to each column by using the `axis` argument.**

<br><br>

### Remember the DataFrame axes?

<img src="https://edlitera-images.s3.us-east-1.amazonaws.com/dataframe_axis.png" />

* we have two axis: `axis=1` and `axis=0`

* these allow us to specify the direction in which computations should be applied:
    * for example, when we compute the mean in a DataFrame: we can compute the mean **for each column (`axis=0`)** or the mean **for each row (`axis=1`)**

* **think about it this way:**
    * all cells in your DataFrame are in a 2-D space, determined by the two axis
    * if you want to get the cells in a row, you have to move along the horizontal axis (`axis=1`)
    * if you want to get the cells in a column, you have to move along the vertical axis (`axis=0`)

* the `axis` argument also allows us to specify how we should apply our custom functions, using the `apply()` method

<br><br>

#### How the `apply()` method works (row example)

**STEP 1**

<img src="https://edlitera-images.s3.us-east-1.amazonaws.com/dataframe_apply_1.png?" />

<br>

**STEP 2**

<img src="https://edlitera-images.s3.us-east-1.amazonaws.com/dataframe_apply_2.png?" />

<br>

**STEP 3**

<img src="https://edlitera-images.s3.us-east-1.amazonaws.com/dataframe_apply_3.png?" />

<br>

**STEP 4**

<img src="https://edlitera-images.s3.us-east-1.amazonaws.com/dataframe_apply_4.png?" />

<br>

**STEP 5**

<img src="https://edlitera-images.s3.us-east-1.amazonaws.com/dataframe_apply_5.png?" />

<br>

**STEP 6**

<img src="https://edlitera-images.s3.us-east-1.amazonaws.com/dataframe_apply_6.png?" />

<br><br><br><br><br><br>

#### Code example

In [None]:
data = pd.read_csv(
    "https://edlitera-datasets.s3.amazonaws.com/survey_sample.csv", 
    parse_dates=['Date']
)

In [None]:
data.head()

Let's say we want to add a new column, called `Bad` that has the value **yes** if the scores in `Helpfulness`, `Empathy` and `Courtesy` are below 2 and a value **no** otherwise.

In [None]:
def bad_score(row):
    if ((row['Helpfulness'] < 2) and
        (row['Empathy'] < 2) and
        (row['Courtesy'] < 2)):
        return 'yes'
    return 'no'

In [None]:
data['Bad Score'] = data.apply(bad_score, axis=1)

data.head()

In [None]:
data[ data['Bad Score'] == 'yes' ]

<br><br><br><br><br><br>

## Another `apply()` example

Add a new column, called `Check Facilities`, which has the value of 'CHECK' when the `Facilities` score is less than 2 and `OK` otherwise.

In [None]:
def check_facilities(row):
    if row['Facilities'] < 2:
        return 'CHECK'
    return 'OK'

In [None]:
data.apply(check_facilities, axis=1)

In [None]:
data.apply(lambda row: 'CHECK' if row['Facilities'] < 2 else 'OK', axis=1)

In [None]:
data['Check Facilities'] = data.apply(check_facilities, axis=1)

In [None]:
data.head(7)

In [None]:
data[ data['Check Facilities'] == 'CHECK' ]

<br><br><br><br><br>

## `axis=1` vs `axis=0`

<img src="https://edlitera-images.s3.us-east-1.amazonaws.com/dataframe_axis.png" />

In [None]:
def print_input(input_):
    print(input_)
    print('---------------')

In [None]:
data.head()

In [None]:
data.apply(print_input, axis=0)

In [None]:
data.head()

In [None]:
data.apply(print_input, axis=1)

In [None]:
data.head()

<br><br><br><br><br><br>

## Another way of using `apply()`

**You don't have to send the entire row as the input**

In [None]:
def check_facilities(row):
    if row['Facilities'] < 2:
        return 'CHECK'
    return 'OK'

In [None]:
data['Check Facilities'] = data.apply(check_facilities, axis=1)

The example below works, but there are two things we can improve:
* the `check_facilities` function is not very reusable
* we pass in an entire row of data, but the `check_facilities` function only needs the `Facilities` score

**We can improve this!**

In [None]:
data = pd.read_csv(
    "https://edlitera-datasets.s3.amazonaws.com/survey_sample.csv", 
    parse_dates=['Date']
)

In [None]:
data.head()

In [None]:
def check_facilities(facilities_score):
    if facilities_score < 2:
        return 'CHECK' 
    return 'OK'

In [None]:
data['Facilities']

In [None]:
data['Facilities'].apply(check_facilities)

In [None]:
data['Check Facilities'] = data['Facilities'].apply(check_facilities)

In [None]:
data.head()

<br><br>