![DSB logo](img/Dolan.jpg)
# Apply Functions to Your DataFrame

## PD4E Chapter 9: Apply
### How do you read/manipulate/store data in Python?

# What You Learned in Python/Pandas that could Apply Here

You will need following knowledge from the first half of this course:
1. functions
2. subsetting/slicing data
3. Loops

# What You will Learn in this Chapter

You will learn following techniques in this chapter:
1. how to apply functions to columns, rows, or the whole DataFrame
2. Different use cases between `.apply()`, `.map()`, and `.applymap()`
3. `lambda` - the nameless, defintion-less functions

# Review of Functions

- Functions are __reusable__ code blocks 
    - where we group some statements together
- In `pandas`, we use functions a lot, particularly in the data preprocessing step
    - e.g., write a function to calculate some values, for consistency we want to use it to all applicable columns
- Functions can be categorized as _fruitful_ and _void_
    - here we mostly care about _fruitful_ functions

In [3]:
# example of a fruitful function
def avg_2(x, y = 10):
    return (x + y) / 2

avg_2(4)

7.0

# Why `.apply()`?

- when you want to use a function on a DataFrame, directly calling the function on it, or its columns will  actually work
    - but sometimes it does not work as we expected
- consider `.apply()` as `pandas` way of calling functions
    - note that you still have to define your function 

In [4]:
# an example
import pandas as pd

df1=pd.DataFrame({'a':[10,20,30],
                 'b':[20,30,40]})
df1

Unnamed: 0,a,b
0,10,20
1,20,30
2,30,40


In [5]:
# function def. - calculate square
def my_sq(x):
    return x ** 2

In [6]:
# let's try calling the function the normal way
my_sq(df1['a'])

0    100
1    400
2    900
Name: a, dtype: int64

In [7]:
# how about the whole DF?
my_sq(df1)

Unnamed: 0,a,b
0,100,400
1,400,900
2,900,1600


# What happened above?

- Looks like we can call the function (`my_sq()`) the normal way, and it does work on either a column or the whole DF
- Now why do we need `.apply()`?
    - we know functions can take arguments, maybe it does not work with arguments?
    - Look at the example below

In [8]:
# function def. - calculate square
def my_exp(x, e):
    return x ** e

In [9]:
my_exp(2, 3)

8

In [10]:
# it appears that taking parameters is not a problem
# let's come back to the 'why' part later
my_exp(df1['a'], 2)

0    100
1    400
2    900
Name: a, dtype: int64

# How `.apply()` works?

- `.apply()` is essentially a Series method 
    - which means natively we can _apply_ a function to a Series (column)
    - what `.apply()` does is that for every element in the series, the function is applied to it
        - and the results are returned as a Series of the same length

In [11]:
sq = df1['a'].apply(my_sq)
sq

0    100
1    400
2    900
Name: a, dtype: int64

In [12]:
cb = df1['a'].apply(my_exp, e=2)
cb

0    100
1    400
2    900
Name: a, dtype: int64

In [13]:
cb1 = []
for v in df1['a'].values:
    #print(v)
    cb1.append(my_exp(v, 2))
pd.Series(cb1)

0    100
1    400
2    900
dtype: int64

# What happened above?

- as you saw in these examples, `.apply()` works like with a `for` loop embedded
    - the function (e.g., `my_exp()`) is broadcasted to all the values in the Series (`df1['a']`)
    - and the return value is automatically converted to a `pandas.Series`
- this is how we avoid using `for` loops in `pandas`
    - as we said before, `for` loops are expensive, try avoiding them whenever you can
    - this is the first benefit of using `.apply()`

In [14]:
# we can do the same to a DF
# note that one different between `.apply()` and the regular function call is
# in `.apply()` you have to say explicitly what is the name of the argument (`e`)
df1.apply(my_exp, e=2)

Unnamed: 0,a,b
0,100,400
1,400,900
2,900,1600


In [15]:
# but if you try to apply a function with unmatched number of inputs
# it will raise an error - see this example

# this function takes three inputs
def avg_3(x, y, z):
    return (x + y + z) / 3

In [16]:
# when you apply the function to `df1` - since `df1` only has two columns
# this will raise an error
df1.apply(avg_3)

TypeError: ("avg_3() missing 2 required positional arguments: 'y' and 'z'", 'occurred at index a')

In [17]:
# consider the logic above - maybe we want to take the average of each column?
# we can rewrit the function like below
def avg_3_apply(col):
    x = col[0]
    y = col[1]
    z = col[2]
    return (x + y + z) / 3

In [18]:
# now it works
df1.apply(avg_3_apply)

a    20.0
b    30.0
dtype: float64

# Your Turn Here

Explain why above code works.

# `.apply()` Works on Columns Natively

- Above example shows an important thing
    - do you want to apply the funtion to each column or each row
    - natively `.apply()` works on columns
    - but you can change that by adding an argument `axis=0` so it applies on _rows_

- note that in `pandas`, `axis=0` always refers to rows, and `axis=1` to columns

In [19]:
df1.apply(avg_3_apply, axis=1)

IndexError: ('index out of bounds', 'occurred at index 0')

In [20]:
df1.apply(avg_3_apply, axis=0)

a    20.0
b    30.0
dtype: float64

In [21]:
# another example
def avg_2_apply(row):
    x = row[0]
    y = row[1]
    return (x + y) / 2

In [22]:
df1.apply(avg_2_apply, axis=1)

0    15.0
1    25.0
2    35.0
dtype: float64

In [21]:
# another way of doing this - note that this is much more expensive than `.apply()`
for index, row in df1.iterrows(): # `.iterrows()` iterate through rows in a DF
    # print(index, row)
    # break
    # index is the index value of the row
    print(index, avg_2_apply(row))

0 15.0
1 25.0
2 35.0


# A More Complex Example of `.apply()`

- So far we have been playing with a very simple DF
- We actually use `.apply()` for more complicated use cases
    - e.g., testing the _missingness_ in a dataset

In [23]:
# load a dataset
# the `titanic` dataset is one of the most popular dataset in analytics
import seaborn as sns
titanic = sns.load_dataset('titanic')
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [24]:
# in lecture 9, we had a way of calulating missingness
# count of missing values by column
titanic.isna().sum()

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

In [24]:
# we can also calculate the ratio of missing values
(titanic.isna().sum()/titanic.shape[0]).round(4) * 100

survived        0.00
pclass          0.00
sex             0.00
age            19.87
sibsp           0.00
parch           0.00
fare            0.00
embarked        0.22
class           0.00
who             0.00
adult_male      0.00
deck           77.22
embark_town     0.22
alive           0.00
alone           0.00
dtype: float64

In [25]:
# we use np.sum() since you can only apply functions not methods
# `.sum()` as we used above is a method
import numpy as np
def count_missing(col):
    """Counts the number of missing values in a column
    """
    null_col = pd.isna(col)
    null_count = np.sum(null_col)
    return null_count

In [26]:
cmis_col = titanic.apply(count_missing)
cmis_col

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

In [29]:
cmis_row = titanic.apply(count_missing, axis=1)
cmis_row

0      1
1      0
2      1
3      0
4      1
      ..
886    1
887    0
888    2
889    0
890    1
Length: 891, dtype: int64

# Your Turn Here

Please explain the results of the above code block.

In [None]:
# Anthonie Hollaar
# Date: 11-26-2019

# axis = 1 normally it refers to column-wise  (transposes the whole dataframe)
# actually transpose rows into columns

# axis = 0 normally it refers to row-wise
# actually means don't transpose

# Lambda Functions

- Regular Python functions require a definition, and a name
- Sometimes the function is so simple that it does not deserve a definition and a name
- we call them anonymous functions - which is __lambda__
    - _lambda_ has no name, and takes _arguments_ and specifies an _expression_ (usually an one-liner)
    - you do not have to specify the return value - it is automatic
- __lambda__ has a structure as following:

```python
lambda arguments: expression
```

In [31]:
# same as the my_exp function earlier
exp_lambda = lambda x, y: x ** y
exp_lambda(2, 3)

8

In [32]:
exp_lambda(3, 2)

9

In [33]:
exp_lambda(df1['a'], 2)

0    100
1    400
2    900
Name: a, dtype: int64

# When is the Best Time to use `lambda`?

- `lambda` is particularly useful when you deal with _lists_, _Series_, and _DataFrames_
    - In particular, when we need to transform a column in a DF
- the expression in `lambda` has to be simple enough
    - if the operation is complex, you can define it in a function, and use a lambda
    - if the operattion contains `if` statements or `for` loop, you should consider using a function rather than a `lambda`
- Being able to use `lambda` is the utmost benefit of using `.apply()`

In [31]:
# this is how you apply lambda to a column
df1['a'].apply(exp_lambda, y=2)

0    100
1    400
2    900
Name: a, dtype: int64

In [32]:
# an even easier way 
# you do not need any definition or function name
df1['a'].apply(lambda x: x**2)

0    100
1    400
2    900
Name: a, dtype: int64

In [33]:
# a complex function
def my_gender(x):
    if x == 'female':
        return 'f'
    else:
        return 'm'

In [34]:
# using lambda
titanic['sex'].apply(lambda x: my_gender(x)).head()

0    m
1    f
2    f
3    f
4    m
Name: sex, dtype: object

In [35]:
# equivalent of above
titanic['sex'].apply(my_gender).head()

0    m
1    f
2    f
3    f
4    m
Name: sex, dtype: object

# Other Ways to Use Functions in `pandas`

- `.map()` is another method
- difference between `.map()` and `.apply()` is that 
    - `.map()` can only work on a single Series (column)
    - `.apply()` can work on the whole DataFrame

In [36]:
df1.apply(lambda x: x**2)

Unnamed: 0,a,b
0,100,400
1,400,900
2,900,1600


In [37]:
# this will cause an error
df1.map(lambda x: x**2)

AttributeError: 'DataFrame' object has no attribute 'map'

# Other Ways to Use Functions in `pandas`

- since `.map()` has limited usabilty, only one column, it is not very useful
- but we have a hybrid method `.applymap()`
    - which is the combination of `.map()` nad `.apply()`
    - reason of using `.applymap()` is that it is much faster comparing to `.apply()`, and also works on the whole DF

In [38]:
df1.applymap(lambda x: x**2)

Unnamed: 0,a,b
0,100,400
1,400,900
2,900,1600


# Popular Use Cases of `.apply()` and `lambda`

- We use the combination of `.apply()` and `lambda` in `pandas` when we are dealing with these scenarios
    - creating a new column based on an existing column
    - filtering a DataFrame (selecting a subset of columns)
    - extracting data from a column

In [34]:
# let's reading a dataset as an example
# please change your PATH to `'/srv/data/my_shared_data_folder/ba505-data/IMDB-Movie-Data.csv'`
imdb_data = pd.read_csv('/srv/data/my_shared_data_folder/ba505-data/IMDB-Movie-Data.csv')
imdb_data.head(2)

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
1,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0


In [40]:
# we can calculate the average rating of a movie
# by average the `Rating` and a tenth of the `Metascore`
imdb_data['AvgRating'] = (imdb_data['Rating'] + imdb_data['Metascore']/10)/2
imdb_data['AvgRating'].head()

0    7.85
1    6.75
2    6.75
3    6.55
4    5.10
Name: AvgRating, dtype: float64

In [41]:
# we can filter the DF by the values of a certain column
# say we want to filter the `imdb_data` by the `Title` column
# if the column contains more than 4 words then we select them

long_title_movie_data = imdb_data[imdb_data['Title'].apply(lambda x: len(x.split())>=4)]
long_title_movie_data.head(3)

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore,AvgRating
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0,7.85
8,9,The Lost City of Z,"Action,Adventure,Biography","A true-life drama, centering on British explor...",James Gray,"Charlie Hunnam, Robert Pattinson, Sienna Mille...",2016,141,7.1,7188,8.01,78.0,7.45
10,11,Fantastic Beasts and Where to Find Them,"Adventure,Family,Fantasy",The adventures of writer Newt Scamander in New...,David Yates,"Eddie Redmayne, Katherine Waterston, Alison Su...",2016,133,7.5,232072,234.02,66.0,7.05


In [42]:
name_df = pd.DataFrame(data = ['Braund, Mr. Owen Harris',
 'Cumings, Mrs. John Bradley (Florence Briggs Thayer)',
 'Heikkinen, Miss. Laina',
 'Futrelle, Mrs. Jacques Heath (Lily May Peel)',
 'Allen, Mr. William Henry',
 'Moran, Mr. James',
 'McCarthy, Mr. Timothy J',
 'Palsson, Master. Gosta Leonard',
 'Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)',
 'Nasser, Mrs. Nicholas (Adele Achem)'], columns = ['Name'] )

#Take a look at the Data 
name_df.head(3)

Unnamed: 0,Name
0,"Braund, Mr. Owen Harris"
1,"Cumings, Mrs. John Bradley (Florence Briggs Th..."
2,"Heikkinen, Miss. Laina"


In [43]:
# we observe the the title is always after the comma (`,`)
# and separate from the first name by a period (`.`)
# following code does the trick
name_df['Title'] = name_df['Name'].apply(lambda x: x.split(" ")[1].replace(".", ""))
name_df.head(3)

Unnamed: 0,Name,Title
0,"Braund, Mr. Owen Harris",Mr
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",Mrs
2,"Heikkinen, Miss. Laina",Miss


# Your Turn Here
Finish exercises below by following instructions of each of them.

## Q1. Coding Problem

Complete excecises regarding data types of the given DataFrame (`itinery_df`).

In [23]:
import random
import pandas as pd
# generating the DF
duration_mins = pd.Series(random.sample(range(1, 1800), 20), name='duration_mins')
work_types = ['lecture', 'consulting', 'research']
work_type_series = pd.Series(random.choices(work_types, k=20), name='work_types')
locations = ['Beijing, China', 'London, England', 'Paris, France', 'Munich, Germany', 
             'Sydney, Australia', 'Mumbai, India', 'Madrid, Spain']
loc_series = pd.Series(random.choices(locations, k=20), name='locations')
hour_rates = pd.Series([round(random.uniform(10.0, 20.0), 2) for i in range(20)], name='hour_rates')
hour_rates.loc[random.sample(range(1, 20), 5)] = 'missing'
duration_mins.loc[random.sample(range(1, 20), 5)] = 'missing'
itinery_df = pd.concat([duration_mins, work_type_series, loc_series, hour_rates], axis=1)
#itinery_df['duration_mins'] = itinery_df['duration_mins'].astype(str)
itinery_df.head()

Unnamed: 0,duration_mins,work_types,locations,hour_rates
0,803,research,"Madrid, Spain",18.21
1,1404,consulting,"Beijing, China",10.94
2,missing,consulting,"Munich, Germany",12.46
3,missing,research,"London, England",18.64
4,missing,consulting,"Paris, France",14.16


## Part 1:

Use `.apply()` and `lambda` to create a new column `duration_hrs` by converting `duration_mins` to hours (divide by `60`).
- make sure you handle all `'missing'` values in `duration_mins` - use the average of the column to replace missing values.

In [27]:
# replace missing with value 0
itinery_df['duration_mins'].replace(to_replace='missing', inplace=True, value=0)
# show dataframe with missing replaced by 0 - test if this is correct
itinery_df['duration_mins']
# show the mean of the dataset
print(itinery_df['duration_mins'].mean())
# replace the 0's with the mean 
itinery_df['duration_mins'].replace(to_replace=0, inplace=True, value=itinery_df['duration_mins'].mean())
# show dataframe with averages
itinery_df['duration_mins']

# method 1
itinery_df['duration_hrs'] = itinery_df['duration_mins']/60
# show dataframe
itinery_df



764.5


Unnamed: 0,duration_mins,work_types,locations,hour_rates,duration_hrs
0,803.0,research,"Madrid, Spain",18.21,13.383333
1,1404.0,consulting,"Beijing, China",10.94,23.4
2,611.6,consulting,"Munich, Germany",12.46,10.193333
3,611.6,research,"London, England",18.64,10.193333
4,611.6,consulting,"Paris, France",14.16,10.193333
5,1137.0,consulting,"Mumbai, India",missing,18.95
6,527.0,research,"Mumbai, India",15.55,8.783333
7,611.6,consulting,"Sydney, Australia",10.73,10.193333
8,798.0,consulting,"Beijing, China",14.47,13.3
9,112.0,research,"Beijing, China",12.96,1.866667


In [29]:
# method 2
itinery_df['duration_hrs'] = itinery_df['duration_mins'].apply(lambda x: x/60)
# show dataframe and test if it worked - visually inspect
itinery_df

Unnamed: 0,duration_mins,work_types,locations,hour_rates,duration_hrs
0,803.0,research,"Madrid, Spain",18.21,13.383333
1,1404.0,consulting,"Beijing, China",10.94,23.4
2,611.6,consulting,"Munich, Germany",12.46,10.193333
3,611.6,research,"London, England",18.64,10.193333
4,611.6,consulting,"Paris, France",14.16,10.193333
5,1137.0,consulting,"Mumbai, India",missing,18.95
6,527.0,research,"Mumbai, India",15.55,8.783333
7,611.6,consulting,"Sydney, Australia",10.73,10.193333
8,798.0,consulting,"Beijing, China",14.47,13.3
9,112.0,research,"Beijing, China",12.96,1.866667


## Part 2:

Use `.apply()` and `lambda` to create two new columns `cities` and `countries`.

- `cities` refer to the first part in `locations` - before the `,`
- `countries` refer to the second part in `locations`
- note that there is a space after `,` that you need to remove

In [85]:
# method 2
# split the locations and take the left part 
itinery_df['cities'] = itinery_df['locations'].apply(lambda x: x.split(",")[0])
# test if cities are added correctly
itinery_df
# split the locations and take the right part
itinery_df['countries'] = itinery_df['locations'].apply(lambda x: x.split(",")[1])
# test if countries were added correctly
itinery_df

Unnamed: 0,duration_mins,work_types,locations,hour_rates,duration_hrs,cities,countries,work_load
0,803.0,research,"Madrid, Spain",18.21,13.383333,Madrid,Spain,part_time
1,1404.0,consulting,"Beijing, China",10.94,23.4,Beijing,China,full_time
2,611.6,consulting,"Munich, Germany",12.46,10.193333,Munich,Germany,part_time
3,611.6,research,"London, England",18.64,10.193333,London,England,part_time
4,611.6,consulting,"Paris, France",14.16,10.193333,Paris,France,part_time
5,1137.0,consulting,"Mumbai, India",missing,18.95,Mumbai,India,part_time
6,527.0,research,"Mumbai, India",15.55,8.783333,Mumbai,India,part_time
7,611.6,consulting,"Sydney, Australia",10.73,10.193333,Sydney,Australia,part_time
8,798.0,consulting,"Beijing, China",14.47,13.3,Beijing,China,part_time
9,112.0,research,"Beijing, China",12.96,1.866667,Beijing,China,part_time


## Part 3:

Use `.apply()` and `lambda` to create a column `work_load` using the following logic:

```python
if duration_hrs >= 20:
    # 'full_time'
else:
    # 'part_time'
```

In [86]:
# create a function whereby if the duration_hrs is larger or equal to 20 hours assign full_time else assign part_time
def work_loadfunc(duration_hrs):
    if duration_hrs >= 20:
        return "full_time"
    else:
        return "part_time"

# create a column called 'work_load' for part_time/full_time duration_hrs
itinery_df['work_load'] = itinery_df['duration_hrs'].apply(lambda duration_hrs: work_loadfunc(duration_hrs))

# show new dataframe with column work_load added and test if it is added correctly
itinery_df

Unnamed: 0,duration_mins,work_types,locations,hour_rates,duration_hrs,cities,countries,work_load
0,803.0,research,"Madrid, Spain",18.21,13.383333,Madrid,Spain,part_time
1,1404.0,consulting,"Beijing, China",10.94,23.4,Beijing,China,full_time
2,611.6,consulting,"Munich, Germany",12.46,10.193333,Munich,Germany,part_time
3,611.6,research,"London, England",18.64,10.193333,London,England,part_time
4,611.6,consulting,"Paris, France",14.16,10.193333,Paris,France,part_time
5,1137.0,consulting,"Mumbai, India",missing,18.95,Mumbai,India,part_time
6,527.0,research,"Mumbai, India",15.55,8.783333,Mumbai,India,part_time
7,611.6,consulting,"Sydney, Australia",10.73,10.193333,Sydney,Australia,part_time
8,798.0,consulting,"Beijing, China",14.47,13.3,Beijing,China,part_time
9,112.0,research,"Beijing, China",12.96,1.866667,Beijing,China,part_time


##  Part 4:

Use `.apply()` and `lambda` to calculate the total payment for each row, $ payment_{total} = duration_hr \times hour\_rate $.

In order to do that, you need to:
1. verify the `duration_hrs` and `hour_rates` are in the numerical (float) type.
2. handle all `'missing'` values in the `hour_rates` column - use the average of the column to replace missing values.
3. create a new column namely `payments`, then put the calculation results in it.

In [95]:
# step 1
# find the datatypes for the two panda series 'duration_hrs' and 'hour_rates'
print(itinery_df[['duration_hrs', 'hour_rates']].dtypes)

# step 2
# replace missing with value 0
itinery_df['hour_rates'].replace(to_replace='missing', inplace=True, value=0)
# show dataframe with missing replaced by 0 - test if this is correct
itinery_df['hour_rates']
# show the mean of the dataset
print(itinery_df['hour_rates'].mean())
# replace the 0's with the mean 
itinery_df['hour_rates'].replace(to_replace=0, inplace=True, value=itinery_df['hour_rates'].mean())
# show dataframe with averages
itinery_df['hour_rates']

# step 3
# create a new column 'payments' which is the duration_hrs multiplied by the 'hour_rates'
itinery_df['payments'] = itinery_df['duration_hrs']*itinery_df['hour_rates']
# show dataframe and test if payments is correctly added
itinery_df

duration_hrs    float64
hour_rates      float64
dtype: object
13.33875


Unnamed: 0,duration_mins,work_types,locations,hour_rates,duration_hrs,cities,countries,work_load,payments
0,803.0,research,"Madrid, Spain",18.21,13.383333,Madrid,Spain,part_time,243.7105
1,1404.0,consulting,"Beijing, China",10.94,23.4,Beijing,China,full_time,255.996
2,611.6,consulting,"Munich, Germany",12.46,10.193333,Munich,Germany,part_time,127.008933
3,611.6,research,"London, England",18.64,10.193333,London,England,part_time,190.003733
4,611.6,consulting,"Paris, France",14.16,10.193333,Paris,France,part_time,144.3376
5,1137.0,consulting,"Mumbai, India",10.671,18.95,Mumbai,India,part_time,202.21545
6,527.0,research,"Mumbai, India",15.55,8.783333,Mumbai,India,part_time,136.580833
7,611.6,consulting,"Sydney, Australia",10.73,10.193333,Sydney,Australia,part_time,109.374467
8,798.0,consulting,"Beijing, China",14.47,13.3,Beijing,China,part_time,192.451
9,112.0,research,"Beijing, China",12.96,1.866667,Beijing,China,part_time,24.192


## Part 5:

Create a new column `final_pay` using the following logic (note that `work_load`, `payments` and `final_pay` are column names):

```python

if work_load == 'full_time':
    final_pay = payment * 1.05
elif work_load == 'part_time':
    final_pay = payment * 0.95
```

In [116]:
# create a function whereby if the element in column work_load equals full_time, multiply the payments column element by 1.05
# else multiply payments column element by 0.95
# return the result and call it final_pay
def final_pay_function(df, work_load, payments):
    if df[work_load] == 'full_time':
        final_pay = df[payments] * 1.05
        return final_pay
    else:
        final_pay = df[payments] * 0.95
        return final_pay

# create a new column called final_pay and use the previous defined function whereby the calculation is done column-wise (axis=1)
itinery_df['final_pay'] = itinery_df.apply(lambda df: final_pay_function(df, 'work_load', 'payments'), axis=1)

# show the new dataframe and test visually if new column is added and correctly calculates the final payment adjusted for full_time or part_time
itinery_df

Unnamed: 0,duration_mins,work_types,locations,hour_rates,duration_hrs,cities,countries,work_load,payments,final_pay
0,803.0,research,"Madrid, Spain",18.21,13.383333,Madrid,Spain,part_time,243.7105,231.524975
1,1404.0,consulting,"Beijing, China",10.94,23.4,Beijing,China,full_time,255.996,268.7958
2,611.6,consulting,"Munich, Germany",12.46,10.193333,Munich,Germany,part_time,127.008933,120.658487
3,611.6,research,"London, England",18.64,10.193333,London,England,part_time,190.003733,180.503547
4,611.6,consulting,"Paris, France",14.16,10.193333,Paris,France,part_time,144.3376,137.12072
5,1137.0,consulting,"Mumbai, India",10.671,18.95,Mumbai,India,part_time,202.21545,192.104677
6,527.0,research,"Mumbai, India",15.55,8.783333,Mumbai,India,part_time,136.580833,129.751792
7,611.6,consulting,"Sydney, Australia",10.73,10.193333,Sydney,Australia,part_time,109.374467,103.905743
8,798.0,consulting,"Beijing, China",14.47,13.3,Beijing,China,part_time,192.451,182.82845
9,112.0,research,"Beijing, China",12.96,1.866667,Beijing,China,part_time,24.192,22.9824


# Classwork (start here in class)
You can start working on them right now:
- Read Chapter 9 in PD4E 
- If time permits, start in on your homework. 
- Ask questions when you need help. Use this time to get help from the professor!

# Homework (do at home)
The following is due before class next week:
  - Any remaining classwork from tonight
  - DataCamp “Speed efficient methods for iterating through a DataFrame” assignment

Note: All work on DataCamp is logged. Don't try to fake it!

Please email [me](mailto:jtao@fairfield.edu) if you have any problems or questions.

![DSB logo](img/Dolan.jpg)
# Apply Functions to Your DataFrame

## PD4E Chapter 9: Apply
### How do you read/manipulate/store data in Python?