## Applying Functions

In [2]:
import pandas as pd
air_quality = pd.read_pickle('air_quality.pkl')

## .apply() method

### Built In Functions With .apply()

In [7]:
air_quality[['PM2.5','PM10']]

Unnamed: 0,PM2.5,PM10
0,9.0,9.0
1,4.0,4.0
2,4.0,4.0
3,5.0,5.0
4,3.0,6.0
...,...,...
95680,9.0,9.0
95681,10.0,29.0
95682,18.0,32.0
95683,15.0,42.0


In [4]:
# apply the built in mean function
air_quality[['PM2.5','PM10']].apply('mean')

PM2.5     83.477884
PM10     111.899959
dtype: float64

In [5]:
# By default axis =  0 so notice how its not different below with axis = 0
# The axis can be 1 or 0
# if 0 or 'index' it will apply the function to each colum in its entirety
# If 1 or 'colums' then it applies the function to each row.
air_quality[['PM2.5','PM10']].apply('mean', axis = 0)

PM2.5     83.477884
PM10     111.899959
dtype: float64

In [6]:
# Now with axis = 1 notice the mean function being applied across the rows.
# Row 4 demonstrates the best. Look how the mean is 4.5. That is the mean of 3 and 6 from row 4 above when the two colums were ran together. 
# Look at 2nd line above to see the 3 and 6
# You can also see more examples in the lower portion of the output. Like 19.5 and 25
air_quality[['PM2.5','PM10']].apply('mean', axis = 1)

0         9.0
1         4.0
2         4.0
3         5.0
4         4.5
         ... 
95680     9.0
95681    19.5
95682    25.0
95683    28.5
95684    32.5
Length: 95685, dtype: float64

#### Now how is this different from the above?

#### This is the issue with apply(). It's more efficient to use just the functions themselves like .mean() insead of apply(mean).
#### The .apply() is cappable but not necessary, like for this example

### .apply() is considered a last resort and should only be used for complicated functions.

In [8]:
air_quality[['PM2.5','PM10']].mean()

PM2.5     83.477884
PM10     111.899959
dtype: float64

In [9]:
air_quality[['PM2.5','PM10']].mean(axis = 1)

0         9.0
1         4.0
2         4.0
3         5.0
4         4.5
         ... 
95680     9.0
95681    19.5
95682    25.0
95683    28.5
95684    32.5
Length: 95685, dtype: float64

## Custom Functions With .apply() method

In [12]:
# Want to divide the PM2.5 by PM10. No built in function to preform this action so for this example use a custom function.
# Use axis =  1 because we want this to apply across each row.
def pm_ratio(row):
    return row['PM2.5']/row['PM10']

air_quality.apply(pm_ratio, axis = 1)

0        1.000000
1        1.000000
2        1.000000
3        1.000000
4        0.500000
           ...   
95680    1.000000
95681    0.344828
95682    0.562500
95683    0.357143
95684    0.300000
Length: 95685, dtype: float64

## Lambda Function .apply() method

In [14]:
# Doing the same as above but as a lambda function
air_quality[['PM2.5','PM10']].apply(lambda row: row['PM2.5']/row['PM10'], axis = 1)

0        1.000000
1        1.000000
2        1.000000
3        1.000000
4        0.500000
           ...   
95680    1.000000
95681    0.344828
95682    0.562500
95683    0.357143
95684    0.300000
Length: 95685, dtype: float64

## How is this different than above?
#### It returns the same values but is simpler.
#### Yet another time when apply was unnecessary to use

In [15]:
air_quality['PM2.5']/air_quality['PM10']

0        1.000000
1        1.000000
2        1.000000
3        1.000000
4        0.500000
           ...   
95680    1.000000
95681    0.344828
95682    0.562500
95683    0.357143
95684    0.300000
Length: 95685, dtype: float64

## So when is .apply() best used?
#### Its best used for complicated functions. See below for example
#### Assume we wnat to add a new column to air_quality. 'Go outside', 'Stay inside'
#### Based on PM25_category, TEMP_category, RAIN

In [19]:
# Create the custom function
# Use a lambda function to apply specific rows. VERY IMPORTANT! I tried using standard air_quality[['PM2.5....]] and get error so use LAMBDA!
def activity_decision(pm25_category, temp_category, rain):
        if pm25_category in ['Good','Moderate'] and temp_category in ['Warm', 'Hot'] and rain == 0:
            return 'Go outside'
        else:
            return 'Stay inside'

air_quality.apply(lambda row: activity_decision(row['PM2.5_category'], row['TEMP_category'], row['RAIN']), axis = 1)

0        Stay inside
1        Stay inside
2        Stay inside
3        Stay inside
4        Stay inside
            ...     
95680     Go outside
95681     Go outside
95682     Go outside
95683     Go outside
95684    Stay inside
Length: 95685, dtype: object

In [20]:
# Now lets assign to new column
air_quality['activity'] = air_quality.apply(lambda row: activity_decision(row['PM2.5_category'], row['TEMP_category'], row['RAIN']), axis = 1)

In [21]:
# Check new column with other columns
air_quality[['activity', 'PM2.5_category','TEMP_category','RAIN']]

Unnamed: 0,activity,PM2.5_category,TEMP_category,RAIN
0,Stay inside,Good,Very Cold,0.0
1,Stay inside,Good,Very Cold,0.0
2,Stay inside,Good,Very Cold,0.0
3,Stay inside,Good,Very Cold,0.0
4,Stay inside,Good,Very Cold,0.0
...,...,...,...,...
95680,Go outside,Good,Warm,0.0
95681,Go outside,Good,Warm,0.0
95682,Go outside,Moderate,Warm,0.0
95683,Go outside,Moderate,Warm,0.0


In [22]:
# Now we can analyze the data of how often we could go outside.
# We could only go outside 12.5% of the time... not good
air_quality['activity'].value_counts(normalize=True)

activity
Stay inside    0.874777
Go outside     0.125223
Name: proportion, dtype: float64