# Applying functions to pandas `Series` or `DataFrames`

In [1]:
## Base imports

import pandas as pd
import numpy as np
pd.set_option('max_columns', 50)

In [2]:
# read the titanic dataset from Kaggle's Titanic competition into a DataFrame
titanic = pd.read_csv('./Data/titanic.csv')
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


**Aim:** Map the existing values of a Series to a different set of values

**Method:** [**`map`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.map.html) (Series method)

In [3]:
# map 'female' to 0 and 'male' to 1
titanic['Sex_num'] = titanic.Sex.map({'female':0, 'male':1})
titanic.loc[0:4, ['Sex', 'Sex_num']]

Unnamed: 0,Sex,Sex_num
0,male,1
1,female,0
2,female,0
3,female,0
4,male,1


**Aim:** Apply a function to each element in a Series

**Method:** [**`apply`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.apply.html) (Series method)

**Note:** **`map`** can be substituted for **`apply`** in many cases, but **`apply`** is more flexible and thus is recommended

In [4]:
# calculate the length of each string in the 'Name' Series
titanic['Name_length'] = titanic.Name.apply(len)
titanic.loc[0:4, ['Name', 'Name_length']]

Unnamed: 0,Name,Name_length
0,"Braund, Mr. Owen Harris",23
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",51
2,"Heikkinen, Miss. Laina",22
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",44
4,"Allen, Mr. William Henry",24


In [5]:
# round up each element in the 'Fare' Series to the next integer
titanic['Fare_ceil'] = titanic.Fare.apply(np.ceil)
titanic.loc[0:4, ['Fare', 'Fare_ceil']]

Unnamed: 0,Fare,Fare_ceil
0,7.25,8.0
1,71.2833,72.0
2,7.925,8.0
3,53.1,54.0
4,8.05,9.0


In [6]:
# we want to extract the last name of each person
titanic.Name.head()

0                              Braund, Mr. Owen Harris
1    Cumings, Mrs. John Bradley (Florence Briggs Th...
2                               Heikkinen, Miss. Laina
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                             Allen, Mr. William Henry
Name: Name, dtype: object

In [7]:
# use a string method to split the 'Name' Series at commas (returns a Series of lists)
titanic.Name.str.split(',').head()

0                           [Braund,  Mr. Owen Harris]
1    [Cumings,  Mrs. John Bradley (Florence Briggs ...
2                            [Heikkinen,  Miss. Laina]
3      [Futrelle,  Mrs. Jacques Heath (Lily May Peel)]
4                          [Allen,  Mr. William Henry]
Name: Name, dtype: object

In [8]:
# define a function that returns an element from a list based on position
def get_element(my_list, position):
    return my_list[position]

In [9]:
# apply the 'get_element' function and pass 'position' as a keyword argument
titanic.Name.str.split(',').apply(get_element, position=0).head()

0       Braund
1      Cumings
2    Heikkinen
3     Futrelle
4        Allen
Name: Name, dtype: object

In [10]:
# alternatively, use a lambda function
titanic.Name.str.split(',').apply(lambda x: x[0]).head()

0       Braund
1      Cumings
2    Heikkinen
3     Futrelle
4        Allen
Name: Name, dtype: object

**Aim:** Apply a function along either axis of a `DataFrame`

**Method:** [**`apply`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html) (DataFrame method)

In [11]:
# read a dataset of alcohol consumption into a DataFrame
drinks = pd.read_csv('./Data/alcohol.csv')
drinks.head()

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Afghanistan,0,0,0,0.0,Asia
1,Albania,89,132,54,4.9,Europe
2,Algeria,25,0,14,0.7,Africa
3,Andorra,245,138,312,12.4,Europe
4,Angola,217,57,45,5.9,Africa


In [12]:
# select a subset of the DataFrame to work with
drinks.loc[:, 'beer_servings':'wine_servings'].head()

Unnamed: 0,beer_servings,spirit_servings,wine_servings
0,0,0,0
1,89,132,54
2,25,0,14
3,245,138,312
4,217,57,45


In [13]:
# apply the 'max' function along axis 0 to calculate the maximum value in each column
drinks.loc[:, 'beer_servings':'wine_servings'].apply(max, axis=0)

beer_servings      376
spirit_servings    438
wine_servings      370
dtype: int64

In [14]:
# apply the 'max' function along axis 1 to calculate the maximum value in each row
drinks.loc[:, 'beer_servings':'wine_servings'].apply(max, axis=1).head()

0      0
1    132
2     25
3    312
4    217
dtype: int64

In [15]:
# use 'np.argmax' to calculate which column has the maximum value for each row
drinks.loc[:, 'beer_servings':'wine_servings'].apply(np.argmax, axis=1).head()

0      beer_servings
1    spirit_servings
2      beer_servings
3      wine_servings
4      beer_servings
dtype: object

**Aim:** Apply a function to every element in a `DataFrame`

**Method:** [**`applymap`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.applymap.html) (DataFrame method)

In [16]:
# convert every DataFrame element into a float
drinks.loc[:, 'beer_servings':'wine_servings'].applymap(float).head()

Unnamed: 0,beer_servings,spirit_servings,wine_servings
0,0.0,0.0,0.0
1,89.0,132.0,54.0
2,25.0,0.0,14.0
3,245.0,138.0,312.0
4,217.0,57.0,45.0


In [17]:
# overwrite the existing DataFrame columns
drinks.loc[:, 'beer_servings':'wine_servings'] = drinks.loc[:, 'beer_servings':'wine_servings'].applymap(float)
drinks.head()

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Afghanistan,0.0,0.0,0.0,0.0,Asia
1,Albania,89.0,132.0,54.0,4.9,Europe
2,Algeria,25.0,0.0,14.0,0.7,Africa
3,Andorra,245.0,138.0,312.0,12.4,Europe
4,Angola,217.0,57.0,45.0,5.9,Africa


## Exercises
***

__1)__ For each country in the alcohol dataset, what beverage has the minimum average consumption per person (beer, wine or spirits)? Add results as a new column.

__2)__ Calculate the difference in average servings between the most popular beverage and the least popular beverage for each country. Which country has the highest difference?

__3)__ For the titanic dataset, add a new column that shows the number of travellers travelling on each ticket.

# Recap
***

1. Map/Apply allow for functions to be applied across Pandas data structures. Extremely useful when a vectorised method does not natively exist.


2. **map** `Series` method - Map the existing values of a Series to a different set of values. 


3. **apply** `Series` method - Apply a function to each element in a Series. 


4. **apply** `DataFrame` method - Apply a function along either axis of a `DataFrame`.


5. **applymap** `DataFrame` method - Apply a function to every element in a `DataFrame`




 # __Further Reading__

## [Group By: split-apply-combine](https://pandas.pydata.org/pandas-docs/stable/groupby.html)

In [32]:
# Example

drinks.groupby('continent').agg([np.mean, lambda x: x.min()])

Unnamed: 0_level_0,beer_servings,beer_servings,spirit_servings,spirit_servings,wine_servings,wine_servings,total_litres_of_pure_alcohol,total_litres_of_pure_alcohol,diff,diff
Unnamed: 0_level_1,mean,<lambda>,mean,<lambda>,mean,<lambda>,mean,<lambda>,mean,<lambda>
continent,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
Africa,61.471698,0.0,16.339623,0.0,16.264151,0.0,3.007547,0.0,61.433962,0.0
Asia,37.045455,0.0,60.840909,0.0,9.068182,0.0,2.170455,0.0,67.454545,0.0
Europe,193.777778,0.0,132.555556,0.0,142.222222,0.0,8.617778,0.0,160.222222,0.0
North America,145.434783,1.0,165.73913,68.0,24.521739,1.0,5.995652,2.2,178.521739,67.0
Oceania,89.6875,0.0,58.4375,0.0,35.625,0.0,3.38125,0.0,94.125,0.0
South America,175.083333,93.0,114.75,25.0,62.416667,1.0,6.308333,3.8,184.583333,48.0


## [Reshaping & Pivot Tables](https://pandas.pydata.org/pandas-docs/stable/reshaping.html)

In [36]:
# Example

drinks.pivot('continent', 'country', 'total_litres_of_pure_alcohol')

country,Afghanistan,Albania,Algeria,Andorra,Angola,Antigua & Barbuda,Argentina,Armenia,Australia,Austria,Azerbaijan,Bahamas,Bahrain,Bangladesh,Barbados,Belarus,Belgium,Belize,Benin,Bhutan,Bolivia,Bosnia-Herzegovina,Botswana,Brazil,Brunei,...,Syria,Tajikistan,Tanzania,Thailand,Timor-Leste,Togo,Tonga,Trinidad & Tobago,Tunisia,Turkey,Turkmenistan,Tuvalu,USA,Uganda,Ukraine,United Arab Emirates,United Kingdom,Uruguay,Uzbekistan,Vanuatu,Venezuela,Vietnam,Yemen,Zambia,Zimbabwe
continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1
Africa,,,0.7,,5.9,,,,,,,,,,,,,,1.1,,,,5.4,,,...,,,5.7,,,1.3,,,1.3,,,,,8.3,,,,,,,,,,2.5,4.7
Asia,0.0,,,,,,,,,,,,2.0,0.0,,,,,,0.4,,,,,0.6,...,1.0,0.3,,6.4,0.1,,,,,1.4,2.2,,,,,2.8,,,2.4,,,2.0,0.1,,
Europe,,4.9,,12.4,,,,3.8,,9.7,1.3,,,,,14.4,10.5,,,,,4.6,,,,...,,,,,,,,,,,,,,,8.9,,10.4,,,,,,,,
North America,,,,,,4.9,,,,,,6.3,,,6.3,,,6.8,,,,,,,,...,,,,,,,,6.4,,,,,8.7,,,,,,,,,,,,
Oceania,,,,,,,,,10.4,,,,,,,,,,,,,,,,,...,,,,,,,1.1,,,,,1.0,,,,,,,,0.9,,,,,
South America,,,,,,,8.3,,,,,,,,,,,,,,3.8,,,7.2,,...,,,,,,,,,,,,,,,,,,6.6,,,7.7,,,,


## [Time Series/Date Functionality](https://pandas.pydata.org/pandas-docs/stable/timeseries.html)

<!--NAVIGATION-->
< [Merge Join Concatenate](06_Merge_Join_Concatenate.ipynb) | [Contents](Index.ipynb)