# Creating Calculated Columns in `pandas`

In this notebook, you'll see a few ways to create calculated columns in pandas.

In [1]:
import pandas as pd
import numpy as np

In [38]:
weather = pd.read_csv('data/weather.csv')

In [39]:
weather.head()

Unnamed: 0,Date,High_Temp,Low_Temp,Avg_Temp,Precipitation
0,2020-08-01,82,70,76.0,0.33
1,2020-08-02,86,66,76.0,0.0
2,2020-08-03,89,67,78.0,0.0
3,2020-08-04,84,70,77.0,0.0
4,2020-08-05,87,67,77.0,0.0


## Method 1: Vectorized Operations

A vectorized operation is when you do a calculation on a whole column or columns at once. This is the preferred method as it is almost always the fastest, and it should be used whenever possible.

Example: Let's say we want to convert our high temperature, which is currently in degrees fahrenheit to degrees celsius. Recall that to convert from fahrenheit to celsius, subtract 32 and then multiple by 5/9.

In [40]:
weather['High_Temp_Celsius'] = (weather['High_Temp'] - 32) * 5/9

In [41]:
weather.head()

Unnamed: 0,Date,High_Temp,Low_Temp,Avg_Temp,Precipitation,High_Temp_Celsius
0,2020-08-01,82,70,76.0,0.33,27.777778
1,2020-08-02,86,66,76.0,0.0,30.0
2,2020-08-03,89,67,78.0,0.0,31.666667
3,2020-08-04,84,70,77.0,0.0,28.888889
4,2020-08-05,87,67,77.0,0.0,30.555556


You can also create new columns by combining two or more other columns. Let's say we want to calcuate the range of temperature values.

In [42]:
weather['Temp_Range'] = weather['High_Temp'] - weather['Low_Temp']

You can even use a lot of numpy functions, which are vectorized.

In [43]:
weather['Sqrt_Temp'] = np.sqrt(weather['High_Temp'])

In [44]:
weather.head()

Unnamed: 0,Date,High_Temp,Low_Temp,Avg_Temp,Precipitation,High_Temp_Celsius,Temp_Range,Sqrt_Temp
0,2020-08-01,82,70,76.0,0.33,27.777778,12,9.055385
1,2020-08-02,86,66,76.0,0.0,30.0,20,9.273618
2,2020-08-03,89,67,78.0,0.0,31.666667,22,9.433981
3,2020-08-04,84,70,77.0,0.0,28.888889,14,9.165151
4,2020-08-05,87,67,77.0,0.0,30.555556,20,9.327379


## Using `np.where` for single conditions

In [45]:
weather['Above_80'] = np.where(
    weather['Avg_Temp'] > 80, 1, 0
)

In [50]:
weather

Unnamed: 0,Date,High_Temp,Low_Temp,Avg_Temp,Precipitation,High_Temp_Celsius,Temp_Range,Sqrt_Temp,Above_80
0,2020-08-01,82,70,76.0,0.33,27.777778,12,9.055385,0
1,2020-08-02,86,66,76.0,0,30.0,20,9.273618,0
2,2020-08-03,89,67,78.0,0,31.666667,22,9.433981,0
3,2020-08-04,84,70,77.0,0,28.888889,14,9.165151,0
4,2020-08-05,87,67,77.0,0,30.555556,20,9.327379,0
5,2020-08-06,89,68,78.5,0,31.666667,21,9.433981,0
6,2020-08-07,90,68,79.0,0,32.222222,22,9.486833,0
7,2020-08-08,94,67,80.5,0,34.444444,27,9.69536,1
8,2020-08-09,95,71,83.0,0.04,35.0,24,9.746794,1
9,2020-08-10,97,74,85.5,0,36.111111,23,9.848858,1


## Using `np.select` for multiple conditions

In [51]:
conditions = [
    weather['High_Temp'] > 90,
    weather['Low_Temp'] < 70
]

choices = [
    'Crazy Hot', 'Crazy Cold'
]

weather['Extreme_Temp'] = np.select(conditions, choices, default='A Good Day')

In [52]:
weather

Unnamed: 0,Date,High_Temp,Low_Temp,Avg_Temp,Precipitation,High_Temp_Celsius,Temp_Range,Sqrt_Temp,Above_80,Extreme_Temp
0,2020-08-01,82,70,76.0,0.33,27.777778,12,9.055385,0,A Good Day
1,2020-08-02,86,66,76.0,0,30.0,20,9.273618,0,Crazy Cold
2,2020-08-03,89,67,78.0,0,31.666667,22,9.433981,0,Crazy Cold
3,2020-08-04,84,70,77.0,0,28.888889,14,9.165151,0,A Good Day
4,2020-08-05,87,67,77.0,0,30.555556,20,9.327379,0,Crazy Cold
5,2020-08-06,89,68,78.5,0,31.666667,21,9.433981,0,Crazy Cold
6,2020-08-07,90,68,79.0,0,32.222222,22,9.486833,0,Crazy Cold
7,2020-08-08,94,67,80.5,0,34.444444,27,9.69536,1,Crazy Hot
8,2020-08-09,95,71,83.0,0.04,35.0,24,9.746794,1,Crazy Hot
9,2020-08-10,97,74,85.5,0,36.111111,23,9.848858,1,Crazy Hot


## Method 2: `.apply`

You can use functions with `.apply`. Generally, `.apply` will be slower than using a vectorized solution.

In [9]:
weather['Sqrt_Temp'] = weather['High_Temp'].apply(np.sqrt)

You can also write your own functions and use them with `.apply`.

In [10]:
def convert_fahrenheit_to_celsius(temp):
    return (temp - 32) * 5/9

In [11]:
weather['Low_Temp_Celsius'] = weather['Low_Temp'].apply(convert_fahrenheit_to_celsius)

In [12]:
weather.head()

Unnamed: 0,Date,High_Temp,Low_Temp,Avg_Temp,Precipitation,High_Temp_Celsius,Temp_Range,Sqrt_Temp,Low_Temp_Celsius
0,2020-08-01,82,70,76.0,0.33,27.777778,12,9.055385,21.111111
1,2020-08-02,86,66,76.0,0.0,30.0,20,9.273618,18.888889
2,2020-08-03,89,67,78.0,0.0,31.666667,22,9.433981,19.444444
3,2020-08-04,84,70,77.0,0.0,28.888889,14,9.165151,21.111111
4,2020-08-05,87,67,77.0,0.0,30.555556,20,9.327379,19.444444


## Method 2b: `.apply` with a lambda function.

Recall that a **lambda function** is an anonymous function. Lambda functions are useful if you need only need to use a function a single time.

Generally, using `.apply` with a lambda function is even slower, so avoid it if possible.

In [13]:
weather['Low_Temp_Celsius'] = weather['Low_Temp'].apply(lambda x: (x - 32) * 5/9)

If you have a function that involves the values from two or more columns, you can use `.apply` with a lambda function in order to apply that function. In this case, you need to specify that you want to apply the function to the rows (axis = 1).

Note: this is an example where you would definitely just use vectorized operations, but for more complicated/nontrivial operations on the columns, you may need to use the `.apply` approach

In [14]:
def difference(a, b):
    return a - b

In [15]:
# Note the axis = 1 argument
weather['Temp_Range'] = weather.apply(lambda row: difference(row['High_Temp'], row['Low_Temp']), axis = 1)

In [16]:
weather.head()

Unnamed: 0,Date,High_Temp,Low_Temp,Avg_Temp,Precipitation,High_Temp_Celsius,Temp_Range,Sqrt_Temp,Low_Temp_Celsius
0,2020-08-01,82,70,76.0,0.33,27.777778,12,9.055385,21.111111
1,2020-08-02,86,66,76.0,0.0,30.0,20,9.273618,18.888889
2,2020-08-03,89,67,78.0,0.0,31.666667,22,9.433981,19.444444
3,2020-08-04,84,70,77.0,0.0,28.888889,14,9.165151,21.111111
4,2020-08-05,87,67,77.0,0.0,30.555556,20,9.327379,19.444444


# Method 3: Iteration

Two ways to iterate through a dataframe are the `iterrows` and the `itertuples` methods.

The first method, `iterrows` returns a tuple containing the index value of each row and the content of that row, as a `pandas` Series.

In [17]:
for idx, row in weather.iterrows():
    print(idx)
    print(row)
    print('-----------')

0
Date                 2020-08-01
High_Temp                    82
Low_Temp                     70
Avg_Temp                     76
Precipitation              0.33
High_Temp_Celsius       27.7778
Temp_Range                   12
Sqrt_Temp               9.05539
Low_Temp_Celsius        21.1111
Name: 0, dtype: object
-----------
1
Date                 2020-08-02
High_Temp                    86
Low_Temp                     66
Avg_Temp                     76
Precipitation                 0
High_Temp_Celsius            30
Temp_Range                   20
Sqrt_Temp               9.27362
Low_Temp_Celsius        18.8889
Name: 1, dtype: object
-----------
2
Date                 2020-08-03
High_Temp                    89
Low_Temp                     67
Avg_Temp                     78
Precipitation                 0
High_Temp_Celsius       31.6667
Temp_Range                   22
Sqrt_Temp               9.43398
Low_Temp_Celsius        19.4444
Name: 2, dtype: object
-----------
3
Date                 20

Since the second component of this tuple is a Series, you can access its elements by slicing.

In [18]:
for idx, row in weather.iterrows():
    print('Date: {}'.format(row['Date']))
    print('High Temperature: {}'.format(row['High_Temp']))
    print('----------')

Date: 2020-08-01
High Temperature: 82
----------
Date: 2020-08-02
High Temperature: 86
----------
Date: 2020-08-03
High Temperature: 89
----------
Date: 2020-08-04
High Temperature: 84
----------
Date: 2020-08-05
High Temperature: 87
----------
Date: 2020-08-06
High Temperature: 89
----------
Date: 2020-08-07
High Temperature: 90
----------
Date: 2020-08-08
High Temperature: 94
----------
Date: 2020-08-09
High Temperature: 95
----------
Date: 2020-08-10
High Temperature: 97
----------
Date: 2020-08-11
High Temperature: 92
----------
Date: 2020-08-12
High Temperature: 90
----------
Date: 2020-08-13
High Temperature: 94
----------
Date: 2020-08-14
High Temperature: 88
----------
Date: 2020-08-15
High Temperature: 90
----------
Date: 2020-08-16
High Temperature: 91
----------
Date: 2020-08-17
High Temperature: 91
----------
Date: 2020-08-18
High Temperature: 91
----------
Date: 2020-08-19
High Temperature: 85
----------
Date: 2020-08-20
High Temperature: 89
----------
Date: 2020-08-21
Hig

The `itertuples` method is similar, but it instead returns a `namedtuple`. This makes it faster than `iterrows` in general.

In [19]:
for item in weather.itertuples():
    print(item)
    print('------')

Pandas(Index=0, Date='2020-08-01', High_Temp=82, Low_Temp=70, Avg_Temp=76.0, Precipitation='0.33', High_Temp_Celsius=27.777777777777779, Temp_Range=12, Sqrt_Temp=9.0553851381374173, Low_Temp_Celsius=21.111111111111111)
------
Pandas(Index=1, Date='2020-08-02', High_Temp=86, Low_Temp=66, Avg_Temp=76.0, Precipitation='0', High_Temp_Celsius=30.0, Temp_Range=20, Sqrt_Temp=9.2736184954957039, Low_Temp_Celsius=18.888888888888889)
------
Pandas(Index=2, Date='2020-08-03', High_Temp=89, Low_Temp=67, Avg_Temp=78.0, Precipitation='0', High_Temp_Celsius=31.666666666666668, Temp_Range=22, Sqrt_Temp=9.4339811320566032, Low_Temp_Celsius=19.444444444444443)
------
Pandas(Index=3, Date='2020-08-04', High_Temp=84, Low_Temp=70, Avg_Temp=77.0, Precipitation='0', High_Temp_Celsius=28.888888888888889, Temp_Range=14, Sqrt_Temp=9.1651513899116797, Low_Temp_Celsius=21.111111111111111)
------
Pandas(Index=4, Date='2020-08-05', High_Temp=87, Low_Temp=67, Avg_Temp=77.0, Precipitation='0', High_Temp_Celsius=30.55

Using either of this iteration methods, you can create a new calculated column. However, you should only use this as a last resort or if you are doing some operation for which vectorized operations or `.apply` will not work.

Note that to access an element of a namedtuple, you need to use a . followed by the element.

In [20]:
temp_range = []
for row in weather.itertuples():
    weather.loc[row.Index, "Avg_Temp_Celsius"] = (row.Avg_Temp - 32) * 5/9

In [21]:
weather.head()

Unnamed: 0,Date,High_Temp,Low_Temp,Avg_Temp,Precipitation,High_Temp_Celsius,Temp_Range,Sqrt_Temp,Low_Temp_Celsius,Avg_Temp_Celsius
0,2020-08-01,82,70,76.0,0.33,27.777778,12,9.055385,21.111111,24.444444
1,2020-08-02,86,66,76.0,0.0,30.0,20,9.273618,18.888889,24.444444
2,2020-08-03,89,67,78.0,0.0,31.666667,22,9.433981,19.444444,25.555556
3,2020-08-04,84,70,77.0,0.0,28.888889,14,9.165151,21.111111,25.0
4,2020-08-05,87,67,77.0,0.0,30.555556,20,9.327379,19.444444,25.0
