<img src=images/gdd-logo.png width=300px align=right> 

# Creating New Columns

Often you will want to create a new column so that you can use it at a later date.

This notebook covers:

* [Creating new columns: avoid common bad practice](#bad-pract)
* [Using `assign()` to create new columns](#assign)
    * [<mark>Exercise: Create new weight columns</mark>](#ex-weight)

First of all, let's load Pandas and the dataset again:

In [1]:
import pandas as pd

chickweight = pd.read_csv('data/chickweight.csv').rename(str.lower, axis='columns')

blep

## A note on style...

When reading the chickweight data in, the columns are being renamed straight away. Now you can see multiple functions and methods being executed and chained together.

Chaining methods allows you to do multiple things at once, not just renaming columns but also adding columns, filtering rows, aggregating data. Therefore your coding lines are going to become very long.

For this reason, it is best to add brackets around you code, so that you can use multiple lines. Note that you cannot write Python on multiple lines unless it is in some form of brackets!

In [None]:
chickweight = (
    pd.read_csv('data/chickweight.csv')
    .rename(str.lower, axis='columns') # add a new line at each new method 
)

<a id='bad-pract'></a>
## Creating new columns: avoid common bad practice

First let's add a piece of code that explains something about the data:

In [3]:
print("There are", chickweight.shape[1], "rows in the DataFrame")

There are 5 rows in the DataFrame


Now say you want to create a new column where the weight is doubled.

You could use the assignment tool to create a new column as seen below.

In [4]:
chickweight['weight_doubled'] = chickweight['weight'] * 2

In [5]:
chickweight.head()

Unnamed: 0,rownum,weight,time,chick,diet,weight_doubled
0,1,42,0,1,1,84
1,2,51,2,1,1,102
2,3,59,4,1,1,118
3,4,64,6,1,1,128
4,5,76,8,1,1,152


However, adding columns like this is considered bad practice, as you have modified the original dataframe.

**Code should always perform in the same way regardless of where it is in the project**

Let's rerun that same code from the cell above. Does the output change?

In [None]:
print("There are", chickweight.shape[1], "rows in the DataFrame")

Now reports are stating something different about this data. This is going to cause confusion with regards to which data is being used and how it has been edited along the way, leading to Pandas frustration...

<img src='images/04_Creating_Columns/panda.gif' width='300px' align='left'>

To avoid this, you **do not want to overwrite your data frame.** This is a philosophy that is being adopted more and more, and being built into pandas itself.

***Thought experiment: "Ok so how about I just copy the DataFrame and make the changes I need"***

In [None]:
chickweight_temp = chickweight.copy()
chickweight_temp['weight_doubled'] = chickweight_temp['weight'] * 2

<details>
    <summary><font color=blue>How could this lead to issues?</font></summary>

In this case you didn't overwrite the dataframe, but you may end up with too many versions of a dataframe, which is not memory efficient & will also become confusing.

</details>

<mark>***So what's the answer?***</mark>

<a id='assign'></a>
## Using `.assign()` to create new columns
You can tell pandas to make a new column with `.assign()`, and specify **how** to to calculate with a lambda function.

In [9]:
(
    chickweight
    .assign(weight_doubled = chickweight['weight'] * 2)
)

Unnamed: 0,rownum,weight,time,chick,diet,weight_doubled
0,1,42,0,1,1,84
1,2,51,2,1,1,102
2,3,59,4,1,1,118
3,4,64,6,1,1,128
4,5,76,8,1,1,152
...,...,...,...,...,...,...
573,574,175,14,50,4,350
574,575,205,16,50,4,410
575,576,234,18,50,4,468
576,577,264,20,50,4,528


Note that the original dataframe is unchanged.

In [7]:
chickweight.head()

Unnamed: 0,rownum,weight,time,chick,diet,weight_doubled
0,1,42,0,1,1,84
1,2,51,2,1,1,102
2,3,59,4,1,1,118
3,4,64,6,1,1,128
4,5,76,8,1,1,152


But how can you use this new column? For example - creating a new column called weight_quadrupled. The below would cause an error:

In [12]:
(
    chickweight
    .assign(weight_doubled = chickweight['weight'] * 2)
    .assign(weight_quadrupled = chickweight['weight_doubled'] * 2)
).head()

Unnamed: 0,rownum,weight,time,chick,diet,weight_doubled,weight_quadrupled
0,1,42,0,1,1,84,168
1,2,51,2,1,1,102,204
2,3,59,4,1,1,118,236
3,4,64,6,1,1,128,256
4,5,76,8,1,1,152,304


This is where the anonymous function `lambda` can come into play! 

Let's make a lambda function with a `DataFrame` as its argument. Typically "nameless" DataFrames get given the shorthand `df`.

In [None]:
my_lambda_function = lambda df: df['weight_doubled'] * 2

Calling this lambda function looks like this:

In [None]:
my_lambda_function(chickweight)

Now this logic can be used inside of the assign method:

In [11]:
(
    chickweight
    .assign(weight_doubled = chickweight['weight'] * 2)
    .assign(weight_quadrupled = lambda df: df['weight_doubled'] * 2)
).head()

Unnamed: 0,rownum,weight,time,chick,diet,weight_doubled,weight_quadrupled
0,1,42,0,1,1,84,168
1,2,51,2,1,1,102,204
2,3,59,4,1,1,118,236
3,4,64,6,1,1,128,256
4,5,76,8,1,1,152,304


It can even be added to the same assign method, as long as it is being added after `weight_doubled` is created.

In [None]:
(
    chickweight
    .assign(weight_doubled = chickweight['weight'] * 2, 
            weight_quadrupled = lambda df: df['weight_doubled'] * 2)
).head()

In fact, for the sake of consistency and scalability, using a lambda function is always recommended:

In [None]:
(
    chickweight
    .assign(weight_doubled = lambda df: df['weight'] * 2, 
            weight_quadrupled = lambda df: df['weight_doubled'] * 2)
).head()

<a id=ex-weight></a>

### <mark>Exercise: Make new weight columns</mark>

1. Assuming that the chick weights are in grams, can you add a column that gives the chickweights in kg?
2. In the same `.assign()`, also add the chickweights in pounds.

*1000 g = 1 kg = 2.205 pounds*

In [28]:
(
    chickweight
    .assign(weight_kg=lambda df:df['weight']/1000)
    .assign(weight_FREEDOM_units=lambda df:df['weight_kg']*2.205)
)

Unnamed: 0,rownum,weight,time,chick,diet,weight_doubled,weight_kg,weight_FREEDOM_units
0,1,42,0,1,1,84,0.042,0.092610
1,2,51,2,1,1,102,0.051,0.112455
2,3,59,4,1,1,118,0.059,0.130095
3,4,64,6,1,1,128,0.064,0.141120
4,5,76,8,1,1,152,0.076,0.167580
...,...,...,...,...,...,...,...,...
573,574,175,14,50,4,350,0.175,0.385875
574,575,205,16,50,4,410,0.205,0.452025
575,576,234,18,50,4,468,0.234,0.515970
576,577,264,20,50,4,528,0.264,0.582120


In [None]:
# %load answers/04_Creating_Columns/new-column.py

## Dropping columns

Note that you can also drop columns if required!

In [None]:
(
    chickweight
    .drop(columns = ['rownum', 'time'])
).head()

# Conclusion

The `.assign()` method is best practice when it comes to creating new columns. Dataframes **are mutable objects** so it is important to be careful when creating new columns or making any changes that you don't accidentally change the original dataframe.