# Mutating 

In this lesson we will consider how to **change** the data held in a `DataFrame`: in data analysis this is often termed **mutation**. We will see that we can persist changes either to the originating `DataFrame` or to a copy, and we will consider the `SettingWithCopyWarning` that often occurs when using `pandas`.  

# Setup

In [1]:
import pandas as pd
import numpy as np

stock = pd.DataFrame({
    'item_no': pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], dtype='Int64'),
    'cost_class': pd.Series(['1st', '2nd', '3rd', '4th', '4th', '3rd', '2nd', np.nan, '1st', '3rd'], dtype='string'),
    'cost': pd.Series([10.99, np.nan, 2.99, np.nan, 2.99, 2.45, 5.99, 5.99, 3.00, None], dtype='float64'),
    'stock_code': pd.Series(['a', 'a', 'c', 'b', 'a', 'b', np.nan, np.nan, 'a', 'c'], dtype='string'),
    'priority_code': pd.Series([np.nan, None, 'a', 'b', None, 'a', 'e', None, 'a', 'd'], dtype='string'),
    'tax_rate': pd.Series([0, 0, 20, 20, 20, 0, 20, 20, 5, 20])
}).set_index('item_no')


# Recap: Adding and removing columns and rows

Adding new columns to a `DataFrame` is straightforward in `pandas`. Either:

* assign to the new column using `.loc[]`

For example, to tag our stock items with a column `year`, we might do the following 

In [2]:
stock.loc[:, 'year'] = 2020
stock

Unnamed: 0_level_0,cost_class,cost,stock_code,priority_code,tax_rate,year
item_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,1st,10.99,a,,0,2020
2,2nd,,a,,0,2020
3,3rd,2.99,c,a,20,2020
4,4th,,b,b,20,2020
5,4th,2.99,a,,20,2020
6,3rd,2.45,b,a,0,2020
7,2nd,5.99,,e,20,2020
8,,5.99,,,20,2020
9,1st,3.0,a,a,5,2020
10,3rd,,c,d,20,2020


* use the `.assign()` method

Note that a new column created with `.assign()` won't persist in the original `DataFrame`: it is largely intended to be used as part of a 'chain' of methods that is then assigned to a variable

In [3]:
stock.assign(new_year=2021, checked=True)

Unnamed: 0_level_0,cost_class,cost,stock_code,priority_code,tax_rate,year,new_year,checked
item_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,1st,10.99,a,,0,2020,2021,True
2,2nd,,a,,0,2020,2021,True
3,3rd,2.99,c,a,20,2020,2021,True
4,4th,,b,b,20,2020,2021,True
5,4th,2.99,a,,20,2020,2021,True
6,3rd,2.45,b,a,0,2020,2021,True
7,2nd,5.99,,e,20,2020,2021,True
8,,5.99,,,20,2020,2021,True
9,1st,3.0,a,a,5,2020,2021,True
10,3rd,,c,d,20,2020,2021,True


In [4]:
# new_year and checked columns don't persist
stock

Unnamed: 0_level_0,cost_class,cost,stock_code,priority_code,tax_rate,year
item_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,1st,10.99,a,,0,2020
2,2nd,,a,,0,2020
3,3rd,2.99,c,a,20,2020
4,4th,,b,b,20,2020
5,4th,2.99,a,,20,2020
6,3rd,2.45,b,a,0,2020
7,2nd,5.99,,e,20,2020
8,,5.99,,,20,2020
9,1st,3.0,a,a,5,2020
10,3rd,,c,d,20,2020


# Adding columns via list comprehension

Let's do something a bit more ambitious! Percentage adjustment factors for `cost` depend upon `cost_class` as follows: 

| cost_class | adjustment |
|---|---|
| 1st | +12.5% |
| 2nd | +5% | 
| 3rd | 0 | 
|4th | -5% | 

Let's create a new `cost_adjustment` column based on these values. First, let's create a `dictionary` to hold the adjustments. Generally speaking, a `dictionary` is often the right choice of data structure to use if your operation involves some sort of 'lookup' of values.

In [5]:
adjust_lookup = {
    '1st': 12.5,
    '2nd': 5,
    '3rd': 0,
    '4th': -5,
    pd.NA: np.nan
}

Now let's first create the new column using a `list` comprehension. The `.get()` method available on a `dictionary` is useful here, as it lets us specify a **default value** to return if a key is not found in the dictionary. Let's see this in action first   

In [6]:
adjust_lookup.get('4th', np.nan)

-5

In [7]:
adjust_lookup.get('5th', np.nan)

nan

OK, we're ready to go! Let's create the new column using a `list comprehension`: lookup each value in the `cost_class` column in the `adjust_lookup` dictionary (defaulting to `np.nan` if the value is not found).

In [8]:
stock.loc[:, 'cost_adjustment'] = [adjust_lookup.get(cc, np.nan) for cc in stock.cost_class]
stock

Unnamed: 0_level_0,cost_class,cost,stock_code,priority_code,tax_rate,year,cost_adjustment
item_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,1st,10.99,a,,0,2020,12.5
2,2nd,,a,,0,2020,5.0
3,3rd,2.99,c,a,20,2020,0.0
4,4th,,b,b,20,2020,-5.0
5,4th,2.99,a,,20,2020,-5.0
6,3rd,2.45,b,a,0,2020,0.0
7,2nd,5.99,,e,20,2020,5.0
8,,5.99,,,20,2020,
9,1st,3.0,a,a,5,2020,12.5
10,3rd,,c,d,20,2020,0.0


# Adding columns from other columns

We can also add new columns that depend upon the values in other columns. For example, say we wish to add a column `cost_inc_tax`, treating the values in `tax_rate` as percentage increases to apply to the `cost` column.

In [9]:
stock.loc[:, 'cost_inc_tax'] = stock.cost + stock.tax_rate * stock.cost / 100
stock

Unnamed: 0_level_0,cost_class,cost,stock_code,priority_code,tax_rate,year,cost_adjustment,cost_inc_tax
item_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,1st,10.99,a,,0,2020,12.5,10.99
2,2nd,,a,,0,2020,5.0,
3,3rd,2.99,c,a,20,2020,0.0,3.588
4,4th,,b,b,20,2020,-5.0,
5,4th,2.99,a,,20,2020,-5.0,3.588
6,3rd,2.45,b,a,0,2020,0.0,2.45
7,2nd,5.99,,e,20,2020,5.0,7.188
8,,5.99,,,20,2020,,7.188
9,1st,3.0,a,a,5,2020,12.5,3.15
10,3rd,,c,d,20,2020,0.0,


<hr style="border:8px solid black"> </hr>

***

**<u>Task - 2 mins</u>**

You can see above that some of the values in `cost_inc_tax` have more than two decimal places, making it difficult to interpret them as currency values. Use the `np.round()` function (with appropriate choice for the `decimals` argument) to round these values to two decimal places.

**Hint** Think of the process this way: 

* Take the values in `cost_inc_tax`
* Process them through `np.round()`
* Store the result back in column `cost_inc_tax`]

**Solution**

In [10]:
stock.loc[:, 'cost_inc_tax'] = np.round(stock.cost_inc_tax, decimals=2)
stock

Unnamed: 0_level_0,cost_class,cost,stock_code,priority_code,tax_rate,year,cost_adjustment,cost_inc_tax
item_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,1st,10.99,a,,0,2020,12.5,10.99
2,2nd,,a,,0,2020,5.0,
3,3rd,2.99,c,a,20,2020,0.0,3.59
4,4th,,b,b,20,2020,-5.0,
5,4th,2.99,a,,20,2020,-5.0,3.59
6,3rd,2.45,b,a,0,2020,0.0,2.45
7,2nd,5.99,,e,20,2020,5.0,7.19
8,,5.99,,,20,2020,,7.19
9,1st,3.0,a,a,5,2020,12.5,3.15
10,3rd,,c,d,20,2020,0.0,


# Removing columns

To remove a column, we can use the `.drop()` method. Let's get rid of the new column `cost_inc_tax` we just added

In [11]:
stock.drop('cost_inc_tax', axis='columns', inplace=True)
stock

Unnamed: 0_level_0,cost_class,cost,stock_code,priority_code,tax_rate,year,cost_adjustment
item_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,1st,10.99,a,,0,2020,12.5
2,2nd,,a,,0,2020,5.0
3,3rd,2.99,c,a,20,2020,0.0
4,4th,,b,b,20,2020,-5.0
5,4th,2.99,a,,20,2020,-5.0
6,3rd,2.45,b,a,0,2020,0.0
7,2nd,5.99,,e,20,2020,5.0
8,,5.99,,,20,2020,
9,1st,3.0,a,a,5,2020,12.5
10,3rd,,c,d,20,2020,0.0


Now use this `index` value to `.drop()` that row 

In [12]:
# we could set inplace=True to persist this change
stock.drop(stock.index[stock.cost_class.isna()], axis='rows')

Unnamed: 0_level_0,cost_class,cost,stock_code,priority_code,tax_rate,year,cost_adjustment
item_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,1st,10.99,a,,0,2020,12.5
2,2nd,,a,,0,2020,5.0
3,3rd,2.99,c,a,20,2020,0.0
4,4th,,b,b,20,2020,-5.0
5,4th,2.99,a,,20,2020,-5.0
6,3rd,2.45,b,a,0,2020,0.0
7,2nd,5.99,,e,20,2020,5.0
9,1st,3.0,a,a,5,2020,12.5
10,3rd,,c,d,20,2020,0.0


## Dedicated methods for dropping and imputing missing values

While we can get pretty far using `.drop()`, it is more efficient and easier to drop missing values the `.dropna()` method designed specifically for the purpose. 

First, let's make sure we still have a missing value in `cost_class` (recall we haven't persisted our earlier effort to drop it)

In [13]:
stock

Unnamed: 0_level_0,cost_class,cost,stock_code,priority_code,tax_rate,year,cost_adjustment
item_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,1st,10.99,a,,0,2020,12.5
2,2nd,,a,,0,2020,5.0
3,3rd,2.99,c,a,20,2020,0.0
4,4th,,b,b,20,2020,-5.0
5,4th,2.99,a,,20,2020,-5.0
6,3rd,2.45,b,a,0,2020,0.0
7,2nd,5.99,,e,20,2020,5.0
8,,5.99,,,20,2020,
9,1st,3.0,a,a,5,2020,12.5
10,3rd,,c,d,20,2020,0.0


Now let's use `.dropna()` to drop rows with a missing value in `cost_class` (and this time, persist the change)

In [14]:
stock.dropna(axis='rows', subset=['cost_class'], inplace=True)
stock

Unnamed: 0_level_0,cost_class,cost,stock_code,priority_code,tax_rate,year,cost_adjustment
item_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,1st,10.99,a,,0,2020,12.5
2,2nd,,a,,0,2020,5.0
3,3rd,2.99,c,a,20,2020,0.0
4,4th,,b,b,20,2020,-5.0
5,4th,2.99,a,,20,2020,-5.0
6,3rd,2.45,b,a,0,2020,0.0
7,2nd,5.99,,e,20,2020,5.0
9,1st,3.0,a,a,5,2020,12.5
10,3rd,,c,d,20,2020,0.0


Have a look at the documentation for the `.dropna()` method, it offers additional arguments such as `how=` and `thresh=` yielding a useful degree of control. 

An often preferable alternative to dropping missing values is **imputation**: filling them with an appropriate value chosen arbitrarily or computed from existing values in the `DataFrame`. Let's see a simple case of imputation using the `.fillna()` method

> **Fill any missing values in cost with the median cost in the whole dataset**

The `.fillna()` method accepts a `dictionary` that lets you specify the imputation value for each column

In [15]:
stock.fillna({'cost': np.round(stock.cost.median(), 2)}, inplace=True)
stock

Unnamed: 0_level_0,cost_class,cost,stock_code,priority_code,tax_rate,year,cost_adjustment
item_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,1st,10.99,a,,0,2020,12.5
2,2nd,3.0,a,,0,2020,5.0
3,3rd,2.99,c,a,20,2020,0.0
4,4th,3.0,b,b,20,2020,-5.0
5,4th,2.99,a,,20,2020,-5.0
6,3rd,2.45,b,a,0,2020,0.0
7,2nd,5.99,,e,20,2020,5.0
9,1st,3.0,a,a,5,2020,12.5
10,3rd,3.0,c,d,20,2020,0.0


# `SettingWithCopyWarning` - a common nuisance!

Imagine we check our records and find that: 

> **The costs of items in the 1st cost_class with stock_code 'a' are wrong: they need to be reduced by 10%.**

We might try something like this

In [16]:
# set up a mask to select the appropriate rows
mask = (stock.cost_class == '1st') & (stock.stock_code == 'a')
stock.loc[mask]

Unnamed: 0_level_0,cost_class,cost,stock_code,priority_code,tax_rate,year,cost_adjustment
item_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,1st,10.99,a,,0,2020,12.5
9,1st,3.0,a,a,5,2020,12.5


In [17]:
stock[mask]['cost'] = (stock[mask]['cost'] * 0.9).round(decimals=2)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  stock[mask]['cost'] = (stock[mask]['cost'] * 0.9).round(decimals=2)


Note that we get a warning: `SetingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.`

What does this mean?

`SettingWithCopyWarning` is one of the most confusing aspects of starting to work with `pandas`. When we extract parts of a `DataFrame` using **'chained indexing'**, in this using `[mask]` followed separately by `['cost']`, we can't be sure whether we are receiving:

* a **copy** of that part of the `DataFrame`, i.e. a **new  object with its own data**
* or a **view** of the original `DataFrame`, i.e. a **link to some part of the original `DataFrame`**

So when we assign to the returned object, **we can't be sure whether we are assigning to a view or a copy**! 

* If we assigned to a **view**, the changes **will persist** in the original `DataFrame`
* If we assigned to a **copy**, the changes **will not persist** in the original `DataFrame`

Let's check whether the changes have persisted in `stock`

In [18]:
stock[mask]

Unnamed: 0_level_0,cost_class,cost,stock_code,priority_code,tax_rate,year,cost_adjustment
item_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,1st,10.99,a,,0,2020,12.5
9,1st,3.0,a,a,5,2020,12.5


They have not! So it looks like we were assigning to a **copy**. Let's try the assignment another way

In [19]:
stock['cost'][mask] = (stock[mask]['cost'] * 0.9).round(decimals=2)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  stock['cost'][mask] = (stock[mask]['cost'] * 0.9).round(decimals=2)


Again we get the same `SettingWithCopyWarning`! Have the changes persisted to `stock`?

In [20]:
stock[mask]

Unnamed: 0_level_0,cost_class,cost,stock_code,priority_code,tax_rate,year,cost_adjustment
item_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,1st,9.89,a,,0,2020,12.5
9,1st,2.7,a,a,5,2020,12.5


Yes they have, so this time we assigned to a **view**. This sounds like chaos! How do you know whether `pandas` has provided you with a view or a copy? The internal rules it uses are fairly complex, but fortunately the solution is simple...

Let's reset the `cost` column so we ensure we are starting with a clean slate

In [21]:
original_stock_costs = pd.Series([10.99, np.nan, 2.99, np.nan, 2.99, 2.45, 5.99, 5.99, 3.00, None],
                                 index = range(1, 11))
stock.loc[:, 'cost'] = original_stock_costs

# Persisting changes: avoid chained indexing, use `.loc[]` or `.iloc[]` and assign on same line

To persist changes in a `DataFrame`, perform all indexing operations simultaneously using `.loc[]` or `.iloc[]` and do the assignment on the same line.

In [22]:
stock[mask]

Unnamed: 0_level_0,cost_class,cost,stock_code,priority_code,tax_rate,year,cost_adjustment
item_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,1st,10.99,a,,0,2020,12.5
9,1st,3.0,a,a,5,2020,12.5


In [23]:
# perform all indexing using .loc[], assign on the same line
stock.loc[mask, 'cost'] = (stock.cost[mask] * 0.9).round(decimals=2)
stock.loc[mask]

Unnamed: 0_level_0,cost_class,cost,stock_code,priority_code,tax_rate,year,cost_adjustment
item_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,1st,9.89,a,,0,2020,12.5
9,1st,2.7,a,a,5,2020,12.5


Note this time we got no warning message, and the original `stock` `DataFrame` has been updated (for a second time)

<hr style="border:8px solid black"> </hr>

***

**<u>Task - 5 mins</u>**

***Add 5.00 to the cost of all the items with stock_code 'a'.*** 

Make sure you persist this change to the **original `stock` `DataFrame`** and not to a copy.

**Solution**

In [24]:
stock.loc[stock.stock_code == 'a', 'cost'] = stock.loc[stock.stock_code == 'a', 'cost'] + 5.00
stock

Unnamed: 0_level_0,cost_class,cost,stock_code,priority_code,tax_rate,year,cost_adjustment
item_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,1st,14.89,a,,0,2020,12.5
2,2nd,,a,,0,2020,5.0
3,3rd,2.99,c,a,20,2020,0.0
4,4th,,b,b,20,2020,-5.0
5,4th,7.99,a,,20,2020,-5.0
6,3rd,2.45,b,a,0,2020,0.0
7,2nd,5.99,,e,20,2020,5.0
9,1st,7.7,a,a,5,2020,12.5
10,3rd,,c,d,20,2020,0.0


***

<hr style="border:8px solid black"> </hr>

# Leaving original `DataFrame` unchanged: use `.copy()`

***If you want to leave your original `DataFrame` unchanged, make an explicit copy of it using `.copy()` and work subsequently with the copied object.***

Now let's say we want to:

***Increase the cost of 'low cost' items by 2.00 ('low cost' here means less than or equal to 3.00)*** 

and we want to do this in a copy of `stock`. Let's code a solution

In [25]:
# first, copy stock
stock_copy = stock.copy()

# now get a mask for low cost items
low_cost_mask = stock.cost <= 3.00

stock_copy.loc[low_cost_mask, 'cost'] = stock_copy.cost[low_cost_mask] + 2.00

So now `stock_copy` has been changed 

In [26]:
stock_copy[low_cost_mask]

Unnamed: 0_level_0,cost_class,cost,stock_code,priority_code,tax_rate,year,cost_adjustment
item_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
3,3rd,4.99,c,a,20,2020,0.0
6,3rd,4.45,b,a,0,2020,0.0


and the original `stock` `DataFrame` remains unchanged

In [27]:
stock[low_cost_mask]

Unnamed: 0_level_0,cost_class,cost,stock_code,priority_code,tax_rate,year,cost_adjustment
item_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
3,3rd,2.99,c,a,20,2020,0.0
6,3rd,2.45,b,a,0,2020,0.0


<hr style="border:8px solid black"> </hr>

***

**<u>Task - 2 mins</u>**

* **Copy** the `cost` and `cost_class` columns from `stock` to a new `DataFrame` called `cost_copy` 
* Add a new column to `cost_copy` called `cost_zscore`, obtained by passing `cost` into the `z_score()` function defined below

In [28]:
def z_score(series):
    mean = series.mean()
    std = series.std()
    return (series - mean) / std

**Hint** - this will involve code like `... = z_score(cost_copy.cost)` after you have created `cost_copy`

**Solution**

In [29]:
cost_copy = stock[['cost', 'cost_class']].copy()
cost_copy.loc[:, 'cost_zscore'] = z_score(cost_copy.cost)

cost_copy

Unnamed: 0_level_0,cost,cost_class,cost_zscore
item_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,14.89,1st,1.750588
2,,2nd,
3,2.99,3rd,-0.890274
4,,4th,
5,7.99,4th,0.219332
6,2.45,3rd,-1.010111
7,5.99,2nd,-0.22451
9,7.7,1st,0.154975
10,,3rd,


The `z-score` is commonly used in data analysis to indicate 'how far' each value lies from the centre of a distribution - the higher the value (positive or negative), the further from the centre.

***

<hr style="border:8px solid black"> </hr>

