<img src="https://datasciencecampus.ons.gov.uk/wp-content/uploads/sites/10/2017/03/data-science-campus-logo-new.svg"
             alt="ONS Data Science Campus Logo"
             width = "240"
             style="margin: 0px 60px"
             />

In [1]:
# import the helper functions from the parent directory,
# these help with things like graph plotting and notebook layout
import sys
sys.path.append('..')
from helper_functions import *

# set things like fonts etc - comes from helper_functions
set_notebook_preferences()

# add a show/hide code button - also from helper_functions
toggle_code(title = "import functions")

# Updating Values


## 5.1 Important! Copies and Views

The selecting of columns and filtering of rows that you've done above effectively **copies** the original dataframe to make a new one based on your conditions. This is great for creating a pared-down dataframe to work with, or when filtering data to produce statistics for particular subsets of data.

However, when we want to actually update cell values we need to use a special function that ensures we are editing the original dataframe **view** and not a **copy** of it. If we edit cells on a copy of a dataframe, we might find that these values are not updated in the original dataframe when we come to analyse it!

This is a technical point that you don't need to worry too much about, just remember to use the following approaches when dealing with code!

In the titanic data, given what we know so far, we might try to update a cell values using what's called 'chained indexing':

```python
titanic[titanic['embarked'] == 'C']['embarked'] = 'c'
```
The above code makes sense - filter the rows, select the column and assign the value you want. However, python will give you a warning (specifically a 'SettingWithCopyWarning') that you're not doing things properly!

Instead we'll do the following:

```python
titanic.loc[titanic['embarked'] == 'C','embarked'] = 'c'
```

The code looks very similar, however in the second (correct) example we're using `.loc[]`

`.loc[]` is actually almost identical to the selecting and filtering we've already done. We just specify the rows and columns within the square bracket like: `.loc[row_indexer, col_indexer]` where:
* `row_indexer` is the mask if we are filtering, or a colon, `:`, if we want to include all rows.
* `col_indexer` is a single column name, or a list of column names, or a colon, `:`, if we want to include all columns.

You can actually use the .loc[] approach to do all the selecting and filtering we've already done if you want.

In addition to `.loc[]` we also have `.iloc[]` which behaves very similarly, except that it allows us to index on the index position of rows and columns.

Some examples using iloc[]:

In [None]:
# Select all rows, and the first column
titanic.iloc[:,0].head()

In [None]:
# Or select a range of columns
titanic.iloc[:,3:5].head()

In [None]:
# Another head-like indexing approach
titanic.iloc[0:5,:]

Now some examples using `.loc[]`

In [None]:
# Often researchers use column selection to reorder the columns in a DataFrame.
cols = list(titanic.columns)
cols.reverse()
titanic.loc[:,cols].head()

In [None]:
# loc takes row conditions as usual.
titanic.loc[titanic['pclass'].isin([1,2]),['pclass','name']].sample(5)

Often pandas users will use chained indexing when selecting and filtering, and the `loc` and `iloc` operators when cleaning values on a cell-by-cell basis. There is no reason why you can't use these operators for selecting and filtering too, however, if you want to assign the sub-set DataFrame to a new variable you'll need to use the `.copy()` method to ensure that it is a different DataFrame in memory, rather than a reference to the original.
```python
new_df = titanic.loc[:,['survived','embarked']].copy()
```

## 5.2 Propagating Missing Data Values with `.loc[]`

Earlier, when we created a new binary variable called 'child' the condition we used created a `Series` of `True` and `False` values based on whether the condition was met or not.

An important consideration when creating a new variable is presence of missing data. A condition will automatically set a missing value to `False`. This is fine for filtering data, but misrepresents data in a new variable - it makes a derived variable appear more complete than it is in reality.

In order to update the 'child' column to incorporate rows we know are missing we can use `.loc[]`:
```python
# This was the column originally generated.
titanic['child'] = (titanic['age'] < 18).astype(int)
# Now, update the missing data values
titanic.loc[titanic['age'].isnull(),'child'] = titanic[titanic['age'].isnull()]['age']
```
In the 2nd line of code, we use the row filter `titanic['age'].isnull()` to filter missing data in the 'age' column and we select the 'child' column. These are used to create a 'view' of the titanic DataFrame using `.loc[]`.

Then, we assign to that combination of rows and columns the values given by: `titanic[titanic['age'].isnull()]['age']` which will be a Series of missing values. We should now see missing values in the 'child' column.

In [None]:
# initial composition of 'child'
titanic['child'].value_counts(dropna=False)

In [None]:
# update and check new field
titanic.loc[titanic['age'].isnull(),'child'] = titanic[titanic['age'].isnull()]['age']
titanic['child'].value_counts(dropna=False)

## 5.3 Changing Column Data Types

As with basic python data types, we can cast columns of data from one data type to another. This can be useful as part of a data cleaning process. Sometimes we find that data we expect to be numeric is actually string data, often occurs because the numbers are actually stored as text in the original dataset, in which case they have to be converted. Other times it is because the original dataset includes characters that pandas won't immediately interpret as a numeric value. This is particularly the case if a dataset contains missing data that is coded to a special character. Reading in these data as text is the safest option, as it preserves all of the information and requires the data analyst to make a decision as to how to handle the conversion to a numeric data type.

There are two ways to change a datatype, firstly the Series method: `.astype()` in which the parameter is a data type e.g.
```python
titanic['survived'].astype(str)
```
There is also a 'top-level' pandas function `pd.to_numeric()` which is a good option for data cleaning.
```python
pd.to_numeric(titanic['survived'])
```

In [None]:
# turned survived to a string column
titanic['survived'] = titanic['survived'].astype(str)
titanic.dtypes['survived']

In [None]:
# turned survived back to a numeric
titanic['survived'] = pd.to_numeric(titanic['survived'])
titanic.dtypes['survived']

## 5.4 Updating Column Names

We've been using a dataset that has good, descriptive column names. However, sometimes you're faced with columns that have difficult to use names.

* column names might be really long, these a are a pain to type out.
* spaces between words, and presence of special characters (e.g. %^&£ etc.) can be annoying
* columns names may be ambiguous or not sufficiently descriptive.

We can use the pandas `.rename()` method to update column names. To update the column names we pass a dicitonary to the parameter 'columns', e.g.
```python
df.rename(columns={'two':'new_name'}, inplace=True)
```
For a DataFrame `df`, the `rename` method allows us to change the names of columns based upon a dictionary. Here, the dictionary key `'two'` is the current name of the column, and the dictionary value `'new_name'` is what we want to rename the column.

In [None]:
titanic.rename(columns={'surname':'family_name'}, inplace=True)
titanic.head()