# Updating data frames

## Updating values in a dataframe 

Let's start by importing packages and data

In [2]:
import numpy as np
import pandas as pd
import random # used 

# Set the seed 
random.seed(42)

# Import data
URL = 'https://raw.githubusercontent.com/allisonhorst/palmerpenguins/main/inst/extdata/penguins.csv'
penguins = pd.read_csv(URL)

In [4]:
# Add column body mass in kg 

penguins['body_mass_kg'] = penguins['body_mass_g']/1000

# Confirm the new column is in the data frame

print('body_mass_kg' in penguins.columns)

True


In [5]:
# Create random 3-digit codes 

codes = random.sample(range(100,1000), len(penguins))

In [6]:
# If there is code you do not understand, take it apart and run seperately

len(penguins)

344

In [None]:
# Insert codes at the front of the data frame

penguins.insert(loc = 0, # Index 
               column = 'id_code', 
               value = codes)

## A single value

Access a single value in a `pandas.DataFrame` using locators

- `at[]` to select by labels, or 
- `iat[]` to select by index position 

Syntax:

```
df.at[single_index_value, 'column_name'] 

```

* `at[]` equivalent of `loc[]` when accessing a single value 

In [15]:
penguins = penguins.set_index('id_code') 

penguins

Unnamed: 0_level_0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year,body_mass_kg
id_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
754,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007,3.750
214,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007,3.800
125,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007,3.250
859,Adelie,Torgersen,,,,,,2007,
381,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007,3.450
...,...,...,...,...,...,...,...,...,...
140,Chinstrap,Dream,55.8,19.8,207.0,4000.0,male,2009,4.000
183,Chinstrap,Dream,43.5,18.1,202.0,3400.0,female,2009,3.400
969,Chinstrap,Dream,49.6,18.2,193.0,3775.0,male,2009,3.775
635,Chinstrap,Dream,50.8,19.0,210.0,4100.0,male,2009,4.100


What is the bill length with penguin ID 214?

In [18]:
# Check bill length of penguin with ID 859

penguins.at[859, 'bill_length_mm']

nan

In [19]:
# Correct bill length of penguin with ID 127

penguins.at[859, 'bill_length_mm'] = 38.3 

# Confirm value was updated (use `loc` to select a specific row)

penguins.loc[127]

species              Adelie
island               Biscoe
bill_length_mm         38.2
bill_depth_mm          18.1
flipper_length_mm     185.0
body_mass_g          3950.0
sex                    male
year                   2007
body_mass_kg           3.95
Name: 127, dtype: object

If we want to access or update a single value by index position we use `iat[]` locator:

syntax: 

```
df.iat[index_integer_location, column_integer_location]

```

Dynamically get the location of a single column. 

```
df.columns.get_loc('column_name')

```

## Check-in 

1. Obtain the location of the `bill_length_mm` column programatically 
2. Use `iat[]` to access the bill length for the penguin with ID 127 (you may have a different ID) and revert it back to NA. Confirm changes. 

In [20]:
penguins.head(2)

Unnamed: 0_level_0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year,body_mass_kg
id_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
754,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007,3.75
214,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007,3.8


In [23]:
# Set to NANA using iat

bill_length_index = penguins.columns.get_loc('bill_length_mm') # Use get_loc to find location 

In [29]:
penguins.iat[3, bill_length_index] = np.nan
penguins.iloc[3] # View if work worked 

species                 Adelie
island               Torgersen
bill_length_mm             NaN
bill_depth_mm              NaN
flipper_length_mm          NaN
body_mass_g                NaN
sex                        NaN
year                      2007
body_mass_kg               NaN
Name: 859, dtype: object

In [27]:
type(np.nan)

float

## Multiple values in a column 

### Using a condition

Exmaple: 
We want to classify the Parlmer penguin such that:
- penguins with body mass <3kg are small
- penguins with 3 kg <= body mass < 5k are medium 
- penguins with5 kg < boyd mass ar elarge 

In [None]:
# Create a list with the conditions 

conditions = [penguins.bodymass_kg < 3], 
(3 <= pengions.body_mass_kg) & (pengions.bodymass_lg <5), 
5 <= penguins.body_mass_kg 
]

# Create a list witht the choices 

choices = ['small',
          'medium', 
          'large']


# Add the selections using np.select

penguins['size'] = np.select.conditions, choices, default = np.nan

## Update values by selecting them 

We can do this with `loc` or `iloc` and assigning new values


Syntax:
```
df.loc[ row_selection, column_name] = new_values
```

Using `iloc[]` in assignment modifies the data frame directly wuthout the need for reassignment. 

### Example 

Update the 'mae' values in the sex column to 'M'. 

In [30]:
# Select rows with sex = male ad simplify values in 'sex' column 

penguins.loc[penguins.sex == 'male', 'sex'] = 'M'

In [31]:
# Check  change sin 'sex' column

print(penguins.sex.unique())

['M' 'female' nan]


### Best practicies

We want to updat ethe 'female' values in the 'sex' column to 'F'


In [32]:
# Select rows where 'sex' is female and attempt to update to values

penguins[penguins.sex == 'female']['sex'] = 'F' # Raises SettingWithCopyWarning

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


**Avoice chained indexing** `[][]` and use `.loc[]` instead. 

This warning happens generally when we have chained indexing:

```
df[row_selection][colum_selection] = new_value

```

## Check-in 
Update the 'female' values w/o the warning and check. 

In [33]:
penguins.loc[penguins.sex == 'female', 'sex'] = 'F'

penguins.sex.unique()

array(['M', 'F', nan], dtype=object)

This warning comes up because some `pandas` operations reutnr a view and others return a copy of your data.

- **Views**: actual subsets of the original data, when we update them, we are modifying the original data frame. 
- **Copies**: unique objects, independent of our original data frame.s When we updat ea copy we are not modifying the original data. 

## Example

We only want to use data from Biscoe island, after doing some analysis, we want to add a new column. 


In [34]:
biscoe2 = penguins[penguins['island'] == 'Biscoe']
biscoe2

Unnamed: 0_level_0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year,body_mass_kg
id_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
338,Adelie,Biscoe,37.8,18.3,174.0,3400.0,F,2007,3.40
617,Adelie,Biscoe,37.7,18.7,180.0,3600.0,M,2007,3.60
716,Adelie,Biscoe,35.9,19.2,189.0,3800.0,F,2007,3.80
127,Adelie,Biscoe,38.2,18.1,185.0,3950.0,M,2007,3.95
674,Adelie,Biscoe,38.8,17.2,180.0,3800.0,M,2007,3.80
...,...,...,...,...,...,...,...,...,...
578,Gentoo,Biscoe,,,,,,2009,
155,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,F,2009,4.85
200,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,M,2009,5.75
162,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,F,2009,5.20


In [37]:
# Select penguins from Biscoe island

biscoe = penguins[penguins.island == 'Biscoe']
biscoe

# Add a column 

biscoe['sample_column'] = 100 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  biscoe['sample_column'] = 100


We can also explicitely ask for a copy of a dataset when subsetting using the `copy()` method. 

In [39]:
'sample_column' in penguins.columns

False