In [1]:
# updating dfs

import numpy as np
import pandas as pd
import random  # Used for randomly sampling integers

# Set the seed
random.seed(42)

# Import data
URL = 'https://raw.githubusercontent.com/allisonhorst/palmerpenguins/main/inst/extdata/penguins.csv'
penguins = pd.read_csv(URL)

In [2]:
penguins

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,Adelie,Torgersen,,,,,,2007
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007
...,...,...,...,...,...,...,...,...
339,Chinstrap,Dream,55.8,19.8,207.0,4000.0,male,2009
340,Chinstrap,Dream,43.5,18.1,202.0,3400.0,female,2009
341,Chinstrap,Dream,49.6,18.2,193.0,3775.0,male,2009
342,Chinstrap,Dream,50.8,19.0,210.0,4100.0,male,2009


In [3]:
#add col 



In [4]:
# Create random 3-digit codes
codes = random.sample(range(100,1000), len(penguins))  # Sampling w/o replacement

# Insert codes at the front of data frame
penguins.insert(loc=0,  # Index
                column='id_code',
                value=codes)
        
penguins.head()

Unnamed: 0,id_code,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,754,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,214,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,125,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,859,Adelie,Torgersen,,,,,,2007
4,381,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007


# A single value

Accessing a single value in a 'pandas.Dataframe' using locators

- 'at[]' to select by labels, or
- 'iat[]' to select by index position

Syntax:

```
df.at[single_index_value, 'col_name']

*'at[]' equivalent to 'loc[]' when accessing a single value*

### Example

In [5]:
penguins = penguins.set_index("id_code")
penguins

Unnamed: 0_level_0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
id_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
754,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
214,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
125,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
859,Adelie,Torgersen,,,,,,2007
381,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007
...,...,...,...,...,...,...,...,...
140,Chinstrap,Dream,55.8,19.8,207.0,4000.0,male,2009
183,Chinstrap,Dream,43.5,18.1,202.0,3400.0,female,2009
969,Chinstrap,Dream,49.6,18.2,193.0,3775.0,male,2009
635,Chinstrap,Dream,50.8,19.0,210.0,4100.0,male,2009


What was the bill length of the penguin with ID 754

In [7]:
penguins.at[754, 'bill_length_mm']

39.1

In [8]:
# check one penguin with NANs

penguins.at[859, 'bill_length_mm']

nan

In [9]:
# Lets update the value with id 859

penguins.at[859, 'bill_length_mm'] = 38.3

#confirm the change

penguins.loc[859]

species                 Adelie
island               Torgersen
bill_length_mm            38.3
bill_depth_mm              NaN
flipper_length_mm          NaN
body_mass_g                NaN
sex                        NaN
year                      2007
Name: 859, dtype: object

If we want to access or update a single value by position we use 
'iat[]' locator:

Syntax:
```
df.iat[index_integer, column_interger_location]
```

#dynamically get the location of a single column
```
df.columns.get_loc('col_name')
```

In [26]:
# Check in
#a) obtain the location o the bill length mm column
#b use iat[] to access the bill lengthfor the penguin 127 and change the value to na

bill_length_index = penguins.columns.get_loc("bill_length_mm")
penguins.iat[127 ,bill_length_index] = np.nan #using the string version vs np float
penguins.iloc[3]

species                 Adelie
island               Torgersen
bill_length_mm              NA
bill_depth_mm              NaN
flipper_length_mm          NaN
body_mass_g                NaN
sex                        NaN
year                      2007
Name: 859, dtype: object

## Multiple values in a column

### Using conditions

Example:

we want to classify the palmer penguins such that:

- penguins with body mass < 3kg are small
- penguins with 3kg <= body mass <5kg are medium
- penguins with 5kg< body mass are large

In [29]:
# create list with conditions

conditions = [penguins.body_mass_g < 3,
             (3<= penguins.body_mass_g) & (penguins.body_mass_g < 5),
             5<= penguins.body_mass_g
             ]

# create a list with the choices

choices = ['small',
          "medium",
          'large']

#add the selections using np.select

penguins['size'] = np.select(conditions,
                            choices,
                            default=np.nan) # value for anything outside conditions


#display updated df

penguins.head()

Unnamed: 0_level_0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year,size
id_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
754,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007,large
214,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007,large
125,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007,large
859,Adelie,Torgersen,,,,,,2007,
381,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007,large


## Update values by slection them

We can do this with loc or iloc and assigning new values


Syntax:

```df.loc[row_slec, col_selec] = new_values
```

using 'loc[]' in assignment modifies the dataframe directly without the need for reassignment


### Example 


updating the "male" values in the sex to "M"

In [32]:
# sELECT ROWS with male and simplify the sex column

penguins.loc[penguins.sex=='male', 'sex'] = 'M'

In [34]:
# Check changes in sex column 

print(penguins.sex.unique())

['M' 'female' nan]


### Best Practices

We want to update the 'female' values in the 'sex ' col to F

In [37]:
# Select reows where sex is female and update
penguins[penguins.sex =='female']['sex'] = 'F' #Raises seeingwithcopywarning



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  penguins[penguins.sex =='female']['sex'] = 'F'


**Avoid chained indexing** '[][]' and use .loc[] instead. This warning happens generally when we have chained indexing 

```df[row_selection][column_selection] = new_value
```

## Check-in

update the f value without the warning and check the values were updated

In [42]:
penguins.loc[penguins.sex=='female', 'sex'] = 'F'
penguins.sex.unique()

array(['M', 'F', nan], dtype=object)

This warning comes up because some pandas ops return a view and others return a copy of your data


- Views: are actual subsets of the data, when we update them, we are modifying the og data
- copies: are unique objectsm independet of our og data. When we update them we are not modiftying the og data

## Example 
We want to use data from Biscoe Island, after an analysis, we want to add a new column

In [44]:
#select biscoe pengs
biscoe = penguins[penguins.island == 'Biscoe']

# other analysis

#addign a column
biscoe['sample_columnn'] = 100

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  biscoe['sample_columnn'] = 100


We can also explicitly ask for a copy of a dataset when subsetting using the copy()

In [47]:
#select biscoe pengs
biscoe = penguins[penguins.island == 'Biscoe'].copy

# other analysis

#addign a column
biscoe['sample_columnn'] = 100

TypeError: 'method' object does not support item assignment

In [49]:
biscoe

<bound method NDFrame.copy of         species  island bill_length_mm  bill_depth_mm  flipper_length_mm  \
id_code                                                                    
338      Adelie  Biscoe           37.8           18.3              174.0   
617      Adelie  Biscoe           37.7           18.7              180.0   
716      Adelie  Biscoe           35.9           19.2              189.0   
127      Adelie  Biscoe           38.2           18.1              185.0   
674      Adelie  Biscoe           38.8           17.2              180.0   
...         ...     ...            ...            ...                ...   
578      Gentoo  Biscoe            NaN            NaN                NaN   
155      Gentoo  Biscoe           46.8           14.3              215.0   
200      Gentoo  Biscoe           50.4           15.7              222.0   
162      Gentoo  Biscoe           45.2           14.8              212.0   
512      Gentoo  Biscoe           49.9           16.1     

In [48]:
'sample_column' in penguins.columns

False