![alt text](pandas.png "Title")

In [1]:
import pandas as pd

# Dataframes: accessing values

In [2]:
# Test data

patients = [10010, 10011, 10012]
data = {
    'gender': ['M', 'F', 'F'],
    'age':    [20, 25, 23],
}

df = pd.DataFrame(data, index= patients, columns=['age', 'gender', 'race'])
df

Unnamed: 0,age,gender,race
10010,20,M,
10011,25,F,
10012,23,F,


In [3]:
# Accessing a value in a Python dict: 
data['age']

[20, 25, 23]

In [4]:
# A dataframe works as a dictionary of pandas Series. Let's look at a column:
age = df['age']

# this columnn is a pandas Series:
print('Type=', type(age) )

age

Type= <class 'pandas.core.series.Series'>


10010    20
10011    25
10012    23
Name: age, dtype: int64

In [5]:
df.gender.str.lower()

10010    m
10011    f
10012    f
Name: gender, dtype: object

In [6]:
# You can also use the dot syntax:
print( df.age )

# This syntax only works if the Series has a valid variable name.
# You can't use the dot syntax with a variable named 'My Var', you can only use df['My Var']

10010    20
10011    25
10012    23
Name: age, dtype: int64


In [7]:
# Switch between pandas and Python

# You can always convert a pandas object to a core Python object:
ages = tuple(df['age'])

# Now you're free to iterate, manipulate, aggregate, do whatever, in pure Python.
# Sometimes it feels easier actually, but there's probably a more simple way in pandas...

# Let's admit for now our ignorance of pandas and calculate the age average using tuples:
mean = sum(ages) / len(ages)

# Done with the "heavy lifting", let's come back to pandas and create a new column with a vectorized approach
df['mean'] = mean
df

Unnamed: 0,age,gender,race,mean
10010,20,M,,22.666667
10011,25,F,,22.666667
10012,23,F,,22.666667


In [8]:
# pandas, like Python, is case sensitive!
print (df.Age)

AttributeError: 'DataFrame' object has no attribute 'Age'

In [None]:
# Accessing a value inside the df: df > series > value
age_10010 = df['age'][10010]

# We retrieved a value, an integer actually using the NumPy int64 type
print('Type=', type(age_10010) )

# let's print that
age_10010

Type= <class 'numpy.int64'>


20

## loc and iloc

These two dataframes methods are useful for selection, filtering and setting values.
* loc uses name ( [reference]( https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html#pandas.DataFrame.loc) )
* iloc uses position ( [reference](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html?highlight=iloc#pandas.DataFrame.iloc))

In [None]:
df

Unnamed: 0,age,gender,race,mean
10010,20,M,,22.666667
10011,25,F,,22.666667
10012,23,F,,22.666667


In [9]:
# loc retrieves a given row in the df, returning a Series where the index is the df columns
df.loc[10010]

# Kinda of a "transpose"...

age              20
gender            M
race            NaN
mean      22.666667
Name: 10010, dtype: object

In [10]:
# iloc does the same, but instead of the index name it uses the index position in the df

# Retrieves the first df row
df.iloc[0] 

# Retrieves the last df row
df.iloc[-1] 

# and we can take bigger slices too, e.g. with first two rows:
df.iloc[0:2] 

Unnamed: 0,age,gender,race,mean
10010,20,M,,22.666667
10011,25,F,,22.666667


In [11]:
# We can access values using iloc/loc too:

age_10010 = df.loc[10010, 'age']  # row index, column index
# df.loc[10010]['age'] does the same

print ('Type', type(age_10010))

age_10010

Type <class 'numpy.int64'>


20

In [12]:
# Same result, different syntax:
df.loc[10010, 'age'] == df['age'][10010]

True

In [13]:
# You can also easily subset a df with the head() method. I'm keeping the first 2 rows here:
df.head(2)

# df.head() is the default and returns the first 5 rows

Unnamed: 0,age,gender,race,mean
10010,20,M,,22.666667
10011,25,F,,22.666667


In [14]:
# the last 5 records
df.tail()

Unnamed: 0,age,gender,race,mean
10010,20,M,,22.666667
10011,25,F,,22.666667
10012,23,F,,22.666667


__________________________________________________
Nicolas Dupuis, Methodology and Innovation (IDAR C&SP), 2020+