![alt text](pandas.png "Title")

In [1]:
import pandas as pd

# Dataframes

## What's a Dataframe?

A pandas Dataframe represents a rectangular table of data. In SAS, you'd call it a dataset. In pandas, it's an ordered collection of columns (each of them that could be of different types). You can think of it as a dictionary of pandas Series sharing the same row index.

## Construct a dataframe from scratch

The DataFrame constructor takes many Python/pandas objects as input: dict of list, dict of Series, dict of dict, list of dicts etc...

Many options are available, check out the [reference site](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html?highlight=dataframe#pandas.DataFrame).

In [3]:
# Let's create a simple Dataframe

# First, we can define the raw data as a dictionary of lists:
rawdata = {
    'gender': ['M', 'F', 'F'],
    'age'   : [20, 25, 23],
}

# The DataFrame constructor converts the dict into a Dataframe (df)
df = pd.DataFrame(rawdata)

# Jupyter does a nice job with the display. Also, see the row index is a range of integers from 0 to 2.
df

Unnamed: 0,gender,age
0,M,20
1,F,25
2,F,23


In [4]:
# The shape property returns a tuple with the number of rows and columns:
df.shape

(3, 2)

In [5]:
f"This df dataframe has {df.shape[0]} rows."

'This df dataframe has 3 rows.'

In [6]:
# The info() method returns more info:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   gender  3 non-null      object
 1   age     3 non-null      int64 
dtypes: int64(1), object(1)
memory usage: 180.0+ bytes


In [7]:
# Let's add some complexity: 
# 1) an explicitly labelled index
# 2) a column order, not driven by the dict 
# 3) a columnn without raw data

patients = [10010, 10011, 10012]
rawdata = {
    'gender': ['M', 'F', 'F'],
    'age':    [20, 25, 23]
}

pd.DataFrame(rawdata, index=patients, columns=['age', 'gender', 'race'])

Unnamed: 0,age,gender,race
10010,20,M,
10011,25,F,
10012,23,F,


In [8]:
# A more readable alternative with a list of lists? Would work the same with tuples. Not with sets as they are unordered...
rawdata = [
    [10010, 'M', 20], 
    [10011, 'F', 25], 
    [10012, 'F', 23]
] 
  
pd.DataFrame(rawdata, columns = ['subjid', 'gender', 'age'])  

Unnamed: 0,subjid,gender,age
0,10010,M,20
1,10011,F,25
2,10012,F,23


## Convert an existing file

pandas has plenty of methods to read many kinds of input files and create a Dataframe out of it. For example:
* CSV : [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html#pandas.read_csv)
* XLS : [read_excel](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html?highlight=read_excel#pandas.read_excel)
* SAS7BDAT or XPORT: [read_sas](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sas.html?highlight=read_sas#pandas.read_sas)
* JSON: [read_json](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html)

In [10]:
# Import the excel file straight away. This was an easy case. For more complex, use the numerous options provided.
df = pd.read_excel(io = "data.xlsx")
df

Unnamed: 0,usubjid,age,gender
0,10010,25,F
1,10011,26,M
2,10013,23,F


In [None]:
# I created that JSON file in a further chapter. I can import it back as a dataframe:
pd.read_json('my_dm.json')

Unnamed: 0,subjid,age,gender
0,10010,20,M
1,10011,25,F
2,10012,23,F


In [None]:
# I also created that feather file in a further chapter:
pd.read_feather('dm.feather')

Unnamed: 0,subjid,age,gender
0,10010,20,M
1,10011,25,F
2,10012,23,F


__________________________________________________
Nicolas Dupuis, Methodology and Innovation (IDAR C&SP), 2020+