# Pandas DataFrame and tabular data

## Pandas and pandas DataFrames

In data science we often encounter data that are organized in tables. Our main tool to handle such data is the third-party module `pandas`, which we import as:

In [1]:
import pandas as pd

As before, the `as pd` part is optional but is relatively standard. It defines a shorthand `pd` for the content of the `pandas` package.

Just like `ndarray` is the center objects offered by numpy. Pandas offer a central object called `DataFrame`. Structurally, pandas DataFrames are row homogeneous (i.e., each row is similar to the next row) but column heterogeneous (i.e., one column may differ from the next one). This makes pandas DataFrame a good representation of tabular data, since tabular data tends to be row homogeneous and column heterogenous too.

To create a new DataFrame, we can use the `pd.DataFrame()` function, which we supply key-values pairs enclosed by curly braces `{}`, and which `:` is used to separate the keys and the values. In the specification, the keys correspond to column names, while the values (usually a python list or a numpy array) represent the data from that column. For example:

In [2]:
df = pd.DataFrame({
    "A": [1, 2, 3, 4],
    "B": ["this", "that", "here", "there"],
    "C": [1.3, 2.4, 7.5, 8.1]
})

display(df)

Unnamed: 0,A,B,C
0,1,this,1.3
1,2,that,2.4
2,3,here,7.5
3,4,there,8.1


*Note #1*: the syntax `{key1: value1, key2: value2, ...}` defines a python **dictionary**. It is a useful data structure from core python but we won't be making much use of it in this course other than to supply it as arguments to functions.

*Note #2*: In the above we used the `display()` function to display a pandas DataFrame. The `display()` function is built-in to the Jupyter notebook and is used to display information using webpage (HTML) technology, which tends to be richer than the plain-text interface you'll get from using `print()`

In addition to being able to handle column-heterogeneous data, when compared to a 2D ndarray, a pandas DataFrame also has the advantage that it retains row and column labels. We can extract these using the `.columns` attribute and the `.index` attribute:

In [3]:
df.columns # column labels

Index(['A', 'B', 'C'], dtype='object')

In [4]:
df.index # row labels

RangeIndex(start=0, stop=4, step=1)

Note that because we didn't supply an index to the DataFrame, the row labels default to numerical range that start counting from 0. Index, we could have specify the row labels using the `index` argument of DataFrame

In [5]:
df = pd.DataFrame(
    {
        "A": [1, 2, 3, 4],
        "B": ["this", "that", "here", "there"],
        "C": [1.3, 2.4, 7.5, 8.1]
    }, index=["2017", "2018", "2019", "2020"]
)

display(df)

Unnamed: 0,A,B,C
2017,1,this,1.3
2018,2,that,2.4
2019,3,here,7.5
2020,4,there,8.1


A single column of a DataFrame is a pandas Series. We can extract a Series from a DataFrame using the square brackets `[]` syntax. For example:

In [6]:
df["A"]

2017    1
2018    2
2019    3
2020    4
Name: A, dtype: int64

Notice the a pandas Series comes with its index and name

We can extract the data contained within the Series using the `.value` attribute:

In [7]:
df["A"].values

array([1, 2, 3, 4], dtype=int64)

Similarly, we can extract the data contained in column labels and row labels of a DataFrame using the `.value` attribute:

In [8]:
df.columns.values

array(['A', 'B', 'C'], dtype=object)

In [9]:
df.index.values

array(['2017', '2018', '2019', '2020'], dtype=object)

## Importing data into pandas DataFrame

More often your DataFrame will be created from external tabular data. The most portable format for such data is a CSV (comma separated values) file. This can be done conveniently using the `pd.read_csv()` function. Moreover, pandas by default does not read strings in the most efficient format. To force conversion, apply the `.convert_dtype()` method

For example, a subset of the CalSOFI dataset ([https://calcofi.org/](https://calcofi.org/)) can be loaded as follows (*note*: the data file can be downloaded [here](https://github.com/OCEAN-215-2025/preclass/tree/main/week_06/data/CalSOFI_subset.csv))

In [10]:
CalSOFI = pd.read_csv("data/CalSOFI_subset.csv")
display(CalSOFI)

Unnamed: 0,Cast_Count,Station_ID,Datetime,Depth_m,T_degC,Salinity,SigmaTheta
0,14172,060.0 060.0,1965-01-11 04:43:00,0,12.12,33.030,25.030
1,14172,060.0 060.0,1965-01-11 04:43:00,10,12.08,33.040,25.050
2,14172,060.0 060.0,1965-01-11 04:43:00,20,12.06,33.040,25.050
3,14172,060.0 060.0,1965-01-11 04:43:00,30,12.06,33.040,25.050
4,14172,060.0 060.0,1965-01-11 04:43:00,50,11.18,33.280,25.400
...,...,...,...,...,...,...,...
81369,25948,090.5 043.0,1988-09-22 18:45:00,250,7.82,34.168,26.651
81370,25948,090.5 043.0,1988-09-22 18:45:00,275,7.66,34.203,26.701
81371,25948,090.5 043.0,1988-09-22 18:45:00,300,7.44,34.225,26.750
81372,25948,090.5 043.0,1988-09-22 18:45:00,350,7.17,34.260,26.817


You can find the official documentation of `pd.read_csv()` from [https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html). The function have plenty of (keyword-only, optional) arguments. Some highlights are:

+ `names`: The names to use as column names. If not supplied the first line being read is treated as a header row
+ `skiprows`: if an `int`, the number of rows to skip before contents are readed in; if a python `list` of `int`, the line indices to skip.
+ `cols`: the columns to read into the DataFrame. Can be a list of column indices or column names
+ `na_values`: which values are to be treated as indication of missing data

As an example, here is the content of a simple CSV file (don't worry about the details of the code; we don't expect you to read a file as plain text in this course):

In [11]:
with open("data/header_example.csv") as infile:
    print(infile.read())

# Two lines of metadata, 
# followed by an empty line

A,B,C,D
m,g,cm,L
1,3.2,4,2
2,7.9,7,5
2,-999,5,3


Suppose we learned that "-999" is the code for missing value, and that we want yo skip the metadata lines (indices 0, 1), the empty line (index 2), and the units line (index 4), and read only the columns "A", "B", and "D", we may do:

In [12]:
pd.read_csv(
    "data/header_example.csv", 
    skiprows=[0, 1, 2, 4], 
    usecols=["A", "B", "D"], 
    na_values="-999"
)

Unnamed: 0,A,B,D
0,1,3.2,2
1,2,7.9,5
2,2,,3


Incidentally, pandas can also be used to read Microsoft Excel file. The relevant function is `pd.read_excel()`, with is documented at [https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html](https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html). Most of its arguments are the same as `pd.read_csv()`, with the notable except of the `sheet_name` argument, which is used to specify the sheet to load the data from.

As an example, the CalSOFI data is also saved as an Excel .xlsx document, and we can load it as such:

In [15]:
CalSOFI_2 = pd.read_excel("data/CalSOFI_subset.xlsx", sheet_name="Sheet1")
display(CalSOFI_2)

Unnamed: 0,Cast_Count,Station_ID,Datetime,Depth_m,T_degC,Salinity,SigmaTheta
0,14172,060.0 060.0,1965-01-11 04:43:00,0,12.12,33.030,25.030
1,14172,060.0 060.0,1965-01-11 04:43:00,10,12.08,33.040,25.050
2,14172,060.0 060.0,1965-01-11 04:43:00,20,12.06,33.040,25.050
3,14172,060.0 060.0,1965-01-11 04:43:00,30,12.06,33.040,25.050
4,14172,060.0 060.0,1965-01-11 04:43:00,50,11.18,33.280,25.400
...,...,...,...,...,...,...,...
81369,25948,090.5 043.0,1988-09-22 18:45:00,250,7.82,34.168,26.651
81370,25948,090.5 043.0,1988-09-22 18:45:00,275,7.66,34.203,26.701
81371,25948,090.5 043.0,1988-09-22 18:45:00,300,7.44,34.225,26.750
81372,25948,090.5 043.0,1988-09-22 18:45:00,350,7.17,34.260,26.817
