# Pandas DataFrame and tabular data

## Pandas and pandas DataFrames

In data science we often encounter data that are organized in tables. Our main tool to handle such data is the third-party module `pandas`, which we import as:

In [2]:
import pandas as pd

As before, the `as pd` part is optional but is relatively standard. It defines a shorthand `pd` for the content of the `pandas` package.

Just like `ndarray` is the center objects offered by numpy. Pandas offer a central object called `DataFrame`. Structurally, pandas DataFrames are row homogeneous (i.e., each row is similar to the next row) but column heterogeneous (i.e., one column may differ from the next one). This makes pandas DataFrame a good representation of tabular data, since tabular data tends to be row homogeneous and column heterogeneous too.

To create a new DataFrame, we can use the `pd.DataFrame()` function, which we supply key-values pairs enclosed by curly braces `{}`, and which `:` is used to separate the keys and the values. In the specification, the keys correspond to column names, while the values (usually a python list or a numpy array) represent the data from that column. For example:

In [3]:
df = pd.DataFrame({
    "A": [1, 2, 3, 4],
    "B": ["this", "that", "here", "there"],
    "C": [1.3, 2.4, 7.5, 8.1]
})

display(df)

Unnamed: 0,A,B,C
0,1,this,1.3
1,2,that,2.4
2,3,here,7.5
3,4,there,8.1


*Note #1*: the syntax `{key1: value1, key2: value2, ...}` defines a python **dictionary**. It is a useful data structure from core python but we won't be making much use of it in this course other than to supply it as arguments to functions.

*Note #2*: In the above we used the `display()` function to display a pandas DataFrame. The `display()` function is built-in to the Jupyter notebook and is used to display information using webpage (HTML) technology, which tends to be richer than the plain-text interface you'll get from using `print()`

In addition to being able to handle column-heterogeneous data, when compared to a 2D ndarray, a pandas DataFrame also has the advantage that it retains row and column labels. We can extract these using the `.columns` attribute and the `.index` attribute:

In [4]:
df.columns # column labels

Index(['A', 'B', 'C'], dtype='object')

In [5]:
df.index # row labels

RangeIndex(start=0, stop=4, step=1)

Note that because we didn't supply an index to the DataFrame, the row labels default to numerical range that start counting from 0. Index, we could have specify the row labels using the `index` argument of DataFrame

In [6]:
df = pd.DataFrame(
    {
        "A": [1, 2, 3, 4],
        "B": ["this", "that", "here", "there"],
        "C": [1.3, 2.4, 7.5, 8.1]
    }, index=["2017", "2018", "2019", "2020"]
)

display(df)

Unnamed: 0,A,B,C
2017,1,this,1.3
2018,2,that,2.4
2019,3,here,7.5
2020,4,there,8.1


A single column of a DataFrame is a pandas Series. We can extract a Series from a DataFrame using the square brackets `[]` syntax. For example:

In [7]:
df["A"]

2017    1
2018    2
2019    3
2020    4
Name: A, dtype: int64

Notice the a pandas Series comes with its index and name

We can extract the data contained within the Series using the `.value` attribute:

In [9]:
df["A"].values

array([1, 2, 3, 4], dtype=int64)

Similarly, we can extract the data contained in column labels and row labels of a DataFrame using the `.value` attribute:

In [10]:
df.columns.values

array(['A', 'B', 'C'], dtype=object)

In [11]:
df.index.values

array(['2017', '2018', '2019', '2020'], dtype=object)

## Importing data into pandas DataFrame

More often your DataFrame will be created from external tabular data. The most portable format for such data is a CSV (comma separated values) file. This can be done conveniently using the `pd.read_csv()` function. Moreover, pandas by default does not read strings in the most efficient format. To force conversion, apply the `.convert_dtype()` method

For example, a subset of the CalCOFI dataset ([https://calcofi.org/](https://calcofi.org/)) can be loaded as follows (*note*: the data file can be downloaded [here](https://github.com/OCEAN-215-2025/preclass/tree/main/week_06/data/CalCOFI_subset.csv))

In [12]:
CalCOFI = pd.read_csv("data/CalCOFI_subset.csv")
display(CalCOFI)

Unnamed: 0,Cast_Count,Station_ID,Datetime,Depth_m,T_degC,Salinity,SigmaTheta
0,992,090.0 070.0,1950-02-06 19:54:00,0,14.040,33.1700,24.76600
1,992,090.0 070.0,1950-02-06 19:54:00,10,13.950,33.2100,24.81500
2,992,090.0 070.0,1950-02-06 19:54:00,20,13.900,33.2100,24.82600
3,992,090.0 070.0,1950-02-06 19:54:00,23,13.880,33.2100,24.83000
4,992,090.0 070.0,1950-02-06 19:54:00,30,13.810,33.2180,24.85100
...,...,...,...,...,...,...,...
10052,35578,090.0 070.0,2021-01-21 13:36:00,300,7.692,34.1712,26.67697
10053,35578,090.0 070.0,2021-01-21 13:36:00,381,7.144,34.2443,26.81386
10054,35578,090.0 070.0,2021-01-21 13:36:00,400,7.031,34.2746,26.85372
10055,35578,090.0 070.0,2021-01-21 13:36:00,500,6.293,34.3126,26.98372


You can find the official documentation of `pd.read_csv()` from [https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html). The function have plenty of (keyword-only, optional) arguments. Some highlights are:

+ `names`: The names to use as column names. If not supplied the first line being read is treated as a header row
+ `skiprows`: if an `int`, the number of rows to skip before contents are read in; if a python `list` of `int`, the line indices to skip.
+ `cols`: the columns to read into the DataFrame. Can be a list of column indices or column names
+ `na_values`: which values are to be treated as indication of missing data

As an example, here is the content of a simple CSV file (don't worry about the details of the code; we don't expect you to read a file as plain text in this course):

In [13]:
with open("data/header_example.csv") as infile:
    print(infile.read())

# Two lines of metadata, 
# followed by an empty line

A,B,C,D
m,g,cm,L
1,3.2,4,2
2,7.9,7,5
2,-999,5,3


Suppose we learned that "-999" is the code for missing value, and that we want yo skip the metadata lines (indices 0, 1), the empty line (index 2), and the units line (index 4), and read only the columns "A", "B", and "D", we may do:

In [14]:
pd.read_csv(
    "data/header_example.csv", 
    skiprows=[0, 1, 2, 4], 
    usecols=["A", "B", "D"], 
    na_values="-999"
)

Unnamed: 0,A,B,D
0,1,3.2,2
1,2,7.9,5
2,2,,3


As a second example, consider the Seattle tide prediction data from Jan 1, 2025 to Jan 31, 2025, obtained from [NOAA](https://tidesandcurrents.noaa.gov/noaatidepredictions.html?id=9447130) (a copy of the file can be found [here](https://github.com/OCEAN-215-2025/preclass/tree/main/week_06/data/tide_prediction_2025-01.txt)) If you open the file in Jupyter Hub, you'll find that this is a text file with 13 lines of metadata, followed by header at line 14, and data from that point all. Note also that the data is separated by white spaces (actually tabs). We can load this file as a pandas DataFrame like so:

In [15]:
tides = pd.read_csv(
    "data/tide_prediction_2025-01.txt", 
    sep="\\s+", skiprows=13
)

display(tides)

Unnamed: 0,Date,Day,Time,Pred,High/Low
2025/01/01,Wed,06:58,AM,3.80,H
2025/01/01,Wed,12:17,PM,2.47,L
2025/01/01,Wed,04:41,PM,3.19,H
2025/01/01,Wed,11:54,PM,-0.73,L
2025/01/02,Thu,07:28,AM,3.83,H
...,...,...,...,...,...
2025/01/30,Thu,04:44,PM,3.28,H
2025/01/30,Thu,11:36,PM,-0.57,L
2025/01/31,Fri,06:45,AM,3.84,H
2025/01/31,Fri,12:29,PM,1.68,L


In the above, `sep = \\s+` tells pandas that the entries are separated by one or more "white-space like characters". Unfortunately, the precise formatting of this file causes pandas to turn the first column into the index of the DataFrame. To deal with this, we use `.reset_index()` to turn the index back into a regular column:

In [16]:
tides = tides.reset_index()
display(tides)

Unnamed: 0,index,Date,Day,Time,Pred,High/Low
0,2025/01/01,Wed,06:58,AM,3.80,H
1,2025/01/01,Wed,12:17,PM,2.47,L
2,2025/01/01,Wed,04:41,PM,3.19,H
3,2025/01/01,Wed,11:54,PM,-0.73,L
4,2025/01/02,Thu,07:28,AM,3.83,H
...,...,...,...,...,...,...
114,2025/01/30,Thu,04:44,PM,3.28,H
115,2025/01/30,Thu,11:36,PM,-0.57,L
116,2025/01/31,Fri,06:45,AM,3.84,H
117,2025/01/31,Fri,12:29,PM,1.68,L


Additional clean up will be needed to fix the header row, etc, and you will learn some of these in the later parts of this week's readings

Incidentally, pandas can also be used to read Microsoft Excel file. The relevant function is `pd.read_excel()`, with is documented at [https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html](https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html). Most of its arguments are the same as `pd.read_csv()`, with the notable except of the `sheet_name` argument, which is used to specify the sheet to load the data from.

As an example, the CalCOFI data is also saved as an Excel .xlsx document (which you can find [here](https://github.com/OCEAN-215-2025/preclass/tree/main/week_06/data/CalCOFI_subset.xlsx)), and we can load it as such:

In [18]:
CalCOFI_2 = pd.read_excel("data/CalCOFI_subset.xlsx", sheet_name="Sheet1")
display(CalCOFI_2)

Unnamed: 0,Cast_Count,Station_ID,Datetime,Depth_m,T_degC,Salinity,SigmaTheta
0,992,090.0 070.0,1950-02-06 19:54:00,0,14.040,33.1700,24.76600
1,992,090.0 070.0,1950-02-06 19:54:00,10,13.950,33.2100,24.81500
2,992,090.0 070.0,1950-02-06 19:54:00,20,13.900,33.2100,24.82600
3,992,090.0 070.0,1950-02-06 19:54:00,23,13.880,33.2100,24.83000
4,992,090.0 070.0,1950-02-06 19:54:00,30,13.810,33.2180,24.85100
...,...,...,...,...,...,...,...
10052,35578,090.0 070.0,2021-01-21 13:36:00,300,7.692,34.1712,26.67697
10053,35578,090.0 070.0,2021-01-21 13:36:00,381,7.144,34.2443,26.81386
10054,35578,090.0 070.0,2021-01-21 13:36:00,400,7.031,34.2746,26.85372
10055,35578,090.0 070.0,2021-01-21 13:36:00,500,6.293,34.3126,26.98372


## Combining multiple dataframes

Sometimes it is necessary to first save data is separate file then combine them during analysis. For example, suppose we are interested in tide measurement. The [NOAA](https://tidesandcurrents.noaa.gov/noaatidepredictions.html?id=9447130) interface only let us download 31 days of data every time. So if you want to analyze tide patterns from January to March of 2025, your data will consist of 3 files:

In [19]:
tides_Jan = pd.read_csv(
    "data/tide_prediction_2025-01.txt", 
    sep="\\s+", skiprows=13
).reset_index()

tides_Feb = pd.read_csv(
    "data/tide_prediction_2025-02.txt", 
    sep="\\s+", skiprows=13
).reset_index()

tides_Mar = pd.read_csv(
    "data/tide_prediction_2025-03.txt", 
    sep="\\s+", skiprows=13
).reset_index()

(As in above, to be useful these need some further manipulation. Nevertheless, since the 3 dataframes are at least *consistent* in their column names we can proceed to combine them)

We can combine the three dataframes into a single one using `pd.concat()`:

In [20]:
pd.concat([tides_Jan, tides_Feb, tides_Mar])

Unnamed: 0,index,Date,Day,Time,Pred,High/Low
0,2025/01/01,Wed,06:58,AM,3.80,H
1,2025/01/01,Wed,12:17,PM,2.47,L
2,2025/01/01,Wed,04:41,PM,3.19,H
3,2025/01/01,Wed,11:54,PM,-0.73,L
4,2025/01/02,Thu,07:28,AM,3.83,H
...,...,...,...,...,...,...
114,2025/03/30,Sun,06:52,PM,3.46,H
115,2025/03/31,Mon,12:37,AM,1.10,L
116,2025/03/31,Mon,06:25,AM,3.70,H
117,2025/03/31,Mon,01:09,PM,-0.48,L


## Writing dataframe to file

Sometimes it is useful to save your intermediate data into a new file (e.g., so that other people can access it outside of python, or so that you don't have to carry out the same data cleaning step when every time). To export your DataFrame into a csv file, all you need to do is to call the `.to_csv()` method on it. For example:

In [21]:
# generate a new dataframe

df = pd.DataFrame(
    {
        "A": [1, 2, 3, 4],
        "B": ["this", "that", "here", "there"],
        "C": [1.3, 2.4, 7.5, 8.1]
    }, index=["2017", "2018", "2019", "2020"]
)

# export to a csv file named "new_data.csv"
# NOTE: the output folder needs to already exist
df.to_csv("output/new_data.csv")