# Pandas

The `pandas` module is a powerful Python library for data manipulation and analysis. It makes reading tabulated data easier and allows various arithmetic or logical operations to be performed. The results can be outputted as tabular forms or plots.

Using `pandas` shares similarities with lists, numpy arrays, and dictionaries - you can add new elements, update them, and manipulate them. However, there are some subtle differences and new features that make `pandas` particularly powerful for data analysis.

We shall start by importing the `pandas` module. Conventionally, pandas is imported as "pd":

In [None]:
import pandas as pd

## DataFrames - The primary pandas structure

Just as `numpy` introduced the `array` object, `pandas` introduces new structures which allow more advanced functionality. The primary structure is called a `DataFrame`, which looks a lot like a table of data.

One way to create a `DataFrame` is using a dictionary where the keys are column names and the values are the column entries:

In [None]:
molecule_data_dict = {"formula": ["CH4", "C2H6", "C3H8"],
                      "name": ["methane", "ethane", "propane"], 
                      "mol_mass": [16., 30.1, 44.1]}

molecule_df = pd.DataFrame(molecule_data_dict)
molecule_df

If we examine the pandas DataFrame above, we can see the following features:
- This is a two-dimensional "table" of data with rows and columns
- Initially, each row has been labelled by number (0, 1, 2). This is known as the **index**
- Each column has been labelled by a **column name** ("formula", "name", "mol_mass")

## Working with Rows: The Index

The index is an important feature of pandas DataFrames. It builds upon other ordered objects (like lists and numpy arrays) but also adds in labelling (like dictionaries). By default, rows are indexed as 0, 1, 2... but we can assign this to anything we like to label our rows in a useful way.

To make the rows more meaningful, we can label them using the *formula column*. We can set a new index from a column using the `set_index` method:

In [None]:
molecule_df = molecule_df.set_index("formula")
molecule_df

*Note: by default the `set_index()` method creates a new DataFrame. To update our original DataFrame this needs to be assigned back to our original variable name.*

The benefit is that we can now access a row using our meaningful index name. We do this using the `.loc[]` syntax:

In [None]:
molecule_df.loc["CH4"]

Note that we use *square brackets not round brackets* after `.loc` - **this is unique pandas syntax** for indexing a DataFrame.

We can access the full set of index values using the `.index` attribute:

In [None]:
molecule_df.index

Even though this returns an `Index` object, we can access elements using positional indexing as we would for a list or numpy array:

In [None]:
molecule_df.index[0]

## Working with Columns

Each column can be accessed from a pandas DataFrame by using the name of the column:

In [None]:
molecule_df["mol_mass"]

This is similar to accessing values using keys in a dictionary. Each column is stored as a pandas `Series`, which is a one-dimensional version of a DataFrame. This is similar to a list or a 1D numpy array but has an index just like a DataFrame.

To find the full names of all columns in a DataFrame, we can access the `.columns` attribute:

In [None]:
molecule_df.columns

In [None]:
molecule_df.columns[0]

## Working with Values and Data Types

A pandas DataFrame contains values for each row and column. Each column (Series) has a particular `dtype`, which is the data type (e.g. float, int, str) associated with that data. This is the same terminology we saw with `numpy`.

In [None]:
molecule_df["mol_mass"].dtype

To access one element in a DataFrame, we can:
- Select the row using `.loc[]` and then the column using `[]` indexing
- Use `.loc[]` with both row and column (similar to multi-indexing in numpy)

In [None]:
# Method 1: Sequential selection
row = molecule_df.loc["CH4"]
element = row["mol_mass"]
print(element)

In [None]:
# Method 2: Using .loc with both dimensions (comma-separated)
element = molecule_df.loc["CH4", "mol_mass"]
print(element)

## Applying Operations

The `pandas` module is built on underlying `numpy` arrays and brings across similar functionality. You can even access the underlying numpy array using the `.values` attribute:

In [None]:
column = molecule_df["mol_mass"]
underlying_values = column.values
print(underlying_values, type(underlying_values))

As with numpy arrays, we can perform arithmetic or logical operations across a whole column (a pandas Series) at once:

In [None]:
column / 2

We can also combine multiple pandas Series with the same indices:

In [None]:
# Create a Series object directly
index = column.index
number_of_carbon_atoms = pd.Series([1, 2, 3], index=index)

column / number_of_carbon_atoms

Just as with numpy, this operation combines these two objects element by element. For a pandas Series, this matches on the index rather than the position.

## Statistical Operations

Similar to numpy arrays, pandas provides many methods for performing statistical operations. However, it's always better to use pandas methods when dealing with pandas objects, as they handle data more intelligently (especially with missing data).

Common statistical operations include:

In [None]:
# Calculate the mean
column.mean()

In [None]:
# Calculate the sum
column.sum()

In [None]:
# Find the maximum
column.max()

In [None]:
# Find the position of the maximum
column.argmax()

In [None]:
# Find the index of the maximum
column.idxmax()

In [None]:
# Use this index to extract the relevant row
index_max = column.idxmax()
molecule_df.loc[index_max]

## Updating DataFrames: Adding New Columns

Let's create a new DataFrame to demonstrate how to update and add new information:

In [None]:
city_data = {"city": ["Melbourne", "Mumbai", "Paris", "Tokyo"],
             "population": [4850740, 20667665, 11078546, 37339804],
             "area (km2)": [9992, 603, 105, 2188]}

city_df = pd.DataFrame(city_data)
city_df = city_df.set_index("city")
city_df

We can extract a column and perform operations on it:

In [None]:
city_df["population"] / 1e6

To add this as a new column, we use syntax similar to assigning new values to dictionaries - define a new key and assign the new data:

In [None]:
city_df["population (million)"] = city_df["population"] / 1e6
city_df

When assigning a new value, it must be the same length as the number of rows in the DataFrame.

We can also apply operations between columns to create new columns. For instance, we can calculate population density:

In [None]:
population_million = city_df["population (million)"]
area_sqkm = city_df["area (km2)"]
population_density = population_million / area_sqkm

city_df["population density (million/km2)"] = population_density
city_df

## Filtering Data

Similar to Boolean array indexing in numpy, we can filter pandas Series or DataFrames based on conditions.

Let's filter to find cities with populations over 15 million:

In [None]:
population_million = city_df["population (million)"]
high_population = population_million[population_million > 15]
high_population

We can break this down into steps. First, create the filter:

In [None]:
pop_filt = (population_million > 15)
print(pop_filt)

The filter contains True and False values for each index. We can apply this filter to our original column:

In [None]:
high_population = population_million[pop_filt]
high_population

We can also filter the entire DataFrame:

In [None]:
city_df_high_population = city_df[pop_filt]
city_df_high_population

This is equivalent to writing:

In [None]:
city_df_high_population = city_df[city_df["population (million)"] > 15]
city_df_high_population

### Filtering with Multiple Conditions

We can filter using multiple conditions with bitwise operators `&` (and) and `|` (or). Each condition must be surrounded by round brackets.

To find populations in a range between 15 and 25 million:

In [None]:
population_million[(population_million > 15) & (population_million <= 25)]

We can filter on multiple columns at once:

In [None]:
filt = (city_df["population (million)"] > 15) & (city_df["area (km2)"] > 2000)
city_df[filt]

This shows all rows where the city population is greater than 15 million **and** the area is greater than 2000 square kilometres.

## Reading Data from Files

One of the main strengths of pandas is its powerful tools for reading tabulated data. The `read_csv()` function is extremely versatile for reading comma separated value (csv) files.

### What is a CSV file?

CSV stands for "comma separated values". These are plain text files where data columns are separated by commas. They are much simpler than Excel files and can be read by any software capable of reading text.

Example of a CSV file content:

```formula,name,mol_mass
H2O,Water,18.01528
CO2,Carbon Dioxide,44.0095
C6H12O6,Glucose,180.156
``` 

### Reading CSV files with pandas

Here's an example of reading a csv file:


In [1]:

import pandas as pd

filename = "data/kinetics_example_ArN2.csv"
df = pd.read_csv(filename, skiprows=2)

df


Unnamed: 0,time (s),conc (nmol/dm3)
0,0,39.75
1,100,30.8
2,150,27.02
3,200,23.79
4,250,20.99
5,270,19.88
6,300,18.47
7,545,9.93




In this code:
- We provide the `read_csv()` function with the filename (required input)
- We use the optional `skiprows` parameter to ignore header lines
- This creates a DataFrame object from the data in the file
- Column names are pulled from the first row that is read (after skipped rows)


This reads the data from the csv file into a pandas DataFrame. From here, we can manipulate and analyze the data using all the tools we've discussed above.


A key option for `read_csv()` is the `sep` parameter, which allows you to specify a different delimiter if your data is not comma-separated. For example, if your data is tab-separated, you can use `sep='\t'`.


We can see some examples here below where (for convenience) we use the `StringIO` module to simulate reading from a file, so that the content of the "file" is immediately apparent in the code snippet itself:

In [4]:
from io import StringIO
import pandas as pd
# Create a tab-delimited string
tab_delimited_data = "Name\tAge\tCity\nAlice\t25\tMelbourne\nBob\t30\tParis\nCharlie\t35\tTokyo"

# Treat it as a file and read with pandas
df_tab = pd.read_csv(StringIO(tab_delimited_data), sep='\t')
df_tab

Unnamed: 0,Name,Age,City
0,Alice,25,Melbourne
1,Bob,30,Paris
2,Charlie,35,Tokyo


A similar example below, but with triple quotes for the multi-line string and an arbitrarily different separator (`;`):

In [8]:
tab_delimited_data = """Name;Age;City
Alice;25;Melbourne
Bob0;3;Paris
Charlie;55;Tokyo"""

# Treat it as a file and read with pandas
df_tab = pd.read_csv(StringIO(tab_delimited_data), sep=';')
df_tab

Unnamed: 0,Name,Age,City
0,Alice,25,Melbourne
1,Bob0,3,Paris
2,Charlie,55,Tokyo


## Conclusion

We covered the basics of the `pandas` module, including creating and manipulating DataFrames, accessing rows and columns, performing operations, filtering data, and reading data from CSV files.

`pandas` is the central tool for data analysis in Python.

It may appear to duplicate some of the opearations we have seen earlier with `numpy` or  *vanilla Python*. However, the case for using `pandas` lies in its data organisation and teh construction of coherent pipelines.


**Caveats.** 

- `pandas` has a rather specific syntax, and learning it can take a while. It is important to explore the documentation, notably its **tutorials**:

    - https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/index.html

- `pandas` is also a good example of a situation where copying and adapting code snippets from online resources (e.g. Stack Overflow) is a common practice.
- It is also a good idea to keep a "cheat sheet" of common operations handy for reference.
- Finally, LLM have normally a good knowledge of `pandas` and can be a useful resource for quick questions.