# Panel Data

Sometimes, data comes in such a way that many observations share certain common features. For example, several measurements can be made in the same location, under the same condition, or for the same subject. To understand the data and extract meaningful insights, we often need to aggregate these observations. This is where the groupby() function comes into play.

## Loading

As always, let's start by importing pandas and loading and cleaning our dataset.

In [3]:
import pandas as pd

df = pd.read_csv("data/sp500_q1_2025.csv")
df

df.info()

df.DlyCalDt.head()

df["DlyCalDt"] = pd.to_datetime(df.DlycalDt)

df.head()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29882 entries, 0 to 29881
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   DlyCalDt     29882 non-null  object 
 1   Ticker       29882 non-null  object 
 2   SecurityNm   29882 non-null  object 
 3   DlyOpen      29789 non-null  float64
 4   DlyHigh      29789 non-null  float64
 5   DlyLow       29789 non-null  float64
 6   DlyClose     29789 non-null  float64
 7   DlyVol       29882 non-null  int64  
 8   SICCD        29882 non-null  int64  
 9   PrimaryExch  29882 non-null  object 
 10  PERMNO       29882 non-null  int64  
 11  PERMCO       29882 non-null  int64  
dtypes: float64(4), int64(4), object(4)
memory usage: 2.7+ MB


AttributeError: 'DataFrame' object has no attribute 'DlycalDt'

We'll stop short of setting the index as our datetime value though. This is because an index must have unique values, and because this panel data contains lots of different company stocks for just one quarter of a year, we'll see the same date lots of times.

## Cleaning

Let's not forget data cleaning! Do we have missing data? Where?

## Exploring

Let's explore this panel data a bit more, to answer some questions:

- How many tickers are considered
- How many securities are considered
- How many companies are considered
- Which exchanges are considered
- Which exchanges appear most


## Grouping

What if we wanted to calculate daily returns in this data set. Is it as simple as using `pct_change()`? Let's try.

Can you see what's gone wrong here? Our first calculated daily return for American Airlines is using Agilent's last closing price. This hopefully gets across the importance of *grouping*, particularly useful with this kind of panel data.


We can solve this with the `groupby()` method of data frames.

Perfect! Grouping is a very powerful way to manipulate panel data. Once you've grouped, you can call functions and they will be applied groupwise as we saw above. Here are some other common functions with groups:

Let's see what else we can do with grouping. Recall that we had more tickers than companies. Let's see why that is by looking at how many unique tickers belong to each company (using `Ticker` and `PERMCO`). Then let's list those companies.

### Exercise: Tick Tick

**Part 1** Identify the number of unique tickers traded on each exchange.

In [None]:
## YOUR CODE GOES HERE

**Part 2** Then identify any securities that share a ticker.

In [None]:
## YOUR CODE GOES HERE

## Aggregation

Aggregation functions like `mean()`, `median()`, `sum()`, `min()`, `max()`, `first()`, `last()` and `std()` can be applied to grouped data to give insights across panel data. Say we wanted the average daily return of each traded security, or the max volume traded on any given day for each security?

The exercises above helped us identify that the `PERMNO` column corresponds to unique securities, so let's use that for grouping from now on. 

Useful, but only to a point. The `PERMNO` value is just a number to most of us. What if we want a ticker or name for the security? Let's look at grouping by multiple columns to help!

Once we've done these sorts of aggregation, we're often curious to see who sits at the top or the bottom of the distribution. We can use `nlargest()` and its antonym here. Note that `as_index=False` doesn't work here easily, since these functions refer to the index!

We can also group by multiple columns! This can be helpful when doing aggregation, for example, to find high performers in each month. Because our date is just a regular column, we need to specify `.dt` to use any datetime functions.

### Exercise: Good Days

Which two days of the week see the highest average close in this data set, and what is the average close for those days?  

In [None]:
## YOUR CODE GOES HERE

### Exercise: Trading Exchanges

Next identify the total trading volume of each exchange.

In [None]:
## YOUR CODE GOES HERE

### Exercise: The 1000 Club

For securities that reached a closing price above 1000, how many times in each month, did they acheive this?

In [None]:
## YOUR CODE GOES HERE

## Multiple Aggregation

We can use the `agg()` method, and pass it a dictionary to do multiple aggregations at once on grouped data. This can be helpful for further analyses, or for producing a more descriptive aggregated data frame.

### Quick Quarter Query

Using multiple aggregation, create an aggregated data frame with ticker and security name, the first open price in the period for each security and the last close price in the period for each security. Create a new column in this aggregated data frame that shows the price difference between final close and initial open for each security.

In [None]:
## YOUR CODE GOES HERE