# Panel Data

Sometimes, data comes in such a way that many observations share certain common features. For example, several measurements can be made in the same location, under the same condition, or for the same subject. To understand the data and extract meaningful insights, we often need to aggregate these observations. This is where the groupby() function comes into play.

## Exploring Panel Data

As always, let's start by importing pandas and loading our dataset. This time our conversion to datetime will be a bit different.

In [None]:
import pandas as pd

# Load the data
df = pd.read_csv("https://raw.githubusercontent.com/ImperialCollegeLondon/efds-ta-python/refs/heads/main/data/sec_data.csv")

#df.info()
df.datadate = pd.to_datetime(df.datadate, format="%d/%m/%Y") # Capital 4 digit
df

Unnamed: 0,GVKEY,iid,datadate,tic,conm,cshtrd,prcod,prcld,prchd,prccd,exchg
0,1004,1,2023-01-03,AIR,AAR CORP,260279,45.09,44.21,45.5800,44.60,11
1,1004,1,2023-01-04,AIR,AAR CORP,258372,44.56,44.56,45.6600,45.24,11
2,1004,1,2023-01-05,AIR,AAR CORP,132574,44.86,44.50,45.0400,44.82,11
3,1004,1,2023-01-06,AIR,AAR CORP,301259,45.20,45.20,46.5200,46.09,11
4,1004,1,2023-01-09,AIR,AAR CORP,372930,46.84,45.94,47.1500,46.18,11
...,...,...,...,...,...,...,...,...,...,...,...
9954,3358,2,2023-03-27,CMTL,COMTECH TELECOMMUN,39911,12.28,12.21,12.4799,12.25,14
9955,3358,2,2023-03-28,CMTL,COMTECH TELECOMMUN,47057,12.15,12.01,12.2150,12.03,14
9956,3358,2,2023-03-29,CMTL,COMTECH TELECOMMUN,65026,12.26,11.94,12.2600,12.11,14
9957,3358,2,2023-03-30,CMTL,COMTECH TELECOMMUN,49142,12.36,12.03,12.3661,12.12,14


We'll stop short of setting the index as our datetime value though. This is because an index must have unique values, and because this panel data contains lots of different company stocks for just one quarter of a year, we'll see the same date lots of times.

In [None]:
df.datadate.nunique()

df.dataDate.max()
df.datadate.min()

62

Let's explore this panel data a bit more, to answer some simple questions:

- How many companies are considered in the data
- How many stocks are considered in the data 
- Which exchanges are considered in the data
- Which exchanges appear most


In [None]:
df.GVKEY.nunique() # Or companies
#df.conm.nunique(), how many companies

df.tic.nunique() # What is tic, how many stocks

df.exchg.unique() # Exchanges in data

df.exchg.value_counts() / 62 # since 62 days and you would get number of rows so want to /62, total trading days - value_counts tells you how many rows does 92 appear under exchange column. How many stocks or securities trading on each exchnage you can't know from these numbers but assuming the exchages are every day you can divide by 62 but you can see one stock on 11 doesn't trade all days. Or maybe IPO is somewhere on period what is IPO

#You could look only at exchg 11 to find out what happened
#You can see one asset that doesn't trade in march 

exchg
11    6301
14    3100
12     496
19      62
Name: count, dtype: int64

## Grouping


Grouping is a powerful way to manipulate panel data. Once you've grouped, you can call functions and they will be applied groupwise. The most common application of grouping is to calculate returns on a stock-by-stock basis, but there are many other uses!

In [None]:
#df["returns"] = df.prccd.pct_change() # This isn't good alone since doesn't account for different stocks just goes row by row
#df.head(65)

df["returns"] = df.prccd.pct_change()
df["returns"] = df.groupby("tic").prccd.pct_change()
df.head(65)

df.groupby("tic").size() # Shows how many rows per tickers #Is there an extension which adds brackets automatically after a method

mask = df.groupby("tic").size() < 62 # Gives you true false
mask[mask==True] # Can say condition instead of mask

tic
RFP    True
dtype: bool

Let's see what else we can do with grouping. Recall that we had more stocks than companies. Let's see why that is by looking at how many unique stocks are issued by each company (using the `tic`). Then let's list those companies.

In [29]:
condition = df.groupby("conm").size() > 62
condition[condition==True]

mask = df.groupby("conm").nunique().tic # Could have >1 here and below instead of >1 do ==True
mask[mask>1]

#unique_tickers = df.groupby("comn").tic.nunique()
#unique_tickers[unique_tickers > 1]

conm
BEL FUSE INC                2
BERKSHIRE HATHAWAY          2
BIO-RAD LABORATORIES INC    2
BROWN FORMAN CORP           2
U-HAUL HOLDING CO           2
Name: tic, dtype: int64

### Exercise: Excellent Exchanges

**Part 1** Identify the number of unique stocks traded on each exchange.

In [25]:
df.groupby("exchg").tic.nunique()

exchg
11    102
12      8
14     50
19      1
Name: tic, dtype: int64

**Part 2** Then identify any companies that trade on more than one exchange.

In [30]:
condition = df.groupby("conm").exchg.nunique() > 1
condition[condition == True]

unique_exchg = df.groupby("conm").exchg.nunique()
unique_exchg

unique_exchg[unique_exchg>1]

conm
BIO-RAD LABORATORIES INC    2
Name: exchg, dtype: int64

## Aggregation

Aggregation functions like `mean()`, `median()`, `sum()`, `min()`, `max()`, `first()`, `last()` and `std()` can be applied to grouped data to give insights across panel data. Say we wanted the average daily return of each traded stock, or the max volume traded on any given day for each stock?

Once we've done these sorts of aggregation, we're often curious to see who sits at the top or the bottom of the distribution. We can use `nlargest()` and its antonym here.

We can also group by multiple columns! This can be helpful when doing aggregation, for example, to find high performers in each month.

### Exercise: Good Days

Which two days of the week see the highest average close in this data set, and what is the average close for those days?  

### Exercise: Trading Exchanges

Next identify the total trading volume of each exchange.

### Exercise: The 500 Club

For stocks that reached a closing price above 500, how many times in each month, did they acheive this?