# Reshaping

We've now seen some of the benefits of panel data, and how we can take advantage of pandas to manipulate it and get some insights. Sometimes though we'll need to reshape our data to work with it more easily.

In [1]:
import pandas as pd
import numpy as np

# Load the data
df = pd.read_csv("data/sp500_q1_2025.csv")

# Convert 'datadate' to datetime
df.DlyCalDt = pd.to_datetime(df.DlyCalDt)

# Data cleaning as we did previously
df.dropna(inplace=True)

print("Missing data after cleaning", df.isnull().sum().sum())

# Handling duplicates
print("Checking for duplicates, which we forgot to do previously!", df.duplicated().sum())

df.drop_duplicates(inplace=True)


Missing data after cleaning 0
Checking for duplicates, which we forgot to do previously! 2


## Pivot

`pivot` helps us reshape *long* panel data into a *wide* data frame. We can use it to have each stock in a separate column and dates in the rows. We can only have one column substituted in as the values of the data frame, so choose carefully.

In [9]:
df.shape
pivot_df = df.pivot(index="DlyCalDt", columns="SecurityNm", values="DlyClose")
pivot_df.loc["2025-02"]

SecurityNm,3M CO; COM NONE; CONS,A E S CORP; COM NONE; CONS,A P A CORP; COM NONE; CONS,A T & T INC; COM NONE; CONS,ABBOTT LABORATORIES; COM NONE; CONS,ABBVIE INC; COM NONE; CONS,ACCENTURE PLC IRELAND; COM A; CONS,ADOBE INC; COM NONE; CONS,ADVANCED MICRO DEVICES INC; COM NONE; CONS,AFLAC INC; COM NONE; CONS,...,WILLIAMS COS; COM NONE; CONS,WILLIS TOWERS WATSON PUB LTD CO; COM NONE; CONS,WYNN RESORTS LTD; COM NONE; CONS,X C E L ENERGY INC; COM NONE; CONS,XYLEM INC; COM NONE; CONS,YUM BRANDS INC; COM NONE; CONS,ZEBRA TECHNOLOGIES CORP; COM A; CONS,ZIMMER BIOMET HOLDINGS INC; COM NONE; CONS,ZIONS BANCORPORATION N A; COM NONE; CONS,ZOETIS INC; COM A; CONS
DlyCalDt,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2025-02-03,150.04,10.71,21.47,24.25,128.45,190.14,385.21,438.6,114.27,106.71,...,56.2,330.67,83.56,67.75,122.8,131.5,386.09,108.16,56.25,171.94
2025-02-04,151.68,10.61,22.39,24.25,129.1,189.95,391.62,440.23,119.5,106.76,...,55.72,320.31,83.19,67.32,129.24,131.47,383.46,107.8,57.07,172.24
2025-02-05,152.45,10.48,22.19,24.47,132.06,191.75,398.25,437.63,112.01,107.29,...,56.85,320.65,81.66,67.95,129.81,131.25,388.04,108.25,57.43,175.67
2025-02-06,152.32,10.82,21.64,24.45,128.22,192.97,387.34,435.4,110.16,103.08,...,56.01,326.91,80.65,67.12,131.42,144.01,376.8,102.69,58.27,174.12
2025-02-07,149.87,10.57,21.67,24.54,129.07,190.6,385.98,433.07,107.56,103.58,...,55.94,325.79,80.58,66.6,131.09,143.56,363.44,100.93,57.26,171.43
2025-02-10,149.69,10.75,22.99,24.86,131.31,190.34,386.89,451.1,110.48,102.61,...,56.17,325.18,78.97,66.88,131.26,148.15,358.44,100.42,56.03,171.91
2025-02-11,150.07,10.23,23.31,25.15,131.44,191.83,390.01,458.82,111.1,102.99,...,55.24,319.23,77.0,67.19,132.04,146.65,354.4,98.98,56.84,174.29
2025-02-12,148.87,10.09,22.37,25.36,130.49,193.0,388.83,462.76,111.72,102.94,...,54.82,318.96,78.37,67.4,129.91,146.51,352.92,98.99,55.47,173.88
2025-02-13,148.72,10.14,22.49,25.63,131.79,193.45,389.53,459.22,111.81,104.08,...,57.46,322.28,80.47,68.4,131.09,148.75,323.42,99.91,55.45,164.93
2025-02-14,148.62,9.93,23.14,25.87,130.61,192.87,388.0,460.16,113.1,103.34,...,56.98,320.4,88.82,68.61,129.38,147.91,318.36,100.52,55.74,157.52


In [12]:
pivot_df.isnull().sum().sum()
pivot_df["MCCORMICK & CO INC; COM V; CONS"] = pivot_df["MCCORMICK & CO INC; COM V; CONS"].ffill()
pivot_df.dropna(axis=1, inplace=True)
missing = pivot_df.isnull().sum()
missing[missing>0]

Series([], dtype: int64)

We generally favour returns over close prices, as they give us a better picture of relative performance. Because our data frame is only holding close prices, it is straightforward to calculate returns.

The other really neat thing we can do with this kind of pivoted dataframe is visualise correlations with ease.

### Exercise: Trading Top Ten

Pivot our panel data, this time using trading volume `DlyVol` for values. Find the max trading volume for each stock and display the top 10.

In [71]:
## YOUR CODE GOES HERE

## Resample

The other kind of reshaping we can do is called *resampling*, which we use to change the frequency of our data. When we resample, we are generally expected to do some aggregation (but we don't have to). Let's resample our pivoted data to get the mean closing price for each month.

We can use `resample()` to help us calculate returns for different periods. When we calculated daily returns we took the last price on the day and the last price on the day before.

For other periods we apply the same thinking. For monthly returns, for example, we take the last price of the month and the last price of the month before. We'll need `last()` to make it work.

There are many possibilities for resampling, here are a few:

- **W** - Weekly
- **D** - Daily (calendar days)
- **QE** - Quarterly (quarter end)
- **YE** - Annually

### Exercise: Losing Days

Resample your *trading volume* pivot df to calendar days. Do you need to do some cleaning? What do you propose?

In [74]:
## YOUR CODE GOES HERE