In [507]:
library(xts)
library(tidyr)
library(DBI)
con <- dbConnect(odbc::odbc(), "JupyterLab", timeout = 10)

# RPNA story count

In this section, we build 4 time series of story count per day:
- total_count
- macro_count - stories containing no mention of any company
- comp_count - stories containing only mentions of companies
- mix_count - stories containg both mentions

In [508]:
qry <-
"SELECT COUNT(DISTINCT RP_STORY_ID) AS NO_STORIES
 FROM dbo.RPNA_WSJ;"

dbGetQuery(con, qry)

NO_STORIES
<int>
845222


The Wall Street Journal published 845'222 stories from January 1, 2001 to August 31, 2019.

In [509]:
qry <-
"SELECT
    ENTITY_TYPE,
    COUNT(DISTINCT RP_STORY_ID) AS NO_STORIES,
    100.0 * COUNT(DISTINCT RP_STORY_ID) / (SELECT COUNT(DISTINCT RP_STORY_ID) FROM dbo.RPNA_WSJ) AS FRAC_STORIES
FROM dbo.RPNA_WSJ
GROUP BY ENTITY_TYPE
ORDER BY NO_STORIES DESC;"

dbGetQuery(con, qry)

ENTITY_TYPE,NO_STORIES,FRAC_STORIES
<chr>,<int>,<dbl>
COMP,678952,80.3282451
ORGA,416262,49.2488364
CMDT,125606,14.8607112
PLCE,10517,1.2442885
CURR,1286,0.1521494


We see that 80% of the stories are related to at least one company. Also, organizations (50%) and commodities (15%) make up most of the "macro" content. Places (1.24%) and currencies (0.15%) are rarely covered.

In [510]:
qry <-
"WITH
 not_comp AS
 (
    SELECT RP_STORY_ID
    FROM dbo.RPNA_WSJ
    WHERE ENTITY_TYPE<>'COMP'
 ),
 only_comp AS
 (
    SELECT RP_STORY_ID
    FROM dbo.RPNA_WSJ
    WHERE ENTITY_TYPE='COMP'
 )

 SELECT
    t.[TYPE],
    t.NO_STORIES,
    100.0 * t.NO_STORIES / SUM(NO_STORIES) OVER() AS FRAC_STORIES
 FROM (
    SELECT
        'Only companies mentioned' AS [TYPE],
        COUNT(DISTINCT RP_STORY_ID) AS [NO_STORIES]
    FROM dbo.RPNA_WSJ
    WHERE RP_STORY_ID NOT IN (SELECT * FROM not_comp)

    UNION ALL

    SELECT
        'No companies mentioned',
        COUNT(DISTINCT RP_STORY_ID)
    FROM dbo.RPNA_WSJ
    WHERE RP_STORY_ID NOT IN (SELECT * FROM only_comp)

    UNION ALL

    SELECT
        'Both mentioned',
        COUNT(DISTINCT RP_STORY_ID)
    FROM dbo.RPNA_WSJ
    WHERE RP_STORY_ID IN (SELECT * FROM not_comp)
      AND RP_STORY_ID IN (SELECT * FROM only_comp)
 ) t
 ORDER BY FRAC_STORIES DESC;"

dbGetQuery(con, qry)

TYPE,NO_STORIES,FRAC_STORIES
<chr>,<int>,<dbl>
Only companies mentioned,371901,44.00039
Both mentioned,307051,36.32785
No companies mentioned,166270,19.67175


Going forward, we will define stories containing no mention of a company as "macro" news. All stories mentioning at least one company will be termed "equity" news. Hence, a more precise distinction would be between "equity" and "non-equity" news.

**TBD:** Decide whether to define macro news as no equity or no equity + both.

In [511]:
# Get count of total stories per day
qry <-
"DECLARE @time_shift int = -6;

 SELECT
    CAST(DATEADD(HOUR, @time_shift, TIMESTAMP_EST) AS date) AS [DATE],
    COUNT(DISTINCT RP_STORY_ID) AS total_count
 FROM dbo.RPNA_WSJ
 GROUP BY CAST(DATEADD(HOUR, @time_shift, TIMESTAMP_EST) AS date)
 ORDER BY [date];"
df <- dbGetQuery(con, qry)
total_count <- xts(df[,2], order.by=df$DATE) # No need to set time zone as everything is done in SQL and we only work with dates in R.
names(total_count) <- names(df)[2]
head(total_count)

           total_count
2001-01-01          95
2001-01-03         104
2001-01-04         136
2001-01-07         145
2001-01-08         124
2001-01-09         121

In [512]:
# Get count of non-company stories per day
qry <-
"DECLARE @time_shift int = -6;

 SELECT
    CAST(DATEADD(HOUR, @time_shift, TIMESTAMP_EST) AS date) AS [DATE],
    COUNT(DISTINCT RP_STORY_ID) AS macro_count
 FROM dbo.RPNA_WSJ
 WHERE RP_STORY_ID NOT IN (
    SELECT RP_STORY_ID
    FROM dbo.RPNA_WSJ
    WHERE ENTITY_TYPE='COMP'
 )
 GROUP BY CAST(DATEADD(HOUR, @time_shift, TIMESTAMP_EST) AS date)
 ORDER BY [date];"
df <- dbGetQuery(con, qry)
macro_count <- xts(df[,2], order.by=df$DATE) # No need to set time zone as everything is done in SQL and we only work with dates in R.
names(macro_count) <- names(df)[2]
head(macro_count)

           macro_count
2001-01-01           9
2001-01-03          13
2001-01-04          17
2001-01-07          20
2001-01-08          15
2001-01-09          13

In [513]:
# Get count of companies-only stories per day
qry <-
"DECLARE @time_shift int = -6;

 SELECT
    CAST(DATEADD(HOUR, @time_shift, TIMESTAMP_EST) AS date) AS [DATE],
    COUNT(DISTINCT RP_STORY_ID) AS comp_count
 FROM dbo.RPNA_WSJ
 WHERE RP_STORY_ID NOT IN (
    SELECT RP_STORY_ID
    FROM dbo.RPNA_WSJ
    WHERE ENTITY_TYPE<>'COMP'
 )
 GROUP BY CAST(DATEADD(HOUR, @time_shift, TIMESTAMP_EST) AS date)
 ORDER BY [date];"
df <- dbGetQuery(con, qry)
comp_count <- xts(df[,2], order.by=df$DATE) # No need to set time zone as everything is done in SQL and we only work with dates in R.
names(comp_count) <- names(df)[2]
head(comp_count)

           comp_count
2001-01-01         49
2001-01-03         44
2001-01-04         63
2001-01-07         62
2001-01-08         60
2001-01-09         68

In [514]:
# Merge counts
news <- merge(total_count, macro_count, fill=0, join="outer")
news <- merge(news, comp_count, fill=0, join="outer")

# Mix_count contains stories that include both companies and non-company entities
news$mix_count <- news$total_count - (news$macro_count + news$comp_count)
head(news)

# Housekeeping
rm(total_count, macro_count, comp_count, df)

           total_count macro_count comp_count mix_count
2001-01-01          95           9         49        37
2001-01-03         104          13         44        47
2001-01-04         136          17         63        56
2001-01-07         145          20         62        63
2001-01-08         124          15         60        49
2001-01-09         121          13         68        40

---

# S&P 500 prices and returns

In this section, we will build time series of returns of companies that were included in the S&P 500 at any point between January 1, 2001 and August 31, 2019.

However, we will include the year 2000 to have enough observations for the stocks that get excluded from the S&P 500 during the first few months of 2001.

### Prices
First, let's get the prices of the S&P constituents and the levels of the index itself.

In [554]:
# Get prices of S&P 500's constituents
qry <-
"SELECT
    'P' + CONVERT(varchar(100),PERMNO) AS permno,
    [DATE] AS [date],
    PRC AS price
 FROM dbo.CRSP_DSF
 ORDER BY permno, [date];"

dbGetQuery(con, qry) %>%
    pivot_wider(names_from = permno, values_from = price) -> df

# Convert to xts object
prices <- xts(df[,-1], order.by=df$date)


# Get levels of S&P 500 index
qry <-
"SELECT
    caldt AS [date],
    spindx AS SP500
 FROM dbo.CRSP_SP500
 ORDER BY [date];"
df <- dbGetQuery(con, qry)

# Convert to xts object
df_xts <- xts(df$SP500, order.by=df$date)
names(df_xts) <- "SP500"


# Housekeeping
rm(qry, df)

### Excess Returns
Once we have the prices, we can merge them insto a single xts object and compute their daily returns.

In [555]:
# Merge xts objects
prices <- merge(df_xts, prices, join="outer")

# Compute arithmetic returns
rets <- (prices / lag(prices, k=1, na.pad=FALSE)) - 1

The last step is to subtract the risk-free rate of return from the stocks' and index daily returns.
We use the 3-month treasury bill as a proxy for the risk free rate, as is usually the case in the literature.

In [556]:
# Get 3-month T-Bill (daily observations, yearly rates)
qry <-
"SELECT
    [DATE] AS [date],
    TBILL AS tbill_yearly
 FROM dbo.FRED_TBILL3M
 ORDER BY [date];"
df <- dbGetQuery(con, qry)

# Convert to xts object
df_xts <- xts(df$tbill_yearly, order.by=df$date)

# Compute daily returns from yearly rates
tbill_3months <- na.locf(df_xts) / 4     # 3-month rates
tbill <- tbill_3months^(1/90) - 1        # Daily rates
names(tbill) <- "TBill"

# Merge libor with returns to match the dates
rets <- merge(rets, tbill, join="inner")

# Compute excess returns
excess_rets <- xts(order.by=index(rets))
for (i in 1:(dim(rets)[2]-1)) { # Last colum is LIBOR itself
    excess_rets <- merge(excess_rets, rets[,i] - rets$TBill)
}

# Housekeeping
rm(qry, df, df_xts, tbill_3months, tbill)

---

# Analysis - Discussed Approach

In a first step, we estimate the model below for each of the 944 stocks that were included in the S&P 500 at any point in time from January 1, 2001 to August 31, 2019. Note however, that we will estimate $\beta_i$ for stock $i$ from 2000 to 2019, in order to have enough observations even for the stocks that got excluded from the S&P 500 in the first few months of 2001.

Formally, the model is


$$
\begin{align*}
R^i_t-R^f_t &= \alpha_i + \beta_i (R^m_t - R^f_t) + \epsilon^i_t
\\
    &= \alpha_i + \mathrm{Systematic \ part} + \mathrm{Idiosyncratic \ part},
\end{align*}
$$

where $\alpha_i$ is a constant in the regression and not Jensen's alpha.

In [559]:
# /!\ The loop can may be simplified by using na.action="na.exclude" in lm() !

# Create xts object to hold residuals of the regressions
res <- xts(order.by=index(excess_rets))

# Temp variables to help regressions
permno <- colnames(excess_rets)[-1] # First name is SP500.
time_index <- index(excess_rets)

for (p in permno) {
    # Prepare data
    r <- as.data.frame(excess_rets[, c("SP500",p)])

    # Estimate model
    fit <- lm(r[,p] ~ r$SP500)

    # Prepare temporary xts to merge
    index_not_na <- as.numeric(attributes(fit$residuals)$names) # The attribute $names of fit$residuals contains the indices of the residuals.
    tmp <- xts(fit$residuals, order.by=time_index[index_not_na])
    names(tmp) <- p

    # Merge xts
    res <- merge(res, tmp)
}

# Housekeeping
rm(permno, time_index, p, r, fit, index_not_na, tmp)

### Cross-sectional Correlation

We then use the residuals $\epsilon^i_t$ of stock $i \in N$ to compute the cross-sectional correlation. Formally, we have

$$
\rho_t = f(\epsilon^i_t) = ?
\nonumber
$$
At time $t$, we have only one epsilon for each stock $i \in N$. We can therefore represent a cross-section of $\epsilon^i_t$ at time $t$ as a $1\times N$  vector. In other words, at time $t$ we have only one residual for each stock.

**How can we compute a correlation based on only 1 vector?**

In [560]:
cor(first(res[,1:3], n="4 years"), use="complete.obs")
cor(first(res[,1:3], n="4 years"), use="na.or.complete")

Unnamed: 0,P10078,P10104,P10107
P10078,1.0,0.2680071,0.1118562
P10104,0.2680071,1.0,0.1743708
P10107,0.1118562,0.1743708,1.0


Unnamed: 0,P10078,P10104,P10107
P10078,1.0,0.2680071,0.1118562
P10104,0.2680071,1.0,0.1743708
P10107,0.1118562,0.1743708,1.0


---

# Analysis - Alternative Approach

In [564]:
# This cell can be ignored and will later be removed!
p <- 18

reg_data_xts <- merge(excess_rets[,c(1,p)], news, join="inner") 
r <- as.data.frame(reg_data_xts)


r$ratio <- (r$macro_count + r$mix_count) / r$total_count

fit <- lm(r[,2] ~ r$SP500 + r$ratio)

summary(fit)
#str(fit)