# Analysis

In [47]:
library(xts)
library(DBI)
con <- dbConnect(odbc::odbc(), "JupyterLab", timeout = 10)

## RPNA story count

In this section, we build 4 time series of story count per day:
- total_count
- macro_count - stories containing no mention of any company
- comp_count - stories containing only mentions of companies
- mix_count - stories containg both mentions

In [48]:
qry <-
"SELECT COUNT(DISTINCT RP_STORY_ID) AS NO_STORIES
 FROM dbo.RPNA_WSJ;"

dbGetQuery(con, qry)

NO_STORIES
<int>
860401


The Wall Street Journal published 860'401 stories from January 1, 2000 to August 31, 2019.

In [49]:
qry <-
"SELECT
    ENTITY_TYPE,
    COUNT(DISTINCT RP_STORY_ID) AS NO_STORIES,
    100.0 * COUNT(DISTINCT RP_STORY_ID) / (SELECT COUNT(DISTINCT RP_STORY_ID) FROM dbo.RPNA_WSJ) AS FRAC_STORIES
FROM dbo.RPNA_WSJ
GROUP BY ENTITY_TYPE
ORDER BY NO_STORIES DESC;"

dbGetQuery(con, qry)

ENTITY_TYPE,NO_STORIES,FRAC_STORIES
<chr>,<int>,<dbl>
COMP,692211,80.452138
ORGA,423095,49.1741641
CMDT,127743,14.8469144
PLCE,10654,1.2382598
CURR,1313,0.1526033


We see that 80% of the stories are related to at least one company. Also, organizations (50%) and commodities (15%) make up most of the "macro" content. Places (1.25%) and currencies (0.15%) are rarely covered.

In [50]:
qry <-
"WITH
 not_comp AS
 (
    SELECT RP_STORY_ID
    FROM dbo.RPNA_WSJ
    WHERE ENTITY_TYPE<>'COMP'
 ),
 only_comp AS
 (
    SELECT RP_STORY_ID
    FROM dbo.RPNA_WSJ
    WHERE ENTITY_TYPE='COMP'
 )

 SELECT
    t.[TYPE],
    t.NO_STORIES,
    100.0 * t.NO_STORIES / SUM(NO_STORIES) OVER() AS FRAC_STORIES
 FROM (
    SELECT
        'Only companies mentioned' AS [TYPE],
        COUNT(DISTINCT RP_STORY_ID) AS [NO_STORIES]
    FROM dbo.RPNA_WSJ
    WHERE RP_STORY_ID NOT IN (SELECT * FROM not_comp)

    UNION ALL

    SELECT
        'No companies mentioned',
        COUNT(DISTINCT RP_STORY_ID)
    FROM dbo.RPNA_WSJ
    WHERE RP_STORY_ID NOT IN (SELECT * FROM only_comp)

    UNION ALL

    SELECT
        'Both mentioned',
        COUNT(DISTINCT RP_STORY_ID)
    FROM dbo.RPNA_WSJ
    WHERE RP_STORY_ID IN (SELECT * FROM not_comp)
      AND RP_STORY_ID IN (SELECT * FROM only_comp)
 ) t
 ORDER BY FRAC_STORIES DESC;"

dbGetQuery(con, qry)

TYPE,NO_STORIES,FRAC_STORIES
<chr>,<int>,<dbl>
Only companies mentioned,379337,44.0884
Both mentioned,312874,36.36374
No companies mentioned,168190,19.54786


Going forward, we will define stories containing no mention of a company as "macro" news. All stories mentioning at least one company will be termed "equity" news. Hence, a more precise distinction would be between "equity" and "non-equity" news.

**TBD:** Decide whether to define macro news as no equity or no equity + both.

In [51]:
# Get count of total stories per day
qry <-
"DECLARE @time_shift int = -6;

 SELECT
    CAST(DATEADD(HOUR, @time_shift, TIMESTAMP_EST) AS date) AS [DATE],
    COUNT(DISTINCT RP_STORY_ID) AS total_count
 FROM dbo.RPNA_WSJ
 GROUP BY CAST(DATEADD(HOUR, @time_shift, TIMESTAMP_EST) AS date)
 ORDER BY [date];"
df <- dbGetQuery(con, qry)
total_count <- xts(df[,2], order.by=df$DATE) # No need to set time zone as everything is done in SQL and we only work with dates in R.
names(total_count) <- names(df)[2]
head(total_count)

           total_count
2000-04-02           2
2000-04-03           2
2000-04-04           3
2000-04-05           2
2000-04-06           2
2000-04-09           3

In [52]:
# Get count of non-company stories per day
qry <-
"DECLARE @time_shift int = -6;

 SELECT
    CAST(DATEADD(HOUR, @time_shift, TIMESTAMP_EST) AS date) AS [DATE],
    COUNT(DISTINCT RP_STORY_ID) AS macro_count
 FROM dbo.RPNA_WSJ
 WHERE RP_STORY_ID NOT IN (
    SELECT RP_STORY_ID
    FROM dbo.RPNA_WSJ
    WHERE ENTITY_TYPE='COMP'
 )
 GROUP BY CAST(DATEADD(HOUR, @time_shift, TIMESTAMP_EST) AS date)
 ORDER BY [date];"
df <- dbGetQuery(con, qry)
macro_count <- xts(df[,2], order.by=df$DATE) # No need to set time zone as everything is done in SQL and we only work with dates in R.
names(macro_count) <- names(df)[2]
head(macro_count)

           macro_count
2000-04-16           1
2000-04-23           1
2000-04-26           2
2000-04-30          13
2000-05-01           8
2000-05-02          13

In [53]:
# Get count of companies-only stories per day
qry <-
"DECLARE @time_shift int = -6;

 SELECT
    CAST(DATEADD(HOUR, @time_shift, TIMESTAMP_EST) AS date) AS [DATE],
    COUNT(DISTINCT RP_STORY_ID) AS comp_count
 FROM dbo.RPNA_WSJ
 WHERE RP_STORY_ID NOT IN (
    SELECT RP_STORY_ID
    FROM dbo.RPNA_WSJ
    WHERE ENTITY_TYPE<>'COMP'
 )
 GROUP BY CAST(DATEADD(HOUR, @time_shift, TIMESTAMP_EST) AS date)
 ORDER BY [date];"
df <- dbGetQuery(con, qry)
comp_count <- xts(df[,2], order.by=df$DATE) # No need to set time zone as everything is done in SQL and we only work with dates in R.
names(comp_count) <- names(df)[2]
head(comp_count)

           comp_count
2000-04-03          1
2000-04-04          1
2000-04-10          1
2000-04-18          4
2000-04-19          1
2000-04-24          2

In [54]:
# Merge counts
news <- merge(total_count, macro_count, fill=0, join="outer")
news <- merge(news, comp_count, fill=0, join="outer")

# Mix_count contains stories that include both companies and non-company entities
news$mix_count <- news$total_count - (news$macro_count + news$comp_count)
head(news)

# Housekeeping
rm(total_count, macro_count, comp_count, df)

           total_count macro_count comp_count mix_count
2000-04-02           2           0          0         2
2000-04-03           2           0          1         1
2000-04-04           3           0          1         2
2000-04-05           2           0          0         2
2000-04-06           2           0          0         2
2000-04-09           3           0          0         3

In [44]:
str(news)
news[1:20,]
summary(news)

An 'xts' object on 2000-04-02/2019-08-30 containing:
  Data: int [1:6558, 1:4] 2 2 3 2 2 3 1 1 1 2 ...
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : chr [1:4] "total_count" "macro_count" "comp_count" "mix_count"
  Indexed by objects of class: [Date] TZ: UTC
  xts Attributes:  
 NULL


           total_count macro_count comp_count mix_count
2000-04-02           2           0          0         2
2000-04-03           2           0          1         1
2000-04-04           3           0          1         2
2000-04-05           2           0          0         2
2000-04-06           2           0          0         2
2000-04-09           3           0          0         3
2000-04-10           1           0          1         0
2000-04-12           1           0          0         1
2000-04-13           1           0          0         1
2000-04-16           2           1          0         1
2000-04-17           2           0          0         2
2000-04-18           5           0          4         1
2000-04-19           7           0          1         6
2000-04-20           1           0          0         1
2000-04-23           2           1          0         1
2000-04-24           2           0          2         0
2000-04-25           1           0          0   

     Index             total_count      macro_count       comp_count    
 Min.   :2000-04-02   Min.   :   1.0   Min.   :  0.00   Min.   :  0.00  
 1st Qu.:2005-07-15   1st Qu.:  74.0   1st Qu.: 13.00   1st Qu.: 26.00  
 Median :2010-03-25   Median : 113.0   Median : 21.00   Median : 46.00  
 Mean   :2010-02-21   Mean   : 131.2   Mean   : 25.65   Mean   : 57.84  
 3rd Qu.:2014-09-23   3rd Qu.: 157.0   3rd Qu.: 31.00   3rd Qu.: 71.00  
 Max.   :2019-08-30   Max.   :1120.0   Max.   :315.00   Max.   :674.00  
   mix_count     
 Min.   :  0.00  
 1st Qu.: 28.00  
 Median : 44.00  
 Mean   : 47.71  
 3rd Qu.: 61.00  
 Max.   :450.00  

## S&P 500 prices and returns

In this section, we will build time series of returns of companies that were included in the S&P 500 at any point between January 1, 2001 and August 31, 2019.