Python implementation of a causal topic modeling paper.

This program implements the following paper:
<blockquote>
    <p>Hyun Duk Kim, Malu Castellanos, Meichun Hsu, ChengXiang Zhai, Thomas Rietz, and Daniel Diermeier. 2013. Mining causal topics in text data: Iterative topic modeling with time series feedback. In Proceedings of the 22nd ACM international conference on information & knowledge management (CIKM 2013). ACM, New York, NY, USA, 885-890. DOI=10.1145/2505515.2505612</p>
</blockquote>

In [74]:
# Import libraries
import plsa
import numpy as np
import pandas as pd
import nltk
import statsmodels.api as sm
from statsmodels.tsa.stattools import grangercausalitytests
from statsmodels.tsa.api import VAR

In [2]:
# Import data
pres_market = pd.read_csv("./data/PRES00_WTA.csv")
AAMRQ = pd.read_csv("./data/AAMRQ.csv")
AAPL = pd.read_csv("./data/AAPL.csv")

In [3]:
pres_market

Unnamed: 0,Date,Contract,Units,$Volume,LowPrice,HighPrice,AvgPrice,LastPrice
0,5/1/2000,Dem,224,112.043,0.490,0.550,0.500,0.550
1,5/1/2000,Reform,2,0.067,0.019,0.048,0.034,0.019
2,5/1/2000,Rep,116,57.95,0.488,0.501,0.500,0.500
3,5/2/2000,Dem,87,44.369,0.501,0.522,0.510,0.508
4,5/2/2000,Reform,50,0.196,0.003,0.005,0.004,0.003
...,...,...,...,...,...,...,...,...
571,11/9/2000,Reform,2065,2.062,0.000,0.001,0.001,0.000
572,11/9/2000,Rep,10055,542.973,0.025,0.109,0.054,0.050
573,11/10/2000,Dem,3454,3328.02,0.950,0.980,0.964,0.969
574,11/10/2000,Reform,23,0.02,0.000,0.001,0.001,0.000


In [4]:
AAMRQ

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,7/3/2000,26.63,26.63,26.00,26.13,26.13,483100
1,7/5/2000,27.25,28.88,27.06,28.38,28.38,1840000
2,7/6/2000,28.44,29.56,27.81,29.00,29.00,1820000
3,7/7/2000,29.81,29.94,29.13,29.13,29.13,1150000
4,7/10/2000,29.75,30.13,29.19,30.00,30.00,711800
...,...,...,...,...,...,...,...
368,12/24/2001,21.72,21.73,20.77,21.19,21.19,1350000
369,12/26/2001,21.37,21.74,21.18,21.57,21.57,938900
370,12/27/2001,21.35,21.79,21.20,21.50,21.50,1190000
371,12/28/2001,21.60,22.19,21.55,22.00,22.00,853000


In [5]:
AAPL

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,7/3/2000,0.930804,0.969866,0.930804,0.952009,0.821251,70828800
1,7/5/2000,0.950893,0.985491,0.906250,0.921875,0.795256,265216000
2,7/6/2000,0.937500,0.945313,0.886161,0.925223,0.798145,309545600
3,7/7/2000,0.939174,0.978795,0.930804,0.972098,0.838581,263603200
4,7/10/2000,0.965960,1.040179,0.959821,1.020089,0.879981,397796000
...,...,...,...,...,...,...,...
368,12/21/2001,0.375179,0.384643,0.371429,0.375000,0.323494,256334400
369,12/24/2001,0.373214,0.383036,0.373214,0.381429,0.329040,50629600
370,12/26/2001,0.381250,0.398214,0.377500,0.383750,0.331042,146400800
371,12/27/2001,0.385357,0.397321,0.385357,0.394107,0.339977,191508800


# Granger Test

"Granger tests...measur[e] statistical significance at different time lags using auto regression to identify causal relationships. Let $y_{t}$ and $x_{t}$ be two time series. To see if $x_{t}$ 'Granger causes' $y_{t}$ with maximum $p$ time lag, run the following regression:

$$
y_{t} = a_{0} + a_{1}y_{t−1} + ... + a_{p}y_{t−p} + b_{1}x_{t−1} + ... + b_{p}x_{t−p}
$$

Then, use F-tests to evaluate the significance of the lagged $x$ terms. The coefficients of lagged $x$ terms estimate the impact of $x$ on $y$. We average the $x$ term coefficients, $\frac{\sum_{i=1}^{p}b_{i}}{|b|}$, as an impact value."

In [49]:
close = pd.concat([AAMRQ["Close"], AAPL["Close"]], axis=1, keys=["AAMRQ", "AAPL"])
close

Unnamed: 0,AAMRQ,AAPL
0,26.13,0.952009
1,28.38,0.921875
2,29.00,0.925223
3,29.13,0.972098
4,30.00,1.020089
...,...,...
368,21.19,0.375000
369,21.57,0.381429
370,21.50,0.383750
371,22.00,0.394107


In [50]:
close = close.rolling(3, center=True, min_periods=2).mean()
close

Unnamed: 0,AAMRQ,AAPL
0,27.255000,0.936942
1,27.836667,0.933036
2,28.836667,0.939732
3,29.376667,0.972470
4,29.523333,1.002976
...,...,...
368,21.550000,0.375179
369,21.420000,0.380060
370,21.690000,0.386429
371,21.933333,0.392798


In [53]:
close = close.diff()[1:]
close

Unnamed: 0,AAMRQ,AAPL
1,0.581667,-0.003906
2,1.000000,0.006696
3,0.540000,0.032738
4,0.146667,0.030506
5,0.500000,0.026414
...,...,...
368,-0.026667,-0.001547
369,-0.130000,0.004881
370,0.270000,0.006369
371,0.243333,0.006369


In [60]:
# Is first column "caused by" second column up to a given lag?
gc_res = grangercausalitytests(close, 5)


Granger Causality
number of lags (no zero) 1
ssr based F test:         F=0.4416  , p=0.5068  , df_denom=368, df_num=1
ssr based chi2 test:   chi2=0.4452  , p=0.5046  , df=1
likelihood ratio test: chi2=0.4449  , p=0.5048  , df=1
parameter F test:         F=0.4416  , p=0.5068  , df_denom=368, df_num=1

Granger Causality
number of lags (no zero) 2
ssr based F test:         F=0.5103  , p=0.6008  , df_denom=365, df_num=2
ssr based chi2 test:   chi2=1.0345  , p=0.5961  , df=2
likelihood ratio test: chi2=1.0331  , p=0.5966  , df=2
parameter F test:         F=0.5103  , p=0.6008  , df_denom=365, df_num=2

Granger Causality
number of lags (no zero) 3
ssr based F test:         F=0.3828  , p=0.7655  , df_denom=362, df_num=3
ssr based chi2 test:   chi2=1.1706  , p=0.7601  , df=3
likelihood ratio test: chi2=1.1687  , p=0.7605  , df=3
parameter F test:         F=0.3828  , p=0.7655  , df_denom=362, df_num=3

Granger Causality
number of lags (no zero) 4
ssr based F test:         F=1.0379  , p=0.3875  

In [70]:
p_vals = []
for i in range(1, len(gc_res) + 1):
    p_vals.append(gc_res[i][0]['params_ftest'][1])

In [71]:
p_vals

[0.5067873616230052,
 0.6007552098554052,
 0.7654686970619549,
 0.3874795496740359,
 0.4398020489287572]

In [72]:
np.argmin(p_vals)

3

In [79]:
np.argmax(np.subtract(1, p_vals))

3

In [86]:
gc_res[1][1][0].summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.414
Model:,OLS,Adj. R-squared:,0.413
Method:,Least Squares,F-statistic:,260.8
Date:,"Sat, 07 Nov 2020",Prob (F-statistic):,9.309999999999999e-45
Time:,15:45:58,Log-Likelihood:,-233.09
No. Observations:,371,AIC:,470.2
Df Residuals:,369,BIC:,478.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
x1,0.6428,0.040,16.150,0.000,0.565,0.721
const,-0.0061,0.024,-0.259,0.796,-0.053,0.040

0,1,2,3
Omnibus:,161.879,Durbin-Watson:,1.785
Prob(Omnibus):,0.0,Jarque-Bera (JB):,3606.831
Skew:,-1.284,Prob(JB):,0.0
Kurtosis:,18.058,Cond. No.,1.69


In [99]:
results.test_causality('AAMRQ').summary()

Test statistic,Critical value,p-value,df
0.4416,3.854,0.507,"(1, 736)"


In [111]:
results.params

Unnamed: 0,AAMRQ,AAPL
const,-0.009421,-0.000509
L1.AAMRQ,0.800274,-0.001446
L1.AAPL,1.023943,0.727527
L2.AAMRQ,-0.00981,0.002245
L2.AAPL,-2.897335,0.124881
L3.AAMRQ,-0.465091,3.4e-05
L3.AAPL,-0.247153,-0.529698
L4.AAMRQ,0.367037,-2.7e-05
L4.AAPL,1.238642,0.347019
L5.AAMRQ,-0.082454,-0.00246


In [112]:
'L' + str(3) + '.' + results.params.columns.values[1]

'L3.AAPL'

In [115]:
type(results.params)

pandas.core.frame.DataFrame

In [120]:
for i in range(1, -5 + 1):
    print(i)

In [110]:
model = VAR(close)
results = model.fit(5)
results.summary()

  Summary of Regression Results   
Model:                         VAR
Method:                        OLS
Date:           Sat, 07, Nov, 2020
Time:                     16:07:45
--------------------------------------------------------------------
No. of Equations:         2.00000    BIC:                   -10.2797
Nobs:                     367.000    HQIC:                  -10.4208
Log likelihood:           909.784    FPE:                2.71597e-05
AIC:                     -10.5138    Det(Omega_mle):     2.56020e-05
--------------------------------------------------------------------
Results for equation AAMRQ
              coefficient       std. error           t-stat            prob
---------------------------------------------------------------------------
const           -0.009421         0.021853           -0.431           0.666
L1.AAMRQ         0.800274         0.052984           15.104           0.000
L1.AAPL          1.023943         1.797251            0.570           0.569
L2.A

In [None]:
def impact_value(data, lag):
    model = VAR(data)
    results = model.fit(lag)
    numerator = 0
    for i in range(1, lag + 1):
        numerator += results.params[results.params.columns.values[0]]['L' + str(i) + '.' + results.params.columns.values[1]]
    return numerator / np.abs(lag)

# Topic Purity

In [None]:
def topic_purity(num_pos_impact, num_neg_impact, num_significant):
    p_prob = num_pos_impact / num_significant
    n_prob = num_neg_impact / num_significa
    entropy = p_prob * np.log(p_prob) + n_prob * np.log(n_prob)
    return 100 + 100 * entropy