Objective:
To use Granger Casuality approach to show how sentiments affect the stock prices.
About Granger Casuality:
1. According to wiki:
The Granger causality test is a statistical hypothesis test for determining whether one time series is useful in forecasting another, first proposed in 1969. Ordinarily, regressions reflect "mere" correlations, but Clive Granger argued that causality in economics could be tested for by measuring the ability to predict the future values of a time series using prior values of another time series. Since the question of "true causality" is deeply philosophical, and because of the post hoc ergo propter hoc fallacy of assuming that one thing preceding another can be used as a proof of causation, econometricians assert that the Granger test finds only "predictive causality".

Intuitively, granger casuality can be used here to show how sentiments affect the stock market prices. 

How to determine it?
The general observation is that the stock prices are observed to drop down after the sentiment is given. The drop value should be calculated after and before the sentiment is released. The drops should be converted to standard zscore before using for granger casualty tests.
I have used a particular asset to make observations.




In [3]:
import numpy as np
import pandas as pd


from datetime import timedelta
from kaggle.competitions import twosigmanews
from scipy.stats import zscore

pd.options.mode.chained_assignment = None
pd.options.display.max_columns = 999

# Get 2Sigma environment
env = twosigmanews.make_env()



ModuleNotFoundError: No module named 'kaggle.competitions'

In [None]:
# Get the data
mt_df, nt_df = env.get_training_data()

We will try and get the correlation between market data and sentiments. We will consider a particular asset to determine it

In [None]:
mt_df['time'] = pd.to_datetime(mt_df['time'])
nt_df['time'] = pd.to_datetime(nt_df['time'])

In [None]:
mt_df.head()

In [None]:
nt_df.head()

In [None]:
#understanding the correlation between sentiments and open values for a particular asset so that we can try and generalize it later.
asset = 'PetroChina Co Ltd'
mt_df_petro_china = mt_df[mt_df['assetName']==asset]
nt_df_petro_china = nt_df[nt_df['assetName']==asset]

In [None]:
def date_(x):
    return x.date()

In [None]:
# consider the open prices and time from market data, consider the sentiments and time from sentiment data
mt_df_petro_china = mt_df_petro_china[['time','open']]
mt_df_petro_china['date_only'] = mt_df_petro_china['time'].map(date_)
mt_df_petro_china['date_only'] = pd.to_datetime(mt_df_petro_china['date_only'])
mt_df_petro_china.head()

In [None]:
nt_df_petro_china = nt_df_petro_china[['time','headline','sentimentNegative','sentimentNeutral','sentimentPositive']]
nt_df_petro_china['date_only'] = nt_df_petro_china['time'].map(date_)
nt_df_petro_china['date_only'] = pd.to_datetime(nt_df_petro_china['date_only'])
nt_df_petro_china['sentiment_max'] = nt_df_petro_china[['sentimentNegative','sentimentNeutral','sentimentPositive']].max(axis=1)
nt_df_petro_china.head()

In order to complete this data, we approximated
the missing values using a concave function. So,
if the open value on a given day is x and the next available
data point is y with n days missing in between, we
approximate the missing data by estimating the first
day after x to be (y+x)/2 and then following the same
method recursively till all gaps are filled. This approximation
is justified as the stock data usually follows a
concave function, unless ofcourse at anomaly points of
sudden rise and fall.

In [None]:
date_ranges = pd.date_range(start=mt_df_petro_china['date_only'].min(), end=mt_df_petro_china['date_only'].max())
all_dates = mt_df_petro_china['date_only'].values
appended_df = pd.DataFrame()
for i in range(len(date_ranges)):
    d = date_ranges[i]
    internal_df = pd.DataFrame(columns=['date','open_value'],index=[i])
    df = mt_df_petro_china[mt_df_petro_china['date_only']==pd.to_datetime(d)]
    if df.shape[0]!=0:
        internal_df['open_value'] = df['open'].values[0]
        internal_df['date'] = d
        appended_df = appended_df.append(internal_df)
    else:
        internal_df['open_value'] = 0.0
        internal_df['date'] = d
        appended_df = appended_df.append(internal_df)


        
    


In [None]:
true_open_values = appended_df['open_value'].values
modified_ = []
for i in range(len(true_open_values)):
    open_ = true_open_values[i]
    if open_!=0.0:
        modified_.append(open_)
    else:
        i_p_2 = true_open_values[next((k for k, x in enumerate(true_open_values[i-1:]) if x), None)]
        modified_.append((true_open_values[i-1]+i_p_2)/2.0)
        
    

In [None]:
true_open_values = modified_
modified_new = []
for i in range(len(true_open_values)):
    open_ = true_open_values[i]
    if open_!=0.0:
        modified_new.append(open_)
    else:
        i_p_2 = true_open_values[next((k for k, x in enumerate(true_open_values[i-1:]) if x), None)]
        modified_new.append((true_open_values[i-1]+i_p_2)/2.0)

In [None]:
appended_df['open_value'] = modified_new
appended_df.head()

In [None]:
nt_df_petro_china_avg_sentiments = nt_df_petro_china.groupby('date_only').agg({'sentiment_max':np.mean}).reset_index()
nt_df_petro_china_avg_sentiments.head()

Get the differencing of the open values before and after sentiments

In [None]:
dates_common = list(set(nt_df_petro_china_avg_sentiments['date_only'].values).intersection(set(appended_df['date'].values)))
difference_df = pd.DataFrame()
for date in dates_common:
    internal_df = pd.DataFrame(columns=['sentiment_value','difference_before_sentiment','difference_after_sentiment'],index=[0])
    if date == min(dates_common) or date==max(dates_common):
        pass
    else:
        internal_df['sentiment_value'] = nt_df_petro_china_avg_sentiments[nt_df_petro_china_avg_sentiments['date_only']==date]['sentiment_max'].values[0]
        internal_df['difference_before_sentiment'] = appended_df[appended_df['date']==pd.to_datetime(date)]['open_value'].values[0] - appended_df[appended_df['date']==(pd.to_datetime(date)-timedelta(days=1))]['open_value'].values[0]
        internal_df['difference_after_sentiment'] = appended_df[appended_df['date']==pd.to_datetime(date)]['open_value'].values[0] - appended_df[appended_df['date']==(pd.to_datetime(date)+timedelta(days=1))]['open_value'].values[0]
        internal_df['date'] = date
        difference_df = difference_df.append(internal_df)
    
    

In [None]:
#get the zscores
difference_df['z_score_difference_before_sentiment'] = zscore(difference_df['difference_before_sentiment'])
difference_df['z_score_difference_after_sentiment'] = zscore(difference_df['difference_after_sentiment'])

In [None]:
difference_df.head()

In [None]:
from statsmodels.tsa.stattools import grangercausalitytests
granger_test_result = grangercausalitytests(difference_df[['sentiment_value','z_score_difference_before_sentiment']].values,maxlag=2)

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:

ts = pd.Series(difference_df['z_score_difference_before_sentiment'].values[:100], index=difference_df['date'].values[:100])
ts1 = pd.Series(difference_df['sentiment_value'].values[:100], index=difference_df['date'].values[:100])

plt.figure(figsize=(20,10))
ts1.plot(label='sentiment',color='r')
ts.plot(label='z_score_before_sentiment',color='b')

plt.legend()

Observations:
1. “lags”, is the number offset  to look for. For this example, the “lag” is how many days prior a change in trend will impact the opening price.
2. For lags=2, p<0.05, hence we can try and build features by offsetting the (prices two days before the sentiment is observed- prices on the day sentiment is observed).
3. You can utilize the p-values as follows:
If p > .10 → “not significant”
If p ≤ .10 → “marginally significant”
If p ≤ .05 → “significant”
If p ≤ .01 → “highly significant.”


Stay Tuned !! I will be building features based on these observations