# News Data analysis and engineering

The news have 3 locative dimensions
- Provider
- Subjects(related assets)
- Audiences

Provider: For provider, over 75% of the news are provided by RTRS and the rest of the top 5 are 5-7% . Under this circumstances, it may make sense to group news providers into channels while top 5 have their own channels and the rest are grouped together as "others".

Subjects: There are numerous assets being mentioned in the datasets. The idea is to use asset embedding to represent assets. The embedding should have linear similarity property(i.e. we can tell how similar two assets are by calculating their dot product normalised by the multiplication of their lengths.)

Suppose we originally have news data features $F^n=(F_1^n,...,F_M^n)$ a vector of dimension $M$ for news $n$ and $S$ number of subject assets with respective assets embeddings $A_1^n,...,A_S^n$.  For an asset $A$, the relevancy of the news signal with respect to the asset $A$ is 

$$ A \cdot \overline{A_n}$$, where $\overline{A_n}= \frac{(\Sigma_i A_i^n)}{S}$.

The intuition is that the assets similarity will be able to direct the news signals to relevant assents with the same mechanism as the famous attention mechanism.

By doing so, we have made a few assumptions:
1. The attention mechanism is effective. (If it is not effective, summing over a lot of weak signals would confuse the model). Ideally, we would want to have $A \cdot B$ close to either 1 or 0 for any two assets $A$ and $B$. 
2. news features $F^n$ has independent entries so summing two news feature vectors will not break the association nature of the featues. For that, we may need to implement non-linear transformation of the features to turn them into independent entries.
3. Each news is assets specific as taking average would destroy the distribution information. i.e. it is related to certain kind of assets with high similarity.

Within a unit timeframe with $N$ news, the signal directing towards a particular asset $A$ can be estimated as,

$$\Sigma_n^N (A \cdot \overline{A_n}  \times F^n)$$.

However, by doing so, the signals may be maginified when there are multiple articles talking about the same asset within the timeframe. To address this problem, There could be two approaches, 

1. Apply another mechanism that can summarise the news asset-wise with some kind of recurrent structure or attention weightings.
2. Shrinking the news data partition(i.e. Shorten the unit timeframe and spread the news channels further) such that the chance of having duplicated news within a short period will be rare and relie on the convolution network to take care of the summarisation of news over a longer period of time. To test such approach, we would need to creat partition and do a partition count on the assets. 




# Observations

The data is highly replicative. The same news would be issued for different audience at different time. Also, multiple news could have mentioned about an assets within the same unit timeframe. There are properties regarding the novelty of the news and the volumn issued. Also, news could be replicated with only a small difference in word counts.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
import seaborn as sns

sns.set_style('darkgrid')
sns.set(font_scale=1.6)

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)


%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [None]:
def trainDataLoad(env=False,market=True,news=True):
    try:
        from kaggle.competitions import twosigmanews

        if(not env):
            env = twosigmanews.make_env()
        (market_df, news_df) = env.get_training_data()

        print('Data fetched from kaggle with {} rows of market data and {} rows of news data'.format(market_df.shape, news_df.shape))
    except:
        print('failed to load data from kaggle, loading data from local directory.')
        if(market):
            market_df=pd.read_csv('./sampleData/market_train.csv')
        if(news):
            news_df=pd.read_csv('./sampleData/news_train.csv')
        print('Train data loaded!')
    if(market & (!news)):
        return market_df
    if(news & (!market)):
        return news_df
    return (market_df,news_df)

In [None]:
def timeCut(df,timeStart,timeEnd=pd.Timestamp.now(), replace=True):
    '''
    df: dataFrame with attribute time in datatime64 format
    time: a time in string
    return df slice cutting off the time before the time provided
    '''
    df.time=pd.to_datetime(df.time)
    timeStart=pd.Timestamp(timeStart)
    timeEnd=pd.Timestamp(timeEnd)
    df_slice = df[(df.time>timeStart) & (df.time<timeEnd)]
    if replace:
        df=df_slice
    return df_slice

def formatCodeSet(df,field):
    '''
    df:dataframe
    field:field name of the code in the form string in set format
    return the field formatted into array
    '''
    return df[field].str.findall(f"'([\w\./]+)'")

In [None]:
#Load Data
(market_train_df,news_train_df)=trainDataLoad()

In [None]:
# Cut it into a slice for exploratory purpose
df_2007=timeCut(news_train_df,'2007-1-1 22:00:00+00:00','2007-12-31 22:00:00+00:00')

In [None]:
# Adding time partition for cuts
df_2007.loc[:,'year']=df_2007.time.apply(lambda x:x.year)
df_2007.loc[:,'month']=df_2007.time.apply(lambda x:x.month)
df_2007.loc[:,'day']=df_2007.time.apply(lambda x:x.day)
df_2007.loc[:,'hour']=df_2007.time.apply(lambda x:x.hour)
df_2007.loc[:,'minute']=df_2007.time.apply(lambda x:x.minute)
df_2007.loc[:,'second']=df_2007.time.apply(lambda x:x.second)

In [None]:
df_2007.columns

In [249]:
time_partition=['month','day','hour']
features1=['headline', 'urgency', 'provider', 'subjects', 'audiences', 'bodySize', 'companyCount', 'marketCommentary', 'sentenceCount', 'wordCount', 'assetCodes', 'assetName', 'firstMentionSentence', 'relevance', 'sentimentClass', 'sentimentNegative', 'sentimentNeutral', 'sentimentPositive', 'sentimentWordCount', 'noveltyCount12H', 'noveltyCount24H', 'noveltyCount3D', 'noveltyCount5D', 'noveltyCount7D', 'volumeCounts12H', 'volumeCounts24H', 'volumeCounts3D', 'volumeCounts5D', 'volumeCounts7D']
features2=['provider','assetCodes','audiences']
partition=time_partition+features2

In [250]:
partition_2007=df_2007[partition+['sourceTimestamp']].groupby(partition)['sourceTimestamp'].count().reset_index()
partition_2007['sourceTimestamp'].value_counts()

1     417420
2      56638
3      14875
4       7438
5       4177
6       2630
7       1834
8       1233
9        846
10       628
11       432
12       254
13       209
14       140
16        80
15        77
17        51
18        31
20        19
21        18
19        18
22        16
24        12
23        10
25         8
26         8
28         5
27         4
31         3
32         1
34         1
33         1
39         1
30         1
29         1
57         1
Name: sourceTimestamp, dtype: int64

In [None]:
partition_2007[partition_2007['sourceTimestamp']>30]

In [None]:
df_2007[(df_2007.month == 2) & (df_2007.day==20) & (df_2007.hour==21) & (df_2007.assetCodes==	"{'HPQ.DE', 'HPQ.N'}") ]

In [None]:
df_2007[df_2007.time=='2007-01-17 12:00:18+00:00']

In [None]:
df_2007_u=df_2007.drop_duplicates(subset=features2)

In [None]:
partition_2007=df_2007_u[partition+['sourceTimestamp']].groupby(partition)['sourceTimestamp'].count().reset_index()
partition_2007['sourceTimestamp'].value_counts()

In [None]:
time_partition=['time']
features=['provider','assetCodes','audiences']
partition=time_partition+features
partition_2007=df_2007_u[partition+['sourceTimestamp']].groupby(partition)['sourceTimestamp'].count().reset_index()
partition_2007['sourceTimestamp'].value_counts()

In [None]:
partition_2007[partition_2007['sourceTimestamp']>1]

In [None]:
df_2007_u[df_2007_u.time=='2007-01-02 20:04:09+00:00']

In [None]:
partition_2007=df_2007[partition+['sourceTimestamp']].groupby(partition)['sourceTimestamp'].count().reset_index()


# Data pipeline

1. Determine news features and unit timeframe.
2. Non-linear transformation of news features, hoping that will transform the data into independent entries.
3. Recurrent summary of news within the same channel.
    - Use an attention key to do weighted average of features.
    - Use recurrent networks to filter important features.