# News Data analysis and engineering

The news have 3 locative dimensions
- Provider
- Subjects(related assets)
- Audiences

Provider: For provider, over 75% of the news are provided by RTRS and the rest of the top 5 are 5-7% . Under this circumstances, it may make sense to group news providers into channels while top 5 have their own channels and the rest are grouped together as "others".

Subjects: There are numerous assets being mentioned in the datasets. The idea is to use asset embedding to represent assets. The embedding should have linear similarity property(i.e. we can tell how similar two assets are by calculating their dot product normalised by the multiplication of their lengths.)

Suppose we originally have news data features $F^n=(F_1^n,...,F_M^n)$ a vector of dimension $M$ for news $n$ and $S$ number of subject assets with respective assets embeddings $A_1^n,...,A_S^n$.  For an asset $A$, the relevancy of the news signal with respect to the asset $A$ is 

$$ A \cdot \overline{A_n}$$, where $\overline{A_n}= \frac{(\Sigma_i A_i^n)}{S}$.

The intuition is that the assets similarity will be able to direct the news signals to relevant assents with the same mechanism as the famous attention mechanism.

By doing so, we have made a few assumptions:
1. The attention mechanism is effective. (If it is not effective, summing over a lot of weak signals would confuse the model). Ideally, we would want to have $A \cdot B$ close to either 1 or 0 for any two assets $A$ and $B$. 
2. news features $F^n$ has independent entries so summing two news feature vectors will not break the association nature of the featues. For that, we may need to implement non-linear transformation of the features to turn them into independent entries.
3. Each news is assets specific as taking average would destroy the distribution information. i.e. it is related to certain kind of assets with high similarity.

Within a unit timeframe with $N$ news, the signal directing towards a particular asset $A$ can be estimated as,

$$\Sigma_n^N (A \cdot \overline{A_n}  \times F^n)$$.

However, by doing so, the signals may be maginified when there are multiple articles talking about the same asset within the timeframe. To address this problem, There could be two approaches, 

1. Apply another mechanism that can summarise the news asset-wise with some kind of recurrent structure or attention weightings.
2. Shrinking the news data partition(i.e. Shorten the unit timeframe and spread the news channels further) such that the chance of having duplicated news within a short period will be rare and relie on the convolution network to take care of the summarisation of news over a longer period of time. To test such approach, we would need to creat partition and do a partition count on the assets. 




# Observations

The data is highly replicative. The same news would be issued for different audience at different time. Also, multiple news could have mentioned about an assets within the same unit timeframe. There are properties regarding the novelty of the news and the volumn issued. Also, news could be replicated with only a small difference in word counts.

# Data pipeline

1. Determine news features and unit timeframe.
2. Non-linear transformation of news features, hoping that will transform the data into independent entries.
3. Recurrent summary of news within the same channel.
    - Use an attention key to do weighted average of features.
    - Use recurrent networks to filter important features.