In [18]:
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
import seaborn as sns

sns.set_style('darkgrid')
sns.set(font_scale=1.6)

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)


%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [27]:
def trainDataLoad(market=True,news=True):
    try:
        from kaggle.competitions import twosigmanews

        env = twosigmanews.make_env()
        (market_df, news_df) = env.get_training_data()

        (market_train_df.shape, news_train_df.shape)
    except:
        print('failed to load data from kaggle, loading data from local directory.')
        if(market):
            market_df=pd.read_csv('./sampleData/market_train.csv')
        if(news):
            news_df=pd.read_csv('./sampleData/news_train.csv')
    print('Train data loaded!')
    return (market_df,news_df)

In [28]:
def timeCut(df,time, replace=True):
    '''
    df: dataFrame with attribute time in datatime64 format
    time: a time in string
    return df slice cutting off the time before the time provided
    '''
    df.time=pd.to_datetime(df.time)
    time=pd.Timestamp(time)
    df_slice = df[df.time>time]
    if replace:
        df=df_slice
    return df_slice

def formatCodeSet(df,field):
    '''
    df:dataframe
    field:field name of the code in the form string in set format
    return the field formatted into array
    '''
    return df[field].str.findall(f"'([\w\./]+)'")

In [29]:
#Load Data
(market_train_df,news_train_df)=trainDataLoad()

failed to load data from kaggle, loading data from local directory.
Train data loaded!


In [43]:
#Depends on the need, cut the data into smaller size for dev testing to save resources

time='2012-12-31'
# its best to get the data with a time cut
if time:
    market_train_df = timeCut(market_train_df,time)
    news_train_df = timeCut(news_train_df,time)
    
news_train_df['subjects'] = formatCodeSet(news_train_df,'subjects')
news_train_df['audiences'] = formatCodeSet(news_train_df,'audiences')
news_train_df['assetCodes'] = formatCodeSet(news_train_df,'assetCodes')

# Part 2 - News data features

The news dataset already included many engineered features for prediction. However it would be nice to further explore the headline features with different kind of embeddings. In order to include the data in the news headlines, it would be useful to apply document embedding so that the model can "understand" the document contents and improve its prediction accordingly.

In [57]:
import nltk
from nltk.tokenize import RegexpTokenizer


def tokeniser_wrapper(tokeniser):
    '''
    A tokeniser wrapper proxy to handle exceptions
    '''
    def wrapped(tokeniser,text):
        try:
            return tokeniser(text)
        except:
            print('Failed tokenisation on input:',text)
            return []
    return lambda text:wrapped(tokeniser,text)
    
    
def getTokens(tokeniser, textCol, iterator=True):
    '''
    Take in a text column and then return an array of tokenised entries
    '''
    if not iterator:
        return list(map(tokeniser,textCol.as_matrix()))
    return map(tokeniser,textCol.as_matrix())

#Tokenisers
tknsr=tokeniser_wrapper(nltk.word_tokenize)
tknsr_noPunc = tokeniser_wrapper(nltk.tokenize.RegexpTokenizer('\w+').tokenize)


In [59]:
news_train_df['headline_tokens']=list(map(tknsr,news_train_df['headline'].as_matrix()))

Failed tokenisation on input: nan
Failed tokenisation on input: nan
Failed tokenisation on input: nan
Failed tokenisation on input: nan
Failed tokenisation on input: nan
Failed tokenisation on input: nan
Failed tokenisation on input: nan
Failed tokenisation on input: nan
Failed tokenisation on input: nan
Failed tokenisation on input: nan
Failed tokenisation on input: nan
Failed tokenisation on input: nan
Failed tokenisation on input: nan
Failed tokenisation on input: nan
Failed tokenisation on input: nan
Failed tokenisation on input: nan
Failed tokenisation on input: nan
Failed tokenisation on input: nan
Failed tokenisation on input: nan
Failed tokenisation on input: nan
Failed tokenisation on input: nan
Failed tokenisation on input: nan
Failed tokenisation on input: nan
Failed tokenisation on input: nan
Failed tokenisation on input: nan
Failed tokenisation on input: nan
Failed tokenisation on input: nan
Failed tokenisation on input: nan
Failed tokenisation on input: nan
Failed tokenis

## Word embeddings

Embeddings mean turning certain categorical/text data into meaningful vectors that can be "ingested" by a machine learning model. Word embeddings is a very common technique used in NLP. By turning words into vectors, we can further compose a meaning representation vectors for each document.

There are different models that can further encode word embeddings into a document embeddings which would be discussed later.

## Document embeddings

While some other models creates embeddings for words, there are also other models that can directly create embeddings for documents of various length.

To train for embeddings, we would first need to tokenise the headline sentences.

### Word embeddings - 1. Word2Vec
Original paper: https://arxiv.org/pdf/1310.4546.pdf

Blog referece: https://www.knime.com/blog/word-embedding-word2vec-explained

Further formalisation: https://papers.nips.cc/paper/5477-neural-word-embedding-as-implicit-matrix-factorization.pdf

Pretrained embeddings:

1. google news: https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit
2. freebase entity: https://docs.google.com/file/d/0B7XkCwpI5KDYaDBDQm1tZGNDRHc/edit


Word2Vec is one of the most famous embedding for words first [published by Google in 2013](https://arxiv.org/pdf/1310.4546.pdf). It combines the CBOW and the Skip-gram structure to form an encoder. The representation learnt is an embedded vector which encode the coocurrences probability between words and context. The model applied a numerous different technique to simplify the calculations such as `negative sampling` and `hierachical softmax` which are as well applied in other embedding models developed afterwards.

It can be viewed as an auto-encoder model for context-word in terms of deep-learning or [a factorisation(approximation) of the context-words pointwise-mutual information matrix](https://papers.nips.cc/paper/5477-neural-word-embedding-as-implicit-matrix-factorization.pdf). 

The understanding of the latter would help combining the model with other mathematical/statistical models to generate quantified results.

##### Custom embedding - Model training

In [2]:
from gensim.models import Word2Vec

In [69]:
model_w2vCustom = Word2Vec(sentences=news_train_df['headline_tokens'], size=100, window=5, min_count=5, workers=4, sg=0)
#model_w2vCustom.save('./sampleData/word2vecCustom.model')
model_w2vCustom=Word2Vec.load('./sampleData/word2vecCustom.model')

#### Pretrained-models

In [4]:
import gensim
# Load Google's pre-trained Word2Vec model.
fileDir='C:/Users/CK/Downloads/'
freebase_entity='knowledge-vectors-skipgram1000.bin'
google_news='GoogleNews-vectors-negative300.bin'
model_w2vGoogleNews = gensim.models.KeyedVectors.load_word2vec_format(fileDir+google_news, binary=True)

#### Transfer learning

In [None]:
#news_df=pd.read_csv('./sampleData/news_train.csv', encoding = "ISO-8859-1")
#Create the model
model_w2vGoogleNewsPlus=gensim.models.Word2Vec(size=300)
#Build the new vocabs
model_w2vGoogleNewsPlus.build_vocab(news_df['headline_tokens'])
#Read in the pretrained vectors
model_w2vGoogleNewsPlus.intersect_word2vec_format(fileDir+google_news, binary=True)
model_w2vGoogleNewsPlus.train(news_df['headline_tokens'],total_examples=model_NewsPlus.corpus_count,epochs=10)
model_w2vGoogleNewsPlus.save('./sampleData/word2vecTransfer.model')

### Word embeddings - 2. Fastext

Original paper: https://arxiv.org/pdf/1607.04606.pdf

Blog reference: https://towardsdatascience.com/word-embedding-with-word2vec-and-fasttext-a209c1d3e12c

Pre-trained embeddings:
    1. https://github.com/facebookresearch/fastText/blob/master/docs/crawl-vectors.md
    
Fasttext is very similar to word2vec but instead of simply using the whole world, it splits words into subwords as inputs and sum the vectors as the embedding of the word. Such method can generalise the meaning of the words to unseen words.

##### Custom embedding - Model training

In [None]:
from gensim.models import FastText

In [69]:
model_FTCustom = FastText(sentences=news_train_df['headline_tokens'], size=100, window=5, min_count=5, workers=4, sg=0)
model_FTCustom.save('./sampleData/FastTextCustom.model')
#model_FTCustom=FastText.load('./sampleData/word2vecCustom.model')

#### Pretrained-models

In [4]:
import gensim
model_FTgoogleNews = gensim.models.FastText.load_FastText_format(fileDir+google_news, binary=True)

#### Transfer learning

In [None]:
#news_train_df=pd.read_csv('./sampleData/news_train.csv', encoding = "ISO-8859-1")
model_FTgoogleNewsPlus=gensim.models.FastText(size=300)
model_FTgoogleNewsPlus.intersect_FastText_format(fileDir+google_news, binary=True)
model_FTgoogleNewsPlus.build_vocab(news_train_df['headline_tokens'])

In [33]:
model_FTgoogleNewsPlus.train(news_train_df['headline_tokens'],total_examples=model_NewsPlus.corpus_count,epochs=10)

(1442456173, 5314372690)

### Document embedding - 1. Doc2Vec

Original paper: https://arxiv.org/pdf/1405.4053.pdf

blog summary: https://medium.com/scaleabout/a-gentle-introduction-to-doc2vec-db3e8c0cce5e

Doc2vec is similar to word2vec. Both of them use the same techniques and encode-decoding model except that now it is embedding the document vector as one of the input vectors for CBOW and using only the document vector to reconstruct the context in skip-gram. The goal is to train the model to learn how to represent a document vector based on the content. 

As it is to learn the document vector summarised, the population distribution of the documents is also important. Meaning that properly control the source of the training document can better encode the meaning of the documents into vectors. 

##### Custom embedding - Model training

In [64]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from ast import literal_eval
documents = [TaggedDocument(literal_eval(content), [i]) for i,content in news_train_df['headline_tokens'].iteritems()]

In [82]:
model_d2vCustom = Doc2Vec(documents=documents,vector_size=100, window=5, min_count=1, workers=4)

In [None]:
model_d2vCustom.save('./sampleData/model_d2vCustom100.model')