# Technical Test - Clare.AI

This is a technical test with a few tasks to complete, the coding can be done either python 2 or 3.

For each task, it can be implemented and documented in Jupyter Notebook or a seperate .py file

For all the functions defined, it shall be within a class called class SentenceSimilarity():

## 1. Data Crawling

This part involves how to crawl data from webpages

Suggests tools to use: 
Beautiful soup, Scarpy (https://scrapy.org)

Crawl the questions and answers from the following page

https://www.cncbinternational.com/personal/investments/securities-trading-service-guide-and-faq/tc/index.jsp

The output format should be in the CSV with following columns

Category, Question, Answer, Language

## 2. Language Vector Space Model

This part is to build language specific model for simliarity comparison later. Word2Vec is a powerful deep learning models that google used to compare text similarity, however it requires big data and computing power to build one

For Chinese, it requires chinese tokenizer to break sentences into words

Jieba“结巴”中文分词 is the popular tool in python
https://github.com/fxsjy/jieba

To build language model, gensim is popular and highly scalable
https://radimrehurek.com/gensim/

### 2.1 Tokenize questions into words

Define a function to tokenize the questions from 1 into words using Jieba, it might require custom dictionary to make it correctly. Jieba has built-in dictionary but it's optimized for simplified chinese, so for words in cantonese, it would need to add it manually in the custom dictionary.

In [2]:
import jieba
import pandas as pd

jieba.initialize()
jieba.set_dictionary('dict.txt.big.txt')


def tokenize_question(file):
    questions = []
    data = pd.read_csv(file)

    for text in data.Question:
        words = jieba.cut(text)
        tokens = ','.join(words)
        questions.append(tokens.split(','))

    return questions

Building prefix dict from the default dictionary ...


Loading model from cache /tmp/jieba.cache


Loading model cost 0.529 seconds.


Prefix dict has been built succesfully.


### 2.2 Build a TFIDF model using questions and answers

Build a TFIDF model using questions and answers from part 1, together with the function in 2.1

Reference to build the model

https://radimrehurek.com/gensim/tutorial.html

https://radimrehurek.com/gensim/tut2.html

In [4]:
import logging
from gensim import corpora, models, similarities

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) 

questions = tokenize_question('CNCBCrawlling/quotes_products.csv')
dictionary = corpora.Dictionary(questions)
dictionary.save('CNCB.dict')

corpus = [dictionary.doc2bow(text) for text in questions]
corpora.MmCorpus.serialize('CNCB.mm', corpus)


tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=400)

index = similarities.MatrixSimilarity(lsi[corpus])

2017-04-26 14:08:12,481 : INFO : adding document #0 to Dictionary(0 unique tokens: [])


2017-04-26 14:08:12,485 : INFO : built Dictionary(676 unique tokens: ['網上', '理財', ' ', '/', '流動']...) from 192 documents (total 3274 corpus positions)


2017-04-26 14:08:12,485 : INFO : saving Dictionary object under CNCB.dict, separately None


2017-04-26 14:08:12,486 : INFO : saved CNCB.dict


2017-04-26 14:08:12,490 : INFO : storing corpus in Matrix Market format to CNCB.mm


2017-04-26 14:08:12,491 : INFO : saving sparse matrix to CNCB.mm


2017-04-26 14:08:12,491 : INFO : PROGRESS: saving document #0


2017-04-26 14:08:12,495 : INFO : saved 192x676 matrix, density=1.904% (2471/129792)


2017-04-26 14:08:12,496 : INFO : saving MmCorpus index to CNCB.mm.index


2017-04-26 14:08:12,497 : INFO : collecting document frequencies


2017-04-26 14:08:12,497 : INFO : PROGRESS: processing document #0


2017-04-26 14:08:12,498 : INFO : calculating IDF weights for 192 documents and 675 features (2471 matrix non-zeros)


2017-04-26 14:08:12,499 : INFO : using serial LSI version on this node


2017-04-26 14:08:12,500 : INFO : updating model with new documents


2017-04-26 14:08:12,503 : INFO : preparing a new chunk of documents


2017-04-26 14:08:12,505 : INFO : using 100 extra samples and 2 power iterations


2017-04-26 14:08:12,506 : INFO : 1st phase: constructing (676, 500) action matrix


2017-04-26 14:08:12,511 : INFO : orthonormalizing (676, 500) action matrix


2017-04-26 14:08:12,546 : INFO : 2nd phase: running dense svd on (500, 192) matrix


2017-04-26 14:08:12,553 : INFO : computing the final decomposition


2017-04-26 14:08:12,553 : INFO : keeping 189 factors (discarding 0.000% of energy spectrum)


2017-04-26 14:08:12,555 : INFO : processed documents up to #192


2017-04-26 14:08:12,556 : INFO : topic #0(3.933): -0.843*" " + -0.162*"I" + -0.155*"?" + -0.122*"the" + -0.121*"" + -0.111*"What" + -0.103*"can" + -0.099*"order" + -0.091*"Securities" + -0.089*"is"


2017-04-26 14:08:12,557 : INFO : topic #1(2.640): -0.323*"我" + -0.273*"可以" + -0.189*"证券" + -0.186*"的" + -0.184*"證券" + -0.170*"渠道" + -0.169*"或" + -0.164*"买卖" + -0.164*"，" + -0.160*"買賣"


2017-04-26 14:08:12,557 : INFO : topic #2(1.958): 0.350*"证券" + 0.299*"买卖" + 0.295*"沪" + 0.290*"什么" + 0.266*"股通" + 0.220*"于" + -0.161*"如果" + -0.161*"想" + -0.161*"做" + -0.152*"或"


2017-04-26 14:08:12,558 : INFO : topic #3(1.929): -0.362*"證券" + -0.335*"買賣" + -0.284*"什麼" + -0.281*"於" + -0.243*"滬股通" + -0.204*"渠道" + 0.172*"或" + -0.163*"美國" + 0.161*"交易" + 0.144*"想"


2017-04-26 14:08:12,559 : INFO : topic #4(1.892): -0.338*"交易" + 0.202*"可以" + -0.183*"安排" + 0.174*"做" + 0.174*"想" + 0.171*"如果" + -0.169*"交收" + -0.154*"有" + 0.147*"我" + 0.146*"正在"




2017-04-26 14:08:12,569 : INFO : creating matrix with 192 documents and 189 features


## 3. Similarity Comparison

Define a function for question simliarity comparison

def similarity(self,sentence):

the input is sentence, where it will be tokenized first and then compare against the model defined in 2.2

With using TFIDF, each document will be represented as bag-of-words counts and applies a weighting. Reference - Last paragraph https://radimrehurek.com/gensim/tutorial.html

In [5]:
def similarity(sentence):
    vec = dictionary.doc2bow(sentence.lower().split())
    print(vec)
    vec_lsi = lsi[vec]
    sims = index[vec_lsi]
    sims = sorted(enumerate(sims), key=lambda item: -item[1])
    print(sims)
    
similarity(sentence='購買力')
print(dictionary)

[(42, 1)]
[(3, 0.68763924), (101, 0.056910545), (15, 0.030518956), (5, 0.029729415), (19, 0.028071254), (122, 0.025732804), (133, 0.025271848), (103, 0.025137503), (52, 0.023241285), (4, 0.022675358), (24, 0.020066548), (170, 0.018275268), (50, 0.017309155), (178, 0.01718832), (100, 0.017077576), (177, 0.017041652), (20, 0.016628839), (22, 0.016509712), (12, 0.015742214), (49, 0.015342262), (136, 0.015261725), (134, 0.015122531), (25, 0.013574764), (32, 0.013573896), (46, 0.013445649), (176, 0.013203029), (118, 0.013190262), (48, 0.012876686), (62, 0.012839887), (11, 0.012825537), (18, 0.012536442), (102, 0.012059662), (28, 0.011912441), (188, 0.011683389), (119, 0.01156247), (2, 0.011163667), (39, 0.01047378), (132, 0.010230407), (155, 0.010187369), (43, 0.0099757705), (167, 0.0094788708), (180, 0.0094556268), (171, 0.0091288202), (151, 0.0089919381), (67, 0.0086237807), (10, 0.0082975775), (187, 0.0082548903), (9, 0.0062697716), (117, 0.0058897585), (0, 0.0055863932), (173, 0.0054644

## 4. Named Entity Recognition

NER (命名实体), short for Named Entity Recognition is probably the first step towards information extraction from unstructured text.

It basically means extracting what is a real world entity from the text (Person, Organization, Event etc …).

There are few popular libraries which support in Chinese: Stanford NLP/HanL
https://nlp.stanford.edu/software/CRF-NER.shtml

https://github.com/hankcs/HanLP

https://github.com/hankcs/HanLP/wiki/%E8%AE%AD%E7%BB%83%E5%91%BD%E5%90%8D%E5%AE%9E%E4%BD%93%E8%AF%86%E5%88%AB%E6%A8%A1%E5%9E%8B

Define a function to extract and print Named Entity on the input sentence

def get_entities(self, sentence)