# Technical Test - Clare.AI

This is a technical test with a few tasks to complete, the coding can be done either python 2 or 3.

For each task, it can be implemented and documented in Jupyter Notebook or a seperate .py file

For all the functions defined, it shall be within a class called class SentenceSimilarity():

## 1. Data Crawling

This part involves how to crawl data from webpages

Suggests tools to use: 
Beautiful soup, Scarpy (https://scrapy.org)

Crawl the questions and answers from the following page

https://www.cncbinternational.com/personal/investments/securities-trading-service-guide-and-faq/tc/index.jsp

The output format should be in the CSV with following columns

Category, Question, Answer, Language

In [2]:
import scrapy

class CNCBQuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'https://www.cncbinternational.com/personal/investments/securities-trading-service-guide-and-faq/tc/index.jsp',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.xpath('span/small/text()').extract_first(),
            }

        next_page = response.css('li.next a::attr("href")').extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)
            
    parse()

## 2. Language Vector Space Model

This part is to build language specific model for simliarity comparison later. Word2Vec is a powerful deep learning models that google used to compare text similarity, however it requires big data and computing power to build one

For Chinese, it requires chinese tokenizer to break sentences into words

Jieba“结巴”中文分词 is the popular tool in python
https://github.com/fxsjy/jieba

To build language model, gensim is popular and highly scalable
https://radimrehurek.com/gensim/

### 2.1 Tokenize questions into words

Define a function to tokenize the questions from 1 into words using Jieba, it might require custom dictionary to make it correctly. Jieba has built-in dictionary but it's optimized for simplified chinese, so for words in cantonese, it would need to add it manually in the custom dictionary.

### 2.2 Build a TFIDF model using questions and answers

Build a TFIDF model using questions and answers from part 1, together with the function in 2.1

Reference to build the model

https://radimrehurek.com/gensim/tutorial.html

https://radimrehurek.com/gensim/tut2.html

## 3. Similarity Comparison

Define a function for question simliarity comparison

def similarity(self,sentence):

the input is sentence, where it will be tokenized first and then compare against the model defined in 2.2

With using TFIDF, each document will be represented as bag-of-words counts and applies a weighting. Reference - Last paragraph https://radimrehurek.com/gensim/tutorial.html

## 4. Named Entity Recognition

NER (命名实体), short for Named Entity Recognition is probably the first step towards information extraction from unstructured text.

It basically means extracting what is a real world entity from the text (Person, Organization, Event etc …).

There are few popular libraries which support in Chinese: Stanford NLP/HanL
https://nlp.stanford.edu/software/CRF-NER.shtml

https://github.com/hankcs/HanLP

https://github.com/hankcs/HanLP/wiki/%E8%AE%AD%E7%BB%83%E5%91%BD%E5%90%8D%E5%AE%9E%E4%BD%93%E8%AF%86%E5%88%AB%E6%A8%A1%E5%9E%8B

Define a function to extract and print Named Entity on the input sentence

def get_entities(self, sentence)