# Get Word Embedding

We use [Chinese-Word-Vectors](https://github.com/Embedding/Chinese-Word-Vectors) for preliminary, and [Tencent AI Lab Embedding](https://ai.tencent.com/ailab/nlp/embedding.html) for semi-final.

Here demonstrate the latter, including both word level and char level.

In [0]:
from google.colab import drive
drive.mount('/gdrive')

In [0]:
!wget https://ai.tencent.com/ailab/nlp/data/Tencent_AILab_ChineseEmbedding.tar.gz

In [0]:
!tar -xzvf Tencent_AILab_ChineseEmbedding.tar.gz

In [0]:
!ls -lh

In [0]:
!head Tencent_AILab_ChineseEmbedding.txt

## 2 Extract embeddings needed

The whole embedding corpus is too large, making it impossible to put them all into memory.

To save space and memory, we only extract words and chars that appear in the dataset.

This method is for competitions only.

In [0]:
!pip install jieba tqdm > /dev/null

In [0]:
import numpy as np
import pandas as pd
import pickle
import jieba
from tqdm import tqdm

jieba.setLogLevel(20)

In [0]:
path = '/gdrive/My Drive/'
contents = pd.read_csv(path + 'train.csv')['content'].tolist() \
           + pd.read_csv(path + 'test_public.csv')['content'].tolist()

wordset = set()
for content in contents:
    wordset.update(list(jieba.lcut_for_search(content)) + list(content))

In [0]:
len(wordset)

In [0]:
word_index, embedding_matrix = {}, []
with open('Tencent_AILab_ChineseEmbedding.txt') as f:
    next(f)
    i = 0
    for line in tqdm(f, total=8824330):
        e = line[:-1].split(' ')
        w = e[0]
        if w in wordset:
            word_index[w] = i
            i += 1
            embedding_matrix.append(np.array(e[1:], dtype=float))
embedding_matrix = np.array(embedding_matrix)
embeddings = [word_index, embedding_matrix]

In [0]:
len(word_index)

In [0]:
embedding_matrix.shape

In [0]:
pickle.dump(embeddings, open('embeddings_.p', 'wb'))

In [0]:
!ls -lh

In [0]:
!cp embeddings_.p /gdrive/My\ Drive/