# Get Word Embedding

We use [Chinese-Word-Vectors](https://github.com/Embedding/Chinese-Word-Vectors) for preliminary, and [Tencent AI Lab Embedding](https://ai.tencent.com/ailab/nlp/embedding.html) for the final.

Here demonstrate the latter, including both word level and char level.

## 1 Download & Decompress

In [0]:
from google.colab import drive
drive.mount('/gdrive')

In [0]:
!wget https://ai.tencent.com/ailab/nlp/data/Tencent_AILab_ChineseEmbedding.tar.gz

--2018-11-04 03:04:05--  https://ai.tencent.com/ailab/nlp/data/Tencent_AILab_ChineseEmbedding.tar.gz
Resolving ai.tencent.com (ai.tencent.com)... 140.207.123.162
Connecting to ai.tencent.com (ai.tencent.com)|140.207.123.162|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6778940358 (6.3G) [application/x-gzip]
Saving to: ‘Tencent_AILab_ChineseEmbedding.tar.gz’


2018-11-04 03:18:56 (7.27 MB/s) - ‘Tencent_AILab_ChineseEmbedding.tar.gz’ saved [6778940358/6778940358]



In [0]:
!tar -xzvf Tencent_AILab_ChineseEmbedding.tar.gz

Tencent_AILab_ChineseEmbedding.txt
README.txt


In [0]:
!ls -lh

total 22G
-rw-r--r-- 1 root root 1.8K Oct 19 10:41 README.txt
drwxr-xr-x 2 root root 4.0K Nov  1 16:42 sample_data
-rw-r--r-- 1 root root 6.4G Oct 19 11:29 Tencent_AILab_ChineseEmbedding.tar.gz
-rw-r--r-- 1 root root  16G Oct 19 10:50 Tencent_AILab_ChineseEmbedding.txt


In [0]:
!head Tencent_AILab_ChineseEmbedding.txt

## 2 Extract embeddings needed

The whole embedding corpus is too large, making it impossible to put them all into memory.

To save space and memory, we only extract words and chars that appear in the dataset.

This method is for competitions only.

In [0]:
!pip install jieba tqdm > /dev/null

In [0]:
import numpy as np
import pandas as pd
import pickle
import jieba
from tqdm import tqdm

jieba.setLogLevel(20)

In [0]:
path = '/gdrive/My Drive/'
contents = pd.read_csv(path + 'train.csv')['content'].tolist() \
           + pd.read_csv(path + 'test_public.csv')['content'].tolist()

wordset = set()
for content in contents:
    wordset.update(list(jieba.lcut_for_search(content)) + list(content))

In [0]:
len(wordset)

25095

In [0]:
word_index, embedding_matrix = {}, []
with open('Tencent_AILab_ChineseEmbedding.txt') as f:
    next(f)
    i = 0
    for line in tqdm(f, total=8824330):
        e = line[:-1].split(' ')
        w = e[0]
        if w in wordset:
            word_index[w] = i
            i += 1
            embedding_matrix.append(np.array(e[1:], dtype=float))
embedding_matrix = np.array(embedding_matrix)
embeddings = [word_index, embedding_matrix]

100%|██████████| 8824330/8824330 [12:19<00:00, 11933.98it/s]


In [0]:
len(word_index)

22815

In [0]:
embedding_matrix.shape

(22815, 200)

In [0]:
pickle.dump(embeddings, open('embeddings_.p', 'wb'))

In [0]:
!ls -lh

total 22G
-rw-r--r-- 1 root root  36M Nov  4 03:53 embeddings_.p
-rw-r--r-- 1 root root 1.8K Oct 19 10:41 README.txt
drwxr-xr-x 2 root root 4.0K Nov  1 16:42 sample_data
-rw-r--r-- 1 root root 6.4G Oct 19 11:29 Tencent_AILab_ChineseEmbedding.tar.gz
-rw-r--r-- 1 root root  16G Oct 19 10:50 Tencent_AILab_ChineseEmbedding.txt


In [0]:
!cp embeddings_.p /gdrive/My\ Drive/