<br>
<br>

文本进行数字编码的本质（应达到）：
* 本质是：压缩 + 向量化 （即分别得到 `词表/词典` 和 `词向量`）
* 并且，所有（长短不一）的文本处理之后的维度应该相同。即处理之后得到的词向量的维度应该相同。

<br>

In [1]:
import pandas as pd

import warnings
warnings.filterwarnings("ignore")

<br>

## Bag-of-Word 模型：CountVectorizer

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()

corpus = ["This is a sentence is", 
          "This is another sentence",
          "third document is here"]

X = cv.fit(corpus)   # 并不是训练，而是用来进行：分词 + 构建词表/词典（vocabulary）
print(X.vocabulary_)
print(cv.get_feature_names())

{'this': 6, 'is': 3, 'sentence': 4, 'another': 0, 'third': 5, 'document': 1, 'here': 2}
['another', 'document', 'here', 'is', 'sentence', 'third', 'this']


In [3]:
print(X)
print(X.vocabulary_)
print(X.vocabulary)
print(cv.get_feature_names_out())

CountVectorizer()
{'this': 6, 'is': 3, 'sentence': 4, 'another': 0, 'third': 5, 'document': 1, 'here': 2}
None
['another' 'document' 'here' 'is' 'sentence' 'third' 'this']


In [4]:
X = cv.transform(corpus)   # 转换得到词向量（的精简表示）
# X = cv.fit_transform(corpus)   # 与 cv.transform() 的区别详见下面这个 Markdown。

print(X.shape)
print(X)
print(X.toarray())  # 转换得到稀疏矩阵。实现了文本处理后的维度相同。

# 对于 X 而言：
# 其第一个元素，即元组中第一个元素表示 corpus 中句子的索引；
# 其第一个元素，即元组中第二个元素表示 X.vocabulary_ 中的索引（即 word_index） ，也即对应了 词表 中的某个词
# 其第二个元素，表示词表中的某个词在该句中的计数。

(3, 7)
  (0, 3)	2
  (0, 4)	1
  (0, 6)	1
  (1, 0)	1
  (1, 3)	1
  (1, 4)	1
  (1, 6)	1
  (2, 1)	1
  (2, 2)	1
  (2, 3)	1
  (2, 5)	1
[[0 0 0 2 1 0 1]
 [1 0 0 1 1 0 1]
 [0 1 1 1 0 1 0]]


```python
# fit 是进行 分词 + 构建词表/词典（vocabulary），fit 之后得到的词表是有状态的。
# transform 是在 fit 得到的词表的基础上，对原始语料进行转换，得到词向量。
# fit 只能做一次，也只需做一次。
#
# fit_transform = fit + transform，即两步一起做了。这样无法保持中间得到的词表。

```

In [5]:
# X.vocabulary_    
# 会报错：AttributeError: vocabulary_ not found
# 
# 因：transform 之后的返回值是没有 .vocabulary_ 属性的。

In [6]:
df = pd.DataFrame(X.toarray(), columns=cv.get_feature_names())
df

Unnamed: 0,another,document,here,is,sentence,third,this
0,0,0,0,2,1,0,1
1,1,0,0,1,1,0,1
2,0,1,1,1,0,1,0


<br>

### ngram_range 和 max_features 参数

<br>

<a href="https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html#review-bag-of-words" style="text-decoration:none;font-size:120%">Review: Bag-of-words</a>: Bag-of-words models are surprisingly effective, but have several weaknesses.
* First, they lose all information about word order: “John likes Mary” and “Mary likes John” correspond to identical vectors. <br>There is a solution: <font color=maroon>bag of **n-grams** models</font> consider word phrases of length n to represent documents as fixed-length vectors to capture local word order but <font color=maroon>suffer from data sparsity and high dimensionality.</font>


* Second, the model does not attempt to learn the meaning of the underlying words, and as a consequence, the distance between vectors doesn’t always reflect the difference in meaning. <br>The <font color=maroon>**Word2Vec** model</font> addresses this second problem.

<br>

ngram_range 参数通常配合 max_features 参数来使用，这里指最多保留 10 个。

<br>

In [7]:
cv = CountVectorizer(ngram_range=(2,3))

X = cv.fit_transform(corpus)
print(X.shape)

df = pd.DataFrame(X.toarray(), columns=cv.get_feature_names())
df

(3, 14)


Unnamed: 0,another sentence,document is,document is here,is another,is another sentence,is here,is sentence,is sentence is,sentence is,third document,third document is,this is,this is another,this is sentence
0,0,0,0,0,0,0,1,1,1,0,0,1,0,1
1,1,0,0,1,1,0,0,0,0,0,0,1,1,0
2,0,1,1,0,0,1,0,0,0,1,1,0,0,0


<br>

In [8]:
cv = CountVectorizer(ngram_range=(2,3), max_features=10)

X = cv.fit_transform(corpus)
print(X.shape)

df = pd.DataFrame(X.toarray(), columns=cv.get_feature_names())
df

(3, 10)


Unnamed: 0,another sentence,document is,document is here,is another,is another sentence,is here,is sentence,is sentence is,sentence is,this is
0,0,0,0,0,0,0,1,1,1,1
1,1,0,0,1,1,0,0,0,0,1
2,0,1,1,0,0,1,0,0,0,0


<br>
<br>
<br>

## TF-IDF

* **TF: Term Frequency**

$$ \mathsf{tf = \frac{No. of \ time \ word \ appears}{No. of \ total \ terms \ in \ Document}}$$

<br>

* **IDF: Inverse Document Frequency** `(IDF 借用了信息熵的思想。词频越高的单词，越不重要。)`
<br>
<br>
$$\mathsf{idf = - \ log \ (ratio \ of \ documents \ that \ include \ the \ word)}$$

合起来的公式为：
<div align=center>
    <font size=5 face="normal">$\mathsf{tfidf_{i,j} = tf_{i,j} \times log(\frac{N}{df_i}) }$</font>
</div>

* $\mathsf{tf_{i,j} = }$ total number of occurences of (word) `i` in (document) `j`.
* $\mathsf{df_i = }$ total number of documents (speeches) containing (word) `i`.
* $\mathsf{N = }$ total number of documents (speeches).

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vec = TfidfVectorizer()

corpus = ["This is a sentence is", 
          "This is another sentence",
          "third document is here"]

X = tfidf_vec.fit(corpus)
print(X.vocabulary_)
print(tfidf_vec.get_feature_names())

{'this': 6, 'is': 3, 'sentence': 4, 'another': 0, 'third': 5, 'document': 1, 'here': 2}
['another', 'document', 'here', 'is', 'sentence', 'third', 'this']


In [10]:
print(X)
X

TfidfVectorizer()


<br>

In [11]:
X = tfidf_vec.transform(corpus)

<br>

In [12]:
print(X)
X

  (0, 6)	0.4760629392767929
  (0, 4)	0.4760629392767929
  (0, 3)	0.7394106813498714
  (1, 6)	0.4804583972923858
  (1, 4)	0.4804583972923858
  (1, 3)	0.3731188059313277
  (1, 0)	0.6317450542765208
  (2, 5)	0.546454011634009
  (2, 3)	0.3227445421804912
  (2, 2)	0.546454011634009
  (2, 1)	0.546454011634009


<3x7 sparse matrix of type '<class 'numpy.float64'>'
	with 11 stored elements in Compressed Sparse Row format>

In [13]:
X.toarray()

array([[0.        , 0.        , 0.        , 0.73941068, 0.47606294,
        0.        , 0.47606294],
       [0.63174505, 0.        , 0.        , 0.37311881, 0.4804584 ,
        0.        , 0.4804584 ],
       [0.        , 0.54645401, 0.54645401, 0.32274454, 0.        ,
        0.54645401, 0.        ]])

<br>

In [14]:
df = pd.DataFrame(X.toarray(), columns=tfidf_vec.get_feature_names())
df

Unnamed: 0,another,document,here,is,sentence,third,this
0,0.0,0.0,0.0,0.739411,0.476063,0.0,0.476063
1,0.631745,0.0,0.0,0.373119,0.480458,0.0,0.480458
2,0.0,0.546454,0.546454,0.322745,0.0,0.546454,0.0


<br>

上述方案得到的词向量构成的 corpus 的矩阵，是稀疏的高维矩阵。解决方案：word emdding。`详见 gemsi.ipynb`

<br>
<br>
<br>

## 编辑距离

* Substitution
* Insertion
* Deletion

In [15]:
# 本例子与编辑距离无关
def title_share_distance(s1, s2):
    s1_words = set(s1.split(" "))
    s2_words = set(s2.split(" "))
    
    return 1 - len(s1_words.intersection(s2_words)) / len(s1_words.union(s2_words))

title_share_distance("hello to you", "hello")

0.6666666666666667

<br>
<br>
<br>