## 名人名言管理员
“虽然不能写出名人名言，但是为了成为对社会有价值的人，我成为了名人名言管理员。 新录入的名言如何加标签呢？让Python来帮我吧。”

### 分库
将爬取到的100条名人名言（spider_items.jl）分为80条“已入库”stocked_in和20条“新入库”new_arrival。

In [28]:
import pandas as pd
from collections import Counter

df = pd.read_json('spider_items.jl', lines=True)

stocked_in = df[:80]
new_arrival = df[80:].reindex()
del new_arrival['tags']

### 统计词频
选出已入库名言中最常出现的100个单词top_words作为词频属性。

In [29]:
from collections import Counter

counter = Counter()
for _, row in stocked_in.iterrows():
    counter.update(row['text'].split())
    
print(*counter.most_common(100))
top_words, _ = zip(*counter.most_common(100))
# top_words

('you', 70) ('is', 59) ('to', 54) ('a', 47) ('the', 36) ('of', 33) ('and', 33) ('not', 29) ('I', 28) ('that', 23) ('it', 22) ('be', 22) ('in', 22) ('your', 18) ('but', 17) ('can', 15) ('have', 13) ('who', 13) ('will', 12) ('as', 11) ('what', 11) ('are', 11) ('or', 11) ('all', 11) ('“The', 10) ('with', 10) ('love', 10) ('think', 10) ('my', 10) ('more', 9) ("it's", 9) ('“I', 9) ('never', 9) ('make', 9) ('up', 9) ('her', 9) ('she', 9) ('no', 9) ('for', 8) ('do', 8) ('we', 7) ('“It', 7) ('than', 7) ('only', 7) ('The', 7) ('like', 7) ('going', 7) ('But', 7) ('if', 7) ('so', 7) ('“If', 7) ('may', 7) ('one', 7) ('our', 6) ('without', 6) ('live', 6) ('just', 6) ('them', 6) ('give', 6) ('because', 6) ('at', 6) ("don't", 6) ('when', 6) ('good', 5) ('must', 5) ('-', 5) ("doesn't", 5) ('keep', 5) ('“You', 5) ('from', 5) ('“There', 4) ('nothing', 4) ('man', 4) ('“A', 4) ('know', 4) ("you're", 4) ('get', 4) ('let', 4) ('find', 4) ('makes', 4) ('great', 4) ('about', 4) ('every', 4) ('opposite', 4) ('

建立创建词频向量的函数build_word_count_vector()，并对已入库的数据进行计算，获得词频向量，并组成矩阵word_count_matrix。

In [39]:
import numpy as np

def build_word_count_vector(word_counts): # 传入参数是一个字典（计数器）
    vector = [word_counts.get(word,0) for word in top_words]
    return np.array(vector)

In [97]:
vectors = [build_word_count_vector(Counter(row['text'].split())) for _, row in stocked_in.iterrows()]

# print(vectors)
word_count_matrix = np.stack(vectors, axis=0)
word_count_matrix

array([[0, 1, 0, ..., 0, 0, 0],
       [0, 1, 0, ..., 0, 0, 0],
       [0, 4, 1, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 2, 2, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

### 归一化
计算每行（一个向量）的平方和，是每个元素乘以平方和的倒数，那么新向量的每个元素的平方和即为1。

In [101]:
word_count_matrix = word_count_matrix / (np.sum(word_count_matrix**2, axis=1)**0.5).reshape(80,1)
word_count_matrix

array([[0.        , 0.25819889, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.30151134, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.69631062, 0.17407766, ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.45883147, 0.45883147, ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

### 自动标签
统计每条名言词频，为每条“新入库”名言找到词频内积相似性（点乘）最高的名言，并将其标签用作这条“新入库”条名言的标签，将其填入tags字段中。

In [114]:
new_tags = []
for _, row in new_arrival.iterrows():
    word_count = build_word_count_vector(Counter(row['text'].split()))
    sim_vector = word_count_matrix @ word_count.reshape(-1,1)
    most_sim = int(np.argmax(sim_vector))
    row_tags = stocked_in.iloc[most_sim]['tags']
    new_tags.append(row_tags)

new_arrival['tags'] = new_tags

### 结果
（由于该数据集比较随机，结果可能与实际情况不太一致，这里不比较正确性。）

In [34]:
new_arrival

Unnamed: 0,text,author,tags
80,“Anyone who has never made a mistake has never...,Albert Einstein,"[books, inspirational, reading, tea]"
81,“A lady's imagination is very rapid; it jumps ...,Jane Austen,[humor]
82,"“Remember, if the time should come when you ha...",J.K. Rowling,"[books, contentment, friends, friendship, life]"
83,“I declare after all there is no enjoyment lik...,Jane Austen,[music]
84,"“There are few people whom I really love, and ...",Jane Austen,[the-hunger-games]
85,“Some day you will be old enough to start read...,C.S. Lewis,[imagination]
86,“We are not necessarily doubting that God will...,C.S. Lewis,[inspirational]
87,“The fear of death follows from the fear of li...,Mark Twain,[the-hunger-games]
88,“A lie can travel half way around the world wh...,Mark Twain,"[death, inspirational]"
89,“I believe in Christianity as I believe that t...,C.S. Lewis,"[humor, insanity, lies, lying, self-indulgence..."
