# 名人名言管理员 - 超级充能

“上次的标签很有意思，但是还有没有其他方法？”

上次实验重复的部分已经实现好了，请直接运行前面部分，跳到“自动标签”部分，实现一系列模型完成任务。

## 分库

将爬取到的100条名人名言（`spider_items.jl`）分为80条“已入库”`stocked_in`和20条“新入库”`new_arrival`。

In [61]:
import pandas as pd
from collections import Counter

df = pd.read_json('spider_items.jl', lines=True)

stocked_in = df[:80]
new_arrival = df[80:]
del new_arrival['tags']

## 统计词频

选出已入库名言中最常出现的100个单词`top_words`作为词频属性。

In [62]:
from collections import Counter

counter_all = Counter()
counter_list = []

for _, data in stocked_in.iterrows():
    words = data['text'].split()
    counter_all.update(words)
    counter_list.append(Counter(words))

top_words, _ = zip(*counter_all.most_common(100))

建立创建词频向量的函数`build_word_count_vector()`，并对已入库的数据进行计算，获得词频向量，并组成矩阵`word_count_matrix`。

In [63]:
def build_word_count_vector(word_counts):
    return [word_counts.get(word, 0) for word in top_words]

In [64]:
import numpy as np

vectors = []

for word_counts in counter_list:
    vectors.append(build_word_count_vector(word_counts))

word_count_matrix = np.array(vectors)
word_count_matrix

array([[0, 1, 0, ..., 0, 0, 0],
       [0, 1, 0, ..., 0, 0, 0],
       [0, 4, 1, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 2, 2, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

## 归一化

计算每行（一个向量）的平方和，使得新向量的每个元素的平方和为1。

In [65]:
s = (np.sum(word_count_matrix ** 2, axis=1) ** 0.5)[:, np.newaxis] # 同(np.sum(word_count_matrix ** 2, axis=1) ** 0.5).reshape(-1,1)
word_count_matrix = word_count_matrix / s

获得“新入库”的数据，对其进行词频向量化，然后对每个模型进行预测，得到预测结果。

In [66]:
vectors = []

for _, data in new_arrival.iterrows():
    words = data['text'].split()
    word_counts = Counter(words)
    vectors.append(build_word_count_vector(word_counts))

new_arrival_word_count_matrix = np.array(vectors)

s = (np.sum(new_arrival_word_count_matrix ** 2, axis=1) ** 0.5)[:, np.newaxis]
new_arrival_word_count_matrix = new_arrival_word_count_matrix / s

## 自动标签

构建多种模型用于预测标签。

### 构造目标标签


以其中一个标签，如`love`，为例，构造目标标签`target_label`，其值为1表示该名言中包含`love`，值为0表示不包含。

In [67]:
stocked_in_target = []  # Your code here to build stocked_in_target
for _, row in stocked_in.iterrows():
    if "love" in row["tags"]:
        stocked_in_target.append(1)
    else:
        stocked_in_target.append(0)
        
# stocked_in_target

### 逻辑回归

In [68]:
# Your code here to train a model on stocked_in_word_count_matrix and stocked_in_target
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(word_count_matrix, stocked_in_target)

tags = model.predict(new_arrival_word_count_matrix)  # Your code here to predict tags for new_arrival

new_arrival["tags"] = list(map(lambda x: ["love"] if x else [], tags))
new_arrival

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_arrival["tags"] = list(map(lambda x: ["love"] if x else [], tags))


Unnamed: 0,text,author,tags
80,“Anyone who has never made a mistake has never...,Albert Einstein,[]
81,“A lady's imagination is very rapid; it jumps ...,Jane Austen,[]
82,"“Remember, if the time should come when you ha...",J.K. Rowling,[]
83,“I declare after all there is no enjoyment lik...,Jane Austen,[]
84,"“There are few people whom I really love, and ...",Jane Austen,[]
85,“Some day you will be old enough to start read...,C.S. Lewis,[]
86,“We are not necessarily doubting that God will...,C.S. Lewis,[]
87,“The fear of death follows from the fear of li...,Mark Twain,[]
88,“A lie can travel half way around the world wh...,Mark Twain,[]
89,“I believe in Christianity as I believe that t...,C.S. Lewis,[]


### 感知机

In [69]:
# Your code here to train a model on stocked_in_word_count_matrix and stocked_in_target
from sklearn.linear_model import Perceptron

model = Perceptron()
model.fit(word_count_matrix, stocked_in_target)

tags = model.predict(new_arrival_word_count_matrix)  # Your code here to predict tags for new_arrival


new_arrival["tags"] = list(map(lambda x: ["love"] if x else [], tags))
new_arrival

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_arrival["tags"] = list(map(lambda x: ["love"] if x else [], tags))


Unnamed: 0,text,author,tags
80,“Anyone who has never made a mistake has never...,Albert Einstein,[]
81,“A lady's imagination is very rapid; it jumps ...,Jane Austen,[]
82,"“Remember, if the time should come when you ha...",J.K. Rowling,[love]
83,“I declare after all there is no enjoyment lik...,Jane Austen,[love]
84,"“There are few people whom I really love, and ...",Jane Austen,[love]
85,“Some day you will be old enough to start read...,C.S. Lewis,[]
86,“We are not necessarily doubting that God will...,C.S. Lewis,[]
87,“The fear of death follows from the fear of li...,Mark Twain,[]
88,“A lie can travel half way around the world wh...,Mark Twain,[]
89,“I believe in Christianity as I believe that t...,C.S. Lewis,[]


### K近邻

In [70]:
# Your code here to train a model on stocked_in_word_count_matrix and stocked_in_target
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(n_neighbors = 1)
model.fit(word_count_matrix, stocked_in_target)

tags = model.predict(new_arrival_word_count_matrix)  # Your code here to predict tags for new_arrival


new_arrival["tags"] = list(map(lambda x: ["love"] if x else [], tags))
new_arrival

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_arrival["tags"] = list(map(lambda x: ["love"] if x else [], tags))


Unnamed: 0,text,author,tags
80,“Anyone who has never made a mistake has never...,Albert Einstein,[]
81,“A lady's imagination is very rapid; it jumps ...,Jane Austen,[]
82,"“Remember, if the time should come when you ha...",J.K. Rowling,[]
83,“I declare after all there is no enjoyment lik...,Jane Austen,[]
84,"“There are few people whom I really love, and ...",Jane Austen,[]
85,“Some day you will be old enough to start read...,C.S. Lewis,[]
86,“We are not necessarily doubting that God will...,C.S. Lewis,[]
87,“The fear of death follows from the fear of li...,Mark Twain,[]
88,“A lie can travel half way around the world wh...,Mark Twain,[]
89,“I believe in Christianity as I believe that t...,C.S. Lewis,[]


### 决策树

In [71]:
# Your code here to train a model on stocked_in_word_count_matrix and stocked_in_target
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier()
model.fit(word_count_matrix, stocked_in_target)

tags = model.predict(new_arrival_word_count_matrix)  # Your code here to predict tags for new_arrival


new_arrival["tags"] = list(map(lambda x: ["love"] if x else [], tags))
new_arrival

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_arrival["tags"] = list(map(lambda x: ["love"] if x else [], tags))


Unnamed: 0,text,author,tags
80,“Anyone who has never made a mistake has never...,Albert Einstein,[]
81,“A lady's imagination is very rapid; it jumps ...,Jane Austen,[love]
82,"“Remember, if the time should come when you ha...",J.K. Rowling,[]
83,“I declare after all there is no enjoyment lik...,Jane Austen,[]
84,"“There are few people whom I really love, and ...",Jane Austen,[love]
85,“Some day you will be old enough to start read...,C.S. Lewis,[]
86,“We are not necessarily doubting that God will...,C.S. Lewis,[love]
87,“The fear of death follows from the fear of li...,Mark Twain,[love]
88,“A lie can travel half way around the world wh...,Mark Twain,[]
89,“I believe in Christianity as I believe that t...,C.S. Lewis,[]


### 支持向量机

In [72]:
# Your code here to train a model on stocked_in_word_count_matrix and stocked_in_target
from sklearn.svm import SVC

model = SVC(kernel = "rbf")
model.fit(word_count_matrix, stocked_in_target)

tags = model.predict(new_arrival_word_count_matrix)  # Your code here to predict tags for new_arrival


new_arrival["tags"] = list(map(lambda x: ["love"] if x else [], tags))
new_arrival

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_arrival["tags"] = list(map(lambda x: ["love"] if x else [], tags))


Unnamed: 0,text,author,tags
80,“Anyone who has never made a mistake has never...,Albert Einstein,[]
81,“A lady's imagination is very rapid; it jumps ...,Jane Austen,[]
82,"“Remember, if the time should come when you ha...",J.K. Rowling,[]
83,“I declare after all there is no enjoyment lik...,Jane Austen,[]
84,"“There are few people whom I really love, and ...",Jane Austen,[]
85,“Some day you will be old enough to start read...,C.S. Lewis,[]
86,“We are not necessarily doubting that God will...,C.S. Lewis,[]
87,“The fear of death follows from the fear of li...,Mark Twain,[]
88,“A lie can travel half way around the world wh...,Mark Twain,[]
89,“I believe in Christianity as I believe that t...,C.S. Lewis,[]


### 多层感知机

In [73]:
# Your code here to train a model on stocked_in_word_count_matrix and stocked_in_target
from sklearn.neural_network import MLPClassifier

model = MLPClassifier()
model.fit(word_count_matrix, stocked_in_target)

tags = model.predict(new_arrival_word_count_matrix)  # Your code here to predict tags for new_arrival


new_arrival["tags"] = list(map(lambda x: ["love"] if x else [], tags))
new_arrival

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_arrival["tags"] = list(map(lambda x: ["love"] if x else [], tags))


Unnamed: 0,text,author,tags
80,“Anyone who has never made a mistake has never...,Albert Einstein,[]
81,“A lady's imagination is very rapid; it jumps ...,Jane Austen,[]
82,"“Remember, if the time should come when you ha...",J.K. Rowling,[]
83,“I declare after all there is no enjoyment lik...,Jane Austen,[]
84,"“There are few people whom I really love, and ...",Jane Austen,[]
85,“Some day you will be old enough to start read...,C.S. Lewis,[]
86,“We are not necessarily doubting that God will...,C.S. Lewis,[]
87,“The fear of death follows from the fear of li...,Mark Twain,[]
88,“A lie can travel half way around the world wh...,Mark Twain,[]
89,“I believe in Christianity as I believe that t...,C.S. Lewis,[]


### 参数

那么这些模型的参数是什么呢？尝试找出它们！

In [74]:
# your code here

model.coefs_

[array([[-0.16842759,  0.0256788 ,  0.17043096, ...,  0.16801243,
          0.19342452, -0.10038806],
        [-0.09095679,  0.19337519, -0.07441755, ..., -0.03746284,
         -0.10787008,  0.14702808],
        [-0.10725369, -0.10366946,  0.18416241, ...,  0.03655457,
          0.12002631, -0.14417346],
        ...,
        [ 0.01754532, -0.04635326,  0.21300727, ..., -0.0105447 ,
          0.32428292, -0.14495346],
        [ 0.04909764, -0.09955436,  0.06112149, ...,  0.00050988,
          0.32704594, -0.18206671],
        [ 0.0027749 , -0.06031123,  0.01927935, ..., -0.04481897,
         -0.05926587, -0.08302163]]),
 array([[ 0.01686598],
        [ 0.50837749],
        [-0.16237567],
        [ 0.20866586],
        [-0.14249878],
        [ 0.15445196],
        [ 0.19083406],
        [-0.19513066],
        [ 0.30145315],
        [ 0.26595801],
        [-0.3937626 ],
        [ 0.13572222],
        [ 0.37781472],
        [-0.15004766],
        [-0.2636684 ],
        [-0.27008497],
     

### 其他标签

请自行尝试其他标签为目标标签，看看效果如何。

In [75]:
tags = set()
for _, row in stocked_in.iterrows():
    tags.update(row["tags"])

In [76]:
models = {}
for tag in tags:
    stocked_in_target = []
# Your code here to build stocked_in_target

    for _, row in stocked_in.iterrows():
        if tag in row["tags"]:
            stocked_in_target.append(1)
        else:
            stocked_in_target.append(0)
    stocked_in_target
    # Your code here to train a model on stocked_in_word_count_matrix and stocked_in_target

    from sklearn.neighbors import KNeighborsClassifier
    model = KNeighborsClassifier(n_neighbors = 3)
    model.fit(word_count_matrix, stocked_in_target)
    models[tag] = model

#     from sklearn.linear_model import Perceptron
#     model = Perceptron()
#     model.fit(word_count_matrix, stocked_in_target)
#     models[tag] = model

In [77]:
tag_list = []
for _, row in new_arrival.iterrows():
    tag_list.append([])

for tag in tags:
    model = models[tag]
    pred = model.predict(new_arrival_word_count_matrix[:])  # Your code here to predict tags for new_arrival
    for i, pr in enumerate(pred):
        if pr == 1:
            tag_list[i].append(tag)
new_arrival["tags"] = tag_list
new_arrival

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_arrival["tags"] = tag_list


Unnamed: 0,text,author,tags
80,“Anyone who has never made a mistake has never...,Albert Einstein,"[reading, books]"
81,“A lady's imagination is very rapid; it jumps ...,Jane Austen,[love]
82,"“Remember, if the time should come when you ha...",J.K. Rowling,"[friends, life]"
83,“I declare after all there is no enjoyment lik...,Jane Austen,"[friendship, love]"
84,"“There are few people whom I really love, and ...",Jane Austen,[]
85,“Some day you will be old enough to start read...,C.S. Lewis,[love]
86,“We are not necessarily doubting that God will...,C.S. Lewis,[inspirational]
87,“The fear of death follows from the fear of li...,Mark Twain,[]
88,“A lie can travel half way around the world wh...,Mark Twain,[inspirational]
89,“I believe in Christianity as I believe that t...,C.S. Lewis,[]


# 尾声

“我好像以及不是管理员，而是一个MLer。”