# 意图分类模型训练

In [None]:
import pandas as pd
import numpy as np

# Word Embeddings
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import gensim
print(f'gensim: {gensim.__version__}')

# Text
from nltk.tokenize import word_tokenize 
from nltk.tokenize import TweetTokenizer
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from gensim.test.utils import get_tmpfile

# Storing as objects via serialization
from tempfile import mkdtemp
import pickle
import joblib

# Visualization 
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="ticks", color_codes=True)

# Directory
import os
import yaml
import collections
import scattertext as st
import math

# Cool progress bars
from tqdm import tqdm_notebook as tqdm
tqdm().pandas()  # Enable tracking of execution progress

## LOADING OBJECTS
processed_inbound = pd.read_pickle('objects/processed_inbound_extra.pkl')
processed = pd.read_pickle('objects/processed.pkl')

# Reading back in intents
with open(r'objects/intents.yml') as file:
    intents = yaml.load(file, Loader=yaml.FullLoader)

# Previewing
print(f'\nintents:\n{intents}')
print(f'\nprocessed:\n{processed.head()}')

## 使用Doc2Vec收集推文

我可以使用我的Doc2Vec表示，根据推文的余弦相似性，找到与推文的广义意图版本最相似的前1000条推文。

启发式搜索是指一种搜索策略，它试图通过基于给定的启发式函数或成本度量迭代改进解决方案来优化问题。我的成本度量是试图获得最接近的余弦距离。

这真的很酷。所以我基本上用我的训练数据训练了我的doc2vec模型，这就是processed_inbound。实际上，我可以根据我的训练数据计算一个向量来向量化这个词。


## 训练我的 Doc2Vec
这是为word2vec推广到段落而开发的一种方法。Doc2Vec取它们的平均值，每条推文都表示为一个嵌入，因此您具有一致的维度。

Word2Vec使用连续单词袋，在每个单词周围创建一个滑动窗口，从上下文（单词周围）和Skip Gram模型预测它。Doc2Vec正是基于此。

* https://medium.com/wisio/a-gentle-introduction-to-doc2vec-db3e8c0cce5e
* https://radimrehurek.com/gensim/models/doc2vec.html
* https://rare-technologies.com/doc2vec-tutorial/

它基本上是单词到向量，但基本上采用标准的单词到向量模型，并添加一个额外的向量来表示段落，称为段落向量。输入单词序列，然后他们用它们来预测下一个单词，并检查预测是否正确。如果预测是正确的，它会对不同的单词组合进行多次预测。

它与word2vec相同，但处于文档级别，而不是单词级别。我下面的实现基于[这里](https://medium.com/wisio/a-gentle-introduction-to-doc2vec-db3e8c0cce5e)以及通过滚动Gensim的文档来更精细地了解每一步。


## My Intents:

<img src="visualizations/intent_list.png" alt="Drawing" style="width: 300px;"/>

更新：
*决定删除lost_replace，因为它与修复很难区分，因为大多数丢失东西的客户在技术上也需要解决问题

### 整合
基本上，我有两种方法来获取我的意向训练数据（每种1000个）：
* **Doc2Vec:** 我将使用doc2vec从理想化的示例中综合生成一些意图示例
* **Manual:** 我将通过复制和手动综合生成一些意图示例（如问候语，因为当前数据并不代表这一点）
* **Hybrid:** 有些意图是，我将采用混合方法，其中50%可能是我生成的数据，50%可能是

In [None]:
# Making my idealized dataset - generating N Tweets similar to this artificial Tweet
# This will then be concatenated to current inbound data so it can be included in the doc2vec training

# Version 1
ideal = {'Greeting': 'hi hello yo hey whats up howdy morning',
        'Update': 'have problem with update'}
# Version 2 - I realized that keywords might get the job done, and it's less risky to 
# add more words for the association power because it's doc2vec
ideal = {'battery': 'battery power', 
         'forgot_password': 'password account login',
         'payment': 'credit card payment pay',
         'update': 'update upgrade',
         'info': 'info information'
#          ,'lost_replace': 'replace lost gone missing trade'
         ,'location': 'nearest apple location store'
        }

def add_extra(current_tokenized_data, extra_tweets):
    ''' Adding extra tweets to current tokenized data'''
    
    # Storing these extra Tweets in a list to concatenate to the inbound data
    extra_tweets = pd.Series(extra_tweets)

    # Making string form
    print('Converting to string...')
    string_processed_data = current_tokenized_data.progress_apply(" ".join)

    # Adding it to the data, updating processed_inbound
    string_processed_data = pd.concat([string_processed_data, extra_tweets], axis = 0)

    # We want a tokenized version
    tknzr = TweetTokenizer(strip_handles = True, reduce_len = True)
#     print('Tokenizing...')
#     string_processed_data.progress_apply(tknzr.tokenize)
    return string_processed_data

# Getting the lengthened data
processed_inbound_extra = add_extra(processed['Processed Inbound'], list(ideal.values()))

# Saving updated processed inbound into a serialized saved file
processed_inbound_extra.to_pickle('objects/processed_inbound_extra.pkl')

processed_inbound_extra

In [None]:
processed_inbound_extra[-7:]

In [None]:
intents_repr

In [None]:
processed_inbound_extra.shape

In [None]:
ideal

In [None]:
processed.shape

我先贴标签，然后开始训练我的模特！这就像训练一个神经网络。至于参数，我将每个向量设置为20维。

In [None]:
def train_doc2vec(string_data, max_epochs, vec_size, alpha):
    # Tagging each of the data with an ID, and I use the most memory efficient one of just using it's ID
    tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) 
                   for i, _d in enumerate(string_data)]
    
    # Instantiating my model
    model = Doc2Vec(size=vec_size, alpha=alpha, min_alpha=0.00025, min_count=1, dm =1)

    model.build_vocab(tagged_data)

    for epoch in range(max_epochs):
        print('iteration {0}'.format(epoch))
        model.train(tagged_data, total_examples = model.corpus_count, epochs=model.iter)
        # Decrease the learning rate
        model.alpha -= 0.0002
        # Fix the learning rate, no decay
        model.min_alpha = model.alpha

    # Saving model
    model.save("models/d2v.model")
    print("Model Saved")
    
# Training
train_doc2vec(processed_inbound_extra, max_epochs = 100, vec_size = 20, alpha = 0.025)

In [None]:
# Loading in my model
model = Doc2Vec.load("models/d2v.model")

# Storing my data into a list - this is the data I will cluster
inbound_d2v = np.array([model.docvecs[i] for i in range(processed_inbound_extra.shape[0])])

# Saving
with open('objects/inbound_d2v.pkl', 'wb') as f:
    pickle.dump(inbound_d2v, f)

inbound_d2v

In [None]:
inbound_d2v.shape

以前，我们在矢量器中没有距离的概念，它们实际上没有特定的含义。这是一种更好的方式，因为它可以捕捉单词之间的上下文表示！我的聚类应该比tfidf或单词袋要好得多。


### doc2vec接受了哪些培训？
需要记住的一件事是，找到这种嵌入是在什么上训练的。我们不希望它在学术数据上训练，因为推特与学术论文在一个完全不同的领域。

查看gensim[文档](https://radimrehurek.com/gensim/models/doc2vec.html)对于doc2vec，它似乎与word2vec一样被训练，只是现在它们还使用了段落上下文向量。这意味着它很可能是在谷歌新闻上训练的。

## 方法

最初，为了获得前1000条类似的推文，我尝试使用现有的数据。但我不认为这会产生最准确的结果，因为你捕捉的不是最具代表性的推文。出于这个原因，我自己制作了所有这些具有基本代表性的推文（你可以在上面的“理想”dict中看到这一点。目标是找到一个理想化的、完整的意图表示。然后，我使用我的doc2vec表示，根据余弦相似性找到前1000条最相似的推文。
### 软件包探索

In [None]:
# Finding and making idealized versions of each intent so that I can find top 1000 to it:
intents_ideal = {'app': ['app', 'prob']}
inferred_vectors = []

for keywords in intents_ideal.values():
    inferred_vectors.append(model.infer_vector(keywords))
    
inferred_vectors

In [None]:
# model.similarity(inferred_vectors[0], inbound_d2v[0])

In [None]:
'hi hello yo hey whats up'.split()

### 查找意向标记
我想获得我的代表性推文的标签，因为这就是doc2vec的“model.asimilation”方法作为参数来生成与之相似的前N条推文的原因。

In [None]:
ideal

In [None]:
# Getting the indexes

下面的代码块不是最有效的代码，而且计算起来很长，但它很有效！它基本上是搜索我所有处理过的入站推文，并查找我的代表推文的标签，如我的输出和“intents_tags”字典中所示。

In [None]:
# Storing my representative tweets and intents in this dictionary
# Just need to add to this dictionary and the rest of the code block does the work for you
# To find a suitable representative tweet for this: I used the keyword EDA functions in notebook 1.1

# Version 1
intents_repr = {'Battery': ['io', 'drain', 'battery', 'iphone', 'twice', 'fast', 'io', 'help'],
    'Update': ['new', 'update', 'i️', 'make', 'sure', 'download', 'yesterday'],
    'iphone': ['instal', 'io', 'make', 'iphone', 'slow', 'work', 'properly', 'help'],
    'app': ['app', 'still', 'longer', 'able', 'control', 'lockscreen'],
    'mac': ['help','mac','app','store','open','can','not','update','macbook','pro','currently','run','o','x',
  'yosemite'], 'greeting': ['hi', 'hello', 'yo', 'hey', 'whats', 'up']
    }
# You could see that in version 1, I try to use existing tweets, but that isn't really the best strategy and
# it doesn't yield the best results

In [None]:
# Version 2
tknzr = TweetTokenizer(strip_handles = True, reduce_len = True)
## Just tokenizing all the values of ideal' values to be able to be fed in to matching function
# intents_repr = dict(zip(ideal.keys(), [tknzr.tokenize(v) for v in ideal.values()]))
# Pythonic way
intents_repr = {k:tknzr.tokenize(v) for k, v in ideal.items()}
print(intents_repr)

# Saving intents_repr into YAML
with open('objects/intents_repr.yml', 'w') as outfile:
    yaml.dump(intents_repr, outfile, default_flow_style=False)

# Storing tags in order of the dictionary above
tags = []

tokenized_processed_inbound = processed_inbound.apply(tknzr.tokenize)
# Find the index locations of specific Tweets
def report_index_loc(tweet, intent_name):
    ''' Takes in the Tweet to find the index for and returns a report of that Tweet index along with what the 
    representative Tweet looks like'''
    try:
        tweets = []
        for i,j in enumerate(tokenized_processed_inbound):
            if j == tweet:
                tweets.append((i, True))
            else:
                tweets.append((i, False))
        index = []
        get_index = [index.append(i[0]) if i[1] == True else False for i in tweets] # Comprehension saves space

        preview = processed_inbound.iloc[index]

        # Appending to indexes for dictionary
        tags.append(str(index[0]))
    except IndexError as e:
        print('Index not in list, move on')
        return
        
    return intent_name, str(index[0]), preview

# Reporting and storing indexes with the function
print('TAGGED INDEXES TO LOOK FOR')
for j,i in intents_repr.items():
    try:
        print('\n{} \nIndex: {}\nPreview: {}'.format(*report_index_loc(i,j)))
    except Exception as e:
        print('Index ended')

# Pythonic way of making new dictionary from 2 lists
intents_tags = dict(zip(intents_repr.keys(), tags))
intents_tags

In [None]:
# Great! Now I can get the training data for my battery intent (as an example)
similar_doc = model.docvecs.most_similar('76066',topn = 1000)
# Preview
similar_doc[:5]

In [None]:
similar_doc = model.docvecs.most_similar('76070',topn = 1000)
similar_doc

## 训练数据合成
### 1.基于相似性向训练数据添加意图如上所示，右侧元组元素是余弦相似性。

我们只是取了前1000个，类似于意图的基本理想化版本（我们主要基于关键字）。

### 2.手动添加意图这些意图是用一种更手动的不同方法生成的。

我会生成尽可能多的例子，然后我通过复制它来强制它，直到它达到1000个训练例子，以保持类平衡。
再一次，以下是我想补充的所有意图：
<img src=“visualizations/intent_list.png”alt=“Drawing”style=“width=300px；”/>

### 3.加上混合意图，我使用了上一本笔记本中显示的关键字探索，发现更新和修复之间有很多重叠。

因此，对于这两种情况，我将使用doc2vec生成一部分，其余部分我将手动插入示例——其想法是平衡过拟合或噪声，并输入正确的信号。

_一个特殊的情况可能是超出范围，我可能会找到另一种方法来处理它，因为我无法生成所有这种意图的例子_
第4步是将数据转换为长格式，NN可以被馈送到该格式，最后一步是保存它。
我从spaCy文档中了解到了非地转遗忘问题，在这个问题中，你不应该迭代相同的值，因为这样做可以有效地改变损失函数，你将创建一个无法很好地泛化的模型。这最终是一个漫长的过程，因为我必须试验什么最有效。

In [None]:
# Checking for stopwords because I don't want to include them in the manually representative intents
# This is something that I manually tune to the dataframe (for step 2 of this process)
import nltk
from nltk.corpus import stopwords

stopwords.words('english').index('to')

In [None]:
intents_tags

In [None]:
model.docvecs.most_similar('10')

In [None]:
intents_tags

提示用户更新或损坏。

In [None]:
# Testing how to tokenize numpy array
vals = [word_tokenize(tweet) for tweet in list(processed_inbound.iloc[[10,1]].values)]
vals

In [None]:
## Getting top n tweets similar to the 0th Tweet
# This will return the a list of tuples (i,j) where i is the index and j is 
# the cosine similarity to the tagged document index

# Storing all intents in this dataframe
train = pd.DataFrame()
# intent_indexes = {}

# 1. Adding intent content based on similarity
def generate_intent(target, itag):
    similar_doc = model.docvecs.most_similar(itag,topn = target)
    # Getting just the indexes
    indexes = [int(i[0]) for i in similar_doc]
#     intent_indexes[intent_name] = indexes
    # Actually seeing the top 1000 Tweets similar to the 0th Tweet which seems to be about updates
    # Adding just the values, not the index
    # Tokenizing the output
    return [word_tokenize(tweet) for tweet in list(processed_inbound.iloc[indexes].values)]

# Updating train data
for intent_name, itag in intents_tags.items():
    train[intent_name] = generate_intent(1000, itag)

# 2. Manually added intents
# These are the remainder intents
manually_added_intents = {
    'speak_representative': [['talk','human','please'],
                             ['let','me','talk','to','apple','support'], 
                             ['can','i','speak','agent','person']], 
    'greeting': [['hi'],['hello'], ['whats','up'], ['good','morning'],
                 ['good','evening'], ['good','night']],
    'goodbye': [['goodbye'],['bye'],['thank'],['thanks'], ['done']], 
    'challenge_robot': [['robot','human'], ['are','you','robot'],
                       ['who','are','you']]
}

# Inserting manually added intents to data

def insert_manually(target, prototype):
    ''' Taking a prototype tokenized document to repeat until
    you get length target'''
    factor = math.ceil(target / len(prototype))
    content = prototype * factor
    return [content[i] for i in range(target)]

# Updating training data
for intent_name in manually_added_intents.keys():
    train[intent_name] = insert_manually(1000, [*manually_added_intents[intent_name]])

# 3. Adding in the hybrid intents

hybrid_intents = {'update':(300,700,[['want','update'], ['update','not','working'], 
                                     ['phone','need','update']], 
                            intents_tags['update']),
                  'info': (800,200, [['need','information'], 
                                       ['want','to','know','about'], ['what','are','macbook','stats'],
                                    ['any','info','next','release','?']], 
                             intents_tags['info']),
                  'payment': (300,700, [['payment','not','through'], 
                                       ['iphone', 'apple', 'pay', 'but', 'not', 'arrive'],
                                       ['how','pay','for', 'this'],
                                       ['can','i','pay','for','this','first']], 
                             intents_tags['payment']),
                  'forgot_password': (600,400, [['forgot','my','pass'], ['forgot','my','login'
                                ,'details'], ['cannot','log','in','password'],['lost','account','recover','password']], 
                             intents_tags['forgot_password'])
                 }

def insert_hybrid(manual_target, generated_target, prototype, itag):
    return insert_manually(manual_target, prototype) + list(generate_intent(generated_target, itag))

# Updating training data
for intent_name, args in hybrid_intents.items():
    train[intent_name] = insert_hybrid(*args)

# 4. Converting to long dataframe from wide that my NN model can read in for the next notebook - and wrangling
neat_train = pd.DataFrame(train.T.unstack()).reset_index().iloc[:,1:].rename(columns={'level_1':'Intent', 0: 'Utterance'})
# Reordering
neat_train = neat_train[['Utterance','Intent']]

# 5. Saving this raw training data into a serialized file
neat_train.to_pickle('objects/train.pkl')

# Styling display
show = lambda x: x.head(10).style.set_properties(**{'background-color': 'black',                                                   
                                    'color': 'lawngreen',                       
                                    'border-color': 'white'})\
.applymap(lambda x: f"color: {'lawngreen' if isinstance(x,str) else 'red'}")\
.background_gradient(cmap='Blues')

print(train.shape)
show(train)

我对这些很满意。如果你检查一下，它们看起来很有前途！我并不太担心我的预处理器错过的表情符号——它们的频率很少，只会增加噪音。同样的事情也适用于其他事情，比如语言，因为我也看到了一条印尼推特。这可能是一件好事，因为我们不希望我们的模型过度拟合，它甚至可能有助于我的模型的可推广性。

最糟糕的结果可能来自“lost_replace”意图，因为正如关键字EDA中所示，无论如何都没有太多这样的内容。我可能会把它取下来。

In [None]:
print(neat_train.shape)
show(neat_train)

In [None]:
neat_train.tail(44)

事实上，这些看起来都很有希望，因为它们似乎都与各自的桶有一些关系。一个表情符号从我的预处理功能中逃脱了，但它们的数量并没有少到我觉得现在不需要删除它，它们只是噪音。

还要注意，如果你比较这些数据的尾部和头部，“更新”是作为模板和我的推文的混合生成的。

丢失和更换-产品出现问题。我的iphone很热，你能换一下吗？丢了。

In [None]:
# Seeing the real data for an intent
intent_name = 'lost_replace'
view = processed.iloc[intent_indexes[intent_name]]['Real Inbound']
[*view]

# 意向桶评估

In [None]:
# Storing word rank table dataframes in this dict
wordranks = {}

# For visualizing top 10
def top10_bagofwords(data, output_name, title):
    ''' Taking as input the data and plots the top 10 words based on counts in this text data'''
    bagofwords = CountVectorizer()
    # Output will be a sparse matrix
    inbound = bagofwords.fit_transform(data)
    # Inspecting of often contractions and colloquial language is used
    word_counts = np.array(np.sum(inbound, axis=0)).reshape((-1,))
    words = np.array(bagofwords.get_feature_names())
    words_df = pd.DataFrame({"word":words, 
                             "count":word_counts})
    words_rank = words_df.sort_values(by="count", ascending=False)
    wordranks[output_name] = words_rank
    # words_rank.to_csv('words_rank.csv') # Storing it in a csv so I can inspect and go through it myself
    # Visualizing top 10 words
    plt.figure(figsize=(12,6))
    sns.barplot(words_rank['word'][:10], words_rank['count'][:10].astype(str), palette = 'inferno')
    plt.title(title)
    
    # Saving
    plt.savefig(f'visualizations/next_ver/{output_name}.png')
    
    plt.show()

In [None]:
# Doing my bucket evaluations here - seeing what each distinct bucket intent means
for i in train.columns:
    top10_bagofwords(train[i].apply(" ".join), f'bucket_eval/{i}', f'Top 10 Words in {i} Intent')

Initial thoughts:

To be honest, I feel like the way I should get my training data for greeting is not the best. There are a lot of words that are similar between buckets. As an example, for mac, it's a little concerning that iphone is the most common word!

After changing method (version 2):

The words and results make a lot more sense.

In [None]:
# Investigating bag of word frequencies at a more granular level
wordranks['bucket_eval/mac'].head(50)

In [None]:
[*train.columns]

### 正在为rasa生成文本文件
Rasa API要求将这种格式的数据输入到他们的机器人程序中。我在训练中使用自己的训练数据，但这是为了试验他们的工具。

In [None]:
# Getting NLU.md training data in correct form for Rasa Bot
with open('data/train_rasa/train_v3.txt', 'w') as t:
    for intent in train.columns:
        t.write(f'## intent: {intent}\n')
        for tweet in train[intent]:
            t.write('- ' + " ".join(tweet) + '\n')
        t.write('\n')

### This is just a cell to log my progress of how my method was doing at first

没有表情符号的类似推文“[新]、“更新”、“我”️', 'make'，'sure'，'下载'，'昨天']`

格式为：`（索引标记，余弦相似性）`

[('72326', 0.8154675364494324),
 ('32166', 0.8151031732559204),
 ('29461', 0.8027088642120361),
 ('5942', 0.7968393564224243),
 ('54836', 0.7879305481910706),
 ('30359', 0.7861931324005127),
 ('66201', 0.7817540168762207),
 ('50109', 0.7796376943588257),
 ('59490', 0.7793254852294922),
 ('46644', 0.7775745391845703),
 ('58410', 0.7734568119049072),
 ('26164', 0.7674931287765503),
 ('14867', 0.7673683166503906),
 ('25813', 0.766610860824585),
 ('47880', 0.7642890214920044),
 ('30945', 0.76273113489151),
 ('74155', 0.7582229971885681),
 ('33346', 0.7577282190322876),
 ('9502', 0.7569847702980042),
 ('64871', 0.7567278146743774)

### Using scattertext from the spaCy universe for EDA
This [kernel](https://www.kaggle.com/psbots/customer-support-meets-spacy-universehttps://www.kaggle.com/psbots/customer-support-meets-spacy-universe) 向我展示了spaCy的散点文本工具的功能！所以我也想亲自去做，希望能获得有用的见解。

正如文档中所说，散点文本是“一种在中小型语料库中找到区别术语的工具，并用不重叠的术语标签在性感的交互式散点图中呈现它们。”

然而，在该内核中实现的“CorpusFromParsedDocuments”似乎被弃用或存在依赖性问题，所以我查看了文档并使用了“CorpusFromPandas”，我认为这非常适合我所拥有的数据。

In [None]:
def term_freqs(intent_name):
    bagofwords = CountVectorizer()
    # Output will be a sparse matrix
    inbound = bagofwords.fit_transform(visualize_train[visualize_train['Intent'] == intent_name]['Utterance'])
    # Inspecting of often contractions and colloquial language is used
    word_counts = np.array(np.sum(inbound, axis=0)).reshape((-1,))
    words = np.array(bagofwords.get_feature_names())
    words_df = pd.DataFrame({"word":words, 
                                 "count":word_counts})
    words_rank = words_df.sort_values(by="count", ascending=False)
    return words_rank

update_df = term_freqs('update')
repair_df = term_freqs('repair')

combined = pd.concat([update_df, repair_df], axis = 0)

In [None]:
import spacy
import scattertext as st

In [None]:
# Data munging
visualize_train = neat_train.copy()
visualize_train['Utterance'] = visualize_train['Utterance'].progress_apply(" ".join)

# Subsetting to the two intents I want to compare
visualize_train = visualize_train[(visualize_train['Intent'] == 'repair') | 
                                 (visualize_train['Intent'] == 'update')]

# Load spacy model
nlp = spacy.load('en',disable_pipes=["tagger","ner"])
visualize_train['parsed'] = visualize_train['Utterance'].progress_apply(nlp)

In [None]:
visualize_train.head()

In [None]:
corpus = st.CorpusFromParsedDocuments(visualize_train,
                             category_col='Intent',
                             parsed_col='parsed').build()

In [None]:
html = st.produce_scattertext_explorer(corpus,
          category='Intent',
          category_name='repair',
          not_category_name='update',
          width_in_pixels=600,
          minimum_term_frequency=10,
        term_significance = st.LogOddsRatioUninformativeDirichletPrior(),
          )

# Keras的意向分类
在我过去的笔记本中，我的目标是为我的聊天机器人接收标记的数据。现在，本笔记本的重点是使用Keras对用户可能键入的新的、看不见的数据的意图进行分类。现在，我们从上一本笔记本中的无监督学习中生成了标签，模型切换到了监督学习方法。

### Rasa比较
Rasa使用SVM和GridsearchCV训练这一意图分类步骤，因为它们可以尝试不同的配置（[来源](https://medium.com/bhavaniravi/intent-classification-demystifying-rasanlu-part-4-685fc02f5c1d)).部署预处理时，训练和测试之间的管道应保持不变。

In [None]:
# Data science
import pandas as pd
print(f"Pandas: {pd.__version__}")
import numpy as np
print(f"Numpy: {np.__version__}")

# Deep Learning 
import tensorflow as tf
print(f"Tensorflow: {tf.__version__}")
from tensorflow import keras
print(f"Keras: {keras.__version__}")
import sklearn
print(f"Sklearn: {sklearn.__version__}")

# Visualization 
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="ticks", color_codes=True)

import collections
import yaml
import re
import os

# Preprocessing and Keras
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.preprocessing import OneHotEncoder
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.models import Sequential, load_model
from keras.layers import Dense, LSTM, Bidirectional, Embedding, Dropout
from keras.callbacks import ModelCheckpoint
from sklearn.model_selection import train_test_split
from tensorflow.keras.layers import Input


# Reading in training data
train = pd.read_pickle('objects/train.pkl')
print(f'Training data: {train.head()}')

# Keras Preprocessing

### Keras Tokenizer
创建vocb中所有单词的字典，并存储索引。对于每个序列，它在序列中传递，并将每个单词转换为引用Keras单词词典的索引。当你把句子输入到模型中时，它们都必须是相同的长度。但有些推文会比其他推文长，所以pad_sequences只会填充所有其他推文，使它们的长度相同。它用0s填充消息，直到它们与最长消息的长度相同。较短的最大长度通常是优选的，因为较长的序列更难训练。

我已经完成了大部分主要的预处理工作，但Keras需要一些更具体的东西来进行建模。

In [None]:
!pwd

In [None]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.text import one_hot

# Label encoding the target
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

# For the text data
from keras.preprocessing.text import hashing_trick
from keras.preprocessing.text import text_to_word_sequence

# I use Keras' Tokenizer API - helpful link I followed: https://machinelearningmastery.com/prepare-text-data-deep-learning-keras/
# Train test split
# Split in to train and test (stratify for class imbalance and random state for reproducibility)
X_train, X_val, y_train, y_val = train_test_split(train['Utterance'], train['Intent'], test_size = 0.3, 
                                                   shuffle = True, stratify = train['Intent'], random_state = 7)
print(f'\nShape checks:\nX_train: {X_train.shape} X_val: {X_val.shape}\ny_train: {y_train.shape} y_val: {y_val.shape}')

In [None]:
y_train

In [None]:
# Encoding the target variable

le = LabelEncoder()
le.fit(y_train)

y_train = le.transform(y_train)
y_val = le.transform(y_val)

In [None]:
le.classes_

In [None]:
## 1. ENCODING THE TEXT DATA

# NOTE: Since we use an embedding matrix, we use the Tokenizer API to integer encode our data - https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/
t = Tokenizer()
t.fit_on_texts(X_train)

print("Document Count: \n{}\n".format(t.document_count))
# print("Word index: \n{} \n ".format(t.word_index))
# print("Word Counts: \n{} \n".format(len(t.word_counts) + 1))
# print("Word docs: \n{} \n ".format(t.word_docs))

def convert_to_padded(tokenizer, docs):
    ''' Taking in Keras API Tokenizer and documents and returns their padded version '''
    ## Using API's attributes
    # Embedding
    embedded = t.texts_to_sequences(docs)
    # Padding
    padded = pad_sequences(embedded, maxlen = max_length, padding = 'post')
    return padded

## Defining useful variables for later
# Adding 1 becuase of reserved 0 index
vocab_size = len(t.word_counts) + 1
print(f'Vocab size:\n{vocab_size}')

# Pad documents to a max length
max_length = len(max(embedded_X_train, key = len))

print(f'Max length:\n{max_length}')

padded_X_train = convert_to_padded(tokenizer = t, docs = X_train)
padded_X_val = convert_to_padded(tokenizer = t, docs = X_val)

print(f'padded_X_train\n{padded_X_train}')
print(f'padded_X_val\n{padded_X_val}')

In [None]:
padded_X_train.shape, padded_X_val.shape, y_train.shape, y_val.shape

In [None]:
padded_X_train[1]

运行此示例适合使用5个小文档的Tokenizer。将打印fit Tokenizer的详细信息。然后使用字数对这5个文档进行编码。

每个文档被编码为9元素矢量，每个单词具有一个位置，每个单词位置具有所选择的编码方案值。在这种情况下，使用简单的单词计数模式。

### Embedding matrix

Keras模型寻找一个热编码的y变量。当它是多类的时候，很多人把它当作一个热门的编码向量。这只是设计选择之一。


如果你使用的是doc2vec嵌入，你如何传入你的推文。您可能需要将其作为完整的推文传递。看看你是如何在推特上传递的。你可能需要在推特级别进行标记。如果你把它传进来，如果它是Tweet 57，它会激活节点，使它与第57个文档的嵌入相乘。

In [None]:
# We can see that there are 4 different dimensionality options
!ls models/glove.twitter.27B

Here, we compute an index mapping words to known embeddings by parsing the data dump of pre-trained embeddings:

I use 50D because my X_train has a max_length of 32.

In [None]:
# Using gloVe word embeddings
embeddings_index = {}
f = open('models/glove.twitter.27B/glove.twitter.27B.50d.txt')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

Now we can leverage our embedding_index dictionary and our word_index to compute our embedding matrix:

In [None]:
# Initializing required objects
word_index = t.word_index
EMBEDDING_DIM = 50 # Because we are using the 50D gloVe embeddings

# Getting my embedding matrix
embedding_matrix = np.zeros((len(word_index) + 1, EMBEDDING_DIM))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

In [None]:
embedding_matrix, embedding_matrix.shape

太好了，现在我们可以开始建模了。

在常规单词嵌入中，必须设置矩阵中嵌入的顺序，使其与单词在我的keras标记器单词索引中的显示方式相匹配。它这样做是为了让最常见的单词出现在前面，并且嵌入矩阵需要对齐。

我还确保嵌入的顺序与我的模型中单词的顺序相同。

在这里，我还确保像macbook这样的特定领域的单词在我的Twitter嵌入中。其中一个例子是“macbook”，你可以清楚地看到它确实在embeddings文件中，这很好：

<img src="visualizations/macbook-glove.png" alt="Drawing" style="width: 400px;"/>

# Keras Modelling
I will create a neural network with Keras with the output layer having the same number of nodes as there are intents. The following is my architecture:

In [None]:
def make_model(vocab_size, max_token_length):
    ''' In this function I define all the layers of my neural network'''
    # Initialize
    model = Sequential()
    #model.add(Input(shape = (32,), dtype = 'int32'))

    # Adding layers - For embedding layer, I made sure to add my embedding matrix into the weights paramater
    model.add(Embedding(vocab_size, embedding_matrix.shape[1], input_length = 32, 
                        trainable = False, weights = [embedding_matrix]))
    
    model.add(Bidirectional(LSTM(128)))
#    model.add(LSTM(128)) 
    # Try 100
    model.add(Dense(600, activation = "relu",kernel_regularizer ='l2')) # Try 50, another dense layer? This takes a little bit of exploration
    
    # Adding another dense layer to increase model complexity
    model.add(Dense(600, activation = "relu",kernel_regularizer ='l2'))
    
    # Only update 50 percent of the nodes - helps with overfitting
    model.add(Dropout(0.5))
    
    # This last layer should be the size of the number of your intents!
    # Use sigmoid for multilabel classification, otherwise, use softmax!
    model.add(Dense(10, activation = "softmax"))
    
    return model

# Actually creating my model with 32 as the max token length
model = make_model(vocab_size, 32)
model.compile(loss = "sparse_categorical_crossentropy", 
              optimizer = "adam", metrics = ["accuracy"])
model.summary()

In [None]:
# Initializing checkpoint settings to view progress and save model
filename = 'models/intent_classification_b.h5'

# Learning rate scheduling
# This function keeps the initial learning rate for the first ten epochs  
# and decreases it exponentially after that.  
def scheduler(epoch, lr):
    if epoch < 10:
        return lr
    else:
        return lr * tf.math.exp(-0.1)

lr_sched_checkpoint = tf.keras.callbacks.LearningRateScheduler(scheduler)

# Early stopping
early_stopping = tf.keras.callbacks.EarlyStopping(
    monitor='val_loss', min_delta=0, patience=3, verbose=0, mode='auto',
    baseline=None, restore_best_weights=True
)


# This saves the best model
checkpoint = ModelCheckpoint(filename, monitor='val_loss', verbose=1, 
                             save_best_only=True, mode='min')

# The model you get at the end of it is after 100 epochs, but that might not have been
# the weights most associated with validation accuracy

# Only save the weights when you model has the lowest val loss. Early stopping

# Fitting model with all the callbacks above
hist = model.fit(padded_X_train, y_train, epochs = 20, batch_size = 32, 
                 validation_data = (padded_X_val, y_val), 
                 callbacks = [checkpoint, lr_sched_checkpoint, early_stopping])

注意：对于任何新的测试数据，它必须采用完全相同的格式。因此，如果您对已经预标记的文档调用fit_to_texts，那么您传入的字符串也必须作为预标记字符串传入。

In [None]:
# Visualizing Training Loss vs Validation Loss (the loss is how wrong your model is)
plt.figure(figsize=(10,7))
plt.plot(hist.history['val_loss'], label = 'Validation Loss', color = 'cyan')
plt.plot(hist.history['loss'], label = 'Training Loss', color = 'purple')
plt.title('Training Loss vs Validation Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

# Visualizing Testing Accuracy vs Validation Accuracy
plt.figure(figsize=(10,7))
plt.plot(hist.history['val_accuracy'], label = 'Validation Accuracy', color = 'cyan')
plt.plot(hist.history['accuracy'], label = 'Training Accuracy', color = 'purple')
plt.title('Training Accuracy vs Validation Accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

在20个时期之后，坡度变成了一条平坦的线，损失并没有太大变化。地板效应是你不能得到任何低于0的损失。它真的很快就从训练数据中学习到了它需要学习的东西。如果你继续训练，你基本上是在过度适应训练数据，你在适应不重要的信号。

例如，在图像的背景下，如果模型学会了识别猫是什么，它现在可能过于详细，并学会了猫也必须是黑色。

### Model improvements
该模型在低时期拟合过度。模型明显过拟合。绘制出准确度。

不需要100次训练。

看学习率调度，在一定数量的历元之后，降低学习率。
* Learning rate scheduling
* Early stopping or reducing epochs
* Dropout layers
* Regularization
* Improve distinctiveness between intent data

在我应用了这些改进之后，我的准确性提高了。

In [None]:
# I have to redefine and load in the model saved by my model checkpoint 
from keras.models import load_model
model = load_model('models/intent_classification_b.h5')

In [None]:
def infer_intent(user_input):
    ''' Making a function that recieves a user input and outputs a 
    dictionary of predictions '''
    assert isinstance(user_input, str), 'User input must be a string!'
    user_input = [user_input]
    print(user_input)
    
    # Converting to Keras form
    padded_text = convert_to_padded(t, user_input)
    x = padded_text[0]
    
    # Prediction for each document
    probs = model.predict(padded_text)
#     print('Prob array shape', probs.shape)
    
    # Get the classes from label encoder
    classes = le.classes_
    
    # Getting predictions dict and sorting
    predictions = dict(zip(classes, probs[0]))
    sorted_predictions = {k: v for k, v in sorted(predictions.items(), key=lambda item: item[1], reverse = True)}
    
    return sorted_predictions

In [None]:
infer_intent('hi')

我花了大量的时间来完善我输入到这个模型中的训练数据，尤其是试图找出这个模型的正确映射。最终修复我的映射的是使用标签编码器，而不是为我的目标变量使用一个热编码器，并确保我的用户输入格式正确（它应该是一个列表，因为它通过了维度检查）。

## Sanity Checks

In [None]:
classes.shape

In [None]:
probs

In [None]:
probs.shape

In [None]:
padded_text

In [None]:
X_train[0]

In [None]:
test = [X_train[0]]

In [None]:
embedded_text = t.texts_to_sequences(test)

padded_text = pad_sequences(embedded_text, maxlen=max_length, padding='post')

In [None]:
embedded_text

In [None]:
padded_text

In [None]:
t.word_index['battery']

# Future Step: Multilabel Classification

将来，如果我想识别话语中的混合意图，我可以进行多标签分类。

对于多标签分类，使用sigmoid激活函数作为最后一层。你仍然会有大约10个不同的意图。但你需要建模，使这些意图中的每一个都相互独立。
意图1的预测不应影响意图2。Softmax获取所有类的所有分数，最高的数字将具有最高的概率输出，但所有的总和将为1。对于最终的softmax层，总和将为1，但这在我的情况下不起作用。

但你要分别对每个意图进行分类。它们的总和可以大于1。

它类似于多类中的logreg。一条曲线用于类0而非类0。这些问题的总和可以大于1。

对于类1，它将是1或不是1。等等。你可以看看你的输出层，无论哪个节点的概率输出大于0.5，这2个都是你的最终输出。你最多可以做3个。取决于您将拥有多少个节点。

当你输入你的目标向量时，它们需要进入一个热门的编码向量。目标列将有10列。对于每个节点，这一切加起来就是一个。每个节点将具有单独的S形函数（P（1-P））。在节点之间，它们的总和将超过1。一对一分类。根据logreg条款进行阅读。多标签分类。最重要的是你的标签需要一个热编码。损失函数将使用二进制交叉熵。

潜在问题：你的类越多，你的模型就越难。尤其是对于第二和第三个标签，这是acc开始下降的时候。