# 应用机器学习与情感分析

本章将主要涵盖下述几个方面

* 清洗和准备文本数据
* 根据文本数据建立特征向量
* 训练机器学习模型来区分正面或者负面评论
* 用基于外存的学习方法来处理大型文本数据集
* 根据文档推断主题进行分类

## 为文本处理预备好IMDb电影评论数据

### 获取电影评论数据集

### 把电影评论数据预处理成更方便格式的数据

In [1]:
import pyprind
import pandas as pd
import os

# change the 'basepath' to the directory of the
# unzipped movie dataset

basepath = 'aclImdb'

labels = {'pos':1,'neg':0}
pbar = pyprind.ProgBar(50000)
df = pd.DataFrame()
for s in ('test','train'):
    for l in ('pos','neg'):
        path = os.path.join(basepath,s,l)
        for file in os.listdir(path):
            with open(os.path.join(path,file),
                     'r',encoding='utf-8') as infile:
                txt = infile.read()
            df = df.append([[txt,labels[l]]],
                           ignore_index=True)
            pbar.update()
df.columns = ['review','sentiment']

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:04:47


上面的代码首先初始化新的进度条对象pbar，并定义迭代次数为50000，这是要读入的文件数量。使用嵌套的for循环，遍历aclImbd主目录下的train和test子目录，并从子目录pos和neg下读入单个文本文件，这两个目录连同整数类标签（1=正面和0=负面）最终将会被映射到pandas的DataFrame对象df上。

因为数据集中的分类标签已经排过序，所以现在可以调用np.random子模块的permutation函数对DataFrame洗牌，这对后期将数据集分裂成训练集和测试集很有用。

In [2]:
import numpy as np

np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))
df.to_csv('movie_data.csv',index=False,encoding='utf-8')

In [3]:
df = pd.read_csv('movie_data.csv',encoding='utf-8')
df.head(3)

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0


## 词袋模型介绍

词袋模型背后的逻辑非常简单，可以概括如下：

1. 从整个文档中创建一个基于独立令牌（例如单词）的词汇表。
2. 为每个文档构建一个特征向量，其中包含每个词在特定文档中出现的频率。

由于么个文档的独立单词只代表了词袋词汇表中所有单词的一小部分，所以特征向量主要由零组成，因此称之为稀疏向量。

### 把此转换成特征向量

可以用scikit-learn实现的CountVectorizer类根据单词在各文件中出现的频率构建词袋模型。

In [4]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer()
docs = np.array([
        'The sun is shining',
        'The weather is sweet',
        'The sun is shining, the weather is sweet, and one and one is two'
])
bag = count.fit_transform(docs)

调用CountVectorizer的fit_transform方法处理构建词袋模型的词汇表，并且把以下三个句子转化为稀疏特征向量。

现在打印出词汇表的内容，以便更好地理解其中所包含地概念

In [5]:
print(count.vocabulary_)
print(bag.toarray())

{'the': 6, 'sun': 4, 'is': 1, 'shining': 3, 'weather': 8, 'sweet': 5, 'and': 0, 'one': 2, 'two': 7}
[[0 1 0 1 1 0 1 0 0]
 [0 1 0 0 0 1 1 0 1]
 [2 3 2 1 1 1 2 1 1]]


特征向量的每个索引位置对应于词汇存储在CountVectorizer字典项中的整数值。例如，索引位置0上的第一个特征等同于单词'and'的词频，它只出现在最后一个文档中，单词'is'在索引位置1（文档向量中的第二个特征），它出现在所有三个句子中。这些值的特征向量也叫__原词频率__:tf(t,d)，即t项在文档中出现的次数d。

> 刚创建的词袋模型也被称为__1克__或者__单克__模型，词汇中的每项或每个令牌代表一个单词。更普遍的是NLP中的连续序列项，即单词、字母或者符号，也被称为__n克__。在n克模型中所选择的数量n取决于特定应用。例如，坎纳瑞斯和其他人的研究发现，在反垃圾邮件过滤过程中，n克模型的规模设置为3和4时性能最佳。总结n克的概念，表示第一个文件"the sun is shining"的1克和2克模型的构成如下：
> * 1克："the","sun","is","shining"
> * 2克："the sun","sun is","is shining"

> scikit的CountVectorizer类通过调整参数ngram_range来使用不同的n克模型。默认情况为1克模型，可以通过转化一个新的CountVectorizer实例，同时通过定义ngram_range=(2,2)将其转换为2克模型。

### 通过词频逆反文档频率评估单词相关性

__词频逆反文档频率(td-idf)__，用于减少特征向量中频繁出现地词。tf-dif可以定义为词频与逆反文档频率地乘积：

$$tf-idf(t,d)=tf(t,d)\times idf(t,d)$$

tf(t,d)为前一节引入的词频，idf(t,d)为逆反文档频率，起计算过程如下：

$$idf(t,d)=\log \frac{n_d}{1+df(d,t)}$$

$n_d$为文档总数，$df(d,t)$为含有单词t的文档数量。请注意，为分母添加常数1为可选，目的在于所有训练样本中出现地单词赋予非零值，用对于来确保低文档频率地权重不会过大。

scikit-learn实现了另外一个转换器TfidfTransformer类，它以来自于CountVectorizer类，以原始词频为输入，然后转换为tf-idfs格式：

In [6]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer(use_idf=True,
                         norm='l2',
                         smooth_idf=True)
np.set_printoptions(precision=2)
print(tfidf.fit_transform(count.fit_transform(docs))
      .toarray())

[[0.   0.43 0.   0.56 0.56 0.   0.43 0.   0.  ]
 [0.   0.43 0.   0.   0.   0.56 0.43 0.   0.56]
 [0.5  0.45 0.5  0.19 0.19 0.19 0.3  0.25 0.19]]


正如前一节所看到的，单词'is'在第三个文档中的词频最高，是最常出现的单词。然而，在把相同的特征向量转换成tf-idf后，发现单词'is'在第三个文档中与相对较小的tf-idf(0.45)相关联，因为该页也在第一和第二个文档中出现，因此不太可能包含任何具有判断性的信息。

然而，如果手工计算特征向量中的每个单词的tf-idfs，就会注意到TfidfTransformer对tf-idf的计算与之前书中定义的标准公式稍有不同。scikit-learn实现的逆文档频率计算公式如下：
$$idf(t,d)=\log\frac{1+n_d}{1+df(d,t)}$$
类似，在scikit-learn中计算的tf-idf与前面定义的默认公式稍有不同：
$$tf-idf(t,d)=tf(t,d)\times (idf(t,d)+1)$$

在调用TfidfTransformer类直接归一化并计算tf-idf之前，归一化原始单词的词频更具有代表性。定义默认参数norm='l2'，用scikit-learn的TfidTransformer进行L2归一化，返回长度为1的向量，
$$v_{norm}=\frac{v}{\|v\|_2}=\frac{v}{\sqrt{v_1^2+v_2^2+\dots+v_n^2}}=\frac{v}{(\sum_{i=1}^n(v_i^2)^{\frac{1}{2}}}$$

为了确保理解TdifTramsformer的工作机制，让我们来看下面三个文件中如何计算单词'is'的tf-idf。

单词'is'在第三个文档中的词频为3(tf=3)，其文档频率也为3因为单词'is'在所有的三个文档中都出现过(df=3)。因此可以计算逆文档频率如下：
$$idf("is",d3)=\log\frac{1+3}{1+3}=0$$
现在，为了计算tf-idf，只需要在逆文档频率上加1，然后乘以词频
$$tf-idf("is",d3)=3\times(0+1)=3$$
如果重复对第三个文档中所有术语的计算，将获得tf-idf向量：[3.39,3.0,3.39,1.29,1.29,1.29,2.0,1.69,1.29]。然而，请注意，这个特征向量中的值与以前用TfidfTransformer所获得的不同。在tf-idf计算中缺少的最后一步是L2归一化：
$$tf-idf(d3)_{norm}=\frac{[3.39,3.0,3.39,1.29,1.29,1.29,2.0,1.69,1.29]}{\sqrt{[3.39^2+3.0^2+3.39^2+1.29^2+1.29^2+1.29^2+2.0^2+1.69^2+1.29^2}}$$
$$tf-idf("is",d3)=0.45$$

### 清洗文本数据

在构建词袋模型之前，第一个重要的步骤是通过去掉所有不需要的字符来清洗文本数据。

In [7]:
df.loc[0,'review'][-50:]

'is seven.<br /><br />Title (Brazil): Not Available'

In [8]:
df.loc[0,'review']

'In 1974, the teenager Martha Moxley (Maggie Grace) moves to the high-class area of Belle Haven, Greenwich, Connecticut. On the Mischief Night, eve of Halloween, she was murdered in the backyard of her house and her murder remained unsolved. Twenty-two years later, the writer Mark Fuhrman (Christopher Meloni), who is a former LA detective that has fallen in disgrace for perjury in O.J. Simpson trial and moved to Idaho, decides to investigate the case with his partner Stephen Weeks (Andrew Mitchell) with the purpose of writing a book. The locals squirm and do not welcome them, but with the support of the retired detective Steve Carroll (Robert Forster) that was in charge of the investigation in the 70\'s, they discover the criminal and a net of power and money to cover the murder.<br /><br />"Murder in Greenwich" is a good TV movie, with the true story of a murder of a fifteen years old girl that was committed by a wealthy teenager whose mother was a Kennedy. The powerful and rich famil

正如这里看到的，文本中包含HTML标记和标点符号以及其他的非字母字符。虽然HTML标记所包含的有用语义不多，但在某些NLP场景，标点符号可以包含有用的附加信息。然而，为了简单起见，将删除所有的标点符号除了表情特征如:)，因为这些符号当然是有用的情感分析数据。

In [9]:
import re
def preprocessor(text):
    text = re.sub('<[^>]*>','',text)
    emoticons = re.findall('(?::|:|=)(?:-)?(?:\)|\(|D|P)',
                           text)
    text = (re.sub('[\W]+',' ',text.lower()) +
            ' '.join(emoticons).replace('-',''))
    return text

第一轮用正则表达式<[^>]\*>试图删除电影评论数据集中的所有HTML标记。之后用稍微复杂的正则表达式来找到表情符号，并暂存为emoticons。最后通过正则表达式[\w]+去除所有的非单词字符并把文本转换为小写字符。

尽管把表情字符加在清洗干净的文档字符串的结尾看起来可能不是最优雅的方法，当应当注意如果词汇中只包含单词令牌，词袋模型中词语的顺序就无关紧要了。

In [10]:
preprocessor(df.loc[0,'review'][-50:])

'is seven title brazil not available'

In [11]:
preprocessor("</a>This :) is :( a test :-)!")

'this is a test :) :( :)'

In [12]:
df['review'] = df['review'].apply(preprocessor) #数据清洗

### 把文档处理为令牌

标记文件的一种方法是通过把清洗后的文件以空白字符拆分为单词：

In [13]:
def tokenizer(text):
    return text.split()
tokenizer('runners like running and thus they run')

['runners', 'like', 'running', 'and', 'thus', 'they', 'run']

关于文档的表计划，还有另外一种有用的技术是__词干__，就是将单词转换为词根的过程。可以把相关的词映射到同一个词干上。原始算法是马丁·波特1979年开发的__波特分词__算法。

In [14]:
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]
tokenizer_porter('runners like running and thus they run')

['runner', 'like', 'run', 'and', 'thu', 'they', 'run']

使用NLTK软件包的PorterStemmer函数，修改tokenizer函数把相关的词都归纳为相应的词根。

__停用词删除__，去除停用词可能对处理原始或者正则词频而非tf-idf有益，因为tf-idf已降低了频繁出现的单词的权重

In [15]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\lewisbase\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [16]:
from nltk.corpus import stopwords
stop = stopwords.words('english')
[w for w in tokenizer_porter('a runner likes running and runs a lot')[-10:] if w not in stop]

['runner', 'like', 'run', 'run', 'lot']

## 训练文档分类的逻辑回归模型

首先将清理过的文本文档分成训练集和测试集

In [17]:
X_train = df.loc[:25000,'review'].values
y_train = df.loc[:25000,'sentiment'].values
X_test = df.loc[25000:,'review'].values
y_test = df.loc[25000:,'sentiment'].values

接着用GridSearchCV对象，采用5倍分层交叉验证方法，为逻辑回归模型寻找最佳参数集：

In [18]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        preprocessor=None)

param_grid = [{'vect__ngram_range':[(1,1)],
               'vect__stop_words':[stop,None],
               'vect__tokenizer':[tokenizer,
                                  tokenizer_porter],
               'clf__penalty':['l1','l2'],
               'clf__C':[1.0,10.0,100.0]},
              {'vect__ngram_range':[(1,1)],
               'vect__stop_words':[stop,None],
               'vect__tokenizer':[tokenizer,
                                 tokenizer_porter],
               'vect__use_idf':[False],
               'vect__norm':[None],
               'clf__penalty':['l1','l2'],
               'clf__C':[1.0,10.0,100.0]}
             ]
lr_tfdif = Pipeline([('vect',tfidf),
                     ('clf',
                      LogisticRegression(random_state=0))])
gs_lr_tfdif = GridSearchCV(lr_tfdif,param_grid,
                           scoring='accuracy',
                           cv=5,verbose=1,
                           n_jobs=2)
gs_lr_tfdif.fit(X_train,y_train)

Fitting 5 folds for each of 48 candidates, totalling 240 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.


KeyboardInterrupt: 

网格搜索完成后，可以显示得到的最佳参数

In [74]:
print('Best parameter set: %s'%gs_lr_tfdif.best_params_)

Best parameter set: {'clf__C': 10.0, 'clf__penalty': 'l1', 'vect__ngram_range': (1, 1), 'vect__stop_words': ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each'

In [75]:
print('CV Accuracy:%.3f'
      % gs_lr_tfdif.best_score_)

CV Accuracy:0.506


In [76]:
clf = gs_lr_tfdif.best_estimator_
print('Test Accuracy: %.3f'
      % clf.score(X_test,y_test))

Test Accuracy: 0.506


## 处理更大的数据集——在线算法和核心学习

__核外学习__，通过对数据集的小批增量来模拟分类器完成大型数据集的处理工作。

本节将用scikit-learn的SGDClassifier的partial_fit函数从本地驱动器直接获取流式文件，并用文件的小批次文档训练逻辑回归模型。

首先定义tokenizer函数来清理来自于movie_data.csv文件的文本数据，然后分解成单词，在标记的同时去除了停用词：

In [7]:
import numpy as np
import re
from nltk.corpus import stopwords

stop = stopwords.words('english')
def tokenizer(text):
    text = re.sub('<[^>]*>','',text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)',
                           text.lower())
    text = re.sub('[\W]+',' ',text.lower()) \
            + ' '.join(emoticons).replace('-','')
    tokenized = [w for w in text.split() if w not in stop]
    return tokenized

LookupError: 
**********************************************************************
  Resource [93mstopwords[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('stopwords')
  [0m
  Attempted to load [93mcorpora/stopwords[0m

  Searched in:
    - 'C:\\Users\\lewisbase/nltk_data'
    - 'D:\\ProgramFile\\Anaconda\\nltk_data'
    - 'D:\\ProgramFile\\Anaconda\\share\\nltk_data'
    - 'D:\\ProgramFile\\Anaconda\\lib\\nltk_data'
    - 'C:\\Users\\lewisbase\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
**********************************************************************


In [78]:
def stream_docs(path):
    with open(path, 'r',encoding='utf-8') as csv:
        next(csv) # skip header
        for line in csv:
            text, label = line[:-3],int(line[-2])
            yield text,label

In [79]:
next(stream_docs(path='movie_data.csv'))

('"In the New Year\'s Eve, the tuberculous sister of the Salvation Army Edit (Astrid Holm) asks her mother and her colleague Maria (Lisa Lundholm) to call David Holm (Victor Sjöström) to visit her in her deathbed. Meanwhile, the alcoholic David is telling to two other drunkards in the cemetery the legend of the Phantom Coach and his coachman: in accordance with the legend, the last sinner to die in the turn of the New Year becomes the soul collector, gathering souls in his coach. When David denies to visit Edit, his friends have an argument with him, they fight and David dies. When the coachman arrives, he recognizes his friend Georges (Tore Svennberg), who died in the end of the last year. George revisits parts of David\'s obnoxious life and in flashbacks, he shows how mean and selfish David was.<br /><br />""Körkarlen"" is an impressive and stylish silent movie, with magnificent special effects (for a 1921 movie). The characters are very well developed; however, the story is dated an

现在将定义get_minibatch函数，该函数调用stream_docs读入文件流并返回大小由参数size定义的文件：

In [100]:
def get_minibatch(doc_stream,size):
    docs,y = [],[]
    try:
        for _ in range(size):
            text, label = next(doc_stream)
            docs.append(text)
            y.append(label)
    except StopIteration:
        return None, None
    return docs, y

scikit-learn实现的另一个有用的向量化工具是HashingVectorizer，该工具用于文本处理并独立与数据，基于哈希技术。

In [101]:
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier

vect = HashingVectorizer(decode_error='ignore',
                         n_features=2**21,
                         preprocessor=None,
                         tokenizer=tokenizer)
clf = SGDClassifier(loss='log',random_state=1,max_iter=1)
doc_stream = stream_docs(path='movie_data.csv')

通过标记化函数并设置特征数量为2\*\*21来初始化HashingVectorizer。另外，设置loss参数SGDClassifier的值为'log'，重新初始化逻辑回归分类器，请注意如果在HashingVectorizer中选择较大的特征数，可以减少哈希碰撞的机会，但是也增加了逻辑回归模型系数的个数。

In [102]:
import pyprind
pbar = pyprind.ProgBar(45)
classes = np.array([0,1])
for _ in range(45):
    X_train, y_train = get_minibatch(doc_stream,size=1000)
    if not X_train:
        break
    X_train = vect.transform(X_train)
    clf.partial_fit(X_train,y_train,classes=classes)
    pbar.update()

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:18


In [103]:
X_test, y_test = get_minibatch(doc_stream,size=5000)
X_test = vect.transform(X_test)
print('Accuracy: %.3f' % clf.score(X_test,y_test))

Accuracy: 0.868


## 具有潜在狄氏分配的主题建模

主题建模描述了为无标签文本文档分配主题这个范围很广的任务。本节主要介绍__潜在狄氏分配（LDA）__。

### 使用LDA分解文本文档

LDA是一种概率生成模型，试图找出经常出现在不同文档中的单词。假设每个文档都是由不同单词组成的混合体，那么经常出现的单词就代表主题。

LDA将把词袋矩阵作为输入然后分解成两个新矩阵：
* 文档主题矩阵
* 单词主题矩阵

LDA的缺点是必须预先定义好主题数量这个超参数。

### LDA与scikit-learn

本节将用scikit-learn的LatentDirichletAllocation类来分解电影评论数据集，然后归入不同的主题。数量限制在10个以内。

In [1]:
import pandas as pd
df = pd.read_csv('movie_data.csv',encoding='utf-8')

接着将用CountVectorizer创建词袋矩阵作为LDA的输入

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer(stop_words='english',
                        max_df=.1,
                        max_features=5000)
X = count.fit_transform(df['review'].values)

注意把要考虑单词的最大文档频率设置为10%(max_df=.1)，以排除在文档间频繁出现的那些单词。

把考虑单词的数量限制为5000(max_feature=5000)

In [3]:
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_components=10,
                                random_state=123,
                                learning_method='batch')
X_topics = lda.fit_transform(X)

In [4]:
n_top_words = 5
feature_names = count.get_feature_names()
for topic_idx, topic in enumerate(lda.components_):
    print("Topic %d" % (topic_idx + 1))
    print(" ".join([feature_names[i]
                    for i in topic.argsort() \
                       [:-n_top_words - 1:-1]]))

Topic 1
worst minutes awful script stupid
Topic 2
family mother father children girl
Topic 3
american dvd music tv war
Topic 4
human audience cinema art sense
Topic 5
police guy car dead murder
Topic 6
horror house sex gore blood
Topic 7
role performance comedy actor performances
Topic 8
series war episode episodes season
Topic 9
book version original effects read
Topic 10
action fight guy guys cool


In [5]:
horror = X_topics[:,5].argsort()[::-1]
for iter_idx, movie_idx in enumerate(horror[:3]):
    print('\nHorroe movie #%d:' % (iter_idx + 1))
    print(df['review'][movie_idx][:300],'...')


Horroe movie #1:
Emilio Miraglia's first Giallo feature, The Night Evelyn Came Out of the Grave, was a great combination of Giallo and Gothic horror - and this second film is even better! We've got more of the Giallo side of the equation this time around, although Miraglia doesn't lose the Gothic horror stylings tha ...

Horroe movie #2:
House of Dracula works from the same basic premise as House of Frankenstein from the year before; namely that Universal's three most famous monsters; Dracula, Frankenstein's Monster and The Wolf Man are appearing in the movie together. Naturally, the film is rather messy therefore, but the fact that ...

Horroe movie #3:
This film marked the end of the "serious" Universal Monsters era (Abbott and Costello meet up with the monsters later in "Abbott and Costello Meet Frankentstein"). It was a somewhat desparate, yet fun attempt to revive the classic monsters of the Wolf Man, Frankenstein's monster, and Dracula one "la ...


In [6]:
X_topics

array([[0.00133369, 0.36123602, 0.00133371, ..., 0.0013338 , 0.00133385,
        0.00133371],
       [0.68446348, 0.00175486, 0.001755  , ..., 0.00175481, 0.00175509,
        0.05914702],
       [0.07260985, 0.00217456, 0.00217436, ..., 0.00217423, 0.65445083,
        0.00217429],
       ...,
       [0.0055566 , 0.0055565 , 0.00555695, ..., 0.00555708, 0.00555655,
        0.00555733],
       [0.3517427 , 0.00117687, 0.00117709, ..., 0.0421556 , 0.0011767 ,
        0.00117678],
       [0.00243961, 0.20154876, 0.00244013, ..., 0.00244037, 0.51408794,
        0.00243966]])