# Лабораторная работа 5

## Библиотеки

- [Natural Language Toolkit (NLTK)](https://www.nltk.org/) - одна из наиболее старых и известных библиотек.
- [spacy](https://spacy.io/) - на сегодняшний день одна из наиболее развитых библиотек для обработки естественного языка, в том числе ориентирована на русский язык. Есть [описание на nlpub](https://nlpub.ru/SpaCy) и [статья.](https://habr.com/ru/post/531940/)
- [natasha](https://github.com/natasha/natasha) - изначально создавалась как библиотека для русского языка. [Статья с описанием.](https://habr.com/ru/post/516098/)
- [pymorphy2](https://pymorphy2.readthedocs.io/en/stable/) - основной задачей является лемматизация.
- [pymystem3](https://github.com/nlpub/pymystem3) - надстройка над библиотекой https://yandex.ru/dev/mystem/ Основной задачей также является лемматизация.

In [19]:
import numpy as np
import pandas as pd
from typing import Dict, Tuple
from scipy import stats
from IPython.display import Image
from sklearn.datasets import load_iris, load_boston
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, balanced_accuracy_score
from sklearn.metrics import precision_score, recall_score, f1_score, classification_report
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_absolute_error, mean_squared_error, mean_squared_log_error, median_absolute_error, r2_score 
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.svm import SVC, NuSVC, LinearSVC, OneClassSVM, SVR, NuSVR, LinearSVR
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline 
sns.set(style="ticks")

## Загрузка и первичный анализ данных

Используем данные из соревнования [Sentiment140.](http://help.sentiment140.com/for-students/)

In [20]:
filename="../RK2/trainingandtestdata/testdata.manual.2009.06.14.csv"
#filename="../RK2/trainingandtestdata/training.1600000.processed.noemoticon.csv"

In [21]:
# Загрузка данных sentiment140
sentiment140_df = pd.read_csv(filename, delimiter=None, header=None, encoding='utf-8',error_bad_lines=False)



  sentiment140_df = pd.read_csv(filename, delimiter=None, header=None, encoding='utf-8',error_bad_lines=False)


## Предобработка текстовых данных.

In [22]:
sentiment140_df.shape

(498, 6)

In [23]:
sentiment140_df = pd.DataFrame(sentiment140_df,columns=[0,5])
sentiment140_df.columns = ['value','text']
# Удалить строку ’@id‘
sentiment140_df['text'] = sentiment140_df['text'].str.replace(r"\@[a-zA-Z0-9_]{1,}\s",'',regex=True)
# Удалить строку ‘#’
sentiment140_df['text'] = sentiment140_df['text'].str.replace('#','',regex=False)
sentiment140_df.head()

Unnamed: 0,value,text
0,4,I loooooooovvvvvveee my Kindle2. Not that the ...
1,4,Reading my kindle2... Love it... Lee childs i...
2,4,"Ok, first assesment of the kindle2 ...it fucki..."
3,4,You'll love your Kindle2. I've had mine for a ...
4,4,Fair enough. But i have the Kindle2 and I thi...


In [24]:
# Сформируем общий словарь для обучения моделей из обучающей и тестовой выборки
vocab_list = sentiment140_df['text'].tolist()
vocab_list[1:10]

['Reading my kindle2...  Love it... Lee childs is good read.',
 'Ok, first assesment of the kindle2 ...it fucking rocks!!!',
 "You'll love your Kindle2. I've had mine for a few months and never looked back. The new big one is huge! No need for remorse! :)",
 " Fair enough. But i have the Kindle2 and I think it's perfect  :)",
 "no. it is too big. I'm quite happy with the Kindle2.",
 'Fuck this economy. I hate aig and their non loan given asses.',
 'Jquery is my new best friend.',
 'Loves twitter',
 'how can you not love Obama? he makes jokes about himself.']

### Spacy

In [25]:
from spacy.lang.en import English
import spacy
import spacy.cli 
#spacy.cli.download("en_core_web_sm")
nlp = spacy.load("en_core_web_sm")

### Задача токенизации

In [26]:
spacy_text = [nlp(v) for v in vocab_list]
for t in spacy_text[0]:
    print(t)

I
loooooooovvvvvveee
my
Kindle2
.
Not
that
the
DX
is
cool
,
but
the
2
is
fantastic
in
its
own
right
.


In [27]:
spacy_text[1]

Reading my kindle2...  Love it... Lee childs is good read.

In [28]:
spacy_text[2]

Ok, first assesment of the kindle2 ...it fucking rocks!!!

### Частеречная разметка (Part-Of-Speech tagging, POS-tagging)

В некоторых библиотеках вначале выполняется частеречная разметка, а далее на ее основе выполняется лемматизация.

In [29]:
for token in spacy_text[0]:
    print('{} - {} - {}'.format(token.text, token.pos_, token.dep_))

I - PRON - nsubj
loooooooovvvvvveee - VERB - ROOT
my - PRON - poss
Kindle2 - PROPN - dobj
. - PUNCT - punct
Not - PART - neg
that - SCONJ - mark
the - DET - det
DX - PROPN - nsubj
is - AUX - ROOT
cool - ADJ - acomp
, - PUNCT - punct
but - CCONJ - cc
the - DET - det
2 - NUM - nsubj
is - AUX - conj
fantastic - ADJ - acomp
in - ADP - prep
its - PRON - poss
own - ADJ - amod
right - NOUN - pobj
. - PUNCT - punct


### Лемматизация

In [30]:
for token in spacy_text[4]:
      print(token, token.lemma, token.lemma_)

  8532415787641010193  
Fair 6400965522219763706 Fair
enough 5083403373732563023 enough
. 12646065887601541794 .
But 14560795576765492085 but
i 4690420944186131903 I
have 14692702688101715474 have
the 7425985699627899538 the
Kindle2 16112180422796512521 Kindle2
and 2283656566040971221 and
I 4690420944186131903 I
think 16875814820671380748 think
it 10239237003504588839 it
's 10382539506755952630 be
perfect 1665682026658446649 perfect
  8532415787641010193  
:) 5920004935509210957 :)


### Выделение (распознавание) именованных сущностей, named-entity recognition (NER)

In [31]:
for ent in spacy_text[100].ents:
    print(ent.text, ent.label_)

Cheney PERSON
Bush PERSON


In [32]:
from spacy import displacy
displacy.render(spacy_text[100], style='ent', jupyter=True)

In [33]:
print(spacy.explain("PER"))

Named person or family.


### Разбор предложения

In [34]:
displacy.render(spacy_text[0], style='dep', jupyter=True)

In [35]:
displacy.render(spacy_text[3], style='dep', jupyter=True)

## Классификация текстовых данных.

### Способ 1. На основе CountVectorizer или TfidfVectorizer.

Использование класса [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) 

Подсчитывает количество слов словаря, входящих в данный текст.

Использование класса [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) 

Вычисляет специфичность текста в корпусе текстов на основе метрики [TF-IDF](https://ru.wikipedia.org/wiki/TF-IDF).

In [51]:
#filename="../RK2/trainingandtestdata/testdata.manual.2009.06.14.csv"
filename="../RK2/trainingandtestdata/training.1600000.processed.noemoticon.csv"
# Загрузка данных sentiment140
sentiment140_df = pd.read_csv(filename, delimiter=None, header=None, encoding='utf-8',error_bad_lines=False)



  sentiment140_df = pd.read_csv(filename, delimiter=None, header=None, encoding='utf-8',error_bad_lines=False)


In [52]:
sentiment140_df = pd.DataFrame(sentiment140_df,columns=[0,5])
sentiment140_df.columns = ['value','text']
# Удалить строку ’@id‘
sentiment140_df['text'] = sentiment140_df['text'].str.replace(r"\@[a-zA-Z0-9_]{1,}\s",'',regex=True)
# Удалить строку ‘#’
sentiment140_df['text'] = sentiment140_df['text'].str.replace('#','',regex=False)
# Сформируем общий словарь для обучения моделей из обучающей и тестовой выборки
vocab_list = sentiment140_df['text'].tolist()

In [53]:
def VectorizeAndClassify(vectorizers_list, classifiers_list):
    for v in vectorizers_list:
        for c in classifiers_list:
            pipeline1 = Pipeline([("vectorizer", v), ("classifier", c)])
            score = cross_val_score(pipeline1, sentiment140_df['text'], sentiment140_df['value'], scoring='accuracy', cv=3).mean()
            print('Векторизация - {}'.format(v))
            print('Модель для классификации - {}'.format(c))
            print('Accuracy = {}'.format(score))
            print('===========================')

In [54]:
vocabVect = CountVectorizer()
vocabVect.fit(vocab_list)
corpusVocab = vocabVect.vocabulary_
print('Количество сформированных признаков - {}'.format(len(corpusVocab)))

Количество сформированных признаков - 371467


In [55]:
vectorizers_list = [CountVectorizer(vocabulary = corpusVocab), TfidfVectorizer(vocabulary = corpusVocab)]
#classifiers_list = [LogisticRegression(C=3.0), LinearSVC(), KNeighborsClassifier()]
classifiers_list = [MultinomialNB(alpha=0.5),LogisticRegression(C=5.0)]
VectorizeAndClassify(vectorizers_list, classifiers_list)

Векторизация - CountVectorizer(vocabulary={'00': 0, '000': 1, '0000': 2, '00000': 3,
                            '000000000000': 4, '0000001': 5, '000001': 6,
                            '000014': 7, '00004873337e0033fea60': 8, '00009': 9,
                            '0000ff': 10, '0000r0cx': 11,
                            '0001110101001010000111': 12, '0001t': 13,
                            '0003': 14, '0007': 15, '000aah': 16,
                            '000albums': 17, '000followers': 18,
                            '000follows': 19, '000ft': 20, '000g': 21,
                            '000gbp': 22, '000h': 23, '000hagds': 24,
                            '000hrs': 25, '000ish': 26, '000k': 27, '000kg': 28,
                            '000km': 29, ...})
Модель для классификации - MultinomialNB(alpha=0.5)
Accuracy = 0.7696475000365388


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Векторизация - CountVectorizer(vocabulary={'00': 0, '000': 1, '0000': 2, '00000': 3,
                            '000000000000': 4, '0000001': 5, '000001': 6,
                            '000014': 7, '00004873337e0033fea60': 8, '00009': 9,
                            '0000ff': 10, '0000r0cx': 11,
                            '0001110101001010000111': 12, '0001t': 13,
                            '0003': 14, '0007': 15, '000aah': 16,
                            '000albums': 17, '000followers': 18,
                            '000follows': 19, '000ft': 20, '000g': 21,
                            '000gbp': 22, '000h': 23, '000hagds': 24,
                            '000hrs': 25, '000ish': 26, '000k': 27, '000kg': 28,
                            '000km': 29, ...})
Модель для классификации - LogisticRegression(C=5.0)
Accuracy = 0.7893631262342096
Векторизация - TfidfVectorizer(vocabulary={'00': 0, '000': 1, '0000': 2, '00000': 3,
                            '000000000000': 4, '0000001': 5, '0

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Векторизация - TfidfVectorizer(vocabulary={'00': 0, '000': 1, '0000': 2, '00000': 3,
                            '000000000000': 4, '0000001': 5, '000001': 6,
                            '000014': 7, '00004873337e0033fea60': 8, '00009': 9,
                            '0000ff': 10, '0000r0cx': 11,
                            '0001110101001010000111': 12, '0001t': 13,
                            '0003': 14, '0007': 15, '000aah': 16,
                            '000albums': 17, '000followers': 18,
                            '000follows': 19, '000ft': 20, '000g': 21,
                            '000gbp': 22, '000h': 23, '000hagds': 24,
                            '000hrs': 25, '000ish': 26, '000k': 27, '000kg': 28,
                            '000km': 29, ...})
Модель для классификации - LogisticRegression(C=5.0)
Accuracy = 0.7895050007521789


### Способ 2. На основе моделей FastText.

In [56]:
import fasttext
#from gensim.models import word2vec

#### Набор данных для обучения

In [57]:
filename="../RK2/trainingandtestdata/training.1600000.processed.noemoticon.csv"
# Загрузка данных sentiment140
train_df = pd.read_csv(filename, delimiter=None, header=None, encoding='utf-8',error_bad_lines=False)
train_df = pd.DataFrame(train_df,columns=[0,5])
train_df.columns = ['value','text']
# Удалить строку ’@id‘
train_df['text'] = train_df['text'].str.replace(r"\@[a-zA-Z0-9_]{1,}\s",'',regex=True)
# Удалить строку ‘#’
train_df['text'] = train_df['text'].str.replace('#','',regex=False)
train_df.head()



  train_df = pd.read_csv(filename, delimiter=None, header=None, encoding='utf-8',error_bad_lines=False)


Unnamed: 0,value,text
0,0,"http://twitpic.com/2y1zl - Awww, that's a bumm..."
1,0,is upset that he can't update his Facebook by ...
2,0,I dived many times for the ball. Managed to sa...
3,0,my whole body feels itchy and like its on fire
4,0,"no, it's not behaving at all. i'm mad. why am ..."


In [58]:
train_df['value'] = train_df['value'].astype(str)
train_df['value'] = pd.Series(['__label__' for i in range(len(train_df['value']))]).str.cat(train_df['value'])
train_df.to_csv('./train.csv',sep=' ',header=False,index=False)

In [59]:
train_df['value'].describe()

count        1600000
unique             2
top       __label__0
freq          800000
Name: value, dtype: object

#### Наборы данных для тестирования

In [61]:
filename="../RK2/trainingandtestdata/testdata.manual.2009.06.14.csv"
# Загрузка данных sentiment140
test_df = pd.read_csv(filename, delimiter=None, header=None, encoding='utf-8',error_bad_lines=False)
test_df = pd.DataFrame(test_df,columns=[0,5])
test_df.columns = ['value','text']
# Удалить строку ’@id‘
test_df['text'] = test_df['text'].str.replace(r"\@[a-zA-Z0-9_]{1,}\s",'',regex=True)
# Удалить строку ‘#’
test_df['text'] = test_df['text'].str.replace('#','',regex=False)
test_df.head()



  test_df = pd.read_csv(filename, delimiter=None, header=None, encoding='utf-8',error_bad_lines=False)


Unnamed: 0,value,text
0,4,I loooooooovvvvvveee my Kindle2. Not that the ...
1,4,Reading my kindle2... Love it... Lee childs i...
2,4,"Ok, first assesment of the kindle2 ...it fucki..."
3,4,You'll love your Kindle2. I've had mine for a ...
4,4,Fair enough. But i have the Kindle2 and I thi...


In [62]:
test_df['value'] = test_df['value'].astype(str)
test_df['value'] = pd.Series(['__label__' for i in range(len(test_df['value']))]).str.cat(test_df['value'])
test_df.to_csv('./test.csv',sep=' ',header=False,index=False)

In [63]:
test_df['value'].describe()

count            498
unique             3
top       __label__4
freq             182
Name: value, dtype: object

In [64]:
model = fasttext.train_supervised('./train.csv')

Read 24M words
Number of words:  1118940
Number of labels: 2
Progress: 100.0% words/sec/thread: 3615350 lr:  0.000000 avg.loss:  0.417807 ETA:   0h 0m 0s


In [65]:
#print(model.words)
print(model.labels)

['__label__4', '__label__0']


In [66]:
def print_results(N, p, r):
    print("N\t" + str(N))
    print("P@{}\t{:.3f}".format(1, p))
    print("R@{}\t{:.3f}".format(1, r))

print_results(*model.test('./train.csv'))

N	1600000
P@1	0.829
R@1	0.829


In [67]:
print_results(*model.test('./test.csv'))

N	359
P@1	0.802
R@1	0.802


### Сравните качество полученных моделей.

FastText лучше, чем CountVectorizer и TfidfVectorizer.