## Домашнее задание 8 (бонусное). Обработка текстов. 
Дедлайн: 24.06.2020 23:59

Ваша задача - определить тональность твита (0 - отрицательная, 4 - положительная) по его тексту.       
Ваша модель должна превзойти указанные бейзлайны (метрика качества - ***accuracy_score***) на тестовой выборке (***df_test***).     
Чем больше бейзлайнов вы пройдете, тем выше будет ваша оценка.       
Использовать можно любые модели и любые способы получения признаков. 

+ **!** Необходимо сделать результаты воспроизводимыми (фиксировать random_state)
+ **!** Для обучения можно использовать только ***df_train***. 
+ **!** Менять разбиение на  ***df_train*** и ***df_test*** нельзя.

**Оценивание (всего 10 баллов)**: 
+ Бейзлайн 1 0.73875 - 4 балла
+ Бейзлайн 2 0.75325 - 6 баллов
+ Бейзлайн 3 0.7635 - 8 баллов 
+ Бейзлайн 4 0.777 - 10 баллов

**Возможные направления улучшения качества**
+ улучшение предобработки (сейчас ее по сути нет)
+ подбор более удачной модели
+ подбор параметров модели 
+ feature engineering
+ feature selection

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, classification_report, accuracy_score

In [2]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB

In [3]:
from scipy.sparse import coo_matrix, hstack
from scipy.sparse.csr import csr_matrix

In [4]:
import io

In [5]:
df = pd.read_csv('https://raw.githubusercontent.com/esolovev/ling2019/master/module2/twi_data.csv', sep=';')

In [6]:
df.head(10)

Unnamed: 0,target,date,text
0,4,Tue Jun 02 02:59:24 PDT 2009,@JackAllTimeLow hope it went good! i couldnt m...
1,0,Sat Jun 06 00:25:20 PDT 2009,@SDI8732 Idk how to do it!!!
2,0,Fri Jun 05 12:07:23 PDT 2009,"@kmwindmill is here ! woop woop , would be bet..."
3,4,Mon Jun 01 14:55:06 PDT 2009,@Daydreamer1984 He explains the tailer better
4,0,Sat Jun 20 15:39:44 PDT 2009,still trying to get a pic on this twitter thin...
5,0,Mon Jun 01 17:05:44 PDT 2009,"personally, i'm pretty upset ian left the cab...."
6,4,Fri May 29 15:32:09 PDT 2009,Dance meeting sitting next to deb
7,4,Sun May 31 08:07:19 PDT 2009,@thespyglass ha... funnier the way you did it...
8,4,Mon Jun 01 18:12:27 PDT 2009,"wooh, i love @mileycyruss! i actuallly just sa..."
9,4,Sat May 30 09:17:18 PDT 2009,@EdinMarathonBot R-4_it is great I'm staying ...


In [7]:
# баланс классов
df.target.value_counts(normalize=True)

4    0.5
0    0.5
Name: target, dtype: float64

In [8]:
# разбиение и пропорции обучающей и тестовой выборки менять нельзя
SEED = 227
np.random.seed(SEED)
df_train, df_test = train_test_split(df, train_size=0.2, test_size=0.1, stratify=df.target, random_state=SEED)

In [9]:
df_train.shape

(8000, 3)

In [10]:
df_test.shape

(4000, 3)

In [11]:
y_train = df_train.target
y_test = df_test.target

## Baseline 1 
Count Vectorizer по словам + Naive Bayes

In [12]:
%%time
count_vectorizer = CountVectorizer()
X_train_count = count_vectorizer.fit_transform(df_train.text)
X_test_count = count_vectorizer.transform(df_test.text)
X_train = X_train_count
X_test = X_test_count

CPU times: user 182 ms, sys: 4.12 ms, total: 186 ms
Wall time: 185 ms


In [13]:
%%time
model = MultinomialNB()
model.fit(X_train, y_train)

CPU times: user 4.44 ms, sys: 1.93 ms, total: 6.36 ms
Wall time: 7.47 ms


MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [14]:
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
print(f'Accuracy: {accuracy_score(y_pred, y_test)}')

              precision    recall  f1-score   support

           0       0.71      0.82      0.76      2000
           4       0.78      0.66      0.72      2000

    accuracy                           0.74      4000
   macro avg       0.74      0.74      0.74      4000
weighted avg       0.74      0.74      0.74      4000

Accuracy: 0.73875


## Baseline 2 
TfidfVectorizer по словам + Logistic Regression

In [15]:
%%time
tfidf_vectorizer = TfidfVectorizer()
X_train_tfidf = tfidf_vectorizer.fit_transform(df_train.text)
X_test_tfidf = tfidf_vectorizer.transform(df_test.text)
X_train = X_train_tfidf
X_test = X_test_tfidf

CPU times: user 158 ms, sys: 2.65 ms, total: 161 ms
Wall time: 161 ms


In [16]:
%%time
model = LogisticRegression(random_state=SEED, solver='liblinear')
model.fit(X_train, y_train)

CPU times: user 45.2 ms, sys: 3.76 ms, total: 48.9 ms
Wall time: 37.1 ms


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=227, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

In [17]:
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
print(f'Accuracy: {accuracy_score(y_pred, y_test)}')

              precision    recall  f1-score   support

           0       0.75      0.76      0.76      2000
           4       0.76      0.74      0.75      2000

    accuracy                           0.75      4000
   macro avg       0.75      0.75      0.75      4000
weighted avg       0.75      0.75      0.75      4000

Accuracy: 0.75325


## Baseline 3
TfidfVectorizer по 1-3 граммам слов + TfidfVectorizer по 3-4граммам символов + LogisticRegression

In [18]:
%%time
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, 4))
X_train_tfidf = tfidf_vectorizer.fit_transform(df_train.text)
X_test_tfidf = tfidf_vectorizer.transform(df_test.text)

tfidf_vectorizer_char = TfidfVectorizer(ngram_range=(3, 4), analyzer='char')
X_train_tfidf_char = tfidf_vectorizer_char.fit_transform(df_train.text)
X_test_tfidf_char = tfidf_vectorizer_char.transform(df_test.text)

X_train = hstack((X_train_tfidf, X_train_tfidf_char))
X_test = hstack((X_test_tfidf, X_test_tfidf_char))

CPU times: user 2.55 s, sys: 96.9 ms, total: 2.64 s
Wall time: 2.51 s


In [19]:
%%time
model = LogisticRegression(random_state=SEED, solver='liblinear')
model.fit(X_train, y_train)

CPU times: user 475 ms, sys: 23.6 ms, total: 499 ms
Wall time: 312 ms


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=227, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

In [20]:
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
print(f'Accuracy: {accuracy_score(y_pred, y_test)}')

              precision    recall  f1-score   support

           0       0.76      0.77      0.76      2000
           4       0.77      0.76      0.76      2000

    accuracy                           0.76      4000
   macro avg       0.76      0.76      0.76      4000
weighted avg       0.76      0.76      0.76      4000

Accuracy: 0.7635


## Baseline 4
Baseline 3 + эмбединги из spacy (вектор документа = среднее векторов всех его слов)

In [21]:
%%time

# !python -m spacy download en_core_web_md
import spacy 
import en_core_web_md
nlp = en_core_web_md.load()

CPU times: user 4.2 s, sys: 533 ms, total: 4.73 s
Wall time: 5.39 s


In [22]:
%%time
X_train_vectors = csr_matrix([nlp(twi_text).vector for twi_text in df_train.text])
X_test_vectors = csr_matrix([nlp(twi_text).vector for twi_text in df_test.text])
X_train = hstack((X_train_tfidf, X_train_tfidf_char, X_train_vectors))
X_test = hstack((X_test_tfidf, X_test_tfidf_char, X_test_vectors))

CPU times: user 1min 24s, sys: 774 ms, total: 1min 25s
Wall time: 1min 26s


In [23]:
%%time
model = LogisticRegression(random_state=SEED, solver='liblinear')
model.fit(X_train, y_train)

CPU times: user 3.16 s, sys: 53.5 ms, total: 3.21 s
Wall time: 1.71 s


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=227, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

In [24]:
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
print(f'Accuracy: {accuracy_score(y_pred, y_test)}')

              precision    recall  f1-score   support

           0       0.77      0.79      0.78      2000
           4       0.78      0.76      0.77      2000

    accuracy                           0.78      4000
   macro avg       0.78      0.78      0.78      4000
weighted avg       0.78      0.78      0.78      4000

Accuracy: 0.777


## My baseline

### Улучшение препроцессинга только ухудшает результат, поэтому получим ещё один вектор с помощью fastText

In [25]:
# import string
# import nltk
# from nltk.stem import WordNetLemmatizer
# from nltk.tokenize import word_tokenize
# from nltk.corpus import stopwords
# from tqdm import tqdm
# import jamspell

# texts = []
# corrector = jamspell.TSpellCorrector()
# corrector.LoadLangModel('en.bin')

# def normalize(s, sw=False):
# #     s = s.replace(".", " ")
# #     s = s.replace("!", " ")
# #     s = s.replace(",", " ")
#     s = corrector.FixFragment(s)
#     #res = word_tokenize(s)
    
#     #nltk_lem = WordNetLemmatizer()
    
#     #res = [nltk_lem.lemmatize(i) for i in res]
#     res = [i for i in s.split() if len(i) != 0]
# #     res = [s.lower() for s in res]
#     if len(res) >= 10:
#         res = [i for i in res if i not in stopwords.words('english')]
# #     res = [s for s in res if s not in string.punctuation]
# #     res = [s for s in res if not s.isdigit()]
# #     res = [s for s in res if not s.startswith("#")
# #            and not s.startswith("@")]

#     return " ".join(res)

# # for i in tqdm(df.text.values):
# #     texts.append(normalize(i))

# #df.text = texts

In [26]:
# texts = []
# for i in tqdm(df_train.text.values):
#     texts.append(normalize(i))
# df_train.text = texts

# texts = []
# for i in tqdm(df_test.text.values):
#     texts.append(normalize(i))
# df_test.text = texts

### Для получения вектора fastText

In [27]:
# pip install sister

In [28]:
import sister
sentence_embedding = sister.MeanEmbedding(lang="en")

Loading model...




In [29]:
X_train_vectors_fasttext = csr_matrix([sentence_embedding(twi_text)
                                       for twi_text in df_train.text.values])
X_test_vectors_fasttext = csr_matrix([sentence_embedding(twi_text)
                                      for twi_text in df_test.text.values])

In [30]:
X_train = hstack((X_train_tfidf,
                  X_train_tfidf_char,
                  X_train_vectors,
                  X_train_vectors_fasttext))
X_test = hstack((X_test_tfidf,
                 X_test_tfidf_char,
                 X_test_vectors,
                 X_test_vectors_fasttext))

### Получим лучшее значение параметра C (коэффициент регуляризации) с помощью optuna

In [31]:
# pip install optuna

In [32]:
import optuna


def objective(trial):
    C = trial.suggest_float('C', 0.1, 3) 
    model = LogisticRegression(random_state=SEED,
                               C=C,
                               solver='liblinear')
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_pred, y_test)
    print("acc: " + str(acc))
    return 1 - acc

study = optuna.create_study()
study.optimize(objective, n_trials=30)

study.best_params

acc: 0.781


[I 2020-06-22 16:05:02,742] Finished trial#0 with value: 0.21899999999999997 with parameters: {'C': 1.782387115618453}. Best is trial#0 with value: 0.21899999999999997.


acc: 0.779


[I 2020-06-22 16:05:06,163] Finished trial#1 with value: 0.22099999999999997 with parameters: {'C': 2.0378267157026246}. Best is trial#0 with value: 0.21899999999999997.


acc: 0.78


[I 2020-06-22 16:05:09,037] Finished trial#2 with value: 0.21999999999999997 with parameters: {'C': 0.9524666587115246}. Best is trial#0 with value: 0.21899999999999997.


acc: 0.78075


[I 2020-06-22 16:05:13,013] Finished trial#3 with value: 0.21924999999999994 with parameters: {'C': 1.067036804175541}. Best is trial#0 with value: 0.21899999999999997.


acc: 0.7805


[I 2020-06-22 16:05:17,144] Finished trial#4 with value: 0.21950000000000003 with parameters: {'C': 1.4322331552637195}. Best is trial#0 with value: 0.21899999999999997.


acc: 0.78075


[I 2020-06-22 16:05:20,606] Finished trial#5 with value: 0.21924999999999994 with parameters: {'C': 1.7402492937265033}. Best is trial#0 with value: 0.21899999999999997.


acc: 0.77875


[I 2020-06-22 16:05:24,156] Finished trial#6 with value: 0.22124999999999995 with parameters: {'C': 1.978792365336849}. Best is trial#0 with value: 0.21899999999999997.


acc: 0.777


[I 2020-06-22 16:05:26,005] Finished trial#7 with value: 0.22299999999999998 with parameters: {'C': 0.28335090134771834}. Best is trial#0 with value: 0.21899999999999997.


acc: 0.77975


[I 2020-06-22 16:05:29,434] Finished trial#8 with value: 0.22024999999999995 with parameters: {'C': 2.236817842314201}. Best is trial#0 with value: 0.21899999999999997.


acc: 0.78075


[I 2020-06-22 16:05:32,763] Finished trial#9 with value: 0.21924999999999994 with parameters: {'C': 1.7369499442315561}. Best is trial#0 with value: 0.21899999999999997.


acc: 0.77775


[I 2020-06-22 16:05:36,287] Finished trial#10 with value: 0.22224999999999995 with parameters: {'C': 2.7807616357065594}. Best is trial#0 with value: 0.21899999999999997.


acc: 0.78


[I 2020-06-22 16:05:39,031] Finished trial#11 with value: 0.21999999999999997 with parameters: {'C': 0.9233429406396503}. Best is trial#0 with value: 0.21899999999999997.


acc: 0.78075


[I 2020-06-22 16:05:42,023] Finished trial#12 with value: 0.21924999999999994 with parameters: {'C': 1.0999929465849911}. Best is trial#0 with value: 0.21899999999999997.


acc: 0.7775


[I 2020-06-22 16:05:43,849] Finished trial#13 with value: 0.22250000000000003 with parameters: {'C': 0.2914669566643756}. Best is trial#0 with value: 0.21899999999999997.


acc: 0.77825


[I 2020-06-22 16:05:47,409] Finished trial#14 with value: 0.22175 with parameters: {'C': 2.5643779903173676}. Best is trial#0 with value: 0.21899999999999997.


acc: 0.781


[I 2020-06-22 16:05:50,381] Finished trial#15 with value: 0.21899999999999997 with parameters: {'C': 1.2548699996400567}. Best is trial#0 with value: 0.21899999999999997.


acc: 0.78025


[I 2020-06-22 16:05:53,394] Finished trial#16 with value: 0.21975 with parameters: {'C': 1.3698862466278863}. Best is trial#0 with value: 0.21899999999999997.


acc: 0.78075


[I 2020-06-22 16:05:55,648] Finished trial#17 with value: 0.21924999999999994 with parameters: {'C': 0.5904818644530172}. Best is trial#0 with value: 0.21899999999999997.


acc: 0.77975


[I 2020-06-22 16:05:59,180] Finished trial#18 with value: 0.22024999999999995 with parameters: {'C': 2.375910825737681}. Best is trial#0 with value: 0.21899999999999997.


acc: 0.781


[I 2020-06-22 16:06:02,471] Finished trial#19 with value: 0.21899999999999997 with parameters: {'C': 1.6118182258536025}. Best is trial#0 with value: 0.21899999999999997.


acc: 0.7815


[I 2020-06-22 16:06:05,084] Finished trial#20 with value: 0.21850000000000003 with parameters: {'C': 0.6503471647683958}. Best is trial#20 with value: 0.21850000000000003.


acc: 0.7815


[I 2020-06-22 16:06:07,767] Finished trial#21 with value: 0.21850000000000003 with parameters: {'C': 0.6566453897319557}. Best is trial#20 with value: 0.21850000000000003.


acc: 0.781


[I 2020-06-22 16:06:10,359] Finished trial#22 with value: 0.21899999999999997 with parameters: {'C': 0.6082110308816102}. Best is trial#20 with value: 0.21850000000000003.


acc: 0.78075


[I 2020-06-22 16:06:13,097] Finished trial#23 with value: 0.21924999999999994 with parameters: {'C': 0.6369598398202899}. Best is trial#20 with value: 0.21850000000000003.


acc: 0.7705


[I 2020-06-22 16:06:14,919] Finished trial#24 with value: 0.22950000000000004 with parameters: {'C': 0.17260550910092326}. Best is trial#20 with value: 0.21850000000000003.


acc: 0.78225


[I 2020-06-22 16:06:17,674] Finished trial#25 with value: 0.21775 with parameters: {'C': 0.7829608479795515}. Best is trial#25 with value: 0.21775.


acc: 0.78225


[I 2020-06-22 16:06:20,375] Finished trial#26 with value: 0.21775 with parameters: {'C': 0.762256902881218}. Best is trial#25 with value: 0.21775.


acc: 0.78075


[I 2020-06-22 16:06:22,443] Finished trial#27 with value: 0.21924999999999994 with parameters: {'C': 0.4396711641331538}. Best is trial#25 with value: 0.21775.


acc: 0.78025


[I 2020-06-22 16:06:25,187] Finished trial#28 with value: 0.21975 with parameters: {'C': 0.8591896709250502}. Best is trial#25 with value: 0.21775.


acc: 0.7625


[I 2020-06-22 16:06:27,281] Finished trial#29 with value: 0.23750000000000004 with parameters: {'C': 0.11269548079959057}. Best is trial#25 with value: 0.21775.


{'C': 0.7829608479795515}

### Используем наилучший параметр

In [33]:
model = LogisticRegression(random_state=SEED,
                           C=study.best_params['C'],
                           solver='liblinear')
model.fit(X_train, y_train)

LogisticRegression(C=0.7829608479795515, class_weight=None, dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=100, multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=227, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

In [34]:
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
print(f'Accuracy: {accuracy_score(y_pred, y_test)}')

              precision    recall  f1-score   support

           0       0.77      0.80      0.79      2000
           4       0.79      0.77      0.78      2000

    accuracy                           0.78      4000
   macro avg       0.78      0.78      0.78      4000
weighted avg       0.78      0.78      0.78      4000

Accuracy: 0.78225


### При C=0.7727395412600294 выдаётся Accuracy: 0.78225 (если выше optuna нашла результат хуже, так как не понятно, как в ней фиксировать random_state (и нужно ли...) )

In [35]:
model = LogisticRegression(random_state=SEED,
                           C=0.7727395412600294,
                           solver='liblinear')
model.fit(X_train, y_train)

LogisticRegression(C=0.7727395412600294, class_weight=None, dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=100, multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=227, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

In [36]:
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
print(f'Accuracy: {accuracy_score(y_pred, y_test)}')

              precision    recall  f1-score   support

           0       0.77      0.80      0.79      2000
           4       0.79      0.77      0.78      2000

    accuracy                           0.78      4000
   macro avg       0.78      0.78      0.78      4000
weighted avg       0.78      0.78      0.78      4000

Accuracy: 0.7825
