# Тема “Классификация текста”

В качестве заготовки для задания прогоним часть 2ого домашнего задания. Нам необходимо получить разреженные матрицы, используя CountVectorizer, TfidfVectorizer для 'tweet_stemmed' и 'tweet_lemmatized' столбцов (4 матрицы).

**Задание 1.**
<br>Построим модель LogisticRegression, используя Bag-of-Words признаки для столбца combine_df['tweet_stemmed']. 
- Поделим Bag-of-Words признаки на train, test (train заканчивается на 31962 строке combine_df)
- Ответами является столбец train_df['label']
- Рассчитаем predict_proba, приведем prediction в в бинарный вид: если предсказание >= 0.3 то 1, иначе 0, тип заменим на int
- Рассчитаем f1_score 

Повторим аналогично для столбца combine_df['tweet_lemmatized'].

**Задание 2.**
<br>Построим модель LogisticRegression, используя TF-IDF признаки для столбца combine_df['tweet_stemmed']. 
- Поделим TF-IDF признаки на train, test (train заканчивается на 31962 строке combine_df)
- Ответами является столбец train_df['label']
- Рассчитаем predict_proba, приведем prediction в в бинарный вид: если предсказание >= 0.3 то 1, иначе 0, тип заменим на int
- Рассчитаем f1_score 

Повторим аналогично для столбца combine_df['tweet_lemmatized'].



In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from tqdm import tqdm

import warnings 
warnings.filterwarnings("ignore", category=Warning)

In [None]:
import pickle

with open('combine_df.pickle', 'rb') as f:
    combine_df = pickle.load(f)

In [None]:
combine_df = combine_df[0:3196]

for col in ['tweet_stemmed', 'tweet_lemmatized']:
    combine_df[col] = combine_df[col].str.join(' ')

combine_df.head(n=5)

Unnamed: 0,id,label,tweet,tweet_token,tweet_token_filtered,tweet_stemmed,tweet_lemmatized
0,1,0.0,when father is dysfunctional and is so selfish...,"[when, father, is, dysfunctional, and, is, so,...","[father, dysfunctional, selfish, drags, kids, ...",father dysfunct selfish drag kid dysfunct run,father dysfunctional selfish drag kid dysfunct...
1,2,0.0,thanks for lyft credit cannot use cause they d...,"[thanks, for, lyft, credit, can, not, use, cau...","[thanks, lyft, credit, use, cause, offer, whee...",thank lyft credit use caus offer wheelchair va...,thank lyft credit use cause offer wheelchair v...
2,3,0.0,bihday your majesty,"[bihday, your, majesty]","[bihday, majesty]",bihday majesti,bihday majesty
3,4,0.0,model love you take with you all the time in ur,"[model, love, you, take, with, you, all, the, ...","[model, love, take, time, ur]",model love take time ur,model love take time ur
4,5,0.0,factsguide society now motivation,"[factsguide, society, now, motivation]","[factsguide, society, motivation]",factsguid societi motiv,factsguide society motivation


In [None]:
X = combine_df[['tweet_stemmed', 'tweet_lemmatized']]
y = combine_df['label']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.25,
    random_state=42,
    stratify=y
)

Векторайзеры и логистическая регрессия (с L1-регуляризацией и кросс-валидацией):

In [None]:
count_vectorizer = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')
tfidf_vectorizer = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000)

clf = LogisticRegressionCV(
    cv=3,
    penalty='l1',
    scoring='f1',
    solver='saga',
    class_weight='balanced',
    n_jobs=-1,
    random_state=42
)

Функция для векторизации обучающей и тестовой выборок, обучения логистической регрессии и получения f1:

In [None]:
def train_and_eval_pipeline(X_train, y_train, X_test, y_test, vectorizer, clf):
    
    X_train_vectorized = vectorizer.fit_transform(X_train)
    X_test_vectorized = vectorizer.transform(X_test)
    
    clf.fit(X_train_vectorized, y_train)
    
    y_proba = clf.predict_proba(X_test_vectorized)
    y_proba = y_proba[:, 1]
    
    f1_value = f1_score(y_test, y_proba >= 0.3)
    
    return f1_value

f1 для каждого из типов нормализации и векторизации слов:

In [None]:
for n_type, col in zip(['Stemming', 'Lemmatization'],
                       ['tweet_stemmed', 'tweet_lemmatized']):
    print(n_type)
    for v_type, vectorizer in zip(['Bag-of-Words', 'TF-IDF'],
                                  [count_vectorizer, tfidf_vectorizer]):
        f1_value = train_and_eval_pipeline(X_train[col], y_train,
                                           X_test[col], y_test,
                                           vectorizer, clf)
        print(f' {v_type}: {f1_value:.6f}')
    print(end='\n')

Stemming
 Bag-of-Words: 0.452174
 TF-IDF: 0.458716

Lemmatization
 Bag-of-Words: 0.464286
 TF-IDF: 0.477064

