В этом задании Вам предлагается решить проблему классификации текстов разными методами.

Среди таких методов мы можем предложить Вам:

1) Простой Байесовский классификатор на основе мультиномиальной модели или модели Бернулли

>Достоинства: идейная простота и простота реализации, неплохая интерпретируемость

>Недостатки: относительно слабая предсказательная способность

> Frameworks: `numpy`

2) Логистическая регрессия на основе векторов TF-IDF

>Достоинства: достаточно высокая скорость обучения, простой метод составления эмбеддингов

>Недостатки: также довольно слабая предсказательная способность, слишком высокая размерность задачи

> Frameworks: `sklearn`, `numpy`

3) Логистическая регрессия или нейронная сеть + word2vec embeddings

> Достоинства: оптимальная размерность эмбеддингов, довольно простые модели, сравнительно неплохое качество

> Недостатки: устаревший метод построения эмбеддингов. Эмбеддинги не контекстуальные

> Frameworks: `gensim`, `pytorch`, `sklearn`

4) Рекуррентная нейронная сеть + word2vec:

> Достоинства: Более современная нейронная сеть

> Недостатки: недоступно распараллеливание

> Frameworks: `pytorch`, `gensim`

5) ELMO + любая нейронная сеть

> Достоинства: отличный контекстуальный метод векторизации текстов, мощная модель

> Недостатки: сложность моделей

> Frameworks: `elmo`, `pytorch`

6) Bert + любая нейронная сеть

> Достоинства: отличный контекстуальный метод векторизации текстов, мощная модель

> Недостатки: сложность моделей

> Frameworks: `transformers`, `pytorch`

Вы также можете исследовать любые комбинации методов векторизации и моделей ML, которые сочтете нужными.

Ваша задача: провести сравнительный анализ не менее 3 алгоритмов классификации текстов. Сравнение стоит проводить по следующим параметрам:

- Качество классификации (актуальную метрику выберите самостоятельно)
- Время обучения модели
- Характерное время инференса модели

Данные можно загрузить по ссылке: https://drive.google.com/drive/folders/14hR7Pm2sH28rQttkD906PTLvtwHFLBRm?usp=sharing

Для упрощения Вашей работы предлагаем ряд функций для предобработки текстов.

In [None]:
import re, string
regex = re.compile('[%s]' % re.escape(string.punctuation))
def clear(text: str) -> str:
    text = regex.sub('', text.lower())
    text = re.sub(r'[«»\n]', ' ', text)
    text = text.replace('ё', 'е')
    return text.strip()

In [None]:
import nltk #natural language toolkit
from nltk.stem import WordNetLemmatizer

nltk.download('omw-1.4')
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [None]:
from nltk.corpus import stopwords
nltk.download('stopwords')
eng_stopwords = stopwords.words("english")

remove_stopwords = lambda tokenized_text, stopwords: [w for w in tokenized_text if not w in stopwords]

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
path = './drive/MyDrive/Classification texts/'
train_texts =[]
train_labels = []

test_texts =[]
test_labels = []


fp_train_texts = open(path+'train.texts','r',encoding='utf-8')
for text in fp_train_texts:
    train_texts.append(text)

fp_train_labels = open(path+'train.labels','r',encoding='utf-8')
for label in fp_train_labels:
    train_labels.append(label)

fp_test_texts = open(path+'dev.texts','r',encoding='utf-8')
for text in fp_test_texts:
    test_texts.append(text)

fp_test_labels = open(path+'dev.labels','r',encoding='utf-8')
for label in fp_test_labels:
    test_labels.append(label)


print('Длина тренировочного набора текстов: ', len(train_texts))
print('Длина тестового набора текстов: ',len(test_texts))

Длина тренировочного набора текстов:  15000
Длина тестового набора текстов:  10000


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import pandas as pd

df = pd.DataFrame({'category' : train_labels[:15000],
                  'text' : train_texts[:15000]})


df_test = pd.DataFrame({'category' : test_labels[:10000],
                  'text' : test_texts[:10000]})

df.head()

Unnamed: 0,category,text
0,neg\n,If the myth regarding broken mirrors would be ...
1,pos\n,I gave this movie a 10 because it needed to be...
2,neg\n,After watching the first 20mn of Blanche(sorry...
3,neg\n,"Weak plot, unlikely car malfunction, and helpl..."
4,pos\n,Where the Sidewalk Ends (1950)<br /><br />Wher...


In [None]:
df_test.head()

Unnamed: 0,category,text
0,neg\n,"First of all, I have to say I have worked for ..."
1,neg\n,"With a cast list like this one, I expected far..."
2,pos\n,Some guys think that sniper is not good becaus...
3,neg\n,"The film is about a young man, Michael, who ca..."
4,pos\n,"This is an ""odysessy through time"" via compute..."


In [None]:
df.describe()

Unnamed: 0,category,text
count,15000,15000
unique,2,14941
top,pos\n,The BFG is one of Roald Dahl's most cherished ...
freq,7520,2


In [None]:
from nltk.tokenize.toktok import ToktokTokenizer
tokenizer=ToktokTokenizer()
from bs4 import BeautifulSoup
stopword_list=nltk.corpus.stopwords.words('english')
def strip_html(text):
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text()

def remove_between_square_brackets(text):
    return re.sub('\[[^]]*\]', '', text)

def denoise_text(text):
    text = strip_html(text)
    text = remove_between_square_brackets(text)
    return text
df['text']=df['text'].apply(denoise_text)
df_test['text'] = df['text'].apply(denoise_text)

In [None]:
def remove_special_characters(text, remove_digits=True):
    pattern=r'[^a-zA-z0-9\s]'
    text=re.sub(pattern,'',text)
    return text
df['text']=df['text'].apply(remove_special_characters)
df_test['text']=df_test['text'].apply(remove_special_characters)

In [None]:
def simple_stemmer(text):
    ps=nltk.porter.PorterStemmer()
    text= ' '.join([ps.stem(word) for word in text.split()])
    return text
df['text']=df['text'].apply(simple_stemmer)
df_test['text']=df_test['text'].apply(simple_stemmer)

In [None]:
stop=set(stopwords.words('english'))
print(stop)

def remove_stopwords(text, is_lower_case=False):
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopword_list]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]
    filtered_text = ' '.join(filtered_tokens)
    return filtered_text
df['text']=df['text'].apply(remove_stopwords)
df_test['text']=df_test['text'].apply(remove_stopwords)

{'shan', 'of', 'during', 'such', 'against', 'on', 'shouldn', "won't", 'or', "didn't", 've', "that'll", 'but', 'these', 'after', 'mustn', 'mightn', 'am', 'own', 's', "shan't", 'its', 'they', "shouldn't", "haven't", 'myself', 'again', 'was', "isn't", 'been', 'you', 'there', 'down', 'y', 'in', 'will', 'out', 'herself', 't', 'her', 'whom', 'under', 'this', 'isn', "hadn't", "couldn't", 'themselves', 'doing', 'd', 'with', 'through', 'here', 'we', 'he', 'between', 'should', "weren't", 'his', 'other', "wasn't", 'wasn', 'as', 'itself', 'couldn', 'about', "mightn't", 'have', 'she', "you'd", 'while', 'no', "you'll", 'most', 'some', 'be', 'yourself', 'which', 'what', 'i', 'being', "you've", "mustn't", 'from', 'their', 'that', 'then', 'than', 'll', "doesn't", 'an', 'hers', 'won', 'further', 'hadn', 'into', 'ma', 'needn', 'up', 'over', 'did', 'each', 'does', 'hasn', 'can', 'him', 'them', "wouldn't", 'do', 'once', 'below', 'more', 'so', "you're", 'where', 'weren', 'yours', 'ain', 'aren', 'too', 'does

In [None]:
df.text[0]

'myth regard broken mirror would accur everybodi involv thi product would face approxim 170 year bad luck becaus lot mirror fall littl piec onli script wa shatter glass broken would brilliant film sadli overlong deriv dull movi onli hand remark idea memor sequenc sean elli made veri stylish elegantli photograph movi stori lacklust total absenc logic explan realli frustrat got discuss friend regard basic concept mean film think elli found inspir old legend claim spot doppelgang forebod go die interest theori im familiar thi legend couldnt find anyth internet thi neither person think broken yet anoth umpteenth variat theme invas bodi snatcher without alien interfer broken center american mcvey famili live london particularli gina mirror spontan break dure birthday celebr thi trigger whole seri mysteri seemingli supernatur event gina spot drive car follow mirror imag apart build whilst drive home state mental confus caus terribl car accid end hospit dismiss gina feel like whole surround c

In [None]:
norm_train = df.text[::]
norm_test = df_test.text[::]

In [None]:
norm_train.dtype

dtype('O')

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tv=TfidfVectorizer(min_df=0,max_df=1,use_idf=True)
tv_train_reviews=tv.fit_transform(norm_train)
tv_test_reviews=tv.transform(norm_test)
print('Tfidf_train:',tv_train_reviews.shape)
print('Tfidf_test:',tv_test_reviews.shape)

Tfidf_train: (15000, 52872)
Tfidf_test: (10000, 52872)


In [None]:
from sklearn.preprocessing import LabelBinarizer
lb=LabelBinarizer()
sentiment_train=lb.fit_transform(df['category'])
sentiment_test=lb.fit_transform(df_test['category'])
print(sentiment_train.shape)
print(sentiment_test.shape)

(15000, 1)
(10000, 1)


In [None]:
sentiment_train

array([[0],
       [1],
       [0],
       ...,
       [0],
       [1],
       [1]])

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

In [None]:
lr=LogisticRegression(penalty='l2',max_iter=500,C=1,random_state=42)
lr.fit(tv_train_reviews, sentiment_train)

  y = column_or_1d(y, warn=True)


LogisticRegression(C=1, max_iter=500, random_state=42)

In [None]:
predict = lr.predict(tv_test_reviews)

In [None]:
print(classification_report(sentiment_test, predict))

              precision    recall  f1-score   support

           0       0.51      0.41      0.46      5020
           1       0.50      0.60      0.55      4980

    accuracy                           0.51     10000
   macro avg       0.51      0.51      0.50     10000
weighted avg       0.51      0.51      0.50     10000



In [None]:
from sklearn.metrics import accuracy_score
print(accuracy_score(sentiment_test, predict))

0.5053


In [None]:
from sklearn.naive_bayes import MultinomialNB
mnb=MultinomialNB()
mnb.fit(tv_train_reviews,sentiment_train)
mnb_predict=mnb.predict(tv_test_reviews)

  y = column_or_1d(y, warn=True)


In [None]:
print(accuracy_score(sentiment_test,mnb_predict))

0.5053
