<a href="https://colab.research.google.com/github/Temish09/ML_CS-course/blob/main/ML_CS_work.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Классификация новостей


Работа Биджиева Темирлана, группа 519/2.

Спецкурс: "Машинное обучение и искусственный интеллект".

Преподаватель: Смирнов Илья Николаевич.



---



Задача классификации новостных текстов по тематикам.

### 1. Подгрузим данные новостей 20 новостных групп.

In [35]:
from sklearn.datasets import fetch_20newsgroups

newsgroups_train = fetch_20newsgroups(subset='train')
newsgroups_test = fetch_20newsgroups(subset='test')

print(newsgroups_train.keys())

for topic in newsgroups_train.target_names:
  print(topic)

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])
alt.atheism
comp.graphics
comp.os.ms-windows.misc
comp.sys.ibm.pc.hardware
comp.sys.mac.hardware
comp.windows.x
misc.forsale
rec.autos
rec.motorcycles
rec.sport.baseball
rec.sport.hockey
sci.crypt
sci.electronics
sci.med
sci.space
soc.religion.christian
talk.politics.guns
talk.politics.mideast
talk.politics.misc
talk.religion.misc


### 2. Произведем препроцессинг текста: 
  1. Удалим слова, не несущие смысла;
  2. Удалим знаки препинания;
  3. Все заглавные буквы заменим на строчные;
  4. Токенизируем текст; 

In [4]:
import nltk

nltk.download('stopwords')
nltk.download('punkt')

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

import re
import tqdm

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [5]:
def preprocess_text(texts):
  stop_words = set(stopwords.words('english'))
  regex = re.compile('[^a-z A-Z]')
  preprocess_texts = []

  for i in tqdm.tqdm(range(len(texts))):
    text = texts[i].lower()
    text = regex.sub(' ', text)
    word_tokens = word_tokenize(text)
    filtered_sentence = [w for w in word_tokens if not w in stop_words]
    preprocess_texts.append(' '.join(filtered_sentence))

  return preprocess_texts


In [18]:
print(newsgroups_train.keys())
newsgroups_train['preprocess_data'] = preprocess_text(newsgroups_train.data)
print('\n', newsgroups_train.keys())

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])


100%|██████████| 11314/11314 [00:16<00:00, 703.34it/s]


 dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR', 'preprocess_data'])





In [19]:
print(newsgroups_test.keys())
newsgroups_test['preprocess_data'] = preprocess_text(newsgroups_test.data)
print('\n', newsgroups_test.keys())

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])


100%|██████████| 7532/7532 [00:10<00:00, 729.70it/s]


 dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR', 'preprocess_data'])





### 3. Применим стемминг, чтобы выделять корни слов, тем самым слова с разным окончанием, будут восприниматься как одно слово.

In [14]:
from nltk.stem.lancaster import LancasterStemmer

In [15]:
def stemming_texts(texts):
  st = LancasterStemmer()
  stem_text = []
  for text in tqdm.tqdm(texts):
    word_tokens = word_tokenize(text)
    stem_text.append(' '.join([st.stem(word) for word in word_tokens]))
  
  return stem_text

In [20]:
newsgroups_train['data_stemming'] = stemming_texts(newsgroups_train.preprocess_data)
newsgroups_test['data_stemming'] = stemming_texts(newsgroups_test.preprocess_data)

100%|██████████| 11314/11314 [00:44<00:00, 253.71it/s]
100%|██████████| 7532/7532 [00:27<00:00, 271.64it/s]


### 4. Будем по-разному векторизировать наши данные и сравним качество SVM модели при каждом из способов векторизации.

In [23]:
def bow(vectorizer, train, test):     # Bag Of Words
  train_bow = vectorizer.fit_transform(train)
  test_bow = vectorizer.transform(test)

  return train_bow, test_bow

Применим векторизацию через CountVectorizer.

In [24]:
from sklearn.feature_extraction.text import CountVectorizer

X_train_bow_stem, X_test_bow_stem = bow(CountVectorizer(), newsgroups_train.data_stemming, newsgroups_test.data_stemming)

Применим векторизация через TF x IDF

In [25]:
from sklearn.feature_extraction.text import TfidfVectorizer

X_train_tfdidf_stem, X_test_tfidf_stem = bow(TfidfVectorizer(), newsgroups_train.data_stemming, newsgroups_test.data_stemming)

Применим векторизацию через TF x IDF с ngram. 

In [26]:
X_train_ngram_stem, X_test_ngram_stem = bow(TfidfVectorizer(ngram_range=(1,2)), newsgroups_train.data_stemming, newsgroups_test.data_stemming)

### 5. Сравним способы препроцессинга через качество работы SVM классификатора.

Обучим модели

In [40]:
from sklearn.svm import LinearSVC

# CountVectorizer
lsvc_count = LinearSVC()
lsvc_count.fit(X_train_bow_stem, newsgroups_train.target)
lsvc_count_test_pred = lsvc_count.predict(X_test_bow_stem)

# TFIDF
lsvc_tfidf = LinearSVC()
lsvc_tfidf.fit(X_train_tfdidf_stem, newsgroups_train.target)
lsvc_tfidf_test_pred = lsvc_tfidf.predict(X_test_tfidf_stem)

# NGRAM
lsvc_ngram = LinearSVC()
lsvc_ngram.fit(X_train_ngram_stem, newsgroups_train.target)
lsvc_ngram_test_pred = lsvc_ngram.predict(X_test_ngram_stem)



Посчитам несколько метрик для каждой модели:

1. Accuracy
2. Precision
3. Recall
4. F1

In [62]:
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score
)

Accuracy

Видно, что у препроцессинга tf_idf с ngram точность самая большая.

In [45]:
lsvc_count_acc = accuracy_score(lsvc_count_test_pred, newsgroups_test.target)
print(lsvc_count_acc)

lsvc_tfidf_acc = accuracy_score(lsvc_tfidf_test_pred, newsgroups_test.target)
print(lsvc_tfidf_acc)

lsvc_ngram_acc = accuracy_score(lsvc_ngram_test_pred, newsgroups_test.target)
print(lsvc_ngram_acc)

0.781465746149761
0.8437334041423261
0.8555496548061604


Precision

Все также, у препроцессинга tf_idf с ngram точность самая большая.

In [66]:
lsvc_count_prec = precision_score(lsvc_count_test_pred, newsgroups_test.target, average='weighted')
print(lsvc_count_prec)

lsvc_tfidf_prec = precision_score(lsvc_tfidf_test_pred, newsgroups_test.target, average='weighted')
print(lsvc_tfidf_prec)

lsvc_ngram_prec = precision_score(lsvc_ngram_test_pred, newsgroups_test.target, average='weighted')
print(lsvc_ngram_prec)

0.7860258338011773
0.849480236165773
0.8620500765154635


Recall

Результаты аналогичные.

In [68]:
lsvc_count_recall = recall_score(lsvc_count_test_pred, newsgroups_test.target, average='weighted')
print(lsvc_count_recall)

lsvc_tfidf_recall = recall_score(lsvc_tfidf_test_pred, newsgroups_test.target, average='weighted')
print(lsvc_tfidf_recall)

lsvc_ngram_recall = recall_score(lsvc_ngram_test_pred, newsgroups_test.target, average='weighted')
print(lsvc_ngram_recall)

0.781465746149761
0.8437334041423261
0.8555496548061604


F1 score

Результаты аналогичные.

In [49]:
lsvc_count_f1 = f1_score(lsvc_count_test_pred, newsgroups_test.target, average='weighted')
print(lsvc_count_f1)

lsvc_tfidf_f1 = f1_score(lsvc_tfidf_test_pred, newsgroups_test.target, average='weighted')
print(lsvc_tfidf_f1)

lsvc_ngram_f1 = f1_score(lsvc_ngram_test_pred, newsgroups_test.target, average='weighted')
print(lsvc_ngram_f1)

0.7823339419720996
0.8448702531127927
0.8569422775357723


### Вывод: из трех методов препроцессинга, наибольшую ценность показал метод tf_idf с ngram.