<a href="https://colab.research.google.com/github/CamelGoong/NLP/blob/main/%5BTutorial%5DText_Classificastion.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Text Classificastion

reference: [Machine Learning, NLP: Text Classification using scikit-learn, python and NLTK.](https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a)

# Description

Document / Text Classification is one of the important and typical task in supervised machine learning(ML).
Assigning categories to documents, which can be a webpage, library book, media articles, gallery etc. has many applications like spam filtering, email routing, sentiment analysis etc.
In this article, I would like to demonstrate how we can do text classification using python, scikit-learn and little bit of NLTK.



1. Prerequisite and setting up the environment.
2. Loading the data set in jupyter.
3. Extracting features from text files.
4. Running ML algorithms.
5. Grid Search for parameter tuning.
6. Useful tips and a touch of NLTK.



#데이터 전처리

In [2]:
# 데이터셋 loading
from sklearn.datasets import fetch_20newsgroups

twenty_train = fetch_20newsgroups(subset = "train", shuffle = True)

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


In [3]:
# 데이터 확인
print(twenty_train.target_names)
print("\n".join(twenty_train.data[0].split("\n")[:3]))

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']
From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu


In [4]:
# Extracting features from text files(vector로 변환.)
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape # (n_samples, n_features)

(11314, 130107)

In [5]:
# TF/IDF
from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(11314, 130107)

# Method 1. Naive Bayes

In [6]:
# Running ML algorithms / 'Naive Bayes' 사용.
from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)

In [7]:
from sklearn.pipeline import Pipeline

text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB()),
                     ])

text_clf = text_clf.fit(twenty_train.data, twenty_train.target)

In [8]:
import numpy as np
twenty_test = fetch_20newsgroups(subset = 'test', shuffle = True)
predicted = text_clf.predict(twenty_test.data)

print(np.mean(predicted == twenty_test.target)) # 정확도

0.7738980350504514


#Method 2. SVM(Support Vector Machine)

In [9]:
from sklearn.linear_model import SGDClassifier

text_clf_svm = Pipeline([
                         ('vect', CountVectorizer()),
                         ('tfidf', TfidfTransformer()),
                         ('clf-svm', SGDClassifier(loss = 'hinge', penalty = 'l2', alpha=1e-3, random_state = 42))




])

_ = text_clf_svm.fit(twenty_train.data, twenty_train.target)

predicted_svm = text_clf_svm.predict(twenty_test.data)
np.mean(predicted_svm == twenty_test.target) # 정확도

0.8240839086563994

#Method 3. Grid Search(이 방안은 위의 Method1. Naive Bayes와 Method2. SVM이 더 좋은 결과를 낼 수 있도록 도와주는 방법론인듯)

> Method 1. Naive Bayes 보완

In [10]:
from sklearn.model_selection import GridSearchCV

parameters = {'vect__ngram_range': [(1, 1), (1, 2)], # ngram이란 언어모델에서 몇개의 연속적인 단어들을 하나의 token으로 정의할 것인지.
              'tfidf__use_idf': (True, False),
              'clf__alpha': (1e-2, 1e-3),
              }

gs_clf = GridSearchCV(text_clf, parameters, n_jobs = -1)
gs_clf = gs_clf.fit(twenty_train.data, twenty_train.target)



In [12]:
print(gs_clf.best_score_)
print(gs_clf.best_params_)

0.9157684864695698
{'clf__alpha': 0.001, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 2)}


> Method 2. SVM 보완 

In [13]:
from sklearn.model_selection import GridSearchCV
parameters_svm = {'vect__ngram_range' : [(1,1), (1,2)],
                  'tfidf__use_idf': (True, False),
                  'clf-svm__alpha': (1e-2, 1e-3),
                  }

gs_clf_svm = GridSearchCV(text_clf_svm, parameters_svm, n_jobs = -1)
gs_clf_svm = gs_clf_svm.fit(twenty_train.data, twenty_train.target)

print(gs_clf_svm.best_score_)
print(gs_clf_svm.best_params_)

0.9051618841994754
{'clf-svm__alpha': 0.001, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 2)}


# Useful tips and a touch of NLTK

In [None]:
# stopwords 전처리 적용(위에서는 안해줬음.)
from sklearn.pipeline import Pipeline

text_clf = Pipeline([('vect',
CountVectorizer(stopwords = 'english')),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB()),
])

# 위와 같은 식으로 전처리를 해주고, 나머지 Naive Bayes를 돌려주면, 정확도가 더 향상될 것임.

> About NLTK(SnowballStemmer)

NLTK를 사용해서, 각 words들 중 같은 stem에서 나온 것들을 'root word'의 형태로 reduced 해줄 수 있음.

In [28]:
import nltk
nltk.download("all")
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english", ignore_stopwords=True)
class StemmedCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedCountVectorizer, self).build_analyzer()
        return lambda doc: ([stemmer.stem(w) for w in analyzer(doc)])
stemmed_count_vect = StemmedCountVectorizer(stop_words='english')
text_mnb_stemmed = Pipeline([('vect', stemmed_count_vect),
...                      ('tfidf', TfidfTransformer()),
...                      ('mnb', MultinomialNB(fit_prior=False)),
... ])
text_mnb_stemmed = text_mnb_stemmed.fit(twenty_train.data, twenty_train.target)
predicted_mnb_stemmed = text_mnb_stemmed.predict(twenty_test.data)
np.mean(predicted_mnb_stemmed == twenty_test.target)

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/biocreative_ppi.zip.
[nltk_data]    | Downloading package brown to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/brown.zip.
[nltk_data]    | Downloading package brown_tei to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/brown_tei.zip.
[nltk_data]    | Downloading package cess_cat to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cess_cat.zip.
[nltk_data]    | Downloading package cess_esp to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cess_esp.zip.
[nltk_data]    | Downloading package chat80 to /root/nltk_data...
[nltk_data]    |   Unzipp

TypeError: ignored