# 1. 텍스트 분류
* Machine Learning, NLP: Text Classification using scikit-learn,python and NLTK.
* Reference: [https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a](https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a)

### Step
1. Prerequisite and setting up the environment.
2. Loading the data set in jupyter.
3. Extracting features from text files.
4. Running ML algorithms.
5. Grid Search for parameter tuning.
6. Useful tips and a touch of NLTK.

## Step 1: Prerequisite and setting up the environment.

## Step 2: Loading the data set in jupyter.
* The data set will be using for this example is the famous **"20 Newsgroup"** data set.

In [1]:
from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='train', shuffle=True)

In [2]:
print(twenty_train.DESCR)

.. _20newsgroups_dataset:

The 20 newsgroups text dataset
------------------------------

The 20 newsgroups dataset comprises around 18000 newsgroups posts on
20 topics split in two subsets: one for training (or development)
and the other one for testing (or for performance evaluation). The split
between the train and test set is based upon a messages posted before
and after a specific date.

This module contains two loaders. The first one,
:func:`sklearn.datasets.fetch_20newsgroups`,
returns a list of the raw texts that can be fed to text feature
extractors such as :class:`sklearn.feature_extraction.text.CountVectorizer`
with custom parameters so as to extract feature vectors.
The second one, :func:`sklearn.datasets.fetch_20newsgroups_vectorized`,
returns ready-to-use features, i.e., it is not necessary to use a feature
extractor.

**Data Set Characteristics:**

    Classes                     20
    Samples total            18846
    Dimensionality               1
    Features       

In [3]:
# prints all the categories
twenty_train.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [12]:
import numpy as np
from pprint import pprint

pprint(twenty_train.target)
pprint(np.unique(twenty_train.target))

array([7, 4, 4, ..., 3, 1, 8])
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])


In [8]:
# prints first line of the first data file
print("\n".join(twenty_train.data[0].split('\n')[:3]))

From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu


## Step 3: Extracting features from text files.

In order to run machine learning algorithms we need to convert the text files into **numerical feature vectors**. We will be using **bag of words** model for our example.  
Briefly, we segment each text file into words(for English splitting by space), and **count # of times each word occurs in each document** and finally **assign each word and integer id**. **Each unique word** in our dictionary will correspond to a **feature**.

#### 참고
Reference : [http://hleecaster.com/nlp-bag-of-words-concept/](http://hleecaster.com/nlp-bag-of-words-concept/)

* 통계적 언어 모델 (SLM:Statistical Language Model)
    - 통계적 언어 모델은 컴퓨터가 **"확률 분포'**를 기반으로 언어를 이해하도록 하는 방법론  
    
    
* 신경망 언어 모델 (NNLM: Neural Network Language Model)  


* BoW (Bag-of-Words)
    - **어휘의 빈도(개수)에 대해 통계적 언어 모델을 적용해서 나타낸 것**
    - 등장하는 단어들의 숫자를 세서 그걸 가지고 무언가를 하는 것
    - 말뭉치를 가방 안에 다 넣어서 단어 개수만 꺼내 살피는 방식이기 때문에 **단어의 순서를 무시한다**
    - 여러 한계가 존재하지만, 비교적 훌륭한 성능을 보여주고 활용도가 높기 때문에 많이 쓰인다.
    - 실제로 분석이나 활용을 할 때 **단어 사전**을 만들어놓은 후에, 문서(텍스트)가 있으면 그것을 벡터로 바꿔준다.(Vectorization)  
    
    
* 문서-단어 행렬 (Document-Term matrix)
    - 어떤 **문서에서 등장하는 각 단어들의 빈도를 나타낸 행렬**
    - BoW와 별개의 개념이 아니라 BoW를 실제로 활용하기 위해 행렬의 형식으로 표현한 거라 생각하면 됨
    - 보통 내가 가진 문서를 문서-단어 행렬로 놓는 게 자연어 처리 및 분석, 텍스트 마이닝의 진정한 시작점이 됨
    - scikit-learn에서 **CounterVectorizer**을 활용하면 아주 쉽게 처리할 수 있음
    

* TF-IDF(Term Frequency-Inverse Document Frequency)
    - 문서가 여러개 있을 때 단순히 **단어 빈도(Term Frequency)** 를 적용해서 문서-단어 행렬을 나타내면 문제가 있을 수도 있음
    - **어떤 단어가 하나의 문서에서도 많이 사용되었다고 하더라도, 다른 모든 문서에서 널리 쓰이는 흔한 단어라면 이 단어는 특정성(specificity)이 떨어진다**
    - 그래서 단순 단어 빈도로 접근하는 게 아니라, 어떤 단어가 한 문서에서 많이 나타난 동시에 다른 문서에서는 잘 나타나지 않는 것까지 고려하기 위한 개념이 등장. 이것이 바로 TF-IDF다.
    - 즉, TF-IDF는 단순한 단어 빈도가 아니라 일종의 가중치를 적용한 개념이라고 이해하면 됨
    - 그래서 이 TF-IDF를 활용해서 문서-단어 행렬을 만들고 분석을 하는 경우도 매우 많음
    - scikit-learn에서는 아예 단순 빈도로 접근하는 CounterVectorizer말고, TF-IDF로 접근하는 **TfidfVectorizer** 클래스를 제공
    
    
* n-gram
    - BoW의 한계로 등장한 개념
    - **단어를 n개씩 묶어서 그것을 하나의 feature(단어)로 보는 것**
        - ex) "이 음식은 너무 맛있다."
            - BoW: "이", "음식은", "너무", "맛있다"
            - bigram: "이 음식은", "음식은 너무", "너무 맛있다"
        - BoW는 각 단어를 독립적으로 고려하지만 bigram, trigram과 같은 모델을 적용하면 단어가 나타나는 순서나 가까운 단어들을 함께 고려하게 되는 셈
        
   
#### 결론
- 아무리 ngram 모델을 적용하더라도 BoW 모델은 그 빈도만 세기 때문에 맥락을 충분히 고려해야 하는 상황, 즉 **텍스트를 생성한다거나 예측하는 등의 장면에는 활용이 어렵다.**
- 게다가 학습된 단어 사전을 기반으로 하기 때문에 사전에 없는 새로운 단어가 나타났을 때 처리할 방법이 없다. 학습 데이터에 지나치게 의존하기 때문에 **오버피팅(overfitting)이 발생한다.** (사실 통계적 언어 모델에서 오버피팅이 나타나는 것은 당연하다)
- **smoothing(평탄화)** 라는 방법을 통해 이 문제를 나름대로 어떻게든 해결하는 전략이 있긴하다.
    - smoothing은 이미 알고 있는 단어들의 확률을 가지고, 알려지지 않은 단어의 확률을 추정하는 방식

In [14]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape

(11314, 130107)

In [15]:
X_train_counts[0]

<1x130107 sparse matrix of type '<class 'numpy.int64'>'
	with 89 stored elements in Compressed Sparse Row format>

As a result, we are learning the vocabulary dictionary and it returns a Documnet-Term matrix.  
[n_samples, n_features]


* **TF**: Just counting the number of words in each document has 1 issue: it will give more weightage to longer documents than shorter documents.  
To avoid this, we can use frequency(TF-Term Frequencies) i.e.  
\# count(word) / # Total words, in each document.  
* **TF-IDF**: reduce the weightage of more common words like 'the, is, an etc'.

In [16]:
# Transoform a count matrix to a normalized tf or tf-idf representation
from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)  # CounterVectorizer를 거친 data를 넣는다!
X_train_tfidf.shape

(11314, 130107)

## Step 4: Running ML algorithms.

* a. **Naive Bayes**: [https://scikit-learn.org/stable/modules/naive_bayes.html#naive-bayes](https://scikit-learn.org/stable/modules/naive_bayes.html#naive-bayes)

In [17]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)

***Building a pipeline:*** *We can write less code and do all of the above, by building a pipeline as follows:*

In [19]:
from sklearn.pipeline import Pipeline
text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB()),
                     ])

text_clf = text_clf.fit(twenty_train.data, twenty_train.target)

***Performance of NB Classifier:*** *Now we will test the performance of the NB classifier on test set.*

In [21]:
import numpy as np
twenty_test = fetch_20newsgroups(subset='test', shuffle=True)
predicted = text_clf.predict(twenty_test.data)
print(np.round(np.mean(predicted == twenty_test.target), 5)*100, "%")

77.39 %


In [23]:
np.mean(predicted == twenty_test.target)

0.7738980350504514

In [30]:
dict(zip(*np.unique((predicted == twenty_test.target), return_counts=True)))

{False: 1703, True: 5829}

* b. **Support Vector Machines(SVM)**: [https://scikit-learn.org/stable/modules/svm.html](https://scikit-learn.org/stable/modules/svm.html)

If you want to fit **a large-scale linear classifier without copying a dense numpy C-contiguous double precision array as input**, we suggest to use the **SGDClassifier(stochastic gradient descent)** class instead.

In [38]:
from sklearn.linear_model import SGDClassifier

text_clf_svm = Pipeline([('vect', CountVectorizer()),
                        ('tfidf', TfidfTransformer()),
                        ('clf-svm', SGDClassifier(loss='hinge',
                                                 penalty='l2',
                                                 alpha=1e-3,
                                                 random_state=42)),
                        ])

_ = text_clf_svm.fit(twenty_train.data, twenty_train.target)

In [39]:
pred_svm = text_clf_svm.predict(twenty_test.data)
print(np.round(np.mean(pred_svm == twenty_test.target), 5)*100, "%")

82.408 %


## Step 5: Grid Search for parameter tuning.

Almost all the classifiers will have various parameters which can be tuned to obtain optimal performance. Scikit gives an extremely useful tll 'GridSearchCV'.

* a. Naive Bayes

In [58]:
# find optimal performance among various parameters
from sklearn.model_selection import GridSearchCV

# All the parameters name start with the classifier name!
parameters = {'vect__ngram_range':[(1, 1), (1, 2)],    # use unigram & bigrams and choose the one which is optimal.
             'tfidf__use_idf':(True, False),
             'clf__alpha': (1e-2, 1e-3),              # 1/100, 1/1000
             }

In [59]:
gs_clf = GridSearchCV(text_clf, parameters, n_jobs = -1)  # n_jobs = -1 : to use multiple cores from user machine.
gs_clf = gs_clf.fit(twenty_train.data, twenty_train.target)



In [60]:
# see the best mean score and the params

print(np.round(gs_clf.best_score_ *100, 2), '%')
print(gs_clf.best_params_)

90.68 %
{'clf__alpha': 0.01, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 2)}


***Performance of NB Classifier after Grid Search***

In [61]:
text_clf_final = Pipeline([('vect', CountVectorizer(ngram_range = (1, 2))),
                           ('tfidf', TfidfTransformer(use_idf = True)),
                           ('clf', MultinomialNB(alpha = 0.01))])
text_clf_final = text_clf_final.fit(twenty_train.data, twenty_train.target)

pred_final = text_clf_final.predict(twenty_test.data)

In [62]:
print(np.round(np.mean(pred_final == twenty_test.target), 4)*100, '%')

83.44 %


***77.39% => 83.44%***

* b. Support Vector Machine(SVM)

In [66]:
param_svm = {'vect__ngram_range': [(1,1), (1,2)],
            'tfidf__use_idf': (True, False),
            'clf-svm__alpha': (1e-2, 1e-3),
            }
gs_clf_svm = GridSearchCV(text_clf_svm, param_svm, n_jobs=-1)
gs_clf_svm = gs_clf_svm.fit(twenty_train.data, twenty_train.target)



In [68]:
print(np.around(gs_clf_svm.best_score_ * 100, 2), '%')
print(gs_clf_svm.best_params_)

90.01 %
{'clf-svm__alpha': 0.001, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 2)}


***Performance of SVM Classifier after Grid Search***

In [69]:
text_clf_svm_final = Pipeline([('vect', CountVectorizer(ngram_range = (1,2))),
                              ('tfidf', TfidfTransformer(use_idf =True)),
                              ('clf-svm', SGDClassifier(loss = 'hinge',
                                                       penalty = 'l2',
                                                       alpha = 1e-3,
                                                       random_state = 42))])
text_clf_svm_final = text_clf_svm_final.fit(twenty_train.data, twenty_train.target)
pred_svm_final = text_clf_svm_final.predict(twenty_test.data)

In [70]:
print(np.around(np.mean(pred_svm_final == twenty_test.target), 3)*100, '%')

83.5 %


***82.408 % => 83.5%***

## Step 6. Useful tips and a touch of NLTK.

(1)  **Removing** : (the, then etc) from the data. In most of the text classification problems, this is indeed not useful. Update the code for creating object of CountVectorizer as follows;

**Let's see if removing stop words increases the accuracy.**

* a. Naive Bayes

In [82]:
from sklearn.pipeline import Pipeline

text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')),
                    ('tfidf', TfidfTransformer()),
                    ('clf', MultinomialNB())])
text_clf = text_clf.fit(twenty_train.data, twenty_train.target)

In [83]:
twenty_test = fetch_20newsgroups(subset='test', shuffle=True)
pred = text_clf.predict(twenty_test.data)
print(np.round(np.mean(pred == twenty_test.target), 4)*100, '%')

81.69 %


***77.39% => 81.69 %***

You can try the same for SVM and also while doing grid search.

In [84]:
# find optimal performance among various parameters
from sklearn.model_selection import GridSearchCV

# parameters name start with the classifier name!
param = {'vect__ngram_range':[(1, 1), (1, 2)],   # unigram & bigrams which is optimal
         'tfidf__use_idf':(True, False),
         'clf__alpha': (1e-2, 1e-3)}             # 1/100, 1/1000

In [88]:
gs_clf = GridSearchCV(text_clf, param, n_jobs = -1)      # n_jobs = -1 : to use multiple cores from user machine.
gs_clf = gs_clf.fit(twenty_train.data, twenty_train.target)



In [94]:
# see the best mean score and the params

print(np.round(gs_clf.best_score_ * 100, 2), '%')
print(gs_clf.best_params_)

90.58 %
{'clf__alpha': 0.01, 'tfidf__use_idf': False, 'vect__ngram_range': (1, 1)}


***Performance of NB Classifier after Gfrid Search***

In [95]:
# text_clf_final = Pipeline([('vect', CountVectorizer(stop_words='english', ngram_range = (1, 2))),
#                            ('tfidf', TfidfTransformer(use_idf = True)),
#                            ('clf', MultinomialNB(alpha = 0.01))])
# text_clf_final = text_clf_final.fit(twenty_train.data, twenty_train.target)

# pred_final = text_clf_final.predict(twenty_test.data)

In [96]:
# print(np.round(np.mean(pred_final == twenty_test.target), 5)*100, "%")

83.006 %


In [97]:
text_clf_final = Pipeline([('vect', CountVectorizer(stop_words='english', ngram_range = (1, 1))),
                           ('tfidf', TfidfTransformer(use_idf = False)),
                           ('clf', MultinomialNB(alpha = 0.01))])
text_clf_final = text_clf_final.fit(twenty_train.data, twenty_train.target)

pred_final = text_clf_final.predict(twenty_test.data)

In [98]:
print(np.round(np.mean(pred_final == twenty_test.target), 5)*100, "%")

84.134 %


***81.69% => 84.134%***

* b. Support Vector Machine(SVM)

In [99]:
from sklearn.linear_model import SGDClassifier

text_clf_svm = Pipeline([('vect', CountVectorizer(stop_words='english')),
                              ('tfidf', TfidfTransformer()),
                              ('clf-svm', SGDClassifier(loss = 'hinge',
                                                       penalty = 'l2',
                                                       alpha = 1e-3,
                                                       random_state = 42))])
_ = text_clf_svm.fit(twenty_train.data, twenty_train.target)

In [100]:
pred_svm = text_clf_svm.predict(twenty_test.data)
print(np.round(np.mean(pred_svm == twenty_test.target), 5)*100, "%")

82.262 %


***82.408% => 82.262%***

In [101]:
param_svm = {'vect__ngram_range':[(1, 1), (1, 2)],
             'tfidf__use_idf':(True, False),
             'clf-svm__alpha':(1e-2, 1e-3)}
gs_clf_svm = GridSearchCV(text_clf_svm, param_svm, n_jobs = -1)
gs_clf_svm = gs_clf_svm.fit(twenty_train.data, twenty_train.target)



In [102]:
print(np.around(gs_clf_svm.best_score_ * 100, 2), '%')
print(gs_clf_svm.best_params_)

89.61 %
{'clf-svm__alpha': 0.001, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 2)}


***Performance of SVM Classifier after Grid Search***

In [103]:
text_clf_svm_final = Pipeline([('vect', CountVectorizer(stop_words='english', ngram_range = (1,2))),
                              ('tfidf', TfidfTransformer(use_idf =True)),
                              ('clf-svm', SGDClassifier(loss = 'hinge',
                                                       penalty = 'l2',
                                                       alpha = 1e-3,
                                                       random_state = 42))])
text_clf_svm_final = text_clf_svm_final.fit(twenty_train.data, twenty_train.target)
pred_svm_final = text_clf_svm_final.predict(twenty_test.data)

In [104]:
print(np.round(np.mean(pred_svm_final == twenty_test.target), 3)*100, '%')

83.2 %


***82.408% => 83.2%***

(2) **FitPrior = False** : When set to false for MultinomialNB, a uniform prior will be used.

In [105]:
from sklearn.pipeline import Pipeline

text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')),
                    ('tfidf', TfidfTransformer()),
                    ('clf', MultinomialNB(fit_prior = False))])
text_clf = text_clf.fit(twenty_train.data, twenty_train.target)

In [106]:
twenty_test = fetch_20newsgroups(subset='test', shuffle=True)
pred = text_clf.predict(twenty_test.data)
print(np.round(np.mean(pred == twenty_test.target), 4)*100, "%")

82.14 %


***81.69% => 82.14%***

(3) **Stemming** : stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form. (ex. A stemming algorithm reduces the words "fishing", "fished", and "fisher" to the root word, "fish".)

We need NLTK which comes with various stemmers that can help reducing the words to their root form.

In [1]:
# Snowball stemmer works very well for English language
import nltk

nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


Exception in thread Thread-4:
Traceback (most recent call last):
  File "C:\Users\user\Anaconda3\lib\threading.py", line 926, in _bootstrap_inner
    self.run()
  File "C:\Users\user\Anaconda3\lib\site-packages\nltk\downloader.py", line 2106, in run
    for msg in self.data_server.incr_download(self.items):
  File "C:\Users\user\Anaconda3\lib\site-packages\nltk\downloader.py", line 623, in incr_download
    for msg in self._download_list(info_or_id, download_dir, force):
  File "C:\Users\user\Anaconda3\lib\site-packages\nltk\downloader.py", line 669, in _download_list
    for msg in self.incr_download(item, download_dir, force):
  File "C:\Users\user\Anaconda3\lib\site-packages\nltk\downloader.py", line 637, in incr_download
    for msg in self.incr_download(info.children, download_dir, force):
  File "C:\Users\user\Anaconda3\lib\site-packages\nltk\downloader.py", line 623, in incr_download
    for msg in self._download_list(info_or_id, download_dir, force):
  File "C:\Users\user\Anaco

True

In [None]:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english", ignore_stopwords=True)

In [None]:
class StemmedCountVectorizer(CountVctorizer) :
    def build_analyzer(self) :
        analyzer = super(StemmedCountVectorizer, self).build_analyzer()
        return lambda doc: ([stemmer.stem(w) for w in analyzer(doc)])

In [None]:
stemmed_count_vect = StemmedCounVectorizer(stop_words='english')
text_mnb_stemmed = Pipeline([('vect', stemmed_count_vect),
                            ('tfidf', TfidfTransoformer()),
                            ('mnb', MultinomialNB(fit_prior = Fasle))])
text_mnb_stemmed = text_mnb_stemmed.fit(twenty_train.data, twenty_train.target)

In [None]:
predicted_mnb_stemmed = text_mnb_stemmed.predict(twenty_test.data)
print(np.round(np.mean(predicted_mnb_stemmed == twenty_test.target)*100,3), '%')