# Text Analytics

NLP (National Language Processing) :  
머신이 인간의 언어를 이해하고 해석하는 데 중점을 두고 기술이 발점함

Text Analytic :  
텍스트 마이닝(Text Mining)이라고도 불리며, 비정형 텍스트에서 의미 있는 정보를 추출하는 것에 좀 더 중점을 두고 기술이 발전함

- 텍스트 분류(Text Classification):  
    - Text Categorization.  
    - 문서가 특정 분류 또는 카테고리에 속하는 것을 예측하는 기법.  
    - 지도학습을 적용함.  
- 감성 분석(Sentiment Analysis):  
    - 텍스트에서 나타나는 감정/판단/믿음/의견/기분 등의 "주관적인 요소"를 분석하는 기법.  
    - 텍스트 분석에서 가장 활발하게 사용되고 있는 분야.  
    - 지도학습 뿐만 아니라 비지도학습을 이용해 적용할 수 있음.  
- 텍스트 요약(Summarization):  
    - 텍스트 내에서 중요한 주제나 중심 사상을 추출하는 기법.  
    - Topic Modeling  
- 텍스트 군집화(Clustering)와 유사도 측정:  
    - 비슷한 유형의 문서에 대해 군집화를 수행하는 기법.  

### Tokenization

Sentence Tokenization

In [1]:
from nltk import sent_tokenize
import nltk
nltk.download('punkt')

text_sample = 'The Matrix is everywhere its all around us, here even in this room. \
               You can see it out your window or on your television. \
               You feel it when you go to work, or go to church or pay your taxes.'

sentences = sent_tokenize(text=text_sample)

print(type(sentences), len(sentences))
print(sentences)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Donggeol.Yang\AppData\Roaming\nltk_data...


<class 'list'> 3
['The Matrix is everywhere its all around us, here even in this room.', 'You can see it out your window or on your television.', 'You feel it when you go to work, or go to church or pay your taxes.']


[nltk_data]   Unzipping tokenizers\punkt.zip.


Word Tokenization:  
공백, 콤마(,), 마침표(.), 개행 문자 등으로 단어를 분리

In [2]:
from nltk import word_tokenize

sentence = "The Matrix is everywhere its all around us, here even in this room."
words = word_tokenize(sentence)
print(type(words), len(words))
print(words)

<class 'list'> 15
['The', 'Matrix', 'is', 'everywhere', 'its', 'all', 'around', 'us', ',', 'here', 'even', 'in', 'this', 'room', '.']


### Stop Word Deletion  
분석에 큰 의미가 없는 단어를 제거함.
- 언어별로 stopwords 목록이 내장됨 (NLTK library)

In [3]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Donggeol.Yang\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [5]:
len(nltk.corpus.stopwords.words('english'))

179

In [7]:
print(nltk.corpus.stopwords.words('english')[:50])

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be']


In [8]:
from nltk import word_tokenize, sent_tokenize

def tokenize_text(text):
    sentences = sent_tokenize(text)
    word_tokens = [word_tokenize(sentence) for sentence in sentences]
    
    return word_tokens


word_tokens = tokenize_text(text_sample)
print(type(word_tokens), len(word_tokens))
print(word_tokens)

<class 'list'> 3
[['The', 'Matrix', 'is', 'everywhere', 'its', 'all', 'around', 'us', ',', 'here', 'even', 'in', 'this', 'room', '.'], ['You', 'can', 'see', 'it', 'out', 'your', 'window', 'or', 'on', 'your', 'television', '.'], ['You', 'feel', 'it', 'when', 'you', 'go', 'to', 'work', ',', 'or', 'go', 'to', 'church', 'or', 'pay', 'your', 'taxes', '.']]


In [10]:
import nltk

stopwords = nltk.corpus.stopwords.words('english')
all_tokens=[]

for sentence in word_tokens:
    filtered_words=[]
    for word in sentence:
        word = word.lower()
        if word not in stopwords:
            filtered_words.append(word)

    all_tokens.append(filtered_words)


print(all_tokens)

[['matrix', 'everywhere', 'around', 'us', ',', 'even', 'room', '.'], ['see', 'window', 'television', '.'], ['feel', 'go', 'work', ',', 'go', 'church', 'pay', 'taxes', '.']]


### Stemming, Lemmatization

- Stemming : 단순화된 방법을 적용하여 일부 철자가 훼손된 어근 단어 추출  
- Lemmatization : 문법적인 요소와 더 의미적인 부분을 감안하여 더욱 정교한 원형 단어 추출

NLTK가 제공하는 Stemmer  
: Porter, Lancaster, Snowball Stemmer  

NLTK가 제공하는 Lemmatization
: WordNetLemmatizer

In [11]:
from nltk.stem import LancasterStemmer
stemmer = LancasterStemmer()

print(stemmer.stem('working'), stemmer.stem('works'), stemmer.stem('worked'))

work work work


In [13]:
from nltk.stem import WordNetLemmatizer
import nltk
# nltk.download('wordnet')
nltk.download()

lemma = WordNetLemmatizer()
print(lemma.lemmatize('amusing', 'v'), lemma.lemmatize('amuses', 'v'), lemma.lemmatize('amused', 'v'))

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml
amuse amuse amuse


### Bag of Words - BOW

문맥이나 순서를 무시하고 일괄적으로 단어에 대해 빈도 값을 부여해 피처 값을 추출하는 모델.  

장점  
- 쉽고 빠를 구축  
- 단순히 단어의 발생 횟수에 기반하고 있지만, 예상보다 문서의 특징을 잘 나타낼 수 있는 모델임  

제약 사항  
- 문맥 의미(Semantic Context) 반영 부족  
- 희소 행렬 문제(희소성, 희소 행렬)

BOW Feature Vectorization

CounterVectorizer : 카운트 기반의 벡터화   
TfidfVectorizer : 

Sparse Matrix - COO (Coordinate)   

0이 아닌 데이터만 별도의 데이터 배열(Array)에 저장하고, 그 데이터가 가리키는 행과 열의 위치를 별도의 배열로 저장하는 방식

In [14]:
import numpy as np

dense = np.array([[3,0,1], 
                  [0,2,0]])

In [15]:
from scipy import sparse

# 0이 아닌 데이터 추출
data = np.array([3,1,2])

# 행 위치와 열 위치를 각각 배열로 생성
row_pos = np.array([0,0,1])
col_pos = np.array([0,2,1])

sparse_coo = sparse.coo_matrix((data, (row_pos, col_pos)))

In [16]:
sparse_coo.toarray()

array([[3, 0, 1],
       [0, 2, 0]])

Sparse Matrix - CSR (Compressed Sparse Row)

COO 형식이 행과 열의 위치를 나타내기 위해서 반복적인 위치 데이터를 사용해야 하는 문제점을 해결한 방식.  
- 행 위치 배열 내에 있는 고유한 값의 시작 위치만 다시 별도의 위치 배열로 가지는 변환 방식

In [17]:
from scipy import sparse

dense2 = np.array([[0,0,1,0,0,5],
                   [1,4,0,3,2,5],
                   [0,6,0,3,0,0],
                   [2,0,0,0,0,0],
                   [0,0,0,7,0,8],
                   [1,0,0,0,0,0]])

# 0이 아닌 데이터 추출
data2 = np.array([1,5,1,4,3,2,5,6,3,2,7,8,1])

row_pos = np.array([0,0,1,1,1,1,1,2,2,3,4,4,5])
col_pos = np.array([2,5,0,1,3,4,5,1,3,0,3,5,0])


# COO
sparse_coo = sparse.coo_matrix((data2, (row_pos, col_pos)))

row_pos_ind = np.array([0,2,7,9,10,12,13])

# CSR
sparse_csr = sparse.csr_matrix((data2, col_pos, row_pos_ind))

print(sparse_coo.toarray())
print(sparse_csr.toarray())

[[0 0 1 0 0 5]
 [1 4 0 3 2 5]
 [0 6 0 3 0 0]
 [2 0 0 0 0 0]
 [0 0 0 7 0 8]
 [1 0 0 0 0 0]]
[[0 0 1 0 0 5]
 [1 4 0 3 2 5]
 [0 6 0 3 0 0]
 [2 0 0 0 0 0]
 [0 0 0 7 0 8]
 [1 0 0 0 0 0]]


## 20 NewsGroup Classification

주어진 문장의 텍스트 분류를 학습 데이터를 통해 학습해 모델을 생성, 다른 문서의 분류 예측

In [3]:
from sklearn.datasets import fetch_20newsgroups

news_data_all = fetch_20newsgroups(subset='all', random_state=156)

news_data_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'), random_state=156)
X_train = news_data_train.data
y_train = news_data_train.target
print(len(news_data_train.data))

news_data_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'), random_state=156)
X_test = news_data_test.data
y_test = news_data_test.target
print(len(news_data_test.data))

11314
7532


In [33]:
news_data_all.data[0]

'From: egreen@east.sun.com (Ed Green - Pixel Cruncher)\nSubject: Re: Observation re: helmets\nOrganization: Sun Microsystems, RTP, NC\nLines: 21\nDistribution: world\nReply-To: egreen@east.sun.com\nNNTP-Posting-Host: laser.east.sun.com\n\nIn article 211353@mavenry.altcit.eskimo.com, maven@mavenry.altcit.eskimo.com (Norman Hamer) writes:\n> \n> The question for the day is re: passenger helmets, if you don\'t know for \n>certain who\'s gonna ride with you (like say you meet them at a .... church \n>meeting, yeah, that\'s the ticket)... What are some guidelines? Should I just \n>pick up another shoei in my size to have a backup helmet (XL), or should I \n>maybe get an inexpensive one of a smaller size to accomodate my likely \n>passenger? \n\nIf your primary concern is protecting the passenger in the event of a\ncrash, have him or her fitted for a helmet that is their size.  If your\nprimary concern is complying with stupid helmet laws, carry a real big\nspare (you can put a big or small 

In [34]:
news_data_train.data[0]

"\n\nWhat I did NOT get with my drive (CD300i) is the System Install CD you\nlisted as #1.  Any ideas about how I can get one?  I bought my IIvx 8/120\nfrom Direct Express in Chicago (no complaints at all -- good price & good\nservice).\n\nBTW, I've heard that the System Install CD can be used to boot the mac;\nhowever, my drive will NOT accept a CD caddy is the machine is off.  How can\nyou boot with it then?\n\n--Dave\n"

In [35]:
y_train[0]

4

In [6]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
import warnings
warnings.filterwarnings('ignore')

In [7]:
cnt_vect = CountVectorizer()
cnt_vect.fit(X_train)
X_train_cnt_vect = cnt_vect.transform(X_train)
X_test_cnt_vect = cnt_vect.transform(X_test)

X_train_cnt_vect.shape

(11314, 101631)

=> 11314개의 문서에서 단어가 101631개 추출됨

In [8]:
lr_clf = LogisticRegression(solver='liblinear')
lr_clf.fit(X_train_cnt_vect, y_train)
pred = lr_clf.predict(X_test_cnt_vect)

print(f'CountVectorized Logistic Regression - Accuracy : {accuracy_score(y_test, pred):.3f}')

CountVectorized Logistic Regression - Accuracy : 0.616


In [9]:
tfidf_vect = TfidfVectorizer()
tfidf_vect.fit(X_train)
X_train_tfidf_vect = tfidf_vect.transform(X_train)
X_test_tfidf_vect = tfidf_vect.transform(X_test)

lr_clf = LogisticRegression(solver='liblinear')
lr_clf.fit(X_train_tfidf_vect, y_train)
pred = lr_clf.predict(X_test_tfidf_vect)

print(f'TF_IDF Logistic Regression - Accuracy : {accuracy_score(y_test, pred):.3f}')

TF_IDF Logistic Regression - Accuracy : 0.678


In [23]:
tfidf_vect = TfidfVectorizer(stop_words='english', ngram_range=(1,2), max_df=300)
tfidf_vect.fit(X_train)
X_train_tfidf_vect = tfidf_vect.transform(X_train)
X_test_tfidf_vect = tfidf_vect.transform(X_test)

lr_clf = LogisticRegression(solver='liblinear')
lr_clf.fit(X_train_tfidf_vect, y_train)
pred = lr_clf.predict(X_test_tfidf_vect)

print(f'TF_IDF Logistic Regression - Accuracy : {accuracy_score(y_test, pred):.3f}')

TF_IDF Logistic Regression - Accuracy : 0.690


In [38]:
from sklearn.model_selection import GridSearchCV

params = {'solver':['liblinear', 'lbfgs'],
          'penalty':['l2','l1'],
          'C':[0.01, 0.1, 1, 5, 10]}
grid_cv_lr = GridSearchCV(lr_clf, param_grid=params, cv=3, scoring='accuracy', verbose=1)
grid_cv_lr.fit(X_train_tfidf_vect, y_train)
print(f'Logistic Regression best parameter : {grid_cv_lr.best_params_}')


pred = grid_cv_lr.predict(X_test_tfidf_vect)
print(f'TF-IDF Vectorized Logistic Regression Accuracy : {accuracy_score(y_test,pred):.3f}')

Fitting 3 folds for each of 20 candidates, totalling 60 fits
Logistic Regression best parameter : {'C': 10, 'penalty': 'l2', 'solver': 'lbfgs'}
TF-IDF Vectorized Logistic Regression Accuracy : 0.701


In [39]:
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('tfidf_vect', TfidfVectorizer(stop_words='english', ngram_range=(1,2), max_df=300)),
    ('lr_clf', LogisticRegression(solver='liblinear', C=10))
])

pipeline.fit(X_train, y_train)
pred = pipeline.predict(X_test)
print(f'Pipeline Logistic Regression Accuracy : {accuracy_score(y_test, pred):.3f}')

Pipeline Logistic Regression Accuracy : 0.704


In [12]:
from sklearn.model_selection import GridSearchCV

svm_clf = SVC()
params = {'C':[0.1, 1, 10],
          'gamma':[1, 0.1, 0.01]}
grid_cv_svm = GridSearchCV(svm_clf, param_grid=params, cv=3, scoring='accuracy', verbose=1)
grid_cv_svm.fit(X_train_tfidf_vect, y_train)
print(f'SVM best parameter : {grid_cv_svm.best_params_}')


pred = grid_cv_svm.predict(X_test_tfidf_vect)
print(f'TF-IDF Vectorized SVM Accuracy : {accuracy_score(y_test,pred):.3f}')

Fitting 3 folds for each of 9 candidates, totalling 27 fits
SVM best parameter : {'C': 10, 'gamma': 0.1}
TF-IDF Vectorized SVM Accuracy : 0.663


In [13]:
from sklearn.naive_bayes import MultinomialNB

mod = MultinomialNB(alpha=0.1, class_prior=None, fit_prior=True)
mod.fit(X_train_tfidf_vect, y_train)

pred = mod.predict(X_test_tfidf_vect)
print(f'TF-IDF Vectorized Naive Bayes Accuracy : {accuracy_score(y_test,pred):.3f}')


TF-IDF Vectorized Naive Bayes Accuracy : 0.685


In [15]:
from sklearn.model_selection import GridSearchCV

mod_clf = MultinomialNB()
params = {'alpha':[0.01, 0.1, 0.5, 1.0]}
grid_cv_mod = GridSearchCV(mod_clf, param_grid=params, cv=3, scoring='accuracy', verbose=1)
grid_cv_mod.fit(X_train_tfidf_vect, y_train)
print(f'Bayes Classifier best parameter : {grid_cv_mod.best_params_}')

pred = grid_cv_mod.predict(X_test_tfidf_vect)
print(f'TF-IDF Vectorized Bayes Accuracy : {accuracy_score(y_test,pred):.3f}')

Fitting 3 folds for each of 4 candidates, totalling 12 fits
Bayes Classifier best parameter : {'alpha': 0.01}
TF-IDF Vectorized Bayes Accuracy : 0.700
