# 텍스트 마이닝 기본문법


## 문장 토큰화
- 파이썬 머신러닝 완벽 가이드 (p.492)

In [1]:
from nltk import sent_tokenize
import nltk

nltk.download('punkt') # 마침표, 개행 문자 관련 데이터 세트를 다운로드 한다.

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [4]:
# Document = 문서
text_sample = 'The Matrix is everywhere its all around us, here even in this room. \
               You can see it out your window or on your television. \
               You feel it when you go to work, or go to church or pay your taxes.'

sentences = sent_tokenize(text=text_sample)
print(type(sentences))
print(len(sentences))
print(sentences)

<class 'list'>
3
['The Matrix is everywhere its all around us, here even in this room.', 'You can see it out your window or on your television.', 'You feel it when you go to work, or go to church or pay your taxes.']


## 단어 토큰화
- 문장을 각각의 단어로 다시 토큰화함

In [5]:
from nltk import word_tokenize

sentence = "The Matrix is everywhere its all around us, here even in this room."
words = word_tokenize(sentence)
print(type(words))
print(len(words))
print(words)

<class 'list'>
15
['The', 'Matrix', 'is', 'everywhere', 'its', 'all', 'around', 'us', ',', 'here', 'even', 'in', 'this', 'room', '.']


## 문서를 단어로 토큰화하는 함수
- 문서 => 문장 => 단어 ====> 단어 뭉치(?)로 묶자

In [6]:
from nltk import word_tokenize, sent_tokenize

# 여러개의 문장으로 된 입력 데이터를 문장별로 단어 토큰화하는 함수
def tokenize_text(text):

  # 문장별로 분리 토큰
  sentences = sent_tokenize(text) # 3개의 리스트

  # 각 문장을 단어별로 토큰화
  word_tokens = [word_tokenize(sentence) for sentence in sentences]
  return word_tokens

# 문서를 단어별로 토큰화 수행
word_tokens = tokenize_text(text_sample)
print(type(word_tokens),len(word_tokens))
print(word_tokens)

<class 'list'> 3
[['The', 'Matrix', 'is', 'everywhere', 'its', 'all', 'around', 'us', ',', 'here', 'even', 'in', 'this', 'room', '.'], ['You', 'can', 'see', 'it', 'out', 'your', 'window', 'or', 'on', 'your', 'television', '.'], ['You', 'feel', 'it', 'when', 'you', 'go', 'to', 'work', ',', 'or', 'go', 'to', 'church', 'or', 'pay', 'your', 'taxes', '.']]


In [7]:
text_sample = "“I have withdrawn from the musical Tiananmen,” Piser said in a brief signed statement on Instagram. Piser was on tour performing hit Broadway songs in the Chinese metropolis of Shanghai when he made the announcement, according to his Instagram posts and Chinese state media reports. In China, the Tiananmen Square pro-democracy protests, which swept Beijing and dozens of other Chinese cities in 1989 – and ended in a bloody military crackdown that cost the lives of hundreds, if not thousands, of protesters – remain a major political taboo."
word_tokens = tokenize_text(text_sample)
print(type(word_tokens),len(word_tokens))
print(word_tokens)

<class 'list'> 3
[['“', 'I', 'have', 'withdrawn', 'from', 'the', 'musical', 'Tiananmen', ',', '”', 'Piser', 'said', 'in', 'a', 'brief', 'signed', 'statement', 'on', 'Instagram', '.'], ['Piser', 'was', 'on', 'tour', 'performing', 'hit', 'Broadway', 'songs', 'in', 'the', 'Chinese', 'metropolis', 'of', 'Shanghai', 'when', 'he', 'made', 'the', 'announcement', ',', 'according', 'to', 'his', 'Instagram', 'posts', 'and', 'Chinese', 'state', 'media', 'reports', '.'], ['In', 'China', ',', 'the', 'Tiananmen', 'Square', 'pro-democracy', 'protests', ',', 'which', 'swept', 'Beijing', 'and', 'dozens', 'of', 'other', 'Chinese', 'cities', 'in', '1989', '–', 'and', 'ended', 'in', 'a', 'bloody', 'military', 'crackdown', 'that', 'cost', 'the', 'lives', 'of', 'hundreds', ',', 'if', 'not', 'thousands', ',', 'of', 'protesters', '–', 'remain', 'a', 'major', 'political', 'taboo', '.']]


In [8]:
text_sample = "삼성전자와 LG전자가 스마트홈 플랫폼에서 손을 잡았다. 8월 31일 관련업계에 따르면 삼성전자와 LG전자는 올해 안에 스마트홈 플랫폼으로 양사 가전을 연동하는 것을 목표로 협력하고 있다. 연내 삼성전자와 LG전자의 가전 관리용 전용 앱을 통해 양사는 물론 다른 회사의 가전제품까지 무선 및 원격으로 작동하거나 제어할 수 있도록 하겠다는 계획이다. 두 회사는 ‘홈 커넥티비티 얼라이언스(HCA)’ 표준을 설계·적용해 타사의 브랜드 가전제품을 자사 앱에서 제어할 수 있도록 지원하는 것으로 알려졌따. 지난해 설립된 HCA는 삼성전자와 LG전자, 튀르키예의 베스텔, 일본 샤프를 비롯한 15개 회원사가 참여하고 있다. 15개사 가전 관리용 앱으로 다른 회원사의 가전제품을 제어하는 표준 기술을 개발 중이다. 삼성전자와 LG전자는 HCA 의장사로 이 같은 ‘스마트홈 가전 동맹’을 주도하고 있다. 당장 9월 가전 관리용 전용 앱인 삼성전자의 ‘스마트싱스’로 베스텔, 샤프 등 글로벌 가전업체 제품을 제어할 수 있게 된다. 연내에는 LG전자 가전제품도 작동할 수 있게 된다. 예컨대 소비자들은 삼성전자 스마트싱스 앱으로 이와 연결된 LG전자 TV, 세탁기 등의 가전을 작동하거나 설정을 조작할 수 있다. LG전자의 가전 관리용 전용 앱 ‘LG 씽큐’로도 올해 안에 삼성전자 가전제품을 조작하는 게 가능해진다. 베스텔 가전제품을 연동하는 방안도 추진하고 있다. 삼성전자와 LG전자의 ‘스마트홈 플랫폼 동맹’은 올해 한국 미국 유럽 등 글로벌 주요 시장에서 순차적으로 이어질 계획이다. 대상 제품은 냉장고 세탁기 에어컨 건조기 식기세척기 오븐 로봇청소기 TV 공기청정기 등이다. 양사는 앞으로 연동 대상 제품을 확대해 나갈 계획이다."
word_tokens = tokenize_text(text_sample)
print(type(word_tokens),len(word_tokens))
print(word_tokens)

<class 'list'> 15
[['삼성전자와', 'LG전자가', '스마트홈', '플랫폼에서', '손을', '잡았다', '.'], ['8월', '31일', '관련업계에', '따르면', '삼성전자와', 'LG전자는', '올해', '안에', '스마트홈', '플랫폼으로', '양사', '가전을', '연동하는', '것을', '목표로', '협력하고', '있다', '.'], ['연내', '삼성전자와', 'LG전자의', '가전', '관리용', '전용', '앱을', '통해', '양사는', '물론', '다른', '회사의', '가전제품까지', '무선', '및', '원격으로', '작동하거나', '제어할', '수', '있도록', '하겠다는', '계획이다', '.'], ['두', '회사는', '‘', '홈', '커넥티비티', '얼라이언스', '(', 'HCA', ')', '’', '표준을', '설계·적용해', '타사의', '브랜드', '가전제품을', '자사', '앱에서', '제어할', '수', '있도록', '지원하는', '것으로', '알려졌따', '.'], ['지난해', '설립된', 'HCA는', '삼성전자와', 'LG전자', ',', '튀르키예의', '베스텔', ',', '일본', '샤프를', '비롯한', '15개', '회원사가', '참여하고', '있다', '.'], ['15개사', '가전', '관리용', '앱으로', '다른', '회원사의', '가전제품을', '제어하는', '표준', '기술을', '개발', '중이다', '.'], ['삼성전자와', 'LG전자는', 'HCA', '의장사로', '이', '같은', '‘', '스마트홈', '가전', '동맹', '’', '을', '주도하고', '있다', '.'], ['당장', '9월', '가전', '관리용', '전용', '앱인', '삼성전자의', '‘', '스마트싱스', '’', '로', '베스텔', ',', '샤프', '등', '글로벌', '가전업체', '제품을', '제어할', '수', '있게', '된다', '.'], ['연내에는', 'L

## Stopwords 제거
- 불용어 : 분석에 큰 의미가 없는 단어
- p.495

In [9]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [11]:
print("영어 불용어 갯수:", len(nltk.corpus.stopwords.words('english')))
print(nltk.corpus.stopwords.words('english')[:20])

영어 불용어 갯수: 179
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his']


- 불용어 제거

In [16]:
text_sample = 'The Matrix is everywhere its all around us, here even in this room. \
               You can see it out your window or on your television. \
               You feel it when you go to work, or go to church or pay your taxes.'
word_tokens = tokenize_text(text_sample)

# 특정 도메인에서 분석을 하려고 함
stopwords = nltk.corpus.stopwords.words('english')
stopwords = stopwords + ['everywhere', 'us'] # 이런식으로 제거할 필요없는 단어를 추가해서 단어를 날림

all_tokens = []
# 위 예제의 3개의 문장별로 얻은 word_tokens list 에 대해 stop word 제거 Loop
for sentence in word_tokens:
    filtered_words=[]
    # 개별 문장별로 tokenize된 sentence list에 대해 stop word 제거 Loop
    for word in sentence:
        #소문자로 모두 변환합니다.
        word = word.lower()
        # tokenize 된 개별 word가 stop words 들의 단어에 포함되지 않으면 word_tokens에 추가
        if word not in stopwords:
            filtered_words.append(word)
    all_tokens.append(filtered_words)

print(all_tokens)

[['matrix', 'around', ',', 'even', 'room', '.'], ['see', 'window', 'television', '.'], ['feel', 'go', 'work', ',', 'go', 'church', 'pay', 'taxes', '.']]


## 어근 추출

In [17]:
from nltk.stem import LancasterStemmer
stemmer = LancasterStemmer()

print(stemmer.stem('working'),stemmer.stem('works'),stemmer.stem('worked'))
print(stemmer.stem('amusing'),stemmer.stem('amuses'),stemmer.stem('amused'))
print(stemmer.stem('happier'),stemmer.stem('happiest'))
print(stemmer.stem('fancier'),stemmer.stem('fanciest'))

work work work
amus amus amus
happy happiest
fant fanciest


In [18]:
from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('wordnet')

lemma = WordNetLemmatizer()
print(lemma.lemmatize('amusing','v'),lemma.lemmatize('amuses','v'),lemma.lemmatize('amused','v'))
print(lemma.lemmatize('happier','a'),lemma.lemmatize('happiest','a'))
print(lemma.lemmatize('fancier','a'),lemma.lemmatize('fanciest','a'))

[nltk_data] Downloading package wordnet to /root/nltk_data...


amuse amuse amuse
happy happy
fancy fancy


## 데이터 불러오기

In [19]:
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


In [21]:
import pandas as pd
DATA_PATH = '/content/drive/MyDrive/Colab Notebooks/text_mining/'

# p.520
review_df = pd.read_csv(DATA_PATH + './labeledTrainData.tsv', header=0, sep="\t", quoting=3)
review_df.head(3)

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."


In [22]:
review_df['review'][0]

'"With all this stuff going down at the moment with MJ i\'ve started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ\'s feeling towards the press and also the obvious message of drugs are bad m\'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally

## 텍스트 전처리

In [23]:
import re
review_df['review'] = review_df['review'].str.replace('<br />', ' ')
review_df['review'][0]

'"With all this stuff going down at the moment with MJ i\'ve started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ\'s feeling towards the press and also the obvious message of drugs are bad m\'kay.  Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.  The actual feature film bit when it finally starts is only on f

In [25]:
# 영어 문자열이 아닌 모든 문자는 공백으로 변환
review_df['review'] = review_df['review'].apply(lambda x: re.sub("[^a-zA-Z]", " ", x))
review_df['review'][0]

' With all this stuff going down at the moment with MJ i ve started listening to his music  watching the odd documentary here and there  watched The Wiz and watched Moonwalker again  Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent  Moonwalker is part biography  part feature film which i remember going to see at the cinema when it was originally released  Some of it has subtle messages about MJ s feeling towards the press and also the obvious message of drugs are bad m kay   Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring  Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him   The actual feature film bit when it finally starts is only on for 

## 훈련 데이터, 테스트 데이터 분리

In [26]:
from sklearn.model_selection import train_test_split
class_df = review_df['sentiment'] # Y : 1은 긍정, 0은 부정
feature_df = review_df.drop(['id', 'sentiment'], axis = 1) # reivew만 남음

X_train, X_test, y_train, y_test = train_test_split(
    feature_df, class_df, test_size = 0.3, random_state = 156
)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((17500, 1), (7500, 1), (17500,), (7500,))

## 모델 학습

In [27]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score

pipeline = Pipeline([
    ('cnt_vect', CountVectorizer(stop_words = 'english', ngram_range = (1, 2))),
    ('lr_clf', LogisticRegression(solver = 'liblinear', C = 10))
])

pipeline.fit(X_train['review'], y_train)

In [28]:
pred = pipeline.predict(X_test['review'])
pred_probs = pipeline.predict_proba(X_test['review'])[:, 1]

print('예측 정확도는 {0:.4f}, ROC-AUC는 {1:.4f}'.format(accuracy_score(y_test ,pred),
                                         roc_auc_score(y_test, pred_probs)))

예측 정확도는 0.8861, ROC-AUC는 0.9503


- TF-IDF 활용해서 측정

In [29]:
%%timeit

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score

pipeline = Pipeline([
    ('cnt_vect', TfidfVectorizer(stop_words = 'english', ngram_range = (1, 2))),
    ('lr_clf', LogisticRegression(solver = 'liblinear', C = 10))
])

pipeline.fit(X_train['review'], y_train)

pred = pipeline.predict(X_test['review'])
pred_probs = pipeline.predict_proba(X_test['review'])[:, 1]

print('예측 정확도는 {0:.4f}, ROC-AUC는 {1:.4f}'.format(accuracy_score(y_test ,pred),
                                         roc_auc_score(y_test, pred_probs)))

예측 정확도는 0.8936, ROC-AUC는 0.9598
예측 정확도는 0.8936, ROC-AUC는 0.9598
예측 정확도는 0.8936, ROC-AUC는 0.9598
예측 정확도는 0.8936, ROC-AUC는 0.9598
예측 정확도는 0.8936, ROC-AUC는 0.9598
예측 정확도는 0.8936, ROC-AUC는 0.9598
예측 정확도는 0.8936, ROC-AUC는 0.9598
예측 정확도는 0.8936, ROC-AUC는 0.9598
24 s ± 1.96 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [30]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(review_df['review'])
X.shape

(25000, 73246)

In [32]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(review_df['review'])
X.shape

(25000, 73246)

In [33]:
print(X[0])

  (0, 71786)	4
  (0, 1558)	4
  (0, 65032)	11
  (0, 62272)	1
  (0, 26679)	3
  (0, 18635)	1
  (0, 3639)	2
  (0, 64811)	19
  (0, 42129)	1
  (0, 41901)	11
  (0, 69374)	2
  (0, 61392)	1
  (0, 37621)	1
  (0, 65592)	9
  (0, 29773)	3
  (0, 43124)	2
  (0, 70705)	1
  (0, 45213)	1
  (0, 18164)	1
  (0, 29351)	1
  (0, 2147)	10
  (0, 64904)	1
  (0, 70699)	2
  (0, 71838)	1
  (0, 42364)	2
  :	:
  (0, 66844)	1
  (0, 63836)	1
  (0, 21498)	1
  (0, 27026)	1
  (0, 48796)	1
  (0, 71002)	2
  (0, 3772)	1
  (0, 25678)	1
  (0, 62417)	1
  (0, 29861)	1
  (0, 18346)	1
  (0, 9203)	1
  (0, 5208)	1
  (0, 17121)	1
  (0, 5517)	1
  (0, 11758)	1
  (0, 18459)	1
  (0, 22262)	1
  (0, 19924)	1
  (0, 22137)	1
  (0, 62320)	1
  (0, 58569)	1
  (0, 37217)	1
  (0, 30273)	1
  (0, 36529)	1
