자연어(NLP)

- 관련 nlp 패키지 설치
- 텍스트 데이터를 토큰화
- 형태소 분석을 통해 단어를 기본 형식으로 변환
- 텍스트 데이터를 정크로 나누기
- word beg모델을 사용해서 문서-용어 행렬 추출
- 카테고리 예측기
- 성별 식별자 구축
- 감정 분석기 구축
- 모델링

In [1]:
import nltk
nltk.download()

NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> q


True

텍스트데이터 토큰화

In [5]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [4]:
from nltk.tokenize import sent_tokenize, word_tokenize, WordPunctTokenizer

In [6]:
input_text = "do you know how tokenize works? \
It's actually quite interesting!"
sent_tokenize(input_text)

['do you know how tokenize works?', "It's actually quite interesting!"]

In [7]:
print("Word tokenizer")
print(word_tokenize(input_text))

Word tokenizer
['do', 'you', 'know', 'how', 'tokenize', 'works', '?', 'It', "'s", 'actually', 'quite', 'interesting', '!']


In [8]:
print("WordPunctTokenizer")
print(WordPunctTokenizer().tokenize(input_text))

WordPunctTokenizer
['do', 'you', 'know', 'how', 'tokenize', 'works', '?', 'It', "'", 's', 'actually', 'quite', 'interesting', '!']


형태소 분석을 사용해서 단어를 기본 형식으로 변환

- sing
 - singer, singing, song, sung 등 다양하게 사용
 - 의미가 유사하며, 이러한 과정을 형태소 분석
 - 어간/기본간어의 형택학적 변형을 생성하는 방법

In [9]:
from nltk.stem.porter import PorterStemmer
from nltk.stem.lancaster import LancasterStemmer
from nltk.stem.snowball import SnowballStemmer

In [10]:
input_word = [
    'writing', 'calves','be','branded','horse','radomize','possibly','provision','hospital','kept','scratchy','code'
]

In [11]:
porter = PorterStemmer()
lancaster = LancasterStemmer()
snowball = SnowballStemmer('english')

In [12]:
print(f"{'input_word':20s}{'porter':20s}{'lancaster':20s}{'snowball':20s}")
print(f"="*80)
for word in input_word:
  print(f"{word:20s}{porter.stem(word):20s}{lancaster.stem(word):20s}{snowball.stem(word):20s}")

input_word          porter              lancaster           snowball            
writing             write               writ                write               
calves              calv                calv                calv                
be                  be                  be                  be                  
branded             brand               brand               brand               
horse               hors                hors                hors                
radomize            radom               radom               radom               
possibly            possibl             poss                possibl             
provision           provis              provid              provis              
hospital            hospit              hospit              hospit              
kept                kept                kept                kept                
scratchy            scratchi            scratchy            scratchi            
code                code    

단어를 기본형식으로 변화 - 기본형화(Lemmatizer)

- 형태소 분석기를 통한 단어 표준화를 하면 의미 없는 단어가 생성되기도 함
- 형태소 분석기와 비슷하지만 단어에 맥락을 제공 그래서 의미가 비슷한 단어를 한 단어로 연결

In [13]:
from nltk.stem import WordNetLemmatizer

In [14]:
lemmatizer = WordNetLemmatizer()

In [15]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [16]:
print(f"{'input_word':20s}{'명사':20s}{'동사':20s}")
for word in input_word:
  print(f"{word:20s}{lemmatizer.lemmatize(word,pos='n'):20s}{lemmatizer.lemmatize(word,pos='v'):20s}")

input_word          명사                  동사                  
writing             writing             write               
calves              calf                calve               
be                  be                  be                  
branded             branded             brand               
horse               horse               horse               
radomize            radomize            radomize            
possibly            possibly            possibly            
provision           provision           provision           
hospital            hospital            hospital            
kept                kept                keep                
scratchy            scratchy            scratchy            
code                code                code                


텍스트 데이터를 청크로 나누기

- 텍스트데이터를 추가 분석을 위해서 여러 조각으로 나눈다 - 청킹(chunking)
- 청킹은 토큰화와 다르다
- 큰 텍스트는 문서를 다룰때 의미있는 정보를 추출하기위해서 텍스트를 청크로 분활

In [17]:
import numpy as np
from nltk.corpus import brown

In [18]:
def chunker(input_data,N):
  input_words =  input_data.split()
  output = []
  cur_chunk = []
  count = 0
  # 단어들을 순환하면서 청크로 분할
  for word in input_words:
    cur_chunk.append(word)
    count +=1
    if count == N:
      output.append(' '.join(cur_chunk))
      count,cur_chunk = 0, []

  output.append(' '.join(cur_chunk))
  return output

In [19]:
nltk.download('brown')

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.


True

In [20]:
input_data = ' '.join(brown.words()[:12000])
# 각 청크에 있는 단어 개수를 정의
chunk_size = 700
# 입력데이터를 청크로 분할하고 결과를 표시
chunks = chunker(input_data,chunk_size)
print(f"number of chunks : {len(chunks)}")
for i, chunk in enumerate(chunks):
  print(f"chunk {i+1} ==> {chunk[:50]}")

number of chunks : 18
chunk 1 ==> The Fulton County Grand Jury said Friday an invest
chunk 2 ==> '' . ( 2 ) Fulton legislators `` work with city of
chunk 3 ==> . Construction bonds Meanwhile , it was learned th
chunk 4 ==> , anonymous midnight phone calls and veiled threat
chunk 5 ==> Harris , Bexar , Tarrant and El Paso would be $451
chunk 6 ==> set it for public hearing on Feb. 22 . The proposa
chunk 7 ==> College . He has served as a border patrolman and 
chunk 8 ==> of his staff were doing on the address involved co
chunk 9 ==> plan alone would boost the base to $5,000 a year a
chunk 10 ==> nursing homes In the area of `` community health s
chunk 11 ==> of its Angola policy prove harsh , there has been 
chunk 12 ==> system which will prevent Laos from being used as 
chunk 13 ==> reform in recipient nations . In Laos , the admini
chunk 14 ==> . He is not interested in being named a full-time 
chunk 15 ==> said , `` to obtain the views of the general publi
chunk 16 ==> '' . Mr. Reama

Bag of Word model

- 텍스트의 빈도 추출
- 문서내의 모든 단어에서 어휘를 추출하고 document-term model 사용해서 모델을 구축
- 문서-용어 모델
 - 단어개수만 추출하고 어순은 무시

In [21]:
from sklearn.feature_extraction.text import CountVectorizer

In [22]:
input_data = ' '.join(brown.words()[:5400])
chunk_size = 800
text_chuncks = chunker(input_data, chunk_size)

In [23]:
chunks = []
for count, chunk in enumerate(text_chuncks):
  d = {'index':count, 'text':chunk}
  chunks.append(d)

In [24]:
count_vectorizer = CountVectorizer(min_df=7,max_df=20)
document_term_matrix =  count_vectorizer.fit_transform([  chunk['text']    for chunk in chunks])

In [25]:
vocabulary =  count_vectorizer.get_feature_names_out()
print(f"vocabulary : {vocabulary}")

vocabulary : ['and' 'are' 'be' 'by' 'county' 'for' 'in' 'is' 'it' 'of' 'on' 'one'
 'said' 'state' 'that' 'the' 'to' 'two' 'was' 'which' 'with']


In [26]:
chunk_names = []
for i in range(len(text_chuncks)):
  chunk_names.append(f'Chunk-{i+1}')
chunk_names

['Chunk-1', 'Chunk-2', 'Chunk-3', 'Chunk-4', 'Chunk-5', 'Chunk-6', 'Chunk-7']

In [27]:
print("Document term matrix....")
print(chunk_names)
for word, item in zip(vocabulary, document_term_matrix.T):
  temp = ''.join([f"{str(i):10s}" for i in item.data] )
  print(f"{word:10s}{temp}")

Document term matrix....
['Chunk-1', 'Chunk-2', 'Chunk-3', 'Chunk-4', 'Chunk-5', 'Chunk-6', 'Chunk-7']
and       23        9         9         11        9         17        10        
are       2         2         1         1         2         2         1         
be        6         8         7         7         6         2         1         
by        3         4         4         5         14        3         6         
county    6         2         7         3         1         2         2         
for       7         13        4         10        7         6         4         
in        15        11        15        11        13        14        17        
is        2         7         3         4         5         5         2         
it        8         6         8         9         3         1         2         
of        31        20        20        30        29        35        26        
on        4         3         5         10        6         5         2         
one   

카테고리 예측기

- 텍스트가 속한 카테고리를 예측하는 데 사용
- 검색엔진은 카테고리 예측기를 사용해서 검색 결과를 관련성에 따라 정렬
- TF-IDF 지표를 사용 - TF(Term Frequency) :문서에서의 단어의 빈도 - IDF(Inverse Document Frequency) : 문서에서 단어가 얼마나 고유한지

In [28]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer

In [29]:
category_map = {
    'talk.politics.misc':'Politics', 'rec.autos' : 'Autos', 'rec.sport.hockey':'Hockey','sci.electronics':'Electronics',
    'sci.med':'Medicine'
}

In [30]:
training_data =  fetch_20newsgroups(subset='train', categories=category_map.keys(), shuffle=True, random_state=5)

In [31]:
count_vectorizer = CountVectorizer()
train_tc =  count_vectorizer.fit_transform(training_data.data)
train_tc.shape

(2844, 40321)

In [32]:
tfidf = TfidfTransformer()
train_tfidf = tfidf.fit_transform(train_tc)

In [33]:
input_data = [
    'You need to be careful with cars when you are driving on slippery roads',
    'A lot of devices can be operated wirelessly',
    'Players need to be careful when they are close to goal posts',
    'Political debates help us understand the perspectives of both sides'
]

In [34]:
classifier = MultinomialNB().fit(train_tfidf,training_data.target)

In [35]:
input_tc = count_vectorizer.transform(input_data)

In [36]:
input_tfidf = tfidf.transform(input_tc)

In [37]:
predictions = classifier.predict(input_tfidf)

In [38]:
for sent,category in zip(input_data, predictions):
  print(f"input:{sent}\n predicted category : {category_map[training_data.target_names[category]]}")
  print("="*50)

input:You need to be careful with cars when you are driving on slippery roads
 predicted category : Autos
input:A lot of devices can be operated wirelessly
 predicted category : Electronics
input:Players need to be careful when they are close to goal posts
 predicted category : Hockey
input:Political debates help us understand the perspectives of both sides
 predicted category : Politics
