### Lemmatization
- 단어의 기본형을 찾는 과정
- 사전에서 실제 존재하는 단어를 찾아서 반환한다.
- wordnet
- am, are, is => be

In [1]:
import nltk
nltk.download("wordnet")   # 사전이 필요하기 때문에 사전을 다운받아야 함

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [2]:
from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize('am')   # 그냥 하면 안됨

'am'

In [3]:
lemmatizer.lemmatize('am', 'v')   # POS 명시해줘야

'be'

In [4]:
lemmatizer.lemmatize('understanding', 'v')   # POS 명시해줘야

'understand'

#### WordNet
- semantically-oriented dictionary
- synonym set
- 상위어, 하위어 구조가 잘 구축되어 있다.
- 16만개 정도의 단어가 존재한다.

In [6]:
from nltk.corpus import wordnet as wn

wn.synsets("motorcar")   # motorcar의 synonym을 찾음

[Synset('car.n.01')]

In [7]:
wn.synset('car.n.01').lemma_names()   # 그 synset에 있는 lemma를 찾아줌

['car', 'auto', 'automobile', 'machine', 'motorcar']

In [8]:
wn.synset('car.n.01').definition()

'a motor vehicle with four wheels; usually propelled by an internal combustion engine'

In [9]:
wn.synset('car.n.01').examples()

['he needs a car to get to work']

In [10]:
wn.synset('car.n.01').lemmas()

[Lemma('car.n.01.car'),
 Lemma('car.n.01.auto'),
 Lemma('car.n.01.automobile'),
 Lemma('car.n.01.machine'),
 Lemma('car.n.01.motorcar')]

In [11]:
wn.synsets('be')   # synsets하면 여러 개 보여줌

[Synset('beryllium.n.01'),
 Synset('be.v.01'),
 Synset('be.v.02'),
 Synset('be.v.03'),
 Synset('exist.v.01'),
 Synset('be.v.05'),
 Synset('equal.v.01'),
 Synset('constitute.v.01'),
 Synset('be.v.08'),
 Synset('embody.v.02'),
 Synset('be.v.10'),
 Synset('be.v.11'),
 Synset('be.v.12'),
 Synset('cost.v.01')]

#### stem, lemmatize 왜 할까?
1. 정확도 개선
2. 연산량이 줄어듦 => 차원 축소

### stopwords | 불용어
- 문장의 의미에 영향을 미치지 않는 단어
- 매우 자주 쓰이지만 영향이 없기 때문에 주로 제거한다.
- 예) 관사

In [None]:
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')
stopwords.words('english')   # spanish 등도 있음

In [15]:
stop_words = stopwords.words('english')
text = '''KT's networks stopped working Monday morning due to a technical error, disrupting the communications of millions of individuals, stores and companies across the nation for an hour or longer. The country's second-largest telecommunication operator with over 10 million users initially suggested a distributed denial-of-service (DDoS) attack was the cause, but a few hours later blamed a glitch in network routing. "We suspected the DDoS attack because of the heavy traffic detected, but it turns out that a routing error caused the problem after we closely looked into the case," KT said in a statement.'''

In [None]:
# text에서 불용어에 있는 단어를 제거한다
[w for w in text.split() if w not in stop_words]

In [22]:
# 위와 같이 하면 You와 같이 대문자는 제거안됨
case1 = [w for w in text.split() if w.lower() not in stop_words]

In [23]:
# 이렇게 해도 punc와 단어가 같이 나옴. 분리해서 나오게 하고 싶으면 tokenize해야됨.
from nltk.tokenize import word_tokenize
case2 = [w for w in word_tokenize(text) if w.lower() not in stop_words]

In [28]:
# case1과 case2 각각 얼만큼 빠졌을까?
print(len(case1) / len(word_tokenize(text)))   # len(text)로 나누면 안되는게, 토크나이즈를 안했기 때문
print(len(case2) / len(word_tokenize(text)))

0.5277777777777778
0.6481481481481481


In [29]:
# customizing
print(len(stop_words))
stop_words.append("hey")
print(len(stop_words))

179
180


한국어는 없음 => 교착어이기 때문  
  ex. 먹었었다 었? 어 -ㅆ ?

### unusual words
words 코퍼스를 사용

In [None]:
from nltk.corpus import words
nltk.download('words')
words.words()

In [32]:
len(words.words())   # 웬만한 단어는 다 있기 때문에, 여기에 없으면 unusual word라고 추정할 수 있음

236736

In [35]:
words.words()

['A',
 'a',
 'aa',
 'aal',
 'aalii',
 'aam',
 'Aani',
 'aardvark',
 'aardwolf',
 'Aaron',
 'Aaronic',
 'Aaronical',
 'Aaronite',
 'Aaronitic',
 'Aaru',
 'Ab',
 'aba',
 'Ababdeh',
 'Ababua',
 'abac',
 'abaca',
 'abacate',
 'abacay',
 'abacinate',
 'abacination',
 'abaciscus',
 'abacist',
 'aback',
 'abactinal',
 'abactinally',
 'abaction',
 'abactor',
 'abaculus',
 'abacus',
 'Abadite',
 'abaff',
 'abaft',
 'abaisance',
 'abaiser',
 'abaissed',
 'abalienate',
 'abalienation',
 'abalone',
 'Abama',
 'abampere',
 'abandon',
 'abandonable',
 'abandoned',
 'abandonedly',
 'abandonee',
 'abandoner',
 'abandonment',
 'Abanic',
 'Abantes',
 'abaptiston',
 'Abarambo',
 'Abaris',
 'abarthrosis',
 'abarticular',
 'abarticulation',
 'abas',
 'abase',
 'abased',
 'abasedly',
 'abasedness',
 'abasement',
 'abaser',
 'Abasgi',
 'abash',
 'abashed',
 'abashedly',
 'abashedness',
 'abashless',
 'abashlessly',
 'abashment',
 'abasia',
 'abasic',
 'abask',
 'Abassin',
 'abastardize',
 'abatable',
 'abate

In [43]:
def unusual_words(text:str):
  unusual = words.words()
  text = set([w for w in word_tokenize(text) if w.isalpha()])  # 알파벳만
  text = [w for w in text if w.lower() not in unusual]

  return text

In [44]:
unusual_words(text)

['users',
 'glitch',
 'communications',
 'Monday',
 'detected',
 'looked',
 'companies',
 'caused',
 'networks',
 'individuals',
 'DDoS',
 'suggested',
 'hours',
 'KT',
 'stores',
 'disrupting']

### Spelling

In [45]:
!pip install autocorrect

Collecting autocorrect
  Downloading autocorrect-2.5.0.tar.gz (622 kB)
[?25l[K     |▌                               | 10 kB 25.2 MB/s eta 0:00:01[K     |█                               | 20 kB 30.9 MB/s eta 0:00:01[K     |█▋                              | 30 kB 22.1 MB/s eta 0:00:01[K     |██                              | 40 kB 17.4 MB/s eta 0:00:01[K     |██▋                             | 51 kB 8.1 MB/s eta 0:00:01[K     |███▏                            | 61 kB 8.8 MB/s eta 0:00:01[K     |███▊                            | 71 kB 9.1 MB/s eta 0:00:01[K     |████▏                           | 81 kB 8.0 MB/s eta 0:00:01[K     |████▊                           | 92 kB 6.9 MB/s eta 0:00:01[K     |█████▎                          | 102 kB 7.6 MB/s eta 0:00:01[K     |█████▉                          | 112 kB 7.6 MB/s eta 0:00:01[K     |██████▎                         | 122 kB 7.6 MB/s eta 0:00:01[K     |██████▉                         | 133 kB 7.6 MB/s eta 0:00:01[K 

In [46]:
from autocorrect import Speller
speller = Speller()
speller("naaturll")

'natural'

#### exercise

In [47]:
from google.colab import drive
drive.mount('/gdrive', force_remount=True)

Mounted at /gdrive


In [64]:
PATH = "/gdrive/MyDrive/01. Programming/Practical NLP/data/Week 08/"
with open(PATH + "exercise.txt", "r") as f:
  data = f.readlines()

In [65]:
data

['In this book authored by Sohom Ghosh and Dwight Gunning, we shall learnning how to pracess Natueral Language and extract insights from it. The first four chapter will introduce you to the basics of NLP. Later chapters will describe how to deal with complex NLP prajects. If you want to get early access of it, you should book your order now.']

In [68]:
# spelling correction
a = []
for line in data:
  a.append(speller(line))

data = " ".join(a)

# [speller(i) for i in data]

In [None]:
# speller: 대,소문자 구분 안하고, 텍스트 전체를 인풋으로 받는다.

In [70]:
# 불용어 제거
data = data.strip()
result = [w for w in word_tokenize(data) if w.lower() not in stop_words]
len(result) / len(word_tokenize(data))

0.5671641791044776

In [75]:
with open(PATH + "file_out.txt", "w") as f:
  for w in result:
    f.write(w + "\n")