# Tutorial 7-1. Natural Language Processing (NLP)

**GOAL**: Let's taste how to process natural language text using Python!

### 0. Installation

Before starting this tutorial, please be prepared by installing the two packages.
- English: `nltk` package
- Korean: `konlpy` package (+Java Development Kit)

You can install them by running the following lines in Anaconda Prompt:
```
>> conda install nltk
>> pip install konlpy
```

For details, see slides.

### 1. English NLP using NLTK package

[NLTK](https://www.nltk.org/) is a pioneering NLP package built in Python.

First, import the `nltk` package and download a tokenizer and a pos tagger.

In [1]:
import nltk

nltk.download('punkt') # tokenizer
nltk.download('averaged_perceptron_tagger') # pos tagger

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\01wkd\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\01wkd\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


True

Try your own sentences. You can tokenize, POS tag, or extract nouns.

In [10]:
sentence = "I'd like to drink some water. But I have only beer."

tokens = nltk.word_tokenize(sentence)

tagged = nltk.pos_tag(tokens)
print(tagged)

nn_list = []
for word, tag in tagged:
    if tag[:2] == 'NN' :
        nn_list.append(word)

nouns = [word for (word, tag) in tagged if tag[:2] == 'NN']
print(nn_list)
print(nouns)

[('I', 'PRP'), ("'d", 'MD'), ('like', 'VB'), ('to', 'TO'), ('drink', 'VB'), ('some', 'DT'), ('water', 'NN'), ('.', '.'), ('But', 'CC'), ('I', 'PRP'), ('have', 'VBP'), ('only', 'RB'), ('beer', 'NN'), ('.', '.')]
['water', 'beer']
['water', 'beer']


In [11]:
sentence = "By the end of this course I will be a great data scientist! 100% sure!"

tokens = nltk.word_tokenize(sentence.lower()) # lowercase
print(tokens)

tagged = nltk.pos_tag(tokens)
print(tagged)

nouns = [word for word, tag in tagged if tag[:2] == 'NN']
print(nouns)

['by', 'the', 'end', 'of', 'this', 'course', 'i', 'will', 'be', 'a', 'great', 'data', 'scientist', '!', '100', '%', 'sure', '!']
[('by', 'IN'), ('the', 'DT'), ('end', 'NN'), ('of', 'IN'), ('this', 'DT'), ('course', 'NN'), ('i', 'NN'), ('will', 'MD'), ('be', 'VB'), ('a', 'DT'), ('great', 'JJ'), ('data', 'NNS'), ('scientist', 'NN'), ('!', '.'), ('100', 'CD'), ('%', 'NN'), ('sure', 'JJ'), ('!', '.')]
['end', 'course', 'i', 'data', 'scientist', '%']


### 2. Korean NLP using KoNLPy pacakge

[KoNLPy](https://konlpy.org/) is a Python package for NLP of Korean language.

It contains famous Korean POS taggers such as Hannanum (`Hannanum`), Kkma(`Kkma`), and Open Korean Text (`Okt`, aka Twitter).

Here, we will use Twitter tagger. Let's load it.

In [12]:
from konlpy.tag import Okt

okt = Okt()

Try your own sentence. You can tokenize, POS tag, or extract nouns and phrases.

In [14]:
# sentence = '데이터 분석 수업이 넘나 재밌어서 현기증이 나요...ㅋ_ㅠ'
sentence = '공부를 하면할수록 모르는게 많다는 것을 알게 됩니다. 배운건 많았는데... 다 까먹어버렸네요? ㅋㅋ 그래도 계속 공부합니다. 재밌으니까!'

tokens = okt.morphs(sentence)
print(tokens)

tagged = okt.pos(sentence)
print(tagged)

nouns = okt.nouns(sentence)
print(nouns)

phrases = okt.phrases(sentence)
print(phrases)

['공부', '를', '하면', '할수록', '모르는게', '많다는', '것', '을', '알', '게', '됩니다', '.', '배운건', '많았는데', '...', '다', '까먹어', '버렸네요', '?', 'ㅋㅋ', '그래도', '계속', '공부', '합니다', '.', '재밌으니까', '!']
[('공부', 'Noun'), ('를', 'Josa'), ('하면', 'Verb'), ('할수록', 'Verb'), ('모르는게', 'Verb'), ('많다는', 'Adjective'), ('것', 'Noun'), ('을', 'Josa'), ('알', 'Noun'), ('게', 'Josa'), ('됩니다', 'Verb'), ('.', 'Punctuation'), ('배운건', 'Verb'), ('많았는데', 'Adjective'), ('...', 'Punctuation'), ('다', 'Adverb'), ('까먹어', 'Verb'), ('버렸네요', 'Verb'), ('?', 'Punctuation'), ('ㅋㅋ', 'KoreanParticle'), ('그래도', 'Adverb'), ('계속', 'Noun'), ('공부', 'Noun'), ('합니다', 'Verb'), ('.', 'Punctuation'), ('재밌으니까', 'Adjective'), ('!', 'Punctuation')]
['공부', '것', '알', '계속', '공부']
['공부', '많다는 것', '계속', '계속 공부']
