# Natural Language Processing
### Goal of lesson
- How the simple syntax of language can be parsed
- What Context-Free Grammar (CFG) is
- Use it to parse text
- Understand text in trigrams
- See how it can be used to generate predictions

### What is Natural Language Processing
- Automatic computational processing of human languages
- Includes 
    - Algorithms that take human written language as input
    - Algorithms that produce natural text

- Examples include
    - Automatic summarization
    - Language identification
    - Translation

### Syntax
- One basic description of a language's syntax is the sequence in which the subject, verb, and object usually appear in sentences.

### Formal Grammar
- A system of rules for genrating sentences in a language
- A grammar is usually thought of as a language generator ([wiki](https://en.wikipedia.org/wiki/Formal_grammar))

### Context-Free Grammar (CFG)
- A formal grammar is "context free" if its production rules can be applied regardless of the context of a nonterminal ([wiki](https://en.wikipedia.org/wiki/Context-free_grammar)).

> #### Programming Notes:
> - Libraries used
>     - [**nltk**](https://www.nltk.org) - Natural Language Toolkit
>     - [**os**](https://docs.python.org/3/library/os.html) Miscellaneous operating system interfaces
>     - [**collections**](https://docs.python.org/3/library/collections.html) Container datatypes
>     - [**markovify**](https://pypi.org/project/markovify/) A simple, extensible Markov chain generato
> - Functionality and concepts used
>     - [**ChartParser**](https://tedboy.github.io/nlps/generated/generated/nltk.ChartParser.html) generic chart parser
>     - **List Comprehension** to convert data ([Lecture on **List Comprehension**](https://youtu.be/vCYEvtfXdig))
>     - [**Counter**](https://docs.python.org/3/library/collections.html#collections.Counter) a dict subclass for counting hashable objects
>     - [**markovify.Text**](https://pypi.org/project/markovify/) to create your Markov Model

In [2]:
11
# generate prediction(text)
# context can make a difference in how you interpret the question
# nltk: natural language tool kit
!pip install nltk



In [3]:
import nltk

In [21]:
# we're going to generate a cfg (context-free-grammar)

# it tries to construct sentence.
# and there are several rules to construct the sentence:
grammar = nltk.CFG.fromstring("""
    S -> NP VP

    NP -> D N | N
    VP -> V | V NP

    D -> "the" | "a"
    N -> "she" | "city" | "car"
    V -> "saw" | "walked"    
""")

parser = nltk.ChartParser(grammar)

sentence = input().split()

for tree in parser.parse(sentence):
    tree.pretty_print()

DGRdg


ValueError: Grammar does not cover some of the input words: "'DGRdg'".

### Challenge with CFG
- You need to encode all possibilities

### Idea
- Understand text in small subsets
- **$n$-gram**
    - a contiguous sequence of $n$ items from a sample text
- **Word $n$-gram**
    - a contiguous sequence of $n$ words from a sample text
- **unigram**
    - 1 items in sequence
- **bigram**
    - 2 items in sequence
- **trigram**
    - 3 items in sequence

### Word Tokenization
- the task of splitting a sequence of words into tokens

- Considerations: comma, punktuation, etc.

In [4]:
import os
from collections import Counter

In [5]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [8]:
content = []
for filename in os.listdir("files/holmes/"):
    #print(filename)
    with open(f"files/holmes/{filename}") as f:
        content.append(f.read())

In [9]:
len(content)

21

In [11]:
# it takes word and makes 
# it lower from word tokenization of item

# 1. item is the full text.
# 2. make a word tokenization which actually takes each word of it
# 3. for each word it will make its lower case if there is some alpha characters in there.
# if any of the characters in the word are alpha characters then we'll keep it.

corpus = []
for item in content:
    corpus.extend([word.lower() for word in nltk.word_tokenize(item) if any(c.isalpha() for c in word)])

In [12]:
len(corpus) # number of words

178246

In [14]:
# Creating n-grams of the corpus
# : is takes three words in a row and adds it to this counter.
# and this counter counts how many appearances are each of three words.
ngrams = Counter(nltk.ngrams(corpus, 3))

In [15]:
# 80: ('it', 'was', 'a') : means we saw this sequence of three words 80 times out of these 178246 words.
# 71: ('one', 'of', 'the') : 178246개 중 이 단어의 순열이 71번 등장했다.

# which ones are the most likely? 
# it과 was 다음에는 a가 most likely appears한다.

for ngram, freq in ngrams.most_common(10): 
    print(f"{freq}: {ngram}")

80: ('it', 'was', 'a')
71: ('one', 'of', 'the')
65: ('i', 'think', 'that')
59: ('out', 'of', 'the')
55: ('there', 'was', 'a')
55: ('that', 'he', 'had')
55: ('that', 'it', 'was')
55: ('that', 'he', 'was')
52: ('it', 'is', 'a')
49: ('i', 'can', 'not')


### Markov Model
- A Markov chain is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous even ([wiki](https://en.wikipedia.org/wiki/Markov_chain))
- Or as the example above:
    - Given any two words -> you have probabilities for the next word

In [None]:
# Given any two words -> you have probabilities for the next word
# : for each two words you have some probability for the next word.

# if we take two words that are likely to happen what
# is the third word and then next two words and it was

In [17]:
!pip install markovify

Collecting markovify
  Downloading markovify-0.9.4.tar.gz (27 kB)
Building wheels for collected packages: markovify
  Building wheel for markovify (setup.py): started
  Building wheel for markovify (setup.py): finished with status 'done'
  Created wheel for markovify: filename=markovify-0.9.4-py3-none-any.whl size=18628 sha256=08a8033045ab84672db8eaea8e860deb7fcbd14678f1ebd0c0d36e064456a1fe
  Stored in directory: c:\users\user\appdata\local\pip\cache\wheels\76\0a\ab\8727d219981e57e6036316dd2ec2037e61ccea0c016f7ae0c1
Successfully built markovify
Installing collected packages: markovify
Successfully installed markovify-0.9.4


In [19]:
# Using Makrkov Model to create text.
import markovify
with open("files/shakespeare.txt") as f:
    text = f.read()
model = markovify.Text(text)
model.make_sentence()
# it is based on the idea that you take some probability
# if these two words are together 80: ('it', 'was',
# then what's the third one.

"I will discase me, and now he writes that he is arm'd for him, say I. Nurse."

In [20]:
# project description: explore tweets based on statistics we're going to make.
#  can you figure out who tweeted those? 