# Ngram Language Models 
## Backgrounds
-----
### Significant Collocations
#### nltk에서 significant collocations를 찾기 위해 사용하는 도구들
* `CollocationFinder` :  collect collocation candidate frequencies, filter and rank them
* `NgramAssocMeasuers`: generic association measures. Each public method returns a score.(Available methods : chi squre, jaccard, liklihood_ratio, mi_like, pmi, poisson_stirling, raw_freq, student_t...)

#### `rank_quadgrams`
* `ngrams = QuadgramCollocationFinder.from_words(corpus.words())`
	* `corpus.words()` -> corpus 내부의 Text가 word tokenization이 끝난 형태로 return
    * `from_words`: `QuadgramCollocationFinder` class 내부의 함수. Construct a QuadgramCollocationFinder for n-grams(n<4) in the given sequence. 
    * `scored = ngrams.score_ngrams(metric)` : `metric`은 `QuadgramAssocMeasures`에서 제공하는 방법 중 하나. `metric`을 기준으로 더 중요한 ngram이 무엇인지 저장되어 `scored`에 저장된다. 
    
``` 
# Full code
def rank_quadgrams(corpus, metric, path=None):
    """
    Find and rank quadgrams from the supplied corpus using the given
    association metric. Write the quadgrams out to the given path if
    supplied otherwise return the list in memory.
    """

    # Create a collocation ranking utility from corpus words.
    ngrams = QuadgramCollocationFinder.from_words(corpus.words())

    # Rank collocations by an association metric
    scored = ngrams.score_ngrams(metric)

    if path:
        with open(path, 'w') as f:
            f.write("Collocation\tScore ({})\n".format(metric.__name__))
            for ngram, score in scored:
                f.write("{}\t{}\n".format(repr(ngram), score))
    else:
        return scored
```
따라서 
```
rank_quadgrams(
        corpus, QuadgramAssocMeasures.likelihood_ratio, "quadgrams.txt"
    )
```
와 같이 함수를 실행시키면 corpus의 quadgram들이 liklihood_ratio를 기준으로 정렬된 후 "quadgrams.txt" 파일에 순위 순으로 저장된다. 


#### `SignificantCollocations`
* `BaseEstimator, TransformerMixin` : `sklearn` 라이브러리의 기본 클래스. preprocessing 과정에서 pipeline을 커스텀하기 위하여 사용한다. (참고: [BaseEstimator in sklearn.base](https://stackoverflow.com/questions/15233632/baseestimator-in-sklearn-base-python), [SCIKIT LEARN 전처리를 위한 변환기 만들기](https://databuzz-team.github.io/2018/11/11/make_pipeline/)
* `def fit(self, docs, target)` : 문서 형태의 `docs`를 input으로 받아 ngram들을 형성하고 `ngrams`에 저장한다. 그리고 `self._scored_`에 `self.metric`을 기준으로 정렬된 significant collocations(ngrams)를 `dict` 형태로 저장한다. 
* `def transformation(self, docs, target)` : raw_freq가 높은 상위 50개의 ngram에 대하여 ngram과 그 score(fit method 에서 구한 점수)를 dict 형태로 저장한다.

## Ngram Language Model
-----
### NgramCounter
```
class NgramCounter(object):
    """
    The NgramCounter class counts ngrams given a vocabulary and ngram size.
    """

    def __init__(self, n, vocabulary, unknown="<UNK>"):
        """
        n is the size of the ngram
        """
        if n < 1:
            raise ValueError("ngram size must be greater than or equal to 1")

        self.n = n
        self.unknown = unknown
        self.padding = {
            "pad_left": True,
            "pad_right": True,
            "left_pad_symbol": "<s>",
            "right_pad_symbol": "</s>"
        }

        self.vocabulary = vocabulary
        self.allgrams = defaultdict(ConditionalFreqDist)
        self.ngrams = FreqDist()
        self.unigrams = FreqDist()

    def train_counts(self, training_text):
        for sent in training_text:
            checked_sent = (self.check_against_vocab(word) for word in sent)
            sent_start = True
            for ngram in self.to_ngrams(checked_sent):
                self.ngrams[ngram] += 1
                context, word = tuple(ngram[:-1]), ngram[-1]
                if sent_start:
                    for context_word in context:
                        self.unigrams[context_word] += 1
                    sent_start = False

                for window, ngram_order in enumerate(range(self.n, 1, -1)):
                    context = context[window:]
                    self.allgrams[ngram_order][context][word] += 1
                self.unigrams[word] += 1

    def check_against_vocab(self, word):
        if word in self.vocabulary:
            return word
        return self.unknown

    def to_ngrams(self, sequence):
        """
        Wrapper for NLTK ngrams method
        """
        return ngrams(sequence, self.n, **self.padding)
```
*  `def __init__(self, n, vocabulary, unknown="<UNK>")`


   
      

