# Language Models 
***

# Overview

- What is Language Modeling ?
- Statistical Language Models
- NGRAMS
- Neural Language Models

## What is Language Modeling ?

#### Deterimining the probability of seeing a group of words together in a sentence

#### Application Areas

- Machine Translation
    - P(high winds tonite) > P(large winds tonite)
- Speech Recognition
    - P(I saw a van) >> P(eyes awe of an) 
- Spell Correction
    - The office is about fifteen minuets from my house 
    - P(about fifteen minutes from) > P(about fifteen minuets from)



<br>
<br>
<br>

## Autocomplete


![title](https://i.chzbgr.com/full/5734002944/h1074620A/google-autocomplete-fail)

## Statistical Language Models

P("bugün hava soyuq olacaq") = P("bugün") * P("hava" | "bugün") * P("soyuq" | "bugün", "hava") * P("olacaq" | "bugün", "hava", "soyuq")

- Difficult to calculate as the sentence could be very long


## N-grams

- Look at only the last n words when predicting the current word

**Unigram:** 
    - P("bugün hava soyuq olacaq") = P("bugün") * P("hava") * P("soyuq") * P("olacaq")
**Bigram:**
    - P("bugün hava soyuq olacaq") = P("bugün") * P("hava" | "bugün") * P("soyuq" | "hava") * P("olacaq" | "soyuq")
**Trigram:**
    - P("bugün hava soyuq olacaq") = P("bugün") * P("hava" | "bugün") * P("soyuq" | "bugün", "hava") * P("olacaq" | "hava", "soyuq")

<br>
<br>
<br>

## Simple Sentence Generator using bigrams

In [1]:
import tensorflow as tf
import numpy as np
from tensorflow.keras.models import model_from_json
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.layers import LSTM, Activation, Dense, Dropout, Input, Embedding, Bidirectional
from tensorflow.keras.optimizers import RMSprop, Adam
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.callbacks import ModelCheckpoint
import pandas as pd
import pickle, re, pdb
import operator

In [2]:
# create bigrams
probabilities = dict()

seeds = ["sən","mən", "avarə", "camahat", "pulsuz", "kasıb", "oxumaq", "fəhlə", "kişi", "qadın", "fələk", "ey", "heyhat"]


with open("sabir.txt", "r", encoding="utf8") as fr:
    for line in fr:
        words = re.findall("\w+", line.strip().replace("İ", "i").lower())
        for i,word in enumerate(words):            
            if i < len(words) - 1:
                if word not in probabilities:
                    probabilities[word] = dict()

                probabilities[word][words[i+1]] = probabilities[word].get(words[i+1],0) + 1

In [3]:
def generate_with_bigram(seed_words, num_words):
    texts = []
    for seed in seed_words:
        text = seed
        for i in range(num_words):
            if seed not in probabilities:
                continue
            pr_word = max(probabilities[seed].items(), key=operator.itemgetter(1))[0]
            text += " " + pr_word
            seed = pr_word
        
        texts.append(text)

    return texts

In [4]:
generate_with_bigram(seeds, 5)

['sən də özün daxili insan edir',
 'mən kimi bir də özün daxili',
 'avarə səbr eylə qeybətdə rübərüdə',
 'camahat',
 'pulsuz kişi insanlığı asanmı sanırsan sənin',
 'kasıb',
 'oxumaq suglu kitab açdır fala bax',
 'fəhlə də özün daxili insan edir',
 'kişi insanlığı asanmı sanırsan sənin ancaq',
 'qadın',
 'fələk tərsinə dövran edir imdi duayə',
 'ey əmu',
 'heyhat ki razi pünhan xahəd şod']

<br>
<br>
<br>

## Neural Network Language Model


In [5]:
sentences = []

with open("sabir.txt", "r", encoding="utf8") as fr:
    for line in fr:
        line = line.strip().replace("İ", "i").lower()
        sentences.append(line)

In [6]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(sentences)

index_to_word = {id_:word for word, id_ in tokenizer.word_index.items()}
with open('tokenizer.pickle', 'wb') as handle:
    pickle.dump(tokenizer, handle, protocol=3)


total_words = len(tokenizer.word_index) + 1
tokenizer.num_words = len(tokenizer.word_counts)

In [7]:
#creating input data

input_sequences = []

for line in sentences:
    token_list = tokenizer.texts_to_sequences([line])[0] #converts words in sentences to ids
    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[:i+1]
        input_sequences.append(n_gram_sequence)

In [8]:
MAX_LEN = max([len(sentence) for sentence in input_sequences])

input_sequences = sequence.pad_sequences(input_sequences,
                                         maxlen=MAX_LEN,
                                         padding="pre")

In [9]:
xs = input_sequences[:, :-1]
labels = input_sequences[:, -1]

ys = tf.keras.utils.to_categorical(labels, num_classes=total_words)

In [10]:
model = Sequential()
model.add(Embedding(total_words, 100, input_length=MAX_LEN-1))
model.add(Bidirectional(LSTM(150)))

model.add(Dense(total_words, activation="softmax"))

model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])
model.fit(xs, ys, epochs = 1, verbose=1) #epoch sayın artıraraq daha çox train ede bilirik

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


<tensorflow.python.keras.callbacks.History at 0x7fe92ac301d0>

In [11]:
#To save the trained model

model_json = model.to_json()
with open("model.json", "w") as json_file:
    json_file.write(model_json)
model.save_weights("model.h5")

In [12]:
#to use your model to generate sentences

def generate_with_lstm(tokenizer, seeds, num_words, max_len):
    generated_texts = []

    for seed in seeds:
        for _ in range(num_words):
            token_list = tokenizer.texts_to_sequences([seed])[0]
            token_list = sequence.pad_sequences([token_list], maxlen=max_len-1, padding="pre")

            predicted = model.predict_classes(token_list, verbose=0)
            output_word = index_to_word[predicted[0]]

            if output_word != seed.split(" ")[-1]:
                seed +=  " " + output_word
        generated_texts.append(seed)

    return generated_texts

In [13]:
generate_with_lstm(tokenizer, seeds, 5, MAX_LEN)

['sən nə',
 'mən bir',
 'avarə bir',
 'camahat bir',
 'pulsuz bir',
 'kasıb bir',
 'oxumaq bir',
 'fəhlə nə',
 'kişi nə',
 'qadın bir',
 'fələk nə',
 'ey bir',
 'heyhat bir']

In [16]:
#If you want to load your already trained model and use
def load_model(tokenizer, model_js, model_w):
    with open(tokenizer, "rb") as handle:
        tokenizer = pickle.load(handle)

    json_file = open(model_js, 'r')
    loaded_model_json = json_file.read()
    json_file.close()
    model = model_from_json(loaded_model_json)

    model.load_weights(model_w)
    index_to_word = {id_:word for word, id_ in tokenizer.word_index.items()}
    
    return tokenizer, model, index_to_word

tokenizer, model, index_to_word = load_model("tokenizer.pickle", "model.json", "model.h5")
generate_with_lstm(tokenizer, seeds, 5, MAX_LEN)

['sən nə',
 'mən bir',
 'avarə bir',
 'camahat bir',
 'pulsuz bir',
 'kasıb bir',
 'oxumaq bir',
 'fəhlə nə',
 'kişi nə',
 'qadın bir',
 'fələk nə',
 'ey bir',
 'heyhat bir']