# Natural Language Processing

### Table of contents
01. [README](#README)
02. [Installing Addons](#addons)
03. [Importing Libraries](#libraries)
04. [Importing Text File](#textfile)
05. [Preprocessing](#preprocessing)
    * [Converting Text Into Lower Case](#lowercase)
    * [Cleaning The Text](#textcleaning)
    * [Creating A Sequence Of The Token](#token)
    * [Creating A New File Of Sequences](#sequence)
    * [Loading The New File](#newfile)
    * [Encoding](#encode)
    * [Input & Output Seperation](#seperation)

08. [Model](#model)
    * [Training The Model](#training)
    * [Model Summary](#summary)
    * [Compiling The Model](#modelcompiling)
    * [Training The Model](#modeltraining)
    * [Saving Model & Tokenizers](#save)

11. [Prediction](#prediction)
    * [Generating Text Sequence From The Saved Model](#generation)
    * [Loading Text File](#loading)
    * [Loading The Saved Model & Tokens](#loading)
    * [Selecting Seed Text](#selection)
    * [Text Prediction](#results)

### README

**Points to Follow** <br>

* You will need `Jupyter Notebook` or `Google Collab` to run the code

* To use the file text you need to change this path `Link = "/content/File.txt"` with the path of your folder where you have stored the file text.
 

* `Out_File = 'Sequences.txt'` In this you have to give the path of the folder where you want to save the 'Sequences' of the tokens.
             


* `F_Name = 'Sequences.txt'` In this you have to give the path of the folder where you have saved the .txt file of the 'Sequences'.



* `Model.save('Model.h5')` Here the path will be of that folder where you want to save the model. `dump(tokenizer, open('Tokenizer.pkl', 'wb'))` and the tokenizer.


* `textfile = 'Sequences.txt'` This path will be same as the previous one where you stored 'Sequences.txt'. 

* `saved_model = load_model('/content/Model.h5')` It will be the same path where you saved the model `tokenizer = load(open('/content/tokenizer.pkl', 'rb'))` along with the tokenizer

   After making the mentioned changes you will be able to run this code. 

### Installing Addons

In [None]:
  pip install pip install tensorflow-addons



### Importing Libraries

In [None]:
# Basics
import string
import tensorflow
import tensorflow as tf
import tensorflow_addons as tfa
import tensorflow.keras as keras
import tensorflow.keras.layers as layers
import numpy as np
import random
import requests
import re
import random

# Model & Pre-Processing
from keras.callbacks import LambdaCallback
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from tensorflow.keras.optimizers import RMSprop
from keras.utils.data_utils import get_file
from tensorflow.keras.utils import to_categorical
from keras.preprocessing.text import Tokenizer
from numpy import array
from keras.layers import Embedding
from pickle import dump, load
from random import randint
from keras.models import load_model
from keras.preprocessing.sequence import pad_sequences

### Importing Text File

In [None]:
# Loading Text File Using 'Loading()' Function
def Loading(Link):
	File = open(Link, 'r') # Opening
	Text = File.read() # Reading the File into 'Text'
	File.close() # Closing
	return Text # Returning 'Text

Link = "/content/File.txt" # Path for the Raw Text
Raw = Loading(Link) # Giving Path in the 'Loading' Function
print(Raw[:200])

The Project Gutenberg EBook of Poirot Investigates, by Agatha Christie

This eBook is for the use of anyone anywhere in the United States and most
other parts of the world at no cost and with almost n


### Pre-Processing

###### Converting Text Into Lower Case

In [None]:
Processed = Raw.lower() # Converting into Lower
print(Processed[:200]) # Printing Results

the project gutenberg ebook of poirot investigates, by agatha christie

this ebook is for the use of anyone anywhere in the united states and most
other parts of the world at no cost and with almost n


###### Cleaning The Text

In [None]:
Processed = Processed.replace('--', ' ') # Replacing '--' with Space

Tokens_Proce = Processed.split() # Spliting Tokens by White Space

# Removing Punctuations
Punc_Table = str.maketrans('', '', string.punctuation)
Tokens_Proce = [w.translate(Punc_Table) for w in Tokens_Proce]

Tokens_Proce = [word for word in Tokens_Proce if word.isalpha()] # Removing Remaning Tokens that are not Alphabetic

print(Tokens_Proce) # Printing the Results
print('Total Tokens:', len(Tokens_Proce)) # Printing Total Tokens Number
print('Unique Tokens:', len(set(Tokens_Proce))) # Printing Unique Tokens Number

Total Tokens: 51057
Unique Tokens: 5916


###### Creating A Sequence Of The Token

In [None]:
Length = 50 + 1 # Initializing Length
Sequences = list() # Initializing Sequence List

# Using 'FOR' Loop to Create the Sequence
for i in range(Length, len(Tokens_Proce)):
  Seq = Tokens_Proce[i-Length:i] # Selecting
  Line = ' '.join(Seq) # Converting into Line
  Sequences.append(Line) # Storing 

print(Sequences[:200]) # Printing Results

['the project gutenberg ebook of poirot investigates by agatha christie this ebook is for the use of anyone anywhere in the united states and most other parts of the world at no cost and with almost no restrictions whatsoever you may copy it give it away or reuse it under the', 'project gutenberg ebook of poirot investigates by agatha christie this ebook is for the use of anyone anywhere in the united states and most other parts of the world at no cost and with almost no restrictions whatsoever you may copy it give it away or reuse it under the terms', 'gutenberg ebook of poirot investigates by agatha christie this ebook is for the use of anyone anywhere in the united states and most other parts of the world at no cost and with almost no restrictions whatsoever you may copy it give it away or reuse it under the terms of', 'ebook of poirot investigates by agatha christie this ebook is for the use of anyone anywhere in the united states and most other parts of the world at no cost and wi

###### Creating A New File Of Sequences

In [None]:
# Saving Tokens
def Save_Token(Lines, File):
	Data = '\n'.join(Lines) # New Line 
	file = open(File, 'w') # Opening in Write Mode
	file.write(Data) # Writing  
	file.close() # Closing

Out_File = 'Sequences.txt' # Saved Tokens File Name
Save_Token(Sequences, Out_File) # Parameter for 'Save_Token' Function

###### Loading The New File

In [None]:
# Loading Saved Tokens
def Load(File_Name):
	F = open(File_Name, 'r') # Opening File in Read Mode
	TXT = F.read() # Reading
	F.close() # Closing
	return TXT # Returning the File
 
F_Name = 'Sequences.txt' # Filename for the Function
Document = Load(F_Name) # Loading the Filename into Function
Lines = Document.split('\n') # Newline in the File

###### Encoding

In [None]:
tokenizer = Tokenizer() # Creating an Instance
tokenizer.fit_on_texts(Lines) # Fitting 
seq = tokenizer.texts_to_sequences(Lines) #  
Vocab = len(tokenizer.word_index) + 1 # Vocabulary Size

###### Input & Output Seperation

In [None]:
Seque = array(seq) # Creating an Array
X, y= Seque[:, :-1], Seque[:, -1]# Extracting Input & Output Variables
y = to_categorical(y, num_classes = Vocab) # One Hot Encoding
Seq_Len = X.shape[1] # Specifing the Length of the Input

### Model

###### Training The Model

In [None]:
Model = Sequential() # Initializing the Model
Model.add(Embedding(Vocab, 50, input_length=Seq_Len)) # Embedding Layer
Model.add(LSTM(255, return_sequences = True)) # LSTM Layer
Model.add(LSTM(200)) # Second LSTM Layer
Model.add(Dense(100, activation = 'relu')) # Dense Layer
Model.add(Dense(Vocab, activation = 'softmax')) # Output Layer

###### Model Summary

In [None]:
Model.summary() 

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 50, 50)            295850    
                                                                 
 lstm (LSTM)                 (None, 50, 255)           312120    
                                                                 
 lstm_1 (LSTM)               (None, 200)               364800    
                                                                 
 dense (Dense)               (None, 100)               20100     
                                                                 
 dense_1 (Dense)             (None, 5917)              597617    
                                                                 
Total params: 1,590,487
Trainable params: 1,590,487
Non-trainable params: 0
_________________________________________________________________


###### Compiling The Model

In [None]:
Model.compile(loss = 'categorical_crossentropy', optimizer = 'adam', metrics = ['accuracy'])

###### Training The Model

In [None]:
Model.fit(X, y, batch_size = 128, epochs = 255)

######Saving Model & Tokenizers

In [None]:
Model.save('Model.h5') # Saving the Model
dump(tokenizer, open('Tokenizer.pkl', 'wb')) # Saving Tokenizer

### Prediction

###### Generating Text Sequence From The Saved Model

In [None]:
# Loading the Document
def load_doc(textfile):
  F = open(textfile, 'r') # Loading File in Read Mode
  txt = F.read() # Reading the File
  F.close() # Closing the File
  return txt


def model_sequence(saved_model, tokenizer, sequence_length, seed_TXT, n):
	res = list() # Initializing a List to Store the Results
	in_text = seed_TXT # Creating a Variable for Seed Text

	# Using 'FOR' Loop for Generating Fixed Number of Words
	for _ in range(n):
		Encod = tokenizer.texts_to_sequences([in_text])[0] # Encoding Tokenizer
		Encod = pad_sequences([Encod], maxlen=seq_length, truncating='pre') # Turncating & Padding
		y = np.argmax(model.predict(Encod, verbose=0),axis=-1) # Probabilities for Each Word
    
		Out = '' # Mapping

    # Using 'FOR' Loop for Decoding into Words
		for word, index in tokenizer.word_index.items():
			if index == y:
				Out = word
				break
	
		in_text += ' ' + Out # Adding Space
		res.append(Out) # Appending Results
  
	return ' '.join(res)

###### Loading Text File

In [None]:
textfile = 'Sequences.txt' # File Name
Doc = load_doc(textfile) # Caaling Function
Lines = Doc.split('\n') # Spliting Lines
sequence_length = len(Lines[0].split()) - 1 # Length of the Sequence File

###### Loading The Saved Model & Tokens

In [None]:
saved_model = load_model('/content/Model.h5') # Loading the Saved Model
tokenizer = load(open('/content/tokenizer.pkl', 'rb')) # Loading the Saved Tokenizer


###### Selecting Seed Text

In [None]:
seed_TXT = Lines[randint(0,len(Lines))] # Selecting Seed Text
print(seed_TXT + '\n') # Printing Results

but the only way on the way over he had conferred with norman in a low voice and the latter had despatched a sheaf of telegrams from dover owing to the special passes held by norman we got through everywhere in record time in london a large police car was waiting



###### Text Prediction

In [None]:
Prediction = model_sequence(saved_model, tokenizer, sequence_length, seed_TXT, 50) # Calling Function
print(Prediction) # Generating Text

for us with some plainclothes men one of whom handed a typewritten sheet of paper to my friend he answered my inquiring glance list of the cottage hospitals within a certain radius west of london i wired for it from we were whirled rapidly through the london streets we were


# xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx