**Question** : Load, Preprocess and split the Parts Of Speech tagged corpora from NLTK. Print the features and labels of training , testing and Validation



**Description** :

load POS tagged corpora from NLTK by using brown, treebank and conll2000. Import these three libraries from NLTK corpus

Divide data into words ( X ) and tags ( Y ) using an empty list as  X and Y and store in it.

Convert Text to integer by using text_to_sequences

Truncate long sentences into fixed lengths as 100 

Convert classes to binary form by using to_categorical

Split data into training  and  testing sets ( test set as 0.15 )

Split training data into training and validation sets ( Valid set as 0.15 )



**Level**: Medium


**Input Format** : 
POS tagged corpora from NLTK


**Output Format** : 
Features and labels


**Sample Input** : 
Load the dataset using libraries called brown, treebank , conll2000


**Sample Output** : 
Features, labels and channels

**Solution**

In [None]:
from nltk.corpus import brown
from nltk.corpus import treebank
from nltk.corpus import conll2000

# load POS tagged corpora from NLTK
treebank_corpus = treebank.tagged_sents(tagset='universal')
brown_corpus = brown.tagged_sents(tagset='universal')
conll_corpus = conll2000.tagged_sents(tagset='universal')
tagged_sentences = treebank_corpus + brown_corpus + conll_corpus

# let's look at the data
tagged_sentences[2]

X = [] # store input sequence
Y = [] # store output sequence

for sentence in tagged_sentences:
    X_sentence = []
    Y_sentence = []
    for entity in sentence:         
        X_sentence.append(entity[0]) 
        Y_sentence.append(entity[1]) 
        
    X.append(X_sentence)
    Y.append(Y_sentence)

X[1]

num_words = len(set([word.lower() for sentence in X for word in sentence]))
num_tags   = len(set([word.lower() for sentence in Y for word in sentence]))
print("Total number of tagged sentences: {}".format(len(X)))
print("Vocabulary size: {}".format(num_words))
print("Total number of tags: {}".format(num_tags))

print('sample X: ', X[0], '\n')
print('sample Y: ', Y[0], '\n')


print("Length of first input sequence  : {}".format(len(X[0])))
print("Length of first output sequence : {}".format(len(Y[0])))

X[1]
from keras.preprocessing.text import Tokenizer

# encode X

word_tokenizer = Tokenizer()                      word_tokenizer.fit_on_texts(X)                    
X_encoded = word_tokenizer.texts_to_sequences(X)  

# encode Y

tag_tokenizer = Tokenizer()
tag_tokenizer.fit_on_texts(Y)
Y_encoded = tag_tokenizer.texts_to_sequences(Y)

# look at first encoded data point

print("** Raw data point **", "\n", "-"*100, "\n")
print('X: ', X[0], '\n')
print('Y: ', Y[0], '\n')
print()
print("** Encoded data point **", "\n", "-"*100, "\n")
print('X: ', X_encoded[0], '\n')
print('Y: ', Y_encoded[0], '\n')

different_length = [1 if len(input) != len(output) else 0 for input, output in zip(X_encoded, Y_encoded)]
print("{} sentences have disparate input-output lengths.".format(sum(different_length)))

lengths = [len(seq) for seq in X_encoded]
print("Length of longest sentence: {}".format(max(lengths)))

import seaborn as sns
from matplotlib import pyplot as plt

sns.boxplot(lengths)
plt.show()

from keras.preprocessing.sequence import pad_sequences

MAX_SEQ_LENGTH = 100  
X_padded = pad_sequences(X_encoded, maxlen=MAX_SEQ_LENGTH, padding="pre", truncating="post")
Y_padded = pad_sequences(Y_encoded, maxlen=MAX_SEQ_LENGTH, padding="pre", truncating="post")

# print the first sequence
print(X_padded[0], "\n"*3)
print(Y_padded[0])

# assign padded sequences to X and Y
X, Y = X_padded, Y_padded

from keras.utils.np_utils import to_categorical
Y = to_categorical(Y)
Y.shape

X
X.shape

from sklearn.model_selection import train_test_split


TEST_SIZE = 0.15
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=TEST_SIZE, random_state=4)
X_train.shape


VALID_SIZE = 0.15
X_train, X_validation, Y_train, Y_validation = train_test_split(X_train, Y_train, test_size=VALID_SIZE, random_state=4)

# print number of samples in each set
print("TRAINING DATA")
print('Shape of input sequences: {}'.format(X_train.shape))
print('Shape of output sequences: {}'.format(Y_train.shape))
print("-"*50)
print("VALIDATION DATA")
print('Shape of input sequences: {}'.format(X_validation.shape))
print('Shape of output sequences: {}'.format(Y_validation.shape))
print("-"*50)
print("TESTING DATA")
print('Shape of input sequences: {}'.format(X_test.shape))
print('Shape of output sequences: {}'.format(Y_test.shape))
