## Text Data

Deep learning has taken the field of natural language processing (NLP) by storm, particularly by using models that repeatedly consume a combination of new input and
 previous model output. These models are called recurrent neural networks, and they’ve
 been applied with great success to text categorization, text generation, and automated
 translation systems.

 Previous NLP workloads were characterized by sophisticated multistage pipelines that included rules encoding the grammar of a language

The state-of-the-art work trains networks end to end on large corpuses starting from
 scratch, letting those rules emerge from data. For the past several years, the most-used
 automated translation systems available as services on the internet have been based on
 deep learning.

 Networks operate on text at two levels: at character level, by processing one character at a time, and at word level, in which individual words are the finest-grained entities seen by the network

#### Note: 
The technique you use to encode text information into
 tensor form is the same whether you operate at character level or at word level. It’s one-hot encoding. 

1. Start with a character-level example. First, get some text to process. An amazing resource is Project Gutenberg, a volunteer effort that digitizes and archives cultural work and makes it available for free in open formats, including plain-text files.

#### Note:
If you’re aiming at larger-scale corpora, the Wikipedia corpus stands out: it’s the
 complete collection of Wikipedia articles containing 1.9 billion words and more
 than 4.4 million articles. You can find several other corpora at the English Corpora
 website.
 
 http://www.gutenberg.org 
 https://www.english-corpora.org

> Load Jane Austen’s Pride and Prejudice from the Project Gutenberg website.Save
 the file and read it in, as shown in the following listing.
http://www.gutenberg.org/files/1342/1342-0.txt

In [1]:
with open('C:/Users/Haier/Desktop/Sch_Apply.txt', encoding='utf8') as f:   
    text = f.read()

In [9]:
len(text)

8123

Encoding:  ASCII encodes 128 characters
 using 128 integers. Letter a, for example, corresponds to binary 1100001 or decimal
 97; letter b corresponds to binary 1100010 or decimal 98, and so on

##### NOTE:
Clearly, 128 characters aren’t enough to account for all the glyphs,
 accents, ligatures, and other features that are needed to properly represent
 written text in languages other than English. To this end, other encodings
 have been developed, using a larger number of bits as a code for a wider
 range of characters. That wider range of characters got standardized as Unicode, which maps all known characters to numbers, with the representation in
 bits of those numbers being provided by a specific encoding. Popular encodings include UTF-8, UTF-16 and UTF-32, in which the numbers are a sequence
 of 8-, 16-, or 32-bit integers. Strings in Python 3.x are Unicode strings

you need to parse the characters in the text and provide a one-hot encoding for each of them. Each character will be represented by a vector of length equal to the
 number of characters in the encoding. This vector will contain all zeros except for a 1 at
 the index corresponding to the location of the character in the encoding. 

>First, split your text into a list of lines and pick an arbitrary line to focus on:


In [27]:
lines = text.split('\n') 
line = lines[2] 
line

'Instead of doing analysis, we make a pre-supposition that the particular method will not work well. We are more inclined to the hype created by NN. '

> Create a tensor that can hold the total number of one-hot encoded characters for the
 whole line:

In [29]:
import torch
letter_tensor = torch.zeros(len(line), 128) # 128 hardcoded due to the 
#limits of ASCII\
letter_tensor.shape

torch.Size([148, 128])

Note that letter_tensor holds a one-hot encoded character per row. Now set a 1 on
 each row in the right position so that each row represents the right character. The index
 where the 1 has to be set corresponds to the index of the character in the encoding:

In [33]:
for i, letter in enumerate(line.lower().strip()): 
    
    letter_index = ord(letter) if ord(letter) < 128 else 0 
    letter_tensor[i][letter_index] = 1

# 128 hardcoded due to the limits of ASCII\ The text uses directional
# double quotes , which aren’t valid ASCII, so screen them out here

In [51]:
## what the above function is doing 
for i,l in enumerate(line.lower().strip()):
    if i<3:
        print(i,l)
        print(ord(l)) # ascii in decimal conversion 
    else:
        break  

0 i
105
1 n
110
2 s
115


You’ve one-hot encoded your sentence into a representation that a neural network
 can digest. You could do word-level encoding the same way by establishing a vocabulary and one-hot encoding sentences, sequences of words, along the rows of your tensor. 

##### Note:
A vocabulary contains many words, this method produces wide encoded
 vectors that may not be practical. Later in this chapter, you see a more efficient way to
 represent text at word level by using embeddings. For now, stick with one-hot encodings to see what happens.

 Define clean_words, which takes text and returns it lowercase and stripped of punctuation.

In [63]:
def clean_words(input_str): 
    punctuation = '.,;:"!?”“_-'   
    word_list = input_str.lower().replace('\n',' ').split()
    word_list = [word.strip(punctuation) for word in word_list] # list comp
    return word_list

In [59]:
word_l = 'abc "cde'.split()
word_l

['abc', '"cde']

In [55]:
type('abc cde'.split())

list

In [62]:
word_l[1].strip('"')

'cde'

In [69]:
set(word_l) # convert into dict

{'"cde', 'abc'}

In [70]:
sorted(set(word_l))

['"cde', 'abc']

In [67]:
# clean data 
words_in_line = clean_words(line)
line, words_in_line[0:4]

('Instead of doing analysis, we make a pre-supposition that the particular method will not work well. We are more inclined to the hype created by NN. ',
 ['instead', 'of', 'doing', 'analysis'])

Next, build a mapping of words to indexes in your encoding:

In [73]:
word_list = sorted(set(clean_words(text)))
word2index_dict = {word: i for (i, word) in enumerate(word_list)}
len(word2index_dict), word2index_dict['nn']

(388, 224)

Note: all words is now a dictionary with words as keys and an integer as value.You’ll use this dictionary to efficiently find the index of a word as you one-hot encode it

 Now focus on your sentence. Break it into words and one-hot encode it—that is,
 populate a tensor with one one-hot encoded vector per word. Create an empty vector,
 and assign the one-hot encoded values of the word in the sentence:


In [74]:
word_tensor = torch.zeros(len(words_in_line), len(word2index_dict))
for i, word in enumerate(words_in_line):
    word_index = word2index_dict[word] 
    word_tensor[i][word_index] = 1 
    print('{:2} {:4} {}'.format(i, word_index, word))

 0  174 instead
 1  229 of
 2  103 doing
 3   25 analysis
 4  371 we
 5  196 make
 6    6 a
 7  257 pre-supposition
 8  336 that
 9  337 the
10  245 particular
11  206 method
12  381 will
13  226 not
14  384 work
15  374 well
16  371 we
17   38 are
18  214 more
19  171 inclined
20  352 to
21  337 the
22  162 hype
23   88 created
24   56 by
25  224 nn


In [75]:
len(word_tensor)

26

In [77]:
len(word_tensor[0])

388

In [78]:
print(word_tensor.shape)

torch.Size([26, 388])


##### Note: 
At this point, tensor represents one sentence of length 26 in an encoding space of size 388—the number of words in your dictionary.