# Preprocessing text
- [NLTK Natural Language Processing video](https://www.youtube.com/watch?v=XFoehWRzG-I)
- [NLP video](https://www.youtube.com/watch?v=xvqsFTUsOmc)

## 01 Tokenization
#### convert text to sentences and sentences to words

In [1]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger') # for POS_tag
nltk.download('maxent_ne_chunker') # for NER
nltk.download('words') # for NER

[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /home/jovyan/nltk_data...
[nltk_data]   Package words is already up-to-date!


True

In [2]:
sentense = "'Nero!' shouted miss Hendrix as the quick red fox jumped over the lazy dog"

In [3]:
#using NLTK library, we can do lot of text preprocesing like transform a string of words to a list of words
import nltk
from nltk.tokenize import word_tokenize
#function to split text into word
tokens = word_tokenize(sentense)
tokens

["'Nero",
 '!',
 "'",
 'shouted',
 'miss',
 'Hendrix',
 'as',
 'the',
 'quick',
 'red',
 'fox',
 'jumped',
 'over',
 'the',
 'lazy',
 'dog']

## 02 Lowercasing

In [4]:
tokens=[word.lower() for word in tokens]
# or better
tokens = list(map(str.lower,tokens))
tokens

["'nero",
 '!',
 "'",
 'shouted',
 'miss',
 'hendrix',
 'as',
 'the',
 'quick',
 'red',
 'fox',
 'jumped',
 'over',
 'the',
 'lazy',
 'dog']

## 03 Remove stop words

In [5]:
# Toy example
stopwords=['this','that','and','a','we','it','to','is','of','up','need']
text="this is a text full of content and we need to clean it up"
words=text.split(" ")
shortlisted_words=[]

#remove stop words
for w in words:
    if w not in stopwords:
        shortlisted_words.append(w)
    else:
        shortlisted_words.append("W")

print("original sentence = ",text)    
print("sentence with stop words removed= ",' '.join(shortlisted_words))    

original sentence =  this is a text full of content and we need to clean it up
sentence with stop words removed=  W W W text full W content W W W W clean W W


In [6]:
# more useful example
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
tokens = [w for w in tokens if not w in stop_words]
print(tokens)

["'nero", '!', "'", 'shouted', 'miss', 'hendrix', 'quick', 'red', 'fox', 'jumped', 'lazy', 'dog']


## 04 Stemming
Reduce complexity by reducing each word to its core. Stemming just removes or stems the last few characters of a word, often leading to incorrect meanings and spelling

In [7]:
import nltk
import pandas as pd
from nltk.stem import PorterStemmer # Porter is most popular stemmer algorithm

# init stemmer
porter_stemmer=PorterStemmer()

In [8]:
# stem connect variations
words=["connect","connected","connection","connections","connects","house","housing"]
stemmed_words=[porter_stemmer.stem(word=word) for word in words]

stemdf= pd.DataFrame({'original_word': words,'stemmed_word': stemmed_words})
stemdf

Unnamed: 0,original_word,stemmed_word
0,connect,connect
1,connected,connect
2,connection,connect
3,connections,connect
4,connects,connect
5,house,hous
6,housing,hous


In [9]:
# stem trouble variations
words=["trouble","troubled","troubles","troublemsome"]
stemmed_words=[porter_stemmer.stem(word=word) for word in words]

stemdf= pd.DataFrame({'original_word': words,'stemmed_word': stemmed_words})
stemdf

Unnamed: 0,original_word,stemmed_word
0,trouble,troubl
1,troubled,troubl
2,troubles,troubl
3,troublemsome,troublemsom


## 05 Lemmatization
Lemmatization considers the context and converts the word to its meaningful base form, which is called Lemma. Sometimes, the same word can have multiple different Lemmas

In [10]:
from nltk.stem import WordNetLemmatizer
# nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()

#lemmatize trouble variations
words=["trouble","troubling","troubled","troubles"]
lemmatized_words=[lemmatizer.lemmatize(word=word,pos='v') for word in words]
lemmatizeddf= pd.DataFrame({'original_word': words,'lemmatized_word': lemmatized_words})
lemmatizeddf=lemmatizeddf[['original_word','lemmatized_word']]
lemmatizeddf

Unnamed: 0,original_word,lemmatized_word
0,trouble,trouble
1,troubling,trouble
2,troubled,trouble
3,troubles,trouble


In [11]:
#lemmatize goose variations
words=["goose","geese"]
lemmatized_words=[lemmatizer.lemmatize(word=word,pos='n') for word in words]
lemmatizeddf= pd.DataFrame({'original_word': words,'lemmatized_word': lemmatized_words})
lemmatizeddf=lemmatizeddf[['original_word','lemmatized_word']]
lemmatizeddf

Unnamed: 0,original_word,lemmatized_word
0,goose,goose
1,geese,goose


## 04-1 Noise Removal before stemming
Necessary for text from like social medias and blog comments etc.

In [12]:
import nltk
import pandas as pd
import re
from nltk.stem import PorterStemmer

porter_stemmer=PorterStemmer()


In [13]:
# stem raw words with noise
raw_words=["..trouble..","trouble<","trouble!","<a>trouble</a>",'1.trouble']
stemmed_words=[porter_stemmer.stem(word=word) for word in raw_words]
stemdf= pd.DataFrame({'raw_word': raw_words,'stemmed_word': stemmed_words})
stemdf

Unnamed: 0,raw_word,stemmed_word
0,..trouble..,..trouble..
1,trouble<,trouble<
2,trouble!,trouble!
3,<a>trouble</a>,<a>trouble</a>
4,1.trouble,1.troubl


In [14]:
def scrub_words(text):
    """Basic cleaning of texts."""
    
    # remove html markup
    text=re.sub("(<.*?>)","",text)
    
    #remove non-ascii and digits
    text=re.sub("(\\W|\\d)"," ",text)
    
    #remove whitespace
    text=text.strip()
    return text

In [15]:
# stem words already cleaned
cleaned_words=[scrub_words(w) for w in raw_words]
cleaned_stemmed_words=[porter_stemmer.stem(word=word) for word in cleaned_words]
stemdf= pd.DataFrame({'raw_word': raw_words,'cleaned_word':cleaned_words,'stemmed_word': cleaned_stemmed_words})
stemdf=stemdf[['raw_word','cleaned_word','stemmed_word']]
stemdf

Unnamed: 0,raw_word,cleaned_word,stemmed_word
0,..trouble..,trouble,troubl
1,trouble<,trouble,troubl
2,trouble!,trouble,troubl
3,<a>trouble</a>,trouble,troubl
4,1.trouble,trouble,troubl


## 06 POS tagging (Part Of Speach)
The POS tagger in the NLTK library outputs specific tags for certain words. The list of POS tags is as follows, with examples of what each POS stands for.
Useful for lemmatization and extracting relationships with words. It identifies parts of the sentence (nouns, verbs, article, adjective)

Find the [tag categories here](https://medium.com/@gianpaul.r/tokenization-and-parts-of-speech-pos-tagging-in-pythons-nltk-library-2d30f70af13b)

In [16]:
from nltk.tokenize import sent_tokenize, word_tokenize
from  nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
txt = """Natural Language Processing is a technique that is widely used in the field of AI and Machine Learning. 
In this video, you learn about the NLTK library and its use for natural language processing and text mining tasks. 
You will look at Speech Recognition, Spam Filtering, and Sentiment Analysis. You will understand text extraction and NLP workflow. 
Using the NLTK Python library, you will perform a hands-on demo on processing brown corpus and structuring sentences. 
Let's get started."""
tokenized = sent_tokenize(txt)
for i,t in enumerate(tokenized):
    print(i,t)
    words_list = word_tokenize(t)
    words_list = [word for word in words_list if word not in stop_words]
    tagged = nltk.pos_tag(words_list)
    print(tagged)
    print()

0 Natural Language Processing is a technique that is widely used in the field of AI and Machine Learning.
[('Natural', 'JJ'), ('Language', 'NNP'), ('Processing', 'NNP'), ('technique', 'NN'), ('widely', 'RB'), ('used', 'VBD'), ('field', 'NN'), ('AI', 'NNP'), ('Machine', 'NNP'), ('Learning', 'NNP'), ('.', '.')]

1 In this video, you learn about the NLTK library and its use for natural language processing and text mining tasks.
[('In', 'IN'), ('video', 'NN'), (',', ','), ('learn', 'VBP'), ('NLTK', 'NNP'), ('library', 'NN'), ('use', 'NN'), ('natural', 'JJ'), ('language', 'NN'), ('processing', 'NN'), ('text', 'NN'), ('mining', 'NN'), ('tasks', 'NNS'), ('.', '.')]

2 You will look at Speech Recognition, Spam Filtering, and Sentiment Analysis.
[('You', 'PRP'), ('look', 'VBP'), ('Speech', 'JJ'), ('Recognition', 'NNP'), (',', ','), ('Spam', 'NNP'), ('Filtering', 'NNP'), (',', ','), ('Sentiment', 'NNP'), ('Analysis', 'NNP'), ('.', '.')]

3 You will understand text extraction and NLP workflow.
[(

## 07 NER (Named Entity Recognition)
NER seeks to extract a real world entity from text and sorts it into predefined categories (persons, organisations, locations etc.)

In [17]:
# NER
txt = "Jim and Jake Brown from Bakersville in Minnesota bought 300 shares of Acme Corp. in 2006."
tokenized = word_tokenize(txt)
tagged = nltk.pos_tag(tokenized)
chunked = nltk.ne_chunk(tagged)

# extract all named entities
named_entities = []
for tagged_tree in chunked:
    if hasattr(tagged_tree, 'label'):
        entity_name = ' '.join(c[0] for c in tagged_tree.leaves())
        entity_type = tagged_tree.label()
        named_entities.append((entity_name,entity_type))
print(named_entities)

[('Jim', 'PERSON'), ('Jake Brown', 'PERSON'), ('Bakersville', 'GPE'), ('Minnesota', 'GPE'), ('Acme Corp.', 'ORGANIZATION')]


## NLTK
Natural Language Tool Kit is a collection of models that can be downloaded  
`nltk.download('punkt')`  Punkt[tuation] Sentence Tokenizer  
`nltk.download('popular')`  
`nltk.download('all')` (3.2+ GB in models and data)  
To see installed models look in `/home/jovyan/nltk_data`  

To download large file from nltk without getting error:
```python
$ rm /Users/<your_username>/nltk_data/corpora/panlex_lite.zip
$ rm -r /Users/<your_username>/nltk_data/corpora/panlex_lite
$ python

>>> import nltk
>>> dler = nltk.downloader.Downloader()
>>> dler._update_index()
>>> dler._status_cache['panlex_lite'] = 'installed' # Trick the index to treat panlex_lite as it's already installed.
>>> dler.download('popular')

```

## Syntax tree and Chunk parsing (Chunking)
ghostscript is for rendering **syntax trees**   
Install from `https://ghostscript.com/download/gsdnld.html`

## nltk regexp parser
`parser = nltk.RegexpParser(); parser.parse(txt)`

# Preparing for the machine
The mapping from textual data to real valued vectors is called **feature extraction**

Source: https://freecontent.manning.com/deep-learning-for-text/  

Deep learning models don’t take as input raw text: they only work with numeric tensors. Three popular ways to create numeric tensors from texts.
1. **One-hot encoding** is the most common, most basic way to turn a token into a vector. It consists of associating a **unique integer index** to every word, then turning this integer index i into a binary vector of size N, the size of the vocabulary, that’d be all-zeros except for the i-th entry, which would be one.
2. **Word embeddings** also called (dense) word vectors
3. **Bag-of-words** (N-grams) create sets of 1, 2 or 3 words in combination. Is used for lightweight shallow text processing models such as logistic regression and random forests

### One-Hot-Encoding

In [18]:
# One-Hot-Encoding simplified (not for production)
import numpy as np
  
# This is our initial data; one entry per "sample"
# (in this toy example, a "sample" is just a sentence, but
# it could be an entire document).
samples = ['The cat sat on the mat.', 'The dog ate my homework.']

# First, build an index of all tokens in the data.
token_index = {}
for sample in samples:
 # We simply tokenize the samples via the `split` method.
 # in real life, we would also strip punctuation and special characters
 # from the samples.
    for word in sample.split():
        if word not in token_index:
            # Assign a unique index to each unique word
            token_index[word] = len(token_index) + 1
            # Note that we don't attribute index 0 to anything.

# Next, we vectorize our samples.
# We will only consider the first `max_length` words in each sample.
max_length = 10

# This is where we store our results:
results = np.zeros((len(samples), max_length, max(token_index.values()) + 1))
for i, sample in enumerate(samples):
    for j, word in list(enumerate(sample.split()))[:max_length]:
        index = token_index.get(word)
        results[i, j, index] = 1.
print('Sparse vectors with one-hot:\n',results) # sparse sinse most data is zeros

Sparse vectors with one-hot:
 [[[0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

 [[0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]]


In [19]:
# Keras example
from keras.preprocessing.text import Tokenizer
  
samples = ['The cat sat on the mat.', 'The dog ate my homework.']

# We create a tokenizer, configured to only take
# into account the top-1000 most common on words
tokenizer = Tokenizer(num_words=1000)
# This builds the word index
tokenizer.fit_on_texts(samples)

# This turns strings into lists of integer indices.
sequences = tokenizer.texts_to_sequences(samples)

# You could also directly get the one-hot binary representations.
# Note that other vectorization modes than one-hot encoding are supported!
one_hot_results = tokenizer.texts_to_matrix(samples, mode='binary')

# This is how you can recover the word index that was computed
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

Found 9 unique tokens.


##  one-hot encoding with hashing trick
(Below example for academic reference)  

A variant of one-hot encoding is the “one-hot hashing trick”, which can be used when the **number of unique tokens** in your vocabulary is **too large** to handle explicitly. Instead of explicitly assigning an index to each word and keeping a reference of these indices in a dictionary, one may hash words into vectors of fixed size. This is typically done with a lightweight hashing function. The main advantage of this method is that it **does away with maintaining an explicit word index lookup table**, which saves memory and allows online encoding of the data (starting to generate token vectors right away, before having seen all of the available data).  

The one **drawback** of this method is that it’s susceptible to **“hash collisions”**: two different words may end up with the same hash, and subsequently any machine learning model looking at these hashes won’t be able to tell the difference between these words. The likelihood of hash collisions decreases when the dimensionality of the hashing space (how much of the initial hash is kept) is much larger than the total number of unique tokens being hashed.

In [20]:
# Toy example
samples = ['The cat sat on the mat.', 'The dog ate my homework.']
  
# We will store our words as vectors of size 1000.
# Each sample will be a matrix shape: (10,1000)
# result will be a list of matrices of shape (samples,max_length,dimensionality) (2,10,1000)
# Note that if you have close to 1000 words (or more)
# you will start seeing many hash collisions, which
# will decrease the accuracy of this encoding method.
dimensionality = 1000
max_length = 10

# Fill initial cube with zeros
results = np.zeros((len(samples), max_length, dimensionality))
for i, sample in enumerate(samples):
    for j, word in list(enumerate(sample.split()))[:max_length]:
    # Hash the word into a "random" integer index, that is between 0 and 1000
        index = abs(hash(word)) % dimensionality
        # for each sentence, for each word, at hash value index set 1
        results[i, j, index] = 1.

print(hash('total'))
print(hash('total') % 1000)
print(results.shape)
print('Second sample, third word. Of thousand indices one will contain the value 1\n',results[1,2,650:750])

-874219843211407601
399
(2, 10, 1000)
Second sample, third word. Of thousand indices one will contain the value 1
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0.]


## Word embeddings
[Article](https://machinelearningmastery.com/what-are-word-embeddings/)  

word embeddings” are dense low-dimensional floating-point vectors. Opposite of one-hot vectors (which are sparse, high-dimensional (dimension=no of words in vocabulary), binary).
Unlike word vectors obtained via one-hot encoding, word embeddings are learned from data. It’s common to see word embeddings that are 256-dimensional, 512-dimensional, or 1024-dimensional when dealing with massive vocabularies.
#### 2 ways
1. Learn word embeddings jointly with the main task you care about (e.g. document classification or sentiment prediction). In this setup, you **start with random word vectors**, then **learn** your word vectors in the same way that you learn the **weights of a neural network**.
2. Load into your model word embeddings that were pre-computed using a different machine learning task than the one you are trying to solve. These are called **“pre-trained word embeddings”**.  

Word embeddings are meant to map human language into a geometric space. For instance, in a reasonable embedding space, we would expect synonyms to be embedded into similar word vectors, and in general we would expect the geometric distance (e.g. L2 distance) between any two word vectors to relate to the semantic distance of the associated words (words meaning very different things would be embedded to points far away from each other, while related words would be closer).  

The simplest way to associate a dense vector to a word would be to pick the vector at random. The problem with this approach is that the resulting embedding space would have no structure: for instance, the words “accurate” and “exact” may end up with completely different embeddings, even though they are interchangeable in most sentences.
<img width=400, src="images/word-embeddings_simple.png">

In real world examples, what makes a good word embedding space depends heavily on your task: the perfect word embedding space for an English-language movie review sentiment analysis model may look different from the perfect embedding space for an English-language legal document classification model. It is therefore reasonable to learn a new embedding space with every new task.

The Embedding layer is best understood as a dictionary mapping integer indices (which stand for specific words) to dense vectors (which indicates the meaning of the word in relation to other words eg. "gender vector" and "plural vector" that can match eg. "king" to "queen" and "child" to "children" respectively)  

### keras.layers.Embedding
The Embedding layer takes as input a 2D tensor of integers, of shape (samples, sequence_length), where each entry is a sequence of integers. It can embed sequences of variable lengths; for instance, we could feed into our embedding layer (above) batches that could have shapes (32, 10) (batch of 32 sequences of length 10) or (64, 15) (batch of 64 sequences of length 15). All sequences in a batch must have the same length, though (because we need to pack them into a single tensor) sequences that are shorter than others should be padded with zeros, and sequences that are longer should be truncated. 

This layer returns a 3D floating point tensor, of shape (samples, sequence_length, embedding_dimensionality). Such a 3D tensor can then be processed by a RNN layer or a 1D convolution layer.

When you instantiate an Embedding layer, its weights (its internal dictionary of token vectors) are initially random, like with any other layer. During training, these word vectors are gradually adjusted via backpropagation, structuring the space into something that the downstream model can exploit. Once fully trained, your embedding space shows a lot of structure—a kind of structure specialized for the specific problem you were training your model for.

In [21]:
from keras.layers import Embedding
  
# The Embedding layer takes at least two arguments:
# the number of possible tokens, here 1000 (1 + maximum word index),
# and the dimensionality of the embeddings, here 64.
embedding_layer = Embedding(1000, 64)
embedding_layer

<tensorflow.python.keras.layers.embeddings.Embedding at 0x7f471b55a750>

In [22]:
# IMDB movie review sample used with word embedding
# We’ll restrict the movie reviews to the top 10,000 most common words and cut the reviews after only 20 words (all samples must have same length). Our network learns 8-dimensional embeddings for each of the 10,000 words, turns the input integer sequences (2D integer tensor) into embedded sequences (3D float tensor), flattens the tensor to 2D, and trains a single Dense layer on top for classification.
from keras.datasets import imdb
from keras import preprocessing
  
# Number of words to consider as features
max_features = 10000
# Cut texts after this number of words (among top max_features most common words)
maxlen = 20

# Load the data as lists of integers.
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
print(x_train[0])
# This turns our lists of integers
# into a 2D integer tensor of shape `(samples, maxlen)`
x_train = preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = preprocessing.sequence.pad_sequences(x_test, maxlen=maxlen)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 

In [23]:
#  Using an Embedding layer and classifier on the IMDB data.
from keras.models import Sequential
from keras.layers import Flatten, Dense

model = Sequential()

# We specify the maximum input length to our Embedding layer so we can later flatten the embedded inputs
model.add(Embedding(10000, 8, input_length=maxlen))
# After the Embedding layer, our activations have shape `(samples, maxlen, 8)`.

# We flatten the 3D tensor of embeddings into a 2D tensor of shape `(samples, maxlen * 8)`
model.add(Flatten())

# We add the classifier on top
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
model.summary()

history = model.fit(x_train, y_train,
                 epochs=10,
                 batch_size=32,
                 validation_split=0.2)
history

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 20, 8)             80000     
_________________________________________________________________
flatten (Flatten)            (None, 160)               0         
_________________________________________________________________
dense (Dense)                (None, 1)                 161       
Total params: 80,161
Trainable params: 80,161
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f46ea02bfd0>

We get to a validation accuracy of ~75%, which is pretty good considering that we’re only looking at the first twenty words in every review. But note that merely flattening the embedded sequences and training a single Dense layer on top leads to a model that treats each word in the input sequence separately, without considering inter-word relationships and structure sentence (it’d likely treat both “this movie is shit” and “this movie is the shit” as being negative “reviews”). It’d be much better to add recurrent layers or 1D convolutional layers on top of the embedded sequences to learn features that take into account each sequence as a whole.

## Putting it all together: from raw text to word embeddings


In [24]:
! tree -la data/aclImdb -L 2

data/aclImdb
├── imdbEr.txt
├── imdb.vocab
├── README
├── test
│   ├── labeledBow.feat
│   ├── neg
│   ├── pos
│   ├── urls_neg.txt
│   └── urls_pos.txt
└── train
    ├── labeledBow.feat
    ├── neg
    ├── pos
    ├── unsup
    ├── unsupBow.feat
    ├── urls_neg.txt
    ├── urls_pos.txt
    └── urls_unsup.txt

7 directories, 11 files


In [25]:
import os

imdb_dir = 'data/aclImdb'
train_dir = os.path.join(imdb_dir, 'train')

labels = []
texts = []

# neg and pos are 2 folders in the aclImdb/train folder containing reviews
for label_type in ['neg', 'pos']:
    dir_name = os.path.join(train_dir, label_type)
    # get the filename in the folder
    for fname in os.listdir(dir_name):
        if fname[-4:] == '.txt':
            with open(os.path.join(dir_name, fname)) as f:
                texts.append(f.read())
            if label_type == 'neg':
                labels.append(0)
            else:
                labels.append(1)

In [26]:
# Tokenizing the text of the raw IMDB data

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import numpy as np

maxlen = 100  # We will cut reviews after 100 words
max_words = 10000  # We will only consider the top 10,000 words in the dataset

tokenizer = Tokenizer(num_words=max_words)

# Updates internal vocabulary based on a list of texts.
tokenizer.fit_on_texts(texts)

# Transforms each text in texts to a sequence of integers. Only words known by the tokenizer will be taken into account.
sequences = tokenizer.texts_to_sequences(texts)

word_index = tokenizer.word_index
print('Found %s unique words.' % len(word_index))

# Sequences that are shorter than `num_timesteps` are padded with `value`. 
data = pad_sequences(sequences, maxlen=maxlen, value=0.0)
labels = np.asarray(labels)
print('Shape of data tensor:', data.shape)
print('label tensor:', labels)

# Split the data into a training set and a validation set. But first, shuffle the data, since we started from data
# where sample are ordered (all negative first, then all positive).
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]

Found 88582 unique words.
Shape of data tensor: (25000, 100)
label tensor: [0 0 0 ... 1 1 1]


In [27]:
training_samples = 200  # We will be training on 200 samples
validation_samples = 10000  # We will be validating on 10000 samples
x_train = data[:training_samples]
y_train = labels[:training_samples]
x_val = data[training_samples: training_samples + validation_samples]
y_val = labels[training_samples: training_samples + validation_samples]
  

## Precomputed embbedings for the GloVe algorithm
https://nlp.stanford.edu/projects/glove/

### Download pre-trained word vectors
http://nlp.stanford.edu/data/glove.6B.zip (822MB zip file)

In [28]:
!wget -O 'data/glove.6B.zip' http://nlp.stanford.edu/data/glove.6B.zip

--2021-05-05 11:29:25--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2021-05-05 11:29:26--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2021-05-05 11:29:26--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘data/glove.6B.zip’


2

In [None]:
!unzip -d data/glove.6B data/glove.6B.zip

Archive:  data/glove.6B.zip
replace data/glove.6B/glove.6B.50d.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

In [None]:
glove_dir = 'data/glove.6B'
  
embeddings_index = {}
# use only one of 4 files from the glove vector set
with open(os.path.join(glove_dir, 'glove.6B.100d.txt')) as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs

print('Found %s word vectors.' % len(embeddings_index))

In [None]:
# Preparing the GloVe word embeddings matrix
# embedding matrix that we can load into an Embedding layer. It must be a matrix of shape (max_words, embedding_dim), where each entry i contains the embedding_dim-dimensional vector for the word of index i in our reference word index (built during tokenization). Note that the index 0 isn’t supposed to stand for any word or token—it’s a placeholder.
embedding_dim = 100
  
embedding_matrix = np.zeros((max_words, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if i < max_words:
        if embedding_vector is not None:
            # Words not found in embedding index will be all-zeros.
            embedding_matrix[i] = embedding_vector

In [None]:
# Define the model
from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense
  
model = Sequential()
model.add(Embedding(max_words, embedding_dim, input_length=maxlen))
model.add(Flatten())
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.summary()
  

In [None]:
# Loading the matrix of pre-trained word embeddings into the Embedding layer
model.layers[0].set_weights([embedding_matrix])
model.layers[0].trainable = False

In [None]:
# Training and evaluation
model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['acc'])
history = model.fit(x_train, y_train,
              epochs=10,
              batch_size=32,
              validation_data=(x_val, y_val))
model.save_weights('pre_trained_glove_model.h5')

In [None]:
# Plot performance over time
import matplotlib.pyplot as plt
  
acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(acc) + 1)

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()

plt.figure()

plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()

plt.show()

## Word embeddings
See [Explanation here](https://machinelearningmastery.com/what-are-word-embeddings/)



**Yes we can!** [See scikit-learn tutorial here](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html)

![](images/word-embeddings.png)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document? Is it?',
]
vectorizer = CountVectorizer()

fit = vectorizer.fit_transform(corpus)
print(type(fit))
res = fit.todense() # returns a numpy array of same shape"
document_idx = vectorizer.vocabulary_['document']
print(document_idx)
document_count = sum(res[:,document_idx]) # sum all row cells where column == index
print('document occurs {} times in the text'.format(document_count))
print('{} is the index for document'.format(document_idx))
mat = fit.toarray()
print('There are 9 different words in the 4 sentences\n',vectorizer.get_feature_names())
print('In second sentence document occurs twice, which tells us that "document" is in second collumn')
print(res)
print('------------------------')
print(mat)
print('------------------------')

In [None]:
print(vectorizer.get_feature_names())

## Exercise

* Use the `CountVectorizer` from `sklearn.feature_extraction` to read the book `data/moby_dick.txt`
  * How many times does the word 'wood' appear?
* Use the `load_digits` function from the `sklearn.datasets` package to load a `sklearn` dataset
  * The package contains `.data` of 8x8 images. Extract the first image in an 8x8 array
  * Use the `plt.imshow` function to plot the image