![Practicum AI Logo image](https://github.com/PracticumAI/practicumai.github.io/blob/main/images/logo/PracticumAI_logo_250x50.png?raw=true)
***
# *Practicum AI:* NLP - Alice Wonderland 

This exercise adapted from Baig et al. (2020) <i>The Deep Learning Workshop</i> from <a href="https://www.packtpub.com/product/the-deep-learning-workshop/9781839219856">Packt Publishers</a> (Activity 4.01 & 4.02, page 176).

#### Downloading Text Corpora Using NLTK (Data Setup)

In [None]:
import nltk

<div style="padding: 10px;margin-bottom: 20px;border: thin solid #E5C250;border-left-width: 10px;background-color: #fff"><strong>Tip:</strong> The nltk.download() command, when executed without arguments, does not open the NLTK downloader in a new window, as pictured in the textbook.  The Unix version has a command line interface.  Type 'l' in the NLTK field and hit enter to view the Packages available for install.  As this list is rather long, you will need to hit enter multiple times to scroll through it.  The gutenberg package is in this list.  To install it, type 'd' and hit enter.  Then type 'gutenberg' in the NLTK field and hit enter again.  The software will respond with an installation message.  And finally, type 'q' to exit the downloader.</div>

In [None]:
nltk.download()

In [None]:
alice_raw = nltk.corpus.gutenberg.raw('carroll-alice.txt')

In [None]:
alice_raw[:500]

***
#### Activity 4.01 (Text Preprocessing) - Page 176

#### 1. Set the raw text to lowercase

```python
from nltk import tokenize
txt_sents = tokenize.sent_tokenize(alice_raw.lower())
```

In [None]:
# Code it!

#### 2. Tokenize the sentences

```python
txt_words = [tokenize.word_tokenize(sent) for sent in txt_sents]
```

In [None]:
# Code it!

#### 3. Import punctuation from string module & stop words from NLTK

```python
from string import punctuation
stop_punct = list(punctuation)

from nltk.corpus import stopwords
stop_nltk = stopwords.words("english")
```

In [None]:
# Code it!

#### 4. Create variable to contain the stop words: "--" and "said"

```python
stop_context = ["--", "said"]
```

In [None]:
# Code it!

#### 5. Create a master list of stop words

```python
stop_final = stop_punct + stop_nltk + stop_context
```

In [None]:
# Code it!

#### 6. Define a function to drop these tokens from any input sequence

```python

def drop_stop(input_tokens):
    return [token for token in input_tokens if token not in stop_final]

alice_words_nostop = [drop_stop(sent) for sent in txt_words]
print(alice_words_nostop[:2])
```

In [None]:
# Code it!

In [None]:
# Code it!

#### 7. Use PorterStemmer from NLTK to perform stemming on the result

```python
from nltk.stem import PorterStemmer
stemmer_p = PorterStemmer()

alice_words_stem = [[stemmer_p.stem(token) for token in sent] for sent in alice_words_nostop]
```

In [None]:
# Code it!

In [None]:
# Code it!

#### 8. Print the first five sentences

```python
print(alice_words_stem[:5])
```

In [None]:
# Code it!

***
#### Activity 4.02 (Text Representation) - Page 210

#### 1. Print the data you will work with

```python
print(alice_words_nostop[:3])
```

In [None]:
# Code it!

#### 2. Train your word embeddings with word2vec

```python
from gensim.models import word2vec

model = word2vec.Word2Vec(alice_words_nostop)
```

In [None]:
# Code it!

#### 3. Find the 5 terms most similar to rabbit

```python
model.wv.most_similar("rabbit", topn = 5)
```

In [None]:
# Code it!

#### 4. Using a window size of 2, retrain the word vectors

```python
model = word2vec.Word2Vec(alice_words_nostop, window = 2)
```

In [None]:
# Code it!

#### 5. Find the terms most similar to 'rabbit'

```python
model.wv.most_similar("rabbit", topn = 5)
```

In [None]:
# Code it!

#### 6. Retrain word vectors using the Skip_gram method with window size of 5

```python
model = word2vec.Word2Vec(alice_words_nostop, window = 5, sg = 1)
```

In [None]:
# Code it!

#### 7. Find the terms most similar to 'rabbit'

```python
model.wv.most_similar("rabbit", topn = 5)
```

In [None]:
# Code it!

#### 8. Find the representation for the phrase 'white rabbit' by averaging the vectors for 'white' and 'rabbit'

```python
v1 = model.wv['white']
v2 = model.wv['rabbit']
res1 = (v1+v2)/2
```

In [None]:
# Code it!

#### 9. Find the representation for the phrase 'mad hatter' by averaging the vectors for 'mad' and 'hatter'

```python
v1 = model.wv['mad']
v2 = model.wv['hatter']
res2 = (v1+v2)/2
```

In [None]:
# Code it!

#### 10. Find the cosine similarity between these two phrases

```python 
model.wv.cosine_similarities(res1, [res2])
```

In [None]:
# Code it!

#### 11. Load the pre-train GloVe embeddings of size 100D using the formatted keyed vectors

```python
from gensim.models.keyedvectors import KeyedVectors
glove_model = KeyedVectors.load_word2vec_format(
    "data/glove.6B.100d.w2vformat.txt", 
    binary = False)
```

In [None]:
# Code it!

#### 12. Find representations for 'white rabbit' and 'mad hatter'

```python
v1 = glove_model['white']
v2 = glove_model['rabbit']
res1 = (v1+v2)/2

v1 = glove_model['mad']
v2 = glove_model['hatter']
res2 = (v1+v2)/2
```

In [None]:
# Code it!

#### 13. Find the cosine similarity between these two phrases

```python 
model.wv.cosine_similarities(res1, [res2])
```

Has the cosine similarity changed?

In [None]:
# Code it!

Here, we can see that the cosine similarity between the two phrases "**mad hatter**" and "**white rabbit**" is far lower from the GloVe model. This is because the GloVe model hasn't seen the terms together in its training data as much as they appear in the book. In the book, the terms **mad** and **hatter** appear together a lot because they form the name of an important character. In other contexts, of course, we don't see **mad** and **hatter** together as often.