# Getting Started With Natural Language Processing

## 2. Text Preprocessing

**Task 1**  
- We used NLTK’s `PorterStemmer` to normalize the text — run the code to see how it does.

<br>

**Task 2**  
- In the output terminal you’ll see our program counts `"go"` and `"went"` as different words! 
- Also, what’s up with `"mani"` and `"hardli"`? 
- A lemmatizer will fix this. Let’s do it.
- Where `lemmatizer` is defined, replace None with `WordNetLemmatizer()`.
- Where we defined `lemmatized`, replace the empty list with a list comprehension that uses `lemmatizer` to `lemmatize()` each `token` in `tokenized`.

<br>

**Task 3**  
- Why are the lemmatized verbs like `"went"` still conjugated? 
- By default `lemmatize()` treats every word as a noun.
- Give `lemmatize()` a second argument: 
    - `get_part_of_speech(token)`. 
- This will tell our lemmatizer what part of speech the word is.

In [None]:
import re
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from helper_functions import get_part_of_speech


text = "So many squids are jumping out of suitcases these days that you can barely go anywhere without seeing one burst forth from a tightly packed valise. I went to the dentist the other day, and sure enough I saw an angry one jump out of my dentist's bag within minutes of arriving. She hardly even noticed."

cleaned = re.sub('\W+', ' ', text)  # \W+ - matches non-word characters
tokenized = word_tokenize(cleaned)  # split text into words/tokens

# Stemming and Lemmatization
# Example word: are
#    Stemming = are
#    Lemmatization = be

# Stemming is mostly just removing the suffix or prefixes from a word
# Example: The stem of the word 'running' is 'run' or 'are' is 'are'
# Stemming is faster than lemmatization
stemmer = PorterStemmer()
stemmed = [stemmer.stem(token) for token in tokenized]

# Lemmatization is the process of converting a word to its base form
# Example: The lemma of the word 'running' is 'run'
lemmatizer = WordNetLemmatizer()
lemmatized = list(map(lambda x: lemmatizer.lemmatize(x, get_part_of_speech(x)), tokenized))

print("Stemmed text:")
print(stemmed)
print("\nLemmatized text:")
print(lemmatized)

Stemmed text:
['so', 'mani', 'squid', 'are', 'jump', 'out', 'of', 'suitcas', 'these', 'day', 'that', 'you', 'can', 'bare', 'go', 'anywher', 'without', 'see', 'one', 'burst', 'forth', 'from', 'a', 'tightli', 'pack', 'valis', 'i', 'went', 'to', 'the', 'dentist', 'the', 'other', 'day', 'and', 'sure', 'enough', 'i', 'saw', 'an', 'angri', 'one', 'jump', 'out', 'of', 'my', 'dentist', 's', 'bag', 'within', 'minut', 'of', 'arriv', 'she', 'hardli', 'even', 'notic']

Lemmatized text:
['So', 'many', 'squid', 'be', 'jump', 'out', 'of', 'suitcase', 'these', 'day', 'that', 'you', 'can', 'barely', 'go', 'anywhere', 'without', 'see', 'one', 'burst', 'forth', 'from', 'a', 'tightly', 'pack', 'valise', 'I', 'go', 'to', 'the', 'dentist', 'the', 'other', 'day', 'and', 'sure', 'enough', 'I', 'saw', 'an', 'angry', 'one', 'jump', 'out', 'of', 'my', 'dentist', 's', 'bag', 'within', 'minute', 'of', 'arrive', 'She', 'hardly', 'even', 'notice']


## 3. Parsing Text

**Task 1**  
- Run the code to see the silly squid sentences parsed into dependency trees visually!

<br>

**Task 2**  
- Change `my_sentence` to a sentence of your choosing and run the code again to see it parsed out as a tree!

In [39]:
import spacy
from nltk import Tree


squids_text = "So many squids are jumping out of suitcases these days. You can barely go anywhere without seeing one. I went to the dentist the other day. Sure enough, I saw an angry one jump out of my dentist's bag. She hardly even noticed."

# Load the 'en_core_web_sm' model
# More info can be found here: https://spacy.io/models/en#en_core_web_sm
# This model contains word vectors, tokenization, part-of-speech tagging, named entity recognition, dependency parser
dependency_parser = spacy.load('en_core_web_sm')
parsed_squids = dependency_parser(squids_text)

my_sentence = "How does spaCy know how to parse this sentence?"
my_parsed_sentence = dependency_parser(my_sentence)

def to_nltk_tree(node):
    if node.n_lefts + node.n_rights > 0:
        parsed_child_nodes = [to_nltk_tree(child) for child in node.children]
        return Tree(node.orth_, parsed_child_nodes)
    else:
        return node.orth_

for sent in parsed_squids.sents:
    print(sent)
    to_nltk_tree(sent.root).pretty_print()

for sent in my_parsed_sentence.sents:
    print(sent)
    to_nltk_tree(sent.root).pretty_print()

So many squids are jumping out of suitcases these days.
        jumping                
  _________|________________    
 |   |   squids    out      |  
 |   |     |        |       |   
 |   |    many      of     days
 |   |     |        |       |   
are  .     So   suitcases these

You can barely go anywhere without seeing one.
          go                       
  ________|____________________     
 |   |    |       |      |  without
 |   |    |       |      |     |    
 |   |    |       |      |   seeing
 |   |    |       |      |     |    
You can barely anywhere  .    one  

I went to the dentist the other day.
          went               
  _________|_________         
 |   |     to        |       
 |   |     |         |        
 |   |  dentist     day      
 |   |     |      ___|____    
 I   .    the   the     other

Sure enough, I saw an angry one jump out of my dentist's bag.
                   saw                           
  __________________|_________                    

## 4. Language Models: Bag-of-Words

**Task 1**  
- We’ve turned a passage from *Through the Looking Glass* by Lewis Carroll into a list of words (aside from stopwords, which we’ve removed) using `nltk` preprocessing.

<br>

**Task 2**  
- Now let’s turn this list into a bag-of-words using `Counter()`!
- Comment out the print statement and set `bag_of_looking_glass_words` equal to a call of `Counter()` on `normalized`. 
- Print `bag_of_looking_glass_words`. 
- What are the most common words?

<br>

**Task 3**  
- Try changing `text` to another string of your choosing and see what happens!

In [43]:
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from collections import Counter
from texts import looking_glass_text
from helper_functions import get_part_of_speech


text = looking_glass_text
text = "Such an excellent bag of words and an excellent word 'bags'."

# Clean and tokenize text
cleaned = re.sub('\W+', ' ', text).lower()  # \W+ - matches non-word characters
tokenized = word_tokenize(cleaned) 

# Remove stop words
# Stop words are words like "a", "the", or "in" which don't convey significant meaning
stop_words = stopwords.words('english')
filtered = list(filter(lambda x: x not in stop_words, tokenized))

# Lemmatize the tokens
# Lemmatization is the process of converting a word to its base form
# Example: The lemma of the word 'running' is 'run'
normalizer = WordNetLemmatizer()
normalized = list(map(lambda x: normalizer.lemmatize(x, get_part_of_speech(x)), filtered))
print(normalized)

# Define bag_of_looking_glass_words & print:
bag_of_looking_glass_words = Counter(normalized)
print(bag_of_looking_glass_words)

['excellent', 'bag', 'word', 'excellent', 'word', 'bag']
Counter({'excellent': 2, 'bag': 2, 'word': 2})


In [47]:
print("Cleaned text (Removing non-word characters):")
print(cleaned, end="\n\n")
print("Tokenized text (Splitting text into tokens):")
print(tokenized, end="\n\n")
print("Filtered text (Removing stop words):")
print(filtered, end="\n\n")
print("Normalized text (Lemmatizing the tokens):")
print(normalized, end="\n\n")

Cleaned text (Removing non-word characters):
such an excellent bag of words and an excellent word bags 

Tokenized text (Splitting text into tokens):
['such', 'an', 'excellent', 'bag', 'of', 'words', 'and', 'an', 'excellent', 'word', 'bags']

Filtered text (Removing stop words):
['excellent', 'bag', 'words', 'excellent', 'word', 'bags']

Normalized text (Lemmatizing the tokens):
['excellent', 'bag', 'word', 'excellent', 'word', 'bag']



## 5. Language Models: N-Gram and NLM

**Task 1**  
- If you run the code, you’ll see the 10 most commonly used words in Through the Looking Glass parsed with NLTK’s `ngrams` module — if you’re thinking this looks like a bag of words, that’s because it is one!

<br>

**Task 2**  
- What do you think the most common phrases in the text are? 
- Let’s find out…
- Where `looking_glass_bigrams` is defined, change the second argument to `2` to see bigrams. 
- Change `n` to `3` for `looking_glass_trigrams` to see trigrams.

<br>

**Task 3**  
- Change `n` to a number greater than `3` for `looking_glass_ngrams`. 
- Try increasing the number.
- At what `n` are you just getting lines from poems repeated in the text? 
- This is where there may be too few examples of each sequence within your training corpus to make any helpful predictions.

In [49]:
import re
from nltk.tokenize import word_tokenize
from nltk.util import ngrams
from collections import Counter
from texts import looking_glass_full_text


# Clean and tokenize text
cleaned = re.sub('\W+', ' ', looking_glass_full_text).lower()   # \W+ - matches non-word characters
tokenized = word_tokenize(cleaned)

# Change the n value to 2:
looking_glass_bigrams = ngrams(tokenized, 2)
looking_glass_bigrams_frequency = Counter(looking_glass_bigrams)

# Change the n value to 3:
looking_glass_trigrams = ngrams(tokenized, 3)
looking_glass_trigrams_frequency = Counter(looking_glass_trigrams)

# Change the n value to a number greater than 3:
n = 7
looking_glass_ngrams = ngrams(tokenized, n)
looking_glass_ngrams_frequency = Counter(looking_glass_ngrams)

print("Looking Glass Bigrams:")
print(looking_glass_bigrams_frequency.most_common(10))

print("\nLooking Glass Trigrams:")
print(looking_glass_trigrams_frequency.most_common(10))

print(f"\nLooking Glass {n} n-grams:")
print(looking_glass_ngrams_frequency.most_common(10))

Looking Glass Bigrams:
[(('of', 'the'), 101), (('said', 'the'), 98), (('in', 'a'), 97), (('in', 'the'), 90), (('as', 'she'), 82), (('you', 'know'), 72), (('a', 'little'), 68), (('the', 'queen'), 67), (('said', 'alice'), 67), (('to', 'the'), 66)]

Looking Glass Trigrams:
[(('the', 'red', 'queen'), 54), (('the', 'white', 'queen'), 31), (('said', 'in', 'a'), 21), (('she', 'went', 'on'), 18), (('said', 'the', 'red'), 17), (('thought', 'to', 'herself'), 16), (('the', 'queen', 'said'), 16), (('said', 'to', 'herself'), 14), (('said', 'humpty', 'dumpty'), 14), (('the', 'knight', 'said'), 14)]

Looking Glass 7 n-grams:
[(('one', 'and', 'one', 'and', 'one', 'and', 'one'), 7), (('and', 'one', 'and', 'one', 'and', 'one', 'and'), 6), (('twas', 'brillig', 'and', 'the', 'slithy', 'toves', 'did'), 3), (('brillig', 'and', 'the', 'slithy', 'toves', 'did', 'gyre'), 3), (('and', 'the', 'slithy', 'toves', 'did', 'gyre', 'and'), 3), (('the', 'slithy', 'toves', 'did', 'gyre', 'and', 'gimble'), 3), (('slithy'

In [7]:
sentence = "The cat is asleep. The cat purrs."
cleaned = re.sub('\W+', ' ', sentence).lower()
tokenized = word_tokenize(cleaned)
Counter(ngrams(tokenized, 3))

Counter({('the', 'cat', 'is'): 1,
         ('cat', 'is', 'asleep'): 1,
         ('is', 'asleep', 'the'): 1,
         ('asleep', 'the', 'cat'): 1,
         ('the', 'cat', 'purrs'): 1})

## 6. Topic Models

**Task 1**  
- Check out how the bag of words model and tf-idf models stack up when faced with a new Sherlock Holmes text!
- Run the code as is to see what topics they uncover…

<br>

**Task 2**  
- Tf-idf has some interesting findings, but the regular bag of words is full of words that tell us very little about the topic of the texts!
- Let’s fix this. 
- Add some words to `stop_list` that don’t tell you much about the topic and then run your code again. 
- Do this until you have at least 10 words in `stop_list` so that the bag of words LDA model has some interesting topics.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from texts import bohemia_ch1, bohemia_ch2, bohemia_ch3, boscombe_ch1, boscombe_ch2, boscombe_ch3
from helper_functions import preprocess_text


stop_words = stopwords.words('english')
lemmatizer = WordNetLemmatizer()


# preparing the text
corpus = [bohemia_ch1, bohemia_ch2, bohemia_ch3, boscombe_ch1, boscombe_ch2, boscombe_ch3]
preprocessed_corpus = [preprocess_text(chapter, lemmatizer, stop_words) for chapter in corpus]


# Update stop_list:
stop_list = ["say", "see", "holmes", "shall", "say", "man", "upon", "know", "quite", "one", "well", "could", "would", "take", "may", "think", "come", "go", "little", "must", "look"]

# filtering topics for stop words
def filter_out_stop_words(corpus):
    no_stops_corpus = []
    for chapter in corpus:
        no_stops_chapter = " ".join([word for word in chapter.split(" ") if word not in stop_list])
        no_stops_corpus.append(no_stops_chapter)
    return no_stops_corpus
filtered_for_stops = filter_out_stop_words(preprocessed_corpus)

# creating the bag of words model
bag_of_words_creator = CountVectorizer()
bag_of_words = bag_of_words_creator.fit_transform(filtered_for_stops)

# creating the tf-idf model
tfidf_creator = TfidfVectorizer(min_df = 0.2)
tfidf = tfidf_creator.fit_transform(preprocessed_corpus)

# creating the bag of words LDA model
lda_bag_of_words_creator = LatentDirichletAllocation(learning_method='online', n_components=10)
lda_bag_of_words = lda_bag_of_words_creator.fit_transform(bag_of_words)

# creating the tf-idf LDA model
lda_tfidf_creator = LatentDirichletAllocation(learning_method='online', n_components=10)
lda_tfidf = lda_tfidf_creator.fit_transform(tfidf)

print("~~~ Topics found by bag of words LDA ~~~")
for topic_id, topic in enumerate(lda_bag_of_words_creator.components_):
    message = "Topic #{}: ".format(topic_id + 1)
    message += " ".join([bag_of_words_creator.get_feature_names_out()[i] for i in topic.argsort()[:-5 :-1]])
    print(message)

print("\n\n~~~ Topics found by tf-idf LDA ~~~")
for topic_id, topic in enumerate(lda_tfidf_creator.components_):
    message = "Topic #{}: ".format(topic_id + 1)
    message += " ".join([tfidf_creator.get_feature_names_out()[i] for i in topic.argsort()[:-5 :-1]])
    print(message)

~~~ Topics found by bag of words LDA ~~~
Topic #1: mccarthy find father hand
Topic #2: find street cry sit
Topic #3: majesty understand touch eye
Topic #4: majesty king photograph sherlock
Topic #5: cry mccarthy turner right
Topic #6: find father case part
Topic #7: son case mccarthy young
Topic #8: street find time two
Topic #9: find mr call make
Topic #10: give leave mccarthy back


~~~ Topics found by tf-idf LDA ~~~
Topic #1: client investigation lens utter
Topic #2: say holmes one upon
Topic #3: mccarthy say father holmes
Topic #4: form resolution ruin retain
Topic #5: complete stand disappear bristol
Topic #6: king majesty holmes photograph
Topic #7: harness depend whatever factor
Topic #8: overpower crime moment recognise
Topic #9: eye pronounce couple easy
Topic #10: agent whether felt james


## 7. Text Similarity

**Task 1**  
- Assign the variable `three_away_from_code` a word with a Levenshtein distance of `3` from “code”.
- Assign `two_away_from_chunk` a word with a Levenshtein distance of `2` from “chunk”.

In [None]:
from nltk.metrics import edit_distance


def print_levenshtein(string1, string2):
    print("The Levenshtein distance from '{0}' to '{1}' is {2}!".format(string1, string2, edit_distance(string1, string2)))

# Check the distance between
# any two words here!
print_levenshtein("fart", "target")

# Assign passing strings here:
three_away_from_code = "order"
two_away_from_chunk = "chunker"

print_levenshtein("code", three_away_from_code)
print_levenshtein("chunk", two_away_from_chunk)

The Levenshtein distance from 'fart' to 'target' is 3!
The Levenshtein distance from 'code' to 'order' is 3!
The Levenshtein distance from 'chunk' to 'chunker' is 2!


## 8. Language Prediction & Text Generation

**Task 1**  
- Add three short stories by your favorite author or the lyrics to three songs by your favorite artist to `document1.py`, `document2.py`, and `document3.py`. 
- Then run to see a short example of text prediction.
- Does it look like something by your favorite author or artist?

In [None]:
import re, random
from nltk.tokenize import word_tokenize
from collections import defaultdict, deque
from documents import training_doc1, training_doc2, training_doc3


class MarkovChain:
    def __init__(self):
        self.lookup_dict = defaultdict(list)
        self._seeded = False
        self.__seed_me()
    
    def __seed_me(self, rand_seed=None):
        if self._seeded is not True:
            try:
                if rand_seed is not None:
                    random.seed(rand_seed)
                else:
                    random.seed()
                self._seeded = True
            except NotImplementedError:
                self._seeded = False
    
    def add_document(self, str):
        preprocessed_list = self._preprocess(str)
        pairs = self.__generate_tuple_keys(preprocessed_list)
        for pair in pairs:
            self.lookup_dict[pair[0]].append(pair[1])
    
    def _preprocess(self, str):
        cleaned = re.sub(r'\W+', ' ', str).lower()
        tokenized = word_tokenize(cleaned)
        return tokenized
    
    def __generate_tuple_keys(self, data):
        if len(data) < 1:
            return
        
        for i in range(len(data) - 1):
            yield [ data[i], data[i + 1] ]
    
    def generate_text(self, max_length=50):
        context = deque()
        output = []
        if len(self.lookup_dict) > 0:
            self.__seed_me(rand_seed=len(self.lookup_dict))
            chain_head = [list(self.lookup_dict)[0]]
            context.extend(chain_head)
            
            while len(output) < (max_length - 1):
                next_choices = self.lookup_dict[context[-1]]
                if len(next_choices) > 0:
                    next_word = random.choice(next_choices)
                    context.append(next_word)
                    output.append(context.popleft())
                else:
                    break
            output.extend(list(context))
        return " ".join(output)

my_markov = MarkovChain()
my_markov.add_document(training_doc1)
my_markov.add_document(training_doc2)
my_markov.add_document(training_doc3)
generated_text = my_markov.generate_text()
print(generated_text)

i find the memories baby it s make up the love sparks will fly they ignite our bones and all the world we light when they strike we ll testify our hearts are you left your love i tried to save us what is over that s beyond us let


## 9. Advanced NLP Topics

**Task 1**  
- Assign `review` a string with a brief review of this lesson so far. 
- Next, run your code. Is the Naive Bayes Classifier accurately classifying your review?

```python
from reviews import counter, training_counts
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB



# Add your review:
review = "This was a bad movie"
review_counts = counter.transform([review])

classifier = MultinomialNB()
training_labels = [0] * 1000 + [1] * 1000

classifier.fit(training_counts, training_labels)

neg = (classifier.predict_proba(review_counts)[0][0] * 100).round()
pos = (classifier.predict_proba(review_counts)[0][1] * 100).round()

if pos > 50:
    print("Thank you for your positive review!")
elif neg > 50:
    print("We're sorry this hasn't been the best possible lesson for you! We're always looking to improve.")
else:
    print("Naive Bayes cannot determine if this is negative or positive. Thank you or we're sorry?")

print("\nAccording to our trained Naive Bayes classifier, the probability that your review was negative was {0}% and the probability it was positive was {1}%.".format(neg, pos))


# Output
# We're sorry this hasn't been the best possible lesson for you! 
# We're always looking to improve.

# According to our trained Naive Bayes classifier, the probability # that your review was negative was 67.0% and the probability it was 
# positive was 33.0%.
```