# [Getting Started with NLP](https://dphi.tech/bootcamps/getting-started-with-natural-language-processing?utm_source=header)
by [CSpanias](https://cspanias.github.io/aboutme/), 28/01 - 06/02/2022 <br>

Bootcamp organized by **[DPhi](https://dphi.tech/community/)**, lectures given by [**Dipanjan (DJ) Sarkar**](https://www.linkedin.com/in/dipanzan/) ([GitHub repo](https://github.com/dipanjanS/nlp_essentials)) <br>

## Fundamental Tutorials for NLP:
* [NLTK Book](https://www.nltk.org/book/)
* [spaCy Tutorials](https://course.spacy.io/en/chapter1)

# CONTENT

1. Text Wrangling
    1. [Step-by-Step](#Steps)
        1. [Tokenization](#Tokenization)
        2. [Removing HTML tags & Noise](#Noise)
        3. [Removing Accented Characters](#Accented)
        4. [Removing Special Characters, Numbers and Symbols](#Special)
        5. [Expanding Contractions](#Contractions)
        6. [Stemming](#Stemming)
        7. [Lemmatization](#Lemmatization)
        8. [Stopword Removal](#Stopwords)
    1. [Automate Lemmatization](#AutoLemm)
        1. [Lemmatization with NLTK](#NLTKProcess)
        2. [Lemmatization with spaCy](#SpacyProcess)

In [1]:
# Install Dependencies
import nltk
#nltk.download('punkt')
#nltk.download('wordnet')
#nltk.download('stopwords')
#nltk.download('averaged_perceptron_tagger')
#!pip install contractions
#!pip install textsearch

<a name="Steps"></a>
# 1.1 Steps

<a name="Tokenization"></a>
## 1.1.1 Tokenization
 The process of **splitting a string** into a list of **sentences** or **words**.

In [7]:
# create text as a single multiline string
sample_text = ("US unveils world's most powerful supercomputer, beats China. " 
               "The US has unveiled the world's most powerful supercomputer called 'Summit', " 
               "beating the previous record-holder China's Sunway TaihuLight. With a peak performance "
               "of 200,000 trillion calculations per second, it is over twice as fast as Sunway TaihuLight, "
               "which is capable of 93,000 trillion calculations per second. Summit has 4,608 servers, "
               "which reportedly take up the size of two tennis courts.")

# print sample text
print("This is our sample text of type {}:\n\n{}".format(type(sample_text), sample_text))

This is our sample text of type <class 'str'>:

US unveils world's most powerful supercomputer, beats China. The US has unveiled the world's most powerful supercomputer called 'Summit', beating the previous record-holder China's Sunway TaihuLight. With a peak performance of 200,000 trillion calculations per second, it is over twice as fast as Sunway TaihuLight, which is capable of 93,000 trillion calculations per second. Summit has 4,608 servers, which reportedly take up the size of two tennis courts.


### Tokenization with NLTK
Split a string into a list of **sentences**.

**`from nltk.tokenize import sent_tokenize`**

In [18]:
# split a string into sentences and print first sentence
print(nltk.sent_tokenize(sample_text)[0])

US unveils world's most powerful supercomputer, beats China.


Split a string into a list of **words**.

**`from nltk.tokenize import word_tokenize`**

In [21]:
# split a string into tokens and print the first 10 tokens
print(nltk.word_tokenize(sample_text)[:10])

['US', 'unveils', 'world', "'s", 'most', 'powerful', 'supercomputer', ',', 'beats', 'China']


### Tokenization with spaCy

Load a **spaCy pipeline**.

**`spacy.load('en_core_web_sm')`**

In [44]:
import spacy

# load a pipeline using the name of an installed package
nlp = spacy.load('en_core_web_sm')

Create an **NLP object**.

**`nlp(text)`**

In [51]:
# create an NLP construct
text_spacy = nlp(sample_text)

# print NLP construct
print("The first 5 tokens are: \"{}\".\n".format(text_spacy[:5]))

# check type
print("The type of the NLP construct is: {}.\n".format(type(text_spacy)))

# check length
print("The length of the NLP construct is: {}.\n".format(len(text_spacy)))

The first 5 tokens are: "US unveils world's most".

The type of the NLP construct is: <class 'spacy.tokens.doc.Doc'>.

The length of the NLP construct is: 84.



Tokenize a string into **sentences**.

**`obj.text for obj in text_object.sents`**

In [49]:
# tokenize text into sentences
print([obj.text for obj in text_spacy.sents][:2])

["US unveils world's most powerful supercomputer, beats China.", "The US has unveiled the world's most powerful supercomputer called 'Summit', beating the previous record-holder China's Sunway TaihuLight."]


Tokenize a string into **words**.

**`obj.text for obj in text_object`**

In [50]:
# tokenize text into words
print([obj.text for obj in text_spacy][:15])

['US', 'unveils', 'world', "'s", 'most', 'powerful', 'supercomputer', ',', 'beats', 'China', '.', 'The', 'US', 'has', 'unveiled']


<a name="Noise"></a>
## 1.1.2 Removing HTML tags & noise

In [54]:
import requests

# request information from a site
data = requests.get('http://www.gutenberg.org/cache/epub/8001/pg8001.html')

# get the text of the contenct
content = data.text

# print a sample
print(content[2745:2900])

************************* */
hr {
    width: 45%;
    /* adjust to ape original work */
    margin-top: 1em;
    /* space above & below */
    margin


In [56]:
import re
from bs4 import BeautifulSoup

# function to remove HTML tags
def strip_html_tags(text):
    """Remove HTML tags and get just the text of a request."""
    
    # instantiate BeautifulSoup
    soup = BeautifulSoup(text, "html.parser")
    #
    [s.extract() for s in soup(['iframe', 'script'])]
    # get just the text without HTML tags
    stripped_text = soup.get_text()
    # 
    stripped_text = re.sub(r'[\r|\n|\r\n]+', '\n', stripped_text)
    #
    return stripped_text

# call function passing our request
clean_content = strip_html_tags(content)
# print sample text
print(clean_content[1163:1957])

d of schedule]
[This file was first posted on June 7, 2003]
Edition: 10
Language: English
*** START OF THE PROJECT GUTENBERG EBOOK, THE BIBLE, KING JAMES, BOOK 1***
This eBook was produced by David Widger
with the help of Derek Andrew's text from January 1992
and the work of Bryan Taylor in November 2002.
Book 01        Genesis
01:001:001 In the beginning God created the heaven and the earth.
01:001:002 And the earth was without form, and void; and darkness was
           upon the face of the deep. And the Spirit of God moved upon
           the face of the waters.
01:001:003 And God said, Let there be light: and there was light.
01:001:004 And God saw the light, that it was good: and God divided the
           light from the darkness.
01:001:005 And God called the light Day, and the


<a name="Accented"></a>
## 1.1.3 Removing Accented Characters

**`unicodedata`** info [here](https://docs.python.org/3/library/unicodedata.html). <br>
General info about to work with **Unicode** [here](https://docs.python.org/3/howto/unicode.html).

In [57]:
import unicodedata

def remove_accented_chars(text):
    """Remove accented characters from the text."""
    
    # 
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    
    # return clean text
    return text

In [60]:
# create text with accented chars
s = 'Sómě Áccěntěd těxt'
# call function to clean text
print(remove_accented_chars(s))

Some Accented text


<a name="Special"></a>
## 1.1.4 Removing Special Characters, Numbers and Symbols

In [63]:
import re

def remove_special_characters(text, remove_digits=False):
    """Remove special characters from a text using regex."""
    
    # create regex pattern
    pattern = r'[^a-zA-Z0-9\s]' if not remove_digits else r'[^a-zA-Z\s]'
    # remove text that matches pattern
    text = re.sub(pattern, '', text)
    # return text without pattern
    return text

# create text with special characters
s = "Well this was fun! See you at 7:30, What do you think!!? #$@@9318@ 🙂🙂🙂"
# call function to clean text
print(remove_special_characters(s, remove_digits=True), "\n")
# call function to clean text, but keep numbers
print(remove_special_characters(s))

Well this was fun See you at  What do you think   

Well this was fun See you at 730 What do you think 9318 


<a name="Contractions"></a>
## 1.1.5 Expanding Contractions
**`import contractions`**

In [69]:
# create text with contractions
s = "Y'all can't expand contractions I'd think! You wouldn't be able to. How'd you do it?"

import contractions

# check the first 10 contraction pairs
print(list(contractions.contractions_dict.items())[:10], "\n")

# expand contractions
print("Original text:\n{}\n".format(s))
print("Expanded text:\n{}".format(contractions.fix(s)))

[("I'm", 'I am'), ("I'm'a", 'I am about to'), ("I'm'o", 'I am going to'), ("I've", 'I have'), ("I'll", 'I will'), ("I'll've", 'I will have'), ("I'd", 'I would'), ("I'd've", 'I would have'), ('Whatcha', 'What are you'), ("amn't", 'am not')] 

Original text:
Y'all can't expand contractions I'd think! You wouldn't be able to. How'd you do it?

Expanded text:
You all cannot expand contractions I would think! You would not be able to. How did you do it?


<a name="Stemming"></a>
## 1.1.6 Stemming
**`from nltk.stem import PorterStemmer`**

In [78]:
from nltk.stem import PorterStemmer

# instantiate Stemmer
ps = PorterStemmer()

# apply stemmer
print('Original text: "jumping"\nStemmed text: "{}"\n'.format(ps.stem('jumping')))
print('Original text: "jumps"\nStemmed text: "{}"\n'.format(ps.stem('jumps'))) 
print('Original text: "jumped"\nStemmed text: "{}"\n'.format(ps.stem('jumped')))
print('Original text: "strange"\nStemmed text: "{}"\n'.format(ps.stem('strange')))
print('Original text: "lying"\nStemmed text: "{}"\n'.format(ps.stem('lying')))

Original text: "jumping"
Stemmed text: "jump"

Original text: "jumps"
Stemmed text: "jump"

Original text: "jumped"
Stemmed text: "jump"

Original text: "strange"
Stemmed text: "strang"

Original text: "lying"
Stemmed text: "lie"



<a name="Lemmatization"></a>
## 1.1.7 Lemmatization
**`from nltk.stem import WordNetLemmatizer`**

In [84]:
from nltk.stem import WordNetLemmatizer

# instantiate Lemmatizer
wnl = WordNetLemmatizer()

# lemmatize nouns
print("Lemmatize nouns ('n'):")
print('Original text: "cars"\nLemmatized text: "{}"\n'.format(wnl.lemmatize('cars', 'n')))
print('Original text: "boxes"\nLemmatized text: "{}"\n'.format(wnl.lemmatize('boxes', 'n')))

# lemmatize verbs
print("Lemmatize verbs ('v'):")
print('Original text: "running"\nLemmatized text: "{}"\n'.format(wnl.lemmatize('running', 'v')))
print('Original text: "ate"\nLemmatized text: "{}"\n'.format(wnl.lemmatize('ate', 'v')))

# lemmatize adjectives
print("Lemmatize adjectives ('a'):")
print('Original text: "saddest"\nLemmatized text: "{}"\n'.format(wnl.lemmatize('saddest', 'a')))
print('Original text: "fancier"\nLemmatized text: "{}"\n'.format(wnl.lemmatize('fancier', 'a')))

Lemmatize nouns ('n'):
Original text: "cars"
Lemmatized text: "car"

Original text: "boxes"
Lemmatized text: "box"

Lemmatize verbs ('v'):
Original text: "running"
Lemmatized text: "run"

Original text: "ate"
Lemmatized text: "eat"

Lemmatize adjectives ('a'):
Original text: "saddest"
Lemmatized text: "sad"

Original text: "fancier"
Lemmatized text: "fancy"



<a name="Stopwords"></a>
## 1.1.8 Stopword Removal
**`stop_words = nltk.corpus.stopwords.words('english')`**

In [109]:
def remove_stopwords(text, is_lower_case=False, stopwords=None):
    if not stopwords:
        stopwords = nltk.corpus.stopwords.words('english')
    tokens = nltk.word_tokenize(text)
    tokens = [token.strip() for token in tokens]
    
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopwords]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopwords]
    
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text


stop_words = nltk.corpus.stopwords.words('english')
print("The first 10 words of the stopwords list are:\n{}\n".format(stop_words[:10]))

# call function
print('Original text:\n{}\n'.format(s))
print('Text with no stopwords:\n{}'.format(remove_stopwords(s, is_lower_case=False)))

The first 10 words of the stopwords list are:
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

Original text:
The brown foxes are quick and they are jumping over the sleeping lazy dogs!

Text with no stopwords:
brown foxes quick jumping sleeping lazy dogs !


We can **remove and/or add words** into the list as required. 

In [108]:
# assign default stopword list to a variable
stop_words = nltk.corpus.stopwords.words('english')
# remove the word 'the'
stop_words.remove('the')
# add the word 'brown'
stop_words.append('brown')

# call function
print('Original text:\n{}\n'.format(s))
print('Text with no stopwords:\n{}'.format(remove_stopwords(s, is_lower_case=False, stopwords=stop_words)))

Original text:
The brown foxes are quick and they are jumping over the sleeping lazy dogs!

Text with no stopwords:
The foxes quick jumping the sleeping lazy dogs !


<a name="AutoLemm"></a>
# 1.2 Automate Lemmatization 

<a name="NLTKProcess"></a>
## 1.2.1 Lemmatization with NLTK

### Process
`tokenization` &rarr; `POS-tagging` &rarr; `WordNet-tagging` &rarr; `lemmatization` <br>

### Corresponding functions
`word_tokenize(text)` &rarr; `pos_tag(tokens)` &rarr; `pos_tag_wordnet(tagged_tokens)` &rarr; `WordNetLemmatizer.lemmatize(tagged_tokens)`

In [1]:
# create text as a single string
s = 'The brown foxes are quick and they are jumping over the sleeping lazy dogs!'
print('Original text:\n{}\n'.format(s))

# tokenize string
tokens = nltk.word_tokenize(s)
print('Tokenized text:\n{}\n'.format(tokens))

# POS tagging
tagged_tokens = nltk.pos_tag(tokens)
print('POS-tagged tokens:\n{}\n'.format(tagged_tokens))

# convert tags to WordNet form
from nltk.corpus import wordnet

def pos_tag_wordnet(tagged_tokens):
    """Convert POS-tagged tokens to WordNet form tags."""
    tag_map = {'j': wordnet.ADJ, 'v': wordnet.VERB, 'n': wordnet.NOUN, 'r': wordnet.ADV}
    # if a word does not belong to one of the 4 categories make it a NOUN
    new_tagged_tokens = [(word, tag_map.get(tag[0].lower(), wordnet.NOUN)) for word, tag in tagged_tokens]
    return new_tagged_tokens

# call function
wordnet_tokens = pos_tag_wordnet(tagged_tokens)
print("WordNet-tagged tokens:\n{}\n".format(wordnet_tokens))

# lemmatize tokens
lemmatized_text = " ".join(wnl.lemmatize(word, tag) for word, tag in wordnet_tokens)
print("Lemmatized text:\n{}".format(lemmatized_text))

Original text:
The brown foxes are quick and they are jumping over the sleeping lazy dogs!



NameError: name 'nltk' is not defined

### Define a function that performs lemmatization using NLTK and WordNet.

In [98]:
def wordnet_lemmatize_text(text):
    """Lemmatize a single string of text."""
    
    # tokenize and POS-tag tokens
    tagged_tokens = nltk.pos_tag(nltk.word_tokenize(text))
    # convert tags into WordNet tags
    wordnet_tokens = pos_tag_wordnet(tagged_tokens)
    # lemmatize tagged tokens and join words back to a single string
    lemmatized_text = ' '.join(wnl.lemmatize(word, tag) for word, tag in wordnet_tokens)
    # return the lemmatized string
    return lemmatized_text

# call function
print('Original text:\n{}\n'.format(s))
print('Lemmatized text:\n{}'.format(wordnet_lemmatize_text(s)))

Original text:
The brown foxes are quick and they are jumping over the sleeping lazy dogs!

Lemmatized text:
The brown fox be quick and they be jump over the sleep lazy dog !


<a name="SpacyProcess"></a>
## 1.2.2  Lemmatization with spaCy.
**`token.lemma_`** [Info](https://spacy.io/api/lemmatizer) <br>
**`token.text`** [Info](https://spacy.io/api/token)

In [111]:
def spacy_lemmatize_text(text):
    # create an NLP object
    text = nlp(text)
    #  & join tokens back into a string
    text = ' '.join([word.lemma_ if word.lemma_ != '-PRON-' else word.text for word in text])
    # return lemmatized string
    return text

# call function
print('Original text:\n{}\n'.format(s))
print('Lemmatized text:\n{}'.format(spacy_lemmatize_text(s)))

Original text:
The brown foxes are quick and they are jumping over the sleeping lazy dogs!

Lemmatized text:
the brown fox be quick and they be jump over the sleep lazy dog !
