# Introduction to Natural Language Processing for Text

**Goal**: Learn the basic techniques to extract features from some text, so you can use these features as input for machine learning models.

### 1. What is NLP (Natural Language Processing)?

NLP is a subfield of computer science and artificial intelligence concerned with interactions between computers and human (natural) languages. It is used to apply machine learning algorithms to **text** and **speech**.

The study of natural language processing has been around for more than 50 years and grew out of the field of linguistics with the rise of computers.

Examples:
    <li>Speech/Audio Recognition - Shazam</li> 
    <li>Autocomplete - MS Word</li> 
    <li>Question and Answering - Zuri</li> 
    <li>Text Recognition - Cam scanner</li> 
    <li>Sentiment Analysis - Twitter</li>
    <li>Topic Modeling - Quora</li>
    <li>Email Filtering/Spam Detection - Gmail</li> 
    <li>Language Translation - Google Translate</li>
    <li>Document Summarization - Associated Press</li> 

## 2. Introduction to the NLTK library for Python

In [2]:
!pip install nltk



You should consider upgrading via the 'c:\users\jt\appdata\local\programs\python\python37\python.exe -m pip install --upgrade pip' command.


In [6]:
import nltk
#nltk.download("popular")

NLTK (<b>Natural Language Toolkit</b>) is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to <b>many corpora</b> and <b>lexical resources</b>. Also, it contains a suite of <b>text processing libraries</b> for classification, tokenization, stemming, tagging, parsing, and semantic reasoning. Best of all, NLTK is a free, open source, community-driven project.

We’ll use this toolkit to show some basics of the natural language processing field. For the examples below, I’ll assume that we have imported the NLTK toolkit. We can do this like this: import nltk.

## 3. The Basics of NLP for Text

    1. Sentence Tokenization
    2. Word Tokenization
    3. Text Lemmatization and Stemming
    4. Stop Words
    5. Regex
    6. Bag-of-Words
    7. TF-IDF
    8. Part of Speech tagging

### i. Sentence Tokenization

Sentence tokenization (also called sentence segmentation) is the problem of <b>dividing a string of written language into its component sentences</b>.

In [9]:
# Get text paragrapg
text = """To dream big, the family had to start small.
That meant looking at what was beneath their feet. 
When the land was bought, much of it was overgrazed, with barren patches and gullies eroded in the earth.
Fences were removed along with the livestock, and the rewilding effort began literally at grassroots level.
"Despite being a semi-arid region, there's a remarkable amount of biodiversity, particularly endemic plants," 
says Isabelle, adding that five of South Africa's nine types of plant habitat exist within Samara.

"""

In [10]:
from nltk.tokenize import sent_tokenize

In [11]:
# Perform sentence segmentation
sentences = nltk.sent_tokenize(text)
sentences

['To dream big, the family had to start small.',
 'That meant looking at what was beneath their feet.',
 'When the land was bought, much of it was overgrazed, with barren patches and gullies eroded in the earth.',
 'Fences were removed along with the livestock, and the rewilding effort began literally at grassroots level.',
 '"Despite being a semi-arid region, there\'s a remarkable amount of biodiversity, particularly endemic plants," \nsays Isabelle, adding that five of South Africa\'s nine types of plant habitat exist within Samara.']

### ii. Word Tokenization

Word tokenization (also called word segmentation) is the problem of dividing a string of written language into its component words.

In [12]:
from nltk.tokenize import word_tokenize 

In [14]:
word_tok = nltk.word_tokenize(text)
word_tok

['To',
 'dream',
 'big',
 ',',
 'the',
 'family',
 'had',
 'to',
 'start',
 'small',
 '.',
 'That',
 'meant',
 'looking',
 'at',
 'what',
 'was',
 'beneath',
 'their',
 'feet',
 '.',
 'When',
 'the',
 'land',
 'was',
 'bought',
 ',',
 'much',
 'of',
 'it',
 'was',
 'overgrazed',
 ',',
 'with',
 'barren',
 'patches',
 'and',
 'gullies',
 'eroded',
 'in',
 'the',
 'earth',
 '.',
 'Fences',
 'were',
 'removed',
 'along',
 'with',
 'the',
 'livestock',
 ',',
 'and',
 'the',
 'rewilding',
 'effort',
 'began',
 'literally',
 'at',
 'grassroots',
 'level',
 '.',
 '``',
 'Despite',
 'being',
 'a',
 'semi-arid',
 'region',
 ',',
 'there',
 "'s",
 'a',
 'remarkable',
 'amount',
 'of',
 'biodiversity',
 ',',
 'particularly',
 'endemic',
 'plants',
 ',',
 "''",
 'says',
 'Isabelle',
 ',',
 'adding',
 'that',
 'five',
 'of',
 'South',
 'Africa',
 "'s",
 'nine',
 'types',
 'of',
 'plant',
 'habitat',
 'exist',
 'within',
 'Samara',
 '.']

In [15]:
for sentence in sentences:
    word = nltk.word_tokenize(sentence)
    print(word)

['To', 'dream', 'big', ',', 'the', 'family', 'had', 'to', 'start', 'small', '.']
['That', 'meant', 'looking', 'at', 'what', 'was', 'beneath', 'their', 'feet', '.']
['When', 'the', 'land', 'was', 'bought', ',', 'much', 'of', 'it', 'was', 'overgrazed', ',', 'with', 'barren', 'patches', 'and', 'gullies', 'eroded', 'in', 'the', 'earth', '.']
['Fences', 'were', 'removed', 'along', 'with', 'the', 'livestock', ',', 'and', 'the', 'rewilding', 'effort', 'began', 'literally', 'at', 'grassroots', 'level', '.']
['``', 'Despite', 'being', 'a', 'semi-arid', 'region', ',', 'there', "'s", 'a', 'remarkable', 'amount', 'of', 'biodiversity', ',', 'particularly', 'endemic', 'plants', ',', "''", 'says', 'Isabelle', ',', 'adding', 'that', 'five', 'of', 'South', 'Africa', "'s", 'nine', 'types', 'of', 'plant', 'habitat', 'exist', 'within', 'Samara', '.']


### iii. Text Lemmatization and Stemming

For grammatical reasons, documents can contain different forms of a word such as drive, drives, driving. Also, sometimes we have related words with a similar meaning, such as nation, national, nationality.

The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.

Examples:

    am, are, is => be
    dog, dogs, dog’s, dogs’ => dog

<li>Stemming usually refers to a crude <b>heuristic process</b> that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes.</li>

<li>Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.</li>

Examples:

    The word “better” has “good” as its lemma. This link is missed by stemming, as it requires a dictionary look-up.
    The word “play” is the base form for the word “playing”, and hence this is matched in both stemming and lemmatization.
    The word “meeting” can be either the base form of a noun or a form of a verb (“to meet”) depending on the context; e.g., “in our last meeting” or “We are meeting again tomorrow”. Unlike stemming, lemmatization attempts to select the correct lemma depending on the context.

![image.png](attachment:image.png)

In [5]:
paragraph = """I have three visions for India. In 3000 years of our history, people from all over 
               the world have come and invaded us, captured our lands, conquered our minds. 
               From Alexander onwards, the Greeks, the Turks, the Moguls, the Portuguese, the British,
               the French, the Dutch, all of them came and looted us, took over what was ours. 
               Yet we have not done this to any other nation. We have not conquered anyone. 
               We have not grabbed their land, their culture, 
               their history and tried to enforce our way of life on them. 
               """

In [29]:
from nltk.stem import PorterStemmer, WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()

In [30]:
lemmatizer.lemmatize('going')

'going'

In [28]:
stemmer.stem('going')

'go'

In [19]:
stem_lemma('going')

Stemmer: go
Lemmatizer: going



**Excercise**: Create a function that will take in a word and give an output of both the Stem and Lema form. 

In [31]:


def stem_lemma(word):
    """
    Print the results of stemmind and lemmitization using the passed stemmer, lemmatizer, word and pos (part of speech)
    """
    print("Stemmer:", stemmer.stem(word))
    print("Lemmatizer:", lemmatizer.lemmatize(word))
    print()

In [32]:
stem_lemma('create')

Stemmer: creat
Lemmatizer: create



### iv. Stop words

Stop words are words which are filtered out before or after processing of text. When applying machine learning to text, these words can add a lot of noise. That’s why we want to remove these irrelevant words.

Stop words usually refer to the most common words such as “and”, “the”, “a” in a language, but there is no single universal list of stopwords. The list of the stop words can change depending on your application.

In [35]:
from nltk.corpus import stopwords
print(stopwords.words("english"))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

**Excercise:** Remove all stop words from the text variables

In [36]:
text='''Are we the best? We endeavour to positively impact the lives of those in the community through giving back. 
The M-PESA foundation and Safaricom Foundation are just a few of our various avenues towards positive change in society. 
All this is because we believe in a future where opportunity is available to all regardless of background. 
Be it the underprivileged young ones, we believe that their future should not be negatively affected by their present or past. 
And thus we strive to change the present to improve the future.'''

In [45]:
stop_words = set(stopwords.words("english"))

def remove_stop_words(p):
    # Show original version
    print('Original Paragraph')
    print(p)
    
    # Get words from block
    words = nltk.word_tokenize(p.lower())
    
    # Remove stopwords
    p_stop = [word for word in words if word not in stop_words]
    print()
    print('Without Stop words')
    print(' '.join(p_stop))

In [43]:
remove_stop_words(text)

Original Paragraph
Are we the best? We endeavour to positively impact the lives of those in the community through giving back. 
The M-PESA foundation and Safaricom Foundation are just a few of our various avenues towards positive change in society. 
All this is because we believe in a future where opportunity is available to all regardless of background. 
Be it the underprivileged young ones, we believe that their future should not be negatively affected by their present or past. 
And thus we strive to change the present to improve the future.

Without Stop words
best ? endeavour positively impact lives community giving back . m-pesa foundation safaricom foundation various avenues towards positive change society . believe future opportunity available regardless background . underprivileged young ones , believe future negatively affected present past . thus strive change present improve future .


### v. Bag of words

Machine learning algorithms cannot work with raw text directly, we need to convert the text into vectors of numbers. This is called feature extraction.

The bag-of-words model is a popular and simple feature extraction technique used when we work with text. It describes the occurrence of each word within a document.

To use this model, we need to:

    1. Design a vocabulary of known words (also called tokens)
    2. Choose a measure of the presence of known words

Any information about the order or structure of words is discarded. That’s why it’s called a bag of words. This model is trying to understand whether a known word occurs in a document, but don’t know where is that word in the document.

The intuition is that similar documents have similar contents. Also, from a content, we can learn something about the meaning of the document.

In [55]:
def preprocess(p):
    
    # Get words from block
    words = nltk.word_tokenize(p.lower())
    
    # Remove stopwords
    p_stop = [word for word in words if word not in stop_words]
    
    # Stem words
    result = [stemmer.stem(word) for word in p_stop]
    
    return ' '.join(result)

In [56]:
documents = nltk.sent_tokenize(preprocess(text))
documents

['best ?',
 'endeavour posit impact live commun give back .',
 'm-pesa foundat safaricom foundat variou avenu toward posit chang societi .',
 'believ futur opportun avail regardless background .',
 'underprivileg young one , believ futur neg affect present past .',
 'thu strive chang present improv futur .']

### Create document vectors



CountVectorizer is a great tool provided by the scikit-learn library in Python. It is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text. This is helpful when we have multiple such texts, and we wish to convert each word in each text into vectors (for using in further text analysis)

CountVectorizer creates a matrix in which each unique word is represented by a column of the matrix, and each text sample from the document is a row in the matrix. The value of each cell is nothing but the count of the word in that particular text sample.  

In [59]:
from sklearn.feature_extraction.text import CountVectorizer

In [60]:
#Design the Vocabulary
# The default token pattern removes tokens of a single character
count_vectorizer = CountVectorizer()

In [63]:
# Create bag of words model
bag_of_words = count_vectorizer.fit_transform(documents)

In [64]:
bag_of_words

<6x32 sparse matrix of type '<class 'numpy.int64'>'
	with 38 stored elements in Compressed Sparse Row format>

In [68]:
feature_names = count_vectorizer.get_feature_names_out()
pd.DataFrame(bag_of_words.toarray(), columns = feature_names)

Unnamed: 0,affect,avail,avenu,back,background,believ,best,chang,commun,endeavour,...,present,regardless,safaricom,societi,strive,thu,toward,underprivileg,variou,young
0,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,1,0,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
2,0,0,1,0,0,0,0,1,0,0,...,0,0,1,1,0,0,1,0,1,0
3,0,1,0,0,1,1,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
4,1,0,0,0,0,1,0,0,0,0,...,1,0,0,0,0,0,0,1,0,1
5,0,0,0,0,0,0,0,1,0,0,...,1,0,0,0,1,1,0,0,0,0


### vii. TF-IDF

One problem with scoring word frequency is that the most frequent words in the document start to have the highest scores. These frequent words may not contain as much “informational gain” to the model compared with some rarer and domain-specific words. One approach to fix that problem is to penalize words that are frequent across all the documents. This approach is called TF-IDF.

TF-IDF, short for term frequency-inverse document frequency is a statistical measure used to evaluate the importance of a word to a document in a collection or corpus.

The TF-IDF scoring value increases proportionally to the number of times a word appears in the document, but it is offset by the number of documents in the corpus that contain the word.

In [69]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [74]:
text2 = """
Jane has gone to school.
Mom picked Jane at the parking lot.
School is great.
I will graduate in June.
I have the best teachers at my school.
"""
documents = nltk.sent_tokenize(preprocess(text2))
documents

['jane gone school .',
 'mom pick jane park lot .',
 'school great .',
 'graduat june .',
 'best teacher school .']

In [75]:
tfidf_vectorizer = TfidfVectorizer()
values = tfidf_vectorizer.fit_transform(documents)

In [76]:
# Show the Model as a pandas DataFrame
feature_names = tfidf_vectorizer.get_feature_names_out()
pd.DataFrame(values.toarray(), columns = feature_names)

Unnamed: 0,best,gone,graduat,great,jane,june,lot,mom,park,pick,school,teacher
0,0.0,0.690159,0.0,0.0,0.556816,0.0,0.0,0.0,0.0,0.0,0.462208,0.0
1,0.0,0.0,0.0,0.0,0.374105,0.0,0.463693,0.463693,0.463693,0.463693,0.0,0.0
2,0.0,0.0,0.0,0.830881,0.0,0.0,0.0,0.0,0.0,0.0,0.556451,0.0
3,0.0,0.0,0.707107,0.0,0.0,0.707107,0.0,0.0,0.0,0.0,0.0,0.0
4,0.63907,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.427993,0.63907


## Read on Regular Expressions in NLP
- https://towardsdatascience.com/regex-essential-for-nlp-ee0336ef988d
- https://towardsdatascience.com/regex-essential-for-nlp-ee0336ef988d
    