# NLP Basic

### Natural Language Processing (NLP):
Field concerned with the ability of a computer to understand, analyze, manipulate, and potentially generate human language.

NLP encompasses many topics, such as
- Sentiment Analysis
- Topic Modeling
- Text Classification
- Sentence Segmentation or Part-of-Speech Tagging

### Natural Language Toolkit (NLTK):
Suite of open-source tools created to make **NLP** processses in Python easier to build.

#### How to install NLTK on your local machine?

Both sets of instructions below assume you already have Python installed. These instructions are taken directly from [http://www.nltk.org/install.html](http://www.nltk.org/install.html).

**Mac/Unix**
From the terminal:
1. Install NLTK: run `pip install -U nltk`
2. Test installation: run `python` then type `import nltk`

**Windows**
1. Install NLTK: [http://pypi.python.org/pypi/nltk](http://pypi.python.org/pypi/nltk)
2. Test installation: `Start>Python35`, then type `import nltk`

**Jupyter Notebook** 
Run `jupyternotebook`: 
1. Install NLTK:`!pip install -U nltk`
2. Test installation: `import nltk`

In [1]:
# Installing the NLTK package
!pip install -U nltk

Collecting nltk
  Downloading https://files.pythonhosted.org/packages/92/75/ce35194d8e3022203cca0d2f896dbb88689f9b3fce8e9f9cff942913519d/nltk-3.5.zip (1.4MB)
Collecting regex
  Downloading https://files.pythonhosted.org/packages/9c/d1/d2ecb51a8cb38c8278e77a2731c1366881e0dea9671f135d2625f15a73a4/regex-2020.7.14-cp37-cp37m-win_amd64.whl (268kB)
Building wheels for collected packages: nltk
  Building wheel for nltk (setup.py): started
  Building wheel for nltk (setup.py): finished with status 'done'
  Created wheel for nltk: filename=nltk-3.5-cp37-none-any.whl size=1434681 sha256=df369ff016c3b4ad10df0753f7adf4c89f72eb384098fa989db194db682bcd63
  Stored in directory: C:\Users\lokma\AppData\Local\pip\Cache\wheels\ae\8c\3f\b1fe0ba04555b08b57ab52ab7f86023639a526d8bc8d384306
Successfully built nltk
Installing collected packages: regex, nltk
  Found existing installation: nltk 3.4.5
    Uninstalling nltk-3.4.5:
      Successfully uninstalled nltk-3.4.5
Successfully installed nltk-3.5 regex-2020

In [2]:
# Import nltk library
import nltk

## Installing NLTK Data

NLTK comes with many corpora, toy grammars, trained model, etc. To download the entire collection, we need to use the interactive installer by typing the following command. 

A new window should open, showing the NLTK Downloader. Click on the File menu and select the packages or collections you want to download. For this demonstration, we recommend to download all packages.

In [3]:
# Download NLTK data, it's going to take a few minutes
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [4]:
# Check all the functions and attributes inside the nltk library
dir(nltk)

['AbstractLazySequence',
 'AffixTagger',
 'AlignedSent',
 'Alignment',
 'AnnotationTask',
 'ApplicationExpression',
 'Assignment',
 'BigramAssocMeasures',
 'BigramCollocationFinder',
 'BigramTagger',
 'BinaryMaxentFeatureEncoding',
 'BlanklineTokenizer',
 'BllipParser',
 'BottomUpChartParser',
 'BottomUpLeftCornerChartParser',
 'BottomUpProbabilisticChartParser',
 'Boxer',
 'BrillTagger',
 'BrillTaggerTrainer',
 'CFG',
 'CRFTagger',
 'CfgReadingCommand',
 'ChartParser',
 'ChunkParserI',
 'ChunkScore',
 'Cistem',
 'ClassifierBasedPOSTagger',
 'ClassifierBasedTagger',
 'ClassifierI',
 'ConcordanceIndex',
 'ConditionalExponentialClassifier',
 'ConditionalFreqDist',
 'ConditionalProbDist',
 'ConditionalProbDistI',
 'ConfusionMatrix',
 'ContextIndex',
 'ContextTagger',
 'ContingencyMeasures',
 'CoreNLPDependencyParser',
 'CoreNLPParser',
 'Counter',
 'CrossValidationProbDist',
 'DRS',
 'DecisionTreeClassifier',
 'DefaultTagger',
 'DependencyEvaluator',
 'DependencyGrammar',
 'DependencyGrap

In [6]:
# Test to use the stopwords function from nltk package
from nltk.corpus import stopwords

# List the English stop words from the package
# The elements in the 0th position go up through in the 500th position
# and in incremented of 25
stopwords.words('english')[0:500:25]

['i', 'herself', 'been', 'with', 'here', 'very', 'doesn', 'won']

## Structured Data vs. Unstructured Data

Most text data lack the formal structure of numeric data. Above 80% of business-relevant information originates in unstructured form, primarily text. 

#### Unstructured Data
- Binary data
- No delimiters
- No indication of rows

## NLP Basics: Readin and Cleaning Text Data

### Read in Semi-Strutured Text Data

In [10]:
# Read in the raw text
rawData = open('./data/SMSSpamCollection.tsv').read()

# Print the first 100 elements from the data
rawData[0:500]

"ham\tI've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times.\nspam\tFree entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's\nham\tNah I don't think he goes to usf, he lives around here though\nham\tEven my brother is not like to speak with me. They treat me like aid"

### Cleaning the Text Data

As you can see, the text data is somewhat structured because there are some consistency features between each line. For instance, each line started with either **ham** or **spam**. What we need to do is to separate the data by line and extract **ham** and **spam** in a more structured format. Frist of all, we replace the "\t" (tap) with the "\n" (newline). Then, we separate the data by "\n".

In [14]:
# Parse the data and split by \n
parsedData = rawData.replace('\t', '\n').split('\n')

In [15]:
# Print the first 10 elements from the data
parsedData[0:10]

['ham',
 "I've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times.",
 'spam',
 "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's",
 'ham',
 "Nah I don't think he goes to usf, he lives around here though",
 'ham',
 'Even my brother is not like to speak with me. They treat me like aids patent.',
 'ham',
 'I HAVE A DATE ON SUNDAY WITH WILL!!']

Next, we extract the "ham" and "Spam" from the raw parsed data and store into a list and same for each line in the data.

In [18]:
labelList = parsedData[0::2]
textList = parsedData[1::2]

In [19]:
print(labelList[0:5])
print(textList[0:5])

['ham', 'spam', 'ham', 'ham', 'ham']
["I've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times.", "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's", "Nah I don't think he goes to usf, he lives around here though", 'Even my brother is not like to speak with me. They treat me like aids patent.', 'I HAVE A DATE ON SUNDAY WITH WILL!!']


Now we created two lists, we need to check the length of the two lists before we store them into a dataframe.

In [21]:
# Check the length of the two lists
print(len(labelList))
print(len(textList))

5571
5570


It seems like the label list is capturing one extra element from the parsed data. We can check to see what it is by printing the last few element from the list.

In [22]:
print(labelList[-5:])

['ham', 'ham', 'ham', 'ham', '']


Seems like there is an empty element is captured in the label list. We can take it out to make an equal length of the two lists.

In [23]:
labelList = labelList[:-1]

Now we have two lists with equal length. We can construct a dataframe to store the structured data.

In [25]:
import pandas as pd

# Create a dataframe to store the data
fullCorpus = pd.DataFrame({'label':labelList, 'body_list': textList})
fullCorpus.head()

Unnamed: 0,label,body_list
0,ham,I've been searching for the right words to tha...
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...
2,ham,"Nah I don't think he goes to usf, he lives aro..."
3,ham,Even my brother is not like to speak with me. ...
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!


## Alternative for Reading the Semi-Structure Data

Another easy way to read this data file is to use the Pandas `read_csv()` function. One thing need to be carefule when using the `read_csv()` function for text data is that the **header** parameter must equal **None**. The reason is that the raw text data do not have column names in it. If header is not set to be `None`, Pandas will read the first two elements as the column names.

In [26]:
# Read the raw text data with read_csv()
dataset = pd.read_csv('./data/SMSSpamCollection.tsv', sep='\t', header=None)
dataset.head()

Unnamed: 0,0,1
0,ham,I've been searching for the right words to tha...
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...
2,ham,"Nah I don't think he goes to usf, he lives aro..."
3,ham,Even my brother is not like to speak with me. ...
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!


## NLP Basics: Exploring the Dataset

#### Read in text data

In [28]:
import pandas as pd

fullCorpus = pd.read_csv('./data/SMSSpamCollection.tsv', sep='\t', header=None)
fullCorpus.columns = ['label', 'body_text']

fullCorpus.head()

Unnamed: 0,label,body_text
0,ham,I've been searching for the right words to tha...
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...
2,ham,"Nah I don't think he goes to usf, he lives aro..."
3,ham,Even my brother is not like to speak with me. ...
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!


In [29]:
# What is the shape of the dataset?
print("Input data has {} rows and {} columns"
      .format(len(fullCorpus),len(fullCorpus.columns)))

Input data has 5568 rows and 2 columns


In [30]:
# How many spam/ham are there?

print("Out of {} rows, {} are spam, {} are ham".format(
    len(fullCorpus),
    len(fullCorpus[fullCorpus['label']=='spam']),
    len(fullCorpus[fullCorpus['label']=='ham'])))

Out of 5568 rows, 746 are spam, 4822 are ham


In [31]:
# How much missing data is there?
print("Number of null in label: {}".format(fullCorpus['label'].isnull().sum()))
print("Number of null in text: {}".format(fullCorpus['body_text'].isnull().sum()))

Number of null in label: 0
Number of null in text: 0


## NLP Basics: Regular Expressions

#### Using Regular Expression in Python

Python's `re` package is the most commonly used regex resource. More detail can be found [here](https://docs.python.org/3/library/re.html).

In [32]:
import re

re_test = 'This is a made up string to test 2 different regex methods'
re_test_messy1 = 'This     is a made up    string to test 2     different regex methods'
re_test_messy2 = 'This-is-a-made/up.string*to>>>>test----2""""""different~regex-methods'

### Splitting a sentence into a list of words

In [33]:
re.split('\s', re_test)

['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

In [34]:
re.split('\s', re_test_messy1)

['This',
 '',
 '',
 '',
 '',
 'is',
 'a',
 'made',
 'up',
 '',
 '',
 '',
 'string',
 'to',
 'test',
 '2',
 '',
 '',
 '',
 '',
 'different',
 'regex',
 'methods']

In [35]:
re.split('\s+', re_test_messy1)

['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

In [37]:
re.split('\s+', re_test_messy2)

['This-is-a-made/up.string*to>>>>test----2""""""different~regex-methods']

In [38]:
re.split('\W+', re_test_messy1)

['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

In [39]:
re.findall('\S+', re_test)

['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

In [41]:
re.findall('\S+', re_test_messy1)

['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

In [42]:
re.findall('\S+', re_test_messy2)

['This-is-a-made/up.string*to>>>>test----2""""""different~regex-methods']

In [43]:
re.findall('\w+', re_test_messy2)

['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

### Replacing a specific string

In [44]:
pep8_test = 'I try to follow PEP8 guidelines'
pep7_test = 'I try to follow PEP7 guidelines'
peep8_test = 'I try to follow PEEP8 guidelines'

In [None]:
re.findall('[a-z]+', pep8_test)

In [None]:
re.findall('[A-Z]+', pep8_test)

In [None]:
re.findall('[A-Z]+[0-9]+', peep8_test)

In [None]:
re.sub('[A-Z]+[0-9]+', 'PEP8 Python Styleguide', peep8_test)

### Other examples of regex methods

- re.search()
- re.match()
- re.fullmatch()
- re.finditer()
- re.escape()

## NLP Basics: Pipeline to Clean Text

### Pre-processing Text Data
Cleaning up the text data is necessary to highlight attributes that you are going to want your machine learning systme to pick up on.  Cleaning (or pre-processing) the data typically consists of a number of steps:

1. **Remove punctuation**
2. **Tokenization**
3. **Remove stopwords**
4. Lemmatize/Stem

The first three steps are covered in this section as they are implemented in pretty much any text cleaning pipeline. Lemmatizing and stemming are covered in the later section as they are helpful but not critical.

In [45]:
import pandas as pd

# Reading the raw text data
pd.set_option('display.max_colwidth', 100)
data = pd.read_csv("./data/SMSSpamCollection.tsv", sep='\t', header=None)
data.columns = ['label', 'body_text']

data.head()

Unnamed: 0,label,body_text
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
2,ham,"Nah I don't think he goes to usf, he lives around here though"
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!


In [47]:
# What does the cleaned version look like?
data_cleaned = pd.read_csv("./data/SMSSpamCollection_cleaned.tsv", sep='\t')
data_cleaned.head()

Unnamed: 0,label,body_text,body_text_nostop
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,"['ive', 'searching', 'right', 'words', 'thank', 'breather', 'promise', 'wont', 'take', 'help', '..."
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,"['free', 'entry', '2', 'wkly', 'comp', 'win', 'fa', 'cup', 'final', 'tkts', '21st', 'may', '2005..."
2,ham,"Nah I don't think he goes to usf, he lives around here though","['nah', 'dont', 'think', 'goes', 'usf', 'lives', 'around', 'though']"
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,"['even', 'brother', 'like', 'speak', 'treat', 'like', 'aids', 'patent']"
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,"['date', 'sunday']"


### Remove Punctuation

In [48]:
# Import the string library and check the list of strings in it
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [49]:
# Punctuation is an element in a string,
# We can check it by the following condition statement
"I like NLP."  == "I like NLP"

False

In [50]:
# Create a remove punctuation function
def remove_punct(text):
    text_nopunct = "".join([char for char in text if char not in string.punctuation])
    return text_nopunct

# Remove punctuation by the remove_punct()
data['body_text_clean'] = data['body_text'].apply(lambda x: remove_punct(x))

data.head()

Unnamed: 0,label,body_text,body_text_clean
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,Ive been searching for the right words to thank you for this breather I promise i wont take your...
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...
2,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,Even my brother is not like to speak with me They treat me like aids patent
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,I HAVE A DATE ON SUNDAY WITH WILL


### Tokenization

In [51]:
# Create a function for tokenization

def tokenize(text):
    tokens = re.split('\W+', text)
    return tokens

data['body_text_tokenized'] = data['body_text_clean'].apply(lambda x: tokenize(x.lower()))

data.head()

Unnamed: 0,label,body_text,body_text_clean,body_text_tokenized
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,Ive been searching for the right words to thank you for this breather I promise i wont take your...,"[ive, been, searching, for, the, right, words, to, thank, you, for, this, breather, i, promise, ..."
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...,"[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to..."
2,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though,"[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]"
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,Even my brother is not like to speak with me They treat me like aids patent,"[even, my, brother, is, not, like, to, speak, with, me, they, treat, me, like, aids, patent]"
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,I HAVE A DATE ON SUNDAY WITH WILL,"[i, have, a, date, on, sunday, with, will]"


### Remove Stopwords

In [52]:
# Create a list of stopwords from the nltk package
import nltk

stopword = nltk.corpus.stopwords.words('english')

In [53]:
# Create a function to remove stopwords
def remove_stopwords(tokenized_list):
    text = [word for word in tokenized_list if word not in stopword]
    return text

data['body_text_nostop'] = data['body_text_tokenized'].apply(lambda x: remove_stopwords(x))

data.head()

Unnamed: 0,label,body_text,body_text_clean,body_text_tokenized,body_text_nostop
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,Ive been searching for the right words to thank you for this breather I promise i wont take your...,"[ive, been, searching, for, the, right, words, to, thank, you, for, this, breather, i, promise, ...","[ive, searching, right, words, thank, breather, promise, wont, take, help, granted, fulfil, prom..."
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...,"[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to...","[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv..."
2,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though,"[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]","[nah, dont, think, goes, usf, lives, around, though]"
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,Even my brother is not like to speak with me They treat me like aids patent,"[even, my, brother, is, not, like, to, speak, with, me, they, treat, me, like, aids, patent]","[even, brother, like, speak, treat, like, aids, patent]"
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,I HAVE A DATE ON SUNDAY WITH WILL,"[i, have, a, date, on, sunday, with, will]","[date, sunday]"
