# NLP - An Introduction to the Basics

In this notebook I outline the main techniques.

In [1]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer, PorterStemmer
from nltk.stem import WordNetLemmatizer
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\apala\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\apala\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\apala\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping chunkers\maxent_ne_chunker.zip.
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\apala\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\words.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\apala\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\wordnet.zip.


True

## Tokenization

Tokenization is splitting text up into separate words.

In [2]:
text = "EDHEC is a French Business School based in Lille and Nice."

In [3]:
text_tokens = word_tokenize(text)

In [4]:
print(text_tokens)

['EDHEC', 'is', 'a', 'French', 'Business', 'School', 'based', 'in', 'Lille', 'and', 'Nice', '.']


In [5]:
from nltk.tokenize import sent_tokenize

In [6]:
text = "EDHEC is a French Business School. It is based in Lille and Nice."

In [7]:
text_tokens = sent_tokenize(text)

In [8]:
print(text_tokens)

['EDHEC is a French Business School.', 'It is based in Lille and Nice.']


## Normalising 
Normalising is all about cleaning our text of numbers and non-words.

In [9]:
doc = "The FT Group employs more than 2300 people worldwide " +\
"including 700 journalists in 40 countries. It includes the " +\
"Financial Times, FT Specialist, and a number of services and joint ventures."

words = nltk.word_tokenize(doc)

for word in words:
    if word.isalpha(): # checks if the token is a word
        print(word, end=" ")

The FT Group employs more than people worldwide including journalists in countries It includes the Financial Times FT Specialist and a number of services and joint ventures 

Make all of the words lower case

In [10]:
for word in words[0:10]:
    if word.isalpha(): # checks if the token is a word
        print(word.lower())

the
ft
group
employs
more
than
people
worldwide
including


## Stop Words

Some words have no real informational value.

In [11]:
nltk.download('stopwords')

text = "EDHEC is a French Business School. It is based in Lille and Nice."

stop_words = set(stopwords.words("english"))
text_tokens = word_tokenize(text)
tokens_without_stopwords = [ word for word in text_tokens \
                            if not word in stop_words]

print(tokens_without_stopwords)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\apala\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


['EDHEC', 'French', 'Business', 'School', '.', 'It', 'based', 'Lille', 'Nice', '.']


## Stemming

Reduces words to their root (stem) form by cutting off the suffix.

In [12]:
stemmer = SnowballStemmer("english")
text = "studies studying cries cry"
tokens = word_tokenize(text)

for text in tokens:
    print(text, "->", stemmer.stem(text))

studies -> studi
studying -> studi
cries -> cri
cry -> cri


## Lemmatization

Stemming can create words which do not exist. Lemmatization always produces actual words.

In [14]:
import nltk
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\apala\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\omw-1.4.zip.


True

In [15]:
text = "studies studying cries cry"
tokens = word_tokenize(text)
lemmatizer = WordNetLemmatizer()
for text in tokens:
    print(text, "->", lemmatizer.lemmatize(text))

studies -> study
studying -> studying
cries -> cry
cry -> cry


It is slower !

## Parts of Speech

In [16]:
text = "EDHEC is a French Business School. It is based in Lille and Nice."
tokens = nltk.word_tokenize(text)
print(nltk.pos_tag(tokens))

[('EDHEC', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('French', 'JJ'), ('Business', 'NNP'), ('School', 'NNP'), ('.', '.'), ('It', 'PRP'), ('is', 'VBZ'), ('based', 'VBN'), ('in', 'IN'), ('Lille', 'NNP'), ('and', 'CC'), ('Nice', 'NNP'), ('.', '.')]


Let's find out what these tags mean.

In [17]:
nltk.download('tagsets')
nltk.help.upenn_tagset()

[nltk_data] Downloading package tagsets to
[nltk_data]     C:\Users\apala\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping help\tagsets.zip.


$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

Sometimes the tagger is able to distinguish different forms of the same word e.g. "desert".

## Named Entity Recognition

In [18]:
text = 'SpaceX is an aerospace manufacturer and space transport services company headquartered in California. It was founded in 2002 by entrepreneur and investor Elon Musk with the goal of reducing space transportation costs and enabling the colonization of Mars.'

In [19]:
text_tag = nltk.pos_tag(nltk.word_tokenize(text))

In [20]:
text_ch = nltk.ne_chunk(text_tag)

In [21]:
for chunk in text_ch:
    if hasattr(chunk, 'label'):
        print(chunk.label(), ' '.join(c[0] for c in chunk.leaves()))

GPE SpaceX
GPE California
PERSON Elon Musk
PERSON Mars
