# Introduction to Natural Language Processing with Python

Natural language processing (NLP) is a field of computer science, artificial intelligence and computational linguistics concerned with the interactions between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully process large natural language corpora. Challenges in natural language processing frequently involve natural language understanding, natural language generation (frequently from formal, machine-readable logical forms), connecting language and machine perception, dialog systems, or some combination thereof.

(Ref: https://en.wikipedia.org/wiki/Natural_language_processing)

This is an introduction notebook where we will see how we can use Python to do some basic processing of natural language in Python.

The package which handles most of the task in Python is - 'nltk'. We will start with importing the nltk package.

If you have not downloaded the nltk package, check the below YouTube video how to download it.      
https://www.youtube.com/watch?v=68aHmFcO-W4

In [1]:
import nltk

In [2]:
# Create a text (Multiple Lines)
text = 'Mary had a little lamb. Her fleece was white as snow.'

We can convert a text into a group of sentences and into a group of words using sent_tokenize and word_tokenize from nltk.

In [3]:
# Import word_tokenize,sent_tokenize
from nltk.tokenize import word_tokenize,sent_tokenize

In [4]:
# Using sent_tokenize we will convert the text into a list of strings or sentences.
sents = sent_tokenize(text)
sents

['Mary had a little lamb.', 'Her fleece was white as snow.']

In [5]:
# Using sent_tokenize we will convert the text into a list of words.
words = [word_tokenize(word) for word in sents]
words

[['Mary', 'had', 'a', 'little', 'lamb', '.'],
 ['Her', 'fleece', 'was', 'white', 'as', 'snow', '.']]

English text consists of many words like - a, an, the, period(.), is, was etc. These words are not useful for our anlysis. They will not provide meaningful essence to our analysis of text.So, we will remove them.

This is done using stopwords from nltk and punctuation from string.

In [6]:
from nltk.corpus import stopwords
from string import punctuation
# We will make a set of stopwords and punctuation and name it as customStopWords
# If you are aware of a list of words which are not useful, you may add that to your set
customStopWords = set(stopwords.words('english')+list(punctuation))

In [7]:
# Below we are creating wordsWOstopWords which is all the words from text which are not present
# in the customStopWords set
wordsWOstopWords = [word for word in word_tokenize(text) if word not in customStopWords]
wordsWOstopWords

['Mary', 'little', 'lamb', 'Her', 'fleece', 'white', 'snow']

N-Gram:
N-Grams means keeping multiple words together as they occur most often. 
If we keep two words together, it’s called Bigrams. 

This is done using the collocation module in nltk. The BigramCollocationFinder function returns group of words and their frequency.

In [8]:
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures() 
finder = BigramCollocationFinder.from_words(wordsWOstopWords)
sorted(finder.ngram_fd.items())

[(('Her', 'fleece'), 1),
 (('Mary', 'little'), 1),
 (('fleece', 'white'), 1),
 (('lamb', 'Her'), 1),
 (('little', 'lamb'), 1),
 (('white', 'snow'), 1)]

Stemming:

Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form.

Example: A stemmer for English, for example, should identify the string "cats" (and possibly "catlike", "catty" etc.) as based on the root "cat", and "stems", "stemmer", "stemming", "stemmed" as based on "stem". A stemming algorithm reduces the words "fishing", "fished", and "fisher" to the root word, "fish". On the other hand, "argue", "argued", "argues", "arguing", and "argus" reduce to the stem "argu" (illustrating the case where the stem is not itself a word or root) but "argument" and "arguments" reduce to the stem "argument".

(Ref: https://en.wikipedia.org/wiki/Stemming)

In [9]:
text2 = "Mary closed on the closing night when she wanted to close"
from nltk.stem.lancaster import LancasterStemmer
st = LancasterStemmer()
stemmerWords = [st.stem(word) for word in word_tokenize(text2)]
print(stemmerWords)
# If you see here, words - closed, closing, and close - reduced to clos (the root)

['mary', 'clos', 'on', 'the', 'clos', 'night', 'when', 'she', 'want', 'to', 'clos']


Part-of-speech:

Sometimes, it may be required to know that a certain word in a sentence is a noun, verb, or a preposition. This is called position tagging. We can use the nlt.pos_tag() function to find that.

In [10]:
# Position Tagging
print(nltk.pos_tag(word_tokenize(text2)))
# Shows if part of the speech is a verb, or a noun

[('Mary', 'NNP'), ('closed', 'VBD'), ('on', 'IN'), ('the', 'DT'), ('closing', 'NN'), ('night', 'NN'), ('when', 'WRB'), ('she', 'PRP'), ('wanted', 'VBD'), ('to', 'TO'), ('close', 'VB')]


Meaning of words:

We can find the meaning of any word using the wordnet module from nltk.corpus.
When we print different meanings of words, it will show all meanings and also the part of the speech.

In [11]:
# Meaning of word
from nltk.corpus import wordnet as wn
for ss in wn.synsets('bass'):
    print(ss,ss.definition())

Synset('bass.n.01') the lowest part of the musical range
Synset('bass.n.02') the lowest part in polyphonic music
Synset('bass.n.03') an adult male singer with the lowest voice
Synset('sea_bass.n.01') the lean flesh of a saltwater fish of the family Serranidae
Synset('freshwater_bass.n.01') any of various North American freshwater fish with lean flesh (especially of the genus Micropterus)
Synset('bass.n.06') the lowest adult male singing voice
Synset('bass.n.07') the member with the lowest range of a family of musical instruments
Synset('bass.n.08') nontechnical name for any of numerous edible marine and freshwater spiny-finned fishes
Synset('bass.s.01') having or denoting a low vocal or instrumental range


Word Sense Disambiguation:

If you see the meaning of the word - 'bass' - it has different meanings in different contexts. 
Synset('bass.n.07') - is a musical instrument range, but, Synset('sea_bass.n.01') is a sea fish.

The nltk is so powerful that it can find the context in which the word has occured.

In [12]:
# We can see that bass has different meanings, sometimes a sea fish and sometimes a low music tone
# If we want to check the meaning in different context
from nltk.wsd import lesk
sensel = lesk(word_tokenize('Sing in a lower tone, along with the bass'),'bass')
print(sensel,sensel.definition())

sensel2 = lesk(word_tokenize('This sea bass is really hard to catch'),'bass')
print(sensel2,sensel2.definition())

Synset('bass.n.07') the member with the lowest range of a family of musical instruments
Synset('sea_bass.n.01') the lean flesh of a saltwater fish of the family Serranidae
