**Getting Started with Natural Language Processing.**

Let's get started with a sample project. 

Start with importing the nltk library. To make sure you have all the packages, run nltk.download() and select all in download option. 

In [None]:
'''
from google.colab import drive
drive.mount('/content/drive')
import os
os.chdir('/content/drive/MyDrive/FTDS Student Material/04MLApplications/03NLP/01NLPIntro_Auto-Summarization')
'''

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [1]:
#!pip3 install nltk
import nltk


In [3]:
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [19]:
nltk.download()
#It takes a long time

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

## Tokenize
Let's start with 2 random lines of text. 

**Q1: Start with tokenizing it into sentence using the package.**

In [2]:
text="I live in Hong Kong. Hong Kong is an amazing city"

from nltk.tokenize import word_tokenize, sent_tokenize
#To segment the text into sentences, use sent_tokenize
# sent_tokenize: seperate into sentences
sents=sent_tokenize(text)
print(sents)

['I live in Hong Kong.', 'Hong Kong is an amazing city']


In [4]:
nltk.download('punkt')


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

**Q2: Print the words out of the last tokenized line**

In [4]:
#word_tokenize is used to split the sentences into words
words=[word_tokenize(sent) for sent in sents]
print(words)

[['I', 'live', 'in', 'Hong', 'Kong', '.'], ['Hong', 'Kong', 'is', 'an', 'amazing', 'city']]


**Q3: Now it's time to remove stopwords and punctuation. Start with calling the package first.**

In [5]:
# stop words are words which are filtered out before or after processing of natural language data
# We would not want these words taking up space in our database, or taking up valuable processing time. For this, we can remove them easily, by storing a list of words that you consider to be stop words. 
from nltk.corpus import stopwords 
from string import punctuation
customStopWords=stopwords.words('english')+list(punctuation)
print(customStopWords)
# can custom Stop Words by also input own stopword manually;

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

**Q4: Now print out the tokenized words without punctuations and stopwords.**

In [8]:
wordsWOStopwords=[word for word in word_tokenize(text) if word not in customStopWords]
print(wordsWOStopwords)

['I', 'live', 'Hong', 'Kong', 'Hong', 'Kong', 'amazing', 'city']


## Ngram
**Q5: Now it's time to use Bigram Association measures and print the ngram items.**

In [9]:
# Comparing pairs of words to see which pair is "more likely to occur" in US English than another pair. 
# Use the collocation facilities in NLTK to score word pairs, the higher scoring pair being the most likely.
from nltk.collocations import *

finder = BigramCollocationFinder.from_words(wordsWOStopwords)

#fequency of each pair
print(finder.ngram_fd.items())

sorted(finder.ngram_fd.items())

dict_items([(('I', 'live'), 1), (('live', 'Hong'), 1), (('Hong', 'Kong'), 2), (('Kong', 'Hong'), 1), (('Kong', 'amazing'), 1), (('amazing', 'city'), 1)])


[(('Hong', 'Kong'), 2),
 (('I', 'live'), 1),
 (('Kong', 'Hong'), 1),
 (('Kong', 'amazing'), 1),
 (('amazing', 'city'), 1),
 (('live', 'Hong'), 1)]

## Stemming OR Lemmatization

**Q6: Now take another line of next and print the stemmed words using LancasterStemmer.**

In [10]:
#@title
# stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form
text2 = "Our feet are wet. My foot is wet.  "

from nltk.stem.lancaster import LancasterStemmer
st=LancasterStemmer()
stemmedWords=[st.stem(word) for word in word_tokenize(text2)]
print(stemmedWords)

['our', 'feet', 'ar', 'wet', '.', 'my', 'foot', 'is', 'wet', '.']


In [11]:
# Lemmatization is the process of converting a word to its base form. 
from nltk.corpus import wordnet as wn
from nltk.stem import WordNetLemmatizer
text2 = "Our feet are wet. My foot is wet.  "

# Wordnet is an large, freely and publicly available lexical database for the English language
# aim is to establish structured semantic relationships between words
lemmatizer = WordNetLemmatizer()
print([lemmatizer.lemmatize(word) for word in word_tokenize(text2)] )

['Our', 'foot', 'are', 'wet', '.', 'My', 'foot', 'is', 'wet', '.']


**Q7: Use pos tag and tokenize the words.**

In [12]:
# pos is part of speech. 
# nltk.pos_tag() returns a tuple with the POS tag. The key here is to map NLTK’s POS tags to the format wordnet lemmatizer would accept.
nltk.pos_tag(word_tokenize(text2))

# eg. JJ -Adjective

[('Our', 'PRP$'),
 ('feet', 'NNS'),
 ('are', 'VBP'),
 ('wet', 'JJ'),
 ('.', '.'),
 ('My', 'PRP$'),
 ('foot', 'NN'),
 ('is', 'VBZ'),
 ('wet', 'JJ'),
 ('.', '.')]

**Q8: Now import wordnet and print out the sysnset.**

In [13]:
# WordNet is a lexical database for the English language.
# It groups English words into sets of synonyms called synsets, provides short definitions and usage examples, 
# and records a number of relations among these synonym sets or their members. 

# like dictionary

for ss in wn.synsets('bass'):
    print(ss, ss.definition())

Synset('bass.n.01') the lowest part of the musical range
Synset('bass.n.02') the lowest part in polyphonic music
Synset('bass.n.03') an adult male singer with the lowest voice
Synset('sea_bass.n.01') the lean flesh of a saltwater fish of the family Serranidae
Synset('freshwater_bass.n.01') any of various North American freshwater fish with lean flesh (especially of the genus Micropterus)
Synset('bass.n.06') the lowest adult male singing voice
Synset('bass.n.07') the member with the lowest range of a family of musical instruments
Synset('bass.n.08') nontechnical name for any of numerous edible marine and freshwater spiny-finned fishes
Synset('bass.s.01') having or denoting a low vocal or instrumental range


**Q9: It's time to play around with lesk now. Import that and tokenize a new line to find the Synset.**

*Suggested: Sing in a lower tone, along with the bass*

In [14]:
from nltk.wsd import lesk
# a classical algorithm for word sense *disambiguation* (Another library)
# The Lesk algorithm is based on the assumption that words in a given "neighborhood" (section of text) will tend to share a common topic.
sense1 = lesk(word_tokenize("Sing in a lower tone, along with the bass"),'bass')
print(sense1, sense1.definition())

Synset('bass.n.07') the member with the lowest range of a family of musical instruments


**Q10: Try this another way.**

In [15]:
sense2 = lesk(word_tokenize("This sea bass was really hard to catch"),'bass')
print(sense2, sense2.definition())

Synset('sea_bass.n.01') the lean flesh of a saltwater fish of the family Serranidae


**I like to code. The code should be clean. What is the code of conduct? **




In [None]:
#This blew my mind in high school, and I wasn’t the only one.

In [17]:
 from nltk.wsd import lesk

sense2 = lesk(word_tokenize("Because a rose is a rose is a rose is a rose."),'rose')
print(sense2, sense2.definition())

Synset('rose.s.01') of something having a dusty purplish pink color


In [5]:
from nltk.tokenize import word_tokenize, sent_tokenize
text1 = 'I am planning to visit Wan Chai to attend Tech Wan Chai Hackathon'
nltk.pos_tag(word_tokenize(text1))

[('I', 'PRP'),
 ('am', 'VBP'),
 ('planning', 'VBG'),
 ('to', 'TO'),
 ('visit', 'VB'),
 ('Wan', 'NNP'),
 ('Chai', 'NNP'),
 ('to', 'TO'),
 ('attend', 'VB'),
 ('Tech', 'NNP'),
 ('Wan', 'NNP'),
 ('Chai', 'NNP'),
 ('Hackathon', 'NNP')]

In [6]:
nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or