## Natural Language Processing

Applications:

- Text Classification
- Information Extraction
- Conversational Agent
- Information Retrieval
- QA Systems
- Text Summarization
- Topic Modelling
- Lanuguage Translations

In [1]:
from IPython.display import Image
from IPython.core.display import HTML 

In [2]:
Image(url= "nlptasks.png")

In [3]:
Image(url= "buildingblocks.png")

## NLP Pipeline

In [4]:
Image(url= "nlp_pipeline.png")

#### Data Acquisition

- Public Dataset
- Web Scraping
- Enterprise Dataset


#### Text Cleaning

- Parsing(HTML)
- Unicode Normalization
- Spelling Correction
- System-Specific Error Correction


In [6]:
from bs4 import BeautifulSoup
from urllib.request import urlopen
myurl = "https://stackoverflow.com/questions/415511/how-to-get-the-current-time-in-python"
html = urlopen(myurl).read()
soupified = BeautifulSoup(html, "html.parser")
question = soupified.find("div", {"class": "question"})
questiontext = question.find("div", {"class": "s-prose js-post-body"})
#print("Question: \n", questiontext.getText().strip())
print(question)


<div class="question" data-questionid="415511" id="question">
<style>
</style>
<div class="js-zone-container zone-container-main">
<div class="everyonelovesstackoverflow everyoneloves__top-leaderboard everyoneloves__leaderboard" id="dfp-tlb"></div>
<div class="js-report-ad-button-container" style="width: 728px"></div>
</div>
<div class="post-layout">
<div class="votecell post-layout--left">
<div class="js-voting-container grid jc-center fd-column ai-stretch gs4 fc-black-200" data-post-id="415511">
<button aria-label="Up vote" aria-pressed="false" class="js-vote-up-btn grid--cell s-btn s-btn__unset c-pointer" data-controller="s-tooltip" data-s-tooltip-placement="right" data-selected-classes="fc-theme-primary" title="This question shows research effort; it is useful and clear"><svg aria-hidden="true" class="m0 svg-icon iconArrowUpLg" height="36" viewbox="0 0 36 36" width="36"><path d="M2 26h32L18 10 2 26z"></path></svg></button>
<div class="js-vote-count grid--cell fc-black-500 fs-title 

In [7]:
from PIL import Image
from pytesseract import image_to_string
filename = "scanned_text.png"
text = image_to_string(Image.open(filename))
print(text)

in the nineteenth century the only Kind of linguistics considered
seriously was this comparative and historical study of words in languages
known or believed to be cognate—say the Semitic languages, or the Indo-
European languages. It is significant that the Germans who really made
the subject what it was, used the term Indo-germanisch. Those who know
the popular works of Otto Jespersen will remember how fitmly he
declares that linguistic science is historical. And those who have noticed



### Pre-Processing

Preliminaries
- Sentence segmentation and word tokenization.

Frequent steps
- Stop word removal, stemming and lemmatization, removing digits/punctuation, lowercasing, etc.

Other steps
- Normalization, language detection, code mixing, transliteration, etc.

Advanced processing
- POS tagging, parsing, coreference resolution, etc.

#### Preliminaries

In [8]:
## SENTENCE SEGMENTATION

from nltk.tokenize import sent_tokenize, word_tokenize

mytext = """A letter of interest is a document used to get your name
in front of hiring managers at organizations at which you’re interested in working, 
but there are currently no open roles that fit your qualifications. 
This letter has also been referred to as a ‘letter of intent,’ 
‘statement of interest’ or ‘letter of inquiry’. 
It contains a broad statement indicating that you’re intrigued 
by a company and hope to find opportunities there. From there, 
the hiring manager or recruiter can see if any current or 
upcoming open roles are in alignment with your skills and experience."""

my_sentences = sent_tokenize(mytext)


In [9]:
print(my_sentences)

['A letter of interest is a document used to get your name\nin front of hiring managers at organizations at which you’re interested in working, \nbut there are currently no open roles that fit your qualifications.', 'This letter has also been referred to as a ‘letter of intent,’ \n‘statement of interest’ or ‘letter of inquiry’.', 'It contains a broad statement indicating that you’re intrigued \nby a company and hope to find opportunities there.', 'From there, \nthe hiring manager or recruiter can see if any current or \nupcoming open roles are in alignment with your skills and experience.']


In [89]:
### WORD TOKENIZATION

In [10]:
for sentence in my_sentences:
   print(sentence)
   print(word_tokenize(sentence))

A letter of interest is a document used to get your name
in front of hiring managers at organizations at which you’re interested in working, 
but there are currently no open roles that fit your qualifications.
['A', 'letter', 'of', 'interest', 'is', 'a', 'document', 'used', 'to', 'get', 'your', 'name', 'in', 'front', 'of', 'hiring', 'managers', 'at', 'organizations', 'at', 'which', 'you', '’', 're', 'interested', 'in', 'working', ',', 'but', 'there', 'are', 'currently', 'no', 'open', 'roles', 'that', 'fit', 'your', 'qualifications', '.']
This letter has also been referred to as a ‘letter of intent,’ 
‘statement of interest’ or ‘letter of inquiry’.
['This', 'letter', 'has', 'also', 'been', 'referred', 'to', 'as', 'a', '‘', 'letter', 'of', 'intent', ',', '’', '‘', 'statement', 'of', 'interest', '’', 'or', '‘', 'letter', 'of', 'inquiry', '’', '.']
It contains a broad statement indicating that you’re intrigued 
by a company and hope to find opportunities there.
['It', 'contains', 'a', 'bro

#### Frequent Steps

In [91]:
### Remove Stop words and convert to lower case

In [11]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english')) 

In [12]:
word_tokens = word_tokenize(mytext.lower())  

filtered_sentence = [w for w in word_tokens if not w in stop_words]  
  
filtered_sentence = []  
  
for w in word_tokens:  
    if w not in stop_words:  
        filtered_sentence.append(w)  

In [14]:
print(word_tokens)


['a', 'letter', 'of', 'interest', 'is', 'a', 'document', 'used', 'to', 'get', 'your', 'name', 'in', 'front', 'of', 'hiring', 'managers', 'at', 'organizations', 'at', 'which', 'you', '’', 're', 'interested', 'in', 'working', ',', 'but', 'there', 'are', 'currently', 'no', 'open', 'roles', 'that', 'fit', 'your', 'qualifications', '.', 'this', 'letter', 'has', 'also', 'been', 'referred', 'to', 'as', 'a', '‘', 'letter', 'of', 'intent', ',', '’', '‘', 'statement', 'of', 'interest', '’', 'or', '‘', 'letter', 'of', 'inquiry', '’', '.', 'it', 'contains', 'a', 'broad', 'statement', 'indicating', 'that', 'you', '’', 're', 'intrigued', 'by', 'a', 'company', 'and', 'hope', 'to', 'find', 'opportunities', 'there', '.', 'from', 'there', ',', 'the', 'hiring', 'manager', 'or', 'recruiter', 'can', 'see', 'if', 'any', 'current', 'or', 'upcoming', 'open', 'roles', 'are', 'in', 'alignment', 'with', 'your', 'skills', 'and', 'experience', '.']


In [15]:
print(filtered_sentence) 

['letter', 'interest', 'document', 'used', 'get', 'name', 'front', 'hiring', 'managers', 'organizations', '’', 'interested', 'working', ',', 'currently', 'open', 'roles', 'fit', 'qualifications', '.', 'letter', 'also', 'referred', '‘', 'letter', 'intent', ',', '’', '‘', 'statement', 'interest', '’', '‘', 'letter', 'inquiry', '’', '.', 'contains', 'broad', 'statement', 'indicating', '’', 'intrigued', 'company', 'hope', 'find', 'opportunities', '.', ',', 'hiring', 'manager', 'recruiter', 'see', 'current', 'upcoming', 'open', 'roles', 'alignment', 'skills', 'experience', '.']


In [95]:
### STEMMING AND LEMMATIZATION

In [16]:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
word1 = "programs"
word2 = "implementation"
print(stemmer.stem(word1), stemmer.stem(word2))

program implement


In [18]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("better", pos="a")) #a is for adjective

good


In [19]:
import spacy
sp = spacy.load('en_core_web_sm')
token = sp(u'better')
for word in token:
   print(word.text,  word.lemma_)

better well


### OTHER PRE-PROCESSING STEPS

- Text Normalization
- Language Detection
- Code Mixing and Translation

### Advanced Processing

#### Part of Speech
- NER
- Relation Extraction
- Coreference Resolution

In [99]:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'Charles Spencer Chaplin was born on 16 April 1889 to Hannah Chaplin')
for token in doc:
    print(token.text, token.lemma_, token.pos_,
          token.shape_, token.is_alpha, token.is_stop)

Charles Charles PROPN Xxxxx True False
Spencer Spencer PROPN Xxxxx True False
Chaplin Chaplin PROPN Xxxxx True False
was be AUX xxx True True
born bear VERB xxxx True False
on on ADP xx True True
16 16 NUM dd False False
April April PROPN Xxxxx True False
1889 1889 NUM dddd False False
to to ADP xx True True
Hannah Hannah PROPN Xxxxx True False
Chaplin Chaplin PROPN Xxxxx True False


In [100]:
doc = nlp(u'Chaplin wrote,directed and composed music for most of his movies')
for token in doc:
    print(token.text, token.lemma_, token.pos_,
          token.shape_, token.is_alpha, token.is_stop)

Chaplin Chaplin PROPN Xxxxx True False
wrote write VERB xxxx True False
, , PUNCT , False False
directed direct VERB xxxx True False
and and CCONJ xxx True True
composed compose VERB xxxx True False
music music NOUN xxxx True False
for for ADP xxx True True
most most ADJ xxxx True True
of of ADP xx True True
his -PRON- DET xxx True True
movies movie NOUN xxxx True False


In [103]:
from IPython.display import Image
from IPython.core.display import HTML 
Image(url= "nlp_feature_engineering.png")