# NLP(Natural language processing)

  Natural languages(English, Japanese etc.) is different than computer language. It evolves and changes.
  In NLP we are concerned with how to program computer to process and analyze natural language data(text).
    NLP is defined as a  field of computer science  and artificial intelligence with roots in linguistics.

# Why?
- We have large volume of unstructured text data
    + How to apply statistical analysis/machine learning to extract useful insight from this?
- Most of the data analysis is numeric
    + We need specialized technique like NLP

# Applications
- Machine translation (Google translate)
- Speech Recognition Systems (Smart assistants)
- Question Answering Systems (Autodiagnostic on company websites)
- Text summarization, categorization/classification/clustering
    - Sentiment analysis
    - Chatbots
    - Spam detection



Building any of above application is a bit involved process as text is free flowing, unstructured data.
It requires 
- cleaning (misspelled text, duplicates, removing stopwords),
- tokenization: list of words.
- tagging(POS), stemming, lemmatization and 
- conversion to word vector before using any machine learning or statistical technique

- POS tagging: Each word has pos tag indicating  part of the speech.
  + Here is some list list [penn pos](https://www.clips.uantwerpen.be/pages/mbsp-tags)
  + Here is a demo [Parts-of-speech.Info](https://parts-of-speech.info/)



# Stemming and lemmatization
In NLP, Stemming and lemmatization are text normalization technique to prepare text(word, sentence etc.) for further processing. In web search and information retrieval it is a common activity to increase the recall. These are common step after tokenization of text.

- Stem: Part of the words to which affixes can be added. 
A stem is a part of a word to which [inflectional](https://en.wikipedia.org/wiki/Inflection) affixes**(ed, ing, ize, s, de)** can be attached.   Stemming is the process of reducing inflected (or sometimes derived) words to their word stem or root(may not be a word). Like apple and apples down to appl if you [Porter stemmer](https://tartarus.org/martin/PorterStemmer/)

- Lemma: A lemma is the  base form(part of the language) for a set of words like  geese to goose.
    + Note: stem of these words woud be gees and goos as per Porter stemmer. Lemmatization((careful approach to removing inflections)) is the process of creating  base form typically based on lexical knowledge base like [WordNet](https://wordnet.princeton.edu/).

Stemming is not perfect. Porter stemming stems both meanness and meaning to mean, creating a false equivalence.


Let's see some available text data(Text corpora).
# Text corpora

Large amount of written or spoken textual data. It has usually associated with some meta data.


# Some popular corpora
- Brown Corpus: This was the first million-word corpus for the English language, published by Kucera and Francis in 1961.
- WordNet: This corpus is a semantic-oriented lexical database for the English language. It was created at Princeton University
- Penn Treebank: This corpus consists of tagged and parsed English sentences including annotations like POS tags and grammar-based parse trees.
- Google N-gram Corpus: The Google N-gram Corpus consists of over a trillion words from various sources including books, web pages etc.
- Web, chat, email, tweets: We can gather this kind of textual data from social media.



# Some popular framework for text analysis
- nltk Natural Language Toolkit
- gensim: The gensim library has a rich set of capabilities for semantic analysis, including topic modeling and similarity analysis.
- textblob: text processing, phrase extraction, classification, POS tagging, and sentiment analysis
- spacy: claims to provide industrial-strength NLP capabilities by providing the best implementation of each technique and algorithm

# Let's use nltk to access some corpora using nltk

In [1]:
#If you have a Mac you may also need :
# xcode-select --install 
#This code allows you to construct the C dependencies that are part of regex used in nltk


#To install the nltk library :
# !pip install nltk

In [1]:
import nltk

In [2]:
#You will need this line of command to actually install the files 
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     C:\Users\ddalton\AppData\Roaming\nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     C:\Users\ddalton\AppData\Roaming\nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     C:\Users\ddalton\AppData\Roaming\nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     C:\Users\ddalton\AppData\Roaming\nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_ru is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     C:\Users\ddalton\AppData\Roaming\nltk_data...
[nltk_d

[nltk_data]    |   Package moses_sample is already up-to-date!
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     C:\Users\ddalton\AppData\Roaming\nltk_data...
[nltk_data]    |   Package movie_reviews is already up-to-date!
[nltk_data]    | Downloading package mte_teip5 to
[nltk_data]    |     C:\Users\ddalton\AppData\Roaming\nltk_data...
[nltk_data]    |   Package mte_teip5 is already up-to-date!
[nltk_data]    | Downloading package mwa_ppdb to
[nltk_data]    |     C:\Users\ddalton\AppData\Roaming\nltk_data...
[nltk_data]    |   Package mwa_ppdb is already up-to-date!
[nltk_data]    | Downloading package names to
[nltk_data]    |     C:\Users\ddalton\AppData\Roaming\nltk_data...
[nltk_data]    |   Package names is already up-to-date!
[nltk_data]    | Downloading package nombank.1.0 to
[nltk_data]    |     C:\Users\ddalton\AppData\Roaming\nltk_data...
[nltk_data]    |   Package nombank.1.0 is already up-to-date!
[nltk_data]    | Downloading package nonbreaking_p

[nltk_data]    |   Package toolbox is already up-to-date!
[nltk_data]    | Downloading package treebank to
[nltk_data]    |     C:\Users\ddalton\AppData\Roaming\nltk_data...
[nltk_data]    |   Package treebank is already up-to-date!
[nltk_data]    | Downloading package twitter_samples to
[nltk_data]    |     C:\Users\ddalton\AppData\Roaming\nltk_data...
[nltk_data]    |   Package twitter_samples is already up-to-date!
[nltk_data]    | Downloading package udhr to
[nltk_data]    |     C:\Users\ddalton\AppData\Roaming\nltk_data...
[nltk_data]    |   Package udhr is already up-to-date!
[nltk_data]    | Downloading package udhr2 to
[nltk_data]    |     C:\Users\ddalton\AppData\Roaming\nltk_data...
[nltk_data]    |   Package udhr2 is already up-to-date!
[nltk_data]    | Downloading package unicode_samples to
[nltk_data]    |     C:\Users\ddalton\AppData\Roaming\nltk_data...
[nltk_data]    |   Package unicode_samples is already up-to-date!
[nltk_data]    | Downloading package universal_tagset

True

# Accessing the Brown Corpus

In [3]:
from nltk.corpus import brown
brown.readme()

'BROWN CORPUS\n\nA Standard Corpus of Present-Day Edited American\nEnglish, for use with Digital Computers.\n\nby W. N. Francis and H. Kucera (1964)\nDepartment of Linguistics, Brown University\nProvidence, Rhode Island, USA\n\nRevised 1971, Revised and Amplified 1979\n\nhttp://www.hit.uib.no/icame/brown/bcm.html\n\nDistributed with the permission of the copyright holder,\nredistribution permitted.\n'

In [4]:
brown.categories()  # categories of the tags

['adventure',
 'belles_lettres',
 'editorial',
 'fiction',
 'government',
 'hobbies',
 'humor',
 'learned',
 'lore',
 'mystery',
 'news',
 'religion',
 'reviews',
 'romance',
 'science_fiction']

words returns the given file(s) as a list of words and punctuation symbols.

In [12]:
brown.words()

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]

In [14]:
words = brown.words()
[' '.join(word) for word in words]

['T h e',
 'F u l t o n',
 'C o u n t y',
 'G r a n d',
 'J u r y',
 's a i d',
 'F r i d a y',
 'a n',
 'i n v e s t i g a t i o n',
 'o f',
 "A t l a n t a ' s",
 'r e c e n t',
 'p r i m a r y',
 'e l e c t i o n',
 'p r o d u c e d',
 '` `',
 'n o',
 'e v i d e n c e',
 "' '",
 't h a t',
 'a n y',
 'i r r e g u l a r i t i e s',
 't o o k',
 'p l a c e',
 '.',
 'T h e',
 'j u r y',
 'f u r t h e r',
 's a i d',
 'i n',
 't e r m - e n d',
 'p r e s e n t m e n t s',
 't h a t',
 't h e',
 'C i t y',
 'E x e c u t i v e',
 'C o m m i t t e e',
 ',',
 'w h i c h',
 'h a d',
 'o v e r - a l l',
 'c h a r g e',
 'o f',
 't h e',
 'e l e c t i o n',
 ',',
 '` `',
 'd e s e r v e s',
 't h e',
 'p r a i s e',
 'a n d',
 't h a n k s',
 'o f',
 't h e',
 'C i t y',
 'o f',
 'A t l a n t a',
 "' '",
 'f o r',
 't h e',
 'm a n n e r',
 'i n',
 'w h i c h',
 't h e',
 'e l e c t i o n',
 'w a s',
 'c o n d u c t e d',
 '.',
 'T h e',
 'S e p t e m b e r - O c t o b e r',
 't e r m',
 'j u 

We can access the tokenized sentences:

sents returns the given file(s) as a list of sentences or utterances, each encoded as a list of word strings.

In [5]:
sentences = brown.sents(categories='adventure')

In [6]:
sentences

[['Dan', 'Morgan', 'told', 'himself', 'he', 'would', 'forget', 'Ann', 'Turner', '.'], ['He', 'was', 'well', 'rid', 'of', 'her', '.'], ...]

In [7]:
# Get the original sentence
[' '.join(sent) for sent in sentences]

['Dan Morgan told himself he would forget Ann Turner .',
 'He was well rid of her .',
 "He certainly didn't want a wife who was fickle as Ann .",
 "If he had married her , he'd have been asking for trouble .",
 'But all of this was rationalization .',
 'Sometimes he woke up in the middle of the night thinking of Ann , and then could not get back to sleep .',
 'His plans and dreams had revolved around her so much and for so long that now he felt as if he had nothing .',
 "The easiest thing would be to sell out to Al Budd and leave the country , but there was a stubborn streak in him that wouldn't allow it .",
 'The best antidote for the bitterness and disappointment that poisoned him was hard work .',
 'He found that if he was tired enough at night , he went to sleep simply because he was too exhausted to stay awake .',
 'Each day he found himself thinking less often of Ann ; ;',
 'each day the hurt was a little duller , a little less poignant .',
 'He had plenty of work to do .',
 'Bec

# POS tagged sentences

In [10]:
brown.sents(categories='humor')

[['It', 'was', 'among', 'these', 'that', 'Hinkle', 'identified', 'a', 'photograph', 'of', 'Barco', '!', '!'], ['For', 'it', 'seems', 'that', 'Barco', ',', 'fancying', 'himself', 'a', "ladies'", 'man', '(', 'and', 'why', 'not', ',', 'after', 'seven', 'marriages', '?', '?'], ...]

In [16]:
# returns the given file(s) as a list of sentences, each encoded as a list of (word,tag) tuples.
tagged_sents= brown.tagged_sents(categories='humor')
tagged_sents

[[('It', 'PPS'), ('was', 'BEDZ'), ('among', 'IN'), ('these', 'DTS'), ('that', 'CS'), ('Hinkle', 'NP'), ('identified', 'VBD'), ('a', 'AT'), ('photograph', 'NN'), ('of', 'IN'), ('Barco', 'NP'), ('!', '.'), ('!', '.')], [('For', 'CS'), ('it', 'PPS'), ('seems', 'VBZ'), ('that', 'CS'), ('Barco', 'NP'), (',', ','), ('fancying', 'VBG'), ('himself', 'PPL'), ('a', 'AT'), ("ladies'", 'NNS$'), ('man', 'NN'), ('(', '('), ('and', 'CC'), ('why', 'WRB'), ('not', '*'), (',', ','), ('after', 'IN'), ('seven', 'CD'), ('marriages', 'NNS'), ('?', '.'), ('?', '.')], ...]

In [21]:
# list of the tags for part of speech
nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

# Let's get top noun in humor and see its distribution

In [17]:
from collections import Counter

#We iterate over our POS tags and extract only the ones we are interested in
noun = []
for sent in tagged_sents:
    noun+=[w for w,tag in sent if tag in ['NN', 'NP','NNS']]  #NN - noun, NP - proper noun, NNS - noun common plural
noun

#Count the occurrences in our test
noun_counter = Counter(noun)

#To get the top 3 :
noun_counter.most_common(3)

[('time', 43), ('Mr.', 36), ('way', 28)]

# Can use nltk freqdist

In [22]:
nouns_freq = nltk.FreqDist(noun)  # get frequency distribution of the different nouns
nouns_freq

FreqDist({'time': 43, 'Mr.': 36, 'way': 28, 'things': 27, 'Arlene': 24, 'man': 21, 'years': 21, 'children': 20, 'day': 19, 'people': 19, ...})

In [23]:
type(nouns_freq)

nltk.probability.FreqDist

In [24]:
# orignal doc id
len(brown.fileids())

500

# Wordnet

In [25]:
from nltk.corpus import wordnet

Look up a word using synsets(); this function has an optional pos argument which lets you constrain the part of speech of the word.  

Synset: a set of synonyms that share a common meaning.

In [35]:
word_synsets= wordnet.synsets('dog')
print(word_synsets)
print(wordnet.synsets('dog', pos=wordnet.VERB))  # only get verb synonyms
print(wordnet.synset('dog.n.01').examples()[0])  # example usage of dog
print(wordnet.synset('dog.n.01').lemmas())  # print lemmas

[Synset('dog.n.01'), Synset('frump.n.01'), Synset('dog.n.03'), Synset('cad.n.01'), Synset('frank.n.02'), Synset('pawl.n.01'), Synset('andiron.n.01'), Synset('chase.v.01')]
[Synset('chase.v.01')]
the dog barked all night
[Lemma('dog.n.01.dog'), Lemma('dog.n.01.domestic_dog'), Lemma('dog.n.01.Canis_familiaris')]


In [37]:
[str(lemma.name()) for lemma in wordnet.synset('dog.n.01').lemmas()]

['dog', 'domestic_dog', 'Canis_familiaris']

In [29]:
print(wordnet.synset('dog.n.01').definition())

a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds


In [38]:
# Italian for dog: cane is masculine and cagna is feminine
wordnet.synset('dog.n.01').lemma_names('ita')

['cane', 'Canis_familiaris']

In [39]:
#We can extract the name of the synsets, definition and even examples
for synset in word_synsets:
    print('name:',synset.name())
    print('definition:',synset.definition())
    print('examples:',synset.examples())

name: dog.n.01
definition: a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds
examples: ['the dog barked all night']
name: frump.n.01
definition: a dull unattractive unpleasant girl or woman
examples: ['she got a reputation as a frump', "she's a real dog"]
name: dog.n.03
definition: informal term for a man
examples: ['you lucky dog']
name: cad.n.01
definition: someone who is morally reprehensible
examples: ['you dirty dog']
name: frank.n.02
definition: a smooth-textured sausage of minced beef or pork usually smoked; often served on a bread roll
examples: []
name: pawl.n.01
definition: a hinged catch that fits into a notch of a ratchet to move a wheel forward or prevent it from moving backward
examples: []
name: andiron.n.01
definition: metal supports for logs in a fireplace
examples: ['the andirons were too hot to touch']
name: chase.v.01
definition: go after with the intent to catch
exa

# Text Preprocessing

# Tokenization

Breaking down or splitting textual data into smaller meaningful components.
- Sentence tokenization
- words tokenization

# tokenizer in nltk
- sent_tokenize
- RegexpTokenizer

Check the documentation for other tokenizer

In [40]:
import nltk
from nltk.corpus import gutenberg

In [41]:
edgeworth = gutenberg.raw(fileids='edgeworth-parents.txt')
edgeworth[0:1000]

'[The Parent\'s Assistant, by Maria Edgeworth]\r\n\r\n\r\nTHE ORPHANS.\r\n\r\nNear the ruins of the castle of Rossmore, in Ireland, is a small cabin,\r\nin which there once lived a widow and her four children.  As long as she\r\nwas able to work, she was very industrious, and was accounted the best\r\nspinner in the parish; but she overworked herself at last, and fell ill,\r\nso that she could not sit to her wheel as she used to do, and was obliged\r\nto give it up to her eldest daughter, Mary.\r\n\r\nMary was at this time about twelve years old.  One evening she was\r\nsitting at the foot of her mother\'s bed spinning, and her little brothers\r\nand sisters were gathered round the fire eating their potatoes and milk\r\nfor supper.  "Bless them, the poor young creatures!" said the widow, who,\r\nas she lay on her bed, which she knew must be her deathbed, was thinking\r\nof what would become of her children after she was gone.  Mary stopped\r\nher wheel, for she was afraid that the nois

In [42]:
edgeworth_sent_tkn = nltk.sent_tokenize(edgeworth)  # sentence tokinze - splits on whitespace and punctuation
edgeworth_sent_tkn

["[The Parent's Assistant, by Maria Edgeworth]\r\n\r\n\r\nTHE ORPHANS.",
 'Near the ruins of the castle of Rossmore, in Ireland, is a small cabin,\r\nin which there once lived a widow and her four children.',
 'As long as she\r\nwas able to work, she was very industrious, and was accounted the best\r\nspinner in the parish; but she overworked herself at last, and fell ill,\r\nso that she could not sit to her wheel as she used to do, and was obliged\r\nto give it up to her eldest daughter, Mary.',
 'Mary was at this time about twelve years old.',
 "One evening she was\r\nsitting at the foot of her mother's bed spinning, and her little brothers\r\nand sisters were gathered round the fire eating their potatoes and milk\r\nfor supper.",
 '"Bless them, the poor young creatures!"',
 'said the widow, who,\r\nas she lay on her bed, which she knew must be her deathbed, was thinking\r\nof what would become of her children after she was gone.',
 'Mary stopped\r\nher wheel, for she was afraid that

In [44]:
print('Total sentences  {}'.format(len(edgeworth_sent_tkn)))  # print sentences with a number
for sidx, s in enumerate(edgeworth_sent_tkn[0:3]):
    print(sidx,"::", s, '\n')

Total sentences  10096
0 :: [The Parent's Assistant, by Maria Edgeworth]


THE ORPHANS. 

1 :: Near the ruins of the castle of Rossmore, in Ireland, is a small cabin,
in which there once lived a widow and her four children. 

2 :: As long as she
was able to work, she was very industrious, and was accounted the best
spinner in the parish; but she overworked herself at last, and fell ill,
so that she could not sit to her wheel as she used to do, and was obliged
to give it up to her eldest daughter, Mary. 



In [49]:
[nltk.word_tokenize(t) for t in nltk.sent_tokenize(edgeworth)]

[['[',
  'The',
  'Parent',
  "'s",
  'Assistant',
  ',',
  'by',
  'Maria',
  'Edgeworth',
  ']',
  'THE',
  'ORPHANS',
  '.'],
 ['Near',
  'the',
  'ruins',
  'of',
  'the',
  'castle',
  'of',
  'Rossmore',
  ',',
  'in',
  'Ireland',
  ',',
  'is',
  'a',
  'small',
  'cabin',
  ',',
  'in',
  'which',
  'there',
  'once',
  'lived',
  'a',
  'widow',
  'and',
  'her',
  'four',
  'children',
  '.'],
 ['As',
  'long',
  'as',
  'she',
  'was',
  'able',
  'to',
  'work',
  ',',
  'she',
  'was',
  'very',
  'industrious',
  ',',
  'and',
  'was',
  'accounted',
  'the',
  'best',
  'spinner',
  'in',
  'the',
  'parish',
  ';',
  'but',
  'she',
  'overworked',
  'herself',
  'at',
  'last',
  ',',
  'and',
  'fell',
  'ill',
  ',',
  'so',
  'that',
  'she',
  'could',
  'not',
  'sit',
  'to',
  'her',
  'wheel',
  'as',
  'she',
  'used',
  'to',
  'do',
  ',',
  'and',
  'was',
  'obliged',
  'to',
  'give',
  'it',
  'up',
  'to',
  'her',
  'eldest',
  'daughter',
  ',',
  'M

# Word Tokenization

Converts sentence into word token. Typical process before stemming or lemmentizing.

- word_tokenize
- TreebankWordTokenizer. Based on the Penn Treebank and uses various regular expressions to tokenize the text.
- RegexpTokenizer

In [45]:
s = 'I just absolutely adore Denver and the Boulder area.'

In [46]:
nltk.word_tokenize(s)  # has punctuation

['I',
 'just',
 'absolutely',
 'adore',
 'Denver',
 'and',
 'the',
 'Boulder',
 'area',
 '.']

In [50]:
word_regex= nltk.RegexpTokenizer(pattern=r'\w+', gaps=False)  # no punctuation in the output
word_regex.tokenize(s)

['I', 'just', 'absolutely', 'adore', 'Denver', 'and', 'the', 'Boulder', 'area']

we can get start and end indices

In [51]:
list(word_regex.span_tokenize(s))  # index of start and end of each word

[(0, 1),
 (2, 6),
 (7, 17),
 (18, 23),
 (24, 30),
 (31, 34),
 (35, 38),
 (39, 46),
 (47, 51)]

In [52]:
#Using the start and end indices to extract the words :
[s[st:en] for st,en in word_regex.span_tokenize(s)]

['I', 'just', 'absolutely', 'adore', 'Denver', 'and', 'the', 'Boulder', 'area']

There are other word tokenizer classes. Check the documentation but to give you a flavour here is WordPunctTokenizer. It tokenize sentences into independent alphabetic and non-alphabetic tokens.

In [53]:
wordpunkt_tkn = nltk.WordPunctTokenizer()
wordpunkt_tkn.tokenize("He couldn't swim" )  # split the n't

['He', 'couldn', "'", 't', 'swim']

# RegexpTokenizer

In [54]:
s ="Price of a gallon milk is $3.50.  I'll buy 2. Thanks."

#Let's try the built in word tokenizer on for size :
nltk.word_tokenize(s)  # keeps punctuation

['Price',
 'of',
 'a',
 'gallon',
 'milk',
 'is',
 '$',
 '3.50',
 '.',
 'I',
 "'ll",
 'buy',
 '2',
 '.',
 'Thanks',
 '.']

In [55]:
#With the following tokenizer we can guarantee that our sentence will be processed correctly
sent_regex = nltk.tokenize.RegexpTokenizer('\w+|\$\d+\.\d+|\s+')

In [56]:
sent_regex.tokenize(s)  # no punctuation, split I'll

['Price',
 ' ',
 'of',
 ' ',
 'a',
 ' ',
 'gallon',
 ' ',
 'milk',
 ' ',
 'is',
 ' ',
 '$3.50',
 '  ',
 'I',
 'll',
 ' ',
 'buy',
 ' ',
 '2',
 ' ',
 'Thanks']

# Text normalization or wrangling
Apart from tokenization
- Cleaning
- Case conversion
- Spell correction
- Removing stop words
- Stemming,
- Lemmatization

# Cleaning text
 Remove any unnecessary tokens.
 
 Like from html we don't care about tags
- use regex, Beautiful soup

In [57]:
# example to get text from html
import requests
from bs4 import BeautifulSoup as bsp
response = requests.get('https://en.wikipedia.org/wiki/World_economy')
response.status_code

200

In [65]:
response.text[:100]  # raw html

'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title'

In [59]:
response.headers

{'Date': 'Mon, 14 Mar 2022 02:43:26 GMT', 'Server': 'mw1441.eqiad.wmnet', 'X-Content-Type-Options': 'nosniff', 'P3p': 'CP="See https://en.wikipedia.org/wiki/Special:CentralAutoLogin/P3P for more info."', 'Content-Language': 'en', 'Vary': 'Accept-Encoding,Cookie,Authorization', 'Last-Modified': 'Mon, 14 Mar 2022 02:38:26 GMT', 'Content-Type': 'text/html; charset=UTF-8', 'Content-Encoding': 'gzip', 'Age': '81995', 'X-Cache': 'cp1079 hit, cp1081 hit/16', 'X-Cache-Status': 'hit-front', 'Server-Timing': 'cache;desc="hit-front", host;desc="cp1081"', 'Strict-Transport-Security': 'max-age=106384710; includeSubDomains; preload', 'Report-To': '{ "group": "wm_nel", "max_age": 86400, "endpoints": [{ "url": "https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0" }] }', 'NEL': '{ "report_to": "wm_nel", "max_age": 86400, "failure_fraction": 0.05, "success_fraction": 0.0}', 'Permissions-Policy': 'interest-cohort=()', 'Set-

In [60]:
soupify = bsp(response.text, 'lxml')  # gets text and drops all the html overhead

In [61]:
soupify.get_text()[5039:7861]  # lots of format related junk, but no html headers and tags

'\nMiddle East and Central Asia\n\n4,598,192\n\n2021\n\n32\n\n\xa0Iran\xa0Saudi Arabia\xa0United Arab Emirates\n\n\nSub-Saharan Africa\n\n1,882,229\n\n2021\n\n45\n\n\xa0Nigeria\xa0South Africa\n\n\n\n\n\n\n\nCountry group\n\nGDP (PPP)\n\nPeak year\n\nNumber of countries\n\n35 largest economies\n\n\nWorld\n\n144,636,376\n\n2021\n\n196\n\n\n\n\nEmerging and developing Asia\n\n47,120,772\n\n2021\n\n30\n\n\xa0Bangladesh\xa0China\xa0India\xa0Indonesia\xa0Malaysia\xa0Philippines\xa0Thailand\xa0Vietnam\n\n\nMajor advanced economies (G7)\n\n44,739,437\n\n2021\n\n7\n\n\xa0Canada\xa0France\xa0Germany\xa0Italy\xa0Japan\xa0United Kingdom\xa0United States\n\n\nOther advanced economies(advanced economies excluding the G7)\n\n16,308,885\n\n2021\n\n33\n\n\xa0Australia\xa0South Korea\xa0Netherlands\xa0Spain\xa0\xa0Switzerland\xa0Taiwan\n\n\nEmerging and developing Europe\n\n11,189,072\n\n2021\n\n16\n\n\xa0Poland\xa0Russia\xa0Turkey\n\n\nLatin America and the Caribbean\n\n10,555,374\n\n2021\n\n33\n\n\xa

In [67]:
soupify.get_text().replace("\n","").strip()



# Working on a corpus
# Tokenization

In [68]:
corpus = ['Meet Google Fi, a different kind of phone plan @@plan', '*Simpler* pricing and smarter coverage. It has unlimited call and text at $20!']
corpus


['Meet Google Fi, a different kind of phone plan @@plan',
 '*Simpler* pricing and smarter coverage. It has unlimited call and text at $20!']

In [69]:
import nltk
sent_tokens = []
for doc in corpus:
    sent_tokens.append(nltk.sent_tokenize(doc))  # sentence tokenizer makes into three sentences
sent_tokens

[['Meet Google Fi, a different kind of phone plan @@plan'],
 ['*Simpler* pricing and smarter coverage.',
  'It has unlimited call and text at $20!']]

In [70]:
words_tokens= []
for doc in corpus:
    sent_tokens= nltk.sent_tokenize(doc)  # first sentence tokenize
    words_tokens.append([nltk.word_tokenize(sent) for sent in sent_tokens])  # word tokenize
        

In [72]:
words_tokens  # first sentence is list[0], second sentence is list[1] and third sentence is list[2]

[[['Meet',
   'Google',
   'Fi',
   ',',
   'a',
   'different',
   'kind',
   'of',
   'phone',
   'plan',
   '@',
   '@',
   'plan']],
 [['*', 'Simpler', '*', 'pricing', 'and', 'smarter', 'coverage', '.'],
  ['It', 'has', 'unlimited', 'call', 'and', 'text', 'at', '$', '20', '!']]]

In [73]:
import re
import string
pattern = '[{}]'.format(re.escape(string.punctuation))
pattern

'[!"\\#\\$%\\&\'\\(\\)\\*\\+,\\-\\./:;<=>\\?@\\[\\\\\\]\\^_`\\{\\|\\}\\~]'

In [74]:
## Here we build a regex to remove punctuations
## work with words_tokens[0][0] sentence
words_tokens[0][0]
punc_regex = re.compile(pattern)
clean_sent = list(filter(None , [punc_regex.sub('', token)  for token in  words_tokens[0][0] ]))

In [75]:
clean_sent

['Meet',
 'Google',
 'Fi',
 'a',
 'different',
 'kind',
 'of',
 'phone',
 'plan',
 'plan']

# Removing stop words

Words that end up occurring the most like a, the, am.

In [76]:
stopwords = nltk.corpus.stopwords.words('english')
stop_clean_sent = [w for w in clean_sent if w not in stopwords]
print(stopwords)
stop_clean_sent

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

['Meet', 'Google', 'Fi', 'different', 'kind', 'phone', 'plan', 'plan']

# Stemming

Normalize words into its base form or root form.

In [77]:
from nltk.stem import PorterStemmer
pstemmer = PorterStemmer()

In [78]:
pstemmer.stem('helped'), pstemmer.stem('helping')

('help', 'help')

In [79]:
pstemmer.stem('strange')

'strang'

In [80]:
from nltk.stem import LancasterStemmer
ls_stemmer = LancasterStemmer()
ls_stemmer.stem('strange')

'strange'

# Regex based stemmer

In [81]:
# Here we have a regex for words ending with ed or ing
from nltk.stem import RegexpStemmer
regex_stemmer = RegexpStemmer(r'ed$|ing$|es$', min=4)

In [82]:
regex_stemmer.stem('played'), regex_stemmer.stem('apples')

('play', 'appl')

# Lemmatization
Get the root word in the dictionary.

In [83]:
from nltk.stem import WordNetLemmatizer
wnetl = WordNetLemmatizer()

In [84]:
# noun
wnetl.lemmatize('buses', 'n')

'bus'

In [85]:
# verb
wnetl.lemmatize('running', 'v'), wnetl.lemmatize('ate', 'v')

('run', 'eat')

In [86]:
# adjective
wnetl.lemmatize('easier', 'a')

'easy'

Use right part of speech

In [87]:
wnetl.lemmatize('ate','n')

'ate'