# Natural Language Processing

### Applications - Auto-tagging, Spam Detection, Siri, and AutoComplete.

## Tasks in NLP

### 1. Tokenization- Breaking down text into words and sentences.
### 2. Stopword Removal - Filtering Common Words.
### 3. N-Grams - Identifying commonly occuring groups of words.
### 4. Word Sense Disambiguation - Identifying the context in which the word occurs.
### 5. Parts-of-Speech - Identifying Parts-of-Speech.
### 6. Stemming - Removing ends of the words.

### 1. Tokenizing

In [1]:
import nltk

In [2]:
## nltk.download()

In [3]:
text = "Mary had a little lamb. Her fleece was white as snow"
from nltk.tokenize import word_tokenize, sent_tokenize
sents = sent_tokenize(text)
print(sents)

['Mary had a little lamb.', 'Her fleece was white as snow']


In [4]:
words = [word_tokenize(sent) for sent in sents]
print(words)

[['Mary', 'had', 'a', 'little', 'lamb', '.'], ['Her', 'fleece', 'was', 'white', 'as', 'snow']]


### 2. Removing Stopword

In [5]:
from nltk.corpus import stopwords
from string import punctuation
customStopWords = set(stopwords.words('english')+list(punctuation))

In [6]:
wordsWOStopwords = [word for word in word_tokenize(text) if word not in customStopWords]
print(wordsWOStopwords)

['Mary', 'little', 'lamb', 'Her', 'fleece', 'white', 'snow']


### 3. N-gram Identification

In [7]:
## construction of bigrams - collocations
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(wordsWOStopwords)
sorted(finder.ngram_fd.items())

[(('Her', 'fleece'), 1),
 (('Mary', 'little'), 1),
 (('fleece', 'white'), 1),
 (('lamb', 'Her'), 1),
 (('little', 'lamb'), 1),
 (('white', 'snow'), 1)]

### 5 & 6 Stemming and Parts of Speech

In [8]:
text2 = "Mary closed on closing night when she was in the mood to close."
# stemming
from nltk.stem.lancaster import LancasterStemmer
st=LancasterStemmer()
stemmedWords=[st.stem(word) for word in word_tokenize(text2)]
print(stemmedWords)

['mary', 'clos', 'on', 'clos', 'night', 'when', 'she', 'was', 'in', 'the', 'mood', 'to', 'clos', '.']


In [9]:
# tag to part of speech
nltk.pos_tag(word_tokenize(text2))

[('Mary', 'NNP'),
 ('closed', 'VBD'),
 ('on', 'IN'),
 ('closing', 'NN'),
 ('night', 'NN'),
 ('when', 'WRB'),
 ('she', 'PRP'),
 ('was', 'VBD'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('mood', 'NN'),
 ('to', 'TO'),
 ('close', 'VB'),
 ('.', '.')]

### 4.Word Sense Disambiguity

In [10]:
# wordnet is like a thesarus/dictionary and a synset represents one single definition of the word
from nltk.corpus import wordnet as wn
for ss in wn.synsets('bass'):
    print(ss, ss.definition())

Synset('bass.n.01') the lowest part of the musical range
Synset('bass.n.02') the lowest part in polyphonic music
Synset('bass.n.03') an adult male singer with the lowest voice
Synset('sea_bass.n.01') the lean flesh of a saltwater fish of the family Serranidae
Synset('freshwater_bass.n.01') any of various North American freshwater fish with lean flesh (especially of the genus Micropterus)
Synset('bass.n.06') the lowest adult male singing voice
Synset('bass.n.07') the member with the lowest range of a family of musical instruments
Synset('bass.n.08') nontechnical name for any of numerous edible marine and freshwater spiny-finned fishes
Synset('bass.s.01') having or denoting a low vocal or instrumental range


In [11]:
# lesk is the algorithm that identifies the word sense ambiguity
from nltk.wsd import lesk
sense1 = lesk(word_tokenize("Sing in a lower tone, along with the bass"), 'bass')
print(sense1, sense1.definition())

Synset('bass.n.07') the member with the lowest range of a family of musical instruments


In [12]:
# lesk is the algorithm that identifies the word sense ambiguity
from nltk.wsd import lesk
sense2 = lesk(word_tokenize("This sea bass was really hard to catch"), 'bass')
print(sense2, sense2.definition())

Synset('sea_bass.n.01') the lean flesh of a saltwater fish of the family Serranidae


## --------------------------------------------------------------------------------------------

## Typical Machine Learning WorkFlow

### 1. Pick your problem - Identify which type of Problem we need to solve;
### 2. Represent Data - Represent data using numeric attributes;
### 3. Apply an Algorithm - Use a standard algorithm to find a model;

### 1. Pick your problem - Identify which type of Problem we need to solve; - Classification, clustering, recommendations and regressions.

### Classification

#### * Spam Detection , Sentiment Analysis; finite decisions;
#### * Problem instance (email/tweet) - needs to be classified by assigning a category or label to it.
#### * Algorithms that perform classification are known as classifiers;
#### * Prequisite of a classifier - It uses a set of insttances for which the correct category membership is known.
#### * Training Data with classification must be made available;
#### * Naive Bayes, Support Vector Machines;

### Clustering

#### * Groups are created from data for which no categories are assigned initially;
#### * The algorithm divides the sample into groups;
#### * it is used to explore the text;
#### * K-Means, Hierarchial Clustering;

### 2. Represent Data - Represent data using numeric attributes;

#### * The algorithms in machine learning take numeric data as input, therefore the text data needs to be converted into numeric form;
#### * Use meaningful numeric attributes to represent text;
#### Many methods: Term Frequency, Term Frequency Inverse-document frequency;

### 3. Apply an Algorithm - Use a standard algorithm to find a model;

#### * Use an algorithm to find patterns from the historical data.
#### * A model is created - a model can be - mathematical equation, a set of rules;

## -----------------------------------------------------------------------------------------------------

## Auto-summarizing Text

#### Auto-summarize text using a rule based model.
#### Scrape websites for text data using BeautifulSoup.
#### Use NLTK for munging text-tokenization, stopword removal, etc.
#### Auto summarization - is done via abstract extraction - important sentences from the text are extracted and form the abstract.

#### Auto-summarization - rule based approach.
#### i. Find the most important words;
#### ii. Compute a significance score for sentences based on words they contain.
#### iii. Pick the top most significant sentences.

#### Word Importance - greater the word frequency - more important it is;
#### Sentence significance score = sum(word importance);
#### Pick the top most significant sentence.


### Steps in Abstract Extraction
#### i. Retrieve Text - Download and parse the text from a webpage. - url - html page - parse text using beautifulsoup - removes tags, etc;
#### ii. Preprocess Text - Tokenize text and remove stopwords. - tokenize the text into words and remove stop words;
#### iii. Extract sentences - Rank words and sentences. - compute word frequence and significant score of sentences.

In [13]:
import urllib.request ##urllib to download a webpage
from bs4 import BeautifulSoup

In [14]:
articleURL = "https://www.washingtonpost.com/opinions/global-opinions/we-think-north-korea-is-crazy-what-if-were-wrong/2017/07/06/d13044b0-6286-11e7-a4f7-af34fc1d9d39_story.html?utm_term=.53f07e903405"

In [15]:
page = urllib.request.urlopen(articleURL).read().decode('utf8','ignore')##html page will be downloaded
soup = BeautifulSoup(page,'lxml') ## beautiful soup creates a tree structure representing the HTML structure of the page.
soup

<!DOCTYPE html>
<html class="article layout_article rendering-context-www" itemscope="" itemtype="http://schema.org/NewsArticle" lang="en"> <head> <script id="_$cookiemonster">(function(b,m){function d(b,d){this.wl={map:f.map.concat(b||[]),reg:f.reg.concat(d||[])}}var f={reg:[],map:[]};d.prototype.ommNom=function(){return this.nom(!0,void 0)};d.prototype.allows=function(b){return!(-1<this.nom(!1,[b]).indexOf(b))};d.prototype.nom=function(d,f){for(var c=[],l=b.location.hostname.split("").reverse().join("").slice(0,18),g=f||b.cookie.split(";"),a,h,e=0;e<g.length,a=g[e];e++)a=a.trim().split("\x3d")[0].toLowerCase(),-1<this.wl.map.indexOf(a)||c.push(a);for(var k=0;k<this.wl.reg.length,
h=this.wl.reg[k];k++)for(e=h.lastIndex=0;e<g.length,a=g[e];e++)a=a.trim().split("\x3d")[0].toLowerCase(),h.test(a)?-1<c.indexOf(a)&&c.splice(c.indexOf(a),1):0>c.indexOf(a)&&0>this.wl.map.indexOf(a)&&c.push(a);d&&("moc.tsopnotgnihsaw"==l&&(this.wl.reg.length||this.wl.map.length)?setTimeout(function(a){return 

In [16]:
soup.find('article')

<article itemprop="articleBody"> <p id="U12202260910680mJF">In Washington, there is a conventional wisdom on North Korea that spans both parties and much of elite opinion. It goes roughly like this: North Korea is the world’s most bizarre country, run by a crackpot dictator with a strange haircut. He is unpredictable and irrational and cannot be negotiated with. Eventually this weird and cruel regime will collapse. Meanwhile, the only solution is more and more pressure. But what if the conventional wisdom is wrong?</p> <p>The North Korean regime has survived for almost seven decades, preserving not just its basic form of government but also its family dynasty, <a href="http://time.com/4681304/north-korea-kim-family-album/" shape="rect" title="time.com">father to son to grandson</a>. It has persisted through the fall of the Soviet Union and its communist satellites, the Orange Revolution, the Arab Spring and the demise of other Asian dictatorships, from South Korea to Taiwan to Indonesi

In [17]:
soup.find('article').text

' In Washington, there is a conventional wisdom on North Korea that spans both parties and much of elite opinion. It goes roughly like this: North Korea is the world’s most bizarre country, run by a crackpot dictator with a strange haircut. He is unpredictable and irrational and cannot be negotiated with. Eventually this weird and cruel regime will collapse. Meanwhile, the only solution is more and more pressure. But what if the conventional wisdom is wrong? The North Korean regime has survived for almost seven decades, preserving not just its basic form of government but also its family dynasty, father to son to grandson. It has persisted through the fall of the Soviet Union and its communist satellites, the Orange Revolution, the Arab Spring and the demise of other Asian dictatorships, from South Korea to Taiwan to Indonesia. The Kim dynasty has been able to achieve striking success in its primary objective — survival. Of course, this is because it rules in a brutal and oppressive fa

In [18]:
text = ' '.join(map(lambda p:p.text, soup.find_all('article')))
text

' In Washington, there is a conventional wisdom on North Korea that spans both parties and much of elite opinion. It goes roughly like this: North Korea is the world’s most bizarre country, run by a crackpot dictator with a strange haircut. He is unpredictable and irrational and cannot be negotiated with. Eventually this weird and cruel regime will collapse. Meanwhile, the only solution is more and more pressure. But what if the conventional wisdom is wrong? The North Korean regime has survived for almost seven decades, preserving not just its basic form of government but also its family dynasty, father to son to grandson. It has persisted through the fall of the Soviet Union and its communist satellites, the Orange Revolution, the Arab Spring and the demise of other Asian dictatorships, from South Korea to Taiwan to Indonesia. The Kim dynasty has been able to achieve striking success in its primary objective — survival. Of course, this is because it rules in a brutal and oppressive fa

#### There maybe special characters within the text that need to be encoded.

In [19]:
# text.encode('ascii', errors='replace').replace("?"," ")

In [20]:
# Parsing logic
def getTextParsed(url):
    page = urllib.request.urlopen(url).read().decode('utf8')
    soup = BeautifulSoup(page,'lxml')
    text = ' '.join(map(lambda p:p.text, soup.find_all('article')))
    return text

In [21]:
text = getTextParsed(articleURL)

In [22]:
text

' In Washington, there is a conventional wisdom on North Korea that spans both parties and much of elite opinion. It goes roughly like this: North Korea is the world’s most bizarre country, run by a crackpot dictator with a strange haircut. He is unpredictable and irrational and cannot be negotiated with. Eventually this weird and cruel regime will collapse. Meanwhile, the only solution is more and more pressure. But what if the conventional wisdom is wrong? The North Korean regime has survived for almost seven decades, preserving not just its basic form of government but also its family dynasty, father to son to grandson. It has persisted through the fall of the Soviet Union and its communist satellites, the Orange Revolution, the Arab Spring and the demise of other Asian dictatorships, from South Korea to Taiwan to Indonesia. The Kim dynasty has been able to achieve striking success in its primary objective — survival. Of course, this is because it rules in a brutal and oppressive fa

In [23]:
## Collect individual senetences in the artcile into a list.
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from string import punctuation

In [24]:
sents = sent_tokenize(text)#based on period followed by space separation;
sents

[' In Washington, there is a conventional wisdom on North Korea that spans both parties and much of elite opinion.',
 'It goes roughly like this: North Korea is the world’s most bizarre country, run by a crackpot dictator with a strange haircut.',
 'He is unpredictable and irrational and cannot be negotiated with.',
 'Eventually this weird and cruel regime will collapse.',
 'Meanwhile, the only solution is more and more pressure.',
 'But what if the conventional wisdom is wrong?',
 'The North Korean regime has survived for almost seven decades, preserving not just its basic form of government but also its family dynasty, father to son to grandson.',
 'It has persisted through the fall of the Soviet Union and its communist satellites, the Orange Revolution, the Arab Spring and the demise of other Asian dictatorships, from South Korea to Taiwan to Indonesia.',
 'The Kim dynasty has been able to achieve striking success in its primary objective — survival.',
 'Of course, this is because i

In [25]:
# list of words
word_sent = word_tokenize(text.lower())
word_sent

['in',
 'washington',
 ',',
 'there',
 'is',
 'a',
 'conventional',
 'wisdom',
 'on',
 'north',
 'korea',
 'that',
 'spans',
 'both',
 'parties',
 'and',
 'much',
 'of',
 'elite',
 'opinion',
 '.',
 'it',
 'goes',
 'roughly',
 'like',
 'this',
 ':',
 'north',
 'korea',
 'is',
 'the',
 'world’s',
 'most',
 'bizarre',
 'country',
 ',',
 'run',
 'by',
 'a',
 'crackpot',
 'dictator',
 'with',
 'a',
 'strange',
 'haircut',
 '.',
 'he',
 'is',
 'unpredictable',
 'and',
 'irrational',
 'and',
 'can',
 'not',
 'be',
 'negotiated',
 'with',
 '.',
 'eventually',
 'this',
 'weird',
 'and',
 'cruel',
 'regime',
 'will',
 'collapse',
 '.',
 'meanwhile',
 ',',
 'the',
 'only',
 'solution',
 'is',
 'more',
 'and',
 'more',
 'pressure',
 '.',
 'but',
 'what',
 'if',
 'the',
 'conventional',
 'wisdom',
 'is',
 'wrong',
 '?',
 'the',
 'north',
 'korean',
 'regime',
 'has',
 'survived',
 'for',
 'almost',
 'seven',
 'decades',
 ',',
 'preserving',
 'not',
 'just',
 'its',
 'basic',
 'form',
 'of',
 'gove

In [26]:
# remove the stop words
_stopwords = set(stopwords.words('english') + list(punctuation))
_stopwords

{'!',
 '"',
 '#',
 '$',
 '%',
 '&',
 "'",
 '(',
 ')',
 '*',
 '+',
 ',',
 '-',
 '.',
 '/',
 ':',
 ';',
 '<',
 '=',
 '>',
 '?',
 '@',
 '[',
 '\\',
 ']',
 '^',
 '_',
 '`',
 'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 'd',
 'did',
 'didn',
 'do',
 'does',
 'doesn',
 'doing',
 'don',
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 'has',
 'hasn',
 'have',
 'haven',
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 'it',
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 'more',
 'most',
 'mustn',
 'my',
 'myself',
 'needn',
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out'

In [27]:
word_sent = [word for word in word_sent if word not in _stopwords]

In [28]:
word_sent

['washington',
 'conventional',
 'wisdom',
 'north',
 'korea',
 'spans',
 'parties',
 'much',
 'elite',
 'opinion',
 'goes',
 'roughly',
 'like',
 'north',
 'korea',
 'world’s',
 'bizarre',
 'country',
 'run',
 'crackpot',
 'dictator',
 'strange',
 'haircut',
 'unpredictable',
 'irrational',
 'negotiated',
 'eventually',
 'weird',
 'cruel',
 'regime',
 'collapse',
 'meanwhile',
 'solution',
 'pressure',
 'conventional',
 'wisdom',
 'wrong',
 'north',
 'korean',
 'regime',
 'survived',
 'almost',
 'seven',
 'decades',
 'preserving',
 'basic',
 'form',
 'government',
 'also',
 'family',
 'dynasty',
 'father',
 'son',
 'grandson',
 'persisted',
 'fall',
 'soviet',
 'union',
 'communist',
 'satellites',
 'orange',
 'revolution',
 'arab',
 'spring',
 'demise',
 'asian',
 'dictatorships',
 'south',
 'korea',
 'taiwan',
 'indonesia',
 'kim',
 'dynasty',
 'able',
 'achieve',
 'striking',
 'success',
 'primary',
 'objective',
 '—',
 'survival',
 'course',
 'rules',
 'brutal',
 'oppressive',
 'f

In [29]:
## Frequency distribution of words
from nltk.probability import FreqDist
freq = FreqDist(word_sent)
freq

FreqDist({'1953': 1,
          '30,000': 1,
          '9/11': 1,
          'able': 1,
          'accurately': 1,
          'achieve': 1,
          'act': 1,
          'affairs': 1,
          'allegedly': 1,
          'ally': 2,
          'almost': 1,
          'already': 1,
          'also': 2,
          'alternative': 1,
          'american': 1,
          'andrei': 1,
          'announced': 1,
          'anyone': 1,
          'arab': 1,
          'archive': 1,
          'armistice': 1,
          'army': 1,
          'arsenal': 2,
          'ash': 1,
          'asian': 1,
          'attacked': 1,
          'authority': 1,
          'basic': 1,
          'becomes': 1,
          'beijing': 3,
          'best': 1,
          'bitter': 1,
          'bizarre': 1,
          'border': 1,
          'break': 1,
          'brother': 1,
          'bruce': 1,
          'brutal': 1,
          'burma': 1,
          'bush': 1,
          'buy': 2,
          'cake': 1,
          'calculated': 1,
       

In [30]:
### Most frequent words
from heapq import nlargest
nlargest(10, freq, key=freq.get)

['korea',
 'north',
 'china',
 'washington',
 'regime',
 'would',
 'pressure',
 'korean',
 'south',
 '—']

In [31]:
## Compute sentence significance score
from collections import defaultdict #defaultdict adds a non-extistent key to the dictionary rather than throwing an error.
ranking = defaultdict(int)

# i - sentence index; values - significance score
for i, sent in enumerate(sents):
    for w in word_tokenize(sent.lower()):
        if w in freq:
            ranking[i] += freq[w]
            
ranking

defaultdict(int,
            {0: 57,
             1: 55,
             2: 3,
             3: 13,
             4: 7,
             5: 6,
             6: 50,
             7: 42,
             8: 18,
             9: 10,
             10: 44,
             11: 12,
             12: 20,
             13: 28,
             14: 42,
             15: 74,
             16: 60,
             17: 87,
             18: 3,
             19: 112,
             20: 37,
             21: 13,
             22: 74,
             23: 100,
             24: 69,
             25: 14,
             26: 2,
             27: 37,
             28: 12,
             29: 125,
             30: 71,
             31: 53,
             32: 37,
             33: 25,
             34: 74,
             35: 3,
             36: 11,
             37: 146,
             38: 1,
             39: 115,
             40: 2})

In [32]:
# pick the top importance sentences.
sents_idx = nlargest(4, ranking, key=ranking.get)
sents_idx

[37, 29, 39, 19]

In [33]:
[sents[j] for j in sorted(sents_idx)]

['Current U.S. policy toward Iran, Secretary of State Rex Tillerson recently said, is to “work toward support of those elements inside of Iran that would lead to a peaceful transition of that government.” And regarding North Korea, President Trump wants China to “end this nonsense once and for all,” which again can only mean getting rid of the Kim government in some way.',
 'Beijing faces an understandable nightmare — under sanctions and pressure, North Korea collapses and the newly unified country becomes a giant version of South Korea, with a defense treaty with Washington, nearly 30,000 American troops and possibly dozens of Pyongyang’s nuclear weapons — all on China’s border.',
 'Read more on this topic:   The Post’s View: What Trump can do about North Korea   Jake Sullivan and Victor Cha: The right way to play the China card on North Korea   Bruce Klingner and Sue Mi Terry: We participated in talks with North Korean representatives.',
 'Andrei Lankov: The inconvenient truth about 

In [34]:
def summarize(text, n):
    sents = sent_tokenize(text)
    
    assert n <= len(sents)
    word_sent = word_tokenize(text.lower())
    _stopwords = set(stopwords.words('english') + list(punctuation))
    
    word_sent=[word for word in word_sent if word not in _stopwords]
    freq = FreqDist(word_sent)
    
    ranking = defaultdict(int)
    
    for i,sent in enumerate(sents):
        for w in word_tokenize(sent.lower()):
            if w in freq:
                ranking[i] += freq[w]
                
    sents_idx = nlargest(n, ranking, key=ranking.get)
    return [sents[j] for j in sorted(sents_idx)]

In [35]:
summarize(text, 3)

['Beijing faces an understandable nightmare — under sanctions and pressure, North Korea collapses and the newly unified country becomes a giant version of South Korea, with a defense treaty with Washington, nearly 30,000 American troops and possibly dozens of Pyongyang’s nuclear weapons — all on China’s border.',
 'Read more on this topic:   The Post’s View: What Trump can do about North Korea   Jake Sullivan and Victor Cha: The right way to play the China card on North Korea   Bruce Klingner and Sue Mi Terry: We participated in talks with North Korean representatives.',
 'Andrei Lankov: The inconvenient truth about North Korea and China   Chung Min Lee: North Korea is already testing South Korea’s new president.']

In [36]:
text = getTextParsed("https://www.washingtonpost.com/news/energy-environment/wp/2017/07/05/scientists-are-starting-to-clear-up-one-of-the-biggest-controversies-in-climate-science/?utm_term=.054a7a6342b3")

In [37]:
text

'      This color-coded map displays a progression of changing global surface temperature anomalies from 1880 through 2015. Higher than normal temperatures are shown in red and lower then normal temperatures are shown in blue. (NASA)   How much Earth will warm in response to future greenhouse gas emissions may be one of the most fundamental questions in climate science —\xa0but it’s also one of the most difficult to answer. And it’s growing more controversial: In recent years, some scientists have suggested that our climate models may actually be predicting too much future\xa0warming, and that climate change will be less severe than the projections suggest. But\xa0new research is helping lay these suspicions to rest. A study, out Wednesday in the journal Science Advances, joins a growing body of literature that suggests the models are on track after all. And while that may be worrisome for the planet, it’s good news for the scientists working to understand its future. The new study add

In [38]:
def _removeNonAscii(s): return "".join(i for i in s if ord(i)<128)

In [39]:
text = _removeNonAscii(text)
text

'      This color-coded map displays a progression of changing global surface temperature anomalies from 1880 through 2015. Higher than normal temperatures are shown in red and lower then normal temperatures are shown in blue. (NASA)   How much Earth will warm in response to future greenhouse gas emissions may be one of the most fundamental questions in climate science but its also one of the most difficult to answer. And its growing more controversial: In recent years, some scientists have suggested that our climate models may actually be predicting too much futurewarming, and that climate change will be less severe than the projections suggest. Butnew research is helping lay these suspicions to rest. A study, out Wednesday in the journal Science Advances, joins a growing body of literature that suggests the models are on track after all. And while that may be worrisome for the planet, its good news for the scientists working to understand its future. The new study addresses a basic c

In [40]:
summarize(text,3)

['And its growing more controversial: In recent years, some scientists have suggested that our climate models may actually be predicting too much futurewarming, and that climate change will be less severe than the projections suggest.',
 'Certain slow-developingclimate processes could amplify warming to a greater extent in the future, putting the models in the right after all.But these processes take time, even up to several hundred years, to really take effect  and because not enough time has passed since the Industrial Revolution for their signal to really develop, the historical record is whats actually misleading at the moment.',
 'According to Piers Forster, a University of Leeds climate scientist who has also studied climate sensitivity, the models tend to rely on certain assumptions that have not yet unfolded in real life  for instance, that the eastern Pacific will eventually warm to a greater extent than the western Pacific.']

In [41]:
text = getTextParsed('https://www.washingtonpost.com/local/social-issues/the-mystery-of-why-the-best-african-american-figure-skater-in-history-went-bankrupt-and-lives-in-a-trailer/2016/02/25/a191972c-ce99-11e5-abc9-ea152f0b9561_story.html?tid=hybrid_experimentrandom_1_na&utm_term=.354b1b0c811f')
_removeNonAscii(text)
summarize(text, 4)

['They’re really tight, though, because your feet grow after you don’t wear them for a long time.” Her medals — from the World Figure Skating Championships, from the Olympics — were equally elusive: “They’re in some bag somewhere.” Uncertainty is not a feeling Debi Thomas has often experienced in her 48 years.',
 'She instead inveighs against shadowy authorities in the nomenclature of conspiracy theorists — “the powers that be”; “corporate media”; “brainwashing” — and composes opinion pieces for the local newspaper that carry headlines such as “Pain, No Gain” and “Driven to Insanity.” She thinks that hoarding gold will insulate us from a looming financial meltdown, and recruits people to sell bits of gold bullion called “Karatbars.”  There’s a conventional narrative of how Thomas went from where she was to where she is — that of a talented figure undone by internal struggles and left penniless.',
 '“And most people don’t get that.”     She says she wants to help a community she frequen

## ------------------------------------------------------------------------------------------------

## Classification of texts using MAchine Learning

#### i. Feature extraction using the bag of words model. - corpus/large body of data;
#### ii. Use K-Means clustering to identify a set of topics/themes. - identify themes;
#### iii. Use the K-Nearest Neighbors model for classifying text into those topics. - classify to a theme;

### i. Feature extraction

In [42]:
import urllib.request
from bs4 import BeautifulSoup

def getAllPosts(url,links):
    response = urllib.request.urlopen(url)
    soup = BeautifulSoup(response, 'lxml')
    for a in soup.findAll('a'):
        try:
            url = a['href']
            title = a['title']
            if title == "Older Posts":
                print(title, url)
                links.append(url)
                getAllPosts(url,links)
        except:
            title = ""
    return

blogUrl = "http://doxydonkey.blogspot.in/"
links = []
getAllPosts(blogUrl, links)

Older Posts http://doxydonkey.blogspot.in/search?updated-max=2017-05-23T19:53:00-07:00&max-results=7
Older Posts http://doxydonkey.blogspot.in/search?updated-max=2017-05-14T19:02:00-07:00&max-results=7&start=7&by-date=false


In [43]:
## Parse the article text from each url
def getText(textUrl):
    response = urllib.request.urlopen(textUrl)
    soup = BeautifulSoup(response, 'lxml')
    mydivs = soup.findAll("div", {"class":'post-body'})
    
    posts = []
    for div in mydivs:
        posts+=map(lambda p:p.text.encode('ascii', 'ignore'), div.findAll("li"))
    return posts

In [44]:
allPosts = []
for link in links:
    allPosts += getText(link)

In [45]:
allPosts

[b"SoftBank's $100 Billion Tech Fund Rankles VCs as Valuations Soar:In the months since Softbank Group Corp. unveiled plans for a $100 billion technology fund, the Japanese company has been making its presence deeply felt across the industry. The Vision Fund closed a few days ago with $93 billion in initial commitments, and already venture firms from London to Silicon Valley are fretting about a behemoth with the resources,cloutand name recognition to snatch away the most promising deals. Just last week, SoftBank swooped in and pumped $1.4 billion into Paytm, Indias largestdigital-paymentsstartup. The deal boosted Paytm's valuation by about 40 percent to $7 billion. That's not outlandish givenPaytm'sdominant market position, but the valuations of other SoftBank deals have prompted head-scratching and ignited alarm that a funding atmosphere that only recently cooled off will heat up again. there's the concern that SoftBank will ladle out more money than startups need or can absorb. Alre

## 2. Identifying underlying themes among the articles
### Follow a Typical Machine Learning WorkFlow
### 1. Pick your problem - Identify which type of problem we need to solve.
### 2. Represent Data - Represent data using numeric attributes.
### 3. Apply an algorithm - Use a standard algorithm to find a model.

### 1. Pick a problem 
#### We are given a large group of articles - we need to divide the articles into groups based on some common attributes. 
### Clustering - Items in one group are similar to one another, while items in different groups are dissimilar to one another - Maximize Intracluster similarity; Minimize Intercluster similarity;

### 2 Represent Data
#### Machine learning algorithms work with numeric data, hence it is requisite to use meaningful numeric attributes to represent text.
#### Different methods:
#### 1. Term Frequency Representation - terms are represent with the number of their occurences. - Bag of words model.
#### 2. Term Frequency - Inverse Document Frequency - Weight the term frequencies to take the rarity of word into accout. 
#### weight = 1 / # documents the word appears in;

### 3. Apply the algorithm
#### K-Means Clustering
#### i. Documents are represented using TF-IDF;
#### ii. Each document is a tuple of N numbers - N is the total number of distinct words in all documents.
#### Document <-> A tuple of N Numbers <-> A point in an N-Dimensional Hypercube <-> Measure the distance between the points <-> (distance between points within the cluster are minimized, whereas the distance between points in different clusters are maximized ).

### N- Dimensional Hypercube
#### line -> 1-dimensional shape;
#### square -> 2-D shape;
#### cube - 3-D shape;
#### A set of N numbers represents a point in an N-Dimensional Hypercube.

### K-Means Clustering
#### 1. Initialize a set of points as "K" Means - Centroids of the cluster you want to find - K = number of clusters;
#### 2. Assign each point to the cluster belonging to the nearest mean;
#### 3. Find the new means/centroids of the clusters;

#### Convergence - Rinse and repeat steps 2,3 until the means don't change anymore, or just fix the number of times the process needs to repeat.

In [46]:
## Converts Text to TF-IDF Representation
from sklearn.feature_extraction.text import TfidfVectorizer

In [47]:
## Ignore stop words
vectorizer = TfidfVectorizer(max_df=0.5, min_df=2,stop_words='english')

In [48]:
### the fit_transform method - takes a list of strings are returns a 2-D array with document/row - 
#number od articlesXnumber of distinct words - TF_IDF
X = vectorizer.fit_transform(allPosts)
X

<38x733 sparse matrix of type '<class 'numpy.float64'>'
	with 2377 stored elements in Compressed Sparse Row format>

In [49]:
X[0]

<1x733 sparse matrix of type '<class 'numpy.float64'>'
	with 60 stored elements in Compressed Sparse Row format>

In [50]:
print(X[0]) # prints the row and decimal numbers - TF-IDF for the text

  (0, 605)	0.663392891939
  (0, 3)	0.137383022014
  (0, 106)	0.193327356547
  (0, 646)	0.0554713674627
  (0, 295)	0.240542079543
  (0, 696)	0.0801806931809
  (0, 693)	0.147420642653
  (0, 431)	0.0686915110072
  (0, 311)	0.0611237801748
  (0, 178)	0.0611237801748
  (0, 482)	0.0737103213265
  (0, 647)	0.0425306237539
  (0, 366)	0.0737103213265
  (0, 409)	0.050958003078
  (0, 273)	0.0801806931809
  (0, 346)	0.0611237801748
  (0, 146)	0.0737103213265
  (0, 196)	0.0686915110072
  (0, 63)	0.0801806931809
  (0, 42)	0.0801806931809
  (0, 349)	0.0645908443946
  (0, 155)	0.0801806931809
  (0, 699)	0.0801806931809
  (0, 279)	0.0801806931809
  (0, 393)	0.0801806931809
  :	:
  (0, 197)	0.0645908443946
  (0, 692)	0.0737103213265
  (0, 415)	0.0439821852889
  (0, 506)	0.0801806931809
  (0, 322)	0.0801806931809
  (0, 296)	0.0686915110072
  (0, 528)	0.0645908443946
  (0, 429)	0.277356837314
  (0, 621)	0.160361386362
  (0, 437)	0.0686915110072
  (0, 290)	0.0737103213265
  (0, 45)	0.0910678627769
  (0, 35

In [51]:
# n_cluster - number of clusters;
# init - method of choosing initial centroids;
# max_iter = 100
from sklearn.cluster import KMeans
km = KMeans(n_clusters = 3, init = 'k-means++', max_iter = 100, n_init = 1, verbose = True)

In [52]:
km.fit(X)

Initialization complete
Iteration  0, inertia 60.211
Iteration  1, inertia 31.664
Converged at iteration 1: center shift 0.000000e+00 within tolerance 1.238796e-07


KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=100,
    n_clusters=3, n_init=1, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=True)

### Every document in array X has been assigned a number which represents a cluster to which it belongs.
### These numbers are stored in array labels;
### The unique function in the numpy modules - prints the distinct cluster numbers, 
#### every article has been assigned one of these numbers as the cluster numbers.
#### It also prints the count of articles in each cluster.

In [53]:
import numpy as np
np.unique(km.labels_, return_counts=True)

(array([0, 1, 2]), array([10, 20,  8]))

### Create a dictionary (text {}) , in which key will be the cluster numbers and the values will be the aggregated texts. 

In [54]:
## enumerate function converts the array of labels into a list of tuples, where the first element is 
## the index of an article;
text = {}
for i, cluster in enumerate(km.labels_):
    oneDocument = allPosts[i]
    if cluster not in text.keys():
        text[cluster] = oneDocument
    else:
        text[cluster] += oneDocument

In [55]:
text

{0: b"Thanks! or Thanks.? Google will tailor suggested email replies to your preferences.If you use Google's Gmail app, you may have seen something new pop up on your screen this week: suggested responses for your emails.The move illustrates one way that Google is using its increased focus on artificial intelligence and machine learning. If you're wondering why and how Google can make these suggestions, here are some answers about the feature and how it works. Google calls the feature Smart Reply, and its pretty much what it sounds like. Google algorithms are scanning your messages and using the information it gleans to suggest ways that you could reply to any given message. Smart Reply relies on machine learning to scan the subject line and body of an email and make suggestions based on what it sees. The company said it has built up a huge bank of anonymized customer messages and response decisions to help accomplish this.It is also designed to remember your individual preferences for

In [56]:
text[0]

b"Thanks! or Thanks.? Google will tailor suggested email replies to your preferences.If you use Google's Gmail app, you may have seen something new pop up on your screen this week: suggested responses for your emails.The move illustrates one way that Google is using its increased focus on artificial intelligence and machine learning. If you're wondering why and how Google can make these suggestions, here are some answers about the feature and how it works. Google calls the feature Smart Reply, and its pretty much what it sounds like. Google algorithms are scanning your messages and using the information it gleans to suggest ways that you could reply to any given message. Smart Reply relies on machine learning to scan the subject line and body of an email and make suggestions based on what it sees. The company said it has built up a huge bank of anonymized customer messages and response decisions to help accomplish this.It is also designed to remember your individual preferences for thi

In [57]:
## find the most frequent words
from nltk.tokenize import word_tokenize,sent_tokenize
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from collections import defaultdict
from string import punctuation
from heapq import nlargest
import nltk

In [58]:
_stopwords = set(stopwords.words('english') + list(punctuation)+["million","billion","year","millions","billions","y/y","'s","''"])

In [68]:
## top keywords in each cluster
keywords = {}
counts={}
for cluster in range(3):
    strclust = text[cluster].decode("utf-8") # convert the cluster text into string
    word_sent = word_tokenize(strclust.lower()) # tokenize the cluster into words
    word_sent=[word for word in word_sent if word not in _stopwords] # filter out the stop words
    freq = FreqDist(word_sent) # use FreqDist to compute the frequency distribution of words;
    keywords[cluster] = nlargest(100, freq, key=freq.get) # top 100 frequent words;
    counts[cluster]=freq # all words and their counts stored in the dictionary;

In [69]:
## Find 10 unique keywords in each cluster
unique_keys={}
for cluster in range(3):
    other_clusters=list(set(range(3))-set([cluster]))# if you look for 0, it will give 1, 2;
    keys_other_clusters=set(keywords[other_clusters[0]]).union(set(keywords[other_clusters[1]]))# keys inthe other cluster
    unique=set(keywords[cluster])-keys_other_clusters # remove those keys from the other clusters
    unique_keys[cluster]=nlargest(10, unique, key=counts[cluster].get) # from the remaining keywords we get the top 10 keywords;

In [70]:
unique_keys

{0: ['google',
  'self-driving',
  'alexa',
  'usb',
  'alphabet',
  'app',
  'echo',
  'device',
  'phone',
  'could'],
 1: ['quarter',
  'softbank',
  'shares',
  'investors',
  'trading',
  'fund',
  'earnings',
  'cents',
  'per',
  'stock'],
 2: ['drivers',
  'bitcoin',
  'online',
  'hackers',
  'fix',
  'city',
  'developers',
  'pay',
  'driverless',
  'meters']}

### Step 3. Assign Themes to new articles
### Typical machine learning work flow
#### 1. Pick your problem - Identify which type of problem we need to solve;
#### 2. Represent Data - Represent data using numeric attributes;
#### 3. Apply an algorithm - use a standard algorithm to find a model;

### 1. Pick your problem - Identify which type of problem we need to solve; - Classify the article into one of the identified themes. - Classification;

### Typical classification setup:
#### 1. Problem statement - Define the problem statement; (article/problem instance -> classifier/Blackbox -> theme/label)

#### 2. Features - Represent the training data and test data using numerical attributes; (Use the TF-IDF representation)

#### 3. Training - "Train a model" - using the training data;
#### 4. Test - "Test the model" using data;

### K-Nearest Neighbors Algorithm
#### Find the K "nearest" neighbors and take a majority vote.

In [73]:
article = "Sören Schwertfeger finished his postdoctorate research on autonomous robots in Germany, and seemed set to go to Europe or the United States, where artificial intelligence was pioneered and established. Instead, he went to China. The balance of power in technology is shifting. China, which for years watched enviously as the West invented the software and the chips powering today’s digital age, has become a major player in artificial intelligence, what some think may be the most important technology of the future. Experts widely believe China is only a step behind the United States. China’s ambitions mingle the most far-out sci-fi ideas with the needs of an authoritarian state: Philip K. Dick meets George Orwell. There are plans to use it to predict crimes, lend money, track people on the country’s ubiquitous closed-circuit cameras, alleviate traffic jams, create self-guided missiles and censor the internet. Beijing is backing its artificial intelligence push with vast sums of money. Having already spent billions on research programs, China is readying a new multibillion-dollar initiative to fund moonshot projects, start-ups and academic research, all with the aim of growing China’s A.I. capabilities, according to two professors who consulted with the government on the plan. China’s private companies are pushing deeply into the field as well, though the line between government and private in China sometimes blurs. Baidu — often called the Google of China and a pioneer in artificial-intelligence-related fields, like speech recognition — this year opened a joint company-government laboratory partly run by academics who once worked on research into Chinese military robots. China is spending more just as the United States cuts back. This past week, the Trump administration released a proposed budget that would slash funding for a variety of government agencies that have traditionally backed artificial intelligence research."

In [90]:
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=10)
classifier.fit(X,km.labels_) ## Training phases - x = articles(TF-IDF tuples); km.labels - array with clustered numbers assigned to articles;

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=10, p=2,
           weights='uniform')

In [91]:
## Prior to sending the article into the KMeans classifier it is required to convert it into
#TF-IDF representation
test=vectorizer.transform([article.encode('ascii', errors='ignore')])

In [92]:
test

<1x733 sparse matrix of type '<class 'numpy.float64'>'
	with 53 stored elements in Compressed Sparse Row format>

In [93]:
# Test phase - predict or classify the article
classifier.predict(test)

array([1])

In [103]:
article2 = "India's ambitious plan to push electric vehicles at the expense of other technologies could benefit Chinese car makers seeking to enter the market, but is worrying established automakers in the country who have so far focused on making hybrid models. India's most influential government think-tank unveiled a policy blueprint this month aimed at electrifying all vehicles in the country by 2032, in a move that is catching the attention of car makers that are already investing in electric technology in China such as BYD and SAIC. The May 12 report by Niti Aayog, the planning body headed by Prime Minister Narendra Modi, recommends lower taxes and loan interest rates on electric vehicles while capping sales of petrol and diesel cars, seen as a radical shift in policy. India also plans to impose higher taxes on hybrid vehicles compared with electric, under a new unified tax regime set to come into effect from July 1, upsetting car makers like Maruti Suzuki and Toyota Motor. Earlier this year SAIC set up a local unit called MG Motor which is finalising plans to buy a car manufacturing plant in western India. A spokesman at SAIC did not comment specifically on the company's India plans. Warren Buffett-backed BYD already builds electric buses in the country, while rival Chongqing Changan has said it may enter India by 2020."

In [113]:
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=20)
classifier.fit(X,km.labels_)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=20, p=2,
           weights='uniform')

In [114]:
test2=vectorizer.transform([article2.encode('ascii', errors='ignore')])

In [115]:
classifier.predict(test2)

array([1])