## Collecting Data

In [87]:
# we start by collecting a set of articles from a blog to create our dataset
from urllib.request import urlopen
from bs4 import BeautifulSoup

def getAllDoxyDonkeyPosts(url,links):
    response = urlopen(url)
    soup = BeautifulSoup(response, "lxml")
    for a in soup.findAll('a'):
        try:
            url = a['href']
            title = a['title']
            if title == "Older Posts":
                print(title, url)
                links.append(url)
                getAllDoxyDonkeyPosts(url, links)
        except:
            title = ""
    return

blogUrl = "http://doxydonkey.blogspot.in"
links = []
getAllDoxyDonkeyPosts(blogUrl, links)

Older Posts http://doxydonkey.blogspot.com/search?updated-max=2017-05-23T19:53:00-07:00&max-results=7
Older Posts http://doxydonkey.blogspot.com/search?updated-max=2017-05-14T19:02:00-07:00&max-results=7&start=7&by-date=false
Older Posts http://doxydonkey.blogspot.com/search?updated-max=2017-05-02T19:43:00-07:00&max-results=7&start=14&by-date=false
Older Posts http://doxydonkey.blogspot.com/search?updated-max=2017-04-17T19:26:00-07:00&max-results=7&start=21&by-date=false
Older Posts http://doxydonkey.blogspot.com/search?updated-max=2017-04-10T18:56:00-07:00&max-results=7&start=28&by-date=false
Older Posts http://doxydonkey.blogspot.com/search?updated-max=2017-03-30T19:57:00-07:00&max-results=7&start=35&by-date=false
Older Posts http://doxydonkey.blogspot.com/search?updated-max=2017-03-20T19:47:00-07:00&max-results=7&start=42&by-date=false
Older Posts http://doxydonkey.blogspot.com/search?updated-max=2017-03-02T17:42:00-08:00&max-results=7&start=49&by-date=false
Older Posts http://doxyd

Older Posts http://doxydonkey.blogspot.com/search?updated-max=2015-04-23T20:19:00-07:00&max-results=7&start=462&by-date=false
Older Posts http://doxydonkey.blogspot.com/search?updated-max=2015-04-14T19:40:00-07:00&max-results=7&start=469&by-date=false
Older Posts http://doxydonkey.blogspot.com/search?updated-max=2015-04-05T20:22:00-07:00&max-results=7&start=476&by-date=false
Older Posts http://doxydonkey.blogspot.com/search?updated-max=2015-03-24T20:12:00-07:00&max-results=7&start=483&by-date=false
Older Posts http://doxydonkey.blogspot.com/search?updated-max=2015-03-15T20:41:00-07:00&max-results=7&start=490&by-date=false
Older Posts http://doxydonkey.blogspot.com/search?updated-max=2015-03-03T19:30:00-08:00&max-results=7&start=497&by-date=false
Older Posts http://doxydonkey.blogspot.com/search?updated-max=2015-02-22T19:55:00-08:00&max-results=7&start=504&by-date=false
Older Posts http://doxydonkey.blogspot.com/search?updated-max=2015-02-11T20:02:00-08:00&max-results=7&start=511&by-dat

In [88]:
def getDoxyDonkeyText(testUrl):
    response = urlopen(testUrl)
    soup = BeautifulSoup(response, "lxml")
    mydivs = soup.findAll("div", {"class":'post-body'})
    
    posts = []
    for div in mydivs:
        posts += map(lambda p:p.text.encode('ascii', errors='replace').decode().replace("?", " "), div.findAll("li"))
    return posts

In [89]:
doxyDonkeyPosts = []
for link in links:
    doxyDonkeyPosts += getDoxyDonkeyText(link)

In [90]:
len(doxyDonkeyPosts)

2804

In [91]:
# so we now have a large number of documents stored
doxyDonkeyPosts[0]

"SoftBank's $100 Billion Tech Fund Rankles VCs as Valuations Soar: In the months since Softbank Group Corp. unveiled plans for a $100 billion technology fund, the Japanese company has been making its presence deeply felt across the industry. The Vision Fund closed a few days ago with $93 billion in initial commitments, and already venture firms from London to Silicon Valley are fretting about a behemoth with the resources, clout and name recognition to snatch away the most promising deals. Just last week, SoftBank swooped in and pumped $1.4 billion into Paytm, India s largest digital-payments startup. The deal boosted Paytm's valuation by about 40 percent to $7 billion. That's not outlandish given Paytm's dominant market position, but the valuations of other SoftBank deals have prompted head-scratching and ignited alarm that a funding atmosphere that only recently cooled off will heat up again. there's the concern that SoftBank will ladle out more money than startups need or can absorb

## Clustering

In [92]:
# the objective is finding "themes" or "topics" and group posts accordingly
from sklearn.feature_extraction.text import TfidfVectorizer

In [93]:
vectorizer = TfidfVectorizer(max_df=0.5, min_df=2, stop_words='english')

In [94]:
X = vectorizer.fit_transform(doxyDonkeyPosts)

In [95]:
# num of articles x num of unique words across all documents
X

<2804x13220 sparse matrix of type '<class 'numpy.float64'>'
	with 280835 stored elements in Compressed Sparse Row format>

In [113]:
from sklearn.cluster import KMeans
km = KMeans(n_clusters = 3, init = 'k-means++', max_iter = 100, n_init = 1, verbose = True)

In [114]:
km.fit(X)

Initialization complete
Iteration  0, inertia 2788.000
Iteration  1, inertia 2707.063
Converged at iteration 1: center shift 0.000000e+00 within tolerance 7.307995e-09


KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=100,
    n_clusters=3, n_init=1, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=True)

In [115]:
# we see our 3 clusters and how many articles in each group
import numpy as np
np.unique(km.labels_, return_counts=True)

(array([0, 1, 2]), array([   1,    1, 2802], dtype=int64))

In [116]:
# let's see if we understand what the clusters are about
text={}
for i, cluster in enumerate(km.labels_):
    oneDocument = doxyDonkeyPosts[i]
    if cluster not in text.keys():
        text[cluster] = oneDocument
    else:
        text[cluster] += oneDocument

In [117]:
# we now have a dictionary where each position has all the text from all the articles belonging to that group
len(text)

3

In [118]:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from collections import defaultdict
from string import punctuation
from heapq import nlargest
import nltk

In [119]:
_sw = set(stopwords.words('english') + list(punctuation) + ["million", "billion", "year", "millions","billions",  "y/y", "'s", "''"])

keywords = {}
counts = {}
for cluster in range(3):
    word_sent = word_tokenize(text[cluster].lower())
    word_sent = [word for word in word_sent if word not in _sw]
    freq = FreqDist(word_sent)
    keywords[cluster] = nlargest(100, freq, key=freq.get)
    counts[cluster]=freq

In [120]:
unique_keys = {}
for cluster in range(3):
    other_clusters = list(set(range(3))-set([cluster]))
    keys_other_clusters = set(keywords[other_clusters[0]]).union(set(keywords[other_clusters[1]]))
    unique=set(keywords[cluster])-keys_other_clusters
    unique_keys[cluster]=nlargest(20, unique, key=counts[cluster].get)

In [121]:
# this one is not very clear, but looks like investment and startups
unique_keys[0]

['titan',
 'project',
 'alphabet',
 'drone',
 'x',
 'parent',
 'projects',
 'loon',
 'parts',
 'economics',
 'strong',
 'present',
 'workers',
 'promising',
 'exploration',
 'work',
 'develop',
 'rural',
 'via',
 'going']

In [122]:
# this one is about devices/services and sales
unique_keys[1]

['snapchat',
 'venture',
 'send',
 'offer',
 'photos',
 'xiaomi',
 '--',
 'latest',
 'valuation',
 'increase',
 'raised',
 'friends',
 'acquisition',
 'popular',
 'investing',
 'valued',
 '2013.',
 'ranked',
 'among',
 'hot']

In [123]:
# this one is about stock prices and earning reports
unique_keys[2]

['percent',
 'new',
 'also',
 'apple',
 'revenue',
 'companies',
 'like',
 'amazon',
 'business',
 'china',
 'market',
 'online',
 'sales',
 'first',
 'quarter',
 'services',
 'shares',
 'twitter',
 'could',
 'time']

### Classification

In [124]:
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=10)
classifier.fit(X, km.labels_)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=10, p=2,
           weights='uniform')

In [125]:
articleText = "DJI’s new selfie drone is controlled with just a wave of your hand: DJI, the world’s biggest drone company, has a tiny new drone called the Spark. It’s the most affordable, accessible drone yet from the Chinese drone maker, costing $499. The Spark weighs only half a pound and is about the size of a can of soda. It’s designed to be carried for daily, spontaneous use, like in a backpack. And unlike DJI’s other drones, which are piloted via a smartphone or a separate controller, the Spark uses gesture recognition, meaning it moves in the direction you wave your hand, making it super easy to position in front of you. The Spark can even land using gesture control, as was demonstrated in an unveiling event today when the presenter landed the small drone on his palm. The Sparks flies at about 31 miles per hour. Like other consumer drones, the Spark has a short flight time. It only flies for 16 minutes before needing to swap batteries or be recharged (though its batteries can be recharged with a micro USB on the go). The larger Mavic can fly for 27 minutes and GoPro’s Karma clocks about 20 minutes of flight time. But short flying time hasn’t stopped people from buying new drones, and analysts predict the market will only continue to grow. The analyist firm Gartner estimates that this year the global personal drone market will be valued at $2.8 billion."

In [126]:
test = vectorizer.transform([articleText.encode('ascii', errors='ignore').decode()])

In [127]:
classifier.predict(test)

array([2])

In [128]:
classifier.predict_proba(test)

array([[0., 0., 1.]])

In [129]:
articleText.encode('ascii', errors='ignore').decode()

'DJIs new selfie drone is controlled with just a wave of your hand: DJI, the worlds biggest drone company, has a tiny new drone called the Spark. Its the most affordable, accessible drone yet from the Chinese drone maker, costing $499. The Spark weighs only half a pound and is about the size of a can of soda. Its designed to be carried for daily, spontaneous use, like in a backpack. And unlike DJIs other drones, which are piloted via a smartphone or a separate controller, the Spark uses gesture recognition, meaning it moves in the direction you wave your hand, making it super easy to position in front of you. The Spark can even land using gesture control, as was demonstrated in an unveiling event today when the presenter landed the small drone on his palm. The Sparks flies at about 31 miles per hour. Like other consumer drones, the Spark has a short flight time. It only flies for 16 minutes before needing to swap batteries or be recharged (though its batteries can be recharged with a m