# Text Classification 2

From ["Classifying Text Using Machine Learning"](https://app.pluralsight.com/player?course=python-natural-language-processing&author=swetha-kolalapudi&name=python-natural-language-processing-m3&clip=0&mode=live)

* Feature extraction using bag of words model
* Use K-means clustering to identify a set of topics
* Using K-Nearest Neighbors for classifying text into those topics

### Typical ML Workflow
1. Pick your problem
  * Identify which type of problem we need to solve
2. Represent data
  * Represent the data using numeric attributes (features)
3. Apply an algorithm
  * Use a standard algorithm to find a model

### 1. Clustering
Group items together based on some measure of similarity

* Items in a group must be "similar" to one another
  * Maximize intracluster similarity
* Items in different groups must be "dissimilar" to one another
  * Minimize intercluster similarity

### 2. Represent Data
Use meaningful numeric attributes to represent text

* Term frequency
* TF-IDF

#### Features
Create a list representing the universe of all words that can appear in any text

$(w_1, w_2, ..., w_N)$

Given any piece of text, we can represent it using a tuple of $n$ numbers where each element represents a frequency of one of these words. This is the **Term Frequency Representation**. Note the order of words is lost, hence 'Bag of words' model.

**Term Frequency - Inverse Document Frequency**: Weight the term frequencies to take the rarity of a word (throughout the corpus) into account.

$Weight = 1 / # documents the word appears in$

Then, if you have a tuple of word frequencies, weight each word's frequency by the inverse of the number of documents the word appears in. Hence we're multiping the term frequency by idf.

$${tf} (t,d)=0.5+0.5\cdot {\frac {f_{t,d}}{\max\{f_{t',d}:t'\in d\}}}$$

$${idf}(t, D) =  \log \frac{N}{|\{d \in D: t \in d\}|}$$

$${tfidf} (t,d,D) = {tf} (t,d)\cdot{idf} (t,D)}$$

#### K-Means Clustering
Documents are represented using TF-IDF

Each document is a tuple of $N$ Numbers, where $N$ is the total number of distinct words across all documents, and each tuple of $N$ numbers is a point in an $N$-dimensional hypercube.

So imagining our points in an $N$-dimensional space, we can now measure the distance between those points.
For clustering, we want to minimize the distance between points within a cluster, and maximize the distance between points in different clusters.

For K-Means clustering, we first decide how many clusters we want to divide the data into.
  1. Initialize a set of points as the "K" Means (Centroids of the clusters you want to find)
  2. Assign each point to the cluster belonging to the nearest mean - 
By taking the average of the coordinates of all the points of one cluster, we can find a new mean/centroid of the clusters.
  3. Find the new means/centroids of the clusters

Rinse and repeat steps 2 & 3 until we reach convergence (ie the point at which the means don't change anymore). If we can't reach convergence, we may want to set a maximimum number of iterations the algorithm should run before stopping.
  

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer


In [14]:
import csv
f = open('C:/Users/dclynch/Desktop/test.csv')
csv_f = csv.reader(f)

testText = []
for row in csv_f:
    testText+=row
    
print len(testText)

90


In [15]:
vectorizer = TfidfVectorizer(max_df=0.5,min_df=2,stop_words='english')

In [17]:
X = vectorizer.fit_transform(testText)
X

<90x1169 sparse matrix of type '<type 'numpy.float64'>'
	with 4793 stored elements in Compressed Sparse Row format>

In [18]:
X[0]

<1x1169 sparse matrix of type '<type 'numpy.float64'>'
	with 56 stored elements in Compressed Sparse Row format>

In [20]:
print X[0]

  (0, 27)	0.049023091094
  (0, 239)	0.277209207046
  (0, 241)	0.637300184222
  (0, 537)	0.098046182188
  (0, 816)	0.344938297829
  (0, 203)	0.0467321399441
  (0, 665)	0.0370166663892
  (0, 800)	0.049023091094
  (0, 957)	0.195578874933
  (0, 621)	0.0518269886937
  (0, 610)	0.049023091094
  (0, 927)	0.038022438479
  (0, 941)	0.110883682819
  (0, 1032)	0.0467321399441
  (0, 607)	0.0518269886937
  (0, 340)	0.049023091094
  (0, 1086)	0.0352185408793
  (0, 376)	0.038022438479
  (0, 999)	0.207307954775
  (0, 843)	0.110883682819
  (0, 240)	0.147069273282
  (0, 377)	0.0554418414093
  (0, 692)	0.0298475174277
  (0, 328)	0.0554418414093
  (0, 844)	0.221767365637
  :	:
  (0, 351)	0.0554418414093
  (0, 818)	0.098046182188
  (0, 944)	0.0273757654159
  (0, 194)	0.0309906181314
  (0, 1010)	0.0554418414093
  (0, 455)	0.0554418414093
  (0, 1111)	0.0467321399441
  (0, 662)	0.0554418414093
  (0, 1040)	0.140196419832
  (0, 824)	0.0467321399441
  (0, 44)	0.134385505039
  (0, 446)	0.0554418414093
  (0, 740)	

In [21]:
from sklearn.cluster import KMeans
km = KMeans(n_clusters = 6, init = 'k-means++', max_iter = 100, n_init = 1, verbose = True)
km.fit(X)

Initialization complete
Iteration  0, inertia 142.042
Iteration  1, inertia 75.725
Converged at iteration 1: center shift 0.000000e+00 within tolerance 8.124096e-08


KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=100,
    n_clusters=6, n_init=1, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=True)

In [22]:
import numpy as np
np.unique(km.labels_, return_counts=True)

(array([0, 1, 2, 3, 4, 5]), array([15, 18,  9,  8, 17, 23], dtype=int64))

In [23]:
text={}
for i,cluster in enumerate(km.labels_):
    oneDocument = testText[i]
    if cluster not in text.keys():
        text[cluster] = oneDocument
    else:
        text[cluster] += oneDocument

In [24]:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from collections import defaultdict
from string import punctuation
from heapq import nlargest
import nltk

In [25]:
_stopwords = set(stopwords.words('english') + list(punctuation) + ["million", "billion"])

In [26]:
keywords = {}
counts = {}
for cluster in range(6):
    word_sent = word_tokenize(text[cluster].lower())
    word_sent=[word for word in word_sent if word not in _stopwords]
    freq = FreqDist(word_sent)
    keywords[cluster] = nlargest(100, freq, key=freq.get)
    counts[cluster]=freq

In [29]:
unique_keys={}
for cluster in range(6):
    other_clusters = list(set(range(6))-set([cluster]))
    keys_other_clusters = set(keywords[other_clusters[0]]).union(set(keywords[other_clusters[1]]))
    unique = set(keywords[cluster])-keys_other_clusters
    unique_keys[cluster]=nlargest(20, unique, key=counts[cluster].get)

In [30]:
unique_keys

{0: ['x2009',
  'x2013',
  'isolates',
  'ili',
  'age',
  'gentamicin',
  'dose',
  'group',
  'subjects',
  'hfrs',
  'aki',
  'pregnancy',
  'igg',
  '1',
  'rates',
  'weeks',
  'population',
  'susceptibility',
  'young',
  'ltbi'],
 1: ['heart',
  'failure',
  'hf',
  'hfpef',
  'fraction',
  'ejection',
  'cancer',
  'hba1c',
  'higher',
  'preserved',
  'ckd',
  'methods',
  'ef',
  'follow-up',
  'serum',
  'hfref',
  'pressure',
  'hfnef',
  'haemodynamic',
  'studies'],
 2: ['surgical',
  'incontinence',
  'fecal',
  'techniques',
  'anal',
  'nerve',
  'fi',
  'repair',
  'stimulation',
  'patient',
  'option',
  'surgeon',
  'anaesthetists',
  'placement',
  'device',
  'procedures',
  'success',
  'anaesthetist',
  'options',
  'management'],
 3: ['pelvic',
  'prolapse',
  'floor',
  'management',
  'rectal',
  'review',
  'posterior',
  'crc',
  'repair',
  'surgical',
  'disorders',
  'complex',
  'approaches',
  'organ',
  'takotsubo',
  'perineal',
  'grade',
  'pain'