# K-Means Clustering with Scikit-Learn

In the previous lesson, we learned about a text analysis method called *term frequency–inverse document frequency*, often abbreviated *tf-idf*. Tf-idf is a method that tries to identify the most distinctively frequent or significant words in a document. We specifically learned how to calculate tf-idf scores using word frequencies per page—or "extracted features"—made available by the HathiTrust Digital Library.

In this lesson, we're going to learn how to calculate tf-idf scores using a collection of plain text (.txt) files and the Python library scikit-learn, which has a quick and nifty module called [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).

In this lesson, we will cover how to:
- Calculate and normalize tf-idf scores for U.S. Inaugural Addresses with scikit-learn

## Dataset

### U.S. Inaugural Addresses

```{epigraph}
This is the meaning of our liberty and our creed; why men and women and children of every race and every faith can join in celebration across this magnificent Mall, and why a man whose father less than 60 years ago might not have been served at a local restaurant can now stand before you to take a most sacred oath.  So let us mark this day with remembrance of who we are and how far we have traveled.

--  Barack Obama, Inaugural Presidential Address, January 2009 
```

During Barack Obama's Inaugural Address in January 2009, he mentioned "women" four different times, including in the passage quoted above. How distinctive is Obama's inclusion of women in this address compared to all other U.S. Presidents? This is one of the questions that we're going to try to answer with tf-idf.

## Breaking Down the TF-IDF Formula

But first, let's quickly discuss the tf-idf formula. The idea is pretty simple.

**tf-idf = term_frequency * inverse_document_frequency**

**term_frequency** = number of times a given term appears in document

**inverse_document_frequency** = log(total number of documents / number of documents with term) + 1**\***

You take the number of times a term occurs in a document (term frequency). Then you take the number of documents in which the same term occurs at least once divided by the total number of documents (document frequency), and you flip that fraction on its head (inverse document frequency). Then you multiply the two numbers together (term_frequency * inverse_document_frequency).

The reason we take the *inverse*, or flipped fraction, of document frequency is to boost the rarer words that occur in relatively few documents. Think about the inverse document frequency for the word "said" vs the word "pigeon." The term "said" appears in 13 (document frequency) of 14 (total documents) *Lost in the City* stories (14 / 13 --> a smaller inverse document frequency) while the term "pigeons" only occurs in 2 (document frequency) of the 14 stories (total documents) (14 / 2 --> a bigger inverse document frequency, a bigger tf-idf boost). 

*There are a bunch of slightly different ways that you can calculate inverse document frequency. The version of idf that we're going to use is the [scikit-learn default](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer), which uses "smoothing" aka it adds a "1" to the numerator and denominator: 

**inverse_document_frequency**  = log((1 + total_number_of_documents) / (number_of_documents_with_term +1)) + 1

```{margin}
> If smooth_idf=True (the default), the constant “1” is added to the numerator and denominator of the idf as if an extra document was seen containing every term in the collection exactly once, which prevents zero divisions: idf(t) = log [ (1 + n) / (1 + df(t)) ] + 1.  
> -[scikit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer)
```

## TF-IDF with scikit-learn

[scikit-learn](https://scikit-learn.org/stable/index.html), imported as `sklearn`, is a popular Python library for machine learning approaches such as clustering, classification, and regression. Though we're not doing any machine learning in this lesson, we're nevertheless going to use scikit-learn's `TfidfVectorizer` and `CountVectorizer`.

Install scikit-learn

In [45]:
!pip install sklearn



Import necessary modules and libraries

In [1]:
import altair as alt

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
pd.set_option("max_rows", 600)
from pathlib import Path  
import glob

We're also going to import `pandas` and change its default display setting. And we're going to import two libraries that will help us work with files and the file system: [`pathlib`](https://docs.python.org/3/library/pathlib.html##basic-use) and [`glob`](https://docs.python.org/3/library/glob.html).

#### Set Directory Path

Below we're setting the directory filepath that contains all the text files that we want to analyze.

In [3]:
directory_path = "../texts/history/US_Inaugural_Addresses/"

Then we're going to use `glob` and `Path` to make a list of all the filepaths in that directory and a list of all the short story titles.

In [4]:
text_files = glob.glob(f"{directory_path}/*.txt")

In [5]:
text_files

['../texts/history/US_Inaugural_Addresses/13_van_buren_1837.txt',
 '../texts/history/US_Inaugural_Addresses/47_nixon_1973.txt',
 '../texts/history/US_Inaugural_Addresses/50_reagan_1985.txt',
 '../texts/history/US_Inaugural_Addresses/53_clinton_1997.txt',
 '../texts/history/US_Inaugural_Addresses/17_pierce_1853.txt',
 '../texts/history/US_Inaugural_Addresses/14_harrison_1841.txt',
 '../texts/history/US_Inaugural_Addresses/56_obama_2009.txt',
 '../texts/history/US_Inaugural_Addresses/25_cleveland_1885.txt',
 '../texts/history/US_Inaugural_Addresses/03_adams_john_1797.txt',
 '../texts/history/US_Inaugural_Addresses/12_jackson_1833.txt',
 '../texts/history/US_Inaugural_Addresses/11_jackson_1829.txt',
 '../texts/history/US_Inaugural_Addresses/36_hoover_1929.txt',
 '../texts/history/US_Inaugural_Addresses/45_johnson_1965.txt',
 '../texts/history/US_Inaugural_Addresses/51_bush_george_h_w_1989.txt',
 '../texts/history/US_Inaugural_Addresses/21_grant_1869.txt',
 '../texts/history/US_Inaugural_A

In [6]:
text_titles = [Path(text).stem for text in text_files]

In [7]:
text_titles

['13_van_buren_1837',
 '47_nixon_1973',
 '50_reagan_1985',
 '53_clinton_1997',
 '17_pierce_1853',
 '14_harrison_1841',
 '56_obama_2009',
 '25_cleveland_1885',
 '03_adams_john_1797',
 '12_jackson_1833',
 '11_jackson_1829',
 '36_hoover_1929',
 '45_johnson_1965',
 '51_bush_george_h_w_1989',
 '21_grant_1869',
 '41_truman_1949',
 '33_wilson_1917',
 '49_reagan_1981',
 '30_roosevelt_theodore_1905',
 '07_madison_1813',
 '09_monroe_1821',
 '48_carter_1977',
 '32_wilson_1913',
 '19_lincoln_1861',
 '01_washington_1789',
 '29_mckinley_1901',
 '04_jefferson_1801',
 '34_harding_1921',
 '52_clinton_1993',
 '35_coolidge_1925',
 '39_roosevelt_franklin_1941',
 '28_mckinley_1897',
 '24_garfield_1881',
 '22_grant_1873',
 '15_polk_1845',
 '54_bush_george_w_2001',
 '02_washington_1793',
 '38_roosevelt_franklin_1937',
 '37_roosevelt_franklin_1933',
 '18_buchanan_1857',
 '16_taylor_1849',
 '05_jefferson_1805',
 '26_harrison_1889',
 '44_kennedy_1961',
 '23_hayes_1877',
 '20_lincoln_1865',
 '57_obama_2013',
 '1

## Calculate tf–idf

To calculate tf–idf scores for every word, we're going to use scikit-learn's [`TfidfVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).

When you initialize TfidfVectorizer, you can choose to set it with different parameters. These parameters will change the way you calculate tf–idf.

The recommended way to run `TfidfVectorizer` is with smoothing (`smooth_idf = True`) and normalization (`norm='l2'`) turned on. These parameters will better account for differences in text length, and overall produce more meaningful tf–idf scores. Smoothing and L2 normalization are actually the default settings for `TfidfVectorizer`, so to turn them on, you don't need to include any extra code at all.

Initialize TfidfVectorizer with desired parameters (default smoothing and normalization)

In [8]:
tfidf_vectorizer = TfidfVectorizer(input='filename', stop_words='english')

Run TfidfVectorizer on our `text_files`

In [9]:
tfidf_vector = tfidf_vectorizer.fit_transform(text_files)

In [10]:
tfidf_vector.shape

(58, 8999)

Make a DataFrame out of the resulting tf–idf vector, setting the "feature names" or words as columns and the titles as rows

In [11]:
tfidf_df = pd.DataFrame(tfidf_vector.toarray(), index=text_titles, columns=tfidf_vectorizer.get_feature_names())

Add column for document frequency aka number of times word appears in all documents

In [12]:
tfidf_df

Unnamed: 0,000,03,04,05,100,120,125,13,14th,151,...,young,younger,youngest,youth,youthful,zachary,zeal,zealous,zealously,zone
13_van_buren_1837,0.0,0.011681,0.011924,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.020958,0.0
47_nixon_1973,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.020361,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50_reagan_1985,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.034165,0.0,0.0,...,0.034977,0.0,0.0,0.027025,0.0,0.0,0.0,0.0,0.0,0.0
53_clinton_1997,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.018566,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
17_pierce_1853,0.0,0.013476,0.013757,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
14_harrison_1841,0.0,0.006286,0.006417,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.011279,0.0
56_obama_2009,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.020657,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25_cleveland_1885,0.0,0.019727,0.020138,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.032556,0.0,0.0,0.0
03_adams_john_1797,0.0,0.016127,0.016463,0.0,0.0,0.0,0.0,0.0,0.0,0.040513,...,0.0,0.0,0.0,0.0,0.0,0.0,0.026615,0.0,0.0,0.0
12_jackson_1833,0.0,0.025599,0.026133,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## K Means From DataFrame Manipulation

In [28]:
tfidf_vector = tfidf_df.values

## K Means

In [29]:
from sklearn.cluster import KMeans

num_clusters = 4

km = KMeans(n_clusters=num_clusters, n_init=10) # default is also 10, but good to know 

km.fit(tfidf_vector)

# km.labels_ gives you the cluster assignments
clusters = km.labels_.tolist()

In [30]:
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
words = tfidf_vectorizer.get_feature_names()

top_words_list = []
for number in range(num_clusters):
    
    top_ten_words = [words[index] for index in order_centroids[number, :10]]
    top_words_list.append(" ".join(top_ten_words))
    print(f"---\nCluster {number}:\n{top_ten_words}")

---
Cluster 0:
['government', 'world', 'peace', 'people', 'great', 'nation', 'shall', 'men', 'life', 'nations']
---
Cluster 1:
['government', 'people', 'shall', 'states', 'constitution', 'public', 'country', 'congress', 'union', 'laws']
---
Cluster 2:
['america', 'world', 'freedom', 'new', 'nation', 'people', 'today', 'let', 'americans', 'know']
---
Cluster 3:
['union', 'war', 'government', 'states', 'public', 'great', 'people', 'country', 'united', 'peace']


In [31]:
results = pd.DataFrame()
results['text'] = text_titles
results['category'] = km.labels_

In [32]:
results

Unnamed: 0,text,category
0,13_van_buren_1837,1
1,47_nixon_1973,2
2,50_reagan_1985,2
3,53_clinton_1997,2
4,17_pierce_1853,3
5,14_harrison_1841,1
6,56_obama_2009,2
7,25_cleveland_1885,1
8,03_adams_john_1797,1
9,12_jackson_1833,3


In [33]:
X = tfidf_vector.todense()

AttributeError: 'numpy.ndarray' object has no attribute 'todense'

## PCA Plot

In [18]:
from sklearn.decomposition import PCA

In [34]:
pca_num_components = 2
reduced_data = PCA(n_components=pca_num_components).fit_transform(tfidf_vector)
# print reduced_data

In [35]:
reduced_df = pd.DataFrame(reduced_data)

In [36]:
reduced_df['title'] = text_titles

In [37]:
reduced_df['cluster'] = clusters

In [38]:
reduced_df = reduced_df.rename(columns= {0: 'x', 1: 'y'})

In [39]:
#Make a list of corresponding top words for each cluster to append to dataframe
top_words_column = []
number_of_clusters = reduced_df['cluster'].nunique()

for cluster in reduced_df['cluster']:
    for number in range(0, number_of_clusters):
        if cluster == number:
            top_words_column.append(top_words_list[number])

In [40]:
reduced_df['top_words'] = top_words_column

In [41]:
reduced_df

Unnamed: 0,x,y,title,cluster,top_words
0,-0.244415,-0.041917,13_van_buren_1837,1,government people shall states constitution pu...
1,0.350447,-0.018783,47_nixon_1973,2,america world freedom new nation people today ...
2,0.299843,0.030772,50_reagan_1985,2,america world freedom new nation people today ...
3,0.384835,-0.029631,53_clinton_1997,2,america world freedom new nation people today ...
4,-0.189791,-0.037642,17_pierce_1853,3,union war government states public great peopl...
5,-0.316962,-0.033491,14_harrison_1841,1,government people shall states constitution pu...
6,0.348495,-0.095925,56_obama_2009,2,america world freedom new nation people today ...
7,-0.218587,0.127601,25_cleveland_1885,1,government people shall states constitution pu...
8,-0.198035,-0.064308,03_adams_john_1797,1,government people shall states constitution pu...
9,-0.24664,-0.129377,12_jackson_1833,3,union war government states public great peopl...


In [42]:
alt.Chart(reduced_df).mark_circle(size=60).encode(
    x='x',
    y='y',
    color=alt.Color('cluster', scale=alt. 
                      Scale(scheme = 'dark2'), type='ordinal') ,
    tooltip=['title', 'cluster', 'top_words']
).interactive()

## TSNE Plot

In [218]:
tsne_num_components = 2
embeddings = TSNE(n_components=tsne_num_components)
Y = embeddings.fit_transform(X)

In [219]:
tsne_df = pd.DataFrame(Y)

In [220]:
tsne_df['title'] = text_titles

In [221]:
tsne_df['cluster'] = clusters

In [222]:
tsne_df = tsne_df.rename(columns= {0: 'x', 1: 'y'})

In [223]:
#Make a list of corresponding top words for each cluster to append to dataframe
top_words_column = []
number_of_clusters = reduced_df['cluster'].nunique()

for cluster in tsne_df['cluster']:
    for number in range(0, number_of_clusters):
        if cluster == number:
            top_words_column.append(top_words_list[number])

In [224]:
tsne_df['top_words'] = top_words_column

In [228]:
alt.Chart(tsne_df).mark_circle(size=60).encode(
    x='x',
    y='y',
    color=alt.Color('cluster', scale=alt. 
                      Scale(scheme = 'dark2'),type='ordinal') ,
    tooltip=['title', 'cluster', 'top_words']
).interactive()

## K Means — NYT Obituaries

In [229]:
directory_path = "../texts/history/NYT-Obituaries/"

Then we're going to use `glob` and `Path` to make a list of all the filepaths in that directory and a list of all the short story titles.

In [230]:
text_files = glob.glob(f"{directory_path}/*.txt")

In [231]:
text_files

['../texts/history/NYT-Obituaries/1852-Ada-Lovelace.txt',
 '../texts/history/NYT-Obituaries/1870-Robert-E-Lee.txt',
 '../texts/history/NYT-Obituaries/1875-Andrew-Johnson.txt',
 '../texts/history/NYT-Obituaries/1877-Bedford-Forrest.txt',
 '../texts/history/NYT-Obituaries/1880-Lucretia-Mott.txt',
 '../texts/history/NYT-Obituaries/1882-Charles-Darwin.txt',
 '../texts/history/NYT-Obituaries/1885-Ulysses-Grant.txt',
 '../texts/history/NYT-Obituaries/1886-Mary-Ewing-Outerbridge.txt',
 '../texts/history/NYT-Obituaries/1887-Emma-Lazarus.txt',
 '../texts/history/NYT-Obituaries/1888-Louisa-M-Alcott.txt',
 '../texts/history/NYT-Obituaries/1891-P-T-Barnum.txt',
 '../texts/history/NYT-Obituaries/1894-R-L-Stevenson.txt',
 '../texts/history/NYT-Obituaries/1895-Fred-Douglass.txt',
 '../texts/history/NYT-Obituaries/1896-Harriet-Beecher-Stowe.txt',
 '../texts/history/NYT-Obituaries/1900-Nietzsche.txt',
 '../texts/history/NYT-Obituaries/1900-Stephen-Crane.txt',
 '../texts/history/NYT-Obituaries/1901-Benj

In [232]:
text_titles = [Path(text).stem for text in text_files]

In [233]:
text_titles

['1852-Ada-Lovelace',
 '1870-Robert-E-Lee',
 '1875-Andrew-Johnson',
 '1877-Bedford-Forrest',
 '1880-Lucretia-Mott',
 '1882-Charles-Darwin',
 '1885-Ulysses-Grant',
 '1886-Mary-Ewing-Outerbridge',
 '1887-Emma-Lazarus',
 '1888-Louisa-M-Alcott',
 '1891-P-T-Barnum',
 '1894-R-L-Stevenson',
 '1895-Fred-Douglass',
 '1896-Harriet-Beecher-Stowe',
 '1900-Nietzsche',
 '1900-Stephen-Crane',
 '1901-Benjamin-Harrison',
 '1901-Queen-Victoria',
 '1901-William-McKinley',
 '1902-Elizabeth-Cady-Stanton',
 '1903-Emily-Warren-Roebling',
 '1903-James-M-N-Whistler',
 '1906-Susan-B-Anthony',
 '1907-Qiu-Jin',
 '1908-Cleveland',
 '1909-Geronimo',
 '1909-Sarah-Orne-Jewett',
 '1910-Florence-Nightingale',
 '1910-Tolstoy',
 '1910-William-James',
 '1911-Joseph-Pulitzer',
 '1914-Alfred-Thayer-Mahan',
 '1914-John-Muir',
 '1914-John-P-Holland',
 '1915-B-T-Washington',
 '1915-F-W-Taylor',
 '1916-J-J-Hill',
 '1916-Jack-London',
 '1916-Martian-Theory',
 '1917-Hilaire-G-E-Degas',
 '1919-Anna-H-Shaw',
 '1919-C-J-Walker',
 '1

## Calculate tf–idf

To calculate tf–idf scores for every word, we're going to use scikit-learn's [`TfidfVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).

When you initialize TfidfVectorizer, you can choose to set it with different parameters. These parameters will change the way you calculate tf–idf.

The recommended way to run `TfidfVectorizer` is with smoothing (`smooth_idf = True`) and normalization (`norm='l2'`) turned on. These parameters will better account for differences in text length, and overall produce more meaningful tf–idf scores. Smoothing and L2 normalization are actually the default settings for `TfidfVectorizer`, so to turn them on, you don't need to include any extra code at all.

Initialize TfidfVectorizer with desired parameters (default smoothing and normalization)

In [234]:
tfidf_vectorizer = TfidfVectorizer(input='filename', stop_words='english')

Run TfidfVectorizer on our `text_files`

In [235]:
tfidf_vector = tfidf_vectorizer.fit_transform(text_files)

Make a DataFrame out of the resulting tf–idf vector, setting the "feature names" or words as columns and the titles as rows

In [261]:
tfidf_df

Unnamed: 0,00,000,000f,001,006,01,010,021,025,028,...,zrathustra,zuber,zuker,zukor,zukors,zula,zululand,zurich,zvai,zwilich
1852-Ada-Lovelace,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1870-Robert-E-Lee,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1875-Andrew-Johnson,0.0,0.007602,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1877-Bedford-Forrest,0.0,0.018923,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1880-Lucretia-Mott,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1882-Charles-Darwin,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1885-Ulysses-Grant,0.0,0.053234,0.0,0.00351,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1886-Mary-Ewing-Outerbridge,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1887-Emma-Lazarus,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1888-Louisa-M-Alcott,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [237]:
tfidf_df = pd.DataFrame(tfidf_vector.toarray(), index=text_titles, columns=tfidf_vectorizer.get_feature_names())

Add column for document frequency aka number of times word appears in all documents

In [262]:
from sklearn.cluster import KMeans

num_clusters = 10

km = KMeans(n_clusters=num_clusters, n_init=10) # default is also 10, but good to know 

km.fit(tfidf_vector)

# km.labels_ gives you the cluster assignments
clusters = km.labels_.tolist()

In [263]:
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
words = tfidf_vectorizer.get_feature_names()

top_words_list = []
for number in range(num_clusters):
    
    top_ten_words = [words[index] for index in order_centroids[number, :10]]
    top_words_list.append(" ".join(top_ten_words))
    print(f"---\nCluster {number}:\n{top_ten_words}")

---
Cluster 0:
['tennis', 'game', 'ashe', 'runs', 'rockne', 'wills', 'outerbridge', 'golf', 'abbott', 'jones']
---
Cluster 1:
['mr', 'israel', 'soviet', 'sadat', 'jewish', 'pulitzer', 'brezhnev', 'buber', 'said', 'borges']
---
Cluster 2:
['years', 'war', 'time', 'new', 'said', 'life', '000', 'world', 'general', 'work']
---
Cluster 3:
['mr', 'cleveland', 'baseball', 'douglass', 'hemingway', 'rockefeller', 'thorpe', 'catton', 'ochs', 'ruth']
---
Cluster 4:
['sullivan', 'keller', 'murrow', 'mr', 'dempsey', 'mrs', 'miss', 'durocher', 'zaharias', 'macy']
---
Cluster 5:
['mr', 'jazz', 'band', 'music', 'blues', 'armstrong', 'piano', 'basie', 'goodman', 'gillespie']
---
Cluster 6:
['miss', 'jackson', 'years', 'said', 'new', 'ziegfeld', 'jewett', 'hellman', 'balch', 'graham']
---
Cluster 7:
['dr', 'dewey', 'university', 'research', 'negro', 'institute', 'vaccine', 'atomic', 'professor', 'jung']
---
Cluster 8:
['mr', 'president', 'roosevelt', 'said', 'years', 'kennedy', 'hoover', 'mrs', 'state',

In [264]:
results = pd.DataFrame()
results['text'] = text_titles
results['category'] = km.labels_

In [265]:
results

Unnamed: 0,text,category
0,1852-Ada-Lovelace,2
1,1870-Robert-E-Lee,2
2,1875-Andrew-Johnson,8
3,1877-Bedford-Forrest,2
4,1880-Lucretia-Mott,8
5,1882-Charles-Darwin,2
6,1885-Ulysses-Grant,2
7,1886-Mary-Ewing-Outerbridge,0
8,1887-Emma-Lazarus,2
9,1888-Louisa-M-Alcott,6


In [266]:
X = tfidf_vector.todense()

## PCA Plot

In [267]:
pca_num_components = 2
reduced_data = PCA(n_components=pca_num_components).fit_transform(X)
# print reduced_data

In [268]:
reduced_df = pd.DataFrame(reduced_data)

In [269]:
reduced_df['title'] = text_titles

In [270]:
reduced_df['cluster'] = clusters

In [271]:
reduced_df = reduced_df.rename(columns= {0: 'x', 1: 'y'})

In [272]:
#Make a list of corresponding top words for each cluster to append to dataframe
top_words_column = []
number_of_clusters = reduced_df['cluster'].nunique()

for cluster in reduced_df['cluster']:
    for number in range(0, number_of_clusters):
        if cluster == number:
            top_words_column.append(top_words_list[number])

In [273]:
reduced_df['top_words'] = top_words_column

In [274]:
reduced_df

Unnamed: 0,x,y,title,cluster,top_words
0,-0.097039,-0.024191,1852-Ada-Lovelace,2,years war time new said life 000 world general...
1,0.030303,-0.145161,1870-Robert-E-Lee,2,years war time new said life 000 world general...
2,0.230276,-0.164679,1875-Andrew-Johnson,8,mr president roosevelt said years kennedy hoov...
3,-0.006883,-0.087962,1877-Bedford-Forrest,2,years war time new said life 000 world general...
4,-0.062079,-0.041888,1880-Lucretia-Mott,8,mr president roosevelt said years kennedy hoov...
5,-0.036459,-0.047472,1882-Charles-Darwin,2,years war time new said life 000 world general...
6,0.12722,-0.208108,1885-Ulysses-Grant,2,years war time new said life 000 world general...
7,-0.061829,-0.02892,1886-Mary-Ewing-Outerbridge,0,tennis game ashe runs rockne wills outerbridge...
8,-0.077708,-0.032835,1887-Emma-Lazarus,2,years war time new said life 000 world general...
9,-0.129666,-0.02139,1888-Louisa-M-Alcott,6,miss jackson years said new ziegfeld jewett he...


In [275]:
alt.Chart(reduced_df).mark_circle(size=60).encode(
    x='x',
    y='y',
    color=alt.Color('cluster', scale=alt. 
                      Scale(scheme = 'dark2'), type='ordinal') ,
    tooltip=['title', 'cluster', 'top_words']
).interactive()

## TSNE Plot

In [276]:
tsne_num_components = 2
embeddings = TSNE(n_components=tsne_num_components)
Y = embeddings.fit_transform(X)

In [277]:
tsne_df = pd.DataFrame(Y)

In [278]:
tsne_df['title'] = text_titles

In [279]:
tsne_df['cluster'] = clusters

In [280]:
tsne_df = tsne_df.rename(columns= {0: 'x', 1: 'y'})

In [281]:
#Make a list of corresponding top words for each cluster to append to dataframe
top_words_column = []
number_of_clusters = reduced_df['cluster'].nunique()

for cluster in tsne_df['cluster']:
    for number in range(0, number_of_clusters):
        if cluster == number:
            top_words_column.append(top_words_list[number])

In [282]:
tsne_df['top_words'] = top_words_column

In [283]:
alt.Chart(tsne_df).mark_circle(size=60).encode(
    x='x',
    y='y',
    color=alt.Color('cluster', scale=alt. 
                      Scale(scheme = 'dark2'),type='ordinal') ,
    tooltip=['title', 'cluster', 'top_words']
).interactive()