In [None]:
# Import all of the things you need to import!

In [2]:
import scipy
import sklearn
import nltk
import pandas as pd

  (fname, cnt))


# Homework 14 (or so): TF-IDF text analysis and clustering

Hooray, we kind of figured out how text analysis works! Some of it is still magic, but at least the **TF** and **IDF** parts make a little sense. Kind of. Somewhat.

No, just kidding, we're *professionals* now.

## Investigating the Congressional Record

The [Congressional Record](https://en.wikipedia.org/wiki/Congressional_Record) is more or less what happened in Congress every single day. Speeches and all that. A good large source of text data, maybe?

Let's pretend it's totally secret but we just got it leaked to us in a data dump, and we need to check it out. It was leaked from [this page here](http://www.cs.cornell.edu/home/llee/data/convote.html).

In [3]:
# If you'd like to download it through the command line...
!curl -O http://www.cs.cornell.edu/home/llee/data/convote/convote_v1.1.tar.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 9607k  100 9607k    0     0  3071k      0  0:00:03  0:00:03 --:--:-- 3072k


In [3]:
# And then extract it through the command line...
!tar -zxf convote_v1.1.tar.gz

You can explore the files if you'd like, but we're going to get the ones from `convote_v1.1/data_stage_one/development_set/`. It's a bunch of text files.

In [4]:
# glob finds files matching a certain filename pattern
import glob

# Give me all the text files
paths = glob.glob('convote_v1.1/data_stage_one/development_set/*')
paths[:5]

['convote_v1.1/data_stage_one/development_set/052_400011_0327014_DON.txt',
 'convote_v1.1/data_stage_one/development_set/052_400011_0327025_DON.txt',
 'convote_v1.1/data_stage_one/development_set/052_400011_0327044_DON.txt',
 'convote_v1.1/data_stage_one/development_set/052_400011_0327046_DON.txt',
 'convote_v1.1/data_stage_one/development_set/052_400011_1479036_DON.txt']

In [5]:
len(paths)

702

So great, we have 702 of them. Now let's import them.

In [6]:
speeches = []
for path in paths:
    with open(path) as speech_file:
        speech = {
            'pathname': path,
            'filename': path.split('/')[-1],
            'content': speech_file.read()
        }
    speeches.append(speech)
speeches_df = pd.DataFrame(speeches)
speeches_df.head()

Unnamed: 0,content,filename,pathname
0,"mr. chairman , i thank the gentlewoman for yie...",052_400011_0327014_DON.txt,convote_v1.1/data_stage_one/development_set/05...
1,"mr. chairman , i want to thank my good friend ...",052_400011_0327025_DON.txt,convote_v1.1/data_stage_one/development_set/05...
2,"mr. chairman , i rise to make two fundamental ...",052_400011_0327044_DON.txt,convote_v1.1/data_stage_one/development_set/05...
3,"mr. chairman , reclaiming my time , let me mak...",052_400011_0327046_DON.txt,convote_v1.1/data_stage_one/development_set/05...
4,"mr. chairman , i thank my distinguished collea...",052_400011_1479036_DON.txt,convote_v1.1/data_stage_one/development_set/05...


In class we had the `texts` variable. For the homework can just do `speeches_df['content']` to get the same sort of list of stuff.

**Take a look at the contents of the first 5 speeches**

In [7]:
speeches_df['content'].head(5)

0    mr. chairman , i thank the gentlewoman for yie...
1    mr. chairman , i want to thank my good friend ...
2    mr. chairman , i rise to make two fundamental ...
3    mr. chairman , reclaiming my time , let me mak...
4    mr. chairman , i thank my distinguished collea...
Name: content, dtype: object

# Doing our analysis

Use the `sklearn` package and a plain boring `CountVectorizer` to get a list of all of the tokens used in the speeches. If it won't list them all, that's ok! Make a dataframe with those terms as columns.

**Be sure to include English-language stopwords**

In [50]:
from sklearn.feature_extraction.text import CountVectorizer
count_vectorizer = CountVectorizer(max_features=100, stop_words='english')
from sklearn.feature_extraction.text import TfidfVectorizer
import re
from nltk.stem.porter import PorterStemmer

In [51]:
X = count_vectorizer.fit_transform(speeches_df['content'])

In [52]:
X.toarray()

array([[0, 1, 3, ..., 0, 0, 1],
       [0, 0, 1, ..., 1, 0, 0],
       [0, 0, 0, ..., 0, 0, 1],
       ..., 
       [0, 0, 0, ..., 0, 0, 0],
       [0, 1, 2, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

Okay, it's **far** too big to even look at. Let's try to get a list of features from a new `CountVectorizer` that only takes the top 100 words.

In [53]:
tophundred_df = pd.DataFrame(X.toarray(), columns=count_vectorizer.get_feature_names())

In [54]:
tophundred_df

Unnamed: 0,000,11,act,allow,amendment,america,american,amp,association,balance,...,trade,united,urge,vote,want,way,work,year,years,yield
0,0,1,3,0,0,0,3,0,0,0,...,0,1,0,0,1,1,0,0,0,1
1,0,0,1,1,1,0,0,0,0,1,...,0,0,0,1,1,3,0,1,0,0
2,0,0,0,0,0,0,1,0,0,0,...,0,0,0,1,0,0,0,0,0,1
3,0,0,0,0,0,1,0,0,0,1,...,0,1,0,1,1,1,0,0,0,0
4,0,0,0,0,1,0,0,0,0,0,...,0,0,0,2,0,0,0,0,0,2
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,1,0,0,1,0,0,0,...,0,1,0,0,0,0,0,0,0,0


Now let's push all of that into a dataframe with nicely named columns.

Everyone seems to start their speeches with "mr chairman" - how many speeches are there total, and many don't mention "chairman" and how many mention neither "mr" nor "chairman"?

In [55]:
mrchairman_df = pd.DataFrame([tophundred_df['mr'], tophundred_df['chairman'], tophundred_df['mr'] + tophundred_df['chairman']], index=["mr", "chairman", "mr + chairman"]).T

In [56]:
mrchairman_df

Unnamed: 0,mr,chairman,mr + chairman
0,2,3,5
1,4,2,6
2,3,2,5
3,3,2,5
4,2,1,3
5,0,0,0
6,1,1,2
7,0,0,0
8,1,1,2
9,1,2,3


In [83]:
num_speeches = len(mrchairman_df)

In [87]:
mrmention_df = mrchairman_df[(mrchairman_df['mr'] > 0)]
mr_mention = len(mrmention_df)

In [88]:
mrorchairmanmention_df = mrchairman_df[(mrchairman_df['mr + chairman'] > 0)]
mrorchair_mention = len(mrorchairmanmention_df)

In [93]:
print("There are",num_speeches,"speeches. Only", num_speeches - mr_mention, "do not mention mr and ", num_speeches - mrorchair_mention, "do not mention mr or chairman")

There are 702 speeches. Only 79 do not mention mr and  76 do not mention mr or chairman


In [95]:
tophundred_df.columns

Index(['000', '11', 'act', 'allow', 'amendment', 'america', 'american', 'amp',
       'association', 'balance', 'based', 'believe', 'bipartisan', 'chairman',
       'children', 'china', 'civil', 'colleagues', 'committee', 'congress',
       'country', 'days', 'debate', 'discrimination', 'does', 'education',
       'election', 'elections', 'fact', 'faith', 'federal', 'frivolous',
       'funding', 'gentleman', 'going', 'good', 'government', 'gt', 'head',
       'health', 'help', 'house', 'important', 'issue', 'just', 'know', 'law',
       'legislation', 'let', 'like', 'lt', 'make', 'member', 'members',
       'million', 'money', 'mr', 'nation', 'national', 'nbsp', 'need', 'new',
       'order', 'organizations', 'people', 'percent', 'policy', 'president',
       'process', 'program', 'programs', 'provide', 'religious', 'right',
       'rights', 'rule', 'rules', 'say', 'school', 'services', 'speaker',
       'start', 'state', 'states', 'support', 'teachers', 'thank', 'think',
       'time

What is the index of the speech thank is the most thankful, a.k.a. includes the word 'thank' the most times?

In [96]:
tophundred_df['thank'].sort_values(ascending=False).head(1) # thank is not in the top 100 words unless you remove stop words

577    9
Name: thank, dtype: int64

If I'm searching for `China` and `trade`, what are the top 3 speeches to read according to the `CountVectoriser`?

In [106]:
chinatrade_df = pd.DataFrame([tophundred_df['china'] + tophundred_df['trade']], index=["China + trade"]).T

In [110]:
chinatrade_df['China + trade'].sort_values(ascending=False).head(3)

379    92
399    36
345    27
Name: China + trade, dtype: int64

Now what if I'm using a `TfidfVectorizer`?

In [121]:
porter_stemmer = PorterStemmer()

def stemming_tokenizer(str_input):
    words = re.sub(r"[^A-Za-z0-9\-]", " ", str_input).lower().split()
    words = [porter_stemmer.stem(word) for word in words]
    return words

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english', tokenizer=stemming_tokenizer, use_idf=False, norm='l1', max_features = 100)
X = tfidf_vectorizer.fit_transform(speeches_df['content'])
chinatrade_tfidfpd = pd.DataFrame(X.toarray(), columns=tfidf_vectorizer.get_feature_names())

In [125]:
chinatrade_tfidfpd = pd.DataFrame([tophundred_df['china'] + tophundred_df['trade']], index=["China + trade"]).T

In [128]:
# chinatrade_tfidfpd

In [126]:
chinatrade_tfidfpd['China + trade'].sort_values(ascending=False).head(3)

379    92
399    36
345    27
Name: China + trade, dtype: int64

**What's the content of the speeches?** Here's a way to get them:

In [129]:
# index 0 is the first speech, which was the first one imported.
paths[0]

'convote_v1.1/data_stage_one/development_set/052_400011_0327014_DON.txt'

In [130]:
# Pass that into 'cat' using { } which lets you put variables in shell commands
# that way you can pass the path to cat
!cat {paths[0]}

mr. chairman , i thank the gentlewoman for yielding me this time . 
my good colleague from california raised the exact and critical point . 
the question is , what happens during those 45 days ? 
we will need to support elections . 
there is not a single member of this house who has not supported some form of general election , a special election , to replace the members at some point . 
but during that 45 days , what happens ? 
the chair of the constitution subcommittee says this is what happens : martial law . 
we do not know who would fill the vacancy of the presidency , but we do know that the succession act most likely suggests it would be an unelected person . 
the sponsors of the bill before us today insist , and i think rightfully so , on the importance of elections . 
but to then say that during a 45-day period we would have none of the checks and balances so fundamental to our constitution , none of the separation of powers , and that the presidency would be filled b

**Now search for something else!** Another two terms that might show up. `elections` and `chaos`? Whatever you thnik might be interesting.

# Enough of this garbage, let's cluster

Using a **simple counting vectorizer**, cluster the documents into **eight categories**, telling me what the top terms are per category.

Using a **term frequency vectorizer**, cluster the documents into **eight categories**, telling me what the top terms are per category.

Using a **term frequency inverse document frequency vectorizer**, cluster the documents into **eight categories**, telling me what the top terms are per category.

In [152]:
# Initialize a vectorizer
vectorizer = TfidfVectorizer(use_idf=True, tokenizer=stemming_tokenizer, stop_words='english', max_features =8)
X = vectorizer.fit_transform(speeches_df['content'])

In [153]:
X

<702x8 sparse matrix of type '<class 'numpy.float64'>'
	with 2593 stored elements in Compressed Sparse Row format>

In [154]:
pd.DataFrame(X.toarray())

Unnamed: 0,0,1,2,3,4,5,6,7
0,0.125349,0.144859,0.000000,0.074980,0.071464,0.000000,0.962678,0.160706
1,0.116806,0.179982,0.000000,0.279478,0.000000,0.176796,0.897067,0.199671
2,0.000000,0.146615,0.000000,0.170749,0.000000,0.000000,0.974345,0.000000
3,0.000000,0.187550,0.000000,0.218422,0.000000,0.000000,0.934786,0.208067
4,0.207346,0.159747,0.000000,0.248056,0.236425,0.000000,0.884677,0.177222
5,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
6,0.000000,0.187830,0.000000,0.145832,0.277988,0.000000,0.832161,0.416754
7,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
8,0.000000,0.594065,0.000000,0.461234,0.000000,0.000000,0.000000,0.659052
9,0.000000,0.580268,0.569997,0.225261,0.429398,0.000000,0.321353,0.000000


In [155]:
from sklearn.cluster import KMeans
number_of_clusters = 8
km = KMeans(n_clusters=number_of_clusters)
km.fit(X)

KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=8, n_init=10,
    n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,
    verbose=0)

In [156]:
print("Top terms per cluster:")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(number_of_clusters):
    top_ten_words = [terms[ind] for ind in order_centroids[i, :5]]
    print("Cluster {}: {}".format(i, ' '.join(top_ten_words)))

Top terms per cluster:
Cluster 0: mr chairman time thi amend
Cluster 1: thi mr time amend chairman
Cluster 2: time mr chairman thi amend
Cluster 3: start head thi amend mr
Cluster 4: time thi start s mr
Cluster 5: mr thi time start s
Cluster 6: amend mr chairman thi time
Cluster 7: s thi mr amend time


In [157]:
km.labels_

array([1, 1, 1, 1, 1, 4, 1, 4, 2, 7, 1, 1, 1, 7, 0, 1, 1, 2, 5, 2, 1, 2, 1,
       6, 5, 2, 2, 5, 6, 7, 2, 2, 0, 1, 1, 0, 7, 2, 6, 0, 2, 1, 1, 6, 6, 2,
       0, 6, 0, 4, 1, 0, 6, 0, 1, 1, 1, 1, 0, 7, 0, 2, 2, 7, 4, 7, 1, 1, 0,
       1, 4, 0, 0, 4, 1, 7, 0, 0, 6, 6, 0, 0, 6, 0, 4, 1, 2, 2, 2, 2, 2, 7,
       4, 4, 2, 0, 2, 2, 6, 2, 7, 2, 2, 1, 1, 0, 0, 2, 1, 2, 6, 0, 2, 5, 2,
       2, 1, 0, 7, 7, 4, 0, 7, 1, 1, 2, 1, 7, 1, 1, 2, 2, 1, 1, 4, 2, 7, 1,
       1, 0, 6, 6, 2, 6, 6, 7, 1, 6, 6, 7, 6, 0, 5, 1, 2, 1, 1, 5, 7, 6, 7,
       1, 4, 6, 1, 7, 7, 6, 1, 0, 6, 1, 6, 6, 6, 1, 5, 7, 5, 5, 1, 1, 6, 7,
       6, 4, 0, 7, 1, 6, 5, 5, 6, 2, 6, 6, 7, 1, 2, 0, 7, 1, 1, 7, 1, 1, 0,
       5, 6, 1, 7, 5, 6, 0, 4, 1, 5, 5, 1, 5, 5, 2, 5, 5, 1, 2, 5, 1, 1, 6,
       1, 1, 6, 1, 7, 2, 2, 6, 6, 0, 5, 6, 2, 1, 4, 0, 6, 2, 2, 6, 6, 6, 0,
       6, 6, 0, 6, 6, 0, 0, 1, 1, 6, 5, 0, 0, 6, 6, 0, 0, 0, 0, 7, 6, 2, 2,
       2, 6, 4, 2, 6, 2, 4, 2, 2, 2, 0, 6, 6, 0, 7, 1, 6, 1, 6, 1, 2, 2, 2,
       5, 2,

In [159]:
speeches_df['content']

0      mr. chairman , i thank the gentlewoman for yie...
1      mr. chairman , i want to thank my good friend ...
2      mr. chairman , i rise to make two fundamental ...
3      mr. chairman , reclaiming my time , let me mak...
4      mr. chairman , i thank my distinguished collea...
5            i yield to the gentleman from illinois . \n
6      mr. chairman , reclaiming my time , the fact i...
7            i yield to the gentleman from illinois . \n
8      mr. chairman , reclaiming my time , i would be...
9      mr. chairman , i do not have it on the top of ...
10     okay . \nso we do not have that answer . \nlet...
11     mr. chairman , i would suggest that with these...
12     mr. chairman , i yield myself such time as i m...
13     mr. chairman , i yield myself the balance of t...
14          mr. chairman , i demand a recorded vote . \n
15     mr. chairman , i appreciated the gentleman fro...
16     mr. chairman , i am pleased to join the gentle...
17     mr. chairman , i thank t

In [160]:
results = pd.DataFrame()
results['content'] = speeches_df['content']
results['category'] = km.labels_
results

Unnamed: 0,content,category
0,"mr. chairman , i thank the gentlewoman for yie...",1
1,"mr. chairman , i want to thank my good friend ...",1
2,"mr. chairman , i rise to make two fundamental ...",1
3,"mr. chairman , reclaiming my time , let me mak...",1
4,"mr. chairman , i thank my distinguished collea...",1
5,i yield to the gentleman from illinois . \n,4
6,"mr. chairman , reclaiming my time , the fact i...",1
7,i yield to the gentleman from illinois . \n,4
8,"mr. chairman , reclaiming my time , i would be...",2
9,"mr. chairman , i do not have it on the top of ...",7


In [161]:
vectorizer.get_feature_names()

['amend', 'chairman', 'head', 'mr', 's', 'start', 'thi', 'time']

In [167]:
df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
df


Unnamed: 0,amend,chairman,head,mr,s,start,thi,time
0,0.125349,0.144859,0.000000,0.074980,0.071464,0.000000,0.962678,0.160706
1,0.116806,0.179982,0.000000,0.279478,0.000000,0.176796,0.897067,0.199671
2,0.000000,0.146615,0.000000,0.170749,0.000000,0.000000,0.974345,0.000000
3,0.000000,0.187550,0.000000,0.218422,0.000000,0.000000,0.934786,0.208067
4,0.207346,0.159747,0.000000,0.248056,0.236425,0.000000,0.884677,0.177222
5,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
6,0.000000,0.187830,0.000000,0.145832,0.277988,0.000000,0.832161,0.416754
7,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
8,0.000000,0.594065,0.000000,0.461234,0.000000,0.000000,0.000000,0.659052
9,0.000000,0.580268,0.569997,0.225261,0.429398,0.000000,0.321353,0.000000


**Which one do you think works the best?**

# Harry Potter time

I have a scraped collection of Harry Potter fanfiction at https://github.com/ledeprogram/courses/raw/master/algorithms/data/hp.zip.

I want you to read them in, vectorize them and cluster them. Use this process to find out **the two types of Harry Potter fanfiction**. What is your hypothesis?

In [29]:
!curl -LO https://github.com/ledeprogram/courses/raw/master/algorithms/data/hp.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   149  100   149    0     0    106      0  0:00:01  0:00:01 --:--:--   106
100 9226k  100 9226k    0     0  2668k      0  0:00:03  0:00:03 --:--:-- 7714k
