<h2>Text Analysis with Scikit-learn</h2>

Thanks to everyone for coming today. Excited to be here and talk about a neat topic I really enjoy, which is text analysis with Python, specifically with scikit-learn. While I have been able to use this at work occassionally, I am definitely not an Natural Language Processing expert. However, one of the cool things about the python scientific computing ecosystem is that you can *not* be an expert and still do some neat stuff.

Title slide. Contact info. github info.

::get slides from work::

Text analysis is the extraction of information from unstructured (in most cases) text. This is difficult because computer only understand numbers. So in order to analyze text with computers, we have to do some kind of processing to it. There are two basic ways to approach text analysis: look for patterns in the text (these words often appear together/near each other (multi-word term), these words are often used in the same context but never appear together (synonyms), etc.); or you can convert the text to numbers and do math on them. 

The technique we're going to discuss today is called Term Frequency-Inverse Document Frequency (TF-IDF), and it is a fairly common method in the area of Information Retrieval (IR) for comparing and searching documents. What we're going to end up doing is taking our big set of documents, called the corpus, and these can be any type of document: news articles, emails, online comments (although the longer the documents the better the results), and we're going to turn them into matrices (number), and then we're going to do some pretty basic math on them. That math is going to help us define what's called a distance metric between any two documents, which we can then use for comparisions and searches. Ok, but first, let's think about this problem intuitively. 

Let's say we have three documents. Our documents are just going to be sentences. 
<li>The dog is jumping over the fence.</li>
<li>The dog is climbing up the fence.</li>
<li>The cat is sitting on the window.</li>
So how similar/disimilar are these sentences? Well, that's a little hard to answer right now because that is trying to quantify something about the text. What about a difference question: which two sentences are the most similar? The first two?  Why? Because they have the same subject? What if you didn't understand English? You'd probably still say the first two. They share a lot of the same words, so we can think of them as being similar in their contents based on the inclusion of identical words. 
<br><br>

So this is good. We could define a similarity metric as the number of words two sentences have in common. But is that good enough? What about these sentences?
<li>A dog was jumping over this fence and then ran over to the next yard so very quickly.</li>
<li>A cat was leaning over this edge and then jumped over to the next table so very quickly.</li>
They have many more words in common than our first two sentences, so are they more similar? Probably not. But this is function of the length of the sentences, so let's try to normalize by dividing by the number of words in the sentence. This will give us a percent of similarity. Well it turns out these longer sentences still get a higher similarity score. So what's the problem? Intiutively we know the first two sentences are similar, but it seems that using numbers suggests the latest two sentences are more similar. So what's the problem? The problem is that we're considering all the words equal, when, for purposes of distiguishing sentences from one another, they are not equal. Why not? 
<li>A dog was jumping over this fence and then ran over to the next yard so very quickly.</li>
<li>A cat was leaning over this fence and then jumped over to the next yard so very quickly.</li>
<li>A squirrel was leaping over this branch and then scampered over to the next tree so very quickly.</li>
<li>A bird was peering over this branch and then flew over to the next tree.</li>
If we were going to distiguish between these sentences, any word that appears in all the sentences is worthless. It doesn't help us to determine if any pair of sentences is similar or disimilar because it's common to all of them. It's almost a characteristic of a sentence in general, as opposed to a charateristics of a particular sentence. So we don't want to use words that appear in all sentences. We either throw them out or weight them very low (more on that later). What about words that appear in a lot but not all sentences? "so very quickly" appears in all but one sentence....
Somewhat useful, but not super useful.
Tree, branch, fence, yard: appear in some, but not all or most, so they are usefula nd will get weighted the highest.
This is the Inverse Document Frequency part of the TF-IDF. The more documents a word appears in the, the less useful it is in distinguishing between the documents, so it get weighted less. The Inverse essentially means we divide by the number of documents the word (or term) appears in. 
For the first part, the Term Frequency, we're simply going to calculate the frequency of each term in each document by counting the number of times it appears in the document and dividing by the total number of words in the documnets. The assumption here is that the more often a word appears in a document, the more closely it represents the topic of the document.


Ok, so let's actually calculate some TF-IDFs by hand, and then we'll do it in Python.<br>
The first thing we do it a list of all the unique words in the corpus.
::example with a few sentences::
::create list of all the words. Count number of docs each word appear in. create vector for each doc. Calculate Euclidean distance::
Show distances between pairs of sentences. Show the dog and cat are similar and the squirrel and bird are similar.




<div style="margin-top:200px"/>

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import pandas as pd

###### Example 1: Short Sentences

In [2]:
short_sentences = ['The dog is jumping over the fence.', 
                   'The dog is climbing up the fence.',
                   'The cat is sitting on the window.']

In [3]:
# Create an instance of the CountVectorizer.
cv = CountVectorizer()

# Create a count matrix from the list of documents.
count_matrix = cv.fit_transform(short_sentences)

In [4]:
# Create a dataframe out of the count matrix with labels and columns names for easier reading.
short_sent_df = pd.DataFrame(count_matrix.todense(), index=short_sentences, columns=sorted(cv.vocabulary_))
short_sent_df

Unnamed: 0,cat,climbing,dog,fence,is,jumping,on,over,sitting,the,up,window
The dog is jumping over the fence.,0,0,1,1,1,1,0,1,0,2,0,0
The dog is climbing up the fence.,0,1,1,1,1,0,0,0,0,2,1,0
The cat is sitting on the window.,1,0,0,0,1,0,1,0,1,2,0,1


<div style="margin-top:200px"/>

###### Example 2: Long sentences

In [14]:
long_sentences = ['A dog was jumping over this fence and then ran over to the next yard so very quickly.',
                  'A cat was leaning over this fence and then jumped over to the next yard so very quickly.',
                  'A squirrel was leaping over this branch and then scampered over to the next tree so very quickly.',
                  'A bird was peering over this branch and then flew over to the next tree.']

In [63]:
# Create an instance of the CountVectorizer.
cv = CountVectorizer(token_pattern=u'(?u)\\b\\w+\\b')

# Create a count matrix from the list of documents.
long_count_matrix = cv.fit_transform(long_sentences)

Most of the time, you'll use the default value for this parameter, but I'm using non-default to make this example match our basic explanation above.

In [74]:
# Create a dataframe out of the count matrix with labels and columns names for easier reading.
long_sent_df = pd.DataFrame(long_count_matrix.todense(), index=[s.split()[1] for s in long_sentences], columns=sorted(cv.vocabulary_))
long_sent_df

Unnamed: 0,a,and,bird,branch,cat,dog,fence,flew,jumped,jumping,...,so,squirrel,the,then,this,to,tree,very,was,yard
dog,1,1,0,0,0,1,1,0,0,1,...,1,0,1,1,1,1,0,1,1,1
cat,1,1,0,0,1,0,1,0,1,0,...,1,0,1,1,1,1,0,1,1,1
squirrel,1,1,0,1,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,0
bird,1,1,1,1,0,0,0,1,0,0,...,0,0,1,1,1,1,1,0,1,0


In [75]:
# Create an instance of a TfidfTransformer.
tfidf = TfidfTransformer(norm='l1', use_idf=False, smooth_idf=False)

# Created a weighted matrix.
weighted_matrix = tfidf.fit_transform(long_count_matrix)

Most of the time, you'll use the default values for these three parameters, but I'm using non-defaults to make this example match our basic explanation above.

In [76]:
weighted_matrix

<4x28 sparse matrix of type '<type 'numpy.float64'>'
	with 65 stored elements in Compressed Sparse Row format>

In [77]:
# Create a dataframe out of the count matrix with labels and columns names for easier reading.
long_sent_weighted_df = pd.DataFrame(weighted_matrix.todense(), index=[s.split()[1] for s in long_sentences], columns=sorted(cv.vocabulary_))
long_sent_weighted_df

Unnamed: 0,a,and,bird,branch,cat,dog,fence,flew,jumped,jumping,...,so,squirrel,the,then,this,to,tree,very,was,yard
dog,0.055556,0.055556,0.0,0.0,0.0,0.055556,0.055556,0.0,0.0,0.055556,...,0.055556,0.0,0.055556,0.055556,0.055556,0.055556,0.0,0.055556,0.055556,0.055556
cat,0.055556,0.055556,0.0,0.0,0.055556,0.0,0.055556,0.0,0.055556,0.0,...,0.055556,0.0,0.055556,0.055556,0.055556,0.055556,0.0,0.055556,0.055556,0.055556
squirrel,0.055556,0.055556,0.0,0.055556,0.0,0.0,0.0,0.0,0.0,0.0,...,0.055556,0.055556,0.055556,0.055556,0.055556,0.055556,0.055556,0.055556,0.055556,0.0
bird,0.066667,0.066667,0.066667,0.066667,0.0,0.0,0.0,0.066667,0.0,0.0,...,0.0,0.0,0.066667,0.066667,0.066667,0.066667,0.066667,0.0,0.066667,0.0


Above we can see each word is scored with its normalized frequency in each sentence. This ignores the frequency of the word in the corpus as a whole, which would indicate its importance in distinguishing documents from each other.

'quickly' appears in three of the four sentences, so it has a score of 0 for the last one. The other three scores are equal because 'quickly' appears the same number of times (1) in each of those sentences, and they all have the same length.

In [78]:
long_sent_weighted_df['quickly']

dog         0.055556
cat         0.055556
squirrel    0.055556
bird        0.000000
Name: quickly, dtype: float64

In [80]:
# Print the number of words in each sentence.
for s in long_sentences:
    print(s.split()[1] + ' ' + str(len(s.split())))

dog 18
cat 18
squirrel 18
bird 15


In [79]:
# Score for 'a' in the last sentence.
1/15.0

0.06666666666666667

In [82]:
# Score for 'quickly' in the first three sentences.
1/18.0

0.05555555555555555

<div style="margin-top:200px"/>

###### Example 3: An interesting one

###### Example 3: An interesting one

In [46]:
# Load song lyrics dataset.
import json
song_lyrics_2 = json.load(open('data\song_lyrics_2.json','rt'), encoding='utf8')

###### Need to include this?

In [154]:
a =[tup for tup in song_lyrics_2 if 'Piece Of My Heart Lyrics  Tara Kemp'==tup[0][2]]

In [155]:
song_lyrics_2.index(a[0])

363

In [152]:
song_lyrics_2[0]

[[u'http://www.songlyrics.com/news/top-songs/',
  u'Maria Muldaur',
  u'Midnight At The Oasis Lyrics  Maria Muldaur'],
 u"Midnight at the oasis\nSend your camel to bed\nShadows paintin' our faces\nTraces of romance in our heads\nHeaven's holdin' a half-moon\nShinin' just for us\nLet's slip off to a sand dune\nReal soon and kick up a little dust\nCome on, Cactus is our friend\nHe'll point out the way\nCome on, till the evenin' ends\nTill the evenin' ends\nYou don't have to answer\nThere's no need to speak\nI'll be your belly dancer\nPrancer and you can be my sheik\nI know your daddy's a sultan\nA nomad known to all\nWith fifty girls to attend him, they all send him\nJump at his beck and call\nBut you won't need no harem, honey\nWhen I'm by your side\nAnd you won't need no camel, no no\nWhen I take you for a ride\nCome on, Cactus is our friend\nHe'll point out the way\nCome on, till the evenin' ends\nTill the evenin' ends\nMidnight at the oasis\nSend your camel to bed\nShadows paintin' o

In [47]:
song_lyrics ={tuple(tup[0]):tup[1] for tup in song_lyrics_2[2000:]}

In [48]:
# Split the titles and lyrics into two lists.
titles, lyrics = zip(*song_lyrics.items())

In [49]:
# Create an instance of the CountVectorizer.
cv = CountVectorizer()

# Create a count matrix from the list of documents.
lyrics_matrix = cv.fit_transform(lyrics)

This matrix is too big to make dense, so no dataframe.

In [50]:
# Create an instance of a TfidfTransformer.
tfidf = TfidfTransformer()

# Created a weighted matrix.
weighted_matrix = tfidf.fit_transform(lyrics_matrix)

In [51]:
weighted_matrix

<3093x17052 sparse matrix of type '<type 'numpy.float64'>'
	with 302992 stored elements in Compressed Sparse Row format>

In [52]:
from sklearn.metrics.pairwise import pairwise_distances
import numpy as np

In [53]:
# Compute distance matrix.
distance_matrix = pairwise_distances(lyrics_matrix)

In [54]:
def get_closest_pairs(dist_mat, names, num=10, ignore_zeros=False):
    """Finds pairs with closest non-zero distances."""
    dist_mat = np.copy(dist_mat)
    # Set zero values to infinity so they are ignore in the partitioning.
    dist_mat[np.triu_indices_from(dist_mat)] = np.inf
    if ignore_zeros:
        dist_mat[dist_mat == 0] = np.inf
    # Unravel the matrix indices and partition num lowest numbers.
    unr_index = np.unravel_index(dist_mat.argpartition(num, axis=None), dist_mat.shape)
    # Get document names for pairs.
    dist_pairs = [(names[unr_index[0][i]],        # First doc
                   names[unr_index[1][i]],        # Second doc
                   dist_mat[unr_index[0][i], unr_index[1][i]])   # Distance between docs
                   for i in xrange(num)]                # First num indices
    return sorted(dist_pairs, key=lambda tup: tup[2])

In [55]:
# Find the 10 closests 
for tup in get_closest_pairs(distance_matrix, titles, num=10):
    print(str(tup[0][2]) + ', ' + str(tup[1][2]) + '  :  ' + str(tup[2]))

Zorba The Greek Lyrics  Herb Alpert and The Tijuana Brass, Asia Minor Lyrics  Kokomo  :  0.0
Soul Twist Lyrics  King Curtis, No Matter What Shape (Your Stomach Is In) Lyrics  T-Bones  :  0.0
Soul Twist Lyrics  King Curtis, Love Is Blue Lyrics  Paul Mauriat  :  0.0
Zorba The Greek Lyrics  Herb Alpert and The Tijuana Brass, Jungle Fever Lyrics  Chakachas  :  0.0
Soul Twist Lyrics  King Curtis, Wheels Lyrics  String-a-longs  :  0.0
Zorba The Greek Lyrics  Herb Alpert and The Tijuana Brass, Java Lyrics Al Hirt  :  0.0
Theme From  Lyrics Rhythm Heritage, Wild Weekend Lyrics  Rebels  :  0.0
Soul Twist Lyrics  King Curtis, A Swingin' Safari Lyrics  Billy Vaughn  :  0.0
Soul Twist Lyrics  King Curtis, Last Night Lyrics  Mar-keys  :  0.0
Soul Twist Lyrics  King Curtis, Love Theme From Romeo And Juliet Lyrics Henry Mancini & His Orchestra  :  0.0


*** This may not be true anymore***

Here we see a number of songs appear in the dataset more than once, sometimes with different names. In fact these songs appear in the top 100 songs from multiple years, which is why they show up more than one. Because they titles are slightly different, the dict created new entries instead of over-riding the values of an existing key.

So now let's ignore all pairs with a distance of 0.

In [56]:
# Find the 10 closests 
closest = get_closest_pairs(distance_matrix, titles, num=10, ignore_zeros=True)
for tup in closest:
    print(str(tup[0][2]) + ', ' + str(tup[1][2]) + '  :  ' + str(tup[2]))

Dueling Banjos Lyrics  Eric Weissberg and Steve Mandel, Soul Twist Lyrics  King Curtis  :  1.0
Dueling Banjos Lyrics  Eric Weissberg and Steve Mandel, Star Wars (Main Title) Lyrics London Symphony Orchestra  :  1.0
Dueling Banjos Lyrics  Eric Weissberg and Steve Mandel, Zorba The Greek Lyrics  Herb Alpert and The Tijuana Brass  :  1.0
Dueling Banjos Lyrics  Eric Weissberg and Steve Mandel, The Rockford Files Lyrics  Mike Post  :  1.0
Dueling Banjos Lyrics  Eric Weissberg and Steve Mandel, Cotton Candy Lyrics Al Hirt  :  1.0
Dueling Banjos Lyrics  Eric Weissberg and Steve Mandel, Teen Beat Lyrics  Sandy Nelson  :  1.0
Dueling Banjos Lyrics  Eric Weissberg and Steve Mandel, Wild Weekend Lyrics  Rebels  :  1.0
Dueling Banjos Lyrics  Eric Weissberg and Steve Mandel, Bongo Rock Lyrics  Preston Epps  :  1.0
Dueling Banjos Lyrics  Eric Weissberg and Steve Mandel, Mexico Lyrics  Bob Moore  :  1.0
Dueling Banjos Lyrics  Eric Weissberg and Steve Mandel, Asia Minor Lyrics  Kokomo  :  1.0


So these are our closest matches. Let's see what they look like.

In [57]:
# Print out closest matching pairs and their lyrics
for i in xrange(len(closest)):
    title1, title2 = closest[i][0], closest[i][1]
    print(str(title1) + ' : ' + lyrics[titles.index(title1)])
    print(str(title2) + ' : ' + lyrics[titles.index(title2)])
    print('')

(u'http://www.songlyrics.com/news/top-songs/', u'Eric Weissberg and Steve Mandel', u'Dueling Banjos Lyrics  Eric Weissberg and Steve Mandel') : Instrumental
**Instrumental**
(u'http://www.songlyrics.com/news/top-songs/', u'King Curtis', u'Soul Twist Lyrics  King Curtis') : Instrumental

(u'http://www.songlyrics.com/news/top-songs/', u'Eric Weissberg and Steve Mandel', u'Dueling Banjos Lyrics  Eric Weissberg and Steve Mandel') : Instrumental
**Instrumental**
(u'http://www.songlyrics.com/news/top-songs/', u'London Symphony Orchestra', u'Star Wars (Main Title) Lyrics London Symphony Orchestra') : Instrumental

(u'http://www.songlyrics.com/news/top-songs/', u'Eric Weissberg and Steve Mandel', u'Dueling Banjos Lyrics  Eric Weissberg and Steve Mandel') : Instrumental
**Instrumental**
(u'http://www.songlyrics.com/news/top-songs/', u'Herb Alpert and The Tijuana Brass', u'Zorba The Greek Lyrics  Herb Alpert and The Tijuana Brass') : Instrumental

(u'http://www.songlyrics.com/news/top-songs/', u

This is another instance where TFIDF has found something we didn't expect. Let's remove these songs from our dataset.

In [58]:
to_remove = [x for x in lyrics if x.startswith('We do not have') 
                                or x.startswith('[Instrumental]')
                                or x.startswith('Instrumental')
                                or x.startswith('Sorry, we have no')]
to_remove

[u'Sorry, we have no Bobby Sherman - Little woman lyrics at the moment.\nPlease check the spelling and try again to\nsearch Bobby Sherman Little woman lyrics\nvar GOOG_FIXURL_LANG = "en";var GOOG_FIXURL_SITE = "http://www.songlyrics.com/";',
 u'Sorry, we have no Huey Lewis and The News - I want a new drug lyrics at the moment.\nPlease check the spelling and try again to\nsearch Huey Lewis and The News I want a new drug lyrics\nvar GOOG_FIXURL_LANG = "en";var GOOG_FIXURL_SITE = "http://www.songlyrics.com/";',
 u'[Instrumental]',
 u'Instrumental intro to "Venus"\nSugar, ah honey honey\nYou are my candy girl and you got me wantin\' you\nHoney, ah sugar sugar\nYou are my candy girl and you got me wantin\' you\n1- 2- 3- 4\nThis happened once before when I came to your door\nNo reply\nThey said it wasn\'t you, but I saw you peak through your window\nYou know, if you break my heart I\'ll go\nBut I\'ll be back again\n\'Cause I told you once before goodbye\nBut I came back again\nAsked the girl

In [131]:
# Filter out songs that do not have lyrics in the dataset.
song_lyrics_filtered = {k:v for k,v in song_lyrics.items() if not (v.startswith('We do not have')
                                                                   or v.startswith('Sorry, we have no')
                                                                   or 'Instrumental' in v[:20])}

#### An interesting example: Attempt #2

In [132]:
# Split the titles and lyrics into two lists.
titles, lyrics = zip(*song_lyrics_filtered.items())

In [133]:
# Create an instance of the CountVectorizer.
cv = CountVectorizer()

# Create a count matrix from the list of documents.
lyrics_matrix = cv.fit_transform(lyrics)

In [134]:
# Create an instance of a TfidfTransformer.
tfidf = TfidfTransformer()

# Created a weighted matrix.
weighted_matrix = tfidf.fit_transform(lyrics_matrix)

In [135]:
# Make the matrix memory footprint smaller by changing the dtype.
weighted_matrix = weighted_matrix.astype(np.float16)

In [136]:
weighted_matrix

<3017x17034 sparse matrix of type '<type 'numpy.float16'>'
	with 302187 stored elements in Compressed Sparse Row format>

In [137]:
# Compute distance matrix.
distance_matrix = pairwise_distances(lyrics_matrix)

In [138]:
# Find the 10 closests 
closest = get_closest_pairs(distance_matrix, titles, num=20, ignore_zeros=True)
for tup in closest:
    print(str(tup[0][2]) + ', ' + str(tup[1][2]) + '  :  ' + str(tup[2]))

This Girl's In Love With You Lyrics  Dionne Warwick, This Guy's In Love With You Lyrics  Herb Alpert  :  2.82842712475
A Groovy Kind Of Love Lyrics  Phil Collins, A Groovy Kind Of Love Lyrics The Mindbenders  :  3.16227766017
We Got The Beat Lyrics  Go-Go's, We Got The Beat Lyrics GLEE Cast  :  4.0
Sealed With A Kiss Lyrics  Brian Hyland, Sealed With A Kiss Lyrics  Bobby Vinton  :  4.35889894354
Ooh Baby Baby Lyrics Linda Ronstadt, Ooh Baby Baby Lyrics The Miracles  :  5.74456264654
(Oh) Pretty Woman Lyrics  Van Halen, Oh, Pretty Woman Lyrics Roy Orbison  :  6.16441400297
Walk This Way Lyrics Aerosmith, Walk This Way Lyrics  Run-D.M.C.  :  7.07106781187
All By Myself Lyrics  Celine Dion, All By Myself Lyrics Eric Carmen  :  7.14142842854
Puppy Love Lyrics  Donny Osmond, Puppy Love Lyrics  Paul Anka  :  7.61577310586
Sukiyaki Lyrics  4 P.M., Sukiyaki Lyrics  A Taste Of Honey  :  7.68114574787
Oh Girl Lyrics  Chi-Lites, Oh Girl Lyrics  Paul Young  :  8.60232526704
Special Lady Lyrics  Ra

In [139]:
from itertools import izip_longest

In [140]:
for i in xrange(len(closest)):
    title1, title2 = closest[i][0], closest[i][1]
    two_cols = izip_longest(lyrics[titles.index(title1)].encode('utf8').split('\n'), 
                            lyrics[titles.index(title2)].encode('utf8').split('\n'), fillvalue='')
    print('____{0:46} | ____{1}'.format(title1[2], title2[2]))
    for tup in two_cols:
        #print('{0:50} | {1}'.format(*map(tup))
        print('{0:50} | {1}'.format(*tup))
    print('')

____This Girl's In Love With You Lyrics  Dionne Warwick | ____This Guy's In Love With You Lyrics  Herb Alpert
You see this girl                                  | You see this guy, this guy's in love with you
This girl's in love with you                       | Yes, I'm in love who looks at you the way I do
Yes I'm in love                                    | When you smile I can tell it know each other very well
Who looks at you the way I do                      | How can I show you I'm glad? I got to know you
When you smile I can tell                          | 'Cause I've heard some talk they say you think I'm fine
It know each other very well                       | This guy's in love and what I'd do to make you mine
How can I show you                                 | Tell me, now, is it so? Don't let me be the last to know
I'm glad I got to know you 'cause                  | My hands are shakin', don't let my heart keep breaking
I've heard some talk                               

In [141]:
from collections import Counter

In [142]:
# Find songs that appear most often in closest pairs.
closest_50 = get_closest_pairs(distance_matrix, titles, num=50, ignore_zeros=True)

count = Counter([t[2] for tup in closest_50 for t in tup[:2]])

count.most_common(10)

[(u'Special Lady Lyrics  Ray, Goodman and Brown', 9),
 (u'Moon River Lyrics  Henry Mancini', 9),
 (u'Also Sprach Zarathustra (2001) Lyrics  Deodato', 8),
 (u'Feels So Good Lyrics Chuck Mangione', 8),
 (u'The Tide Is High Lyrics  Blondie', 6),
 (u'I Take It Back Lyrics  Sandy Posey', 4),
 (u'Days Of Wine And Roses Lyrics  Henry Mancini', 4),
 (u"What A Diff'rence A Day Makes Lyrics  Dinah Washington", 3),
 (u'Dear One Lyrics  Larry Finnegan', 3),
 (u'The More I See You Lyrics  Chris Montez', 2)]

In [143]:
count2 = Counter([t for tup in closest_50 for t in tup[:2]])

In [144]:
count2

Counter({(u'http://www.songlyrics.com/news/top-songs/',
          u'4 P.M.',
          u'Sukiyaki Lyrics  4 P.M.'): 1,
         (u'http://www.songlyrics.com/news/top-songs/',
          u'A Taste Of Honey',
          u'Sukiyaki Lyrics  A Taste Of Honey'): 1,
         (u'http://www.songlyrics.com/news/top-songs/',
          u'Aerosmith',
          u'Walk This Way Lyrics Aerosmith'): 1,
         (u'http://www.songlyrics.com/news/top-songs/',
          u'Az Yet feat. Peter Cetera',
          u"Hard To Say I'm Sorry Lyrics  Az Yet feat. Peter Cetera"): 1,
         (u'http://www.songlyrics.com/news/top-songs/',
          u'Bent Fabric',
          u'Alley Cat Lyrics  Bent Fabric'): 1,
         (u'http://www.songlyrics.com/news/top-songs/',
          u'Bert Kaempfert',
          u'Red Roses For A Blue Lady Lyrics Bert Kaempfert'): 1,
         (u'http://www.songlyrics.com/news/top-songs/',
          u'Blondie',
          u'The Tide Is High Lyrics  Blondie'): 6,
         (u'http://www.songlyrics

In [145]:
count.most_common(1)[0][0]

u'Special Lady Lyrics  Ray, Goodman and Brown'

In [146]:
count2.most_common(10)

[((u'http://www.songlyrics.com/news/top-songs/',
   u'Henry Mancini',
   u'Moon River Lyrics  Henry Mancini'),
  9),
 ((u'http://www.songlyrics.com/news/top-songs/',
   u'Ray, Goodman and Brown',
   u'Special Lady Lyrics  Ray, Goodman and Brown'),
  9),
 ((u'http://www.songlyrics.com/news/top-songs/',
   u'Deodato',
   u'Also Sprach Zarathustra (2001) Lyrics  Deodato'),
  8),
 ((u'http://www.songlyrics.com/news/top-songs/',
   u'Chuck Mangione',
   u'Feels So Good Lyrics Chuck Mangione'),
  8),
 ((u'http://www.songlyrics.com/news/top-songs/',
   u'Blondie',
   u'The Tide Is High Lyrics  Blondie'),
  6),
 ((u'http://www.songlyrics.com/news/top-songs/',
   u'Henry Mancini',
   u'Days Of Wine And Roses Lyrics  Henry Mancini'),
  4),
 ((u'http://www.songlyrics.com/news/top-songs/',
   u'Sandy Posey',
   u'I Take It Back Lyrics  Sandy Posey'),
  4),
 ((u'http://www.songlyrics.com/news/top-songs/',
   u'Dinah Washington',
   u"What A Diff'rence A Day Makes Lyrics  Dinah Washington"),
  3),
 

In [91]:
titles

((u'http://www.songlyrics.com/news/top-songs/',
  u'James Brown and The Famous Flames',
  u"Papa's Got A Brand New Bag Lyrics James Brown and The Famous Flames"),
 (u'http://www.songlyrics.com/news/top-songs/',
  u'Aretha Franklin',
  u'I Never Loved A Man (The Way I Love You) Lyrics  Aretha Franklin'),
 (u'http://www.songlyrics.com/news/top-songs/',
  u'Technotronic',
  u'Pump Up The Jam Lyrics  Technotronic'),
 (u'http://www.songlyrics.com/news/top-songs/',
  u'Jagged Edge and Nelly',
  u'Where The Party At Lyrics  Jagged Edge and Nelly'),
 (u'http://www.songlyrics.com/news/top-songs/',
  u'Godspell',
  u'Day By Day Lyrics  Godspell'),
 (u'http://www.songlyrics.com/news/top-songs/',
  u'Jan and Dean',
  u'Surf City Lyrics  Jan and Dean'),
 (u'http://www.songlyrics.com/news/top-songs/',
  u'Kanye West feat. Dwele',
  u'Flashing Lights Lyrics  Kanye West feat. Dwele'),
 (u'http://www.songlyrics.com/news/top-songs/',
  u'Pink',
  u"Please Don't Leave Me Lyrics  Pink"),
 (u'http://www.so

In [96]:
count2.most_common(1)[0]

((u'http://www.songlyrics.com/news/top-songs/',
  u'Deodato',
  u'Also Sprach Zarathustra (2001) Lyrics  Deodato'),
 8)

In [147]:
print(lyrics[titles.index(count2.most_common(1)[0][0])])

Moon River, wider than a mile
I'm crossing you in style some day
Oh, dream maker, you heart breaker
Wherever you're going I'm going your way
Two drifters off to see the world
There's such a lot of world to see
We're after the same rainbow's end
Waiting 'round the bend my Huckleberry friend
Moon River and me


You Must Love Me is getting repeated a lot. Let's take a look at the terms in the song and their weights.

In [148]:
# Get the index of a song in the titles list.
#titles.index('You Must Love Me Lyrics  Madonna')
titles.index('Piece Of My Heart Lyrics  Tara Kemp')
# This index corresponds the the row for this song in the matrix.
ind = np.nonzero(weighted_matrix[754])[1]
# Create a term to index dict.
t = zip(*cv.vocabulary_.items())
term_ind = dict(zip(t[1], t[0]))
sorted([(weighted_matrix[754, i], term_ind[i]) for i in ind])

ValueError: tuple.index(x): x not in tuple

The term with the most weight is "chorus." Sometimes the most weighted term is a strong indicator of related documents. In this case, however, all songs have a chorus, and some of the songs have labeled it in their lyrics. Let's take a look.

In [None]:
print(song_lyrics_filtered['Piece Of My Heart Lyrics  Tara Kemp'])

#### An interesting example: Attempt 3

How can we deal with words like "chorus" that appear in most/all documents (or even a few documents), but do not help us distinguish between different kinds of documents?

In [None]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
ENGLISH_STOP_WORDS

Let's look at the CountVectorizer documentation to get a little more detail on the stop_words parameter.

In [None]:
# Split the titles and lyrics into two lists.
titles, lyrics = zip(*song_lyrics_filtered.items())

In [None]:
# Create an instance of the CountVectorizer.
#cv = CountVectorizer(stop_words='english')
custom_stop_words = list(ENGLISH_STOP_WORDS) + ['chorus']
cv = CountVectorizer(stop_words=custom_stop_words)

# Create a count matrix from the list of documents.
lyrics_matrix = cv.fit_transform(lyrics)

In [None]:
# Create an instance of a TfidfTransformer.
tfidf = TfidfTransformer()

# Created a weighted matrix.
weighted_matrix = tfidf.fit_transform(lyrics_matrix)

In [None]:
# Make the matrix memory footprint smaller by changing the dtype.
weighted_matrix = weighted_matrix.astype(np.float16)

In [None]:
weighted_matrix

In [None]:
# Compute distance matrix.
distance_matrix = pairwise_distances(lyrics_matrix).astype(np.float16)
distance_matrix

In [None]:
# Find the 10 closests 
closest = get_closest_pairs(distance_matrix, titles, num=20, ignore_zeros=True)
for tup in closest:
    print(tup[0] + ', ' + tup[1] + '  :  ' + str(tup[2]))

Still seeing a lot of Tara Kemp and Madonna. Let's look at the lyrics more closely.

In [13]:
def get_weighted_vocab(name, names, matrix, cv):
    # Get the index of a song in the titles list.
    name_index = names.index(name)
    # This index corresponds the the row for this song in the matrix.
    # Get indices for all non-zero elements in the document vector.
    ind = np.nonzero(matrix[name_index])[1]
    # Create a term to index dict.
    t = zip(*cv.vocabulary_.items())
    term_ind = dict(zip(t[1], t[0]))
    return sorted([(matrix[name_index, i], term_ind[i]) for i in ind])

In [None]:
tara = get_weighted_vocab('Piece Of My Heart Lyrics  Tara Kemp', titles, weighted_matrix, cv)
madonna = get_weighted_vocab('You Must Love Me Lyrics  Madonna', titles, weighted_matrix, cv)
{tup[1] for tup in tara}.intersection({tup[1] for tup in madonna})

In [None]:
tara, madonna

In [None]:
from itertools import izip_longest

In [None]:
for i in xrange(len(closest)):
    title1, title2 = closest[i][0], closest[i][1]
    two_cols = izip_longest(lyrics[titles.index(title1)].split('\n'), lyrics[titles.index(title2)].split('\n'), fillvalue='')
    print('____{0:46} | ____{1}'.format(title1, title2))
    for tup in two_cols:
        print('{0:50} | {1}'.format(*tup))
    print('')

In [None]:
min(tfidf.idf_), max(tfidf.idf_)

In [None]:
zip(cv.vocabulary_, tfidf.idf_)

In [None]:
vect = TfidfVectorizer(min_df=1)
tfidf = vect.fit_transform(["I'd like an apple",
                             "An apple a day keeps the doctor away",
                             "Never compare an apple to an orange",
                             "I prefer scikit-learn to Orange"])
(tfidf * tfidf.T).A
array([[ 1.        ,  0.25082859,  0.39482963,  0.        ],
       [ 0.25082859,  1.        ,  0.22057609,  0.        ],
       [ 0.39482963,  0.22057609,  1.        ,  0.26264139],
       [ 0.        ,  0.        ,  0.26264139,  1.        ]])

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
vect = TfidfVectorizer(min_df=1)
tfidf = vect.fit_transform(["I'd like an apple",
                             "An apple a day keeps the doctor away",
                             "Never compare an apple to an orange",
                             "I prefer scikit-learn to Orange"])


In [None]:
pairwise_distances(tfidf)

In [12]:
dists = distance_matrix

NameError: name 'distance_matrix' is not defined

http://nlp.stanford.edu/IR-book/html/htmledition/document-and-query-weighting-schemes-1.html
TextBlob