<h2>Text Analysis with Scikit-learn</h2>

Thanks to everyone for coming today. Excited to be here and talk about a neat topic I really enjoy, which is text analysis with Python, specifically with scikit-learn. While I have been able to use this at work occassionally, I am definitely not an Natural Language Processing expert. However, one of the cool things about the python scientific computing ecosystem is that you can *not* be an expert and still do some neat stuff.

Title slide. Contact info. github info.

::get slides from work::

Text analysis is the extraction of information from unstructured (in most cases) text. This is difficult because computer only understand numbers. So in order to analyze text with computers, we have to do some kind of processing to it. There are two basic ways to approach text analysis: look for patterns in the text (these words often appear together/near each other (multi-word term), these words are often used in the same context but never appear together (synonyms), etc.); or you can convert the text to numbers and do math on them. 

The technique we're going to discuss today is called Term Frequency-Inverse Document Frequency (TF-IDF), and it is a fairly common method in the area of Information Retrieval (IR) for comparing and searching documents. What we're going to end up doing is taking our big set of documents, called the corpus, and these can be any type of document: news articles, emails, online comments (although the longer the documents the better the results), and we're going to turn them into matrices (number), and then we're going to do some pretty basic math on them. That math is going to help us define what's called a distance metric between any two documents, which we can then use for comparisions and searches. Ok, but first, let's think about this problem intuitively. 

Let's say we have three documents. Our documents are just going to be sentences. 
<li>The dog is jumping over the fence.</li>
<li>The dog is climbing up the fence.</li>
<li>The cat is sitting on the window.</li>
So how similar/disimilar are these sentences? Well, that's a little hard to answer right now because that is trying to quantify something about the text. What about a difference question: which two sentences are the most similar? The first two?  Why? They share a lot of the same words, so we can think of them as being similar in their contents based on the inclusion of identical words. 

::Add another sentence and point out which words end up being the important ones.::
<li>A dog was sitting in this window.</li>

So this is good. We could define a similarity metric as the number of words two sentences have in common. But is that good enough? What about these sentences?
<li>A dog was jumping over this fence and then ran over to the next yard so very quickly.</li>
<li>A cat was leaning over this edge and then jumped over to the next table so very quickly.</li>
They have many more words in common than our first two sentences, so are they more similar? Probably not. But this is function of the length of the sentences, so let's try to normalize by dividing by the number of words in the sentence. This will give us a percent of similarity. Well it turns out these longer sentences still get a higher similarity score. So what's the problem? Intiutively we know the first two sentences are similar, but it seems that using numbers suggests the latest two sentences are more similar. So what's the problem? The problem is that we're considering all the words equal, when, for purposes of distiguishing sentences from one another, they are not equal. Why not? 
<li>A dog was jumping over this fence and then ran over to the next yard so very quickly.</li>
<li>A cat was leaning over this fence and then jumped over to the next yard so very quickly.</li>
<li>A squirrel was leaping over this branch and then scampered over to the next tree so very quickly.</li>
<li>A bird was peering over this branch and then flew over to the next tree.</li>
If we were going to distiguish between these sentences, any word that appears in all the sentences is worthless. It doesn't help us to determine if any pair of sentences is similar or disimilar because it's common to all of them. It's almost a characteristic of a sentence in general, as opposed to a charateristics of a particular sentence. So we don't want to use words that appear in all sentences. We either throw them out or weight them very low (more on that later). What about words that appear in a lot but not all sentences? "so very quickly" appears in all but one sentence....
Somewhat useful, but not super useful.
Tree, branch, fence, yard: appear in some, but not all or most, so they are usefula nd will get weighted the highest.
This is the Inverse Document Frequency part of the TF-IDF. The more documents a word appears in the, the less useful it is in distinguishing between the documents, so it get weighted less. The Inverse essentially means we divide by the number of documents the word (or term) appears in. 
For the first part, the Term Frequency, we're simply going to calculate the frequency of each term in each document by counting the number of times it appears in the document and dividing by the total number of words in the documnets. The assumption here is that the more often a word appears in a document, the more closely it represents the topic of the document.


Ok, so let's actually calculate some TF-IDFs by hand, and then we'll do it in Python.
::example with a few sentences::
::create list of all the words. Count number of docs each word appear in. create vector for each doc. Calculate Euclidean distance::
Show distances between pairs of sentences. Show the dog and cat are similar and the squirrel and bird are similar.




<div style="margin-top:200px"/>

In [13]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import pandas as pd

###### Example 1: Short Sentences

In [2]:
short_sentences = ['The dog is jumping over the fence.', 
                   'The dog is climbing up the fence.',
                   'The cat is sitting on the window.']

In [3]:
# Create an instance of the CountVectorizer.
cv = CountVectorizer()

# Create a count matrix from the list of documents.
count_matrix = cv.fit_transform(short_sentences)

In [28]:
# Create a dataframe out of the count matrix with labels and columns names for easier reading.
short_sent_df = pd.DataFrame(count_matrix.todense(), index=short_sentences, columns=sorted(cv.vocabulary_))
short_sent_df

Unnamed: 0,cat,climbing,dog,fence,is,jumping,on,over,sitting,the,up,window
The dog is jumping over the fence.,0,0,1,1,1,1,0,1,0,2,0,0
The dog is climbing up the fence.,0,1,1,1,1,0,0,0,0,2,1,0
The cat is sitting on the window.,1,0,0,0,1,0,1,0,1,2,0,1


<div style="margin-top:200px"/>

###### Example 2: Long sentences

In [150]:
long_sentences = ['A dog was jumping over this fence and then ran over to the next yard so very quickly.',
                  'A cat was leaning over this fence and then jumped over to the next yard so very quickly.',
                  'A squirrel was leaping over this branch and then scampered over to the next tree so very quickly.',
                  'A bird was peering over this branch and then flew over to the next tree.']

In [140]:
long_sentences = ['dog dog dog dog dog dog dog dog something else.',
                  'A cat was leaning over this fence and then jumped over to the next yard so very quickly.',
                  'A squirrel was leaping over this branch and then scampered over to the next tree so very quickly.',
                  'A bird was peering over this branch and then flew over to the next tree.']

In [151]:
# Create an instance of the CountVectorizer.
cv = CountVectorizer()

# Create a count matrix from the list of documents.
long_count_matrix = cv.fit_transform(long_sentences)

In [152]:
# Create a dataframe out of the count matrix with labels and columns names for easier reading.
long_sent_df = pd.DataFrame(long_count_matrix.todense(), index=long_sentences, columns=sorted(cv.vocabulary_))
long_sent_df

Unnamed: 0,and,bird,branch,cat,dog,fence,flew,jumped,jumping,leaning,...,so,squirrel,the,then,this,to,tree,very,was,yard
A dog was jumping over this fence and then ran over to the next yard so very quickly.,1,0,0,0,1,1,0,0,1,0,...,1,0,1,1,1,1,0,1,1,1
A cat was leaning over this fence and then jumped over to the next yard so very quickly.,1,0,0,1,0,1,0,1,0,1,...,1,0,1,1,1,1,0,1,1,1
A squirrel was leaping over this branch and then scampered over to the next tree so very quickly.,1,0,1,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,0
A bird was peering over this branch and then flew over to the next tree.,1,1,1,0,0,0,1,0,0,0,...,0,0,1,1,1,1,1,0,1,0


In [153]:
# Create an instance of a TfidfTransformer.
tfidf = TfidfTransformer(norm='l1', use_idf=False, smooth_idf=False)

# Created a weighted matrix.
weighted_matrix = tfidf.fit_transform(long_count_matrix)

In [154]:
weighted_matrix

<4x27 sparse matrix of type '<type 'numpy.float64'>'
	with 61 stored elements in Compressed Sparse Row format>

In [155]:
# Create a dataframe out of the count matrix with labels and columns names for easier reading.
long_sent_weighted_df = pd.DataFrame(weighted_matrix.todense(), index=[s.split()[1] for s in long_sentences], columns=sorted(cv.vocabulary_))
long_sent_weighted_df

Unnamed: 0,and,bird,branch,cat,dog,fence,flew,jumped,jumping,leaning,...,so,squirrel,the,then,this,to,tree,very,was,yard
dog,0.058824,0.0,0.0,0.0,0.058824,0.058824,0.0,0.0,0.058824,0.0,...,0.058824,0.0,0.058824,0.058824,0.058824,0.058824,0.0,0.058824,0.058824,0.058824
cat,0.058824,0.0,0.0,0.058824,0.0,0.058824,0.0,0.058824,0.0,0.058824,...,0.058824,0.0,0.058824,0.058824,0.058824,0.058824,0.0,0.058824,0.058824,0.058824
squirrel,0.058824,0.0,0.058824,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.058824,0.058824,0.058824,0.058824,0.058824,0.058824,0.058824,0.058824,0.058824,0.0
bird,0.071429,0.071429,0.071429,0.0,0.0,0.0,0.071429,0.0,0.0,0.0,...,0.0,0.0,0.071429,0.071429,0.071429,0.071429,0.071429,0.0,0.071429,0.0


In [156]:
long_sent_weighted_df['quickly']

dog         0.058824
cat         0.058824
squirrel    0.058824
bird        0.000000
Name: quickly, dtype: float64

In [165]:
1/14.0

0.07142857142857142

In [164]:
pd.np.sum(long_count_matrix.todense(), axis=1)

matrix([[17],
        [17],
        [17],
        [14]], dtype=int64)

In [158]:
# Print the number of words in each sentence.
for s in long_sentences:
    print(s.split()[1] + ' ' + str(len(s.split())))

dog 18
cat 18
squirrel 18
bird 15


In [159]:
zip(sorted(cv.vocabulary_), tfidf.idf_)

TypeError: zip argument #2 must support iteration

<div style="margin-top:200px"/>

###### Example 3: An interesting one

In [5]:
# Load song lyrics dataset.
import json
song_lyrics = json.load(open('data\song_lyrics.json','rt'), encoding='utf8')

In [14]:
# Split the titles and lyrics into two lists.
titles, lyrics = zip(*song_lyrics.items())

In [16]:
# Create an instance of the CountVectorizer.
cv = CountVectorizer()

# Create a count matrix from the list of documents.
lyrics_matrix = cv.fit_transform(lyrics)

In [17]:
# Create a dataframe out of the count matrix with labels and columns names for easier reading.
lyrics_df = pd.DataFrame(lyrics_matrix.todense(), index=titles, columns=sorted(cv.vocabulary_))
lyrics_df

Unnamed: 0,00,000,02,03,06,07,0h,10,100,1000,...,½this,½ttâ,½what,½why,½will,½yeah,½yeahâ,½yee,½you,ôem
If I Had No Loot Lyrics Tony! Toni! Tone!,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
No One Else Lyrics Total,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Who Says you can't Go Home Lyrics Bon Jovi,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Power Of Love-Love Power Lyrics Luther Vandross,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Westside Lyrics TQ,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Heartless Lyrics Kanye West,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Someday Lyrics Mariah Carey,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
White Flag Lyrics Dido,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
U Don't Have To Call Lyrics Usher,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Just Take My Heart Lyrics Mr. Big,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [18]:
# Create an instance of a TfidfTransformer.
tfidf = TfidfTransformer()

# Created a weighted matrix.
weighted_matrix = tfidf.fit_transform(lyrics_matrix)

In [19]:
weighted_matrix

<2046x17015 sparse matrix of type '<type 'numpy.float64'>'
	with 263955 stored elements in Compressed Sparse Row format>

In [21]:
# Create a dataframe out of the count matrix with labels and columns names for easier reading.
# TODO: When the new dataset with (artist, title) is loaded, set index=[tup[1] for tup in titles]
lyrics_weighted_df = pd.DataFrame(lyrics_matrix.todense(), index=titles, columns=sorted(cv.vocabulary_))
lyrics_weighted_df

Unnamed: 0,00,000,02,03,06,07,0h,10,100,1000,...,½this,½ttâ,½what,½why,½will,½yeah,½yeahâ,½yee,½you,ôem
If I Had No Loot Lyrics Tony! Toni! Tone!,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
No One Else Lyrics Total,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Who Says you can't Go Home Lyrics Bon Jovi,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Power Of Love-Love Power Lyrics Luther Vandross,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Westside Lyrics TQ,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Heartless Lyrics Kanye West,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Someday Lyrics Mariah Carey,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
White Flag Lyrics Dido,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
U Don't Have To Call Lyrics Usher,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Just Take My Heart Lyrics Mr. Big,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [26]:
from sklearn.metrics.pairwise import pairwise_distances
import numpy as np

In [25]:
# Compute distance matrix.
distance_matrix = pairwise_distances(lyrics_matrix)

In [69]:
def get_closest_pairs(dist_mat, names, num=10, ignore_zeros=False):
    """Finds pairs with closest non-zero distances."""
    dist_mat = np.copy(dist_mat)
    # Set zero values to infinity so they are ignore in the partitioning.
    dist_mat[np.triu_indices_from(dist_mat)] = np.inf
    if ignore_zeros:
        dist_mat[dist_mat == 0] = np.inf
    # Unravel the matrix indices and partition num lowest numbers.
    unr_index = np.unravel_index(dist_mat.argpartition(num, axis=None), dist_mat.shape)
    # Get document names for pairs.
    dist_pairs = [(names[unr_index[0][i]],        # First doc
                   names[unr_index[1][i]],        # Second doc
                   dist_mat[unr_index[0][i], unr_index[1][i]])   # Distance between docs
                   for i in xrange(num)]                # First num indices
    return sorted(dist_pairs, key=lambda tup: tup[2])

In [70]:
# Find the 10 closests 
for tup in get_closest_pairs(distance_matrix, titles, num=10):
    print(tup[0] + ', ' + tup[1] + '  :  ' + str(tup[2]))

She Will Be Loved Lyrics  Maroon 5, She Will Be Loved Lyrics  Maroon5  :  0.0
All Cried Out Lyrics  Allure feat. 112, All Cried Out Lyrics Allure feat. 112  :  0.0
No Diggity Lyrics  BLACKstreet feat. Dr. Dre, No Diggity Lyrics  BLACKstreet (feat. Dr. Dre)  :  0.0
Step In The Name Of Love Lyrics  R. Kelly, Step In The Name of Love Lyrics  R. Kelly  :  0.0
We Be Burnin' Lyrics  Sean Paul, We be Burnin' Lyrics  Sean Paul  :  0.0
All For You Lyrics Sister Hazel, All For You Lyrics  Sister Hazel  :  0.0
Party in the U.S.A. Lyrics  Miley Cyrus, Party In The U.S.A. Lyrics  Miley Cyrus  :  0.0
Because of You Lyrics 98 Degrees, Because Of You Lyrics  98 Degrees  :  0.0
Forever In Love Lyrics  Kenny G, Children Lyrics  Robert Miles  :  0.0
Use Somebody Lyrics  Kings Of Leon, Use Somebody Lyrics  Kings of Leon  :  0.0


Here we see a number of songs appear in the dataset more than once, sometimes with different names. In fact these songs appear in the top 100 songs from multiple years, which is why they show up more than one. Because they titles are slightly different, the dict created new entries instead of over-riding the values of an existing key.

So now let's ignore all pairs with a distance of 0.

In [72]:
# Find the 10 closests 
closest = get_closest_pairs(distance_matrix, titles, num=10, ignore_zeros=True)
for tup in closest:
    print(tup[0] + ', ' + tup[1] + '  :  ' + str(tup[2]))

Hero Lyrics  Mariah Carey, My All Lyrics Mariah Carey  :  1.73205080757
Always Be My Baby Lyrics  Mariah Carey, My All Lyrics Mariah Carey  :  2.0
Hero Lyrics  Mariah Carey, Love Story Lyrics  Taylor Swift  :  2.2360679775
Hero Lyrics  Mariah Carey, Always Be My Baby Lyrics  Mariah Carey  :  2.2360679775
My All Lyrics Mariah Carey, Love Story Lyrics  Taylor Swift  :  2.44948974278
Always Be My Baby Lyrics  Mariah Carey, Love Story Lyrics  Taylor Swift  :  2.82842712475
Bad Romance Lyrics  Lady Gaga, Hero Lyrics  Mariah Carey  :  2.82842712475
Bad Romance Lyrics  Lady Gaga, My All Lyrics Mariah Carey  :  3.0
Hero Lyrics  Mariah Carey, Children Lyrics  Robert Miles  :  3.16227766017
Hero Lyrics  Mariah Carey, Forever In Love Lyrics  Kenny G  :  3.16227766017


So these are our closest matches. Let's see what they look like.

In [82]:
for i in xrange(len(closest)):
    title1, title2 = closest[i][0], closest[i][1]
    print(title1 + ' : ' + lyrics[titles.index(title1)])
    print(title2 + ' : ' + lyrics[titles.index(title2)])
    print('')

Hero Lyrics  Mariah Carey : We do not have the lyrics for Hero  yet.
My All Lyrics Mariah Carey : We do not have the lyrics for My All  yet.

Always Be My Baby Lyrics  Mariah Carey : We do not have the lyrics for Always Be My Baby  yet.
My All Lyrics Mariah Carey : We do not have the lyrics for My All  yet.

Hero Lyrics  Mariah Carey : We do not have the lyrics for Hero  yet.
Love Story Lyrics  Taylor Swift : We do not have the lyrics for Love Story (Taylor Swift) yet.

Hero Lyrics  Mariah Carey : We do not have the lyrics for Hero  yet.
Always Be My Baby Lyrics  Mariah Carey : We do not have the lyrics for Always Be My Baby  yet.

My All Lyrics Mariah Carey : We do not have the lyrics for My All  yet.
Love Story Lyrics  Taylor Swift : We do not have the lyrics for Love Story (Taylor Swift) yet.

Always Be My Baby Lyrics  Mariah Carey : We do not have the lyrics for Always Be My Baby  yet.
Love Story Lyrics  Taylor Swift : We do not have the lyrics for Love Story (Taylor Swift) yet.

B

This is another instance where TFIDF has found something we didn't expect. Let's remove these songs from our dataset.

In [87]:
to_remove = [x for x in lyrics if x.startswith('We do not have') 
                                or x.startswith('[Instrumental]')
                                or x.startswith('Sorry, we have no')]
to_remove

[u'Sorry, we have no Beyonce Knowles - Sweet dreams lyrics at the moment.\nPlease check the spelling and try again to\nsearch Beyonce Knowles Sweet dreams lyrics\nvar GOOG_FIXURL_LANG = "en";var GOOG_FIXURL_SITE = "http://www.songlyrics.com/";',
 u'Sorry, we have no Sean Kingston - Eenie meenie with justin bieber lyrics at the moment.\nPlease check the spelling and try again to\nsearch Sean Kingston Eenie meenie with justin bieber lyrics\nvar GOOG_FIXURL_LANG = "en";var GOOG_FIXURL_SITE = "http://www.songlyrics.com/";',
 u'We do not have the lyrics for Love Story (Taylor Swift) yet.',
 u'[Instrumental]',
 u'We do not have the lyrics for My All  yet.',
 u'Sorry, we have no Usher - Dj got us fallin in love lyrics at the moment.\nPlease check the spelling and try again to\nsearch Usher Dj got us fallin in love lyrics\nvar GOOG_FIXURL_LANG = "en";var GOOG_FIXURL_SITE = "http://www.songlyrics.com/";',
 u'We do not have the lyrics for Always Be My Baby  yet.',
 u'[Instrumental]',
 u'Sorry, w

In [96]:
# Filter out songs that do not have lyrics in the dataset.
song_lyrics_filtered = {k:v for k,v in song_lyrics.items() if not (v.startswith('We do not have') 
                                                                    or v.startswith('[Instrumental]')
                                                                    or v.startswith('Sorry, we have no'))}

In [2]:
import requests
from bs4 import BeautifulSoup

In [41]:
def get_song_links(url):
    """Gets song titles and URLs for the lyrics page for a www.songlyrics.com top-songs page."""
    soup = BeautifulSoup(requests.get(url).content)
    trs = soup.find_all('table', class_='tracklist')[0].find_all('tr')
    song_links = {}
    for row in trs[1:]:
        tag = row.find_all('td')[-1]
        a_tag = tag.a
        song_links[a_tag['title']] = a_tag['href']
    return song_links

In [42]:
def get_lyrics(url):
    """Scrapes the lyrics from a www.songlyrics.com song page."""
    soup = BeautifulSoup(requests.get(url).content)
    return soup.find_all('p', attrs={'id':'songLyricsDiv'})[0].get_text('\n', strip=True)

In [45]:
# Define variables
base_url = 'http://www.songlyrics.com/news/top-songs/'
start_year = 2011
end_year = 1990

In [46]:
# Dict to hold song titles -> link to lyrics page.
song_links = {}
cur_year = start_year

# Loop through years and scrape links to lyric pages for each year's top 100 songs.
while cur_year >= end_year:
    url = base_url + str(cur_year)
    song_links.update(get_song_links(url))
    cur_year -= 1

In [51]:
# Dict to hold song titles -> lyrics
song_lyrics = {}
errors = {}
# Visit the lyric pages for each song in song_links and scrape the lyrics.
for title, url in song_links.iteritems():
    try:
        song_lyrics[title] = get_lyrics(url)
    except Exception as e:
        errors[title] = e

In [52]:
tup = song_lyrics.items()[2]
print(tup[0] +'\n'+ tup[1])

Who Says you can't Go Home Lyrics  Bon Jovi
I spent twenty years tryin' to get out of this place
I was lookin' for somethin' I couldn't replace
I was runnin' away from the only thing I've ever known
And like a blind dog without a bone
I was a gypsy lost in the twilight zone
I hijacked a rainbow and crashed into a pot of gold
I've been there, done that, now I ain't lookin' back
And the seeds I've sown, savin' dimes
Spendin' too much time on the telephone
Who says you can't go home?
Who says you can't go home?
There's only one place that call me one of their own
Just a hometown boy, born a rollin' stone
Who says you can't go home?
Who says you can't go back?
Been all around the world and as a matter of fact
There's only one place left, I want to go
Who says you can't go home? It's alright
It's alright, it's alright, it's alright, it's alright
I went as far as I could, I tried to find a new face
There isn't one of these lines that I would erase
I left a million mile of memories on that ro

In [53]:
len(song_lyrics)

2046

In [56]:
[k for k in song_lyrics.keys() if 'matthews' in k.lower()]

['The Space Between Lyrics  Dave Matthews Band']

In [57]:
errors

{'1 Thing Lyrics  Amerie': IndexError('list index out of range'),
 'Any Time, Any Place / And On And On Lyrics  Janet Jackson': IndexError('list index out of range'),
 'Before You Walk Out Of My Life / Like This And Like That Lyrics  Monica': IndexError('list index out of range'),
 'Boombastic / In The Summertime Lyrics  Shaggy': IndexError('list index out of range'),
 "C'mon And Get My Love Lyrics  D-Mob With Cathy Dennis": requests.exceptions.InvalidSchema(u"No connection adapters were found for ' http://www.songlyrics.com/d-mob-introducing-cathy-dennis/c-mon-and-get-my-love-lyrics/'"),
 'Dear Mama / Old School Lyrics Tupac': IndexError('list index out of range'),
 'Do For Love Lyrics Tupac': requests.exceptions.InvalidSchema(u"No connection adapters were found for ' http://www.songlyrics.com/tupac/do-for-love-lyrics/'"),
 "Don't Know Much Lyrics  Linda Ronstadt and Aaron Neville": IndexError('list index out of range'),
 "Don't Wanna Fall In Love Lyrics  Jane Child": requests.excepti

In [58]:
len(errors)

35

In [59]:
import json

In [None]:
# Save song lyrics.
json.dump(song_lyrics, open('song_lyrics.json', 'wt'), encoding='utf8')

http://nlp.stanford.edu/IR-book/html/htmledition/document-and-query-weighting-schemes-1.html
TextBlob