<h1>Document Similarity using LSI</h1>

<h4>OUTLINE</h4>
<ol>
<li>From Wikipedia’s List of musicians page (https://en.wikipedia.org/wiki/Lists_of_musicians), pick five lists of
musicians (e.g., List of big band musicians) which have the words “musicians” in
it and that the list has at least 30 musicians listed.
<li>Collect the urls of all the musicians on those five pages and place them in a list
<li>Grab the content of each musician in the list and place them in a list (of documents)
<li>Build an LSI model using this data to form the "reference" data set
<li>Grab another list of musicians from wikipedia and create a new list of documents using the detail from each musicians page to form the "musician" data set
<li>For each musician in the new list, find the musician in the reference data set that is the closest in similarity. 
<li>Print a table that contains each musician from the musician data set and the most similar musician from the reference data set
</ol>

<h4>get_musicians</h4>
<li>A function that, given a "list of musicians" url, returns a list containing the names of the musicians and the urls for their wikipedia pages

In [0]:
def get_musicians(url):
    from bs4 import BeautifulSoup
    import requests
    page_soup = BeautifulSoup(requests.get(url).content,'lxml')
    li_tags = page_soup.find_all('li')
    all_musicians = list()
    for tag in li_tags:
        if tag.get('id'):
            continue

        try:
            tag.find('sup',class_="reference")
            link = tag.find('a').get('href')
            name = tag.find('a').get_text()
            if "/wiki/" in link and non_musician_finder(link):
                all_musicians.append((name,"https://en.wikipedia.org" + link))
        except:
            pass
    return all_musicians

def non_musician_finder(link):
    non_musician_words = ['Category','Template','Portal','List','File','Template','Special','Main','Help','User']
    for word in non_musician_words:
        if word in link:
            return False
    return True

In [2]:
def select_five_list(url):
    from bs4 import BeautifulSoup
    import requests
    import random
    page_soup = BeautifulSoup(requests.get(url).content,'lxml')
    li_tags = page_soup.find_all('li')
    random.shuffle(li_tags)
    page_list=[]
    for tag in li_tags:
        try: 
            if 'musicians' in tag.find('a').get_text():
                url_='https://en.wikipedia.org/'+tag.find('a').get('href')
                get_m=get_musicians(url_)
                if len(get_musicians(url_))>=30:
                    name=tag.find('a').get('title')
                    name_refine='_'.join(name.split()[2:])
                    page_list.append(name_refine)
                    if len(page_list)==5:
                        break
        except:
            pass
    return page_list
    
url='https://en.wikipedia.org/wiki/Lists_of_musicians'
select_five_list(url)

['free_funk_musicians',
 'country_rock_musicians',
 'Ghanaian_musicians',
 'boogie_woogie_musicians',
 'Chinese_musicians']

<h4>get_musician_text(url): returns the page text of the wikipedia page associated with a musician</h4>

In [0]:
def get_musician_text(url):
    from bs4 import BeautifulSoup
    import requests
    all_text = ''
    try:
        page_soup = BeautifulSoup(requests.get(url).content,'lxml')
        for p_tag in page_soup.find_all('p'):
            all_text += p_tag.get_text()
    except:
        return None
    return all_text


<h4>get_all_musicians: A function that, given a list of genres, returns a list containing the names of the musicians and the urls for their wikipedia pages associated with that list of genres</h4>

In [0]:
def get_all_musicians(genre_list):
    all_musicians = list()
    for genre in genre_list:
        url = 'https://en.wikipedia.org/wiki/List_of_' + genre
    
        all_musicians.extend(get_musicians(url))
    
    return all_musicians

In [0]:
genre_list = ['bluegrass_musicians#G','British_blues_musicians','country_blues_musicians','emo_artists']
all_musicians = get_all_musicians(genre_list)

<h4>get_all_musician_docs: A function that, given the list of (musician,url) pairs, returns two lists, a list of musicians and a parallel (same size) list of documents. </h4>

In [0]:
def get_all_musician_docs(all_musicians):
    musician_names = list()
    musician_texts = list()
    for musician in all_musicians:
        name = musician[0]
        url = musician[1]
        
        text=get_musician_text(url)
        if text:
            musician_names.append(name)
            musician_texts.append(text)
    return musician_names,musician_texts
        

In [0]:
reference_names,reference_docs = get_all_musician_docs(all_musicians)

<h3>Set up the LSI model</h3>

In [0]:
from gensim import corpora,similarities,models
from gensim.parsing.preprocessing import STOPWORDS
from gensim.similarities.docsim import Similarity
texts=[[word for word in doc.lower().split() if word and word not in STOPWORDS and word.isalnum()]for doc in reference_docs]
dictionary=corpora.Dictionary(texts)
corpus=[dictionary.doc2bow(text) for text in texts]


lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=5)

<h3>Construct the "musician" data set</h3>

In [0]:
musician_genre_list = ['acid_rock_artists']
all_musicians = get_all_musicians(musician_genre_list)
musician_names,musician_docs = get_all_musician_docs(all_musicians)

In [12]:
musician_names

['The 13th Floor Elevators',
 'Alice Cooper',
 'The Amboy Dukes',
 'Amon Düül',
 'Big Brother and the Holding Company',
 'Black Sabbath',
 'Blue Cheer',
 'Blues Magoos',
 'The Charlatans',
 'Count Five',
 'Country Joe and the Fish',
 'Coven',
 'Cream',
 'Deep Purple',
 'The Deviants',
 'The Doors',
 'The Electric Prunes',
 'The Fugs',
 'Grateful Dead',
 'The Great Society',
 'The Groundhogs',
 'Hawkwind',
 'Iron Butterfly',
 'Jefferson Airplane',
 'The Jimi Hendrix Experience',
 'Janis Joplin',
 'JPT Scare Band',
 'Love',
 'MC5',
 'Moby Grape',
 'The Music Machine',
 'Quicksilver Messenger Service',
 'Santana',
 'The Seeds',
 'Grace Slick',
 'Steppenwolf',
 'Tully',
 'Vanilla Fudge',
 'Wooden Shjips',
 'Acid rock',
 'Knowles, Christopher',
 'McIntyre, Iain',
 'Prown, Pete',
 'Folk',
 'Funk',
 'Hip hop',
 'Pop',
 'Rock',
 'Soul',
 'Trance',
 'Dub',
 'Psybient',
 'Suomisaundi',
 'Breaks',
 'House',
 'Jazz',
 'Punk',
 'Rock',
 'Techno',
 'Trance',
 'Krautrock',
 'Chillwave',
 'Dream-beat'

<h4>find the most similar musicians for each new musician from our reference data set</h4>

In [13]:
table_data = list()
for index,musician in enumerate(musician_docs):
    
    vec_bow=dictionary.doc2bow(musician.lower().split())
    vec_lsi=lsi[vec_bow]
    index_=similarities.MatrixSimilarity(lsi[corpus])
    sims=index_[vec_lsi]
    sims=sorted(enumerate(sims),key=lambda x:-x[1])
    most_similar_musician = sims[0][0]
    table_data.append((musician_names[index],reference_names[most_similar_musician]))
    
print(table_data)
    

  if np.issubdtype(vec.dtype, np.int):


[('The 13th Floor Elevators', 'Marcus Mumford'), ('Alice Cooper', 'Jawbreaker'), ('The Amboy Dukes', 'Garden Variety'), ('Amon Düül', 'Secondhand Serenade'), ('Big Brother and the Holding Company', 'The Pretty Things'), ('Black Sabbath', 'Garden Variety'), ('Blue Cheer', 'Eric Burdon'), ('Blues Magoos', 'Tramp'), ('The Charlatans', 'Fragile Rock'), ('Count Five', 'Free'), ('Country Joe and the Fish', 'Vern Williams'), ('Coven', 'Rites of Spring'), ('Cream', 'Cream'), ('Deep Purple', 'The Anniversary'), ('The Deviants', 'Nude'), ('The Doors', 'Secondhand Serenade'), ('The Electric Prunes', 'Secondhand Serenade'), ('The Fugs', 'Secondhand Serenade'), ('Grateful Dead', 'Secondhand Serenade'), ('The Great Society', 'Pillar'), ('The Groundhogs', 'The Groundhogs'), ('Hawkwind', 'Wishbone Ash'), ('Iron Butterfly', 'Steamhammer'), ('Jefferson Airplane', 'Free'), ('The Jimi Hendrix Experience', 'The Jimi Hendrix Experience'), ('Janis Joplin', 'Memphis Jug Band'), ('JPT Scare Band', 'Secondhand 

In [15]:
import pprint
pp=pprint.PrettyPrinter(indent=4)
pp.pprint(table_data)

[   ('The 13th Floor Elevators', 'Marcus Mumford'),
    ('Alice Cooper', 'Jawbreaker'),
    ('The Amboy Dukes', 'Garden Variety'),
    ('Amon Düül', 'Secondhand Serenade'),
    ('Big Brother and the Holding Company', 'The Pretty Things'),
    ('Black Sabbath', 'Garden Variety'),
    ('Blue Cheer', 'Eric Burdon'),
    ('Blues Magoos', 'Tramp'),
    ('The Charlatans', 'Fragile Rock'),
    ('Count Five', 'Free'),
    ('Country Joe and the Fish', 'Vern Williams'),
    ('Coven', 'Rites of Spring'),
    ('Cream', 'Cream'),
    ('Deep Purple', 'The Anniversary'),
    ('The Deviants', 'Nude'),
    ('The Doors', 'Secondhand Serenade'),
    ('The Electric Prunes', 'Secondhand Serenade'),
    ('The Fugs', 'Secondhand Serenade'),
    ('Grateful Dead', 'Secondhand Serenade'),
    ('The Great Society', 'Pillar'),
    ('The Groundhogs', 'The Groundhogs'),
    ('Hawkwind', 'Wishbone Ash'),
    ('Iron Butterfly', 'Steamhammer'),
    ('Jefferson Airplane', 'Free'),
    ('The Jimi Hendrix Experience', 'T