<h1>Document Similarity using LSI</h1>

<h4>In this assignment we’re going to practice document similarity. Here’s
what you need to do:</h4>
<ol>
<li>From Wikipedia’s List of musicians page (https://en.wikipedia.org/wiki/Lists_of_musicians), pick five lists of
musicians (e.g., List of big band musicians). You can pick any five
you like but make sure that the list has the words “musicians” in
it and that the list has at least 30 musicians listed
<li>Collect the urls of all the musicians on those five pages and place them in a list
<li>Grab the content of each musician in the list and place them in a list (of documents)
<li>Build an LSI model using this data. This is your "reference" data set
<li>Now grab another list of musicians from wikipedia and create a new list of documents using the detail from each musicians page. This is your "musician" data set
<li>For each musician in the new list, find the musician in the reference data set that is the closest in similarity. 
<li>Print a table that contains each musician from the musician data set and the most similar musician from the reference data set
</ol>
<h4>Use the code below to build your solution

<p><span style="color:blue">get_musicians</span>: A function that, given a "list of musicians" url, returns a list containing the names of the musicians and the urls for their wikipedia pages
<p>non_musician_finder tries its best to remove links that are not musician links from the page (not perfect, but good enough!)

In [None]:
def get_musicians(url):
    from bs4 import BeautifulSoup
    import requests
    page_soup = BeautifulSoup(requests.get(url).content,'lxml')
    li_tags = page_soup.find_all('li')
    all_musicians = list()
    for tag in li_tags:
        if tag.get('id'):
            continue

        try:
            tag.find('sup',class_="reference")
            link = tag.find('a').get('href')
            name = tag.find('a').get_text()
            if "/wiki/" in link and non_musician_finder(link):
                all_musicians.append((name,"https://en.wikipedia.org" + link))
        except:
            pass
    return all_musicians

def non_musician_finder(link):
    non_musician_words = ['Category','Template','Portal','List','File','Template','Special','Main','Help','User']
    for word in non_musician_words:
        if word in link:
            return False
    return True

<h4>testing the function</h4>
<li>Note that Wikipedia does not have a standard for its page design so this code may not work with every list

In [None]:
url = "https://en.wikipedia.org/wiki/List_of_alternative_country_musicians"
get_musicians(url)

<h4>get_musician_text(url): returns the page text of the wikipedia page associated with a musician</h4>
<li>Since we're not sure if this will always work, we use a try ... except to catch exceptions
<li>If it doesn't work, the function returns None
<li>We will need to delete this (musician, url) pair from our musicians list

In [None]:
def get_musician_text(url):
    from bs4 import BeautifulSoup
    import requests
    all_text = ''
    try:
        page_soup = BeautifulSoup(requests.get(url).content,'lxml')
        for p_tag in page_soup.find_all('p'):
            all_text += p_tag.get_text()
    except:
        return None
    return all_text


<h4>testing get_musician_text</h4>

In [None]:
url = "https://en.wikipedia.org/wiki/Jim_Morrison"
get_musician_text(url)

<p><span style="color:blue">get_all_musicians</span>: A function that, given a list of genres, returns a list containing the names of the musicians and the urls for their wikipedia pages associated with that list of genres
<p>The function should return a list of (name,url) pairs for all the musicians in the list of genres
<p>You need to:
<ol>
<li>iterate through the list of genres
<li>initialize a list "all_musicians"
<li>construct a url for the list of musicians (I've done these first three steps for you)
<li>call get_musicians for that url
<li>extend all_musicians by what get_musicians returns

In [None]:
def get_all_musicians(genre_list):
    all_musicians = list()
    for genre in genre_list:
        url = 'https://en.wikipedia.org/wiki/List_of_' + genre
    
        #Your code here
    
    return all_musicians

<h4>Example of how to use get_all_musicians</h4>

In [None]:
genre_list = ['bluegrass_musicians#G','British_blues_musicians','country_blues_musicians','emo_artists']
all_musicians = get_all_musicians(genre_list)

<p><span style="color:blue">get_all_musician_docs</span>: A function that, given the list of (musician,url) pairs, returns two lists, a list of musicians and a parallel (same size) list of documents. 

<p>You need to:

<ol>
<li>initialize the two lists

<li>iterate through the all_musicians list
<li>extract the name and the url of the musician
<li>get the text using the get_musician_text() function
<li>if the function returns None, ignore it and move to the next musician
<li>otherwise, append the name ot the musician_names list and the text to the musician_texts list
<li>return musician_names and musician_texts


In [None]:
def get_all_musician_docs(all_musicians):
    musician_names = list()
    musician_texts = list()
    for musician in all_musicians:
        name = musician[0]
        url = musician[1]
        
        #Your code here
    return musician_names,musician_texts
        

<h4>Example of how to use get_all_musician_docs</h4>

In [None]:
reference_names,reference_docs = get_all_musician_docs(all_musicians)

<h3>Set up the LSI model</h3>
<li>reference_docs is the list of documents
<li>construct texts, dictionary, and corpus (see class iPython notebook)
<li>construct an LSI model. Use 5 topics initially but you should play around with this number

In [None]:
#Code for LSI model goes here


lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=5)

<h3>Construct the "musician" data set</h3>
<h4>Example</h4>

In [None]:
musician_genre_list = ['acid_rock_artists']
all_musicians = get_all_musicians(musician_genre_list)
musician_names,musician_docs = get_all_musicians_docs(all_musicians)

<h4>find the most similar musicians for each new musician from our reference data set</h4>

In [None]:
table_data = list()
for index,musician in enumerate(musician_docs):
    
    #Your similarity code here. Use the in-class notebook as a reference
    
    most_similar_musician = sims[0][0]
    table_data.append((musician_names[index],reference_names[most_similar_musician]))
    
#Write code to print table_data after the for loop ends
    