<h1>Document Similarity using LSI</h1>

<h4>In this assignment we’re going to practice document similarity. Here’s
what you need to do:</h4>
<ol>
<li>From Wikipedia’s List of writers page (https://en.wikipedia.org/wiki/Lists_of_writers), pick five lists of
writers (e.g., List of detective fiction authors). You can pick any five
you like but make sure that the list has at least 30 writers listed
<li>Collect the urls of all the writers on those five pages and place them in a list
<li>Grab the content of each writer in the list and place them in a list (of documents)
<li>Build an LSI model using this data. This is your "reference" data set
<li>Now grab another list of writers from wikipedia and create a new list of documents using the detail from each writers page. This is your "writer" data set
<li>For each writer in the new list, find the writer in the reference data set that is the least close in similarity (with a similarity not lower than 0.6).
<li>Print a table that contains each writer from the writer data set and the most similar writer from the reference data set
</ol>
<h4>Use the code below to build your solution

<p><span style="color:blue">get_writers</span>: A function that, given a "list of writers" url, returns a list containing the names of the writers and the urls for their wikipedia pages
<p>non_writer_finder tries its best to remove links that are not writer links from the page (not perfect, but good enough!)

In [97]:
def get_writers(url):
    from bs4 import BeautifulSoup
    import requests
    page_soup = BeautifulSoup(requests.get(url).content,'lxml')
    li_tags = page_soup.find_all('li')
    all_writers = list()
    for tag in li_tags:
        if tag.get('id'):
            continue

        try:
            tag.find('sup',class_="reference")
            link = tag.find('a').get('href')
            name = tag.find('a').get_text()
            if "/wiki/" in link and non_writer_finder(link):
                all_writers.append((name,"https://en.wikipedia.org" + link))
        except:
            pass
    return all_writers

def non_writer_finder(link):
    non_writer_words = ['Category','Template','Portal','List','File','Template','Special','Main','Help','User','https']
    for word in non_writer_words:
        if word in link:
            return False
    return True

<h4>testing the function</h4>
<li>Note that Wikipedia does not have a standard for its page design so this code may not work with every list

In [98]:
url = "https://en.wikipedia.org/wiki/List_of_detective_fiction_authors"
get_writers(url)

[('Mario Acevedo', 'https://en.wikipedia.org/wiki/Mario_Acevedo_(author)'),
 ('Douglas Adams', 'https://en.wikipedia.org/wiki/Douglas_Adams'),
 ('Margery Allingham', 'https://en.wikipedia.org/wiki/Margery_Allingham'),
 ('Rudolfo Anaya', 'https://en.wikipedia.org/wiki/Rudolfo_Anaya'),
 ('Gosho Aoyama', 'https://en.wikipedia.org/wiki/Gosho_Aoyama'),
 ('Frank Arnau', 'https://en.wikipedia.org/wiki/Frank_Arnau'),
 ('Taku Ashibe', 'https://en.wikipedia.org/wiki/Taku_Ashibe'),
 ('Ace Atkins', 'https://en.wikipedia.org/wiki/Ace_Atkins'),
 ('Kate Atkinson', 'https://en.wikipedia.org/wiki/Kate_Atkinson_(writer)'),
 ('Yukito Ayatsuji', 'https://en.wikipedia.org/wiki/Yukito_Ayatsuji'),
 ('Nevada Barr', 'https://en.wikipedia.org/wiki/Nevada_Barr'),
 ('Earle Basinsky', 'https://en.wikipedia.org/wiki/Earle_Basinsky'),
 ('M. C. Beaton', 'https://en.wikipedia.org/wiki/M._C._Beaton'),
 ('E. C. Bentley', 'https://en.wikipedia.org/wiki/Edmund_Clerihew_Bentley'),
 ('Larry Beinhart', 'https://en.wikipedia.

<h4>get_writer_text(url): returns the page text of the wikipedia page associated with a writer</h4>
<li>Since we're not sure if this will always work, we use a try ... except to catch exceptions
<li>If it doesn't work, the function returns None
<li>We will need to delete this (writer, url) pair from our writers list

In [99]:
def get_writer_text(url):
    from bs4 import BeautifulSoup
    import requests
    all_text = ''
    try:
        page_soup = BeautifulSoup(requests.get(url).content,'lxml')
        for p_tag in page_soup.find_all('p'):
            all_text += p_tag.get_text()
    except:
        return None
    return all_text

<h4>testing get_writer_text</h4>

In [100]:
url = "https://en.wikipedia.org/wiki/Kate_Atkinson_(writer)"
get_writer_text(url)

'\nKate Atkinson MBE (born 20 December 1951) is an English writer of novels, plays and short stories.[1] She is known for creating the Jackson Brodie series of detective novels, which has been adapted into the BBC One series Case Histories.[1][2] She won the Whitbread Book of the Year prize in 1995 in the Novels category for Behind the Scenes at the Museum, winning again in 2013 and 2015 under its new name the Costa Book Awards.[1]\nThe daughter of a shopkeeper, Atkinson was born in York, the setting for several of her books.[3] She studied English literature at the University of Dundee, gaining her master\'s degree in 1974.[1] Atkinson subsequently studied for a doctorate in American literature, with a thesis titled "The post-modern American short story in its historical context".[3] She failed at the viva (oral examination) stage. After leaving the university, she took on a variety of jobs, from home help to legal secretary and teacher.[4]\nHer first novel, Behind the Scenes at the M

<p><span style="color:blue">get_all_writers</span>: A function that, given a list of genres, returns a list containing the names of the writers and the urls for their wikipedia pages associated with that list of genres
<p>The function should return a list of (name,url) pairs for all the writers in the list of genres
<p>You need to:
<ol>
<li>iterate through the list of genres
<li>initialize a list "all_writers"
<li>construct a url for the list of writers (I've done these first three steps for you)
<li>call get_writers for that url
<li>extend all_writers by what get_writers returns

In [101]:
def get_all_writers(genre_list):
    all_writers = list()
    for genre in genre_list:
        # Construct the URL for the list of writers
        url = f"https://en.wikipedia.org/wiki/List_of_{genre}"
        # Call get_writers for that URL
        writers = get_writers(url)
        # Extend all_writers by what get_writers returns
        all_writers.extend(writers)
    return all_writers



<h4>Example of how to use get_all_writers</h4>

In [102]:
genre_list = ['detective_fiction_authors', 'romantic_novelists']
all_writers = get_all_writers(genre_list)
all_writers

[('Mario Acevedo', 'https://en.wikipedia.org/wiki/Mario_Acevedo_(author)'),
 ('Douglas Adams', 'https://en.wikipedia.org/wiki/Douglas_Adams'),
 ('Margery Allingham', 'https://en.wikipedia.org/wiki/Margery_Allingham'),
 ('Rudolfo Anaya', 'https://en.wikipedia.org/wiki/Rudolfo_Anaya'),
 ('Gosho Aoyama', 'https://en.wikipedia.org/wiki/Gosho_Aoyama'),
 ('Frank Arnau', 'https://en.wikipedia.org/wiki/Frank_Arnau'),
 ('Taku Ashibe', 'https://en.wikipedia.org/wiki/Taku_Ashibe'),
 ('Ace Atkins', 'https://en.wikipedia.org/wiki/Ace_Atkins'),
 ('Kate Atkinson', 'https://en.wikipedia.org/wiki/Kate_Atkinson_(writer)'),
 ('Yukito Ayatsuji', 'https://en.wikipedia.org/wiki/Yukito_Ayatsuji'),
 ('Nevada Barr', 'https://en.wikipedia.org/wiki/Nevada_Barr'),
 ('Earle Basinsky', 'https://en.wikipedia.org/wiki/Earle_Basinsky'),
 ('M. C. Beaton', 'https://en.wikipedia.org/wiki/M._C._Beaton'),
 ('E. C. Bentley', 'https://en.wikipedia.org/wiki/Edmund_Clerihew_Bentley'),
 ('Larry Beinhart', 'https://en.wikipedia.

<p><span style="color:blue">get_all_writer_docs</span>: A function that, given the list of (writer,url) pairs, returns two lists, a list of writers and a parallel (same size) list of documents. 

<p>You need to:

<ol>
<li>initialize the two lists

<li>iterate through the all_writers list
<li>extract the name and the url of the writer
<li>get the text using predefined function
<li>if the function returns None, ignore it and move to the next writer
<li>otherwise, append the name ot the writer_names list and the text to the writer_texts list
<li>return writer_names and writer_texts


In [103]:
def get_all_writer_docs(all_writers):
    writer_names = list()
    writer_texts = list()
    for writer in all_writers:
        name, url = writer
        text = get_writer_text(url)
        if text is not None:
            writer_names.append(name)
            writer_texts.append(text)
    return writer_names, writer_texts


<h4>Example of how to use get_all_writer_docs</h4>

In [104]:
reference_names,reference_docs = get_all_writer_docs(all_writers)
print(len(reference_names),len(reference_docs))

979 979


<h3>Set up the LSI model</h3>
<li>reference_docs is the list of documents
<li>construct texts, dictionary, and corpus (see class iPython notebook)
<li>construct an LSI model. Use 5 topics initially but you should play around with this number

In [105]:
from gensim import corpora
from gensim.models.ldamodel import LdaModel
from gensim.parsing.preprocessing import STOPWORDS
from gensim import corpora, models, similarities
import pprint
!pip install --user -U nltk
import nltk
from nltk import sent_tokenize,word_tokenize 
nltk.download('punkt')

for i in range(len(reference_docs )):
    doc = reference_docs [i]
    sents = sent_tokenize(doc)
    for j in range(len(sents)):
        sent = sents[j]
        sent = sent.strip().replace('\n','')
        sents[j] = sent
    reference_docs [i] = '. '.join(sents)
# story_list[0]
len(reference_docs)


texts = [[word for word in doc.lower().split()
        if word not in STOPWORDS and word.isalnum() and not word.lower() == 'slate']
        for doc in reference_docs ]
texts[1]

dictionary = corpora.Dictionary(texts) #(word_id,word) pairs
corpus = [dictionary.doc2bow(text) for text in texts] #(word_id,freq) pairs by sentence
corpus

lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=5)

pp = pprint.PrettyPrinter(indent=4)
pp.pprint(lsi.print_topics(num_words=8))




[nltk_data] Downloading package punkt to /Users/chining/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


[   (   0,
        '0.267*"published" + 0.264*"caine" + 0.227*"novels" + 0.187*"new" + '
        '0.176*"wrote" + 0.166*"novel" + 0.145*"book" + 0.138*"books"'),
    (   1,
        '-0.742*"caine" + 0.240*"holmes" + 0.155*"detective" + 0.141*"novels" '
        '+ -0.113*"isle" + 0.107*"stories" + 0.106*"sherlock" + -0.106*"film"'),
    (   2,
        '0.640*"holmes" + 0.266*"sherlock" + 0.197*"conan" + -0.196*"novels" + '
        '0.193*"doyle" + 0.186*"caine" + 0.167*"watson" + 0.157*"detective"'),
    (   3,
        '0.680*"ray" + 0.455*"film" + -0.137*"caine" + 0.125*"films" + '
        '0.117*"indian" + -0.115*"novels" + -0.105*"holmes" + 0.105*"bengali"'),
    (   4,
        '0.612*"poe" + -0.217*"hibbert" + 0.170*"allan" + -0.166*"holmes" + '
        '-0.165*"ray" + -0.155*"jean" + -0.150*"novels" + -0.147*"plaidy"')]


<h3>Construct the writer data set</h3>
<h4>Example</h4>

In [106]:
writer_genre_list = ['Western_fiction_authors']
all_writers = get_all_writers(writer_genre_list)
writer_names, writer_docs = get_all_writer_docs(all_writers)

<h4>find the least similar writers with at least 0.6 similarity for each new writer from our reference data set</h4>
<li>Write code to print table_data after the for loop ends

In [107]:
from gensim.similarities.docsim import Similarity
from gensim import corpora, models, similarities
from gensim.utils import simple_preprocess
 
index = similarities.MatrixSimilarity(lsi[corpus])

 
writer_corpus = [dictionary.doc2bow((doc.lower().split())) for doc in writer_docs]

writer_lsi = [lsi[doc] for doc in writer_corpus]
table_data = list()

for i, doc in enumerate(writer_lsi):
    sims = index[doc]
    sims_filtered = [(doc_id, sim) for doc_id, sim in enumerate(sims) if sim >= 0.6]

   
    sims_filtered = sorted(sims_filtered, key=lambda item: item[1])
 
    if sims_filtered:
        least_similar_above_threshold = sims_filtered[0]
        table_data.append((writer_names[i], reference_names[least_similar_above_threshold[0]]))

 
for row in table_data:
    print(row)
 

 

('Edward Abbey', 'Eleanor Burford')
('Andy Adams', 'Jean Barrett')
('William Lacey Amy', 'Dorothy Phillips')
('Rudolfo Anaya', 'Heather Graham')
('Todhunter Ballard', 'Ann Roth')
('S. Omar Barker', 'Jean Barrett')
('Rex Beach', 'Jean Barrett')
('James Warner Bellah', 'Jean Barrett')
('Don Bendell', 'Sir Arthur Conan Doyle')
('Tom W. Blackburn', 'Barbara Dawson Smith')
('James Carlos Blake', 'Stephanie James')
('William Blinn', 'M. L. Buchman')
('Stephen Bly', 'Sir Arthur Conan Doyle')
('Frank Bonham', 'Jean Barrett')
('Allan R. Bosworth', 'Jean Barrett')
('Peter Bowen', 'Detective fiction')
('B.M. Bower', 'Heather Graham')
('Leigh Brackett', 'Edgar Allan Poe')
('Max Brand', 'Barbara Dawson Smith')
('Lyle Brandt', 'Heather Graham')
('Peter Brandvold', 'Heather Graham')
('Matt Braun', 'Heather Graham')
('Dee Brown', 'Jean Barrett')
('Anthony Burgess', 'Jean Barrett')
('Walter Noble Burns', 'Barbara Dawson Smith')
('Daniel Carlson ', 'Margaret Moore')
('David Wynford Carnegie', 'Jacquelin

# Some simple sentiment analysis

In this part we are gonna run some simple sentiment analysis to find out which writer has the most positive description.

Define a function simple_sentiment_analysis(writer_names, writer_docs) that takes as inputs the list of writers and their corresponding descriptions.
The expected output is a list, each element of this list should be a list with the writer name, the percentage of positive words in their description and the percentage of negative words in their description.

In [108]:
#Example output
"""
[('William Blinn', 0.81, 0.54),
 ('Stephen Bly', 0.75, 0.94),
 ('Frank Bonham', 3.73, 0.62)
 ...]
"""

"\n[('William Blinn', 0.81, 0.54),\n ('Stephen Bly', 0.75, 0.94),\n ('Frank Bonham', 3.73, 0.62)\n ...]\n"

To ensure results can be compared please use the following function to define your list of positive and negative words:

In [109]:
def get_pos_neg_words():
    def get_words(url):
        import requests
        words = requests.get(url).content.decode('latin-1')
        word_list = words.split('\n')
        index = 0
        while index < len(word_list):
            word = word_list[index]
            if ';' in word or not word:
                word_list.pop(index)
            else:
                index+=1
        return word_list
    #Get lists of positive and negative words
    p_url = 'http://ptrckprry.com/course/ssd/data/positive-words.txt'
    n_url = 'http://ptrckprry.com/course/ssd/data/negative-words.txt'
    positive_words = get_words(p_url)
    negative_words = get_words(n_url)
    return positive_words,negative_words

In [110]:
positive_words, negative_words = get_pos_neg_words()

def simple_sentiment_analysis(writer_names, writer_docs):
    from nltk import word_tokenize
    results = list()
    for name, text in zip(writer_names, writer_docs):
        cpos = cneg = lpos = lneg = 0
        for word in word_tokenize(text):
            if word in positive_words:
                cpos+=1
            if word in negative_words:
                cneg+=1
        results.append((name,f"{(cpos/len(word_tokenize(text)))*100:.2f}",f"{(cneg/len(word_tokenize(text)))*100:.2f}"))
    return results

In [111]:
simple_sentiment_analysis(writer_names, writer_docs)

[('Edward Abbey', '2.09', '2.40'),
 ('Andy Adams', '1.94', '1.94'),
 ('William Lacey Amy', '1.16', '1.62'),
 ('Rudolfo Anaya', '1.77', '1.11'),
 ('Todhunter Ballard', '1.60', '3.19'),
 ('S. Omar Barker', '1.20', '0.13'),
 ('Rex Beach', '2.66', '1.60'),
 ('James Warner Bellah', '0.67', '0.67'),
 ('Don Bendell', '2.27', '1.14'),
 ('Tom W. Blackburn', '1.45', '0.66'),
 ('James Carlos Blake', '2.19', '1.10'),
 ('William Blinn', '0.81', '0.54'),
 ('Stephen Bly', '0.75', '0.94'),
 ('Frank Bonham', '3.73', '0.62'),
 ('Allan R. Bosworth', '1.89', '0.63'),
 ('Peter Bowen', '2.61', '0.87'),
 ('B.M. Bower', '1.60', '1.49'),
 ('Leigh Brackett', '1.37', '2.24'),
 ('Max Brand', '0.26', '0.79'),
 ('Lyle Brandt', '5.62', '0.62'),
 ('Peter Brandvold', '0.88', '0.44'),
 ('Matt Braun', '1.62', '0.00'),
 ('Dee Brown', '2.62', '0.59'),
 ('Anthony Burgess', '1.31', '1.61'),
 ('Walter Noble Burns', '0.33', '1.64'),
 ('Daniel Carlson ', '2.00', '0.67'),
 ('David Wynford Carnegie', '1.75', '1.63'),
 ('Forrest 