# Parsing the list of short stories

Solving the [science fiction anthology problem](https://auxiliarymemory.com/2018/08/18/the-mathematics-of-buying-science-fiction-anthologies/) first requires importing the [list of classic short stories v2](https://csfquery.com/SearchResult?mincite=8&category=story&sortby=7&list=1).



In [1]:
import requests, bs4
import pandas as pd

res = requests.get('https://csfquery.com/SearchResult?mincite=8&category=story&sortby=7&list=1') #Download the webpage
res.raise_for_status() #Check that download was OK
soup = bs4.BeautifulSoup(res.text) #Parse HTML
classics_table = soup.find("table", attrs={"class": "table pt-3"}) #Find the table with the list of stories
classics_table_data = classics_table.tbody.find_all("a", attrs={"class": "a-csf"}) #Find all links in the table

df = pd.DataFrame(columns=['story_author', 'story_title', 'story_link']) #Create dataframe for storing the list of stories
for i in classics_table_data: #Loop over the links found previously 
    if "title.cgi" in i['href']:
        story_link = i['href']
        story_title = i.text
    elif "ea.cgi" in i['href']:
        story_author = i.text
        df.loc[len(df)]=([story_author,story_title,story_link]) #Append dataframe with author, title, story link

df = df.groupby(['story_title','story_link'])['story_author'].apply(', '.join).reset_index() # Merge entries for stories with more than one author

# Creating a list of anthologies

Having a database (`df`) of short stories, we now need to obtain the list of anthologies from ISFDB.org containing at least one story.

In [2]:
main_df = pd.DataFrame(columns=['story_author', 'story_title', 'story_link','publication_title','publication_link']) #Create empty dataframe which will store a mapping between stories and anthologies
for index, story_row in df.iterrows(): #Loop over all stories
    story_author = story_row['story_author']
    story_title = story_row['story_title']
    story_link = story_row['story_link']
    story_link += '+1' #Change the ISFDB link to "Do not display translations"
    res = requests.get(story_link)
    res.raise_for_status()
    soup = bs4.BeautifulSoup(res.text)
    publications_table = soup.find("table", attrs={"class": "publications"}) #Find the table listing all publications of a story
    publications_df = pd.read_html(str(publications_table), header=0)[0] #Read the HTML table into a dataframe (there could be more tables, but we're interested in the first, indexed 0)
    publications_table_links = publications_table.find_all("a") #Extract all HTML links from the table
    publications_links = [] 
    for l in publications_table_links:
        if "pl.cgi" in l['href']:
            publications_links.append(l['href'])
    publications_df['publications_links']=publications_links #Add column with links to anthologies to dataframe
    # publications_df = publications_df[publications_df.Type.isin(['anth','omni'])] #Only anthologies, collections, and omnibuses are of interest, drop others
    publications_df = publications_df.drop_duplicates(subset='Title', keep="first") #Removing duplicates by exactle title match (this could be improved)
    for index2, publication_row in publications_df.iterrows(): #Iterate over all publications
        publication_title = publication_row['Title']
        publication_link = publication_row['publications_links']
        # print("Adding to main_df: ", story_title, " in ", publication_title)
        main_df.loc[len(main_df)]=([story_author,story_title,story_link,publication_title,publication_link]) #Update main database, each entry (row) is a story-to-anthology mapping

ISFDB is missing the contents of two notable anthologies (perhaps because they are academic textbooks):

- [Sense of Wonder](http://www.isfdb.org/cgi-bin/pl.cgi?354562)
- [Science Fiction: Stories and Contexts](http://www.isfdb.org/cgi-bin/pl.cgi?612736)

We need to add them based on their listings at the [Classics of Science Fiction](https://csfquery.com/).

In [3]:
def add_stories_from_anthology(publication_title, publication_link):
    res = requests.get(publication_link) #Download the webpage
    res.raise_for_status() #Check that download was OK
    soup = bs4.BeautifulSoup(res.text) #Parse HTML
    new_stories_table = soup.find("table", attrs={"class": "table"}) #Find the table with the list of stories
    new_stories_df = pd.read_html(str(new_stories_table), header=0)[0] #Read the HTML table into a dataframe (there could be more tables, but we're interested in the first, indexed 0)
    new_stories_df.columns=['#','Title','Author','Year']
    new_stories_df.Author = new_stories_df.Author.str.replace('  ', ' ') #Sanitize list of authors
    new_stories_df = new_stories_df[new_stories_df.Title.isin(df.story_title.to_list())]
    for index, new_stories_row in new_stories_df.iterrows(): #Iterate over all new stories to add
        story_author = new_stories_row.Author
        story_title = new_stories_row.Title
        story_link = df.story_link[(df.story_title==story_title)&(df.story_author==story_author)].values[0]
        # print("Adding to main_df:", story_title, "in", publication_title, "with link:", story_link)
        main_df.loc[len(main_df)]=([story_author,story_title,story_link,publication_title,publication_link]) #Update main database, each entry (row) is a story-to-anthology mapping 

add_stories_from_anthology('Sense of Wonder', 'https://csfquery.com/cworks?sortby=2&cid=37')
add_stories_from_anthology('Science Fiction: Stories and Contexts','https://csfquery.com/cworks?sortby=2&cid=49')

At this point you may want to save this mapping into a CSV file.

In [4]:
main_df.to_csv('main.csv')

If you're interested you can list the anthologies with the most classic stories:

In [5]:
main_df['publication_title'].value_counts().head(10)

Sense of Wonder                                                                   34
The Wesleyan Anthology of Science Fiction                                         23
The Big Book of Science Fiction: The Ultimate Collection                          22
Science Fiction: Stories and Contexts                                             22
Science Fiction Hall of Fame: The Greatest Science Fiction Stories of All Time    17
The Science Fiction Hall of Fame, Volume I                                        17
The Science Fiction Hall of Fame, Volume One                                      17
The Science Fiction Hall of Fame, Volume One, 1929-1964                           17
The Science Fiction Hall of Fame                                                  17
The Road to Science Fiction: Volume 3: From Heinlein to Here                      16
Name: publication_title, dtype: int64

# Removing duplicate books

As you can see, duplicate entries still remain: The Science Fiction Hall of Fame, Vol. I is listed under five alternative titles. The code below removes anthologies which are identical in terms of the classic short stories they contain.

In [6]:
all_books=main_df['publication_title'].unique()
print("Initial unique books:",len(all_books))
reduced_main_df = main_df.copy()
for i in range(len(all_books)):
    first_book = all_books[i]
    stories_in_first = reduced_main_df.story_link[reduced_main_df.publication_title == first_book].to_list()
    if stories_in_first:
        stories_in_first.sort()
        # if (len(stories_in_first) < 3): 
        #     continue
        for j in range(i+1,len(all_books)):
            second_book = all_books[j]
            stories_in_second = reduced_main_df.story_link[reduced_main_df.publication_title == second_book].to_list()
            if stories_in_second:
                stories_in_second.sort()
                if stories_in_first == stories_in_second:
                    # print (first_book, "vs", second_book)
                    reduced_main_df = reduced_main_df[reduced_main_df.publication_title != second_book]
print("Reduced books to:", len(reduced_main_df['publication_title'].unique()))


Initial unique books: 1563
Reduced books to: 401


In [7]:
reduced_main_df['publication_title'].value_counts().head(10)

Sense of Wonder                                                              34
The Wesleyan Anthology of Science Fiction                                    23
Science Fiction: Stories and Contexts                                        22
The Big Book of Science Fiction: The Ultimate Collection                     22
The Science Fiction Hall of Fame, Volume One                                 17
The Road to Science Fiction #3: From Heinlein to Here                        16
Science Fiction: The Science Fiction Research Association Anthology          16
The Prentice Hall Anthology of Science Fiction and Fantasy                   15
The Best of the Nebulas                                                      12
The Locus Awards: Thirty Years of the Best in Science Fiction and Fantasy    12
Name: publication_title, dtype: int64

# Greedy algorithm

One approach to solving the [set cover problem](https://en.wikipedia.org/wiki/Set_cover_problem) is the greedy algorithm implemented below.

In [8]:
import numpy as np
remaining_df = reduced_main_df.copy() #Create working copy of main database
remaining_df['story_desc']=remaining_df['story_title'] + ' by ' + remaining_df['story_author'] #Add new column of unique story description (the story title is not enough, there are two stories titled "The Star")
selected_books = []
selected_stories = []
while (np.unique(np.array(selected_stories)).size < len(df.story_title)): #Loop until we have covered all stories
    top_book = remaining_df['publication_title'].value_counts().head(1).index.values[0] #Find book with most stories
    print("Chosen book: ",top_book)
    selected_books.append(top_book)
    stories_in_top_book = remaining_df[remaining_df.publication_title==top_book].story_desc.tolist()
    selected_stories.extend(stories_in_top_book)
    print("Chosen stories: ",stories_in_top_book)
    remaining_df=remaining_df[~remaining_df.story_desc.isin(stories_in_top_book)] #Remove from database all entries of the stories we have selected
    print("Total stories selected: ",np.unique(np.array(selected_stories)).size)
print("All selected books: ",selected_books)
print("Total number of selected books: ",len(selected_books))

Chosen book:  Sense of Wonder
Chosen stories:  ['"Arena" by Fredric Brown', '"Bears Discover Fire" by Terry Bisson', '"Black Destroyer" by A. E. van Vogt', '"Blood Music" by Greg Bear', '"Bloodchild" by Octavia E. Butler', '"The Cold Equations" by Tom Godwin', '"The Country of the Kind" by Damon Knight', '"Day Million" by Frederik Pohl', '"First Contact" by Murray Leinster', '"Fondly Fahrenheit" by Alfred Bester', '"The Game of Rat and Dragon" by Cordwainer Smith', '"Hell Is the Absence of God" by Ted Chiang', '"Jeffty Is Five" by Harlan Ellison', '"The Little Black Bag" by C. M. Kornbluth', '"Lobsters" by Charles Stross', '"The Lucky Strike" by Kim Stanley Robinson', '"A Martian Odyssey" by Stanley G. Weinbaum', '"Microcosmic God" by Theodore Sturgeon', '"The Mountains of Mourning" by Lois McMaster Bujold', '"Nightfall" by Isaac Asimov', '"The Only Neat Thing to Do" by James Tiptree, Jr.', '"Or All the Seas with Oysters" by Avram Davidson', '"Passengers" by Robert Silverberg', '"The P

# Eliminating books without unique stories

An alternative approach for solving the science fiction anthology problem is to go through all the books (sorted by ascending number of stories), keeping those which contain at least a unique story and removing those that do not.

In [9]:
eliminated_df = reduced_main_df.copy()
eliminated_df['story_desc']=eliminated_df['story_title'] + ' by ' + eliminated_df['story_author'] #Add new column of unique story description 
books = eliminated_df['publication_title'].value_counts(ascending=True).index.to_list() #Books ordered in ascending count of stories
for book in books:
    stories = eliminated_df.story_desc[eliminated_df.publication_title==book].to_list()
    toEliminate = True
    for story in stories:
        if eliminated_df[eliminated_df.story_desc==story].shape[0]==1:
            print(story,"is unique to",book)
            toEliminate = False
            break
    if toEliminate:
        eliminated_df = eliminated_df[eliminated_df.publication_title != book]
print("Total books containing all stories:",len(eliminated_df.publication_title.unique()))

"R & R" by Lucius Shepard is unique to Isaac Asimov's Science Fiction Magazine, April 1986
"The Man Who Bridged the Mist" by Kij Johnson is unique to Asimov's Science Fiction, October-November 2011
"The Island" by Peter Watts is unique to Beyond the Rift
"Buffalo Gals, Won't You Come Out Tonight" by Ursula K. Le Guin is unique to The Fantasy Hall of Fame
"The Star Pit" by Samuel R. Delany is unique to Modern Classic Short Novels of Science Fiction
"The Merchant and the Alchemist's Gate" by Ted Chiang is unique to The Very Best of Fantasy & Science Fiction: 60th Anniversary Anthology
"A Song for Lya" by George R. R. Martin is unique to The Hugo Winners, Volume Three
"The Last of the Winnebagos" by Connie Willis is unique to Nebula Award-Winning Novellas
"The Moon Moth" by Jack Vance is unique to From Here to Forever
"The Screwfly Solution" by James Tiptree, Jr. is unique to The Oxford Book of Science Fiction Stories
"Beggars in Spain" by Nancy Kress is unique to Hugo and Nebula Award Wi

# Puctured greedy algorithm

Let's see if we can improve on the greedy algorithm by going down a different book selection path. We can force the different selection path by removing singular books as selected by the original greedy algorithm.

In [10]:
import numpy as np
remaining_df = reduced_main_df.copy() #Create working copy of main database
remaining_df['story_desc']=remaining_df['story_title'] + ' by ' + remaining_df['story_author'] #Add new column of unique story description (the story title is not enough, there are two stories titled "The Star")
selected_books = []
selected_stories = []
while (np.unique(np.array(selected_stories)).size < len(df.story_title)): #Loop until we have covered all stories
    top_book = remaining_df['publication_title'].value_counts().head(1).index.values[0] #Find book with most stories
    selected_books.append(top_book)
    stories_in_top_book = remaining_df[remaining_df.publication_title==top_book].story_desc.tolist()
    selected_stories.extend(stories_in_top_book)
    remaining_df=remaining_df[~remaining_df.story_desc.isin(stories_in_top_book)] #Remove from database all entries of the stories we have selected
print("Books selected by greedy algorithm:",selected_books)
print("Total number of selected books: ",len(selected_books))
greedy_books = selected_books
greedy_stories = selected_stories

print("\nExcluded book\t\t\tNumber of books")
for book in greedy_books:
    remaining_df = reduced_main_df.copy() #Create working copy of main database
    remaining_df['story_desc']=remaining_df['story_title'] + ' by ' + remaining_df['story_author'] #Add new column of unique story description (the story title is not enough, there are two stories titled "The Star")
    remaining_df = remaining_df[remaining_df.publication_title != book]
    selected_books = []
    selected_stories = []
    while (np.unique(np.array(selected_stories)).size < len(df.story_title)): #Loop until we have covered all stories
        try:
            top_book = remaining_df['publication_title'].value_counts().head(1).index.values[0] #Find book with most stories
        except (IndexError):
            break
        # print("Chosen book: ",top_book)
        selected_books.append(top_book)
        stories_in_top_book = remaining_df[remaining_df.publication_title==top_book].story_desc.tolist()
        selected_stories.extend(stories_in_top_book)
        # print("Chosen stories: ",stories_in_top_book)
        remaining_df=remaining_df[~remaining_df.story_desc.isin(stories_in_top_book)] #Remove from database all entries of the stories we have selected
        # print("Total stories selected: ",np.unique(np.array(selected_stories)).size)
    # print("All selected books: ",selected_books)
    if (np.unique(np.array(selected_stories)).size<len(df.story_title)):
        break
    print(book,"\t\t\t",len(selected_books))

Books selected by greedy algorithm: ['Sense of Wonder', 'The Wesleyan Anthology of Science Fiction', 'Science Fiction: Stories and Contexts', 'The Big Book of Science Fiction: The Ultimate Collection', 'The Science Fiction Hall of Fame, Volume IV', 'The World Treasury of Science Fiction', "Hugo and Nebula Award Winners from Asimov's Science Fiction", 'Her Smoke Rose Up Forever', 'Survival Printout', 'What If? Volume 3', 'Impossible Things', 'Nebula Award Stories 5', 'GRRM: A RRetrospective', 'Beyond the Rift', 'The Magazine of Fantasy & Science Fiction, November 1987', 'Novelties & Souvenirs: Collected Short Fiction', 'The Best of the Best Volume 2: 20 Years of the Best Short Science Fiction Novels', 'From Here to Forever', "Isaac Asimov's Science Fiction Magazine, April 1986", 'Machines That Think: The Best Science Fiction Stories About Robots and Computers', "The Merchant and the Alchemist's Gate", "Asimov's Science Fiction, October-November 2011"]
Total number of selected books:  22

In [16]:
# Greedy with puncturing round 2
remaining_df = reduced_main_df.copy() #Create working copy of main database
remaining_df = remaining_df[remaining_df.publication_title != 'The Science Fiction Hall of Fame, Volume IV']
remaining_df['story_desc']=remaining_df['story_title'] + ' by ' + remaining_df['story_author'] #Add new column of unique story description (the story title is not enough, there are two stories titled "The Star")
selected_books = []
selected_stories = []
while (np.unique(np.array(selected_stories)).size < len(df.story_title)): #Loop until we have covered all stories
    top_book = remaining_df['publication_title'].value_counts().head(1).index.values[0] #Find book with most stories
    selected_books.append(top_book)
    stories_in_top_book = remaining_df[remaining_df.publication_title==top_book].story_desc.tolist()
    selected_stories.extend(stories_in_top_book)
    remaining_df=remaining_df[~remaining_df.story_desc.isin(stories_in_top_book)] #Remove from database all entries of the stories we have selected
print("Books selected by greedy algorithm:",selected_books)
print("Total number of selected books: ",len(selected_books))
greedy_books = selected_books
greedy_stories = selected_stories

print("\nExcluded book\t\t\tNumber of books")
for book in greedy_books:
    remaining_df = reduced_main_df.copy() #Create working copy of main database
    remaining_df['story_desc']=remaining_df['story_title'] + ' by ' + remaining_df['story_author'] #Add new column of unique story description (the story title is not enough, there are two stories titled "The Star")
    remaining_df = remaining_df[remaining_df.publication_title != book]
    selected_books = []
    selected_stories = []
    while (np.unique(np.array(selected_stories)).size < len(df.story_title)): #Loop until we have covered all stories
        try:
            top_book = remaining_df['publication_title'].value_counts().head(1).index.values[0] #Find book with most stories
        except (IndexError):
            break
        # print("Chosen book: ",top_book)
        selected_books.append(top_book)
        stories_in_top_book = remaining_df[remaining_df.publication_title==top_book].story_desc.tolist()
        selected_stories.extend(stories_in_top_book)
        # print("Chosen stories: ",stories_in_top_book)
        remaining_df=remaining_df[~remaining_df.story_desc.isin(stories_in_top_book)] #Remove from database all entries of the stories we have selected
        # print("Total stories selected: ",np.unique(np.array(selected_stories)).size)
    # print("All selected books: ",selected_books)
    if (np.unique(np.array(selected_stories)).size<len(df.story_title)):
        break
    print(book,"\t\t\t",len(selected_books))



Books selected by greedy algorithm: ['Sense of Wonder', 'The Wesleyan Anthology of Science Fiction', 'Science Fiction: Stories and Contexts', 'The Big Book of Science Fiction: The Ultimate Collection', 'The Arbor House Treasury of Modern Science Fiction', 'The Locus Awards: Thirty Years of the Best in Science Fiction and Fantasy', 'The World Treasury of Science Fiction', 'The Science Fiction Century', 'The Arbor House Treasury of Great Science Fiction Short Novels', 'Survival Printout', 'The Science Fiction Hall of Fame Volume Four', 'Beyond the Rift', 'Nebula Award-Winning Novellas', 'The Magazine of Fantasy & Science Fiction, November 1987', 'Analog Science Fiction/Science Fact, June 1977', 'From Here to Forever', "Asimov's Science Fiction, October-November 2011", 'Future on Ice', 'The Very Best of Fantasy & Science Fiction: 60th Anniversary Anthology', "Isaac Asimov's Science Fiction Magazine, April 1986", 'GRRM: A RRetrospective']
Total number of selected books:  21

Excluded book	

# Current best solution

The current best solution is to remove *The Science Fiction Hall of Fame, Volume IV* (or another from some other books listed above) from the list of available books which allows the greedy algorithm to find a better (local) minimum, but no less than 21 books.

In [17]:
import numpy as np
remaining_df = reduced_main_df.copy() #Create working copy of main database
remaining_df = remaining_df[remaining_df.publication_title != "The Science Fiction Hall of Fame, Volume IV"] #Remove one book
remaining_df['story_desc']=remaining_df['story_title'] + ' by ' + remaining_df['story_author'] #Add new column of unique story description (the story title is not enough, there are two stories titled "The Star")
selected_books = []
selected_stories = []
while (np.unique(np.array(selected_stories)).size < len(df.story_title)): #Loop until we have covered all stories
    top_book = remaining_df['publication_title'].value_counts().head(1).index.values[0] #Find book with most stories
    print(top_book,"\n")
    selected_books.append(top_book)
    stories_in_top_book = remaining_df[remaining_df.publication_title==top_book].story_desc.tolist()
    selected_stories.extend(stories_in_top_book)
    for story in stories_in_top_book:
        print("-",story)
    remaining_df=remaining_df[~remaining_df.story_desc.isin(stories_in_top_book)] #Remove from database all entries of the stories we have selected
    print("\nRunning story total:",np.unique(np.array(selected_stories)).size,"\n")
print("Selected books:")
for book in selected_books:
    print("-",book)
print("\nNumber of selected books: ",len(selected_books))

Sense of Wonder 

- "Arena" by Fredric Brown
- "Bears Discover Fire" by Terry Bisson
- "Black Destroyer" by A. E. van Vogt
- "Blood Music" by Greg Bear
- "Bloodchild" by Octavia E. Butler
- "The Cold Equations" by Tom Godwin
- "The Country of the Kind" by Damon Knight
- "Day Million" by Frederik Pohl
- "First Contact" by Murray Leinster
- "Fondly Fahrenheit" by Alfred Bester
- "The Game of Rat and Dragon" by Cordwainer Smith
- "Hell Is the Absence of God" by Ted Chiang
- "Jeffty Is Five" by Harlan Ellison
- "The Little Black Bag" by C. M. Kornbluth
- "Lobsters" by Charles Stross
- "The Lucky Strike" by Kim Stanley Robinson
- "A Martian Odyssey" by Stanley G. Weinbaum
- "Microcosmic God" by Theodore Sturgeon
- "The Mountains of Mourning" by Lois McMaster Bujold
- "Nightfall" by Isaac Asimov
- "The Only Neat Thing to Do" by James Tiptree, Jr.
- "Or All the Seas with Oysters" by Avram Davidson
- "Passengers" by Robert Silverberg
- "The Persistence of Vision" by John Varley
- "Rachel in Lo