# Parsing the list of short stories

Solving the [science fiction anthology problem](https://auxiliarymemory.com/2018/08/18/the-mathematics-of-buying-science-fiction-anthologies/) first requires importing the [list of classic short stories v2](https://csfquery.com/SearchResult?mincite=8&category=story&sortby=7&list=1).



In [1]:
import requests, bs4
import pandas as pd

res = requests.get('https://csfquery.com/SearchResult?mincite=8&category=story&sortby=7&list=1') #Download the webpage
res.raise_for_status() #Check that download was OK
soup = bs4.BeautifulSoup(res.text) #Parse HTML
classics_table = soup.find("table", attrs={"class": "table pt-3"}) #Find the table with the list of stories
classics_table_data = classics_table.tbody.find_all("a", attrs={"class": "a-csf"}) #Find all links in the table

df = pd.DataFrame(columns=['story_author', 'story_title', 'story_link']) #Create dataframe for storing the list of stories
for i in classics_table_data: #Loop over the links found previously 
    if "title.cgi" in i['href']:
        story_link = i['href']
        story_title = i.text
    elif "ea.cgi" in i['href']:
        story_author = i.text
        df.loc[len(df)]=([story_author,story_title,story_link]) #Append dataframe with author, title, story link

df = df.groupby(['story_title','story_link'])['story_author'].apply(', '.join).reset_index() # Merge entries for stories with more than one author

# Creating a list of anthologies

Having a database (`df`) of short stories, we now need to obtain the list of anthologies from ISFDB.org containing at least one story.

In [3]:
main_df = pd.DataFrame(columns=['story_author', 'story_title', 'story_link','publication_title','publication_link']) #Create empty dataframe which will store a mapping between stories and anthologies
for index, story_row in df.iterrows(): #Loop over all stories
    story_author = story_row['story_author']
    story_title = story_row['story_title']
    story_link = story_row['story_link']
    story_link += '+1' #Change the ISFDB link to "Do not display translations"
    res = requests.get(story_link)
    res.raise_for_status()
    soup = bs4.BeautifulSoup(res.text)
    publications_table = soup.find("table", attrs={"class": "publications"}) #Find the table listing all publications of a story
    publications_df = pd.read_html(str(publications_table), header=0)[0] #Read the HTML table into a dataframe (there could be more tables, but we're interested in the first, indexed 0)
    publications_table_links = publications_table.find_all("a") #Extract all HTML links from the table
    publications_links = [] 
    for l in publications_table_links:
        if "pl.cgi" in l['href']:
            publications_links.append(l['href'])
    publications_df['publications_links']=publications_links #Add column with links to anthologies to dataframe
    publications_df = publications_df[publications_df.Type.isin(['anth','coll','omni'])] #Only anthologies, collections, and omnibuses are of interest, drop others
    publications_df = publications_df.drop_duplicates(subset='Title', keep="first") #Removing duplicates by exactle title match (this could be improved)
    for index2, publication_row in publications_df.iterrows(): #Iterate over all publications
        publication_title = publication_row['Title']
        publication_link = publication_row['publications_links']
        # print("Adding to main_df: ", story_title, " in ", publication_title)
        main_df.loc[len(main_df)]=([story_author,story_title,story_link,publication_title,publication_link]) #Update main database, each entry (row) is a story-to-anthology mapping

At this point you may want to save this mapping into a CSV file.

In [62]:
main_df.to_csv('main.csv')

If you're interested you can list the anthologies with the most classic stories:

In [33]:
main_df['publication_title'].value_counts().head()

The Wesleyan Anthology of Science Fiction                   23
The Big Book of Science Fiction: The Ultimate Collection    22
Name: publication_title, dtype: int64

# Greedy algorithm

One approach to solving the [set cover problem](https://en.wikipedia.org/wiki/Set_cover_problem) is the greedy algorithm implemented below.

In [68]:
import numpy as np
remaining_df = main_df #Create working copy of main database
remaining_df['story_desc']=remaining_df['story_title'] + ' by ' + remaining_df['story_author'] #Add new column of unique story description (the story title is not enough, there are two stories titled "The Star")
selected_books = []
selected_stories = []
while (np.unique(np.array(selected_stories)).size < len(df.story_title)): #Loop until we have covered all stories
    top_book = remaining_df['publication_title'].value_counts().head(1).index.values[0] #Find book with most stories
    print("Chosen book: ",top_book)
    selected_books.append(top_book)
    stories_in_top_book = remaining_df[remaining_df.publication_title==top_book].story_desc.tolist()
    selected_stories.extend(stories_in_top_book)
    print("Chosen stories: ",stories_in_top_book)
    remaining_df=remaining_df[~remaining_df.story_desc.isin(stories_in_top_book)] #Remove from database all entries of the stories we have selected
    print("Total stories selected: ",np.unique(np.array(selected_stories)).size)
print("All selected books: ",selected_books)
print("Total number of selected books: ",len(selected_books))

Chosen book:  The Wesleyan Anthology of Science Fiction
Chosen stories:  ['""Repent, Harlequin!" Said the Ticktockman" by Harlan Ellison', '"A Martian Odyssey" by Stanley G. Weinbaum', '"Air Raid" by John Varley', '"All You Zombies—" by Robert A. Heinlein', '"And I Awoke and Found Me Here on the Cold Hill\'s Side" by James Tiptree, Jr.', '"Aye, and Gomorrah …" by Samuel R. Delany', '"Burning Chrome" by William Gibson', '"Coming Attraction" by Fritz Leiber', '"Day Million" by Frederik Pohl', '"Desertion" by Clifford D. Simak', '"Fondly Fahrenheit" by Alfred Bester', '"Nine Lives" by Ursula K. Le Guin', '"Passengers" by Robert Silverberg', '"Speech Sounds" by Octavia E. Butler', '"That Only a Mother" by Judith Merril', '"The Game of Rat and Dragon" by Cordwainer Smith', '"The Sentinel" by Arthur C. Clarke', '"The Star" by H. G. Wells', '"There Will Come Soft Rains" by Ray Bradbury', '"Think Like a Dinosaur" by James Patrick Kelly', '"Thunder and Roses" by Theodore Sturgeon', '"We Can Rem