# Parsing the list of short stories

Solving the [science fiction anthology problem](https://auxiliarymemory.com/2018/08/18/the-mathematics-of-buying-science-fiction-anthologies/) first requires importing the [list of classic short stories v2](https://csfquery.com/SearchResult?mincite=8&category=story&sortby=7&list=1).



In [1]:
import requests, bs4
import pandas as pd

res = requests.get('https://csfquery.com/SearchResult?mincite=8&category=story&sortby=7&list=1') #Download the webpage
res.raise_for_status() #Check that download was OK
soup = bs4.BeautifulSoup(res.text) #Parse HTML
classics_table = soup.find("table", attrs={"class": "table pt-3"}) #Find the table with the list of stories
classics_table_data = classics_table.tbody.find_all("a", attrs={"class": "a-csf"}) #Find all links in the table

df = pd.DataFrame(columns=['story_author', 'story_title', 'story_link']) #Create dataframe for storing the list of stories
for i in classics_table_data: #Loop over the links found previously 
    if "title.cgi" in i['href']:
        story_link = i['href']
        story_title = i.text
    elif "ea.cgi" in i['href']:
        story_author = i.text
        df.loc[len(df)]=([story_author,story_title,story_link]) #Append dataframe with author, title, story link

df = df.groupby(['story_title','story_link'])['story_author'].apply(', '.join).reset_index() # Merge entries for stories with more than one author

# Creating a list of anthologies

Having a database (`df`) of short stories, we now need to obtain the list of anthologies from ISFDB.org containing at least one story.

In [5]:
main_df = pd.DataFrame(columns=['story_author', 'story_title', 'story_link','publication_title','publication_link']) #Create empty dataframe which will store a mapping between stories and anthologies
for index, story_row in df.iterrows(): #Loop over all stories
    story_author = story_row['story_author']
    story_title = story_row['story_title']
    story_link = story_row['story_link']
    story_link += '+1' #Change the ISFDB link to "Do not display translations"
    res = requests.get(story_link)
    res.raise_for_status()
    soup = bs4.BeautifulSoup(res.text)
    publications_table = soup.find("table", attrs={"class": "publications"}) #Find the table listing all publications of a story
    publications_df = pd.read_html(str(publications_table), header=0)[0] #Read the HTML table into a dataframe (there could be more tables, but we're interested in the first, indexed 0)
    publications_table_links = publications_table.find_all("a") #Extract all HTML links from the table
    publications_links = [] 
    for l in publications_table_links:
        if "pl.cgi" in l['href']:
            publications_links.append(l['href'])
    publications_df['publications_links']=publications_links #Add column with links to anthologies to dataframe
    publications_df = publications_df[publications_df.Format.str.contains('digital audio download')] #Only anthologies, collections, and omnibuses are of interest, drop others
    publications_df = publications_df.drop_duplicates(subset='Title', keep="first") #Removing duplicates by exactle title match (this could be improved)
    for index2, publication_row in publications_df.iterrows(): #Iterate over all publications
        publication_title = publication_row['Title']
        publication_link = publication_row['publications_links']
        # print("Adding to main_df: ", story_title, " in ", publication_title)
        main_df.loc[len(main_df)]=([story_author,story_title,story_link,publication_title,publication_link]) #Update main database, each entry (row) is a story-to-anthology mapping

At this point you may want to save this mapping into a CSV file.

In [7]:
main_df.to_csv('main_digital_audio_downloads.csv')