# Construct function
This notebook contains the steps gone through to examine the structure of the pages that will be scraped and to construct the individual scraping functions.

In [1]:
import bs4
from bs4 import BeautifulSoup as bs
import pandas as pd
import requests
import re

Extract list of links from the threadmarks page, for use in pipeline

In [3]:
dl_page_html = open('test_data/dl_threadmarks.html').read()
dl_soup = bs(dl_page_html)
# foreach threadmark, extract the title of the post, the hyperlink and the post id

post_id_ex = re.compile('post-[0-9]*')

html_links = [t.find('a') for t in dl_soup.find_all(class_='structItem-title threadmark_depth0')] 
threadmark_links = [(l.text.lstrip().rstrip(), l['href'], post_id_ex.findall(l['href'])[0]) for l in html_links]


In [4]:
post_id_from_tm = threadmark_links[0][2]
post_id_from_tm

'post-9987149'

# Extract particular posts from page

In [5]:
page_html = open('test_data/dl_page_1.html').read()
page_soup = bs(page_html)

Retrieve a single post from html, using the post id to identify the post

In [6]:
post = page_soup.find(attrs={'data-content':post_id_from_tm})

retrieve the content of the post

In [7]:
post_body = post.find('article', class_="message-body")
post_body.text[:1000]

'\n\n                   Within the borders of the Empire of Man, the Elector Counts are among the highest of nobility, second only to the Emperor and - arguably, very arguably - the High Priests. Their word is all but law over the countless thousands of souls within their province.\n                   \n\n                   You are not an Elector Count. But you are the next best thing.\n                   \n\n                   You have secured a position in the Privy Council of a newly-arisen Elector Count, an instrument for their will in a specific domain. But also an instrument for your own will; because like all people, you have your own ambitions. The way you present information, the solutions you present, the options you put towards them and how you explain them; all of these will allow you a great deal of leeway in how you go about your business, and allowing you to pursue your own private goals.\n                   \n\n                   The first question is: in what province 

retrieve the author, and other metadata

In [8]:
post_author = post['data-author']
post_author

'BoneyM'

In [9]:
post_time = post.find(class_='message-attribution').find('time').text.lstrip().rstrip()
post_time

'Jan 18, 2018'

In [10]:
# Extract information meta information from threadmarked posts

In [11]:
def collect_metadata(p):
    header = p.find(class_="message-cell--threadmark-header")
    if (header):
        t_threadmark_id = header.find('label')['for']
        t_threadmark_type = header.find('label').text.rstrip().lstrip()
        t_threadmark_title = header.find(class_='threadmarkLabel',attrs={'id':t_threadmark_id}).text.rstrip().lstrip()
        return (t_threadmark_type, t_threadmark_title)
    else:
        return ('Regular', None)

In [12]:
collect_metadata(post)

('Threadmarks', 'Character Creation - Part 1')

## Finalized function

In [53]:
def extract_info(p):
    author = p['data-author']
    date = p.find(class_='message-attribution').find('time').text.rstrip().lstrip()

    identification_raw = p.find(class_='message-attribution-opposite--list').find_all('a')[1]
    (identification_raw.text.rstrip().lstrip().replace('#',''),identification_raw['href'].split('/')[-1])
    sv_post_number = identification_raw['href'].split('/')[-1]
    
    thread_post_number = identification_raw.text.rstrip().lstrip().replace('#','')
    content = p.find(class_='message-body').text.rstrip().lstrip() 

    metadata = collect_metadata(p)
    
    return {
        'thread_post_nr':thread_post_number,
        'sv_post_id':sv_post_number,
        'post_author':author,
        'post_date':date,
        'threadmark_type':metadata[0],
        'threadmark_title':metadata[1],
        'post_content':content}


In [54]:
all_posts_in_page = page_soup.find_all('article',class_='message')
all_post_list =[p for p in all_posts_in_page]
inspect_post = all_post_list[2]
#inspect_post

In [55]:
extract_info(inspect_post)

{'thread_post_nr': '3',
 'sv_post_id': 'post-9987281',
 'post_author': 'veekie',
 'post_date': 'Jan 18, 2018',
 'threadmark_type': 'Regular',
 'threadmark_title': None,
 'post_content': "[X] STIRLAND\n                   \n                   [X] Intrigue: The Spymaster is the master of soft power in the realm, responsible for hiding the Elector Count's secrets and uncovering those of everyone else. Gives you a great deal of leeway in how you go about things, but also makes you a target for everyone trying to get away with skulduggery in the province, and being anything short of omniscient is often considered a failing by those that don't have to do the job.\n                   \n                   [X] Sleeper Agent: Perhaps you did not quite attain the position; perhaps it was thrust unto you, and perhaps those that did the thrusting have attached several strings to the position.\n                   \n                   Only required to pass on information, making it usually less burden

# Export data

In [56]:
list_of_records = []

for post in all_post_list:
    list_of_records.append(extract_info(post))

In [57]:
posts_data = pd.DataFrame(list_of_records)

In [58]:
posts_data

Unnamed: 0,thread_post_nr,sv_post_id,post_author,post_date,threadmark_type,threadmark_title,post_content
0,1,post-9987149,BoneyM,"Jan 18, 2018",Threadmarks,Character Creation - Part 1,"Within the borders of the Empire of Man, the E..."
1,2,post-9987164,gutza1,"Jan 18, 2018",Regular,,[X] AVERLAND\n \n\n ...
2,3,post-9987281,veekie,"Jan 18, 2018",Regular,,[X] STIRLAND\n \n ...
3,4,post-9987290,bryanfran36,"Jan 18, 2018",Regular,,[X] STIRLAND\n \n ...
4,5,post-9987312,Drucchi,"Jan 18, 2018",Regular,,"I really like Sylvania, so let's go with Stirl..."
5,6,post-9987320,avatar11792,"Jan 18, 2018",Regular,,[X] HOCHLAND\n \n\n ...
6,7,post-9987383,Void Stalker,"Jan 18, 2018",Regular,,[X] HOCHLAND\n \n\n ...
7,8,post-9987385,gutza1,"Jan 18, 2018",Regular,,I personally would find it cool to play as a m...
8,9,post-9987398,Hannz,"Jan 18, 2018",Regular,,[X] HOCHLAND\n \n\n ...
9,10,post-9987421,Uhtread,"Jan 18, 2018",Regular,,[X] HOCHLAND\n \n\n ...


In [59]:
posts_data.to_csv('extracted_data/extracted_posts.csv',index=False,mode = 'w')