# Web Scraping Program by David Smolinski
What this adds to my portfolio:
- web scraping
- html parsing
- regular expressions
- list comprehensions

The program's function: 

This program extracts data from the Beesource forum thread "Why Treatment Free.". The last line of code (turned off for my portfolio) writes the dataset to a csv for future analysis with Pandas.

Features of the dataset:
- Each row contains data from 1 post in the forum thread.
- There are rows for all posts in the thread.
- Aside from "post_date", all columns are about users (their id, when they joined Beesource, location, number of posts they made). 
- It is small so that it can run fast for demonstrations and use minimal website resources.

Future data analysis idea: Find the geographic locations where people believe in treatment free (TF) beekeeping.
- Find this by comparing the quantities of unique people who posted grouped by their location.
- This could affect the price of TF products (queens, nucs...), and the quality of feral genetics (where to breed).

Links:
- [my portfolio](https://github.com/DavidSmolinski/portfolio)
- [forum thread](https://www.beesource.com/forums/showthread.php?291554-Why-Treatment-Free)

In [1]:
import requests
from bs4 import BeautifulSoup  # Beautiful Soup
import re  # regular expressions
import pandas as pd

In [3]:
def page_to_soup(page_num, url='https://www.beesource.com/forums/showthread.php?291554-Why-Treatment-Free'):
    """
    :param page_num: page number of the forum thread
    :param url:
    :return: a bs4 BeautifulSoup object
    """
    r = requests.get(url, params={'page': page_num})
    soup = BeautifulSoup(r.text, 'html.parser')
    return soup

In [4]:
page_num = 1
soup = page_to_soup(page_num=page_num)

last_page_text = soup.find('span', attrs={'class': 'first_last'})
last_page_text = last_page_text.find('a')['href']

last_page_num_re = re.compile(r'(page)(\d*)')
text_search = last_page_num_re.search(last_page_text)
last_page_num = text_search.group(2)
last_page_num = int(last_page_num)

In [5]:
user_id_re = re.compile(r'(php\?)(\d*)')
posts_list = []
while page_num <= last_page_num:
    posts = soup.find_all('li', attrs={'class': 'postbit postbitim postcontainer old'})
    for post in posts:
        user_id_data = post.find('div', attrs={'class': 'popupmenu memberaction'})
        user_id_data = user_id_data.find('a')['href']
        text_search = user_id_re.search(user_id_data)
        user_id = text_search.group(2)

        date_time = post.find('span', attrs={'class': 'date'}).text
        post_date = date_time.split(',')[0]

        userstats = post.find('dl', attrs={'class': 'userstats'})
        userstats_list = userstats.find_all('dd')
        join_date, location, profile_posts = [e.text for e in userstats_list]

        posts_list.append((user_id, post_date, join_date, location, profile_posts))
    page_num += 1
    soup = page_to_soup(page_num=page_num)

In [6]:
df = pd.DataFrame(posts_list, columns=['user_id', 'post_date', 'join_date', 'location', 'profile_posts'])

In [7]:
df

Unnamed: 0,user_id,post_date,join_date,location,profile_posts
0,85171,12-11-2013,Apr 2012,"Gainesboro, Tennessee, USA.",394
1,63848,12-11-2013,Dec 2002,"Medford, Oregon",5083
2,61741,12-11-2013,Jan 2005,"Hamilton, Alabama",2983
3,76628,12-11-2013,Aug 2010,"Adair Co, Oklahoma",134
4,77738,12-11-2013,Dec 2010,hinesville ga usa,762
5,77115,12-11-2013,Oct 2010,Baker Oregon,3221
6,90519,12-11-2013,Dec 2012,"Fort Walton Beach, Florida",1253
7,95801,12-11-2013,Jul 2013,"Pleasant Shade, TN",763
8,96427,12-11-2013,Aug 2013,"Little Rock, AR",143
9,60655,12-12-2013,Aug 2002,"Nehawka, Nebraska USA",53582


In [None]:
# df.to_csv('beesource_tf.csv', index=False)