Step 1: Create df of urls w/how many pages are in the topic.

Step 2: Create system for filling out those two things in new df.

## The Plan
- Gather each piece of data that we care about
- Move through each page that exists.
- End at the end of the range of pages
- Have all rows of posts added to the posts_df
- Change the row value of "Topic_scrapped" to "True" in url_df

## Imports and Function definitions

In [1]:
import pandas as pd
import matplotlib as plt
import seaborn as sns
import numpy as np
%matplotlib inline
import requests
from time import sleep
from bs4 import BeautifulSoup
import json
from time import time
from datetime import datetime
from warnings import filterwarnings
filterwarnings('ignore')

In [2]:
def get_html(url):
    return BeautifulSoup(requests.get(url).content)

In [3]:
def urls_with_numbers(forum_page_url, forum_page_num = False):
    forum_soup     = get_html(forum_page_url)     # Input URL get Soup
    
    # List of urls of topics in forum page
    topic_url_list = ['https://us.battle.net' + topic.attrs['href'] 
        for topic in forum_soup.find_all(attrs={'class': "ForumTopic"})]
    topics         = []
    count          = 0
    
    # for i in all the forum topic infos
    for forum_topic in forum_soup.find_all(attrs={'class': "ForumTopic"}):
        # Turning each individual info bit into a dict
        posts_num   = json.loads(forum_topic.attrs['data-forum-topic']
            )['lastPosition'] 
        
        # Num of posts in Topic divided by num allowed per page +1 for 1st
        topic_pages = posts_num//20 + 1 
        
        if posts_num%20 != 0 and posts_num > 20:# If there's a remainder
            topic_pages += 1                    # Add remainder page
        topics.append({
            'url'      : topic_url_list[count],
            'pages'    : topic_pages,
            'Forum_num': forum_page_num
            })
        count += 1
    return topics # Returns list of dicts


In [4]:
# For use in scraper to keep things cleaner
# Using this instead of just as a dictionary will allow me to
# avoid a script-stopping error if a piece of data doesn't exist.
def save_datum(dictionary, column, datum):
    try:
        dictionary[str(column)] = datum
    except:
        print(dictionary, 'has no', column, 'data to assign')
    return

## Getting the initial information
My first step is getting the urls and any other needed information from the 

In [None]:
dict_list = []
b_url = 'https://us.battle.net/forums/en/overwatch/22813879/?page='
for attempt in range(5):
    try:
        for forum_num in range(9999):
            current_list = urls_with_numbers(b_url + str(forum_num),
                forum_num)
            
            dict_list += current_list
            if forum_num % 100 == 0:
                print(forum_num, 'Added', 
                    datetime.fromtimestamp(time()
                    ).strftime('%H:%M:%S'))
    except:
        print("---------")
        print('Error', attempt)
        print("---------")
        sleep(100)
    pd.DataFrame(dict_list).to_csv('./data/urls' + str(attempt) + '.csv')

### Fixing some minor mistakes

In [5]:
filterwarnings('ignore')
# I don't seem to have read_csv on this instance so I'm going 
# to settle for from_csv and ignore the warning for now.
url_df = pd.DataFrame.from_csv('./../data/urls.csv')

While fixing another issue I came upon this:

In [6]:
url_df['url'].value_counts()[55:65]

https://us.battle.net/forums/en/overwatch/topic/20761687354    2
https://us.battle.net/forums/en/overwatch/topic/20761926666    2
https://us.battle.net/forums/en/overwatch/topic/20753097396    2
https://us.battle.net/forums/en/overwatch/topic/20762036663    2
https://us.battle.net/forums/en/overwatch/topic/20761906654    2
https://us.battle.net/forums/en/overwatch/topic/20762066698    2
https://us.battle.net/forums/en/overwatch/topic/20753166205    1
https://us.battle.net/forums/en/overwatch/topic/20745706931    1
https://us.battle.net/forums/en/overwatch/topic/20752115206    1
https://us.battle.net/forums/en/overwatch/topic/20753535216    1
Name: url, dtype: int64

Just so many duplicates, but why!?  Lets take a look at a specific instance:

In [7]:
mask = url_df['url'] == 'https://us.battle.net/forums/en/overwatch/topic/20761846667'
url_df[mask]

Unnamed: 0,Forum_num,pages,url
14,0,1,https://us.battle.net/forums/en/overwatch/topi...
62,1,1,https://us.battle.net/forums/en/overwatch/topi...


ooooohhhh, when I put in the 0th and 1st forum page, I ended up essentially scraping the same page twice.  How silly of me.  Lets do a quick and easy fix.

In [8]:
mask = url_df['Forum_num'] == 0
url_df = url_df.drop([i for i in url_df[mask].index])

But there are still about 12 duplicates...

In [9]:
url_df['url'].value_counts()[10:15]

https://us.battle.net/forums/en/overwatch/topic/20753097396    2
https://us.battle.net/forums/en/overwatch/topic/20759150839    2
https://us.battle.net/forums/en/overwatch/topic/20761526007    2
https://us.battle.net/forums/en/overwatch/topic/20747304716    1
https://us.battle.net/forums/en/overwatch/topic/20759396284    1
Name: url, dtype: int64

In [10]:
mask = url_df['url'] == 'https://us.battle.net/forums/en/overwatch/topic/20752366073'
url_df[mask]

Unnamed: 0,Forum_num,pages,url
391470,7830,1,https://us.battle.net/forums/en/overwatch/topi...
391553,7832,1,https://us.battle.net/forums/en/overwatch/topi...


No obvious explaination here.  Lets look at the larger picture

In [11]:
duplicates = [
    'https://us.battle.net/forums/en/overwatch/topic/20752366073',
    'https://us.battle.net/forums/en/overwatch/topic/20759150839',    
    'https://us.battle.net/forums/en/overwatch/topic/20755047322',    
    'https://us.battle.net/forums/en/overwatch/topic/20753097396',    
    'https://us.battle.net/forums/en/overwatch/topic/20744334507',    
    'https://us.battle.net/forums/en/overwatch/topic/20747686120',    
    'https://us.battle.net/forums/en/overwatch/topic/20758737655',    
    'https://us.battle.net/forums/en/overwatch/topic/20761526007',    
    'https://us.battle.net/forums/en/overwatch/topic/20753487689',    
    'https://us.battle.net/forums/en/overwatch/topic/20760667469',    
    'https://us.battle.net/forums/en/overwatch/topic/20761766076', 
    'https://us.battle.net/forums/en/overwatch/topic/20759291345', 
    'https://us.battle.net/forums/en/overwatch/topic/20759158440'
]
for dup_url in duplicates:
    mask = url_df['url'] == dup_url
    print(url_df[mask]['Forum_num'])

391470    7830
391553    7832
Name: Forum_num, dtype: int64
391         7
66676    1333
Name: Forum_num, dtype: int64
233047    4661
233085    4662
Name: Forum_num, dtype: int64
334444    6689
335443    6709
Name: Forum_num, dtype: int64
112677    2254
488327    9767
Name: Forum_num, dtype: int64
425697    8515
429074    8582
Name: Forum_num, dtype: int64
136808    2736
136864    2738
Name: Forum_num, dtype: int64
17273    345
17800    356
Name: Forum_num, dtype: int64
300286    6006
300289    6006
Name: Forum_num, dtype: int64
43632    872
44596    892
Name: Forum_num, dtype: int64
16394    328
16395    328
Name: Forum_num, dtype: int64
77804    1556
77817    1556
Name: Forum_num, dtype: int64
103836    2077
103850    2077
Name: Forum_num, dtype: int64


Huh. This seems to be an issue to look into further once I've gathered more information.

### Adding last page.
UGH.  I made the silly mistake of forgetting the 9999th page in the forums.  I swear one day I'll get use to the range numbering system in python.

In [12]:
b_url = 'https://us.battle.net/forums/en/overwatch/22813879/?page='
last_list = urls_with_numbers(b_url + str(9999),9999)

In [15]:
last_df = pd.DataFrame(last_list)

In [16]:
url_df.tail()

Unnamed: 0,Forum_num,pages,url
499887,9998,1,https://us.battle.net/forums/en/overwatch/topi...
499888,9998,1,https://us.battle.net/forums/en/overwatch/topi...
499889,9998,1,https://us.battle.net/forums/en/overwatch/topi...
499890,9998,1,https://us.battle.net/forums/en/overwatch/topi...
499891,9998,1,https://us.battle.net/forums/en/overwatch/topi...


In [17]:
last_df.head()

Unnamed: 0,Forum_num,pages,url
0,9999,1,https://us.battle.net/forums/en/overwatch/topi...
1,9999,1,https://us.battle.net/forums/en/overwatch/topi...
2,9999,1,https://us.battle.net/forums/en/overwatch/topi...
3,9999,1,https://us.battle.net/forums/en/overwatch/topi...
4,9999,1,https://us.battle.net/forums/en/overwatch/topi...


In [18]:
# Sticking the two df's together
url_df = pd.concat([url_df,last_df])

# Resetting the index
url_df.reset_index(inplace = True)

# Renaming old index but keeping, just in case.
url_df.rename_axis({'index': 'original_index'},
                   axis = 1, inplace = True)

In [19]:
url_df.to_csv('./../data/urls_v2.csv', index = False)

## From urls to full scrapes

In [5]:
url_df = pd.DataFrame.from_csv('./../data/urls_v2.csv')

url_df['Topic_scrapped'] = False

url_df.head()#.append()

Unnamed: 0_level_0,Forum_num,pages,url,Topic_scrapped
original_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
48,1,1,https://us.battle.net/forums/en/overwatch/topi...,False
49,1,1,https://us.battle.net/forums/en/overwatch/topi...,False
50,1,238,https://us.battle.net/forums/en/overwatch/topi...,False
51,1,6,https://us.battle.net/forums/en/overwatch/topi...,False
52,1,10,https://us.battle.net/forums/en/overwatch/topi...,False


In [6]:
posts_df = pd.DataFrame(columns = ['text',
                            'date',
                            'ids_dict',
                            'post_num',
                            'auth_posts',
                            'prof_link',
                            'title',
                            'forum_page',
                            'statuses',
                            'topic_url'])

posts_df = posts_df.append({
    'date' : 'test_teeessest'},
    ignore_index= True)

posts_df

Unnamed: 0,text,date,ids_dict,post_num,auth_posts,prof_link,title,forum_page,statuses,topic_url
0,,test_teeessest,,,,,,,,


## Extracting the test html from the test url

In [9]:
# State the Row information that we need:
test_page_quan = url_df.loc[0,'pages']
test_url       = url_df.loc[0,'url']
test_done      = url_df.loc[0,'Topic_scrapped']

# Obtain the html from the url

In [16]:
test_soup = get_html(test_url)

## Extracting the data the want from each post of the html

In [51]:
test_title = test_soup.find(attrs={'class':'Topic-title'}).text

'D/C'

In [20]:
test_Topic_Posts = test_soup.find_all(attrs = {"class" :'TopicPost'})

In [None]:
for post in test_Topic_Posts:
    date                = post.find('a', {"class" :'TopicPost-timestamp'}).attrs['data-tooltip-content']
    words               = post.find('div',{"class" :'TopicPost-bodyContent'}).text 
    auth_posts_total    = post.find('a', {"class" :'Author-avatar '}).attrs

In [44]:
test_post = test_Topic_Posts[0]

test_date = test_post.find('a', 
    {"class" :'TopicPost-timestamp'}).attrs['data-tooltip-content']
print('Test date:')
print(test_date)
print('--------------')

test_words = test_post.find('div',
    {"class" :'TopicPost-bodyContent'}).text 
print('Test words:')
print(test_words)
print('--------------')

test_posts_total = test_post.find('a', {"class" :'Author-avatar '})
print("Test author's post total:")
print(test_posts_total)
print('--------------')

test_data_dict = test_post.attrs['data-topic-post']
print('Test data topic post dict:')
print("Includes the votes, post ID, author ID and author name")
print(test_data_dict)
print('--------------')

test_post_num_in_topic = test_post.attrs['id']
print('Test post # in topic:')
print(test_post_num_in_topic)
print('--------------')

test_status = test_post.attrs['data-topic']
print('Test')
print(test_status)
print('--------------')

#test_ = test_post.attrs['data-topic']
#print('Test')
#print(test_)
test_post.attrs.keys()

Test date:
05/23/2016 11:48 PM
--------------
Test words:
i can join a game, play for a few minutes and then it disconnects me from the game. says, searching for game server. Im in New Zealand
--------------
Test author's post total:
<a class="Author-avatar " href="https://playoverwatch.com/en-us/career/pc/us/Akasha-12113"><img alt="" src="https://blzgdapipro-a.akamaihd.net/game/unlocks/0x02500000000017AD.png"/></a>
--------------
Test data topic post dict:
Includes the votes, post ID, author ID and author name
{"id":"207422680682","valueVoted":0,"rank":{"voteUp":7,"voteDown":0},"author":{"id":"207423789371","name":"Akasha"}}
--------------
Test post # in topic:
post-1
--------------
Test
{ "sticky":"false","featured":"false","locked":"false","frozen":"false","hidden":"false","pollId":"0"}
--------------


dict_keys(['class', 'id', 'data-topic-post', 'data-topic'])

### References

In [None]:
test_dict = {}
#save_datum(test_dict,'text'      , words[post].contents)# Text of the post
#save_datum(test_dict,'date'      , dates[post].attrs['data-tooltip-content'])# Date of the (unedited) post
#save_datum(test_dict,'ids_dict'  , TopicPosts[post].attrs['data-topic-post'])# Author info & votes of the post
#save_datum(test_dict,'post_num'  , TopicPosts[post].attrs['id'])# Post number in the topic
#save_datum(test_dict,'title'     , title)# Title of Topic
#save_datum(test_dict,'forum_page', forum_page)# Page in the forum
#save_datum(test_dict,'topic_url' , topic_keys[post][0])
#save_datum(test_dict,'statuses'  , TopicPosts[post].attrs['data-topic'])
save_datum(test_dict,'prof_link' , prof_link[post].attrs['href'])
save_datum(test_dict,'auth_posts', auth_posts[post])  # Number of posts author has made


In [None]:
# THIS is for turning into the Requests library
list_of_dicts = []                                              # Instantiate empty list of DICKS to FILL UP
saver = './data/full_scrapes/Overwatch_Test_'                   # The path & starting name for saving junk

if forum_page % 100 == 0:
    save_posts(list_of_dicts, saver, forum_page) 
for topic_stats in topic_keys:         # For every title in the list
    for page in range(topic_stats[1]): # for page # in range of # of pages in the topic
        page += 1                      # Add 1 to compensate for starting at 0
        if page > 1:                   # Basically making sure it's not the first page
            URL = topic_stats[0] + '?page=' + str(page)     # This is basicslly for getting to all the pages
        else:
            URL = topic_stats[0]                            # This will be the first page

        # This chunk defines the lists of things that we want per row
        page_soup  = get_html(URL)                          # Soup of the page
        title      = page_soup.find(attrs={'class':'Topic-title'}).contents
        dates      = [e for e in page_soup.find_all('a',{"class" :'TopicPost-timestamp'}) if e.contents !=edt]
        words      = page_soup.find_all('div',{"class" :'TopicPost-bodyContent'})
        TopicPosts = page_soup.find_all(attrs = {"class" :'TopicPost'})
        auth_posts = [e.contents for e in page_soup.find_all('a',attrs = {'class' :'Author-posts'})]
        prof_link  = page_soup.find_all(attrs = {"class" :'Author-avatar '})
        print("forum page:", forum_page, "Posts in Topic:", topic_stats[1],"Page:",page,"URL:",URL )
        for post in range(len(words)):                           # For each post that's present
            list_of_dicts.append(turn_to_dict(post))        # Add the dict of post to list of posts
    print("forum page:", forum_page,"Topic Page:", page, "Last URL:", URL) # Sanity check
df = save_posts(list_of_dicts, saver, forum_page)               # Should both save the data & create a df to see

In [14]:
test_auth_posts_num = [int(e.find('a', attrs = {'class' :'Author-posts'}).text.strip('\n').strip(' posts')) for e in test_post_htmls]

test_dates = [e.find('a', {"class" :'TopicPost-timestamp'}).text.strip('\n').strip('\t').strip('\n') for e in test_post_htmls]

#test_dates = [e.find('a', {"class" :'TopicPost-timestamp'}).attrs['data-tooltip-content'] 
     for e in test_post_htmls]

test_words = [e.find('div',{"class" :'TopicPost-bodyContent'}).text 
    for e in test_post_htmls]

test_auth_posts = [e.find('a', {"class" :'Author-avatar '}).attrs for e in test_post_htmls]
test_auth_posts

test_prof.attrs

In [41]:
# Create the lists of desired contents
test_dates = [e.text.strip('\n').strip('\t').strip('\n') 
    for e in test_html.find_all('a', {"class" :'TopicPost-timestamp'})]

test_words = [e.text for e in test_html.find_all('div',{"class" :'TopicPost-bodyContent'})]

#TopicPosts = 
test_html.find_all(attrs = {"class" :'TopicPost'})[0].find_all('a',attrs = {'class' :'Author-posts'})

#len(test_words)
#test_html.find_all(attrs = {"class" :'TopicPost'})[1]
#auth_posts = [e.contents for e in test_html.find_all('a',attrs = {'class' :'Author-posts'})]
#prof_link  = test_html.find_all(a

[<a class="Author-posts" data-toggle="tooltip" data-tooltip-content="View Post History" href="/forums/en/overwatch/search?a=Akasha%2312113">
 3 posts
 </a>]

## Other Older Stuff

In [5]:
def save_posts(list_of_dicts, saver, forum_page):
    return pd.DataFrame(list_of_dicts, columns = ['text',
                            'date',
                            'ids_dict',
                            'post_num',
                            'auth_posts',
                            'prof_link',
                            'title',
                            'forum_page',
                            'stqatuses',
                            'topic_url']).to_csv(saver + str(forum_page), index=False)

In [6]:
words = []
dates = []
TopicPosts = []
title = []
forum_page = []
prof_link = []
auth_posts = []
topic_keys  = []
def turn_to_dict(post, words = words, 
                 dates = dates, TopicPosts = TopicPosts, 
                 title = title, forum_page = forum_page, 
                 prof_link = prof_link, auth_posts = auth_posts, topic_keys = topic_keys):
    post_dict = {}
    try:
        post_dict = {                               # Creation & Statement of dicts
            'text'      : words[post].contents,                     # Text of the post
            'date'      : dates[post].attrs['data-tooltip-content'],# Date of the (unedited) post
            'ids_dict'  : TopicPosts[post].attrs['data-topic-post'],# Author info & votes of the post
            'post_num'  : TopicPosts[post].attrs['id'] ,            # Post number in the topic
            'title'     : title,                    # Title of Topic
            'forum_page': forum_page,               # Page in the forum
            'topic_url' : topic_keys[post][0]}
    except:
        pass
    
    try:
        post_dict['statuses']   = TopicPosts[post].attrs['data-topic']
    except:
        pass
    
    try:
        post_dict['prof_link']  = prof_link[post].attrs['href']
    except:
        pass
    
    try:
        post_dict['auth_posts'] = auth_posts[post]  # Number of posts author has made
    except:
        pass
    
    return post_dict

In [7]:
edt = ['\n\t\t\t\t\t\t\t\xa0(Edited)\n']

In [10]:
# THIS is for turning into the Requests library
list_of_dicts = []                                              # Instantiate empty list of DICKS to FILL UP
saver = './data/full_scrapes/Overwatch_Test_'                   # The path & starting name for saving junk

for forum_page in range(1,101):                                 # Will go through all the forum pages specified
    topic_keys = urls_with_numbers('https://us.battle.net/forums/en/overwatch/22813879/?page='+ str(forum_page))
    if forum_page % 100 == 0:
        save_posts(list_of_dicts, saver, forum_page) 
    for topic_stats in topic_keys:                              # For every title in the list
        for page in range(topic_stats[1]):                      # for page # in range of # of pages in the topic
            page += 1                                           # Add 1 to compensate for starting at 0
            if page > 1:                                        # Basically making sure it's not the first page
                URL = topic_stats[0] + '?page=' + str(page)     # This is basicslly for getting to all the pages
            else:
                URL = topic_stats[0]                            # This will be the first page
            
            # This chunk defines the lists of things that we want per row
            page_soup  = get_html(URL)                          # Soup of the page
            title      = page_soup.find(attrs={'class':'Topic-title'}).contents
            dates      = [e for e in page_soup.find_all('a',{"class" :'TopicPost-timestamp'}) if e.contents !=edt]
            words      = page_soup.find_all('div',{"class" :'TopicPost-bodyContent'})
            TopicPosts = page_soup.find_all(attrs = {"class" :'TopicPost'})
            auth_posts = [e.contents for e in page_soup.find_all('a',attrs = {'class' :'Author-posts'})]
            prof_link  = page_soup.find_all(attrs = {"class" :'Author-avatar '})
            print("forum page:", forum_page, "Posts in Topic:", topic_stats[1],"Page:",page,"URL:",URL )
            for post in range(len(words)):                           # For each post that's present
                list_of_dicts.append(turn_to_dict(post))        # Add the dict of post to list of posts
        print("forum page:", forum_page,"Topic Page:", page, "Last URL:", URL) # Sanity check
df = save_posts(list_of_dicts, saver, forum_page)               # Should both save the data & create a df to see

FeatureNotFound: Couldn't find a tree builder with the features you requested: html5lib. Do you need to install a parser library?

In [None]:
# Tester to see if this will continue to run within different situations
for i in range(60):
    print(datetime.fromtimestamp(time()).strftime('%H:%M:%S'))
    sleep(30)

23:43:47
23:44:17
23:44:48
23:45:18
23:45:48
23:46:18
23:46:48
23:47:18
23:47:48
23:48:18
23:48:48
23:49:18
23:49:48
23:50:18
23:50:48
23:51:18
23:51:48
23:52:18
23:52:48
23:53:18
23:53:48
23:54:18
23:54:48
23:55:18
23:55:48
23:56:18
23:56:48
23:57:18
23:57:48
23:58:18
23:58:48
23:59:18
23:59:48
00:00:18
00:00:48
00:01:18
00:01:48
00:02:18


In [None]:
2:18

In [13]:
pd.DataFrame.from_csv('./data/urls0.csv')

  """Entry point for launching an IPython kernel.


Unnamed: 0,Forum_num,pages,url
0,0,1,https://us.battle.net/forums/en/overwatch/topi...
1,0,1,https://us.battle.net/forums/en/overwatch/topi...
2,0,238,https://us.battle.net/forums/en/overwatch/topi...
3,0,6,https://us.battle.net/forums/en/overwatch/topi...
4,0,10,https://us.battle.net/forums/en/overwatch/topi...
5,0,1,https://us.battle.net/forums/en/overwatch/topi...
6,0,1,https://us.battle.net/forums/en/overwatch/topi...
7,0,1,https://us.battle.net/forums/en/overwatch/topi...
8,0,1,https://us.battle.net/forums/en/overwatch/topi...
9,0,3,https://us.battle.net/forums/en/overwatch/topi...


In [24]:
def urls_with_numbers(forum_page_url, forum_page_num = False):
#current_forum_page = 'https://us.battle.net/forums/en/overwatch/22813879/?page='+ str(forum_page)
    forum_soup     = get_html(forum_page_url)                                       # Input URL get Soup
    topic_url_list = ['https://us.battle.net' + topic.attrs['href'] 
        for topic in forum_soup.find_all(attrs={'class': "ForumTopic"})]
    topic_tuples   = []                                                           # Above is list of urls of topics in forum page
    count          = 0                                                                   # Instantiating empty list and starting count
    for forum_topic in forum_soup.find_all(attrs={'class': "ForumTopic"}):      # for i in all th3e forum topic infos
        posts_num = json.loads(forum_topic.attrs['data-forum-topic']
            )['lastPosition'] # Turning each individual info bit into a dict
        topic_pages = posts_num//20 + 1                     # Number of posts in the Topic divided by num allowed per page +1 for 1st
        if posts_num%20 != 0 and posts_num > 20:            # If there's a remainder page
            topic_pages += 1                                # Add remainder page
        if forum_page_num == False:
            topic_tuples.append((topic_url_list[count],topic_pages)) # Add the url to the page amount in tuples
        else:
            topic_tuples.append((forum_page_num,topic_url_list[count],topic_pages))
        count += 1                                          # Keep track of the count
    return topic_tuples                                     # Returns list of tuples
