Adapted from the notebook found at [How to Build a Law Bot](https://lawyerist.com/how-build-law-bot/)

## Install libraries

If you haven't already, you may need to install some dependencies. On the command line, run the following to install/update gspread, oauth2client, PyOpenSSL, and python-twitter.
```
pip install gspread
pip install --upgrade oauth2client
pip install PyOpenSSL
pip install python-twitter
```
Library installs are one and done. So after doing this once, you should be all set. 

## Import modules and set variables

Now we're getting into the bot's code. This is what will run every time your bot is called. To make sure it behaves as expected, replace the placeholder values found below in the `document_key`, `credentials`, `consumer_key`, `consumer_secret`, `access_token_key`, and `access_token_secret` variables with relevant values (e.g., your access credentials). 

You will need to create a new Google Sheet (same instructions as [last time](https://lawyerist.com/126074/online-forms-meet-local-document-automation-cut-and-paste-coding/)). You **MUST** add a first row with headings. If you don't, the below code won't work. In this example, just make four columns filled with zeros. Also, delete rows 2-999. This is because the code below appends values to the end of your sheet. So if you fail to remove rows 2-999, values will be appended to row 1000. Additionally, it looks at the last row of the sheet for your old values. So if you fail to delete 2-999, instead of seeing your row of zeros, it will look at the blank row 999.

As for a Twitter account and Twitter credentials, follow the instruction in [this post](https://lawyerist.com/?p=127093). 

*NOTE: You should be reading all of the comments (i.e., text following a #)*

In [1]:
# Load the module for visiting and reading websites.
import urllib.request
# Load the module for running regular expressions (regex).
import re 
# Load the module for date and time stuff.
import datetime
# Define the variable now as equal to the current date and time.
now = datetime.datetime.now()

In [2]:
# Set the URLs you want to scrape.
url_1 = "https://en.wikipedia.org/wiki/Main_Page"
url_2 = "http://forecast.weather.gov/MapClick.php?lat=42.36715360000011&lon=-71.10340049999996#.Wd6C8VuPJEY"

In [3]:
# Load the module for accessing Google Sheets.
import gspread
# Load the module needed for securely communicating with Google Sheets.
from oauth2client.service_account import ServiceAccountCredentials
# The scope for your access credentials
scope = ['https://spreadsheets.google.com/feeds']

# Your spreadsheet's ID
document_key = "1MgQXqAakTpZQSUYj4hdjVppncTXg3AD-LDi8ZeBjRMw"
# Your Google project's .json key
credentials = ServiceAccountCredentials.from_json_keyfile_name('../../../../../SheetsBot-51db789eba6b.json', scope)

# Use your credentials to authorize yourself.
gc = gspread.authorize(credentials)
# Open up the Sheet with the defined ID.
wks = gc.open_by_key(document_key)

#########################################
#
#  NOTE: The name of the sheet you are 
#  trying to access should be in the 
#  parenthetical below (e.g., Data). By
#  Default this is probably "Sheet1".
#
#########################################
worksheet = wks.worksheet("Sheet1")

# Count the number of rows in your Sheet &
# resize to remove blank rows.
worksheet.resize(worksheet.row_count)

In [4]:
# download spreadsheet
import csv
csvfile = "output.csv"
list_of_lists = worksheet.get_all_values()
with open(csvfile, "w") as output:
    writer = csv.writer(output, lineterminator='\n')
    writer.writerows(list_of_lists)

import pandas as pd
output = pd.read_csv(csvfile)
output[:3]

Unnamed: 0,timestamp,wiki,wiki name,tempature
0,0,0,0,0


In [5]:
# Import the relevant Twitter libraries so you can use Twitter.
import twitter
from twitter import TwitterError

with open('../../../../../key.txt', 'r') as myfile:
    key=myfile.read()
    
with open('../../../../../secret.txt', 'r') as myfile:
    secret=myfile.read()
    
with open('../../../../../token_key.txt', 'r') as myfile:
    token_key=myfile.read()

with open('../../../../../token_secret.txt', 'r') as myfile:
    token_secret=myfile.read()

# Set you Twitter API credentials.
api = twitter.Api(consumer_key=key,
                  consumer_secret=secret,
                  access_token_key=token_key,
                  access_token_secret=token_secret)

## Read the contents of your first webpage

When you run the next cell, your program will visit the first URL you defined above. It will then print out that page's HTML. 

In [6]:
p_1 = urllib.request.build_opener(urllib.request.HTTPCookieProcessor).open(url_1).read()
print(p_1)

b'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Wikipedia, the free encyclopedia</title>\n<script>document.documentElement.className = document.documentElement.className.replace( /(^|\\s)client-nojs(\\s|$)/, "$1client-js$2" );</script>\n<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Main_Page","wgTitle":"Main Page","wgCurRevisionId":798174323,"wgRevisionId":798174323,"wgArticleId":15580374,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":[],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgMont

## Parse the site's contents

Scan the above HTML for the content you are trying to extract. Cut and paste the HTML above into the TEST STRING box over at [Regex 101](https://regex101.com/) and craft a regex that captures your desired content. 

Remember the parenthetical is the group you're pulling out. Once you have a working regex, plug it into the code below, and run the cell. If it worked, you'll see you scraped data as an output. 

In [7]:
res_1 = re.search(b'<a href=\"(\/wiki\/[^\"]*)\"[^>]*><b>([^<]*)<\/b><\/a>',p_1)
output_1= res_1.group(1).decode('UTF-8')
output_2 = res_1.group(2).decode('UTF-8')
print(output_1,output_2)

/wiki/Henry_III_of_England Henry III


## Read the contents of your second webpage

Same deal as above, but now we're looking at your second URL. 

In [8]:
p_2 = urllib.request.build_opener(urllib.request.HTTPCookieProcessor).open(url_2).read()
print(p_2)



## Parse the site's contents

Again, the same as above, but with a new regex on a new page.

In [9]:
res_2 = re.search(b"<p class=\"myforecast-current-lrg\">(\d+).*</p>",p_2)
output_3 = res_2.group(1).decode('UTF-8')
print(output_3)

60


## Combine Stuff

Now we're going to take the values you found above and do something with them. The new thing you'll be seeing in this code is the If statement. In Python, if you type `if [some evaluation]:` then the code directly below that statement and indented once will run only if that evaluation is true. For example:

In [10]:
# The If statment below says: If the variables res_1 and res_2 actually exist, do what follows.
if res_1 and res_2: 
    # Make sure res_1 is in a format we can read (that's the "decode" part)
    # output_1 equal to regex match on page one.
    # Do the same thing as above but for res_2
    # Combine titles. Then store the value in the variable named "titles."
    titles = "output_1" and "output_2" and "output_3"

In [11]:
# Print out the old values stored in your sheet 
# Note: The first time you run this code, it will be empty as nothing has yet to be stored in your sheet.
print("%s | %s | %s | %s"%(worksheet.row_values(worksheet.row_count)[1],worksheet.row_values(worksheet.row_count)[2],worksheet.row_values(worksheet.row_count)[3],worksheet.row_values(worksheet.row_count)[2]))

0 | 0 | 0 | 0


In [12]:
status = api.PostUpdate('It\'s %s °F in Cambridge, MA, and today\'s featured Wikipedia article is about %s: http://www.wikipedia.org%s'%(output_3,output_2,output_1))
print(status.text)

It's 60 °F in Cambridge, MA, and today's featured Wikipedia article is about Henry III: https://t.co/QhVdljyi2d


## Post to Twitter and Save to Google

In [13]:
if (res_1 and (worksheet.row_values(worksheet.row_count)[1]) != output_1
          or (worksheet.row_values(worksheet.row_count)[2]) != output_2 
          or (worksheet.row_values(worksheet.row_count)[3]) != output_3 ):
    # same as above but now comparing two values
    
    try:
        # Post to Twitter.
        status = api.PostUpdate('It\'s %s °F in Cambridge, MA, and today\'s featured Wikipedia article is about %s: http://www.wikipedia.org%s'%(output_3,output_2,output_1))
        print(status.text)
    except TwitterError:
        # Post to Twitter.
        status = api.PostUpdate('It is %s °F in Cambridge, MA, and today\'s featured Wikipedia article is about %s: http://www.wikipedia.org%s'%(output_3,output_2,output_1))
        print(status.text)

    # Save to Google only after Tweeting
    worksheet.append_row([now, output_1,output_2,output_3])

It is 60 °F in Cambridge, MA, and today's featured Wikipedia article is about Henry III: https://t.co/QhVdljyi2d


In [14]:
print(worksheet.row_values(worksheet.row_count))
#############################
# DELETE CELL AFTER TESTING
#############################

['2017-10-18 21:42:16', '/wiki/Henry_III_of_England', 'Henry III', '60']
