# Scraping Data

I imagine not all of you are working for super majors and have access to every log or dataset known to man.  That also being said I don't think your boss is going to let you but a thousand digital logs from *a vendor at $150 a pop for a regional study to support a prospect.  Your tech, if you have access to one, is also going to want to murder you if you ask them to go download files from the state one well at a time as well. To help with this lets use python to simulate a user interacting with a browser in a process know as scraping.

The two styles of scraping that we'll touch on today: with and without a browser.  A third style uses a [web spider](https://scrapy.org/) but we won't get to that today.

With scraping:
-  Check terms of service from the website.
-  Don't scrape agressively as you can cause enough traffic to affect other users. Be a Good Citizen! Don't be a dick. (ie Be Nice)
-  Just plan on the website changing from time to time and having to re-write scrapers.

So let's all take an oath...

---

## Scraping Without a Browser
This is generally a much faster way of collecting data but it doesn't handle data sources that have used features to make it harder to scrape.  In this exercise will be using `geopandas` to get basic information, `requests` to fetch our data, parse that data, then we'll store it to a `.csv` with `pandas`.  We'll walk through how to parse text and **build** a scraper for public data for this example.  After we test it we'll roll it into its automated form with a function.

In [1]:
import pandas as pd
import requests
from numpy import nan
import geopandas as gpd
import time

%matplotlib inline
pd.options.display.max_columns = 999

-  Open the `wells.shp` to a dataframe.
-  Open COGCC's data portal in another tab in our browser. https://cogcc.state.co.us/data.html#/cogis
-  Then navigate to "facility".

Let's load in a dataframe of our Colorado wells and preview the data.

In [2]:
#Well shapefile
wells = gpd.read_file("data/Shapefiles/Wells/Wells.shp")
wells.head()

Unnamed: 0,API,API_Label,Operator,Well_Title,Facil_Id,Facil_Type,Facil_Stat,Operat_Num,Well_Num,Well_Name,Field_Code,Dist_N_S,Dir_N_S,Dist_E_W,Dir_E_W,Qtr_Qtr,Section,Township,Range,Meridian,Latitude,Longitude,Ground_Ele,Utm_X,Utm_Y,Loc_Qual,Field_Name,API_Seq,API_County,Loc_ID,Loc_Name,Spud_Date,Citing_Typ,Max_MD,Max_TVD,geometry
0,105000,05-001-05000,TOMBERLIN* BILL,1 UPRR-JOLLY,200001,WELL,DA,88925,1,UPRR-JOLLY,99999.0,664.0,S,646.0,E,SESE,35,3S,57W,6,39.741587,-103.727484,4911.0,609032,4399851,Planned Footage,WILDCAT,5000,1,311522.0,UPRR-JOLLY-63S57W 35SESE,1957-10-15,ACTUAL,5404.0,0.0,POINT (609032 4399851)
1,105001,05-001-05001,PLAINS EX,1-V STATE,200002,WELL,DA,70500,1-V,STATE,99999.0,1984.0,S,1839.0,E,NWSE,36,3S,57W,6,39.745257,-103.713113,4814.0,610257,4400276,Planned Footage,WILDCAT,5001,1,311524.0,STATE-63S57W 36NWSE,1962-06-19,ACTUAL,5308.0,0.0,POINT (610257 4400276)
2,105002,05-001-05002,JAMESON COMPANY* W L,1 JAMESON,269532,WELL,PA,100630,1,JAMESON,99999.0,,,,,C,35,3S,66W,6,39.747303,-104.744385,,521899,4399742,Planned Footage,WILDCAT,5002,1,311541.0,JAMESON-63S66W 35C,1917-09-01,HISTORICAL,1010.0,,POINT (521899 4399742)
3,105003,05-001-05003,SUPERIOR OIL COMPANY,1-31 UPRR-NOONAN,200003,WELL,DA,85100,1-31,UPRR-NOONAN,99999.0,2310.0,N,990.0,E,SENE,31,3S,58W,6,39.748516,-103.910839,5051.0,593312,4400413,Planned Footage,WILDCAT,5003,1,311603.0,UPRR-NOONAN-63S58W 31SENE,1956-01-06,ACTUAL,6086.0,,POINT (593312 4400413)
4,105004,05-001-05004,CHAMPLIN PETROLEUM COMPANY,1 PLACIDO RICO,200004,WELL,DA,15550,1,PLACIDO RICO,99999.0,660.0,N,1980.0,E,NWNE,32,3S,59W,6,39.752736,-104.008242,5030.0,584962,4400784,Planned Footage,WILDCAT,5004,1,311609.0,PLACIDO RICO-63S59W 32NWNE,1961-04-07,ACTUAL,6260.0,0.0,POINT (584962 4400784)


Now let's select all the wells in Jackson County (057).

In [3]:
apis = wells[['API_Label','Latitude','Longitude','geometry']][wells['API_Label'].str.contains('05-057-')]
apis.head()

Unnamed: 0,API_Label,Latitude,Longitude,geometry
30169,05-057-05000,40.775932,-106.253831,POINT (394193 4514640)
30170,05-057-05001,40.437766,-106.267009,POINT (392541 4477118)
30171,05-057-05002,40.440236,-106.201067,POINT (398138 4477314)
30172,05-057-05003,40.441426,-106.271739,POINT (392146 4477530)
30173,05-057-05004,40.441457,-106.276447,POINT (391746 4477539)


-  On the [COGCC](https://cogcc.state.co.us/data.html#/cogis) data site select WELL under facility type and select JACKSON county and search.
-  Click on a few wells. Notice that the URL doesn't change.
-  Now this time open a well in a new tab.
-  Notice that the URL is now specific to that well.

We're going to utilize this to get more information in a usable format for these wells.  Let's break out the non-unique portions of this URL to use.

In [4]:
baseURL = 'https://cogcc.state.co.us/cogis/FacilityDetail.asp?facid='
tailURL = '&type=WELL'

Generally websites like this will have a base URL seperated by `?` followed by variables. Notice that COGCC doesn't use the state code in the API number with no deliminator.

In [5]:
url = baseURL+'05-057-05128'.replace('-','')[2:] + tailURL
print(url)
r = requests.get(url)
print('Encoding:', r.encoding)
print('RespCode:',type(r.status_code),r.status_code)

https://cogcc.state.co.us/cogis/FacilityDetail.asp?facid=05705128&type=WELL
Encoding: ISO-8859-1
RespCode: <class 'int'> 200


A response code of `200` lets us know that it was a good request. No let's look at the text that COGCC sent us back...

In [6]:
r.text

'\r\n  \r\n<html>\r\n\r\n<head>\r\n<script src="/urchin.js" type="text/javascript"></script>\r\n<script type="text/javascript">\r\nurchinTracker();\r\n</script>\r\n\r\n\r\n\t<title>COGIS - WELL Information</title>\r\n</head>\r\n\r\n<body onLoad=window.focus()>\r\n\r\n\r\n<font face="Arial" size="2">\r\n<!--\r\n<img SRC="images/s_cogcc_head.jpg" width="513" height="51" alt="Colorado Oil & Gas Conservation Commission"><br>\r\n<img SRC="images/s_head_fill.jpg" width="123" height="22">\r\n -->\r\n\r\n<p><font size="5" color="#000080" face="Arial"><b>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;\r\n COGIS - WELL Information</b></font></p>\r\n\r\n\r\n\r\n\r\n\t\t\r\n\t\t<!-- BEGIN OUTPUT TO SCREEN -->\r\n\t\t<!-- BEGIN SURFACE INFORMATION -->\r\n\t\t\r\n\t\t<!-- HANDLE BAD API NUMBER -->\r\n\t\t\r\n\r\n\t\t<table cellspacing="1" cellpadding="1" border="0">\r\n\t\t\t<tr>\r\n\t\t\t\t<td colspan="4" bgcolor="#ffffcc">\r\n\t\t\t\t\t\t<font size="2">\r\n\t\t\t\t<table>\r\n\t\t\t\t<tr>\r\n\t\t\t\t\r\n\t\t\t\t\r\

Now look back at the website. Look for a keyword towards the tops listed at the bottom of the site. Copy this keyword `Ctrl+C` move to this tab, open find `Ctrl+F` and scroll through till you find where the tops are. We are looking for unique sections of the string to split on.

In [7]:
#Split out tops section
tops = r.text.split('<!-- LOOP FOR EACH Formation WITHIN WELLBORE -->')[1].split('<!-- END Formation LOOP -->')[0]
tops

'\r\n\t\t\t\r\n\t\t\t\r\n\t\t\t   <tr>\r\n\t\t\t\t\t<td align="center"><font size="2" color="Navy">NIOBRARA                                          </td>\r\n\t\t\t\t    <td nowrap align="center"><font size="2">2076</font></td>\r\n\t\t\t\t    <td nowrap align="center"><font size="2"></font></td>\r\n\t\t\t\t\t<td nowrap align="center"><font size="2"></font></td>\r\n\t\t\t\t\t<td nowrap align="center"><font size="2"></font></td>\r\n\t\t\t\t</tr>\r\n\t\t\t\t\t\t  \r\n\t\t\t\t\r\n\t\t\t   \r\n\t\t\t   <tr>\r\n\t\t\t\t\t<td align="center"><font size="2" color="Navy">CARLILE                                           </td>\r\n\t\t\t\t    <td nowrap align="center"><font size="2">2332</font></td>\r\n\t\t\t\t    <td nowrap align="center"><font size="2"></font></td>\r\n\t\t\t\t\t<td nowrap align="center"><font size="2"></font></td>\r\n\t\t\t\t\t<td nowrap align="center"><font size="2"></font></td>\r\n\t\t\t\t</tr>\r\n\t\t\t\t\t\t  \r\n\t\t\t\t\r\n\t\t\t   \r\n\t\t\t   <tr>\r\n\t\t\t\t\t<td align=

In [8]:
#Remove special characters
tops = tops.replace('\r','').replace('\n','').replace('\t','').strip()
tops

'<tr><td align="center"><font size="2" color="Navy">NIOBRARA                                          </td>    <td nowrap align="center"><font size="2">2076</font></td>    <td nowrap align="center"><font size="2"></font></td><td nowrap align="center"><font size="2"></font></td><td nowrap align="center"><font size="2"></font></td></tr>        <tr><td align="center"><font size="2" color="Navy">CARLILE                                           </td>    <td nowrap align="center"><font size="2">2332</font></td>    <td nowrap align="center"><font size="2"></font></td><td nowrap align="center"><font size="2"></font></td><td nowrap align="center"><font size="2"></font></td></tr>        <tr><td align="center"><font size="2" color="Navy">FRONTIER                                          </td>    <td nowrap align="center"><font size="2">2676</font></td>    <td nowrap align="center"><font size="2"></font></td><td nowrap align="center"><font size="2"></font></td><td nowrap align="center"><font size

In [9]:
#Split individual tops
tops = tops.split('<tr>')
tops = [x.strip() for x in tops if len(x) > 0]
tops

['<td align="center"><font size="2" color="Navy">NIOBRARA                                          </td>    <td nowrap align="center"><font size="2">2076</font></td>    <td nowrap align="center"><font size="2"></font></td><td nowrap align="center"><font size="2"></font></td><td nowrap align="center"><font size="2"></font></td></tr>',
 '<td align="center"><font size="2" color="Navy">CARLILE                                           </td>    <td nowrap align="center"><font size="2">2332</font></td>    <td nowrap align="center"><font size="2"></font></td><td nowrap align="center"><font size="2"></font></td><td nowrap align="center"><font size="2"></font></td></tr>',
 '<td align="center"><font size="2" color="Navy">FRONTIER                                          </td>    <td nowrap align="center"><font size="2">2676</font></td>    <td nowrap align="center"><font size="2"></font></td><td nowrap align="center"><font size="2"></font></td><td nowrap align="center"><font size="2"></font></td>

In [10]:
#Clean up data
for top in tops:
    cols = top.split('<td')
    cols = [x.strip() for x in cols if len(x) > 0]    
    formation = cols[0].split('color="Navy">')[1].split(' ')[0]
    depth = cols[1].split('<font size="2">')[1].split('</font>')[0]
    print(formation,':',depth)

NIOBRARA : 2076
CARLILE : 2332
FRONTIER : 2676
DAKOTA : 4327
FUSON : 4382
LAKOTA : 4436
MORRISON : 4456


Ok now let's roll all that up into a function.

In [11]:
def top_parse(text):
    tops = text.split('<!-- LOOP FOR EACH Formation WITHIN WELLBORE -->')[1].split('<!-- END Formation LOOP -->')[0]
    tops = tops.replace('\r','').replace('\n','').replace('\t','').strip()
    tops = tops.split('<tr>')
    tops = [x.strip() for x in tops if len(x) > 0]
    
    formations = {}
    for top in tops:
        cols = top.split('<td')
        cols = [x.strip() for x in cols if len(x) > 0]    
        formation = cols[0].split('color="Navy">')[1].split(' ')[0]
        depth = eval(cols[1].split('<font size="2">')[1].split('</font>')[0])
        formations[formation] = depth
    return formations


In [12]:
top_parse(r.text)

{'NIOBRARA': 2076,
 'CARLILE': 2332,
 'FRONTIER': 2676,
 'DAKOTA': 4327,
 'FUSON': 4382,
 'LAKOTA': 4436,
 'MORRISON': 4456}

And itterate through our wells. It is _EXTREMELY_ important to add `try` `except` to handle errors in scraping. Scrapers deal with others people's code and things will go wrong. It's also a good idea on long scrapes to periodically saveout your progress as there's nothing worse then getting back to something that ran all weeekend pulling data that you need for a project and to see that it crashed.

In [13]:
topDF = pd.DataFrame()
i = 0
apiSample = apis.head(10) #We'll only do the first few for this example 
total = apiSample.shape[0]

for index, row in apiSample.iterrows(): 
    i += 1
    prec = str(int(100*i/total)) + '% complete  '
    print(row['API_Label'], prec, end='\r')
    try:
        url = baseURL+row['API_Label'].replace('-','')[2:]+tailURL
        r = requests.get(url)

        if r.status_code == 200:
            formations = top_parse(r.text)
            formations['API'] = row['API_Label']
            topDF = topDF.append(formations,ignore_index=True)
            time.sleep(5) #Wait 5 sec.
        else:
            print(row['API_Label'],':',r.status_code)
    except Exception as e:
        print('Error:',row['API_Label'],e)

05-057-05009 100% complete  

In [14]:
topDF.head()

Unnamed: 0,API,BENTON,CARLILE,CURTIS,DAKOTA,FRONTIER,FUSON,LAKOTA,MORRISON,NIOBRARA,PIERRE,COALMONT,MOWRY,MUDDY,ENTRADA,BENTONITE,GRANITE,CHUGWATER,SUNDANCE,CRETACEOUS,GREENHORN,SHANNON
0,05-057-05000,,,,,,,,,,,,,,,,,,,,,
1,05-057-05001,1600.0,1350.0,2389.0,2020.0,1570.0,2088.0,2109.0,2206.0,950.0,0.0,,,,,,,,,,,
2,05-057-05002,,,,,,,,,,,2424.0,,,,,,,,,,
3,05-057-05003,,460.0,,1140.0,675.0,1195.0,1219.0,,30.0,0.0,,1042.0,1125.0,,,,,,,,
4,05-057-05004,,,1212.0,858.0,380.0,,948.0,,0.0,,,,,1434.0,,,,,,,


I've gone ahead and pulled all the tops for Jackson County for you.  This took approximately an hour and a half for 771 records to give you an idea of the time needed. These are avalible in the project folder.  This was a basic example with `requests` but if this is something you would like to do regularly I suggest you also check out `urllib`.  There are packages avalible to make the searching and parsing of the html much easier but when you're troubleshooting a tough website it's good to know what you are looking for

---

# Scraping with a Browser with Selenium

Scraping with a browser allows you to navigate around obsticles that are often put in place to discourage scraping, fillout forms, and interact with a website in ways that `requests` can't.  That being said it can be significantly more challenging and can sometimes take much longer. In this example we will pull production data from COGCC. `selenium` locates "elements" of a web page to interact with them to preform tasks. There are several [different methods](https://selenium-python.readthedocs.io/locating-elements.html) to locate elements. We will also use `bs4` to parse a table from html. BeautifulSoup uses tag names and daughter relationships to make finding data easier.  

I've previously written up this function but please open COGCC's [facility search](https://cogcc.state.co.us/cogis/FacilitySearch.asp) in a new tab. Select "Well", enter Weld County's code "123", and the sequence code "39340". Hit search. Select the well that comes up. Note the URL.

With that open, copy the link from the well name.  Notice that there is one of these per wellbore. Paste this url into a new tab. Now let's walk through finding elements & using tags to find the data you need.

In [15]:
from selenium import webdriver
import time
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd

pd.options.display.max_columns = 50


In [16]:
def pull_CO_prod(api_05, df, driver, pull_excel=False):
    url = 'https://cogcc.state.co.us/cogis/FacilityDetail.asp?facid='+api_05+'&type=WELL'
    print(url)
    driver.get(url)
    time.sleep(1)
    links = [driver.find_elements_by_tag_name('a')]
    prod_wellbores = [x.get_attribute("href") for x in driver.find_elements_by_tag_name('a') if 'production' in x.get_attribute("href")]
    print('prod_wellbores',prod_wellbores)
    for wellbore in prod_wellbores:
        driver.get(wellbore)
        time.sleep(1)
        
        #Download the file
        if pull_excel:
            dwnExcel = driver.find_element_by_xpath('//*[@id="mainContent_btnExport"]')
            #//*[@id="mainContent_btnExport"]
            dwnExcel.click()
            
        #Table HTML
        table = driver.find_element_by_xpath('//*[@id="mainContent_pnlResults"]/div')

        #BeautifulSoup
        soup = BeautifulSoup(table.get_attribute('innerHTML'), "html.parser")
        
        rows = soup.find_all('tr')
        row_list = []
        
        #Pull Header 
        for tr in rows[:1]:
            th = tr.find_all('th')
            row = [i.text for i in th]
            row_list.append(row)

        #Pull Rows
        for tr in rows[1:]:
            td = tr.find_all('td')
            row = [i.text.replace('\xa0','') for i in td]
            row_list.append(row)
        
        temp = pd.DataFrame(row_list[1:],columns=row_list[0])
        temp['Days Produced'].replace('',0,inplace=True)
        temp['Days Produced'] = temp['Days Produced'].astype(float)
        temp['Total_Days'] = temp['Days Produced'].cumsum()
        df = pd.concat([df,temp],ignore_index=True)

        return df, driver

# Give it a try

Now that we have the function complete the `for` loop below to feed the individual apis, minus the state code, to the function. Remember that you need to pass the dataframe and the driver to the function too.

Run it for the following wells: `0512339340`,`0512339383`,`0512339370`, & `0512339384`.

In [19]:
##I've laid out the format for you below. Make edits at *1, *2, & *3.

#Make a list of your UWI codes
apis = ['0512339340','0512339383','0512339370', '0512339384']

#Make Driver
chromedriver = "chromedriver.exe"
driver = webdriver.Chrome(executable_path=chromedriver)

#*2: Make an Empty DataFrame
df = pd.DataFrame()

for api in apis:
    #*3: Insert the function w/ inputs and returned variables
    api_05 = api[2:]
    print(api_05)
    temp, driver = pull_CO_prod(api_05, df, driver)
    df = pd.concat([df,temp],ignore_index=True)
    
driver.close()

12339340
https://cogcc.state.co.us/cogis/FacilityDetail.asp?facid=12339340&type=WELL
prod_wellbores ['https://cogcc.state.co.us/production/?&apiCounty=123&apiSequence=39340&APIWB=00&Year=All']
12339383
https://cogcc.state.co.us/cogis/FacilityDetail.asp?facid=12339383&type=WELL
prod_wellbores ['https://cogcc.state.co.us/production/?&apiCounty=123&apiSequence=39383&APIWB=00&Year=All']
12339370
https://cogcc.state.co.us/cogis/FacilityDetail.asp?facid=12339370&type=WELL
prod_wellbores ['https://cogcc.state.co.us/production/?&apiCounty=123&apiSequence=39370&APIWB=00&Year=All']
12339384
https://cogcc.state.co.us/cogis/FacilityDetail.asp?facid=12339384&type=WELL
prod_wellbores ['https://cogcc.state.co.us/production/?&apiCounty=123&apiSequence=39384&APIWB=00&Year=All']


In [20]:
#Preview your results
df.head()

Unnamed: 0,First of Month,Days Produced,Well Status,API County,API Sequence,API Sidetrack,Formation,BOM Inventory,Oil Produced,Oil Sold,Oil Adjustment,EOM Inventory,Oil Gravity,Gas Produced,Gas Flared,Gas Used,Gas Shrinkage,Gas Sold,Gas BTU,Gas Tubing Pressure,Gas Casing Pressure,Water Volume,Water Tubing Pressure,Water Casing Pressure,Water Disp Code,Total_Days
0,12/1/2019,31.0,PR,123,39340,0,NIOBRARA ...,16,428,417,,27,49.6,6673,,30,,6643,1308,,,153,,,M,31.0
1,11/1/2019,30.0,PR,123,39340,0,NIOBRARA ...,40,517,541,,16,51.3,6927,,28,,6899,1309,,,142,,,M,61.0
2,10/1/2019,27.0,PR,123,39340,0,NIOBRARA ...,43,532,535,,40,51.2,6976,,24,,6952,1302,,,91,,,M,88.0
3,9/1/2019,23.0,PR,123,39340,0,NIOBRARA ...,21,448,426,,43,53.2,7585,,20,,7565,1302,,,93,,,M,111.0
4,8/1/2019,12.0,PR,123,39340,0,NIOBRARA ...,43,216,238,,21,51.1,4038,,9,,4029,1302,,,38,,,M,123.0
