<a href="https://colab.research.google.com/github/Rocks-n-Code/PythonCourse/blob/master/6%20-%20Scraping%20Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

I covered this notebook for Denver Data Drivers and you can follow along with the video [here.](https://www.youtube.com/watch?v=cO8fWCPp_6k)

---

# Scraping Data

I imagine not all of you are working for super majors and have access to every log or dataset known to man.  That also being said I don't think your boss is going to let you buy a thousand digital logs from *a vendor at $150 a pop for a regional study to support a prospect.  Your tech, if you have access to one, is also going to want to murder you if you ask them to go download files from the state one well at a time as well. To help with this lets use python to simulate a user interacting with a browser in a process know as scraping.

The two styles of scraping that we'll touch on today: with and without a browser.  A third style uses a [web spider](https://scrapy.org/) but we won't get to that today.

With scraping:
-  Check terms of service from the website.
-  Don't scrape agressively as you can cause enough traffic to affect other users. Be a Good Citizen! Don't be a dick. (ie Be Nice)
-  Just plan on the website changing from time to time and having to re-write scrapers.

So let's all take an oath...

---

## Scraping Without a Browser
This is generally a much faster way of collecting data but it doesn't handle data sources that have used features to make it harder to scrape.  In this exercise will be using `geopandas` to get basic information, `requests` to fetch our data, parse that data, then we'll store it to a `.csv` with `pandas`.  We'll walk through how to parse text and **build** a scraper for public data for this example.  After we test it we'll roll it into its automated form with a function.

In [1]:
# To install packages in the Colab instance that are not normally avalible run a
# command line command with "!"
!pip install geopandas

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting geopandas
  Downloading geopandas-0.13.0-py3-none-any.whl (1.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m15.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting fiona>=1.8.19 (from geopandas)
  Downloading Fiona-1.9.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.5/16.5 MB[0m [31m73.1 MB/s[0m eta [36m0:00:00[0m
Collecting pyproj>=3.0.1 (from geopandas)
  Downloading pyproj-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m42.9 MB/s[0m eta [36m0:00:00[0m
Collecting click-plugins>=1.0 (from fiona>=1.8.19->geopandas)
  Downloading click_plugins-1.1.1-py2.py3-none-any.whl (7.5 kB)
Collecting cligj>=0.5 (from fiona>=1.8.19->geopandas)
  Do

In [2]:
import pandas as pd
import requests
from numpy import nan
import geopandas as gpd
from shapely.geometry import Point
import time
import matplotlib.pyplot as plt
import re

%matplotlib inline
pd.options.display.max_columns = 999
pd.options.display.max_rows = 999

-  Open the `wells.shp` to a dataframe.
-  Open COGCC's data portal in another tab in our browser. https://cogcc.state.co.us/data.html#/cogis
-  Then navigate to "facility".

Let's load in a dataframe of our Colorado wells and preview the data.

In [3]:
#Originally from COGCC well spot shapefile - Jackson County
rawurl = 'https://raw.githubusercontent.com/Rocks-n-Code/PythonCourse/master/data/Jackson_057.csv'
apis = pd.read_csv(rawurl)

#Fix raw csv geometry column
def str_to_point(point_string):
  x = int(point_string.split('(')[1].split(' ')[0])
  y = int(point_string.split('(')[1].split(' ')[1].replace(')',''))
  return Point(x,y)

apis['geometry'] = apis['geometry'].apply(str_to_point)

print('Before:',type(apis))

#Change from pandas.DataFrame to geopandas.GeoDataFrame
apis = gpd.GeoDataFrame(apis,
                        geometry='geometry',
                        crs='EPSG:26913')

print('After:',type(apis))
apis.head()

Before: <class 'pandas.core.frame.DataFrame'>
After: <class 'geopandas.geodataframe.GeoDataFrame'>


Unnamed: 0,API_Label,Latitude,Longitude,geometry
0,05-057-05000,40.775932,-106.253831,POINT (394193.000 4514640.000)
1,05-057-05001,40.437766,-106.267009,POINT (392541.000 4477118.000)
2,05-057-05002,40.440236,-106.201067,POINT (398138.000 4477314.000)
3,05-057-05003,40.441426,-106.271739,POINT (392146.000 4477530.000)
4,05-057-05004,40.441457,-106.276447,POINT (391746.000 4477539.000)


-  Open [COGCC](https://cogcc.state.co.us/data5.html#/cogis_old) in a new tab.
-  On the [website](https://cogcc.state.co.us/data5.html#/cogis_old) select WELL under facility type and select JACKSON county and search.
-  Click on a few wells. Notice that the URL doesn't change.
-  Now this time open a well in a new tab (Right click + 'Open link in new tab').
-  Notice that the URL is now specific to that well.

We're going to utilize this to get more information in a usable format for these wells.  Let's break out the non-unique portions of this URL to use.

In [4]:
baseURL = 'https://cogcc.state.co.us/cogis/FacilityDetail.asp?facid='
tailURL = '&type=WELL'

Generally websites like this will have a base URL seperated by `?` followed by variables. Notice that COGCC doesn't use the state code in the API number with no deliminator.

In [8]:
url = baseURL + '05-057-05128'.replace('-','')[2:] + tailURL
print('URL:', url)
r = requests.get(url)
print('Encoding:', r.encoding)
print('RespCode:',type(r.status_code),r.status_code)

URL: https://cogcc.state.co.us/cogis/FacilityDetail.asp?facid=05705128&type=WELL
Encoding: ISO-8859-1
RespCode: <class 'int'> 200


A response code of `200` lets us know that it was a good request. No let's look at the text that COGCC sent us back...

In [9]:
r.text

'\r\n  \r\n<html>\r\n\r\n<head>\r\n\r\n\r\n\r\n\t<title>COGIS - WELL Information</title>\r\n</head>\r\n\r\n<body onLoad=window.focus()>\r\n\r\n\r\n<font face="Arial" size="2">\r\n<!--\r\n<img SRC="images/s_cogcc_head.jpg" width="513" height="51" alt="Colorado Oil & Gas Conservation Commission"><br>\r\n<img SRC="images/s_head_fill.jpg" width="123" height="22">\r\n -->\r\n\r\n<p><font size="5" color="#000080" face="Arial"><b>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;\r\n COGIS - WELL Information</b></font></p>\r\n\r\n\r\n\r\n\r\n\t\t\r\n\t\t<!-- BEGIN OUTPUT TO SCREEN -->\r\n\t\t<!-- BEGIN SURFACE INFORMATION -->\r\n\t\t\r\n\t\t<!-- HANDLE BAD API NUMBER -->\r\n\t\t\r\n\r\n\t\t<table cellspacing="1" cellpadding="1" border="0">\r\n\t\t\t<tr>\r\n\t\t\t\t<td colspan="4" bgcolor="#ffffcc">\r\n\t\t\t\t\t\t<font size="2">\r\n\t\t\t\t<table>\r\n\t\t\t\t<tr>\r\n\t\t\t\t\r\n\t\t\t\t\r\n\t\t\r\n\t\t\t\t<td valign="top"><font size="4" color="Navy">Scout Card</font>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbs

With the last update of the COGCC's website we can actually send the html to `pandas` directly and recieve a list of dataframes from the page.

In [10]:
df_list = pd.read_html(r.text)
print('Number of lists found:', len(df_list))
df_list[0]

Number of lists found: 2


Unnamed: 0,0,1,2,3,4
0,Scout Card <! -- remove comment tags from the ...,Scout Card <! -- remove comment tags from the ...,Scout Card <! -- remove comment tags from the ...,Scout Card <! -- remove comment tags from the ...,
1,Surface Location Data for API # 05-057-05128 ...,Surface Location Data for API # 05-057-05128 ...,Surface Location Data for API # 05-057-05128 ...,Surface Location Data for API # 05-057-05128 ...,
2,Well Name/No:,CAREY #1,(click well name for production),,
3,Operator:,RYAN OIL CO - 75500,RYAN OIL CO - 75500,RYAN OIL CO - 75500,
4,Status Date:,8/27/1998,Federal or State Lease #:,,
5,FacilityID:,212076,LocationID:,383777,
6,County:,JACKSON #057,Location:,NESW 19 10N78W 6 PM,
7,Field:,WILDCAT - #99999,Elevation:,"8,100 ft.",
8,Planned Location FL FL,Planned Location FL FL,Lat/Long: 40.820892/-106.19481,Lat/Long Calculated From Footages,
9,Wellbore Data for Sidetrack #00 Status: PA 8/2...,Wellbore Data for Sidetrack #00 Status: PA 8/2...,Wellbore Data for Sidetrack #00 Status: PA 8/2...,Wellbore Data for Sidetrack #00 Status: PA 8/2...,


Now we can see that most of the page's data is avalible in the last table. We'll parse that data down to what we need and define the column names.

In [11]:
# Select last df in list
tops = df_list[0]

#Find index of the top of the table
i = tops[tops[0] == 'Formation'].index.values[0]

#Set column names without spaces
tops.columns = [x.strip().replace(' ','_') for x in tops.loc[i,:].tolist()]

#Slice dataframe and reset the index
tops = tops[i + 1:].reset_index(drop=True)

#Preview our tops df
tops.head()

Unnamed: 0,Formation,Log_Top,Log_Bottom,Cored,DSTs
0,NIOBRARA,2076,,,
1,CARLILE,2332,,,
2,FRONTIER,2676,,,
3,DAKOTA,4327,,,
4,FUSON,4382,,,


Now we'll remove the unit and format the column content as float. I'll use [regular expression](https://docs.python.org/3/howto/regex.html) to do this. `\D` looks for any non-numeric character

In [12]:
#For only the Log_Top & Log_Bottom columns
for col in tops.columns[1:3]:

  #Where the column is not null remove the non-numeric characters
  tops[col] = tops[col][tops[col].notnull()].apply(lambda x: re.sub('\D',
                                                                    '',
                                                                    x))
  #df[col] = df[col][where not null].apply(lambda x: re.sub(search for, 
  #                                                         replace with,
  #                                                         original string))

  tops[col] = tops[col].astype(float)
print(tops.dtypes)
tops.head()

Formation      object
Log_Top       float64
Log_Bottom    float64
Cored          object
DSTs           object
dtype: object


Unnamed: 0,Formation,Log_Top,Log_Bottom,Cored,DSTs
0,NIOBRARA,2076.0,,,
1,CARLILE,2332.0,,,
2,FRONTIER,2676.0,,,
3,DAKOTA,4327.0,,,
4,FUSON,4382.0,,,


Now that we have the tops parsed from the website html and formated we'll roll all of that code up into a function.

In [14]:
def top_parse(text):
  '''
  Input:
  text; str, html code from COGCC facility detail site

  Output
  tops; df, DataFrame of formation tops
  '''
  #Create list of DataFrames
  df_list = pd.read_html(text)

  #Select last DF
  tops = df_list[0]
  
  #Test for no tops
  if 'Formation' not in tops[0].tolist():
    print('No Tops Found')
    return pd.DataFrame()
  
  #Set column names
  i = tops[tops[0] == 'Formation'].index.values[0]
  tops.columns = [x.strip().replace(' ','_') for x in tops.loc[i,:].tolist()]
  tops = tops[i + 1:].reset_index(drop=True)
  #tops = tops[1:].reset_index(drop=True)

  #Format Top and Bottom column
  cols = ['Formation','Log_Top','Log_Bottom','Cored','DSTs']
  tops = tops[cols]
  for col in cols[1:3]:
      tops[col] = tops[col][tops[col].notnull()].apply(lambda x: re.sub('\D',
                                                                    '', x))
      try:                                                             
        tops[col] = tops[col].astype(float)
      except:
        print(col,'float type conversion error.')
  
  tops = tops[tops.Formation != 'No formation data to display.']
  tops = tops[(tops.Formation.notnull())&(~tops[tops.Formation.notnull()].Formation.str.contains('No additional interval'))]
  
  return tops

In [15]:
print(url)
top_parse(r.text)

https://cogcc.state.co.us/cogis/FacilityDetail.asp?facid=05705128&type=WELL


Unnamed: 0,Formation,Log_Top,Log_Bottom,Cored,DSTs
0,NIOBRARA,2076.0,,,
1,CARLILE,2332.0,,,
2,FRONTIER,2676.0,,,
3,DAKOTA,4327.0,,,
4,FUSON,4382.0,,,
5,LAKOTA,4436.0,,,
6,MORRISON,4456.0,,,


And iterrate through our wells. It is _EXTREMELY_ important to add `try` `except` to handle errors in scraping. Scrapers deal with others people's code and things *will* go wrong. It's also a good idea on long scrapes to periodically saveout your progress as there's nothing worse then getting back to something that ran all weeekend pulling data that you need for a project and to see that it crashed.

In [18]:
topDF = pd.DataFrame()
i = 0
apiSample = apis.head(10) #We'll only do the first few for this example 
total = apiSample.shape[0]

for index, row in apiSample.iterrows(): 
    i += 1
    prec = str(int(100*i/total)) + '% complete  '
    print(row['API_Label'], prec, end='\r')
    try:
        url = baseURL + row['API_Label'].replace('-','')[2:] + tailURL
        print(url)
        r = requests.get(url)

        if r.status_code == 200:
            formations = top_parse(r.text)
            formations['API'] = row['API_Label']
            # topDF = topDF.append(formations,ignore_index=True)
            topDF = pd.concat([topDF, formations],
                               ignore_index=True)
            time.sleep(5) #Wait 5 sec.
        else:
            print(row['API_Label'],':',r.status_code)
    except Exception as e:
        print('Error:',row['API_Label'],e)

topDF.head()

https://cogcc.state.co.us/cogis/FacilityDetail.asp?facid=05705000&type=WELL
https://cogcc.state.co.us/cogis/FacilityDetail.asp?facid=05705001&type=WELL
https://cogcc.state.co.us/cogis/FacilityDetail.asp?facid=05705002&type=WELL
https://cogcc.state.co.us/cogis/FacilityDetail.asp?facid=05705003&type=WELL
https://cogcc.state.co.us/cogis/FacilityDetail.asp?facid=05705004&type=WELL
https://cogcc.state.co.us/cogis/FacilityDetail.asp?facid=05705005&type=WELL
https://cogcc.state.co.us/cogis/FacilityDetail.asp?facid=05705006&type=WELL
https://cogcc.state.co.us/cogis/FacilityDetail.asp?facid=05705007&type=WELL
https://cogcc.state.co.us/cogis/FacilityDetail.asp?facid=05705008&type=WELL
https://cogcc.state.co.us/cogis/FacilityDetail.asp?facid=05705009&type=WELL
Log_Top float type conversion error.
Log_Bottom float type conversion error.


Unnamed: 0,Formation,Log_Top,Log_Bottom,Cored,DSTs,API
0,PIERRE,0.0,950.0,,,05-057-05001
1,NIOBRARA,950.0,1350.0,,,05-057-05001
2,CARLILE,1350.0,1570.0,,,05-057-05001
3,FRONTIER,1570.0,1600.0,,,05-057-05001
4,BENTON,1600.0,2020.0,,,05-057-05001


I've gone ahead and pulled all the tops for Jackson County for you.  This took approximately an hour and a half for 771 records to give you an idea of the time needed. These are avalible in the project folder.  This was a basic example with `requests` but if this is something you would like to do regularly I suggest you also check out `urllib`.  There are packages avalible to make the searching and parsing of the html much easier but when you're troubleshooting a tough website it's good to know what you are looking for

---

# Scraping with a Browser with Selenium

Scraping with a browser allows you to navigate around obsticles that are often put in place to discourage scraping, fillout forms, and interact with a website in ways that `requests` can't.  That being said it can be significantly more challenging and can sometimes take much longer. In this example we will pull production data from COGCC. `selenium` locates "elements" of a web page to interact with them to preform tasks. There are several [different methods](https://selenium-python.readthedocs.io/locating-elements.html) to locate elements. We will also use `bs4` to parse a table from html. BeautifulSoup uses tag names and daughter relationships to make finding data easier.  

I've previously written up this function but please open COGCC's [facility search](https://cogcc.state.co.us/cogis/FacilitySearch.asp) in a new tab. Select "Well", enter Weld County's code "123", and the sequence code "39340". Hit search. Select the well that comes up. Note the URL.

With that open, copy the link from the well name.  Notice that there is one of these per wellbore. Paste this url into a new tab. Now let's walk through finding elements & using tags to find the data you need.

In [44]:
%%shell
# Ubuntu no longer distributes chromium-browser outside of snap
#
# Proposed solution: https://askubuntu.com/questions/1204571/how-to-install-chromium-without-snap

# Add debian buster
cat > /etc/apt/sources.list.d/debian.list <<'EOF'
deb [arch=amd64 signed-by=/usr/share/keyrings/debian-buster.gpg] http://deb.debian.org/debian buster main
deb [arch=amd64 signed-by=/usr/share/keyrings/debian-buster-updates.gpg] http://deb.debian.org/debian buster-updates main
deb [arch=amd64 signed-by=/usr/share/keyrings/debian-security-buster.gpg] http://deb.debian.org/debian-security buster/updates main
EOF

# Add keys
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys DCC9EFBF77E11517
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 648ACFD622F3D138
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 112695A0E562B32A

apt-key export 77E11517 | gpg --dearmour -o /usr/share/keyrings/debian-buster.gpg
apt-key export 22F3D138 | gpg --dearmour -o /usr/share/keyrings/debian-buster-updates.gpg
apt-key export E562B32A | gpg --dearmour -o /usr/share/keyrings/debian-security-buster.gpg

# Prefer debian repo for chromium* packages only
# Note the double-blank lines between entries
cat > /etc/apt/preferences.d/chromium.pref << 'EOF'
Package: *
Pin: release a=eoan
Pin-Priority: 500


Package: *
Pin: origin "deb.debian.org"
Pin-Priority: 300


Package: chromium*
Pin: origin "deb.debian.org"
Pin-Priority: 700
EOF

Executing: /tmp/apt-key-gpghome.ZwXokuYni4/gpg.1.sh --keyserver keyserver.ubuntu.com --recv-keys DCC9EFBF77E11517
gpg: key DCC9EFBF77E11517: "Debian Stable Release Key (10/buster) <debian-release@lists.debian.org>" not changed
gpg: Total number processed: 1
gpg:              unchanged: 1
Executing: /tmp/apt-key-gpghome.MsnSwOLxIP/gpg.1.sh --keyserver keyserver.ubuntu.com --recv-keys 648ACFD622F3D138
gpg: key DC30D7C23CBBABEE: "Debian Archive Automatic Signing Key (10/buster) <ftpmaster@debian.org>" not changed
gpg: Total number processed: 1
gpg:              unchanged: 1
Executing: /tmp/apt-key-gpghome.PuVo7wtuBh/gpg.1.sh --keyserver keyserver.ubuntu.com --recv-keys 112695A0E562B32A
gpg: key 4DFAB270CAA96DFA: "Debian Security Archive Automatic Signing Key (10/buster) <ftpmaster@debian.org>" not changed
gpg: Total number processed: 1
gpg:              unchanged: 1
gpg: cannot open '/dev/tty': No such device or address
gpg: [stdout]: write error: Broken pipe
gpg: filter_flush failed on c



In [45]:
# To selenium Run in Colab
# !apt update
# !apt install chromium-chromedriver
# !pip install selenium
#!apt-get update

# !wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb && apt install ./google-chrome-stable_current_amd64.deb



!apt-get update
!apt-get install chromium chromium-driver
!pip3 install selenium

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

url = "http://example.com/"
options = Options()
options.add_argument("--headless")
options.add_argument("--no-sandbox")

# Make Web Driver
driver = webdriver.Chrome("chromedriver", options=options)


## If you want to run the code below in a Jupyter Notebook use this create the 
## driver
#from selenium import webdriver

## Make Driver
#chromedriver = "chromedriver.exe" #Path to your chromedriver - https://sites.google.com/a/chromium.org/chromedriver/
#driver = webdriver.Chrome(executable_path=chromedriver)

0% [Working]            Hit:1 http://deb.debian.org/debian buster InRelease
0% [Connecting to archive.ubuntu.com (91.189.91.38)] [Connecting to security.ub                                                                               Hit:2 http://deb.debian.org/debian buster-updates InRelease
0% [Waiting for headers] [Connecting to security.ubuntu.com (185.125.190.39)] [                                                                               Hit:3 http://archive.ubuntu.com/ubuntu focal InRelease
0% [Waiting for headers] [Waiting for headers] [Connecting to cloud.r-project.o0% [Waiting for headers] [Waiting for headers] [Connecting to cloud.r-project.o                                                                               Hit:4 http://deb.debian.org/debian-security buster/updates InRelease
0% [Waiting for headers] [Waiting for headers] [Connecting to cloud.r-project.o                                                                               Hit:5 http://

WebDriverException: ignored

In [42]:
!cat /etc/os-release shows

NAME="Ubuntu"
VERSION="20.04.6 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.6 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
cat: shows: No such file or directory


In [43]:
!apt-get update
!apt-get install chromium chromium-driver
!pip3 install selenium

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

url = "http://example.com"
options = Options()
options.add_argument("--headless")
options.add_argument("--no-sandbox")
driver = webdriver.Chrome("chromedriver", options=options)

driver.get(url)
print(driver.title)
driver.quit()

0% [Working]            Hit:1 http://deb.debian.org/debian buster InRelease
0% [Connecting to archive.ubuntu.com] [Waiting for headers] [Waiting for header                                                                               Hit:2 https://cloud.r-project.org/bin/linux/ubuntu focal-cran40/ InRelease
0% [Connecting to archive.ubuntu.com] [Waiting for headers] [Connecting to ppa.                                                                               Hit:3 http://deb.debian.org/debian buster-updates InRelease
0% [Connecting to archive.ubuntu.com] [Waiting for headers] [Connecting to ppa.                                                                               Hit:4 http://security.ubuntu.com/ubuntu focal-security InRelease
Hit:5 http://deb.debian.org/debian-security buster/updates InRelease
Hit:6 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  InRelease
Hit:7 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu focal InRelease
Hi

WebDriverException: ignored

In [36]:
!sudo apt install chromium-chromedriver

Reading package lists... Done
Building dependency tree       
Reading state information... Done
chromium-chromedriver is already the newest version (1:85.0.4183.83-0ubuntu0.20.04.3).
0 upgraded, 0 newly installed, 0 to remove and 26 not upgraded.
20 not fully installed or removed.
After this operation, 0 B of additional disk space will be used.
Setting up libplist3:amd64 (2.1.0-4build2) ...
Setting up libxtst6:amd64 (2:1.2.3-1) ...
Setting up libxxf86dga1:amd64 (2:1.1.5-0ubuntu1) ...
Setting up chromium-sandbox (90.0.4430.212-1~deb10u1) ...
Setting up libicu63:amd64 (63.1-6+deb10u3) ...
Setting up notification-daemon (3.20.0-4) ...
Setting up libfontenc1:amd64 (1:1.1.4-0ubuntu1) ...
Setting up libjpeg62-turbo:amd64 (1:1.5.2-2+deb10u1) ...
Setting up libevent-2.1-6:amd64 (2.1.8-stable-4) ...
Setting up libusbmuxd6:amd64 (2.0.1-2) ...
Setting up libupower-glib3:amd64 (0.99.11-1build2) ...
Setting up libre2-5:amd64 (20200101+dfsg-1build1) ...
Setting up libxkbfile1:amd64 (1:1.1.0-1) ...
S

In [40]:
!sudo apt update
!sudo apt upgrade

[33m0% [Working][0m            Hit:1 http://deb.debian.org/debian buster InRelease
[33m0% [Waiting for headers] [Connecting to security.ubuntu.com (185.125.190.39)] [[0m                                                                               Hit:2 http://archive.ubuntu.com/ubuntu focal InRelease
[33m0% [Waiting for headers] [Connecting to security.ubuntu.com (185.125.190.39)] [[0m                                                                               Hit:3 http://deb.debian.org/debian buster-updates InRelease
[33m0% [Waiting for headers] [Connecting to security.ubuntu.com (185.125.190.39)] [[0m                                                                               Hit:4 https://cloud.r-project.org/bin/linux/ubuntu focal-cran40/ InRelease
[33m0% [Waiting for headers] [Connecting to security.ubuntu.com (185.125.190.39)] [[0m[33m0% [Waiting for headers] [Connecting to security.ubuntu.com (185.125.190.39)] [[0m                                 

In [None]:
import time
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
from selenium.webdriver.common.by import By

pd.options.display.max_columns = 50

Elements can be found with `driver.find_elements(By.<method>,<value>)` or individually with `driver.find_element(By.<method>,<value>)`

In [None]:
print('"By" methods:',dir(By)[:8])

In [None]:
def pull_CO_prod(api_05, df, driver, pull_excel=False):
    url = 'https://cogcc.state.co.us/cogis/FacilityDetail.asp?facid='+api_05+'&type=WELL'
    print(url)
    driver.get(url)
    time.sleep(1)
    links = driver.find_elements(By.TAG_NAME,'a')
    prod_wellbores = [x.get_attribute("href") for x in driver.find_elements(By.TAG_NAME,'a') if 'production' in x.get_attribute("href")]
    print('prod_wellbores',prod_wellbores)
    for wellbore in prod_wellbores:
        driver.get(wellbore)
        time.sleep(1)
        
        #Download the file
        if pull_excel:
            dwnExcel = driver.find_element(By.XPATH,'//*[@id="mainContent_btnExport"]')
            #//*[@id="mainContent_btnExport"]
            dwnExcel.click()
            
        #Table HTML
        table = driver.find_elements(By.TAG_NAME,'table')[-1]

        #BeautifulSoup
        soup = BeautifulSoup(table.get_attribute('innerHTML'), "html.parser")
        
        rows = soup.find_all('tr')
        row_list = []
        
        #Pull Header 
        for tr in rows[:1]:
            th = tr.find_all('th')
            row = [i.text for i in th]
            row_list.append(row)

        #Pull Rows
        for tr in rows[1:]:
            td = tr.find_all('td')
            row = [i.text.replace('\xa0','') for i in td]
            row_list.append(row)
        
        temp = pd.DataFrame(row_list[1:],columns=row_list[0])
        temp['First of Month'] = pd.to_datetime(temp['First of Month'])
        temp.sort_values(by='First of Month',inplace=True)

        df = pd.concat([df,temp],ignore_index=True)

        return df, driver

# Give it a try

Now that we have the function complete the `for` loop below to feed the individual apis, minus the state code, to the function. Remember that you need to pass the dataframe and the driver to the function too.

Run it for the following wells: `0512339340`,`0512339383`,`0512339370`, & `0512339384`.

In [20]:
##I've laid out the format for you below. Make edits at *1, *2, & *3.

apis =   '0512339340,0512339383,0512339370,0512339384'.split(',') #*1: Make a list of your UWI codes

df =  pd.DataFrame()#*2: Make an Empty DataFrame

for api in apis:
    
    api_05 = api[2:]
    print(api_05)
    df, driver =  pull_CO_prod(api_05, df, driver, pull_excel=False) #*3: Insert the function w/ inputs

df.head()

12339340


NameError: ignored

Once that works for you let's format some of the strings in that dataframe to floats.

In [None]:
#Set data types & preview data
cols = ['Oil Produced','Gas Produced','Water Volume','Days Produced']
for col in cols:
  df[col].replace('',0,inplace=True)
  df[col] = df[col].astype(float)

df.head()

Plot cumulative oil curves.

In [None]:
fig=plt.figure(figsize=(15, 5))
ax=fig.add_subplot(111)

for api_wb, group in df[['First of Month','API Sequence','Days Produced','Oil Produced']].groupby('API Sequence'):
  group['CumOil'] = group['Oil Produced'].fillna(0).cumsum()
  
  group['Days Produced'] = group['Days Produced'].replace('',0).astype(float)
  
  group['Total_Days'] = group['Days Produced'].cumsum()
  
  prod_start = df['First of Month'].min()
  group['Elapsed_Days'] = group['First of Month'].apply(lambda x: (x - prod_start).days )
  group['Elapsed_Days'] = group['Elapsed_Days'].astype(float)
  ax.plot(group.Total_Days,
          group.CumOil,
          ls='-',
          label='05-123-'+api_wb,
          fillstyle='none')

plt.legend(loc=2)
plt.show()

---

# COGCCpy

Want all of that data in an easier to use package? Check out [COGCCpy](https://pypi.org/project/COGCCpy/)