## Introduction
In this notebook we will work with Wellbore report summaries from Norwegian Petroleum Directorate (NPD). The wellbore datasets are public domain.<br> They are available here: https://hotell.difi.no/?dataset=npd/wellbore/with-history (well report summaries).<br> The following link should go to all data available from Norwegian Oil Directorate, in public domain: https://data.norge.no/datasets/4304aea1-53b1-47ed-beae-52bf4d3642f3 (link might get broken over time).<br>
We will explore different NLP methods to extract insights and also create a few visualisations.
#### UNDER CONSTRUCTION, CHECK FOR UPDATES! :-))

---

In [1]:
import pandas as pd
import numpy as np
from pprint import pprint
import re, os, operator
import collections 
import matplotlib.pyplot as plt
import requests
import json

import nltk
from nltk.corpus import stopwords

import altair as alt
# add the line below if you use notebooks and comment it if you use jupyter lab
# alt.renderers.enable('notebook')

#### API connection - check for status
Make a general get request, print out status code. 
Status code = 200 means API connection is up and running.

In [2]:
requests.get('http://data.norge.no/api/dcat/870917732/data.json')

<Response [200]>

#### Get request - Well summaries
The datasets are paginated. The first page will tell us how many pages there are in total, and how many rows for the whole dataset.

In [3]:
wells_summary = requests.get('http://hotell.difi.no/api/json/npd/wellbore/with-history?page=1')
json_wells_summary = wells_summary.json()

In [4]:
print(f'number of pages: {json_wells_summary["pages"]}')
print(f'number of rows:  {json_wells_summary["posts"]}')

number of pages: 19
number of rows:  1824


#### Let's go through all the pages and append all the rows in JSON dictionary
Create a pandas dataframe from the json file

In [5]:
def get_all_data(url_dataset, df_out_name):
    '''looping through all pages,
    append in JSON dictionary,
    create dataframe
    '''
    
    page = json_wells_summary['pages']
    all_data_list = []
    for i in range(0, page):
        url = url_dataset + f'?page={i+1}'
        dataset_all = requests.get(url)
        dataset_all_json = dataset_all.json()
        number_of_entries = len(dataset_all_json['entries'])

        for entry in range(0, number_of_entries):
            all_data_list.append(dataset_all_json['entries'][entry])

    df_out_name = pd.DataFrame(all_data_list)
    return df_out_name

#### Getting all our well summaries into a dataframe

In [6]:
url_dataset ='http://hotell.difi.no/api/json/npd/wellbore/with-history'
df_out_name = 'df_all_summaries'
df_all_summaries = get_all_data(url_dataset, df_out_name)

In [8]:
df_all_summaries.shape

(1824, 5)

In [9]:
df_all_summaries.head()

Unnamed: 0,wlbHistoryDateUpdated,wlbName,wlbNPDID_wellbore,wlbHistory,datesyncNPD
0,7/6/2016 12:00:00 AM,1/2-1,1382,<p><b>General</b></p>\n\n<p>Well 1/2-1 is loca...,15.01.2021
1,4/11/2017 12:00:00 AM,1/2-2,5192,<p>The 1/2-2 well was drilled to evaluate the ...,15.01.2021
2,5/19/2016 12:00:00 AM,1/3-1,154,<p class=MsoBodyText><b><span lang=EN-GB>Gener...,15.01.2021
3,4/11/2017 12:00:00 AM,1/3-10,5614,<p class=MsoBodyText><b><span lang=EN-GB>Gener...,15.01.2021
4,4/11/2017 12:00:00 AM,1/3-10 A,5779,<p class=MsoBodyText><b><span lang=EN-GB>Gener...,15.01.2021


#### Display a complete summary - we will need to clean all this html code!

In [10]:
# Display well summary (wlbHistory) 
df_all_summaries.iloc[1500]['wlbHistory']

'<p class=MsoBodyText><b><span lang=EN-GB>General</span></b></p>\n\n<p class=MsoBodyText><span lang=EN-GB>Well 6508/1-2 was drilled on the Skaugumsåsen\nprospect in the south-western end of the Helgeland Basin in the Norwegian Sea, about\nten kilometres south of the Norne field. The primary objective was to prove\npetroleum in reservoirs of the Early Jurassic Båt Group. A Secondary objective was\nto test the reservoir and HC potential of the Paleocene Tare Formation.</span></p>\n\n<p class=MsoBodyText><b><span lang=EN-GB>Operations and results</span></b></p>\n\n<p class=MsoBodyText><span lang=EN-GB>A 9 7/8&quot; pilot well 6508/1-U-2 was\ndrilled to 1305 m to check for shallow gas. No indications of shallow gas were\nseen. Wildcat well 6508/1-2 was spudded with the semi-submersible installation Aker\nBarents on 20 August 2011 and drilled to TD at 1810 m in the Early Jurassic\nTilje Formation. No significant problem was encountered in the operations. The\nwell was drilled with seawater 

#### Getting geolocation information for all wells - we will append this information to our well summary dataframe

In [11]:
url_dataset = 'http://hotell.difi.no/api/json/npd/wellbore/with-coordinates'
df_out_name = 'df_all_with_coord'
df_all_with_coord = get_all_data(url_dataset, df_out_name)

In [15]:
df_all_with_coord.shape

(1900, 26)

In [16]:
df_all_with_coord.head()

Unnamed: 0,wlbUtmZone,wlbWellType,wlbEwDeg,wlbEwCode,wlbEntryDate,wlbNsSec,wlbEwMin,wlbEwSec,wlbNsDecDeg,wlbProductionLicence,...,wlbEwUtm,wlbNsMin,wlbWellboreName,wlbContent,wlbDrillingOperator,wlbEwDesDeg,wlbNsUtm,wlbGeodeticDatum,datesyncNPD,wlbMainArea
0,31,EXPLORATION,2,E,20.03.1989,15.07,28,35.7,56.887519,143,...,468106.29,53,1/2-1,OIL,Phillips Petroleum Norsk AS,2.476583,6305128.26,ED50,15.01.2021,NORTH SEA
1,31,EXPLORATION,2,E,14.12.2005,32.0,29,47.66,56.992222,143 CS,...,469410.1,59,1/2-2,OIL SHOWS,Paladin Resources Norge AS,2.496572,6316774.33,ED50,15.01.2021,NORTH SEA
2,31,OTHER,2,E,09.05.2009,55.58,27,5.01,56.948772,143,...,466625.99,56,1/2-U-1,,ConocoPhillips Skandinavia AS,2.451392,6311958.73,ED50,15.01.2021,NORTH SEA
3,31,OTHER,2,E,12.05.2009,56.95,27,7.69,56.949153,143,...,466671.62,56,1/2-U-2,,ConocoPhillips Skandinavia AS,2.452136,6312000.73,ED50,15.01.2021,NORTH SEA
4,31,OTHER,2,E,12.05.2009,54.99,27,8.52,56.948608,143,...,466685.16,56,1/2-U-3,,ConocoPhillips Skandinavia AS,2.452367,6311940.01,ED50,15.01.2021,NORTH SEA
