# Discover the dataset [<svg height="1.5rem" width="1.5rem" fill="currentColor" viewBox="0 0 16 16" aria-hidden="true" xmlns="http://www.w3.org/2000/svg" class="Icon__StyledSvg-sc-6evbi1-0"><path data-testid="eds-icon-path" d="M5 3c-1.093 0-2 .907-2 2v14c0 1.093.907 2 2 2h14c1.093 0 2-.907 2-2v-7h-2v7H5V5h7V3H5Zm9 0v2h3.586l-9.293 9.293 1.414 1.414L19 6.414V10h2V3h-7Z" height="24" fill-rule="evenodd" clip-rule="evenodd" transform="scale(0.6666666666666666)" class="Icon__StyledPath-sc-6evbi1-1 jzkZsK"></path></svg>]()

### About the dataset
In June 2018, the Norwegian Oil & Gas company **Equinor** announced that they would be sharing a complete set of data from the Norwegian continental shelf for research and study purposes, thus granting all academic institutions, students, and researchers permission to use this dataset in accordance with the [Equinor Open Data Licence](https://cdn.equinor.com/files/h61q9gi9/global/de6532f6134b9a953f6c41bac47a0c055a3712d3.pdf?equinor-hrs-terms-and-conditions-for-licence-to-data-volve.pdf), without any need for further written permission.

The Volve production data was released in the form of an excel file made up of two (02) sheets, namely **Daily Production Data** and **Monthly Production Data**.

## Key points about the industry

### Definitions

**Wellbore**: is a hole that is drilled to aid in the exploration and recovery of natural resources, including oil, gas, or water.

**Hydrocarbons**: is a term generally used to designate oil and gas.

### Important notes

In the Oil & Gas industry, when a wellbore starts producing hydrocarbons, the amount of pressure required to move the hydrocarbons from the reservoir to the surface is provided by natural forces, also known as [natural drive mechanisms](https://wiki.aapg.org/Reservoir_drive_mechanisms).

However, these forces get depleted with time and become insufficient to drive the hydrocarbons to the surface. To tackle this problem, [secondary recovery methods](https://www.britannica.com/technology/petroleum-production/Recovery-of-oil-and-gas#ref623983) were introduced, which consist of injecting water or gas to displace the hydrocarbons and drive them through a production wellbore to the surface.

In [1]:
import pandas as pd

In [2]:
volve_df = pd.read_excel(io="../../Data/raw-data/Volve production data.xlsx", sheet_name="Monthly Production Data")

In [3]:
volve_df

Unnamed: 0,Wellbore name,NPDCode,Year,Month,On Stream,Oil,Gas,Water,GI,WI
0,,,,,hrs,Sm3,Sm3,Sm3,Sm3,Sm3
1,15/9-F-1 C,7405.0,2014.0,4.0,227.5,11142.47,1597936.65,0,,
2,15/9-F-1 C,7405.0,2014.0,5.0,733.83334,24901.95,3496229.65,783.48,,
3,15/9-F-1 C,7405.0,2014.0,6.0,705.91666,19617.76,2886661.69,2068.48,,
4,15/9-F-1 C,7405.0,2014.0,7.0,742.41666,15085.68,2249365.75,6243.98,,
...,...,...,...,...,...,...,...,...,...,...
522,15/9-F-5,5769.0,2016.0,5.0,732,9724.4,1534677.16,3949.9,,0
523,15/9-F-5,5769.0,2016.0,6.0,718.41667,9121.48,1468557.12,2376.93,,
524,15/9-F-5,5769.0,2016.0,7.0,668.64168,9985.29,1602674.39,2453.71,,0
525,15/9-F-5,5769.0,2016.0,8.0,608.425,8928.9,1417278.51,2371.86,,0


## Questions and Answers

### What is the dataset about?

- The dataset provides hydrocarbon production data from the Volve field in Norway.

### What does each column tell us about?

|Column        | Information                                           | Unit        |
|:-------------|:-----------------------------------------------------:|------------:|
|Wellbore name | The name of the wellbore                              | NA          |
|NPDCode       | Norwegian Petroleum Directorate Code                  | NA          |
|Year          | The year of record                                    | NA          |
|Month         | The month of record                                   | NA          |
|On Stream     | The duration of a wellbore in operation               | hours       |
|Oil           | The Volume of oil produced                            | cubic meters|
|Gas           | The Volume of gas produced                            | cubic meters|
|Water         | The Volume of water produced                          | cubic meters|
|GI            | The Volume of gas injected during secondary recovery  | cubic meters|
|WI            | The Volume of water injected during secondary recovery| cubic meters|

In [4]:
volve_df.dtypes

Wellbore name     object
NPDCode          float64
Year             float64
Month            float64
On Stream         object
Oil               object
Gas               object
Water             object
GI                object
WI                object
dtype: object

## Questions and Answers

### Which of the available columns are correctly formatted?
- The column `Wellbore name` is the only one in a correct format.

### Which of the available columns require type conversion?
- `NPDCode`, `Year`, `Month`: float ->  int
- `On Stream`, `Oil`, `Gas`, `Water`, `GI`, `WI`: object -> float

Let's try to transform our data columns into appropriate data types.

In [5]:
volve_df[["NPDCode", "Year", "Month"]] = volve_df[["NPDCode", "Year", "Month"]].astype(int)
volve_df[["On Stream", "Oil", "Gas", "Water", "WI", "GI"]] = volve_df[["On Stream", "Oil", "Gas", "Water", "WI", "GI"]].astype(float)

IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer

### 🤯 What just happened? 😦 Why can't I convert my data?
- If you take a look a the dataset, you will see that the first row only contains strings (*units*) and non-numeric (NaN) values.
- When python tries to convert `NaN` into `int` or `hrs` into `float`, it throws an error saying `I cannot make that kind of conversion`.
- To fix this, we will have to drop the first row of the dataset.

In [6]:
volve_df = volve_df.drop(index=[0]).reset_index(drop=True)

In [7]:
volve_df[["NPDCode", "Year", "Month"]] = volve_df[["NPDCode", "Year", "Month"]].astype(int)
volve_df[["On Stream", "Oil", "Gas", "Water", "WI", "GI"]] = volve_df[["On Stream", "Oil", "Gas", "Water", "WI", "GI"]].astype(float)

In [8]:
volve_df.dtypes

Wellbore name     object
NPDCode            int32
Year               int32
Month              int32
On Stream        float64
Oil              float64
Gas              float64
Water            float64
GI               float64
WI               float64
dtype: object

## Questions and Answers

### Questions

- Which of the available columns contain missing values?
- How do you intend to handle the missing values?

In [9]:
volve_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 526 entries, 0 to 525
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Wellbore name  526 non-null    object 
 1   NPDCode        526 non-null    int32  
 2   Year           526 non-null    int32  
 3   Month          526 non-null    int32  
 4   On Stream      515 non-null    float64
 5   Oil            311 non-null    float64
 6   Gas            311 non-null    float64
 7   Water          311 non-null    float64
 8   GI             0 non-null      float64
 9   WI             201 non-null    float64
dtypes: float64(6), int32(3), object(1)
memory usage: 35.1+ KB


In [10]:
volve_df.describe()

Unnamed: 0,NPDCode,Year,Month,On Stream,Oil,Gas,Water,GI,WI
count,526.0,526.0,526.0,515.0,311.0,311.0,311.0,0.0,201.0
mean,5906.731939,2012.380228,6.48289,595.901633,32273.571093,4743956.0,49255.878939,,150896.186722
std,650.0211,2.633829,3.417977,196.67664,37361.937092,5302562.0,47458.158698,,56431.406213
min,5351.0,2007.0,1.0,0.0,0.0,0.0,0.0,,0.0
25%,5599.0,2010.0,4.0,561.33416,6085.665,923652.2,3534.335,,126376.884487
50%,5693.0,2013.0,6.5,683.74166,17870.72,2722573.0,36195.93,,150430.98628
75%,5769.0,2015.0,9.0,720.0,37607.385,5780980.0,94056.455,,188426.124913
max,7405.0,2016.0,12.0,745.0,166439.67,24106360.0,155365.68,,270199.39983


### Answers

- The columns `On Stream`, `Oil`, `Gas`, `Water`, `GI`, and `WI` all contain missing values.
- Given the high standard deviation for the concerned columns (*cell above*), the best option is to replace them with zero (0).

>**Note:** The standard deviation is considered high when it is more than one third of the arithmetic mean.

In [11]:
volve_df = volve_df.fillna(0)

In [12]:
volve_df.head(10)

Unnamed: 0,Wellbore name,NPDCode,Year,Month,On Stream,Oil,Gas,Water,GI,WI
0,15/9-F-1 C,7405,2014,4,227.5,11142.47,1597936.65,0.0,0.0,0.0
1,15/9-F-1 C,7405,2014,5,733.83334,24901.95,3496229.65,783.48,0.0,0.0
2,15/9-F-1 C,7405,2014,6,705.91666,19617.76,2886661.69,2068.48,0.0,0.0
3,15/9-F-1 C,7405,2014,7,742.41666,15085.68,2249365.75,6243.98,0.0,0.0
4,15/9-F-1 C,7405,2014,8,432.99166,6970.43,1048190.8,4529.75,0.0,0.0
5,15/9-F-1 C,7405,2014,9,630.3,9168.43,1414099.99,8317.59,0.0,0.0
6,15/9-F-1 C,7405,2014,10,745.0,9468.06,1462063.99,10364.87,0.0,0.0
7,15/9-F-1 C,7405,2014,11,579.775,6710.33,1044188.3,7234.24,0.0,0.0
8,15/9-F-1 C,7405,2014,12,27.5,120.29,25857.08,183.44,0.0,0.0
9,15/9-F-1 C,7405,2015,1,479.91667,10875.53,1604934.6,6850.8,0.0,0.0


## Questions and Answers

### Questions

- What is the temporal coverage of the dataset?
- How many wellbores are there in the dataset?
- How much hydrocarbon was produced from the Volve field?
- What secondary recovery method was employed during field development?

In [13]:
years = sorted(list(volve_df["Year"].unique()))
temporal_coverage = years[-1] - years[0]

print(f"The dataset provides production data for a duration of {temporal_coverage} years, from {years[0]} to {years[-1]}.")

The dataset provides production data for a duration of 9 years, from 2007 to 2016.


In [14]:
number_of_wells = len(volve_df["Wellbore name"].unique())
print(f"There is a total of {number_of_wells} wellbores in the dataset.")

There is a total of 7 wellbores in the dataset.


### Unit convension

- Oil reserves are measured in barrels (bbl)
- Gas reserves are measured in standard cubic feet (SCF or ft<sup>3</sup>)

>1 bbl = 0.1589873 m<sup>3</sup> <br>
>1 scf = 0.0283168 m<sup>3</sup>

In [15]:
total_oil_production = volve_df["Oil"].sum() / 0.1589873
total_gas_production = volve_df["Gas"].sum() / 0.0283168

In [16]:
print("A total of {:.2f} million barrels of oil were produced.".format(total_oil_production / 1e6))

A total of 63.13 million barrels of oil were produced.


In [17]:
print("A total of {:.2f} billion cubic feet of gas were produced.".format(total_gas_production / 1e9))

A total of 52.10 billion cubic feet of gas were produced.


In [18]:
if volve_df["WI"].sum() > 0:
    if volve_df["GI"].sum() > 0:
        print("Both water and gas injection were used for the secondary recovery.")
    else:
        print("Water injection was used as the secondary recovery method.")
else:
    print("Gas injection was used as the secondary recovery method.")

Water injection was used as the secondary recovery method.


### Basic Information

|Field name | Location | No of wells | Temporal coverage      |
|:---------:|:--------:|:-----------:|:----------------------:|
|Volve      |Norway    |7            |9 years (_2007 - 2016_) |

### Performance Evaluation

| Oil Production | Gas Production | Recovery Method |
|:--------------:|:--------------:|:---------------:|
| 62.73 MMbbl    | 52.10 BCF      |Water injection  |