The purpose of this report is to determine the best region to develop oil wells with the most profit. We have access to data from 3 different regions. Throughout this report we will use a linear regression model to determine which zone will provide the greatest profit and have the least risk of losses.

In [1]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
import numpy as np
from scipy import stats as st

In [2]:
df_0 = pd.read_csv('/datasets/geo_data_0.csv')
df_1 = pd.read_csv('/datasets/geo_data_1.csv')
df_2 = pd.read_csv('/datasets/geo_data_2.csv')

In [3]:
df_0.head()

Unnamed: 0,id,f0,f1,f2,product
0,txEyH,0.705745,-0.497823,1.22117,105.280062
1,2acmU,1.334711,-0.340164,4.36508,73.03775
2,409Wp,1.022732,0.15199,1.419926,85.265647
3,iJLyR,-0.032172,0.139033,2.978566,168.620776
4,Xdl7t,1.988431,0.155413,4.751769,154.036647


In [4]:
df_1.head()

Unnamed: 0,id,f0,f1,f2,product
0,kBEdx,-15.001348,-8.276,-0.005876,3.179103
1,62mP7,14.272088,-3.475083,0.999183,26.953261
2,vyE1P,6.263187,-5.948386,5.00116,134.766305
3,KcrkZ,-13.081196,-11.506057,4.999415,137.945408
4,AHL4O,12.702195,-8.147433,5.004363,134.766305


In [5]:
df_2.head()

Unnamed: 0,id,f0,f1,f2,product
0,fwXo0,-1.146987,0.963328,-0.828965,27.758673
1,WJtFt,0.262778,0.269839,-2.530187,56.069697
2,ovLUW,0.194587,0.289035,-5.586433,62.87191
3,q6cA6,2.23606,-0.55376,0.930038,114.572842
4,WPMUX,-0.515993,1.716266,5.899011,149.600746


In [6]:
df_0.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


In [7]:
df_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


In [8]:
df_2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


In [9]:
#ID names will be dropped as they are irrelevant to training the target parameter.
df_0 = df_0.drop(['id'], axis=1)
df_1 = df_1.drop(['id'], axis=1)
df_2 = df_2.drop(['id'], axis=1)

In [10]:
df_0.head()

Unnamed: 0,f0,f1,f2,product
0,0.705745,-0.497823,1.22117,105.280062
1,1.334711,-0.340164,4.36508,73.03775
2,1.022732,0.15199,1.419926,85.265647
3,-0.032172,0.139033,2.978566,168.620776
4,1.988431,0.155413,4.751769,154.036647


Data is ready to be used. There were no missing values to address and all parameters remaining in the dataframe are directly relevant to the target parameter (product).

In [11]:
#Declare features and target for each zone.
features_0 = df_0.drop(['product'], axis=1)
target_0 = df_0['product']
features_1 = df_1.drop(['product'], axis=1)
target_1 = df_1['product']
features_2 = df_2.drop(['product'], axis=1)
target_2 = df_2['product']

In [12]:
#Splitting the data into training and validation sets for all 3 datasets.
features_0_train, features_0_valid, target_0_train, target_0_valid = train_test_split(
    features_0, target_0, test_size=0.25, random_state=12345)
features_1_train, features_1_valid, target_1_train, target_1_valid = train_test_split(
    features_1, target_1, test_size=0.25, random_state=12345)
features_2_train, features_2_valid, target_2_train, target_2_valid = train_test_split(
    features_2, target_2, test_size=0.25, random_state=12345)

In [13]:
#Train linear regression model for zone 0.
model = LinearRegression()
model.fit(features_0_train, target_0_train)
predictions_0_valid = model.predict(features_0_valid)
result = mean_squared_error(target_0_valid, predictions_0_valid)**0.5
print("Average volume of predicted reserves:", predictions_0_valid.mean())
print("RMSE of the linear regression model on the validation set:", result)

Average volume of predicted reserves: 92.59256778438035
RMSE of the linear regression model on the validation set: 37.5794217150813


In [14]:
#Train linear regression model for zone 1
model = LinearRegression()
model.fit(features_1_train, target_1_train)
predictions_1_valid = model.predict(features_1_valid)
result = mean_squared_error(target_1_valid, predictions_1_valid)**0.5
print("Average volume of predicted reserves:", predictions_1_valid.mean())
print("RMSE of the linear regression model on the validation set:", result)

Average volume of predicted reserves: 68.728546895446
RMSE of the linear regression model on the validation set: 0.893099286775617


In [15]:
#Train linear regression model for zone 2
model = LinearRegression()
model.fit(features_2_train, target_2_train)
predictions_2_valid = model.predict(features_2_valid)
result = mean_squared_error(target_2_valid, predictions_2_valid)**0.5
print("Average volume of predicted reserves:", predictions_2_valid.mean())
print("RMSE of the linear regression model on the validation set:", result)

Average volume of predicted reserves: 94.96504596800489
RMSE of the linear regression model on the validation set: 40.02970873393434


In [16]:
print(predictions_2_valid.mean())

94.96504596800489


The RMSE of zone 0 and 2 are very high while the RMSE of zone 1 is still higher than we would like.

The budget to develop 200 oil wells is 100 USD million. The revenue of 1 unit of product (1000 barrels of oil) is 4500. 100,000,000/200 = 500,000. 500,000/4500 = 111.1. If there are only 200 oil wells being developed, each oil well needs to produce slightly more than 111 barrels in order to break even.

In [17]:
Revenue_Barrel = 4.5
Revenue_Unit = 4500
Budget = 100000000
Oil_Well_Amount_Budget = 200
Volume_Of_Reserves = 1000

In [18]:
def calculate_profit(targets, probabilities, count):
    probs_sorted = probabilities.sort_values(ascending = False).head(200)
    selected = targets[probs_sorted.index][:count]
    return (4500 * selected.sum()) - 100000000

In [19]:
state = np.random.RandomState(12345)

In [20]:
print(pd.DataFrame(predictions_0_valid))

                0
0       95.894952
1       77.572583
2       77.892640
3       90.175134
4       70.510088
...           ...
24995  103.037104
24996   85.403255
24997   61.509833
24998  118.180397
24999  118.169392

[25000 rows x 1 columns]


In [21]:
#Average revenue for Zone 0
values_0 = []
for i in range(1000):
    target_subsample = target_0_valid.reset_index(drop=True).sample(n=500, replace=True, random_state=state)
    probs_subsample = pd.Series(predictions_0_valid).reset_index(drop=True)[target_subsample.index]
    values_0.append(calculate_profit(target_subsample, probs_subsample, 200))
    
values_0 = pd.Series(values_0)
lower_0 = values_0.quantile(0.025)
upper_0 = values_0.quantile(0.975)
mean_0 = values_0.mean()
print("Average revenue for Zone 0:", mean_0)
print("Confidence interval 2.5% quantile:", lower_0)
print("Confidence interval 97.5% quantile:", upper_0)

Average revenue for Zone 0: 4259385.269105923
Confidence interval 2.5% quantile: -1020900.9483793724
Confidence interval 97.5% quantile: 9479763.533583675


In [22]:
losses_0 = values_0 <= 0
losses_0.describe()
print("The risk of losses is:", (1- 940/1000)*100, "%")

The risk of losses is: 6.000000000000005 %


In [23]:
#Average revenue for Zone 1
values_1 = []
for i in range(1000):
    target_subsample = target_1_valid.reset_index(drop=True).sample(n=500, replace=True, random_state=state)
    probs_subsample = pd.Series(predictions_1_valid).reset_index(drop=True)[target_subsample.index]
    values_1.append(calculate_profit(target_subsample, probs_subsample, 200))
    
values_1 = pd.Series(values_1)
lower_1 = values_1.quantile(0.025)
upper_1 = values_1.quantile(0.975)
mean_1 = values_1.mean()
print("Average revenue for Zone 1:", mean_1)
print("Confidence interval 2.5% quantile:", lower_1)
print("Confidence interval 97.5% quantile:", upper_1)

Average revenue for Zone 1: 5182594.93697325
Confidence interval 2.5% quantile: 1281232.3143308456
Confidence interval 97.5% quantile: 9536129.820669085


In [24]:
losses_1 = values_1 <= 0
losses_1.describe()
print("The risk of losses is:", (1- 997/1000)*100,"%")

The risk of losses is: 0.30000000000000027 %


In [25]:
#Average revenue for Zone 2
values_2 = []
for i in range(1000):
    target_subsample = target_2_valid.reset_index(drop=True).sample(n=500, replace=True, random_state=state)
    probs_subsample = pd.Series(predictions_2_valid).reset_index(drop=True)[target_subsample.index]
    values_2.append(calculate_profit(target_subsample, probs_subsample, 200))
    
values_2 = pd.Series(values_2)
lower_2 = values_2.quantile(0.025)
upper_2 = values_2.quantile(0.975)
mean_2 = values_2.mean()
print("Average revenue for Zone 2:", mean_2)
print("Confidence interval 2.5% quantile:", lower_2)
print("Confidence interval 97.5% quantile:", upper_2)

Average revenue for Zone 2: 4201940.0534405
Confidence interval 2.5% quantile: -1158526.0916001017
Confidence interval 97.5% quantile: 9896299.398445744


In [30]:
losses_2 = values_2 <= 0
losses_2.describe()
print("The risk of losses is:", (1- 938/1000)*100,"%")

The risk of losses is: 6.2000000000000055 %


The best region for development is Zone 1. Zone 1 has the highest average revenue and it is the only zone that does not have the possibility of losses in the 95th quantile.