#  Profitability prediction for oil field wells development

Goal: develop a machine learning algorithm for oil field wells profitability analysis. Assess poterntial profit and risks.

The model will be used for region selection for further development. 

There are three candidates for development. Each of them have 10 000 oil field wells. Data of oil quality and amount of row material is available for each well.

## Data Preparation

Upload the data and the necessary libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from math import sqrt

In [2]:
data_region_0 = pd.read_csv('/datasets/geo_data_0.csv')
data_region_1 = pd.read_csv('/datasets/geo_data_1.csv')
data_region_2 = pd.read_csv('/datasets/geo_data_2.csv')

Create a function to describe the data

In [3]:
def get_to_know_your_data(df, df_name):
    for e in range(len(df)):
        element = df[e]
        print(f'NAME: {df_name[e]}')
        print("-------------------")
        print('Review first rows')
        print(element.head())
        print("-----")
        print('Summary')
        print(element.info())
        print("-----")
        print('Statistics')
        print(element.describe())
        print("-----")
        print('Duplicates amount')
        print(element.duplicated().sum())
        print()
df = [data_region_0, data_region_1, data_region_2]
df_name = ['region_0', 'region_1', 'region_2']
get_to_know_your_data(df, df_name)

NAME: region_0
-------------------
Review first rows
      id        f0        f1        f2     product
0  txEyH  0.705745 -0.497823  1.221170  105.280062
1  2acmU  1.334711 -0.340164  4.365080   73.037750
2  409Wp  1.022732  0.151990  1.419926   85.265647
3  iJLyR -0.032172  0.139033  2.978566  168.620776
4  Xdl7t  1.988431  0.155413  4.751769  154.036647
-----
Summary
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB
None
-----
Statistics
                  f0             f1             f2        product
count  100000.000000  100000.000000  100000.000000  100000.000000
mean        0.500419       0.250143    

After target feature is removed, features should be scaled to normalize the range of features. 

In [4]:
features_0 = data_region_0.drop('product', axis=1)
target_0 = data_region_0['product']

In [5]:
features_1 = data_region_1.drop('product', axis=1)
target_1 = data_region_1['product']

In [6]:
features_2 = data_region_2.drop('product', axis=1)
target_2 = data_region_2['product']

Create a function for feature scaling.

In [7]:
def features_standard(feature):
    numeric = ['f0', 'f1', 'f2']
    pd.options.mode.chained_assignment = None
    scaler = StandardScaler()
    scaler.fit(feature[numeric])
    feature[numeric] = scaler.transform(feature[numeric])
   
    return feature[numeric]    

In [8]:
features_standard_0 = features_standard(features_0)
features_standard_1 = features_standard(features_1)
features_standard_2 = features_standard(features_2)

In [9]:
features_standard_0.describe()

Unnamed: 0,f0,f1,f2
count,100000.0,100000.0,100000.0
mean,3.5444980000000006e-17,-1.768141e-17,1.079015e-16
std,1.000005,1.000005,1.000005
min,-2.189681,-2.17743,-4.491975
25%,-0.6572397,-0.8941262,-0.6818784
50%,0.0022265,0.0002166008,0.00410136
75%,0.6574259,0.8930937,0.6811217
max,2.135642,2.168043,4.15646


###### Summary

There are no missing values or duplicated rows in any of the datasets provided. Features were scaled for further processing.

## Model Training

Split datasets to train and test sets

In [10]:
features_train_0, features_valid_0, target_train_0, target_valid_0 = train_test_split(
    features_standard_0, target_0, test_size=0.25, random_state=12345)

features_train_1, features_valid_1, target_train_1, target_valid_1 = train_test_split(
    features_standard_1, target_1, test_size=0.25, random_state=12345)

features_train_2, features_valid_2, target_train_2, target_valid_2 = train_test_split(
    features_standard_2, target_2, test_size=0.25, random_state=12345)

In [11]:
model_0 = LinearRegression()
model_1 = LinearRegression()
model_2 = LinearRegression()

In [12]:
model_0.fit(features_train_0, target_train_0)
model_1.fit(features_train_1, target_train_1)
model_2.fit(features_train_2, target_train_2)

LinearRegression()

In [13]:
predicted_valid_0 = model_0.predict(features_valid_0)
predicted_valid_1 = model_1.predict(features_valid_1)
predicted_valid_2 = model_2.predict(features_valid_2)

Calculate mean predicted amount of the raw product, thousand barrels. 

In [14]:
print("model_0 predictions mean:", "{:.2f}".format(predicted_valid_0.mean()))
print("model_1 predictions mean:", "{:.2f}".format(predicted_valid_1.mean()))
print("model_2 predictions mean:", "{:.2f}".format(predicted_valid_2.mean()))

model_0 predictions mean: 92.59
model_1 predictions mean: 68.73
model_2 predictions mean: 94.97


In [15]:
print("target_0 mean:", "{:.2f}".format(target_valid_0.mean()))
print("target_1 mean:", "{:.2f}".format(target_valid_1.mean()))
print("target_2 mean:", "{:.2f}".format(target_valid_2.mean()))

target_0 mean: 92.08
target_1 mean: 68.72
target_2 mean: 94.88


In [16]:
mse_0 = mean_squared_error(target_valid_0, predicted_valid_0)
rmse_0 = sqrt(mse_0)
print("rmse_0:", "{:.2f}".format(rmse_0))

mse_1 = mean_squared_error(target_valid_1, predicted_valid_1)
rmse_1 = sqrt(mse_1)
print("rmse_1:", "{:.2f}".format(rmse_1))

mse_2 = mean_squared_error(target_valid_2, predicted_valid_2)
rmse_2 = sqrt(mse_2)
print("rmse_2:", "{:.2f}".format(rmse_2))

rmse_0: 37.58
rmse_1: 0.89
rmse_2: 40.03


###### Summary

Mean of predicted values for the test set is close to the mean of the actual values. 
RMSE for region_1 is the lowest one.

## 	Profit calculation - preparation steps

In [17]:
BUDGET = 10000000000
TOTAL_LOCATIONS_PER_REGION = 500
BEST_LOCATIONS = 200
PROFIT_PER_BARREL = 450

In [18]:
expense_per_location = BUDGET / BEST_LOCATIONS
expense_per_location

50000000.0

Calculate sufficient amount of row material for break-even point.

In [19]:
porog = expense_per_location / PROFIT_PER_BARREL
"{:.2f}".format(porog)

'111111.11'

###### Summary

On average it takes 50 mil. rubles to develop new oil field wells. Breakeven point (111111 barrels) is higher than the mean amount of row material in the regions. Comparison of the means is not sufficient for determination of the most profitable region. Since region_1 has the lowest mean amount of the row material, it looks like the least favourable option at the moment. Further analysis is required. 

## Calculation of profit and risks

In [20]:
target_predicted_0 = pd.Series(predicted_valid_0)
target_predicted_1 = pd.Series(predicted_valid_1)
target_predicted_2 = pd.Series(predicted_valid_2)

In [21]:
target_valid_0 = target_valid_0.reset_index(drop=True)
target_valid_1 = target_valid_1.reset_index(drop=True)
target_valid_2 = target_valid_2.reset_index(drop=True)

Create function for profit calculation. 

In [22]:
def revenue(target, probabilities):
    
    probs_sorted = probabilities.sort_values(ascending=False)
    selected = target[probs_sorted.index][:BEST_LOCATIONS]
    return 1000 * selected.sum() * PROFIT_PER_BARREL - BUDGET

In [23]:
state = np.random.RandomState(12345)

In [24]:
def bootstrap(target, probabilities):
    values = []
    for i in range(1000):
        target_subsample = target.sample(n=TOTAL_LOCATIONS_PER_REGION, replace=True, random_state=state)
        probs_subsample = probabilities[target_subsample.index]
        values.append(revenue(target_subsample, probs_subsample))
    
    values = pd.Series(values)
    lower = values.quantile(0.025)
    upper = values.quantile(0.975)
    risk = (values < 0).mean() * 100
    
    mean = values.mean()
    print("Average profit, mil. rubles:", "{:.2f}".format(mean / 1000000))
    print("95% Interval:", "{:.2f}".format(lower / 1000000), '-', "{:.2f}".format(upper / 1000000))
    print("Risk,%:", "{:.2f}".format(risk))

In [25]:
print("Region_0:")
print(bootstrap(target_valid_0, target_predicted_0))
print()
print("Region_1:")
print(bootstrap(target_valid_1, target_predicted_1))
print()
print("Region_2:")
print(bootstrap(target_valid_2, target_predicted_2))

Region_0:
Average profit, mil. rubles: 425.94
95% Interval: -102.09 - 947.98
Risk,%: 6.00
None

Region_1:
Average profit, mil. rubles: 518.26
95% Interval: 128.12 - 953.61
Risk,%: 0.30
None

Region_2:
Average profit, mil. rubles: 420.19
95% Interval: -115.85 - 989.63
Risk,%: 6.20
None


###### Summary 

According to calculations, Region_1 is the most profitable and might be recommended for oil field wells development.

## Conclusion

Three regions with oil fields were analyzed for profitability of future development. Region 1 showed the highest predicted profit (518 mil. rubles) and the lower risk percentage (less that 1%). Model RMSE for this region is also the lowest compared to the other two regions. 
Based on this study, Region_1 is recommended for further oil field wells development.