# Oil Well Quality Prediction

## Project description

You work for the OilyGiant mining company. The task is to find the best place for a new well.  
Steps to choose the location:  

    • Collect the oil well parameters in the selected region: oil quality and volume of reserves.  
    • Build a model for predicting the volume of reserves in the new wells.  
    • Pick the oil wells with the highest estimated values;  
    • Pick the region with the highest total profit for the selected oil wells.  

You have data on oil samples from three regions. Parameters of each oil well in the region are already known.  
Build a model that will help to pick the region with the highest profit margin. Analyze potential profit and risks using the Bootstrapping technique.

In [1]:
# pip install -U -q scikit-learn
# pip install -q sidetable

In [29]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import cross_validate
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.utils import shuffle
from sklearn.metrics import f1_score
import matplotlib.pyplot as plt
import sidetable
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
import numpy as np


from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## Data preperation

In [3]:
def load_data(file_name, parse_dates=None, dtype=None, sep=','):
    df=''
    
    try:
        df=pd.read_csv('datasets/{}'.format(file_name), parse_dates=parse_dates, dtype=dtype, sep=sep)
    except:
        df=pd.read_csv('/datasets/{}'.format(file_name), parse_dates=parse_dates, dtype=dtype, sep=sep)
        
    return df

Loading data of the 3 regions.  
The data description is:  

    • id — unique oil well identifier  
    • f0, f1, f2 — three features of points (their specific meaning is unimportant, but the features themselves are significant)  
    • product — volume of reserves in the oil well (thousand barrels).

In [4]:
geo_0=load_data('geo_data_0.csv')
geo_1=load_data('geo_data_1.csv')
geo_2=load_data('geo_data_2.csv')
geo_0.head(2)
geo_1.head(2)
geo_2.head(2)

Unnamed: 0,id,f0,f1,f2,product
0,txEyH,0.705745,-0.497823,1.22117,105.280062
1,2acmU,1.334711,-0.340164,4.36508,73.03775


Unnamed: 0,id,f0,f1,f2,product
0,kBEdx,-15.001348,-8.276,-0.005876,3.179103
1,62mP7,14.272088,-3.475083,0.999183,26.953261


Unnamed: 0,id,f0,f1,f2,product
0,fwXo0,-1.146987,0.963328,-0.828965,27.758673
1,WJtFt,0.262778,0.269839,-2.530187,56.069697


### Missing values check.

In [5]:
geo_0.stb.missing(style=True)

Unnamed: 0,missing,total,percent
id,0,100000,0.00%
f0,0,100000,0.00%
f1,0,100000,0.00%
f2,0,100000,0.00%
product,0,100000,0.00%


In [6]:
geo_1.stb.missing(style=True)

Unnamed: 0,missing,total,percent
id,0,100000,0.00%
f0,0,100000,0.00%
f1,0,100000,0.00%
f2,0,100000,0.00%
product,0,100000,0.00%


In [7]:
geo_2.stb.missing(style=True)

Unnamed: 0,missing,total,percent
id,0,100000,0.00%
f0,0,100000,0.00%
f1,0,100000,0.00%
f2,0,100000,0.00%
product,0,100000,0.00%


There are no missing values in our data.

## Data analysis

### Split the source data into a training set, and a test set.

Split the data into a training set and validation set at a ratio of 75:25.

In [8]:
geo_0_train, geo_0_test=train_test_split(geo_0, test_size=0.25, random_state=12345)
geo_1_train, geo_1_test=train_test_split(geo_1, test_size=0.25, random_state=12345)
geo_2_train, geo_2_test=train_test_split(geo_2, test_size=0.25, random_state=12345)

### Making features & targets datasets for each dataset.

Used convension for:

    Features=x
    Target=y

In [9]:
geo_0_x_train=geo_0_train.drop(['product', 'id'], axis=1)
geo_0_y_train=geo_0_train['product']

geo_0_x_test=geo_0_test.drop(['product', 'id'], axis=1)
geo_0_y_test=geo_0_test['product']

geo_1_x_train=geo_1_train.drop(['product', 'id'], axis=1)
geo_1_y_train=geo_1_train['product']

geo_1_x_test=geo_1_test.drop(['product', 'id'], axis=1)
geo_1_y_test=geo_1_test['product']

geo_2_x_train=geo_2_train.drop(['product', 'id'], axis=1)
geo_2_y_train=geo_2_train['product']

geo_2_x_test=geo_2_test.drop(['product', 'id'], axis=1)
geo_2_y_test=geo_2_test['product']

### Train the model and make predictions for the validation set.

Training the ML model using `LinearRegression` as it is most suitable for a regression problem.

In [10]:
model_0 = LinearRegression()
model_1 = LinearRegression()
model_2 = LinearRegression()

model_0.fit(geo_0_x_train, geo_0_y_train)
model_1.fit(geo_1_x_train, geo_1_y_train)
model_2.fit(geo_2_x_train, geo_2_y_train)

predict_0=model_0.predict(geo_0_x_test)
rmse_0 = mean_squared_error(geo_0_y_test, predict_0, squared=False)

predict_1=model_1.predict(geo_1_x_test)
rmse_1 = mean_squared_error(geo_1_y_test, predict_1, squared=False)

predict_2=model_2.predict(geo_2_x_test)
rmse_2 = mean_squared_error(geo_2_y_test, predict_2, squared=False)

average_predicted_reserves_0=predict_0.mean()
average_predicted_reserves_1=predict_1.mean()
average_predicted_reserves_2=predict_2.mean()

r2_0=r2_score(geo_0_y_test, predict_0)
r2_1=r2_score(geo_1_y_test, predict_1)
r2_2=r2_score(geo_2_y_test, predict_2)

print('Region 0 prediction results')
print('---------------------------')
print('Average predicted reserves: {:.2f}'.format(average_predicted_reserves_0))
print('RMSE score: {:.2f}'.format(rmse_0))
print('R2 score: {:.2f}'.format(r2_0))
print()

print('Region 1 prediction results')
print('---------------------------')
print('Average predicted reserves: {:.2f}'.format(average_predicted_reserves_1))
print('RMSE score: {:.2f}'.format(rmse_1))
print('R2 score: {:.2f}'.format(r2_1))
print()

print('Region 2 prediction results')
print('---------------------------')
print('Average predicted reserves: {:.2f}'.format(average_predicted_reserves_2))
print('RMSE score: {:.2f}'.format(rmse_2))
print('R2 score: {:.2f}'.format(r2_2))
print()

LinearRegression()

LinearRegression()

LinearRegression()

Region 0 prediction results
---------------------------
Average predicted reserves: 92.59
RMSE score: 37.58
R2 score: 0.28

Region 1 prediction results
---------------------------
Average predicted reserves: 68.73
RMSE score: 0.89
R2 score: 1.00

Region 2 prediction results
---------------------------
Average predicted reserves: 94.97
RMSE score: 40.03
R2 score: 0.21



The Region 1 predictions were very accurate. Not like the other 2 regions which didn't have such good predictions, which have higher RMSE, and much lower R2 score. 

## Data preperation for profit calculation

1. You need to set up variables: barrel_revenue, unit_revenue, budget, best_points_amount (the best 200 wells selected from 500 as a sample size), sample_size
2. Calculate the volume of reserves sufficient for developing a new well. `well_cost = budget / best_points_amount`
3. Compare the obtained value with the average volume of reserves in each region: predictions.mean() vs well_cost

Conditions for profit calculation:
1. Only linear regression is suitable for model training (the rest are not sufficiently predictable).
2. When exploring the region, a study of 500 points is carried with picking the best 200 points for the profit calculation.
3. The budget for development of 200 oil wells is 100 USD million.
4. One barrel of raw materials brings 4.5 USD of revenue The revenue from one unit of product is 4,500 dollars (volume of reserves is in thousand barrels).
5. After the risk evaluation, keep only the regions with the risk of losses lower than 2.5%. From the ones that fit the criteria, the region with the highest average profit should be selected.


Those conditions are summed into the next parameters:

In [27]:
barrel_revenue=4.5
unit_revenue=4500
budget=100000000
best_points_amount=200
sample_size=500
confidence_level=1-0.025

Calculating the profit for each well.

In [16]:
geo_0['profit']=geo_0['product']*unit_revenue
geo_1['profit']=geo_1['product']*unit_revenue
geo_2['profit']=geo_2['product']*unit_revenue

Taking the 200 most profitable wells.

In [18]:
geo_top_0=geo_0.sort_values('profit', ascending=False).head(200)
geo_top_1=geo_1.sort_values('profit', ascending=False).head(200)
geo_top_2=geo_2.sort_values('profit', ascending=False).head(200)

Calculating the profit for a region.

In [52]:
geo_profit_0=geo_top_0['profit'].sum()
geo_profit_1=geo_top_1['profit'].sum()
geo_profit_2=geo_top_2['profit'].sum()
print('Profit for region 0: {:.2f}$'.format(geo_profit_0))
print('Profit for region 1: {:.2f}$'.format(geo_profit_1))
print('Profit for region 2: {:.2f}$'.format(geo_profit_2))

Profit for region 0: 166350365.68$
Profit for region 1: 124150866.97$
Profit for region 2: 170596329.28$


Risk evaluation:

In [51]:
def risk_evaluation(data):
    state = np.random.RandomState(12345)
    values = []

    for i in range(500):
        geosubsample = data.sample(frac=1, replace=True, random_state=state)
        values.append(geosubsample['profit'].sum())

    values=pd.Series(values)
    return values.quantile(confidence_level)

print('Profit evaluation for region 0, with confidence of 97.5%: {:.2f}$'.format(risk_evaluation(geo_top_0)))
print('Profit evaluation for region 1, with confidence of 97.5%: {:.2f}$'.format(risk_evaluation(geo_top_1)))
print('Profit evaluation for region 2, with confidence of 97.5%: {:.2f}$'.format(risk_evaluation(geo_top_2)))

Profit evaluation for region 0, with confidence of 97.5%: 166389849.46$
Profit evaluation for region 1, with confidence of 97.5%: 124150866.97$
Profit evaluation for region 2, with confidence of 97.5%: 170634365.54$


After looking at the predicted profit from each region, with the risk of 2.5%, I can see that all regions are profitable.

Finding the ROI for region 2.

In [65]:
print('ROI for region 2: {:.2%}'.format((risk_evaluation(geo_top_2)-budget)/budget))

ROI for region 2: 70.63%


The most profitable region is region 2, and it has a nice predicted ROI on an investment of 70.63%.

## Conclusion

Developing an oil well is a very expensive task, therefore it's crucial to minimize the risks.  
After analyzing the data of the 3 regions, there is one region that is likely to be the most profitable one, with a nice ROI of over 70%.