# Project description

The OilyGiant mining company wants to find the best place for a new well.

Steps to choose the location:

- Collect the oil well parameters in the selected region: oil quality and volume of reserves;
- Build a model for predicting the volume of reserves in the new wells;
- Pick the oil wells with the highest estimated values;
- Pick the region with the highest total profit for the selected oil wells.

We have data on oil samples from three regions. Parameters of each oil well in the region are already known. 
We should build a model that will help to pick the region with the highest profit margin. Analyze potential profit and risks using the Bootstrapping technique.

## Download and prepare the data 

In [1]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
import matplotlib.pyplot as plt
import numpy as np 
import seaborn as sns
from scipy import stats as st
from joblib import dump
from sklearn.metrics import accuracy_score 
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import OrdinalEncoder 
from sklearn.metrics import confusion_matrix
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score
from sklearn.utils import shuffle
from sklearn.metrics import roc_curve
from sklearn import metrics
from sklearn.metrics import roc_auc_score
from sklearn.metrics import mean_squared_error

In [2]:
geo_0 = pd.read_csv('/datasets/geo_data_0.csv')
geo_0.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


In [3]:
geo_0.describe()

Unnamed: 0,f0,f1,f2,product
count,100000.0,100000.0,100000.0,100000.0
mean,0.500419,0.250143,2.502647,92.5
std,0.871832,0.504433,3.248248,44.288691
min,-1.408605,-0.848218,-12.088328,0.0
25%,-0.07258,-0.200881,0.287748,56.497507
50%,0.50236,0.250252,2.515969,91.849972
75%,1.073581,0.700646,4.715088,128.564089
max,2.362331,1.343769,16.00379,185.364347


In [4]:
geo_1 = pd.read_csv('/datasets/geo_data_1.csv')
geo_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


In [5]:
geo_1.describe()

Unnamed: 0,f0,f1,f2,product
count,100000.0,100000.0,100000.0,100000.0
mean,1.141296,-4.796579,2.494541,68.825
std,8.965932,5.119872,1.703572,45.944423
min,-31.609576,-26.358598,-0.018144,0.0
25%,-6.298551,-8.267985,1.000021,26.953261
50%,1.153055,-4.813172,2.011479,57.085625
75%,8.621015,-1.332816,3.999904,107.813044
max,29.421755,18.734063,5.019721,137.945408


In [6]:
geo_2 = pd.read_csv('/datasets/geo_data_2.csv')
geo_2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


In [7]:
geo_2.describe()

Unnamed: 0,f0,f1,f2,product
count,100000.0,100000.0,100000.0,100000.0
mean,0.002023,-0.002081,2.495128,95.0
std,1.732045,1.730417,3.473445,44.749921
min,-8.760004,-7.08402,-11.970335,0.0
25%,-1.162288,-1.17482,0.130359,59.450441
50%,0.009424,-0.009482,2.484236,94.925613
75%,1.158535,1.163678,4.858794,130.595027
max,7.238262,7.844801,16.739402,190.029838


## Train and test the model for each region

In [8]:
def train_test_model (features, target,region):
    features_train, features_valid, target_train, target_valid = train_test_split(
        features, target, test_size=0.25, random_state=12345)
    model = LinearRegression()
    model.fit(features_train, target_train)
    predicted_valid = model.predict(features_valid)
    predicted_valid_mean=predicted_valid.mean()
    mse = mean_squared_error(target_valid, predicted_valid)
    rmse = mse ** 0.5
    print('For the region',region)
    print('RMSE =', rmse)
    print('Average volume of predicted reserves:', predicted_valid_mean)
    return target_valid, predicted_valid

In [9]:
y_0 = geo_0['product']
x_0 = geo_0.drop(['product','id'], axis=1)

y_1 = geo_1['product']
x_1 = geo_1.drop(['product','id'], axis=1)

y_2 = geo_2['product']
x_2 = geo_2.drop(['product','id'], axis=1)

In [10]:
y_0_valid, predicted_valid_0 = train_test_model(x_0,y_0,0)
y_1_valid, predicted_valid_1 = train_test_model(x_1,y_1,1)
y_2_valid, predicted_valid_2 = train_test_model(x_2,y_2,2)

For the region 0
RMSE = 37.5794217150813
Average volume of predicted reserves: 92.59256778438035
For the region 1
RMSE = 0.893099286775617
Average volume of predicted reserves: 68.728546895446
For the region 2
RMSE = 40.02970873393434
Average volume of predicted reserves: 94.96504596800489


The average volume of predicted reserves is higher for regions 0 and 2, however, RMSE is very high as well, which makes the model predictions unreliable.

The average volume of predicted reserves is not as high in region 1, but the model seems to be of a high quality based on its RMSE.


## Prepare for profit calculation

When exploring the region, a study of 500 points is carried with picking the best 200 points for the profit calculation.

The budget for development of 200 oil wells is 100 USD million.

One barrel of raw materials brings 4.5 USD of revenue The revenue from one unit of product is 4,500 dollars (volume of reserves is in thousand barrels).

After the risk evaluation, keep only the regions with the risk of losses lower than 2.5%. From the ones that fit the criteria, the region with the highest average profit should be selected.

In [11]:
points=500
best_points=200
budget = 100e+6
revenue_per_unit = 4500
unit=1000
risk=0.025

In [12]:
product_min = budget / (best_points * revenue_per_unit)
print('Minimum required volume of the product:',product_min)

Minimum required volume of the product: 111.11111111111111


Minimum required volume of the product exceeds the average volume of predicted reserves for all regions.

## Profit from a set of selected oil wells and model predictions

In [13]:
def set_profit (predicted_valid,y_valid,region):
    data={'predicted':predicted_valid, 'target':y_valid}
    reserves=pd.DataFrame(data).reset_index(drop=True)
    top_200 =reserves.sort_values('predicted',ascending = False).head(200)
    sum_target = top_200['target'].sum()
    profit = (revenue_per_unit*sum_target)-budget
    print('For the region',region)
    print('Summarized target volume of reserves:',sum_target)
    print('Profit for the obtained volume of reserves:',profit)
    return profit

In [14]:
top_200_0=set_profit(predicted_valid_0,y_0_valid,0)
top_200_1=set_profit(predicted_valid_1,y_1_valid,1)
top_200_2=set_profit(predicted_valid_2,y_2_valid,2)

For the region 0
Summarized target volume of reserves: 29601.83565142189
Profit for the obtained volume of reserves: 33208260.43139851
For the region 1
Summarized target volume of reserves: 27589.081548181137
Profit for the obtained volume of reserves: 24150866.966815114
For the region 2
Summarized target volume of reserves: 28245.22214133296
Profit for the obtained volume of reserves: 27103499.635998324


For now, we can say that region 0 is the most profitable.

## Risks and profit for each region

In [15]:
y_0_valid = pd.Series(y_0_valid).reset_index(drop=True)
y_1_valid = pd.Series(y_1_valid).reset_index(drop=True)
y_2_valid = pd.Series(y_2_valid).reset_index(drop=True)

predicted_valid_0 = pd.Series(*predicted_valid_0.reshape(1,-1))
predicted_valid_1 = pd.Series(*predicted_valid_1.reshape(1,-1))
predicted_valid_2 = pd.Series(*predicted_valid_2.reshape(1,-1))

In [16]:
def profit(target, predicted_valid, count):
    predicted_sorted = predicted_valid.sort_values(ascending=False)
    selected = target[predicted_sorted.index][:count]
    sum_target = selected.sum()
    profit = (revenue_per_unit*sum_target)-budget    
    return profit

In [44]:

def bootstrap_profit(y_valid,predicted_valid,count,region):
    state = np.random.RandomState(12345)
    values = []
    for i in range(1000):
        target_subsample = y_valid.sample(n=500, replace=True, random_state=state)
        predicted_subsample = predicted_valid[target_subsample.index]
        profit_0 = profit(target_subsample,predicted_subsample,count)
        values.append(profit_0)

    values = pd.Series(values)
    mean = values.mean()
    lower = values.quantile(0.025)
    upper =  values.quantile(0.975)
    loss = values[values<0].count()
    risk = (loss/values.count())*100
    print('For the region',region,':')
    print('Average profit:{:.2f}'.format(mean))
    print('95% confidence interval:{:.2f}'.format(lower),'to {:.2f}'.format(upper))
    print('Risk of losses:',risk,'%')
    return mean,lower,upper,risk


In [45]:
mean_0,lower_0,upper_0,risk_0=bootstrap_profit(y_0_valid,predicted_valid_0,200,0)
mean_1,lower_1,upper_1,risk_1=bootstrap_profit(y_1_valid,predicted_valid_1,200,1)
mean_2,lower_2,upper_2,risk_2=bootstrap_profit(y_2_valid,predicted_valid_2,200,2)

For the region 0 :
Average profit:4259385.27
95% confidence interval:-1020900.95 to 9479763.53
Risk of losses: 6.0 %
For the region 1 :
Average profit:5152227.73
95% confidence interval:688732.25 to 9315475.91
Risk of losses: 1.0 %
For the region 2 :
Average profit:4350083.63
95% confidence interval:-1288805.47 to 9697069.54
Risk of losses: 6.4 %


Now we can recommend developing wells in region 1. It has the biggest average profit from reserves and the lowest risk of losses. It also provides profit within the 95% confidence interval.