# Description of the project

Let's say you work for the mining company GlavRosGosNeft. We need to decide where to drill a new well.

I was provided with oil samples in three regions: in each of 10,000 fields, where they measured the quality of oil and the volume of its reserves. Build a machine learning model to help determine the region where mining will bring the most profit. It is necessary to analyze the possible profit and risks using the *Bootstrap.* technique

Steps to choose a location:

- In the selected region, they are looking for deposits, for each, the values ​​of the signs are determined;
- Build a model and estimate the volume of reserves;
- Select the deposits with the highest value estimates. The number of fields depends on the company's budget and the cost of developing one well;
- The profit is equal to the total profit of the selected deposits.

# 1. Loading and preparing data

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OrdinalEncoder 
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

In [2]:
data_0 = pd.read_csv('/datasets/geo_data_0.csv')
data_1 = pd.read_csv('/datasets/geo_data_1.csv')
data_2 = pd.read_csv('/datasets/geo_data_2.csv')
print(data_0.dtypes)
print(data_1.dtypes)
print(data_2.dtypes)
data_0

id          object
f0         float64
f1         float64
f2         float64
product    float64
dtype: object
id          object
f0         float64
f1         float64
f2         float64
product    float64
dtype: object
id          object
f0         float64
f1         float64
f2         float64
product    float64
dtype: object


Unnamed: 0,id,f0,f1,f2,product
0,txEyH,0.705745,-0.497823,1.221170,105.280062
1,2acmU,1.334711,-0.340164,4.365080,73.037750
2,409Wp,1.022732,0.151990,1.419926,85.265647
3,iJLyR,-0.032172,0.139033,2.978566,168.620776
4,Xdl7t,1.988431,0.155413,4.751769,154.036647
...,...,...,...,...,...
99995,DLsed,0.971957,0.370953,6.075346,110.744026
99996,QKivN,1.392429,-0.382606,1.273912,122.346843
99997,3rnvd,1.029585,0.018787,-1.348308,64.375443
99998,7kl59,0.998163,-0.528582,1.583869,74.040764


***There are no errors in the data***

# 2. Train and validate the model

In [3]:
target_0 = data_0['product']
features_0 = data_0.drop(['product', 'id'], axis=1)

# samples in a ratio of 75:25
features_train_0, features_valid_0, target_train_0, target_valid_0 = train_test_split(
    features_0, target_0, test_size=0.25, random_state=12345)

# Model training and predictions on the validation set.
model_0 = LinearRegression()
model_0.fit(features_train_0, target_train_0)
predicted_valid_0 = model_0.predict(features_valid_0)
mse_0 = mean_squared_error(target_valid_0, predicted_valid_0) # MSE

print("MSE=", mse_0)
print("RMSE =", mse_0 ** 0.5)

predicted_valid_mean_0 = pd.Series(target_train_0.mean(), index=target_valid_0.index) # constant model: for each object it predicts the 'mean value of the target feature'
mse_mean_0 = mean_squared_error(target_valid_0, predicted_valid_mean_0)

print("\nMean")
print("MSE=", mse_mean_0)
print("RMSE =", mse_mean_0 ** 0.5)

print("\nR2 =", r2_score(target_valid_0, predicted_valid_0))

MSE= 1412.2129364399243
RMSE = 37.5794217150813

Mean
MSE= 1961.5678757223516
RMSE = 44.289591053907365

R2 = 0.27994321524487786


In [4]:
target_1 = data_1['product']
features_1 = data_1.drop(['product', 'id'], axis=1)

# samples in a ratio of 75:25
features_train_1, features_valid_1, target_train_1, target_valid_1 = train_test_split(
    features_1, target_1, test_size=0.25, random_state=12345)

# Model training and predictions on the validation set.
model_1 = LinearRegression()
model_1.fit(features_train_1, target_train_1)
predicted_valid_1 = model_1.predict(features_valid_1)
mse_1 = mean_squared_error(target_valid_1, predicted_valid_1) # MSE

print("MSE=", mse_1)
print("RMSE =", mse_1 ** 0.5)

predicted_valid_mean_1 = pd.Series(target_train_1.mean(), index=target_valid_1.index) # constant model: for each object it predicts the 'mean value of the target feature'
mse_mean_1 = mean_squared_error(target_valid_1, predicted_valid_mean_1)

print("\nMean")
print("MSE=", mse_mean_1)
print("RMSE =", mse_mean_1 ** 0.5)

print("\nR2 =", r2_score(target_valid_1, predicted_valid_1))

MSE= 0.7976263360391157
RMSE = 0.893099286775617

Mean
MSE= 2117.9734309299147
RMSE = 46.02144533725462

R2 = 0.9996233978805127


In [5]:
target_2 = data_2['product']
features_2 = data_2.drop(['product', 'id'], axis=1)

# samples in a ratio of 75:25
features_train_2, features_valid_2, target_train_2, target_valid_2 = train_test_split(
    features_2, target_2, test_size=0.25, random_state=12345)

# Model training and predictions on the validation set.
model_2 = LinearRegression()
model_2.fit(features_train_2, target_train_2)
predicted_valid_2 = model_2.predict(features_valid_2)
mse_2 = mean_squared_error(target_valid_2, predicted_valid_2) # MSE

print("MSE=", mse_2)
print("RMSE =", mse_2 ** 0.5)

predicted_valid_mean_2 = pd.Series(target_train_2.mean(), index=target_valid_2.index) # constant model: for each object it predicts the 'mean value of the target feature'
mse_mean_2 = mean_squared_error(target_valid_2, predicted_valid_mean_2)

print("\nMean")
print("MSE=", mse_mean_2)
print("RMSE =", mse_mean_2 ** 0.5)

print("\nR2 =", r2_score(target_valid_2, predicted_valid_2))

MSE= 1602.3775813236196
RMSE = 40.02970873393434

Mean
MSE= 2016.2210072435087
RMSE = 44.90234968510566

R2 = 0.20524758386040443


***We can conclude that the reserves in the first and third regions are quite large, but in the second they are rather scarce***

In [6]:
budget = 10000000000 # budget
bar_price = 450000 # barrel price
good_holes = 200 # number of wells for this budget
holes = 500 # number of holes

#sufficient volume of raw materials for break-even development of a new well
good_hole_price = (budget / good_holes) / bar_price
good_hole_price

111.11111111111111

- budget - budget
- bar_price - barrel price
- good_holes - number of wells for this budget
- holes - number of wells
- good_hole_price - sufficient amount of raw materials for break-even development of a new well

In [7]:
print('average in 1 region', data_0['product'].mean())
print('mean in region 2', data_1['product'].mean())
print('mean in region 3', data_2['product'].mean())

average in 1 region 92.50000000000001
mean in region 2 68.82500000000002
mean in region 3 95.00000000000004


***It can be seen here that the first and third regions are identical and exceed the minimum by several times,
but in the second region, the reserves are not enough to pay for future mines***


# 3. Preparing to Calculate Profits

In [8]:
def revenue(target, probabilities, count):
    probs_sorted = probabilities.sort_values(ascending=False)
    selected = target[probs_sorted.index][:count]
    return bar_price * selected.sum()

target = target_valid_0[:500].reset_index(drop=True) 
probabilities = pd.Series(predicted_valid_0)[:500] 

res = revenue(target, probabilities, 200)

print(res - budget)

196771396.9994259


*It can be concluded that the second region is unprofitable and it is impossible to start development there. The profit will be ***196 771 397*** rubles*

# 4. Profit and Risk Calculation

In [9]:
target = target_valid_0.reset_index(drop=True) 
probabilities = pd.Series(predicted_valid_0)

state = np.random.RandomState(12345)
 
values_0 = []
b = 0
for i in range(1000): 
    target_subsample = target.sample(n=500, replace=True, random_state=state) 
    probs_subsample = probabilities[target_subsample.index]
 
    values_0.append(revenue(target_subsample, probs_subsample, 200))
    if revenue(target_subsample, probs_subsample, 200) - budget  < 0:
        b += 1
        
values_0 = pd.Series(values_0)
lower = values_0.quantile(0.05)
upper = values_0.quantile(0.90)
 
mean = values_0.mean()
print('\n1 Region\n')
print("Average revenue:", mean - budget)
print("5%-quantile:", lower - budget)
print("85%-quantile:", upper - budget)
print('probability of loss:', b/1000)


1 Region

Average revenue: 425938526.91059303
5%-quantile: -31803114.34611702
85%-quantile: 801501278.2033138
probability of loss: 0.06


Only the second region is suitable for development


In [10]:
target = target_valid_1.reset_index(drop=True) 
probabilities = pd.Series(predicted_valid_1)

state = np.random.RandomState(12345)
g = 0
b = 0
values_1 = []
for i in range(1000): 
    target_subsample = target.sample(n = 500, replace=True, random_state=state) 
    probs_subsample = probabilities[target_subsample.index]
 
    values_1.append(revenue(target_subsample, probs_subsample, 200)) 
    if revenue(target_subsample, probs_subsample, 200) - budget  < 0:
        b += 1
values_1 = pd.Series(values_1)
lower = values_1.quantile(0.05)
upper = values_1.quantile(0.90)
 
mean = values_1.mean()
print('\n2 Region\n')
print("Average revenue:", mean - budget)
print("5%-quantile:", lower - budget)
print("85%-quantile:", upper - budget)
print('probability of loss:', b/1000)


2 Region

Average revenue: 515222773.4432907
5%-quantile: 150785740.64118004
85%-quantile: 798830252.2696915
probability of loss: 0.01


In [11]:
target = target_valid_2.reset_index(drop=True) 
probabilities = pd.Series(predicted_valid_2)

state = np.random.RandomState(12345)
g = 0
b = 0
values_2 = []
for i in range(1000): 
    target_subsample = target.sample(n = 500, replace=True, random_state=state) 
    probs_subsample = probabilities[target_subsample.index]
 
    values_2.append(revenue(target_subsample, probs_subsample, 200)) 
    if revenue(target_subsample, probs_subsample, 200) - budget  < 0:
        b += 1
        
values_2 = pd.Series(values_2)
lower = values_2.quantile(0.05)
upper = values_2.quantile(0.90)
 
mean = values_2.mean()
print('\n3 Region\n')
print("Average revenue:", mean - budget)
print("5%-quantile:", lower - budget)
print("85%-quantile:", upper - budget)
print('probability of loss:', b/1000)


3 Region

Average revenue: 435008362.7827568
5%-quantile: -43448491.322502136
85%-quantile: 784904953.9330616
probability of loss: 0.064


### Conclusion:
I can say for sure that the best region for development is the second, and the worst is the third. Also, in the second region, I would not start development because the probability of losses is more than 2.5%.