# Choosing a location for a well

Building a model to dpredict a promising oil production region based on samples provided from several regions, as well as measured indicators: oil quality and its volume of reserves in wells. Analysis of possible profit and risks.

Steps to choose a location:

- In the selected region, they are looking for deposits, for each, the values of the signs are determined;
- Build a model and estimate the volume of reserves;
- Select the deposits with the highest value estimates. The number of fields depends on the company's budget and the cost of developing one well;
- The profit is equal to the total profit of the selected fields.

## Loading and preparing data

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

In [2]:
df_0 = pd.read_csv('/datasets/geo_data_0.csv')
df_1 = pd.read_csv('/datasets/geo_data_1.csv')
df_2 = pd.read_csv('/datasets/geo_data_2.csv')

In [3]:
display(df_1.head())

Unnamed: 0,id,f0,f1,f2,product
0,kBEdx,-15.001348,-8.276,-0.005876,3.179103
1,62mP7,14.272088,-3.475083,0.999183,26.953261
2,vyE1P,6.263187,-5.948386,5.00116,134.766305
3,KcrkZ,-13.081196,-11.506057,4.999415,137.945408
4,AHL4O,12.702195,-8.147433,5.004363,134.766305


In [4]:
features_0 = df_0.drop(['id','product'] , axis=1)
target_0 = df_0['product']

In [5]:
features_1 = df_1.drop(['id','product'] , axis=1)
target_1 = df_1['product']

In [6]:
features_2 = df_2.drop(['id','product'] , axis=1)
target_2 = df_2['product']

In [7]:
numeric = ['f0', 'f1', 'f2']

In [8]:
def scal(data_1, data_2):
    scaler = StandardScaler()
    scaler.fit(data_1[numeric])
    pd.options.mode.chained_assignment = None
    data_1[numeric] = scaler.transform(data_1[numeric])
    data_2[numeric] = scaler.transform(data_2[numeric])
    return data_1, data_2

### Comments

Data files were uploaded, examined, broken down into features and a target feature. 

## Train and validate the model

In [9]:
features_0_train, features_0_valid, target_0_train, target_0_valid = train_test_split(
    features_0, target_0, test_size=0.25, random_state=12345)

In [10]:
features_1_train, features_1_valid, target_1_train, target_1_valid = train_test_split(
    features_1, target_1, test_size=0.25, random_state=12345)

In [11]:
features_2_train, features_2_valid, target_2_train, target_2_valid = train_test_split(
    features_2, target_2, test_size=0.25, random_state=12345)

In [12]:
print(features_0_train.shape)
print(features_0_valid.shape)

(75000, 3)
(25000, 3)


In [13]:
def model_lr(features, target):
    model = LinearRegression()
    model.fit(features, target)
    return model

In [14]:
def predict_lr(model,features_v, target_v):
    predictions_valid = model.predict(features_v)
    mse = mean_squared_error(target_v, predictions_valid)
    data = pd.DataFrame(target_v)
    data['predict'] = pd.DataFrame(predictions_valid, index = target_v.index)
    mean_v = data['predict'].mean()
    print('rmse:', round(mse**0.5, 4))
    print()
    print('Средний запас сырья, тыс.бар:', round(mean_v, 4))
    return data

In [34]:
features_0_train_s, features_0_valid_s = scal(features_0_train, features_0_valid)
features_1_train_s, features_1_valid_s = scal(features_1_train, features_1_valid)
features_2_train_s, features_2_valid_s = scal(features_2_train, features_2_valid)

In [35]:
features_0_train_s

Unnamed: 0,f0,f1,f2
27212,-0.544828,1.390264,-0.094959
7866,1.455912,-0.480422,1.209567
62041,0.260460,0.825069,-0.204865
70185,-1.837105,0.010321,-0.147634
82230,-1.299243,0.987558,1.273181
...,...,...,...
4094,1.567114,-1.087243,-0.272211
85412,-1.904207,-0.525360,1.327530
2177,0.418949,-1.296788,-0.196407
77285,0.400077,-1.466874,-0.445317


In [16]:
model_0 = model_lr(features_0_train_s, target_0_train)

In [17]:
pred_0 = predict_lr(model_0,features_0_valid_s, target_0_valid)
display(pred_0.head())

rmse: 37.5794

Средний запас сырья, тыс.бар: 92.5926


Unnamed: 0,product,predict
71751,10.038645,95.894952
80493,114.551489,77.572583
2655,132.603635,77.89264
53233,169.072125,90.175134
91141,122.32518,70.510088


In [18]:
model_1 = model_lr(features_1_train_s, target_1_train)
pred_1 = predict_lr(model_1,features_1_valid_s, target_1_valid)
display(pred_1.head())

rmse: 0.8931

Средний запас сырья, тыс.бар: 68.7285


Unnamed: 0,product,predict
71751,80.859783,82.663314
80493,53.906522,54.431786
2655,30.132364,29.74876
53233,53.906522,53.552133
91141,0.0,1.243856


In [19]:
model_2 = model_lr(features_2_train_s, target_2_train)
pred_2 = predict_lr(model_2,features_2_valid_s, target_2_valid)
display(pred_2.head())

rmse: 40.0297

Средний запас сырья, тыс.бар: 94.965


Unnamed: 0,product,predict
71751,61.212375,93.599633
80493,41.850118,75.105159
2655,57.776581,90.066809
53233,100.053761,105.162375
91141,109.897122,115.30331


### Comments

The data were divided into trial and validation samples. Models were trained for each region. As a result of comparing the predictions and correct labels of the validation set, as well as comparing the rmse metrics, it can be seen that the best model turned out for region number 1. The RMSE of the volume of oil reserves in the well for region number 1 showed 0.9 thousand barrels with an average value of 68.2. For the other two regions, the average stock was over 90 thousand barrels, and the RMSE was about 40 thousand barrels, which is almost half the average.

## Preparation for profit calculation

In [20]:
BAR_1 = 450000
BUDGET = 10**10
SQUAZH = 200
N_SQ = 500
BIL = 10**9

In [21]:
POROG = BUDGET / (SQUAZH * BAR_1)

In [22]:
POROG

111.11111111111111

In [23]:
state = np.random.RandomState(12345)
def pryb(target):
    target_subsample = target.sample(N_SQ, replace=True, random_state=state)
    target_pred = target_subsample.sort_values(by = 'predict', ascending=False).head(SQUAZH)
    summa = target_pred['product'].sum()
    profit = summa * BAR_1 / BIL - BUDGET / BIL
    return round(profit,4)

### Comments
Since the budget for the development of wells in the region is 10 billion rubles, this means that in order to make a profit, the volume of oil in the well must be more than 111 thousand barrels. The calculation of the average value for the validation sample showed that in all regions this value is below the threshold for making a profit.


## Calculation of profit and risks

In [24]:
profit_0 = pryb(pred_0)
profit_1 = pryb(pred_1)
profit_2 = pryb(pred_2)

In [25]:
res_all = np.array([['profit_0', profit_0],['profit_1', profit_1],['profit_2', profit_2]])
result_all = pd.DataFrame(res_all,columns = ['squazhina','profit'])

In [26]:
result_all

Unnamed: 0,squazhina,profit
0,profit_0,0.6055
1,profit_1,0.3343
2,profit_2,0.6262


In [27]:
state = np.random.RandomState(12345)
def dover(data):    
    values = []
    
    for i in range(1000):
        values.append(pryb(data))
          
    values = pd.Series(values)

    lower = values.quantile(0.025)
    upper = values.quantile(0.975)

    mean_v = values.mean()
    print('Средняя прибыль, млрд', round(mean_v,2))
    print()
    print('Доверительный интервал, млрд:', round(lower,2), '< X <',round(upper,2))
    return (values < 0).mean()

In [28]:
pryb_0 = dover(pred_0)

Средняя прибыль, млрд 0.4

Доверительный интервал, млрд: -0.11 < X < 0.91


In [29]:
pryb_1 = dover(pred_1)

Средняя прибыль, млрд 0.46

Доверительный интервал, млрд: 0.08 < X < 0.86


In [30]:
pryb_2 = dover(pred_2)

Средняя прибыль, млрд 0.39

Доверительный интервал, млрд: -0.11 < X < 0.93


In [31]:
pryb_2*100

6.5

In [32]:
pryb_1*100

0.7000000000000001

In [33]:
pryb_0*100

6.9

## Conclusions
According to the calculations, region № 1 is best suited for well development. This choice is based on the following results of the research: although the average profits in the three regions relative to the best wells are quite close, the probability of loss risks is less than 1% in region No. 1, and in regions No. 0 and 2 - more than 6%.
