# Finding a Suitable Location for New Oil Well Drilling

# Content <a id='contents'></a>

* [0 Overview](#big_picture)
    * [0.1 Introduction](#intro)
    * [0.2 Data Descrition](#data_description)
    * [0.3 Goals](#goals_and_step)

* [1 Data_Preprocessing](#data_preprocessing)
    * [1.1 Load Data](#load_data)
    * [1.2 Initial_Data_Exploration](#initial_data_exploration)
    * [1.3 Initial Summary](#initial_summary)
    
* [2 Train and Test Model for Each Region](#train_data_and_test_model)
    * [2.1 Spliting Data into Training Set and Validation Set](#split_data)
    * [2.2 Model Train and Generate Prediction for Validation Dataset](#train_model)
    * [2.3 Save Prediction and Correct Validation Dataset](#save_valid_answer)
    * [2.4 Average Product Volume and RMSE Model](#mean_product_volume)
    * [2.5 Initial Analysis](#initial_analysis)

* [3 Initial Profit Calculation](#initial_profit_calculation)
    * [3.1 Main Variable](#key_variable)
    * [3.2 Volume of Oil Reserves Sufficient to Develop a New wel](#minimum_oil_volume)
    * [3.3 Subsection Conclusion](#subsection_conclusion)


* [4 Profit Calculation](#profit_calculation)

* [5 Risk and Return](#risk_and_return)

* [6 Summary](#Summary)

# Overview <a id='big_picture'></a>

### Introduction <a id='intro'></a>

Sebagai seorang data scientist di perusahaan OilyGiant. Saya diminta untuk menemukan lokasi yang cocok untuk penggalian sumur minyak baru. Data yang tersedia adalah data sampel minyak dari tiga wilayah. Pada project ini saya akan membuat sebuah model yang akan membantu Anda memilih wilayah dengan margin laba tertinggi. Analisis terhadap laba dan risiko potensial akan dilakukan menggunakan teknik bootstrapping.

### Data Description <a id='data_description'></a>

**Fitur:**

- id — ID unik sumur minyak
- f0, f1, f2 — tiga fitur titik (makna spesifiknya tidak penting, tetapi fitur itu sendiri signifikan)
- product — volume cadangan minyak di sumur (ribuan barel).

### Goals <a id='goals_and_step'></a>

**The purpose of this project is to find a suitable location for oil well drilling.**

**Steps to be taken:**
1. Train and test the model for each area::
 - Separate data into training set and validation set with a ratio of 75:25.
 - Train your model and make predictions for the validation set.
 - Save the predictions and correct answers for the validation set.
 - Display the predicted average volume of oil reserves and the model's RMSE.
 - Analyze the results.
2. Prepare for profit calculation:
 - Save all key values for profit calculation in separate variables.
 - Calculate the volume of oil reserves sufficient to develop a new well without losses.  
 - Compare the obtained value with the average volume of oil reserves in each area.
 - Present your findings regarding the preparation for profit calculation.
3. Create function to calculate profits from a selected set of oil wells and model predictions::
 - Select the well with the highest predicted value.
 - Summarize the target volume of oil reserves based on those predictions.
 - Propose an area for oil well development and provide justification or reasons for your choice. Calculate the profits for the obtained volume of oil reserves.
4. Calculate the risks and profits for each area::
 - Use bootstrapping technique with 1,000 samples to find the profit distribution.
 - Find the average profit, 95% confidence interval, and risk of loss. Loss is negative profit, calculate the probability of possible losses and express it as a percentage.
 - Present your findings: suggest an area for oil well development and provide justification or reasons for your choice.

## Data Preprocessing<a id='data_preprocessing'></a>

In [None]:
# Muat semua library
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

from scipy import stats as st 

from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.linear_model import LogisticRegression, LinearRegression, Ridge

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error, f1_score, roc_auc_score, make_scorer
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.metrics import roc_auc_score

from sklearn.utils import shuffle

import warnings
warnings.filterwarnings('ignore')

### Load Data <a id='load_data'></a>

In [None]:
# Muat file data menjadi DataFrame
df0 = pd.read_csv('/datasets/geo_data_0.csv')
df1 = pd.read_csv('/datasets/geo_data_1.csv')
df2 = pd.read_csv('/datasets/geo_data_2.csv')

### Initial Data Exploration <a id='initial_data_exploration'></a>

In [None]:
# Menampilkan sample data untuk melihat data secara sekilas
df0.sample(5)

Unnamed: 0,id,f0,f1,f2,product
93831,E1JtE,1.898214,-0.089501,3.603826,156.112824
15431,ahykT,2.013823,0.323875,-0.242273,51.184754
56500,iTbMK,-1.034147,0.38946,5.601539,64.416238
27361,tsRG2,-0.56603,0.834494,2.160257,124.513944
41720,ceTFD,0.409916,-0.399627,5.132847,147.054842


In [None]:
# Menampilkan informasi/rangkuman umum tentang DataFrame
df0.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


In [None]:
# Memampilkan nilai statistik dari kolom numerik
df0.describe()

Unnamed: 0,f0,f1,f2,product
count,100000.0,100000.0,100000.0,100000.0
mean,0.500419,0.250143,2.502647,92.5
std,0.871832,0.504433,3.248248,44.288691
min,-1.408605,-0.848218,-12.088328,0.0
25%,-0.07258,-0.200881,0.287748,56.497507
50%,0.50236,0.250252,2.515969,91.849972
75%,1.073581,0.700646,4.715088,128.564089
max,2.362331,1.343769,16.00379,185.364347


In [None]:
# Menampilkan sample data untuk melihat data secara sekilas
df1.sample(5)

Unnamed: 0,id,f0,f1,f2,product
62998,2NKwm,-2.620256,3.663492,3.008799,80.859783
39524,PnlHw,0.56489,-0.023699,1.004018,26.953261
80158,opMfG,7.757181,-2.18795,-0.012661,0.0
22057,wo9lI,8.482358,-7.066569,4.000716,107.813044
72553,hC9Ae,-1.9629,-10.694838,1.989563,57.085625


In [None]:
# Menampilkan informasi/rangkuman umum tentang DataFrame
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


In [None]:
# Memampilkan nilai statistik dari kolom numerik
df1.describe()

Unnamed: 0,f0,f1,f2,product
count,100000.0,100000.0,100000.0,100000.0
mean,1.141296,-4.796579,2.494541,68.825
std,8.965932,5.119872,1.703572,45.944423
min,-31.609576,-26.358598,-0.018144,0.0
25%,-6.298551,-8.267985,1.000021,26.953261
50%,1.153055,-4.813172,2.011479,57.085625
75%,8.621015,-1.332816,3.999904,107.813044
max,29.421755,18.734063,5.019721,137.945408


In [None]:
# Menampilkan sample data untuk melihat data secara sekilas
df2.sample(5)

Unnamed: 0,id,f0,f1,f2,product
52208,Ph8zL,3.511185,0.812331,4.170611,135.533778
87965,tgOl6,-6.077439,-1.366449,1.858951,153.00124
24501,tavUM,0.888976,-0.523874,7.791243,121.622966
9695,UG10W,1.393515,-1.369817,3.761893,126.167995
93144,Sa0R8,-0.812961,0.063893,8.140978,86.495576


In [None]:
# Menampilkan informasi/rangkuman umum tentang DataFrame
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


In [None]:
# Memampilkan nilai statistik dari kolom numerik
df2.describe()

Unnamed: 0,f0,f1,f2,product
count,100000.0,100000.0,100000.0,100000.0
mean,0.002023,-0.002081,2.495128,95.0
std,1.732045,1.730417,3.473445,44.749921
min,-8.760004,-7.08402,-11.970335,0.0
25%,-1.162288,-1.17482,0.130359,59.450441
50%,0.009424,-0.009482,2.484236,94.925613
75%,1.158535,1.163678,4.858794,130.595027
max,7.238262,7.844801,16.739402,190.029838


### Initial Summary <a id='initial_summary'></a>

**Insights:**
1. Data yang digunakan sudah lengkap (tidak tedapat data null) dan keseluruhan tipe datanya sudah benar
2. Rata-rata titik pada df0 dan df1 memiliki volume cadangan minyak yang lebih tinggi dari pada df2. 

## Train and Test Model for Each Region<a id='train_data_and_test_model'></a>

### Spliting Data into Training Set and Validation Set <a id='split_data'></a>

In [None]:
# Function to split data into training and validation set
def split_data (data):
    features = data.drop(['product','id'], axis=1)
    target = data['product']
    
    features_train, features_valid, target_train, target_valid = train_test_split(features, target, 
                                                                                  random_state=12345, 
                                                                                  test_size=0.25)
    return features_train, features_valid, target_train, target_valid

In [None]:
# Recall split_data function
features_train_0, features_valid_0, target_train_0, target_valid_0 = split_data(df0)
features_train_1, features_valid_1, target_train_1, target_valid_1 = split_data(df1)
features_train_2, features_valid_2, target_train_2, target_valid_2 = split_data(df2)

In [None]:
# Checking split_data step
print(features_train_0.shape)
print(features_valid_0.shape)
print(target_train_0.shape)
print(target_valid_0.shape)

(75000, 3)
(25000, 3)
(75000,)
(25000,)


In [None]:
# Checking split_data step
print(features_train_1.shape)
print(features_valid_1.shape)
print(target_train_1.shape)
print(target_valid_1.shape)

(75000, 3)
(25000, 3)
(75000,)
(25000,)


In [None]:
# Checking split_data step
print(features_train_2.shape)
print(features_valid_2.shape)
print(target_train_2.shape)
print(target_valid_2.shape)

(75000, 3)
(25000, 3)
(75000,)
(25000,)


In [None]:
df_all = [
    df0.drop('id', axis = 1),
    df1.drop('id', axis = 1),
    df2.drop('id', axis = 1),
]

In [None]:
state = np.random.RandomState(12345)

samples_target = []
samples_predictions = []

for region in range(len(df_all)):
    data  = df_all[region]

    features = data.drop('product', axis = 1)
    target = data['product']

    features_train, features_valid, target_train, target_valid = train_test_split( 
        features, target, test_size = 0.25, random_state = state)
    
    model = LinearRegression()
    model.fit(features_train, target_train)
    predictions = model.predict(features_valid)

    samples_target.append(target_valid.reset_index(drop = True))
    samples_predictions.append(pd.Series(predictions))

    mean_product = target.mean()
    model_rmse = mean_squared_error(target_valid, predictions)**0.5

    print("—Region", region, "—")
    print("mean product amount =", mean_product)
    print("Model RMSE:", model_rmse)
    print()

—Region 0 —
mean product amount = 92.50000000000001
Model RMSE: 37.5794217150813

—Region 1 —
mean product amount = 68.82500000000002
Model RMSE: 0.889736773768065

—Region 2 —
mean product amount = 95.00000000000004
Model RMSE: 39.958042459521614



### Model Train and Generate Prediction for Validation Dataset <a id='train_model'></a>

In [None]:
# Function to tune hyperparameter for LinearRegression
def fit(features_train, target_train, features_valid, target_valid):
    param_grid = {'fit_intercept': [True, False],
                  'copy_X' : [True, False],
                  'n_jobs': [1, 2, -1],
                  'positive': [True, False],
    }

    lr = LinearRegression(np.random.RandomState(12345))
    rmse_scorer = make_scorer(lambda target_valid, target_pred: 
                              np.sqrt(mean_squared_error(target_valid, target_pred)), 
                              greater_is_better=False)
    
    grid = GridSearchCV(estimator = lr, 
                        param_grid = param_grid, 
                        cv = 5, 
                        scoring = rmse_scorer)
    
    grid.fit(features_train, target_train)

    best_params = grid.best_params_
    best_rmse = -grid.best_score_
    target_pred = grid.predict(features_valid)
    mean_product = target_pred.mean()
    model_rmse = mean_squared_error(target_valid, target_pred)**0.5
    
    return best_params, best_rmse, target_pred, mean_product, model_rmse

In [None]:
# Function to tune hyperparameter for LinearRegression
def fit2(features_train, target_train, features_valid, target_valid):
    param_grid = {
                  'solver': ['svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga', 'lbfgs'],
                  'alpha': [0.01, 0.1, 1, 10, 100],
    }

    r = Ridge(np.random.RandomState(12345))
    rmse_scorer = make_scorer(lambda target_valid, target_pred: 
                              np.sqrt(mean_squared_error(target_valid, target_pred)), 
                              greater_is_better=False)
    
    grid = GridSearchCV(estimator = r, 
                        param_grid = param_grid, 
                        cv = 5, 
                        scoring = rmse_scorer)
    
    grid.fit(features_train, target_train)

    best_params = grid.best_params_
    best_rmse = -grid.best_score_
    target_pred = grid.predict(features_valid)
    mean_product = target_pred.mean()
    model_rmse = mean_squared_error(target_valid, target_pred)**0.5
    
    return best_params, best_rmse, target_pred, mean_product, model_rmse

In [None]:
# Recall Fit Function to Train Model for df0
best_params_0, best_rmse_0, target_pred_0, mean_product_0, model_rmse_0 = fit2(features_train_0, 
                                                                              target_train_0, 
                                                                              features_valid_0, 
                                                                              target_valid_0)

In [None]:
# Print Best Params for df0
best_params_0

{'alpha': 100, 'solver': 'saga'}

In [None]:
# Print Best RMSE for df0
best_rmse_0

37.73227317571341

In [None]:
# Recall Fit Function to Train Model for df1
best_params_1, best_rmse_1, target_pred_1, mean_product_1, model_rmse_1 = fit2(features_train_1, 
                                                                              target_train_1, 
                                                                              features_valid_1, 
                                                                              target_valid_1)

In [None]:
# Print Best Params for df1
best_params_1

{'alpha': 0.1, 'solver': 'lsqr'}

In [None]:
# Print Best RMSE for df1
best_rmse_1

0.8895409504762819

In [None]:
# Recall Fit Function to Train Model for df2
best_params_2, best_rmse_2, target_pred_2, mean_product_2, model_rmse_2 = fit2(features_train_2, 
                                                                              target_train_2, 
                                                                              features_valid_2, 
                                                                              target_valid_2)

In [None]:
# Print Best Params for df2
best_params_2

{'alpha': 1, 'solver': 'saga'}

In [None]:
# Print Best RMSE for df2
best_rmse_2

40.06572205067444

In [None]:
df_all = [
    df0.drop('id', axis = 1),
    df1.drop('id', axis = 1),
    df2.drop('id', axis = 1),
]

In [None]:
state = np.random.RandomState(12345)

samples_target = []
samples_predictions = []

for region in range(len(df_all)):
    data  = df_all[region]

    features = data.drop('product', axis = 1)
    target = data['product']

    features_train, features_valid, target_train, target_valid = train_test_split( 
        features, target, test_size = 0.25, random_state = state)
    
    model = LinearRegression(copy_X = True, 
                             fit_intercept = True, 
                             n_jobs = 1, 
                             positive = False)
    model.fit(features_train, target_train)
    predictions = model.predict(features_valid)

    samples_target.append(target_valid.reset_index(drop = True))
    samples_predictions.append(pd.Series(predictions))

    mean_product = target.mean()
    model_rmse = mean_squared_error(target_valid, predictions)**0.5

    print("—Region", region, "—")
    print("mean product amount =", mean_product)
    print("Model RMSE:", model_rmse)
    print()

—Region 0 —
mean product amount = 92.50000000000001
Model RMSE: 37.5794217150813

—Region 1 —
mean product amount = 68.82500000000002
Model RMSE: 0.889736773768065

—Region 2 —
mean product amount = 95.00000000000004
Model RMSE: 39.958042459521614



In [None]:
state = np.random.RandomState(12345)

samples_target_2 = []
samples_predictions_2 = []

for region in range(len(df_all)):
    data  = df_all[region]

    features = data.drop('product', axis = 1)
    target = data['product']

    features_train, features_valid, target_train, target_valid = train_test_split( 
        features, target, test_size = 0.25, random_state = state)
    
    model = Ridge(alpha = 1, solver = 'sag')
    model.fit(features_train, target_train)
    predictions = model.predict(features_valid)

    samples_target.append(target_valid.reset_index(drop = True))
    samples_predictions.append(pd.Series(predictions))

    mean_product = target.mean()
    model_rmse = mean_squared_error(target_valid, predictions)**0.5

    print("—Region", region, "—")
    print("mean product amount =", mean_product)
    print("Model RMSE:", model_rmse)
    print()

—Region 0 —
mean product amount = 92.50000000000001
Model RMSE: 37.57941407849715

—Region 1 —
mean product amount = 68.82500000000002
Model RMSE: 0.8928301129769857

—Region 2 —
mean product amount = 95.00000000000004
Model RMSE: 39.95801966643965



In [None]:
state = np.random.RandomState(12345)

samples_target_3 = []
samples_predictions_3 = []

for region in range(len(df_all)):
    data  = df_all[region]

    features = data.drop('product', axis = 1)
    target = data['product']

    features_train, features_valid, target_train, target_valid = train_test_split( 
        features, target, test_size = 0.25, random_state = state)
    
    model = RandomForestRegressor()
    model.fit(features_train, target_train)
    predictions = model.predict(features_valid)

    samples_target.append(target_valid.reset_index(drop = True))
    samples_predictions.append(pd.Series(predictions))

    mean_product = target.mean()
    model_rmse = mean_squared_error(target_valid, predictions)**0.5

    print("—Region", region, "—")
    print("mean product amount =", mean_product)
    print("Model RMSE:", model_rmse)
    print()

—Region 0 —
mean product amount = 92.50000000000001
Model RMSE: 38.81636571892589

—Region 1 —
mean product amount = 68.82500000000002
Model RMSE: 0.7479881844511275

—Region 2 —
mean product amount = 95.00000000000004
Model RMSE: 39.2961424117959



### Save Prediction and Correct Validation Dataset <a id='save_valid_answer'></a>

In [None]:
target_pred_df0 = pd.DataFrame(target_pred_0, columns = ['product'])
target_pred_df1 = pd.DataFrame(target_pred_1, columns = ['product'])
target_pred_df2 = pd.DataFrame(target_pred_2, columns = ['product'])

### Average Product Volume and RMSE Model <a id='mean_product_volume'></a>

In [None]:
print("Region 0 Product Mean", target_pred_df0.mean())
print("Region 0 Model RMSE:", model_rmse_0)
print()
print("Region 1 Product Mean", target_pred_df1.mean())
print("Region 1 Model RMSE:", model_rmse_1)
print()
print("Region 2 Product Mean", target_pred_df2.mean())
print("Region 2 Model RMSE:", model_rmse_2)

Region 0 Product Mean product    92.592866
dtype: float64
Region 0 Model RMSE: 37.579441607153385

Region 1 Product Mean product    68.728547
dtype: float64
Region 1 Model RMSE: 0.8930993684427244

Region 2 Product Mean product    94.965028
dtype: float64
Region 2 Model RMSE: 40.029619395448385


In [None]:
print("Actual Region 0 Product Mean", target_valid_0.mean())
print()
print("Actual Region 1 Product Mean", target_valid_1.mean())
print()
print("Actual Region 2 Product Mean", target_valid_2.mean())

Actual Region 0 Product Mean 92.07859674082927

Actual Region 1 Product Mean 68.72313602435997

Actual Region 2 Product Mean 94.88423280885438


### Initial Analysis <a id='initial_analysis'></a>

**Insights:**
1. Region 2 produces the highest average predicted product results, but also has the highest error rate of 40.03.
2. Region 0 produces predicted average product results similar to Region 2, but with a lower error rate. Therefore, instead of choosing Region 2, I would suggest choosing Region 0.
3. Region 1 produces the lowest average predicted product results, but has the lowest error rate. Therefore, if you want to choose a region with the lowest error rate, Region 1 can be used.

## Initial Profit Calculation<a id='initial_profit_calculation'></a>

### Main Variable <a id='key_variable'></a>

In [None]:
# Cost
total_cost = 100000000
total_oil_well = 200
cost_per_well = total_cost / total_oil_well
income = 4500

### Volume of Oil Reserves Sufficient to Develop a New well  <a id='minimum_oil_volume'></a>

In [None]:
min_oil_vol = cost_per_well / income
print("Oil Reserves Sufficient to Develop a New Well", np.ceil(min_oil_vol))

Oil Reserves Sufficient to Develop a New Well 112.0


In [None]:
target_pred_df0.describe()

Unnamed: 0,product
count,25000.0
mean,92.592866
std,23.161769
min,-9.37339
25%,76.660327
50%,92.661787
75%,108.449129
max,180.152598


In [None]:
target_pred_df1.describe()

Unnamed: 0,product
count,25000.0
mean,68.728547
std,46.010204
min,-1.893744
25%,28.53668
50%,57.851592
75%,109.346467
max,139.818939


In [None]:
target_pred_df2.describe()

Unnamed: 0,product
count,25000.0
mean,94.965028
std,19.861384
min,17.103267
25%,81.385125
50%,95.030283
75%,108.496648
max,165.887128


In [None]:
print(target_pred_df0.quantile(0.8))
print(target_pred_df1.quantile(0.832))
print(target_pred_df2.quantile(0.81))

product    112.355211
Name: 0.8, dtype: float64
product    112.146728
Name: 0.832, dtype: float64
product    112.604336
Name: 0.81, dtype: float64


In [None]:
# Profit Calc for Region 0
top_200_product = pd.Series(target_pred_0).sort_values(ascending = False)[:200]
total_product = top_200_product.sum()
total_income = income * total_product
profit = total_income - total_cost
print("Profit", round(profit), "USD")
print()

Profit 39924909 USD



In [None]:
# Profit Calc for Region 1
top_200_product = pd.Series(target_pred_1).sort_values(ascending = False)[:200]
total_product = top_200_product.sum()
total_income = income * total_product
profit = total_income - total_cost
print("Profit", round(profit), "USD")
print()

Profit 24857093 USD



In [None]:
# Profit Calc for Region 2
top_200_product = pd.Series(target_pred_2).sort_values(ascending = False)[:200]
total_product = top_200_product.sum()
total_income = income * total_product
profit = total_income - total_cost
print("Profit", round(profit), "USD")
print()

Profit 33235997 USD



### Subsection Conclusion  <a id='subsection_conclusion'></a>

**Insights:**
1. A well must have an oil volume greater than 112 (thousand barrels) in order for its investment to be profitable.
2. On average, the three regions have 20 million out of 100 million points that qualify for development.
3. If we develop the top 200 wells in each region, the highest profit will be obtained in the **region 1** with a total profit of almost 40 million USD.

## Profit Calculation<a id='profit_calculation'></a>

In [None]:
def calculate_profit(prediction, name, income = 4500, total_cost = 100000000, points = 200):
    predict_top200 = prediction.sort_values(ascending = False, by = 'product')[:points]
    product = predict_top200.sum()
    total_cost = round(total_cost / 1000000)
    total_income = round(income * product / 1000000)
    profit = round(total_income - total_cost)
    geo = name
    print('-------------------')
    print(f'Profitability Geo Data {geo}')
    print(f'Total Income: {total_income}')
    print(f'Total Cost  : {total_cost}')
    print(f'Profit      : {profit}', 'M USD')

In [None]:
calculate_profit(prediction = target_pred_df0, name = 0)
calculate_profit(prediction = target_pred_df1, name = 1)
calculate_profit(prediction = target_pred_df2, name = 2)

-------------------
Profitability Geo Data 0
Total Income: product    140.0
dtype: float64
Total Cost  : 100
Profit      : product    40.0
dtype: float64 M USD
-------------------
Profitability Geo Data 1
Total Income: product    125.0
dtype: float64
Total Cost  : 100
Profit      : product    25.0
dtype: float64 M USD
-------------------
Profitability Geo Data 2
Total Income: product    133.0
dtype: float64
Total Cost  : 100
Profit      : product    33.0
dtype: float64 M USD


**Insights:**
1. If we invest money in the top 200 wells in these three regions, we will still generate profits from all three regions.
2. The region with the highest profit is produced in region 0.
3. Region 0 also has the highest number of points above 112.

## Risk and Return <a id='risk_and_return'></a>

In [None]:
SAMPLE_SIZE = 500
BOOTSTRAP_SIZE = 1000

BUDGET = 100000000
COST_PER_POINT = 500000
POINTS_PER_BUDGET = BUDGET // COST_PER_POINT

PRODUCT_PRICE = 4500
POINTS_PER_BUDGET

200

In [None]:
def calculate_profit_bootstrap(prediction, name, income = 4500, total_cost = 100000000, points = 200):
    predict_top200 = prediction.sort_values(ascending = False)[:points]
    product = predict_top200.sum()
    total_cost = total_cost
    total_income = income *  product
    profit = total_income - total_cost
    geo = name

In [None]:
def profit(target, predictions):
    prediction_sorted = predictions.sort_values(ascending = False)
    selected_points = target[prediction_sorted.index][:POINTS_PER_BUDGET]
    product = selected_points.sum()
    revenue = product * PRODUCT_PRICE
    cost = BUDGET
    return revenue - cost

In [None]:
for region in range(3):

    target = samples_target[region]
    predictions = samples_predictions[region]

    profit_values = []
    
    for i in range(BOOTSTRAP_SIZE):
        target_sample = target.sample(SAMPLE_SIZE, replace = True, random_state = state)
        predictions_sample = predictions[target_sample.index]
        #profit_values.append(calculate_profit_bootstrap(prediction = predictions_sample, name = region))
        profit_values.append(profit(target_sample, predictions_sample))

    profit_values = pd.Series(profit_values)

    mean_profit = profit_values.mean()
    confidence_interval = (profit_values.quantile(0.025), profit_values.quantile(0.975))
    negative_profit_chance = (profit_values < 0).mean()
    
    print("—Region", region, "—")
    print("Mean profit =", round(mean_profit), "USD")
    print("95% confidence interval:", confidence_interval)
    print("Risk of losses =", negative_profit_chance * 100, "%")
    print()


—Region 0 —
Mean profit = 4238972 USD
95% confidence interval: (-761878.1389036368, 9578465.319517836)
Risk of losses = 4.8 %

—Region 1 —
Mean profit = 5132567 USD
95% confidence interval: (1080668.9523396173, 9285744.392324952)
Risk of losses = 0.6 %

—Region 2 —
Mean profit = 3811204 USD
95% confidence interval: (-1428006.300878686, 8933805.657503996)
Risk of losses = 7.3999999999999995 %



## Summary <a id='summary'></a>

This project resulted in a model that is capable of predicting the volume of oil reserves in a well, with hope that the investment made will generate profit. Based on the model's predictions, it was found that region 2 has the highest average oil reserves. To generate profit, an oil well must have at least 112 thousand barrels of oil reserves.

After performing the bootstrapping process, it was found that investing in region 2 carries high risk with the lowest average income compared to the other two regions. Therefore, I recommend investing in region 1, which has the lowest risk and the highest profit.