# Optimizing Oil Well Development: Predicting Reservoir Volumes and Maximizing Profitability

## Table of Content
* [Introduction](#introduction)
* [Import Library and Data](#import-libary-and-data)
    * [Import Library](#import-library)
    * [Import Data](#import-data)
    * [Data Information](#data-information)
* [Model Training and Testing](#model-training-and-testing)
* [Preparation Profit Calculation](#preparation-profit-calculation)
* [Calculate Profit and Volume](#calculate-profit-and-volume)
    * [Total Top 200 Wells Volume Calculation](#total-top-200-wells-volume-calculation)
    * [Total Top 200 Wells Profit Calculation](#total-top-200-wells-profit-calculation)
* [Region Risk and Profit Analysis](#region-risk-and-profit-analysis)
* [Conclusion](#conclusion)

## Introduction

Embarking on a mission with OilyGiant Mining Company, my primary responsibility is to pinpoint optimal sites for the development of 200 new oil wells. As part of this endeavor, I am tasked with carefully reading files containing gathered parameters from selected regions, focusing on oil quality and reserve volumes. The journey unfolds with the creation of a predictive model, aiding in the estimation of reserve volumes and the identification of high-value wells. Subsequently, the critical decision lies in selecting a region boasting the highest total profits from the chosen wells. Armed with data from three distinct regions, my role involves leading the project, making strategic decisions, and employing predictive modeling techniques, including the insightful use of bootstrapping for a meticulous analysis of potential profits and risks. Together, we navigate the path to success at OilyGiant, leveraging data-driven insights to maximize profitability.

## Import Library and Data

### Import Library

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.utils import resample

### Import Data

In [2]:
try:
    # Try loading the file from your laptop path
    geo_data_0 = pd.read_csv('C:/Users/Eugene/Documents/GitHub/TripleTen-Projects/9. OilyGiant Oil Well Development/geo_data_0.csv')
except FileNotFoundError:
    # If the file is not found, try loading from the PC path
    geo_data_0 = pd.read_csv('C:/Users/user/OneDrive/Documents/GitHub/TripleTen-Projects/9. OilyGiant Oil Well Development/geo_data_0.csv')

In [3]:
try:
    # Try loading the file from your laptop path
    geo_data_1 = pd.read_csv('C:/Users/Eugene/Documents/GitHub/TripleTen-Projects/9. OilyGiant Oil Well Development/geo_data_1.csv')
except FileNotFoundError:
    # If the file is not found, try loading from the PC path
    geo_data_1 = pd.read_csv('C:/Users/user/OneDrive/Documents/GitHub/TripleTen-Projects/9. OilyGiant Oil Well Development/geo_data_1.csv')

In [4]:
try:
    # Try loading the file from your laptop path
    geo_data_2 = pd.read_csv('C:/Users/Eugene/Documents/GitHub/TripleTen-Projects/9. OilyGiant Oil Well Development/geo_data_2.csv')
except FileNotFoundError:
    # If the file is not found, try loading from the PC path
    geo_data_2 = pd.read_csv('C:/Users/user/OneDrive/Documents/GitHub/TripleTen-Projects/9. OilyGiant Oil Well Development/geo_data_2.csv')

### Data Information

In [5]:
geo_data_0.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


In [6]:
geo_data_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


In [7]:
geo_data_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


All 3 datasets have similar structure with column and count of rows

## Model Training and Testing

In [6]:
def train_and_predict(data):
    x = data[['f0', 'f1', 'f2']]
    y = data['product']

    # Split into train and validation set
    x_train, x_valid, y_train, y_valid = train_test_split(x, y, test_size=0.25, random_state=42)

    # model train
    model = LinearRegression()
    model.fit(x_train, y_train)

    predictions = model.predict(x_valid)
    
    results = pd.DataFrame({'Prediction': predictions, 'True': y_valid})

    # Display results
    avg_prediction = results['Prediction'].mean()
    rmse = np.sqrt(mean_squared_error(results['True'], results['Prediction']))
    print(f'Average Prediction: {avg_prediction}')
    print(f'RMSE: {rmse}')

    return model, results

Train and predict for each geo_data file

In [7]:
model_0, results_0 = train_and_predict(geo_data_0)
model_1, results_1 = train_and_predict(geo_data_1)
model_2, results_2 = train_and_predict(geo_data_2)

Average Prediction: 92.3987999065777
RMSE: 37.756600350261685
Average Prediction: 68.71287803913762
RMSE: 0.890280100102884
Average Prediction: 94.77102387765939
RMSE: 40.14587231134218


The "Average Prediction" values represent the mean predicted oil reserves in each region, providing insight into the central tendency of the model's estimates. In the first region (geo_data_0), the average predicted oil reserve is approximately 92.4 units. The "Root Mean Squared Error" (RMSE) values quantify the accuracy of the model's predictions by measuring the differences between predicted and actual values. Lower RMSE values indicate better predictive accuracy. In the second region (geo_data_1), the RMSE is notably low at 0.89, suggesting that the model's predictions align closely with the true values. Conversely, in the third region (geo_data_2), the RMSE is relatively high at 40.15, indicating a larger degree of variability between predicted and actual oil reserves. 

## Preparation Profit Calculation

Set the value for investment and formula based on this conditions:
- Budget Investment (investment) : \$100.000.000,00 `100e6`
- 1 barrel: \$4,5
- 1 unit revenue (revenue per unit): \$4.500,00 `4.5e3`
- 1 unit: 1.000 unit `1e3`

Formula for average_revenue_per_well:

    `average_revenue_per_well = revenue * unit`

In [8]:
investment = 100e6

# $4.5 per barrel
average_revenue_per_well = 4.5e3 * 1e3  

Check the average revenue per well and assign the value to `break_even_point` 

In [9]:
break_even_point = investment / average_revenue_per_well
print(f'Break-even point: {break_even_point} barrels')

Break-even point: 22.22222222222222 barrels


## Calculate Profit and Volume

### Total Top 200 Wells Volume Calculation

Select the top 200 wells based on predictions, then calculate total predicted volume.

In [10]:
def calculate_volume(predictions, region_data):
    
    top_wells = predictions.nlargest(200, 'Prediction')

    total_volume = top_wells['Prediction'].sum()

    if total_volume >= break_even_point:
        return total_volume
    else:
        return 0

Calculate volume for each region

In [11]:
volume_0 = calculate_volume(results_0, geo_data_0)
volume_1 = calculate_volume(results_1, geo_data_1)
volume_2 = calculate_volume(results_2, geo_data_2)

In [12]:
print(f'Total Volume Region 0: {volume_0}')
print(f'Total Volume Region 1: {volume_1}')
print(f'Total Volume Region 2: {volume_2}')

Total Volume Region 0: 30881.463288146995
Total Volume Region 1: 27748.75136666462
Total Volume Region 2: 29728.847808255443


The total volume figures represent the cumulative predicted oil extraction from the top 200 wells in each respective region. For Region 0, the total predicted volume is 30,881.46 thousand barrels, for Region 1 it is 27,748.75 thousand barrels, and for Region 2 it is 29,728.85 thousand barrels. These values are derived from the sum of the predicted volumes of individual wells within the top 200 that exhibit the highest production estimates based on the model's predictions. The calculation of total volume is a crucial metric for assessing the overall oil extraction potential in each region, helping to identify which areas may yield the highest production output.

### Total Top 200 Wells Profit Calculation

Select the top 200 wells based on predictions, then calculate profit

Formula for revenue:
`revenue = total volume * revenue per unit`

Formula for profit:
`profit = revenue - investment`

In [13]:
def calculate_profit(predictions, region_data, investment=100e6, revenue_per_unit=4.5e3):
    top_wells = predictions.nlargest(200, 'Prediction')
    total_volume = top_wells['Prediction'].sum()

    # revenue formula
    revenue = total_volume * revenue_per_unit

    profit = revenue - investment

    return profit

Calculate profit for each region

In [14]:
profit_0 = calculate_profit(results_0, geo_data_0)
profit_1 = calculate_profit(results_1, geo_data_1)
profit_2 = calculate_profit(results_2, geo_data_2)

In [15]:
print(f'Total Profit Region 0: {profit_0}')
print(f'Total Profit Region 1: {profit_1}')
print(f'Total Profit Region 2: {profit_2}')

Total Profit Region 0: 38966584.79666147
Total Profit Region 1: 24869381.149990782
Total Profit Region 2: 33779815.1371495


The total profit figures represent the financial gain or loss associated with the predicted oil extraction from the top 200 wells in each region. Profit is determined by subtracting the initial investment of \\$100 million from the revenue generated by the predicted volume. For Region 0, the total profit is \\$38,966,584.80, for Region 1 it is \\$24,869,381.15, and for Region 2 it is \\$33,779,815.14. These values provide a comprehensive view of the financial viability of each region, aiding in the decision-making process to identify the region with the highest profitability potential for further development.

## Region Risk and Profit Analysis

set the bootstrap sample with 1000

In [16]:
def bootstrap_risk(predictions, region_data, n_samples=1000):
    profits = []

    for _ in range(n_samples):
        # bootstrap sample
        bootstrap_sample = resample(predictions, replace=True)
        
        # calculate profit with bootstrap sample
        bootstrap_profit = calculate_profit(bootstrap_sample, region_data, investment=100e6, revenue_per_unit=4.5e3)
        profits.append(bootstrap_profit)

    return profits

Perform bootstrapping for risk analysis for each region

In [17]:
bootstrap_profits_0 = bootstrap_risk(results_0, geo_data_0)
bootstrap_profits_1 = bootstrap_risk(results_1, geo_data_1)
bootstrap_profits_2 = bootstrap_risk(results_2, geo_data_2)

Calculate statistics from bootstrapping results, with risk of loss less than 2.5% and highest average profit

In [18]:
average_profit_0 = np.mean(bootstrap_profits_0)
confidence_interval_0 = np.percentile(bootstrap_profits_0, [2.5, 97.5])
risk_of_loss_0 = np.mean(np.array(bootstrap_profits_0) < break_even_point)

average_profit_1 = np.mean(bootstrap_profits_1)
confidence_interval_1 = np.percentile(bootstrap_profits_1, [2.5, 97.5])
risk_of_loss_1 = np.mean(np.array(bootstrap_profits_1) < break_even_point)

average_profit_2 = np.mean(bootstrap_profits_2)
confidence_interval_2 = np.percentile(bootstrap_profits_2, [2.5, 97.5])
risk_of_loss_2 = np.mean(np.array(bootstrap_profits_2) < break_even_point)

In [19]:
print("\nRegion 0:")
print(f'Average Profit: ${average_profit_0}')
print(f'95% Confidence Interval: ${confidence_interval_0[0]} - ${confidence_interval_0[1]}')
print(f'Risk of Loss: {risk_of_loss_0 * 100}%')

print("\nRegion 1:")
print(f'Average Profit: ${average_profit_1}')
print(f'95% Confidence Interval: ${confidence_interval_1[0]} - ${confidence_interval_1[1]}')
print(f'Risk of Loss: {risk_of_loss_1 * 100}%')

print("\nRegion 2:")
print(f'Average Profit: ${average_profit_2}')
print(f'95% Confidence Interval: ${confidence_interval_2[0]} - ${confidence_interval_2[1]}')
print(f'Risk of Loss: {risk_of_loss_2 * 100}%')


Region 0:
Average Profit: $38970502.66486516
95% Confidence Interval: $37718524.90858246 - $40249120.99535558
Risk of Loss: 0.0%

Region 1:
Average Profit: $24868509.607561164
95% Confidence Interval: $24820338.849507928 - $24920654.44201528
Risk of Loss: 0.0%

Region 2:
Average Profit: $33750795.14873277
95% Confidence Interval: $32735957.74535761 - $34783600.813512355
Risk of Loss: 0.0%


For  Region 0, the average profit is \\$38,937,164.63, with a 95% confidence interval between \\$37,685,697.13 and \\$40,080,314.71, and the risk of loss is 0%. Similarly, for Region 1, the average profit is \\$24,868,026.80, with the confidence interval ranging from \\$24,816,939.35 to \\$24,920,883.59, and the risk of loss is 0%. In Region 2, the average profit is \\$33,772,856.46, and the confidence interval spans from \\$32,772,466.59 to \\$34,827,918.14, with a risk of loss of 0%. These results indicate that, based on the bootstrap analysis, all three regions exhibit consistent and positive profitability with no apparent risk of financial loss.

Based on the analysis, all three regions exhibit positive profitability with no apparent risk of financial loss, as indicated by the average profit and the lack of risk of loss in each region. However, considering the average profit alone, Region 0 has the highest average profit of \\$38,937,164.63, followed by Region 2 with \\$33,772,856.45 and Region 1 with \\$24,868,026.79. Therefore, based on the profitability criterion alone, Region 0 would be recommended for oil well development due to its higher average profit.

## Conclusion

In summary, the comprehensive analysis of the oil exploration project yields several key insights. Firstly, the exploration datasets for all three regions (geo_data_0, geo_data_1, and geo_data_2) are structurally identical, with no missing data and consistent column types. The predictive modeling results reveal varying levels of accuracy across regions, with Region 1 exhibiting exceptionally low Root Mean Squared Error (RMSE), indicating high precision in predictions. The break-even point is determined to be 22.22 barrels, influencing subsequent evaluations.

The total predicted volumes from the top 200 wells in each region provide a crucial metric for estimating potential oil extraction. Region 0 leads with 30,881.46 thousand barrels, followed by Region 2 with 29,728.85 thousand barrels and Region 1 with 27,748.75 thousand barrels. These figures inform decisions on which regions may offer the highest production output. Financially, the total profit, calculated by subtracting the initial investment from the revenue generated, highlights Region 0 as the most lucrative with \\$38,937,164.63, followed by Region 2 with \\$33,772,856.46, and Region 1 with \\$24,868,026.80.

Finally, the bootstrap analysis reveals consistent positive profitability with a 0% risk of loss across all three regions. The detailed breakdowns for each region, including average profit, 95% confidence intervals, and risk of loss, reinforce the conclusion that all regions are financially viable. However, considering the previous emphasis on choosing the region with the highest average profit, Region 0 is recommended for oil well development.