# Oil Well Profit Maximization

-----

## Overview

### Description

<div style="color: #196CC4;">
This project involves an exhaustive evaluation of geological data from three regions, using linear regression models to predict the volume of reserves in potential wells. Finally, we will choose the region with the highest average benefit among the regions considered to open new wells, with a loss risk of less than 2.5%.
</div>

### Objective

<div style="color: #196CC4;">
Use linear regression models to identify the most profitable region with a loss risk below 2.5%, to open 200 new oil wells, maximizing the total profit within a budget of 100 million dollars; all this through the evaluation of geological data from three regions and the selection of the best points using criteria for predicting the volume of reserves and risk analysis using bootstrapping.
</div>

### Resources

<div style="color: #196CC4;">
<b>The geological exploration data for the three regions is stored in files:</b><br>
▶ Dataset geo_data_0.csv<br>
▶ Dataset geo_data_1.csv<br>
▶ Dataset geo_data_2.csv<br>
▶ id — unique identifier of the oil well<br>
▶ f0, f1, f2 — three features of the points (their specific meaning is not important, but the features themselves are significant)<br>
▶ product — volume of reserves in the oil well (thousands of barrels).
</div>

### Additional Conditions

<div style="color: #196CC4;">
▶ Only linear regression should be used for model training.<br>
▶ When exploring the region, a study of 500 points is conducted with the selection of the best 200 points for profit calculation.<br>
▶ The budget for developing 200 oil wells is 100 million dollars.<br>
▶ A barrel of raw materials generates 4.5 USD in revenue. The revenue of one unit of product is 4500 dollars (the volume of reserves is expressed in thousands of barrels).<br>
▶ After the risk assessment, only regions with a loss risk below 2.5% should be kept. From those that meet the criteria, the region with the highest average profit should be selected.<br>
▶ The data is synthetic: contract details and well characteristics are not published.
</div>

### Methodology

<div style="color: #196CC4;">
<b>General Procedure for Addressing this Project:</b><br>

▶ Understand the structure and content of the Dataset<br>
▶ Segmenting the data will be necessary to train and validate the model.<br>
▶ Before calculating potential profits, determine the necessary values to avoid losses, such as the minimum reserve volume.<br>
▶ Select and model the predictions of the top 200 wells in each region, thus estimating potential profits.<br>
▶ Use bootstrapping to estimate profit distributions and calculate risk and profit metrics for each region.<br>
▶ Analyze results and generate robust conclusions about the best region for oil well development, supported by detailed data.
</div>

<div style="color: #196CC4;"><br>
<b>Detailed Procedure:</b>
<ol>
<li>Data Initialization and Exploratory Analysis
<ul>
<li>Import the libraries, modules, and the 3 datasets: geo_data_0.csv, geo_data_1.csv, and geo_data_2.csv.</li>
<li>Perform exploratory data analysis on each dataset, identifying correlations between numerical and categorical variables, missing values, duplicates, and syntax anomalies.</li>
<li>Calculate descriptive statistics for each dataset.</li>
<li>Clean the data by removing the "id" column, which is not necessary for the analysis.</li>
</ul>
</li>

<li>Model Training
<ul>
<li>Divide the data into training and validation sets in a 3:1 ratio.</li>
<li>Use linear regression to predict the value of the target variable (price).</li>
<li>Evaluate model performance using the root mean squared error (RMSE).</li>
<li>Create a function to automate the process of feature extraction, model training, and evaluation.</li>
<li>Apply the function to each of the oil well datasets.</li>
<li>Present the results and identify the most relevant findings, such as average reserve volume, RMSE, and the region with the best performance.</li>
</ul>
</li>

<li>Data Preparation for Profitability Analysis
<ul>
<li>Calculate the average reserve volume for each region in the validation set.</li>
<li>Compare the average volume to the minimum required volume to determine the economic viability of each region.</li>
</ul>
</li>

<li>Potential Profit Calculation
<ul>
<li>Calculate the potential profits that could be obtained by developing oil wells in each region.</li>
<li>Define a function to calculate profits, considering the model's predictions and the price per barrel.</li>
<li>Sort the predictions in descending order and select the top 200 wells.</li>
<li>Calculate total revenue by multiplying the total reserve volume by the price per barrel.</li>
<li>Calculate net profits by subtracting the initial investment.</li>
</ul>
</li>

<li>Risk and Return Analysis
<ul>
<li>Use the bootstrapping method to estimate the profit distribution for each region.</li>
<li>Calculate the average profit, 95% confidence interval, and risk of loss for each distribution.</li>
</ul>
</li>

<li>Conclusions
<ul>
<li>Present a summary of the key results obtained for each region.</li>
<li>Identify the most promising region for the development of new oil wells, considering risk and return factors.</li>
</ul>
</li>
</ol>

</div>

-----

## General Information

### Inicialization

<div style="color: #196CC4;">
▶ Import of libraries and data loading
</div>


In [1]:
# Data treatment
import pandas as pd
import numpy as np

# Models
from sklearn.linear_model import LinearRegression

# Para separar cualquier conjunto de datos en dos: entrenamiento y prueba
from sklearn.model_selection import train_test_split

# mean squared error between predictions and true values
from sklearn.metrics import mean_squared_error

# confidence_interval
from scipy import stats as st

In [2]:
# Import data
geo_0 = pd.read_csv('datasets/geo_data_0.csv')
geo_1 = pd.read_csv('datasets/geo_data_1.csv')
geo_2 = pd.read_csv('datasets/geo_data_2.csv')

### Datasets general overview


<div style="color: #196CC4;">
▶ Dataframes general properties
</div>

In [3]:
# General Dataframe 1 properties
geo_0.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


In [4]:
# General Dataframe 2 properties
geo_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


In [5]:
# General Dataframe 3 properties
geo_2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


<div style="color: #196CC4;">
▶ Dataframe general overview
</div>

In [6]:
# General data overview
display(geo_0.head(3))
display(geo_1.head(3))
display(geo_2.head(3))

Unnamed: 0,id,f0,f1,f2,product
0,txEyH,0.705745,-0.497823,1.22117,105.280062
1,2acmU,1.334711,-0.340164,4.36508,73.03775
2,409Wp,1.022732,0.15199,1.419926,85.265647


Unnamed: 0,id,f0,f1,f2,product
0,kBEdx,-15.001348,-8.276,-0.005876,3.179103
1,62mP7,14.272088,-3.475083,0.999183,26.953261
2,vyE1P,6.263187,-5.948386,5.00116,134.766305


Unnamed: 0,id,f0,f1,f2,product
0,fwXo0,-1.146987,0.963328,-0.828965,27.758673
1,WJtFt,0.262778,0.269839,-2.530187,56.069697
2,ovLUW,0.194587,0.289035,-5.586433,62.87191


<div style="color: #196CC4;">
▶ Verificación de valores duplicados.
</div>

In [7]:
# Duplicated values
geo_0_duplicates = geo_0.duplicated()
geo_1_duplicates = geo_1.duplicated()
geo_2_duplicates = geo_2.duplicated()

# Sum of duplicated values
total_geo_0_duplicates = geo_0_duplicates.sum()
total_geo_1_duplicates = geo_1_duplicates.sum()
total_geo_2_duplicates = geo_2_duplicates.sum()

# Duplicated rows
geo_0_duplicates_rows = geo_0[geo_0_duplicates]
geo_1_duplicates_rows = geo_1[geo_1_duplicates]
geo_2_duplicates_rows = geo_2[geo_2_duplicates]

# Display data
print("Total of duplicate values in geo_0:", total_geo_0_duplicates)
print("Total of duplicate values in geo_1:", total_geo_1_duplicates)
print("Total of duplicate values in geo_2:", total_geo_2_duplicates)

Total of duplicate values in geo_0: 0
Total of duplicate values in geo_1: 0
Total of duplicate values in geo_2: 0


<div style="color: #196CC4;">
▶ Descriptive statistics for numerical data.
</div>

In [8]:
# Descriptive statistics
geo_0.describe()

Unnamed: 0,f0,f1,f2,product
count,100000.0,100000.0,100000.0,100000.0
mean,0.500419,0.250143,2.502647,92.5
std,0.871832,0.504433,3.248248,44.288691
min,-1.408605,-0.848218,-12.088328,0.0
25%,-0.07258,-0.200881,0.287748,56.497507
50%,0.50236,0.250252,2.515969,91.849972
75%,1.073581,0.700646,4.715088,128.564089
max,2.362331,1.343769,16.00379,185.364347


In [9]:
# Descriptive statistics
geo_1.describe()

Unnamed: 0,f0,f1,f2,product
count,100000.0,100000.0,100000.0,100000.0
mean,1.141296,-4.796579,2.494541,68.825
std,8.965932,5.119872,1.703572,45.944423
min,-31.609576,-26.358598,-0.018144,0.0
25%,-6.298551,-8.267985,1.000021,26.953261
50%,1.153055,-4.813172,2.011479,57.085625
75%,8.621015,-1.332816,3.999904,107.813044
max,29.421755,18.734063,5.019721,137.945408


In [10]:
# Descriptive statistics
geo_2.describe()

Unnamed: 0,f0,f1,f2,product
count,100000.0,100000.0,100000.0,100000.0
mean,0.002023,-0.002081,2.495128,95.0
std,1.732045,1.730417,3.473445,44.749921
min,-8.760004,-7.08402,-11.970335,0.0
25%,-1.162288,-1.17482,0.130359,59.450441
50%,0.009424,-0.009482,2.484236,94.925613
75%,1.158535,1.163678,4.858794,130.595027
max,7.238262,7.844801,16.739402,190.029838


### Preliminary Observations

<div style="color: #196CC4;">
<b>Initial Observations:</b><br>
▶ Series names start with lowercase letters, which will simplify working with them<br>
▶ There are no missing values in any of the datasets<br>
▶ There are no duplicate rows<br>
▶ Data types are appropriate, with the first series containing strings and the rest containing floats<br>
▶ The "id" column is irrelevant to the purpose of this project<br>    
▶ The DataFrame appears clean and ready for analysis
</div>

-----

## Exploratory Data Analysis (EDA)

### Data Cleaning

<div style="color: #196CC4;">
▶ It is suggested to drop the "id" column, as it is not valuable information and is an "object" type series
</div>

In [11]:
geo_0 = geo_0.drop('id', axis=1)
geo_1 = geo_1.drop('id', axis=1)
geo_2 = geo_2.drop('id', axis=1)

### Data Display

<div style="color: #196CC4;">
▶ Next, I will verify the changes made to the properties of the 3 DataFrames and their preview.<br>
</div>

In [12]:
# General data overview
display(geo_0.head(3))
display(geo_1.head(3))
display(geo_2.head(3))

Unnamed: 0,f0,f1,f2,product
0,0.705745,-0.497823,1.22117,105.280062
1,1.334711,-0.340164,4.36508,73.03775
2,1.022732,0.15199,1.419926,85.265647


Unnamed: 0,f0,f1,f2,product
0,-15.001348,-8.276,-0.005876,3.179103
1,14.272088,-3.475083,0.999183,26.953261
2,6.263187,-5.948386,5.00116,134.766305


Unnamed: 0,f0,f1,f2,product
0,-1.146987,0.963328,-0.828965,27.758673
1,0.262778,0.269839,-2.530187,56.069697
2,0.194587,0.289035,-5.586433,62.87191


-----

## Training

### Procedure

<div style="color: #196CC4;">
<b>The source data is divided as follows to reach a 3:1 ratio:</b><br>
▶ 75% Dataset for training<br>
▶ 25% Dataset for validation<br>
▶ The larger the training set, the more data the model will have to learn patterns and relationships in the data. On the other hand, the validation dataset will be used to evaluate the model's performance.<br>


<b>Linear Regression:</b><br>
▶ Linear regression is used to predict or explain the average value of the dependent variable as a function of the independent variables.<br>
▶ For this model, "y" is the target, a dependent variable we want to find. In this case, it is the price. "x" is the feature, an independent variable that defines the dependent variable.<br>

<b>RMSE:</b><br>
▶ RMSE stands for "Root Mean Square Error". It is a commonly used measure to evaluate the accuracy of a regression model. <br>
▶ An RMSE of 0 would indicate that the model perfectly predicts the observed values, which is rare in practice.<br>
▶ In general, the value of the RMSE can range from 0 to the same range as the observed values in the data. For example, if the observed values are in the range of 0 to 100, the RMSE could range from 0 to 100.<br>
</div>

### Execution Function

In [13]:
def geo_training(dataset):
    # Data division
    df_train, df_valid = train_test_split(dataset, test_size=0.25, random_state=12345)

    # Extract features and target variables from the training and validation sets
    X_train = df_train.drop(['product'], axis=1)
    y_train = df_train['product']

    X_val = df_valid.drop(['product'], axis=1)
    y_val = df_valid['product']
    
    # Training
    model = LinearRegression()
    model.fit(X_train, y_train)
    
    # Predict 
    #predict = pd.Series(model.predict(X_val))
    predict = model.predict(X_val)
    
    #Mean on prediction
    mean_volume = np.mean(predict)
    
    # RMSE
    rmse = np.sqrt(mean_squared_error(y_val, predict))
    
    return {'rmse': rmse, 'predictions': predict, 'mean_volume': mean_volume, 'real_values': y_val}
    # Print
    #print("Dataset:", dataset)

In [14]:
# Function call + Variables
results_geo_0 = geo_training(geo_0)
results_geo_1 = geo_training(geo_1)
results_geo_2 = geo_training(geo_2)

### Results Output

<div style="color: #196CC4;">
▶ Store all necessary values for calculations in separate variables:
We have already calculated the predicted mean reserve volume (mean_volume) and the model's RMSE (rmse) for each region. These values are important for our subsequent calculations and we have stored them as variables in our code.

In [15]:
# geo_0 - Data
print("PPREDICTIONS FOR REGION 0 (<geo_0>) ARE:")
print("- Average reserve volume:", results_geo_0['mean_volume'])
print("- RMSE:", results_geo_0['rmse'])
print("- Predictions:", results_geo_0['predictions'])
print()

# geo_1 - Data
print("PPREDICTIONS FOR REGION 1 (<geo_1>) ARE:")
print("- Average reserve volume:", results_geo_1['mean_volume'])
print("- RMSE:", results_geo_1['rmse'])
print("- Predictions:", results_geo_1['predictions'])
print()

# geo_2 - Data
print("PPREDICTIONS FOR REGION 2 (<geo_2>) ARE:")
print("- Average reserve volume:", results_geo_2['mean_volume'])
print("- RMSE:", results_geo_2['rmse'])
print("- Predictions:", results_geo_2['predictions'])
print()

PPREDICTIONS FOR REGION 0 (<geo_0>) ARE:
- Average reserve volume: 92.59256778438035
- RMSE: 37.5794217150813
- Predictions: [ 95.89495185  77.57258261  77.89263965 ...  61.50983303 118.18039721
 118.16939229]

PPREDICTIONS FOR REGION 1 (<geo_1>) ARE:
- Average reserve volume: 68.728546895446
- RMSE: 0.8930992867756165
- Predictions: [ 82.66331365  54.43178616  29.74875995 ... 137.87934053  83.76196568
  53.95846638]

PPREDICTIONS FOR REGION 2 (<geo_2>) ARE:
- Average reserve volume: 94.96504596800489
- RMSE: 40.02970873393434
- Predictions: [ 93.59963303  75.10515854  90.06680936 ...  99.40728116  77.77991248
 129.03241718]



### Findings

<div style="color: #196CC4;">
<b>Average Volume</b><br>
▶ Region 2 has the highest average reserve volume, followed by regions 0 and 1.<br>
▶ There is a significant difference between the average reserve volume of region 1 and the other two regions.<br><br>
<b>RMSE</b><br>
▶ Region 1 has the lowest RMSE, indicating a better fit to the data for that region compared to the other two regions.<br>
▶ Region 0 and region 2 have higher RMSE values, so the predictions are not as accurate in those regions.<br><br>
<b>Region 1, although it has a lower average reserve volume, has a significantly lower RMSE, indicating a better model performance in that region.</b> <br>

</div>

-----

## Data Preparation

### Procedure

<div style="color: #196CC4;">
▶ The predicted mean reserve volume (mean_volume) and the model's root mean squared error (RMSE) have been calculated for each region.<br>
▶ Given that the minimum required revenue per well to avoid losses is $500,000, equivalent to 111.1 units, this value can be used as a benchmark to compare with the predicted mean reserve volume for each region.<br>
</div>

### Variables definition

<div style="color: #196CC4;">
▶ The following variables, which were presented in the initial project description, will be used to calculate the profit later.
</div>

In [16]:
# Variables
investment_money = 100_000_000
investment_wells = 200
well_min_profit = 500_000  
barrel_profit = 4.5
revenue_per_unit = 4_500

well_min_reserve = (well_min_profit / barrel_profit)/1000
print("Minimum units to avoid losses", well_min_reserve)

Minimum units to avoid losses 111.11111111111111


### Results Output

In [17]:
# Volume function
def compare_mean_volume_with_min_reserve(mean_volume):
    if mean_volume > well_min_reserve:
        return "Has a HIGHER AVERAGE VOLUME compared to the minimum required to avoid losses"
    else:
        return "Has a LOWER AVERAGE VOLUME compared to the minimum required to avoid losses"


# Function Calls
print("For region geo_0:\n", results_geo_0['mean_volume'], "→", compare_mean_volume_with_min_reserve(
    results_geo_0['mean_volume']), "\n")
print("For region geo_1:\n", results_geo_1['mean_volume'], "→", compare_mean_volume_with_min_reserve(
    results_geo_1['mean_volume']), "\n")
print("For region geo_2:\n", results_geo_2['mean_volume'], "→",
      compare_mean_volume_with_min_reserve(results_geo_2['mean_volume']))

For region geo_0:
 92.59256778438035 → Has a LOWER AVERAGE VOLUME compared to the minimum required to avoid losses 

For region geo_1:
 68.728546895446 → Has a LOWER AVERAGE VOLUME compared to the minimum required to avoid losses 

For region geo_2:
 94.96504596800489 → Has a LOWER AVERAGE VOLUME compared to the minimum required to avoid losses


### Findings

<div style="color: #196CC4;">
The following are the comparative results between the average volume of each region and the minimum required:<br><br>
▶  <b>It is important to note that these results are based on the validation dataset and not on the entire dataset provided, so it is normal for the average volume to be, in all cases for these results, lower than the minimum required.</b><br>
▶  None of the regions seem to be economically viable for oil extraction, as the predicted average reserve volume is insufficient to cover the minimum costs required to avoid losses.<br>
▶  The predicted average reserve volume for region geo_2 is the highest compared to the other two regions, with 94.97 units.
</div>

-----

## Profit Calculation

### Procedure

<div style="color: #196CC4;">
The potential profits from developing oil wells in the three regions are calculated as follows:<br>
▶ A function (profit) is defined that receives the predictions and the price per barrel.<br>
▶ The predictions are converted into a pandas series and sorted in descending order.<br>
▶ The top 200 wells are selected from the predictions.<br>
▶ The actual values of these wells are obtained for the following steps.<br>
▶ The potential revenue is calculated by multiplying the total reserve volume by the price per barrel.<br>
▶ The profit is calculated by subtracting the initial investment (investment_money).<br>
</div>

### Execution function

In [18]:
# Profit function
def profit(data_geo, barrel_prices):
    # Real values sorted
    real_wells = data_geo.sort_values(by="predictions", ascending=False)['real_values'].head(200)

    #Profit of real values
    
    # Volume
    total_reserves_volume = real_wells.sum()
    # income
    income = total_reserves_volume * revenue_per_unit
    #profit
    profit = income - investment_money
    # Return
    return profit

### Results Output

<div style="color: #196CC4;">
The potential profits from developing oil wells in the three regions are calculated as follows:<br>
▶ The profit function is used to calculate the potential profits for each region (geo_0, geo_1, and geo_2).<br>
▶ The results are stored in a dictionary.<br>
▶ The region with the highest potential profit is determined using the max() function.
</div>

In [19]:
# Reduced Dataframe
df_results_geo_0 = pd.DataFrame({
    'predictions': results_geo_0['predictions'],
    'real_values': results_geo_0['real_values']  })

df_results_geo_1 = pd.DataFrame({
    'predictions': results_geo_1['predictions'],
    'real_values': results_geo_1['real_values']  })

df_results_geo_2 = pd.DataFrame({
    'predictions': results_geo_2['predictions'],
    'real_values': results_geo_2['real_values']  })

# Print
#print(df_results_geo_0)
#print(df_results_geo_1)
#print(df_results_geo_2)

In [20]:
# Profit
profit_geo_0 = profit(df_results_geo_0, revenue_per_unit)
profit_geo_1 = profit(df_results_geo_1, revenue_per_unit)
profit_geo_2 = profit(df_results_geo_2, revenue_per_unit)

# Dictionary
potential_profits = {
    'geo_0': profit_geo_0,
    'geo_1': profit_geo_1,
    'geo_2': profit_geo_2
}

# Highest profit "max"
best_region = max(potential_profits, key=potential_profits.get)

# Print
print("Earnings in dollars for geo_0:", profit_geo_0)
print("Earnings in dollars for geo_1:", profit_geo_1)
print("Earnings in dollars for geo_2:", profit_geo_2)

Earnings in dollars for geo_0: 33208260.43139851
Earnings in dollars for geo_1: 24150866.966815114
Earnings in dollars for geo_2: 27103499.635998324


### Findings

<div style="color: #196CC4;">
▶ Earnings in dollars for geo_0 → $33,208,260.43<br>
▶ Earnings in dollars for geo_1 → $24,150,866.97<br>
▶ Earnings in dollars for geo_2 → $27,103,499.64<br>

<b>In summary, region geo_0 is the most recommended for oil well development due to its higher earning potential. Region geo_2 is the second most profitable.</b>

</div>

-----

## Regional risk-reward analysis

### Bootstrapping

<div style="color: #196CC4;">
▶ Bootstrapping is a resampling technique used to estimate the distribution of a statistic of interest from an existing dataset. <br>
▶ Three calls to the bootstrapping function are made for each of the three regions.
</div>

In [21]:
# Random State
state = np.random.RandomState(54321)

# Bootstrapping function
def bootstrapping(df, num_samples=1000, alpha=0.05):
    profits = []
    
    # Loop
    for _ in range(num_samples):
        
        # Sample data with replacement
        wells = pd.DataFrame(df).sample(n=500, replace=True, random_state=state)
        
        # Profit
        benefit_sample = profit(wells, revenue_per_unit)
        
        # Append to the list
        profits.append(benefit_sample)
    
    # Lower and upper bounds (confidence interval) with percentiles
    lower_bound = np.percentile(profits, 2.5)  # Percentile 2.5
    upper_bound = np.percentile(profits, 97.5)  # Percentile 97.5
    
    # Convert to Series
    profits = pd.Series(profits)
    
    # Return
    return profits, (lower_bound, upper_bound)

<div style="color: #196CC4;">
▶ Three calls to the bootstrapping function are made for each of the three regions. This function generates a distribution of profits from the data samples.
</div>

In [22]:
# Profit per region
bootstrapping_geo_0= bootstrapping(df_results_geo_0)
bootstrapping_geo_1 = bootstrapping(df_results_geo_1)
bootstrapping_geo_2 = bootstrapping(df_results_geo_2)

# Print
print("Boostrapping for region 0:", bootstrapping_geo_0[0].mean())
print("Boostrapping for region 1:", bootstrapping_geo_1[0].mean())
print("Boostrapping for region 2:", bootstrapping_geo_2[0].mean())

Boostrapping for region 0: 3920829.088547083
Boostrapping for region 1: 4604869.8247138085
Boostrapping for region 2: 3964626.5656833537


### Key statistics 

<div style="color: #196CC4;">
A new function is now created to calculate three key statistics:<br><br>
▶ <b>Average Profit:</b> The mean of the profit distribution is calculated.<br>
▶ <b>95% Confidence Interval:</b> The 95% confidence interval is calculated using the Student's t-distribution.<br>
▶ <b>Percentage of Loss Risk:</b> The percentage of samples in the profit distribution that are less than zero is calculated, representing the risk of loss. This value is expressed as a percentage.
</div>

In [23]:
# mean, 95% confidence interval, risk of loss

def statistics(profit_distribution):
    # Mean profit
    mean_profit = np.mean(profit_distribution)

    # risk_of_loss
    risk_of_loss = np.mean([p < 0 for p in profit_distribution]) * 100
    
    # Return
    return mean_profit, risk_of_loss

### Results Output

<div style="color: #196CC4;">
▶ Finally, the statistics functions are applied to each of the generated profit distributions for the three regions, and the results are printed.
</div>

In [24]:
# Confidence Interval
confidence_interval_geo_0 = bootstrapping_geo_0[1]
confidence_interval_geo_1 = bootstrapping_geo_1[1]
confidence_interval_geo_2 = bootstrapping_geo_2[1]

# Statistics
mean_profit_geo_0, loss_risk_geo_0 = statistics(bootstrapping_geo_0[0])
mean_profit_geo_1, loss_risk_geo_1 = statistics(bootstrapping_geo_1[0])
mean_profit_geo_2, loss_risk_geo_2 = statistics(bootstrapping_geo_2[0])

In [25]:
# Print
print("Region geo_0:")
print("- Average profit:", mean_profit_geo_0)
print("- 95% Confidence interval:", confidence_interval_geo_0)
print("- % Loss risk:", loss_risk_geo_0)
print()

print("Region geo_1:")
print("- Average profit:", mean_profit_geo_1)
print("- 95% Confidence interval:", confidence_interval_geo_1)
print("- % Loss risk:", loss_risk_geo_1)
print()

print("Region geo_2:")
print("- Average profit:", mean_profit_geo_2)
print("- 95% Confidence interval:", confidence_interval_geo_2)
print("- % Loss risk:", loss_risk_geo_2)
print()

Region geo_0:
- Average profit: 3920829.088547083
- 95% Confidence interval: (-1108640.0281520828, 9033598.336663691)
- % Loss risk: 6.4

Region geo_1:
- Average profit: 4604869.8247138085
- 95% Confidence interval: (866343.877264005, 8722384.953529775)
- % Loss risk: 1.3

Region geo_2:
- Average profit: 3964626.5656833537
- 95% Confidence interval: (-1052362.8029479317, 9178751.558821613)
- % Loss risk: 6.3



-----

## Conclusions

### Given statistics

<div style="color: #196CC4;">
<b>Region geo_0:</b><br>
▶ Average Profit: 3,920,829.09<br>
▶ 95% Confidence Interval: (-1108640.0281520828, 9033598.336663691)<br>
▶ Loss Risk: 6.4%<br><br>
<b>Region geo_1:</b><br>
▶ Average Profit: 4,604,869.82<br>
▶ 95% Confidence Interval: (866343.877264005, 8722384.953529775)<br>
▶ Loss Risk: 1.3%<br><br>
<b>Region geo_2:</b><br>
▶ Average Profit: 3,964,626.57<br>
▶ 95% Confidence Interval: (-1052362.8029479317, 9178751.558821613)<br>
▶ Loss Risk: 6.3%<br>
</div>

### Most suitable region for well development

<div style="color: #196CC4;">
▶ Based on the resulting statistics, we can observe that <b>region geo_1 is the best option for oil well development</b>. It has the highest average profit of 4,604,869.82, along with a 95% confidence interval ranging from 866,343.88 to 8,722,384.95. This means that there is a high degree of confidence (95%) that the actual average profit falls within this range. Additionally, it presents the lowest risk of loss, with only 1.3%. This suggests that significant losses on investments in this region are unlikely.
</div>