# Optimizing Oil Well Locations: Maximizing Profit and Minimizing Risks

# Contents <a id='contents'></a>

[1. Contents](#contents)   
[2. Introduction](#introduction)  
[3. Project Goal](#project_goal)     
[4. Initialization](#initialization)  
[5. Load data](#load-data)  
[6. Data Preprocessing](#data-preprocessing)  
[7. Train and test the model for each region](#train-and-test)  
[8. Prepare for profit calculation](#prepare-for-profit-calculation)    
[9. Calculate profit from a set of selected oil wells](#profit-from-a-set-of-selected-oil-wells)  
[10. Calculate risks and profit for each region](#risks-and-profit-for-each-region)  
[11. Conclusion](#conclusion)

# 1. Introduction <a id='introduction'></a> 
[Back to Contents](#contents)

In the quest for harnessing valuable energy resources, the **OilyGiant mining company** embarks on a crucial mission to expand its operations by discovering new oil well locations. The success of this endeavor lies in identifying regions with the highest potential for profitable oil extraction while mitigating associated risks. To achieve this, a data-driven approach is adopted, leveraging geological exploration data from three distinct regions and employing techniques in data analysis and modeling.

Beyond just prediction, the project delves into the realm of profit calculation. Key variables, such as the budget for development and the revenue per barrel of raw materials, are utilized to identify the volume of reserves required for a new well to be developed without incurring losses. These calculations serve as a crucial reference for selecting profitable oil wells.

# 2. Project Goal <a id='project_goal'></a>  
[Back to Contents](#contents)

The project goal is to identify the most suitable region for the development of new oil wells, considering two primary objectives:

1. **Maximizing Profit**: The project aims to identify the region that offers the highest potential for profitable oil extraction. By accurately estimating the volume of reserves in the new wells using linear regression models and selecting the wells with the highest predicted values, the goal is to maximize the revenue generated from oil production.

2. **Minimizing Risks**: In the pursuit of profit, the project also considers the associated risks. Through the application of the Bootstrapping technique, the project evaluates the distribution of profit and calculates the probability of losses in each region. The objective is to choose a region with a risk level below 2.5%, ensuring a secure investment and minimizing potential losses.

By achieving these goals, the project aims to provide the OilyGiant mining company with data-driven insights and recommendations that lead to sound decision-making in selecting the region with the highest profit margin and the lowest risk for the development of new oil wells.

# 3. Initialization <a id='initialization'></a>  
[Back to Contents](#contents)

In [1]:
import pandas as pd
import numpy as np
from numpy.random import RandomState

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# 4. Load data <a id='load-data'></a>  
[Back to Contents](#contents)

Geological exploration data for the three regions are stored in files: `geo_data_0.csv`, `geo_data_1.csv` and `geo_data_2.csv`. Let's load the datasets into separate Dataframes:

In [2]:
# Read datasets into Dataframes
df_geo_data_0 = pd.read_csv('./datasets/geo_data_0.csv')
df_geo_data_1 = pd.read_csv('./datasets/geo_data_1.csv')
df_geo_data_2 = pd.read_csv('./datasets/geo_data_2.csv')

Great! the data has been loaded from the data files into separate dataframes.

# 5. Data Description <a id='data-description'></a>  
[Back to Contents](#contents)

Let's have a peek into the data in the dataframes.

In [3]:
# Get first 5 records of the dataframe - df_geo_data_0
df_geo_data_0.head()

Unnamed: 0,id,f0,f1,f2,product
0,txEyH,0.705745,-0.497823,1.22117,105.280062
1,2acmU,1.334711,-0.340164,4.36508,73.03775
2,409Wp,1.022732,0.15199,1.419926,85.265647
3,iJLyR,-0.032172,0.139033,2.978566,168.620776
4,Xdl7t,1.988431,0.155413,4.751769,154.036647


In [4]:
# Get first 5 records of the dataframe - df_geo_data_1
df_geo_data_1.head()

Unnamed: 0,id,f0,f1,f2,product
0,kBEdx,-15.001348,-8.276,-0.005876,3.179103
1,62mP7,14.272088,-3.475083,0.999183,26.953261
2,vyE1P,6.263187,-5.948386,5.00116,134.766305
3,KcrkZ,-13.081196,-11.506057,4.999415,137.945408
4,AHL4O,12.702195,-8.147433,5.004363,134.766305


In [5]:
# Get first 5 records of the dataframe - df_geo_data_2
df_geo_data_2.head()

Unnamed: 0,id,f0,f1,f2,product
0,fwXo0,-1.146987,0.963328,-0.828965,27.758673
1,WJtFt,0.262778,0.269839,-2.530187,56.069697
2,ovLUW,0.194587,0.289035,-5.586433,62.87191
3,q6cA6,2.23606,-0.55376,0.930038,114.572842
4,WPMUX,-0.515993,1.716266,5.899011,149.600746


All the three dataframes have the same columns:  

- `id`: Unique oil well identifier, representing each oil well's individual record.

- `f0`, `f1`, `f2`: Three features of points. The specific meaning of these features is unimportant for the analysis, but they are significant for modeling the volume of reserves in the oil wells.

- `product`: Volume of reserves in the oil well, measured in thousand barrels. This is the target variable for the prediction models.

    **The datasets have a total of five columns, with `id`, `f0`, `f1`, and `f2` being the input features, and `product` being the target variable** representing the volume of reserves in each oil well.

Let's get some general information about each one of them:

In [6]:
# Get general information of the df_geo_data_0
df_geo_data_0.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


The dataframe - `df_geo_data_0` have a total of **100000** witl all the columns having non-null values.

In [7]:
# Get general information of the df_geo_data_1
df_geo_data_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


The dataframe - `df_geo_data_1` have a total of **100000** witl all the columns having non-null values.

In [8]:
# Get general information of the df_geo_data_2
df_geo_data_2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


The dataframe - `df_geo_data_2` have a total of **100000** witl all the columns having non-null values.

# 6. Data Preprocessing <a id='data-preprocessing'></a>  
[Back to Contents](#contents)

From the initial observation of the dataset's first few rows, it appears that **the numeric features (f0, f1, and f2) are already somewhat standardized**, meaning they might have been preprocessed or transformed to have similar scales. Additionally, **there are no categorical features present in the dataset**. The absence of categorical features simplifies the data preparation process, as no further encoding or transformation is required for these types of variables.

## Change the column names <a id='change-the-column-names'></a>  
[Back to Contents](#contents)

We'll change the name of the column - `product` in all the dataframes to `volume_of_reserves` to make it meaningful:

In [9]:
# Rename column name
new_column_name = { 'product': 'volume_of_reserves'}

df_geo_data_0 = df_geo_data_0.rename(columns=new_column_name)
df_geo_data_1 = df_geo_data_1.rename(columns=new_column_name)
df_geo_data_2 = df_geo_data_2.rename(columns=new_column_name)

Let's confirm the change:

In [10]:
# Get list of column names in df_geo_data_0
df_geo_data_0.columns

Index(['id', 'f0', 'f1', 'f2', 'volume_of_reserves'], dtype='object')

In [11]:
# Get list of column names in df_geo_data_1
df_geo_data_1.columns

Index(['id', 'f0', 'f1', 'f2', 'volume_of_reserves'], dtype='object')

In [12]:
# Get list of column names in df_geo_data_2
df_geo_data_2.columns

Index(['id', 'f0', 'f1', 'f2', 'volume_of_reserves'], dtype='object')

Great! Let's move to the next steps.

## Look for duplicates <a id='look-for-duplicates'></a>  
[Back to Contents](#contents)

Since we are going to check the same stuffs with three different dataframes, let's write a function to check for duplicates:

In [13]:
# Function to check duplicate rows and ids in dataframes
def check_duplicates_in_df(df: pd.DataFrame):
    # Check for duplicate rows
    print(f"The no of duplicate rows: {df.duplicated().sum()}")
    
    # Check for duplicate IDs
    print(f"The no of duplicate IDs: {df['id'].duplicated().sum()}")

Great! Now let's call our function on the three dataframes:

In [14]:
# Check for duplicates in df_geo_data_0
check_duplicates_in_df(df_geo_data_0)

The no of duplicate rows: 0
The no of duplicate IDs: 10


In [15]:
# Check for duplicates in df_geo_data_1
check_duplicates_in_df(df_geo_data_1)

The no of duplicate rows: 0
The no of duplicate IDs: 4


In [16]:
# Check for duplicates in df_geo_data_2
check_duplicates_in_df(df_geo_data_2)

The no of duplicate rows: 0
The no of duplicate IDs: 4


Interesting! We have few duplicate ids in all the three dataframes. Let's look at all the records in the dataframes which are duplicates:

In [17]:
# Get all the 10 records that are duplicates in df_geo_data_0
df_geo_data_0[df_geo_data_0['id'].duplicated()]

Unnamed: 0,id,f0,f1,f2,volume_of_reserves
7530,HZww2,1.061194,-0.373969,10.43021,158.828695
41724,bxg6G,-0.823752,0.546319,3.630479,93.007798
51970,A5aEY,-0.180335,0.935548,-2.094773,33.020205
63593,QcMuo,0.635635,-0.473422,0.86267,64.578675
66136,74z30,1.084962,-0.312358,6.990771,127.643327
69163,AGS9W,-0.933795,0.116194,-3.655896,19.230453
75715,Tdehs,0.112079,0.430296,3.218993,60.964018
90815,fiKDv,0.049883,0.841313,6.394613,137.346586
92341,TtcGQ,0.110711,1.022689,0.911381,101.318008
97785,bsk9y,0.378429,0.005837,0.160827,160.637302


In [18]:
# Get all the 4 records that are duplicates in df_geo_data_1
df_geo_data_1[df_geo_data_1['id'].duplicated()]

Unnamed: 0,id,f0,f1,f2,volume_of_reserves
41906,LHZR0,-8.989672,-4.286607,2.009139,57.085625
82178,bfPNe,-6.202799,-4.820045,2.995107,84.038886
82873,wt4Uk,10.259972,-9.376355,4.994297,134.766305
84461,5ltQ6,18.213839,2.191999,3.993869,107.813044


In [19]:
# Get all the 4 records that are duplicates in df_geo_data_2
df_geo_data_2[df_geo_data_2['id'].duplicated()]

Unnamed: 0,id,f0,f1,f2,volume_of_reserves
43233,xCHr8,-0.847066,2.101796,5.59713,184.388641
49564,VF7Jo,-0.883115,0.560537,0.723601,136.23342
55967,KUPhW,1.21115,3.176408,5.54354,132.831802
95090,Vcm5J,2.587702,1.986875,2.482245,92.327572


So, it seems that multiple samples were taken from the same well or that some wells were sampled more than once during different stages of exploration or drilling and that's why we have duplicate IDs in the dataframes. But, also since the actual values of the features and target are different, it seems we are good with the duplicate IDs.

## Remove columns not required for ML model <a id='remove-columns'></a>  
[Back to Contents](#contents)

After looking at the columns in all the three dataframes, it is evident that the column - `id` is not of significance in training our models so that they can predict our target, so, let's fet rid of that in all the three dataframes:

In [20]:
# Drop column - id in df_geo_data_0
df_geo_data_0 = df_geo_data_0.drop(columns=['id'])
df_geo_data_0.columns

Index(['f0', 'f1', 'f2', 'volume_of_reserves'], dtype='object')

In [21]:
# Drop column - id in df_geo_data_1
df_geo_data_1 = df_geo_data_1.drop(columns=['id'])
df_geo_data_1.columns

Index(['f0', 'f1', 'f2', 'volume_of_reserves'], dtype='object')

In [22]:
# Drop column - id in df_geo_data_2
df_geo_data_2 = df_geo_data_2.drop(columns=['id'])
df_geo_data_2.columns

Index(['f0', 'f1', 'f2', 'volume_of_reserves'], dtype='object')

Great! We are good to go.

# 7. Train and test the model for each region <a id='train-and-test'></a>  
[Back to Contents](#contents)

Since we have to predict a numerical target (not a categorical one) - `volume_of_reserves`, we will use **Linear Regression** model. We will follow the listed steps:
1. Split the data into a training set and validation set at a ratio of 75:25.
2. Train the model and make predictions for the validation set.
3. Save the predictions and correct answers for the validation set.
4. Print the average volume of predicted reserves and model RMSE.
5. Analyze the results.  

We have three dataframes and since, we have to follow the above steps with all of the three dataframes, let's write a function to train and test the model for each region:

In [23]:
# Function to train and test the model for each region
def train_test_model(df: pd.DataFrame):
    '''
    This function:
    1. Splits the dataframe into a training set and validation set at a ratio of 75:25.
    2. Trains the Linear Regression model and make predictions for the validation set.
    3. Save the predictions and correct answers for the validation set.
    4. Calculates the average volume of predicted reserves and model RMSE.
    '''
    # Split the dataframe into features and target
    features = df.drop(columns=['volume_of_reserves'])
    target = df['volume_of_reserves']
    
    # Split the dataframe into a training set and validation set at a ratio of 75:25
    features_train, features_valid, target_train, target_valid = train_test_split(features, target, test_size=0.25, random_state=12345)
    
    # Train the Linear Regression model on training set
    model = LinearRegression()
    model.fit(features_train, target_train)
    
    # Make predictions for validation set using the trained model
    predictions_valid = model.predict(features_valid)
    
    # Calculate Mean Squared Error - MSE for the validation set
    mse = mean_squared_error(target_valid, predictions_valid)
    
    # Calculate Root Mean Squared Error - RMSE for the validation set
    rmse = mse ** 0.5
    
    # Calculate Coefficient of Determination - R2 metric for the validation set
    r2 = r2_score(target_valid, predictions_valid)
    
    # Calculate the average volume of predicted reserves
    avg_volume_of_pred_reserves = predictions_valid.mean()
    
    # Return target_valid, predictions_valid, rmse, r2, avg_volume_of_pred_reserves
    return target_valid, predictions_valid, rmse, r2, avg_volume_of_pred_reserves

Good! So, now we have a function that will do our desired job on all the three dataframes. Let's start with **geographical region # 1**:

In [24]:
# Train and test the Linear Regression model for geographical region # 1
print('Train and test the Linear Regression model for geographical region # 1')

target_valid_0, predictions_valid_0, rmse_0, r2_0, avg_volume_of_pred_reserves_0 = train_test_model(df_geo_data_0)

print(f'RMSE: {rmse_0} ')
print(f'Coefficient of Determination | R2: {r2_0} ')
print(f'Average volume of predicted reserves: {avg_volume_of_pred_reserves_0} ')

Train and test the Linear Regression model for geographical region # 1
RMSE: 37.5794217150813 
Coefficient of Determination | R2: 0.27994321524487786 
Average volume of predicted reserves: 92.59256778438035 


Now for **geographical region # 2**:

In [25]:
# Train and test the Linear Regression model for geographical region # 2
print('Train and test the Linear Regression model for geographical region # 2')

target_valid_1, predictions_valid_1, rmse_1, r2_1, avg_volume_of_pred_reserves_1 = train_test_model(df_geo_data_1)

print(f'RMSE: {rmse_1} ')
print(f'Coefficient of Determination | R2: {r2_1} ')
print(f'Average volume of predicted reserves: {avg_volume_of_pred_reserves_1} ')

Train and test the Linear Regression model for geographical region # 2
RMSE: 0.893099286775617 
Coefficient of Determination | R2: 0.9996233978805127 
Average volume of predicted reserves: 68.728546895446 


And finally for **geographical region # 3**:

In [26]:
# Train and test the Linear Regression model for geographical region # 3
print('Train and test the Linear Regression model for geographical region # 3')

target_valid_2, predictions_valid_2, rmse_2, r2_2, avg_volume_of_pred_reserves_2 = train_test_model(df_geo_data_2)

print(f'RMSE: {rmse_2} ')
print(f'Coefficient of Determination | R2: {r2_2} ')
print(f'Average volume of predicted reserves: {avg_volume_of_pred_reserves_2} ')

Train and test the Linear Regression model for geographical region # 3
RMSE: 40.02970873393434 
Coefficient of Determination | R2: 0.20524758386040443 
Average volume of predicted reserves: 94.96504596800489 


So, in summary, this is what we have got:

1. **Geographical Region #1**:
- Root Mean Square Error (RMSE): 37.58
- Coefficient of Determination (R2): 0.280
- Average Volume of Predicted Reserves: 92.59 thousand barrels

2. **Geographical Region #2**:
- Root Mean Square Error (RMSE): 0.893
- Coefficient of Determination (R2): 0.9996
- Average Volume of Predicted Reserves: 68.73 thousand barrels

3. **Geographical Region #3**:
- Root Mean Square Error (RMSE): 40.03
- Coefficient of Determination (R2): 0.205
- Average Volume of Predicted Reserves: 94.97 thousand barrels

Based on the above results, **Geographical Region #2 stands out as the best-performing region**. It has the lowest RMSE, indicating accurate predictions, and the highest R2 value.  

Geographical Region #1 and #3 show lower model performance, with higher RMSE values and lower R2 values. 

The results suggest that **Region #2 has the most promising potential for developing new oil wells** due to its highly accurate predictions and strong correlation between the features and the volume of reserves. However, we need to dig more deep into this in our further analysis.

# 8. Prepare for profit calculation <a id='prepare-for-profit-calculation'></a>  
[Back to Contents](#contents)

Now, let's prepare ourselves for profit calculation. We'll store all key values for calculations in separate variables:

In [27]:
# Store all key values for calculations in separate variables

# When exploring the region, a study of 500 points is carried with picking the best 200 points
no_of_oil_wells = 200

# The budget for development of 200 oil wells is 100 USD million.
dev_budget = 100000000

# The revenue from one unit of product is 4,500 dollars
revenue_per_product = 4500

Now, let's **calculate the volume of reserves sufficient for developing a new well without losses** i.e. **break-even volume**. In the context of oil exploration and production, the break-even volume represents the minimum amount of oil reserves that must be extracted and sold to cover all the costs associated with drilling, development, and operation of the new well.  

Once the production from the new well surpasses the break-even volume, the revenue generated from selling the extracted oil exceeds the total costs, leading to profitability. Therefore, the break-even volume is a critical threshold in determining the economic viability of a new oil well project.

In [28]:
# Calulate development cost per well in USD
cost_per_well = dev_budget / no_of_oil_wells 
cost_per_well

500000.0

In [29]:
# Calculate the volume of reserves sufficient for developing a new well without losses i.e. break-even volume
breakeven_volume = cost_per_well / revenue_per_product
print(f'Volume of reserves sufficient for developing a new well without losses is {breakeven_volume} thousand barrels, with {cost_per_well} USD cost per well.')

Volume of reserves sufficient for developing a new well without losses is 111.11111111111111 thousand barrels, with 500000.0 USD cost per well.


Awesome! Now, let's compare the obtained value of `breakeven_volume` with the average volume of reserves in each region:

In [30]:
print(f'Average volume of predicted reserves in Region # 1: {avg_volume_of_pred_reserves_0} ')
print(f'Average volume of predicted reserves in Region # 2: {avg_volume_of_pred_reserves_1} ')
print(f'Average volume of predicted reserves in Region # 3: {avg_volume_of_pred_reserves_2} ')

Average volume of predicted reserves in Region # 1: 92.59256778438035 
Average volume of predicted reserves in Region # 2: 68.728546895446 
Average volume of predicted reserves in Region # 3: 94.96504596800489 


Let's compare and analyze the average predicted reserves in each region with the break-even volume:

1. **Region #1**:

- Average Volume of Predicted Reserves: 92.59 thousand barrels.
- Break-Even Volume: 111.11 thousand barrels.  

**The average predicted reserves in Region #1 (92.59 thousand barrels) are below the break-even volume (111.11 thousand barrels)**. This suggests that, on average, **developing a new well in this region may not be profitable**, as the predicted reserves fall short of the volume required to cover the costs without losses.

2. Region #2:

- Average Volume of Predicted Reserves: 68.73 thousand barrels.
- Break-Even Volume: 111.11 thousand barrels.  

**The average predicted reserves in Region #2 (68.73 thousand barrels) are also below the break-even volume (111.11 thousand barrels)**. Similar to Region #1, **developing a new well in this region may not be economically viable** based on the average predicted reserves.

3. Region #3:

- Average Volume of Predicted Reserves: 94.97 thousand barrels.
- Break-Even Volume: 111.11 thousand barrels.

**In Region #3, the average predicted reserves (94.97 thousand barrels) are still below the break-even volume (111.11 thousand barrels)**.  

Overall, based on the average predicted reserves, none of the regions (Region #1, Region #2, or Region #3) appear to meet the break-even volume of 111.11 thousand barrels required for developing a new well without losses, assuming a cost of 500,000 USD per well.

# 9. Calculate profit from a set of selected oil wells <a id='profit-from-a-set-of-selected-oil-wells'></a>  
[Back to Contents](#contents)

The objective is to identify the top 200 sites that the model predicts as the most promising locations for building new oil wells. Once these sites are determined, the next step is to calculate the profit based on the real-life values associated with these sites.

In [31]:
def get_profits_for_top_200_sites(predictions, target):
    
    """
    1. This function accepts two inputs: predictions and targets.
    2. It gathers the top 200 predicted sites based on their highest predicted volumes of reserves. 
    3. For these selected sites, it calculates the revenue using the actual real-life values, 
    which includes the sum of the oil product and a value of $4500 per 1000 barrels. 
    4. The function then computes the profit by subtracting the development costs of $100 million from the revenue. 
    5. Finally, the function returns the calculated profit & total reserve volume.
    """
    # Convert predictions into Series
    predictions = pd.Series(predictions, index=target.index)
    
    top_200_predicted_sites = predictions.sort_values(ascending=False).head(200).index
    sum_of_top_200_sites_volumes = target[top_200_predicted_sites].sum()
    revenue = sum_of_top_200_sites_volumes * 4500
    profit = revenue - dev_budget

    return sum_of_top_200_sites_volumes, profit

Let's calculate for Region # 1:

In [32]:
total_reserve_volume_0, profit_0 = get_profits_for_top_200_sites(predictions_valid_0, target_valid_0)

print(f'The total reserve volume for Region # 1 is {total_reserve_volume_0} thousand barrels.')
print(f'The total profit for Region # 1 is {profit_0} in USD.')

The total reserve volume for Region # 1 is 29601.83565142189 thousand barrels.
The total profit for Region # 1 is 33208260.43139851 in USD.


Let's calculate for Region # 2:

In [33]:
total_reserve_volume_1, profit_1 = get_profits_for_top_200_sites(predictions_valid_1, target_valid_1)

print(f'The total reserve volume for Region # 2 is {total_reserve_volume_1} thousand barrels.')
print(f'The total profit for Region # 2 is {profit_1} in USD.')

The total reserve volume for Region # 2 is 27589.081548181137 thousand barrels.
The total profit for Region # 2 is 24150866.966815114 in USD.


Let's calculate for Region # 3:

In [34]:
total_reserve_volume_2, profit_2 = get_profits_for_top_200_sites(predictions_valid_2, target_valid_2)

print(f'The total reserve volume for Region # 3 is {total_reserve_volume_2} thousand barrels.')
print(f'The total profit for Region # 3 is {profit_2} in USD.')

The total reserve volume for Region # 3 is 28245.22214133296 thousand barrels.
The total profit for Region # 3 is 27103499.635998324 in USD.


Based on the findings from the calculations of total reserve volume and total profit for each region, we can make a recommendation for oil wells' development:

1. **Region #1**:

- Total Reserve Volume: 29,601.84 thousand barrels
- Total Profit: 33,208,260.43 USD

2. **Region #2**:

- Total Reserve Volume: 27,589.08 thousand barrels
- Total Profit: 24,150,866.97 USD

3. **Region #3**:

- Total Reserve Volume: 28,245.22 thousand barrels
- Total Profit: 27,103,499.64 USD  

Based on the results, **Region #1 has the highest total reserve volume of 29,601.84 thousand barrels, which suggests a significant potential for oil extraction in this region**. Additionally, Region #1 also demonstrates the highest total profit of 33,208,260.43 USD, indicating that it offers the most promising economic opportunity among the three regions.

# 10. Calculate risks and profit for each region <a id='risks-and-profit-for-each-region'></a>  
[Back to Contents](#contents)

Let's use the bootstrapping technique with 1000 samples to find the distribution of profit. Also, let's find average profit, 95% confidence interval and risk of losses. Loss is negative profit, so, let's calculate it as a probability and then express as a percentage.

In [35]:
def bootstrap(predictions, target):
    
    """
    This function is designed to analyze the predictions and targets for the three specific regions. 
    It utilizes bootstrapping, a statistical resampling technique, to calculate the mean and 95% confidence 
    interval of the profit estimates.
    Additionally, the function iterates through quantiles to identify the approximate quantile at which the 
    profit becomes zero, helping to assess the risk of potential losses.
    The function provides a comprehensive summary of the results by printing the mean profit, the 95% confidence 
    interval, and the risk of loss.
    """
    
    rand_state = RandomState(0)
    values = []
    
    # Convert predictions into Series
    predictions = pd.Series(predictions, index=target.index)

    for i in range(1000):
        target_subsample = target.sample(n=500, random_state=rand_state, replace=False)
        predictions_subsample = predictions[target_subsample.index]
        profit = get_profits_for_top_200_sites(predictions_subsample, target_subsample)[1]
        
        values.append(profit)
    
    values = pd.Series(values)
    mean_profit = values.mean()
    lower_profit = values.quantile(0.025)
    upper_profit = values.quantile(0.975)
    
    print(f"Average profit is {mean_profit} in USD.")
    print(f"For 95% confidence interval, the lower profit is {lower_profit} in USD and the upper profit is {upper_profit} in USD.")
    

    for quan in range (1,100,1):
        profit = values.quantile(quan/1000)
        if profit >= 0:
            break
    print(f"Risk of loss is {quan/10}%")

In [36]:
# For Region # 1
bootstrap(predictions_valid_0, target_valid_0)

Average profit is 3973281.161239682 in USD.
For 95% confidence interval, the lower profit is -976742.1907361418 in USD and the upper profit is 8911035.45563315 in USD.
Risk of loss is 5.9%


In [37]:
# For Region # 2
bootstrap(predictions_valid_1, target_valid_1)

Average profit is 4568945.689079863 in USD.
For 95% confidence interval, the lower profit is 431660.3432972718 in USD and the upper profit is 8451364.730538176 in USD.
Risk of loss is 0.9%


In [38]:
# For Region # 3
bootstrap(predictions_valid_2, target_valid_2)

Average profit is 3964652.612256283 in USD.
For 95% confidence interval, the lower profit is -958061.7904931746 in USD and the upper profit is 9205980.325455725 in USD.
Risk of loss is 7.6%


Based on the results from the bootstrapping technique for the three regions, we can provide the following findings and a recommendation for the development of oil wells:

1. **Region #1**:

- Average Profit: 3,973,281.16 USD
- 95% Confidence Interval (Lower Profit): -976,742.19 USD
- 95% Confidence Interval (Upper Profit): 8,911,035.46 USD
- Risk of Loss: 5.9%

2. **Region #2**:

- Average Profit: 4,568,945.69 USD
- 95% Confidence Interval (Lower Profit): 431,660.34 USD
- 95% Confidence Interval (Upper Profit): 8,451,364.73 USD
- Risk of Loss: 0.9%

3. **Region #3**:

- Average Profit: 3,964,652.61 USD
- 95% Confidence Interval (Lower Profit): -958,061.79 USD
- 95% Confidence Interval (Upper Profit): 9,205,980.33 USD
- Risk of Loss: 7.6%

In summary, **Region #2 stands out as the most favorable choice for the development of oil wells** as:

- Region #2 has the highest average profit of 4,568,945.69 USD among the three regions. This indicates that, on average, the oil well development in Region #2 is expected to yield the highest returns.

- The 95% confidence interval for Region #2 is relatively narrow, with the lower profit boundary at 431,660.34 USD and the upper profit boundary at 8,451,364.73 USD. This suggests a high level of confidence in the profit estimates.

- Region #2 has the lowest risk of loss at 0.9%. This implies that the probability of incurring losses during oil well development is very low, making it a more secure investment option.

# 11. Conclusion  <a id='conclusion'></a>  
[Back to Contents](#contents)

In conclusion, this project aimed to identify the most promising region for the development of new oil wells for the OilyGiant mining company. The process involved various steps, including data collection, model training, profit calculation, and risk assessment using bootstrapping. The primary goal was to maximize profit while minimizing the risk of losses.

After analyzing the data and training linear regression models for each region, we evaluated the predictive performance. Among the three regions, **Region #2 exhibited the highest average predicted reserves**.  

Using the bootstrapping technique with 1000 samples, we calculated the mean profit and 95% confidence interval for each region. **Region #2 continued to show the highest average profit and the narrowest confidence interval**. Moreover, Region #2 had the lowest risk of losses at only 0.9%, making it a more secure investment choice.

Based on these comprehensive findings, **we recommend Region #2 as the most suitable region for oil wells' development**. The combination of high average profit, low risk of losses, and a narrow confidence interval positions Region #2 as the most promising option for maximizing returns on investment.