## **Project Description**

The objective of this project was to determine the most profitable region for developing a new oil well for the OilyGiant mining company. The analysis involved building predictive models to estimate the volume of oil reserves in various regions, calculating the potential profit from these estimates, and performing risk analysis to ensure the financial viability of the chosen region.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

In [4]:
geo_data_0 = pd.read_csv('geo_data_0.csv')
geo_data_1 = pd.read_csv('geo_data_1.csv')
geo_data_2 = pd.read_csv('geo_data_2.csv')

In [5]:
geo_data_0.head(), geo_data_1.head(), geo_data_2.head()

(      id        f0        f1        f2     product
 0  txEyH  0.705745 -0.497823  1.221170  105.280062
 1  2acmU  1.334711 -0.340164  4.365080   73.037750
 2  409Wp  1.022732  0.151990  1.419926   85.265647
 3  iJLyR -0.032172  0.139033  2.978566  168.620776
 4  Xdl7t  1.988431  0.155413  4.751769  154.036647,
       id         f0         f1        f2     product
 0  kBEdx -15.001348  -8.276000 -0.005876    3.179103
 1  62mP7  14.272088  -3.475083  0.999183   26.953261
 2  vyE1P   6.263187  -5.948386  5.001160  134.766305
 3  KcrkZ -13.081196 -11.506057  4.999415  137.945408
 4  AHL4O  12.702195  -8.147433  5.004363  134.766305,
       id        f0        f1        f2     product
 0  fwXo0 -1.146987  0.963328 -0.828965   27.758673
 1  WJtFt  0.262778  0.269839 -2.530187   56.069697
 2  ovLUW  0.194587  0.289035 -5.586433   62.871910
 3  q6cA6  2.236060 -0.553760  0.930038  114.572842
 4  WPMUX -0.515993  1.716266  5.899011  149.600746)

In [6]:
geo_data_0.isnull().sum(), geo_data_1.isnull().sum(), geo_data_2.isnull().sum()

(id         0
 f0         0
 f1         0
 f2         0
 product    0
 dtype: int64,
 id         0
 f0         0
 f1         0
 f2         0
 product    0
 dtype: int64,
 id         0
 f0         0
 f1         0
 f2         0
 product    0
 dtype: int64)

In [7]:
train_0, val_0 = train_test_split(geo_data_0, test_size=0.25, random_state=42)
train_1, val_1 = train_test_split(geo_data_1, test_size=0.25, random_state=42)
train_2, val_2 = train_test_split(geo_data_2, test_size=0.25, random_state=42)

train_0.shape, val_0.shape, train_1.shape, val_1.shape, train_2.shape, val_2.shape

((75000, 5), (25000, 5), (75000, 5), (25000, 5), (75000, 5), (25000, 5))

- Confirmed the data is clean with no missing values.
- Each dataset was split into a training set (75%) and a validation set (25%).

In [8]:
def train_and_evaluate(train_data, val_data):
    X_train = train_data[['f0', 'f1', 'f2']]
    y_train = train_data['product']
    X_val = val_data[['f0', 'f1', 'f2']]
    y_val = val_data['product']
    
    model = LinearRegression()
    model.fit(X_train, y_train)
    
    predictions = model.predict(X_val)
    rmse = mean_squared_error(y_val, predictions, squared=False)
    avg_predicted_reserves = predictions.mean()
    
    return predictions, y_val, rmse, avg_predicted_reserves

In [9]:
results_0 = train_and_evaluate(train_0, val_0)
results_1 = train_and_evaluate(train_1, val_1)
results_2 = train_and_evaluate(train_2, val_2)

rmse_0, avg_pred_reserves_0 = results_0[2], results_0[3]
rmse_1, avg_pred_reserves_1 = results_1[2], results_1[3]
rmse_2, avg_pred_reserves_2 = results_2[2], results_2[3]

rmse_0, avg_pred_reserves_0, rmse_1, avg_pred_reserves_1, rmse_2, avg_pred_reserves_2

(37.75660035026169,
 92.3987999065777,
 0.8902801001028817,
 68.71287803913762,
 40.145872311342174,
 94.7710238776594)

- **Region 0**: RMSE = 37.76, Average Reserves = 92.40 thousand barrels
- **Region 1**: RMSE = 0.89, Average Reserves = 68.71 thousand barrels
- **Region 2**: RMSE = 40.15, Average Reserves = 94.77 thousand barrels

In [10]:
budget = 100e6
num_wells = 200
revenue_per_barrel = 4.5e3 
min_reserves = budget / (num_wells * revenue_per_barrel)
min_reserves

111.11111111111111

- Minimum reserves required for profitability: 111.11 thousand barrels
- None of the regions meet this threshold on average, requiring selective well development.


In [11]:
def calculate_profit(predictions, target, num_wells, budget, revenue_per_barrel):
    
    predictions = predictions.reset_index(drop=True)
    target = target.reset_index(drop=True)
    selected_indices = predictions.sort_values(ascending=False).index[:num_wells]
    selected_reserves = target.loc[selected_indices]
    total_revenue = selected_reserves.sum() * revenue_per_barrel
    profit = total_revenue - budget
    
    return profit

profit_0 = calculate_profit(pd.Series(results_0[0]), results_0[1], num_wells, budget, revenue_per_barrel)
profit_1 = calculate_profit(pd.Series(results_1[0]), results_1[1], num_wells, budget, revenue_per_barrel)
profit_2 = calculate_profit(pd.Series(results_2[0]), results_2[1], num_wells, budget, revenue_per_barrel)

profit_0, profit_1, profit_2

(33591411.14462179, 24150866.966815114, 25985717.59374112)

- **Region 0**: Expected profit = \$33.59 million
- **Region 1**: Expected profit = \$24.15 million
- **Region 2**: Expected profit = \$25.99 million

In [12]:
def bootstrap_profit(predictions, target, num_wells, budget, revenue_per_barrel, n_samples=1000, sample_size=500):
    profits = []
    
    predictions = predictions.reset_index(drop=True)
    target = target.reset_index(drop=True)
    
    for _ in range(n_samples):

        sample_indices = predictions.sample(n=sample_size, replace=True).index
        sample_predictions = predictions.loc[sample_indices]
        sample_target = target.loc[sample_indices]
        profit = calculate_profit(sample_predictions, sample_target, num_wells, budget, revenue_per_barrel)
        profits.append(profit)
    
    profits = np.array(profits)
    mean_profit = np.mean(profits)
    lower_ci = np.percentile(profits, 2.5)
    upper_ci = np.percentile(profits, 97.5)
    risk_of_loss = np.mean(profits < 0) * 100 
    
    return mean_profit, lower_ci, upper_ci, risk_of_loss

bootstrap_results_0 = bootstrap_profit(pd.Series(results_0[0]), results_0[1], num_wells, budget, revenue_per_barrel)
bootstrap_results_1 = bootstrap_profit(pd.Series(results_1[0]), results_1[1], num_wells, budget, revenue_per_barrel)
bootstrap_results_2 = bootstrap_profit(pd.Series(results_2[0]), results_2[1], num_wells, budget, revenue_per_barrel)

bootstrap_results_0, bootstrap_results_1, bootstrap_results_2

((3974506.9685117737, -1063334.0632409106, 8998406.131622046, 6.7),
 (4519940.928534534, 596225.55185312, 8309269.211707684, 1.0),
 (3737761.251407242, -1443179.749425677, 8918476.40290279, 7.9))

Based on the updated bootstrapping results, the expected profits and associated risks for each region are as follows:

- **Region 0**: 
  - **Mean Profit**: \$4.16 million
  - **95% Confidence Interval**: [-\$644,543.87, \$9.06 million]
  - **Risk of Loss**: 4.6%

- **Region 1**: 
  - **Mean Profit**: \$4.43 million
  - **95% Confidence Interval**: [\$524,866.79, \$8.29 million]
  - **Risk of Loss**: 1.4%

- **Region 2**: 
  - **Mean Profit**: \$3.75 million
  - **95% Confidence Interval**: [-\$1.40 million, \$9.24 million]
  - **Risk of Loss**: 7.3%

### 1.1 Final Conclusion

After conducting a comprehensive analysis of the three potential regions for oil well development, we have formulated a data-driven recommendation for OilyGiant mining company. The analysis involved several critical steps:

- **Modeling and Prediction**: Linear regression models were developed to predict the volume of oil reserves in each region. These models were rigorously evaluated using RMSE and the average predicted reserves to ensure their accuracy and reliability.

- **Profitability Assessment**: We calculated the minimum reserves required for a well to be profitable and compared these thresholds with the predicted reserves. The calculated profits for the top 200 wells in each region provided a clear understanding of the potential financial outcomes.

- **Risk Analysis**: Utilizing the bootstrapping technique, we simulated 1,000 different profit scenarios to evaluate the variability and risk associated with each region's profit estimates. This step was essential for understanding not only the potential profit but also the financial risk involved.

### 1.2 Key Findings

- **Region 1** emerges as the most favorable option with an expected mean profit of **\$4.43 million**. The 95% confidence interval ranges from **\$524,867 to \$8.29 million**, indicating a stable and predictable profit margin. Additionally, the risk of loss is relatively low at **1.4%**, making Region 1 the safest and most profitable choice.

- **Region 0** shows a mean profit of **\$4.16 million**, with a 95% confidence interval between **-\$644,544 and \$9.06 million**. The risk of loss is **4.6%**, which is higher than Region 1. Although there is potential for profit, the increased risk may make this region less attractive depending on the company’s risk tolerance.

- **Region 2** presents the lowest mean profit at **\$3.75 million** and the highest risk of loss at **7.3%**. The confidence interval ranges from **-\$1.40 million to \$9.24 million**, indicating significant uncertainty in outcomes. This makes Region 2 the least favorable for development.

### 1.3 Recommendation

Based on the updated analysis:

- **Region 1** is recommended as the optimal location for developing new oil wells. It offers the highest expected profit with minimal financial risk, ensuring both stability and maximized returns on investment.

- **Region 0** could be considered as a secondary option. Although it has a slightly lower mean profit and a higher risk of loss, it still presents a viable opportunity depending on the company’s appetite for risk.

- **Region 2**, with the lowest profit potential and the highest risk, is not recommended for further development due to its significant financial uncertainty.

This recommendation ensures that OilyGiant's investment strategy is aligned with its goals of maximizing profit while minimizing risk, ultimately securing the best possible outcome for the company's future development projects.