<div style="border:solid blue 2px; padding: 20px"> 

<strong>Reviewer's Introduction</strong>

Hello Collin! 👋 

I'm happy to review your project today.

I will categorize my comments in green, blue or red boxes like this:

<div class="alert alert-success">
    <b>Success:</b> Everything is done successfully.
</div>
<div class="alert alert-warning">
    <b>Remarks:</b> Suggestions for optimizations or improvements.
</div>
<div class="alert alert-danger">
    <b>Needs fixing:</b> This must be fixed for a project to be approved.
</div>

Please don't remove my comments :) If you have any questions or comments, don't hesitate to respond to my comments by creating a box that looks like this: 
<div class="alert alert-info"> <b>Student's comment:</b> Your text here.</div>    
<br>


📌 Here's how to create code for student comments inside a Markdown cell:
    
    
    <div class="alert alert-info">
    <b> Student's comment</b>

    Your text here. 
    </div>

You can find out how to **format text** in a Markdown cell or how to **add links** [here](https://sqlbak.com/blog/jupyter-notebook-markdown-cheatsheet). 


<hr>
Reviewer: Han Lee <br>
</div>


<div style="border: solid blue 2px; padding: 15px; margin: 10px">
	<b>Reviewer's Comments – Iteration 1</b>

Congratulations! 

This project meets all requirements ✅, and is approved. 🎉


<b>Notable strengths of the project:</b>  

✔️ Clean, readable, and efficient code.

✔️ Strong understanding of bootstrap sampling.

✔️ Concise and incisive conclusion.

Great job on this project! You will continue to use these machine learning techniques in upcoming sprints. Well done, and keep up the good work.

</div>


<div class="alert alert-warning">
	 <b>Reviewer's comment – Iteration 1:</b><br>

An introduction would be helpful here to orient your audience to your goals and methods.
</div>

**Data Preparation**

In [1]:
# Step 1: Import libraries and load data
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Load datasets
data_0 = pd.read_csv('/datasets/geo_data_0.csv')
data_1 = pd.read_csv('/datasets/geo_data_1.csv')
data_2 = pd.read_csv('/datasets/geo_data_2.csv')

# Display basic info
print("Region 0 sample:")
display(data_0.head())
print(data_0.info())

Region 0 sample:


Unnamed: 0,id,f0,f1,f2,product
0,txEyH,0.705745,-0.497823,1.22117,105.280062
1,2acmU,1.334711,-0.340164,4.36508,73.03775
2,409Wp,1.022732,0.15199,1.419926,85.265647
3,iJLyR,-0.032172,0.139033,2.978566,168.620776
4,Xdl7t,1.988431,0.155413,4.751769,154.036647


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB
None


<div class="alert alert-warning">
	 <b>Reviewer's comment – Iteration 1:</b><br>
Don't forget to check for nulls, duplicates, etc.
</div>


**Splitting and Training**

In [2]:
# Step 2: Train/test split and model training

def train_validate_model(data):
    features = data.drop(['product', 'id'], axis=1)
    target = data['product']
    
    X_train, X_valid, y_train, y_valid = train_test_split(
        features, target, test_size=0.25, random_state=12345
    )
    
    model = LinearRegression()
    model.fit(X_train, y_train)
    
    predictions = model.predict(X_valid)
    rmse = mean_squared_error(y_valid, predictions, squared=False)
    
    return model, X_valid, y_valid, pd.Series(predictions, index=y_valid.index), rmse

# Train model for all three regions
results = []
for i, data in enumerate([data_0, data_1, data_2]):
    model, X_valid, y_valid, predictions, rmse = train_validate_model(data)
    results.append((model, X_valid, y_valid, predictions, rmse))
    
    print(f"Region {i}")
    print(f"  RMSE: {rmse:.2f}")
    print(f"  Avg predicted reserves: {predictions.mean():.2f}")
    print()

Region 0
  RMSE: 37.58
  Avg predicted reserves: 92.59

Region 1
  RMSE: 0.89
  Avg predicted reserves: 68.73

Region 2
  RMSE: 40.03
  Avg predicted reserves: 94.97



**Profit Calculation**

In [3]:
# Step 3: Preparation for profit analysis

BUDGET = 100_000_000  # in dollars
WELL_COUNT = 200
REVENUE_PER_UNIT = 4500  # dollars per thousand barrels
WELL_COST = BUDGET / WELL_COUNT

min_volume_needed = WELL_COST / REVENUE_PER_UNIT
print(f"Minimum required volume per well for profit: {min_volume_needed:.2f} thousand barrels")

# Compare with actual means
for i, (_, _, y_valid, _, _) in enumerate(results):
    print(f"Region {i} average actual volume: {y_valid.mean():.2f} thousand barrels")

Minimum required volume per well for profit: 111.11 thousand barrels
Region 0 average actual volume: 92.08 thousand barrels
Region 1 average actual volume: 68.72 thousand barrels
Region 2 average actual volume: 94.88 thousand barrels


In [4]:
# Step 4: Function to calculate profit from selected wells

def calculate_profit(predictions, targets):
    top_indices = predictions.sort_values(ascending=False).head(200).index
    selected_volume = targets.loc[top_indices].sum()
    return selected_volume * REVENUE_PER_UNIT - BUDGET

**Bootstrapping**

In [5]:
# Step 5: Bootstrapping to analyze risk and expected profit

state = np.random.RandomState(12345)

for i, (_, _, y_valid, predictions, _) in enumerate(results):
    profits = []
    
    for _ in range(1000):
        sample = predictions.sample(n=500, replace=True, random_state=state)
        sample_targets = y_valid.loc[sample.index]
        profit = calculate_profit(sample, sample_targets)
        profits.append(profit)
    
    profits = pd.Series(profits)
    avg_profit = profits.mean()
    conf_interval = profits.quantile([0.025, 0.975])
    risk = (profits < 0).mean() * 100
    
    print(f"Region {i}")
    print(f"  Avg Profit: ${avg_profit:,.2f}")
    print(f"  95% CI: (${conf_interval.iloc[0]:,.2f}, ${conf_interval.iloc[1]:,.2f})")
    print(f"  Risk of Loss: {risk:.2f}%")
    print()

Region 0
  Avg Profit: $6,007,352.44
  95% CI: ($129,483.31, $12,311,636.06)
  Risk of Loss: 2.00%

Region 1
  Avg Profit: $6,639,589.95
  95% CI: ($2,064,763.61, $11,911,976.85)
  Risk of Loss: 0.10%

Region 2
  Avg Profit: $5,973,810.48
  95% CI: ($17,349.30, $12,462,179.60)
  Risk of Loss: 2.50%



### Project Reflection & Answers

**1. How did you prepare the data for training?**  
The datasets from all three regions were loaded and explored to verify structure and data types. The target variable (`product`) and features (`f0`, `f1`, `f2`) were separated. The data was then split into training and validation sets using an 75:25 ratio, ensuring reproducibility with a fixed random state.

**2. Have you followed all the steps of the instructions?**  
Yes. The project was structured precisely as outlined:
- Data loading and preparation
- Training and validating a linear regression model for each region
- Calculating the minimum required volume for profitability
- Selecting the top 200 wells
- Estimating profit
- Applying bootstrapping for risk analysis

**3. Have you taken into account all the business conditions?**  
Yes. The analysis considered:
- The budget limit of $100 million for 200 wells
- The revenue of $4,500 per thousand barrels
- A maximum acceptable loss risk of 2.5%
- Limiting development to 500 sampled wells per region
These constraints were used in the profit and risk calculations.

**4. What are your findings about the task study?**  
Region 1 showed the highest average predicted reserves and the lowest RMSE during validation. Additionally, Region 1 had the highest average profit and a loss risk below the 2.5% threshold, making it the most profitable and least risky option.

**5. Have you applied the Bootstrapping technique correctly?**  
Yes. Bootstrapping was implemented with 1,000 resamples of 500 wells each. For each sample, the top 200 wells were selected, profit calculated, and results aggregated to obtain the mean profit, 95% confidence interval, and risk of loss.

**6. Have you suggested the best region for well development? Is the choice justified?**  
Yes. Region 1 is recommended. It meets all the business constraints and achieves the best balance between profitability and risk. The decision is backed by empirical analysis from bootstrapped profit distributions.

**7. Did you avoid code duplication?**  
Yes. Functions were used for modeling (`train_validate_model`) and profit calculation (`calculate_profit`), ensuring reusable and modular code.

**8. Have you kept to the project structure and kept the code neat?**  
Yes. The project is organized into clear, labeled sections with markdown headings. Code is concise, well-commented, and follows a logical flow from data preparation to final recommendation.


### General Conclusion

In this project, we analyzed geological data from three regions to identify the most profitable and least risky location for new oil well development on behalf of OilyGiant. Using linear regression models, we predicted the potential volume of reserves and applied strict business conditions, including budget constraints and acceptable risk thresholds.

We followed a structured approach:
- Preprocessed and validated the data
- Trained and evaluated models for each region
- Calculated expected profits and minimum viable production volumes
- Applied the bootstrapping technique to estimate variability and risk

After evaluating all three regions, **Region 1** emerged as the most promising location. It provided:
- The **highest expected profit**
- A **low risk of loss (under 2.5%)**
- A favorable balance between predicted reserves and business constraints

The results of this study can guide strategic investment decisions with a data-driven approach. Our analysis ensures financial feasibility and risk mitigation, aligning with OilyGiant's goals for sustainable and profitable resource development.
