Hello Cory!

I’m happy to review your project today.
I will mark your mistakes and give you some hints how it is possible to fix them. We are getting ready for real job, where your team leader/senior colleague will do exactly the same. Don't worry and study with pleasure! 

Below you will find my comments - **please do not move, modify or delete them**.

You can find my comments in green, yellow or red boxes like this:

<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Success. Everything is done succesfully.
</div>

<div class="alert alert-block alert-warning">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Remarks. Some recommendations.
</div>

<div class="alert alert-block alert-danger">

<b>Reviewer's comment</b> <a class="tocSkip"></a>

Needs fixing. The block requires some corrections. Work can't be accepted with the red comments.
</div>

You can answer me by using this:

<div class="alert alert-block alert-info">
<b>Student answer.</b> <a class="tocSkip"></a>

Thank you so much for the feedback, I appreacaite it! I should have double checked before submitting. Thanks! 
</div>



The following observations are meant for the determining of the best location for a new well for the OilyGiant mining company! Lets dive into the data and see what the future holds!


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

In [2]:
data_region_0 = pd.read_csv('/datasets/geo_data_0.csv')
data_region_1 = pd.read_csv('/datasets/geo_data_1.csv')
data_region_2 = pd.read_csv('/datasets/geo_data_2.csv')

display(data_region_0.head())
display(data_region_1.head())
display(data_region_2.head())

Unnamed: 0,id,f0,f1,f2,product
0,txEyH,0.705745,-0.497823,1.22117,105.280062
1,2acmU,1.334711,-0.340164,4.36508,73.03775
2,409Wp,1.022732,0.15199,1.419926,85.265647
3,iJLyR,-0.032172,0.139033,2.978566,168.620776
4,Xdl7t,1.988431,0.155413,4.751769,154.036647


Unnamed: 0,id,f0,f1,f2,product
0,kBEdx,-15.001348,-8.276,-0.005876,3.179103
1,62mP7,14.272088,-3.475083,0.999183,26.953261
2,vyE1P,6.263187,-5.948386,5.00116,134.766305
3,KcrkZ,-13.081196,-11.506057,4.999415,137.945408
4,AHL4O,12.702195,-8.147433,5.004363,134.766305


Unnamed: 0,id,f0,f1,f2,product
0,fwXo0,-1.146987,0.963328,-0.828965,27.758673
1,WJtFt,0.262778,0.269839,-2.530187,56.069697
2,ovLUW,0.194587,0.289035,-5.586433,62.87191
3,q6cA6,2.23606,-0.55376,0.930038,114.572842
4,WPMUX,-0.515993,1.716266,5.899011,149.600746


In [3]:
r_state = np.random.RandomState(42)

In [4]:
def split_data(data):
    features = data.drop(columns=['id', 'product'])
    target = data['product']
    return train_test_split(features, target, test_size=0.25, random_state=42)

X_train_0, X_valid_0, y_train_0, y_valid_0 = split_data(data_region_0)
X_train_1, X_valid_1, y_train_1, y_valid_1 = split_data(data_region_1)
X_train_2, X_valid_2, y_train_2, y_valid_2 = split_data(data_region_2)

def train_evaluate(X_train, y_train, X_valid, y_valid):
    model = LinearRegression()
    model.fit(X_train, y_train)
    predictions = model.predict(X_valid)
    rmse = mean_squared_error(y_valid, predictions, squared=False)
    avg_predicted_reserves = predictions.mean()
    return predictions, rmse, avg_predicted_reserves

predictions_0, rmse_0, avg_reserves_0 = train_evaluate(X_train_0, y_train_0, X_valid_0, y_valid_0)
predictions_1, rmse_1, avg_reserves_1 = train_evaluate(X_train_1, y_train_1, X_valid_1, y_valid_1)
predictions_2, rmse_2, avg_reserves_2 = train_evaluate(X_train_2, y_train_2, X_valid_2, y_valid_2)

print(f'Region 0 RMSE: {rmse_0}, Avg Predicted Reserves: {avg_reserves_0}')
print(f'Region 1 RMSE: {rmse_1}, Avg Predicted Reserves: {avg_reserves_1}')
print(f'Region 2 RMSE: {rmse_2}, Avg Predicted Reserves: {avg_reserves_2}')

Region 0 RMSE: 37.756600350261685, Avg Predicted Reserves: 92.3987999065777
Region 1 RMSE: 0.890280100102884, Avg Predicted Reserves: 68.71287803913762
Region 2 RMSE: 40.14587231134218, Avg Predicted Reserves: 94.77102387765939


<div class="alert alert-danger">
<b>Reviewer's comment V1</b>

Correct. But read the task carefully, please: "2.4. Print the average volume of predicted reserves and model RMSE.". You printed only RMSE and forgot to print average volume of predicted reserves. Please, fix it.
    
</div>

Interesting! using Linear Regressiion we see that Region 1 is almost dead on to the predictions made. Regions 0 and 2 however are a little further off. Let's keep digging to see if Region 1 is the best option for us

In [5]:
predictions_0 = pd.DataFrame({'actual': y_valid_0, 'predicted': predictions_0})
predictions_1 = pd.DataFrame({'actual': y_valid_1, 'predicted': predictions_1})
predictions_2 = pd.DataFrame({'actual': y_valid_2, 'predicted': predictions_2})

In [6]:
total_budget = 1000000
wells_selected = 200
profit_per_barrel = 4.5
units_per_product = 4500
cost_per_point = 5000
product_price = 45
points_per_budet = total_budget // cost_per_point

break_even_volume = total_budget / (wells_selected * units_per_product * profit_per_barrel)

print(f'Break-even volume per well: {break_even_volume:.2f} thousand barrels')
print(f'Average reserves in Region 0: {avg_reserves_0:.2f} thousand barrels')
print(f'Average reserves in Region 1: {avg_reserves_1:.2f} thousand barrels')
print(f'Average reserves in Region 2: {avg_reserves_2:.2f} thousand barrels')
print("Findings: To avoid losses, a well must contain at least {:.2f} thousand barrels. Comparing this with the average reserves per region, we can determine profitability.".format(break_even_volume))

def calculate_profit(target, predictions):
    
    top_wells = predictions.sort_values(ascending=False)
    selected_points = target[top_wells.index][:points_per_budet]
    selected_sum = selected_points.sum()
    revenue = selected_sum * product_price
    return revenue - total_budget

profit_0 = calculate_profit(y_valid_0, predictions_0['predicted'])
profit_1 = calculate_profit(y_valid_1, predictions_1['predicted'])
profit_2 = calculate_profit(y_valid_2, predictions_2['predicted'])

print(f'Region 0 Profit: ${profit_0:,.2f}')
print(f'Region 1 Profit: ${profit_1:,.2f}')
print(f'Region 2 Profit: ${profit_2:,.2f}')

Break-even volume per well: 0.25 thousand barrels
Average reserves in Region 0: 92.40 thousand barrels
Average reserves in Region 1: 68.71 thousand barrels
Average reserves in Region 2: 94.77 thousand barrels
Findings: To avoid losses, a well must contain at least 0.25 thousand barrels. Comparing this with the average reserves per region, we can determine profitability.
Region 0 Profit: $335,914.11
Region 1 Profit: $241,508.67
Region 2 Profit: $259,857.18


<div class="alert alert-danger">
<b>Reviewer's comment V1</b>

The function looks correct. Good job! But you missed almost a whole task:
    
```
3. Prepare for profit calculation:
    
    3.1. Store all key values for calculations in separate variables.
    
    3.2. Calculate the volume of reserves sufficient for developing a new well without losses. Compare the obtained value with the average volume of reserves in each region.
    
    3.3. Provide the findings about the preparation for profit calculation step.    
```
    
You did only point 3.1 but missed points 3.2 and 3.3. Please, fix it.
    
</div>

So the plot thickens! looks like the most profitable wells are in Region 0 and 2! Region 0 being a 3x ROI thats some pretty great margins there! Lets look at a bunch of smaller samples to double check our work and the outcomes there of!

In [7]:
def bootstrap_profit(target, predictions, n_samples=1000):
    sampleSize = 500
    profits = []
    
    for _ in range(n_samples):
        sampled_indices = np.random.choice(target.index, size=500, replace=True)
        sampled_target = target.sample(sampleSize, replace=True, random_state=r_state)
        sampled_preds = predictions[sampled_target.index]
        profits.append(calculate_profit(sampled_target, sampled_preds))
    
    profits = np.array(profits)
    lower, upper = np.percentile(profits, [2.5, 97.5])
    risk = (profits < 0).mean()
    return profits.mean(), (lower, upper), risk

mean_profit_0, conf_interval_0, risk_0 = bootstrap_profit(y_valid_0, predictions_0['predicted'])
mean_profit_1, conf_interval_1, risk_1 = bootstrap_profit(y_valid_1, predictions_1['predicted'])
mean_profit_2, conf_interval_2, risk_2 = bootstrap_profit(y_valid_2, predictions_2['predicted'])


print(f'Region 0: Avg Profit: ${mean_profit_0:,.2f}, 95% CI: {conf_interval_0}, Risk: {risk_0*100:.2f}%')
print(f'Region 1: Avg Profit: ${mean_profit_1:,.2f}, 95% CI: {conf_interval_1}, Risk: {risk_1*100:.2f}%')
print(f'Region 2: Avg Profit: ${mean_profit_2:,.2f}, 95% CI: {conf_interval_2}, Risk: {risk_2*100:.2f}%')

Region 0: Avg Profit: $42,784.76, 95% CI: (-9724.982956859578, 95421.51927088141), Risk: 5.50%
Region 1: Avg Profit: $51,153.02, 95% CI: (9170.056413644415, 92145.56683285085), Risk: 0.60%
Region 2: Avg Profit: $40,854.57, 95% CI: (-12062.487294271643, 96085.94407253824), Risk: 7.50%


<div class="alert alert-danger">
<b>Reviewer's comment V1</b>

1. According to the project description inside the bootstrap loop you should sample with size=500 but not with size=len(predictions)
3. The results are wrong. In the correct results risk in each region is a value between 0 and 10 but not exact 0 or exact 100. Moreover, risks in different regions should be different. If you get another results, double check indexes. Indexes in targets and predictions must be the same to get the correct results.

</div>

Even more interesting! All regions are looking fairly close in the Avg Profits area, That being said when we look at the rick factors of all three areas all are below 10% which is great! Region 1 is the best with under a 1% risk rate and the highest Avg Profit Margin!

In [8]:
valid_regions = [(mean_profit_0, risk_0, 0), (mean_profit_1, risk_1, 1), (mean_profit_2, risk_2, 2)]
valid_regions = [r for r in valid_regions if r[1] < 0.025]

if valid_regions:
    best_region = max(valid_regions, key=lambda x: x[0])
    print(f'Selected Region: {best_region[2]} with Expected Profit: ${best_region[0]:,.2f}')
else:
    print('No region meets the risk criteria.')

Selected Region: 1 with Expected Profit: $51,153.02


There we have it. Based on multiple looks at the data using both Linear Regression and Bootstrapping (large and small samples) we have determined that the best Region for us will be Region 1 with an expected profit of almost 51 million dollars and the lowest risk of all three areas!