Hello Munir!

I’m happy to review your project today.
I will mark your mistakes and give you some hints how it is possible to fix them. We are getting ready for real job, where your team leader/senior colleague will do exactly the same. Don't worry and study with pleasure! 

Below you will find my comments - **please do not move, modify or delete them**.

You can find my comments in green, yellow or red boxes like this:

<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Success. Everything is done succesfully.
</div>

<div class="alert alert-block alert-warning">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Remarks. Some recommendations.
</div>

<div class="alert alert-block alert-danger">

<b>Reviewer's comment</b> <a class="tocSkip"></a>

Needs fixing. The block requires some corrections. Work can't be accepted with the red comments.
</div>

You can answer me by using this:

<div class="alert alert-block alert-info">
<b>Student answer.</b> <a class="tocSkip"></a>

Text
</div>

we aim to help the OilyGiant mining company identify the most profitable region for drilling new oil wells. Using geological data from three regions, we build a predictive model to estimate oil reserves at potential drilling sites. By selecting the top 200 wells with the highest predicted reserves and evaluating profitability through bootstrapping, we assess which region offers the best return on investment while minimizing financial risk.

In [14]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from numpy.random import RandomState
import matplotlib.pyplot as plt

In [15]:
import os
print(os.listdir())

['calls_per_user_per_month.csv', 'f95e0262-f610-4e6b-9daa-f66d08cb72c1.ipynb', '2fede7ea-9ca6-42a3-ba35-bf4142d2fcc0.ipynb', '676da3d0-d254-4f97-bec8-7178b52f9ff9.ipynb', '.touched', '72abb834-7923-4339-bc83-8dfd388a9c9b.ipynb', '533b6048-6249-4966-8990-afd3e7db5a18.ipynb', '.ipynb_checkpoints', '1st project Triple ten.ipynb', '6a82f125-37a9-47e3-98bf-a7bbb8a86a85.ipynb', 'internet_usage_per_user_per_month.csv', 'geo_data_1.csv', 'minutes_per_user_per_month.csv', '8a6ec703-2532-43eb-90a5-0e65d9d612d9-Copy1.ipynb', 'messages_per_user_per_month.csv', 'f6575730-d902-46ad-9c6c-3526ab6c1d28.ipynb', '8a6ec703-2532-43eb-90a5-0e65d9d612d9.ipynb', 'geo_data_0.csv', '25ef8da9-6c14-4615-b68f-f7ceffea2b02.ipynb', 'A2.ipynb', 'p3.ipynb', 'final_user_data_with_plans.csv', 'dfff2238-9606-46d6-b1a3-e1d904bb0937.ipynb', 'geo_data_2.csv', '8a6ec703-2532-43eb-90a5-0e65d9d612d9-Copy2.ipynb']


In [16]:
#Train and Test the Model for Each Region
# Split the Data into Training and Validation Sets. 
#We will load the datasets and split each into a training set (75%) and a validation set (25%).

# Load datasets
data_0 = pd.read_csv('/datasets/geo_data_0.csv')
data_1 = pd.read_csv('/datasets/geo_data_1.csv')
data_2 = pd.read_csv('/datasets/geo_data_2.csv')

# Function to split data
def split_data(data):
    features = data.drop(columns=['product', 'id'])
    target = data['product']
    return train_test_split(features, target, test_size=0.25, random_state=12345)

# Split data for each region
X_train_0, X_valid_0, y_train_0, y_valid_0 = split_data(data_0)
X_train_1, X_valid_1, y_train_1, y_valid_1 = split_data(data_1)
X_train_2, X_valid_2, y_train_2, y_valid_2 = split_data(data_2)

In [17]:
#Train the Model and Make Predictions
#We will train a linear regression model on the training set and make predictions on the validation set.

def train_and_predict(X_train, y_train, X_valid):
    model = LinearRegression()
    model.fit(X_train, y_train)
    predictions = model.predict(X_valid)
    return predictions, model

# Train and predict for each region
predictions_0, model_0 = train_and_predict(X_train_0, y_train_0, X_valid_0)
predictions_1, model_1 = train_and_predict(X_train_1, y_train_1, X_valid_1)
predictions_2, model_2 = train_and_predict(X_train_2, y_train_2, X_valid_2)

In [18]:
#Save Predictions and Correct Answers
#We will save the predictions and the actual values for the validation set.

results_0 = pd.DataFrame({'Actual': y_valid_0, 'Predicted': predictions_0})
results_1 = pd.DataFrame({'Actual': y_valid_1, 'Predicted': predictions_1})
results_2 = pd.DataFrame({'Actual': y_valid_2, 'Predicted': predictions_2})

In [19]:
#Print Average Volume of Predicted Reserves and Model RMSE
#We will calculate and print the average predicted reserves and RMSE for each region.

def print_results(predictions, actual):
    rmse = mean_squared_error(actual, predictions, squared=False)
    avg_predicted = predictions.mean()
    print(f"Average predicted volume: {avg_predicted:.2f} thousand barrels")
    print(f"RMSE: {rmse:.2f}")

print("\nRegion 0")
print_results(predictions_0, y_valid_0)

print("\nRegion 1")
print_results(predictions_1, y_valid_1)

print("\nRegion 2")
print_results(predictions_2, y_valid_2)



Region 0
Average predicted volume: 92.59 thousand barrels
RMSE: 37.58

Region 1
Average predicted volume: 68.73 thousand barrels
RMSE: 0.89

Region 2
Average predicted volume: 94.97 thousand barrels
RMSE: 40.03


removed it

<div class="alert alert-success">
<b>Reviewer's comment V1</b>

Everything is correct. Good job!
    
</div>

<div class="alert alert-danger">
<b>Reviewer's comment V1</b>

It seems it's better to remove this cell because you have the same calculations below
    
</div>

<div class="alert alert-success">
<b>Reviewer's comment V2</b>

Fixed
    
</div>

In [20]:
#Prepare for Profit Calculation
#Store Key Values
#Define constants for budget, revenue per barrel, and the number of wells.

# Constants
BUDGET = 100_000_000  
REVENUE_PER_BARREL = 4.5  
WELLS_TO_SELECT = 200

In [21]:
#Prepare for Profit Calculation
#Calculate Minimum Volume of Reserves
#We will calculate the minimum volume of reserves needed to avoid losses and compare it with the average volume of reserves in each region.

# Calculate average volume of reserves for each region
avg_volume_0 = data_0['product'].mean()
avg_volume_1 = data_1['product'].mean()
avg_volume_2 = data_2['product'].mean()

# Calculate minimum volume needed to avoid losses
min_volume_needed = BUDGET / (WELLS_TO_SELECT * REVENUE_PER_BARREL * 1000)

# Print average volumes
print("\nAverage Volumes of Reserves:")
print(f"Region 0: {avg_volume_0:.2f} thousand barrels")
print(f"Region 1: {avg_volume_1:.2f} thousand barrels")
print(f"Region 2: {avg_volume_2:.2f} thousand barrels")

# Compare with minimum volume needed
print("\nComparison with Minimum Volume Needed:")
print(f"Minimum volume needed per well: {min_volume_needed:.2f} thousand barrels")
print(f"Region 0 is {'sufficient' if avg_volume_0 >= min_volume_needed else 'not sufficient'}")
print(f"Region 1 is {'sufficient' if avg_volume_1 >= min_volume_needed else 'not sufficient'}")
print(f"Region 2 is {'sufficient' if avg_volume_2 >= min_volume_needed else 'not sufficient'}")


Average Volumes of Reserves:
Region 0: 92.50 thousand barrels
Region 1: 68.83 thousand barrels
Region 2: 95.00 thousand barrels

Comparison with Minimum Volume Needed:
Minimum volume needed per well: 111.11 thousand barrels
Region 0 is not sufficient
Region 1 is not sufficient
Region 2 is not sufficient


Findings About Profit Calculation Preparation
Based on the average volumes compared to the minimum volume needed, we can summarize the findings:

Region 0: The average volume of reserves is sufficient to develop new wells without incurring losses.
Region 1: The average volume is also sufficient, indicating potential profitability.
Region 2: The average volume is not sufficient, suggesting that further exploration is needed before development.

<div class="alert alert-success">
<b>Reviewer's comment V1</b>

Well done!
    
</div>

Calculate Profit from Selected Oil Wells
Pick the Wells with the Highest Values of Predictions
We will create a function to select the top 200 wells based on the predictions.


In [22]:
def bootstrap_profit_distribution(predictions, target, num_samples=1000):
    profits = []
    for i in range(num_samples):
        profit = calculate_profit_from_top_wells(predictions, target)
        profits.append(profit)
    return np.array(profits)

In [23]:
def calculate_profit_from_top_wells(data, top_n=200):
    top_wells = data.sort_values(by='prediction', ascending=False).head(top_n)
    total_revenue = top_wells['target'].sum() * REVENUE_PER_BARREL * 1000  
    profit = total_revenue - BUDGET
    return profit

<div class="alert alert-danger">
<b>Reviewer's comment V1</b>

1. You need to merge functions `select_top_wells`, `calculate_total_reserves` and `calculate_profit` into a single function. And you need to use this function in the bootstrap loop below.
2. Be careful with indexes. The code inside the `select_top_wells` works correctly only when indexes in targets and predictions are the same. If it is not so, it works wrong. Look at variables predictions_0 and y_valid_0. They have different indexes.
    
</div>

<div class="alert alert-danger">
<b>Reviewer's comment V2</b>

1. Not fixed. I don't see a single function for profit calculation. Once again, you should replace select_top_wells, calculate_total_reserves and calculate_profit function with a single function. And you still have a problem with indexes which I described in my previous comment that's why your results are still wrong.
2. You have a task for bootstrap calculation below. You should not make bootstrap twice. It does not make sence. So, please, remove a code for bootstrap and make the fixes according to my previous comment.
    
</div>

<div class="alert alert-danger">
<b>Reviewer's comment V3</b>

1. Remove this code: `sample = data.sample(n=sample_size, replace=True)`. You should not select 500 random wells inside the function for profit calculation. This should be done inside the bootstrap loop. 
2. You should remove all the code for the bootstrap above because you should done in below in the next task. You should not do the same thing twice.
    
</div>

<div class="alert alert-success">
<b>Reviewer's comment V4</b>

The function `calculate_profit_from_top_wells` looks correct. Good job!
    
</div>

Calculate Risks and Profit for Each Region
Bootstrapping Technique
We will use bootstrapping to find the distribution of profit for each region.

Average Profit, 95% Confidence Interval, and Risk of Losses
We will calculate the average profit, 95% confidence interval, and the risk of losses for each region.
findings 

In [28]:
# Constants
BUDGET = 100e6  
REVENUE_PER_BARREL = 4500

# Profit function from top wells
def calculate_profit_from_top_wells(data, top_n=200):
    top_wells = data.sort_values(by='prediction', ascending=False).head(top_n)
    total_revenue = top_wells['target'].sum() * REVENUE_PER_BARREL
    return total_revenue - BUDGET

# Bootstrapping with correct profit function
def bootstrap_profit_distribution(predictions, target, num_samples=1000, sample_size=500, top_n=200):
    data = pd.DataFrame({'prediction': predictions, 'target': target}).reset_index(drop=True)
    profits = []
    for i in range(num_samples):
        sample = data.sample(n=sample_size, replace=True, random_state=42 + i).reset_index(drop=True)
        profit = calculate_profit_from_top_wells(sample, top_n)
        profits.append(profit)
    return np.array(profits)

# Statistics calculation
def calculate_statistics(profit_distribution):
    average_profit = np.mean(profit_distribution)
    lower_bound = np.percentile(profit_distribution, 2.5)
    upper_bound = np.percentile(profit_distribution, 97.5)
    risk_of_loss = (profit_distribution < 0).mean() * 100
    return average_profit, (lower_bound, upper_bound), risk_of_loss

# bootstrap and statistics calculation for each region

# Make sure predictions and targets have matching indexes
region_0_df = pd.DataFrame({'prediction': predictions_0, 'target': y_valid_0.values})
region_1_df = pd.DataFrame({'prediction': predictions_1, 'target': y_valid_1.values})
region_2_df = pd.DataFrame({'prediction': predictions_2, 'target': y_valid_2.values})

profits_0 = bootstrap_profit_distribution(region_0_df['prediction'], region_0_df['target'])
profits_1 = bootstrap_profit_distribution(region_1_df['prediction'], region_1_df['target'])
profits_2 = bootstrap_profit_distribution(region_2_df['prediction'], region_2_df['target'])

stats_0 = calculate_statistics(profits_0)
stats_1 = calculate_statistics(profits_1)
stats_2 = calculate_statistics(profits_2)

# Print results
print("\nBootstrap Results by Region:")
for i, stats in enumerate([stats_0, stats_1, stats_2]):
    print(f"Region {i}:")
    print(f"  Average Profit: ${stats[0]:,.2f}")
    print(f"  95% Confidence Interval: ${stats[1][0]:,.2f} - ${stats[1][1]:,.2f}")
    print(f"  Risk of Loss: {stats[2]:.2f}%\n")



Bootstrap Results by Region:
Region 0:
  Average Profit: $3,895,862.60
  95% Confidence Interval: $-780,737.48 - $8,773,652.89
  Risk of Loss: 6.50%

Region 1:
  Average Profit: $4,571,482.82
  95% Confidence Interval: $594,079.66 - $8,715,527.42
  Risk of Loss: 1.70%

Region 2:
  Average Profit: $3,860,356.28
  95% Confidence Interval: $-955,100.56 - $8,754,238.27
  Risk of Loss: 6.70%



<div class="alert alert-danger">
<b>Reviewer's comment V1</b>
 
Unfortunately, the results are wrong.
    
1. Inside the bootstrap loop you should use the function for profit calculation from previous task. But before to use this function, you need to merge 3 function into a single one.
2. According to the task description, inside the bootstrap loop you need to sample 500 random wells but len(predictions) random wells.
    
In the correct results risk in each region is a value between 0 and 10 but not exact 0 or exact 100. Moreover, risks in different regions should be different. If you get another results, double check indexes. Indexes in targets and predictions must be the same to get the correct results.
    
In the lesson you have an example of bootstrap for the task about students and lessons. I'd recommend to repeat this task because here the idea is the same and so the code is almost the same as well.

</div>

<div class="alert alert-danger">
<b>Reviewer's comment V2</b>

1. None of the points from my previous comment are fixed.
2. You should not create dicts like region_0_stats manually. You should use variables for it.
    
</div>

<div class="alert alert-danger">
<b>Reviewer's comment V3</b>

1. Inside the bootstrap loop you should use `calculate_profit_from_top_wells` function. The function `calculate_profit` should be removed at all.
2. You need to print risk, average profit, confidence interval for each region.
3. In the correct results risk in each region is a value between 0 and 10 but not exact 0 or exact 100. Moreover, risks in different regions should be different. If you get another results, double check indexes. Indexes in targets and predictions must be the same to get the correct results.
    
</div>

<div class="alert alert-danger">
<b>Reviewer's comment V4</b>

You didn't run your functions and so there are no results. Please, run your functions and be sure that: "In the correct results risk in each region is a value between 0 and 10 but not exact 0 or exact 100. Moreover, risks in different regions should be different."
    
</div>

<div class="alert alert-success">
<b>Reviewer's comment V5</b>

Everything is correct now. Well done!
    
</div>

Best Options for Development:

Region 1: This region has shown the highest average profit potential based on the calculations and risk assessments.
Region 2: While it also presents a viable option, its average profit is lower than Region 1.
Region 0: This region has the lowest average profit and does not meet the risk criteria.
Justification for Choice:

Profitability: Region 1 has consistently demonstrated the highest profit margins in the bootstrapping analysis, making it the most economically viable option for development.
Risk Assessment: The risk of losses in Region 1 is below the 2.5% threshold, indicating a favorable risk-to-reward ratio.
Conclusion
Based on the analysis, Region 1 is recommended for oil well development due to its superior profit potential and acceptable risk levels. 