<div class="alert alert-success">
<b>Reviewer's comment V3</b>

Thanks for taking the time to improve the project! It has come a long way, and now everything is perfect! Keep up the good work on the next sprint!

</div>

**Review**

Hi, my name is Dmitry and I will be reviewing your project.
  
You can find my comments in colored markdown cells:
  
<div class="alert alert-success">
  If everything is done successfully.
</div>
  
<div class="alert alert-warning">
  If I have some (optional) suggestions, or questions to think about, or general comments.
</div>
  
<div class="alert alert-danger">
  If a section requires some corrections. Work can't be accepted with red comments.
</div>
  
Please don't remove my comments, as it will make further review iterations much harder for me.
  
Feel free to reply to my comments or ask questions using the following template:
  
<div class="alert alert-info">
  For your comments and questions.
</div>
  
First of all, thank you for turning in the project! You did a pretty good start, but there are some problems with profit calculation and bootstrapping. Hopefully my comments will help to figure it out. Let me know if you have questions!

# OilyGiant Region Oil Well Analysis

The goal of this analysis is to take data provided for three regions, and using predictive models that we create, assess which of the regions are projected to be the most profitable with the smallest risk to OilyGiant.

We will first prepare the data, then create the predictive models for each region, assessing their strength once we do so. When the models are created, the oil reserves based on the data provided will be estimated, and profit for Oily Giant calculated.

The budget we are working with is 200 wells with a total of 1,000,000 USD. The price per unit of oil (1000 barrels) is 4,500 USD.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats as st

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error



# from sklearn.preprocessing import StandardScaler
# from sklearn.metrics import accuracy_score
# from sklearn.metrics import f1_score
# from sklearn.metrics import roc_auc_score 
# from sklearn.metrics import mean_absolute_error
# from sklearn.utils import shuffle
# from sklearn.metrics import precision_score, recall_score
# from sklearn.metrics import r2_score

In [2]:
df0 = pd.read_csv('/datasets/geo_data_0.csv')
df1 = pd.read_csv('/datasets/geo_data_1.csv')
df2 = pd.read_csv('/datasets/geo_data_2.csv')

## Preprocessing

### Region 0 Data

#### Overview of data

In [3]:
df0.head()

Unnamed: 0,id,f0,f1,f2,product
0,txEyH,0.705745,-0.497823,1.22117,105.280062
1,2acmU,1.334711,-0.340164,4.36508,73.03775
2,409Wp,1.022732,0.15199,1.419926,85.265647
3,iJLyR,-0.032172,0.139033,2.978566,168.620776
4,Xdl7t,1.988431,0.155413,4.751769,154.036647


In [4]:
df0.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


#### Finding/Verifying no Missing Values or Duplicates

In [5]:
df0.isna().sum()

id         0
f0         0
f1         0
f2         0
product    0
dtype: int64

In [6]:
df0.duplicated().sum()

0

There are no missing values or duplicate rows to handle in preprocessing.

### Region 1 Data

#### Overview of data

In [7]:
df1.head()

Unnamed: 0,id,f0,f1,f2,product
0,kBEdx,-15.001348,-8.276,-0.005876,3.179103
1,62mP7,14.272088,-3.475083,0.999183,26.953261
2,vyE1P,6.263187,-5.948386,5.00116,134.766305
3,KcrkZ,-13.081196,-11.506057,4.999415,137.945408
4,AHL4O,12.702195,-8.147433,5.004363,134.766305


In [8]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


#### Finding/Verifying no Missing Values or Duplicates

In [9]:
df1.isna().sum()

id         0
f0         0
f1         0
f2         0
product    0
dtype: int64

In [10]:
df1.duplicated().sum()

0

There are no missing values or duplicate rows to handle in preprocessing.

### Region 2 Data

#### Overview of data

In [11]:
df2.head()

Unnamed: 0,id,f0,f1,f2,product
0,fwXo0,-1.146987,0.963328,-0.828965,27.758673
1,WJtFt,0.262778,0.269839,-2.530187,56.069697
2,ovLUW,0.194587,0.289035,-5.586433,62.87191
3,q6cA6,2.23606,-0.55376,0.930038,114.572842
4,WPMUX,-0.515993,1.716266,5.899011,149.600746


In [12]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


#### Finding/Verifying no Missing Values or Duplicates

In [13]:
df2.isna().sum()

id         0
f0         0
f1         0
f2         0
product    0
dtype: int64

In [14]:
df2.duplicated().sum()

0

There are no missing values or duplicate rows to handle in preprocessing.

<div class="alert alert-success">
<b>Reviewer's comment</b>

The data was loaded and inspected

</div>

## Training Predicitve Models

### Region 0

#### Prepare Data

In [15]:
train0, valid0 = train_test_split(df0, test_size=0.25, random_state=759638)

In [16]:
features_train0 = train0.drop(['id', 'product'] ,axis=1)
target_train0 = train0['product']
features_valid0 = valid0.drop(['id', 'product'] ,axis=1)
target_valid0 = valid0['product']

#### Train Model

In [17]:
model0 = LinearRegression()
model0.fit(features_train0, target_train0) # train model on training set
predictions_valid0 = model0.predict(features_valid0) # get model predictions on validation set

result0 = mean_squared_error(target_valid0, predictions_valid0) ** 0.5

### Region 1

#### Prepare Data

In [18]:
train1, valid1 = train_test_split(df1, test_size=0.25, random_state=759638)

In [19]:
features_train1 = train1.drop(['id', 'product'] ,axis=1)
target_train1 = train1['product']
features_valid1 = valid1.drop(['id', 'product'] ,axis=1)
target_valid1 = valid1['product']

#### Train Model

In [20]:
model1 = LinearRegression()
model1.fit(features_train1, target_train1) # train model on training set
predictions_valid1 = model1.predict(features_valid1) # get model predictions on validation set

result1 = mean_squared_error(target_valid1, predictions_valid1) ** 0.5

### Region 2

#### Prepare Data

In [21]:
train2, valid2 = train_test_split(df2, test_size=0.25, random_state=759638)

In [22]:
features_train2 = train2.drop(['id', 'product'] ,axis=1)
target_train2 = train2['product']
features_valid2 = valid2.drop(['id', 'product'] ,axis=1)
target_valid2 = valid2['product']

#### Train Model

In [23]:
model2 = LinearRegression()
model2.fit(features_train2, target_train2) # train model on training set
predictions_valid2 = model2.predict(features_valid2) # get model predictions on validation set

result2 = mean_squared_error(target_valid2, predictions_valid2) ** 0.5

<div class="alert alert-success">
<b>Reviewer's comment</b>

Alright, you split the data in each region into train and validation sets, correctly trained and evaluated the model

</div>

<div class="alert alert-danger">
<s><b>Reviewer's comment</b>

The way you used `cross_val_score` is incorrect though: it is meant to be used on the train set, because the it does [k-fold cross-validation](https://scikit-learn.org/stable/modules/cross_validation.html) under the hood. The data is split into k parts, and then it iterates through them treating one part as validation subset and the remaining k-1 parts as train subset.
    
Doing this on the validation set means we don't use the train set at all, and the validation set is small, so we're training the using very little data and evaluating them using even less data.
    
Apart from that it's not clear what metric is meant by 'model evaluation score' (for example are higher values better or lower?)

</div>

<div class="alert alert-info">
  I eliminated the cross validation portion and any displayed scores in the three sections above as I have the results below. I'm not sure why I did the cross-validation after reading the propmts a couple of times, but it's been eliminated as they weren't even being done properly. Now the training and validation sets should be correct.
    
  Additionally, the RMSE is the metric used below which should be understandable compared to what was above.
</div>

<div class="alert alert-success">
<b>Reviewer's comment V2</b>

Ok! Cross-validation will be used in the next sprint's project :)

</div>

### Predictive Model Results

In [24]:
print("Region 0 predicted average oil reserve", predictions_valid0.mean())
print("RMSE of Region 0 linear regression model on the validation set:", result0)
print()
print("Region 1 predicted average oil reserve", predictions_valid1.mean())
print("RMSE of Region 1 linear regression model on the validation set:", result1)
print()
print("Region 2 predicted average oil reserve", predictions_valid2.mean())
print("RMSE of Region 2 linear regression model on the validation set:", result2)

Region 0 predicted average oil reserve 92.56627068991745
RMSE of Region 0 linear regression model on the validation set: 37.645877640445036

Region 1 predicted average oil reserve 68.86387538632056
RMSE of Region 1 linear regression model on the validation set: 0.8880150007927671

Region 2 predicted average oil reserve 95.02735343132912
RMSE of Region 2 linear regression model on the validation set: 39.966229898885025


Region 0 and 2 are similar in that they have higher averages of predicted oil reserves and less accurate, or poorer models. While Region 1 shows a lower average of predicted oil reserves but was able to produce a high quality model with the data provided for the region.

<div class="alert alert-success">
<b>Reviewer's comment</b>

Ok, very well!

</div>

## Profit Calculation

### Oil Reserve Threshold Calculation

With a budget of 100 million USD for the development of 200 oil wells, and a revenue of 4.5 USD per barrel or 4,500 USD per unit of 1,000 barrels of oil, we will calculate the minimum reserve for a well to make a profit.

```100 million USD / 200 oil wells = 500,000 USD per oil well```

```500,000 USD / 4,500 USD per unit of oil = 111.11 units of oil required for profit from a single well```

<div class="alert alert-success">
<b>Reviewer's comment</b>

Calculation is correct!

</div>

In [25]:
budget = 100000000
well_count = 200
well_cost = budget / well_count
unit_price = 4500
reserve_threshold = well_cost / unit_price
r0_average = predictions_valid0.mean()
r1_average = predictions_valid1.mean()
r2_average = predictions_valid2.mean()

In [26]:
valid0['projection'] = predictions_valid0
valid1['projection'] = predictions_valid1
valid2['projection'] = predictions_valid2

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  valid0['projection'] = predictions_valid0
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  valid1['projection'] = predictions_valid1
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  valid2['projection'] = predictions_valid2


In [27]:
print("Threshold of reserves to break even is", reserve_threshold)
print("Region 0 Average Predicted Oil Reserve", r0_average)
print("Region 1 Average Predicted Oil Reserve", r1_average)
print("Region 2 Average Predicted Oil Reserve", r2_average)

Threshold of reserves to break even is 111.11111111111111
Region 0 Average Predicted Oil Reserve 92.56627068991745
Region 1 Average Predicted Oil Reserve 68.86387538632056
Region 2 Average Predicted Oil Reserve 95.02735343132912


Based on the requirement of just over 111 units of oil from a well to be profitable in this given scenario, Regions 0 and 2 are close to the requirement with their average predicted oil reserves by well. Region 1 is not as promising with an average predicted reserve amount of 68.86 units, even though the model is much more reliable.

We will determine whether Regions 0 and 2 with their higher averages in the predicted oil reserves but less accurate predictive models or Region 1 with a lower predicted average of oil reserves, but a very accurate model will be profitable with a lower risk.

Based on the averages, we can estimate that Regions 0 and 2 have at least 200 well locations that would be predicted to have over the minimum threshold of 111.11 units to make a profit for those wells. In order to determine whether Region 1 will have enough, and verify for the other two regions, we will write a function to select a number of oil well locations in a region and determine the predicted profit of them.

<div class="alert alert-success">
<b>Reviewer's comment</b>

Alright, so if we just selected the wells to develop randomly, we'd lose money in each region. Hopefully using our models to select the wells will improve the situation!

</div>

#### Profit predictions

In [33]:
# pass valid0,1,2 and sort by the predictions, but return the product column for actual calculation

def top_wells_profit(data, price, budget):
    
    sorted = data.sort_values(by='projection',ascending=False)
    reserves = sorted['product'].head(well_count).sum()
    revenue = reserves * price
    profit = (revenue - budget).round(2)
    
    return profit
    

<div class="alert alert-danger">
<s><b>Reviewer's comment</b>

The function for profit calculation should take not just predictions, but targets as its input. Because while we need to sort the wells using predictions, we need to use targets for actual profit calculation (predictions aren't necessarily correct, and targets have the information about the actual amount of product contained in the wells)

</div>

<div class="alert alert-info">
  I renamed predictions to data just to clarify when inputs are made. I'm unsure if that was the only issue with the function, or if there was more than just the naming convention I used for the incoming data. I'm using the [product] column for calculation from the original datasets as that is the same as the overall targets and with how I did the split, it's more efficient to use said datasets.
    
I also removed some of the extra outputs and narrowed it down to only profit for ease of use in the following bootstrap function.
</div>

<div class="alert alert-danger">
<s><b>Reviewer's comment V2</b>

It seems that you slightly missed my point: the function needs both targets ('product' column from the dataframes) and our model's predictions. What you need to do is to sort the wells by predictions, find the corresponding targets and use those to caclulate profit. Now you're sorting based on targets which are unknown to us beforehand :)

</div>

<div class="alert alert-success">
<b>Reviewer's comment V3</b>

Great, the function is correct now!

</div>

In [36]:
state = np.random.RandomState(759638)

def calc_prof (data):
    profit = []
    for i in range(1000):
        subsample = data.sample(n=500, replace=True, random_state=state)
        profit.append(top_wells_profit(subsample, unit_price, budget))
        
    return profit

<div class="alert alert-warning">
<b>Reviewer's comment</b>

The profit from overall top 200 wells is not very relevant: we don't actually have the budget to examine all possible well locations in a region. Remember that we're only making initial measurements at 500 randomly selected locations, then select 200 best wells out of those 500.

</div>

<div class="alert alert-info">
    I removed the irrelevant code above and shifted up a second function for the bootstrapping of the profits of 500 samples.
</div>

<div class="alert alert-success">
<b>Reviewer's comment V2</b>

Ok, good!

</div>

#### Risk of Loss and Confidence Interval

<div class="alert alert-danger">
<s><b>Reviewer's comment</b>

It seems that you misunderstood the task a bit: we're not looking to find average profit from one well. What we'd like to do is to estimate the profit distribution for our whole experiment of sampling 500 wells and developing 200 best (by predictions) out of those 500 using bootstrapping. 

</div>

<div class="alert alert-info">
    I removed the wrong/not relevant code again and moved the bootstrapping function to the previous section.
</div>

<div class="alert alert-success">
<b>Reviewer's comment V2</b>

Alright!

</div>

In [37]:
values0 = calc_prof(valid0)
values1 = calc_prof(valid1)
values2 = calc_prof(valid2)

<div class="alert alert-danger">
<s><b>Reviewer's comment</b>

The function for bootstrapping is not quite correct: inside the loop you need to sample 500 locations (we need both predictions and targets, make sure that they correspond to each other), pass predictions and targets into your profit function, and then append the calculated profit value (instead of `subsample.quantile(percentage)`) to the list of values

</div>

<div class="alert alert-danger">
<s><b>Reviewer's comment V2</b>

Note that in bootstrapping we need to use targets and predictions for the validation set only: otherwise we'd be cheating by making predictions for the wells the model was trained on. One way to fix this problem is to create a dataframe for each region with predictions and targets columns, then inside the bootstrapping function you can just sample from this dataframe, and inside the profit function sort the input by predictions and use targets to calculate profit.

</div>

<div class="alert alert-info">
    I modified the functions to be used with valid0, valid1, and valid2 (which is the validations sets of each reagions with the predictions and product columns in the same dataframe). The top wells selected are based off of the predictions, and the actual profit calculations are based off of the products - so now there's no more cheating.
</div>

<div class="alert alert-success">
<b>Reviewer's comment V3</b>

Excellent! Now everything looks good!

</div>

In [38]:
def risk_loss (data):
    list = pd.Series(data)
    rol = (list < 0).mean()
    return rol

<div class="alert alert-warning">
<b>Reviewer's comment</b>

The calculation of risk of losses can be simplified if you convert the list of profit values to a `pd.Series`:
    
```python
risk_of_losses = (values < 0).mean()
```

</div>

<div class="alert alert-info">
    I removed the old code and used the line you provided me as it's a lot simpler. I don't exactly understand how it works - but I know it does... which is a start I suppose...?
    I completely understand it's more efficient than the loop I had written previously at the very least.
</div>

<div class="alert alert-warning">
<b>Reviewer's comment V2</b>

Let me explain how it works:
    
`values < 0` returns a pandas series which is True if a value is less than 0 and False otherwise. As you probably know, True is often interpreted as 1 and False as 0. So when we calculate the mean value of the resulting series, we are basically calculating the sum of 1s for each value less than 0 (or a number of values less than 0 in other words) and divide by the length of values.

</div>

In [39]:
rol0 = risk_loss(values0)
rol1 = risk_loss(values1)
rol2 = risk_loss(values2)

In [40]:
values0 = pd.Series(values0)
ci0 = st.t.interval(0.95, len(values0)-1, loc=values0.mean(), scale=values0.sem())
values1 = pd.Series(values1)
ci1 = st.t.interval(0.95, len(values1)-1, loc=values1.mean(), scale=values1.sem())
values2 = pd.Series(values2)
ci2 = st.t.interval(0.95, len(values2)-1, loc=values2.mean(), scale=values2.sem())

<div class="alert alert-success">
<b>Reviewer's comment</b>

If you fix the problems with bootstrapping and profit calculation, then risk of losses and confidence interval calculation would be correct

</div>

In [41]:
display(values0.mean())
display(ci0)
display(rol0)

4531671.53696

(4368529.474920464, 4694813.598999537)

0.044

**Region 0:**
<p>Average regional profit: 4,531,671.54 USD
<p>Confidence interval of 95% of average profit: 4,368,529.47 USD to 4,694,813.60 USD
<p>Risk of Loss: 4.4%

In [42]:
display(values1.mean())
display(ci1)
display(rol1)

4489637.16975

(4361776.251182999, 4617498.088317001)

0.011

**Region 1:**
<p>Average regional profit: 4,489,637.17 USD
<p>Confidence interval of 95% of average profit: 4,361,776.25 USD to 4,617,498.09 USD
<p>Risk of Loss: 1.1%

In [43]:
display(values2.mean())
display(ci2)
display(rol2)

3845461.94387

(3681713.6209656126, 4009210.2667743876)

0.073

**Region 2:**
<p>Average regional profit: 3,845,461.94 USD
<p>Confidence interval of 95% of average profit: 3,681,713.62 USD to 4,009,210.27 USD
<p>Risk of Loss: 7.3%

## Conclusion

Based on our findings, if OilyGiant were to sample 500 sites and select the best 200 to drill at, Region 1 would be the advised region to place new oil wells. Region 0 and Region 2 do not meet the criteria of a maximum risk of loss of 2.5%, while Region 1 does with a risk of loss of only 1.1%.

While the two rejected regions could be profitable, and both have a higher average oil reserve volume than the accepted and advised region, Region 1's predictive model was more accurate than the others and allows for better estimation for finding profitable well locations.

<div class="alert alert-danger">
<s><b>Reviewer's comment</b>

Please check the results and the conclusion after fixing the problems above

</div>

<div class="alert alert-info">
I know that I'll likely need a revision as the last section "Profit Calculation" is far from ideal - especially since the Risk of Loss values are all zero, which doesn't make any sense to me. I apologize for any wasted time on your part dealing with it, but I'm at my limits trying to figure out how to make it work, and the last time I asked for assistance in Discord from the tutors, it took 2 days to get advice to go back to a function that I previously had (or similar at least) but didn't see as ideal because it actually required just as many lines of code had I just done it manually, repeating a few lines for the three data sets.
    
Any advice on anything you see would be appreciated, but especially the last section. Thanks, and sorry for any and all trouble.
</div>

<div class="alert alert-success">
<b>Reviewer's comment</b>

Yeah, don't worry! This project can be a bit overwhelming, so it's not a problem at all!
    
I left a few comments above which should help to get you on track. Hopefully they make sense, but if you have any questions let me know, and I'll try to answer in the next review :)

</div>

<div class="alert alert-info">
    I made the alterations that I believe needed to be made. I completely grasp the concept of bootstrapping, but writing the code I was less than confident, so I wouldn't be surprised if there were more issues for me to correct... especially after two of the regions presented no risk of loss after changes were made.
    Hopefully it's at least improved from last time.
</div>

<div class="alert alert-success">
<b>Reviewer's comment V2</b>

No problem! Just a couple more corrections, and we're good! :)

</div>

<div class="alert alert-info">
    I changed up the functions as mentioned above, and altered my Risk of Loss section (now that it makes more sense - because any region without a risk of loss is too good to be true in my mind). And the Conclusion has once again been altered.

If there is still an issue with the functions, I'm likely going to need some assistance with figuring it out. I can manage functions decently well, even though they're not my strong suite - but putting two together has been mentally taxing so far. Hopefully they're correct, but if they're not, I would ask for additional guidance on getting them to work right, or for a better way of doing it.
</div>

<div class="alert alert-success">
<b>Reviewer's comment V3</b>

Nope, no more issues! You did great! The results and conclusions make sense now, region choice is correct and justified! Well done! :)

</div>