<div class="alert alert-success">
<b>Reviewer's comment V3</b>

Thank you for taking the time to improve the project! Now it is accepted. Keep up the good work on the next sprint!

</div>

**Review**

Hi, my name is Dmitry and I will be reviewing your project.
  
You can find my comments in colored markdown cells:
  
<div class="alert alert-success">
  If everything is done successfully.
</div>
  
<div class="alert alert-warning">
  If I have some (optional) suggestions, or questions to think about, or general comments.
</div>
  
<div class="alert alert-danger">
  If a section requires some corrections. Work can't be accepted with red comments.
</div>
  
Please don't remove my comments, as it will make further review iterations much harder for me.
  
Feel free to reply to my comments or ask questions using the following template:
  
<div class="alert alert-info">
  For your comments and questions.
</div>
  
First of all, thank you for turning in the project! You did a pretty good job overall, but there are a couple of small problems that need to be fixed before the project is accepted. Let me know if you have any questions!

# Predicting Best Well Location

The OilyGiant Mining Company is in the market to build a new well. They have data on oil reserves in three particular regions, and want to know which location has the highest liklihood of producing the most oil, and therefore the most profit. It has been requested that a Linear Regression model be created to assist in predicting the most porfitable region for the new well. 

Three models will be trained, one for each region (dataset). Once trained, the model will be used to predict oil reserve volumes for a validation dataset. The validation dataset will be used to calculate potential profit based on the largest 200 oil reserves. Once that starting point has been interpreted and analyzed, an average profit distribution will be collected for each region. The average profit, 95% confidence interval, and risk of loss will be calculated for each region and displayed. From those calcualted values, we will determine the best and most profitable location for OilyGiant to build their new well.

## Initialization

In this seciton the necessary libraries will be imported, the data will be read into a DataFrame, and a summary of the data will be quickly explored.

### Load libraries

All the important libraries that are utilized throughout this report are imported in the cell block below.

In [1]:
# Import the necessary libraries

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Import library for formatting currency
import locale

# Hide warning messages
import warnings
warnings.filterwarnings('ignore')

## Load data

Below, the csv files `geo_data_0.csv`, `geo_data_1.csv`, `geo_data_2.csv` will be read and stored into DataFrames.

In [2]:
# Read and store the csv files into DataFrames

df_0 = pd.read_csv('/datasets/geo_data_0.csv')
df_1 = pd.read_csv('/datasets/geo_data_1.csv')
df_2 = pd.read_csv('/datasets/geo_data_2.csv')

### Explore the data

Let's take a look at the data stored in each of the three DataFrames. The first 15 rows of each DataFrame will be printed, followed by a summary and general info of each DataFrame.

In [3]:
datasets = [df_0, df_1, df_2]

for i in range(len(datasets)):
    print('df_' + str(i) + ':')
    print(datasets[i].head(15))
    print()

df_0:
       id        f0        f1        f2     product
0   txEyH  0.705745 -0.497823  1.221170  105.280062
1   2acmU  1.334711 -0.340164  4.365080   73.037750
2   409Wp  1.022732  0.151990  1.419926   85.265647
3   iJLyR -0.032172  0.139033  2.978566  168.620776
4   Xdl7t  1.988431  0.155413  4.751769  154.036647
5   wX4Hy  0.969570  0.489775 -0.735383   64.741541
6   tL6pL  0.645075  0.530656  1.780266   49.055285
7   BYPU6 -0.400648  0.808337 -5.624670   72.943292
8   j9Oui  0.643105 -0.551583  2.372141  113.356160
9   OLuZU  2.173381  0.563698  9.441852  127.910945
10  b8WQ6  0.371066 -0.036585  0.009208   70.326617
11  1YYm1  0.015920  1.062729 -0.722248   45.110381
12  zIYPq -0.276476  0.924865  0.095584   89.158678
13  iqTqq  0.212696 -0.111147  5.770095  164.298520
14  Ct5yY -0.018578  0.187516  2.944683  158.633720

df_1:
       id         f0         f1        f2     product
0   kBEdx -15.001348  -8.276000 -0.005876    3.179103
1   62mP7  14.272088  -3.475083  0.999183   26.

In [4]:
datasets = [df_0, df_1, df_2]

for i in range(len(datasets)):
    print('df_' + str(i) + ':')
    print(datasets[i].info())
    print()

df_0:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB
None

df_1:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB
None

df_2:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 col

After reviewing the previews and summaries of the datasets, it does not look like any preprocessing needs to be performed. There are no missing values, and the datatypes for each column appear to be correct. We will move forward without making any changed to the data.

It is important to note that the column names are not very descriptive at all. However, the OilyGiant Mining Company explained that the features were already processed for model training, and their exact meaning is unimportant for this task.

## Feature Preparation

It should be noted that the `'id'` column in each of the three datasets contains a unique identification string for each oil reserve. Being that the values in the `'id'` column are strings, it cannot be passed to a Linear Regression model. Therefore, before we move onto training models, it is best to drop the `'id'` column from each of the three datasets. The below cell block will drop the `'id'` column from each of the three columns.

In [5]:
df_0 = df_0.drop('id', axis=1)
df_1 = df_1.drop('id', axis=1)
df_2 = df_2.drop('id', axis=1)

## Modeling

Now that each dataset only contains numeric values, let's move onto splitting the data. Each dataset will be split into a training dataset and a validation dataset. The ratio for each training dataset to validation dataset will be **3:1**. This means that the training dataset for each region will contain **75%** of the overall data, and the validation dataset will contain **25%** of the overall data. The cell block below will split the data up as described, using the `train_test_split` function from the `sklearn.model_selection` module.

In [6]:
# Create the training, validation, and testing datasets
# The datasets will be split 75% and 25% for a ratio of 3:1 between the training and validation datasets

# Initializing DataFrame features for each region
df_0_features = df_0.drop('product', axis=1)
df_1_features = df_1.drop('product', axis=1)
df_2_features = df_2.drop('product', axis=1)

# Initializing DataFrame targets for each region
df_0_target = df_0['product']
df_1_target = df_1['product']
df_2_target = df_2['product']

# Split the feature and target datasets for each DataFrame into training and validation datasets
# The test_size will be 0.25 so that the validation dataset contains 25% of the data
# and the training dataset contains 75% of the data
features_train_0, features_valid_0, target_train_0, target_valid_0 = train_test_split(df_0_features, df_0_target, test_size=0.25)
features_train_1, features_valid_1, target_train_1, target_valid_1 = train_test_split(df_1_features, df_1_target, test_size=0.25)
features_train_2, features_valid_2, target_train_2, target_valid_2 = train_test_split(df_2_features, df_2_target, test_size=0.25)

### Training logistic regression models

Now that the training and validation datasets have been split for each regional dataset, it's finally time to train the models! Again, one model will be trained for each regional dataset. The model of choice will a Linear Regression model since we are trying to predict/calculate a precise number in a continuous range. Since the possibilities are vastly more numerous than just **0** or **1**, this is not a binary classification task, so Logistic Regression is not an accurate choice model for this application. The below cells will initialize a Linear Regression model, fit the model, predict target values for the validation dataset, and then output the average volume of predicted oil reserves and the Root Mean Square Error.

#### Region 0 dataset

In [7]:
# Create an instance of a LogisticRegression model
model = LinearRegression()

# Fit the model using the training data
model.fit(features_train_0, target_train_0)

# Predict the target values of the validation features
predicted_valid_0 = model.predict(features_valid_0)

# Calculate and print the average volumne of predicted reserves and model RMSE
result = mean_squared_error(target_valid_0, predicted_valid_0)**0.5
print(f'Average volume of predicted reserves: {round(predicted_valid_0.mean(),3)} thousand barrels')
print("RMSE of the linear regression model on the validation set:", round(result,3))

Average volume of predicted reserves: 92.607 thousand barrels
RMSE of the linear regression model on the validation set: 37.594


The average volume of predicted oil reserves is approximately **92** thousand barrels. The Root Mean Square Error of the Linear Regression model is approxiamtely **37**. This means that the predicted values have a standard deviations of up to **37** thousand barrels. An RMSE value of **37** is kind of high, and provides for a large range of reserve volumes.

#### Region 1 dataset

In [8]:
# Create an instance of a LogisticRegression model
model = LinearRegression()

# Fit the model using the training data
model.fit(features_train_1, target_train_1)

# Predict the target values of the validation features
predicted_valid_1 = model.predict(features_valid_1)

# Calculate and print the average volumne of predicted reserves and model RMSE
result = mean_squared_error(target_valid_1, predicted_valid_1)**0.5
print(f'Average volume of predicted reserves: {round(predicted_valid_1.mean(),3)} thousand barrels')
print("RMSE of the linear regression model on the validation set:", round(result,3))

Average volume of predicted reserves: 69.143 thousand barrels
RMSE of the linear regression model on the validation set: 0.89


The average volume of predicted oil reserves is approximately **68** thousand barrels. The Root Mean Square Error of the Linear Regression model is approxiamtely **0.8**. This means that the predicted values have a standard deviations of up to **0.9** thousand barrels. That RMSE value is incredibly small! The variation in oil reserve volumes is very tiny, so a vast majority of the oil reserve volumes will be close to the **68** average.

#### Region 2 dataset

In [9]:
# Create an instance of a LogisticRegression model
model = LinearRegression()

# Fit the model using the training data
model.fit(features_train_2, target_train_2)

# Predict the target values of the validation features
predicted_valid_2 = model.predict(features_valid_2)

# Calculate and print the average volumne of predicted reserves and model RMSE
result = mean_squared_error(target_valid_2, predicted_valid_2)**0.5
print(f'Average volume of predicted reserves: {round(predicted_valid_2.mean(),3)} thousand barrels')
print("RMSE of the linear regression model on the validation set:", round(result,3))

Average volume of predicted reserves: 94.888 thousand barrels
RMSE of the linear regression model on the validation set: 39.865


The average volume of predicted oil reserves is approximately **94** thousand barrels. The Root Mean Square Error of the Linear Regression model is approxiamtely **40**. This means that the predicted values have a standard deviations of up to **40** thousand barrels. This is a pretty high RMSE value. In fact, it's the highest RMSE value out of all three. The predicted oil reserve volumes could differ quite largerly from the **94** average.

### Summary

Both Region 0 and Region 2 have a large RSME value, especially when compared with Region 1. This means that Region 0 and Region 2 have a wide range of oil reserve volumes, while Region 1 has rather predictable oil reserve volumes. It should also be noted that Region 0 and Region 2 have an average oil reserve volume that is approximately **1.5** times greater than the average oil reserve volume of Region 1. It will be interesting to see how each region plays out in the profit analysis since Region 0 and Region 2 could potentially provide much greater depending on how their oil reserve volumes skew. However, Region 1 has much higher chance of providing a profit that is within a much more precise range.

## Profit calculation

### Determine the minimum oil production required to break-even

To begin analzing the potential profit that can be made from each region, let's first determine how many thousands of barrles of oil are required to ensure that the OilyGiant Mining Company is not building/operating at a loss. The below cell block will calculate the number of units, or thousands barrels, that the oil reserve volume must provide in order for OilyGiant to produce a profit. This calculation is done by dividing OilyGiant's overall budget (**100,000,000 USD**) for building **200** oil wells with the revenue brought in by one unit (**4,500 USD**) of oil.

In [10]:
# Initialize constant variables
# The revenue from one unit is 4,500 USD (one-thousand barrels)
# The budget for development of 200 oil wells is $100,000,000
# The revenue from 200 oil wells needs to exceed $100,000,000
budget = 100_000_000
unit_revenue = 4_500

# Volume of reserves required to develop a new well without losses:
sufficient_volume = budget/unit_revenue

# Print the results
print(f'A total of {round(sufficient_volume,2)} thousand barrels are required to avoid building/operating without losses.')
print(f'Requires approximately {round((sufficient_volume / 200), 2)} thousand barrels per reserve.')

A total of 22222.22 thousand barrels are required to avoid building/operating without losses.
Requires approximately 111.11 thousand barrels per reserve.


From the above calculations, a minimum of approximately **22.2** million barrels of oil are needed from the new well in order to ensure that OilyGiant is not building or operating at a loss. This breaks down to a minimum of approximately **111.11** thousand barrels from each reserve in the region. As can quickly be seen, the average predicted oil reserve volumes for the three regions, as was modeled in Section 4, are lower than the **111.11** thousand barrels per reserve. However, when you factor in the RMSE that Region 0 and Region 2 ouputted, there are definitley oil reserve volumes that exceed the **111.11** value. Unfortunately, Region 1 has less oil reserves that exceed the **111.11** volume since the average oil reserve volume is approximately **68.66** thousand barrels, and the RMSE is approximately **0.887**. However, if OilyGiant is able to extract oil from the top 200 oil wells in Region 1, then there is a much better chance of OilyGiant ending up with a profit rather than a loss at the end of the year.

### Calculate potential profit from predicted values

Now, based on the predicted oil reserve volumes from each region, let's calculate the profit OilyGiant would receive. To do this, a function will be created that accepts the volume predictions, as well as the sample size of oil wells from each region, as parameters. The largest 200 oil wells from each region will be used to calculate OilyGiant's profit. So, the predictions from each region will be sorted from highest to lowest, and then the largest 200 values will be sliced from the predictions series. The 200 values will be added together for a total sum, which will then be multiplied by **4,500 USD** to get the total revenue. Lastly, to get the final profit number, OilyGiant's budget will be subtracted fromthe product of the total reserve volume and revenue from one unit. In addition to displaying the profit that OilyGiant would recieve, the target number of barrels of oil needed to be produced will be displayed, as well.

In [11]:
# Create a function to calculate the profit from the predicted target values of each region

# Initialize function - takes 'predictions' and 'count' as parameters
def get_profit(targets, predictions, count):
    '''Function will calculate the potential profit based on predicted oil reserve volumes for each region'''
    
    # Sort the predicted volumes from largest to smallest
    predictions_sorted = predictions.sort_values(ascending=False)[:count]
    
    # Select the largest 200 oil reserve volumes for each region, but utilize the target volumes (actual volumes)
    selected_wells = targets[predictions_sorted.index]
    
    # Calculate the profit based on one unit producing a revenue of $4,500
    # Subtract the budget of $100,000,000 from the total revenue
    profit = (4500 * selected_wells.sum()) - budget
    
    # Return the profit value
    return round(profit, 2)


# Run the function below...

# Create a list of the predicted values and actual values (targets) for each region
predictions = [pd.Series(predicted_valid_0), pd.Series(predicted_valid_1), pd.Series(predicted_valid_2)]
targets = [target_valid_0.reset_index(drop=True) , target_valid_1.reset_index(drop=True), target_valid_2.reset_index(drop=True)]
profits = []

# For loop for executing the get_profit function for the predicted datasets of each region
# Store profit values into the profits list
for i in range(len(predictions)):
    profits.append(get_profit(targets[i], predictions[i], 200))

# Set local currency to USD
locale.setlocale(locale.LC_ALL, '')
    
# Print each profit value
for i in range(len(profits)):
    print(f'Profit from Region {i}: {locale.currency(profits[i], grouping=True)}')
    print(f'Target oil reserve volume: {round((profits[i] + 100_000_000) / 4500,2)} thousand barrels\n')

Profit from Region 0: $33,619,361.58
Target oil reserve volume: 29693.19 thousand barrels

Profit from Region 1: $24,150,866.97
Target oil reserve volume: 27589.08 thousand barrels

Profit from Region 2: $26,377,304.42
Target oil reserve volume: 28083.85 thousand barrels



The ranking of each region by its approximate profits as calculated using the predicted oil reserve volumes are as follows:

1.  Region 0
2.  Region 2
3.  Region 1

From the profits calculated above, it can quickly be seen that Region 0 provided the highest profits, followed by Region 2, and then Region 1. All three regions were able to produce profits and not losses. The approximate target oil reserve volumes needed to acheive the calculated profits are as follows:

- Region 0: **approx. 29 million barrels**
- Region 2: **aprrox. 27 million barrels**
- Region 1: **approx. 27 million barrels**

The approximate target oil reserve volumes are all greater than the approximate **22.2 million barrels** that were calculated to be required for OilyGiant to break-even. That is why each region has been predicted to produce a profit for OilyGiant. While I expected Region 1 to produce a lower amount of barrels, it seems that it contains some very large oil reserves. At least 200 of them, seeing as Region 1 was able to produce a profit. While Region 1 and Region 2 were able to provide a profit for OilyGiant, Region 0 provided the highest predicted profit greater than **30 million USD**. Therefore, if only based off these predictions, OilyGiant should build their new oil well in Region 0. However, we are basing these calculations off the largest predicted reserves, which is unlikely to actually come to fruition when OilyGiant buildes their new well. Therefore, it is important to look at distributions of average profits for each region, which will be done in the next section.

### Bootstrapping to determine the most profitable region

Although calculating the profits produced by the largest model predictions is a good starting point for our analysis of the most porfitable region, it doesn't tell the whole story. We sliced 200 of the largest oil reserves from each region to obtain the highest profit possible. Now, let's create a distribution of average profit values for each region. This will be done by providing **1,000** samples of **500** oil wells to the `get_profit` function. The function will then calculate the predicted profit, as was previously done, and store the profit value in the `values` list. Once the distribution has been collected, the average profit, 95% confidence interval, and risk of loss will be calculated for each region.

In [12]:
# Use the bootstrapping method with 1000 samples to find the distribution of profit
# Collect a distribution of average profits for each region based on the samples

def get_profit_distribution(targets, predictions, count):
    '''Function will calculate profit for 1000 samples based on predictions and target values. Pupose
       is to collect a average profit distribution.'''
    
    # Initialize the random state
    state = np.random.RandomState(12345)
    
    # Initialize the values list
    values = []
    
    # Create a DataFrame consisting of the predictions and targets series.
    combined_df = pd.DataFrame()
    combined_df['predictions'] = predictions
    combined_df['targets'] = targets.reset_index(drop=True)
    
    # For loop for obtaining 1000 samples and passing them to the get_profit function
    for i in range(1000):
        target_subsample = combined_df.sample(n=500, replace=True, random_state=state).reset_index(drop=True)
        values.append(get_profit(target_subsample['targets'], target_subsample['predictions'], count))
    
    # Turn the values list into a pandas series
    values = pd.Series(values)
    
    # Calculate the mean of the profits
    mean = values.mean()
    
    # Obtain the upper and lower bounds of the 95% confidence interval
    upper = values.quantile(0.975)
    lower = values.quantile(0.025)
    
    # Determine the probability of negative profit (loss)
    count = 0
    for value in values:
        if value < 0:
            count += 1
            
    print("Average profit:", locale.currency(mean, grouping=True))
    print(f'Target oil reserve volume: {round((mean + 100_000_000) / 4500,2)} thousand barrels')
    print("95% Confidence Interval:", locale.currency(lower, grouping=True), 'to', locale.currency(upper, grouping=True))
    print(f"Risk of loss: {1. * count * 100 / len(values)}%")

for i in range(len(predictions)):
    print('Region ' + str(i))
    get_profit_distribution(targets[i], predictions[i], 200)
    print()

Region 0
Average profit: $4,843,364.99
Target oil reserve volume: 23298.53 thousand barrels
95% Confidence Interval: -$470,829.47 to $9,637,909.47
Risk of loss: 4.0%

Region 1
Average profit: $4,633,159.55
Target oil reserve volume: 23251.81 thousand barrels
95% Confidence Interval: $745,163.05 to $8,566,527.72
Risk of loss: 0.8%

Region 2
Average profit: $3,865,131.45
Target oil reserve volume: 23081.14 thousand barrels
95% Confidence Interval: -$1,119,615.68 to $9,060,771.58
Risk of loss: 7.1%



Region 0 and Region 2 both have risk of loss percentages greater than **2.5%**, which makes them not suitable to be the next location for the new well. Only Region 1 has a risk of loss percentage that is less than **2.5%**. Based on the criteria for acceptable risk of loss, only Region 1 can be considered for the location for OilyGiant's new well.

The average profit calculated for each region is much lower than what was seen when calculating the potential profit from the largest 200 oil reserves in each region. This is expected, as we are now calculating the average profit from a profit distribution of 1000 random samples. Region 1 and 2 have fairly high average profits, whereas Region 2 brings in a slighlty lower average profit. Therefore, in addition to the risk of loss criteria, it can be reported to OilyGiant that they should build their new well in Region 1 due to it having one of the larger average profits.

## Conclusion

To complete this task, the datasets were broken into a training and validation dataset for each region. A model was trained for each reagion using the data provided by OilyGiant, and the predicted oil reserve volumes were utilized to calculate and predict the profit from each region. While this was a good base to begin our analysis with, we needed to obtain more samples for a more precise and accurate conclusion. The bootstrapping technique was used with 1000 samples to find the distribution of profit for each region. The average profit, 95% confidence interval, and risk of loss were calculated for the profit distributions of each region. After all the predictions and calculations were completed, it was determined that Region 1 would be the best and most suitable location for the OilyGiant Mining Company to build their new oil well.