# OilyGiant

Project description
You work for the OilyGiant mining company. Your task is to find the best place for a new well.
Steps to choose the location:
Collect the oil well parameters in the selected region: oil quality and volume of reserves;
Build a model for predicting the volume of reserves in the new wells;
Pick the oil wells with the highest estimated values;
Pick the region with the highest total profit for the selected oil wells.
You have data on oil samples from three regions. Parameters of each oil well in the region are already known. Build a model that will help to pick the region with the highest profit margin. Analyze potential profit and risks using the Bootstrapping technique.

Data description
Geological exploration data for the three regions are stored in files:
geo_data_0.csv. download dataset
geo_data_1.csv. download dataset
geo_data_2.csv. download dataset
id — unique oil well identifier
f0, f1, f2 — three features of points (their specific meaning is unimportant, but the features themselves are significant)
product — volume of reserves in the oil well (thousand barrels).

Conditions:
Only linear regression is suitable for model training (the rest are not sufficiently predictable).
When exploring the region, a study of 500 points is carried with picking the best 200 points for the profit calculation.
The budget for development of 200 oil wells is 100 USD million.
One barrel of raw materials brings 4.5 USD of revenue The revenue from one unit of product is 4,500 dollars (volume of reserves is in thousand barrels).
After the risk evaluation, keep only the regions with the risk of losses lower than 2.5%. From the ones that fit the criteria, the region with the highest average profit should be selected.
The data is synthetic: contract details and well characteristics are not disclosed.

[We've provided you with some commentary to guide your thinking as you complete this project. However, make sure to remove all the bracketed comments before submitting your project.]

[Before you dive into analyzing your data, explain for yourself the purpose of the project and actions you plan to take.]

[Please bear in mind that studying, amending, and analyzing data is an iterative process. It is normal to return to previous steps and correct/expand them to allow for further steps.]

## Download and prepare the data. Explain the procedure.

### Initialization

In [27]:
# Loading all the libraries
import pandas as pd
from sklearn.linear_model import LinearRegression #All that is needed
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
import numpy as np
from scipy import stats as st

from sklearn.model_selection import train_test_split #split into training,validation, and testing

from sklearn.dummy import DummyClassifier
from sklearn.preprocessing import OrdinalEncoder

from sklearn.metrics import mean_squared_error #Only needed for regression
from sklearn.preprocessing import StandardScaler #For Scaling and Balancing
from sklearn.metrics import confusion_matrix #For False/True Postive and Negative Testing

from sklearn.metrics import precision_score #more metrics for balance
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

from sklearn.utils import shuffle #shuffling for upsampling
from sklearn.metrics import roc_auc_score #TPR (True Positve Rate) vs FPR (False Positive Rate)
from sklearn.metrics import precision_recall_curve 
from sklearn.metrics import precision_recall_curve #to graph precision and recall and f1
from matplotlib import pyplot as plt
from math import factorial #for probabilities

In [28]:
try:
    region_1 = pd.read_csv("C:/Users/Lorenzo Santos/OneDrive/Documents/geo_data_0.csv")
    region_2 = pd.read_csv("C:/Users/Lorenzo Santos/OneDrive/Documents/geo_data_1.csv")
    region_3 = pd.read_csv("C:/Users/Lorenzo Santos/OneDrive/Documents/geo_data_2.csv")
except:
    region_1 = pd.read_csv("/datasets/geo_data_0.csv")
    region_2 = pd.read_csv("/datasets/geo_data_1.csv")
    region_3 = pd.read_csv("/datasets/geo_data_2.csv")

### Preparation

[Explore the table to get an initial understanding of the data. Do necessary corrections to the table if necessary.]

In [29]:
region_1

Unnamed: 0,id,f0,f1,f2,product
0,txEyH,0.705745,-0.497823,1.221170,105.280062
1,2acmU,1.334711,-0.340164,4.365080,73.037750
2,409Wp,1.022732,0.151990,1.419926,85.265647
3,iJLyR,-0.032172,0.139033,2.978566,168.620776
4,Xdl7t,1.988431,0.155413,4.751769,154.036647
...,...,...,...,...,...
99995,DLsed,0.971957,0.370953,6.075346,110.744026
99996,QKivN,1.392429,-0.382606,1.273912,122.346843
99997,3rnvd,1.029585,0.018787,-1.348308,64.375443
99998,7kl59,0.998163,-0.528582,1.583869,74.040764


In [30]:
region_2

Unnamed: 0,id,f0,f1,f2,product
0,kBEdx,-15.001348,-8.276000,-0.005876,3.179103
1,62mP7,14.272088,-3.475083,0.999183,26.953261
2,vyE1P,6.263187,-5.948386,5.001160,134.766305
3,KcrkZ,-13.081196,-11.506057,4.999415,137.945408
4,AHL4O,12.702195,-8.147433,5.004363,134.766305
...,...,...,...,...,...
99995,QywKC,9.535637,-6.878139,1.998296,53.906522
99996,ptvty,-10.160631,-12.558096,5.005581,137.945408
99997,09gWa,-7.378891,-3.084104,4.998651,137.945408
99998,rqwUm,0.665714,-6.152593,1.000146,30.132364


In [31]:
region_3

Unnamed: 0,id,f0,f1,f2,product
0,fwXo0,-1.146987,0.963328,-0.828965,27.758673
1,WJtFt,0.262778,0.269839,-2.530187,56.069697
2,ovLUW,0.194587,0.289035,-5.586433,62.871910
3,q6cA6,2.236060,-0.553760,0.930038,114.572842
4,WPMUX,-0.515993,1.716266,5.899011,149.600746
...,...,...,...,...,...
99995,4GxBu,-1.777037,1.125220,6.263374,172.327046
99996,YKFjq,-1.261523,-0.894828,2.524545,138.748846
99997,tKPY3,-1.199934,-2.957637,5.219411,157.080080
99998,nmxp2,-2.419896,2.417221,-5.548444,51.795253


### Prepare the data

[Explore the table to get an initial understanding of the data. Do necessary corrections to the table if necessary.]

In [32]:
region_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


In [33]:
region_2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


In [34]:
region_3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


### FIX DATA

[Add additional factors to the data if you believe they might be useful.]

No NANs seen so far.

In [35]:
region_1[region_1.duplicated() == True]

Unnamed: 0,id,f0,f1,f2,product


In [36]:
region_2[region_2.duplicated() == True]

Unnamed: 0,id,f0,f1,f2,product


In [37]:
region_3[region_3.duplicated() == True]

Unnamed: 0,id,f0,f1,f2,product


None needed.

### ENRICH DATA

[Add additional factors to the data if you believe they might be useful.]

We need to scale and encode it or we'll get a value error. We're trying to get categorical features into numerical features.
 We could improve product by labeling to make calculations easier.

In [38]:
#We don't need id
region_1 = region_1.drop('id', axis=1)
region_2 = region_2.drop('id', axis=1)
region_3 = region_3.drop('id', axis=1)

region_1.columns = ['f0','f1','f2','thousand_barrels']
region_2.columns = ['f0','f1','f2','thousand_barrels']
region_3.columns = ['f0','f1','f2','thousand_barrels']

#There are no categorical features though, so nothing needs to be encoded (OHE)



## Train and test the model for each region:

### Split the data into a training set and validation set at a ratio of 75:25.


In [39]:
features_1 = region_1.drop('thousand_barrels',axis=1)
target_1 = region_1['thousand_barrels']
features_2 = region_2.drop('thousand_barrels',axis=1)
target_2 = region_2['thousand_barrels']
features_3 = region_3.drop('thousand_barrels',axis=1)
target_3 = region_3['thousand_barrels']

In [40]:
features_train_1,features_valid_1, target_train_1, target_valid_1 = train_test_split(features_1,target_1,test_size=0.25,random_state=12345)
features_train_2,features_valid_2, target_train_2, target_valid_2 = train_test_split(features_2,target_2,test_size=0.25,random_state=12345)
features_train_3,features_valid_3, target_train_3, target_valid_3 = train_test_split(features_3,target_3,test_size=0.25,random_state=12345)

#numeric = ['f0','f1','f2']

#StandardScaler doesn't work with linear regression.

#scaler = StandardScaler()
#features_train_0[numeric] = scaler.fit(features_train_0[numeric])
#features_train_1[numeric] = scaler.fit(features_train_1[numeric])
#features_train_2[numeric] = scaler.fit(features_train_2[numeric])
#features_valid_0[numeric] = scaler.fit(features_valid_0[numeric])
#features_valid_1[numeric] = scaler.fit(features_valid_1[numeric])
#features_valid_2[numeric] = scaler.fit(features_valid_2[numeric])

#Red text removal
pd.options.mode.chained_assignment = None

### Train the model and make predictions for the validation set.

In [41]:
#Condition: Only linear regression is suitable for model training (the rest are not sufficiently predictable).
model_1 = LinearRegression() #No class_weight or random_state parameter needed.
model_1.fit(features_train_1,target_train_1)
predictions_valid_1 = model_1.predict(features_valid_1)
model_2 = LinearRegression()
model_2.fit(features_train_2,target_train_2)
predictions_valid_2 = model_2.predict(features_valid_2)
model_3 = LinearRegression()
model_3.fit(features_train_3,target_train_3)
predictions_valid_3 = model_3.predict(features_valid_3)
print('trained')

trained


### Save the predictions and correct answers for the validation set.

In [42]:
mse_1 = mean_squared_error(target_valid_1, predictions_valid_1)
mse_2 = mean_squared_error(target_valid_2, predictions_valid_2)
mse_3 = mean_squared_error(target_valid_3, predictions_valid_3)
#The observation error shows the extent of the discrepancy between the correct answer and the prediction. If the error is much greater than zero, the model has overpriced the apartment; if the error is much less than zero, then the model underpriced it.
print('model_1,region_1')
print('mse',mse_1)
print('model_2,region_2')
print('mse',mse_2)
print('model_3,region_3')
print('mse',mse_3)

model_1,region_1
mse 1412.2129364399243
model_2,region_2
mse 0.7976263360391157
model_3,region_3
mse 1602.3775813236196


Good mse(mean squared error) on model_2, close to 0.

### Print the average volume of predicted reserves and model RMSE

In [43]:
rmse_1 = mse_1 ** 0.5 
rmse_2 = mse_2 ** 0.5
rmse_3 = mse_3 ** 0.5

#https://practicum.com/trainer/data-scientist/lesson/faedca40-cb8f-40a7-909b-34ecac64ca0e/task/2082f70f-98eb-48c6-8912-29007c777d86/
#Remember, 
print('region_1')
print('rmse',rmse_1) 
print('avg',predictions_valid_1.mean())

print('region_2')
print('rsme',rmse_2)
print('avg',predictions_valid_2.mean())

print('region_3')
print('rmse',rmse_3)
print('avg',predictions_valid_3.mean())

region_1
rmse 37.5794217150813
avg 92.59256778438035
region_2
rsme 0.893099286775617
avg 68.728546895446
region_3
rmse 40.02970873393434
avg 94.96504596800489


Reserves of region has the best average volume but high rmse (Root Mean Squared Error), the average may be lower on region_2, but it has a better model rmse (close to zero).

## Prepare for profit calculation:

### Store all key values for calculations in separate variables.

In [44]:
rmse_1 = mse_1 ** 0.5 
rmse_2 = mse_2 ** 0.5
rmse_3 = mse_3 ** 0.5
tar_avg_1 = target_valid_1.mean()
tar_avg_2 = target_valid_2.mean()
tar_avg_3 = target_valid_3.mean()
tar_sum_1 = target_valid_1.sum()
tar_sum_2 = target_valid_2.sum()
tar_sum_3 = target_valid_3.sum()
pred_avg_1 = predictions_valid_1.mean()
pred_avg_2 = predictions_valid_2.mean()
pred_avg_3 = predictions_valid_3.mean()
pred_sum_1 = predictions_valid_1.sum()
pred_sum_2 = predictions_valid_2.sum()
pred_sum_3 = predictions_valid_3.sum()

### Calculate the volume of reserves sufficient for developing a new well without losses. Compare the obtained value with the average volume of reserves in each region.


In [45]:
#Conditions:
#Only linear regression is suitable for model training (the rest are not sufficiently predictable).
#When exploring the region, a study of 500 points is carried with picking the best 200 points for the profit calculation.
#The budget for development of 200 oil wells is 100 USD million.
#One barrel of raw materials brings 4.5 USD of revenue The revenue from one unit of product is 4,500 dollars (volume of reserves is in thousand barrels).
#After the risk evaluation, keep only the regions with the risk of losses lower than 2.5%. From the ones that fit the criteria, the region with the highest average profit should be selected.
#The data is synthetic: contract details and well characteristics are not disclosed.

print(f'{100} million dollars divided by {200} wells means one well is {100/200} mil or {(100/200) * 1000} thousand')
print(f'500 thousand divided by 4500 is {500/4.5} volume')
print()
print('We need past 111 thousand barrels.')

print('region_1')
print('avg volume',tar_avg_1)
print('region_2')
print('avg volume',tar_avg_2)
print('region_3')
print('avg volume',tar_avg_3)





100 million dollars divided by 200 wells means one well is 0.5 mil or 500.0 thousand
500 thousand divided by 4500 is 111.11111111111111 volume

We need past 111 thousand barrels.
region_1
avg volume 92.07859674082927
region_2
avg volume 68.72313602435997
region_3
avg volume 94.88423280885438


We need past 111 thousand barrels per well without loss.

### Provide the findings about the preparation for profit calculation step.

region_3 has the best avg revenue per well, and average volume per well, and best predictions. But still isn't enough on average to keep creating wells.



## Write a function to calculate profit from a set of selected oil wells and model predictions:

### Pick the wells with the highest values of predictions.

In [46]:
print('region 1')
print('total volume',pred_sum_1)
print('avg well',pred_avg_1)
print('region 2')
print('total volume',pred_sum_2)
print('avg well',pred_avg_2)
print('region 3')
print('total volume',pred_sum_3)
print('avg well',pred_avg_3)


region 1
total volume 2314814.194609509
avg well 92.59256778438035
region 2
total volume 1718213.67238615
avg well 68.728546895446
region 3
total volume 2374126.1492001223
avg well 94.96504596800489


Region 3 has the highest total volume and mean.

### Summarize the target volume of reserves in accordance with these predictions

In [47]:
print('region 1')
print('total volume',tar_sum_1)
print('total prediction volume',pred_sum_1)
print('region 2')
print('total volume',tar_sum_2)
print('total prediction volume',pred_sum_2)
print('region 3')
print('total volume',tar_sum_3)
print('total prediction volume',pred_sum_3)

region 1
total volume 2301964.918520732
total prediction volume 2314814.194609509
region 2
total volume 1718078.4006089992
total prediction volume 1718213.67238615
region 3
total volume 2372105.8202213594
total prediction volume 2374126.1492001223


Region 3 has the highest volume.

### Provide findings: suggest a region for oil wells' development and justify the choice. Calculate the profit for the obtained volume of reserves.

In [48]:
target_valid_1

71751     10.038645
80493    114.551489
2655     132.603635
53233    169.072125
91141    122.325180
            ...    
12581    170.116726
18456     93.632175
73035    127.352259
63834     99.782700
43558    177.821022
Name: thousand_barrels, Length: 25000, dtype: float64

In [49]:
predictions_valid_1

array([ 95.89495185,  77.57258261,  77.89263965, ...,  61.50983303,
       118.18039721, 118.16939229])

In [50]:
pd.concat([target_valid_1.reset_index(drop=True), pd.Series(predictions_valid_1)], axis=1)

Unnamed: 0,thousand_barrels,0
0,10.038645,95.894952
1,114.551489,77.572583
2,132.603635,77.892640
3,169.072125,90.175134
4,122.325180,70.510088
...,...,...
24995,170.116726,103.037104
24996,93.632175,85.403255
24997,127.352259,61.509833
24998,99.782700,118.180397


In [51]:
#https://pastebin.com/uh7ncRQZ Old code to show thought process



#The budget for development of 200 oil wells is 100 USD million.
#One barrel of raw materials brings 4.5 USD of revenue The revenue from one unit of product is 4,500 dollars (volume of reserves is in thousand barrels).

#500.0 thousand per well.

#a function with a signature like def profit(target, predictions), and it should do the following:

#Sort the wells by prediction and select top 200
#Find the corresponding targets to these 200 wells
#Calculate profit from these 200 wells using target product values (remember that profit = revenue - cost)


#concat to combine, get rid of NANs, sum both and subtract cost
#remember concatenation or .join won't work unless you reset index and drop original index.

#profit is only based on what is actually contained in the wells (i.e. targets)
#remember that the cost is 100000000 (100 million), not 1000000 (1 million)
#Sum of (200 wells * Volume * $4500 each well) - (100 million)



def profit(target,predictions):
    predictions = pd.Series(predictions).reset_index(drop=True)
    target = pd.Series(target).reset_index(drop=True)
    matching = pd.concat([predictions, target], axis=1) 
    matching.columns = ['predictions','target']
    matching = matching.sort_values(by='predictions',ascending=False)
    matching = matching.head(200)
    matching['profit'] = matching['target'] * 4500
    return matching['profit'].sum() - 100000000

#In summary, we're sorting targets by highest predictions and multiplying that by profit per well then subtracting 100 mil.

region_1_profit = profit(target_valid_1,predictions_valid_1)
region_2_profit = profit(target_valid_2,predictions_valid_2)
region_3_profit = profit(target_valid_3,predictions_valid_3)
    
print(region_1_profit)
print(region_2_profit)
print(region_3_profit)

33208260.43139851
24150866.966815084
27103499.63599831


The profit is the highest in region 1.

## Calculate risks and profit for each region:

### Use the bootstrapping technique with 1000 samples to find the distribution of profit.


In [52]:
#A confidence interval is a segment of the number axis, which the population parameter of interest falls into with a predetermined probability. If the value falls into the range from 300 to 500 with 99% probability, then the 99-percent confidence interval for this value is (300, 500).
#https://pastebin.com/Rjw3VaEy #we actually need to keep the duplicates (that's kind of the point of bootstrapping).
#https://pastebin.com/fGDjxU3U #notes + extra code to make sure it's running properly.
#https://pastebin.com/HERt718U #Profit function is backwards 

#Using the Bootstrapping technique, find 90% confidence interval for 99% quantile. Save its lowest value to the lower variable, and the highest value to the upper variable. Print them (in precode).

region_1_values = []
region_2_values = []
region_3_values = []


#first reset and drop the index of targets (so it should just be indexed as 0, 1, 2, etc.).
target_valid_1 = pd.Series(target_valid_1).reset_index(drop=True)
target_valid_2 = pd.Series(target_valid_2).reset_index(drop=True)
target_valid_3 = pd.Series(target_valid_3).reset_index(drop=True)

#Machine Learning in Business - Data Collection - Cross-Validation in Python

#Then as you're creating a series out of predictions with no index, it will have the same standard index (0, 1, 2, etc.)
predictions_valid_1 = pd.Series(predictions_valid_1).reset_index(drop=True)
predictions_valid_2 = pd.Series(predictions_valid_2).reset_index(drop=True)
predictions_valid_3 = pd.Series(predictions_valid_3).reset_index(drop=True)

for i in range(1000):
    #Another important detail when creating subsamples is that they should provide a selection of elements with replacement. That is, the same element can fall into a subsample several times. To do this, specify replace=True for the sample() function. Compare:
    #Don't reset targets not reset the index of target_subsample befor selecting the corresponding predictions, because then you won't be able to select the matching predictions
    target_subsample_1 = target_valid_1.sample(n=500, replace=True)
    target_subsample_2 = target_valid_2.sample(n=500, replace=True)
    target_subsample_3 = target_valid_3.sample(n=500, replace=True)
    
    #Sample with replacement out of targets and use predictions.loc[target_subsample.index] to find the corresponding predictions
    #Before passing those to the profit function, reset and drop the indices of both predictions and targets subsample (this renumbers the rows to get rid of duplicate indices, but keeps all rows, as we wanted)
    predictions_subsample_1 = predictions_valid_1.loc[target_subsample_1.index].reset_index(drop=True)
    predictions_subsample_2 = predictions_valid_2.loc[target_subsample_2.index].reset_index(drop=True)
    predictions_subsample_3 = predictions_valid_3.loc[target_subsample_3.index].reset_index(drop=True)
    
    #profit(target,predictions) targets should be first
    region_1_values.append(profit(target_subsample_1,predictions_subsample_1))
    region_2_values.append(profit(target_subsample_2,predictions_subsample_2))
    region_3_values.append(profit(target_subsample_3,predictions_subsample_3))
    
#In summary, top 200 wells, of 500 samples, done 1000 times.

### Find average profit, 95% confidence interval and risk of losses. Loss is negative profit, calculate it as a probability and then express as a percentage.

In [53]:
#https://pastebin.com/L312meKd

#Machine Learning in Business - Implementing New Functionality - Confidence Interval
#95% confidence = The 2.5% percentile (lower) and the 97.5% (upper) percentile
#because 97.5 - 2.5 = 95


#for ease.
region_1_values = pd.Series(region_1_values).reset_index(drop=True)
region_2_values = pd.Series(region_2_values).reset_index(drop=True)
region_3_values = pd.Series(region_3_values).reset_index(drop=True)

avg_profit_1 = region_1_values.mean()
avg_profit_2 = region_2_values.mean()
avg_profit_3 = region_3_values.mean()

##95% confidence interval
lower_1 = region_1_values.quantile(0.025)
upper_1 = region_1_values.quantile(0.975)
lower_2 = region_2_values.quantile(0.025)
upper_2 = region_2_values.quantile(0.975)
lower_3 = region_3_values.quantile(0.025)
upper_3 = region_3_values.quantile(0.975)

#risk of losses (Remember that risk of losses can be calculated as the share of negative profits in the list.)
risk_1 = (region_1_values[region_1_values <= 0].count() / 1000) * 100
risk_2 = (region_2_values[region_2_values <= 0].count() / 1000) * 100
risk_3 = (region_3_values[region_3_values <= 0].count() / 1000) * 100
#Tried sum(), used count() instead of sum(): we want to find how many profits out of 1000 are negative


print('region_1')
print('avg profit',avg_profit_1)
print('lower quantile',lower_1)
print('upper quantile', upper_1)
print('risk of losses', risk_1)
print('region_2')
print('avg profit',avg_profit_2)
print('lower quantile',lower_2)
print('upper quantile', upper_2)
print('risk of losses', risk_2)
print('region_3')
print('avg profit',avg_profit_3)
print('lower quantile',lower_3)
print('upper quantile', upper_3)
print('risk of losses', risk_3)

region_1
avg profit 3943675.7025880385
lower quantile -1141749.7840983581
upper quantile 8865611.519924922
risk of losses 5.800000000000001
region_2
avg profit 4533438.292328027
lower quantile 739285.1680974513
upper quantile 8344381.017520697
risk of losses 1.4000000000000001
region_3
avg profit 3798290.493455105
lower quantile -1437655.2873674463
upper quantile 8577436.778678397
risk of losses 7.8


Region 2 has Lowest risk of losses, Highest average profit!

### Provide findings: suggest a region for development of oil wells and justify the choice.

In [54]:
print('region_1')
print('root mean squared error', rmse_1)
print('avg profit',avg_profit_1)
print('lower quantile',lower_1)
print('upper quantile', upper_1)
print('risk of losses', risk_1)
print('region_2')
print('root mean squared error', rmse_2)
print('avg profit',avg_profit_2)
print('lower quantile',lower_2)
print('upper quantile', upper_2)
print('risk of losses', risk_2)
print('region_3')
print('root mean squared error', rmse_3)
print('avg profit',avg_profit_3)
print('lower quantile',lower_3)
print('upper quantile', upper_3)
print('risk of losses', risk_3)

region_1
root mean squared error 37.5794217150813
avg profit 3943675.7025880385
lower quantile -1141749.7840983581
upper quantile 8865611.519924922
risk of losses 5.800000000000001
region_2
root mean squared error 0.893099286775617
avg profit 4533438.292328027
lower quantile 739285.1680974513
upper quantile 8344381.017520697
risk of losses 1.4000000000000001
region_3
root mean squared error 40.02970873393434
avg profit 3798290.493455105
lower quantile -1437655.2873674463
upper quantile 8577436.778678397
risk of losses 7.8


## Conclusion

Region 2 is best, because of a low risk of losses, a lower MSE, and a higher lower quantile. The region 1 and region 3 come with a risk of loss, high negative profit especially at low quantile, and a high root mean squared error far above zero.