# Goals and Overview

The goal of this project is to build a model that will help to pick the region with the highest profit margin and find the best place for a new well.

# Project

## Initialization

In [None]:
# Loading necessary libraries.
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import mean_squared_error, r2_score, make_scorer
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.utils import shuffle
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats as st
from sklearn.model_selection import cross_val_score

## Reading Data

In [None]:
# Reading Data.
data00 = pd.read_csv('./datasets/geo_data_0.csv')
data01 = pd.read_csv('./datasets/geo_data_1.csv')
data02 = pd.read_csv('./datasets/geo_data_2.csv')

In [None]:
data00

In [None]:
data01

In [None]:
data02

In [None]:
data00.info()

In [None]:
data01.info()

In [None]:
data02.info()

All data looks right and complete.

__Missing Values__

In [None]:
data00.isna().sum()

In [None]:
data01.isna().sum()

In [None]:
data02.isna().sum()

There are no missing values in any dataframe.

__Duplicate Values__

In [None]:
data00[data00.duplicated()]

In [None]:
data00[data01.duplicated()]

In [None]:
data00[data02.duplicated()]

There are no duplicates in any dataframe.

## Data Preparation

In [None]:
# Removing from 'df00' the 'id' column.
df00 = data00.drop('id', axis=1)

# Removing from 'df01' the 'id' column.
df01 = data01.drop('id', axis=1)

# Removing from 'df02' the 'id' column.
df02 = data02.drop('id', axis=1)

In [None]:
df00.info()

In [None]:
df01.info()

In [None]:
df02.info()

Apart from removing the 'id' column, nothing else was done to modify each region's data.

'id' was removed because it has no qualities that correlate with other features or target.

### Data Splitting

In [None]:
def split_data(df):
    # Assigning to 'target' the 'product' column of the dataframe.
    target = df['product']

    # Assigning to 'features' all other columns but 'product' from the dataframe.
    features = df.drop('product', axis=1)

    # Splitting data into train, validation, and test sets
    features_train, features_valid, target_train, target_valid = train_test_split(
        features, target, test_size=0.25, random_state=12345
    )
    
    return features_train, features_valid, target_train, target_valid

# Splitting data for df00
df00_features_train, df00_features_valid, df00_target_train, df00_target_valid = split_data(df00)

# Splitting data for df01
df01_features_train, df01_features_valid, df01_target_train, df01_target_valid = split_data(df01)

# Splitting data for df02
df02_features_train, df02_features_valid, df02_target_train, df02_target_valid = split_data(df02)

Data for each region has been split into training and validation sets.

## Model Exploration

In [None]:
def train_model(features_train, target_train, features_valid, target_valid):
    # Training Model using 'features_train' and 'target_train'.
    lr_model = LinearRegression()
    lr_model.fit(features_train, target_train)

    # Assigning to 'predicted_valid' model predictions using 'features_valid'.
    predicted_valid = lr_model.predict(features_valid)

    # Calculate RMSE on validation set
    rmse = np.sqrt(mean_squared_error(target_valid, predicted_valid))
    r2 = r2_score(target_valid, predicted_valid)
    avor = predicted_valid.mean()

    print("RMSE of the linear regression model on the validation set:", rmse)
    print("R2 score of the linear regression model on the validation set:", r2)
    print("Estimated Average Volume of Reserves:", avor)
    print("Estimated Average Value of Product:", avor * 4500)
    
    return predicted_valid, target_valid

# Train and evaluate model for df00
print("Region 1 Evaluation")
df00_predicted_valid, df00_target_valid = train_model(df00_features_train, df00_target_train, df00_features_valid, df00_target_valid)
print()

# Train and evaluate model for df01
print("Region 2 Evaluation")
df01_predicted_valid, df01_target_valid = train_model(df01_features_train, df01_target_train, df01_features_valid, df01_target_valid)
print()

# Train and evaluate model for df02
print("Region 3 Evaluation")
df02_predicted_valid, df02_target_valid = train_model(df02_features_train, df02_target_train, df02_features_valid, df02_target_valid)
print()

The region 1 model has a RMSE of 37.57 and an R2 of .27, and has a decent estimated average volume of reserves(92.59) despite the variability in the predictions.

The region 2 model does exceptionally well in terms of both RMSE and R2 Score, however the estimated average volume of reserves is much lower than the other 2 regions(68.72 units).

The region 3 model has room for improvement with an RMSE of 40.02 units and R2 Score 0.20, and despite the highest estimated average volume of reserves(94.96 units) the low scores indicate the model may not fully capture the variability in the data.

## Profit Calculation

In [None]:
budget = 100000000

new_wells = 200

unit_value = 4500

well_cost = round(budget/new_wells, 2)

product_required = round(well_cost/unit_value, 2)

In [None]:
print('Budget available per New Well:', well_cost, '$')
print('Volume of Reserves required to develop well without losses:', product_required)

In [None]:
def profit(target, predictions, count):
    # Sorting predictions and selecting top 'count' values.
    sorted_predictions = predictions.sort_values(ascending=False)
    selected = target[sorted_predictions.index][:count]
    
    # Calculating profit.
    profit = (selected * unit_value - well_cost).sum()
    
    # Calculating losses and loss count.
    losses = (selected[selected * unit_value - well_cost <= 0] * unit_value - well_cost).sum()
    loss_count = selected[selected * unit_value - well_cost <= 0].count()
    
    loss_percent = (loss_count / count) * 100
    
    return profit, loss_percent, losses

In [None]:
def eval_region(target, predictions):
    
    profits = []
    losses = []
    precents = []
    
    for i in range(1000):
        target_subsample = target.sample(n=200, replace=True, random_state=state)
        probs_subsample = predictions[target_subsample.index]
    
        plus, perc, minus = profit(target_subsample, probs_subsample, 200)

        profits.append(plus)
        precents.append(perc)
        losses.append(minus)

    profits = pd.Series(profits)
    losses = pd.Series(losses)
    risk_of_losses = pd.Series(precents)

    avg_profit = round(profits.mean(), 2)
    expected_profit = round(profits.quantile(0.01), 2)
    risk_loss = round(risk_of_losses.mean(), 2)

    print("Average profit:", avg_profit, "$")
    print("Profit 1% quantile:" ,expected_profit, "$")
    print()
    print("Risk of losses:", risk_loss, "%")
    print('Average Losses: ', round(losses.mean(), 2))

In [None]:
state = np.random.RandomState(12345)

### Region 1

In [None]:
print("Estimated Average Volume of Reserves for this Region:", round(df00_predicted_valid.mean(), 2), '/', product_required)

In [None]:
predictions = pd.Series(df00_predicted_valid)
predictions

In [None]:
target = df00_target_valid.reset_index(drop=True)
target

In [None]:
# Selecting TOP 500 Wells.
sorted_predictions = predictions.sort_values(ascending=False)
selected = target[sorted_predictions.index][:500]
select_predictions = sorted_predictions.head(500).reset_index(drop=True)
select_answer = selected.reset_index(drop=True)

In [None]:
print("Expected Average Volume of Reserves for select wells based on predictions:", round(select_predictions.mean(), 2))
print("Average Volume of Reserves in selected wells:", round(selected.mean(), 2))

The expected average volume of reserves for the top 500 wells based on predicted product is 148.37 units, well above the required 111.11 units of product to cover the well cost. On average, profit can be expected from the top selected wells based on model predictions.

In [None]:
eval_region(select_answer, select_predictions)

Based on the evaluation of the region, there is a 99% probablity of a profit of at least  23,279,289,𝑤𝑖𝑡ℎ 𝑎𝑛 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝑜𝑓
 30,557,524.

However, there is potential for losses of 11.8% with an average of $1,953,354 lost.

In [None]:
confidence_interval = st.t.interval(
    0.95, len(select_answer)-1, loc=select_answer.mean(), scale=st.sem(select_answer))

print('95% confidence interval:', confidence_interval)

In [None]:
# Convert the series into a DataFrame
df = pd.DataFrame(select_predictions, columns=['predicted_product'])

# Add a new column by multiplying the original values by 4500
df['predicted_value'] = round(df['predicted_product'] * 4500, 2)

df['actual_product'] = round((select_answer), 2)

df['actual_value'] = round((select_answer * unit_value), 2)

df

### Region 2

In [None]:
print("Estimated Average Volume of Reserves for this Region:", round(df01_predicted_valid.mean(), 2), '/', product_required)

The average volume of reserves for Region 2 does not meet the required product target necessary for operating without losses, having an average of 68 units out of the 111 necessary to operate without losses.

In [None]:
predictions = pd.Series(df01_predicted_valid)
predictions

In [None]:
target = df01_target_valid.reset_index(drop=True)
target

In [None]:
#Selecting TOP 500 Wells

sorted_predictions = predictions.sort_values(ascending=False)
selected = target[sorted_predictions.index][:500]
select_predictions = sorted_predictions.head(500).reset_index(drop=True)
select_answer = selected.reset_index(drop=True)

In [None]:
print("Expected Volume of Reserves based on predictions:", select_predictions.mean())
print("Average Volume of Reserves in selected wells:", round(selected.mean(), 2))

The expected average volume of reserves for the top 500 wells based on predicted product is 138.39 units, meeting the required 111.11 units of product to cover the well cost. On average, profit can be expected from the top selected wells based on model predictions.

In [None]:
eval_region(select_answer, select_predictions)

Based on the evaluation of the region, there is a 99% probablity of a profit of at least  24,150,866.97,𝑤𝑖𝑡ℎ𝑎𝑛𝑎𝑣𝑒𝑟𝑎𝑔𝑒𝑜𝑓
 24,150,781.

There is a 0% risk of loss and and expected average for this region.

This model and region had a great RMSE score, and was expected to predict with accuracy

In [None]:
confidence_interval = st.t.interval(
    0.95, len(select_answer)-1, loc=select_answer.mean(), scale=st.sem(select_answer))

print('95% confidence interval:', confidence_interval)

In [None]:
# Convert the series into a DataFrame
df = pd.DataFrame(select_predictions, columns=['predicted_product'])

# Add a new column by multiplying the original values by 4500
df['predicted_value'] = round(df['predicted_product'] * 4500, 2)

df['actual_product'] = round((select_answer), 2)

df['actual_value'] = round((select_answer * unit_value), 2)

df

### Region 3

In [None]:
print("Estimated Average Volume of Reserves for this Region:", df02_predicted_valid.mean())

The average volume of reserves for Region 3 is close but does not meet the required product target necessary for operating without losses, lacking by around 20 units of product.

In [None]:
predictions = pd.Series(df02_predicted_valid)
predictions

In [None]:
target = df02_target_valid.reset_index(drop=True)
target

In [None]:
sorted_predictions = predictions.sort_values(ascending=False)
selected = target[sorted_predictions.index][:500]
select_predictions = sorted_predictions.head(500).reset_index(drop=True)
select_answer = selected.reset_index(drop=True)

In [None]:
print("Expected Volume of Reserves based on predictions:", select_predictions.mean())
print("Average Volume of Reserves in selected wells:", round(selected.mean(), 2))

The expected average volume of reserves for the top 500 wells based on predicted product is 142.32 units, meeting the required 111.11 units of product to cover the well cost. On average, profit can be expected from the top selected wells based on model predictions.

In [None]:
eval_region(select_answer, select_predictions)

Based on the evaluation of the region, there is a 99% probablity of a profit of at least  16,839,580,𝑤𝑖𝑡ℎ𝑎𝑛𝑎𝑣𝑒𝑟𝑎𝑔𝑒𝑜𝑓
 25,597,463.

There is a 16.77% risk of loss and and expected average loss of $3,270,263.48 for this region.

In [None]:
confidence_interval = st.t.interval(
    0.95, len(select_answer)-1, loc=select_answer.mean(), scale=st.sem(select_answer))

print('95% confidence interval:', confidence_interval)

## Conclusion

Region 2 is the top choice for new wells due to many metrics such as the lowest risk of loss at 0%, the highest expected profit of $24,150,866.97, and while it does not have the highest average profit, it is the most consistent region.

Region 1 is the 2nd choice due to the highest average profit of  30,557,524,𝑎𝑛𝑑ℎ𝑎𝑠𝑎𝑔𝑜𝑜𝑑𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑𝑝𝑟𝑜𝑓𝑖𝑡𝑜𝑓
 23,279,289, but has more risk of loss than region 2.

Region 3 has the lowest profit values and the highest risk of loss chances, omitting it from futher consideration.