Problem Statement

This challenge aims to use time series injection information and monitoring data on a carbon capture well to predict carbon capture well injection rates deltas. Correlating the change in injection rate to the behavior of other parameters in the well can be used to provide a checkpoint against carbon migration from the well or other losses during the process.  The code developed to predict injection rate deltas based on monitoring well data can be used to validate carbon containment throughout the injection of the well. 

In [1]:
# Using AI to Validate Carbon Containment in the Illinois Basin

## Importing Libraries

### First, we will import the necessary Python libraries, such as pandas, numpy, and scikit-learn.

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split


Loading and Preprocessing Data

Load the dataset and perform some basic preprocessing steps. This includes removing any missing values, converting categorical variables into one-hot encoded format, and scaling the data using StandardScaler.    

In [2]:
# Load the dataset
data = pd.read_csv('dataset.csv')

# Drop any rows with missing values
data.dropna(inplace=True)

# Convert categorical variables into one-hot encoded format
data = pd.get_dummies(data, columns=['categorical_var1', 'categorical_var2'])



FileNotFoundError: [Errno 2] No such file or directory: 'dataset.csv'

Exploratory Data Analysis

Explore the data to gain insights into its structure and characteristics. Generate visualizations to help identify any patterns or trends in the data.

In [None]:
import matplotlib.pyplot as plt

# Generate histograms of each feature
data.hist(bins=20, figsize=(20, 20))
plt.show()

# Generate a scatter plot of features
plt.scatter(data['feature1'], data['feature2'])
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()


Feature Selection

Use correlation analysis to identify the most important features for the machine learning model. This will help to improve the accuracy of predictions and reduce the risk of overfitting

In [None]:
# Compute correlation matrix
corr_matrix = data.corr()

# Generate heatmap of correlation matrix
import seaborn as sns
sns.heatmap(corr_matrix)
plt.show()

# Select the most correlated features
corr_threshold = 0.5
corr_pairs = np.where(np.abs(corr_matrix) > corr_threshold)
corr_pairs = [(corr_matrix.iloc[i, j], i, j) for i, j in zip(*corr_pairs) if i != j and i < j]
corr_pairs.sort(reverse=True)


Model Development

Split the data using the preprocessed data. Split the dataset into training and testing sets. This will allow to evaluate the performance of the machine learning model on unseen data.

In [None]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data[data.columns[:-1]], data['target'], test_size=0.2, random_state=42)

# Scale the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


Model Selection and Tuning

Experiment different machine learning algorithms, such as logistic regression, decision trees, and random forests, and tune the hyperparameters of each model to find the best combination for the task at hand.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Define the model
model = LogisticRegression()
model.fit(X_train_scaled, y_train)

# Evaluate the model on the test set
y_pred = model.predict(X_test_scaled)
test_accuracy = accuracy_score(y_test, y_pred)
print("Test accuracy: ", test_accuracy)

# Define hyperparameters to tune
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100], 'penalty': ['l1', 'l2', 'elasticnet'], 'solver': ['lbfgs', 'liblinear', 'saga']}

# Define grid search
grid = GridSearchCV(model, param_grid, cv=5, n_jobs=-1)

# Fit the grid search
grid.fit(X_train_scaled, y_train)

# Print best hyperparameters
print("Best hyperparameters:", grid.best_params_)

# Retrain the model with best hyperparameters
model_tuned = LogisticRegression(C=10, penalty='l2', solver='liblinear')
model_tuned.fit(X_train_scaled, y_train)

# Make predictions on testing set
y_pred_tuned = logreg_tuned.predict(X_test_scaled)

# Calculate accuracy
accuracy_tuned = accuracy_score(y_test, y_pred_tuned)
print("Tuned Accuracy:", accuracy_tuned)


Submission

Once the machine learning model is developed and evaluated, contestants submit the ML code to an SPEMLChallenge@gmail.com.  The results are submitted to the XEEK.org platform for preliminary scoring.   