Sure, here's an implementation of the Posterior Correction with Positive Unlabeled Learning (PC-PUL) method, also known as the Confidence Adjusted PUL (CAL) method, using the data and parameters from the code you provided. We'll then compute the F1-score as a measure of accuracy:

In [1]:
import numpy as np
import scipy.io as sio
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

# Load the data from the MAT-file
data = sio.loadmat('/home/kofi/Downloads/Matlab/diabetes.mat')
labels = data["labels"]
X = data["X"]

# Perform PCA on the data
pca = PCA(n_components=2)
X = pca.fit_transform(X)

# Split the data into positive and unlabeled sets
positive_indices = np.where(labels == 1)[0]
unlabeled_indices = np.where(labels == 0)[0]
X_positive = X[positive_indices, :]
X_unlabeled = X[unlabeled_indices, :]

# Split the positive examples and some of the unlabeled examples into training and validation sets
X_train_positive, X_val_positive, y_train_positive, y_val_positive = train_test_split(X_positive, np.ones(X_positive.shape[0]), test_size=0.2, random_state=42)
X_train_unlabeled, X_val_unlabeled, y_train_unlabeled, y_val_unlabeled = train_test_split(X_unlabeled, np.zeros(X_unlabeled.shape[0]), test_size=0.2, random_state=42)

# Set the number of folds for cross-validation
k = 5

# Split the positive examples into k folds
positive_folds = np.array_split(X_train_positive, k)

# Estimate the class prior using cross-validation
class_prior_cv = 0
for fold in range(k):
    # Split the remaining (K-1) folds into training and validation sets
    train_indices = [i for i in range(k) if i != fold]
    X_train = np.concatenate([positive_folds[i] for i in train_indices])
    y_train = np.ones(X_train.shape[0])
    
    # Estimate the class prior using the current training set
    current_class_prior = np.mean(y_train)
    
    # Add the current estimate to the running total
    class_prior_cv += current_class_prior

# Average the K estimates of the class prior
class_prior_cv /= k


# Train a logistic regression model on the positive and unlabeled training examples
pu_train_X = np.vstack((X_train_positive, X_train_unlabeled))
pu_train_y = np.hstack((y_train_positive, np.ones(X_train_unlabeled.shape[0]) * -1))
lr = LogisticRegression(solver='lbfgs', max_iter=1000)
lr.fit(pu_train_X, pu_train_y)

# Compute the predicted probabilities on the validation set
val_X = np.vstack((X_val_positive, X_val_unlabeled))
val_y = np.hstack((np.ones(X_val_positive.shape[0]), np.zeros(X_val_unlabeled.shape[0])))
probs = lr.predict_proba(val_X)[:, 1]

# Apply the CAL method to adjust the probabilities
calibrated_probs = probs / class_prior_cv

# Compute the F1-score
y_pred = (calibrated_probs > 0.5).astype(int)
f1 = f1_score(val_y, y_pred)

print("Class prior: {:.2f}".format(class_prior_cv))
print("F1-score: {:.3f}".format(f1))


Class prior: 1.00
F1-score: 0.436


Here's what the code does:

    1. The code loads the data and performs PCA as in the previous example.
    2. The data is split into positive and unlabeled sets as before.
    3. The positive and some of the unlabeled examples are split into training and validation sets as before.
    4. The class prior is estimated as before.(Cross-Validation)
    5. A logistic regression model is trained on the positive and unlabeled training examples using scikit-learn's LogisticRegression class. The labels of the unlabeled examples are set to -1.
    6. The model is used to predict the probabilities of positive labels on the validation set.
    7. The probabilities are adjusted using the CAL method, which involves dividing the predicted probabilities by the estimated class 8. prior.
    8. The F1-score is computed on the validation set using scikit-learn's f1_score function.

Note that the F1-score is just one


In [2]:
import numpy as np
import scipy.io as sio
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

# Load the data from the MAT-file
data = sio.loadmat('/home/kofi/Downloads/Matlab/diabetes.mat')
labels = data["labels"]
X = data["X"]

# Perform PCA on the data
pca = PCA(n_components=2)
X = pca.fit_transform(X)

# Split the data into positive and unlabeled sets
positive_indices = np.where(labels == 1)[0]
unlabeled_indices = np.where(labels == 0)[0]
X_positive = X[positive_indices, :]
X_unlabeled = X[unlabeled_indices, :]

# Split the positive examples and some of the unlabeled examples into training and validation sets
X_train_positive, X_val_positive, y_train_positive, y_val_positive = train_test_split(X_positive, np.ones(X_positive.shape[0]), test_size=0.2, random_state=42)
X_train_unlabeled, X_val_unlabeled, y_train_unlabeled, y_val_unlabeled = train_test_split(X_unlabeled, np.zeros(X_unlabeled.shape[0]), test_size=0.2, random_state=42)


# Train a logistic regression classifier on the positive and unlabeled examples
pu_X = np.vstack((X_positive, X_unlabeled))
pu_y = np.hstack((np.ones(X_positive.shape[0]), np.zeros(X_unlabeled.shape[0])))
clf = LogisticRegression(solver='lbfgs', max_iter=1000)
clf.fit(pu_X, pu_y)


# Calculate the probability of the positive class for each example
y_prob = clf.predict_proba(X)[:, 1]

# Estimate the class prior using the model-based method
class_prior_model = np.mean(y_prob)


# Train a logistic regression model on the positive and unlabeled training examples
pu_train_X = np.vstack((X_train_positive, X_train_unlabeled))
pu_train_y = np.hstack((y_train_positive, np.ones(X_train_unlabeled.shape[0]) * -1))
lr = LogisticRegression(solver='lbfgs', max_iter=1000)
lr.fit(pu_train_X, pu_train_y)

# Compute the predicted probabilities on the validation set
val_X = np.vstack((X_val_positive, X_val_unlabeled))
val_y = np.hstack((np.ones(X_val_positive.shape[0]), np.zeros(X_val_unlabeled.shape[0])))
probs = lr.predict_proba(val_X)[:, 1]

# Apply the CAL method to adjust the probabilities
calibrated_probs = probs / class_prior_model

# Compute the F1-score
y_pred = (calibrated_probs > 0.5).astype(int)
f1 = f1_score(val_y, y_pred)

print("Class prior: {:.2f}".format(class_prior_model))
print("F1-score: {:.3f}".format(f1))


Class prior: 0.35
F1-score: 0.553


Here's what the code does:

    1. The code loads the data and performs PCA as in the previous example.
    2. The data is split into positive and unlabeled sets as before.
    3. The positive and some of the unlabeled examples are split into training and validation sets as before.
    4. The class prior is estimated as before.(Model-Based Estimation)
    5. A logistic regression model is trained on the positive and unlabeled training examples using scikit-learn's LogisticRegression class. The labels of the unlabeled examples are set to -1.
    6. The model is used to predict the probabilities of positive labels on the validation set.
    7. The probabilities are adjusted using the CAL method, which involves dividing the predicted probabilities by the estimated class 8. prior.
    8. The F1-score is computed on the validation set using scikit-learn's f1_score function.

Note that the F1-score is just one

In [3]:
import numpy as np
import scipy.io as sio
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

# Load the data from the MAT-file
data = sio.loadmat('/home/kofi/Downloads/Matlab/diabetes.mat')
labels = data["labels"]
X = data["X"]

# Perform PCA on the data
pca = PCA(n_components=2)
X = pca.fit_transform(X)

# Split the data into positive and unlabeled sets
positive_indices = np.where(labels == 1)[0]
unlabeled_indices = np.where(labels == 0)[0]
X_positive = X[positive_indices, :]
X_unlabeled = X[unlabeled_indices, :]

# Split the positive examples and some of the unlabeled examples into training and validation sets
X_train_positive, X_val_positive, y_train_positive, y_val_positive = train_test_split(X_positive, np.ones(X_positive.shape[0]), test_size=0.2, random_state=42)
X_train_unlabeled, X_val_unlabeled, y_train_unlabeled, y_val_unlabeled = train_test_split(X_unlabeled, np.zeros(X_unlabeled.shape[0]), test_size=0.2, random_state=42)

# Estimate the class prior using the MLE method
class_prior_mle = len(y_train_positive) / (len(y_train_positive) + len(y_train_unlabeled))


# Train a logistic regression model on the positive and unlabeled training examples
pu_train_X = np.vstack((X_train_positive, X_train_unlabeled))
pu_train_y = np.hstack((y_train_positive, np.ones(X_train_unlabeled.shape[0]) * -1))
lr = LogisticRegression(solver='lbfgs', max_iter=1000)
lr.fit(pu_train_X, pu_train_y)

# Compute the predicted probabilities on the validation set
val_X = np.vstack((X_val_positive, X_val_unlabeled))
val_y = np.hstack((np.ones(X_val_positive.shape[0]), np.zeros(X_val_unlabeled.shape[0])))
probs = lr.predict_proba(val_X)[:, 1]

# Apply the CAL method to adjust the probabilities
calibrated_probs = probs / class_prior_mle

# Compute the F1-score
y_pred = (calibrated_probs > 0.5).astype(int)
f1 = f1_score(val_y, y_pred)

print("Class prior: {:.2f}".format(class_prior_mle))
print("F1-score: {:.3f}".format(f1))

Class prior: 0.35
F1-score: 0.553


Here's what the code does:

    1. The code loads the data and performs PCA as in the previous example.
    2. The data is split into positive and unlabeled sets as before.
    3. The positive and some of the unlabeled examples are split into training and validation sets as before.
    4. The class prior is estimated as before.(MLE Estimation)
    5. A logistic regression model is trained on the positive and unlabeled training examples using scikit-learn's LogisticRegression class. The labels of the unlabeled examples are set to -1.
    6. The model is used to predict the probabilities of positive labels on the validation set.
    7. The probabilities are adjusted using the CAL method, which involves dividing the predicted probabilities by the estimated class 8. prior.
    8. The F1-score is computed on the validation set using scikit-learn's f1_score function.

Note that the F1-score is just one

In [4]:
import numpy as np
import scipy.io as sio
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

# Load the data from the MAT-file
data = sio.loadmat('/home/kofi/Downloads/Matlab/diabetes.mat')
labels = data["labels"]
X = data["X"]

# Perform PCA on the data
pca = PCA(n_components=2)
X = pca.fit_transform(X)

# Split the data into positive and unlabeled sets
positive_indices = np.where(labels == 1)[0]
unlabeled_indices = np.where(labels == 0)[0]
X_positive = X[positive_indices, :]
X_unlabeled = X[unlabeled_indices, :]

# Split the positive examples and some of the unlabeled examples into training and validation sets
X_train_positive, X_val_positive, y_train_positive, y_val_positive = train_test_split(X_positive, np.ones(X_positive.shape[0]), test_size=0.2, random_state=42)
X_train_unlabeled, X_val_unlabeled, y_train_unlabeled, y_val_unlabeled = train_test_split(X_unlabeled, np.zeros(X_unlabeled.shape[0]), test_size=0.2, random_state=42)

# Load the external dataset
external_data = sio.loadmat('/home/kofi/Downloads/Matlab/diabetes.mat')
external_labels = external_data["labels"]

# Calculate the class prior from the external dataset
class_prior_external = np.mean(external_labels)


# Train a logistic regression model on the positive and unlabeled training examples
pu_train_X = np.vstack((X_train_positive, X_train_unlabeled))
pu_train_y = np.hstack((y_train_positive, np.ones(X_train_unlabeled.shape[0]) * -1))
lr = LogisticRegression(solver='lbfgs', max_iter=1000)
lr.fit(pu_train_X, pu_train_y)

# Compute the predicted probabilities on the validation set
val_X = np.vstack((X_val_positive, X_val_unlabeled))
val_y = np.hstack((np.ones(X_val_positive.shape[0]), np.zeros(X_val_unlabeled.shape[0])))
probs = lr.predict_proba(val_X)[:, 1]

# Apply the CAL method to adjust the probabilities
calibrated_probs = probs / class_prior_external

# Compute the F1-score
y_pred = (calibrated_probs > 0.5).astype(int)
f1 = f1_score(val_y, y_pred)

print("Class prior: {:.2f}".format(class_prior_external))
print("F1-score: {:.3f}".format(f1))

Class prior: 0.35
F1-score: 0.553


Here's what the code does:

    1. The code loads the data and performs PCA as in the previous example.
    2. The data is split into positive and unlabeled sets as before.
    3. The positive and some of the unlabeled examples are split into training and validation sets as before.
    4. The class prior is estimated as before.(External information Estimation)
    5. A logistic regression model is trained on the positive and unlabeled training examples using scikit-learn's LogisticRegression class. The labels of the unlabeled examples are set to -1.
    6. The model is used to predict the probabilities of positive labels on the validation set.
    7. The probabilities are adjusted using the CAL method, which involves dividing the predicted probabilities by the estimated class 8. prior.
    8. The F1-score is computed on the validation set using scikit-learn's f1_score function.

Note that the F1-score is just one