
---

## Named Entity Recognition (NER) for Protein-Protein Interactions (PPI) in Biomedical Text Mining

### Classical Computing Approach:

1. Select a corpus of biomedical literature related to protein-protein interactions.

2. Preprocess the corpus by:
   - Removing stop words
   - Removing punctuations
   - Applying stemming or lemmatization

3. Extract features from the preprocessed text, such as:
   - Word shape
   - Context
   - Part-of-speech (POS) tags
   - Chunking

4. Train a machine learning model, such as:
   - Conditional Random Fields (CRF)
   - Support Vector Machines (SVM)
   - Hidden Markov Models (HMM)

5. Evaluation:
   - Evaluate the model's performance using metrics such as precision, recall, and F1-score.

### Quantum Computing Approach:

1. Use quantum algorithms, such as Quantum Kernel Estimation (QKE) or Quantum Machine Learning (QML), to extract features from the preprocessed text.

2. Train a quantum machine learning model, such as:
   - Quantum Support Vector Machine (QSVM)
   - Quantum Neural Network (QNN)

3. Evaluation:
   - Evaluate the model's performance using metrics such as precision, recall, and F1-score.

4. Compare the efficiency of classical and quantum computing approaches.

---

In [5]:
# Import necessary libraries
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.svm import LinearSVC

In [6]:
corpus = [
    "Protein A interacts with Protein B to activate pathway X.",
    "Inhibition of Protein C prevents interaction with Protein D.",
    "Protein E is a key regulator of pathway Y.",
    "Protein F is a downstream target of pathway Z."
]

In [7]:
labels = [1, 0] 

In [8]:
# Preprocessing
nltk.download('stopwords')
from nltk.corpus import stopwords
import string
from nltk.tokenize import word_tokenize, sent_tokenize

def preprocess(text):
    # Tokenize
    tokens = word_tokenize(text)
    # Lowercase
    tokens = [w.lower() for w in tokens]
    # Remove punctuation
    table = str.maketrans('', '', string.punctuation)
    stripped = [w.translate(table) for w in tokens]
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    words = [w for w in stripped if not w in stop_words]
    return ' '.join(words)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:

# Preprocessing
stop_words = set(stopwords.words('english'))
ps = PorterStemmer()

def preprocess_text(text):
    tokens = word_tokenize(text.lower())
    filtered_tokens = [ps.stem(token) for token in tokens if token.isalnum() and token not in stop_words]
    return " ".join(filtered_tokens)

preprocessed_corpus = [preprocess_text(doc) for doc in corpus]

# Feature extraction
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(preprocessed_corpus)

# Labels
y = ['PPI' for _ in range(len(corpus))]  # Assuming all examples are protein-protein interactions

# Split data for training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Support Vector Machine (SVM) model
svm_classifier = LinearSVC()
svm_classifier.fit(X_train, y_train)

# Evaluate the model
predictions = svm_classifier.predict(X_test)
print(classification_report(y_test, predictions))

In [None]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.2, random_state=42)

# Train a Support Vector Machine (SVM) model
svm_classifier = SVC(kernel='linear')
svm_classifier.fit(X_train, y_train)

# Evaluation
y_pred = svm_classifier.predict(X_test)
print(classification_report(y_test, y_pred))

# evaluatio of the model
print(classification_report(y_test, y_pred))
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

print("Accuracy:", accuracy)
print("F1 Score:", f1)
print("Precision:", precision)
print("Recall:", recall)

In [None]:
# Quantum computing libraries are not yet mature for NLP tasks, so this is a conceptual example
import qiskit_aer
from qiskit import QuantumCircuit, Aer, transpile, assemble
from qiskit.circuit.library import ZZFeatureMap
from qiskit.aqua import QuantumInstance
from qiskit.aqua.algorithms import QSVM
from qiskit import execute

In [None]:
# Preprocessing (Not shown here as it's assumed to be done classically before quantum feature extraction)
def quantum_text_classification(document):
    # Quantum feature extraction
    feature_map = ZZFeatureMap(feature_dimension=2, reps=2)
    quantum_instance = QuantumInstance(backend=Aer.get_backend('qasm_simulator'), shots=1024)
    qsvm = QSVM(feature_map, training_input, test_input, quantum_instance=quantum_instance)
    result = qsvm.run()
    return result["test_accuracy"]

# Labels
labels = ['PPI' for _ in range(len(corpus))]  # Assuming all examples are protein-protein interactions
# Text classification on the sample corpus
accuracies = [quantum_text_classification(doc) for doc in corpus]
print(accuracies)

# Quantum feature extraction
def quantum_feature_extraction(text):
    
    # Encode text as quantum state
    qc = QuantumCircuit(2)
    qc.h(0)
    qc.cx(0, 1)
    qc.barrier()
    
    # Measure quantum state
    qc.measure_all()
    
    # Simulate quantum circuit
    simulator = Aer.get_backend('qasm_simulator')
    result = execute(qc, simulator, shots=1000).result()
    counts = result.get_counts(qc)
    
    # Extract features from measurement outcomes
    features = [0, 0]
    for key, value in counts.items():
        if key == '00':
            features[0] = value
        elif key == '11':
            features[1] = value
    
    return features / np.sum(features)

In [None]:
# Quantum machine learning model training
def quantum_model_training(features, labels):
        
        # Encode labels as quantum states
        qc = QuantumCircuit(1)
        if labels[0] == 1:
            qc.x(0)
        qc.h(0)
        qc.barrier()
        
        # Measure quantum state
        qc.measure_all()
        
        # Simulate quantum circuit
        simulator = Aer.get_backend('qasm_simulator')
        result = execute(qc, simulator, shots=1000).result()
        counts = result.get_counts(qc)
        
        # Extract parameters from measurement outcomes
        parameter = 0
        for key, value in counts.items():
            if key == '1':
                parameter = value
        
        return parameter

In [4]:
# Testing the function with random data
features = [quantum_feature_extraction(doc) for doc in corpus]
labels = [random.randint(0, 1) for _ in range(len(corpus))]

print("Features shape:", np.array(features).shape)
print("Labels shape:", np.array(labels).shape)

# Training a quantum classifier on the features and labels
classifier = quantum_model_training(features, labels)

# Quantum evaluation
def quantum_evaluation(model, test_features, test_labels):
    
    # Encode test features as quantum states
    test_states = [quantum_feature_extraction(doc) for doc in test_features]
    
    # Evaluate the quantum model
    predictions = [model.predict(state) for state in test_states]
    
    # Evaluate the predictions
    accuracy = accuracy_score(test_labels, predictions)
    print("Accuracy:", accuracy)

# Quantum evaluation
test_features = ["Protein A interacts with Protein B to activate pathway X."] * 5 + \
               ["Protein C binds to Protein D and inhibits pathway Y."] * 5
test_labels = [0]*5 + [1]*5

quantum_evaluation(classifier, test_features, test_labels)

This Python code is a conceptual example of how quantum computing can be used for natural language processing (NLP) tasks, specifically for text classification. It uses the Qiskit library, an open-source quantum computing software development framework.

- `quantum_text_classification` function: Classifies a given document using a quantum support vector machine (QSVM) and returns the test accuracy.

- `quantum_feature_extraction` function: Extracts features from the text by encoding it as a quantum state using a quantum circuit, measuring the quantum state, and extracting features from the measurement outcomes.

- `quantum_model_training` function: Trains a quantum machine learning model by encoding the labels as quantum states using a quantum circuit, measuring the quantum state, and extracting parameters from the measurement outcomes.

- `quantum_evaluation` function: Evaluates the quantum model by encoding the test features as quantum states, evaluating the quantum model, and calculating the accuracy of the predictions.

The code also includes a test with random data. It extracts features from a corpus of documents, generates random labels, trains a quantum classifier on the features and labels, and evaluates the classifier.
