# Prospecção de Dados (Data Mining) DI/FCUL - HA2

## Course Project (MC/DI/FCUL - 2024)

### GROUP: `02`

* João Martins, 62532 - Hours worked on the project: 16
* Rúben Torres, 62531 - Hours worked on the project: 16
* Nuno Pereira, 56933 - Hours worked on the project: 16

In [8]:
import pandas as pd
import numpy as np
import pickle
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.cluster import KMeans

with open("mol_bits.pkl", "rb") as file:
    molecular_fingerprints = pickle.load(file)

# Convert molecular fingerprints to DataFrame
mol_df = pd.DataFrame(
    list(molecular_fingerprints.items()), columns=["Molecules", "Fingerprint"]
)
mol_df

Unnamed: 0,Molecules,Fingerprint
0,CHEMBL2022243,"[10, 38, 50, 80, 107, 113, 180, 217, 315, 322,..."
1,CHEMBL2022244,"[10, 38, 50, 80, 107, 113, 180, 217, 315, 322,..."
2,CHEMBL2022245,"[10, 38, 50, 80, 104, 107, 113, 180, 184, 217,..."
3,CHEMBL2022246,"[10, 38, 50, 80, 107, 113, 118, 123, 217, 315,..."
4,CHEMBL2022247,"[10, 22, 38, 50, 66, 80, 107, 113, 160, 180, 2..."
...,...,...
73860,CHEMBL4218012,"[32, 80, 103, 147, 158, 264, 371, 389, 425, 51..."
73861,CHEMBL4217503,"[38, 80, 103, 155, 371, 389, 425, 457, 491, 51..."
73862,CHEMBL4205802,"[38, 80, 103, 115, 147, 155, 206, 371, 389, 45..."
73863,CHEMBL4204359,"[32, 80, 103, 147, 158, 264, 371, 389, 425, 49..."


In [9]:
activity = pd.read_csv(
    "activity_train.csv", header=None, names=["Proteins", "Molecules", "Rate"]
)
activity

Unnamed: 0,Proteins,Molecules,Rate
0,O14842,CHEMBL2022243,4
1,O14842,CHEMBL2022244,6
2,O14842,CHEMBL2022245,2
3,O14842,CHEMBL2022246,1
4,O14842,CHEMBL2022247,4
...,...,...,...
135706,Q9Y5Y4,CHEMBL4214909,6
135707,Q9Y5Y4,CHEMBL4218012,2
135708,Q9Y5Y4,CHEMBL4217503,7
135709,Q9Y5Y4,CHEMBL4204359,8


In [10]:
activity_test = pd.read_csv(
    "activity_test_blanked.csv", header=None, names=["Proteins", "Molecules", "Rate"]
)
activity_test

Unnamed: 0,Proteins,Molecules,Rate
0,O14842,CHEMBL2022258,0
1,O14842,CHEMBL2047161,0
2,O14842,CHEMBL2047163,0
3,O14842,CHEMBL2047168,0
4,O14842,CHEMBL2047169,0
...,...,...,...
4623,Q9Y5Y4,CHEMBL4208314,0
4624,Q9Y5Y4,CHEMBL4205421,0
4625,Q9Y5Y4,CHEMBL4207935,0
4626,Q9Y5Y4,CHEMBL4208884,0


* The file activity_train.csv contains a list of interactions between molecules (identified by their ChEMBL IDs and proteins identified by their Uniprot IDs). The activity value is rated from 1 to 10, where 1 is INACTIVE and 10 is EXTREMELY POTENT.

* The file activity_test_blanked.csv has exactly the same structure as activity_train.csv, yet, the activiy values are all at Zero. The goal of the project is to predict the real values.

* Additionally it is provided the Fingerprints of molecules (mol_bits.pkl). Fingerprinting is a hashed structural representation of molecules, where each set bit represents a structural feature. Molecules that have a common bit set mean that they probably share a structural element. This file is a Zipped pickled file that contain a dictionary with keys corresponding the ChEMBL IDs and values corresponding to a list of the set bits of each molecule.

In [11]:
def convert_to_bit_vector(bit_indices, length):
    bit_vector = np.zeros(length, dtype=int)
    bit_vector[bit_indices] = 1
    return bit_vector


max_bit_index = max([max(fp) for fp in molecular_fingerprints.values()])

bit_vectors = mol_df["Fingerprint"].apply(
    lambda x: convert_to_bit_vector(x, max_bit_index + 1)
)
bit_matrix = np.vstack(bit_vectors)
bit_df = pd.DataFrame(bit_matrix, columns=[f"{i}" for i in range(max_bit_index + 1)])
mol_df = pd.concat([mol_df[["Molecules"]], bit_df], axis=1)


activity["Molecules"] = activity["Molecules"].str.strip()
mol_df["Molecules"] = mol_df["Molecules"].str.strip()
activity_test["Molecules"] = activity_test["Molecules"].str.strip()

In [12]:
merged_df = pd.merge(activity, mol_df, on="Molecules")
merged_df

merged_df_test = pd.merge(activity_test, mol_df, on="Molecules")

In [13]:
# Prepare features (X) and target variable (y)
X = merged_df.drop(columns=["Proteins", "Molecules", "Rate"])
y = merged_df["Rate"]

X_test = merged_df_test.drop(columns=["Proteins", "Molecules", "Rate"])


# X_train, X_test, y_train, y_test = train_test_split(
#     X, y, test_size=0.2, random_state=42
# )


In [14]:
rf_classifier = RandomForestClassifier(n_jobs=-1)
rf_classifier.fit(X, y)

In [15]:
y_pred = rf_classifier.predict(X_test)

# accuracy = accuracy_score(y_test, y_pred)
# classification_rep = classification_report(y_test, y_pred)

print(y_pred)
# print(f"Accuracy: {accuracy}")
# print("Classification Report:")
# print(classification_rep)
y_pred.shape

[6 6 7 ... 8 8 7]


(4628,)

In [16]:
activity_test["Rate"] = y_pred

# Write the updated DataFrame to a new CSV file
activity_test.to_csv("activity_test_predictions.csv", index=False)
activity_test

Unnamed: 0,Proteins,Molecules,Rate
0,O14842,CHEMBL2022258,6
1,O14842,CHEMBL2047161,6
2,O14842,CHEMBL2047163,7
3,O14842,CHEMBL2047168,6
4,O14842,CHEMBL2047169,7
...,...,...,...
4623,Q9Y5Y4,CHEMBL4208314,8
4624,Q9Y5Y4,CHEMBL4205421,3
4625,Q9Y5Y4,CHEMBL4207935,8
4626,Q9Y5Y4,CHEMBL4208884,8
