<a href="https://colab.research.google.com/github/PranavMunigala/Bioinformatics/blob/main/IntroToBioinformaticsColab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Why We Do This**: The whole point of this process is to sift through tons of bioactivity data to find the compounds that might be worth exploring further as potential drugs. By pulling data from big databases and cleaning it up, we can zero in on the compounds that show real promise. This makes it easier to focus on the ones that are most likely to work, which can save a lot of time and effort in drug development and increase the chances of finding effective treatments.

Key vocabulary:

*   **Bioactivity**: The effect a compound has on a biological target, such as a protein.

* **Standard Value**: A numerical measure of a compound's bioactivity, which can be represented in various units like IC50 or EC50.


*   **EC50** (Half-Maximal Effective Concentration): The concentration required to achieve 50% of the maximum effect.


*   **Canonical SMILES**: A text-based notation for representing chemical structures.

*   **Data Pre-Processing**: The process of cleaning and organizing raw data to prepare it for analysis.



#Step 1: Install necessary libraries
In Google Colab, you'll first need to install the chembl_webresource_client package to retrieve bioactivity data from the ChEMBL Database.

In [23]:
# Install the ChEMBL web service client
!pip install chembl_webresource_client



#Step 2: Import necessary libraries
 Once the package is installed, import the required libraries.

In [None]:
# Import libraries
import pandas as pd
from chembl_webresource_client.new_client import new_client

# Step 3: Search for target protein (Acetylcholinesterase)
Now, let's search for the target protein, Acetylcholinesterase.

In [None]:
# Target search for Acetylcholinesterase
activity = new_client.activity

target = new_client.target
target_query = target.search('acetylcholinesterase')
targets = pd.DataFrame.from_dict(target_query)
targets

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,"[{'xref_id': 'P22303', 'xref_name': None, 'xre...",Homo sapiens,Acetylcholinesterase,28.0,False,CHEMBL220,"[{'accession': 'P22303', 'component_descriptio...",SINGLE PROTEIN,9606
1,[],Homo sapiens,Cholinesterases; ACHE & BCHE,28.0,False,CHEMBL2095233,"[{'accession': 'P06276', 'component_descriptio...",SELECTIVITY GROUP,9606
2,[],Drosophila melanogaster,Acetylcholinesterase,18.0,False,CHEMBL2242744,"[{'accession': 'P07140', 'component_descriptio...",SINGLE PROTEIN,7227
3,[],Bemisia tabaci,AChE2,16.0,False,CHEMBL2366409,"[{'accession': 'B3SST5', 'component_descriptio...",SINGLE PROTEIN,7038
4,[],Leptinotarsa decemlineata,Acetylcholinesterase,16.0,False,CHEMBL2366490,"[{'accession': 'Q27677', 'component_descriptio...",SINGLE PROTEIN,7539
5,"[{'xref_id': 'P04058', 'xref_name': None, 'xre...",Torpedo californica,Acetylcholinesterase,15.0,False,CHEMBL4780,"[{'accession': 'P04058', 'component_descriptio...",SINGLE PROTEIN,7787
6,"[{'xref_id': 'P21836', 'xref_name': None, 'xre...",Mus musculus,Acetylcholinesterase,15.0,False,CHEMBL3198,"[{'accession': 'P21836', 'component_descriptio...",SINGLE PROTEIN,10090
7,"[{'xref_id': 'P37136', 'xref_name': None, 'xre...",Rattus norvegicus,Acetylcholinesterase,15.0,False,CHEMBL3199,"[{'accession': 'P37136', 'component_descriptio...",SINGLE PROTEIN,10116
8,"[{'xref_id': 'O42275', 'xref_name': None, 'xre...",Electrophorus electricus,Acetylcholinesterase,15.0,False,CHEMBL4078,"[{'accession': 'O42275', 'component_descriptio...",SINGLE PROTEIN,8005
9,"[{'xref_id': 'P23795', 'xref_name': None, 'xre...",Bos taurus,Acetylcholinesterase,15.0,False,CHEMBL4768,"[{'accession': 'P23795', 'component_descriptio...",SINGLE PROTEIN,9913


#Step 4: Retrieve Bioactivity Data for EC50
 "EC50" as the standard type (limit 800 entries)

In [None]:
# Select the first entry, which corresponds to Human Acetylcholinesterase
selected_target = targets.target_chembl_id[0]

# Retrieve only the first 800 bioactivity data points with standard type 'EC50'
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="EC50")[:800]

# Convert the results into a DataFrame
df = pd.DataFrame.from_dict(res)

# Display the first few rows of the data
df.head()


Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,No data,184296,[],CHEMBL684546,In vitro reversal of vecuronium-induced block ...,F,,,BAO_0000188,...,Homo sapiens,Acetylcholinesterase,9606,,,EC50,uM,UO_0000065,,
1,,,185671,[],CHEMBL684545,In vitro reversal of vecuronium-induced block ...,F,,,BAO_0000188,...,Homo sapiens,Acetylcholinesterase,9606,,,EC50,uM,UO_0000065,,30.0
2,,,185674,[],CHEMBL684545,In vitro reversal of vecuronium-induced block ...,F,,,BAO_0000188,...,Homo sapiens,Acetylcholinesterase,9606,,,EC50,uM,UO_0000065,,0.099
3,,No data,188345,[],CHEMBL684546,In vitro reversal of vecuronium-induced block ...,F,,,BAO_0000188,...,Homo sapiens,Acetylcholinesterase,9606,,,EC50,uM,UO_0000065,,
4,,,188348,[],CHEMBL684545,In vitro reversal of vecuronium-induced block ...,F,,,BAO_0000188,...,Homo sapiens,Acetylcholinesterase,9606,,,EC50,uM,UO_0000065,,1.72


#Step 5: Handle missing data
Filter out rows where there are missing values for **standard_value** and **canonical_smiles.**

In [None]:
# Drop rows with missing standard_value or canonical_smiles
df2 = df[df.standard_value.notna()]
df2 = df2[df.canonical_smiles.notna()]

# Display the first few rows after handling missing data
df2.head()

  df2 = df2[df.canonical_smiles.notna()]


Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
1,,,185671,[],CHEMBL684545,In vitro reversal of vecuronium-induced block ...,F,,,BAO_0000188,...,Homo sapiens,Acetylcholinesterase,9606,,,EC50,uM,UO_0000065,,30.0
2,,,185674,[],CHEMBL684545,In vitro reversal of vecuronium-induced block ...,F,,,BAO_0000188,...,Homo sapiens,Acetylcholinesterase,9606,,,EC50,uM,UO_0000065,,0.099
4,,,188348,[],CHEMBL684545,In vitro reversal of vecuronium-induced block ...,F,,,BAO_0000188,...,Homo sapiens,Acetylcholinesterase,9606,,,EC50,uM,UO_0000065,,1.72
5,,,190705,[],CHEMBL684545,In vitro reversal of vecuronium-induced block ...,F,,,BAO_0000188,...,Homo sapiens,Acetylcholinesterase,9606,,,EC50,uM,UO_0000065,,10.0
6,,,190708,[],CHEMBL684545,In vitro reversal of vecuronium-induced block ...,F,,,BAO_0000188,...,Homo sapiens,Acetylcholinesterase,9606,,,EC50,uM,UO_0000065,,3.0


#Step 6: Remove duplicate entries
Ensure that each compound (represented by **canonical_smiles**) appears only once.

In [None]:
# Remove duplicate canonical_smiles entries
df2_nr = df2.drop_duplicates(['canonical_smiles'])

# Display the first few rows of the non-redundant data
df2_nr.head()

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
1,,,185671,[],CHEMBL684545,In vitro reversal of vecuronium-induced block ...,F,,,BAO_0000188,...,Homo sapiens,Acetylcholinesterase,9606,,,EC50,uM,UO_0000065,,30.0
2,,,185674,[],CHEMBL684545,In vitro reversal of vecuronium-induced block ...,F,,,BAO_0000188,...,Homo sapiens,Acetylcholinesterase,9606,,,EC50,uM,UO_0000065,,0.099
4,,,188348,[],CHEMBL684545,In vitro reversal of vecuronium-induced block ...,F,,,BAO_0000188,...,Homo sapiens,Acetylcholinesterase,9606,,,EC50,uM,UO_0000065,,1.72
5,,,190705,[],CHEMBL684545,In vitro reversal of vecuronium-induced block ...,F,,,BAO_0000188,...,Homo sapiens,Acetylcholinesterase,9606,,,EC50,uM,UO_0000065,,10.0
6,,,190708,[],CHEMBL684545,In vitro reversal of vecuronium-induced block ...,F,,,BAO_0000188,...,Homo sapiens,Acetylcholinesterase,9606,,,EC50,uM,UO_0000065,,3.0


#Step 7: Data Pre-processing
Select relevant columns and prepare a DataFrame.

In [None]:
# Select columns for pre-processing
selection = ['molecule_chembl_id', 'canonical_smiles', 'standard_value']
df3 = df2_nr[selection]

# Display the pre-processed data
df3.head()

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value
1,CHEMBL314658,COc1cc2cc(C(=O)CC3=CC[N+](C)(Cc4ccccc4)CC3)sc2...,30000.0
2,CHEMBL87170,COc1cc2cc(C(=O)CCc3cc[n+](Cc4ccccc4)cc3)sc2cc1OC,99.0
4,CHEMBL82810,COc1cc2cc(C(=O)CCc3cc[n+](CC4CCC4)cc3)sc2cc1OC...,1720.0
5,CHEMBL88049,COc1cc2cc(C(=O)CCCCC3CC[N+](C)(Cc4ccccc4)CC3)s...,10000.0
6,CHEMBL87128,CC[n+]1ccc(CCC(=O)c2cc3cc(OC)c(OC)cc3s2)cc1.[Br-],3000.0


#Step 8: Label compounds as active, inactive, or intermediate
Label compounds based on their Standard values into three categories: active, inactive, or intermediate.

In [None]:

# Define the bioactivity classification
df3['standard_value'] = pd.to_numeric(df['standard_value'], errors='coerce').fillna(0).astype(int)


bioactivity_threshold = []
for value in df3.standard_value:
    if pd.isna(value):
        bioactivity_threshold.append("unknown")
    elif value >= 10000:
        bioactivity_threshold.append("inactive")
    elif value <= 1000:
        bioactivity_threshold.append("active")
    else:
        bioactivity_threshold.append("intermediate")

# Add the classification to the DataFrame
bioactivity_class = pd.Series(bioactivity_threshold, name='class')
df5 = pd.concat([df3, bioactivity_class], axis=1)

# Display the DataFrame with classifications
df5.head()


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df3['standard_value'] = pd.to_numeric(df['standard_value'], errors='coerce').fillna(0).astype(int)


Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,class
1,CHEMBL314658,COc1cc2cc(C(=O)CC3=CC[N+](C)(Cc4ccccc4)CC3)sc2...,30000.0,active
2,CHEMBL87170,COc1cc2cc(C(=O)CCc3cc[n+](Cc4ccccc4)cc3)sc2cc1OC,99.0,intermediate
4,CHEMBL82810,COc1cc2cc(C(=O)CCc3cc[n+](CC4CCC4)cc3)sc2cc1OC...,1720.0,intermediate
5,CHEMBL88049,COc1cc2cc(C(=O)CCCCC3CC[N+](C)(Cc4ccccc4)CC3)s...,10000.0,inactive
6,CHEMBL87128,CC[n+]1ccc(CCC(=O)c2cc3cc(OC)c(OC)cc3s2)cc1.[Br-],3000.0,intermediate


#Step 9: Save the processed data to a CSV file



You can save the final DataFrame to a CSV file.

In [None]:
# Save the processed data to a CSV file
df5.to_csv('acetylcholinesterase_bioactivity_data_curated.csv', index=False)

#Step 10: Display the final DataFrame
Finally, you can display the entire processed DataFrame.

In [None]:
# Display the entire DataFrame (may want to use df5.head() for large datasets)
df5

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,class
1,CHEMBL314658,COc1cc2cc(C(=O)CC3=CC[N+](C)(Cc4ccccc4)CC3)sc2...,30000.0,active
2,CHEMBL87170,COc1cc2cc(C(=O)CCc3cc[n+](Cc4ccccc4)cc3)sc2cc1OC,99.0,intermediate
4,CHEMBL82810,COc1cc2cc(C(=O)CCc3cc[n+](CC4CCC4)cc3)sc2cc1OC...,1720.0,intermediate
5,CHEMBL88049,COc1cc2cc(C(=O)CCCCC3CC[N+](C)(Cc4ccccc4)CC3)s...,10000.0,inactive
6,CHEMBL87128,CC[n+]1ccc(CCC(=O)c2cc3cc(OC)c(OC)cc3s2)cc1.[Br-],3000.0,intermediate
7,CHEMBL85473,COc1cc2cc(C(O)C3CC[N+](C)(Cc4ccccc4)CC3)sc2cc1...,30000.0,intermediate
8,CHEMBL87280,COC(=O)c1ccc(C[n+]2ccc(CCC(=O)c3cc4cc(OC)c(OC)...,1670.0,intermediate
9,CHEMBL313809,COc1cc2cc(C(=O)C=C3CC[N+](C)(Cc4ccccc4)CC3)sc2...,7670.0,active
10,CHEMBL87849,COc1cc2cc(C(=O)CCc3cc[n+](Cc4ccc(F)cc4)cc3)sc2...,1650.0,intermediate
11,CHEMBL314398,COc1cc2cc(C(=O)CC(OC)C3CC[N+](C)(Cc4ccccc4)CC3...,650.0,intermediate


#Step 11 (Optional): Download the CSV file from Google Colab
If you want to download the CSV file directly from the Colab environment, you can use the following command:

In [None]:
from google.colab import files
files.download('acetylcholinesterase_bioactivity_data_curated.csv')