# Part 1 

----------

### This notebook conducts initial inspection and description of the datasets, setting the foundation for subsequent analysis.

First we locate the 3 datasets we need for the analysis as outlined in the PDF: drug targets, drugs and mechanism of action. Using these FTP URLs within we can download the relevant JSON info using a script.

* Target - Core annotation for targets: ftp.ebi.ac.uk/pub/databases/opentargets/platform/24.03/output/etl/json/targets
* Drug	- Core annotation for drug molecules: ftp.ebi.ac.uk/pub/databases/opentargets/platform/24.03/output/etl/json/molecule
* Drug - Mechanism of action: ftp.ebi.ac.uk/pub/databases/opentargets/platform/24.03/output/etl/json/mechanismOfAction

In [2]:
# data_type and URL for the 3 sets of data we need to download, will download parquet instead of JSON format due to size
urls = {
    "Drug_Targets": "pub/databases/opentargets/platform/24.03/output/etl/parquet/targets",
    "Drug_Info": "pub/databases/opentargets/platform/24.03/output/etl/parquet/molecule",
    "Drug_Mechanism": "pub/databases/opentargets/platform/24.03/output/etl/parquet/mechanismOfAction/"
}

In [4]:
# downloading files from FTP datasets in parquet format for each of the URLs
# should be 199 seperate parquet files for each data_type; ~2 mins runtime

import ftplib
import os
import pandas as pd

#STEP 0: predefine download function for ftp downloads, download_data_from_ftp() 
# login to ftp server with default auths, downloads files specified by URL list, save them into a local directory (predefined defined by data_type)
def download_data_from_ftp(url, local_directory):
    ftp_host = "ftp.ebi.ac.uk"
    ftp_user = "anonymous"
    ftp_password = "" 
    ftp = ftplib.FTP(ftp_host)
    ftp.login(ftp_user, ftp_password)
    ftp.cwd(url)

    if not os.path.exists(local_directory):
        os.makedirs(local_directory)

    files = ftp.nlst()

    for file_name in files:
        local_file_path = os.path.join(local_directory, file_name)
        with open(local_file_path, 'wb') as local_file:
            ftp.retrbinary('RETR ' + file_name, local_file.write)

    ftp.quit()

# STEP 1: iterate over URL's in list, create new subfile under /data, use download_data_from_ftp to donwload, print() when done
for data_type, url in urls.items():
    local_directory = f"./data/{data_type}"
    download_data_from_ftp(url, local_directory)
    print(f"Downloaded {data_type} data into {local_directory}")

Downloaded Drug_Targets data into ./data/Drug_Targets
Downloaded Drug_Info data into ./data/Drug_Info
Downloaded Drug_Mechanism data into ./data/Drug_Mechanism


In [6]:
# lookup the schema/format/columns and datatypes of the parquet files we are working with
import pyarrow.parquet as pq
from tabulate import tabulate

random_parquet_file = ".\data\Drug_Targets\part-00000-e1e4536b-5a02-4828-887c-0d68f353f37b-c000.snappy.parquet"

table = pq.read_table(random_parquet_file)

df = table.to_pandas()

print("\n DataFrame Schema: \n")
print(df.info())


 DataFrame Schema: 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 332 entries, 0 to 331
Data columns (total 28 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   id                    332 non-null    object
 1   approvedSymbol        332 non-null    object
 2   biotype               332 non-null    object
 3   transcriptIds         332 non-null    object
 4   canonicalTranscript   315 non-null    object
 5   canonicalExons        315 non-null    object
 6   genomicLocation       332 non-null    object
 7   alternativeGenes      7 non-null      object
 8   approvedName          332 non-null    object
 9   go                    101 non-null    object
 10  hallmarks             1 non-null      object
 11  synonyms              332 non-null    object
 12  symbolSynonyms        332 non-null    object
 13  nameSynonyms          332 non-null    object
 14  functionDescriptions  98 non-null     object
 15  subcellularLocatio

In [10]:
# using pyarrow module to check the head of the 3 datasets we are working with

import pyarrow.parquet as pq
from tabulate import tabulate

# random files chosen
drug_info_parquet = "./data/Drug_Info/part-00000-5e395455-ba46-4795-aa84-f1f2f840385c-c000.snappy.parquet"
target_parquet = "./data/Drug_Targets/part-00000-e1e4536b-5a02-4828-887c-0d68f353f37b-c000.snappy.parquet"
MoA_parquet = "./data/Drug_Mechanism/part-00000-a3d47169-b4a0-47f2-b3a4-f26067bc4df3-c000.snappy.parquet"

# for a file in each sub dataset (drug_info, targets, MoA), we read the parquet files into PyArrow tables and then convert them into Pandas df's for visualisation
drug_info_table = pq.read_table(drug_info_parquet)
targets_table = pq.read_table(target_parquet)
MoA_table = pq.read_table(MoA_parquet)

drug_info_df = drug_info_table.to_pandas()
targets_df = targets_table.to_pandas()
MoA_df = MoA_table.to_pandas()

# first 3 rows of each DataFrame using tabulate
print("First 3 rows of Drug Info:")
print(tabulate(drug_info_df.head(3), headers='keys', tablefmt='grid'))
print("\nFirst 3 rows of Mechanisms of Action:")
print(tabulate(MoA_df.head(3), headers='keys', tablefmt='grid'))
print("\nFirst 3 rows of Targets:")
print(tabulate(targets_df.head(3), headers='keys', tablefmt='grid'))

First 3 rows of Drug Info:
+----+---------------+-------------------------------------------------------------------------------------------------------------------+-----------------------------+----------------+-------------------+-------------------------------------+-----------------------+-----------------------------+---------------+--------------------+--------------+---------------------------------------------------+-------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------+------------------------------------------------------------------------------------------+--------------------------------------------------------------------------+----------------------------------------------------------------------------------------------

--------------------

#### Initial thoughts after observing the heads of the drugs/targets/methods of actions datasets for the first parquet files in each dataset:

<br>

* Mechanism of action dataset is the key link between Drug info (its chemblIds column) and drug targets (targets column) datasets:

* DRUG --> MoA --> TARGET
* (id) --> [(chemblIds), (targetType)] --> (id)

Additional important information: 
> Drug modality (type/category of drug): 

    Found in Drug info dataset --> drugType column 

> Locations of drug targets: 

    Found inMoA dataset --> subcellularLocations column

<br>

### Next, I'll merge all the Parquet files within each dataset category (drug info, Mechanism of Action (MoA), and target). <br>

### This will consolidate the data into a single dataframe for each category. By doing so, we can conduct comprehensive descriptive analysis for each dataset, enabling easier data manipulation and facilitating future analysis.

-----------------

In [7]:
# merge all parquet files in sub datasets

import os
import pandas as pd

# create merge function for parquet files
def merge_parquet_files(folder_path):
    dfs = []
    # for loop to iterate through each each file in the sub datasets, get the file path for each and read it, then append all the parquets to the empty list
    for file_name in os.listdir(folder_path):
        if file_name.endswith('.parquet'):
            file_path = os.path.join(folder_path, file_name)
            df = pd.read_parquet(file_path)
            dfs.append(df)
    # after we concatenate them all together into a merged df
    if dfs:
        merged_df = pd.concat(dfs, ignore_index=True)
        return merged_df
    # error handing
    else:
        print(f"No Parquet files found in {folder_path}")
        return None

# define folder locations
drug_info_folder = "./data/Drug_Info"
targets_folder = "./data/Drug_Targets"
mechanism_folder = "./data/Drug_Mechanism"

# now we utilise the function defined above for each dataset, output it directly into /data
# dataset 1
merged_drug_info = merge_parquet_files(drug_info_folder)
# the crossreferences column needs to be string datatype for concat step
merged_drug_info['crossReferences'] = merged_drug_info['crossReferences'].astype(str)
# save the merged drug info
merged_drug_info_file = "./data/merged_drug_info.parquet"
merged_drug_info.to_parquet(merged_drug_info_file)
print(f"Merged drug info data saved to {merged_drug_info_file}")

# dataset 2
merged_mechanism = merge_parquet_files(mechanism_folder)
merged_mechanism_file = "./data/merged_mechanism.parquet"
merged_mechanism.to_parquet(merged_mechanism_file)
print(f"Merged Mechanism of Action data saved to {merged_mechanism_file}")

# dataset 3
merged_targets = merge_parquet_files(targets_folder)
merged_targets_file = "./data/merged_targets.parquet"
merged_targets.to_parquet(merged_targets_file)
print(f"Merged Drug Targets data saved to {merged_targets_file}")

Merged drug info data saved to ./data/merged_drug_info.parquet
Merged Mechanism of Action data saved to ./data/merged_mechanism.parquet
Merged Drug Targets data saved to ./data/merged_targets.parquet


----------

### Inspection and describing the starting datasets:

---------

In [8]:
import pyarrow.parquet as pq

merged_drug_info_file = "./data/merged_drug_info.parquet"
merged_mechanism_file = "./data/merged_mechanism.parquet"
merged_targets_file = "./data/merged_targets.parquet"

# prerequisite inbuilt functions for merged datasets (len, dtypes, isnull, sum, unique values)
merged_drug_info_table = pq.read_table(merged_drug_info_file)
num_rows_drug_info = len(merged_drug_info_table)
num_columns_drug_info = len(merged_drug_info_table.schema)
column_names_drug_info = merged_drug_info_table.schema.names
drug_info_df = merged_drug_info_table.to_pandas()

merged_mechanism_table = pq.read_table(merged_mechanism_file)
num_rows_mechanism = len(merged_mechanism_table)
num_columns_mechanism = len(merged_mechanism_table.schema)
column_names_mechanism = merged_mechanism_table.schema.names
mechanism_df = merged_mechanism_table.to_pandas()

merged_targets_table = pq.read_table(merged_targets_file)
num_rows_targets = len(merged_targets_table)
num_columns_targets = len(merged_targets_table.schema)
column_names_targets = merged_targets_table.schema.names
targets_df = merged_targets_table.to_pandas()

# print info for dataset stats
print("Drug Info Dataset:")
print("Number of rows:", num_rows_drug_info)
print("Number of columns:", num_columns_drug_info)
print("Column names:", column_names_drug_info)
print("Data types:\n", drug_info_df.dtypes)
print("Counts of unique values in 'drugType':\n", drug_info_df['drugType'].value_counts())
print("Presence of missing values:\n", drug_info_df.isnull().sum())
print()

print("Mechanism of Action Dataset:")
print("Number of rows:", num_rows_mechanism)
print("Number of columns:", num_columns_mechanism)
print("Column names:", column_names_mechanism)
print("Data types:\n", mechanism_df.dtypes)
print("Counts of unique 'actionType':\n", mechanism_df['actionType'].value_counts())
print("Counts of unique 'targetType':\n", mechanism_df['targetType'].value_counts())
print("Presence of missing values:\n", mechanism_df.isnull().sum())
print()

print("Drug Targets Dataset:")
print("Number of rows:", num_rows_targets)
print("Number of columns:", num_columns_targets)
print("Column names:", column_names_targets)
print("Data types:\n", targets_df.dtypes)
print("Counts of unique values in 'biotype':\n", targets_df['biotype'].value_counts())
print("Counts of unique values in 'targetClass':\n", targets_df['targetClass'].value_counts())
print("Presence of missing values:\n", targets_df.isnull().sum())


Drug Info Dataset:
Number of rows: 17111
Number of columns: 18
Data types:
 id                            object
canonicalSmiles               object
inchiKey                      object
drugType                      object
name                          object
yearOfFirstApproval          float64
maximumClinicalTrialPhase    float64
parentId                      object
hasBeenWithdrawn                bool
isApproved                    object
tradeNames                    object
synonyms                      object
crossReferences               object
childChemblIds                object
linkedDiseases                object
linkedTargets                 object
description                   object
dtype: object
Counts of unique values in 'drugType':
 drugType
Small molecule     14392
Antibody             894
Unknown              863
Protein              615
Oligonucleotide      100
Enzyme                87
Oligosaccharide       54
Gene                  50
Cell                  35
unknown

--------------

### Key Takeaways:

#### Drug Info Dataset:
- **Rows:** 17,111, **Columns:** 18
- **Data Types:** Mostly objects, some numerical.
- **Unique 'drugType':** Dominated by small molecules.
- **Missing Values:** Notable in several columns, including approval-related fields.

#### Mechanism of Action Dataset:
- **Rows:** 6,186, **Columns:** 7
- **Action Types:** Mostly inhibitors, antagonists, and agonists.
- **Target Types:** Predominantly single proteins.
- **Missing Values:** None found across the dataset.

#### Drug Targets Dataset:
- **Rows:** 63,226, **Columns:** 28
- **Biotype Variety:** Includes protein coding, lncRNA, and processed pseudogenes.
- **Target Class Diversity:** Covers a wide range, from enzymes to secreted proteins.
- **Missing Values:** Various columns have significant missing data, indicating potential data quality issues.

Aware of some missing values in a few datasets, these will need to be dealt with later on.

-------------------

### [Click here to go to PART 2 - preprocessing stages](.\2_Preprocessing.ipynb)