Skip to content

GraphML-lab/BRIDGE

Repository files navigation

README - IGNORE(read internal READMEs)

This repository contains scripts for extracting Klebsiella pneumoniae data from CARD, creating a knowledge graph (KG), annotating Klebsiella pneumoniae genome sequence using a BAKTA/RGI pipeline, performing BLAST searches against a database, and filtering the BLAST results based on sequence identity.

Files and Functions

1. CARD/CARD_data2.py

Function:

  • Extracts Klebsiella pneumoniae data from CARD and writes it into extracted_data_kleb.csv.

Modification:

  • Modify line 43 to specify the desired bacteria for extraction. For this project, it extracts Klebsiella pneumoniae data.

Output:

  • extracted_data_kleb.csv: A CSV file containing the data of Klebsiella pneumoniae.

2. CARD/create_kg.py

Function:

  • Reads the extracted data from extracted_data_kleb.csv and creates a knowledge graph (KG).
  • For each relationship between entities (e.g., drug targets, gene relationships), separate files are generated in the format of object-relationship-subject.

Output:

  • Files representing various relationships in the knowledge graph.

3. DrugTarget/getTarget.py

Function:

  • Reads UniProt IDs and antibiotic names from extracted_data_kleb.csv and uniprot_links.csv.
  • For each UniProt ID:
    • Retrieves the corresponding protein sequence from UniProt in .fasta format.
    • Writes the sequence to query.faa.
    • Performs a BLAST search using the query sequence against the database (mergedsequenceskleb.faa).
    • Saves the BLAST results in blast_results.txt.

Functions Breakdown:

  • get_protein_sequence(uniprot_id): Retrieves the protein sequence from UniProt for the given UniProt ID by making a request to the UniProt API.
  • subprocess.run: Executes the BLAST command to compare the query sequence against the provided database. The results include details such as query accession, subject accession, e-value, query length, percentage identity (pident), query start/end, subject start/end.

Output:

  • blast_results.txt: This file contains the BLAST search results for all the sequences compared to the database.

4. DrugTarget/identityFiltered.py

Function:

  • Filters the BLAST results in blast_results.txt to retain only those hits where the percentage identity (pident) is greater than 90%.
  • The filtered results are written to blast_hits_pident_above_90.txt.

Steps:

  1. Reads the BLAST results from blast_results.txt.
  2. For each hit, it checks if the pident is above the threshold (e.g., 90%).
  3. If so, it saves the hit details (drug name, sequence, blast hit, and pident) to blast_hits_pident_above_90.txt.

Output:

  • blast_hits_pident_above_90.txt: A filtered results file containing only BLAST hits with pident greater than 90%.

5. STRINGComparison/AMRsequences.py

Function:

  • Extracts Klebsiella pneumoniae Gene sequences from CARD and writes it into query.faa. Uses the list of Genes related to Klebsiella pneumoniae from AMRGeneNames.csv

Modification:

  • Modify line 7 to specify the desired output file.

Output:

  • query.faa: A faa file containing the sequence data of each AMR Genes associated with Klebsiella pneumoniae.

6. STRINGComparison/filter.py

Function:

  • Filters the BLAST results in specified file to retain only those hits where the percentage identity (pident) is greater than the specified threshold.

Input:

  • Filter.py takes two command line inputs pident threshold and output file .

Output:

  • filteredbypident.csv: A csv file containing the highest BLAST hits with threshold >=pident threshold.

7. annotation/geneNeighborhood.py

Function:

  • Creates a gene network from the corresponding annotation_detail.csv file and embeds the network in a specified output text file.

Input:

  • geneNeighborhood.py takes three command line inputs file_path to input annotation file, the output file path the network should be embedded in, and unkown_gene boolean if "True" to include unkownGenes in the gene Neighborhood and False exclude unkownGenes from the gene Neighborhood.

Output:

  • geneNeighborhood.txt: A text file containing the gene Network derived from the genome sequence annotation.

Usage

  1. Extract Data:

    • Run CARD/CARD_data2.py to extract Klebsiella pneumoniae data from CARD.
    • Ensure line 43 is modified to specify Klebsiella pneumoniae.
  2. Create Knowledge Graph:

    • Use CARD/create_kg.py to read the extracted data from extracted_data_kleb.csv and create the knowledge graph files.
  3. Retrieve Protein Sequences and Perform BLAST:

    • Execute DrugTarget/getTarget.py to read UniProt IDs and antibiotic names, retrieve protein sequences, perform BLAST searches, and save the results in blast_results.txt.
  4. Filter BLAST Results:

    • Run DrugTarget/identityFiltered.py to filter the BLAST results based on percentage identity and save the filtered results to blast_hits_pident_above_90.txt.
  5. Extract sequence Data:

    • Run STRINGComparison/AMRsequences.py to extract Klebsiella pneumoniae Gene sequences from CARD.json and write it into a specified output file.
  6. Filter BLAST Results:

    • Run STRINGComparison/filter.py to filter the BLAST results based on percentage identity and save the filtered results to specified file name.

Dependencies

  • Python 3.x
  • Requests library for API calls
  • BLAST+ for performing BLAST searches
  • UniProt API for retrieving protein sequences

Ensure all dependencies are installed and properly configured before running the scripts.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors