This repository contains scripts for extracting Klebsiella pneumoniae data from CARD, creating a knowledge graph (KG), annotating Klebsiella pneumoniae genome sequence using a BAKTA/RGI pipeline, performing BLAST searches against a database, and filtering the BLAST results based on sequence identity.
Function:
- Extracts Klebsiella pneumoniae data from CARD and writes it into
extracted_data_kleb.csv.
Modification:
- Modify line 43 to specify the desired bacteria for extraction. For this project, it extracts Klebsiella pneumoniae data.
Output:
extracted_data_kleb.csv: A CSV file containing the data of Klebsiella pneumoniae.
Function:
- Reads the extracted data from
extracted_data_kleb.csvand creates a knowledge graph (KG). - For each relationship between entities (e.g., drug targets, gene relationships), separate files are generated in the format of object-relationship-subject.
Output:
- Files representing various relationships in the knowledge graph.
Function:
- Reads UniProt IDs and antibiotic names from
extracted_data_kleb.csvanduniprot_links.csv. - For each UniProt ID:
- Retrieves the corresponding protein sequence from UniProt in .fasta format.
- Writes the sequence to
query.faa. - Performs a BLAST search using the query sequence against the database (
mergedsequenceskleb.faa). - Saves the BLAST results in
blast_results.txt.
Functions Breakdown:
get_protein_sequence(uniprot_id): Retrieves the protein sequence from UniProt for the given UniProt ID by making a request to the UniProt API.subprocess.run: Executes the BLAST command to compare the query sequence against the provided database. The results include details such as query accession, subject accession, e-value, query length, percentage identity (pident), query start/end, subject start/end.
Output:
blast_results.txt: This file contains the BLAST search results for all the sequences compared to the database.
Function:
- Filters the BLAST results in
blast_results.txtto retain only those hits where the percentage identity (pident) is greater than 90%. - The filtered results are written to
blast_hits_pident_above_90.txt.
Steps:
- Reads the BLAST results from
blast_results.txt. - For each hit, it checks if the pident is above the threshold (e.g., 90%).
- If so, it saves the hit details (drug name, sequence, blast hit, and pident) to
blast_hits_pident_above_90.txt.
Output:
blast_hits_pident_above_90.txt: A filtered results file containing only BLAST hits with pident greater than 90%.
Function:
- Extracts Klebsiella pneumoniae Gene sequences from CARD and writes it into
query.faa. Uses the list of Genes related to Klebsiella pneumoniae fromAMRGeneNames.csv
Modification:
- Modify line 7 to specify the desired output file.
Output:
query.faa: A faa file containing the sequence data of each AMR Genes associated with Klebsiella pneumoniae.
Function:
- Filters the BLAST results in specified file to retain only those hits where the percentage identity (pident) is greater than the specified threshold.
Input:
- Filter.py takes two command line inputs
pident thresholdandoutput file.
Output:
filteredbypident.csv: A csv file containing the highest BLAST hits with threshold >=pident threshold.
Function:
- Creates a gene network from the corresponding
annotation_detail.csvfile and embeds the network in a specified output text file.
Input:
geneNeighborhood.pytakes three command line inputsfile_pathto input annotation file, theoutput file paththe network should be embedded in, andunkown_geneboolean if "True" to include unkownGenes in the gene Neighborhood and False exclude unkownGenes from the gene Neighborhood.
Output:
geneNeighborhood.txt: A text file containing the gene Network derived from the genome sequence annotation.
-
Extract Data:
- Run
CARD/CARD_data2.pyto extract Klebsiella pneumoniae data from CARD. - Ensure line 43 is modified to specify Klebsiella pneumoniae.
- Run
-
Create Knowledge Graph:
- Use
CARD/create_kg.pyto read the extracted data fromextracted_data_kleb.csvand create the knowledge graph files.
- Use
-
Retrieve Protein Sequences and Perform BLAST:
- Execute
DrugTarget/getTarget.pyto read UniProt IDs and antibiotic names, retrieve protein sequences, perform BLAST searches, and save the results inblast_results.txt.
- Execute
-
Filter BLAST Results:
- Run
DrugTarget/identityFiltered.pyto filter the BLAST results based on percentage identity and save the filtered results toblast_hits_pident_above_90.txt.
- Run
-
Extract sequence Data:
- Run
STRINGComparison/AMRsequences.pyto extract Klebsiella pneumoniae Gene sequences from CARD.json and write it into a specified output file.
- Run
-
Filter BLAST Results:
- Run
STRINGComparison/filter.pyto filter the BLAST results based on percentage identity and save the filtered results tospecified file name.
- Run
- Python 3.x
- Requests library for API calls
- BLAST+ for performing BLAST searches
- UniProt API for retrieving protein sequences
Ensure all dependencies are installed and properly configured before running the scripts.