In [1]:
pip install biopython

Collecting biopython
  Downloading biopython-1.84-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Downloading biopython-1.84-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.2/3.2 MB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: biopython
Successfully installed biopython-1.84


In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


```REST```- Representational State Transfer, a type of web service that allows you to interact with web resources (like the KEGG database).

KEGG REST API: KEGG provides its data (e.g., enzyme information, pathway data) through a RESTful interface.

```.write()```: This writes the content (which is the enzyme data) into the file located at /content/drive/MyDrive/EXAMPLE/ec_5.4.2.2.txt

```request.read()```: This reads the content returned from the KEGG REST API request.

**request** object holds the data fetched from the KEGG database, and the ```.read()``` method retrieves that data as a string.

In [3]:
#requesting the enzyme file and downloading it to drive

from Bio.KEGG import REST
from Bio.KEGG import Enzyme
request = REST.kegg_get("ec:5.4.2.2") # Requesting the enzyme file from KEGG database REST.Kegg_get() function will fetch the enxyme that mentioned inside the braces
open("/content/drive/MyDrive/Biopython_Garden_City/ec_5.4.2.2.txt", "w").write(request.read()) # Saving the enzyme information to a file in Google Drive


296898

```Enzyme.parse()```: This method reads and parses the enzyme data from the file.

```open``` - opens the file for reading.

In [4]:
records = Enzyme.parse(open("/content/drive/MyDrive/Biopython_Garden_City/ec_5.4.2.2.txt"))
record = list(records)[0] #converts the parsed records into a list, [0]-grabs the first enzyme record
print(record.entry) #Prints the entry ID of the enzyme, which is typically the EC number
print(record.name) #Prints the name of the enzyme
print(record.classname) #Prints the class of the enzyme, which categorizes the enzyme based on its function (e.g., transferase, isomerase).
print(record.pathway)  #Prints the pathways associated with this enzyme, showing which metabolic pathways this enzyme participates in.
print(record.genes)   #Prints the genes associated with the enzyme, detailing the species and corresponding genes that encode this enzyme.

5.4.2.2
['phosphoglucomutase (alpha-D-glucose-1,6-bisphosphate-dependent)', 'glucose phosphomutase (ambiguous)', 'phosphoglucose mutase (ambiguous)']
['Isomerases;', 'Intramolecular transferases;', 'Phosphotransferases (phosphomutases)']
[('PATH', 'ec00010', 'Glycolysis / Gluconeogenesis'), ('PATH', 'ec00030', 'Pentose phosphate pathway'), ('PATH', 'ec00052', 'Galactose metabolism'), ('PATH', 'ec00230', 'Purine metabolism'), ('PATH', 'ec00500', 'Starch and sucrose metabolism'), ('PATH', 'ec00520', 'Amino sugar and nucleotide sugar metabolism'), ('PATH', 'ec00521', 'Streptomycin biosynthesis'), ('PATH', 'ec01100', 'Metabolic pathways'), ('PATH', 'ec01110', 'Biosynthesis of secondary metabolites'), ('PATH', 'ec01120', 'Microbial metabolism in diverse environments')]
[('HSA', ['5236', '55276']), ('PTR', ['456908', '461162']), ('PPS', ['100977295', '100993927']), ('GGO', ['101128874', '101131551']), ('PON', ['100190836', '100438793']), ('PPYG', ['129034752', '129035286']), ('NLE', ['100596

###PROBLEM 1: Filter only the DNA repair pathways in Human

####SOLUTION:
1) importing REST Module from Bio.KEGG- ```from Bio.KEGG import REST```
2) ```REST.kegg_list("pathway", "hsa")``` sends a request to the KEGG REST API to retrieve a list of pathways for humans.

  a) ```pathway```: This indicates that you're requesting pathway data.

  b) ```hsa```: This specifies the human species (Homo sapiens), where "hsa" is the KEGG code for humans.

3) ```.read()``` reads the response from the API as a string and stores it in the variable **human_pathways**.


In [6]:
from Bio.KEGG import REST
human_pathways = REST.kegg_list("pathway", "hsa").read()
print(human_pathways) #This prints the list of human pathways retrieved from KEGG to the console

hsa01100	Metabolic pathways - Homo sapiens (human)
hsa01200	Carbon metabolism - Homo sapiens (human)
hsa01210	2-Oxocarboxylic acid metabolism - Homo sapiens (human)
hsa01212	Fatty acid metabolism - Homo sapiens (human)
hsa01230	Biosynthesis of amino acids - Homo sapiens (human)
hsa01232	Nucleotide metabolism - Homo sapiens (human)
hsa01250	Biosynthesis of nucleotide sugars - Homo sapiens (human)
hsa01240	Biosynthesis of cofactors - Homo sapiens (human)
hsa00010	Glycolysis / Gluconeogenesis - Homo sapiens (human)
hsa00020	Citrate cycle (TCA cycle) - Homo sapiens (human)
hsa00030	Pentose phosphate pathway - Homo sapiens (human)
hsa00040	Pentose and glucuronate interconversions - Homo sapiens (human)
hsa00051	Fructose and mannose metabolism - Homo sapiens (human)
hsa00052	Galactose metabolism - Homo sapiens (human)
hsa00053	Ascorbate and aldarate metabolism - Homo sapiens (human)
hsa00500	Starch and sucrose metabolism - Homo sapiens (human)
hsa00520	Amino sugar and nucleotide sugar metabo

###now lets refine the list of human pathways by filtering out only those related to repair mechanisms as the problem states.

1) ```repair_pathways = []```: Creates an empty list to store the pathway IDs that contain "repair" in their description.

2) ```pathways = {}```: Initializes an empty dictionary to store *pathway IDs* as **keys** and their corresponding  *descriptions* as **values** for the filtered repair pathways.

3)```for``` Iterates over each line in **human_pathways** variable. The ```.rstrip()``` method removes any trailing whitespace/newlines, and ```.split("\n")``` splits the text into individual lines.
Each line represents one pathway entry (with an ID and description).

4) ```entry, description = line.split("\t")```
Splits each line into two parts using the tab ```(\t)``` as a delimiter:

```entry```: Contains the pathway ID (e.g., hsa03430).

```description``` : Contains the pathway description (e.g., "Mismatch repair").

5)```if "repair" in description:```:
Checks if the word "repair" appears in the pathway description. If true, it processes that pathway.

6)```repair_pathways.append(entry)```:
Adds the pathway ID (for repair pathways) to the repair_pathways list created earlier i.e ```repair_pathways = []```.

In [7]:
repair_pathways = []
pathways={}
for line in human_pathways.rstrip().split("\n"):
  entry, description = line.split("\t")
  if "repair" in description:
    repair_pathways.append(entry)
    pathways[entry]=description

print(repair_pathways)
print(pathways)

['hsa03410', 'hsa03420', 'hsa03430']
{'hsa03410': 'Base excision repair - Homo sapiens (human)', 'hsa03420': 'Nucleotide excision repair - Homo sapiens (human)', 'hsa03430': 'Mismatch repair - Homo sapiens (human)'}


This code follows the same structure as your previous one but now filters pathways based on whether the word ```"signaling"``` appears in the description.

1)```if "signaling" in description:```:
Checks if the word "signaling" is present in the pathway description. If the description contains "signaling," the pathway is identified as relevant.

2) ```print(" ")```:
Prints an empty space (four spaces), likely to create some separation for better readability in the output.

In [8]:
repair_pathways = []
pathways={}
for line in human_pathways.rstrip().split("\n"):
  entry, description = line.split("\t")
  if "signaling" in description:
    repair_pathways.append(entry)
    pathways[entry]=description

print(repair_pathways)
print("    ")

print(pathways)

['hsa04010', 'hsa04012', 'hsa04014', 'hsa04015', 'hsa04310', 'hsa04330', 'hsa04340', 'hsa04350', 'hsa04390', 'hsa04392', 'hsa04370', 'hsa04371', 'hsa04630', 'hsa04064', 'hsa04668', 'hsa04066', 'hsa04068', 'hsa04020', 'hsa04070', 'hsa04072', 'hsa04071', 'hsa04024', 'hsa04022', 'hsa04151', 'hsa04152', 'hsa04150', 'hsa04115', 'hsa04620', 'hsa04621', 'hsa04622', 'hsa04625', 'hsa04660', 'hsa04657', 'hsa04662', 'hsa04664', 'hsa04062', 'hsa04910', 'hsa04922', 'hsa04920', 'hsa03320', 'hsa04912', 'hsa04915', 'hsa04917', 'hsa04921', 'hsa04926', 'hsa04919', 'hsa04261', 'hsa04723', 'hsa04722', 'hsa05120', 'hsa04933']
    
{'hsa04010': 'MAPK signaling pathway - Homo sapiens (human)', 'hsa04012': 'ErbB signaling pathway - Homo sapiens (human)', 'hsa04014': 'Ras signaling pathway - Homo sapiens (human)', 'hsa04015': 'Rap1 signaling pathway - Homo sapiens (human)', 'hsa04310': 'Wnt signaling pathway - Homo sapiens (human)', 'hsa04330': 'Notch signaling pathway - Homo sapiens (human)', 'hsa04340': 'Hed

##PROBLEM 2: Get the genes involved in a pathway

###Solution:
 code retrieves a specific KEGG pathway and prints its content and type.

1)```.read()```: retrieves the content of the request as a string and stores it in the variable **pathway_file**.

2)```print(type(pathway_file))```:
Prints the data type of pathway_file. Since ```.read()``` was used, the expected output will be ```<class 'str'>``` at last line, indicating that the pathway content is a string.

In [9]:

pathway_file = REST.kegg_get("hsa03410").read() #This line requests data for the KEGG pathway with the ID "hsa03410"
print(pathway_file)
print(type(pathway_file))


ENTRY       hsa03410                    Pathway
NAME        Base excision repair - Homo sapiens (human)
DESCRIPTION Base excision repair (BER) is the predominant DNA damage repair pathway for the processing of small base lesions, derived from oxidation and alkylation damages. BER is normally defined as DNA repair initiated by lesion-specific DNA glycosylases and completed by either of the two sub-pathways: short-patch BER where only one nucleotide is replaced and long-patch BER where 2-13 nucleotides are replaced. Each sub-pathway of BER relies on the formation of protein complexes that assemble at the site of the DNA lesion and facilitate repair in a coordinated fashion. This process of complex formation appears to provide an increase in specificity and efficiency to the BER pathway, thereby facilitating the maintenance of genome integrity by preventing the accumulation of highly toxic repair intermediates.
CLASS       Genetic Information Processing; Replication and repair
PATHWAY_MAP

1)```import re```:
Imports the ```re``` module, which provides support for regular expressions in Python. regular expressions IS USED to extract gene information from the KEGG pathway file.

2)```pattern = r"GENE(.*?)REFERENCE"```:

Defines a regular expression pattern that looks for the section in the file that starts with ```"GENE"``` and ends with ```"REFERENCE"```.
The ```.*?``` (non-greedy) matches any characters between "GENE" and "REFERENCE".

3)```genes = re.findall(pattern, pathway_file, re.DOTALL)```:

```re.findall()``` searches the entire *pathway_file* {IN CELL [9]} string for occurrences of the pattern and returns them as a list.
The pattern looks for the block of text between the "GENE" section and the "REFERENCE" section of the pathway file, which typically contains gene information in KEGG pathway files.

```re.DOTALL``` is a special flag in Python's re module that makes the dot ```(.)``` character match all characters, including newlines ```(\n)```.
it can capture everything between "GENE" and "REFERENCE" across multiple lines.

Without ```re.DOTALL```:

It matches only if "GENE" and "REFERENCE" are on the same line.
It won't match anything if they are on different lines.


With ```re.DOTALL```:

It matches all characters, including newlines, allowing it to extract everything between "GENE" and "REFERENCE" across multiple lines.

4) ```genes[0].strip().split("\n")```:

```genes[0]```: Takes the first match from the list (which should contain the gene information between "GENE" and "REFERENCE").

```.strip()```: Removes any leading or trailing whitespace from the string.

```.split("\n")```: Splits the gene information by newlines, creating a list where each entry is a line representing a gene or related data.

In [10]:
#IMPORING REGULAR EXPRESSION
import re
pattern = r"GENE(.*?)REFERENCE"
genes = re.findall(pattern, pathway_file, re.DOTALL)
genes[0].strip().split("\n")

['4968  OGG1; 8-oxoguanine DNA glycosylase [KO:K03660] [EC:3.2.2.- 4.2.99.18]',
 '            4913  NTHL1; nth like DNA glycosylase 1 [KO:K10773] [EC:3.2.2.- 4.2.99.18]',
 '            79661  NEIL1; nei like DNA glycosylase 1 [KO:K10567] [EC:3.2.2.- 4.2.99.18]',
 '            252969  NEIL2; nei like DNA glycosylase 2 [KO:K10568] [EC:3.2.2.- 4.2.99.18]',
 '            55247  NEIL3; nei like DNA glycosylase 3 [KO:K10569] [EC:3.2.2.- 4.2.99.18]',
 '            7374  UNG; uracil DNA glycosylase [KO:K03648] [EC:3.2.2.27]',
 '            23583  SMUG1; single-strand-selective monofunctional uracil-DNA glycosylase 1 [KO:K10800] [EC:3.2.2.-]',
 '            4595  MUTYH; mutY DNA glycosylase [KO:K03575] [EC:3.2.2.31]',
 '            4350  MPG; N-methylpurine DNA glycosylase [KO:K03652] [EC:3.2.2.21]',
 '            8930  MBD4; methyl-CpG binding domain 4, DNA glycosylase [KO:K10801] [EC:3.2.2.-]',
 '            6996  TDG; thymine DNA glycosylase [KO:K20813] [EC:3.2.2.29]',
 '            328  APE