# Gene Information Web Scraping

This notebook scrapes information of 19k homo sapien genes from NCBI: https://www.ncbi.nlm.nih.gov/gene and then saves it to a pandas DataFrame. We were able to download a .gz file directly from NCBI that has the required information of all possible genes, which we used to do the web scraping for those specific genes.

## Step 0: Load Libraries

Run the below cell to import the libraries needed. If some packages are not installed, you can use !pip install 'library name' to download it.

In [None]:
# Ex: If gzip & pandas are not downloaded, uncomment the next 2 lines and rerun this code block
# !pip install gzip
# !pip install pandas

import gzip
import pandas as pd
import requests
from bs4 import BeautifulSoup
import time
import random
import re

## Step 1: Import file

First, we import the .gz file.

In [None]:
# Path to the uploaded file - Change to yours
file_path = "D:\\MarkhamLab\\GeneScraping\\Homo_sapiens.gene_info.gz"

# Reading the .gz file
with gzip.open(file_path, 'rt') as f:
    # Load the file into a pandas dataframe
    gene_info_df = pd.read_csv(f, sep='\t')

# Display the first few rows to understand its structure
gene_info_df.head()


Unnamed: 0,#tax_id,GeneID,Symbol,LocusTag,Synonyms,dbXrefs,chromosome,map_location,description,type_of_gene,Symbol_from_nomenclature_authority,Full_name_from_nomenclature_authority,Nomenclature_status,Other_designations,Modification_date,Feature_type
0,9606,1,A1BG,-,A1B|ABG|GAB|HYST2477,MIM:138670|HGNC:HGNC:5|Ensembl:ENSG00000121410...,19,19q13.43,alpha-1-B glycoprotein,protein-coding,A1BG,alpha-1-B glycoprotein,O,alpha-1B-glycoprotein|HEL-S-163pA|epididymis s...,20240827,-
1,9606,2,A2M,-,A2MD|CPAMD5|FWP007|S863-7,MIM:103950|HGNC:HGNC:7|Ensembl:ENSG00000175899...,12,12p13.31,alpha-2-macroglobulin,protein-coding,A2M,alpha-2-macroglobulin,O,alpha-2-macroglobulin|C3 and PZP-like alpha-2-...,20240827,-
2,9606,3,A2MP1,-,A2MP,HGNC:HGNC:8|Ensembl:ENSG00000291190|AllianceGe...,12,12p13.31,alpha-2-macroglobulin pseudogene 1,pseudo,A2MP1,alpha-2-macroglobulin pseudogene 1,O,pregnancy-zone protein pseudogene,20240827,-
3,9606,9,NAT1,-,AAC1|MNAT|NAT-1|NATI,MIM:108345|HGNC:HGNC:7645|Ensembl:ENSG00000171...,8,8p22,N-acetyltransferase 1,protein-coding,NAT1,N-acetyltransferase 1,O,arylamine N-acetyltransferase 1|N-acetyltransf...,20240827,-
4,9606,10,NAT2,-,AAC2|NAT-2|PNAT,MIM:612182|HGNC:HGNC:7646|Ensembl:ENSG00000156...,8,8p22,N-acetyltransferase 2,protein-coding,NAT2,N-acetyltransferase 2,O,arylamine N-acetyltransferase 2|N-acetyltransf...,20240827,-


## Step 2: Preprocess Raw Data

Run this to filter the data down to ~40,000 rows, so it only contains the genes we are interested in.

In [None]:
# Condition 1: Remove rows where 'Feature_type' starts with 'regulatory:silencer' or 'regulatory:enhancer'
condition1 = ~gene_info_df['Feature_type'].str.startswith(('regulatory:silencer', 'regulatory:enhancer'), na=False)

# Condition 2: Keep only rows where 'type_of_gene' is 'protein-coding', 'unknown', 'other', or 'pseudo'
condition2 = gene_info_df['type_of_gene'].str.contains('protein-coding|unknown|other|pseudo', case=False, na=False)

# Condition 3: Remove rows that contain 'tRNA' in any of the columns
condition3 = ~gene_info_df.apply(lambda row: row.astype(str).str.contains('tRNA', case=False, na=False).any(), axis=1)

# Apply all conditions to filter the DataFrame
filtered_gene_info_df = gene_info_df[condition1 & condition2 & condition3]

# Get the total number of rows after filtering
total_rows_after_filtering = filtered_gene_info_df.shape[0]

# Display the total number of rows after filtering
total_rows_after_filtering


39933

After running the above cell, there should be a total of 39,933 rows that are printed out.

## Step 3: Scrape Genes

Now, we can use the shortened list of genes to scrape its respective information off of NCBI. Again, make sure the file paths align.

In [None]:
# Function to scrape gene summary from the NCBI gene page
def scrape_gene_summary(gene_id):
    url = f"https://www.ncbi.nlm.nih.gov/gene/{gene_id}"
    response = requests.get(url)
    
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')
        
        # Find the correct <dd> tag containing the summary
        summary_section = soup.find_all('dd')  # Find all <dd> tags

        # Based on previous info, the summary is in the 10th <dd> tag (index 9)
        if len(summary_section) >= 10:
            summary_text = summary_section[9].text.strip()  # Access the 10th <dd> tag (index 9)
            return summary_text
        else:
            return "NA"  # If fewer than 10 <dd> tags, return 'NA'
    else:
        return "NA"  # Return 'NA' if the page request fails

# Path to save the final CSV file for 39k+ genes
output_file_path = "D:\\MarkhamLab\\GeneScraping\\ncbi_summaries.csv"

# Process in batches to avoid being blocked
def process_in_batches(gene_data, batch_size=500, sleep_between_requests=1.5, sleep_between_batches=30):
    start_index = 0

    for start in range(0, len(gene_data), batch_size):
        batch = gene_data[start:start + batch_size].copy()  # Copy batch data to avoid SettingWithCopyWarning
        batch_summaries = []

        for index, row in batch.iterrows():
            gene_id = str(row['GeneID'])
            summary = scrape_gene_summary(gene_id)
            batch_summaries.append(summary)
            
            # Sleep between each request to avoid hitting API rate limits
            time.sleep(random.uniform(sleep_between_requests, sleep_between_requests + 0.5))  # Randomized sleep between 1.5 - 2 seconds

        # Add summaries to the batch and append to CSV
        batch['Summary'] = batch_summaries
        batch[['GeneID', 'Symbol', 'description', 'Summary']].to_csv(output_file_path, mode='a', header=(start == 0), index=False)
        
        # Print progress and sleep between batches
        print(f"Processed batch from {start} to {start + batch_size}. Taking a break.")
        time.sleep(sleep_between_batches)

# Main function for 39k+ genes
def main():
    # Assuming filtered_gene_info_df is already filtered as per your previous logic
    filtered_gene_info_df['Summary'] = 'NA'  # Initialize the Summary column

    # Set the batch size and delay parameters
    batch_size = 500
    sleep_between_requests = 1.5  # 1.5 seconds between requests
    sleep_between_batches = 30    # 30 seconds between batches

    # Process the gene data in batches and scrape the summaries
    process_in_batches(filtered_gene_info_df[['GeneID', 'Symbol', 'description']], batch_size=batch_size, sleep_between_requests=sleep_between_requests, sleep_between_batches=sleep_between_batches)

    # Final message when all batches are done
    print(f"Final summaries saved to {output_file_path}")

if __name__ == "__main__":
    main()

## Step 4: Scraped Data Cleaning

We need to analyze if the web scraping was successful and address any inconsistencies.

In [None]:
df_scraped = pd.read_csv("D:\\MarkhamLab\\GeneScraping\\ncbi_summaries.csv")
df_scraped

Unnamed: 0,GeneID,Symbol,description,Summary
0,1,A1BG,alpha-1-B glycoprotein,The protein encoded by this gene is a plasma g...
1,2,A2M,alpha-2-macroglobulin,The protein encoded by this gene is a protease...
2,3,A2MP1,alpha-2-macroglobulin pseudogene 1,"Biased expression in adrenal (RPKM 1.1), splee..."
3,9,NAT1,N-acetyltransferase 1,This gene is one of two arylamine N-acetyltran...
4,10,NAT2,N-acetyltransferase 2,This gene encodes an enzyme that functions to ...
...,...,...,...,...
39928,8923209,ND1,NADH dehydrogenase subunit 1,NADH dehydrogenase subunit 1
39929,8923210,ND5,NADH dehydrogenase subunit 5,NADH dehydrogenase subunit 5
39930,8923211,ATP8,ATP synthase F0 subunit 8,ATP synthase F0 subunit 8
39931,8923212,ND2,NADH dehydrogenase subunit 2,NADH dehydrogenase subunit 2


It seems the summaries for all genes were not extracted properly, and we need to replace 'bad' summaries with NA. There are also values of:

"Try the new Gene table"

"Try the new Transcript table"

So, we will:

- Replace rows with specific known invalid phrases, including multi-line ones
- Replace rows where the summary is too short
- Replace rows that look like chromosomal positions or alphanumeric codes
- Remove 'See more' from the end of the summary if present
- Replace 'Summary' with 'NA' if it is the same as 'description'


In [None]:
# Replace invalid summaries based on conditions

# 1. Replace rows with specific known invalid phrases, including multi-line ones
invalid_phrases = [
    'Try the new Gene table\n\nTry the new Transcript table',
    'mouse', 'GenBank, FASTA, Sequence Viewer (Graphics)'
]
df_scraped['Summary'] = df_scraped['Summary'].apply(lambda x: 'NA' if any(phrase in str(x) for phrase in invalid_phrases) else x)

# 2. Replace rows where the summary is too short (e.g., less than 5 words)
df_scraped['Summary'] = df_scraped['Summary'].apply(lambda x: 'NA' if len(str(x).split()) < 5 else x)

# 3. Replace rows that look like chromosomal positions or alphanumeric codes
df_scraped['Summary'] = df_scraped['Summary'].apply(lambda x: 'NA' if re.match(r'^\w{1,3}[qQ]\d{1,2}\.\d+$', str(x)) else x)

# 4. Remove 'See more' from the end of the summary if present
df_scraped['Summary'] = df_scraped['Summary'].apply(lambda x: re.sub(r'\s*See more\s*$', '', str(x)))

# 5. Replace 'Summary' with 'NA' if it is the same as 'description'
df_scraped['Summary'] = df_scraped.apply(lambda row: 'NA' if row['Summary'] == row['description'] else row['Summary'], axis=1)

# Save the cleaned data to a new CSV
df_scraped.to_csv("D:\\MarkhamLab\\GeneScraping\\ncbi_summaries_cleaned.csv", index=False)

print("Cleaning complete. Cleaned data saved as 'ncbi_summaries_cleaned.csv'.")


Cleaning complete. Cleaned data saved as 'ncbi_summaries_cleaned.csv'.
