### Download and Extract Gene IDs from WormBase

This code automates the process of downloading and extracting gene IDs from WormBase. 

#### `download_and_extract_gene_ids(wormbase_version, gene_ids, output_dir)`
This function downloads a gene IDs file from WormBase for a specified release version, unzips the `.gz` file, and removes the original compressed file after extraction.

- **Behavior**:
  - Constructs the appropriate URL for the gene IDs file based on the provided WormBase version.
  - Downloads the `.gz` file to the specified output directory.
  - Unzips the downloaded `.gz` file and saves the uncompressed version in the same directory.
  - Deletes the original `.gz` file after extraction is complete.


#### Example Execution:
In the last section of the code, the function `download_and_extract_gene_ids()` is called with the specified parameters:
- **WormBase version**: `"WS293"`
- **Gene IDs file**: `c_elegans.PRJNA13758.WS293.geneIDs.txt.gz`
- **Output directory**: `"./wormbase_data"`

This will download the gene IDs file for *C. elegans* (WormBase version `WS293`) and place the unzipped file in the `wormbase_data` directory.

In [2]:
import os
import requests
import gzip
import shutil
import csv

def download_url(file_url, output_file_path):
    response = requests.get(file_url, stream=True)
    if response.status_code == 200:
        with open(output_file_path, 'wb') as f:
            shutil.copyfileobj(response.raw, f)
        print(f"Downloaded: {output_file_path}")
    else:
        print(f"Failed to download: {file_url} (status code: {response.status_code})")
    return

    
def download_and_extract_gene_ids(wormbase_version, output_dir):
    gene_ids = f"c_elegans.PRJNA13758.{wormbase_version}.geneIDs.txt.gz"

    base_url = f"https://downloads.wormbase.org/releases/{wormbase_version}/species/c_elegans/PRJNA13758"
    file_url = f"{base_url}/annotation/{gene_ids}"

    # Ensure output directory exists
    os.makedirs(output_dir, exist_ok=True)

    # Download the file
    output_file_path = os.path.join(output_dir, gene_ids)
    download_url(file_url, output_file_path)

    # Unzip the file
    with gzip.open(output_file_path, 'rb') as f_in:
        with open(output_file_path.rstrip('.gz'), 'wb') as f_out:
            shutil.copyfileobj(f_in, f_out)

    # Remove the .gz file if it exists
    if os.path.exists(output_file_path):
        os.remove(output_file_path)
        print(f"Removed: {output_file_path}")
    else:
        print(f"{output_file_path} does not exist")
        

In [3]:
# Define variables
wormbase_version = "WS293"
output_dir = "./wormbase_data"

download_and_extract_gene_ids(wormbase_version, output_dir)

Downloaded: ./wormbase_data/c_elegans.PRJNA13758.WS293.geneIDs.txt.gz
Removed: ./wormbase_data/c_elegans.PRJNA13758.WS293.geneIDs.txt.gz


### Process and Filter Gene IDs CSV File

This script processes a CSV file containing gene IDs by filtering and selecting relevant columns. It generates an output CSV file with a clean structure and appropriate headers.

#### Function: `process_gene_ids(input_file)`
This function reads a CSV file, filters rows based on if the gen is "Live", selects relevant columns, and saves the processed data to a new CSV file. 

#### Key Steps:
1. **Generate Output File Name**:
   - The output file name is created by removing the last three characters from the input file name (e.g., `.gz`) and appending `.csv` to it.

2. **Load CSV File**:
   - The input file is loaded into a pandas DataFrame without headers (`header=None`) since the input file does not contain any column names.

3. **Filter Rows**:
   - The DataFrame is filtered to include only the rows where the value in the 5th column (index `4`) is `'Live'`.

4. **Select Relevant Columns**:
   - The function selects the 2nd, 3rd, 4th, and 6th columns (which correspond to index `1, 2, 3, 5`).

5. **Add Column Headers**:
   - Appropriate column headers are added for the output DataFrame:
     - `Wormbase_Id`: The identifier from WormBase.
     - `Gene_name`: The name of the gene.
     - `Sequence_id`: The ID of the sequence.
     - `Gene_Type`: The type of gene.

6. **Save Processed Data**:
   - The resulting DataFrame is saved as a CSV file with the generated file name, and the rows are saved without the index.


#### Example Execution:
- The input file is located in the `output_dir`, with the name stripped of `.gz`. The file is then processed and saved in a CSV format, ready for further analysis.

In [4]:
import pandas as pd

def process_gene_ids(wormbase_version, output_dir):
    """
    Processes a gene IDs CSV file by filtering rows where the 5th column is 'Live'
    and extracting columns 2, 3, 4, and 6. Adds a header to the output file.
    
    The output file name is generated by stripping the last three characters from the input file name
    and appending '.csv' to it.

    Args:
        input_file (str): Path to the input CSV file.
    """
    gene_ids = f"c_elegans.PRJNA13758.{wormbase_version}.geneIDs.txt.gz"
    input_file = f"{output_dir}/{gene_ids.rstrip('.gz')}"


    # Generate the output file name
    output_file = f"{input_file[:-3]}csv"

    # Load the input CSV file into a DataFrame
    df = pd.read_csv(input_file, header=None)

    # Filter rows where the 5th column equals 'Live'
    df_filtered = df[df[4] == 'Live']

    # Select the required columns (2nd, 3rd, 4th, and 6th)
    df_selected = df_filtered[[1, 2, 3, 5]]

    # Add the appropriate column headers
    df_selected.columns = ["Wormbase_Id", "Gene_name", "Sequence_id", "Gene_Type"]

    # Save the result to the output file
    df_selected.to_csv(output_file, index=False)

    print(f"Processed file saved to: {output_file}")




In [5]:
# Example usage
output_dir = "./wormbase_data"
wormbase_version = "WS293"
process_gene_ids(wormbase_version, output_dir)

Processed file saved to: ./wormbase_data/c_elegans.PRJNA13758.WS293.geneIDs.csv


## Provide some summary info on Gene IDs

In [13]:
gene_ids_df = pd.read_csv('./wormbase_data/c_elegans.PRJNA13758.WS293.geneIDs.csv') 
unique_gene_types = gene_ids_df["Gene_Type"].value_counts()
print(unique_gene_types)

Gene_Type
protein_coding_gene      19983
piRNA_gene               15363
ncRNA_gene                8487
pseudogene                2131
gene                      1523
tRNA_gene                  634
snoRNA_gene                346
miRNA_gene                 261
lincRNA_gene               193
snRNA_gene                 129
antisense_lncRNA_gene      100
rRNA_gene                   22
scRNA_gene                   1
Name: count, dtype: int64


### Download WormCat CSV File

This code downloads a WormCat CSV file from a wormcat.com URL and saves it to a designated output directory. 
It ensures that the directory exists before saving the file.


#### Example Execution:
- The function downloads the file `whole_genome_v2_nov-11-2021.csv` from the WormCat website and saves it to the `./wormbase_data` directory (or any other specified directory).

In [6]:
import os

def download_wormcat_csv(output_dir="./"):
    url = "http://www.wormcat.com/static/download/whole_genome_v2_nov-11-2021.csv"
    output_filename = url.split("/")[-1]  # Get the filename from the URL

    os.makedirs(output_dir, exist_ok=True)
    output_file_path = os.path.join(output_dir, output_filename)

    if os.path.exists(output_file_path):
        print(f"File already exists: {output_file_path}. Skipping download.")
        return
    
    download_url(url, output_file_path)
    print(f"File downloaded to: {output_file_path}")



In [7]:
download_wormcat_csv("./wormbase_data")

File already exists: ./wormbase_data/whole_genome_v2_nov-11-2021.csv. Skipping download.


# Appendix