# Create a bed file from large variant table
"[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Eitan177/EitanAmrom/blob/main/makingbedfilefromlargetableofvariants.ipynb)"

1. **Install and Import Libraries**:
    ```python
    !pip install pybiomart
    ```
    - Installs the `pybiomart` library.
    - Imports the `pandas` library for data manipulation and `pybiomart` for querying the BioMart database.

2. **Read Data**:
    ```python
    combined_df = pd.read_csv('path to combined_diseases.tsv', sep='\t')
    ```
    - Reads a TSV file into a DataFrame named `combined_df`.

3. **Count Diseases**:
    ```python
    disease_counts = []
    for filename in combined_df['Filename'].unique():
        subset = combined_df[(combined_df['Filename'] == filename) & (combined_df['Disease'] != '-')]
        counts = subset['Disease'].value_counts()
        disease_counts.append(counts)
    ```
    - Counts the occurrences of each disease for each unique filename and stores the counts in a list.

4. **Create Dictionary of Disease Counts**:
    ```python
    filename_disease_counts = {}
    for i in range(len(combined_df['Filename'].unique())):
        filename_disease_counts[combined_df['Filename'].unique()[i]] = disease_counts[i]
    ```
    - Creates a dictionary where the keys are filenames and the values are the corresponding disease counts.

5. **Add Disease Counts to DataFrame**:
    ```python
    combined_df['DiseaseCounts'] = combined_df['Filename'].map(filename_disease_counts)
    combined_df
    ```
    - Adds a new column `DiseaseCounts` to `combined_df` by mapping the filenames to their disease counts.

6. **Drop Duplicates and Rename Column**:
    ```python
    dd_df = combined_df.drop_duplicates(subset=['Filename'])
    dd_df = dd_df.rename(columns={'Filename': 'Gene name'})
    ```
    - Drops duplicate rows based on the `Filename` column and renames the `Filename` column to `Gene name`.

7. **Query BioMart Database**:
    ```python
    server = pybiomart.Server(host='http://www.ensembl.org')
    dataset = server.marts['ENSEMBL_MART_ENSEMBL'].datasets['hsapiens_gene_ensembl']
    genetable = dataset.query(attributes=['external_gene_name', 'chromosome_name', 'start_position', 'end_position'])
    ```
    - Initializes the BioMart server and dataset, and queries the dataset for gene information.

8. **Merge DataFrames**:
    ```python
    merged_df = pd.merge(dd_df, genetable, on='Gene name', how='inner')
    merged_df = merged_df.drop(columns=['Disease'])
    ```
    - Merges `dd_df` with `genetable` on the `Gene name` column and drops the `Disease` column.

9. **Filter and Format Chromosome Names**:
    ```python
    valid_chromosomes = [str(i) for i in range(1, 23)] + ['X', 'Y']
    merged_df = merged_df[merged_df['Chromosome/scaffold name'].isin(valid_chromosomes)]
    merged_df['Chromosome/scaffold name'] = 'chr' + merged_df['Chromosome/scaffold name'].astype(str)
    ```
    - Filters the DataFrame to include only valid chromosomes and adds 'chr' to the beginning of chromosome names.

10. **Create BED DataFrame**:
    ```python
    bed_df = pd.DataFrame()
    bed_df['chr'] = merged_df['Chromosome/scaffold name']
    bed_df['start'] = merged_df['Gene start (bp)'].astype(int)
    bed_df['end'] = merged_df['Gene end (bp)'].astype(int)
    bed_df['name'] = merged_df['Gene name']
    ```
    - Creates a new DataFrame `bed_df` with columns for chromosome, start position, end position, and gene name.

11. **Define Helper Functions**:
    ```python
    def collapse_dict(d):
        return ';'.join([f"{k}:{v}" for k, v in d.items()])

    def sum_counts(d):
        return sum(d.tolist())
    ```
    - Defines two helper functions: `collapse_dict` to collapse dictionary values into a single string, and `sum_counts` to calculate the sum of counts from dictionaries.

12. **Add Name and Score Columns**:
    ```python
    bed_df['name'] = merged_df['Gene name'] + '-' + merged_df['DiseaseCounts'].apply(collapse_dict)
    bed_df['score'] = merged_df['DiseaseCounts'].apply(sum_counts)
    ```
    - Adds a `name` column to `bed_df` by concatenating the gene name with collapsed disease counts.
    - Adds a `score` column to `bed_df` by summing the disease counts.

13. **Save to BED File**:
    ```python
    bed_df.to_csv('output.bed', sep='\t', index=False, header=False)
    ```
    - Saves the `bed_df` DataFrame to a BED file named `output.bed`.

This code processes a TSV file containing disease data, merges it with gene information from BioMart, and saves the result as a BED file.

In [None]:
!pip install pybiomart

import pandas as pd
import pybiomart

combined_df = pd.read_csv('path to combined_diseases.tsv', sep='\t')

disease_counts = []
for filename in combined_df['Filename'].unique():
    subset = combined_df[(combined_df['Filename'] == filename) & (combined_df['Disease'] != '-')]
    counts = subset['Disease'].value_counts()
    disease_counts.append(counts)

# Create a dictionary to store the counts for each filename
filename_disease_counts = {}
for i in range(len(combined_df['Filename'].unique())):
  filename_disease_counts[combined_df['Filename'].unique()[i]] = disease_counts[i]


# Create a new column in the combined_df with disease counts
combined_df['DiseaseCounts'] = combined_df['Filename'].map(filename_disease_counts)
combined_df

dd_df = combined_df.drop_duplicates(subset=['Filename'])

dd_df = dd_df.rename(columns={'Filename': 'Gene name'})



# Initialize the biomart server and dataset
server = pybiomart.Server(host='http://www.ensembl.org')
dataset = server.marts['ENSEMBL_MART_ENSEMBL'].datasets['hsapiens_gene_ensembl']

genetable=dataset.query(attributes= ['external_gene_name', 'chromosome_name', 'start_position', 'end_position'])


# Merge the two DataFrames based on the common column
merged_df = pd.merge(dd_df, genetable, on='Gene name', how='inner') # Use 'inner' or other join types as needed

merged_df = merged_df.drop(columns=['Disease'])


# Create a list of valid chromosome names
valid_chromosomes = [str(i) for i in range(1, 23)] + ['X', 'Y']

# Filter the DataFrame based on valid chromosomes
merged_df = merged_df[merged_df['Chromosome/scaffold name'].isin(valid_chromosomes)]

# Add 'chr' to the beginning of the chromosome names
merged_df['Chromosome/scaffold name'] = 'chr' + merged_df['Chromosome/scaffold name'].astype(str)



bed_df = pd.DataFrame()
bed_df['chr'] = merged_df['Chromosome/scaffold name']
bed_df['start'] = merged_df['Gene start (bp)'].astype(int)
bed_df['end'] = merged_df['Gene end (bp)'].astype(int)
bed_df['name'] = merged_df['Gene name']


# Function to collapse dictionary values into a single string
def collapse_dict(d):
  return ';'.join([f"{k}:{v}" for k, v in d.items()])

# Function to calculate the sum of the counts from the dictionaries
def sum_counts(d):
  #print(d.tolist())
  return sum(d.tolist())
bed_df['name'] = merged_df['Gene name'] +'-'+merged_df['DiseaseCounts'].apply(collapse_dict)

#bed_df['values'] = merged_df2['DiseaseCounts'].apply(collapse_dict)
bed_df['score'] = merged_df['DiseaseCounts'].apply(sum_counts)


#Save to a bed file
bed_df.to_csv('output.bed', sep='\t', index=False, header=False)
