## Phase 1: Understanding and Analyzing the Data

### Reading and Understanding the Data
- **Explanation:**  
  - We load the training and testing data using the Pandas library to understand their structure.  
  - We use `pd.read_csv` to read files like `train_feats.csv` and `test_feats.csv`, along with mutation and methylation data.  

In [1]:
import pandas as pd

# Load training features
train_feats = pd.read_csv('train_feats.csv')

# Load testing features
test_feats = pd.read_csv('test_feats.csv')

# Load training mutation data
train_muts = pd.read_csv('train_muts_data.csv')

# Load testing mutation data
test_muts = pd.read_csv('test_muts_data.csv')

# Load testing methylation data
test_meth = pd.read_csv('test_meth_data.csv')

### Exploring the Data
- **Explanation:**  
  - We examine the distribution of labels in the training data to check for class balance using `value_counts()`.  
  - We check for missing values using `isnull().sum()` to ensure data quality.  

In [2]:
# Distribution of labels in training data
print("Label Distribution:")
print(train_feats['Label'].value_counts())

# Check for missing values in training data
print("Missing Values in Training Data:")
print(train_feats.isnull().sum())

# Check for missing values in testing data
print("Missing Values in Testing Data:")
print(test_feats.isnull().sum())

Label Distribution:
Label
2.0    408
1.0    397
Name: count, dtype: int64
Missing Values in Training Data:
case_id               0
Label                 0
Mutations_in_ABL1     0
Mutations_in_AKT1     0
Mutations_in_ALK      0
                     ..
Mutations_in_TSC1     0
Mutations_in_TSC2     0
Mutations_in_U2AF1    0
Mutations_in_VHL      0
Mutations_in_WT1      0
Length: 102, dtype: int64
Missing Values in Testing Data:
case_id               0
Mutations_in_ABL1     0
Mutations_in_AKT1     0
Mutations_in_ALK      0
Mutations_in_APC      0
                     ..
Mutations_in_TSC1     0
Mutations_in_TSC2     0
Mutations_in_U2AF1    0
Mutations_in_VHL      0
Mutations_in_WT1      0
Length: 101, dtype: int64


### Analyzing Mutations
- **Explanation:**  
  - We extract sequences surrounding mutations using the `extract_sequences` function from an external file.  
  - We analyze the types of mutations (e.g., Missense or Nonsense) to understand their impact using `value_counts()`.  


In [3]:
# from extract_sequences import extract_seqs

# # Extract sequences for mutations in training data
# train_sequences = extract_seqs('100_genes.csv', train_muts)

# # Analyze the distribution of mutation types
# print("Mutation Type Distribution:")
# print(train_muts['mutation_type'].value_counts())

In [None]:
import pandas as pd
from extract_sequences import extract_flanked_region

# Load mutation data
train_muts = pd.read_csv('train_muts_data.csv')  # Adjust path if needed

# Load gene data (assuming '100_genes.csv' contains sequences)
genes_df = pd.read_csv('100_genes.csv')  # Adjust path if needed
# Example: Create a dictionary mapping gene names to sequences
# Adjust column names based on your '100_genes.csv' structure
gene_sequences = dict(zip(genes_df['gene_name'], genes_df['sequence']))

# Define flanking region sizes
upstream = 10    # Number of bases upstream
downstream = 10  # Number of bases downstream

# Extract sequences for each mutation
train_sequences = []
for index, row in train_muts.iterrows():
    gene = row['gene_name']  # Adjust column name if different
    chrom_seq_str = gene_sequences.get(gene, '')  # Get sequence for the gene
    if not chrom_seq_str:
        print(f"Warning: No sequence found for gene {gene}")
        continue
    strand = row['strand']    # Adjust column name if different
    start = row['position']   # Adjust column name if different
    stop = row['position']    # Assuming single-base mutations; adjust if start/stop differ
    try:
        sequence = extract_flanked_region(chrom_seq_str, strand, start, stop, upstream, downstream)
        train_sequences.append(sequence)
    except ValueError as e:
        print(f"Error processing mutation at index {index}: {e}")
        continue

# Output the extracted sequences
print(f"Extracted {len(train_sequences)} sequences:")
print(train_sequences)

ImportError: cannot import name 'extract_flanked_region' from 'extract_sequences' (consider renaming '/home/okal/Zone/Gigs/Computational-Genomics/extract_sequences.py' if it has the same name as a library you intended to import)