<a href="https://colab.research.google.com/github/Lakshay2013/Lakshay2013/blob/main/Copy_of_Genetic_profile_system.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

GeneQuest™ Genetic Profiling System
Phoenix labs

### 1. **Dataset Preparation**  
- **Source**: Data is sourced from the [1000 Genomes Project](https://www.internationalgenome.org/data/).  
- **Preprocessing**:  
  - Download `.vcf` files containing genetic data.  
  - Use tools like **bcftools** or **PLINK** to extract SNPs of interest.  
  - Convert the processed data into `.csv` format with rows representing individuals and columns representing SNPs.  
  - Add labels (e.g., disease presence or absence) for supervised learning.  

**Example Dataset Format**:  
| SNP1 | SNP2 | SNP3 | ... | Label |  
|------|------|------|-----|-------|  
| 0.1  | 0.3  | 0.2  | ... | 1     |  
| 0.2  | 0.1  | 0.4  | ... | 0     |  

In [None]:
import pandas as pd

data = pd.read_csv('genetic_data.csv')
X = data.drop(columns=['Label']).values
y = data['Label'].value

### 2. **Data Loading and Splitting**  
- Load the `.csv` file using **pandas**.  
- Split the dataset into training and validation sets using **train_test_split** from Scikit-learn.  


In [None]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)


### 3. **Model Architecture**  
The system uses a feedforward neural network with the following layers:  
- Input layer: Matches the number of SNP features.  
- Hidden layers: Fully connected layers with ReLU activation and dropout for regularization.  
- Output layer: A single neuron with a sigmoid activation for binary classification.  
The model is trained using the **Adam optimizer** and **binary cross-entropy loss**

In [None]:
from tensorflow.keras import layers, models

model = models.Sequential([
    layers.Input(shape=(X_train.shape[1],)),
    layers.Dense(128, activation='relu'),
    layers.Dropout(0.4),
    layers.Dense(64, activation='relu'),
    layers.Dropout(0.3),
    layers.Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

history = model.fit(
    X_train, y_train,
    validation_data=(X_val, y_val),
    epochs=30,
    batch_size=64
)


### 4. **Hyperparameter Tuning**  
- The model's hyperparameters (e.g., number of neurons, dropout rate, optimizer choice) are optimized using **Keras Tuner**.  
- A random search is conducted to find the best configuration, which is then used to retrain the model.

In [None]:
import keras_tuner as kt

def build_model(hp):
    model = models.Sequential()
    model.add(layers.Input(shape=(X_train.shape[1],)))
    model.add(layers.Dense(hp.Int('units', min_value=64, max_value=256, step=64), activation='relu'))
    model.add(layers.Dropout(hp.Float('dropout', min_value=0.2, max_value=0.5, step=0.1)))
    model.add(layers.Dense(1, activation='sigmoid'))
    model.compile(
        optimizer=hp.Choice('optimizer', ['adam', 'sgd']),
        loss='binary_crossentropy',
        metrics=['accuracy']
    )
    return model

tuner = kt.RandomSearch(
    build_model,
    objective='val_accuracy',
    max_trials=10,
    directory='tuner_results',
    project_name='genetic_profiling'
)

tuner.search(X_train, y_train, epochs=20, validation_data=(X_val, y_val))
best_hps = tuner.get_best_hyperparameters()[0]
model = tuner.hypermodel.build(best_hps)
model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=20)


### 5. **Model Training**  
- The model is trained using the training set with validation on a holdout set.  
- Training uses a batch size of 64 and runs for 20–30 epochs, depending on early stopping criteria.

### 6. **Model Evaluation**  
- After training, the model is evaluated on the validation set using metrics like:  
  - **Accuracy**  
  - **AUC-ROC**  
- These metrics help ensure the model generalizes well to unseen data.

In [None]:
loss, accuracy = model.evaluate(X_val, y_val)
print(f"Final Validation Accuracy: {accuracy:.2f}")


## Prerequisites  
- Python 3.8+  
- Required libraries: TensorFlow, Keras Tuner, Scikit-learn, Pandas, Numpy  

### Steps  
1. Preprocess the `.vcf` files to create a `.csv` dataset (`genetic_data.csv`).  
2. Run the Python script to load and preprocess the data.  
3. Train the model using the provided architecture.  
4. Use hyperparameter tuning to optimize the model.  
5. Save the final model and evaluate its performance.

## Results  
The system provides probabilistic predictions for the likelihood of genetic predispositions. It can be extended for multiclass classification or integrated with external systems for broader use cases.  

## Future Work  
- Explore other ML algorithms like Random Forest or Gradient Boosted Trees for comparison.  
- Extend the system for multiclass problems (e.g., multiple genetic conditions).  
- Build a user-friendly interface for clinical use.  


## Acknowledgments  
- **Data Source**: [1000 Genomes Project](https://www.internationalgenome.org/data/)  
- **Tools Used**: TensorFlow, Keras Tuner, Scikit-learn, Pandas