## Starter Notebook Project 1

This notebook is to help you with the first steps to start modeling for challenge 1.

In [None]:
import numpy as np
import pandas as pd
import os
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns

Check if you have access to train and test files, otherwise put directory to files. This notebooks assumes the directory within kaggle for train and test data.

In [None]:
data_dir = '/kaggle/input/predicting-repair-outcomes'
for dirname, _, filenames in os.walk(data_dir):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/predicting-repair-outcomes/train.csv
/kaggle/input/predicting-repair-outcomes/test.csv


In [None]:
train_data_dir = data_dir + '/train.csv'

# also load test data and create directory for your submission
test_data_dir = data_dir + '/test.csv'

In [None]:
df_train = pd.read_csv(train_data_dir)
df_train.head()

Unnamed: 0,Id,GuideSeq,Fraction_Insertions,Avg_Deletion_Length,Indel_Diversity,Fraction_Frameshifts
0,0,CTGCAGGGCTAGTTTCCTATAGG,0.069572,4.301844,3.536538,0.807375
1,1,GAGATGCGGACCACCCAGCTGGG,0.287647,10.814444,3.871165,0.665696
2,2,GCAAACGGAAGTGCAATTGTCGG,0.137004,9.888889,3.931298,0.684823
3,3,GTCATCGCTGAGTTGAGGAAGGG,0.093889,4.527812,3.523067,0.753003
4,4,ATATGATTATCCCTGCACAAGGG,0.526525,6.415644,2.828101,0.887214


Check length of DNA sequences:

In [None]:
seq_len = df_train['GuideSeq'].apply(len)
seq_len.describe()

count    1065.0
mean       23.0
std         0.0
min        23.0
25%        23.0
50%        23.0
75%        23.0
max        23.0
Name: GuideSeq, dtype: float64

We see that all sequences have the same length ( 20-nucleotide guide RNA sequence plus the 3-nucleotide PAM sequence). Checking for Nan values:

In [None]:
# Check for missing values
df_train.isnull().sum()

Id                      0
GuideSeq                0
Fraction_Insertions     0
Avg_Deletion_Length     0
Indel_Diversity         0
Fraction_Frameshifts    0
dtype: int64

We can also already look at basic statistics of the data:

In [None]:
# Describe the target variables
target_cols = ['Fraction_Insertions', 'Avg_Deletion_Length', 'Indel_Diversity', 'Fraction_Frameshifts']
df_train[target_cols].describe()

Unnamed: 0,Fraction_Insertions,Avg_Deletion_Length,Indel_Diversity,Fraction_Frameshifts
count,1065.0,1065.0,1065.0,1065.0
mean,0.209327,7.370845,3.800805,0.698444
std,0.153476,3.003263,0.709694,0.123423
min,0.001168,2.346479,1.151684,0.001676
25%,0.088558,5.421346,3.357909,0.641393
50%,0.164405,6.977368,3.857114,0.714457
75%,0.292504,8.773226,4.302568,0.780449
max,0.831198,46.027027,5.548348,0.954305


Now it is your turn to continue with exploring and preprocessing your data, and start modeling!

In [None]:
# YOUR CODE HERE!
# - Continue with data exploration, data preprocessing, think about ways to encode your DNA sequences
# - start modeling!

### Submission File
To participate in the challenge, you need to submit a CSV file containing your predictions for the test set. The submission file should have the following format:
- Columns:
 - Id: Should match the ‘Id’ column in the test data.
 - Fraction_Insertions
 - Avg_Deletion_Length
 - Indel_Diversity
 - Fraction_Frameshifts

In [None]:
df_test = pd.read_csv(test_data_dir)
df_test.head()

Unnamed: 0,Id,GuideSeq
0,0,TGTGCAATATCTGGTACTAAGGG
1,1,TGTCTGGCCAGCAGAATACAGGG
2,2,ACTGAGAGTGGATCCGAAAGTGG
3,3,GTTCTGCACCAGCACATTCACGG
4,4,ACTGGATGGACAAGACTGGTGGG


In [None]:
# Create a DataFrame for the submission
submission = pd.DataFrame({
    'Id': df_test['Id']
})

# Generate random (dummy) predictions within reasonable ranges
np.random.seed(42)  # For reproducibility

# For Fraction_Insertions and Fraction_Frameshifts, values between 0 and 1
submission['Fraction_Insertions'] = np.random.uniform(0, 1, len(df_test))
submission['Avg_Deletion_Length'] = np.random.uniform(0, 50, len(df_test))  # Assuming avg deletion length between 0 and 50
submission['Indel_Diversity'] = np.random.uniform(0, 1, len(df_test))
submission['Fraction_Frameshifts'] = np.random.uniform(0, 1, len(df_test))

# Ensure that the columns are in the correct order
submission = submission[['Id', 'Fraction_Insertions', 'Avg_Deletion_Length', 'Indel_Diversity', 'Fraction_Frameshifts']]

# Display the first few rows of the submission
submission.head()

Unnamed: 0,Id,Fraction_Insertions,Avg_Deletion_Length,Indel_Diversity,Fraction_Frameshifts
0,0,0.37454,36.304567,0.503417,0.624149
1,1,0.950714,48.792604,0.690395,0.409412
2,2,0.731994,25.815017,0.039312,0.552047
3,3,0.598658,16.147824,0.79941,0.436127
4,4,0.156019,39.75931,0.6279,0.294466


In [None]:
# Save the submission file
# submission.to_csv('/kaggle/working/submission.csv', index=False)