# <center> Feature Engineering </center>

## Problem Statement 

- The objective of this notebook is to perform feature extraction from the given dataset. Specifically, we aim to create a Likelihood Feature and generate a Target Label Feature based on the likelihood of paternity testing.

## Feature Engineering Steps

### 1. Creating a Likelihood Feature

- By implementing the **compare_sequences** function to compare DNA sequences and calculate the likelihood of paternity based on matching alleles based on a statistical measure used to assess the probability of observing the provided sequences under different hypotheses.

### 2. Generating a Target Label Feature
- By utilizing the likelihood feature to generate a binary target label ( 0 , 1 ). The label is determined by a predefined threshold based on medical research.


------

In [1]:
# Constants
DATA_PATH = '../data/processed/1_first_processed_merged_df.pkl'
EXPORT_PATH = '../data/processed/2_second_processed_merged_df.pkl'

In [2]:
import pandas as pd 
import numpy as np
import logging 
import pickle

## Read data

In [3]:
first_processed_merged_df = pd.read_pickle(DATA_PATH)

In [4]:
first_processed_merged_df_copy = first_processed_merged_df.copy()

In [6]:
first_processed_merged_df_copy.sample(4)

Unnamed: 0,Name,Parent_full_DNA_Seq,Child_full_DNA_Seq,ParentM,ParentF
5532,A3585,CTCCGTCGACGCTTTAGGGACATAGATGGGAGCTCTGATTCCCGTG...,CTCCGTCGACGCTTTAGGGACATAGATGGGAGCTCTGATTCCCGTG...,A3585,A790
414,A289,CTCCGTCGACGCTTTAGGGACATAGATGGGAGCTCTGATTCCCGTG...,CTCCGTCGACGCTTTAGGGACATAGATGGGAGCTCTTATTCCCGTG...,,
9337,A6199,CTCCGTCGACGCTTTAGGGACATAGATGGGAGCTCTGATTCCCGTG...,CTCCGTCGACGCTTTAGGGACATAGATGGGAGCTCTGATTCCCGTG...,,
12204,A8252,CTCCGTCGACGCTTTAGGGACATAGATGGGAGCTCTGATTCCCGTG...,CTCCGTCGACGCTTTAGGGACATAGATGGGAGCTCTGATTCCCGTG...,,


------

## **Likelihood Ratio**

### Compares two sequences (alleles) for a paternity test based on the likelihood Ratio

- `comparing two sequences (alleles) in the context of a paternity test. The comparison is based on the Likelihood Ratio, a statistical measure used to assess the probability of observing the provided sequences under different hypotheses.`
 

------

In [9]:
def calculate_likelihood_ratio(father_allele, child_allele):
    """
    Calculate the likelihood ratio for paternity testing based on allele sequences.

    Parameters:
    - father_allele (str): Allele sequence of the father.
    - child_allele (str): Allele sequence of the child.

    Returns:
    - likelihood_ratio (float): Likelihood ratio indicating the probability of paternity.
    """
    match_count = 0
    total_count = 0
    length = len(father_allele) // 2

    allel1_father = father_allele[:length]
    allel2_father = father_allele[length:]

    allel1_child = child_allele[:length]
    allel2_child = child_allele[length:]

    # Compare allel1 sequences of father and child
    for i in range(len(allel1_father)):
        if allel1_father[i] == allel1_child[i]:
            match_count += 1
        total_count += 1

    # Compare allel2 sequences of father and child
    for i in range(len(allel2_father)):
        if allel2_father[i] == allel2_child[i]:
            match_count += 1
        total_count += 1

    likelihood_ratio = (match_count / total_count) * 100
    return round(likelihood_ratio, 2)

# Example usage
father_sequence = first_processed_merged_df_copy.iloc[10, 1]
child_sequence = first_processed_merged_df_copy.iloc[10, 2]
likelihood_ratio = calculate_likelihood_ratio(father_sequence, child_sequence)

print("Likelihood of paternity: {:.2f}%".format(likelihood_ratio))


Likelihood of paternity: 69.03%


------

## **Target Label**

In [10]:
def get_target(second_processed_merged_df):
    second_processed_merged_df['target'] = [1 if calculate_likelihood_ratio(i[1], i[2]) > 77 else 0 for i in first_processed_merged_df_copy.values]
    return second_processed_merged_df

In [11]:
second_processed_merged_df = get_target(first_processed_merged_df_copy)

In [12]:
second_processed_merged_df.sample(4)

Unnamed: 0,Name,Parent_full_DNA_Seq,Child_full_DNA_Seq,ParentM,ParentF,target
1699,A1128,CTCCATCGACGCTTTAGGGACATAGATGGGAGCTCTGATTCCCGTG...,CTCCGTCGACGCTTTAGGGACATAGATGGGAGCTCTGATTCCCGTG...,,,1
159,A113,CTCCGTCGACGCTTTAGGGACATAGATGGGAGCTCTGATTCCCGTG...,CTCCGTCGACGCTTTAGGGACATAGATGGGAGCTCTGATTCCCGTG...,A113,A9557,1
7153,A4723,CTCCGTCGACGCTTTAGGGACATAGATGGGAGCTCTGATTCCCGTG...,CTCCGTCGACGCTTTAGGGACATAGATGGGAGCTCTGATTCCCGTG...,,,0
15930,A10761,CTCCGTCGACGCTTTAGGGACATAGATGGGAGCTCTGATTCCTGTG...,CTCCGTCGACGCTTTAGGGACATAGATGGGAGCTCTGACTCCTGTG...,,,1


In [14]:
second_processed_merged_df['target'].value_counts()

0    21985
1    21799
Name: target, dtype: int64

In [13]:
second_processed_merged_df.to_pickle(EXPORT_PATH)