
# ML Model To Predict NSUN6 Affected Genes



## Imports & Set-Up

Currently just using the 50:50 area selection

In [4]:
import random
import math
import pandas as pd
import numpy as np
from pyfaidx import Fasta
import matplotlib.pyplot as plt
from matplotlib import colors
import logomaker as lm
import re

dataset = pd.read_csv("out\\NSUN6_dataset_2405241153.csv", index_col = 0)


## Dealing With The "p>>n" Problem

The current datasets created have 4756 features (n) but only 227 samples in the positive dataset and 75 samples in the negative dataset (p). To mitigate this imbalance which could lead to over-fitting and similar issues various methods should be considered to reduce the size of n (or increase the size of p).

There is no best method and it is recommended to use controlled experiments to test a suite of different methods...

Whilst the features 'A', 'C', 'T', 'G' are obviously crucial, we need to identify how to reduce the base pairing probability down to just the problem-significant components.

From a quick inspection all base-pairing probabilities (bpp) have the same likelihood of being blank as each other, normally ~200/227 samples are blank!


### Dimensional Reduction [PCA, tSNE]

PCA is less effective at preserving local structures. tSNE is frequently used for bioinformatics/biomedical signal processing but it requires hyperparameter like perplexity and number of steps.

Only use the highly variable features???


In [21]:
bpp_set = dataset.iloc[:,6:] # Only want to reduce the base pairing features


# Replace NaNs with 0 probability
bpp_set = bpp_set.fillna(0)

# Target, classification value
y = dataset['NSUN6_affected'].values
# Features
X = bpp_set.values

# Standardiser?
from sklearn.preprocessing import StandardScaler
X = StandardScaler().fit_transform(X)

# PCA
from sklearn.decomposition import PCA
pca = PCA(n_components=200)
principalComponents = pca.fit_transform(X)


### Feature Selection [Statistical Tests (e.g., chi-squared, mutual information)]

select the most informative probabilities, might need to do more numerical encoding before statistical test can be applied

In [56]:
    ### Positive Dataset Selection ###


<bound method Series.sort_values of seq         0
A           0
C           0
G           0
T           0
         ... 
61-101    219
65-96     212
21-55     218
53-74     216
48-83     219
Length: 4758, dtype: int64>
oh



### Collapsing Probabilities To Reduce Empty Data

Aggregated Features: Aggregate probabilities over specific regions (e.g., sliding windows) to reduce dimensionality. For example, compute the mean, variance, or entropy of probabilities within a window.