
# ML Model To Predict NSUN6 Affected Genes



## Imports & Set-Up

Currently just using the 50:50 area selection

In [2]:
import random
import math
import pandas as pd
import numpy as np
from pyfaidx import Fasta
import matplotlib.pyplot as plt
from matplotlib import colors
import logomaker as lm
import re

positive_set = pd.read_csv("out\\Positive_NSUN6_5050.csv", index_col = 0)
negative_set = pd.read_csv("out\\Negative_NSUN6_5050.csv", index_col = 0)

Matplotlib is building the font cache; this may take a moment.



## Dealing With The "p>>n" Problem

The current datasets created have 4756 features (n) but only 227 samples in the positive dataset and 75 samples in the negative dataset (p). To mitigate this imbalance which could lead to over-fitting and similar issues various methods should be considered to reduce the size of n (or increase the size of p).

There is no best method and it is recommended to use controlled experiments to test a suite of different methods...

Whilst the features 'A', 'C', 'T', 'G' are obviously crucial, we need to identify how to reduce the base pairing probability down to just the problem-significant components.

From a quick inspection all base-pairing probabilities (bpp) have the same likelihood of being blank as each other, normally ~200/227 samples are blank!


### Dimensional Reduction [PCA, tSNE]

PCA is less effective at preserving local structures. tSNE is frequently used for bioinformatics/biomedical signal processing but it requires hyperparameter like perplexity and number of steps.

Only use the highly variable features???


In [21]:
    ### Positive Dataset Reduction ###
from sklearn.manifold import TSNE
#positive_set
bpp_pos_set = positive_set.iloc[:,5:]
print(bpp_pos_set)
X = np.array([[0, 0, 0], [0, 1, 1], [1, 0, 1], [1, 1, 1]])
X_e = TSNE(n_components=2, learning_rate='auto',
                  init='random', perplexity=3).fit_transform(X)
print(X.shape)
print(X_e.shape)

                               1-5      1-12      1-29      1-30      1-35  \
gene                                                                         
ENSG00000188157(AGRN)          NaN       NaN       NaN       NaN       NaN   
ENSG00000083444(PLOD1)         NaN       NaN       NaN       NaN       NaN   
ENSG00000055070(SZRD1)         NaN       NaN  0.268693  0.022077       NaN   
ENSG00000244038(DDOST)         NaN       NaN       NaN       NaN       NaN   
ENSG00000244038(DDOST)         NaN       NaN       NaN       NaN       NaN   
...                            ...       ...       ...       ...       ...   
ENSG00000169100(SLC25A6)       NaN  0.003383  0.006685       NaN       NaN   
ENSG00000188153(COL4A5)        NaN       NaN       NaN       NaN       NaN   
ENSG00000185825(BCAP31)        NaN       NaN       NaN       NaN       NaN   
ENSG00000071553(ATP6AP1)  0.009186       NaN       NaN       NaN  0.024568   
ENSG00000126903(SLC10A3)       NaN       NaN       NaN       NaN


### Feature Selection [Statistical Tests (e.g., chi-squared, mutual information)]

select the most informative probabilities, might need to do more numerical encoding before statistical test can be applied

In [56]:
    ### Positive Dataset Selection ###


<bound method Series.sort_values of seq         0
A           0
C           0
G           0
T           0
         ... 
61-101    219
65-96     212
21-55     218
53-74     216
48-83     219
Length: 4758, dtype: int64>
oh



### Collapsing Probabilities To Reduce Empty Data

Aggregated Features: Aggregate probabilities over specific regions (e.g., sliding windows) to reduce dimensionality. For example, compute the mean, variance, or entropy of probabilities within a window.