## Exploratory Data Analysis 
*This file was written by Nicole and reformatted by Jack.*

This `.ipynb` file examines the raw data, and was used to inform the `PreProcessing Pipeline` class in `data/preprocess_data.py`. We begin by importing the necessary libraries.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.manifold import TSNE

import warnings
warnings.filterwarnings("ignore")

We read in the `data.parquet` and `labels.parquet` files, encode the specific cancer labels numerically with the `encoding_dict` specified in the `README.md` file, and merge the two datasets together.

In [None]:
data = pd.read_parquet('./data/data.parquet')
labels = pd.read_parquet('./data/labels.parquet')

encoding_dict = {
    'BRCA': 0,
    'KIRC': 1,
    'COAD': 2,
    'LUAD': 3,
    'PRAD': 4
}

labels['Class'] = labels['Class'].map(encoding_dict)

data.rename(columns={'Unnamed: 0': 'sample_id'}, inplace=True)
labels.rename(columns={'Unnamed: 0': 'sample_id'}, inplace=True)

merged_data = pd.merge(labels, data, on='sample_id')
data = data.drop(columns=['sample_id'])
labels = labels.drop(columns=['sample_id'])

merged_data.head()

In [None]:
df = pd.read_csv('gene_ids.csv')
lst = df['gene_id'].to_list()
print(len(lst))
temp = []
for thing in lst:
    uh = thing.split('|')[0]
    if uh not in temp: temp.append(uh)
print(len(temp))
# temp
lst

20531
20502


['?',
 'A1BG',
 'A1CF',
 'A2BP1',
 'A2LD1',
 'A2ML1',
 'A2M',
 'A4GALT',
 'A4GNT',
 'AAA1',
 'AAAS',
 'AACSL',
 'AACS',
 'AADACL2',
 'AADACL3',
 'AADACL4',
 'AADAC',
 'AADAT',
 'AAGAB',
 'AAK1',
 'AAMP',
 'AANAT',
 'AARS2',
 'AARSD1',
 'AARS',
 'AASDHPPT',
 'AASDH',
 'AASS',
 'AATF',
 'AATK',
 'ABAT',
 'ABCA10',
 'ABCA11P',
 'ABCA12',
 'ABCA13',
 'ABCA17P',
 'ABCA1',
 'ABCA2',
 'ABCA3',
 'ABCA4',
 'ABCA5',
 'ABCA6',
 'ABCA7',
 'ABCA8',
 'ABCA9',
 'ABCB10',
 'ABCB11',
 'ABCB1',
 'ABCB4',
 'ABCB5',
 'ABCB6',
 'ABCB7',
 'ABCB8',
 'ABCB9',
 'ABCC10',
 'ABCC11',
 'ABCC12',
 'ABCC13',
 'ABCC1',
 'ABCC2',
 'ABCC3',
 'ABCC4',
 'ABCC5',
 'ABCC6P1',
 'ABCC6P2',
 'ABCC6',
 'ABCC8',
 'ABCC9',
 'ABCD1',
 'ABCD2',
 'ABCD3',
 'ABCD4',
 'ABCE1',
 'ABCF1',
 'ABCF2',
 'ABCF3',
 'ABCG1',
 'ABCG2',
 'ABCG4',
 'ABCG5',
 'ABCG8',
 'ABHD10',
 'ABHD11',
 'ABHD12B',
 'ABHD12',
 'ABHD13',
 'ABHD14A',
 'ABHD14B',
 'ABHD15',
 'ABHD1',
 'ABHD2',
 'ABHD3',
 'ABHD4',
 'ABHD5',
 'ABHD6',
 'ABHD8',
 'ABI1',
 'ABI2',
 

We examine the size of the activations across subjects for all genes tested (the sum of each column).

In [None]:
def examine_column_sums(merged_data):
    numeric_df = merged_data.select_dtypes(include=['number'])
    column_sums = numeric_df.sum()

    plt.figure(figsize=(6, 4))
    plt.hist(column_sums, bins=10, edgecolor='black')
    plt.title('Distribution of Column Sums')
    plt.xlabel('Sum')
    plt.ylabel('Frequency')
    plt.grid(True)
    plt.show()

    q1 = column_sums.quantile(0.25)
    q2 = column_sums.quantile(0.50)  
    q3 = column_sums.quantile(0.75)
    iqr = q3 - q1

    print("Quartile Ranges:")
    print(f"Q1 (25th percentile): {q1}")
    print(f"Median (50th percentile): {q2}")
    print(f"Q3 (75th percentile): {q3}")
    print(f"Interquartile Range (IQR): {iqr}")

    plt.figure(figsize=(6, 4))
    plt.boxplot(column_sums, vert=False)
    plt.title('Box Plot of Column Sums')
    plt.xlabel('Sum')
    plt.show()

    return column_sums

column_sums = examine_column_sums(merged_data=merged_data)

In [None]:
print("Columns that sum to zero:")
filtered_columns = column_sums[column_sums == 0]
# print(filtered_columns)
print("Number of features that are all zeroes:", len(filtered_columns))

There are a substantial number (267) of genes who show zero activation across all subjects, so we drop these from our analysis. We add this to our preprocessor. 

In [None]:
merged_data = merged_data.drop(columns = filtered_columns.index)
data = data.drop(columns = filtered_columns.index)
merged_data.head()

We next examine how this change effects the distribution of sums of activations over patients.

In [None]:
column_sums = examine_column_sums(merged_data=merged_data)

This change only has a small effect on the overall shape of the distribution.

We next examine the distribution of cancer labels across subjects.

In [None]:
plt.figure(figsize=(8, 6))
sns.countplot(x=labels['Class'], palette='viridis')
plt.title('Distribution of Cancer Types')
plt.xlabel('Cancer Type (Encoded)')
plt.ylabel('Count')
plt.xticks(ticks=[0, 1, 2, 3, 4], labels=['BRCA', 'KIRC', 'COAD', 'LUAD', 'PRAD'])
plt.show()

Subjects with breast cancer (BRCA) appear most frequently in our dataset. Kidney (KIRC), lung (LUAD), and prostate (PRAD) cancer appear in similar frequencies. Subjects with colon cancer (COAD) appear least frequently.

In [None]:
tsne = TSNE(n_components=2, random_state=42, perplexity=30)
data_tsne = tsne.fit_transform(data)

plt.figure(figsize=(8, 6))
sns.scatterplot(x=data_tsne[:, 0], y=data_tsne[:, 1], hue=labels['Class'], palette='viridis', legend='full')
plt.title('t-SNE Visualization')
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.legend(title='Cancer Type', labels=['BRCA', 'KIRC', 'COAD', 'LUAD', 'PRAD'])
plt.show()

Using t-SNE Visualization, we can see that the data is distinctly clustered. From this we believe clustering algorithms will perform well for classification.

In fact, the clustering is so distinct that it is likely that the single PRAD datapoint in the KIRC cluster is an incorrect labelling. However, we decide to not treat this as an outlier, and instead leave it as-is in our analysis, as the apparent mislabelling may just be due to our low-dimension projection. 

## Conclusions

We decide to remove the all zero columns for our analysis.  