imports packages

### This notebook builds a matrix/dataframe that treats samples as objects. The goal is to provide a dataset the allows exploration of with respect to the characteristics of the samples. Each sample is an aggregate of other samples (which according to the study's publication is for the purpose of batch correction). In addition to aggregating the beta-values, indicating overall methylation of CpG sites, each sample include an aggregate of these parameters- huntingtondiseasestatus: control/pre-manifest/manifest, averageage: numeric, averagebodymassindex: numeric, dnamage: numeric. In addition, it includes a sentrix.id: X############ attribute that was likely used as the means of aggregation. It indicates a chip ID for the BeadChip used in generating intensity measures.
___

imports packages

In [None]:
import pandas as pd

reads in data

In [None]:
# creates beta-values df that was filtered for chromosomes
betaSamples_df = pd.read_csv("output_files/betaValues_annotated_chrFiltered.csv")
betaSamples_df = betaSamples_df.iloc[:,1:]
betaSamples_df.head()

In [None]:

colnames = betaSamples_df.columns
substring = "GSM"
samples_list = [s for s in colnames if substring in s]
betaMatrix_df = betaSamples_df[samples_list]


In [None]:

#creates sample level annotation df
sampleMeta_df = pd.read_csv("source_data/GEO_sample_metadata.csv")

identifies metadata attributes to be mapped to samples of beta-value dataset

In [None]:
sampleMeta_df.head(2)

The above cell shows the attributes provided for the samples that were collected from GEO using Bioconductor's GEOquery package (in R). The attributes of interest include: averageage:ch1, dnamage:ch1, huntingtondiseasestatus:ch1, (and sentrix.id:ch1 for good measure).

This data will be isolated in the cell below, along with the gsm number which will be used to map the other metadata attributes to the counts samples.


In [None]:
appending_df = sampleMeta_df[['Unnamed: 0',
                    'averageage:ch1',
                    'dnamage:ch1', 
                    'huntingtondiseasestatus:ch1', 
                    'sentrix.id:ch1'
                ]]

appending_df.rename(columns={'Unnamed: 0': 'gsm_accession'}, inplace= True)

appending_df

The code shows what the corresponding publication clarifies as sample aggregation for the sake of batch correction. Each sample is an aggregate of many samples that have the same huntington's disease status, and seemingly the same sentrix.id. Next the names of the attributes in the appending df are trimmed of their contextually superfluous ":CH1" substring and the appending_df is merged with the transposed beta-samples. The transposition establishes the samples as the data objects this is to allow for associations to be explored across sample groups.

In [None]:
# assigns exiting coloumn names to be updates
current_colnames = appending_df.columns

# list comprehension: iteratively removes ':ch1' from colnames that have it
new_colnames = [item.replace(":ch1", "") for item in current_colnames]

# assigns updated colnames to the appending df
appending_df.columns = new_colnames

appending_df

In [None]:
len(appending_df['gsm_accession'])

In [None]:
# transposes the beta values
TbetaSamples_df = betaSamples_df.T


In [None]:
TbetaSamples_df.shape

adds the gsm that were index values after the transposition of the data frame as a attribute to TbetaSemples_df.index[1:]

In [None]:
# assigns the index values to a variable
gsms = TbetaSamples_df.index

# checks length (this exposed that 'GSM4409678')
len(gsms)

TbetaSamples_df['gsm_accession'] = gsms

TbetaSamples_df.reset_index()

#### change colnames

get cpgs

In [None]:
colnames = betaSamples_df['ID_REF'].to_list()
colnames.insert(0,'ID_REF')
# colnames = colnames.extend(cpgs)

print(colnames)

In [None]:


TbetaSamples_df.head()

In [None]:
TbetaSamples_df.to_csv("output_files/samples_with_metadata.csv")

In [None]:
TbetaSamples_df = pd.read_csv("output_files/samples_with_metadata.csv")

In [None]:
TbetaSamples_df.head()

In [None]:
cols = TbetaSamples_df.iloc[0,:].to_list()

In [None]:
TbetaSamples_df.columns = cols

In [None]:
TbetaSamples_df.head()


In [None]:
TbetaSamples_df = TbetaSamples_df.iloc[1:,:]

In [None]:
TbetaSamples_df.iloc[0:2,:]

In [None]:
TbetaSamples_df.to_csv("output_files/samples_with_metadata.csv")

In [None]:
TbetaSamples_df = TbetaSamples_df.iloc[:,:-1]

In [None]:
appending_df = appending_df.rename(columns = {'gsm_accession': 'ID_REF'})

In [None]:
appending_df.columns

In [None]:
final_sampleLevel_df = pd.merge(TbetaSamples_df, appending_df, on = 'ID_REF', how = 'left')

In [None]:
final_sampleLevel_df.to_csv("output_files/samples_with_metadata.csv")

In [None]:
important_site_df = final_sampleLevel_df[['ID_REF', 'cg02550322', 'cg22982173', 'cg11324953', 'cg08763102', 'averageage', 'dnamage', 'huntingtondiseasestatus']]


In [None]:
important_site_df.iloc[0:10,:]

In [None]:
important_site_df.to_csv("output_files/sites_of_interest.csv")