This notebook creates a dataset comprised of methylation beta-values samples. The data consists of GSM samples from GSE146917 associated with the study published as "DNA methylation study of Huntington's disease and motor progression in patients and in animal models." in [Nature](https://escholarship.org/uc/item/7c39w4km).

#### Imports The libraries used to create the dataset.

In [1]:
import pandas as pd
import os

___
___
## Part 1: (sandbox cells) tests methodology for building dataset from noob normalized beta-values.
___
___

Sandbox cell: Retrieves 2 methylation sample data sets from a directory that stores source text files downloaded from GEO. These files are then joined together in a single data frame to test procedure to be implemented in a loop that iteratively forms a dataset with all of the samples.

In [None]:
# # imports data using read_csv() method

# ## because the folder containing the files is nested, inside of the folder that contains this notebook,
# ## the 'path' to the file must be stated as an argument

# ## the 'skiprows' parameter takes the arguement of 2 to remove the first two rows of extemporaneous data

# ## the 'sep' parameter takes the '\t' escape sequence arguement so that read_csv knows how to parse the .txt file
# df1 = pd.read_csv('source_data/betaValue_txtFiles/GSM4409578-103266.txt', skiprows=2, sep='\t')
# df2 = pd.read_csv('source_data/betaValue_txtFiles/GSM4409579-103267.txt', skiprows=2, sep='\t')

# # checks that the variable (column name) the data will be joined on is equal.
# # meaning in this case contains the same values at the same rows
# df1['ID_REF'].tolist() == df2['ID_REF'].tolist() # same is true printed

# # joins df1 and df2 into a single dataframe.
# joined_df = pd.merge(df1,df2,on="ID_REF",how="outer")

True

___
___
## Part 2: Deploys what was done above inside of a loop to join all samples together
___
___

Joins all individual sample data sets from downloaded files into a single pandas DataFrame.

In [2]:
# path where the data is stored
folder_path = 'source_data/betaValue_txtFiles'

# makes a list of all the file names in folder_path
## used to iterate over in for loop below
file_names = os.listdir(folder_path)

# creates dataframe to be filled by the loop
## (if made in the loop, it will be created from scratch with each trip through the loop)
df = pd.DataFrame()

# fills the data frame (df) with contenst of files
for i,file in enumerate(file_names): 
    # creates variable that holds the path and file name of the data being added to the dataframe
    ## this syntax lets you access the contents of variables as strings in their place within
    ## the string being built inside the qoutation marks
    file_str = f"{folder_path}/{file}"

    # deploys the file string in the .read_csv() method call to create temp df being added to the permanent df
    temp_df = pd.read_csv(file_str, skiprows=2, sep='\t')

    # creates new variable name to replace 'VALUE' with the GSM accession number for the sample
    var = file.split('-')[0]

    # print used to log the progress (since this cell takes a while to run)
    print(f"adding {var}")

    # renames the variable in the temp_df
    temp_df = temp_df.rename(columns={'VALUE': var})

    
    if i == 0:
        # conditionally writes the temp_df to the empty variable if first file
        df = temp_df
    else:
        # conditionally merges temp_ with existing content in df if not first file
        df = pd.merge(df,temp_df, on="ID_REF", how = "outer")

df.head()

adding GSM4409578
adding GSM4409579
adding GSM4409580
adding GSM4409581
adding GSM4409582
adding GSM4409583
adding GSM4409584
adding GSM4409585
adding GSM4409586
adding GSM4409587
adding GSM4409588
adding GSM4409589
adding GSM4409590
adding GSM4409591
adding GSM4409592
adding GSM4409593
adding GSM4409594
adding GSM4409595
adding GSM4409596
adding GSM4409597
adding GSM4409598
adding GSM4409599
adding GSM4409600
adding GSM4409601
adding GSM4409602
adding GSM4409603
adding GSM4409604
adding GSM4409605
adding GSM4409606
adding GSM4409607
adding GSM4409608
adding GSM4409609
adding GSM4409610
adding GSM4409611
adding GSM4409612
adding GSM4409613
adding GSM4409614
adding GSM4409615
adding GSM4409616
adding GSM4409617
adding GSM4409618
adding GSM4409619
adding GSM4409620
adding GSM4409621
adding GSM4409622
adding GSM4409623
adding GSM4409624
adding GSM4409625
adding GSM4409626
adding GSM4409627
adding GSM4409628
adding GSM4409629
adding GSM4409630
adding GSM4409631
adding GSM4409632
adding GSM

Unnamed: 0,ID_REF,GSM4409578,GSM4409579,GSM4409580,GSM4409581,GSM4409582,GSM4409583,GSM4409584,GSM4409585,GSM4409586,...,GSM4409644,GSM4409645,GSM4409646,GSM4409647,GSM4409648,GSM4409649,GSM4409650,GSM4409651,GSM4409652,GSM4409653
0,cg00000029,0.651094,0.650451,0.634303,0.620983,0.599298,0.566119,0.675059,0.600194,0.615174,...,0.651123,0.615825,0.627343,0.660676,0.654224,0.640789,0.61398,0.643257,0.631593,0.646609
1,cg00000108,0.960434,0.954877,0.957124,0.948438,0.950022,0.949574,0.950393,0.95026,0.947326,...,0.953565,0.958739,0.953261,0.95271,0.951573,0.95835,0.955655,0.951422,0.955816,0.950133
2,cg00000109,0.899284,0.835354,0.886725,0.872381,0.872987,0.867569,0.892893,0.867972,0.861312,...,0.868461,0.896487,0.880993,0.889053,0.865207,0.893211,0.884844,0.886243,0.885892,0.884258
3,cg00000165,0.162039,0.155513,0.145876,0.172293,0.188915,0.154112,0.151555,0.168012,0.143774,...,0.16995,0.151252,0.158671,0.188231,0.191992,0.163816,0.184146,0.153689,0.16087,0.194933
4,cg00000236,0.859468,0.84283,0.8469,0.841603,0.841441,0.83011,0.847721,0.835056,0.833633,...,0.859549,0.859449,0.846702,0.84613,0.853704,0.863567,0.834595,0.854181,0.8635,0.847537


writes the df of all containing all of the beta-value data to a .csv (comma-separated-values) file so that it can be used by other notebooks

In [3]:
# .to_csv is method of pandas that is called on 'df' object
## the 'output_files' component of the string is folder it is copied to
## the 'beta_values.csv' is the name given to the exported file
## 'index = False' says not to copy the index to a variable (really annoying feature that should be automatic)
df.to_csv("output_files/beta_values.csv", index=False)

___
___
## Part 3: appending platform metadata to beta values
___
___

reads in beta_values and platform metadata datasets

In [4]:
beta_df = pd.read_csv("output_files/beta_values.csv")
platformMeta_df = pd.read_csv("source_data/GPL13534-11288_platformManifest.txt", sep = '\t')

  platformMeta_df = pd.read_csv("source_data/GPL13534-11288_platformManifest.txt", sep = '\t')


checks beta_df is not empty or incorrectly formatted

In [5]:
beta_df.head()

Unnamed: 0,ID_REF,GSM4409578,GSM4409579,GSM4409580,GSM4409581,GSM4409582,GSM4409583,GSM4409584,GSM4409585,GSM4409586,...,GSM4409644,GSM4409645,GSM4409646,GSM4409647,GSM4409648,GSM4409649,GSM4409650,GSM4409651,GSM4409652,GSM4409653
0,cg00000029,0.651094,0.650451,0.634303,0.620983,0.599298,0.566119,0.675059,0.600194,0.615174,...,0.651123,0.615825,0.627343,0.660676,0.654224,0.640789,0.61398,0.643257,0.631593,0.646609
1,cg00000108,0.960434,0.954877,0.957124,0.948438,0.950022,0.949574,0.950393,0.95026,0.947326,...,0.953565,0.958739,0.953261,0.95271,0.951573,0.95835,0.955655,0.951422,0.955816,0.950133
2,cg00000109,0.899284,0.835354,0.886725,0.872381,0.872987,0.867569,0.892893,0.867972,0.861312,...,0.868461,0.896487,0.880993,0.889053,0.865207,0.893211,0.884844,0.886243,0.885892,0.884258
3,cg00000165,0.162039,0.155513,0.145876,0.172293,0.188915,0.154112,0.151555,0.168012,0.143774,...,0.16995,0.151252,0.158671,0.188231,0.191992,0.163816,0.184146,0.153689,0.16087,0.194933
4,cg00000236,0.859468,0.84283,0.8469,0.841603,0.841441,0.83011,0.847721,0.835056,0.833633,...,0.859549,0.859449,0.846702,0.84613,0.853704,0.863567,0.834595,0.854181,0.8635,0.847537


checks platformMeta_df is not empty or incorrectly formatted

In [6]:
platformMeta_df.head()

Unnamed: 0,ID,Name,AddressA_ID,AlleleA_ProbeSeq,AddressB_ID,AlleleB_ProbeSeq,Infinium_Design_Type,Next_Base,Color_Channel,Forward_Sequence,...,DMR,Enhancer,HMM_Island,Regulatory_Feature_Name,Regulatory_Feature_Group,DHS,RANGE_START,RANGE_END,RANGE_GB,SPOT_ID
0,cg00035864,cg00035864,31729416,AAAACACTAACAATCTTATCCACATAAACCCTTAAATTTATCTCAA...,,,II,,,AATCCAAAGATGATGGAGGAGTGCCCGCTCATGATGTGAAGTACCT...,...,,,,,,,8553009.0,8553132.0,NC_000024.9,
1,cg00050873,cg00050873,32735311,ACAAAAAAACAACACACAACTATAATAATTTTTAAAATAAATAAAC...,31717405.0,ACGAAAAAACAACGCACAACTATAATAATTTTTAAAATAAATAAAC...,I,A,Red,TATCTCTGTCTGGCGAGGAGGCAACGCACAACTGTGGTGGTTTTTG...,...,,,Y:9973136-9976273,,,,9363356.0,9363479.0,NC_000024.9,
2,cg00061679,cg00061679,28780415,AAAACATTAAAAAACTAATTCACTACTATTTAATTACTTTATTTTC...,,,II,,,TCAACAAATGAGAGACATTGAAGAACTAATTCACTACTATTTGGTT...,...,,,,,,,25314171.0,25314294.0,NC_000024.9,
3,cg00063477,cg00063477,16712347,TATTCTTCCACACAAAATACTAAACRTATATTTACAAAAATACTTC...,,,II,,,CTCCTGTACTTGTTCATTAAATAATGATTCCTTGGATATACCAAGT...,...,,,,,,,22741795.0,22741918.0,NC_000024.9,
4,cg00121626,cg00121626,19779393,AAAACTAATAAAAATAACTTACAAACCAAATACTATACCCTACAAC...,,,II,,,AGGTGAATGAAGAGACTAATGGGAGTGGCTTGCAAGCCAGGTACTG...,...,,,,,,,21664296.0,21664419.0,NC_000024.9,


gets list of column names to select from to keep for merger

In [7]:
platformMeta_df.columns

Index(['ID', 'Name', 'AddressA_ID', 'AlleleA_ProbeSeq', 'AddressB_ID',
       'AlleleB_ProbeSeq', 'Infinium_Design_Type', 'Next_Base',
       'Color_Channel', 'Forward_Sequence', 'Genome_Build', 'CHR', 'MAPINFO',
       'SourceSeq', 'Chromosome_36', 'Coordinate_36', 'Strand', 'Probe_SNPs',
       'Probe_SNPs_10', 'Random_Loci', 'Methyl27_Loci', 'UCSC_RefGene_Name',
       'UCSC_RefGene_Accession', 'UCSC_RefGene_Group', 'UCSC_CpG_Islands_Name',
       'Relation_to_UCSC_CpG_Island', 'Phantom', 'DMR', 'Enhancer',
       'HMM_Island', 'Regulatory_Feature_Name', 'Regulatory_Feature_Group',
       'DHS', 'RANGE_START', 'RANGE_END', 'RANGE_GB', 'SPOT_ID'],
      dtype='object')

makes a df of meta data for merging by selecting specific columns

In [8]:
filteredMeta_df = platformMeta_df[
                    ['ID',
                    'Probe_SNPs',
                    'Probe_SNPs_10',
                    'UCSC_RefGene_Name',
                    'UCSC_RefGene_Group',
                    'UCSC_CpG_Islands_Name',
                    'Relation_to_UCSC_CpG_Island']
                ]

checks contents of filteredMeta_df

In [11]:
filteredMeta_df.head()

Unnamed: 0,ID,Probe_SNPs,Probe_SNPs_10,UCSC_RefGene_Name,UCSC_RefGene_Group,UCSC_CpG_Islands_Name,Relation_to_UCSC_CpG_Island
0,cg00035864,,,TTTY18,TSS1500,,
1,cg00050873,,,TSPY4;FAM197Y2,Body;TSS1500,chrY:9363680-9363943,N_Shore
2,cg00061679,,,DAZ1;DAZ4;DAZ4,Body;Body;Body,,
3,cg00063477,rs9341313,rs13447379,EIF1AY,Body,chrY:22737825-22738052,S_Shelf
4,cg00121626,,,BCORL2,Body,chrY:21664481-21665063,N_Shore


rename the first column of filteredMeta_df

In [9]:
filteredMeta_df.rename(columns = {'ID':'ID_REF'}, inplace=True)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filteredMeta_df.rename(columns = {'ID':'ID_REF'}, inplace=True)


In [10]:
filteredMeta_df.head()

Unnamed: 0,ID_REF,Probe_SNPs,Probe_SNPs_10,UCSC_RefGene_Name,UCSC_RefGene_Group,UCSC_CpG_Islands_Name,Relation_to_UCSC_CpG_Island
0,cg00035864,,,TTTY18,TSS1500,,
1,cg00050873,,,TSPY4;FAM197Y2,Body;TSS1500,chrY:9363680-9363943,N_Shore
2,cg00061679,,,DAZ1;DAZ4;DAZ4,Body;Body;Body,,
3,cg00063477,rs9341313,rs13447379,EIF1AY,Body,chrY:22737825-22738052,S_Shelf
4,cg00121626,,,BCORL2,Body,chrY:21664481-21665063,N_Shore


In [11]:
meta_beta_values_df = pd.merge(beta_df,filteredMeta_df, on = 'ID_REF', how = 'left' )

In [12]:
meta_beta_values_df.head()

Unnamed: 0,ID_REF,GSM4409578,GSM4409579,GSM4409580,GSM4409581,GSM4409582,GSM4409583,GSM4409584,GSM4409585,GSM4409586,...,GSM4409650,GSM4409651,GSM4409652,GSM4409653,Probe_SNPs,Probe_SNPs_10,UCSC_RefGene_Name,UCSC_RefGene_Group,UCSC_CpG_Islands_Name,Relation_to_UCSC_CpG_Island
0,cg00000029,0.651094,0.650451,0.634303,0.620983,0.599298,0.566119,0.675059,0.600194,0.615174,...,0.61398,0.643257,0.631593,0.646609,,,RBL2,TSS1500,chr16:53468284-53469209,N_Shore
1,cg00000108,0.960434,0.954877,0.957124,0.948438,0.950022,0.949574,0.950393,0.95026,0.947326,...,0.955655,0.951422,0.955816,0.950133,rs9857774,,C3orf35;C3orf35,Body;3'UTR,,
2,cg00000109,0.899284,0.835354,0.886725,0.872381,0.872987,0.867569,0.892893,0.867972,0.861312,...,0.884844,0.886243,0.885892,0.884258,rs9864492,,FNDC3B;FNDC3B,Body;Body,,
3,cg00000165,0.162039,0.155513,0.145876,0.172293,0.188915,0.154112,0.151555,0.168012,0.143774,...,0.184146,0.153689,0.16087,0.194933,,,,,chr1:91190489-91192804,S_Shore
4,cg00000236,0.859468,0.84283,0.8469,0.841603,0.841441,0.83011,0.847721,0.835056,0.833633,...,0.834595,0.854181,0.8635,0.847537,,,VDAC3;VDAC3,3'UTR;3'UTR,,


writes merged data to csv

In [13]:
meta_beta_values_df.to_csv("output_files/meta_beta_values.csv", index=False)