# Extra Material: Data Wrangling

---

## We will continue to use the ERAP2 dataset from UToronto.

Genomic information files are illustrative of the challenge of transforming 2-D data, such as graphs (i.e. pedigrees and family 'trees'), into a data science friendly format. 

You have the typical coding of nominal data like male, female, and ancestry:
* male = 1, female = 2 by convention
* CEU: Utah residents with Northern and Western European ancestry from the CEPH collection
* YRI: Yoruba in Ibadan, Nigeria 

In addition, the Columns are Family ID, Individual ID, Paternal ID, Maternal ID, etc which captures the position in a pedigree and tries to account for relatedness in GWAS. There is a good explanation (if you are interested, and especially if you can use "R") of [Wrangling decisions](https://github.com/LeiSunUofT/How-to-Run-a-GWAS/blob/main/stat-Sun-module2-data.pdf).

![](https://raw.githubusercontent.com/awnorowski/BDSiC_2025/refs/heads/main/images/HowToCodePedigreeGraph.png)


We will just do a **quick peek** at the data using histograms...

This will be a quick side tangent to remind you about how to pull out columns and how to visualize data sets. 

In [None]:
# import libraries
import numpy as np
import pandas as pd

In [None]:
# DATA Wrangling
# Data set one is from https://github.com/sugolov/GWAS-Workshop/tree/master 
Ex1url="https://www.utstat.toronto.edu/sun/data/GWAS-workshop-sample-dataset-ERAP2.txt"
mydata_ERAP2=pd.read_csv(Ex1url,sep='\t')
print(mydata_ERAP2.head(10))
#print(mydata_ERAP2.iloc[[1,2,3,103,104,105,194],1:12])

In [None]:
print("~~~~~~~~~~~~~~~~~~~~~")
print(mydata_ERAP2.groupby(["SEX","POP"]).count())
print("~~~~~~~~~~~~~~~~~~~~~")
print(mydata_ERAP2.groupby(["SEX","POP"])["PHENO"].mean())

When visualizing your data (we'll see this in module 8), you need to identify what type of variable you have.
* *Nominal*: SNP columns hold genotypic information. This needs to recoded into dose numbers.
* *Numeric*: PHENO

In [None]:
# We're going to focus on locus SNP1-5618704 and we will convert each dizygote into A--> 0 and C--> 1
#mydata_converted=mydata_ERAP2["SNP1-5618704"].replace({"AA":0,"AC":1,"CA":1,"CC":2})
# we could also use the map method: 
mydata_converted= mydata_ERAP2["SNP1-5618704"].map({'AA': 0, 'AC': 1, 'CA':1, 'CC': 2})
print(mydata_converted)

In [None]:
# we're going to look at the phenotype column, since that is the column of interest
#round(mydata_ERAP2["PHENO"])
print(mydata_ERAP2["PHENO"].describe())
# We can use numpy built in plotting methods (although matplotlib is 'better')
Pheno_hist = mydata_ERAP2["PHENO"].plot.hist(bins = 40, color = 'green',xlabel='phenotype')
# What does this histogram tell us?

In [None]:
# there are two mounds -- maybe it is a difference in expression between males and females?
# Is the distribution of the phenotype between males and females the same? 
# I am treating the column names as attributes here
pheno_males = mydata_ERAP2[mydata_ERAP2.SEX==1]
pheno_males.head()
pheno_males["PHENO"].plot.hist(bins = 40, color = 'blue',xlabel='phenotype')
print(pheno_males["PHENO"].describe())
print("******************")
pheno_females=mydata_ERAP2[mydata_ERAP2.SEX==2]
pheno_females.tail()
print(pheno_females["PHENO"].describe())
pheno_females["PHENO"].plot.hist(bins = 40, color = 'pink',xlabel='phenotype')
# This doesn't really look particularly worthy of exploration

In [None]:
# maybe we can tease out the ancestries to investigate if THAT accounts for the two mounds?
# is the distribution of the phenotype between CEU and YRI the same? 
pheno_CEU = mydata_ERAP2[mydata_ERAP2.POP=="CEU"]
pheno_CEU.head()
pheno_CEU["PHENO"].plot.hist(bins = 40, color = 'green',xlabel='phenotype')
print(pheno_CEU["PHENO"].describe())
print("******************")
pheno_YRI=mydata_ERAP2[mydata_ERAP2.POP=="YRI"]
pheno_YRI.tail()
print(pheno_YRI["PHENO"].describe())
pheno_YRI["PHENO"].plot.hist(bins = 40, color = 'red',xlabel='phenotype')
# This doesn't really look particularly worthy of exploration

In [None]:
# So, now we will see a visualization of the phenotype-genotype association
# we will plot out three distinct genotype-phenotypes: AA, AC/CA, and CC
CEUdata=mydata_ERAP2[mydata_ERAP2.POP=="CEU"]["SNP1-5618704"]
#print(CEUdata)
#CEUAA
CEUAA=mydata_ERAP2.loc[(mydata_ERAP2["POP"]=="CEU")&(mydata_ERAP2["SNP1-5618704"]=="AA")]
print(CEUAA)
#CEU with AC or CA
CEUAC=mydata_ERAP2.loc[(mydata_ERAP2["POP"]=="CEU")&((mydata_ERAP2["SNP1-5618704"]=="AC")|(mydata_ERAP2["SNP1-5618704"]=="CA"))]
print(CEUAC)
#CEU with CC
CEUCC=mydata_ERAP2.loc[(mydata_ERAP2["POP"]=="CEU")&((mydata_ERAP2["SNP1-5618704"]=="CC"))]
print(CEUCC)

In [None]:
print("And now for the same analysis for the Yorubi population")
# Now the same thing for Yorubi
YRIdata=mydata_ERAP2[mydata_ERAP2.POP=="YRI"]["SNP1-5618704"]
#print(YRIdata)
YRIAA=mydata_ERAP2.loc[(mydata_ERAP2["POP"]=="YRI")&(mydata_ERAP2["SNP1-5618704"]=="AA")]
print(YRIAA)
#CEU with AC or CA
YRIAC=mydata_ERAP2.loc[(mydata_ERAP2["POP"]=="YRI")&((mydata_ERAP2["SNP1-5618704"]=="AC")|(mydata_ERAP2["SNP1-5618704"]=="CA"))]
#print(YRIAC)
#YRI with CC
YRICC=mydata_ERAP2.loc[(mydata_ERAP2["POP"]=="YRI")&((mydata_ERAP2["SNP1-5618704"]=="CC"))]
print(YRICC)

In [None]:
# Let's graph out the categorical data for AA, AC/CA, CC in CEU
CEUAA["PHENO"].plot.hist(bins = 40, color = 'yellow',xlabel='phenotype')
CEUAC["PHENO"].plot.hist(bins = 40, color = 'orange',xlabel='phenotype')
CEUCC["PHENO"].plot.hist(bins = 40, color = 'red',xlabel='phenotype')

In [None]:
YRIAA["PHENO"].plot.hist(bins = 40, color = 'green',xlabel='phenotype')
YRIAC["PHENO"].plot.hist(bins = 40, color = 'blue',xlabel='phenotype')
YRICC["PHENO"].plot.hist(bins = 40, color = 'purple',xlabel='phenotype')

In [None]:
# Genotypes of both CEU & YRI all on same graph
CEUAA["PHENO"].plot.hist(bins = 40, color = 'yellow',xlabel='phenotype')
CEUAC["PHENO"].plot.hist(bins = 40, color = 'orange',xlabel='phenotype')
CEUCC["PHENO"].plot.hist(bins = 40, color = 'red',xlabel='phenotype')
YRIAA["PHENO"].plot.hist(bins = 40, color = 'green',xlabel='phenotype')
YRIAC["PHENO"].plot.hist(bins = 40, color = 'blue',xlabel='phenotype')
YRICC["PHENO"].plot.hist(bins = 40, color = 'purple',xlabel='phenotype')

In [None]:
print(" Here are the CEU Pop: AA, AC/CA, CC")
print(CEUAA["PHENO"].describe())
print(CEUAC["PHENO"].describe())
print(CEUCC["PHENO"].describe())
print(" Here are the YRI Pop: AA, AC/CA, CC")
print(YRIAA["PHENO"].describe())
print(YRIAC["PHENO"].describe())
print(YRICC["PHENO"].describe())