-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Readme update with alternative graph example #14
Conversation
Currently this does not match the results obtained with previous component clustering work. My current suspicion is that the re-written component-cluster function needs some attention. |
Total growth does not increase monotonically with component clustering threshold (from example) # generate cluster sets under varying parameter settings
param.list <- lapply(seq(0.001, 0.04, 0.001), function(x) {list(g=g, dist.thresh=x)})
cluster.sets <- multi.cluster(component.cluster, param.list)
growth_total <- sapply(1:40, function(i){sum((cluster.sets[SetID == i])$Growth)})
all(growth_total == cummax(growth_total)) ## FALSE - SHOULD BE TRUE |
@ConnorChato can I ask @SandeepThokala to assist with this? i.e., by comparing the |
Yes that would be wonderful - just got back from time away so there is a bit of a project queue. |
I've isolated the issue down to the clustering algorithm itself - I suspect that https://github.com/PoonLab/clustuneR/blob/master/R/graph.clustering.R would not produce the same cluster set as the |
@SandeepThokala can you please focus on this? |
I think at the time we were developing this, there was a focus on the tree-based clustering that left some of these algorithms a bit less tested. I am certain the problem is |
@SandeepThokala we're going to have to walk through code changes starting from |
@SandeepThokala can you please record your work through the commit history in this PR |
@SandeepThokala determined that there was a major shift in the code base between commits |
Hi @ArtPoon @SandeepThokala - I have a couple of starting fixes I did to correct the component clustering algorithm used to determine components in the new code here. There were a few different problems in the way it was written, so I can provide a bit of specificity from the investigation I've done in my off time. |
Here is an overcommented code chunk from the function that I've determined is misbehaving ( # A clustering algorithm to determine disconnected components
# Uses a table of sequence metadata (g$seq.info) and a matrix of edges (filtered.edges).
# If filtered.edges[i, j] is TRUE then there is an edge between node i and node j (which will both have a row in g$seq.info).
# Assign a "Cluster" to each node. This is a unique number.
# To start, each node is in its own, unique cluster id
g$seq.info[, "Cluster" := 1:nrow(g$seq.info)]
# Initialize a variable that tracks the previous cluster of each node. This is used as a halting condition.
previous.cluster <- rep(0, nrow(g$seq.info))
# (Tragically) Uses a while loop - I am hoping there is a clever way to vectorize this, but I haven't found it yet.
# We will keep updating cluster memberships until things stop changing
while (any(g$seq.info$Cluster != previous.cluster)) {
# Update previous cluster info to current cluster info
previous.cluster <- g$seq.info$Cluster
# Reassign each node's cluster
g$seq.info[, Cluster := sapply(1:nrow(g$seq.info), function(i) {
# Look at the clusters of the nodes that node "i" is connected to
# Node i will never be connected to itself.
x <- g$seq.info[which(filtered.edges[i, ]), Cluster]
if (length(x) == 0) {
return(i) # If node i is not connected to any other nodes, it stays in it's current cluster
} else {
return(min(x)) # If connected to other nodes, node i's cluster becomes one of it's neighbour's clusters
}
})]
} The issueWhen figuring out WHICH neighbour's cluster node i should use, it decides on the cluster id that is the lowest number. I'm not exactly sure how best to resolve, but this logic can lead to infinite loops where neighbours trade their cluster id back and forth. |
Actually in re-describing that issue, I may have found a solution that generates some expected results. Since we are using a slightly different method of tn93 calculation in this code to simplify dependancies, I would still expect some slight differences between this and the original results, but I hope this is fairly similar to what you two find (@SandeepThokala @ArtPoon). Apologies if this duplicates or undoes any of your current work. There are likely much more elegant (and fast) solutions to this same issue that could replace this method anyway, I just wanted to finish up my revisit to this function. |
The old function
Where as the new function Lines 54 to 62 in 353d3fa
|
Hi @SandeepThokala. The only point where it's truly necessary to find minimum retrospective edges is when looking at new cases linking to old cases. For the previous version, I calculated the minimum retrospective edge from every node because we were interested in sequentially re-defining the newest year, so any old year may eventually get called "New" and at that point we'd want it's minimum retrospective edge. |
I tried to run tn93 on data/na.fasta to see if we're using the same input as example_tn93.txt |
|
Uploading my understanding of the functions. The input file (example_tn93.txt) contains a huge list of comma separated edge info (ID1, ID2, distance).
idf <- read.csv(iFile, stringsAsFactors = F)
temp1 <- sapply(idf$ID1, function(x) (strsplit(x,'_')[[1]])[[1]])
temp2 <- sapply(idf$ID1, function(x) (strsplit(x,'_')[[1]])[[2]])
temp3 <- sapply(idf$ID2, function(x) (strsplit(x,'_')[[1]])[[1]])
temp4 <- sapply(idf$ID2, function(x) (strsplit(x,'_')[[1]])[[2]])
el <- data.frame(ID1=as.character(temp1), t1=as.numeric(temp2),
ID2=as.character(temp3), t2=as.numeric(temp4),
Distance = as.numeric(idf$Distance),
stringsAsFactors= F)
el$tMax <- pmax(el$t1,el$t2)
el$tDiff <- abs(el$t1-el$t2)
vl <- unique(data.frame(ID = c(el$ID1, el$ID2), Time = c(el$t1, el$t2), stringsAsFactors=F))
subV <- subset(g$v, Time>min(Time))
minE <- bind_rows(lapply(1:nrow(subV), function(i){
v <- subV[i,]
incE <- subset(g$e, (ID1%in%v$ID)|(ID2%in%v$ID))
retE <- subset(incE, (tMax==v$Time)&(tDiff>0))
retE[which(retE$Distance==min(retE$Distance))[[1]],]
}))
g$e <- subset(g$e, tMax!=max(tMax))
g$e <- rbind(g$e, subset(minE, tMax==max(tMax)))
g$f <- subset(minE, tMax < max(tMax))
|
dMax <- max(subG$e$Distance)
tTab <- table(subG$f$tMax)
vTab <- table(subG$v$Time)
eTab <- sapply(as.numeric(names(vTab)), function(t){
nrow(subset(subG$e, (t1==t|t2==t)))
})
names(eTab) <- names(vTab)
ageD <- bind_rows(lapply(as.numeric(names(tTab)), function(t) {
temp <- subset(subG$f, tMax==t)
dfs <- split(temp, temp$tDiff)
Positive <- sapply(dfs, function(df){length(which(df$Distance<=dMax))})
vTotal <- rep((vTab[[as.character(t)]]),length(dfs))
tDiff <- as.numeric(names(Positive))
oeDens <- sapply(tDiff, function(tD){
oTime <- t-tD
return(eTab[as.character(oTime)]/vTab[as.character(oTime)])
})
res <- data.frame(Positive=as.numeric(Positive), vTotal=vTab[[as.character(t)]], oeDens=oeDens, tDiff)
return(res)
}))
mod <- glm(cbind(Positive, vTotal) ~ tDiff+oeDens, data=ageD, family='binomial')
subG$v$Weight <- predict(mod, type='response',
data.frame(tDiff=max(subG$v$Time)-subG$v$Time,
oeDens=as.numeric(eTab[as.character(subG$v$Time)]/vTab[as.character(subG$v$Time)])))
subG <- simGrow(subG)
cPred <- subset(subG$v, Time<max(Time))[,c("Weight", "Cluster")]
df1 <- data.frame(Growth = as.numeric(subG$g), Pred = sapply(names(subG$c), function(x) { sum(subset(cPred, Cluster==as.numeric(x))$Weight) }))
df2 <- data.frame(Growth = as.numeric(subG$g), Pred = as.numeric(subG$c) * (sum(as.numeric(subG$g))/sum(as.numeric(subG$c))))
fit1 <- glm(Growth ~ Pred, data = df1, family = "poisson")
fit2 <- glm(Growth ~ Pred, data = df2, family = "poisson")
subG$gaic <- fit1$aic-fit2$aic
subG$ageMod <- mod
subG$ageFit <- fit1
subG$nullFit <- fit2
subG$f <- ageD
|
@SandeepThokala reports that he was able to convert the example data file into the data structure used by the most recent code, but this conversion script it is taking a long time to process. Can you post this script here so we can troubleshoot it? |
Conversion script import numpy as np
import pandas as pd
from tqdm import tqdm
ex = "./data/example_tn93.txt"
def get_distance(id1, id2):
distance = df.loc[(df['ID1'] == id1) & (df['ID2'] == id2), 'Distance']
if not(len(distance)):
distance = df.loc[(df['ID2'] == id1) & (df['ID1'] == id2), 'Distance']
if not(len(distance)):
distance = np.nan
if isinstance(distance, pd.core.series.Series):
return distance.iloc[0]
else:
return distance
df = pd.read_csv(ex)
unique_ids = set(df['ID1'].unique())
unique_ids = unique_ids.union(set(df['ID2'].unique()))
unique_ids = np.array(list(unique_ids))
num_ids = len(unique_ids)
matrix = np.zeros((num_ids, num_ids))
matrix = np.reshape(np.append(unique_ids, matrix), (num_ids + 1, num_ids))
count = 0
for row_index, row in tqdm(enumerate(unique_ids, 1)):
for col_index, col in enumerate(unique_ids[count:], count):
if col != row:
matrix[row_index, col_index] = get_distance(row, col)
else:
matrix[row_index, col_index] = 0
count += 1
matrix = matrix + matrix.T
np.savetxt('example_matrix.csv', matrix, delimiter=',') |
This first loop of |
Parsing intermediate data structures for graph creatiion# replacing usage of pull.headers
seq.info <- read.csv("./data/seq_info.csv")
max.year <- max(year(seq.info$coldate))
which.new <- which(year(seq.info$coldate) == max.year)
# replacing usage of ape::dist.dna for calculating genetic distance using TN93 model
edge.info <- read.csv("./data/edge_info.csv")
rownames(edge.info) <- colnames(edge.info)
g <- create.graph(seq.info, edge.info, which.new) Resulted in error Error in create.graph(seq.info, edge.info, which.new) :
The pairwise distance matrix does not contain all recognized headers
from alignment This input validation in if (!all(colnames(edge.info) %in% seq.info$Header)) {
stop("The pairwise distance matrix does not contain all recognized headers
from alignment")
} |
Solving the errors by making use of pull.headers functionseq.info <- read.csv("./data/seq_info.csv")
seqs <- list()
for (header in seq.info$Header) {
seqs[[header]] <- header
}
seq.info <- pull.headers(seqs, sep="_", var.names=c('accession', 'coldate', 'subtype'),
var.transformations=c(as.character, as.Date, as.factor))
max.year <- max(year(seq.info$coldate))
which.new <- which(year(seq.info$coldate) == max.year)
edge.info <- read.csv("./data/edge_info.csv")
colnames(edge.info) <- lapply(colnames(edge.info), function(x) {
seq.info$Header[seq.info$accession == strsplit(x, "_")[[1]][1]]
})
rownames(edge.info) <- colnames(edge.info)
g <- create.graph(seq.info, edge.info, which.new) Finally able to create graph data structure using above script |
.Rhistory
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we shouldn't have .Rhistory
in the repository, please remove with git rm --cached .Rhistory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just removed this, looks like the .gitignore was a bit empty so I updated that as well to avoid that in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok thanks. I am doing a bunch of work in branch iss12
and I will probably merge this branch into it when I'm done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! I've seen some of that - thanks
I've refactored a lot of the code and merged my work from |
I'm going to go ahead and merge this PR. master branch is broken anyhow, so I am going to commit my ongoing work directly to that branch |
Demonstrate a working example of graph-based cluster analysis using na.fasta example data and component.cluster function