Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Readme update with alternative graph example #14

Merged
merged 26 commits into from
Nov 20, 2023
Merged

Conversation

ConnorChato
Copy link
Collaborator

@ConnorChato ConnorChato commented Sep 20, 2023

Demonstrate a working example of graph-based cluster analysis using na.fasta example data and component.cluster function

@ConnorChato
Copy link
Collaborator Author

Currently this does not match the results obtained with previous component clustering work. My current suspicion is that the re-written component-cluster function needs some attention.

@ConnorChato
Copy link
Collaborator Author

Total growth does not increase monotonically with component clustering threshold

(from example)

# generate cluster sets under varying parameter settings
param.list <- lapply(seq(0.001, 0.04, 0.001), function(x) {list(g=g, dist.thresh=x)})
cluster.sets <- multi.cluster(component.cluster, param.list) 

growth_total <- sapply(1:40, function(i){sum((cluster.sets[SetID == i])$Growth)}) 
all(growth_total == cummax(growth_total)) ## FALSE - SHOULD BE TRUE

@ArtPoon
Copy link
Collaborator

ArtPoon commented Sep 26, 2023

@ConnorChato can I ask @SandeepThokala to assist with this? i.e., by comparing the MountainPlot.R script from version v1.0 to the current component clustering script?

@ConnorChato
Copy link
Collaborator Author

Yes that would be wonderful - just got back from time away so there is a bit of a project queue.

@ConnorChato
Copy link
Collaborator Author

I've isolated the issue down to the clustering algorithm itself - I suspect that https://github.com/PoonLab/clustuneR/blob/master/R/graph.clustering.R would not produce the same cluster set as the compClu() function from https://github.com/PoonLab/tn/blob/master/comp_Lib.R or the MountainPlot.R code.

@ArtPoon
Copy link
Collaborator

ArtPoon commented Sep 26, 2023

I've isolated the issue down to the clustering algorithm itself - I suspect that https://github.com/PoonLab/clustuneR/blob/master/R/graph.clustering.R would not produce the same cluster set as the compClu() function from https://github.com/PoonLab/tn/blob/master/comp_Lib.R or the MountainPlot.R code.

@SandeepThokala can you please focus on this?

@ConnorChato
Copy link
Collaborator Author

I think at the time we were developing this, there was a focus on the tree-based clustering that left some of these algorithms a bit less tested. I am certain the problem is compClu() though, as it doesn't make sense for total cluster growth to be decreasing with an increasing threshold.

@ArtPoon
Copy link
Collaborator

ArtPoon commented Oct 3, 2023

@SandeepThokala we're going to have to walk through code changes starting from MountainPlot.R to graph.clustering.R (and associated scripts) to see where outputs change for the same test data.

@ArtPoon
Copy link
Collaborator

ArtPoon commented Oct 10, 2023

@SandeepThokala can you please record your work through the commit history in this PR

@ArtPoon
Copy link
Collaborator

ArtPoon commented Oct 10, 2023

@SandeepThokala determined that there was a major shift in the code base between commits fde8762 and 890c92e. It does not seem possible to run the example in the latter commit. My suggestion is to move forward from 890c92e until it is possible to run the example, and then isolate the problem by (if possible) converting between the old and new data structures and comparing them to check whether the data input and pre-processing steps are equivalent.

@ConnorChato
Copy link
Collaborator Author

ConnorChato commented Oct 10, 2023

Hi @ArtPoon @SandeepThokala - I have a couple of starting fixes I did to correct the component clustering algorithm used to determine components in the new code here. There were a few different problems in the way it was written, so I can provide a bit of specificity from the investigation I've done in my off time.

@ConnorChato
Copy link
Collaborator Author

Here is an overcommented code chunk from the function that I've determined is misbehaving (component.cluster() from R/graph.clustering.R. Hopefully it can be helpful in some way

  # A clustering algorithm to determine disconnected components 
  # Uses a table of sequence metadata (g$seq.info) and a matrix of edges (filtered.edges).
  # If filtered.edges[i, j] is TRUE then there is an edge between node i and node j (which will both have a row in g$seq.info).

  # Assign a "Cluster" to each node. This is a unique number.
  # To start, each node is in its own, unique cluster id
  g$seq.info[, "Cluster" := 1:nrow(g$seq.info)]

  # Initialize a variable that tracks the previous cluster of each node. This is used as a halting condition.
  previous.cluster <- rep(0, nrow(g$seq.info))

  # (Tragically) Uses a while loop - I am hoping there is a clever way to vectorize this, but I haven't found it yet.
  # We will keep updating cluster memberships until things stop changing
  while (any(g$seq.info$Cluster != previous.cluster)) {
    
    # Update previous cluster info to current cluster info
    previous.cluster <- g$seq.info$Cluster

    # Reassign each node's cluster
    g$seq.info[, Cluster := sapply(1:nrow(g$seq.info), function(i) {

     # Look at the clusters of the nodes that node "i" is connected to
     # Node i will never be connected to itself.
      x <- g$seq.info[which(filtered.edges[i, ]), Cluster]
      if (length(x) == 0) {
        return(i) # If node i is not connected to any other nodes, it stays in it's current cluster
      } else {
        return(min(x)) # If connected to other nodes, node i's cluster becomes one of it's neighbour's clusters 
      }
    })]
  }

The issue

When figuring out WHICH neighbour's cluster node i should use, it decides on the cluster id that is the lowest number. I'm not exactly sure how best to resolve, but this logic can lead to infinite loops where neighbours trade their cluster id back and forth.

@ConnorChato
Copy link
Collaborator Author

Actually in re-describing that issue, I may have found a solution that generates some expected results.

(From Northern Alberta data)
image

Since we are using a slightly different method of tn93 calculation in this code to simplify dependancies, I would still expect some slight differences between this and the original results, but I hope this is fairly similar to what you two find (@SandeepThokala @ArtPoon). Apologies if this duplicates or undoes any of your current work. There are likely much more elegant (and fast) solutions to this same issue that could replace this method anyway, I just wanted to finish up my revisit to this function.

@ConnorChato ConnorChato marked this pull request as ready for review October 15, 2023 17:33
@SandeepThokala
Copy link
Collaborator

The old function impTN93 creates a list of minimum retrospective edges minE by following the steps below:

  • Gets a list of "not old" vertices.
  • Gets the closest retrospective edge, for every "not old" vertex v.
  • Binds the rows of all closest retrospective edges of all "not old" vertices.
    #The subset of vertices excluding those at the oldest year
    subV <- subset(g$v, Time>min(Time))
    #Obtain the closest retrospective edge of every vertex beyond the oldest year
    minE <- bind_rows(lapply(1:nrow(subV), function(i){
    v <- subV[i,]
    incE <- subset(g$e, (ID1%in%v$ID)|(ID2%in%v$ID))
    retE <- subset(incE, (tMax==v$Time)&(tDiff>0))
    retE[which(retE$Distance==min(retE$Distance))[[1]],]
    }))

Where as the new function create.graph calls minimum.retrosepctive.edge which just gets closest retrospective edges of only new vertices (not for all "not old").

minimum.retrospective.edge <- function(g) {
# Find the minimum retrospective edge of each sequence
new.seqs <- g$seq.info[(New), Header]
old.seqs <- g$seq.info[!(New), Header]
retro.edges <- g$edge.info[new.seqs, old.seqs]
min.retro.edges <- sapply(seq_len(nrow(retro.edges)), function(i) {
names(which.min(retro.edges[i, ]))
})

@ConnorChato
Copy link
Collaborator Author

Hi @SandeepThokala. The only point where it's truly necessary to find minimum retrospective edges is when looking at new cases linking to old cases. For the previous version, I calculated the minimum retrospective edge from every node because we were interested in sequentially re-defining the newest year, so any old year may eventually get called "New" and at that point we'd want it's minimum retrospective edge.

@SandeepThokala
Copy link
Collaborator

It does not seem possible to run the example in the latter commit. My suggestion is to move forward from 890c92e until it is possible to run the example, and then isolate the problem by (if possible) converting between the old and new data structures and comparing them to check whether the data input and pre-processing steps are equivalent.

I tried to run tn93 on data/na.fasta to see if we're using the same input as example_tn93.txt
tn93 -o data/test_tn93.txt data/na.fasta
But the result was only 1308 edges long, where as example_tn93.txt has 1306536 edges.

@ConnorChato
Copy link
Collaborator Author

It does not seem possible to run the example in the latter commit. My suggestion is to move forward from 890c92e until it is possible to run the example, and then isolate the problem by (if possible) converting between the old and new data structures and comparing them to check whether the data input and pre-processing steps are equivalent.

I tried to run tn93 on data/na.fasta to see if we're using the same input as example_tn93.txt tn93 -o data/test_tn93.txt data/na.fasta But the result was only 1308 edges long, where as example_tn93.txt has 1306536 edges.

tn93 has a threshold option -t which I believe is 0.015. Typically when we run tn93 we would use an ultra permissive threshold to look at the maximum number of edges (something like tn93 -t 1 -o data/test_tn93.txt data/na.fasta)

@SandeepThokala
Copy link
Collaborator

Uploading my understanding of the functions.

The input file (example_tn93.txt) contains a huge list of comma separated edge info (ID1, ID2, distance).

impTN93

  • Reads the input file and splits both IDs of all the edges at "_" character extracting ID and time.
	idf <- read.csv(iFile, stringsAsFactors = F)
	temp1 <- sapply(idf$ID1, function(x) (strsplit(x,'_')[[1]])[[1]])
	temp2 <- sapply(idf$ID1, function(x) (strsplit(x,'_')[[1]])[[2]])
	temp3 <- sapply(idf$ID2, function(x) (strsplit(x,'_')[[1]])[[1]])
	temp4 <- sapply(idf$ID2, function(x) (strsplit(x,'_')[[1]])[[2]])
  • Creates a list g$e and obtains maximum time and time difference for each edge resulting in a data frame with 7 columns ID1, t1, TD2, t2, Distance, tMax, tDiff.
  • Creates a list of unique vertices g$v based on ID and time columns of edge list resulting in a data frame with 2 columns ID, Time.
	el <- data.frame(ID1=as.character(temp1), t1=as.numeric(temp2), 
                     ID2=as.character(temp3), t2=as.numeric(temp4), 
                     Distance = as.numeric(idf$Distance),
                     stringsAsFactors= F)
	el$tMax <- pmax(el$t1,el$t2)
	el$tDiff <- abs(el$t1-el$t2)
	vl <- unique(data.frame(ID = c(el$ID1, el$ID2), Time = c(el$t1, el$t2), stringsAsFactors=F))
  • Creates a list of minimum retrospective edges minE
    -- Gets a list of "not old" vertices.
    -- Gets the closest retrospective edge, for every "not old" vertex v.
    -- Binds the rows of all closest retrospective edges of all "not old" vertices.
	subV <- subset(g$v, Time>min(Time))
	minE <- bind_rows(lapply(1:nrow(subV), function(i){
	    v <- subV[i,]
	    incE <- subset(g$e, (ID1%in%v$ID)|(ID2%in%v$ID))
	    retE <- subset(incE, (tMax==v$Time)&(tDiff>0))
	    retE[which(retE$Distance==min(retE$Distance))[[1]],]
	}))
  • Removes all "new" edges from g$e and add only new retrospective edges
	g$e <- subset(g$e, tMax!=max(tMax))
	g$e <- rbind(g$e, subset(minE, tMax==max(tMax)))
  • Creates a list of "not new" minimum retrospective edges g$f
	g$f <- subset(minE, tMax < max(tMax))
  • Returns g

@SandeepThokala
Copy link
Collaborator

compAnalyze

  • Takes in a subset subG of graph g, with maximum distance threshold dMax.
	dMax <- max(subG$e$Distance)
  • Gets a list of max years of retrospective edges tTab.
	tTab <- table(subG$f$tMax)
  • Gets a list of years of all unique vertices vTab.
	vTab <- table(subG$v$Time)
  • For each year in vTab, gets a list of number of all edges eTab.
	eTab <- sapply(as.numeric(names(vTab)), function(t){
		nrow(subset(subG$e, (t1==t|t2==t)))
	})
	names(eTab) <- names(vTab)
  • Creates a list of densities ageD, for each year in tTab:
    -- Gets a subset of retrospective edges of that year.
    -- Groups the subset of edges by the differences in years of vertices.
    -- For each group, gets number of edges that have a distance threshold of dMax, Positive.
    -- For each group's year difference, gets a ratio of old year edges eTab to number of old year edges in vTab.
    -- Creates a data frame with columns Positive, vTotal, oeDens, tDiff.
    -- Binds rows of all data frames.
	ageD <- bind_rows(lapply(as.numeric(names(tTab)), function(t) {
		temp <- subset(subG$f, tMax==t)
		dfs <- split(temp, temp$tDiff)
    
		Positive <- sapply(dfs, function(df){length(which(df$Distance<=dMax))})
		vTotal <- rep((vTab[[as.character(t)]]),length(dfs))
		tDiff <- as.numeric(names(Positive))
		oeDens <- sapply(tDiff, function(tD){
			oTime <- t-tD
			return(eTab[as.character(oTime)]/vTab[as.character(oTime)])
		})
    
    
		res <- data.frame(Positive=as.numeric(Positive), vTotal=vTab[[as.character(t)]], oeDens=oeDens, tDiff)
		return(res)
	}))
  • Obtains a model of case connection frequency to new cases as predicted by individual case age to weight cases by age.
	mod <- glm(cbind(Positive, vTotal) ~ tDiff+oeDens, data=ageD, family='binomial')
	subG$v$Weight <- predict(mod, type='response',
                             data.frame(tDiff=max(subG$v$Time)-subG$v$Time,
                             oeDens=as.numeric(eTab[as.character(subG$v$Time)]/vTab[as.character(subG$v$Time)])))
  • Creates clusters for this subgraph and measures growth.
	subG <- simGrow(subG)
	cPred <- subset(subG$v, Time<max(Time))[,c("Weight", "Cluster")]
  • Creates two data frames from two predictive models, one based on absolute size (NULL) and date-informed model.
	df1 <- data.frame(Growth = as.numeric(subG$g), Pred = sapply(names(subG$c), function(x) { sum(subset(cPred, Cluster==as.numeric(x))$Weight) }))
	df2 <- data.frame(Growth = as.numeric(subG$g), Pred = as.numeric(subG$c) * (sum(as.numeric(subG$g))/sum(as.numeric(subG$c))))
	fit1 <- glm(Growth ~ Pred, data = df1, family = "poisson")
	fit2 <- glm(Growth ~ Pred, data = df2, family = "poisson")
  • Saves, gaic, model and age data as part of the output.
	subG$gaic <- fit1$aic-fit2$aic
	subG$ageMod <- mod
	subG$ageFit <- fit1
	subG$nullFit <- fit2
	subG$f <- ageD
  • Returns subG.

@ArtPoon
Copy link
Collaborator

ArtPoon commented Oct 31, 2023

@SandeepThokala reports that he was able to convert the example data file into the data structure used by the most recent code, but this conversion script it is taking a long time to process. Can you post this script here so we can troubleshoot it?

@SandeepThokala
Copy link
Collaborator

SandeepThokala commented Oct 31, 2023

Conversion script

import numpy as np
import pandas as pd
from tqdm import tqdm

ex = "./data/example_tn93.txt"

def get_distance(id1, id2):
  distance = df.loc[(df['ID1'] == id1) & (df['ID2'] == id2), 'Distance']
  if not(len(distance)):
    distance = df.loc[(df['ID2'] == id1) & (df['ID1'] == id2), 'Distance']
    if not(len(distance)):
      distance = np.nan

  if isinstance(distance, pd.core.series.Series):
    return distance.iloc[0]
  else:
    return distance

df = pd.read_csv(ex)

unique_ids = set(df['ID1'].unique())
unique_ids = unique_ids.union(set(df['ID2'].unique()))
unique_ids = np.array(list(unique_ids))

num_ids = len(unique_ids)
matrix = np.zeros((num_ids, num_ids))
matrix = np.reshape(np.append(unique_ids, matrix), (num_ids + 1, num_ids))

count = 0
for row_index, row in tqdm(enumerate(unique_ids, 1)):
    for col_index, col in enumerate(unique_ids[count:], count):
        if col != row:
            matrix[row_index, col_index] = get_distance(row, col)
        else:
            matrix[row_index, col_index] = 0

    count += 1

matrix = matrix + matrix.T

np.savetxt('example_matrix.csv', matrix, delimiter=',')

@ConnorChato
Copy link
Collaborator Author

@SandeepThokala reports that he was able to convert the example data file into the data structure used by the most recent code, but this conversion script it is taking a long time to process. Can you post this script here so we can troubleshoot it?

This first loop of component.cluster in R/graph.clustering.R is currently a while loop. I had hoped that it would generally limited in the number of steps that it takes, but yes it does seem to be really slow. Finding a way to vectorize this stage would likely be a big step forward.

@SandeepThokala
Copy link
Collaborator

Parsing intermediate data structures for graph creatiion

# replacing usage of pull.headers
seq.info <- read.csv("./data/seq_info.csv")
max.year <- max(year(seq.info$coldate))
which.new <- which(year(seq.info$coldate) == max.year)
# replacing usage of ape::dist.dna for calculating genetic distance using TN93 model
edge.info <- read.csv("./data/edge_info.csv")
rownames(edge.info) <- colnames(edge.info)
g <- create.graph(seq.info, edge.info, which.new)

Resulted in error

Error in create.graph(seq.info, edge.info, which.new) : 
  The pairwise distance matrix does not contain all recognized headers 
         from alignment

This input validation in graph.setup.R script caused the error

  if (!all(colnames(edge.info) %in% seq.info$Header)) {
    stop("The pairwise distance matrix does not contain all recognized headers 
         from alignment")
  }

@SandeepThokala
Copy link
Collaborator

Solving the errors by making use of pull.headers function

seq.info <- read.csv("./data/seq_info.csv")

seqs <- list()
for (header in seq.info$Header) {
  seqs[[header]] <- header
}

seq.info <- pull.headers(seqs, sep="_", var.names=c('accession', 'coldate', 'subtype'),
                         var.transformations=c(as.character, as.Date, as.factor))
max.year <- max(year(seq.info$coldate))
which.new <- which(year(seq.info$coldate) == max.year)

edge.info <- read.csv("./data/edge_info.csv")
colnames(edge.info) <- lapply(colnames(edge.info), function(x) {
     seq.info$Header[seq.info$accession == strsplit(x, "_")[[1]][1]]
})
rownames(edge.info) <- colnames(edge.info)
g <- create.graph(seq.info, edge.info, which.new)

Finally able to create graph data structure using above script

.Rhistory Outdated
Copy link
Collaborator

@ArtPoon ArtPoon Nov 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we shouldn't have .Rhistory in the repository, please remove with git rm --cached .Rhistory

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just removed this, looks like the .gitignore was a bit empty so I updated that as well to avoid that in the future

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok thanks. I am doing a bunch of work in branch iss12 and I will probably merge this branch into it when I'm done

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! I've seen some of that - thanks

@ArtPoon
Copy link
Collaborator

ArtPoon commented Nov 19, 2023

I've refactored a lot of the code and merged my work from iss12 branch into this one, but I don't think we're out of the woods yet. I am still having problems reproducing the graph clustering results with data files other than the mysterious example_tn93.txt one.

@ArtPoon
Copy link
Collaborator

ArtPoon commented Nov 20, 2023

I'm going to go ahead and merge this PR. master branch is broken anyhow, so I am going to commit my ongoing work directly to that branch

@ArtPoon ArtPoon merged commit 23509f4 into master Nov 20, 2023
@ArtPoon ArtPoon deleted the component_example branch December 2, 2023 16:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants