Readme update with alternative graph example #14

ConnorChato · 2023-09-20T12:19:29Z

Demonstrate a working example of graph-based cluster analysis using na.fasta example data and component.cluster function

ConnorChato · 2023-09-20T12:22:10Z

Currently this does not match the results obtained with previous component clustering work. My current suspicion is that the re-written component-cluster function needs some attention.

ConnorChato · 2023-09-24T22:52:23Z

Total growth does not increase monotonically with component clustering threshold

(from example)

# generate cluster sets under varying parameter settings
param.list <- lapply(seq(0.001, 0.04, 0.001), function(x) {list(g=g, dist.thresh=x)})
cluster.sets <- multi.cluster(component.cluster, param.list) 

growth_total <- sapply(1:40, function(i){sum((cluster.sets[SetID == i])$Growth)}) 
all(growth_total == cummax(growth_total)) ## FALSE - SHOULD BE TRUE

ArtPoon · 2023-09-26T13:46:47Z

@ConnorChato can I ask @SandeepThokala to assist with this? i.e., by comparing the MountainPlot.R script from version v1.0 to the current component clustering script?

ConnorChato · 2023-09-26T14:39:54Z

Yes that would be wonderful - just got back from time away so there is a bit of a project queue.

ConnorChato · 2023-09-26T14:59:06Z

I've isolated the issue down to the clustering algorithm itself - I suspect that https://github.com/PoonLab/clustuneR/blob/master/R/graph.clustering.R would not produce the same cluster set as the compClu() function from https://github.com/PoonLab/tn/blob/master/comp_Lib.R or the MountainPlot.R code.

ArtPoon · 2023-09-26T15:01:00Z

I've isolated the issue down to the clustering algorithm itself - I suspect that https://github.com/PoonLab/clustuneR/blob/master/R/graph.clustering.R would not produce the same cluster set as the compClu() function from https://github.com/PoonLab/tn/blob/master/comp_Lib.R or the MountainPlot.R code.

@SandeepThokala can you please focus on this?

ConnorChato · 2023-09-26T15:05:11Z

I think at the time we were developing this, there was a focus on the tree-based clustering that left some of these algorithms a bit less tested. I am certain the problem is compClu() though, as it doesn't make sense for total cluster growth to be decreasing with an increasing threshold.

ArtPoon · 2023-10-03T13:25:13Z

@SandeepThokala we're going to have to walk through code changes starting from MountainPlot.R to graph.clustering.R (and associated scripts) to see where outputs change for the same test data.

ArtPoon · 2023-10-10T13:25:28Z

@SandeepThokala can you please record your work through the commit history in this PR

ArtPoon · 2023-10-10T13:34:27Z

@SandeepThokala determined that there was a major shift in the code base between commits fde8762 and 890c92e. It does not seem possible to run the example in the latter commit. My suggestion is to move forward from 890c92e until it is possible to run the example, and then isolate the problem by (if possible) converting between the old and new data structures and comparing them to check whether the data input and pre-processing steps are equivalent.

ConnorChato · 2023-10-10T14:22:16Z

Hi @ArtPoon @SandeepThokala - I have a couple of starting fixes I did to correct the component clustering algorithm used to determine components in the new code here. There were a few different problems in the way it was written, so I can provide a bit of specificity from the investigation I've done in my off time.

ConnorChato · 2023-10-10T14:46:12Z

Here is an overcommented code chunk from the function that I've determined is misbehaving (component.cluster() from R/graph.clustering.R. Hopefully it can be helpful in some way

  # A clustering algorithm to determine disconnected components 
  # Uses a table of sequence metadata (g$seq.info) and a matrix of edges (filtered.edges).
  # If filtered.edges[i, j] is TRUE then there is an edge between node i and node j (which will both have a row in g$seq.info).

  # Assign a "Cluster" to each node. This is a unique number.
  # To start, each node is in its own, unique cluster id
  g$seq.info[, "Cluster" := 1:nrow(g$seq.info)]

  # Initialize a variable that tracks the previous cluster of each node. This is used as a halting condition.
  previous.cluster <- rep(0, nrow(g$seq.info))

  # (Tragically) Uses a while loop - I am hoping there is a clever way to vectorize this, but I haven't found it yet.
  # We will keep updating cluster memberships until things stop changing
  while (any(g$seq.info$Cluster != previous.cluster)) {
    
    # Update previous cluster info to current cluster info
    previous.cluster <- g$seq.info$Cluster

    # Reassign each node's cluster
    g$seq.info[, Cluster := sapply(1:nrow(g$seq.info), function(i) {

     # Look at the clusters of the nodes that node "i" is connected to
     # Node i will never be connected to itself.
      x <- g$seq.info[which(filtered.edges[i, ]), Cluster]
      if (length(x) == 0) {
        return(i) # If node i is not connected to any other nodes, it stays in it's current cluster
      } else {
        return(min(x)) # If connected to other nodes, node i's cluster becomes one of it's neighbour's clusters 
      }
    })]
  }

The issue

When figuring out WHICH neighbour's cluster node i should use, it decides on the cluster id that is the lowest number. I'm not exactly sure how best to resolve, but this logic can lead to infinite loops where neighbours trade their cluster id back and forth.

ConnorChato · 2023-10-10T15:04:07Z

Actually in re-describing that issue, I may have found a solution that generates some expected results.

(From Northern Alberta data)

Since we are using a slightly different method of tn93 calculation in this code to simplify dependancies, I would still expect some slight differences between this and the original results, but I hope this is fairly similar to what you two find (@SandeepThokala @ArtPoon). Apologies if this duplicates or undoes any of your current work. There are likely much more elegant (and fast) solutions to this same issue that could replace this method anyway, I just wanted to finish up my revisit to this function.

SandeepThokala · 2023-10-16T02:46:31Z

The old function impTN93 creates a list of minimum retrospective edges minE by following the steps below:

Gets a list of "not old" vertices.
Gets the closest retrospective edge, for every "not old" vertex v.

Binds the rows of all closest retrospective edges of all "not old" vertices.

clustuneR/scripts/comp_Lib.R

Lines 44 to 53 in 74a35b6

    
           #The subset of vertices excluding those at the oldest year 
        
           subV <- subset(g$v, Time>min(Time)) 
        
           #Obtain the closest retrospective edge of every vertex beyond the oldest year 
        
           minE <- bind_rows(lapply(1:nrow(subV), function(i){ 
        
             v <- subV[i,] 
        
             incE <- subset(g$e, (ID1%in%v$ID)|(ID2%in%v$ID)) 
        
             retE <- subset(incE, (tMax==v$Time)&(tDiff>0)) 
        
             retE[which(retE$Distance==min(retE$Distance))[[1]],] 
        
           }))

Where as the new function create.graph calls minimum.retrosepctive.edge which just gets closest retrospective edges of only new vertices (not for all "not old").

clustuneR/R/graph.setup.R

Lines 54 to 62 in 353d3fa

    
           minimum.retrospective.edge <- function(g) { 
        
             # Find the minimum retrospective edge of each sequence 
        
             new.seqs <- g$seq.info[(New), Header] 
        
             old.seqs <- g$seq.info[!(New), Header] 
        
             retro.edges <- g$edge.info[new.seqs, old.seqs] 
        
             min.retro.edges <- sapply(seq_len(nrow(retro.edges)), function(i) { 
        
               names(which.min(retro.edges[i, ])) 
        
             })

ConnorChato · 2023-10-16T21:06:42Z

Hi @SandeepThokala. The only point where it's truly necessary to find minimum retrospective edges is when looking at new cases linking to old cases. For the previous version, I calculated the minimum retrospective edge from every node because we were interested in sequentially re-defining the newest year, so any old year may eventually get called "New" and at that point we'd want it's minimum retrospective edge.

SandeepThokala · 2023-10-17T03:56:52Z

It does not seem possible to run the example in the latter commit. My suggestion is to move forward from 890c92e until it is possible to run the example, and then isolate the problem by (if possible) converting between the old and new data structures and comparing them to check whether the data input and pre-processing steps are equivalent.

I tried to run tn93 on data/na.fasta to see if we're using the same input as example_tn93.txt
tn93 -o data/test_tn93.txt data/na.fasta
But the result was only 1308 edges long, where as example_tn93.txt has 1306536 edges.

ConnorChato · 2023-10-17T13:06:17Z

It does not seem possible to run the example in the latter commit. My suggestion is to move forward from 890c92e until it is possible to run the example, and then isolate the problem by (if possible) converting between the old and new data structures and comparing them to check whether the data input and pre-processing steps are equivalent.

I tried to run tn93 on data/na.fasta to see if we're using the same input as example_tn93.txt tn93 -o data/test_tn93.txt data/na.fasta But the result was only 1308 edges long, where as example_tn93.txt has 1306536 edges.

tn93 has a threshold option -t which I believe is 0.015. Typically when we run tn93 we would use an ultra permissive threshold to look at the maximum number of edges (something like tn93 -t 1 -o data/test_tn93.txt data/na.fasta)

SandeepThokala · 2023-10-17T20:48:06Z

Uploading my understanding of the functions.

The input file (example_tn93.txt) contains a huge list of comma separated edge info (ID1, ID2, distance).

impTN93

Reads the input file and splits both IDs of all the edges at "_" character extracting ID and time.

	idf <- read.csv(iFile, stringsAsFactors = F)
	temp1 <- sapply(idf$ID1, function(x) (strsplit(x,'_')[[1]])[[1]])
	temp2 <- sapply(idf$ID1, function(x) (strsplit(x,'_')[[1]])[[2]])
	temp3 <- sapply(idf$ID2, function(x) (strsplit(x,'_')[[1]])[[1]])
	temp4 <- sapply(idf$ID2, function(x) (strsplit(x,'_')[[1]])[[2]])

Creates a list g$e and obtains maximum time and time difference for each edge resulting in a data frame with 7 columns ID1, t1, TD2, t2, Distance, tMax, tDiff.
Creates a list of unique vertices g$v based on ID and time columns of edge list resulting in a data frame with 2 columns ID, Time.

	el <- data.frame(ID1=as.character(temp1), t1=as.numeric(temp2), 
                     ID2=as.character(temp3), t2=as.numeric(temp4), 
                     Distance = as.numeric(idf$Distance),
                     stringsAsFactors= F)
	el$tMax <- pmax(el$t1,el$t2)
	el$tDiff <- abs(el$t1-el$t2)
	vl <- unique(data.frame(ID = c(el$ID1, el$ID2), Time = c(el$t1, el$t2), stringsAsFactors=F))

Creates a list of minimum retrospective edges minE
-- Gets a list of "not old" vertices.
-- Gets the closest retrospective edge, for every "not old" vertex v.
-- Binds the rows of all closest retrospective edges of all "not old" vertices.

	subV <- subset(g$v, Time>min(Time))
	minE <- bind_rows(lapply(1:nrow(subV), function(i){
	    v <- subV[i,]
	    incE <- subset(g$e, (ID1%in%v$ID)|(ID2%in%v$ID))
	    retE <- subset(incE, (tMax==v$Time)&(tDiff>0))
	    retE[which(retE$Distance==min(retE$Distance))[[1]],]
	}))

Removes all "new" edges from g$e and add only new retrospective edges

	g$e <- subset(g$e, tMax!=max(tMax))
	g$e <- rbind(g$e, subset(minE, tMax==max(tMax)))

Creates a list of "not new" minimum retrospective edges g$f

	g$f <- subset(minE, tMax < max(tMax))

Returns g

SandeepThokala · 2023-10-17T20:48:39Z

compAnalyze

Takes in a subset subG of graph g, with maximum distance threshold dMax.

	dMax <- max(subG$e$Distance)

Gets a list of max years of retrospective edges tTab.

	tTab <- table(subG$f$tMax)

Gets a list of years of all unique vertices vTab.

	vTab <- table(subG$v$Time)

For each year in vTab, gets a list of number of all edges eTab.

	eTab <- sapply(as.numeric(names(vTab)), function(t){
		nrow(subset(subG$e, (t1==t|t2==t)))
	})
	names(eTab) <- names(vTab)

Creates a list of densities ageD, for each year in tTab:
-- Gets a subset of retrospective edges of that year.
-- Groups the subset of edges by the differences in years of vertices.
-- For each group, gets number of edges that have a distance threshold of dMax, Positive.
-- For each group's year difference, gets a ratio of old year edges eTab to number of old year edges in vTab.
-- Creates a data frame with columns Positive, vTotal, oeDens, tDiff.
-- Binds rows of all data frames.

	ageD <- bind_rows(lapply(as.numeric(names(tTab)), function(t) {
		temp <- subset(subG$f, tMax==t)
		dfs <- split(temp, temp$tDiff)
    
		Positive <- sapply(dfs, function(df){length(which(df$Distance<=dMax))})
		vTotal <- rep((vTab[[as.character(t)]]),length(dfs))
		tDiff <- as.numeric(names(Positive))
		oeDens <- sapply(tDiff, function(tD){
			oTime <- t-tD
			return(eTab[as.character(oTime)]/vTab[as.character(oTime)])
		})
    
    
		res <- data.frame(Positive=as.numeric(Positive), vTotal=vTab[[as.character(t)]], oeDens=oeDens, tDiff)
		return(res)
	}))

Obtains a model of case connection frequency to new cases as predicted by individual case age to weight cases by age.

	mod <- glm(cbind(Positive, vTotal) ~ tDiff+oeDens, data=ageD, family='binomial')
	subG$v$Weight <- predict(mod, type='response',
                             data.frame(tDiff=max(subG$v$Time)-subG$v$Time,
                             oeDens=as.numeric(eTab[as.character(subG$v$Time)]/vTab[as.character(subG$v$Time)])))

Creates clusters for this subgraph and measures growth.

	subG <- simGrow(subG)
	cPred <- subset(subG$v, Time<max(Time))[,c("Weight", "Cluster")]

Creates two data frames from two predictive models, one based on absolute size (NULL) and date-informed model.

	df1 <- data.frame(Growth = as.numeric(subG$g), Pred = sapply(names(subG$c), function(x) { sum(subset(cPred, Cluster==as.numeric(x))$Weight) }))
	df2 <- data.frame(Growth = as.numeric(subG$g), Pred = as.numeric(subG$c) * (sum(as.numeric(subG$g))/sum(as.numeric(subG$c))))
	fit1 <- glm(Growth ~ Pred, data = df1, family = "poisson")
	fit2 <- glm(Growth ~ Pred, data = df2, family = "poisson")

Saves, gaic, model and age data as part of the output.

	subG$gaic <- fit1$aic-fit2$aic
	subG$ageMod <- mod
	subG$ageFit <- fit1
	subG$nullFit <- fit2
	subG$f <- ageD

Returns subG.

ArtPoon · 2023-10-31T13:36:08Z

@SandeepThokala reports that he was able to convert the example data file into the data structure used by the most recent code, but this conversion script it is taking a long time to process. Can you post this script here so we can troubleshoot it?

SandeepThokala · 2023-10-31T13:39:17Z

Conversion script

import numpy as np
import pandas as pd
from tqdm import tqdm

ex = "./data/example_tn93.txt"

def get_distance(id1, id2):
  distance = df.loc[(df['ID1'] == id1) & (df['ID2'] == id2), 'Distance']
  if not(len(distance)):
    distance = df.loc[(df['ID2'] == id1) & (df['ID1'] == id2), 'Distance']
    if not(len(distance)):
      distance = np.nan

  if isinstance(distance, pd.core.series.Series):
    return distance.iloc[0]
  else:
    return distance

df = pd.read_csv(ex)

unique_ids = set(df['ID1'].unique())
unique_ids = unique_ids.union(set(df['ID2'].unique()))
unique_ids = np.array(list(unique_ids))

num_ids = len(unique_ids)
matrix = np.zeros((num_ids, num_ids))
matrix = np.reshape(np.append(unique_ids, matrix), (num_ids + 1, num_ids))

count = 0
for row_index, row in tqdm(enumerate(unique_ids, 1)):
    for col_index, col in enumerate(unique_ids[count:], count):
        if col != row:
            matrix[row_index, col_index] = get_distance(row, col)
        else:
            matrix[row_index, col_index] = 0

    count += 1

matrix = matrix + matrix.T

np.savetxt('example_matrix.csv', matrix, delimiter=',')

ConnorChato · 2023-10-31T14:12:24Z

@SandeepThokala reports that he was able to convert the example data file into the data structure used by the most recent code, but this conversion script it is taking a long time to process. Can you post this script here so we can troubleshoot it?

This first loop of component.cluster in R/graph.clustering.R is currently a while loop. I had hoped that it would generally limited in the number of steps that it takes, but yes it does seem to be really slow. Finding a way to vectorize this stage would likely be a big step forward.

SandeepThokala · 2023-11-07T04:38:13Z

Parsing intermediate data structures for graph creatiion

# replacing usage of pull.headers
seq.info <- read.csv("./data/seq_info.csv")
max.year <- max(year(seq.info$coldate))
which.new <- which(year(seq.info$coldate) == max.year)
# replacing usage of ape::dist.dna for calculating genetic distance using TN93 model
edge.info <- read.csv("./data/edge_info.csv")
rownames(edge.info) <- colnames(edge.info)
g <- create.graph(seq.info, edge.info, which.new)

Resulted in error

Error in create.graph(seq.info, edge.info, which.new) : 
  The pairwise distance matrix does not contain all recognized headers 
         from alignment

This input validation in graph.setup.R script caused the error

  if (!all(colnames(edge.info) %in% seq.info$Header)) {
    stop("The pairwise distance matrix does not contain all recognized headers 
         from alignment")
  }

SandeepThokala · 2023-11-07T09:43:41Z

Solving the errors by making use of pull.headers function

seq.info <- read.csv("./data/seq_info.csv")

seqs <- list()
for (header in seq.info$Header) {
  seqs[[header]] <- header
}

seq.info <- pull.headers(seqs, sep="_", var.names=c('accession', 'coldate', 'subtype'),
                         var.transformations=c(as.character, as.Date, as.factor))
max.year <- max(year(seq.info$coldate))
which.new <- which(year(seq.info$coldate) == max.year)

edge.info <- read.csv("./data/edge_info.csv")
colnames(edge.info) <- lapply(colnames(edge.info), function(x) {
     seq.info$Header[seq.info$accession == strsplit(x, "_")[[1]][1]]
})
rownames(edge.info) <- colnames(edge.info)
g <- create.graph(seq.info, edge.info, which.new)

Finally able to create graph data structure using above script

ArtPoon · 2023-11-07T14:59:18Z

.Rhistory

we shouldn't have .Rhistory in the repository, please remove with git rm --cached .Rhistory

Just removed this, looks like the .gitignore was a bit empty so I updated that as well to avoid that in the future

ok thanks. I am doing a bunch of work in branch iss12 and I will probably merge this branch into it when I'm done

Great! I've seen some of that - thanks

…a.table

ArtPoon · 2023-11-19T22:20:02Z

I've refactored a lot of the code and merged my work from iss12 branch into this one, but I don't think we're out of the woods yet. I am still having problems reproducing the graph clustering results with data files other than the mysterious example_tn93.txt one.

ArtPoon · 2023-11-20T20:53:04Z

I'm going to go ahead and merge this PR. master branch is broken anyhow, so I am going to commit my ongoing work directly to that branch

Readme update with alternative graph example

da5a00e

ConnorChato marked this pull request as draft September 20, 2023 12:22

ConnorChato mentioned this pull request Sep 20, 2023

Discordant results between v1.0 and main branch #12

Closed

Clarify figure gen in README + fix growth vector order

91e1865

Component Clustering algorithm updates

8dd73c5

Self connect allowed in homogenization algorithm for comp clustering

8bd7441

ConnorChato marked this pull request as ready for review October 15, 2023 17:33

ConnorChato requested a review from ArtPoon October 15, 2023 17:33

ArtPoon reviewed Nov 7, 2023

View reviewed changes

ArtPoon and others added 22 commits November 7, 2023 12:02

revised docstring for create.graph

1762301

revised docstring for create.graph

3a73c86

starting to refactor component clustering workflow

d123414

refactored graph.setup.R

43d1672

gitignore

be160ad

resolved merge conflict

f970144

refactoring analysis.R

3c83c23

small

d9bb66e

resolved merge conflict

af90a38

successful run with example_tn93.txt

d5fd56b

starting to work on tree setup

fc8cc47

updated annotate.nodes and annotate.paths

a05b007

gitignore update

936fdf5

finished refactoring tree-setup.R, moving on to tree-clustering.R

47da367

some comments

7189a78

bootstrap handling

f6865ba

rolling back changes in tree.setup and tree.clustering, restoring dat…

676c8c7

…a.table

missed a t

d212598

fixed typo in annotate.growth()

6ce46ba

reverted to using data.table

f503ecc

merged iss12

76582ec

resolved merge conflict

f398a8e

ArtPoon merged commit 23509f4 into master Nov 20, 2023

ArtPoon deleted the component_example branch December 2, 2023 16:37

ArtPoon added a commit that referenced this pull request Dec 7, 2023

adding comp_Lib.R back in with my commenting analysis for #14

66ad865

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Readme update with alternative graph example #14

Readme update with alternative graph example #14

ConnorChato commented Sep 20, 2023 •

edited

Loading

ConnorChato commented Sep 20, 2023

ConnorChato commented Sep 24, 2023

ArtPoon commented Sep 26, 2023

ConnorChato commented Sep 26, 2023

ConnorChato commented Sep 26, 2023

ArtPoon commented Sep 26, 2023

ConnorChato commented Sep 26, 2023

ArtPoon commented Oct 3, 2023 •

edited

Loading

ArtPoon commented Oct 10, 2023

ArtPoon commented Oct 10, 2023

ConnorChato commented Oct 10, 2023 •

edited

Loading

ConnorChato commented Oct 10, 2023

ConnorChato commented Oct 10, 2023

SandeepThokala commented Oct 16, 2023

ConnorChato commented Oct 16, 2023

SandeepThokala commented Oct 17, 2023

ConnorChato commented Oct 17, 2023

SandeepThokala commented Oct 17, 2023

SandeepThokala commented Oct 17, 2023

ArtPoon commented Oct 31, 2023 •

edited

Loading

SandeepThokala commented Oct 31, 2023 •

edited

Loading

ConnorChato commented Oct 31, 2023

SandeepThokala commented Nov 7, 2023

SandeepThokala commented Nov 7, 2023

ArtPoon Nov 7, 2023 •

edited

Loading

ConnorChato Nov 10, 2023

ArtPoon Nov 10, 2023

ConnorChato Nov 10, 2023

ArtPoon commented Nov 19, 2023

ArtPoon commented Nov 20, 2023

Readme update with alternative graph example #14

Readme update with alternative graph example #14

Conversation

ConnorChato commented Sep 20, 2023 • edited Loading

ConnorChato commented Sep 20, 2023

ConnorChato commented Sep 24, 2023

ArtPoon commented Sep 26, 2023

ConnorChato commented Sep 26, 2023

ConnorChato commented Sep 26, 2023

ArtPoon commented Sep 26, 2023

ConnorChato commented Sep 26, 2023

ArtPoon commented Oct 3, 2023 • edited Loading

ArtPoon commented Oct 10, 2023

ArtPoon commented Oct 10, 2023

ConnorChato commented Oct 10, 2023 • edited Loading

ConnorChato commented Oct 10, 2023

The issue

ConnorChato commented Oct 10, 2023

SandeepThokala commented Oct 16, 2023

ConnorChato commented Oct 16, 2023

SandeepThokala commented Oct 17, 2023

ConnorChato commented Oct 17, 2023

SandeepThokala commented Oct 17, 2023

SandeepThokala commented Oct 17, 2023

ArtPoon commented Oct 31, 2023 • edited Loading

SandeepThokala commented Oct 31, 2023 • edited Loading

ConnorChato commented Oct 31, 2023

SandeepThokala commented Nov 7, 2023

Parsing intermediate data structures for graph creatiion

SandeepThokala commented Nov 7, 2023

Solving the errors by making use of pull.headers function

ArtPoon Nov 7, 2023 • edited Loading

Choose a reason for hiding this comment

ConnorChato Nov 10, 2023

Choose a reason for hiding this comment

ArtPoon Nov 10, 2023

Choose a reason for hiding this comment

ConnorChato Nov 10, 2023

Choose a reason for hiding this comment

ArtPoon commented Nov 19, 2023

ArtPoon commented Nov 20, 2023

ConnorChato commented Sep 20, 2023 •

edited

Loading

ArtPoon commented Oct 3, 2023 •

edited

Loading

ConnorChato commented Oct 10, 2023 •

edited

Loading

ArtPoon commented Oct 31, 2023 •

edited

Loading

SandeepThokala commented Oct 31, 2023 •

edited

Loading

ArtPoon Nov 7, 2023 •

edited

Loading