Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unexpected keyword argument 'size' #3

Closed
iwwwish opened this issue Jul 21, 2021 · 7 comments
Closed

unexpected keyword argument 'size' #3

iwwwish opened this issue Jul 21, 2021 · 7 comments

Comments

@iwwwish
Copy link

iwwwish commented Jul 21, 2021

Hi Charles,

First, I'd like to thank you for contributing this interesting work and sharing the code with the community. I am trying to run the GuiltyTargets pipeline using example data shared in the previous #2 issue (of course, after fixing the issues with the data files). To do this, I created a condo env and installed all the packages as suggested, but ran into the below issue. I would really appreciate your help in fixing this.

I originally wanted to reproduce the results from your paper but since the open targets REST API has been deprecated and replaced by a GraphQL API, I had to directly work with the code from this repo. Also, you mention that some data is provided in supplementary information but I couldn't find any supplementary data in the original publication as well as on bioRxiv. It would be nice to see those tables too.

Thank you,
Vishal

# imports
from guiltytargets.pipeline import run

# define constants
input_directory = 'exampleData/'
targets_path = 'exampleData/known_targetID.csv'
ppi_graph_path = 'exampleData/ppi_graph.csv'
dge_path = 'exampleData/dge3.tsv'
auc_output_path = 'exampleData/'
probs_output_path = 'exampleData/'
max_padj = 0.05
lfc_cutoff = 1
entrez_id_name = 'Entrez id'
log_fold_change_name = 'Log fold change'
adjusted_p_value_name = 'Adjusted p value'
base_mean_name = 'Base mean'
split_char = ';'
confidence_cutoff = 0.1

# run GuiltyTargets
guiltytargets.run(
    input_directory,
    targets_path,
    ppi_graph_path,
    dge_path,
    auc_output_path,
    probs_output_path,
    max_adj_p=max_padj,
    max_log2_fold_change=lfc_cutoff * -1,
    min_log2_fold_change=lfc_cutoff,
    entrez_id_header=entrez_id_name,
    log2_fold_change_header=log_fold_change_name,
    adj_p_header=adjusted_p_value_name,
    base_mean_header=base_mean_name,
    entrez_delimiter=split_char,
    ppi_edge_min_confidence=confidence_cutoff,
 )
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-18-713dbffdce87> in <module>
     14     base_mean_header=base_mean_name,
     15     entrez_delimiter=split_char,
---> 16     ppi_edge_min_confidence=confidence_cutoff
     17  )

~/work/mentoring/summer_2021/code/guiltytargets/src/guiltytargets/pipeline.py in run(input_directory, targets_path, ppi_graph_path, dge_path, auc_output_path, probs_output_path, max_adj_p, max_log2_fold_change, min_log2_fold_change, entrez_id_header, log2_fold_change_header, adj_p_header, base_mean_header, entrez_delimiter, ppi_edge_min_confidence)
     58         directory=input_directory,
     59         targets=targets,
---> 60         network=network,
     61     )
     62 

~/work/mentoring/summer_2021/code/guiltytargets/src/guiltytargets/pipeline.py in rank_targets(network, targets, directory)
    110         gat2vec_config.dimension,
    111         gat2vec_config.window_size,
--> 112         output=True,
    113     )
    114     classifier = Classification(directory, directory, tr=gat2vec_config.training_ratio)

~/anaconda3/envs/guiltytargets/lib/python3.6/site-packages/GAT2VEC/gat2vec.py in train_gat2vec(self, nwalks, wlength, dsize, wsize, output)
     66             fname = paths.get_embedding_path(self.dataset_dir, self.output_dir)
     67             gat2vec_model = self._train_gat2vec(dsize, fname, nwalks, output, walks_structure,
---> 68                                                 wlength, wsize)
     69         return gat2vec_model
     70 

~/anaconda3/envs/guiltytargets/lib/python3.6/site-packages/GAT2VEC/gat2vec.py in _train_gat2vec(self, dsize, fname, nwalks, output, walks_structure, wlength, wsize, add_structure)
     93         if add_structure:
     94             walks = walks_structure + walks
---> 95         gat2vec_model = self._train_word2Vec(walks, dsize, wsize, 4, output, fname)
     96         return gat2vec_model

~/anaconda3/envs/guiltytargets/lib/python3.6/site-packages/GAT2VEC/gat2vec.py in _train_word2Vec(self, walks, dimension_size, window_size, cores, output, fname)
     48         model = Word2Vec([list(map(str, walk)) for walk in walks],
     49                          size=dimension_size, window=window_size, min_count=0, sg=1,
---> 50                          workers=cores)
     51         if output is True:
     52             model.wv.save_word2vec_format(fname)

TypeError: __init__() got an unexpected keyword argument 'size'
@cthoyt
Copy link
Member

cthoyt commented Jul 21, 2021

This is a known issue because gensim changed their arguments in the word2vec model. See also:

At this point I'm not super enthusiastic about updating the code in this repo for a few reasons:

  1. GAT2VEC didn't really provide any meaningful improvements over standard random walks with DeepWalk. Our supervisor was adamant that we needed this, so we included it, but I would suggest skipping this part completely
  2. Node2vec has some much newer/better implementations that perform better than DeepWalk
  3. I've re-implemented this pipeline a few times in different places - ultimately the training we did here is pretty routine, but we just didn't have the bandwidth to clean this code up at the end of my master's student's time

If you're gearing up for a publication and need a co-author, I could probably find some time to give some real support. In the mean time, I'd suggest checking out some of the follow-up work to GuiltyTargets that use the same ideas, but have a bit more clean and reusable code:

@iwwwish
Copy link
Author

iwwwish commented Jul 22, 2021

Charles, thank you very much for the prompt response. I figured out the issue with genesim and a couple of other packages that gat2vec relies on and had to be downgraded in order to use the code 'as is'.

My summer intern is trying to use the approach for target prioritization. So I am not sure if we would be ready for a publication any sooner, but that is an encouraging thought. Thanks for offering to help. I'll reach out if we decided to pursue this idea further.

Vishal

@ozlemmuslu
Copy link
Member

Dear Vishal,

Thank you for your interest in our work. If you let me know which versions Gat2Vec requires, I can update the documentation.

I'd like to add that using Gat2Vec as opposed to DeepWalk increased the performance by 1-2% which could be important depending on how many candidates you are working with.

You can address your questions to me in the future as this is primarily my work. We no longer work on this project, but the purpose of making it open source was so that the community could also contribute.

Best,
Özlem

@SalvatoreRa
Copy link

Dear Özlem,

I had the same issue, how can I solve this issue?

thank you for your help

Best,
Salvatore

@ozlemmuslu
Copy link
Member

Dear Salvatore,

The error is caused by a version mismatch in the gensim library. Originally, one of the parameters to initialize a Word2Vec object was named size, but it changed to vector_size.

I now updated the gat2vec library which uses gensim to be compatible with the more recent version of gensim. Please do a clean install (including, and especially for gat2vec) and let me know if it works now.

Best,
Özlem

@SalvatoreRa
Copy link

Dear Özlem,

Thank you for updating the code. I was working in these days on my machine, I uninstalled and reistalled all the libraries and now it looks like it works.

However, it may useful for you to know (or if someone else wants to use it) after uninstalling Gat2vec, Guiltytarget and deepwalk and reinstall them, was still giving me errors.

The first error is during import guiltytargets, it requires to uninstall and reinstall gensim

the second was generated by a deprecated function in gat2vec, in the parser file. the function as_matrix is deprecated and removed in pandas. I present here the correct code that permitted me to solve the error:

def get_embeddingDF(fname):
    """returns the embeddings read from file fname."""
    df = pd.read_csv(fname, header=None, skiprows=1, delimiter=' ')
    df.sort_values(by=[0], inplace=True)
    df = df.set_index(0)
    return df.to_numpy()

There is another error, in the evaluation file of the gat2vec, the iid= False argument is deprecated and removed from scikit-learn, here the updated code:

def evaluate_cv(self, clf, embedding, n_splits):
        """Do a repeated stratified cross validation.

        :param clf: Classifier object.
        :param embedding: The feature matrix.
        :param n_splits: Number of folds.
        :return: Dictionary containing numerical results of the classification.
        """
        embedding = embedding[self.label_ind, :]
        results = defaultdict(list)
        grid = {
            'C': np.logspace(-4, 4, 20),
            'tol': [0.0001, 0.001, 0.01]
        }
        log_reg = linear_model.LogisticRegression(solver='liblinear')

        # tol, C
        for i in range(10):
            inner_cv = StratifiedKFold(n_splits=n_splits, shuffle=True)
            outer_cv = StratifiedKFold(n_splits=n_splits, shuffle=True)

            for train_idx, test_idx in outer_cv.split(embedding, self.labels):
                clf = GridSearchCV(estimator=log_reg, param_grid=grid, cv=inner_cv 
                )
                clf.fit(embedding, self.labels)

                print('Parameter fitting done. clf: {}'.format(clf))

                X_train, X_test, Y_train, Y_test = self._get_split(embedding, test_idx, train_idx)
                pred, probs = self.get_predictions(clf, X_train, X_test, Y_train, Y_test)
                results["TR"].append(i)
                results["accuracy"].append(accuracy_score(Y_test, pred))
                results["f1micro"].append(f1_score(Y_test, pred, average='micro'))
                results["f1macro"].append(f1_score(Y_test, pred, average='macro'))
                if self.label_count == 2:
                    results["auc"].append(roc_auc_score(Y_test, probs[:, 1]))
                else:
                    results["auc"].append(0)
        return results

I did not received other errors and it work smoothly then

since I had to slightly modified the code I have a couple to additional questions.

a) after running GuiltyTargets with my files it returns the following outputs:

  • _gat2vec.emb which I suppose is the embedding
  • _na.adjlist (I suppose is an adjacency list, but not clear of what)
  • _graph.adjlist which I suppose is the adjacency list of the graph
  • labels_maped.txt I suppose the list of the labels (if I understood correctly the list of provided labels are just mapped to the graph, before the classifying steps)
  • probe_df.csv that i suppose it is the file after the classification step where the other entrez gene are classified as possible target or not
  • auc_df the results of the cross-validation

is it correct?

b) probes_df is returning a dataframe with 3 columns 0, 1 entrez which I suppose are the probabilities for each entrez gene to be a target (class 1) or not a target (class 0). is it correct? did you use on the probabilities argmax function to consider it a target or not?

c) did you use the probability to rank the target?

best,

Salvatore

This was referenced Nov 18, 2021
@ozlemmuslu
Copy link
Member

Dear Salvatore,

Thank you for your input. Would you like to do a pull request so your contribution to the repository is more visible?

Regarding your questions:
a) .emb is the embedding. .adjlist files are needed by Gat2Vec, one being the adjacency list of the structural graph, the other is the attribute graph.

For the rest of your questions, I will need to double check the output files, then I will get back to you.

I am closing this issue, since it's no longer about the original question. Here are the new issues:

#4
#5

Best,
Özlem

ozlemmuslu added a commit to GuiltyTargets/GAT2VEC that referenced this issue Dec 22, 2021
Update deprecated function based on GuiltyTargets/guiltytargets#3 (comment)
ozlemmuslu added a commit to GuiltyTargets/GAT2VEC that referenced this issue Dec 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants