Script to compute clusters on Seed, Istex Expanded data, Random Istex… #21

lmartinet · 2016-12-13T14:46:28Z

… selection checking that the main key phrse of the 2 first group do not appear in Random Istex selection

natsheh · 2016-12-15T09:53:32Z

LDACheck_key_phrases.py

+# Extract the key words representing each cluster.
+
+# co-author : Lucie Martinet <lucie.martinet@univ-lorraine.fr>
+# co-author : Hussein AL-NATSHEH <hussein.al-natsheh@ish-lyon.cnrs.fr.>


There is an extra '.' at the end of the email address

natsheh · 2016-12-15T09:54:02Z

LDACheck_key_phrases.py

+# co-author : Hussein AL-NATSHEH <hussein.al-natsheh@ish-lyon.cnrs.fr.>
+# Affiliation: University of Lyon, ERIC Laboratory, Lyon2
+
+# Thanks to ISTEX project for the foundings


natsheh · 2016-12-15T09:56:08Z

LDACheck_key_phrases.py

+# Thanks to ISTEX project for the foundings
+
+import os, argparse, pickle, json
+from sklearn.feature_extraction.text import CountVectorizer,  TfidfVectorizer


CountVectorizer is not used later in the code

natsheh · 2016-12-15T09:56:56Z

LDACheck_key_phrases.py

+import os, argparse, pickle, json
+from sklearn.feature_extraction.text import CountVectorizer,  TfidfVectorizer
+from sklearn.metrics.pairwise import cosine_similarity
+from sklearn.decomposition import TruncatedSVD


TruncatedSVD is not used later in the code

natsheh · 2016-12-15T09:57:44Z

LDACheck_key_phrases.py

+
+import os, argparse, pickle, json
+from sklearn.feature_extraction.text import CountVectorizer,  TfidfVectorizer
+from sklearn.metrics.pairwise import cosine_similarity


cosine_similarity is not used later in the code

natsheh · 2016-12-15T10:04:00Z

LDACheck_key_phrases.py

+	return keys, values
+
+def statisticsClusterSelection(cluster, document_id, docs_topic, selection, stat_selection, outfile_pointer):
+	if selection in document_id and outfile_pointer != None and len(selection)==len(document_id.split("_")[0]): # keys[t] is a string, the name of the document


in the documentation, what does 'keys[t]' refer to in the function code? Could you please also break this line into 2 lines (one for the code and the other for the documentation)? Same for the line after

natsheh · 2016-12-15T10:04:46Z

LDACheck_key_phrases.py

+	return stat_selection
+
+# Compute the clusters of document and write the results in output files.
+# Need the


Seems an incomplete documentation line

natsheh · 2016-12-15T10:08:48Z

LDACheck_key_phrases.py

+	parser.add_argument("--mx_ngram", default=2, type=int) # the upper bound of the ngram range
+	parser.add_argument("--mn_ngram", default=1, type=int) # the lower bound of the ngram range
+	parser.add_argument("--stop_words", default=1, type=int) # filtering out English stop-words
+	parser.add_argument("--vec_size", default=100, type=int) # the size of the vector in the semantics space


No need for this argument as we do not use SVD like transformation

natsheh · 2016-12-15T10:09:20Z

LDACheck_key_phrases.py

+		stop_words = 'english'
+	else:
+		stop_words = None
+	n_components = args.vec_size


This argument is not in use

natsheh · 2016-12-15T10:10:25Z

LDACheck_key_phrases.py

+	parser.add_argument("--min_count", default=12	, type=int) # minimum frequency of the token to be included in the vocabulary
+	parser.add_argument("--max_df", default=0.95, type=float) # how much vocabulary percent to keep at max based on frequency
+	parser.add_argument("--debug", default=0, type=int) # embed IPython to use the decomposed matrix while running
+	parser.add_argument("--compress", default="json", type=str) # for dumping resulted files


unused argument

natsheh · 2016-12-15T10:10:42Z

LDACheck_key_phrases.py

+	min_count = args.min_count
+	max_df = args.max_df
+	debug = args.debug
+	compress = args.compress


unused argument

natsheh · 2016-12-15T10:15:47Z

LDACheck_key_phrases.py

+	parser = argparse.ArgumentParser()
+	parser.add_argument("--input_file", default='results.pickle', type=str) # is a .json file
+	parser.add_argument("--output_file", default='resultsTest/results_lda.txt', type=str) # is a .json file
+	parser.add_argument("--lemmatizer", default=0, type=int) # for using lemmatization_tokenizer


The lemmatizer is not supported in this file. You need to use
'from utils import Lemmatizer'
in order to avoid an error if the user use this argument with a value other than 0

natsheh · 2016-12-15T10:20:31Z

LDACheck_key_phrases.py

+	tf_idf_bow = tf_idf_vectorizer.fit_transform(values)
+	tf_feature_names = tf_idf_vectorizer.get_feature_names()
+
+	generic = open(output_file, "w")


Instead of using 'generic' as a global variable to used in the function 'statiticsClusters', you should pass the output_file as a function parameter and move this open file line to the function

natsheh · 2016-12-15T10:20:43Z

LDACheck_key_phrases.py

+	tf_feature_names = tf_idf_vectorizer.get_feature_names()
+
+	generic = open(output_file, "w")
+	ucbl_out = open(os.path.join(out_dir, "lda_ucbl_cluster.txt"), "w")


Instead of using 'ucbl_out' as a global variable to used in the function 'statiticsClusters', you should pass the output_file as a function parameter and move this open file line to the function

natsheh · 2016-12-15T10:20:55Z

LDACheck_key_phrases.py

+
+	generic = open(output_file, "w")
+	ucbl_out = open(os.path.join(out_dir, "lda_ucbl_cluster.txt"), "w")
+	istex_out = open(os.path.join(out_dir, "lda_mristex_cluster.txt"), "w") 


Instead of using 'istex_out' as a global variable to used in the function 'statiticsClusters', you should pass the output_file as a function parameter and move this open file line to the function

natsheh · 2016-12-15T10:24:51Z

LDACheck_key_phrases.py

+	istex_out = open(os.path.join(out_dir, "lda_mristex_cluster.txt"), "w") 
+
+	for i in range(min_nb_clusters, max_nb_clusters) :
+		statiticsClusters(i, tf_idf_bow, tf_feature_names, ucbl_out, istex_out ,max_iter=5, learning_method='online', learning_offset=50., random_state=0)


Better to create input arguments in ' main' for these parameters with their default values:
max_iter=5, learning_method='online', learning_offset=50., random_state=0

natsheh

Traceback (most recent call last):
File "LDACheck_key_phrases.py", line 180, in
nb_random_with_key_phrase = stat_check_vocabulary(keys, values, groups_avoid=["UCBL", "MRISTEX"], key_phrase=key_phrase)
File "LDACheck_key_phrases.py", line 116, in stat_check_vocabulary
if values[i].lower().find(key_phrase) > -1 :
AttributeError: 'numpy.float64' object has no attribute 'lower'

natsheh · 2016-12-21T08:27:00Z

@lmartinet Any update of this PR fix?

lmartinet · 2016-12-22T09:52:20Z

LDA_check_key_phrase Done ! I hope I did it properly, this time ;-) Lucie 2016-12-21 9:27 GMT+01:00 Hussein AL-NATSHEH <notifications@github.com>:

…

@lmartinet <https://github.com/lmartinet> Any update of this PR fix? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#21 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AVDN1fNqo799N4eicJw-RGDPhcAz3-wIks5rKOLUgaJpZM4LLz6_> .

natsheh · 2016-12-24T10:27:16Z

@lmartinet
Using the 'results.pickle' file generated by 'classifier.py' I got the same error message as before

… selection checking that the main key phrse of the 2 first group do not appear in Random Istex selection

…Add some instruction in the README. The first steps should be completed.

lmartinet · 2017-05-10T09:21:15Z

The input to the script was not the good one. Please, read the README file and review the script after.
It should be correct now.

natsheh · 2017-05-10T09:26:52Z

@lmartinet I will double check and follow the steps in the readme file.
Could you move the code file to a sub-folder and have the readme file there? The readme file have a big font, could you make it smaller?

natsheh

it is now working from my side, however, please reflect the minor changes proposals below

natsheh · 2017-05-10T10:25:10Z

README.md

Please leave this file empty to be filled later for the main repositpry readme. Instead, you should build the same file but in a sub-directory for this LDA clustering process; for example:
../LDA_analysis/readme.md

natsheh · 2017-05-10T10:25:38Z

README.md

+> python ids2docs.py (output: results/LDA_res_input.pickle)
+
+# Comput clusters on the documents well classified by the classifyer from the dictionnary given by ids2docs. 


classifier instead of classifyer

natsheh · 2017-05-10T10:26:15Z

README.md

+Steps to run the experiment :
+
+# build the classifier for the documents, according to the vectorisation done before


better to use smaller font for the documentation than this header font

…r readme

natsheh · 2017-05-11T14:36:57Z

I got this error message:

Nb clusters  2  Nb ucbl clusters  0 0
Traceback (most recent call last):
  File "Topic_Clustering.py", line 196, in <module>
    statisticsClusters(i, document_ids, tf_idf_bow, tf_feature_names, generic, ucbl_out, istex_out, max_iter=max_iter, learning_method=learning_method, learning_offset=learning_offset, random_state=random_state)
  File "Topic_Clustering.py", line 87, in statisticsClusters
    print "Nb clusters ", i, " Nb ucbl clusters " , len(list_ucbl.values()), len(list_ucbl.values()), min(list_ucbl.values()), " Nb istex cluster ",len(list_mristex), min(list_mristex.values()) 
ValueError: min() arg is an empty sequence

@lmartinet If the code expect a certain conditions, e.g., excluding the seed articles from the results, please introduce that as a pre-processing step.

natsheh reviewed Dec 15, 2016

View reviewed changes

natsheh requested changes Dec 15, 2016

View reviewed changes

lmartinet force-pushed the LDA_check_key_phrase branch from 45c282d to 86ead38 Compare December 22, 2016 09:50

lmartinet added 2 commits December 28, 2016 19:25

Script to compute clusters on Seed, Istex Expanded data, Random Istex…

8837982

… selection checking that the main key phrse of the 2 first group do not appear in Random Istex selection

Add input file in sample_data for LDACheck_key_phrases script

3f36c6c

lmartinet force-pushed the LDA_check_key_phrase branch from 86ead38 to 3f36c6c Compare December 28, 2016 18:30

Change the name of the input/output default values to be consistent. …

b2481e9

…Add some instruction in the README. The first steps should be completed.

natsheh requested changes May 10, 2017

View reviewed changes

Lucie Martinet added 2 commits May 11, 2017 15:54

Rename the file computing the clusters of documents and wrote a prope…

a1e722c

…r readme

Readme

82260b5

Readme

9fbf7d7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Script to compute clusters on Seed, Istex Expanded data, Random Istex… #21

Script to compute clusters on Seed, Istex Expanded data, Random Istex… #21

lmartinet commented Dec 13, 2016

natsheh Dec 15, 2016

natsheh Dec 15, 2016

natsheh Dec 15, 2016

natsheh Dec 15, 2016

natsheh Dec 15, 2016

natsheh Dec 15, 2016

natsheh Dec 15, 2016

natsheh Dec 15, 2016

natsheh Dec 15, 2016

natsheh Dec 15, 2016

natsheh Dec 15, 2016

natsheh Dec 15, 2016

natsheh Dec 15, 2016

natsheh Dec 15, 2016

natsheh Dec 15, 2016

natsheh Dec 15, 2016

natsheh left a comment

natsheh commented Dec 21, 2016

lmartinet commented Dec 22, 2016 via email

natsheh commented Dec 24, 2016

lmartinet commented May 10, 2017

natsheh commented May 10, 2017

natsheh left a comment

natsheh May 10, 2017 •

edited

Loading

natsheh May 10, 2017

natsheh May 10, 2017

natsheh commented May 11, 2017

		> python ids2docs.py (output: results/LDA_res_input.pickle)

		# Comput clusters on the documents well classified by the classifyer from the dictionnary given by ids2docs.

		Steps to run the experiment :

		# build the classifier for the documents, according to the vectorisation done before

Script to compute clusters on Seed, Istex Expanded data, Random Istex… #21

Are you sure you want to change the base?

Script to compute clusters on Seed, Istex Expanded data, Random Istex… #21

Conversation

lmartinet commented Dec 13, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

natsheh left a comment

Choose a reason for hiding this comment

natsheh commented Dec 21, 2016

lmartinet commented Dec 22, 2016 via email

natsheh commented Dec 24, 2016

lmartinet commented May 10, 2017

natsheh commented May 10, 2017

natsheh left a comment

Choose a reason for hiding this comment

natsheh May 10, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

natsheh commented May 11, 2017

natsheh May 10, 2017 •

edited

Loading