-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Script to compute clusters on Seed, Istex Expanded data, Random Istex… #21
base: master
Are you sure you want to change the base?
Conversation
LDACheck_key_phrases.py
Outdated
# Extract the key words representing each cluster. | ||
|
||
# co-author : Lucie Martinet <lucie.martinet@univ-lorraine.fr> | ||
# co-author : Hussein AL-NATSHEH <hussein.al-natsheh@ish-lyon.cnrs.fr.> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an extra '.' at the end of the email address
LDACheck_key_phrases.py
Outdated
# co-author : Hussein AL-NATSHEH <hussein.al-natsheh@ish-lyon.cnrs.fr.> | ||
# Affiliation: University of Lyon, ERIC Laboratory, Lyon2 | ||
|
||
# Thanks to ISTEX project for the foundings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
funding
LDACheck_key_phrases.py
Outdated
# Thanks to ISTEX project for the foundings | ||
|
||
import os, argparse, pickle, json | ||
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CountVectorizer is not used later in the code
LDACheck_key_phrases.py
Outdated
import os, argparse, pickle, json | ||
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer | ||
from sklearn.metrics.pairwise import cosine_similarity | ||
from sklearn.decomposition import TruncatedSVD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TruncatedSVD is not used later in the code
LDACheck_key_phrases.py
Outdated
|
||
import os, argparse, pickle, json | ||
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer | ||
from sklearn.metrics.pairwise import cosine_similarity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cosine_similarity is not used later in the code
LDACheck_key_phrases.py
Outdated
return keys, values | ||
|
||
def statisticsClusterSelection(cluster, document_id, docs_topic, selection, stat_selection, outfile_pointer): | ||
if selection in document_id and outfile_pointer != None and len(selection)==len(document_id.split("_")[0]): # keys[t] is a string, the name of the document |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the documentation, what does 'keys[t]' refer to in the function code? Could you please also break this line into 2 lines (one for the code and the other for the documentation)? Same for the line after
LDACheck_key_phrases.py
Outdated
return stat_selection | ||
|
||
# Compute the clusters of document and write the results in output files. | ||
# Need the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems an incomplete documentation line
LDACheck_key_phrases.py
Outdated
parser.add_argument("--mx_ngram", default=2, type=int) # the upper bound of the ngram range | ||
parser.add_argument("--mn_ngram", default=1, type=int) # the lower bound of the ngram range | ||
parser.add_argument("--stop_words", default=1, type=int) # filtering out English stop-words | ||
parser.add_argument("--vec_size", default=100, type=int) # the size of the vector in the semantics space |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for this argument as we do not use SVD like transformation
LDACheck_key_phrases.py
Outdated
stop_words = 'english' | ||
else: | ||
stop_words = None | ||
n_components = args.vec_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This argument is not in use
LDACheck_key_phrases.py
Outdated
parser.add_argument("--min_count", default=12 , type=int) # minimum frequency of the token to be included in the vocabulary | ||
parser.add_argument("--max_df", default=0.95, type=float) # how much vocabulary percent to keep at max based on frequency | ||
parser.add_argument("--debug", default=0, type=int) # embed IPython to use the decomposed matrix while running | ||
parser.add_argument("--compress", default="json", type=str) # for dumping resulted files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused argument
LDACheck_key_phrases.py
Outdated
min_count = args.min_count | ||
max_df = args.max_df | ||
debug = args.debug | ||
compress = args.compress |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused argument
LDACheck_key_phrases.py
Outdated
parser = argparse.ArgumentParser() | ||
parser.add_argument("--input_file", default='results.pickle', type=str) # is a .json file | ||
parser.add_argument("--output_file", default='resultsTest/results_lda.txt', type=str) # is a .json file | ||
parser.add_argument("--lemmatizer", default=0, type=int) # for using lemmatization_tokenizer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lemmatizer is not supported in this file. You need to use
'from utils import Lemmatizer'
in order to avoid an error if the user use this argument with a value other than 0
LDACheck_key_phrases.py
Outdated
tf_idf_bow = tf_idf_vectorizer.fit_transform(values) | ||
tf_feature_names = tf_idf_vectorizer.get_feature_names() | ||
|
||
generic = open(output_file, "w") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using 'generic' as a global variable to used in the function 'statiticsClusters', you should pass the output_file as a function parameter and move this open file line to the function
LDACheck_key_phrases.py
Outdated
tf_feature_names = tf_idf_vectorizer.get_feature_names() | ||
|
||
generic = open(output_file, "w") | ||
ucbl_out = open(os.path.join(out_dir, "lda_ucbl_cluster.txt"), "w") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using 'ucbl_out' as a global variable to used in the function 'statiticsClusters', you should pass the output_file as a function parameter and move this open file line to the function
LDACheck_key_phrases.py
Outdated
|
||
generic = open(output_file, "w") | ||
ucbl_out = open(os.path.join(out_dir, "lda_ucbl_cluster.txt"), "w") | ||
istex_out = open(os.path.join(out_dir, "lda_mristex_cluster.txt"), "w") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using 'istex_out' as a global variable to used in the function 'statiticsClusters', you should pass the output_file as a function parameter and move this open file line to the function
LDACheck_key_phrases.py
Outdated
istex_out = open(os.path.join(out_dir, "lda_mristex_cluster.txt"), "w") | ||
|
||
for i in range(min_nb_clusters, max_nb_clusters) : | ||
statiticsClusters(i, tf_idf_bow, tf_feature_names, ucbl_out, istex_out ,max_iter=5, learning_method='online', learning_offset=50., random_state=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to create input arguments in ' main' for these parameters with their default values:
max_iter=5, learning_method='online', learning_offset=50., random_state=0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Traceback (most recent call last):
File "LDACheck_key_phrases.py", line 180, in
nb_random_with_key_phrase = stat_check_vocabulary(keys, values, groups_avoid=["UCBL", "MRISTEX"], key_phrase=key_phrase)
File "LDACheck_key_phrases.py", line 116, in stat_check_vocabulary
if values[i].lower().find(key_phrase) > -1 :
AttributeError: 'numpy.float64' object has no attribute 'lower'
@lmartinet Any update of this PR fix? |
45c282d
to
86ead38
Compare
LDA_check_key_phrase
Done !
I hope I did it properly, this time ;-)
Lucie
2016-12-21 9:27 GMT+01:00 Hussein AL-NATSHEH <notifications@github.com>:
… @lmartinet <https://github.com/lmartinet> Any update of this PR fix?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#21 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AVDN1fNqo799N4eicJw-RGDPhcAz3-wIks5rKOLUgaJpZM4LLz6_>
.
|
@lmartinet |
… selection checking that the main key phrse of the 2 first group do not appear in Random Istex selection
86ead38
to
3f36c6c
Compare
…Add some instruction in the README. The first steps should be completed.
The input to the script was not the good one. Please, read the README file and review the script after. |
@lmartinet I will double check and follow the steps in the readme file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is now working from my side, however, please reflect the minor changes proposals below
# ISTEX_MentalRotation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please leave this file empty to be filled later for the main repositpry readme. Instead, you should build the same file but in a sub-directory for this LDA clustering process; for example:
../LDA_analysis/readme.md
README.md
Outdated
> python ids2docs.py (output: results/LDA_res_input.pickle) | ||
|
||
# Comput clusters on the documents well classified by the classifyer from the dictionnary given by ids2docs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
classifier instead of classifyer
README.md
Outdated
Steps to run the experiment : | ||
|
||
# build the classifier for the documents, according to the vectorisation done before |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to use smaller font for the documentation than this header font
I got this error message:
@lmartinet If the code expect a certain conditions, e.g., excluding the seed articles from the results, please introduce that as a pre-processing step. |
… selection checking that the main key phrse of the 2 first group do not appear in Random Istex selection