Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Script to compute clusters on Seed, Istex Expanded data, Random Istex… #21

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

lmartinet
Copy link
Collaborator

… selection checking that the main key phrse of the 2 first group do not appear in Random Istex selection

# Extract the key words representing each cluster.

# co-author : Lucie Martinet <lucie.martinet@univ-lorraine.fr>
# co-author : Hussein AL-NATSHEH <hussein.al-natsheh@ish-lyon.cnrs.fr.>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an extra '.' at the end of the email address

# co-author : Hussein AL-NATSHEH <hussein.al-natsheh@ish-lyon.cnrs.fr.>
# Affiliation: University of Lyon, ERIC Laboratory, Lyon2

# Thanks to ISTEX project for the foundings
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

funding

# Thanks to ISTEX project for the foundings

import os, argparse, pickle, json
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CountVectorizer is not used later in the code

import os, argparse, pickle, json
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import TruncatedSVD
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TruncatedSVD is not used later in the code


import os, argparse, pickle, json
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cosine_similarity is not used later in the code

return keys, values

def statisticsClusterSelection(cluster, document_id, docs_topic, selection, stat_selection, outfile_pointer):
if selection in document_id and outfile_pointer != None and len(selection)==len(document_id.split("_")[0]): # keys[t] is a string, the name of the document
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the documentation, what does 'keys[t]' refer to in the function code? Could you please also break this line into 2 lines (one for the code and the other for the documentation)? Same for the line after

return stat_selection

# Compute the clusters of document and write the results in output files.
# Need the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems an incomplete documentation line

parser.add_argument("--mx_ngram", default=2, type=int) # the upper bound of the ngram range
parser.add_argument("--mn_ngram", default=1, type=int) # the lower bound of the ngram range
parser.add_argument("--stop_words", default=1, type=int) # filtering out English stop-words
parser.add_argument("--vec_size", default=100, type=int) # the size of the vector in the semantics space
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for this argument as we do not use SVD like transformation

stop_words = 'english'
else:
stop_words = None
n_components = args.vec_size
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This argument is not in use

parser.add_argument("--min_count", default=12 , type=int) # minimum frequency of the token to be included in the vocabulary
parser.add_argument("--max_df", default=0.95, type=float) # how much vocabulary percent to keep at max based on frequency
parser.add_argument("--debug", default=0, type=int) # embed IPython to use the decomposed matrix while running
parser.add_argument("--compress", default="json", type=str) # for dumping resulted files
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused argument

min_count = args.min_count
max_df = args.max_df
debug = args.debug
compress = args.compress
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused argument

parser = argparse.ArgumentParser()
parser.add_argument("--input_file", default='results.pickle', type=str) # is a .json file
parser.add_argument("--output_file", default='resultsTest/results_lda.txt', type=str) # is a .json file
parser.add_argument("--lemmatizer", default=0, type=int) # for using lemmatization_tokenizer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lemmatizer is not supported in this file. You need to use
'from utils import Lemmatizer'
in order to avoid an error if the user use this argument with a value other than 0

tf_idf_bow = tf_idf_vectorizer.fit_transform(values)
tf_feature_names = tf_idf_vectorizer.get_feature_names()

generic = open(output_file, "w")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using 'generic' as a global variable to used in the function 'statiticsClusters', you should pass the output_file as a function parameter and move this open file line to the function

tf_feature_names = tf_idf_vectorizer.get_feature_names()

generic = open(output_file, "w")
ucbl_out = open(os.path.join(out_dir, "lda_ucbl_cluster.txt"), "w")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using 'ucbl_out' as a global variable to used in the function 'statiticsClusters', you should pass the output_file as a function parameter and move this open file line to the function


generic = open(output_file, "w")
ucbl_out = open(os.path.join(out_dir, "lda_ucbl_cluster.txt"), "w")
istex_out = open(os.path.join(out_dir, "lda_mristex_cluster.txt"), "w")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using 'istex_out' as a global variable to used in the function 'statiticsClusters', you should pass the output_file as a function parameter and move this open file line to the function

istex_out = open(os.path.join(out_dir, "lda_mristex_cluster.txt"), "w")

for i in range(min_nb_clusters, max_nb_clusters) :
statiticsClusters(i, tf_idf_bow, tf_feature_names, ucbl_out, istex_out ,max_iter=5, learning_method='online', learning_offset=50., random_state=0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to create input arguments in ' main' for these parameters with their default values:
max_iter=5, learning_method='online', learning_offset=50., random_state=0

Copy link
Collaborator

@natsheh natsheh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Traceback (most recent call last):
File "LDACheck_key_phrases.py", line 180, in
nb_random_with_key_phrase = stat_check_vocabulary(keys, values, groups_avoid=["UCBL", "MRISTEX"], key_phrase=key_phrase)
File "LDACheck_key_phrases.py", line 116, in stat_check_vocabulary
if values[i].lower().find(key_phrase) > -1 :
AttributeError: 'numpy.float64' object has no attribute 'lower'

@natsheh
Copy link
Collaborator

natsheh commented Dec 21, 2016

@lmartinet Any update of this PR fix?

@lmartinet
Copy link
Collaborator Author

lmartinet commented Dec 22, 2016 via email

@natsheh
Copy link
Collaborator

natsheh commented Dec 24, 2016

@lmartinet
Using the 'results.pickle' file generated by 'classifier.py' I got the same error message as before

… selection checking that the main key phrse of the 2 first group do not appear in Random Istex selection
…Add some instruction in the README. The first steps should be completed.
@lmartinet
Copy link
Collaborator Author

The input to the script was not the good one. Please, read the README file and review the script after.
It should be correct now.

@natsheh
Copy link
Collaborator

natsheh commented May 10, 2017

@lmartinet I will double check and follow the steps in the readme file.
Could you move the code file to a sub-folder and have the readme file there? The readme file have a big font, could you make it smaller?

Copy link
Collaborator

@natsheh natsheh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is now working from my side, however, please reflect the minor changes proposals below

# ISTEX_MentalRotation
Copy link
Collaborator

@natsheh natsheh May 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please leave this file empty to be filled later for the main repositpry readme. Instead, you should build the same file but in a sub-directory for this LDA clustering process; for example:
../LDA_analysis/readme.md

README.md Outdated
> python ids2docs.py (output: results/LDA_res_input.pickle)

# Comput clusters on the documents well classified by the classifyer from the dictionnary given by ids2docs.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

classifier instead of classifyer

README.md Outdated
Steps to run the experiment :

# build the classifier for the documents, according to the vectorisation done before
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to use smaller font for the documentation than this header font

@natsheh
Copy link
Collaborator

natsheh commented May 11, 2017

I got this error message:

Nb clusters  2  Nb ucbl clusters  0 0
Traceback (most recent call last):
  File "Topic_Clustering.py", line 196, in <module>
    statisticsClusters(i, document_ids, tf_idf_bow, tf_feature_names, generic, ucbl_out, istex_out, max_iter=max_iter, learning_method=learning_method, learning_offset=learning_offset, random_state=random_state)
  File "Topic_Clustering.py", line 87, in statisticsClusters
    print "Nb clusters ", i, " Nb ucbl clusters " , len(list_ucbl.values()), len(list_ucbl.values()), min(list_ucbl.values()), " Nb istex cluster ",len(list_mristex), min(list_mristex.values()) 
ValueError: min() arg is an empty sequence

@lmartinet If the code expect a certain conditions, e.g., excluding the seed articles from the results, please introduce that as a pre-processing step.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants