Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comparison against TF-IDF Vectorizer (from scratch) #8

Open
9 tasks
BALaka-18 opened this issue Sep 25, 2020 · 14 comments
Open
9 tasks

Comparison against TF-IDF Vectorizer (from scratch) #8

BALaka-18 opened this issue Sep 25, 2020 · 14 comments
Assignees
Labels
code code based issue Hacktoberfest This issue is under Hacktoberfest 2020 hard hard issues

Comments

@BALaka-18
Copy link
Owner

Description

TF-IDF is one of the most famous algorithms when it comes to keyword extraction from text. Your task is to create a function that will extract keywords from text using the TF-IDF algorithm and compare the results against this library. How similar / different are the results ?

NOTE : You have to build the Tf-idf algorithm for keyword extraction from scratch. You will then compare its performance against sklearn's TfidfVectorizer and rake_new2.

For reference :

For your reference, you may read this link

Folder Structure, Function details

Create a folder tfidf_vectorizer in the root directory. The folder must contain a .py file that will contain the function for extracting the keywords from text using the Tfidf algorithm written from scratch.

Structure : tfidf_vectorizer/extract_keywords_tfidf_scratch.py

Acceptance Criteria

  • Code must be properly formatted.
  • Code must be accompanied by appropriate comments.
  • File structure must be strictly maintained.
  • Test cases must be present at the end of the code.
  • Variables and functions must be properly named
  • IMPORTANT : Make sure requirements.txt file is updated if you are including any new library.
  • All instructions provided in the Description must be strictly followed.

Definition of Done

  • All of the required items are completed.
  • Approval by 1 mentor.

Time Estimation

2.5-3 hours (or more if needed)

@BALaka-18 BALaka-18 added code code based issue Hacktoberfest This issue is under Hacktoberfest 2020 hard hard issues labels Sep 25, 2020
@2bit-hack
Copy link
Contributor

Is it okay if I work on this issue during Hacktoberfest?

@BALaka-18
Copy link
Owner Author

Is it okay if I work on this issue during Hacktoberfest?

Yes sure @2bit-hack . Just make sure the PR is made after Oct 1.
Assigning it to you.

@2bit-hack
Copy link
Contributor

I have a few questions regarding the implementation:

  • Will there be a corpus of documents provided against which to run the TF-IDF algorithm?
    (Something like get_keywords(document, documents, stopwords)?)
  • Will these documents be text files or plaintext strings (or potentially both)?
  • How do I measure performance / accuracy of the implementation versus the other algorithms? Do I perform a diff?

@BALaka-18
Copy link
Owner Author

I have a few questions regarding the implementation:

  • Will there be a corpus of documents provided against which to run the TF-IDF algorithm?
    (Something like get_keywords(document, documents, stopwords)?)
  • Will these documents be text files or plaintext strings (or potentially both)?
  • How do I measure performance / accuracy of the implementation versus the other algorithms? Do I perform a diff?
  1. You have to run the tf-idf algorithm on only one document. That document can be a large corpus, or an article, or just 3 lines of text. The function will take in the document, the stopwords and the max_num of keywords the user wants to extract.
    If max_num == None, then extract all keywords.
    Else, extract max_num keywords.
    Default : None
    The function will return the keywords along with their tf-idf weights.
  2. Answered in the above point.
  3. The major difference that you will find is the number of keywords extracted, as tf-idf will go over all words but rake_new2 returns phrases mostly. So, in order to compare :
    - Choose a small text to work one. Using your knowledge, mark out what according to you are key words/phrases
    in the text.
    - Extract keywords using both rake_new2 and tf-idf and calculate their recall / precision score (against the
    keywords you manually extracted). READ THIS PAPER TO GET AN IDEA OF WHAT I MEAN TO SAY.
    - Time both the algorithms (using Python's time library) to see which algorithm extracts keywords faster.

@2bit-hack
Copy link
Contributor

2bit-hack commented Sep 28, 2020

tf-idf is a corpus based algorithm whereas RAKE can work on single documents. Like, if I have a collection of documents, I can figure out the tf-idf scores for words in one document against all the documents in the corpus. But if there's just one document, will tf-idf still work?

@BALaka-18
Copy link
Owner Author

tf-idf is a corpus based algorithm whereas RAKE can work on single documents. Like, if I have a collection of documents, I can figure out the tf-idf scores for words in one document against all the documents in the corpus. But if there's just one document, will tf-idf still work?

See, say you take a corpus of N documents. Like, N Stackoverflow posts for example. Your IDF will depend on all N posts. Do the calculation. Then choose one post, say post_3. Run TF-IDF and RAKE for that chosen document and compare.

@2bit-hack
Copy link
Contributor

Oh, okay. So do I change the function signature to be something like def keywords(docs, selected_doc, stopwords)?

@BALaka-18
Copy link
Owner Author

BALaka-18 commented Sep 28, 2020

Oh yes..thanks for pointing that out. Yes you change the function to include the corpus.
Make sure you include the max_num parameter.

@2bit-hack
Copy link
Contributor

Alright, thank you so much!

@2bit-hack
Copy link
Contributor

The code for keyword extraction using TF-IDF goes into tfidf_vectorizer/extract_keywords_tfidf_scratch.py. Where do the tests of RAKE vs TF-IDF go? I noticed there's a tests folder, does it go there?

@BALaka-18
Copy link
Owner Author

The code for keyword extraction using TF-IDF goes into tfidf_vectorizer/extract_keywords_tfidf_scratch.py. Where do the tests of RAKE vs TF-IDF go? I noticed there's a tests folder, does it go there?

yes, create a separate file in the tests folder.

@2bit-hack
Copy link
Contributor

So far, I've implemented tf-idf and written a main.py file for testing its usage. I've also created a file in tests which runs both RAKE and tf-idf against a document (RAKE is being given just one document, tf-idf is also fed a small corpus), prints the results from extracting keywords, and prints the runtime of both algorithms. I'm a bit confused by the precision scoring thing though. Should I manually assign keywords/phrases/both? If, for instance, RAKE extracts 'object oriented' and tf-idf extracts 'object' (RAKE generally extracts phrases, tf-idf extracts words only), do I score them equally? I'm not quite sure how exactly the F-scores are being calculated.

@BALaka-18
Copy link
Owner Author

@2bit-hack Basically you have to toil a bit. What they have done in the paper is that, they have marked out keywords manually. Then they ran the algorithms, and checked which algorithm gave results closer to the ones they had manually derived, and calculated the precision score (TP/TP+FP).

Like, say according to you, 'object oriented' is a keyword(it makes more sense), so you have noted it down. Now according to RAKE, it gave 'object oriented' but TF-IDF gave 'object'. So for RAKE, it is a True Positive, but for TF-IDF, it's a False Positive.
Iterate for all keywords, and calculate the scores for the two algorithms.

@2bit-hack
Copy link
Contributor

@BALaka-18 I've sent a PR, could you please take a look?

BALaka-18 pushed a commit that referenced this issue Oct 2, 2020
* Add tfidf implementation written from scratch

* Add example usage of tfidf implementation

* Fix 'attempted relative import with no known parent package' error

* Implement precision score

* Restyled by autopep8

* Restyled by black

* Restyled by isort

* Restyled by reorder-python-imports

* Restyled by whitespace

* Restyled by yapf

* Add tfidf implementation written from scratch and test against rake

Add example usage of tfidf implementation

Fix 'attempted relative import with no known parent package' error

Implement precision score

Restyled by autopep8

Restyled by black

Restyled by isort

Restyled by reorder-python-imports

Restyled by whitespace

Restyled by yapf

Attempt to make requested changes

Co-authored-by: Restyled.io <commits@restyled.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
code code based issue Hacktoberfest This issue is under Hacktoberfest 2020 hard hard issues
Projects
None yet
2 participants