Comparison against TF-IDF Vectorizer (from scratch) #8

BALaka-18 · 2020-09-25T18:02:01Z

Description

TF-IDF is one of the most famous algorithms when it comes to keyword extraction from text. Your task is to create a function that will extract keywords from text using the TF-IDF algorithm and compare the results against this library. How similar / different are the results ?

NOTE : You have to build the Tf-idf algorithm for keyword extraction from scratch. You will then compare its performance against sklearn's TfidfVectorizer and rake_new2.

For reference :

For your reference, you may read this link

Folder Structure, Function details

Create a folder tfidf_vectorizer in the root directory. The folder must contain a .py file that will contain the function for extracting the keywords from text using the Tfidf algorithm written from scratch.

Structure : tfidf_vectorizer/extract_keywords_tfidf_scratch.py

Acceptance Criteria

Code must be properly formatted.
Code must be accompanied by appropriate comments.
File structure must be strictly maintained.
Test cases must be present at the end of the code.
Variables and functions must be properly named
IMPORTANT : Make sure requirements.txt file is updated if you are including any new library.
All instructions provided in the Description must be strictly followed.

Definition of Done

All of the required items are completed.
Approval by 1 mentor.

Time Estimation

2.5-3 hours (or more if needed)

The text was updated successfully, but these errors were encountered:

2bit-hack · 2020-09-27T13:47:05Z

Is it okay if I work on this issue during Hacktoberfest?

BALaka-18 · 2020-09-27T13:55:04Z

Is it okay if I work on this issue during Hacktoberfest?

Yes sure @2bit-hack . Just make sure the PR is made after Oct 1.
Assigning it to you.

2bit-hack · 2020-09-27T18:47:32Z

I have a few questions regarding the implementation:

Will there be a corpus of documents provided against which to run the TF-IDF algorithm?
(Something like get_keywords(document, documents, stopwords)?)
Will these documents be text files or plaintext strings (or potentially both)?
How do I measure performance / accuracy of the implementation versus the other algorithms? Do I perform a diff?

BALaka-18 · 2020-09-27T21:26:00Z

I have a few questions regarding the implementation:

Will there be a corpus of documents provided against which to run the TF-IDF algorithm?
(Something like get_keywords(document, documents, stopwords)?)

Will these documents be text files or plaintext strings (or potentially both)?

How do I measure performance / accuracy of the implementation versus the other algorithms? Do I perform a diff?

You have to run the tf-idf algorithm on only one document. That document can be a large corpus, or an article, or just 3 lines of text. The function will take in the document, the stopwords and the max_num of keywords the user wants to extract.
If max_num == None, then extract all keywords.
Else, extract max_num keywords.
Default : None
The function will return the keywords along with their tf-idf weights.
Answered in the above point.
The major difference that you will find is the number of keywords extracted, as tf-idf will go over all words but rake_new2 returns phrases mostly. So, in order to compare :
- Choose a small text to work one. Using your knowledge, mark out what according to you are key words/phrases
in the text.
- Extract keywords using both rake_new2 and tf-idf and calculate their recall / precision score (against the
keywords you manually extracted). READ THIS PAPER TO GET AN IDEA OF WHAT I MEAN TO SAY.
- Time both the algorithms (using Python's time library) to see which algorithm extracts keywords faster.

2bit-hack · 2020-09-28T12:51:22Z

tf-idf is a corpus based algorithm whereas RAKE can work on single documents. Like, if I have a collection of documents, I can figure out the tf-idf scores for words in one document against all the documents in the corpus. But if there's just one document, will tf-idf still work?

BALaka-18 · 2020-09-28T13:52:51Z

tf-idf is a corpus based algorithm whereas RAKE can work on single documents. Like, if I have a collection of documents, I can figure out the tf-idf scores for words in one document against all the documents in the corpus. But if there's just one document, will tf-idf still work?

See, say you take a corpus of N documents. Like, N Stackoverflow posts for example. Your IDF will depend on all N posts. Do the calculation. Then choose one post, say post_3. Run TF-IDF and RAKE for that chosen document and compare.

2bit-hack · 2020-09-28T13:55:46Z

Oh, okay. So do I change the function signature to be something like def keywords(docs, selected_doc, stopwords)?

BALaka-18 · 2020-09-28T13:58:08Z

Oh yes..thanks for pointing that out. Yes you change the function to include the corpus.
Make sure you include the max_num parameter.

2bit-hack · 2020-09-28T14:00:35Z

Alright, thank you so much!

2bit-hack · 2020-09-30T07:23:59Z

The code for keyword extraction using TF-IDF goes into tfidf_vectorizer/extract_keywords_tfidf_scratch.py. Where do the tests of RAKE vs TF-IDF go? I noticed there's a tests folder, does it go there?

BALaka-18 · 2020-09-30T12:41:59Z

The code for keyword extraction using TF-IDF goes into tfidf_vectorizer/extract_keywords_tfidf_scratch.py. Where do the tests of RAKE vs TF-IDF go? I noticed there's a tests folder, does it go there?

yes, create a separate file in the tests folder.

2bit-hack · 2020-09-30T14:14:35Z

So far, I've implemented tf-idf and written a main.py file for testing its usage. I've also created a file in tests which runs both RAKE and tf-idf against a document (RAKE is being given just one document, tf-idf is also fed a small corpus), prints the results from extracting keywords, and prints the runtime of both algorithms. I'm a bit confused by the precision scoring thing though. Should I manually assign keywords/phrases/both? If, for instance, RAKE extracts 'object oriented' and tf-idf extracts 'object' (RAKE generally extracts phrases, tf-idf extracts words only), do I score them equally? I'm not quite sure how exactly the F-scores are being calculated.

BALaka-18 · 2020-09-30T15:42:14Z

@2bit-hack Basically you have to toil a bit. What they have done in the paper is that, they have marked out keywords manually. Then they ran the algorithms, and checked which algorithm gave results closer to the ones they had manually derived, and calculated the precision score (TP/TP+FP).

Like, say according to you, 'object oriented' is a keyword(it makes more sense), so you have noted it down. Now according to RAKE, it gave 'object oriented' but TF-IDF gave 'object'. So for RAKE, it is a True Positive, but for TF-IDF, it's a False Positive.
Iterate for all keywords, and calculate the scores for the two algorithms.

2bit-hack · 2020-09-30T19:09:52Z

@BALaka-18 I've sent a PR, could you please take a look?

* Add tfidf implementation written from scratch * Add example usage of tfidf implementation * Fix 'attempted relative import with no known parent package' error * Implement precision score * Restyled by autopep8 * Restyled by black * Restyled by isort * Restyled by reorder-python-imports * Restyled by whitespace * Restyled by yapf * Add tfidf implementation written from scratch and test against rake Add example usage of tfidf implementation Fix 'attempted relative import with no known parent package' error Implement precision score Restyled by autopep8 Restyled by black Restyled by isort Restyled by reorder-python-imports Restyled by whitespace Restyled by yapf Attempt to make requested changes Co-authored-by: Restyled.io <commits@restyled.io>

BALaka-18 added code code based issue Hacktoberfest This issue is under Hacktoberfest 2020 hard hard issues labels Sep 25, 2020

BALaka-18 assigned 2bit-hack Sep 27, 2020

This was referenced Sep 30, 2020

Implement tf-idf from scratch and compare against RAKE #17

Closed

Comparison against TF-IDF Vectorizer (from scratch) #18

Closed

2bit-hack mentioned this issue Sep 30, 2020

Comparison against TF-IDF Vectorizer (from scratch) #8 #20

Merged

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comparison against TF-IDF Vectorizer (from scratch) #8

Comparison against TF-IDF Vectorizer (from scratch) #8

BALaka-18 commented Sep 25, 2020

2bit-hack commented Sep 27, 2020

BALaka-18 commented Sep 27, 2020

2bit-hack commented Sep 27, 2020

BALaka-18 commented Sep 27, 2020

2bit-hack commented Sep 28, 2020 •

edited

Loading

BALaka-18 commented Sep 28, 2020

2bit-hack commented Sep 28, 2020

BALaka-18 commented Sep 28, 2020 •

edited

Loading

2bit-hack commented Sep 28, 2020

2bit-hack commented Sep 30, 2020

BALaka-18 commented Sep 30, 2020

2bit-hack commented Sep 30, 2020

BALaka-18 commented Sep 30, 2020

2bit-hack commented Sep 30, 2020

Comparison against TF-IDF Vectorizer (from scratch) #8

Comparison against TF-IDF Vectorizer (from scratch) #8

Comments

BALaka-18 commented Sep 25, 2020

Description

For reference :

Folder Structure, Function details

Acceptance Criteria

Definition of Done

Time Estimation

2bit-hack commented Sep 27, 2020

BALaka-18 commented Sep 27, 2020

2bit-hack commented Sep 27, 2020

BALaka-18 commented Sep 27, 2020

2bit-hack commented Sep 28, 2020 • edited Loading

BALaka-18 commented Sep 28, 2020

2bit-hack commented Sep 28, 2020

BALaka-18 commented Sep 28, 2020 • edited Loading

2bit-hack commented Sep 28, 2020

2bit-hack commented Sep 30, 2020

BALaka-18 commented Sep 30, 2020

2bit-hack commented Sep 30, 2020

BALaka-18 commented Sep 30, 2020

2bit-hack commented Sep 30, 2020

2bit-hack commented Sep 28, 2020 •

edited

Loading

BALaka-18 commented Sep 28, 2020 •

edited

Loading