Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes #20: Added basic summarize functionality #66

Merged
merged 13 commits into from Nov 14, 2019

Conversation

@ThomasLuu00
Copy link
Collaborator

ThomasLuu00 commented Nov 7, 2019

Fixes #20

Added a summarizer that uses gensim's TextRank Summarizer in python.

I included code that allows it to be called from the node.js framework just to test it and to make it easier for whoever wants to call the function.

It isn't good, but it's a starting point. I will continue working on a document summary model.

@Reza-Rajabi Reza-Rajabi added this to In progress/Review in Main Nov 7, 2019
Copy link
Contributor

humphd left a comment

Let's not take on a python dependency if we don't need to. There are a few dozen text summarization modules on npm, let's try and use those.

Next, we can get rid of express here. We don't need a web server to do this, just a function interface. You can model your code on what @jatinAroraGit is doing in #46. Basically, we need to export an async run function that accepts some text, and gives back the result via a Promise.

Also, we should be able to include a test now that the Jest code is in place. You can either do that in this PR, or file another bug for someone to work on that adds one. The test should, at the least, call your function with some snippet of text, and expect to get back results of the form we'll expect at runtime.

Copy link
Collaborator

ImmutableBox left a comment

Need to run npm eslint-fix and rerun npm test to see if all errors have been corrected.

Copy link
Contributor

humphd left a comment

This is taking shape! I left some more comments.

const { SummarizerManager } = require('node-summarizer');

module.exports = async function (text, numSentences = 3) {
const Summarizer = new SummarizerManager(text, numSentences);

This comment has been minimized.

Copy link
@humphd

humphd Nov 11, 2019

Contributor

const summarizer = ...

Use lower case to start a variable name, only use a capital for a Constructor function name.

const fs = require('fs');
const summarize = require('./summarize');

describe('summarize tests...', () => {

This comment has been minimized.

Copy link
@humphd

humphd Nov 11, 2019

Contributor

We need to convert this to Jest format (i.e., test() vs. describe())

This comment has been minimized.

Copy link
@ThomasLuu00

ThomasLuu00 Nov 11, 2019

Author Collaborator

I am unable to nest tests in Jest, I use describe to define a scope just in case I wish to add other types of test cases into the file. It isn't really needed at the moment so I could just remove describe,.

const summarize = require('./summarize');

describe('summarize tests...', () => {
const text = fs.readFileSync('./summarizer/text.txt', 'utf-8');

This comment has been minimized.

Copy link
@humphd

humphd Nov 11, 2019

Contributor

This should go in a before test setup call. You can also drop the readFileSync and just use readFile, since we can use async calls in a test setup function.


it('summarize returns string', async () => {
data = await summarize(text, 5);
expect(typeof data).toBe('string');

This comment has been minimized.

Copy link
@humphd

humphd Nov 11, 2019

Contributor

Should we also test for the exact summary?

This comment has been minimized.

Copy link
@ThomasLuu00

ThomasLuu00 Nov 11, 2019

Author Collaborator

I think it's a good idea, but I'm am not certain I have the best approach to it. I think I could compare how similar the machine summary is to a human summary and have it pass if it meets a threshold, but this requires a good human summary for each blog input.

This comment has been minimized.

Copy link
@humphd

humphd Nov 11, 2019

Contributor

What I mean is: run your code once, get the summary it produces, see if you think it makes sense. If it does, hard code that into your test, so we know in the future if it breaks for this given blog post text. This is really a regression test, to guard against future problems.

This comment has been minimized.

Copy link
@ThomasLuu00

ThomasLuu00 Nov 12, 2019

Author Collaborator

Alright, I added the test.

Copy link
Contributor

humphd left a comment

This is really close.

@@ -6,7 +6,8 @@ module.exports = {
"LICENSE",
"*.*",
"images/*.*",
"!*.css"
"!*.css",
"summarizer/*.txt"

This comment has been minimized.

Copy link
@humphd

humphd Nov 12, 2019

Contributor

This change makes me think we need to file a new issue to adjust how our stylelint works, such that it only looks at files in a particular directory, maybe src/frontend/styles/* or something, and then we don't collide with random files like this.

This comment has been minimized.

Copy link
@ThomasLuu00

ThomasLuu00 Nov 12, 2019

Author Collaborator

Posted as issue #162.

This comment has been minimized.

Copy link
@ThomasLuu00

ThomasLuu00 Nov 12, 2019

Author Collaborator

I closed the issue as duplicate. Did not see the issue you posted.

This comment has been minimized.

Copy link
@humphd

humphd Nov 12, 2019

Contributor

As you know, this is being fixed in #151, so why don't you reset your changes to this file, so you don't have merge conflicts with that:

git checkout master .stylelintrc.js

And you can go back to what it was. It will fail Travis, but we can ignore. By the way, you might want to add another review to #151 so we can land it.

@@ -0,0 +1,23 @@
const fs = require('fs').promises;

This comment has been minimized.

Copy link
@humphd

humphd Nov 12, 2019

Contributor

Let's move this into test/ where the other tests are.

@@ -6,7 +6,8 @@ module.exports = {
"LICENSE",
"*.*",
"images/*.*",
"!*.css"
"!*.css",
"summarizer/*.txt"

This comment has been minimized.

Copy link
@humphd

humphd Nov 12, 2019

Contributor

As you know, this is being fixed in #151, so why don't you reset your changes to this file, so you don't have merge conflicts with that:

git checkout master .stylelintrc.js

And you can go back to what it was. It will fail Travis, but we can ignore. By the way, you might want to add another review to #151 so we can land it.

let expected;

beforeAll(async () => {
inputData = await fs.readFile('./summarizer/input.txt', 'utf-8');

This comment has been minimized.

Copy link
@humphd

humphd Nov 12, 2019

Contributor

Move these files into test/, and/or just inline the contents of the files into this test file so we don't even have to read from disk.

return expect(predicted).toBe(expected);
});

test('summarize returns string', async () => expect(typeof predicted).toBe('string'));

This comment has been minimized.

Copy link
@humphd

humphd Nov 12, 2019

Contributor

Tests can be run in different order, so don't count on predicted to exist here. I'd re-calculate this in each test so you know it's good. Try not to have test() functions depend on shared state. It can lead to random failures.


test('summarize returns string', async () => expect(typeof predicted).toBe('string'));

test('summarize returns correct number of lines', () => expect(predicted.split('.').length).toBe(6));

This comment has been minimized.

Copy link
@humphd

humphd Nov 12, 2019

Contributor

What does predicted look like that you have to call split? I don't follow the logic here.

This comment has been minimized.

Copy link
@ThomasLuu00

ThomasLuu00 Nov 13, 2019

Author Collaborator

'Predicted' should be returning 5 sentences since I pass 5 sentences as a parameter for summarize. So I split 'predicted' in order to get the number of sentences. The module I use requires that a number of sentences be specified to produce a summary.

This comment has been minimized.

Copy link
@humphd

humphd Nov 13, 2019

Contributor

Ah, I see. Defining a sentence as "ends with a ." could be problematic, if a sentence includes something like "I'm working on a file named foo.js in my course." Probably doing /\. {1,2}/ for "period followed by 1 or 2 spaces" would be more accurate. That still leaves out sentences that end with a ;. Anyway, this gets into the weed. If you want to tighten up your split() to deal with the periods too, that's fine. Otherwise, we'll go with what you have here.

@ThomasLuu00 ThomasLuu00 force-pushed the ThomasLuu00:issue-20 branch from 0f22498 to 06c7742 Nov 13, 2019
@ThomasLuu00 ThomasLuu00 requested a review from humphd Nov 13, 2019
Copy link
Contributor

humphd left a comment

Nice, a few little things, and I think this is probably good.


test('summarize returns string', async () => expect(typeof predicted).toBe('string'));

test('summarize returns correct number of lines', () => expect(predicted.split('.').length).toBe(6));

This comment has been minimized.

Copy link
@humphd

humphd Nov 13, 2019

Contributor

Ah, I see. Defining a sentence as "ends with a ." could be problematic, if a sentence includes something like "I'm working on a file named foo.js in my course." Probably doing /\. {1,2}/ for "period followed by 1 or 2 spaces" would be more accurate. That still leaves out sentences that end with a ;. Anyway, this gets into the weed. If you want to tighten up your split() to deal with the periods too, that's fine. Otherwise, we'll go with what you have here.

module.exports = async function (text, numSentences = 3) {
const summarizer = new SummarizerManager(text, numSentences);
return summarizer.getSummaryByRank().then(
(summaryObject) => Promise.resolve(summaryObject.summary),

This comment has been minimized.

Copy link
@humphd

humphd Nov 13, 2019

Contributor

(summaryObject) => summaryObject.summary is all you need here

@ThomasLuu00 ThomasLuu00 force-pushed the ThomasLuu00:issue-20 branch from 53199ad to f304ac6 Nov 13, 2019
@humphd

This comment has been minimized.

Copy link
Contributor

humphd commented Nov 13, 2019

Looks like your regex isn't passing this test anymore. I think it's fine if you want to revert to what you had, so we can land this and improve it later.

@ThomasLuu00

This comment has been minimized.

Copy link
Collaborator Author

ThomasLuu00 commented Nov 13, 2019

I'm not sure why, but the TextRank algorithm in node-summarizer throws out inconsistent results. Sometimes it fails the test but once its rerun it might pass, or just fail again.

@ThomasLuu00

This comment has been minimized.

Copy link
Collaborator Author

ThomasLuu00 commented Nov 13, 2019

So from what I read on the node-summarizer repo, looks like TextRank might not always throw out the same result, but it should still be generating the correct number of sentences each time.

@humphd

This comment has been minimized.

Copy link
Contributor

humphd commented Nov 13, 2019

Fascinating. OK, let's adjust the tests to match the expectations of the library, and include some comments in your tests/code that indicate what your research has uncovered about how it varies with each run.

@ThomasLuu00 ThomasLuu00 force-pushed the ThomasLuu00:issue-20 branch from be43208 to a45dc1d Nov 14, 2019
@humphd
humphd approved these changes Nov 14, 2019
@ThomasLuu00 ThomasLuu00 merged commit fdda764 into Seneca-CDOT:master Nov 14, 2019
1 check passed
1 check passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Main automation moved this from In progress/Review to Done Nov 14, 2019
@ThomasLuu00 ThomasLuu00 deleted the ThomasLuu00:issue-20 branch Nov 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Main
Done
5 participants
You can’t perform that action at this time.