Skip to content

Text preprocessing #13

@bcdasilv

Description

@bcdasilv

The overall process is: first, pre-process the comments (e.g. remove html tags and code snippets); second, adhere the text (review comments) to the sentiment/emotion analysis API in use (e.g. check comment length); third, make API calls and store the results.

In this issue, you're supposed to work on the preprocessing step. Screen over a few code review comments to see how they look like. They probably have a bunch of trash and noise for the sentiment/emotion analysis tools we use. For instance, html/markdown tags and code snippets.

Then, write a script to read code review comments from the dataset, preprocess them, and generate another version of the comments which is supposed to be ready for the sentiment/emotion analysis.

Make sure to store the processed comments.

This processing step might vary by the sentiment/emotion analysis approach in use. For instance, EMTk may remove html tags internally whereas IBM Tone Analyzer may not. This is something to be clarified as we work on the issue.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions