CS290 Spring 2017

zuozhiw edited this page Aug 29, 2017 · 14 revisions

CS290: Text Analytics in the Big Data Era

Spring 2017, Department of Computer Science, UC Irvine

  • Instructor: Prof. Chen Li
  • Lecture time: Wednesday 3:30-5:00 pm, DBH 3011

Goal:

  • Gain hands-on experiences to build a system to manage large amounts of text information
  • Study research challenges related to text and data management
  • Form teams to do a group project; learn tools and skills to manage a software project.

Schedule

No. Date Topics Todos
01 04/05/2017 Running GUI, Use cases, Task assignments Make GUI work on your data; Initial Design Google Doc linked on github issue
02 04/12/2017 Status update (1) Medline team: Modify the backend to let DictionaryMatcher also accept a file as the input; (2) Twitter team: Add sentiment analysis module/operator (work with @zuozhiw); use Stanford NLP to split a document into sentences; (3) ProposalReport team: Wrap up with query plan the current Chinese proposal data and move on to the next dataset and task; (4) LegalDoc team: modify the Join operator (JoinDistancePredicate) to exclude joined spans that completely contained by other spans, and implement a PDF-to-text operator; (5) SmartGui team: modify the RelationManager to expose the metadata to texera-web server
03 04/17/2017 Status update (1) SmartGui team: implement the autocomplete using the new GUI in the branch of zuozhi-demo-base; (2) Medline team: implement the new file-based dictionary using the new engine (already in master); implement a PDF2Text operator; implement a regex operator using earlier labeled entities; (3) Twitter team: implement a NlpSentenceSplitter operator; (4) LegalDoc team: implement a regex operator using earlier labeled entities; (5) ProposalReport team: implement an operator to write results to an Excel file.
04 04/26/2017 Status update (1) SmartGui team: implement an interface to upload dictionaries to the backend to be persistent; (2) Medline team: Continue the task of developing an operator to support regex with labeled variables; (3) Twitter team: finish the NlpSentenceSplitter operator and look for other NLP packages for tweets; (4) LegalDoc team: design the regex operator with variables; (5) ProposalReport team: finish the ExcelFileSink operator, and implement an AsterixDB Sink operator.
05 05/03/2017 Status update (1) Implement the SentenceSplitter operator with a flag (one tuple with a spanlist or multiple tuples); then talk to Prof. Huang to make similar changes to the RegexSplitter operator; (2) To support Python in Texera, implement a simple operator (e.g., "string-length()") using two different architectures, and evaluate the development experience and performance; (3) Finish the FileReader operator for different file formats; (4) Finish an operator to write results to an Excel/CSV file; (5) Finish the first implementation of RegexMatcher with variables, and think about how to improve its performance and expressive power; (6) SmartGUI: finish a PR of the backend with MetaData, and do another PR for the frontend autocomplete; (7) Implement an operator of sentiment analysis based on Emojis.
06 05/10/2017 Status update (1) Finish the SentenceSplitter operator; then talk to Prof. Huang to make similar changes to the RegexSplitter operator; (2) Implement a simple operator based on NLTK (in Python) using two different architectures, and evaluate the development experience and performance; (3) Finish an AsterixDB reader and writer; (4) Finish the first implementation of RegexMatcher with variables; (5) Improve its performance and expressive power; (6) SmartGUI: finish a PR of the backend with MetaData, and do another PR for the frontend autocomplete; (7) Implement an operator of sentiment analysis based on Emojis.
07 05/17/2017 Status update (1) Finish the SentenceSplitter operator; talk to Prof. Huang to make similar changes to the RegexSplitter operator; (2) Implement a simple operator based on NLTK (in Python) using two different architectures, and evaluate the development experience and performance; (3) Finish an AsterixDB writer; (4) Finish the first implementation of RegexMatcher with variables; (5) Improve its performance by evaluating a subclass of regexes without qualifiers without building an automaton; (6) Finish the PR for the frontend autocomplete; (7) Start implementing a UI to upload a dictionary; (8) Implement an operator of sentiment analysis based on Emojis.
08 05/24/2017 Status update (1) Finish the SentenceSplitter operator; talk to Prof. Huang to make similar changes to the RegexSplitter operator; (2) Implement a simple operator based on NLTK (in Python) using two different architectures, and evaluate the development experience and performance; (3) Finish the first implementation of RegexMatcher with variables; (4) Improve its performance by evaluating a subclass of regexes without qualifiers without building an automaton; (5) Start implementing a UI to upload a dictionary; (6) Implement an operator of sentiment analysis based on Emojis.
09 05/31/2017 Status update Finish the pending PRs, and prepare for the integration hackathon next Wednesday!

Prerequisites:

  • Desire to learn and build a real open source system;
  • Familiar with Java;
  • Hands-on system-building experiences;
  • Eager to solve open problems;
  • (Optional but a big plus) Have taken CS222 or CS221.

Software Tools:

  • Java
  • Maven
  • Git
  • Wiki
  • Issue tracking

Project Protocol:

  • Do not add large files to git. Check github guidance for details.
  • Write high-quality code.
  • Do high-quality peer reviews.
  • Write good documentations using github wiki.
  • Drawing diagrams: Use Google Drawings. Add diagram source files to Google Drive and change the ownership to "texeraproject AT gmail.com". Add authors to each diagram, and include the source file link on the wiki. Here is an example.
  • Use the "sandbox/" folder on git for your only experiments. Use the format of "[firstname]-[lastname]" (all lower case) for the name of your folder under "sandbox/".
  • Use Github Issues to manage tasks and bugs.

Project Lead:

Chen Li
Chen Li