Skip to content

Meeting Minutes Jan 29 18

jbbela edited this page Jan 30, 2018 · 1 revision

2018-01-29

Location Time Duration
ATH-452 3:30 - 4:00 30min.

Discussion with Eleni

  • Expectation: More from project than just simple search with results
  • Clarification:
    • 'People' is a very nice and understandable quantity
    • However, 'Discussions with USRI evaluations' -> should also detect 'teaching evaluations', 'teaching evaluations reform', 'quality of instruction reform'
  • WANT: n-grams (phrases); names from spreadsheets
    • There are other keywords hidden in text (n-grams)
    • Have to analyze source text

Diagram

  • Minutes are ‘probably’ in the spreadsheets
  • Agenda and minutes are 1:1; minutes and attachments are 1:n
  • ‘Key phrases’ come from item, from attachments, from minutes
  • For each item => agenda(i), minutes(i-1), attachments(i)
  • Look for OUTLINE OF ISSUE --> (next) OUTLINE OF ISSUE
    • Between OUTLINE OF ISSUE should be 'PDF blob' we return as search result
  • Take attachments and cross-reference with items in agenda
  • Example: Question gets submitted a week before => secretary does something, etc., something that was developed outside => get submitted to president (attachments are coming from all different places)
  • e.g., Stroulia + USRIs + APC => find all USRIs in APC where Stroulia was present
  • Either:
    • We can find them as words: NLTK (NLP)
    • Entity recognition, extract entities trained to recognize more interesting key phrases: IBM Watson, Stanford
  • Search PDFs, parsing PDFs is something risky
  • Elastic Search -> free text, parses everything, index
  • NLTK vs. Elastic Search -> Algorithm for counting words vs. indexing

Suggestions

  • Start with elastic search (find boxes with text), then once you have boxes, find attachments (pieces of text found, need to know where they’re coming from)
  • Visualization: ‘I found this 10 times; found 6 in December, found 4 times in January.’
  • HIGHLIGHT BOXES WITH SOME REFERENCE OR PLACE IN PICTURE
  • Every item will have a 'blob' (all text associated with item from attachments => will have description)
  • Spend more time on Visualization and UI
  • Filters!!! ie., tell everything with question attached to it, show what happened in APC, GFC; where did it get elaborated, where did it get slowed down
  • AGENDA, MINUTES, ATTACHMENTS => technical user story “As a PDF parser…. [find key phrases], [calculate distinctiveness]”
  • Keyphrase = something that is sufficiently frequent but not smoothly distributed in all documents (covered by NLTK)
  • NLTK -> lots of resources
  • Somebody has to develop data ingestion, somebody else does elastic search, somebody else does user interface, etc.
  • “PROBLEMS”:
    • PDFs
  • To consider:
    • Find: “Stroulia” said something
    • Next: USRI scenario (get teaching quality, etc.)

The Next Level

  • Knowledge Graphs
  • Get keyphrases and some point in time do further analysis and figure out ‘USRIs’ is a type of teaching evaluation.
  • ‘University’ is about ‘teaching’
  • Extract meaningful phrases
  • In knowledge graph, ‘teaching evaluation’, ‘USRIs’ figures out which teaching evaluations it is talking about

Q & A

  • Where are these PDFs?
    • PDFs are available on the web and we have to scrape them (suggested: wget)
  • Can we get access to SharePoint?
    • Sharepoint access is no
  • When system is being used, still getting from website?
    • Our system is a prototype to demonstrate ; we don’t care where it comes from, just feed it PDFs

Clone this wiki locally