-
Notifications
You must be signed in to change notification settings - Fork 3
Meeting Minutes Jan 29 18
jbbela edited this page Jan 30, 2018
·
1 revision
| Location | Time | Duration |
|---|---|---|
| ATH-452 | 3:30 - 4:00 | 30min. |
- Expectation: More from project than just simple search with results
- Clarification:
- 'People' is a very nice and understandable quantity
- However, 'Discussions with USRI evaluations' -> should also detect 'teaching evaluations', 'teaching evaluations reform', 'quality of instruction reform'
- WANT: n-grams (phrases); names from spreadsheets
- There are other keywords hidden in text (n-grams)
- Have to analyze source text
- Minutes are ‘probably’ in the spreadsheets
- Agenda and minutes are 1:1; minutes and attachments are 1:n
- ‘Key phrases’ come from item, from attachments, from minutes
- For each item => agenda(i), minutes(i-1), attachments(i)
- Look for OUTLINE OF ISSUE --> (next) OUTLINE OF ISSUE
- Between OUTLINE OF ISSUE should be 'PDF blob' we return as search result
- Take attachments and cross-reference with items in agenda
- Example: Question gets submitted a week before => secretary does something, etc., something that was developed outside => get submitted to president (attachments are coming from all different places)
- e.g., Stroulia + USRIs + APC => find all USRIs in APC where Stroulia was present
- Either:
- We can find them as words: NLTK (NLP)
- Entity recognition, extract entities trained to recognize more interesting key phrases: IBM Watson, Stanford
- Search PDFs, parsing PDFs is something risky
- Elastic Search -> free text, parses everything, index
- NLTK vs. Elastic Search -> Algorithm for counting words vs. indexing
- Start with elastic search (find boxes with text), then once you have boxes, find attachments (pieces of text found, need to know where they’re coming from)
- Visualization: ‘I found this 10 times; found 6 in December, found 4 times in January.’
- HIGHLIGHT BOXES WITH SOME REFERENCE OR PLACE IN PICTURE
- Every item will have a 'blob' (all text associated with item from attachments => will have description)
- Spend more time on Visualization and UI
- Filters!!! ie., tell everything with question attached to it, show what happened in APC, GFC; where did it get elaborated, where did it get slowed down
- AGENDA, MINUTES, ATTACHMENTS => technical user story “As a PDF parser…. [find key phrases], [calculate distinctiveness]”
- Keyphrase = something that is sufficiently frequent but not smoothly distributed in all documents (covered by NLTK)
- NLTK -> lots of resources
- Somebody has to develop data ingestion, somebody else does elastic search, somebody else does user interface, etc.
- “PROBLEMS”:
- PDFs
- To consider:
- Find: “Stroulia” said something
- Next: USRI scenario (get teaching quality, etc.)
- Knowledge Graphs
- Get keyphrases and some point in time do further analysis and figure out ‘USRIs’ is a type of teaching evaluation.
- ‘University’ is about ‘teaching’
- Extract meaningful phrases
- In knowledge graph, ‘teaching evaluation’, ‘USRIs’ figures out which teaching evaluations it is talking about
- Where are these PDFs?
- PDFs are available on the web and we have to scrape them (suggested: wget)
- Can we get access to SharePoint?
- Sharepoint access is no
- When system is being used, still getting from website?
- Our system is a prototype to demonstrate ; we don’t care where it comes from, just feed it PDFs
- Records
- Sprint 1
- Grades
- Test Documentation
- Client Documentation
- Not Yet Developed
- Presentation
- Screencast
