Skip to content

Latest commit

 

History

History
29 lines (21 loc) · 1.27 KB

TODO.rst

File metadata and controls

29 lines (21 loc) · 1.27 KB

Things to be done

  • [ ] We need to de-couple the dataset from the classifiers
    We can follow scikit's dataset.target and dataset.data approach This will make it easier for MI to deal with data before classification
  • [x] Best way, is to implement the dataset as a Vector Space,
    since this is an IR Library.
    • Vector Space will look like scikit's dataset. [see above]
    • Function to convert ot TF and/or IDF each document, or all.
    • Let's offer a way to serialize new queries, however, no need to put queries in a Vector Space as we do now.
  • [ ] We need add pruning and MI (Mutual Information) again to our code
    Use it to skip columns from the VSM as well.
  • [x] We need to add basic TF-IDF search capabilities to our Vecotr Space.
    Both Euclidean and Cosine distances should be added here.
  • [ ] We need a way to dump VSM into file (pickle) and read it back
    Do padding automatically if not done before comparisons or tf.idf
  • [ ] Add statistics to VSM, ie. most frequent terms, histograms, etc.
    We probably add special class for that, MI can go here too.
  • [ ] We need to implement Ye's shapelet classifier.
    Probably implement it as standalone, not here in irlib.