TakeLab · FilipBolt · Sep 14, 2018 · Sep 6, 2018 · Sep 10, 2018 · Sep 13, 2018
diff --git a/README.md b/README.md
@@ -11,13 +11,14 @@ Home of the **TakeLab Podium** project.
 You will work in a virtual environment and keep a list of required
 dependencies in a ```requirements.txt``` file. The master branch of the 
 project **must** be buildable with passing tests **all the time**. 
+Code coverage should be kept as high as possible. 
 
 ```
 
 virtualenv -p python3.6 env
 source env/bin/activate
 pip install -r requirements.txt
-python -m pytest
+py.test --cov=takepod test
 
 ```
 
@@ -26,6 +27,13 @@ python -m pytest
 Adding a new library to a project should be done via ```pip install
 <new_framework>```. **Don't forget to add it to requirements.txt** 
 
+The best thing to do is to manually add dependencies to the
+```requirements.txt``` file instead of using 
+```pip freeze > requirements.txt```. 
+See [here](https://medium.com/@tomagee/pip-freeze-requirements-txt-considered-harmful-f0bce66cf895)
+why.
+
+
 ## Details
 
 The project is packaged according to official Python packaging

diff --git a/notes.md b/notes.md
@@ -0,0 +1,122 @@
+### Bayesian optimization 
+
+- part of Sequential model-based optimization algorithm family
+- multiple python libraries already available
+ - [spearmint](https://github.com/HIPS/Spearmint)
+ - [smac](https://www.cs.ubc.ca/labs/beta/Projects/SMAC/)
+ - [Bayesian Optimization](https://github.com/fmfn/BayesianOptimization)
+ - [hyperopt](https://github.com/hyperopt/hyperopt)
+ - [pygpgo](http://pygpgo.readthedocs.io/en/latest/)
+ - [GPyOpt](https://sheffieldml.github.io/GPyOpt/)
+
+Four preconditions for doing BO
+
+1. Objective Function: takes in an input and returns a loss to minimize
+2. Domain space: the range of input values to evaluate Optimization
+3. Algorithm: the method used to construct the surrogate function and
+   choose the next values to evaluate 
+4. Results: score, value pairs that the algorithm uses to build the model
+
+Benefits of Bayesian optimization over random search
+
+- works better most of the time (less iterations)
+- returns distributions of searched spaces (which learning rates are
+  better than other)
+- returns trials
+
+Resources
+
+- Excellent notebook on [Bayesian
+  optimization](https://github.com/WillKoehrsen/hyperparameter-optimization/blob/master/Introduction%20to%20Bayesian%20Optimization%20with%20Hyperopt.ipynb)
+- [Blog post using scikit](https://thuijskens.github.io/2016/12/29/bayesian-optimisation/)
+
+
+### Active learning
+
+The idea is to label part of the dataset and try to intelligently select
+which data to label to maximize model performance. 
+
+Procedure
+
+1. Split data to seed(will be labelled) and unlabelled. 
+
+2. Train the model using seed
+
+3. Choose unlabelled instances to label based on one of many selection
+   criteria (Pool-based, Stream-baseda) based on least confidence
+
+4. Iterate  2. and 3. until some stopping criteria is met
+
+
+Python libraries
+
+- [google active learning](https://github.com/google/active-learning),
+  6 sampling strategies
+- [modAL](https://cosmic-cortex.github.io/modAL/#introduction)
+- [libact](https://github.com/ntucllab/libact), Pool-based
+  Active Learning in Python
+- [acton](https://github.com/chengsoonong/acton)
+
+Resources
+
+- [Tutorial](https://towardsdatascience.com/active-learning-tutorial-57c3398e34d)
+- [Begginer datacamp tutorial](https://www.datacamp.com/community/tutorials/active-learning)
+
+### Framework part
+
+- interchanging saved models -- use ONNX format (ici generalnije)
+- continuous integration using Travis CI
+
+### Development setup
+
+- Python (you will likely use these)
+  - [pytest](https://docs.pytest.org/en/latest/getting-started.html)
+  - [Decorators in python](https://realpython.com/primer-on-python-decorators/)
+  - [Debugger pdb](https://docs.python.org/2/library/pdb.html)
+  - pandas
+  - pytorch
+  - [pytest](https://semaphoreci.com/community/tutorials/testing-python-applications-with-pytest)
+
+https://docs.pytest.org/en/latest/example/simple.html
+- Git 
+  - [git basics](https://git-scm.com/book/en/v2/Getting-Started-Git-Basics)
+  - [how I operate in git](https://medium.com/@fredrikmorken/why-you-should-stop-using-git-rebase-5552bee4fed1)
+- Linux (all TakeLab servers have Ubuntu installed)
+  - [tmux](https://www.hamvocke.com/blog/a-quick-and-easy-guide-to-tmux/)
+    / [nohup](https://linux.101hacks.com/unix/nohup-command/)
+  - [htop](http://www.deonsworld.co.za/2012/12/20/understanding-and-using-htop-monitor-system-resources/)
+  - nvidia-smi
+
+
+Advanced preconditions (also depending on personal choice)
+
+- vim (vimtutor)
+- [Tmux + vim](https://blog.bugsnag.com/tmux-and-vim/)
+
+
+### Some general test principles
+
+- unit test should run within a few seconds
+- unit tests should mock external resources
+- integration tests should be run separately from unit tests 
+- unit tests should have extremely high coverage, integration tests should
+  focus on only testing integration
+
+Some general source writing practices
+
+- discern between public and private methods ('\_' for private)
+- proposed: write numpydoc-style documentation for public methods
+- use flake8 to check code formatting
+
+### Onboarding -- week 1 checklist
+
+By the end of week 1 you should have:
+
+- git access (use pull request)
+- ssh access to all TakeLab servers (ask sysadmins on Trello to give you
+  access)
+- trello access (ask TakeLab members to give you access)
+- create and launch a deep learning model on TakeLab GPU resources 
+- checkout, build and run tests of TakeLab podium
+- make a simple code change on TakeLab podium passing
+- pick up a software implementation task
diff --git a/requirements.txt b/requirements.txt
@@ -1,26 +1,31 @@
-atomicwrites==1.2.1
-attrs==18.2.0
-backcall==0.1.0
-decorator==4.3.0
+flake8==3.5.0
 ipython==6.5.0
 ipython-genutils==0.2.0
-jedi==0.12.1
+mccabe==0.6.1
 more-itertools==4.3.0
+msgpack-numpy==0.4.3.1
 numpy==1.15.1
-parso==0.3.1
-pexpect==4.6.0
-pickleshare==0.7.4
+plac==0.9.6
 pluggy==0.7.1
+preshed==1.0.1
 prompt-toolkit==1.0.15
 ptyprocess==0.6.0
 py==1.6.0
-Pygments==2.2.0
+pycodestyle==2.3.1
+pyflakes==1.6.0
 pytest==3.7.4
+pytest-cov==2.6.0
+regex==2017.4.5
+requests==2.19.1
 scikit-learn==0.19.2
 scipy==1.1.0
 simplegeneric==0.8.1
 six==1.11.0
 sklearn==0.0
+spacy==2.0.12
 takepod==0.1
-traitlets==4.3.2
-wcwidth==0.1.7
+torch==0.4.1
+torchtext==0.2.3
+tqdm==4.25.0
+urllib3==1.23
+wget==3.2
diff --git a/takepod/preproc/stemmer/__init__.py b/takepod/preproc/stemmer/__init__.py
diff --git a/takepod/preproc/stemmer/croatian_stemmer.py b/takepod/preproc/stemmer/croatian_stemmer.py
@@ -0,0 +1,109 @@
+# -*-coding:utf-8-*-
+#
+#    Simple stemmer for Croatian v0.1
+#    Copyright 2012 Nikola Ljubešić and Ivan Pandžić
+#
+#    This program is free software: you can redistribute it and/or modify
+#    it under the terms of the GNU Lesser General Public License as published
+#    by the Free Software Foundation, either version 3 of the License, or
+#    (at your option) any later version.
+#
+#    This program is distributed in the hope that it will be useful,
+#    but WITHOUT ANY WARRANTY; without even the implied warranty of
+#    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+#    GNU Lesser General Public License for more details.
+#
+#    You should have received a copy of the GNU Lesser General Public License
+#    along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+import re
+import os
+
+
+class CroatianStemmer:
+
+    # list of words that are it's own stem
+    __stop = None
+    __transformations = None
+    __rules = None
+
+    def __init__(self):
+
+        dir_path = os.path.dirname(os.path.realpath(__file__))
+        self.__rules = [
+            re.compile(r'^(' + base + ')(' + suffix + r')$') for
+            base, suffix in [
+                e.strip().split(' ')
+                for e in
+                open(os.path.join(dir_path, "rules.txt"), encoding='utf-8')]
+        ]
+        self.__transformations = [e.strip().split(
+            '\t') for e in open(os.path.join(dir_path,
+                                             'transformations.txt'),
+                                encoding='utf-8')]
+        self.__stop = set([
+            e.strip()
+            for e in
+            open(os.path.join(dir_path, 'nostem-hr.txt'), encoding='utf-8')
+        ])
+
+    def _determine_r_vowel(self, string):
+        '''
+        Determines if 'r' is a vowel or not
+        If it is => uppercase it.
+
+        Parameters
+        ----------
+        string : str
+            word in Croatian
+
+        Returns
+        -------
+        string : str
+            Croatian word with 'r' vowel uppercased
+        '''
+        return re.sub(r'(^|[^aeiou])r($|[^aeiou])', r'\1R\2', string)
+
+    def _has_vowel(self, string):
+        if re.search(r'[aeiouR]', self._determine_r_vowel(string)):
+            return True
+        else:
+            return False
+
+    def transform(self, word):
+        for seek, replace in self.__transformations:
+            if word.endswith(seek):
+                return word[:-len(seek)] + replace
+        return word
+
+    def root_word(self, word):
+        for rule in self.__rules:
+            division = rule.match(word)
+            if division:
+                root = division.group(1)
+                if self._has_vowel(root) and len(root) > 1:
+                    return root
+        return word
+
+    def stem_word(self, word):
+        '''
+        Returns the root or roots of a word,
+        together with any derivational affixes
+
+        Parameters
+        ----------
+        word : str
+            word in Croatian
+
+        Returns
+        -------
+        string : str
+            Croatian word root plus derivational morphemes
+        '''
+
+        if word.lower() in self.__stop:
+            return word
+        stem = self.root_word(self.transform(word.lower()))
+        return "".join(list(
+            stem[i].upper() if c.isupper() else stem[i] for i, c in
+            zip(range(len(stem)), word)))
diff --git a/takepod/preproc/stemmer/nostem-hr.txt b/takepod/preproc/stemmer/nostem-hr.txt
@@ -0,0 +1,67 @@
+bi
+bih
+bijah
+bijahu
+bijasmo
+bijaste
+bijaše
+bila
+bile
+bili
+bilo
+bio
+bismo
+biste
+biti
+biše
+bje
+bjeh
+bjehu
+bjesmo
+bjeste
+bješe
+budem
+budemo
+budete
+budeš
+budimo
+budite
+budu
+jesam
+jesi
+jesmo
+jeste
+jesu
+mogu
+mora
+moraju
+moram
+moramo
+morate
+moraš
+može
+možemo
+možete
+možeš
+sam
+si
+smo
+ste
+su
+treba
+trebaju
+trebam
+trebamo
+trebate
+trebaš
+će
+ćemo
+ćete
+ćeš
+ću
+žele
+želi
+želim
+želimo
+želite
+želiš
-Original file line number
+Diff line change
@@ -0,0 +1,67 @@
+    bi
+    bih
+    bijah
+    bijahu
+    bijasmo
+    bijaste
+    bijaše
+    bila
+    bile
+    bili
+    bilo
+    bio
+    bismo
+    biste
+    biti
+    biše
+    bje
+    bjeh
+    bjehu
+    bjesmo
+    bjeste
+    bješe
+    budem
+    budemo
+    budete
+    budeš
+    budimo
+    budite
+    budu
+    jesam
+    jesi
+    jesmo
+    jeste
+    jesu
+    mogu
+    mora
+    moraju
+    moram
+    moramo
+    morate
+    moraš
+    može
+    možemo
+    možete
+    možeš
+    sam
+    si
+    smo
+    ste
+    su
+    treba
+    trebaju
+    trebam
+    trebamo
+    trebate
+    trebaš
+    će
+    ćemo
+    ćete
+    ćeš
+    ću
+    žele
+    želi
+    želim
+    želimo
+    želite
+    želiš