Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 9 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,13 +11,14 @@ Home of the **TakeLab Podium** project.
You will work in a virtual environment and keep a list of required
dependencies in a ```requirements.txt``` file. The master branch of the
project **must** be buildable with passing tests **all the time**.
Code coverage should be kept as high as possible.

```

virtualenv -p python3.6 env
source env/bin/activate
pip install -r requirements.txt
python -m pytest
py.test --cov=takepod test

```

Expand All @@ -26,6 +27,13 @@ python -m pytest
Adding a new library to a project should be done via ```pip install
<new_framework>```. **Don't forget to add it to requirements.txt**

The best thing to do is to manually add dependencies to the
```requirements.txt``` file instead of using
```pip freeze > requirements.txt```.
See [here](https://medium.com/@tomagee/pip-freeze-requirements-txt-considered-harmful-f0bce66cf895)
why.


## Details

The project is packaged according to official Python packaging
Expand Down
122 changes: 122 additions & 0 deletions notes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
### Bayesian optimization

- part of Sequential model-based optimization algorithm family
- multiple python libraries already available
- [spearmint](https://github.com/HIPS/Spearmint)
- [smac](https://www.cs.ubc.ca/labs/beta/Projects/SMAC/)
- [Bayesian Optimization](https://github.com/fmfn/BayesianOptimization)
- [hyperopt](https://github.com/hyperopt/hyperopt)
- [pygpgo](http://pygpgo.readthedocs.io/en/latest/)
- [GPyOpt](https://sheffieldml.github.io/GPyOpt/)

Four preconditions for doing BO

1. Objective Function: takes in an input and returns a loss to minimize
2. Domain space: the range of input values to evaluate Optimization
3. Algorithm: the method used to construct the surrogate function and
choose the next values to evaluate
4. Results: score, value pairs that the algorithm uses to build the model

Benefits of Bayesian optimization over random search

- works better most of the time (less iterations)
- returns distributions of searched spaces (which learning rates are
better than other)
- returns trials

Resources

- Excellent notebook on [Bayesian
optimization](https://github.com/WillKoehrsen/hyperparameter-optimization/blob/master/Introduction%20to%20Bayesian%20Optimization%20with%20Hyperopt.ipynb)
- [Blog post using scikit](https://thuijskens.github.io/2016/12/29/bayesian-optimisation/)


### Active learning

The idea is to label part of the dataset and try to intelligently select
which data to label to maximize model performance.

Procedure

1. Split data to seed(will be labelled) and unlabelled.

2. Train the model using seed

3. Choose unlabelled instances to label based on one of many selection
criteria (Pool-based, Stream-baseda) based on least confidence

4. Iterate 2. and 3. until some stopping criteria is met


Python libraries

- [google active learning](https://github.com/google/active-learning),
6 sampling strategies
- [modAL](https://cosmic-cortex.github.io/modAL/#introduction)
- [libact](https://github.com/ntucllab/libact), Pool-based
Active Learning in Python
- [acton](https://github.com/chengsoonong/acton)

Resources

- [Tutorial](https://towardsdatascience.com/active-learning-tutorial-57c3398e34d)
- [Begginer datacamp tutorial](https://www.datacamp.com/community/tutorials/active-learning)

### Framework part

- interchanging saved models -- use ONNX format (ici generalnije)
- continuous integration using Travis CI

### Development setup

- Python (you will likely use these)
- [pytest](https://docs.pytest.org/en/latest/getting-started.html)
- [Decorators in python](https://realpython.com/primer-on-python-decorators/)
- [Debugger pdb](https://docs.python.org/2/library/pdb.html)
- pandas
- pytorch
- [pytest](https://semaphoreci.com/community/tutorials/testing-python-applications-with-pytest)

https://docs.pytest.org/en/latest/example/simple.html
- Git
- [git basics](https://git-scm.com/book/en/v2/Getting-Started-Git-Basics)
- [how I operate in git](https://medium.com/@fredrikmorken/why-you-should-stop-using-git-rebase-5552bee4fed1)
- Linux (all TakeLab servers have Ubuntu installed)
- [tmux](https://www.hamvocke.com/blog/a-quick-and-easy-guide-to-tmux/)
/ [nohup](https://linux.101hacks.com/unix/nohup-command/)
- [htop](http://www.deonsworld.co.za/2012/12/20/understanding-and-using-htop-monitor-system-resources/)
- nvidia-smi


Advanced preconditions (also depending on personal choice)

- vim (vimtutor)
- [Tmux + vim](https://blog.bugsnag.com/tmux-and-vim/)


### Some general test principles

- unit test should run within a few seconds
- unit tests should mock external resources
- integration tests should be run separately from unit tests
- unit tests should have extremely high coverage, integration tests should
focus on only testing integration

Some general source writing practices

- discern between public and private methods ('\_' for private)
- proposed: write numpydoc-style documentation for public methods
- use flake8 to check code formatting

### Onboarding -- week 1 checklist

By the end of week 1 you should have:

- git access (use pull request)
- ssh access to all TakeLab servers (ask sysadmins on Trello to give you
access)
- trello access (ask TakeLab members to give you access)
- create and launch a deep learning model on TakeLab GPU resources
- checkout, build and run tests of TakeLab podium
- make a simple code change on TakeLab podium passing
- pick up a software implementation task
27 changes: 16 additions & 11 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,26 +1,31 @@
atomicwrites==1.2.1
attrs==18.2.0
backcall==0.1.0
decorator==4.3.0
flake8==3.5.0
ipython==6.5.0
ipython-genutils==0.2.0
jedi==0.12.1
mccabe==0.6.1
more-itertools==4.3.0
msgpack-numpy==0.4.3.1
numpy==1.15.1
parso==0.3.1
pexpect==4.6.0
pickleshare==0.7.4
plac==0.9.6
pluggy==0.7.1
preshed==1.0.1
prompt-toolkit==1.0.15
ptyprocess==0.6.0
py==1.6.0
Pygments==2.2.0
pycodestyle==2.3.1
pyflakes==1.6.0
pytest==3.7.4
pytest-cov==2.6.0
regex==2017.4.5
requests==2.19.1
scikit-learn==0.19.2
scipy==1.1.0
simplegeneric==0.8.1
six==1.11.0
sklearn==0.0
spacy==2.0.12
takepod==0.1
traitlets==4.3.2
wcwidth==0.1.7
torch==0.4.1
torchtext==0.2.3
tqdm==4.25.0
urllib3==1.23
wget==3.2
Empty file.
109 changes: 109 additions & 0 deletions takepod/preproc/stemmer/croatian_stemmer.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
# -*-coding:utf-8-*-
#
# Simple stemmer for Croatian v0.1
# Copyright 2012 Nikola Ljubešić and Ivan Pandžić
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU Lesser General Public License as published
# by the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU Lesser General Public License for more details.
#
# You should have received a copy of the GNU Lesser General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.

import re
import os


class CroatianStemmer:

# list of words that are it's own stem
__stop = None
__transformations = None
__rules = None

def __init__(self):

dir_path = os.path.dirname(os.path.realpath(__file__))
self.__rules = [
re.compile(r'^(' + base + ')(' + suffix + r')$') for
base, suffix in [
e.strip().split(' ')
for e in
open(os.path.join(dir_path, "rules.txt"), encoding='utf-8')]
]
self.__transformations = [e.strip().split(
'\t') for e in open(os.path.join(dir_path,
'transformations.txt'),
encoding='utf-8')]
self.__stop = set([
e.strip()
for e in
open(os.path.join(dir_path, 'nostem-hr.txt'), encoding='utf-8')
])

def _determine_r_vowel(self, string):
'''
Determines if 'r' is a vowel or not
If it is => uppercase it.

Parameters
----------
string : str
word in Croatian

Returns
-------
string : str
Croatian word with 'r' vowel uppercased
'''
return re.sub(r'(^|[^aeiou])r($|[^aeiou])', r'\1R\2', string)

def _has_vowel(self, string):
if re.search(r'[aeiouR]', self._determine_r_vowel(string)):
return True
else:
return False

def transform(self, word):
for seek, replace in self.__transformations:
if word.endswith(seek):
return word[:-len(seek)] + replace
return word

def root_word(self, word):
for rule in self.__rules:
division = rule.match(word)
if division:
root = division.group(1)
if self._has_vowel(root) and len(root) > 1:
return root
return word

def stem_word(self, word):
'''
Returns the root or roots of a word,
together with any derivational affixes

Parameters
----------
word : str
word in Croatian

Returns
-------
string : str
Croatian word root plus derivational morphemes
'''

if word.lower() in self.__stop:
return word
stem = self.root_word(self.transform(word.lower()))
return "".join(list(
stem[i].upper() if c.isupper() else stem[i] for i, c in
zip(range(len(stem)), word)))
67 changes: 67 additions & 0 deletions takepod/preproc/stemmer/nostem-hr.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
bi
bih
bijah
bijahu
bijasmo
bijaste
bijaše
bila
bile
bili
bilo
bio
bismo
biste
biti
biše
bje
bjeh
bjehu
bjesmo
bjeste
bješe
budem
budemo
budete
budeš
budimo
budite
budu
jesam
jesi
jesmo
jeste
jesu
mogu
mora
moraju
moram
moramo
morate
moraš
može
možemo
možete
možeš
sam
si
smo
ste
su
treba
trebaju
trebam
trebamo
trebate
trebaš
će
ćemo
ćete
ćeš
ću
žele
želi
želim
želimo
želite
želiš
Loading