diff --git a/README.md b/README.md index bf22b8d9..c65db6c6 100644 --- a/README.md +++ b/README.md @@ -11,13 +11,14 @@ Home of the **TakeLab Podium** project. You will work in a virtual environment and keep a list of required dependencies in a ```requirements.txt``` file. The master branch of the project **must** be buildable with passing tests **all the time**. +Code coverage should be kept as high as possible. ``` virtualenv -p python3.6 env source env/bin/activate pip install -r requirements.txt -python -m pytest +py.test --cov=takepod test ``` @@ -26,6 +27,13 @@ python -m pytest Adding a new library to a project should be done via ```pip install ```. **Don't forget to add it to requirements.txt** +The best thing to do is to manually add dependencies to the +```requirements.txt``` file instead of using +```pip freeze > requirements.txt```. +See [here](https://medium.com/@tomagee/pip-freeze-requirements-txt-considered-harmful-f0bce66cf895) +why. + + ## Details The project is packaged according to official Python packaging diff --git a/notes.md b/notes.md new file mode 100644 index 00000000..c7b9a149 --- /dev/null +++ b/notes.md @@ -0,0 +1,122 @@ +### Bayesian optimization + +- part of Sequential model-based optimization algorithm family +- multiple python libraries already available + - [spearmint](https://github.com/HIPS/Spearmint) + - [smac](https://www.cs.ubc.ca/labs/beta/Projects/SMAC/) + - [Bayesian Optimization](https://github.com/fmfn/BayesianOptimization) + - [hyperopt](https://github.com/hyperopt/hyperopt) + - [pygpgo](http://pygpgo.readthedocs.io/en/latest/) + - [GPyOpt](https://sheffieldml.github.io/GPyOpt/) + +Four preconditions for doing BO + +1. Objective Function: takes in an input and returns a loss to minimize +2. Domain space: the range of input values to evaluate Optimization +3. Algorithm: the method used to construct the surrogate function and + choose the next values to evaluate +4. Results: score, value pairs that the algorithm uses to build the model + +Benefits of Bayesian optimization over random search + +- works better most of the time (less iterations) +- returns distributions of searched spaces (which learning rates are + better than other) +- returns trials + +Resources + +- Excellent notebook on [Bayesian + optimization](https://github.com/WillKoehrsen/hyperparameter-optimization/blob/master/Introduction%20to%20Bayesian%20Optimization%20with%20Hyperopt.ipynb) +- [Blog post using scikit](https://thuijskens.github.io/2016/12/29/bayesian-optimisation/) + + +### Active learning + +The idea is to label part of the dataset and try to intelligently select +which data to label to maximize model performance. + +Procedure + +1. Split data to seed(will be labelled) and unlabelled. + +2. Train the model using seed + +3. Choose unlabelled instances to label based on one of many selection + criteria (Pool-based, Stream-baseda) based on least confidence + +4. Iterate 2. and 3. until some stopping criteria is met + + +Python libraries + +- [google active learning](https://github.com/google/active-learning), + 6 sampling strategies +- [modAL](https://cosmic-cortex.github.io/modAL/#introduction) +- [libact](https://github.com/ntucllab/libact), Pool-based + Active Learning in Python +- [acton](https://github.com/chengsoonong/acton) + +Resources + +- [Tutorial](https://towardsdatascience.com/active-learning-tutorial-57c3398e34d) +- [Begginer datacamp tutorial](https://www.datacamp.com/community/tutorials/active-learning) + +### Framework part + +- interchanging saved models -- use ONNX format (ici generalnije) +- continuous integration using Travis CI + +### Development setup + +- Python (you will likely use these) + - [pytest](https://docs.pytest.org/en/latest/getting-started.html) + - [Decorators in python](https://realpython.com/primer-on-python-decorators/) + - [Debugger pdb](https://docs.python.org/2/library/pdb.html) + - pandas + - pytorch + - [pytest](https://semaphoreci.com/community/tutorials/testing-python-applications-with-pytest) + +https://docs.pytest.org/en/latest/example/simple.html +- Git + - [git basics](https://git-scm.com/book/en/v2/Getting-Started-Git-Basics) + - [how I operate in git](https://medium.com/@fredrikmorken/why-you-should-stop-using-git-rebase-5552bee4fed1) +- Linux (all TakeLab servers have Ubuntu installed) + - [tmux](https://www.hamvocke.com/blog/a-quick-and-easy-guide-to-tmux/) + / [nohup](https://linux.101hacks.com/unix/nohup-command/) + - [htop](http://www.deonsworld.co.za/2012/12/20/understanding-and-using-htop-monitor-system-resources/) + - nvidia-smi + + +Advanced preconditions (also depending on personal choice) + +- vim (vimtutor) +- [Tmux + vim](https://blog.bugsnag.com/tmux-and-vim/) + + +### Some general test principles + +- unit test should run within a few seconds +- unit tests should mock external resources +- integration tests should be run separately from unit tests +- unit tests should have extremely high coverage, integration tests should + focus on only testing integration + +Some general source writing practices + +- discern between public and private methods ('\_' for private) +- proposed: write numpydoc-style documentation for public methods +- use flake8 to check code formatting + +### Onboarding -- week 1 checklist + +By the end of week 1 you should have: + +- git access (use pull request) +- ssh access to all TakeLab servers (ask sysadmins on Trello to give you + access) +- trello access (ask TakeLab members to give you access) +- create and launch a deep learning model on TakeLab GPU resources +- checkout, build and run tests of TakeLab podium +- make a simple code change on TakeLab podium passing +- pick up a software implementation task diff --git a/requirements.txt b/requirements.txt index f326e338..632f959d 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,26 +1,31 @@ -atomicwrites==1.2.1 -attrs==18.2.0 -backcall==0.1.0 -decorator==4.3.0 +flake8==3.5.0 ipython==6.5.0 ipython-genutils==0.2.0 -jedi==0.12.1 +mccabe==0.6.1 more-itertools==4.3.0 +msgpack-numpy==0.4.3.1 numpy==1.15.1 -parso==0.3.1 -pexpect==4.6.0 -pickleshare==0.7.4 +plac==0.9.6 pluggy==0.7.1 +preshed==1.0.1 prompt-toolkit==1.0.15 ptyprocess==0.6.0 py==1.6.0 -Pygments==2.2.0 +pycodestyle==2.3.1 +pyflakes==1.6.0 pytest==3.7.4 +pytest-cov==2.6.0 +regex==2017.4.5 +requests==2.19.1 scikit-learn==0.19.2 scipy==1.1.0 simplegeneric==0.8.1 six==1.11.0 sklearn==0.0 +spacy==2.0.12 takepod==0.1 -traitlets==4.3.2 -wcwidth==0.1.7 +torch==0.4.1 +torchtext==0.2.3 +tqdm==4.25.0 +urllib3==1.23 +wget==3.2 diff --git a/takepod/preproc/stemmer/__init__.py b/takepod/preproc/stemmer/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/takepod/preproc/stemmer/croatian_stemmer.py b/takepod/preproc/stemmer/croatian_stemmer.py new file mode 100644 index 00000000..aad70818 --- /dev/null +++ b/takepod/preproc/stemmer/croatian_stemmer.py @@ -0,0 +1,109 @@ +# -*-coding:utf-8-*- +# +# Simple stemmer for Croatian v0.1 +# Copyright 2012 Nikola Ljubešić and Ivan Pandžić +# +# This program is free software: you can redistribute it and/or modify +# it under the terms of the GNU Lesser General Public License as published +# by the Free Software Foundation, either version 3 of the License, or +# (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public License +# along with this program. If not, see . + +import re +import os + + +class CroatianStemmer: + + # list of words that are it's own stem + __stop = None + __transformations = None + __rules = None + + def __init__(self): + + dir_path = os.path.dirname(os.path.realpath(__file__)) + self.__rules = [ + re.compile(r'^(' + base + ')(' + suffix + r')$') for + base, suffix in [ + e.strip().split(' ') + for e in + open(os.path.join(dir_path, "rules.txt"), encoding='utf-8')] + ] + self.__transformations = [e.strip().split( + '\t') for e in open(os.path.join(dir_path, + 'transformations.txt'), + encoding='utf-8')] + self.__stop = set([ + e.strip() + for e in + open(os.path.join(dir_path, 'nostem-hr.txt'), encoding='utf-8') + ]) + + def _determine_r_vowel(self, string): + ''' + Determines if 'r' is a vowel or not + If it is => uppercase it. + + Parameters + ---------- + string : str + word in Croatian + + Returns + ------- + string : str + Croatian word with 'r' vowel uppercased + ''' + return re.sub(r'(^|[^aeiou])r($|[^aeiou])', r'\1R\2', string) + + def _has_vowel(self, string): + if re.search(r'[aeiouR]', self._determine_r_vowel(string)): + return True + else: + return False + + def transform(self, word): + for seek, replace in self.__transformations: + if word.endswith(seek): + return word[:-len(seek)] + replace + return word + + def root_word(self, word): + for rule in self.__rules: + division = rule.match(word) + if division: + root = division.group(1) + if self._has_vowel(root) and len(root) > 1: + return root + return word + + def stem_word(self, word): + ''' + Returns the root or roots of a word, + together with any derivational affixes + + Parameters + ---------- + word : str + word in Croatian + + Returns + ------- + string : str + Croatian word root plus derivational morphemes + ''' + + if word.lower() in self.__stop: + return word + stem = self.root_word(self.transform(word.lower())) + return "".join(list( + stem[i].upper() if c.isupper() else stem[i] for i, c in + zip(range(len(stem)), word))) diff --git a/takepod/preproc/stemmer/nostem-hr.txt b/takepod/preproc/stemmer/nostem-hr.txt new file mode 100644 index 00000000..fbf06046 --- /dev/null +++ b/takepod/preproc/stemmer/nostem-hr.txt @@ -0,0 +1,67 @@ +bi +bih +bijah +bijahu +bijasmo +bijaste +bijaše +bila +bile +bili +bilo +bio +bismo +biste +biti +biše +bje +bjeh +bjehu +bjesmo +bjeste +bješe +budem +budemo +budete +budeš +budimo +budite +budu +jesam +jesi +jesmo +jeste +jesu +mogu +mora +moraju +moram +moramo +morate +moraš +može +možemo +možete +možeš +sam +si +smo +ste +su +treba +trebaju +trebam +trebamo +trebate +trebaš +će +ćemo +ćete +ćeš +ću +žele +želi +želim +želimo +želite +želiš diff --git a/takepod/preproc/stemmer/rules.txt b/takepod/preproc/stemmer/rules.txt new file mode 100644 index 00000000..965c79b0 --- /dev/null +++ b/takepod/preproc/stemmer/rules.txt @@ -0,0 +1,102 @@ +.+(s|š)k ijima|ijega|ijemu|ijem|ijim|ijih|ijoj|ijeg|iji|ije|ija|oga|ome|omu|ima|og|om|im|ih|oj|i|e|o|a|u +.+(s|š)tv ima|om|o|a|u +# N +.+(t|m|p|r|g)anij ama|ima|om|a|u|e|i| +.+an inom|ina|inu|ine|ima|in|om|u|i|a|e| +.+in ima|ama|om|a|e|i|u|o| +.+on ovima|ova|ove|ovi|ima|om|a|e|i|u| +.+n ijima|ijega|ijemu|ijeg|ijem|ijim|ijih|ijoj|iji|ije|ija|iju|ima|ome|omu|oga|oj|om|ih|im|og|o|e|a|u|i| +# Ć +.+(a|e|u)ć oga|ome|omu|ega|emu|ima|oj|ih|om|eg|em|og|uh|im|e|a +# G +.+ugov ima|i|e|a +.+ug ama|om|a|e|i|u|o +.+log ama|om|a|u|e| +.+[^eo]g ovima|ama|ovi|ove|ova|om|a|e|i|u|o| +# I +.+(rrar|ott|ss|ll)i jem|ja|ju|o| +# J +.+uj ući|emo|ete|mo|em|eš|e|u| +.+(c|č|ć|đ|l|r)aj evima|evi|eva|eve|ama|ima|em|a|e|i|u| +.+(b|c|d|l|n|m|ž|g|f|p|r|s|t|z)ij ima|ama|om|a|e|i|u|o| +# L +#.+al inom|ina|inu|ine|ima|om|in|i|a|e +#.+[^(lo|ž)]il ima|om|a|e|u|i| +.+[^z]nal ima|ama|om|a|e|i|u|o| +.+ijal ima|ama|om|a|e|i|u|o| +.+ozil ima|om|a|e|u|i| +.+olov ima|i|a|e +.+ol ima|om|a|u|e|i| +# M +.+lem ama|ima|om|a|e|i|u|o| +.+ram ama|om|a|e|i|u|o +#.+(es|e|u)m ama|om|a|e|i|u|o +# R +#.+(a|d|e|o|u)r ama|ima|om|u|a|e|i| +.+(a|d|e|o)r ama|ima|om|u|a|e|i| +# S +.+(e|i)s ima|om|e|a|u +# Š +.+(t|n|j|k|j|t|b|g|v)aš ama|ima|om|em|a|u|i|e| +.+(e|i)š ima|ama|om|em|i|e|a|u| +# T +.+ikat ima|om|a|e|i|u|o| +.+lat ima|om|a|e|i|u|o| +.+et ama|ima|om|a|e|i|u|o| +#.+ot ama|ima|om|a|u|e|i| +.+(e|i|k|o)st ima|ama|om|a|e|i|u|o| +.+išt ima|em|a|e|u +#.+ut ovima|evima|ove|ovi|ova|eve|evi|eva|ima|om|a|u|e|i| +# V +.+ova smo|ste|hu|ti|še|li|la|le|lo|t|h|o +.+(a|e|i)v ijemu|ijima|ijega|ijeg|ijem|ijim|ijih|ijoj|oga|ome|omu|ima|ama|iji|ije|ija|iju|im|ih|oj|om|og|i|a|u|e|o| +.+[^dkml]ov ijemu|ijima|ijega|ijeg|ijem|ijim|ijih|ijoj|oga|ome|omu|ima|iji|ije|ija|iju|im|ih|oj|om|og|i|a|u|e|o| +.+(m|l)ov ima|om|a|u|e|i| +# PRIDJEVI +.+el ijemu|ijima|ijega|ijeg|ijem|ijim|ijih|ijoj|oga|ome|omu|ima|iji|ije|ija|iju|im|ih|oj|om|og|i|a|u|e|o| +.+(a|e|š)nj ijemu|ijima|ijega|ijeg|ijem|ijim|ijih|ijoj|oga|ome|omu|ima|iji|ije|ija|iju|ega|emu|eg|em|im|ih|oj|om|og|a|e|i|o|u +.+čin ama|ome|omu|oga|ima|og|om|im|ih|oj|a|u|i|o|e| +.+roši vši|smo|ste|še|mo|te|ti|li|la|lo|le|m|š|t|h|o +.+oš ijemu|ijima|ijega|ijeg|ijem|ijim|ijih|ijoj|oga|ome|omu|ima|iji|ije|ija|iju|im|ih|oj|om|og|i|a|u|e| +.+(e|o)vit ijima|ijega|ijemu|ijem|ijim|ijih|ijoj|ijeg|iji|ije|ija|oga|ome|omu|ima|og|om|im|ih|oj|i|e|o|a|u| +#.+tit ijima|ijega|ijemu|ijem|ijim|ijih|ijoj|ijeg|iji|ije|ija|oga|ome|omu|ima|og|om|im|ih|oj|e|o|a|u|i| +.+ast ijima|ijega|ijemu|ijem|ijim|ijih|ijoj|ijeg|iji|ije|ija|oga|ome|omu|ima|og|om|im|ih|oj|i|e|o|a|u| +.+k ijemu|ijima|ijega|ijeg|ijem|ijim|ijih|ijoj|oga|ome|omu|ima|iji|ije|ija|iju|im|ih|oj|om|og|i|a|u|e|o| +# GLAGOLI +.+(e|a|i|u)va jući|smo|ste|jmo|jte|ju|la|le|li|lo|mo|na|ne|ni|no|te|ti|še|hu|h|j|m|n|o|t|v|š| +.+ir ujemo|ujete|ujući|ajući|ivat|ujem|uješ|ujmo|ujte|avši|asmo|aste|ati|amo|ate|aju|aše|ahu|ala|alo|ali|ale|uje|uju|uj|al|an|am|aš|at|ah|ao +.+ač ismo|iste|iti|imo|ite|iše|eći|ila|ilo|ili|ile|ena|eno|eni|ene|io|im|iš|it|ih|en|i|e +.+ača vši|smo|ste|smo|ste|hu|ti|mo|te|še|la|lo|li|le|ju|na|no|ni|ne|o|m|š|t|h|n +#.+ači smo|ste|ti|li|la|lo|le|mo|te|še|m|š|t|h|o| +# Druga_vrsta +.+n uvši|usmo|uste|ući|imo|ite|emo|ete|ula|ulo|ule|uli|uto|uti|uta|em|eš|uo|ut|e|u|i +.+ni vši|smo|ste|ti|mo|te|mo|te|la|lo|le|li|m|š|o +# A +.+((a|r|i|p|e|u)st|[^o]g|ik|uc|oj|aj|lj|ak|ck|čk|šk|uk|nj|im|ar|at|et|št|it|ot|ut|zn|zv)a jući|vši|smo|ste|jmo|jte|jem|mo|te|je|ju|ti|še|hu|la|li|le|lo|na|no|ni|ne|t|h|o|j|n|m|š +.+ur ajući|asmo|aste|ajmo|ajte|amo|ate|aju|ati|aše|ahu|ala|ali|ale|alo|ana|ano|ani|ane|al|at|ah|ao|aj|an|am|aš +.+(a|i|o)staj asmo|aste|ahu|ati|emo|ete|aše|ali|ući|ala|alo|ale|mo|ao|em|eš|at|ah|te|e|u| +.+(b|c|č|ć|d|e|f|g|j|k|n|r|t|u|v)a lama|lima|lom|lu|li|la|le|lo|l +.+(t|č|j|ž|š)aj evima|evi|eva|eve|ama|ima|em|a|e|i|u| +#.+(e|j|k|r|u|v)al ama|ima|om|u|i|a|e|o| +#.+(e|j|k|r|t|u|v)al ih|im +.+([^o]m|ič|nč|uč|b|c|ć|d|đ|h|j|k|l|n|p|r|s|š|v|z|ž)a jući|vši|smo|ste|jmo|jte|mo|te|ju|ti|še|hu|la|li|le|lo|na|no|ni|ne|t|h|o|j|n|m|š +.+(a|i|o)sta dosmo|doste|doše|nemo|demo|nete|dete|nimo|nite|nila|vši|nem|dem|neš|deš|doh|de|ti|ne|nu|du|la|li|lo|le|t|o +.+ta smo|ste|jmo|jte|vši|ti|mo|te|ju|še|la|lo|le|li|na|no|ni|ne|n|j|o|m|š|t|h +.+inj asmo|aste|ati|emo|ete|ali|ala|alo|ale|aše|ahu|em|eš|at|ah|ao +.+as temo|tete|timo|tite|tući|tem|teš|tao|te|li|ti|la|lo|le +# I +.+(elj|ulj|tit|ac|ič|od|oj|et|av|ov)i vši|eći|smo|ste|še|mo|te|ti|li|la|lo|le|m|š|t|h|o +.+(tit|jeb|ar|ed|uš|ič)i jemo|jete|jem|ješ|smo|ste|jmo|jte|vši|mo|še|te|ti|ju|je|la|lo|li|le|t|m|š|h|j|o +.+(b|č|d|l|m|p|r|s|š|ž)i jemo|jete|jem|ješ|smo|ste|jmo|jte|vši|mo|lu|še|te|ti|ju|je|la|lo|li|le|t|m|š|h|j|o +.+luč ujete|ujući|ujemo|ujem|uješ|ismo|iste|ujmo|ujte|uje|uju|iše|iti|imo|ite|ila|ilo|ili|ile|ena|eno|eni|ene|uj|io|en|im|iš|it|ih|e|i +.+jeti smo|ste|še|mo|te|ti|li|la|lo|le|m|š|t|h|o +.+e lama|lima|lom|lu|li|la|le|lo|l +.+i lama|lima|lom|lu|li|la|le|lo|l +# Pridjev_t +.+at ijega|ijemu|ijima|ijeg|ijem|ijih|ijim|ima|oga|ome|omu|iji|ije|ija|iju|oj|og|om|im|ih|a|u|i|e|o| +# Pridjev +.+et avši|ući|emo|imo|em|eš|e|u|i +.+ ajući|alima|alom|avši|asmo|aste|ajmo|ajte|ivši|amo|ate|aju|ati|aše|ahu|ali|ala|ale|alo|ana|ano|ani|ane|am|aš|at|ah|ao|aj|an +.+ anje|enje|anja|enja|enom|enoj|enog|enim|enih|anom|anoj|anog|anim|anih|eno|ovi|ova|oga|ima|ove|enu|anu|ena|ama +.+ nijega|nijemu|nijima|nijeg|nijem|nijim|nijih|nima|niji|nije|nija|niju|noj|nom|nog|nim|nih|an|na|nu|ni|ne|no +.+ om|og|im|ih|em|oj|an|u|o|i|e|a \ No newline at end of file diff --git a/takepod/preproc/stemmer/transformations.txt b/takepod/preproc/stemmer/transformations.txt new file mode 100644 index 00000000..52f79cb6 --- /dev/null +++ b/takepod/preproc/stemmer/transformations.txt @@ -0,0 +1,131 @@ +lozi loga +lozima loga +pjesi pjeh +pjesima pjeh +vojci vojka +bojci bojka +jaci jak +jacima jak +čajan čajni +ijeran ijerni +laran larni +ijesan ijesni +anjac anjca +ajac ajca +ajaca ajca +ljaca ljca +ljac ljca +ejac ejca +ejaca ejca +ojac ojca +ojaca ojca +ajaka ajka +ojaka ojka +šaca šca +šac šca +inzima ing +inzi ing +tvenici tvenik +tetici tetika +teticima tetika +nstava nstva +nicima nik +ticima tik +zicima zik +snici snik +kuse kusi +kusan kusni +kustava kustva +dušan dušni +antan antni +bilan bilni +tilan tilni +avilan avilni +silan silni +gilan gilni +rilan rilni +nilan nilni +alan alni +ozan ozni +rave ravi +stavan stavni +pravan pravni +tivan tivni +sivan sivni +atan atni +cenata centa +denata denta +genata genta +lenata lenta +menata menta +jenata jenta +venata venta +tetan tetni +pletan pletni +šave šavi +manata manta +tanata tanta +lanata lanta +sanata santa +ačak ačka +ačaka ačka +ušak uška +atak atka +ataka atka +atci atka +atcima atka +etak etka +etaka etka +itak itka +itaka itka +itci itka +otak otka +otaka otka +utak utka +utaka utka +utci utka +utcima utka +eskan eskna +tičan tični +ojsci ojska +esama esma +metara metra +centar centra +centara centra +istara istra +istar istra +ošću osti +daba dba +čcima čka +čci čka +mac mca +maca mca +naca nca +nac nca +voljan voljni +anaka anki +vac vca +vaca vca +saca sca +sac sca +naca nca +nac nca +raca rca +rac rca +aoca alca +alaca alca +alac alca +elaca elca +elac elca +olaca olca +olac olca +olce olca +njac njca +njaca njca +ekata ekta +ekat ekta +izam izma +izama izma +jebe jebi +baci baci +ašan ašni \ No newline at end of file diff --git a/test/test_preproc.py b/test/test_preproc.py new file mode 100644 index 00000000..61871a06 --- /dev/null +++ b/test/test_preproc.py @@ -0,0 +1,30 @@ +from takepod.preproc.stemmer.croatian_stemmer import CroatianStemmer +import pytest + + +@pytest.fixture(scope='module') +def cro_stemmer(): + yield CroatianStemmer() + + +def test_croatian_stemmer_stem_word(cro_stemmer): + assert cro_stemmer.stem_word('babicama') == 'babic' + assert cro_stemmer.stem_word('babice') == 'babic' + + +def test_croatian_stemmer_stem_nostem_word(cro_stemmer): + assert cro_stemmer.stem_word('želimo') == 'želimo' + assert cro_stemmer.stem_word('jesmo') == 'jesmo' + + +def test_croatian_stemmer_no_vowel_word(cro_stemmer): + # this is an actual word in Croatian, look it up + assert cro_stemmer.stem_word('sntntn') == 'sntntn' + + +def test_croatian_stemmer_transformative_word(cro_stemmer): + assert cro_stemmer.stem_word('turizama') == 'turizm' + + +def test_croatian_stemmer_preserves_case(cro_stemmer): + assert cro_stemmer.stem_word('Turizam') == 'Turizm'