TICCLAT is a tool for text-induced corpus correction and lexical assessment.
To install ticclat, do:
git clone https://github.com/ticclat/ticclat.git
cd ticclat
pip install .
Run tests (including coverage) with:
python setup.py test
Run sudo mysql_secure_installation with the following choices:
- Validate passwords: no
- Root password: pick one
- Remove anonymous users: yes
- Disallow root login remotely: no
- Remove test database and access to it: yes
- Reload privilege tables now: yes
To allow login as any user with the root password set above, you have to switch the authentication plugin for root to mysql_native_password.
SELECT plugin from mysql.user where User='root';
what plugin you are using currently. If it is auth_socket (default on Ubuntu), you can only login as root if you are running mysql as the Unix root user, e.g. by running with sudo. To change it to mysql_native_password, start mysql -u root and run
UPDATE mysql.user SET plugin = 'mysql_native_password' WHERE User = 'root';
To make this authentication plugin the default, add the following to /etc/my.cnf (or another my.cnf location, run mysqladmin --help to see the locations that mysqld looks for):
[mysqld]
default-authentication-plugin = mysql_native_password
To run the ingestion script (e.g. the elex lexicon ingestion), the maximum package size has to be high enough. We set it to 41943040 (4194304 was not enough) by setting the following line in /etc/my.cnf:
[mysqld]
max_allowed_packet = 42M
To allow for loading CSV files (this is the fastest way of inserting big bulks of records), add:
[mysqld]
local_infile=ON
This allows you to run queries like this:
LOAD DATA LOCAL INFILE '/file.csv' INTO TABLE test FIELDS TERMINATED BY ',' ENCLOSED BY '"' ESCAPED BY '\\';
This loads the file /file.csv from the client, sends it to the server which inserts it into table test. See [MySQL Load Data Documentation](https://dev.mysql.com/doc/refman/8.0/en/load-data.html).
To allow for saving CSV files, add:
[mysqld]
secure_file_priv=/data/tmp/mysql
Also, add this to /etc/apparmor.d/usr.sbin.mysqld (restart afterwards: sudo systemctl reload apparmor)
# Allow /data/tmp/mysql access
/data/tmp/mysql/ rw,
/data/tmp/mysql/** rw,
Make sure the directory /data/tmp/mysql exists and is writable by the mysql user.
On Ubuntu 18.04, the default mysqld settings in /etc/mysql/mysql.conf.d/mysqld.cnf set the socket to a non-standard location that confuses all the default values in MySQLdb. Change it to /tmp/mysql.sock if you get OperationError: 2006 ... when running ticclat tasks like ingesting corpora or lexica.
Important note: Alembic stripts were removed. Use most recent database dumps to get the newest version of the database.
To apply changes to the database schema, we use [alembic](https://alembic.sqlalchemy.org/en/latest/index.html).
Alembic is configured to read the information needed to connect to the database database from environment variable DATABASE_URL
To migrate the database to the latest database schema run:
alembic upgrade head
Important note: if you are creating the database from scratch, do not use the alembic database migrations. Instead, use SQLAlchemy to create a complete new instance of the database.
The ticclat package contains scripts for ingesting data into the database.
To run the scripts, create an .env
file as described under
Setup virtual environment. In the directory where the .env file is located,
type python and then:
>>> from ticclat import ingest
>>> ingest.run()
You can conigure run()
by providing arguments:
env
: path to the.env
file (default:.env
)reset_db
: delete the database and recreate it before ingesting data (default:False
)alphabet_file
: path to the alphabet file (required for calculating anahashes; default: /data/ALPH/nld.aspell.dict.clip20.lc.LD3.charconfus.clip20.lc.chars)batch_size
: size of database batches (default: 5000) (We should look into how this is used.)include
: list of data sources to ingest (default:[]
)exclude
: list of data sources to exclude from ingesting (default:[]
)ingest
: boolean indicating whether data should be ingested (default:True
)anahash
: boolean indicating whether anahashes should be calculated (default:True
)tmpdir
: directory to use for storing temporary data (default:/data/tmp
)loglevel
: what log messages to show (default:INFO
)reset_anahashes
boolean indicating whether the anahashes table should be emptied (default:False
)base_dir
: path to the directory containing the source datafiles
The following sources can be ingested (and added to the include
and exclude
lists):
twente
: spelling correction lexiconinl
: lexiconSoNaR500
: corpuselex
: lexicongroene boekje
: lexiconOpenTaal
: lexiconsgd
: Staten Generaal Digitaal, corpusedbo
: Early Dutch Books Online, corpusdbnl
: Digitale Bibliotheek voor de Nederlandse letterenmorph_par
: Morphological Paradigmswf_freqs
: Generate materialized view (table) containing wordforms and their total frequencies in the corporasgd_ticcl
: ingest ticcl corrections based on the SDG data (we currently have data for two wordforms: Amsterdam and Binnenlandsche)
Starting from Ubuntu (18.04), setup the MySQL database. Then clone this directory, install dependencies (conda & libmysqlclient-dev & build-essential e.g. https://docs.conda.io/en/latest/miniconda.html and apt-get update && apt-get install -y libmysqlclient-dev build-essential).
conda create --name ticclat-web
conda activate ticclat-web
conda install pip
From ticclat directory, install it:
Create a .env file with the following:
DATABASE_URL=mysql://[user]:[pass]@[host]:[port]/[db_name]?charset=utf8mb4&local_infile=1
FLASK_APP=ticclat.flask_app.py
FLASK_ENV=production
FLASK_DEBUG=0
#for DEV:
#FLASK_ENV=development
#FLASK_DEBUG=1
You can now run a development server using: flask run
Or a production server:
export $(cat .env | xargs)
gunicorn ticclat.flask_app.wsgi:app --bind 0.0.0.0:8000 --max-requests 100 --workers 2 --timeout 30
where the last three options may not be necessary, but can be tweaked for stability and performance.
If the debugger in e.g. PyCharm isn't working correctly, this might be because test coverage is enabled. Disbable this temporarily by commenting addopts line in setup.cfg:
[tool:pytest] testpaths = tests #addopts = --cov --cov-report xml --cov-report term --cov-report html
Include a link to your project's full documentation here.
If you want to contribute to the development of ticclat, have a look at the contribution guidelines.
Copyright (c) 2019, Netherlands eScience Center and Meertens Instituut
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
This package was created with Cookiecutter and the NLeSC/python-template.