No description, website, or topics provided.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
Evaluation
Parameters
Resources
Test
Training
.DS_Store
Classification.py
Features.py
GetFeatures.py
Parameters.py
README.md
test.py
train.py

README.md

LogisticRegression-Shared-Task-Parallel-Corpus-Filtering

This is the core system (based on "sklearn") that participated in the "Parallel Corpus Filtering Task" http://statmt.org/wmt18/parallel-corpus-filtering.html.
The whole pipeline is presented in the paper "A hybrid system of rule and machine learning to filter web-crawled parallel corpora". The paper is also included here in the file "HybridSystem.pdf"
In the folder Training there is a manually annotated file with positive and negative examples.
In the folder Test there is a test file that you can classify using the trained model.
In the folder Evaluation there is a the automatic annotated file that we use to build the model run on the test corpus provided by the organizers . If you want to use this file instead the manually annotated one, copy the automatic annoated file in the Training directory and modify the corresponding parameters values (Parameters/p-Training.txt for training) and (Parameters/p-Test.txt for test)

  1. Train (The parameters are configured in Parameters/p-Training.txt). Generates the features for the training file.

python train.py

  1. Test (The parameters are configured in Parameters/p-Test.txt). Fits a Logistic Regression model on the training file, generates the features for the test file and classifies the test file

python test.py

  1. Evaluation (Go to folder Evaluation and evaluate the automatic annotated file we talk about in the paper against the mannualy annotated file)

python compareTrainingSets.py