Discriminate between the Moldavian and the Romanian dialects across different text genres (news versus tweets)
(Project details can also be found on the Kaggle competition here)
Participants have to train a model on tweets. Therefore, participants have to build a model for a in-genre binary classification by dialect task, in which a classification model is required to discriminate between the Moldavian (label 0) and the Romanian (label 1) dialects.
The training data is composed of 7757 samples. The validation set is composed of 2656 samples. All samples are preprocessed in order to replace named entities with a special tag:
Note that the tweets are encrypted. You may or not decrypt the texts for solving this project.
- train_samples.txt - the training data samples (one sample per row)
- train_labels.txt - the training labels (one label per row)
- validation_samples.txt - the validation data samples (one sample per row)
- validation_labels.txt - the validation labels (one label per row)
- test_samples.txt - the test data samples (one sample per row)
- sample_submission.txt - a sample submission file in the correct format
You can download the data for this project here
The data samples are provided in the following format based on TAB separated values:
112752 sAFW K#xk}t fH@ae m&Xd >h& @# l@Rd}a @Hc liT ehAr@m Xgmz !}a }eAr@m Be g@@m efH RB(D Ehk&
107227 X;d:N qnwB Acke@m m*g lvc& ggcp ht*A mat; }:@ HA&@m HA@e hZ Er#@m
101685 #fEw w!ygF dDB XwfE| HrWe@mH
Each line represents a data sample where:
- The first column shows the ID of the data sample.
- The second column is the actual data sample.
The labels are provided in the following format based on TAB separated values:
1 1
2 0
Each line represents a label associated to a data sample where:
- The first column shows the ID of the data sample.
- The second column is the actual label.
The evaluation measure is the macro F1 score computed on the test set. The macro F1 score is given by the mean of the F1 scores computed for each class. The F1 score is given by: F1=2 * (P⋅R)/(P+R), where P is the precision and R is the recall.
You can try your submissions here
For solving the sub-dialect identification I tried all the following models:
1. Logistic Regression
2. Naïve Bayes
3. SVM
4. Linear Regression
5. LSTM
6. LSTM + CNN
7. GRU cells
All the 4 first models had similar workflow:
1. Load the data from the .txt files
2. Apply the TF-IDF technique
(3.*) Sometimes the data was normalized and standardised
4. Fit and predict
Logistic Regression accuracy score -> 0.6295180722891566
Confusion matrix
[[792 509]
[475 880]]
Classification report
precision recall f1-score support
0 0.63 0.61 0.62 1301
1 0.63 0.65 0.64 1355
accuracy 0.63 2656
macro avg 0.63 0.63 0.63 2656
weighted avg 0.63 0.63 0.63 2656
Logistic Regression F1 score: 0.629360764766923
Naïve Bayes accuracy score -> 65.54969879518072
Confusion matrix
[[760 541]
[374 981]]
Classification report
precision recall f1-score support
0 0.67 0.58 0.62 1301
1 0.64 0.72 0.68 1355
accuracy 0.66 2656
macro avg 0.66 0.65 0.65 2656
weighted avg 0.66 0.66 0.65 2656
Naïve Bayes F1 score: 0.6536820451582341
SVM accuracy score -> 0.6321536144578314
Confusion matrix
[[780 521]
[456 899]]
Classification report
precision recall f1-score support
0 0.63 0.60 0.61 1301
1 0.63 0.66 0.65 1355
accuracy 0.63 2656
macro avg 0.63 0.63 0.63 2656
weighted avg 0.63 0.63 0.63 2656
SVM F1 score: 0.6317494637382586
Linear Regression accuracy score -> 0.5835843373493976
Confusion matrix
[[755 546]
[560 795]]
Classification report
precision recall f1-score support
0 0.57 0.58 0.58 1301
1 0.59 0.59 0.59 1355
accuracy 0.58 2656
macro avg 0.58 0.58 0.58 2656
weighted avg 0.58 0.58 0.58 2656
Linear Regression F1 score: 0.583617401506497
The last 3 models had also similar workflow:
1. Load the data from .txt files
2. Assign each crypted word its frequency number
3. Fit and predict
LSTM accuracy score -> 0.6788403614457831
Confusion matrix
[[813 488]
[365 990]]
Classification report
precision recall f1-score support
0 0.69 0.62 0.66 1301
1 0.67 0.73 0.70 1355
accuracy 0.68 2656
macro avg 0.68 0.68 0.68 2656
weighted avg 0.68 0.68 0.68 2656
LSTM F1 score: 0.6778447812774927
LSTM with CNN accuracy score -> 0.6125753012048193
Confusion matrix
[[796 505]
[524 831]]
Classification report
precision recall f1-score support
0 0.60 0.61 0.61 1301
1 0.62 0.61 0.62 1355
accuracy 0.61 2656
macro avg 0.61 0.61 0.61 2656
weighted avg 0.61 0.61 0.61 2656
LSTM with CNN F1 score: 0.6126118294013413
GRU cells accuracy score -> 0.5101656626506024
Confusion matrix
[[ 0 1301]
[ 0 1355]]
Classification report
precision recall f1-score support
0 0.00 0.00 0.00 1301
1 0.51 1.00 0.68 1355
accuracy 0.51 2656
macro avg 0.26 0.50 0.34 2656
weighted avg 0.26 0.51 0.34 2656
GRU cells F1 score: 0.3446893407586967
- Best accuracy score was LSTM with ~67%
- Second best accuracy score was Naïve Bayes with ~65%
- Best F1 score was LSTM with ~67%
- Second best F1 score was Naïve Bayes with ~65%
- Python
- Arghire Gabriel (me)