Experiments:
5-Fold evaluation results (original) (abs, ne)
Resources | Acc | hPrec | hRec | hF |
---|---|---|---|---|
10k | 0.734 | 0.894 | 0.886 | 0.890 |
1M | 0.825 | 0.944 | 0.939 | 0.942 |
Full(3M) | 0.835 | 0.952 | 0.949 | 0.950 |
5-Fold evaluation results (reproduced) (abs, ne)
Resources | Acc | hPrec | hRec | hF |
---|---|---|---|---|
10k | 0.766 | 0.899 | 0.898 | 0.899 |
1M | 0.880 | 0.956 | 0.959 | 0.957 |
Full(3M) | 0.879 | 0.954 | 0.956 | 0.955 |
5-Fold 1M entries (original) abstracts, name entity recognition, preprocessing
ABS | NER | PP | Acc | hPrec | hRec | hF |
---|---|---|---|---|---|---|
+ | + | - | 0.825 | 0.944 | 0.939 | 0.942 |
+ | + | + | 0.811 | 0.937 | 0.932 | 0.935 |
+ | - | - | 0.790 | 0.937 | 0.930 | 0.933 |
+ | - | + | 0.779 | 0.929 | 0.921 | 0.925 |
- | + | - | 0.568 | 0.782 | 0.782 | 0.782 |
5-Fold 1M entries (reproduced) abstracts, name entity recognition, preprocessing
ABS | NER | PP | Acc | hPrec | hRec | hF |
---|---|---|---|---|---|---|
+ | + | - | 0.880 | 0.956 | 0.959 | 0.957 |
+ | + | + | 0.837 | 0.935 | 0.936 | 0.935 |
+ | - | - | 0.852 | 0.949 | 0.950 | 0.949 |
+ | - | + | 0.813 | 0.926 | 0.928 | 0.927 |
- | + | - | 0.575 | 0.783 | 0.779 | 0.781 |
Gold standard evaluation (original) 1825 entries
Resources | Acc | hPrec | hRec | hF |
---|---|---|---|---|
10k | 0.205 | 0.669 | 0.624 | 0.645 |
1M | 0.420 | 0.811 | 0.807 | 0.809 |
Full(3M) | 0.449 | 0.827 | 0.822 | 0.825 |
----------- | ------- | ------- | ------- | ------- |
hSVM | 0.548 | 0.890 | 0.665 | 0.761 |
SDType | 0.338 | 0.809 | 0.641 | 0.715 |
Gold standard evaluation (reproduced) 1289 entries
Resources | Acc | hPrec | hRec | hF |
---|---|---|---|---|
10k | 0.479 | 0.830 | 0.815 | 0.823 |
1M | 0.556 | 0.874 | 0.872 | 0.873 |
Full(3M) | 0.573 | 0.889 | 0.883 | 0.886 |
----------- | ------- | ------- | ------- | ------- |
hSVM | 0.548 | 0.890 | 0.665 | 0.761 |
SDType | 0.338 | 0.809 | 0.641 | 0.715 |
New gold standard eval 2563962 entries 10k: 0.639 0.855 0.893 0.874 1m: 0.653 0.875 0.926 0.900 3m: 0.667 0.883 0.934 0.908
Testing of new possible upgrades with 1Million instances: - printable names = SoccerPlayer -> Soccer Player - path2root = SoccerPlayer -> owl:Thing, Agent, Person, Athlete, SoccerPlayer
printable names | path2root | Acc | hPrec | hRec | hF |
---|---|---|---|---|---|
- | - | 0.880 | 0.956 | 0.959 | 0.957 |
+ | - | 0.873 | 0.955 | 0.955 | 0.955 |
- | + | 0.879 | 0.957 | 0.959 | 0.958 |
+ | + | 0.869 | 0.953 | 0.956 | 0.955 |
Experiment with 1 million records, how preprocessing techniques affect the result?
stw | stemm | lemma | punct | Acc | hPrec | hRec | hF |
---|---|---|---|---|---|---|---|
- | - | - | - | 0.880 | 0.956 | 0.959 | 0.958 |
+ | - | - | - | 0.733 | 0.890 | 0.892 | 0.891 |
+ | + | - | - | 0.727 | 0.883 | 0.882 | 0.883 |
+ | - | + | - | 0.599 | 0.815 | 0.817 | 0.816 |
+ | + | + | - | 0.609 | 0.817 | 0.820 | 0.818 |
+ | - | - | + | 0.728 | 0.883 | 0.886 | 0.885 |
Test with spanish data: There are in total 785750 instances, we are going to try with the 100%, 33% (259297) and 0.33% of the total data (like english).
5-Fold evaluation results (reproduced) (abs, ne)
Resources | Acc | hPrec | hRec | hF |
---|---|---|---|---|
0.33 | 0.792 | 0.896 | 0.917 | 0.906 |
33 | 0.921 | 0.972 | 0.974 | 0.973 |
Full(800k) | 0.924 | 0.973 | 0.976 | 0.974 |
5-Fold 33% entries (original) abstracts, name entity recognition, preprocessing
ABS | NER | PP | Acc | hPrec | hRec | hF |
---|---|---|---|---|---|---|
+ | + | - | 0.921 | 0.972 | 0.974 | 0.973 |
+ | + | + | 0.903 | 0.963 | 0.965 | 0.964 |
+ | - | - | 0.899 | 0.968 | 0.970 | 0.969 |
+ | - | + | 0.876 | 0.956 | 0.958 | 0.957 |
- | + | - | 0.650 | 0.796 | 0.801 | 0.799 |
printable names | path2root | Acc | hPrec | hRec | hF |
---|---|---|---|---|---|
- | - | 0.921 | 0.972 | 0.974 | 0.973 |
+ | - | 0.917 | 0.972 | 0.973 | 0.972 |
- | + | 0.920 | 0.972 | 0.974 | 0.973 |
+ | + | 0.915 | 0.971 | 0.973 | 0.972 |
There is no gold standard dataset in Spanish so there is no testing with that.
(146 unique classes on the spanish dbpedia dataset vs 405 unique classes on the english dbpedia dataset) Despite having less data, the Spanish version performs better than the English version. It is believed that this is due to the number of possible classes in both datasets.
5-Fold evaluation results (reproduced) (abs, ne)
Resources | Acc | hPrec | hRec | hF |
---|---|---|---|---|
10k | 0.766 | 0.899 | 0.898 | 0.899 |
1M | 0.880 | 0.956 | 0.959 | 0.957 |
Full(3M) | 0.xxx | 0.xxx | 0.xxx | 0.xxx |
10k: 0.777 0.912 0.918 0.915
5-Fold evaluation results (reproduced) (abs, ne) (cropping the tdm)
Resources | Acc | hPrec | hRec | hF |
---|---|---|---|---|
10k | 0.777 | 0.904 | 0.908 | 0.906 |
1M | 0.876 | 0.955 | 0.955 | 0.955 |
Full(3M) | 0.875 | 0.951 | 0.954 | 0.952 |
1m: 1715230 --> 594896 features
5-Fold evaluation results (reproduced) (abs, ne) (2-grams)
Resources | Acc | hPrec | hRec | hF |
---|---|---|---|---|
10k | 0.768 | 0.911 | 0.903 | 0.907 |
1M | 0.xxx | 0.xxx | 0.xxx | 0.xxx |
Full(3M) | 0.xxx | 0.xxx | 0.xxx | 0.xxx |
5-Fold evaluation results (reproduced) (abs, ne) (3-grams)
Resources | Acc | hPrec | hRec | hF |
---|---|---|---|---|
10k | 0.739 | 0.881 | 0.877 | 0.879 |
1M | 0.xxx | 0.xxx | 0.xxx | 0.xxx |
Full(3M) | 0.xxx | 0.xxx | 0.xxx | 0.xxx |
learning curve: acc hP hR hF 0.5: 0.879 0.956 0.959 0.957 1: 0.879 0.956 0.959 0.957 1.5: 0.882 0.957 0.958 0.958 2: 0.881 0.957 0.958 0.957 2.5: 0.880 0.955 0.957 0.956 3:
trim: baseline: doc freq 2 : term freq 2: 0.877 0.954 0.955 0.955 term freq 5: 0.867 0.949 0.951 0.950
min_term | max_doc | Features | Acc | hPrec | hRec | hF |
---|---|---|---|---|---|---|
0.95 | 0.1 | 86929 | 0.857 | 0.943 | 0.945 | 0.944 |
0.90 | 0.1 | 186195 | 0.864 | 0.947 | 0.949 | 0.948 |
0.90 | 0.1 | 366712 | 0.870 | 0.950 | 0.952 | 0.951 |
0.95 | 0.2 | 86929 | 0.857 | 0.943 | 0.945 | 0.944 |
0.90 | 0.2 | 186195 | 0.864 | 0.947 | 0.949 | 0.948 |
0.80 | 0.2 | 366712 | 0.870 | 0.950 | 0.952 | 0.951 |
0.95 | 0.3 | 86929 | 0.857 | 0.943 | 0.945 | 0.944 |
0.90 | 0.3 | 186195 | 0.864 | 0.947 | 0.949 | 0.948 |
0.80 | 0.3 | 366712 | 0.870 | 0.950 | 0.952 | 0.951 |
xxxx | xxx | 1716880 | 0.879 | 0.956 | 0.959 | 0.957 |
fasttext 1m 200 epchs
N 201230 P@1 0.898 R@1 0.898
3m 100 epchs N 609788 P@1 0.885 R@1 0.885
original ne_types
df["max"].value_counts() False 1532048 True 1516894 Name: max, dtype: int64 1516894/3048942 0.49751487565194746
no ne_types
df["max"].value_counts() False 2179510 True 854196 Name: max, dtype: int64 854196/3033706 0.2815684842235866
unique ne_types
df["max"].value_counts() False 1881317 True 1152389 Name: max, dtype: int64 1152389/3033706 0.3798617928039171
1M,en,ABS+NE+use_lower=FALSE "2022-02-11 09:09:46 CET" acc hP hR hF 0.917 0.975 0.974 0.975
1M,en,ABS+NE+use_lower=TRUE 2022-02-11 11:31:00 (abs_ne_1m_lower_en) 0.919 0.976 0.975 0.975
10k,en,ABS+NE+use_lower=TRUE 2022-02-11 11:31:00 acc hP hR hF 0.688 0.890 0.872 0.881
10k,en,ABS+NE+use_lower=FALSE 2022-02-11 11:31:00 acc hP hR hF 0.673 0.876 0.863 0.869
1M,en,ABS+use_lower=TRUE 2022-02-11 14:08:50" acc hP hR hF 0.888 0.969 0.968 0.968
1M,en,ABS+NER(use_printable_names=TRUE)+PP(use_lower=TRUE,use_steam=TRUE) 2022-02-13 09:51:34" acc hP hR hF 0.917 0.975 0.975 0.975 ---- test pp ---
pp svm en 1m base: | 1M | 0.880 | 0.956 | 0.959 | 0.957 |
lower: [1] "4. Using preprocessing before vectorization" [1] "Lowercasing abstracts" [1] " rebuilding abstracts¨" acc hP hR hF 0.867 0.950 0.953 0.952 [1] "Accuracy: 0.866610346369826" [1] "Hierarchical Precission: 0.949992308644274" [1] "Hierarchical Recall: 0.953478221501986" [1] "Hierarchical F measure: 0.951732073117106"
stw: [1] "4. Using preprocessing before vectorization" [1] " Removing Stopwords" [1] " rebuilding abstracts¨" acc hP hR hF 0.870 0.949 0.953 0.951 [1] "Accuracy: 0.869974655866422" [1] "Hierarchical Precission: 0.949157717950023" [1] "Hierarchical Recall: 0.952584375573164" [1] "Hierarchical F measure: 0.950867959596802"
stem: [1] "4. Using preprocessing before vectorization" [1] " Applying stemming" [1] " rebuilding abstracts¨" acc hP hR hF 0.868 0.950 0.953 0.952 [1] "Accuracy: 0.868135963822492" [1] "Hierarchical Precission: 0.950179908136938" [1] "Hierarchical Recall: 0.953360976776258" [1] "Hierarchical F measure: 0.951767784463337"
lemma: [1] "4. Using preprocessing before vectorization" [1] " Applying lematization" [1] " rebuilding abstracts¨" acc hP hR hF 0.871 0.951 0.954 0.953 [1] "Accuracy: 0.870660438304428" [1] "Hierarchical Precission: 0.951233090434348" [1] "Hierarchical Recall: 0.954062123452892" [1] "Hierarchical F measure: 0.952645506631878"
punct: [1] "4. Using preprocessing before vectorization" acc hP hR hF 0.878 0.955 0.958 0.956 [1] "Accuracy: 0.878447547582368" [1] "Hierarchical Precission: 0.954768271975645" [1] "Hierarchical Recall: 0.958206318213794" [1] "Hierarchical F measure: 0.956484205622963"
10k acc hP hR hF 0.758 0.913 0.909 0.911 acc hP hR hF 0.358 0.794 0.762 0.778 1m model: 0.916 0.975 0.975 0.975 gs: acc hP hR hF 0.571 0.892 0.885 0.888 3m 0.929 0.979 0.979 0.979 acc hP hR hF 0.583 0.897 0.890 0.894