-
Notifications
You must be signed in to change notification settings - Fork 0
/
dataset.txt
73 lines (48 loc) · 1.62 KB
/
dataset.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
Title: README.txt
Date: 2023-15-08
Author: M.Conway
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++ Data for Assignment 3 of COMP90049_2023_SM2
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Files used for Assignment 3:
## TRAINING/VALIDATION/TEST SPLITS
Columns:
(1) index
(2) dr-id-adjusted
(3) dr_id_gender
(4) review-text-cleaned
(5) rating
See PDF file associated with Assignment 3 for a detailed description
of the fields
Note that the rating column (i.e. the sentiment labels) is not present
in the test set
./TRAIN.csv
Approximately 80% of the data
./VALIDATION.csv
Approximately 10% of the data
./TEST_NO_LABELS.csv
Approximately 10% of the data
## TFIDF TRAINING/VALIDATION/TEST SPLITS
Note that row numbers are consistent across all
training/validation/test data files
./rTFIDF_TRAIN.csv
TFIDF representations (500 highest rank TFIDF features) for training
set.
./TFIDF_VALIDATION.csv
TFIDF representations (500 highest rank TFIDF features) for validation
set.
./TFIDF_TEST.csv
TFIDF representations (500 highest rank TFIDF features) for test
set.
## WORD EMBEDDING TRAINING/VALIDATION/TEST SPLITS
Note that row numbers are consistent across all
training/validation/test data files
./384EMBEDDINGS_TRAIN.csv
384-dimension word embedding representation for training set
./384EMBEDDINGS_VALIDATION.csv
384-dimension word embedding representation for validation set
./384EMBEDDINGS_TEST.csv
384-dimension word embedding representation for test set
# TFIDF FEATURE SELECTION
./tfidf_words.txt
Contains the 500 most discriminating features as identified by TFIDF