/
README.1.1
188 lines (140 loc) · 7.28 KB
/
README.1.1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
=======
Introduction
This README v1.1 (Nov 14, 2002) comes from the directory
http://www.cs.cornell.edu/people/pabo/movie-review-data.
Please make sure to read the "Rating Information - WARNING" section
below to determine which datasets are most appropriate for your use.
=======
What's New -- Nov 14, 2002
Nathan Treloar <ntreloar@avaquest.com> of AvaQuest, Inc. kindly
provided us a further cleaned-up version of the data in
mix20_rand700_tokens_cleaned.zip. He removed a few
non-English/incomplete reviews and corrected the labels of some
mis-classified reviews (judging some sentiments differently from the
original author's rating). This cleaned version is now available as
mix20_rand700_tokens_0211.tar.gz, and details on the changes made
can be found in diff.txt.
=======
Citation Info
Some of this data was used in Bo Pang, Lillian Lee, and Shivakumar
Vaithyanathan, ``Thumbs up? Sentiment classification using machine
learning techniques", in the proceedings of the 2002 Conference on
Empirical Methods in Natural Language Processing (EMNLP-2002). That
paper describes some decision choices in constructing the corpora.
@InProceedings{Pang+Lee+Vaithyanathan:02a,
author = {Bo Pang and Lillian Lee and Shivakumar Vaithyanathan},
title = {Thumbs up? Sentiment Classification using Machine Learning Techniques},
booktitle = "Proceedings of the 2002 Conference on Empirical Methods in Natural
Language Processing (EMNLP)",
year = 2002
}
=======
Data Format Summary
The original source was the Internet Movie Database (IMDb) archive of
the rec.arts.movies.reviews newsgroups at http://reviews.imdb.com/Reviews.
- movie.zip: all html files from the IMDb archive at the time we
accessed it. A subset of these files was chosen for creating
training and testing corpora (see Pang et al 2002 for rationale).
- mix20_rand700_tokens.zip: contains this readme and the folder ``tokens'',
with the roughly 1400 processed down-cased text files used in
Pang/Lee/Vaithyanathan 2002; the name of the two subdirectories
in that folder, ``pos'' and ``neg'', indicates indicates the true
classification (sentiment) of the component files according to our
automatic rating classifer (see Rating Decision section below).
Preliminary steps were taken to remove rating information from the
text files, but the process was noisy; *BE SURE TO READ THE ``RATING
INFORMATION'' SECTION BELOW*.
File names consist of a cross-validation tag plus the name of the
original html file. The three folds used in the emnlp paper
experiments were:
fold 1: files tagged cv000 through cv232, in numerical order
fold 2: files tagged cv233 through cv465, in numerical order
fold 3: files tagged cv466 through cv698, in numerical order
Hence the file cv114_tok-16533.txt, for example, was a part of
fold 1, and extracted from the file 16533.html in html.zip.
- mix20_rand700_tokens_cleaned.zip: further refined versions of
the above text files, with more filtering of rating information and
some further cleaning up of html tags.
=======
Rating Information - WARNING
It is important to note that for some of the files used in our emnlp
experiments, some rating information, such as a number of stars
(e.g. * * * *), got through our filters, which in effect means that
the correct classification could be inferred from the stray
rating information rather than the review, which is not what is
intended. The algorithms and feature sets used in
Pang/Lee/Vaithyanathan 2002 cannot take great advantage of this
information, because only unigram and bigram feature presence, not
feature frequency, was employed (see the paper for more
information). See Appendix B for performance on the cleaned data set.
HOWEVER, other feature selection and classification algorithms can
potentially accidentally use the rating information in making the
categorization decision, which would lead to artificially inflated
results.
Hence, IF you are using algorithms that can take advantage of stray
rating information, YOU SHOULD USE mix20_rand700_tokens_cleaned,
NOT THE EMNLP DATA.
The reviews in mix20_rand700_tokens_cleaned have been
passed through further automatic filters. Potential sources of
remaining errors include
- reviews containing rating information in several places;
- reviews where the range (e.g. 5-star maximum instead of 4)
wasn't specified in an expected manner, causing a
classification of "positive" instead of "neutral", say. But
since we only considered positive vs. negative sentiments, we
do not view this as a serious issue
- incorrect removal of extraneous boilerplate sometimes resulted
in portions of the review due to the occurrence of unexpected
tags.
=======
Rating Decision (Appendix A)
This section describes how we determined whether a review was positive
or negative.
The original html files do not have consistent formats -- a review may
not have the author's rating with it, and when it does, the rating can
appear at different places in the file in different forms. We only
recognize some of the more explicit ratings, which are extracted via a
set of ad-hoc rules. In essence, file's classification is decided
based on the first rating we were able to identify.
- For ratings specified in stars, we assume a maximum of four stars
unless the author specified otherwise (e.g. "*** out of *****");
- For ratings specified in numbers, the maximum rating must be
specified explicitly.
- With a five-star system (or compatible number systems):
four stars and up are considered positive,
two stars and below are considered negative;
- With a four-star system (or compatible number system):
three stars and up are considered positive,
one star and below are considered negative.
We attempted to recognize half stars, but they are specified in an
especially free way, which makes them difficult to recognize. Hence,
we may lose a half star occasionally; but this only results in 2.5
stars in five star system (1.5 stars in four star system) being
categorized as negative, which is still reasonable.
=======
Performance on data sets (Appendix B)
- On cleaned data set (mix20_rand700_tokens_cleaned.zip)
Features # features NB ME SVM
unigrams (freq.) 16162 79.0 n/a 73.0
unigrams 16162 81.0 80.2 82.9
unigrams+bigrams 32324 80.7 80.7 82.8
bigrams 16162 77.3 77.5 76.5
unigrams+POS 16688 81.3 80.3 82.0
adjectives 2631 76.6 77.6 75.3
top 2631 unigrams 2631 80.9 81.3 81.2
unigrams+position 22407 80.8 79.8 81.8
- On original data set (mix20_rand700_tokens.zip)
We discovered a slight bug in end-of-file handling in our original Naive
Bayes code that affected (sometimes negatively, sometimes positively)
the first decimal place (i.e., tenths of a percent) of the results
reported in our EMNLP 2002 paper. Here are the results:
*corrected* *in paper*
Features NB* NB ME SVM
unigrams (freq.) 79.0 78.7 n/a 72.8
unigrams 81.5 81.0 80.4 82.9
unigrams+bigrams 80.5 80.6 80.8 82.7
bigrams 77.3 77.3 77.4 77.1
unigrams+POS 81.5 81.5 80.4 81.9
adjectives 76.8 77.0 77.7 75.1
top 2633 unigrams 80.2 80.3 81.0 81.4
unigrams+position 80.8 81.0 80.1 81.6