# Project phase 1: Baseline

The goal of this phase is to create a baseline model. Note that the word baseline can mean different things. In the course we distinguished three different types of baselines:
* 1. The simplest possible approach (majority baseline, i.e. everything is positive or noun)
* 2. A simple machine learning classifier (logistic regression with words as features)
* 3. The ``state-of-the-art'' approach on which you want to improve (your starting point)

For this phase you need to make a number 2 or 3 baseline. 

If you plan to have a research question like: can we improve sentiment detection systems by doing X, the answer to the question is the most relevant if you have a competetive baseline (3). In this case we would suggest to use a BiLSTM or even a transformer based model, so that you can re-use the baseline for the final research question (phase 3).

You should pick one of the following tasks to create your baseline for.

## Task 1: Sentiment classification
* The data can be found in the `classification` folder.
* The goal is to predict the label in the `sentiment` field.
* **You have to upload the predictions of `music_reviews_test_masked.json.gz` to CodaLab. (The link will be posted here on monday). Note that the format should match the json files in the repository.**
* **Also upload a .txt file on LearnIt (one per group) with a short description of your baseline.**

The data can be read like:

In [4]:
import gzip
import json
for line in gzip.open('classification/music_reviews_dev.json.gz'):
    review_data = json.loads(line)
    for key in review_data:
        print('"' + key +'": ' + str(review_data[key]))
    break

"vote": 3
"verified": True
"reviewTime": 12 19, 2012
"reviewerID": A1KKWETTT5BZ6N
"asin": B00474S1J2
"reviewText": My dentist recommended this as a relaxation technique for dental visits. They give me an ipod with headphones, play this on it and it relieves some of the stress of dental treatment, which I dislike intensely.
It worked so well that I bought my own copy to try at home. I fall asleep after a couple of minutes and stay asleep. Instead of tossing and turning, I hardly move at all. Highly recommend.
"summary": Out like a light!
"unixReviewTime": 1355875200
"sentiment": positive
"id": 0


## Task2: Sentiment Expression Labeling
* The data can be found in the `seq_labeling` folder
* The goal is to predict the BIO-labels in the third column
* Note that the evaluation metric is Span-F1, which means that you will only get "points" if you get the whole span correct! We provide an evaluation script in `seq_labeling/eval.py`.
* **You have to upload the predictions of `opener_en-test-masked.conll` to CodaLab. (The link will be posted here on monday). Note that the format should match the conll files in the repository.**
* **Also upload a .txt file on LearnIt (one per group) with a short description of your baseline.**

* Note that if you use BERT-based embeddings, you need to make sure that the number of labels matches the number of tokens. This is commonly done by only using the embedding of the first subword of each token.

The data looks as follows:

In [6]:
!head seq_labeling/opener_en-dev.conll

# sent_id=opener_en/kaf/hotel/english00032_30ddf6dff464d0b92c6fbae7019ece91-2
1	very	B-Positive
2	warm	I-Positive
3	welcome	O
4	at	O
5	the	O
6	reception	O
7	,	O
8	very	B-Positive
9	friendly	I-Positive
