# Introduction
This is a sentiment classification task using Doc2Vec in Python. Sentiment analysis is about extracting a set of opinion representations from a text. Each representation should identify the opinion holder, target, content, and context. For this project, a data set containing movie reviews is used. It is assumed that most of the elements in an opinion representation are already known, so the goal of the project is to classify these. Sentiment classification takes as input an opinionated text object and the output is typically a sentiment label. This sentiment label can either be a category such as positive, negative or neutral (polarity analysis) or it characterizes the feeling of the person who generated the text (emotion analysis).

### Method
The method entails of loading the data, pre-processing it into the desired structure, building and training the model, classifying sentiments and experimenting with different settings. 

### Data 
This data set is about movie reviews along with their binary sentiment polarity labels. It contains 50,000 reviews split evenly into 25,000 train and 25,000 test sets. Each set has 12,500 negative and 12,500 positive labels. A review is labeled as negative when it has a score of four or less out of 10. On the other hand, a review is labeled as positive when it has a score of seven or higher out of 10. However, reviews with scores inbetween and thus more neutral reviews are not included.

### Setup
In Python, gensim is used for the implementation of Doc2Vec. Before inputting the data into the model, the data has to be
pre-processed. The data is being cleaned by converting everything into lower case and removing punctuation. The result is four documents: two test files, one containing positive movie reviews and one containing negative movie reviews and two train files, one containing positive movie reviews and the other containing negative movie reviews. 

Each review should be on one entire line, separated by new lines. Also, the precision, recall and F1 score are reported as evaluation metrics. Precision is defined as the proportion of assigned labels that are correct. Recall is defined as the proportion of true positives that were identified correctly. The F1-score is a measure that considers aspects of both precision and recall.

# Implementation
Let's start by importing the modules that are required for this project. 

In [1]:
from gensim import utils
from gensim.models.doc2vec import LabeledSentence
import gensim.models.doc2vec
assert gensim.models.doc2vec.FAST_VERSION 
from gensim.models.doc2vec import Doc2Vec
import smart_open
import numpy
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
from sklearn import metrics
import random
import os
import re
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

In [2]:
import tarfile
my_tar = tarfile.open('aclImdb_v1.tar.gz')
my_tar.extractall() # specify which folder to extract to
my_tar.close()

# 1. Data loading
In this step, the data is loaded. Read the positive and negative examples and stores them as training data and test data.

In [3]:
# train positives
reviews_train_pos = []
files_train_pos = os.listdir('aclImdb/train/pos')
for i in files_train_pos:
    with open('aclImdb/train/pos/' + i, 'r', encoding = 'utf-8') as file: 
        data = file.read()
        out = re.sub(r'[^a-zA-Z0-9\s]', '', data).lower()
    reviews_train_pos.append(out)
    
# train negatives 
reviews_train_neg = []
files_train_neg = os.listdir('aclImdb/train/neg')
for i in files_train_neg:
    with open('aclImdb/train/neg/' + i, 'r', encoding = 'utf-8') as file: 
        data = file.read()
        out = re.sub(r'[^a-zA-Z0-9\s]', '', data).lower()
    reviews_train_neg.append(out)
    
# test positives
reviews_test_pos = []
files_test_pos = os.listdir('aclImdb/test/pos')
for i in files_test_pos:
    with open('aclImdb/test/pos/' + i, 'r', encoding = 'utf-8') as file: 
        data = file.read()
        out = re.sub(r'[^a-zA-Z0-9\s]', '', data).lower()
    reviews_test_pos.append(out)
    
# test negatives
reviews_test_neg = []
files_test_neg = os.listdir('aclImdb/test/neg')
for i in files_test_neg:
    with open('aclImdb/test/neg/' + i, 'r', encoding = 'utf-8') as file: 
        data = file.read()
        out = re.sub(r'[^a-zA-Z0-9\s]', '', data).lower()
    reviews_test_neg.append(out)

In [4]:
# test negatives
with open('test-neg.txt', 'w',encoding='utf-8') as f: 
    for i in reviews_test_neg:
        f.write(i+'\n') 
        
# test positives
with open('test-pos.txt', 'w',encoding='utf-8') as f: 
    for i in reviews_test_pos:
        f.write(i+'\n') 
        
# train negatives 
with open('train-neg.txt', 'w',encoding='utf-8') as f: 
    for i in reviews_train_neg:
        f.write(i+'\n') 
        
# train positives
with open('train-pos.txt', 'w',encoding='utf-8') as f: 
    for i in reviews_train_pos:
        f.write(i+'\n') 

This is what a positive movie review looks like:

In [5]:
reviews_train_pos[0]

'bromwell high is a cartoon comedy it ran at the same time as some other programs about school life such as teachers my 35 years in the teaching profession lead me to believe that bromwell highs satire is much closer to reality than is teachers the scramble to survive financially the insightful students who can see right through their pathetic teachers pomp the pettiness of the whole situation all remind me of the schools i knew and their students when i saw the episode in which a student repeatedly tried to burn down the school i immediately recalled  at  high a classic line inspector im here to sack one of your teachers student welcome to bromwell high i expect that many adults of my age think that bromwell high is far fetched what a pity that it isnt'

# 2. Models

## 2.1 Baseline model

### 2.1.1 Doc2Vec
Doc2Vec converts a word into a vector and also aggregates all the words in a sentence into a vector. It treats a sentence label as a special word and does something with that special word.
So sentences have to be formatted into:

$$[['word1', 'word2', 'word3', 'lastword'], ['label1']]$$

LabeledSentence, a class from gensim.models.doc2vec, is a way to do that. It contains a list of words, and a label for the sentence. However, LabeledSentence can do that for a single text file, but not for multiple files. That is why the LabeledLineSentence class is written. The constructor takes in a dictionary that defines the files to read and the label prefixes sentences from that document should take on. Then, Doc2Vec can either read the collection directly via
the iterator, or the array can be accessed directly. This class also has a function to return a permutated version of the array of LabeledSentences.

In [6]:
class LabeledLineSentence(object):
    def __init__(self, sources):
        self.sources = sources
        
        flipped = {}
        
        # make sure that keys are unique
        for key, value in sources.items():
            if value not in flipped:
                flipped[value] = [key]
            else:
                raise Exception('Non-unique prefix encountered')
    
    def __iter__(self):
        for source, prefix in self.sources.items():
            with smart_open(source) as fin:
                for item_no, line in enumerate(fin):
                    yield LabeledSentence(utils.to_unicode(line).split(), [prefix + '_%s' % item_no])
    
    def to_array(self):
        self.sentences = []
        for source, prefix in self.sources.items():
            with smart_open.open(source) as fin:
                for item_no, line in enumerate(fin):
                    self.sentences.append(LabeledSentence(utils.to_unicode(line).split(), [prefix + '_%s' % item_no]))
        return self.sentences
    
    def sentences_perm(self):
        shuffled = list(self.sentences)
        random.shuffle(shuffled)
        return shuffled

Doc2Vec requires a vocabulary table, so model.build\_vocab is used which takes an array of LabeledLineSentence. 

In [7]:
sources = {'test-neg.txt':'TEST_NEG', 'test-pos.txt':'TEST_POS', 'train-neg.txt':'TRAIN_NEG', 'train-pos.txt':'TRAIN_POS'}

sentences = LabeledLineSentence(sources)

In [None]:
model = Doc2Vec(min_count=1, window=10, size=100, sample=1e-4, negative=5, workers=7)

model.build_vocab(sentences.to_array())

A short description of the parameters:
* min_count: ignore all words with total frequency lower than this. This has to be set to 1, since the sentence labels only appear once. Setting it any higher than 1 will miss out on the sentences.
* window: the maximum distance between the current and predicted word within a sentence. Word2Vec uses a skip-gram model, and this is simply the window size of the skip-gram model.
* size: dimensionality of the feature vectors in output. 100 is a good number. 
* sample: threshold for configuring which higher-frequency words are randomly downsampled
* workers: use this many worker threads to train the model

After building the model, the model has to be trained. The model is trained for 25 epochs. In each training epoch, the sequence of sentences fed to the model is randomized because then the model is trained better. The function sentences\_perm from the LabeledLineSentences class does this.

In [None]:
for epoch in range(25):
    print('iteration {0}'.format(epoch))
    model.train(sentences.sentences_perm(),
                total_examples=model.corpus_count,
                epochs=model.iter,)

In [None]:
model.save('./imdb.d2v')

In [8]:
model = Doc2Vec.load('./imdb.d2v')

In [9]:
model.most_similar('good')

[('decent', 0.7449836134910583),
 ('great', 0.737765908241272),
 ('nice', 0.7261615991592407),
 ('bad', 0.7239553332328796),
 ('fine', 0.6863929033279419),
 ('excellent', 0.6742612719535828),
 ('solid', 0.6639758348464966),
 ('terrific', 0.6231818795204163),
 ('fantastic', 0.6168004870414734),
 ('wonderful', 0.611182451248169)]

In [10]:
model.most_similar('boring')

[('dull', 0.7971225380897522),
 ('tedious', 0.7050924301147461),
 ('pointless', 0.6939346194267273),
 ('uninteresting', 0.6464412212371826),
 ('predictable', 0.621228814125061),
 ('unoriginal', 0.584164023399353),
 ('bored', 0.5438627004623413),
 ('annoying', 0.5417511463165283),
 ('stupid', 0.537611722946167),
 ('ridiculous', 0.5341532230377197)]

Now that the model has converted each sentence into a vector, these vectors are used to train a classifier. A train\_array is created where the first 12,500 rows contain the positive review vectors and the second 12,500 rows contain the negative review vectors. There is another array,
called train\_labels of which the first 12,500 rows have the value one (corresponding to the positive reviews) and the other 12,500 rows have the value zero (corresponding to the negative reviews). The same is done for the test data.

In [11]:
train_arrays = numpy.zeros((25000, 100))
train_labels = numpy.zeros(25000, dtype=int)

for i in range(12500):
    prefix_train_pos = 'TRAIN_POS_' + str(i)
    prefix_train_neg = 'TRAIN_NEG_' + str(i)
    train_arrays[i] = model[prefix_train_pos]
    train_arrays[12500 + i] = model[prefix_train_neg]
    train_labels[i] = 1
    train_labels[12500 + i] = 0

In [12]:
test_arrays = numpy.zeros((25000, 100))
test_labels = numpy.zeros(25000, dtype=int)

for i in range(12500):
    prefix_test_pos = 'TEST_POS_' + str(i)
    prefix_test_neg = 'TEST_NEG_' + str(i)
    test_arrays[i] = model[prefix_test_pos]
    test_arrays[12500 + i] = model[prefix_test_neg]
    test_labels[i] = 1
    test_labels[12500 + i] = 0

Now, a logistic regression classifier is trained on the training data.

In [13]:
classifier = LogisticRegression()
classifier.fit(train_arrays, train_labels)

LogisticRegression()

In [14]:
# accuracy
classifier.score(test_arrays, test_labels)

0.872

In [15]:
predict = classifier.predict(test_arrays)
print(metrics.classification_report(test_labels, predict, digits = 5))

              precision    recall  f1-score   support

           0    0.87319   0.87040   0.87179     12500
           1    0.87081   0.87360   0.87220     12500

    accuracy                        0.87200     25000
   macro avg    0.87200   0.87200   0.87200     25000
weighted avg    0.87200   0.87200   0.87200     25000



In [16]:
out_dict = metrics.classification_report(test_labels, predict, digits = 5, output_dict = True)

The accuracy of the classifier is about 87% for sentiment analysis. This is quite good, given that only a linear SVM and a very shallow neural network are used.

## 2.2 Experimental setting 1
Experimental setting 1 is similar to the baseline setting but the way the reviews are pre-processed is changed. Still, everything is converted into lower case and the punctuation is removed but now also the reviews are lemmatized. A lemma is the dictionary form of a word.

###### Lemmatize the review files and store them

In [17]:
def get_lemmatized_text(corpus):
    from nltk.stem import WordNetLemmatizer
    lemmatizer = WordNetLemmatizer()
    return [' '.join([lemmatizer.lemmatize(word) for word in review.split()]) for review in corpus]

lemmatized_train_pos = get_lemmatized_text(reviews_train_pos)
lemmatized_train_neg = get_lemmatized_text(reviews_train_neg)
lemmatized_test_pos = get_lemmatized_text(reviews_test_pos)
lemmatized_test_neg = get_lemmatized_text(reviews_test_neg)

In [18]:
# test negatives
with open('lemmatized_test-neg.txt', 'w',encoding='utf-8') as f: 
    for i in lemmatized_test_neg:
        f.write(i+'\n') 
        
# test positives
with open('lemmatized_test-pos.txt', 'w',encoding='utf-8') as f: 
    for i in lemmatized_test_pos:
        f.write(i+'\n') 
        
# train negatives
with open('lemmatized_train-neg.txt', 'w',encoding='utf-8') as f: 
    for i in lemmatized_train_neg:
        f.write(i+'\n') 

# train positives
with open('lemmatized_train-pos.txt', 'w',encoding='utf-8') as f: 
    for i in lemmatized_train_pos:
        f.write(i+'\n') 

### 2.2.1 Doc2Vec

In [19]:
sources1 = {'lemmatized_test-neg.txt':'LEM_TEST_NEG', 'lemmatized_test-pos.txt':'LEM_TEST_POS',
           'lemmatized_train-neg.txt':'LEM_TRAIN_NEG', 'lemmatized_train-pos.txt':'LEM_TRAIN_POS'}

sentences1 = LabeledLineSentence(sources1)

In [None]:
model1 = Doc2Vec(min_count=1, window=10, size=100, sample=1e-4, negative=5, workers=7)

model1.build_vocab(sentences1.to_array())

In [None]:
for epoch in range(25):
    print('iteration {0}'.format(epoch))
    model1.train(sentences1.sentences_perm(),
                total_examples=model1.corpus_count,
                epochs=model1.iter,)

In [None]:
model1.save('./imdb.d2v1')

In [20]:
model1 = Doc2Vec.load('./imdb.d2v1')

In [21]:
model1.most_similar('good')

[('decent', 0.7839062213897705),
 ('great', 0.7348613142967224),
 ('bad', 0.6999446749687195),
 ('solid', 0.6808467507362366),
 ('nice', 0.6562058329582214),
 ('fine', 0.6537462472915649),
 ('poor', 0.6355423927307129),
 ('excellent', 0.6106948852539062),
 ('terrible', 0.6053005456924438),
 ('terrific', 0.6030822992324829)]

In [22]:
model1.most_similar('boring')

[('dull', 0.7662643194198608),
 ('pointless', 0.6455971002578735),
 ('tedious', 0.6319400668144226),
 ('predictable', 0.6287767887115479),
 ('unfunny', 0.5826073884963989),
 ('uninteresting', 0.5795999765396118),
 ('annoying', 0.5657222270965576),
 ('bored', 0.5572026968002319),
 ('stupid', 0.5556310415267944),
 ('lame', 0.5352811813354492)]

In [23]:
train_arrays1 = numpy.zeros((25000, 100))
train_labels1 = numpy.zeros(25000, dtype=int)

for i in range(12500):
    prefix_train_pos1 = 'LEM_TRAIN_POS_' + str(i)
    prefix_train_neg1 = 'LEM_TRAIN_NEG_' + str(i)
    train_arrays1[i] = model1[prefix_train_pos1]
    train_arrays1[12500 + i] = model1[prefix_train_neg1]
    train_labels1[i] = 1
    train_labels1[12500 + i] = 0

In [24]:
test_arrays1 = numpy.zeros((25000, 100))
test_labels1 = numpy.zeros(25000, dtype=int)

for i in range(12500):
    prefix_test_pos1 = 'LEM_TEST_POS_' + str(i)
    prefix_test_neg1 = 'LEM_TEST_NEG_' + str(i)
    test_arrays1[i] = model1[prefix_test_pos1]
    test_arrays1[12500 + i] = model1[prefix_test_neg1]
    test_labels1[i] = 1
    test_labels1[12500 + i] = 0

In [25]:
classifier1 = LogisticRegression()
classifier1.fit(train_arrays1, train_labels1)

LogisticRegression()

In [26]:
# accuracy
classifier1.score(test_arrays1, test_labels1)

0.87108

In [27]:
predict = classifier1.predict(test_arrays1)
print(metrics.classification_report(test_labels1, predict, digits = 5))

              precision    recall  f1-score   support

           0    0.87188   0.87000   0.87094     12500
           1    0.87028   0.87216   0.87122     12500

    accuracy                        0.87108     25000
   macro avg    0.87108   0.87108   0.87108     25000
weighted avg    0.87108   0.87108   0.87108     25000



In [28]:
out_dict1 = metrics.classification_report(test_labels1, predict, digits = 5, output_dict = True)

## 2.3 Experimental setting 2
For experimental setting 2, the same classifier is used but this classifier is tuned. So the classifier is still logistic regression like in the baseline setting, but now the penalty is changed from L2 regularization to L1 regularization. In addition, the parameter C (the inverse of the regularization strength) is set to 0.25 instead of 1.0. A smaller value for C specifies stronger regularization.

###### Tune the classifier, Doc2Vec procedure is the same as in the baseline run

In [29]:
classifier2 = LogisticRegression(C=0.25, penalty = 'l1', solver = 'saga')
classifier2.fit(train_arrays, train_labels)

LogisticRegression(C=0.25, penalty='l1', solver='saga')

In [30]:
# accuracy
classifier2.score(test_arrays, test_labels)

0.8718

In [31]:
predict = classifier2.predict(test_arrays)
print(metrics.classification_report(test_labels, predict, digits = 5))

              precision    recall  f1-score   support

           0    0.87290   0.87032   0.87161     12500
           1    0.87070   0.87328   0.87199     12500

    accuracy                        0.87180     25000
   macro avg    0.87180   0.87180   0.87180     25000
weighted avg    0.87180   0.87180   0.87180     25000



In [32]:
out_dict2 = metrics.classification_report(test_labels, predict, digits = 5, output_dict = True)

## 2.4 Experimental setting 3
Experimental setting 3 has a change in the word embeddings model. The parameter size of the word embeddings model in the baseline setting is set to 100. This means that model outputs the review text as a 100 dimensional vector. The feature vector space of the word embeddings model in the third experimental setting is 300 instead of 100 as in the baseline setting.

###### The size parameter in Doc2Vec is increased from 100 to 300

### 2.4.1 Doc2Vec

In [None]:
model2 = Doc2Vec(min_count=1, window=10, size=300, sample=1e-4, negative=5, workers=7)

model2.build_vocab(sentences.to_array())

In [None]:
for epoch in range(25):
    print('iteration {0}'.format(epoch))
    model2.train(sentences.sentences_perm(),
                total_examples=model2.corpus_count,
                epochs=model2.iter,)

In [None]:
model2.save('./imdb.d2v2')

In [33]:
model2 = Doc2Vec.load('./imdb.d2v2')

In [34]:
model2.most_similar('good')

[('great', 0.6714057922363281),
 ('bad', 0.6277437210083008),
 ('decent', 0.5501477122306824),
 ('for', 0.534475564956665),
 ('but', 0.5228431224822998),
 ('the', 0.5226880311965942),
 ('that', 0.5085980892181396),
 ('nice', 0.4987344741821289),
 ('horrible', 0.4985167384147644),
 ('a', 0.49822670221328735)]

In [35]:
model2.most_similar('boring')

[('dull', 0.5229418277740479),
 ('bored', 0.41893628239631653),
 ('tedious', 0.4082891047000885),
 ('it', 0.3790885806083679),
 ('stupid', 0.36691418290138245),
 ('uninteresting', 0.3656439483165741),
 ('predictable', 0.3643425703048706),
 ('boringbr', 0.3641272485256195),
 ('cheesy', 0.35982877016067505),
 ('pointless', 0.3583638072013855)]

In [36]:
train_arrays2 = numpy.zeros((25000, 300))
train_labels2 = numpy.zeros(25000, dtype=int)

for i in range(12500):
    prefix_train_pos2 = 'TRAIN_POS_' + str(i)
    prefix_train_neg2 = 'TRAIN_NEG_' + str(i)
    train_arrays2[i] = model2[prefix_train_pos2]
    train_arrays2[12500 + i] = model2[prefix_train_neg2]
    train_labels2[i] = 1
    train_labels2[12500 + i] = 0

In [37]:
test_arrays2 = numpy.zeros((25000, 300))
test_labels2 = numpy.zeros(25000, dtype=int)

for i in range(12500):
    prefix_test_pos2 = 'TEST_POS_' + str(i)
    prefix_test_neg2 = 'TEST_NEG_' + str(i)
    test_arrays2[i] = model2[prefix_test_pos2]
    test_arrays2[12500 + i] = model2[prefix_test_neg2]
    test_labels2[i] = 1
    test_labels2[12500 + i] = 0

In [38]:
classifier3 = LogisticRegression(C = 0.5)
classifier3.fit(train_arrays2, train_labels2)

LogisticRegression(C=0.5)

In [39]:
# accuracy
classifier3.score(test_arrays2, test_labels2)

0.87396

In [40]:
predict = classifier3.predict(test_arrays2)
print(metrics.classification_report(test_labels2, predict, digits = 5))

              precision    recall  f1-score   support

           0    0.87616   0.87104   0.87359     12500
           1    0.87179   0.87688   0.87433     12500

    accuracy                        0.87396     25000
   macro avg    0.87397   0.87396   0.87396     25000
weighted avg    0.87397   0.87396   0.87396     25000



In [41]:
out_dict3 = metrics.classification_report(test_labels2, predict, digits = 5, output_dict = True)

# 3. Results
When the models are inspected, it seems that they have kind of understood the word 'good', since the most similar words to 'good' are 'decent' and 'great'. They also have kind of understood the word 'boring', since the most similar words to 'boring' are 'dull' and 'tedious'. Table 1 shows the three most similar words to the words 'good' and 'boring' for the four different settings. What stands out is that the third most similar word to 'good' for experimental setting 1 is 'bad'. This is strange since this has nothing to do with the word 'good'. The same goes for experimental setting 3, except that it is now the second most similar word to 'good'. Experimental setting 2 has the same similar words which is obvious since it uses the same model but a tuned classifier.

In [42]:
model_three_good = model.most_similar('good')[0][0] + ', ' + model.most_similar('good')[1][0] + ', ' + model.most_similar('good')[2][0]
model_three_boring = model.most_similar('boring')[0][0] + ', ' + model.most_similar('boring')[1][0] + ', ' + model.most_similar('boring')[2][0]

model1_three_good = model1.most_similar('good')[0][0] + ', ' + model1.most_similar('good')[1][0] + ', ' + model1.most_similar('good')[2][0]
model1_three_boring = model1.most_similar('boring')[0][0] + ', ' + model1.most_similar('boring')[1][0] + ', ' + model1.most_similar('boring')[2][0]

model2_three_good = model2.most_similar('good')[0][0] + ', ' + model2.most_similar('good')[1][0] + ', ' + model2.most_similar('good')[2][0]
model2_three_boring = model2.most_similar('boring')[0][0] + ', ' + model2.most_similar('boring')[1][0] + ', ' + model2.most_similar('boring')[2][0]


conclus = {"Similar to 'good'":[model_three_good,
                                model1_three_good,
                                model_three_good,
                                model2_three_good],
          "Similar to 'boring'":[model_three_boring,
                                 model1_three_boring,
                                 model_three_boring,
                                 model2_three_boring]}

conclus_df = pd.DataFrame(conclus, index=['Baseline setting',"Experiment 1",'Experiment 2', 'Experiment 3'])

print("\nTable 1: An overview of the three most similar words to 'boring' and 'good' for the different settings.")
display(conclus_df)


Table 1: An overview of the three most similar words to 'boring' and 'good' for the different settings.


Unnamed: 0,Similar to 'good',Similar to 'boring'
Baseline setting,"decent, great, nice","dull, tedious, pointless"
Experiment 1,"decent, great, bad","dull, pointless, tedious"
Experiment 2,"decent, great, nice","dull, tedious, pointless"
Experiment 3,"great, bad, decent","dull, bored, tedious"


In [43]:
conclus1 = {"Precision":[out_dict['weighted avg']['precision'],
                         out_dict1['weighted avg']['precision'],
                         out_dict2['weighted avg']['precision'],
                         out_dict3['weighted avg']['precision']],
            "Recall":[out_dict['weighted avg']['recall'],
                      out_dict1['weighted avg']['recall'],
                      out_dict2['weighted avg']['recall'],
                      out_dict3['weighted avg']['recall']],
            "F1-score":[out_dict['weighted avg']['f1-score'],
                        out_dict1['weighted avg']['f1-score'],
                        out_dict2['weighted avg']['f1-score'],
                        out_dict3['weighted avg']['f1-score']]}

conclus_df1 = pd.DataFrame(conclus1, index=['Baseline setting',"Experiment 1",'Experiment 2', 'Experiment 3'])

print("\nTable 2: The evaluation metrics of the baseline setting and the three experimental settings.")
display(conclus_df1)


Table 2: The evaluation metrics of the baseline setting and the three experimental settings.


Unnamed: 0,Precision,Recall,F1-score
Baseline setting,0.872004,0.872,0.872
Experiment 1,0.871082,0.87108,0.87108
Experiment 2,0.871803,0.8718,0.8718
Experiment 3,0.873973,0.87396,0.873959


# Conclusion
The goal of this project was to perform a sentiment classification task with Doc2Vec feature representation. Looking at Table 2, it can be concluded that the model with experimental setting 3 has the best precision, recall and F1-score. This model has a word embeddings model with a feature vector space of 300 instead of 100 (as in the baseline setting). However, all the evaluation metrics for the different settings are quite close to each other. So the three different settings chosen might not have been the best in improving the sentiment classification. In chapter 18 of the textbook 'Text Data Management and Analysis. A Practical Introduction to Information Retrieval and Text Mining', it is stated that among other features, the features character n-grams and word n-grams are important in sentiment classification. The experimental settings used in this project do not change these features. A change in these features might produce better results.