# Sentiment Analysis of student evaluation of teaching

This repository contains code for work for dataset creation and sentiment polarity detection of textual answers from student surveys from University of Barcelona bachelor programs. 

The goals of this project are: 

1) to create a supervised dataset for sentiment analysis and polarity detection of student opinions in two languages (Catalan and Spanish);

2) to validate the dataset empirically and propose competitive baselines by investigating, implementing and comparing sentiment analysis algorithms and methods to automatically classify student comments as positive, negative or neutral.

Below are listed scripts, notebooks and data, that are published in this repository.

## Dataset creation

Scripts for dataset creation and annoymization are in this repository. This includes:

* *pdf* scraping
* MongoDB creation
* sentence segmentation
* language detection
* translation to English
* annonymization

The original dataset is not public, and the files that are in the *data* folder are already pre-processed and annoymized.

## Sentiment lexicons

Downloaded from [kaggle](https://www.kaggle.com/rtatman/sentiment-lexicons-for-81-languages/data).

Unsupervised method. Predicts setiment of sentences by counting how many positive and negative words each sentence has.


## Feature vectors

These are the types of feature vectors used in this work. These vectors are created from
* tokenized sentences, 
* tokenized comments, 
* tokenezied English translations, 
* lemmatized sentences,
* POS tags.

All of these vectors are tested with and without TF-IDF.

### bag-of-words

Tokenize all training texts and create vocabulary. *bag-of-words* feature vector has the length of the vocabulary.

* **BTO (binary term occurences)** - for every word in each sentence, put 0 for words that don't appear in that particular sentence an 1 for words that appear in that sentence.

* **term frequencies** - similar to BTO but instead of 1 and 0 put counts of how many times each word appears in each sentence.

### *n*-grams

Creating n-gram features is similar to creating the bag-of-words features except that instead of taking one word, we take a sequence of *n* words.

To create the vocabulary for n-grams we will create a list of all co-occurring words with a window of *n*.

The length of each feature vector then will be the length of the vocabulary of *n-grams. 

### *word2vec* embeddings

Neural networks that assign to each word a vector in n-dimensional vector space, such that words that appear in similar context will be be located closely in the vector space. 

To train these embeddings we download all texts from Spanish and Catalan *Wikipedia*.

Pre-train word embeddings using *Gensim* *word2vec* with 100 dimensions.

### TF-IDF weights

This weight can be calculated for each word, *n*-gram, lemma and so on.

TF term is used to normalize variation in sentence length (for longer sentences it is more likley that certain terms will appear more time).

$$
\text{TF}(t) = \frac{\text{number of times term appears}}{\text{total number of terms}} 
$$

IDF term is used to decrease weights of terms that appear very often in all the training corpus (often stop-words, like, "la", "y", and so on).

$$
\text{IDF}(t) = \log_{e}\Big(\frac{\text{total number of sentences}}{\text{number of sentences that contain this term}}\Big)
$$

TF-IDF is combines both of these weights.

$$
\text{TF-IDF}(t) = \text{TF}(t) \times \text{IDF}(t)
$$

## Used ML models

All models are implemented with *sklearn*.

* support vector machines
* mulitnomial Naive Bayes
* logistic regression
* simple feed forward neural network

## Notebooks (feature vector creation and model testing)

[BOW and n-gram features on original Catalan and Spanish comments, English translations](original_texts_BOW_&_ngram_features.ipynb)

[BOW features on lemmatized comments and POS tags](lemmatized_texts_&_POS_BOW_features.ipynb)

[n-gram features on lemmatized comments and POS tags](lemmatized_texts_&_POS_ngram_features.ipynb)

[word2vec features](word2vec_features.ipynb)

[concatenated features](concatenated_features.ipynb)

[simple feed forward neural network](feed_forward_neural_network.ipynb)