# Example solution for tweet sentiment analysis

This is a baseline example to help you with the third challenge. It was originally developed by our Ph.D. student Jonas Wacker

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# plotting
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style("whitegrid")

# general NLP preprocessing and basic tools
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

# train/test split
from sklearn.model_selection import train_test_split
# basic machine learning models
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
# our evaluation metric for sentiment classification
from sklearn.metrics import fbeta_score

#from sklearn.model_selection import GridSearchCV
#from sklearn.pipeline import Pipeline
#from sklearn.metrics import f1_score

In [2]:
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Loading the data

In [3]:
train_df = pd.read_csv('input/eurecom-aml-2023-challenge-3/train.csv')
test_df = pd.read_csv('input/eurecom-aml-2023-challenge-3/test.csv')

## Quick data inspection

In [4]:
len(train_df)+len(test_df)

27480

In [5]:
train_df.head()

Unnamed: 0,textID,text,selected_text,sentiment
0,28ac06f416,good luck with your auction,good luck with your auction,positive
1,92098cf9a7,Hmm..You can`t judge a book by looking at its ...,Hmm..You can`t judge a book by looking at its ...,neutral
2,7858ff28f2,"Hello, yourself. Enjoy London. Watch out for ...",They`re mental.,negative
3,b0c9c67f32,We can`t even call you from belgium sucks,m suck,negative
4,7b36e9e7a5,not so good mood..,not so good mood..,negative


In [6]:
test_df.head()

Unnamed: 0,textID,text,selected_text
0,102f98e5e2,Happy Mother`s Day hahaha,Happy Mother`s Day
1,033b399113,"Sorry for the triple twitter post, was having ...","Sorry for the triple twitter post, was having ..."
2,c125e29be2,thats much better than the flu syndrome!,thats much better
3,b91e2b0679,Aww I have a tummy ache,tummy ache
4,1a46141274,hey chocolate chips is good. i want a snack ...,good.


## Data pre-processing

In [7]:
# we create a validation dataset from the training data
train_df, val_df = train_test_split(train_df, test_size=0.1, random_state=0)

We start off by converting the labels to numbers. This is a requirement for the submission and numerical inputs are generally more compatible with machine learning libraries.

In [8]:
target_conversion = {
    'neutral': 0,
    'positive': 1,
    'negative': -1
}

In [9]:
train_df['target'] = train_df['sentiment'].map(target_conversion)
val_df['target'] = val_df['sentiment'].map(target_conversion)

Now we need to find a numerical representation for our input data. Extracting features from text is one of the major building blocks of any Natural Language Processing (NLP) pipeline.

There have been huge developments in the field during the last decade. A very traditional approach is to extract Bag-of-Words features. See here for an explanation:

https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction

We will stick to this technique for the purpose of this example notebook. However, be aware that much more powerful feature extraction techniques exist. The most recent ones use neural network based language models. See e.g.:

https://www.analyticsvidhya.com/blog/2019/06/understanding-transformers-nlp-state-of-the-art-models/

In [10]:
nltk.download('punkt')  # Download the tokenizer resource
nltk.download('stopwords') # Download the stopwords resource

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\laura\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\laura\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [11]:
# Define tokenizer function
def custom_tokenizer(text):
    tokens = word_tokenize(text)
    
    #tokens = [token for token in tokens if token.lower() not in stop_words] # Apply lowercasing to each token not in stop words
    stemmer = PorterStemmer()
    tokens = [stemmer.stem(token) for token in tokens]
    tokens = [token.lower() for token in tokens]
    return tokens

In [12]:
# Specifiies the n-gram range we want to extract
ngram_range = (1, 3)

count_vect = CountVectorizer(
    tokenizer=custom_tokenizer 
    #stop_words=set(stopwords.words('english'))
    #ngram_range=ngram_range
)

In [13]:
count_vect = CountVectorizer()

In [14]:
# here we are obtaining the vocabulary from the training data minus validation data
# you may want to change this to the full training data for the final submission
X_train_counts = count_vect.fit_transform(list(train_df['text'].values))
X_val_counts = count_vect.transform(list(val_df['text'].values))
X_test_counts = count_vect.transform(list(test_df['text'].values))

In [15]:
print('Train feature shape:', X_train_counts.shape)
print('Validation feature shape:', X_val_counts.shape)
print('Test feature shape:', X_test_counts.shape)

Train feature shape: (22258, 23239)
Validation feature shape: (2474, 23239)
Test feature shape: (2748, 23239)


The Bag-of-Words representation assigns a unique ID to each word that appears in the training data. 23239 unique words have been extracted. Each input data point (tweet) is then represented by a vector of the size of the vocabulary. Each of its elements are the counts of the respective word appearing in the tweet.

Therefore, the features have a huge dimension! Storing the feature matrix directly would require (n_datapoints x vocabulary size) * 32 bits  ≈
  2 GB CPU/GPU RAM! Imagine we were not analyzing tweets (limited vocabulary) but Wikipedia! Or imagine we had a larger corpus of documents. Then we could not store the features!

Instead, the Bag-of-Words features are usually stored using a sparse representation. Imagine this like a dictionary of ID-count tuples assigned to each tweet.

In [16]:
# Now we quickly analyze the matrix of word counts:
# Only 255125 of the 22258x23162 values => 0.049487% are non-zero.
# The sparse encoding only needs to store these.
X_train_counts

<22258x23239 sparse matrix of type '<class 'numpy.int64'>'
	with 255005 stored elements in Compressed Sparse Row format>

In [17]:
# Yet, we can ask to convert a part of the matrix into the traditional dense format.
# It's quite challenging to find any non-zeros here!
X_train_counts[:10,:10].toarray()

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int64)

In [18]:
# The other way around is easier. We can ask to find the ID (index) of a specific word.
count_vect.vocabulary_.get('sleep')

18618

In [19]:
# So the first tweet should have a one at this position:
print('Tweet:\n', train_df.iloc[0]['text'])
print('Number of times the word "sleep" appeared:\n', X_train_counts[0, 18618])

Tweet:
 had a horrible sleep + in a rather bad mood
Number of times the word "sleep" appeared:
 1


## Training a simple classifier

We are training a naive Bayes classifier on the Bag-of-Words features of the training data:

https://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html

It is already built into the sklearn library:

https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html

Keep in mind that not only storing the features is challenging but also processing them. A simple SVM may be quite slow on such high-dimensional features. Naive Bayes works well with Bag-of-Words.



In [20]:
# Cross validation of hyperparameters

# pipeline = Pipeline([
#     ('vectorizer', CountVectorizer(tokenizer=custom_tokenizer)),
#     ('classifier', MultinomialNB())
# ])

# # Defines the parameter grid to search over
# param_grid = {
#     'vectorizer__ngram_range': [(1, 1), 
#                                 (1, 2), 
#                                 (1, 3),
#                                 (2, 2),
#                                 (2, 3),
#                                 (3, 3)]  # Example ngram_range values
# }

# grid_search = GridSearchCV(pipeline, param_grid, scoring='f1_macro', cv=10)

# grid_search.fit(train_df['text'], train_df['target'])

# print("Best ngram_range:", grid_search.best_params_['vectorizer__ngram_range'])
# print("Best F1 score:", grid_search.best_score_)

## Trying stuff

In [21]:
%pip install torch
%pip install torchtext
%pip install transformers
%pip install datasets
%pip install tensorflow

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


Note: you may need to restart the kernel to use updated packages.


In [22]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import DataLoader
import math
import transformers
from transformers import pipeline
from transformers import AutoConfig, AutoModel
from transformers import DataCollatorWithPadding
from datasets import load_metric
from evaluate

In [23]:
pretrained_model = pipeline("sentiment-analysis", model="finiteautomata/bertweet-base-sentiment-analysis")
output = pretrained_model(list(train_df['text'].values))

emoji is not installed, thus not converting emoticons or emojis into text. Install emoji: pip3 install emoji==0.6.0
Xformers is not installed correctly. If you want to use memorry_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


In [32]:
total_score = sum(result['score'] for result in output)
average_score = total_score / len(output)
print(f"Average Prediction Score: {average_score}")

Average Prediction Score: 0.8591309786378475


In [34]:
outputs = pd.DataFrame(output)

In [35]:
def compute_metrics(eval_pred):
   load_accuracy = load_metric("accuracy")
   load_f1 = load_metric("f1")
  
   logits, labels = eval_pred
   predictions = np.argmax(logits, axis=-1)
   accuracy = load_accuracy.compute(predictions=predictions, references=labels)["accuracy"]
   f1 = load_f1.compute(predictions=predictions, references=labels)["f1"]
   return {"accuracy": accuracy, "f1": f1}

## End of trying stuff

In [None]:
%%time
clf = MultinomialNB().fit(X_train_counts, train_df['target'])

In [None]:
val_predictions_nb = clf.predict(X_val_counts)

In [None]:
accuracy = (val_predictions_nb == val_df['target'].values).mean()
print('The accuracy of our multinomial Naive Bayes classifier is: {:.2f}%'.format(accuracy*100))

In [None]:
fbeta = fbeta_score(val_df['target'].values, val_predictions_nb, average='macro', beta=1.0)
print('The fbeta score is:', fbeta)

In [None]:
# Creating a submission

X_train_counts = count_vect.fit_transform(list(train_df['text'].values) + list(val_df['text'].values))
X_test_counts = count_vect.transform(list(test_df['text'].values))

clf = MultinomialNB().fit(X_train_counts, np.hstack([train_df['target'].values, val_df['target'].values]))
test_predictions_nb = clf.predict(X_test_counts)

submission_df = pd.DataFrame()
submission_df['textID'] = test_df['textID']
submission_df['sentiment'] = test_predictions_nb
submission_df.to_csv('TA_baseline_NB.csv', index=False)

## How good is this score?

Early approaches in NLP used rule-based classifiers for sentiment analysis. A popular baseline is VADER which was published in 2014:

https://www.aaai.org/ocs/index.php/ICWSM/ICWSM14/paper/viewPaper/8109

VADER does not use any machine learning but is purely handcrafted by humans. It uses text preprocessing and lexica to determine the sentiment of a text.

In [None]:
nltk.download('vader_lexicon')

In [None]:
sid = SentimentIntensityAnalyzer()

In [None]:
# We show a few prediction examples:
for doc in val_df['text'].iloc[:5].values:
    print(doc)
    print(sid.polarity_scores(doc))

In [None]:
def vader_predict(x):
    prediction = sid.polarity_scores(x)
    prediction_list = [
        (1, prediction['pos']),
        (-1, prediction['neg']),
        (0, prediction['neu'])
    ]
    label = sorted(prediction_list, key=lambda x: x[1], reverse=True)[0][0]
    return label

In [None]:
predictions_vader = val_df['text'].apply(vader_predict)

In [None]:
accuracy = (predictions_vader == val_df['target'].values).mean()
print('The accuracy of VADER is: {:.2f}%'.format(accuracy*100))

In [None]:
fbeta = fbeta_score(val_df['target'].values, predictions_vader, average='macro', beta=1.0)
print('The fbeta score is:', fbeta)

VADER performs worse! That is a good sign that our classifier learned useful generalizations from the training data (better than standard handcrafted rules).

## Where to go from here?

We can improve our Machine Learning pipeline on multiple aspects:

### Data analysis:
How is the data distributed? Can we analyze our data to find patterns associated with the classes? Which kinds of words are useful, which aren't?

### Feature extraction:
Can we make our Bag-of-Words representation more compact or richer? There are many things you could try to implement. Here are some buzzwords: tokenization, stop words removal, lemmatization, n-gram extraction, ...
A useful Python library to address these issues is: NLTK (https://www.nltk.org/)
The sklearn CountVectorizer we used can be combined with NLTK preprocessing: https://scikit-learn.org/stable/modules/feature_extraction.html#customizing-the-vectorizer-classes
Is there also a dense (as opposed to sparse) representation of documents (tweets in our case)? Buzzwords: word2vec, gloVe
The state-of-the-art: ... are neural network language models, so-called Transformers. There are pretrained models available. If you feel comfortable with neural networks, fine-tuning and GPUs, have a look here: https://huggingface.co/transformers/

In general, we also recommend spaCy as a convenient Python library that covers most of the above features at once and may be a great resource to start with: https://spacy.io/

### Model selection:
The model of choice highly depends on the previously extracted features. Depending on whether you obtain a sparse or dense feature representation, you have to choose an appropriate model!

### Model evaluation:
Make sure to select potential model hyperparameters using cross-validation or similar. Our evaluation metric of choice is the F1-score:

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.fbeta_score.html#sklearn.metrics.fbeta_score

We choose beta=1 and average=macro

### Extension idea 1:
Apart from classifying the sentiment of tweets, we can also try to determine which words are the reason for the classifier to determine the classification. Ground-truth labels for these words are contained in our training data. The evaluation will not take place on the Kaggle platform. You need to do it yourself. Use the Jaccard coefficient to evaluate the overlap between the selected words and the ground truth:

https://scikit-learn.org/stable/modules/model_evaluation.html#jaccard-similarity-coefficient-score

In [None]:
# selected_text shows the words selected from text to lead to the classification stored in sentiment
train_df[['text', 'selected_text', 'sentiment']].iloc[:5]

### Extension idea 2:

You may want to give it a try to Kaggle's brand new feature called models!