# Multi-class classification of sentiment associated with therapies in English tweets

Daniel Jimenez Campos

### Data Description

**Training data**: 3009 tweets  
**Validation data**: 753 tweets  
**Testing data**: TBA  
**Evaluation metric**: micro-averaged F1-score  

### Data Examples

| tweet_id | therapy     | text                                                                                                                                                       | label    |
|----------|-------------|------------------------------------------------------------------------------------------------------------------------------------------------------------|----------|
| 15309    | meditation  | Did you know meditation can be one of the *most rewarding important things you do in your life*? Did you also know it’s *impossible to not be able to meditate*? For people that believe your mind must somehow go blank you’re wrong unless you’re dead. | positive |
| 15262    | acupuncture | abt to get acupuncture for my migraines for the first time ever & i am *terrified*                                                                             | neutral  |

### Submission Format

Please use the format below for submission. Submissions should contain tweet_id and label separated by tabspace in the same order as below.

tweet_id label
15309 positive
15262 neutral


The unzipped submission data needs to be named as *"answer.txt"* and be zipped.

For more information, please refer to [this link](https://github.com/codalab/codalab-competitions/wiki/User_Building-a-Scoring-Program-for-a-Competition#directory-structure-for-submissions).


Simply copy and paste the above Markdown-formatted text into an empty Markdown cell in Jupyter Notebook to display it correctly.

## 1. Importing all necessary libraries

In [1]:
# Importing necessary libraries
import re
import string

import nltk
import numpy as np
import pandas as pd
from nltk.corpus import opinion_lexicon
from nltk.corpus import sentiwordnet as swn
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.tokenize import treebank
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

# Download the nlkt tools
nltk.download('opinion_lexicon')
nltk.download('sentiwordnet')

[nltk_data] Downloading package opinion_lexicon to
[nltk_data]     C:\Users\danij\AppData\Roaming\nltk_data...
[nltk_data]   Package opinion_lexicon is already up-to-date!
[nltk_data] Downloading package sentiwordnet to
[nltk_data]     C:\Users\danij\AppData\Roaming\nltk_data...
[nltk_data]   Package sentiwordnet is already up-to-date!


True

## 2. Read in data

In [2]:
train_data = pd.read_csv("C:\\Users\\danij\\Documents\\UC3M\\TFG\\DATA\\train.csv")
dev_data = pd.read_csv("C:\\Users\\danij\\Documents\\UC3M\\TFG\\DATA\\dev.csv")

# Concatenate the train and dev data
data = pd.concat([train_data, dev_data])

data.head()

Unnamed: 0,tweet_id,therapy,text,label
0,1550591923047600131,cannabis,@chuckschumer YES. Please. Cannabis is legal i...,neutral
1,1496301299691839491,adderall,"@youdoingtoomuch I’m a busy girl, adderall kee...",positive
2,1460587790966657024,adderall,adderall adderall caffeine caffeine caffeine k...,neutral
3,1393586192625528832,alprazolam,@justky1018 See if you can get your doctor to ...,neutral
4,1561452418285547520,diazepam,@feytaline Reminds me of the time I had a roug...,positive


## 3. Cleaning, tokenizing, removing stopwords and lemmatizing the data

In [3]:
# Downloading the stopwords corpus from NLTK (words like "the", "is", "and" that are 
# commonly used and can be ignored)
stopwords = nltk.corpus.stopwords.words('english')


# Creating a WordNet lemmatizer object from NLTK (used for lemmatizing words to their base form based on context)
wn = nltk.WordNetLemmatizer()


# Function to clean the text by removing punctuation, converting to lowercase, and stemming words
def clean_text(text):
    # Removing punctuation characters from the text and converting it to lowercase
    text = "".join([word.lower() for word in text if word not in string.punctuation])
    # Splitting the text into tokens (words) using regular expressions
    tokens = re.split('\W+', text)
    # Lemmatizing each word in the tokens list using the WordNet lemmatizer
    text = [wn.lemmatize(word) for word in tokens if word not in stopwords]
    # Returning the cleaned text
    return text


## Applying the function to the dataset and convert the 'cleaned_text' column from list to string
data['cleaned_text'] = data['text'].apply(lambda x: clean_text(x)).apply(' '.join)

In [4]:
data.head()

Unnamed: 0,tweet_id,therapy,text,label,cleaned_text
0,1550591923047600131,cannabis,@chuckschumer YES. Please. Cannabis is legal i...,neutral,chuckschumer yes please cannabis legal license...
1,1496301299691839491,adderall,"@youdoingtoomuch I’m a busy girl, adderall kee...",positive,youdoingtoomuch busy girl adderall keep functi...
2,1460587790966657024,adderall,adderall adderall caffeine caffeine caffeine k...,neutral,adderall adderall caffeine caffeine caffeine k...
3,1393586192625528832,alprazolam,@justky1018 See if you can get your doctor to ...,neutral,justky1018 see get doctor prescribe alprazolam...
4,1561452418285547520,diazepam,@feytaline Reminds me of the time I had a roug...,positive,feytaline reminds time rough day took xanax la...


## 4. Feature Engineering

### 4.1 Body length

In [5]:
# Applying the 'count' function to the 'text' column and storing the result in 
# a new 'body_len' column
data['body_len'] = data['text'].apply(lambda x: len(x) - x.count(" "))

### 4.2 Count punctuation signs 

In [6]:
# Function to count the percentage of punctuation characters in a given text
def count_punct(text):
    # Counting the number of punctuation characters in the text
    count = sum([1 for char in text if char in string.punctuation])
    # Calculating the percentage of punctuation characters (excluding spaces) in the text
    return round(count/(len(text) - text.count(" ")), 3) * 100


# Applying the 'count_punct' function to the 'body_text' column and storing the result in 
# a new 'punct%' column
data['punct%'] = data['text'].apply(lambda x: count_punct(x))

### 4.3 Word with associated sentiment weight function

In [7]:
data.head()

Unnamed: 0,tweet_id,therapy,text,label,cleaned_text,body_len,punct%
0,1550591923047600131,cannabis,@chuckschumer YES. Please. Cannabis is legal i...,neutral,chuckschumer yes please cannabis legal license...,239,6.3
1,1496301299691839491,adderall,"@youdoingtoomuch I’m a busy girl, adderall kee...",positive,youdoingtoomuch busy girl adderall keep functi...,67,6.0
2,1460587790966657024,adderall,adderall adderall caffeine caffeine caffeine k...,neutral,adderall adderall caffeine caffeine caffeine k...,166,3.6
3,1393586192625528832,alprazolam,@justky1018 See if you can get your doctor to ...,neutral,justky1018 see get doctor prescribe alprazolam...,129,2.3
4,1561452418285547520,diazepam,@feytaline Reminds me of the time I had a roug...,positive,feytaline reminds time rough day took xanax la...,236,4.7


Define una función que calcule el sentimiento de cada palabra en un texto, utilizando el léxico de sentimientos apropiado. Aquí tienes un ejemplo utilizando SentiWordNet:

In [8]:
def get_word_sentiment(word):
    synsets = list(swn.senti_synsets(word))
    if synsets:
        sentiment = synsets[0].pos_score() - synsets[0].neg_score()
        return sentiment
    return 0.0

Itera sobre cada texto en tu dataset y para cada palabra en el texto, utiliza la función get_word_sentiment para obtener el sentimiento de esa palabra. Puedes almacenar los sentimientos en una nueva lista o como una columna adicional en tu dataset.

In [9]:
data['sentiments'] = data['cleaned_text'].apply(lambda text: np.mean([get_word_sentiment(word) for word in text]))

In [10]:
data.head()

Unnamed: 0,tweet_id,therapy,text,label,cleaned_text,body_len,punct%,sentiments
0,1550591923047600131,cannabis,@chuckschumer YES. Please. Cannabis is legal i...,neutral,chuckschumer yes please cannabis legal license...,239,6.3,0.027778
1,1496301299691839491,adderall,"@youdoingtoomuch I’m a busy girl, adderall kee...",positive,youdoingtoomuch busy girl adderall keep functi...,67,6.0,0.004167
2,1460587790966657024,adderall,adderall adderall caffeine caffeine caffeine k...,neutral,adderall adderall caffeine caffeine caffeine k...,166,3.6,0.033179
3,1393586192625528832,alprazolam,@justky1018 See if you can get your doctor to ...,neutral,justky1018 see get doctor prescribe alprazolam...,129,2.3,0.028646
4,1561452418285547520,diazepam,@feytaline Reminds me of the time I had a roug...,positive,feytaline reminds time rough day took xanax la...,236,4.7,0.022727


### 4.5 Add sentiment intensity to the text

In [11]:
# Create a Sentiment Intensity Analyzer
sia = SentimentIntensityAnalyzer()

# Function to get sentiment intensity
def get_sentiment_intensity(text):
    sentiment = sia.polarity_scores(text)
    return sentiment['compound']

# Apply the function to the text column
data['sentiment_intensity'] = data['cleaned_text'].apply(get_sentiment_intensity)

In [12]:
data.head()

Unnamed: 0,tweet_id,therapy,text,label,cleaned_text,body_len,punct%,sentiments,sentiment_intensity
0,1550591923047600131,cannabis,@chuckschumer YES. Please. Cannabis is legal i...,neutral,chuckschumer yes please cannabis legal license...,239,6.3,0.027778,-0.2732
1,1496301299691839491,adderall,"@youdoingtoomuch I’m a busy girl, adderall kee...",positive,youdoingtoomuch busy girl adderall keep functi...,67,6.0,0.004167,0.6486
2,1460587790966657024,adderall,adderall adderall caffeine caffeine caffeine k...,neutral,adderall adderall caffeine caffeine caffeine k...,166,3.6,0.033179,-0.8062
3,1393586192625528832,alprazolam,@justky1018 See if you can get your doctor to ...,neutral,justky1018 see get doctor prescribe alprazolam...,129,2.3,0.028646,0.0
4,1561452418285547520,diazepam,@feytaline Reminds me of the time I had a roug...,positive,feytaline reminds time rough day took xanax la...,236,4.7,0.022727,-0.1119


### Split into train/test

In [13]:
# Splitting the data into training and testing sets
# The 'body_text', 'body_len', and 'punct%' columns are used as the features (X)
# The 'label' column is used as the target variable (y)
# The test_size parameter is set to 0.2, which means 20% of the data will be used for testing
X_train, X_test, y_train, y_test = train_test_split(data[['cleaned_text', 'body_len', 'punct%', 'sentiments', 'sentiment_intensity']], data['label'], test_size=0.2)

In [14]:
# Show data before vectorizing
data.head()

Unnamed: 0,tweet_id,therapy,text,label,cleaned_text,body_len,punct%,sentiments,sentiment_intensity
0,1550591923047600131,cannabis,@chuckschumer YES. Please. Cannabis is legal i...,neutral,chuckschumer yes please cannabis legal license...,239,6.3,0.027778,-0.2732
1,1496301299691839491,adderall,"@youdoingtoomuch I’m a busy girl, adderall kee...",positive,youdoingtoomuch busy girl adderall keep functi...,67,6.0,0.004167,0.6486
2,1460587790966657024,adderall,adderall adderall caffeine caffeine caffeine k...,neutral,adderall adderall caffeine caffeine caffeine k...,166,3.6,0.033179,-0.8062
3,1393586192625528832,alprazolam,@justky1018 See if you can get your doctor to ...,neutral,justky1018 see get doctor prescribe alprazolam...,129,2.3,0.028646,0.0
4,1561452418285547520,diazepam,@feytaline Reminds me of the time I had a roug...,positive,feytaline reminds time rough day took xanax la...,236,4.7,0.022727,-0.1119


### Vectorize text

In [16]:
# Creating a TfidfVectorizer object with the analyzer parameter set to the clean_text function
tfidf_vect = TfidfVectorizer()

# Fitting the TfidfVectorizer on the 'text' column of the training set
tfidf_vect_fit = tfidf_vect.fit(X_train['cleaned_text'])

# Transforming the 'text' column of the training and testing sets into TF-IDF features
tfidf_train = tfidf_vect_fit.transform(X_train['cleaned_text'])
tfidf_test = tfidf_vect_fit.transform(X_test['cleaned_text'])

# Concatenating the 'body_len' and 'punct%' and 'sentiments' columns with the TF-IDF features of the training set
X_train_vect = pd.concat([X_train[['body_len', 'punct%', 'sentiments', 'sentiment_intensity']].reset_index(drop=True), 
                          pd.DataFrame(tfidf_train.toarray())], axis=1)

# Concatenating the 'body_len' and 'punct%'  and 'sentiments'columns with the TF-IDF features of the testing set
X_test_vect = pd.concat([X_test[['body_len', 'punct%', 'sentiments', 'sentiment_intensity']].reset_index(drop=True), 
                         pd.DataFrame(tfidf_test.toarray())], axis=1)

# Displaying the head (first few rows) of the X_train_vect DataFrame
X_train_vect.head()

Unnamed: 0,body_len,punct%,sentiments,sentiment_intensity,0,1,2,3,4,5,...,9848,9849,9850,9851,9852,9853,9854,9855,9856,9857
0,209,2.9,0.041139,0.5478,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,198,5.1,0.026012,-0.8942,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,223,2.2,0.031457,0.8806,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,273,2.9,0.029279,0.4404,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,136,0.7,0.039062,-0.5664,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Hyperparameter tuning

In [None]:
# Function to show hyperparameters values
def print_results(results):
    print('BEST PARAMS: {}\n'.format(results.best_params_))

    means = results.cv_results_['mean_test_score']
    stds = results.cv_results_['std_test_score']
    for mean, std, params in zip(means, stds, results.cv_results_['params']):
        print('{} (+/-{}) for {}'.format(round(mean, 3), round(std * 2, 3), params))

In [None]:
rf = RandomForestClassifier()
parameters = {
    'n_estimators': [5, 50, 250],
    'max_depth': [2, 4, 8, 16, 32, None]
}

cv = GridSearchCV(rf, parameters, cv=5)
cv.fit(tr_features, tr_labels.values.ravel())

print_results(cv)

In [None]:
cv.best_estimator_

### Write out pickled model

In [None]:
joblib.dump(cv.best_estimator_, 'C:\\Users\\danij\\Documents\\LEARNING\\DATASETS\\ML_Algorithms_data\\RF_model.pkl')

### Final evaluation of models

In [None]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import precision_recall_fscore_support as score
import time

In [None]:
# Importing the RandomForestClassifier from the sklearn.ensemble module
from sklearn.ensemble import RandomForestClassifier
import time
from sklearn.metrics import precision_recall_fscore_support as score

# Creating a RandomForestClassifier object with specified parameters
rf = RandomForestClassifier(n_estimators=150, max_depth=None, n_jobs=-1)

# Measuring the time taken to fit (train) the RandomForestClassifier on the training data
start = time.time()
rf_model = rf.fit(X_train_vect, y_train)
end = time.time()
fit_time = (end - start)

# Measuring the time taken to make predictions using the trained RandomForestClassifier 
# on the testing data
start = time.time()
y_pred = rf_model.predict(X_test_vect)
end = time.time()
pred_time = (end - start)

# Computing precision, recall, fscore, and support values for the predicted results
precision, recall, fscore, support = score(y_test, y_pred, average='macro')

# Printing the precision, recall, and F1-score
print('Macro Average Precision:', precision)
print('Macro Average Recall:', recall)
print('Macro Average F1-score:', fscore)
print('Support:', support)

In [None]:
# Importing the GradientBoostingClassifier from the sklearn.ensemble module
from sklearn.ensemble import GradientBoostingClassifier
import time
from sklearn.metrics import precision_recall_fscore_support as score

# Creating a GradientBoostingClassifier object with specified parameters
gb = GradientBoostingClassifier(n_estimators=150, max_depth=11)

# Measuring the time taken to fit (train) the GradientBoostingClassifier on the training data
start = time.time()
gb_model = gb.fit(X_train_vect, y_train)
end = time.time()
fit_time = (end - start)

# Measuring the time taken to make predictions using the trained GradientBoostingClassifier on 
# the testing data
start = time.time()
y_pred = gb_model.predict(X_test_vect)
end = time.time()
pred_time = (end - start)

# Computing precision, recall, fscore, and support values for the predicted results
precision, recall, fscore, support = score(y_test, y_pred, average='macro')

# Printing the precision, recall, and F1-score
print('Macro Average Precision:', precision)
print('Macro Average Recall:', recall)
print('Macro Average F1-score:', fscore)
print('Support:', support)