# Detection of tweets written by potentially suicidal people

### In this ipybn file, a *term frequency–inverse document frequency* technique is used to train a machine learning classification model in order to determine if a tweet was written by a potentially suicidal person. 

Training data consist of 1787 tweets classified in *'Potential Suicide post'* and *'No Suicide post'*. Data set was obtained from Kaggle from the next source:
https://www.kaggle.com/datasets/aunanya875/suicidal-tweet-detection-dataset/data

In [1]:
# import the necessary:
import pandas as pd
from imblearn.under_sampling import RandomUnderSampler
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
# Import data:
df = pd.read_csv('Suicide_Ideation_Dataset(Twitter-based).csv')

# First visualization:
df.head()

Unnamed: 0,Tweet,Suicide
0,making some lunch,Not Suicide post
1,@Alexia You want his money.,Not Suicide post
2,@dizzyhrvy that crap took me forever to put to...,Potential Suicide post
3,@jnaylor #kiwitweets Hey Jer! Since when did y...,Not Suicide post
4,Trying out &quot;Delicious Library 2&quot; wit...,Not Suicide post


First, we observe if dataset is balanced and if it needs to be cleaned:

In [3]:
# Is dataframe balanced?
df.Suicide.value_counts()

Suicide
Not Suicide post           1127
Potential Suicide post      660
Name: count, dtype: int64

In [4]:
# Null and Na values:
print(df.isna()[df.isna()['Tweet'] == True].index)
print(df.isnull()[df.isnull()['Tweet'] == True].index)

Index([497, 1017], dtype='int64')
Index([497, 1017], dtype='int64')


In [5]:
# Rows with null values:
df.iloc[[497, 1017]]

Unnamed: 0,Tweet,Suicide
497,,Potential Suicide post
1017,,Not Suicide post


We have an unbalanced dataframe and a couple of null values to drop. 

First we drop null values:

In [6]:
df = df.drop(df.isnull()[df.isnull()['Tweet'] == True].index)

Then we balanced the dataset by randomly discard some not suicide post until to have the same amount of both potential suicide and not suicidal posts: 

In [7]:
randUndSam = RandomUnderSampler()
df_balanced, df_balanced['Suicide'] = randUndSam.fit_resample(df[['Tweet']], df['Suicide'])
df_balanced['Suicide'].value_counts()

Suicide
Not Suicide post           659
Potential Suicide post     659
Name: count, dtype: int64

Now we can separate dataset in train (80%) and test (20%) samples:

In [8]:
df_train, df_test = train_test_split(df_balanced, test_size= 0.2, random_state=1)
print(f'Test has {df_test.shape[0]} rows')
print(f'Train has {df_train.shape[0]} rows')

Test has 264 rows
Train has 1054 rows


In [9]:
x_train, y_train = df_train ['Tweet'], df_train['Suicide']
x_test, y_test = df_test ['Tweet'], df_test['Suicide']

Before to apply any machine learning algorithm, it's convenient to transform text data into numerical data in some way. In this case we used the ***term frequency–inverse document frequency (Tf-idf)*** technique to not only transform text data into numerical data but also assign certain weight to words with major apparition frequency both in just one tweet and all the tweets.

The Tf-idf assign value to a word is given by the next expression [1]:

$$Tf-idf = Tf \times Idf$$

Where $Tf$ is the frequency of the word in the study document (in our case in the tweet):

$$Tf = \frac{Times\hspace{0.1cm}the\hspace{0.1cm}word\hspace{0.1cm}appears\hspace{0.1cm}in\hspace{0.1cm}the\hspace{0.1cm}tweet}{Total\hspace{0.1cm}words\hspace{0.1cm}in\hspace{0.1cm}the\hspace{0.1cm}tweet}$$

it might be weighted for the word document frequency:

$$Df=\frac{Number\hspace{0.1cm}of\hspace{0.1cm}tweets\hspace{0.1cm}containing\hspace{0.1cm}the\hspace{0.1cm}word}{Total\hspace{0.1cm}of\hspace{0.1cm}tweets}$$

However, the logarithm of $Df^{-1}$ (Inverse document frequency) is used in order to soften the weight effect of $Df$, i.e. it's used:

$$Idf = \log\left(\frac{1}{Df}\right)$$

Note that use $\log(Df)$ would give us negative values.

To implement the Tf-idf technique we resort to ***TfidfVectorizer*** function from the module *feature_extraction.text* of *Sklearn*: 

In [10]:
tfidf = TfidfVectorizer(stop_words='english')
x_train_vector = tfidf.fit_transform(x_train)
x_test_vector = tfidf.transform(x_test)

Note that with test data *transform* method was used instead of *fit_transform* method  because fitting was made firstly with train data.

The result:

In [11]:
x_train_vector

<1054x4248 sparse matrix of type '<class 'numpy.float64'>'
	with 10651 stored elements in Compressed Sparse Row format>

We can visualize one of these result as follows:

In [12]:
aux= pd.DataFrame(x_train_vector[100].T.todense(),index=tfidf.get_feature_names_out(), columns=["TF-IDF"])
aux = aux.sort_values('TF-IDF', ascending=False)
aux.head(10)

Unnamed: 0,TF-IDF
lotion,0.440642
shower,0.398619
hair,0.385091
says,0.374037
body,0.349455
wake,0.301045
got,0.274475
like,0.197692
just,0.177688
pointless,0.0


Once with text data transformed to numerical data, we are able to apply any machine learning classification model. 

***Support vectors classifier:***

In [13]:
from sklearn.svm import SVC

svc = SVC()
svc.fit(x_train_vector, y_train)

In [14]:
# Some tests:
print(svc.predict(tfidf.transform(['I just want die'])))
print(svc.predict(tfidf.transform(['This day is beautiful'])))
print(svc.predict(tfidf.transform(["I can't be strong anymore"])))

['Potential Suicide post ']
['Not Suicide post']
['Potential Suicide post ']


***Decision tree classifier:***

In [15]:
from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier()
tree.fit(x_train_vector, y_train)

***Gaussian naive Bayes:***

In [16]:
from sklearn.naive_bayes import GaussianNB

naive = GaussianNB()
naive.fit(x_train_vector.toarray(), y_train)

***Logistic regression:***

In [17]:
from sklearn.linear_model import LogisticRegression

log = LogisticRegression()
log.fit(x_train_vector, y_train)

We compare the models' score:

In [18]:
print(svc.score(x_test_vector.toarray(), y_test))
print(tree.score(x_test_vector.toarray(), y_test))
print(naive.score(x_test_vector.toarray(), y_test))
print(log.score(x_test_vector.toarray(), y_test))

0.9128787878787878
0.9015151515151515
0.7765151515151515
0.8977272727272727


In this case the best model is the support vectors clasifier, so we choice it and get the main information about it:

In [19]:
from sklearn.metrics import f1_score

y_pred = svc.predict(x_test_vector)
f1_score (y_test, y_pred, labels=['Potential Suicide post ', 'Not Suicide post'], average=None)

array([0.91119691, 0.91449814])

In [20]:
from sklearn.metrics import classification_report, confusion_matrix


print(f'confusion matrix:\n\n', confusion_matrix(y_test, y_pred, labels=['Potential Suicide post ', 'Not Suicide post']),'\n')
print(f'report:\n\n', classification_report(y_test, y_pred, labels=['Potential Suicide post ', 'Not Suicide post']))

confusion matrix:

 [[118  12]
 [ 11 123]] 

report:

                          precision    recall  f1-score   support

Potential Suicide post        0.91      0.91      0.91       130
       Not Suicide post       0.91      0.92      0.91       134

               accuracy                           0.91       264
              macro avg       0.91      0.91      0.91       264
           weighted avg       0.91      0.91      0.91       264



We try to optimize the model parameters by repeat the next code with different proposals in the parameters:

In [21]:
from sklearn.model_selection import GridSearchCV

parameters = {'C':[0.1, 0.5, 1, 1.2,1.5, 2, 3], 'kernel':['linear', 'rbf']}
svc_optimized = SVC()
svc_optimized_grid = GridSearchCV(svc_optimized, parameters, cv=8)
svc_optimized_grid.fit(x_train_vector, y_train)

In [22]:
print(svc_optimized_grid.best_estimator_)
print(svc_optimized_grid.best_params_)
print(svc_optimized_grid.best_score_)

SVC(C=2)
{'C': 2, 'kernel': 'rbf'}
0.9373987971316216


We apply the best parameters:

In [25]:
model = SVC(C=2, kernel='rbf')
model.fit(x_train_vector, y_train)
print(f'score: {model.score(x_test_vector.toarray(), y_test)}')
y_model_pred = model.predict(x_test_vector)
f1_score (y_test, y_model_pred, labels=['Potential Suicide post ', 'Not Suicide post'], average=None)
print(f'confusion matrix:\n\n', confusion_matrix(y_test, y_model_pred, labels=['Potential Suicide post ', 'Not Suicide post']),'\n')
print(f'report:\n\n', classification_report(y_test, y_model_pred, labels=['Potential Suicide post ', 'Not Suicide post']))


score: 0.9166666666666666
confusion matrix:

 [[119  11]
 [ 11 123]] 

report:

                          precision    recall  f1-score   support

Potential Suicide post        0.92      0.92      0.92       130
       Not Suicide post       0.92      0.92      0.92       134

               accuracy                           0.92       264
              macro avg       0.92      0.92      0.92       264
           weighted avg       0.92      0.92      0.92       264



There is an improve in the precision, recall and f1 score.

Finally, we save the model in a file:

In [26]:
import pickle
with open('Suicides_classification', 'wb') as f:
    pickle.dump(model,f)

A finally test:

In [27]:
print(model.predict(tfidf.transform(["what a beautiful day to just go away "])))

['Not Suicide post']


In [29]:
print(model.predict(tfidf.transform(["what a beautiful day to just go away, from this life"])))

['Potential Suicide post ']


## References:

[1]:Abhishek Jha, (october 28, 2024), "Vectorization Techniques in NLP [Guide]", Recover from: https://neptune.ai/blog/vectorization-techniques-in-nlp-guide