# Dataset Description: Suicidal Tweet Detection

This dataset provides a collection of tweets along with an annotation indicating whether each tweet is related to suicide or not. The primary objective of this dataset is to facilitate the development and evaluation of machine learning models for the classification of tweets as either expressing suicidal sentiments or not. This dataset has been internally generated by our team members specifically for our NLP project.
Columns:

Tweet: This column contains the text content of the tweets obtained from various sources. The tweets cover a wide range of topics, emotions, and expressions.
Suicide: This column provides annotations indicating the classification of the tweets. The possible values are:
Not Suicide post: This label is assigned to tweets that do not express any suicidal sentiments or intentions.
Potential Suicide post: This label is assigned to tweets that exhibit indications of suicidal thoughts, feelings, or intentions.

In [149]:
import pandas as pd

In [99]:
df=pd.read_csv("Suicide_Ideation_Dataset(Twitter-based).csv")

In [100]:
df

Unnamed: 0,Tweet,Suicide
0,making some lunch,Not Suicide post
1,@Alexia You want his money.,Not Suicide post
2,@dizzyhrvy that crap took me forever to put to...,Potential Suicide post
3,@jnaylor #kiwitweets Hey Jer! Since when did y...,Not Suicide post
4,Trying out &quot;Delicious Library 2&quot; wit...,Not Suicide post
...,...,...
1780,i have forgotten how much i love my Nokia N95-1,Not Suicide post
1781,Starting my day out with a positive attitude! ...,Not Suicide post
1782,"@belledame222 Hey, it's 5 am...give a girl som...",Not Suicide post
1783,2 drunken besties stumble into my room and we ...,Not Suicide post


In [101]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1785 entries, 0 to 1784
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Tweet    1785 non-null   object
 1   Suicide  1785 non-null   object
dtypes: object(2)
memory usage: 28.0+ KB


In [102]:
df['cleaned_data']=df['Tweet'].str.replace(r'[^\w\s]','',regex=True)

In [103]:
df.head()

Unnamed: 0,Tweet,Suicide,cleaned_data
0,making some lunch,Not Suicide post,making some lunch
1,@Alexia You want his money.,Not Suicide post,Alexia You want his money
2,@dizzyhrvy that crap took me forever to put to...,Potential Suicide post,dizzyhrvy that crap took me forever to put tog...
3,@jnaylor #kiwitweets Hey Jer! Since when did y...,Not Suicide post,jnaylor kiwitweets Hey Jer Since when did you ...
4,Trying out &quot;Delicious Library 2&quot; wit...,Not Suicide post,Trying out quotDelicious Library 2quot with mi...


In [104]:
df['cleaned_data']=df['cleaned_data'].astype('string')

In [105]:
!pip install nltk



In [106]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
import string

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Jyoti\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [107]:
stopwords.words(['english'])[0:10] # Show some stop words

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [108]:
def remove_stop_words(string):
    stop_words = set(stopwords.words('english'))
    words = string.split()
    filtered_words = [word for word in words if word.lower() not in stop_words]
    new_string = ' '.join(filtered_words)
    return new_string


In [109]:
df.loc[10].apply(remove_stop_words)

Tweet           Everything lover survival, everything mother s...
Suicide                                              Suicide post
cleaned_data    Everything lover survival everything mother su...
Name: 10, dtype: object

In [110]:
df['cleaned']=df['cleaned_data'].apply(remove_stop_words)

In [111]:
df.head()

Unnamed: 0,Tweet,Suicide,cleaned_data,cleaned
0,making some lunch,Not Suicide post,making some lunch,making lunch
1,@Alexia You want his money.,Not Suicide post,Alexia You want his money,Alexia want money
2,@dizzyhrvy that crap took me forever to put to...,Potential Suicide post,dizzyhrvy that crap took me forever to put tog...,dizzyhrvy crap took forever put together iâm g...
3,@jnaylor #kiwitweets Hey Jer! Since when did y...,Not Suicide post,jnaylor kiwitweets Hey Jer Since when did you ...,jnaylor kiwitweets Hey Jer Since start twittering
4,Trying out &quot;Delicious Library 2&quot; wit...,Not Suicide post,Trying out quotDelicious Library 2quot with mi...,Trying quotDelicious Library 2quot mixed resul...


In [112]:
from sklearn.feature_extraction.text import CountVectorizer

In [113]:
vectorizer1=CountVectorizer(binary=True,stop_words='english')

In [114]:
vectorizer2=CountVectorizer(binary=False,stop_words='english')

In [116]:
x1=vectorizer1.fit_transform(df['cleaned'])

In [117]:
x2=vectorizer2.fit_transform(df['cleaned'])

In [119]:
y=df['Suicide']

In [120]:
from sklearn.model_selection import train_test_split

In [121]:
xtrain,xtest,ytrain,ytest=train_test_split(x1,y,test_size=0.25,random_state=101)

In [122]:
xtrain2,xtest2,ytrain2,ytest2=train_test_split(x2,y,test_size=0.25,random_state=101)

In [123]:
from sklearn.naive_bayes import BernoulliNB,MultinomialNB

In [124]:
bn=BernoulliNB()
mnb=MultinomialNB()

In [125]:
bn.fit(xtrain,ytrain)

In [126]:
mnb.fit(xtrain2,ytrain2)

In [127]:
pred1=bn.predict(xtest)

In [128]:
pred2=mnb.predict(xtest2)

In [129]:
from sklearn.metrics import accuracy_score

In [130]:
accuracy_score(ytest,pred1)

0.7718120805369127

In [131]:
accuracy_score(ytest2,pred2)

0.8903803131991052

In [132]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [133]:
from sklearn.pipeline import make_pipeline

In [134]:
model=make_pipeline(TfidfVectorizer(stop_words='english'),MultinomialNB())

In [141]:
model1=make_pipeline(TfidfVectorizer(stop_words='english'),BernoulliNB())

In [135]:
x=df['cleaned']

In [137]:
xtrain3,xtest3,ytrain3,ytest3=train_test_split(x,y,test_size=0.25,random_state=101)

In [138]:
model.fit(xtrain3,ytrain3)

In [142]:
model1.fit(xtrain3,ytrain3)

In [145]:
pred3=model.predict(xtest3)

In [146]:
pred4=model1.predict(xtest3)

In [144]:
accuracy_score(ytest3,pred3)

0.901565995525727

In [148]:
accuracy_score(ytest3,pred4)

0.767337807606264

# Conclusion
Here,accuracy
using CountVectorization:
1. BernoulliNB=0.771
2. MultinomialNB=0.890

and
using Tfidf:
1. BernoulliNB=0.767
2. MultinomialNB=0.901


Implies Tfidf vectorization using MultinomialNB is the best fit model.