<a href="https://colab.research.google.com/github/EviLuci/SuicidalTweetPrediction/blob/main/SuicidalTweet.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Read the Dataset
This dataset provides a collection of tweets along with an annotation indicating whether each tweet is related to suicide or not. The primary objective of this dataset is to facilitate the development and evaluation of machine learning models for the classification of tweets as either expressing suicidal sentiments or not.
Potential Applications:

*   Suicidal Ideation Detection: The dataset can be used to train models to automatically detect and flag tweets containing potential suicidal content, enabling platforms to take appropriate actions.
*   Mental Health Support: Insights from this dataset can be used to develop tools that offer mental health resources or interventions to users who express signs of distress.

*   Sentiment Analysis Research: Researchers can analyze the linguistic patterns and sentiment of both non-suicidal and potentially suicidal tweets to gain insights into the language used by individuals in different emotional states.
*   Public Health Awareness: The dataset can be used to raise awareness about mental health issues and the importance of responsible social media usage.

[N.B.: Please note that the annotations provided in the "Suicide" column are based on indicators present in the tweet text. However, the dataset does not provide any personal or identifying information about the users who posted the tweets. Researchers and developers should handle this data responsibly and ethically while considering potential user privacy concerns.]

In [12]:
import numpy as np
import pandas as pd
df = pd.read_csv("/content/Suicide_Ideation_Dataset(Twitter-based).csv")

# Let's Explore what we have !

In [13]:
df.head()

Unnamed: 0,Tweet,Suicide
0,making some lunch,Not Suicide post
1,@Alexia You want his money.,Not Suicide post
2,@dizzyhrvy that crap took me forever to put to...,Potential Suicide post
3,@jnaylor #kiwitweets Hey Jer! Since when did y...,Not Suicide post
4,Trying out &quot;Delicious Library 2&quot; wit...,Not Suicide post


In [14]:
df

Unnamed: 0,Tweet,Suicide
0,making some lunch,Not Suicide post
1,@Alexia You want his money.,Not Suicide post
2,@dizzyhrvy that crap took me forever to put to...,Potential Suicide post
3,@jnaylor #kiwitweets Hey Jer! Since when did y...,Not Suicide post
4,Trying out &quot;Delicious Library 2&quot; wit...,Not Suicide post
...,...,...
1782,i have forgotten how much i love my Nokia N95-1,Not Suicide post
1783,Starting my day out with a positive attitude! ...,Not Suicide post
1784,"@belledame222 Hey, it's 5 am...give a girl som...",Not Suicide post
1785,2 drunken besties stumble into my room and we ...,Not Suicide post


In [15]:
df.isnull().sum()

Tweet      2
Suicide    0
dtype: int64

In [16]:
df.Suicide.value_counts()

Not Suicide post           1127
Potential Suicide post      660
Name: Suicide, dtype: int64

Process the data

In [18]:
from sklearn.preprocessing import LabelEncoder
df.Suicide = LabelEncoder().fit_transform(df.Suicide)
df.Suicide

0       0
1       0
2       1
3       0
4       0
       ..
1782    0
1783    0
1784    0
1785    0
1786    0
Name: Suicide, Length: 1787, dtype: int64

TfidfVectorizer

In [29]:
from sklearn.feature_extraction.text import TfidfVectorizer
df = df.dropna()
vectorizer = TfidfVectorizer(max_features=2000)
X = vectorizer.fit_transform(df["Tweet"])
y = df.Suicide
X = X.toarray()

Split Data

In [30]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

Models Training and testing

In [32]:
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

models = [
    ('XGBClassifier', XGBClassifier()),
    ('LGBMClassifier', LGBMClassifier()),
    ('RandomForestClassifier', RandomForestClassifier()),
    ('LogisticRegression', LogisticRegression()),
    ('SVC', SVC(probability=True))
]

for model_name, model in models:
  print()
  print(f"Training {model_name}.....")
  model.fit(X_train, y_train)
  y_pred = model.predict(X_test)
  score = accuracy_score(y_pred, y_test)
  print(f"{model_name}'s Accuracy Score: {score:.2f}")
  print()
  print(f"_______________________________________________________________")



Training XGBClassifier.....
XGBClassifier's Accuracy Score: 0.94

_______________________________________________________________

Training LGBMClassifier.....
[LightGBM] [Info] Number of positive: 485, number of negative: 853
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002096 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 4058
[LightGBM] [Info] Number of data points in the train set: 1338, number of used features: 169
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.362481 -> initscore=-0.564611
[LightGBM] [Info] Start training from score -0.564611
LGBMClassifier's Accuracy Score: 0.94

_______________________________________________________________

Training RandomForestClassifier.....
RandomForestClassifier's Accuracy Score: 0.94

_______________________________________________________________

Training LogisticRegression.....

In [33]:
from sklearn.ensemble import VotingClassifier

model_instances = [model for _, model in models]

voting_classifier = VotingClassifier(estimators=models, voting='soft')
print(f"Training voting classifier...")

voting_classifier.fit(X_train, y_train)

y_pred2 = voting_classifier.predict(X_test)

score = accuracy_score(y_pred2, y_test)
print(f"Voting Ensemble Accuracy Score: {score:.2f}")

Training voting classifier...
[LightGBM] [Info] Number of positive: 485, number of negative: 853
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002760 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 4058
[LightGBM] [Info] Number of data points in the train set: 1338, number of used features: 169
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.362481 -> initscore=-0.564611
[LightGBM] [Info] Start training from score -0.564611
Voting Ensemble Accuracy Score: 0.95
