This script explores different techniques for building a model to classify text data into suicidal or non-suicidal categories. Here's a summary of the key steps involved:

**Data Loading and Preprocessing:**
* Reduces the dataset size to 10,000 entries for faster processing.
* Converts all text to lowercase.
* Removes punctuation marks.
* Tokenizes the text data by splitting it into individual words.
* Removes stop words (common words like "the", "a", "an") from the vocabulary.

**Data Splitting:**

Splits the preprocessed data into training and testing sets using an 80:20 ratio.
The training set is used to train the model, and the testing set is used to evaluate the model's performance.

**Feature Extraction with CountVectorizer:**

CountVectorizer is used to transform text data into numerical features.
It counts the occurrences of each word in the text data.

**Experiments with different n-gram options:**
* Unigrams (single words)
* Bigrams (word pairs)
* Trigrams (three-word sequences)

**Model Building and Evaluation:**

Evaluates different machine learning models for sentiment classification:
1. Support Vector Machine (SVM)
1. Multinomial Naive Bayes (NB)
1. Decision Tree (DT)
1. K-Nearest Neighbors (KNN)

> Uses pipelines to combine feature extraction (CountVectorizer) with Tfidf transformation and the machine learning models. Trains each model on the training set and evaluates its performance on the testing set using accuracy score.

Overall, the script explores various feature engineering and machine learning techniques to identify the best approach for classifying text data into suicidal and non-suicidal categories.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import re,json,nltk
import string
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report,accuracy_score,precision_score,recall_score,f1_score
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import FunctionTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

In [2]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [4]:
# df = pd.read_csv('Suicide_Detection.csv')
df = pd.read_csv('../input/suicide-watch/Suicide_Detection.csv')

df = df.iloc[:10000, :]
df

Unnamed: 0.1,Unnamed: 0,text,class
0,2,Ex Wife Threatening SuicideRecently I left my ...,suicide
1,3,Am I weird I don't get affected by compliments...,non-suicide
2,4,Finally 2020 is almost over... So I can never ...,non-suicide
3,8,i need helpjust help me im crying so hard,suicide
4,9,"I’m so lostHello, my name is Adam (16) and I’v...",suicide
...,...,...,...
9995,15026,Help me plz I got my first reward aka the gold...,non-suicide
9996,15028,Anyone wanna chat? \n\nIm a bit bored right n...,non-suicide
9997,15029,I’m on a bridgeI don’t want to die but right n...,suicide
9998,15030,This is serious My dad just turned gay anyone ...,non-suicide


In [5]:
print(f"Data shape {df.shape}")
print(f"Number of Unique Elements: {len(df['text'].unique())}")

Data shape (10000, 3)
Number of Unique Elements: 10000


The dataset is selected and divided into training set and testing set


# Data Preprocessing

In [6]:
# make all the text in the dataframe lowercase
# df['text'] = df['text'].apply(lambda x: x.lower())

# remove punctuation
df['text'] = df['text'].str.replace('[^\w\s]','')

# tokenize the text
df['text'] = df['text'].apply(lambda x: word_tokenize(x))

# remove stop words
stop_words = set(stopwords.words("english"))
df['text'] = df['text'].apply(lambda x: [word for word in x if word not in stop_words])

# save the preprocessed dataframe
df.to_csv('preprocessed_data.csv', index=False)

To split data into an 80:20 ratio, set the test_size argument to 0.2. 
The random_state argument is used to set the seed for the random number generator used to perform the splitting, in order to ensure that the same split is performed every time the code is run.

In [7]:
X = df.drop(labels=['class', 'Unnamed: 0' ], axis=1)
y = df['class']
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True)

In [8]:
print("Shape of x_train:", x_train.shape)
print("Head of x_train:")
print(x_train.head())

print("\nShape of y_train:", y_train.shape)
print("Head of y_train:")
print(y_train.head())

print("\nShape of x_test:", x_test.shape)
print("Head of x_test:")
print(x_test.head())

print("\nShape of y_test:", y_test.shape)
print("Head of y_test:")
print(y_test.head())

Shape of x_train: (8000, 1)
Head of x_train:
                                                   text
2189  [I, joined, sub, I, turned, 13, October, 8, .,...
3354  [OK, GUYS, GUYS, GUYS, HEAR, ME, OUT, #, 🐽, I,...
3745  [3, weeks, ago, today, I, psych, ward.3, weeks...
7327  [Is, dick, size, normal, ?, 3.5, inches, ., Im...
8249  [I, want, change, educational, system, country...

Shape of y_train: (8000,)
Head of y_train:
2189    non-suicide
3354    non-suicide
3745        suicide
7327    non-suicide
8249    non-suicide
Name: class, dtype: object

Shape of x_test: (2000, 1)
Head of x_test:
                                                   text
3474  [Anyone, wan, na, chat, ?, I, need, someone, I...
1407  [How, guys, feel, going, dates, without, knowi...
5255  [I, hate, much, ,, existing, painful, ., I, wa...
8281  [I, ’, sure, ’, feel, right, labeling, introve...
2175  [One, Topic, best, ., [, link, ], (, https, :,...

Shape of y_test: (2000,)
Head of y_test:
3474    non-suicide
1407  

Perform feature extraction on the data of the training set. CountVectorizer is transform a given text into a vector on the basis of the count of each word that occurs in the entire text.

In [9]:
from sklearn.feature_extraction.text import CountVectorizer

# Initialize the CountVectorizer object
count_vect = CountVectorizer()

# Transform the training data using the fit_transform method
x_train_text = [' '.join(text) for text in x_train['text']]
x_train_counts = count_vect.fit_transform(x_train_text)

# x_train_counts = count_vect.fit_transform(x_train['text'])

# Print the shape of the transformed data
print(x_train_counts.shape)

# Get the index of the word "disaster" in the vocabulary
print(f"Index of the word 'dont' in the vocabulary: {count_vect.vocabulary_.get(u'dont')}")

(8000, 25481)
Index of the word 'dont' in the vocabulary: 7290


In [10]:
# Initialize the CountVectorizer object with ngram_range=(1, 1) for uni-gram
count_vect_uni = CountVectorizer(ngram_range=(1, 1))
x_train_counts_uni = count_vect_uni.fit_transform(x_train_text)
# Print the shape of the transformed data for uni-gram
print(f"Uni-gram shape: {x_train_counts_uni.shape}")
# Get the index of the word "disaster" in the vocabulary for each n-gram
print(f"Index of the word 'dont' in the uni-gram vocabulary: {count_vect_uni.vocabulary_.get(u'dont')}")

Uni-gram shape: (8000, 25481)
Index of the word 'dont' in the uni-gram vocabulary: 7290


In [11]:
# Initialize the CountVectorizer object with ngram_range=(2, 2) for bi-gram
count_vect_bi = CountVectorizer(ngram_range=(2, 2), min_df=1)
x_train_counts_bi = count_vect_bi.fit_transform(x_train_text)
# Print the shape of the transformed data for uni-gram
print(f"Bi-gram shape: {x_train_counts_bi.shape}")
print(f"Index of the word 'dont' in the bi-gram vocabulary: {count_vect_bi.vocabulary_.get(u'dont')}")
print(f"Index of the word 'dont want' in the bi-gram vocabulary: {count_vect_bi.vocabulary_.get(u'dont want')}")

Bi-gram shape: (8000, 310818)
Index of the word 'dont' in the bi-gram vocabulary: None
Index of the word 'dont want' in the bi-gram vocabulary: 72129


In [12]:
# Initialize the CountVectorizer object with ngram_range=(3, 3) for tri-gram
count_vect_tri = CountVectorizer(ngram_range=(3, 3))
x_train_counts_tri = count_vect_tri.fit_transform(x_train_text)
print(f"Tri-gram shape: {x_train_counts_tri.shape}")
print(f"Index of the word 'dont' in the tri-gram vocabulary: {count_vect_tri.vocabulary_.get(u'dont')}")
print(f"Index of the word 'dont want' in the tri-gram vocabulary: {count_vect_tri.vocabulary_.get(u'dont want')}")
print(f"Index of the word 'dont want anyone in the tri-gram vocabulary: {count_vect_tri.vocabulary_.get(u'dont want anyone')}")

Tri-gram shape: (8000, 463357)
Index of the word 'dont' in the tri-gram vocabulary: None
Index of the word 'dont want' in the tri-gram vocabulary: None
Index of the word 'dont want anyone in the tri-gram vocabulary: 98448


# Perform N-gram(uni-gram, bi-gram, tri-gram), Tfid, count vectorizer with SVM

In [13]:
# Define the pipeline
pipeline_uni = Pipeline([
    ('count_vect', CountVectorizer(ngram_range=(1, 1))),
    ('tfidf', TfidfTransformer()),
    ('svm', SVC())
])

pipeline_bi = Pipeline([
    ('count_vect', CountVectorizer(ngram_range=(2, 2), min_df=1)),
    ('tfidf', TfidfTransformer()),
    ('svm', SVC())
])

pipeline_tri = Pipeline([
    ('count_vect', CountVectorizer(ngram_range=(3, 3))),
    ('tfidf', TfidfTransformer()),
    ('svm', SVC())
])

In [14]:
# Fit the pipeline to the training data
pipeline_uni.fit(x_train_text, y_train)
pipeline_bi.fit(x_train_text, y_train)
pipeline_tri.fit(x_train_text, y_train)

In [15]:
# Predict on the test data
y_pred_uni = pipeline_uni.predict(x_test)
y_pred_bi = pipeline_bi.predict(x_test)
y_pred_tri = pipeline_tri.predict(x_test)

In [16]:
y_test = y_test.drop(columns='unnamed: 0')
y_test

3474    non-suicide
1407    non-suicide
5255        suicide
8281    non-suicide
2175    non-suicide
           ...     
7611    non-suicide
1922    non-suicide
9035    non-suicide
6608    non-suicide
3808    non-suicide
Name: class, Length: 2000, dtype: object

In [17]:
# Join the words of each text in the test data to make a single string
x_test_text = [' '.join(text) for text in x_test['text']]

In [18]:
# Make predictions on the test data using the uni-gram pipeline
y_pred_uni_SVM = pipeline_uni.predict(x_test_text)
# Make predictions on the test data using the bi-gram pipeline
y_pred_bi_SVM = pipeline_bi.predict(x_test_text)
# Make predictions on the test data using the tri-gram pipeline
y_pred_tri_SVM = pipeline_tri.predict(x_test_text)

In [19]:
# Compute the accuracy for each pipeline
accuracy_uni_SVM = accuracy_score(y_test, y_pred_uni_SVM)
accuracy_bi_SVM = accuracy_score(y_test, y_pred_bi_SVM)
accuracy_tri_SVM = accuracy_score(y_test, y_pred_tri_SVM)

# Print the accuracy for each pipeline
print(f"Uni-gram accuracy: {accuracy_uni_SVM:.2f}")
print(f"Bi-gram accuracy: {accuracy_bi_SVM:.2f}")
print(f"Tri-gram accuracy: {accuracy_tri_SVM:.2f}")

Uni-gram accuracy: 0.92
Bi-gram accuracy: 0.85
Tri-gram accuracy: 0.79


# Perform N-gram, Tfid, count vectorizer with Multinomial Naive Bayes

## Data Preprocessing and 80:20 split

In [20]:
df = pd.read_csv('../input/suicide-watch/Suicide_Detection.csv')
df = df.iloc[:10000, :]

# make all the text in the dataframe lowercase
df['text'] = df['text'].apply(lambda x: x.lower())

# remove punctuation
df['text'] = df['text'].str.replace('[^\w\s]','')

# tokenize the text
df['text'] = df['text'].apply(lambda x: word_tokenize(x))

# remove stop words
stop_words = set(stopwords.words("english"))
df['text'] = df['text'].apply(lambda x: [word for word in x if word not in stop_words])

# save the preprocessed dataframe
df.to_csv('preprocessed_data.csv', index=False)

X = df.drop(labels=['class', 'Unnamed: 0' ], axis=1)
y = df['class']
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True)

In [21]:
# Join the words of each text in the data to make a single string
x_train_text = [' '.join(text) for text in x_train['text']]
x_test_text = [' '.join(text) for text in x_test['text']]

In [22]:
# Apply TF-IDF, CountVectorizer, and Naive Bayes using pipeline
pipeline_uni_NB = Pipeline([
    ('vect', CountVectorizer(ngram_range=(1,1))),
    ('tfidf', TfidfTransformer()),
    ('nb', MultinomialNB()),
])

pipeline_bi_NB = Pipeline([
    ('vect', CountVectorizer(ngram_range=(2,2))),
    ('tfidf', TfidfTransformer()),
    ('nb', MultinomialNB()),
])

pipeline_tri_NB = Pipeline([
    ('vect', CountVectorizer(ngram_range=(3,3))),
    ('tfidf', TfidfTransformer()),
    ('nb', MultinomialNB()),
])

In [23]:
# Fit the pipeline to the training data
pipeline_uni_NB.fit(x_train_text, y_train)
pipeline_bi_NB.fit(x_train_text, y_train)
pipeline_tri_NB.fit(x_train_text, y_train)

In [24]:
# Predict the classes of the test data
y_pred_uni_NB = pipeline_uni_NB.predict(x_test_text)
y_pred_bi_NB = pipeline_bi_NB.predict(x_test_text)
y_pred_tri_NB = pipeline_tri_NB.predict(x_test_text)

In [25]:
# Evaluate the accuracy of the model
accuracy_uni_NB = accuracy_score(y_test, y_pred_uni_NB)
accuracy_bi_NB = accuracy_score(y_test, y_pred_bi_NB)
accuracy_tri_NB = accuracy_score(y_test, y_pred_tri_NB)

print("Accuracy:", accuracy_uni_NB)
print("Accuracy:", accuracy_bi_NB)
print("Accuracy:", accuracy_tri_NB)

Accuracy: 0.818
Accuracy: 0.7685
Accuracy: 0.8115


## Perform N-gram, count vectorizer with Multinomial Naive Bayes(without tfid)

In [26]:
# Initialize the CountVectorizer object with ngram_range=(1, 1) for uni-gram
count_vect_uni_NB = CountVectorizer(ngram_range=(1, 1))
x_train_counts_uni_NB = count_vect_uni_NB.fit_transform(x_train_text)
# Print the shape of the transformed data for uni-gram
print(f"Uni-gram shape: {x_train_counts_uni_NB.shape}")
# Get the index of the word "disaster" in the vocabulary for each n-gram
print(f"Index of the word 'dont' in the uni-gram vocabulary: {count_vect_uni_NB.vocabulary_.get(u'dont')}")

Uni-gram shape: (8000, 25484)
Index of the word 'dont' in the uni-gram vocabulary: 7245


In [27]:
# Initialize the CountVectorizer object with ngram_range=(2, 2) for bi-gram
count_vect_bi_NB = CountVectorizer(ngram_range=(2, 2), min_df=1)
x_train_counts_bi_NB = count_vect_bi_NB.fit_transform(x_train_text)
# Print the shape of the transformed data for uni-gram
print(f"Bi-gram shape: {x_train_counts_bi_NB.shape}")
print(f"Index of the word 'dont' in the bi-gram vocabulary: {count_vect_bi_NB.vocabulary_.get(u'dont')}")
print(f"Index of the word 'dont want' in the bi-gram vocabulary: {count_vect_bi_NB.vocabulary_.get(u'dont want')}")

Bi-gram shape: (8000, 301048)
Index of the word 'dont' in the bi-gram vocabulary: None
Index of the word 'dont want' in the bi-gram vocabulary: 70142


In [28]:
# Initialize the CountVectorizer object with ngram_range=(3, 3) for tri-gram
count_vect_tri_NB = CountVectorizer(ngram_range=(3, 3))
x_train_counts_tri_NB = count_vect_tri_NB.fit_transform(x_train_text)
print(f"Tri-gram shape: {x_train_counts_tri_NB.shape}")
print(f"Index of the word 'dont' in the tri-gram vocabulary: {count_vect_tri_NB.vocabulary_.get(u'dont')}")
print(f"Index of the word 'dont want' in the tri-gram vocabulary: {count_vect_tri_NB.vocabulary_.get(u'dont want')}")
print(f"Index of the word 'dont want anyone in the tri-gram vocabulary: {count_vect_tri_NB.vocabulary_.get(u'dont want anyone')}")

Tri-gram shape: (8000, 443659)
Index of the word 'dont' in the tri-gram vocabulary: None
Index of the word 'dont want' in the tri-gram vocabulary: None
Index of the word 'dont want anyone in the tri-gram vocabulary: 94579


In [29]:
from sklearn.naive_bayes import MultinomialNB

# Initialize the Naive Bayes model for uni-gram
NB_uni = MultinomialNB()
NB_uni.fit(x_train_counts_uni_NB, y_train)

# Initialize the Naive Bayes model for bi-gram
NB_bi = MultinomialNB()
NB_bi.fit(x_train_counts_bi_NB, y_train)

# Initialize the Naive Bayes model for tri-gram
NB_tri = MultinomialNB()
NB_tri.fit(x_train_counts_tri_NB, y_train)


In [30]:
# Transform the test data into feature representations
x_test_text = [' '.join(text) for text in x_test['text']]
x_test_counts_uni_NB = count_vect_uni_NB.transform(x_test_text)
x_test_counts_bi_NB = count_vect_bi_NB.transform(x_test_text)
x_test_counts_tri_NB = count_vect_tri_NB.transform(x_test_text)

In [31]:
# Make predictions using the Naive Bayes models
y_pred_uni = NB_uni.predict(x_test_counts_uni_NB)
y_pred_bi = NB_bi.predict(x_test_counts_bi_NB)
y_pred_tri = NB_tri.predict(x_test_counts_tri_NB)

In [32]:
# Calculate the accuracy of the models
accuracy_uni = accuracy_score(y_test, y_pred_uni)
accuracy_bi = accuracy_score(y_test, y_pred_bi)
accuracy_tri = accuracy_score(y_test, y_pred_tri)

# Print the accuracy of the models
print(f"Accuracy of uni-gram model: {accuracy_uni}")
print(f"Accuracy of bi-gram model: {accuracy_bi}")
print(f"Accuracy of tri-gram model: {accuracy_tri}")

Accuracy of uni-gram model: 0.856
Accuracy of bi-gram model: 0.7575
Accuracy of tri-gram model: 0.8105


# Perform N-gram, Tfid, count vectorizer with Decision Tree
## Data Preprocessing

In [33]:
df = pd.read_csv('../input/suicide-watch/Suicide_Detection.csv')
df = df.iloc[:10000, :]

# make all the text in the dataframe lowercase
df['text'] = df['text'].apply(lambda x: x.lower())

# remove punctuation
df['text'] = df['text'].str.replace('[^\w\s]','')

# tokenize the text
df['text'] = df['text'].apply(lambda x: word_tokenize(x))

# remove stop words
stop_words = set(stopwords.words("english"))
df['text'] = df['text'].apply(lambda x: [word for word in x if word not in stop_words])

# save the preprocessed dataframe
df.to_csv('preprocessed_data.csv', index=False)

X = df.drop(labels=['class', 'Unnamed: 0' ], axis=1)
y = df['class']
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True)

In [34]:
# Join the words of each text in the data to make a single string
x_train_text = [' '.join(text) for text in x_train['text']]
x_test_text = [' '.join(text) for text in x_test['text']]

In [35]:
# Apply TF-IDF, CountVectorizer, and Decision Tree using pipeline
pipeline_uni_DT = Pipeline([
    ('vect', CountVectorizer(ngram_range=(1,1))),
    ('tfidf', TfidfTransformer()),
    ('nb', DecisionTreeClassifier()),
])

pipeline_bi_DT = Pipeline([
    ('vect', CountVectorizer(ngram_range=(2,2))),
    ('tfidf', TfidfTransformer()),
    ('nb', DecisionTreeClassifier()),
])

pipeline_tri_DT = Pipeline([
    ('vect', CountVectorizer(ngram_range=(3,3))),
    ('tfidf', TfidfTransformer()),
    ('nb', DecisionTreeClassifier()),
])

In [36]:
# Fit the pipeline to the training data
pipeline_uni_DT.fit(x_train_text, y_train)
pipeline_bi_DT.fit(x_train_text, y_train)
pipeline_tri_DT.fit(x_train_text, y_train)

In [37]:
# Predict the classes of the test data
y_pred_uni_DT = pipeline_uni_DT.predict(x_test_text)
y_pred_bi_DT = pipeline_bi_DT.predict(x_test_text)
y_pred_tri_DT = pipeline_tri_DT.predict(x_test_text)

In [38]:
# Evaluate the accuracy of the model
accuracy_uni_DT = accuracy_score(y_test, y_pred_uni_DT)
accuracy_bi_DT = accuracy_score(y_test, y_pred_bi_DT)
accuracy_tri_DT = accuracy_score(y_test, y_pred_tri_DT)

print("Accuracy:", accuracy_uni_DT)
print("Accuracy:", accuracy_bi_DT)
print("Accuracy:", accuracy_tri_DT)

Accuracy: 0.8335
Accuracy: 0.7915
Accuracy: 0.699


# Perform N-gram, Tfid, count vectorizer with KNN

In [39]:
pipeline_unigram = Pipeline([
    ('vect', CountVectorizer(ngram_range=(1,1))),
    ('tfidf', TfidfTransformer()),
    ('knn', KNeighborsClassifier(n_neighbors=5))
])

pipeline_bigram = Pipeline([
    ('vect', CountVectorizer(ngram_range=(2,2))),
    ('tfidf', TfidfTransformer()),
    ('knn', KNeighborsClassifier(n_neighbors=5))
])

pipeline_trigram = Pipeline([
    ('vect', CountVectorizer(ngram_range=(3,3))),
    ('tfidf', TfidfTransformer()),
    ('knn', KNeighborsClassifier(n_neighbors=5))
])

# Fit the pipelines to the training data
pipeline_unigram.fit(x_train_text, y_train)
pipeline_bigram.fit(x_train_text, y_train)
pipeline_trigram.fit(x_train_text, y_train)

# Predict the classes of the test data
y_pred_unigram = pipeline_unigram.predict(x_test_text)
y_pred_bigram = pipeline_bigram.predict(x_test_text)
y_pred_trigram = pipeline_trigram.predict(x_test_text)

# Calculate the accuracy scores for each pipeline
acc_unigram = accuracy_score(y_test, y_pred_unigram)
acc_bigram = accuracy_score(y_test, y_pred_bigram)
acc_trigram = accuracy_score(y_test, y_pred_trigram)

print("Accuracy of unigram model:", acc_unigram)
print("Accuracy of bigram model:", acc_bigram)
print("Accuracy of trigram model:", acc_trigram)

Accuracy of unigram model: 0.842
Accuracy of bigram model: 0.6855
Accuracy of trigram model: 0.5145
