# Sentiment Analysis Using Scikit Learn

## Imports

In [67]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
import re  
from sklearn.metrics import classification_report, accuracy_score
from sklearn.preprocessing import LabelBinarizer

## Exploring Data

### Dataset Glossary

This comprehensive dataset is a meticulously curated collection of mental health statuses tagged from various statements. The dataset amalgamates raw data from multiple sources, cleaned and compiled to create a robust resource for developing chatbots and performing sentiment analysis.
Data Source:

The dataset integrates information from the following Kaggle datasets:

    3k Conversations Dataset for Chatbot
    Depression Reddit Cleaned
    Human Stress Prediction
    Predicting Anxiety in Mental Health Data
    Mental Health Dataset Bipolar
    Reddit Mental Health Data
    Students Anxiety and Depression Dataset
    Suicidal Mental Health Dataset
    Suicidal Tweet Detection Dataset

Data Overview:

The dataset consists of statements tagged with one of the following seven mental health statuses:

    Normal
    Depression
    Suicidal
    Anxiety
    Stress
    Bi-Polar
    Personality Disorder

Data Collection:

The data is sourced from diverse platforms including social media posts, Reddit posts, Twitter posts, and more. Each entry is tagged with a specific mental health status, making it an invaluable asset for:

    Developing intelligent mental health chatbots.
    Performing in-depth sentiment analysis.
    Research and studies related to mental health trends.

Features:

    unique_id: A unique identifier for each entry.
    Statement: The textual data or post.
    Mental Health Status: The tagged mental health status of the statement.

Usage:

This dataset is ideal for training machine learning models aimed at understanding and predicting mental health conditions based on textual data. It can be used in various applications such as:

    Chatbot development for mental health support.
    Sentiment analysis to gauge mental health trends.
    Academic research on mental health patterns.

Acknowledgments:

This dataset was created by aggregating and cleaning data from various publicly available datasets on Kaggle. Special thanks to the original dataset creators for their contributions.

### Data Loading

In [68]:
df = pd.read_csv("../datasets/health.csv")
df.drop(["Unnamed: 0"], axis=1, inplace=True)

# df = df.dropna()
# df.isna().sum()
df.head(11)

Unnamed: 0,statement,status
0,oh my gosh,Anxiety
1,"trouble sleeping, confused mind, restless hear...",Anxiety
2,"All wrong, back off dear, forward doubt. Stay ...",Anxiety
3,I've shifted my focus to something else but I'...,Anxiety
4,"I'm restless and restless, it's been a month n...",Anxiety
5,"every break, you must be nervous, like somethi...",Anxiety
6,"I feel scared, anxious, what can I do? And may...",Anxiety
7,Have you ever felt nervous but didn't know why?,Anxiety
8,"I haven't slept well for 2 days, it's like I'm...",Anxiety
9,"I'm really worried, I want to cry.",Anxiety


In [69]:
df.info()
df.describe(include='all')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53043 entries, 0 to 53042
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   statement  52681 non-null  object
 1   status     53043 non-null  object
dtypes: object(2)
memory usage: 828.9+ KB


Unnamed: 0,statement,status
count,52681,53043
unique,51073,7
top,what do you mean?,Normal
freq,22,16351


### Data Preprocessing

In [75]:
def preprocess_text(text):
    text = str(text).lower()
    # Remove extra whitespace
    text = ' '.join(text.split())
    # Remove special characters but keep important punctuation
    text = re.sub(r'[^a-zA-Z\s!?.]', '', text)
    return text
df['statement'] = df['statement'].fillna('')
df['statement'] = df['statement'].apply(preprocess_text)

### Spliting Data and Calculating TF-IDF (Weighted Words)

In [71]:
X = df['statement']  # Your text data
y = df['status']     # Your target labels

# 3. Use class weights to handle imbalance
class_counts = df['status'].value_counts()
total_samples = len(df)
class_weights = {class_: total_samples/(2*count) 
                for class_, count in class_counts.items()}

In [72]:

# Create improved pipeline
text_clf = Pipeline([
    ('tfidf', TfidfVectorizer(
        max_features=10000,          # Increased vocabulary
        ngram_range=(1, 3),         # Capture up to 3-word phrases
        stop_words='english',
        min_df=2,                   # Remove very rare terms
        max_df=0.95,                # Remove very common terms
        sublinear_tf=True,          # Apply sublinear scaling
        use_idf=True,
        smooth_idf=True
    )),
    ('clf', MultinomialNB(
        alpha=0.1,                  # Smoothing parameter
        fit_prior=True,
        class_prior=None
    ))
])

# Split with stratification
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=42,
    stratify=y  # Ensure balanced split
)

# Fit and predict
text_clf.fit(X_train, y_train)

In [73]:
# Make predictions
y_pred = text_clf.predict(X_test)

# Print metrics
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.71627863135074

Classification Report:
                       precision    recall  f1-score   support

             Anxiety       0.79      0.70      0.74       778
             Bipolar       0.77      0.67      0.72       575
          Depression       0.60      0.76      0.67      3081
              Normal       0.85      0.83      0.84      3270
Personality disorder       0.86      0.38      0.53       240
              Stress       0.67      0.34      0.45       534
            Suicidal       0.69      0.63      0.66      2131

            accuracy                           0.72     10609
           macro avg       0.75      0.62      0.66     10609
        weighted avg       0.73      0.72      0.71     10609

