![bse_logo_textminingcourse](https://bse.eu/sites/default/files/bse_logo_small.png)

In [49]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import numpy as np

# Part 3: Implementation

After the previous analysis we decided to follow the methodology below.

The mental health corpus, we can divide it into 2

- One corpus with the 1 : Mentally ill People
- Second corpus with the 0: Healthy people

Steps:

1. We split both corpus in train and test.
2. We extract a dictionary from each of those corpus.
3. ⁠Then we decide the way we are choosing to weight the frecuency of the words to decide which words of the ill people corpus we discard because are also frecuently mentioned on the Healthy people corpus.
4. We do a simple logistic to test the accuracy of the dictionary that we generated.
5. ⁠We imporve if we feel like it or we have some free time


In [50]:
df = pd.read_csv("data/preprocessing/mental_health_preprocessed.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,text,label,text_no_stopwords,text_stem,text_lemma
0,0,dear american teens question dutch person hear...,0,dear american teens question dutch person hear...,dear american teen question dutch person heard...,dear american teen question dutch person hear ...
1,1,nothing look forward lifei dont many reasons k...,1,nothing look forward lifei dont many reasons k...,noth look forward lifei dont mani reason keep ...,look forward lifei not reason going feel like ...
2,2,music recommendations im looking expand playli...,0,music recommendations im looking expand playli...,music recommend im look expand playlist usual ...,music recommendation m look expand playlist us...
3,3,im done trying feel betterthe reason im still ...,1,im done trying feel betterthe reason im still ...,im done tri feel betterth reason im still aliv...,m try feel betterthe reason m alive know mum d...
4,4,worried year old girl subject domestic physic...,1,worried year old girl subject domestic physica...,worri year old girl subject domest physicalmen...,worry year old girl subject domestic physicalm...


## Split the Corpus into Train & Test Sets

In [51]:
from sklearn.model_selection import train_test_split

# Separate texts based on label
texts_0 = df[df["label"] == 0]["text_lemma"].dropna().tolist()  # Healthy group
texts_1 = df[df["label"] == 1]["text_lemma"].dropna().tolist()  # Mentally ill group

# Split dataset into training and testing sets
train_0, test_0 = train_test_split(texts_0, test_size=0.2, random_state=42)
train_1, test_1 = train_test_split(texts_1, test_size=0.2, random_state=42)

# Convert to DataFrame
train_df = pd.DataFrame({"text": train_0 + train_1, "label": [0] * len(train_0) + [1] * len(train_1)})
test_df = pd.DataFrame({"text": test_0 + test_1, "label": [0] * len(test_0) + [1] * len(test_1)})

# Save to CSV for reference
train_df.to_csv("data/training/train_data.csv", index=False)
test_df.to_csv("data/training/test_data.csv", index=False)

print("Training and test sets created successfully.")

Training and test sets created successfully.


## Extract a Dictionary from Each Corpus

In [52]:
# Separate texts based on label
texts_train_0 = train_df[train_df["label"] == 0]["text"].dropna().tolist()
texts_train_1 = train_df[train_df["label"] == 1]["text"].dropna().tolist()

# Apply TF-IDF separately for each group
vectorizer_0 = TfidfVectorizer(stop_words="english", max_features=30)
tfidf_0 = vectorizer_0.fit_transform(texts_train_0)
words_0 = vectorizer_0.get_feature_names_out()
scores_0 = np.asarray(tfidf_0.mean(axis=0)).flatten()

vectorizer_1 = TfidfVectorizer(stop_words="english", max_features=30)
tfidf_1 = vectorizer_1.fit_transform(texts_train_1)
words_1 = vectorizer_1.get_feature_names_out()
scores_1 = np.asarray(tfidf_1.mean(axis=0)).flatten()

# Create DataFrames with extracted words and scores
df_tfidf_0 = pd.DataFrame({"word": words_0, "score": scores_0}).sort_values(by="score", ascending=False)
df_tfidf_1 = pd.DataFrame({"word": words_1, "score": scores_1}).sort_values(by="score", ascending=False)

# Extract dictionary from each corpus using TF-IDF
dictionary_0 = set(df_tfidf_0["word"])  # Healthy words
dictionary_1 = set(df_tfidf_1["word"])  # Mentally ill words

# Save the dictionaries for future use
with open("data/dictionaries/dictionary_healthy.txt", "w") as f:
    f.write("\n".join(dictionary_0))

with open("data/dictionaries/dictionary_mentally_ill.txt", "w") as f:
    f.write("\n".join(dictionary_1))

print("Dictionaries extracted and saved.")

Dictionaries extracted and saved.


$\text{score ajustado} = \frac{\text{TF-IDF en mentally ill}}{\text{TF-IDF en healthy} + \epsilon}$

In [53]:
# Calculate frequency of words in both classes
word_weights = {}
for word in dictionary_1:
    freq_1 = df_tfidf_1[df_tfidf_1["word"] == word]["score"].values[0] if word in df_tfidf_1["word"].values else 0
    freq_0 = df_tfidf_0[df_tfidf_0["word"] == word]["score"].values[0] if word in df_tfidf_0["word"].values else 0
    word_weights[word] = freq_1 / (freq_0 + 1e-6)  # Avoid division by zero

# Sort words by importance (higher values = more relevant to mentally ill class)
filtered_words = sorted(word_weights.items(), key=lambda x: x[1], reverse=True)[:100]  # Keep top 100 words

# Save filtered dictionary
with open("data/dictionaries/filtered_dictionary.txt", "w") as f:
    f.write("\n".join([word for word, score in filtered_words]))

print("Filtered dictionary generated.")


df_filtered_words = pd.DataFrame(filtered_words, columns=["Word", "Importance Score"])
df_filtered_words

Filtered dictionary generated.


Unnamed: 0,Word,Importance Score
0,try,85430.850024
1,kill,80732.394717
2,help,75115.802095
3,live,73611.718322
4,die,73238.740831
5,end,69068.97451
6,fuck,67512.827835
7,anymore,65301.134306
8,bad,62129.046782
9,need,60410.81563


In [54]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Convert train and test text into TF-IDF features using only filtered words
vectorizer = TfidfVectorizer(vocabulary=[word for word, _ in filtered_words])
X_train = vectorizer.fit_transform(train_df["text"])
X_test = vectorizer.transform(test_df["text"])

# Train a logistic regression model
model = LogisticRegression(random_state=42)
model.fit(X_train, train_df["label"])

# Make predictions
y_pred = model.predict(X_test)

# Evaluate performance
accuracy = accuracy_score(test_df["label"], y_pred)
print(f"Logistic Regression Accuracy: {accuracy:.4f}")


Logistic Regression Accuracy: 0.8077
